HiveBrain v1.2.0
Get Started
← Back to all entries
snippetMinor

Number of buckets for bucket sort (most significant bits)

Submitted by: @import:stackexchange-cs··
0
Viewed 0 times
bucketsnumberbitsforbucketsortmostsignificant

Problem

Wikipedia has this pseudocode on bucket sort:


function bucketSort(array, n) is
buckets ← new array of n empty lists
for i = 0 to (length(array)-1) do
insert array[i] into buckets[msbits(array[i], k)]
for i = 0 to n - 1 do
nextSort(buckets[i]);
return the concatenation of buckets[0], ...., buckets[n-1]


It also explains:


Here array is the array to be sorted and n is the number of buckets to use. The function msbits(x,k) returns the k most significant bits of x (floor(x/2^(size(x)-k))); different functions can be used to translate the range of elements in array to n buckets...

What baffles me is: Why would one want to initialize buckets with n buckets when 2^k is the maximum number of values that msbits can possibly generate, given the way it's defined?

Solution

The code you found is unfortunately written and explained in a very confusing way, most notably in a way that, in my humble opinion, misses the entire point of Bucketsort.

Let $A$ be an array of $n$ elements, taken from the ordered set $(S, \le)$. Bucketsort is a sorting algorithm that sorts $A$ in linear time with probability $1$, subject to the hypothesis that the statistical distribution of the elements of $A$ over $S$ is known in advance.

For the sake of simplicity, we will assume $S$ to be the real interval $[0, 1)$ and the distribution of $A$ to be uniform over $[0, 1)$. Our second assumption is without loss of generality. In fact, should we be provided with a different probability density function $\rho$, we could simply replace each $x \in A$ with:

$$
x' =\int_{0}^x \rho(z) dz
$$

An analogous argument holds for different choices of $S$.

Bucketsort works as follows:

  • Split $[0, 1)$ in $n$ consecutive intervals, where $n = |A|$, such that the $i$-th interval is $\left[ \frac{i-1}{n}, \frac{i}{n} \right)$



  • Create $n$ empty lists $L_1, L_2, \dots, L_n$



  • For each element $x$ of $A$, add $x$ to the list corresponding to the interval in which it falls



  • Sort each list individually, then return the concatenation $L_1 \circ L_2 \circ \dots \circ L_n$



Correctness is obvious, but complexity isn't. The crucial hypothesis is that, by assumption, we are sorting elements uniformly distributed over $[0, 1)$. Therefore, if $n$ grows large, as it is the case in the usual asymptotic setting, at the end of our third step, each of the lists will contain a constant number of elements with probability $1$; if that weren't the case, the elements wouldn't be uniformly distributed after all. Also observe that "with probability $1$" is not the same as "certainly", although it's as close as it gets.

The choice of exactly $n$ buckets is due to the fact that we want our last step to take linear time; but that can only be achieved if each list contains a constant amount of elements. We could have chosen just as well to use $n/c$ buckets for any $c \in \Theta(1)$, but we couldn't have started with $ k \in o(n)$ buckets; each of them would have contained $n/k \in \omega(1)$ elements. The total cost of sorting them would have been:

$$
k \left ( \frac{n}{k} \log \frac{n}{k} \right ) = n \log \omega(1) \in \omega(n)
$$

Context

StackExchange Computer Science Q#69962, answer score: 3

Revisions (0)

No revisions yet.