snippetMinor
Number of buckets for bucket sort (most significant bits)
Viewed 0 times
bucketsnumberbitsforbucketsortmostsignificant
Problem
Wikipedia has this pseudocode on bucket sort:
It also explains:
Here
What baffles me is: Why would one want to initialize
function bucketSort(array, n) is
buckets ← new array of n empty lists
for i = 0 to (length(array)-1) do
insert array[i] into buckets[msbits(array[i], k)]
for i = 0 to n - 1 do
nextSort(buckets[i]);
return the concatenation of buckets[0], ...., buckets[n-1]It also explains:
Here
array is the array to be sorted and n is the number of buckets to use. The function msbits(x,k) returns the k most significant bits of x (floor(x/2^(size(x)-k))); different functions can be used to translate the range of elements in array to n buckets...What baffles me is: Why would one want to initialize
buckets with n buckets when 2^k is the maximum number of values that msbits can possibly generate, given the way it's defined?Solution
The code you found is unfortunately written and explained in a very confusing way, most notably in a way that, in my humble opinion, misses the entire point of Bucketsort.
Let $A$ be an array of $n$ elements, taken from the ordered set $(S, \le)$. Bucketsort is a sorting algorithm that sorts $A$ in linear time with probability $1$, subject to the hypothesis that the statistical distribution of the elements of $A$ over $S$ is known in advance.
For the sake of simplicity, we will assume $S$ to be the real interval $[0, 1)$ and the distribution of $A$ to be uniform over $[0, 1)$. Our second assumption is without loss of generality. In fact, should we be provided with a different probability density function $\rho$, we could simply replace each $x \in A$ with:
$$
x' =\int_{0}^x \rho(z) dz
$$
An analogous argument holds for different choices of $S$.
Bucketsort works as follows:
Correctness is obvious, but complexity isn't. The crucial hypothesis is that, by assumption, we are sorting elements uniformly distributed over $[0, 1)$. Therefore, if $n$ grows large, as it is the case in the usual asymptotic setting, at the end of our third step, each of the lists will contain a constant number of elements with probability $1$; if that weren't the case, the elements wouldn't be uniformly distributed after all. Also observe that "with probability $1$" is not the same as "certainly", although it's as close as it gets.
The choice of exactly $n$ buckets is due to the fact that we want our last step to take linear time; but that can only be achieved if each list contains a constant amount of elements. We could have chosen just as well to use $n/c$ buckets for any $c \in \Theta(1)$, but we couldn't have started with $ k \in o(n)$ buckets; each of them would have contained $n/k \in \omega(1)$ elements. The total cost of sorting them would have been:
$$
k \left ( \frac{n}{k} \log \frac{n}{k} \right ) = n \log \omega(1) \in \omega(n)
$$
Let $A$ be an array of $n$ elements, taken from the ordered set $(S, \le)$. Bucketsort is a sorting algorithm that sorts $A$ in linear time with probability $1$, subject to the hypothesis that the statistical distribution of the elements of $A$ over $S$ is known in advance.
For the sake of simplicity, we will assume $S$ to be the real interval $[0, 1)$ and the distribution of $A$ to be uniform over $[0, 1)$. Our second assumption is without loss of generality. In fact, should we be provided with a different probability density function $\rho$, we could simply replace each $x \in A$ with:
$$
x' =\int_{0}^x \rho(z) dz
$$
An analogous argument holds for different choices of $S$.
Bucketsort works as follows:
- Split $[0, 1)$ in $n$ consecutive intervals, where $n = |A|$, such that the $i$-th interval is $\left[ \frac{i-1}{n}, \frac{i}{n} \right)$
- Create $n$ empty lists $L_1, L_2, \dots, L_n$
- For each element $x$ of $A$, add $x$ to the list corresponding to the interval in which it falls
- Sort each list individually, then return the concatenation $L_1 \circ L_2 \circ \dots \circ L_n$
Correctness is obvious, but complexity isn't. The crucial hypothesis is that, by assumption, we are sorting elements uniformly distributed over $[0, 1)$. Therefore, if $n$ grows large, as it is the case in the usual asymptotic setting, at the end of our third step, each of the lists will contain a constant number of elements with probability $1$; if that weren't the case, the elements wouldn't be uniformly distributed after all. Also observe that "with probability $1$" is not the same as "certainly", although it's as close as it gets.
The choice of exactly $n$ buckets is due to the fact that we want our last step to take linear time; but that can only be achieved if each list contains a constant amount of elements. We could have chosen just as well to use $n/c$ buckets for any $c \in \Theta(1)$, but we couldn't have started with $ k \in o(n)$ buckets; each of them would have contained $n/k \in \omega(1)$ elements. The total cost of sorting them would have been:
$$
k \left ( \frac{n}{k} \log \frac{n}{k} \right ) = n \log \omega(1) \in \omega(n)
$$
Context
StackExchange Computer Science Q#69962, answer score: 3
Revisions (0)
No revisions yet.