patternMinor
Efficient n-choose-k random sampling
Viewed 0 times
randomchoosesamplingefficient
Problem
Is there an efficient method of sampling an n-choose-k combination at random (with uniform probability, for example)?
I have read this question but it asks for generations of all combinations, not combinations at random.
I general I'm aware of rejection sampling, however it's very inefficient.
I also came across reservoir sampling, but that appears to be primarily geared towards very large or unknown n. I'm more interested in large but finite n (definitely not large enough to not be able to fit in memory. Well. An n-sized array itself will fit in memory, but the state space of all n-choose-k combinations might not).
Is there any survey/review on this topic? Does Knuth cover random n-choose-k sampling in his TAOCP texts?
Thanks in advance.
Edit: To be a bit more specific, a 5-choose-3 space over the string 'ABCDE' would look like this:
['ABC', 'ABD', 'ABE', 'ACD', 'ACE', 'ADE', 'BCD', 'BCE', 'BDE', 'CDE']
(Note: this is combination without replacement). And I want to be able to sample from this space with uniform distribution, using a general algorithm (one that works with arbitrary n and k).
I have read this question but it asks for generations of all combinations, not combinations at random.
I general I'm aware of rejection sampling, however it's very inefficient.
I also came across reservoir sampling, but that appears to be primarily geared towards very large or unknown n. I'm more interested in large but finite n (definitely not large enough to not be able to fit in memory. Well. An n-sized array itself will fit in memory, but the state space of all n-choose-k combinations might not).
Is there any survey/review on this topic? Does Knuth cover random n-choose-k sampling in his TAOCP texts?
Thanks in advance.
Edit: To be a bit more specific, a 5-choose-3 space over the string 'ABCDE' would look like this:
['ABC', 'ABD', 'ABE', 'ACD', 'ACE', 'ADE', 'BCD', 'BCE', 'BDE', 'CDE']
(Note: this is combination without replacement). And I want to be able to sample from this space with uniform distribution, using a general algorithm (one that works with arbitrary n and k).
Solution
Here is the simplest algorithm, which is efficient when $k$ is much smaller than $n$ relatively.
Input: two positive integers $n$ and $k$ with $k\le n$
Output: a random permutation of $k$ integers from $1,2,\cdots,n$
Algorithm:
If $k\le n/2$, then the algorithm above runs in $O(k)$ average time. For example, if $n=1000$ and $k=50$, it will use the random number generator less than 55 times in average. This is much better the reservoir sampling that will use the random number generator about 1000 times.
How to speed up the algorithm if $k$ is not much smaller than $n$?
We will let each element in
Input: two positive integers $n$ and $k$ with $k\le n$
Output: a random permutation of $k$ integers from $1,2,\cdots,n$
Algorithm:
- Let
arrbe an array of size $n$ and a default value that is notTrue.
- Let
outbe an empty array.
- Let
ibe a random integer in $[0,n)$. Ifarr[i]is notTrue, appendi+1tooutand setarr[i]toTrue.
- Go back to 3 unless we have selected $k$ elements.
- return
out.
If $k\le n/2$, then the algorithm above runs in $O(k)$ average time. For example, if $n=1000$ and $k=50$, it will use the random number generator less than 55 times in average. This is much better the reservoir sampling that will use the random number generator about 1000 times.
How to speed up the algorithm if $k$ is not much smaller than $n$?
We will let each element in
arr be a pair (i, False). After we have selected $n/4$ elemented, we will compactify arr by removing the elements whose second entries have been changed to True. We will set $n$ to $n-n/4$ and $k$ to $k-n/4$. Repeat the algorithm.Context
StackExchange Computer Science Q#104930, answer score: 4
Revisions (0)
No revisions yet.