HiveBrain v1.2.0
Get Started
← Back to all entries
patternMinor

Efficient n-choose-k random sampling

Submitted by: @import:stackexchange-cs··
0
Viewed 0 times
randomchoosesamplingefficient

Problem

Is there an efficient method of sampling an n-choose-k combination at random (with uniform probability, for example)?

I have read this question but it asks for generations of all combinations, not combinations at random.

I general I'm aware of rejection sampling, however it's very inefficient.

I also came across reservoir sampling, but that appears to be primarily geared towards very large or unknown n. I'm more interested in large but finite n (definitely not large enough to not be able to fit in memory. Well. An n-sized array itself will fit in memory, but the state space of all n-choose-k combinations might not).

Is there any survey/review on this topic? Does Knuth cover random n-choose-k sampling in his TAOCP texts?

Thanks in advance.

Edit: To be a bit more specific, a 5-choose-3 space over the string 'ABCDE' would look like this:

['ABC', 'ABD', 'ABE', 'ACD', 'ACE', 'ADE', 'BCD', 'BCE', 'BDE', 'CDE']

(Note: this is combination without replacement). And I want to be able to sample from this space with uniform distribution, using a general algorithm (one that works with arbitrary n and k).

Solution

Here is the simplest algorithm, which is efficient when $k$ is much smaller than $n$ relatively.

Input: two positive integers $n$ and $k$ with $k\le n$

Output: a random permutation of $k$ integers from $1,2,\cdots,n$

Algorithm:

  • Let arr be an array of size $n$ and a default value that is not True.



  • Let out be an empty array.



  • Let i be a random integer in $[0,n)$. If arr[i] is not True, append i+1 to out and set arr[i] to True.



  • Go back to 3 unless we have selected $k$ elements.



  • return out.



If $k\le n/2$, then the algorithm above runs in $O(k)$ average time. For example, if $n=1000$ and $k=50$, it will use the random number generator less than 55 times in average. This is much better the reservoir sampling that will use the random number generator about 1000 times.

How to speed up the algorithm if $k$ is not much smaller than $n$?

We will let each element in arr be a pair (i, False). After we have selected $n/4$ elemented, we will compactify arr by removing the elements whose second entries have been changed to True. We will set $n$ to $n-n/4$ and $k$ to $k-n/4$. Repeat the algorithm.

Context

StackExchange Computer Science Q#104930, answer score: 4

Revisions (0)

No revisions yet.