patternModerate
pick K random integers without repetition
Viewed 0 times
randomwithoutintegersrepetitionpick
Problem
Suppose we've to pick $K (\le N)$ random integers in the range $[0, N - 1]$ for very large $N$ such that there is no repetition, while also deterministically minimizing the number of calls made to the $rand()$ function. How would we do it?
Assume $rand(X)$ produces a random integer in the range $[0,X - 1]$; calling it with any argument counts as one call.
Picking at random till we get a unique value may not be efficient for smaller N, and $O(N)$ algorithms, i.e. ones which require looping through the entire range of integers is also not feasible, as $N$ could be up to $10^{18}$.
Is there any way to accomplish this?
Assume $rand(X)$ produces a random integer in the range $[0,X - 1]$; calling it with any argument counts as one call.
Picking at random till we get a unique value may not be efficient for smaller N, and $O(N)$ algorithms, i.e. ones which require looping through the entire range of integers is also not feasible, as $N$ could be up to $10^{18}$.
Is there any way to accomplish this?
Solution
Ṃųỻịgǻňạcểơửṩ mentioned that this is the reservoir problem. The reservoir sampling problem, though, is explicitly a very strict requirement for a streaming algorithm, i.e. it works in just one pass over a stream of values whose eventual size we don't know.
In this case, we want to do it in one pass, but we do know both N and K, so the problem is a little different and it permits a simpler solution.
Floyd's algorithm is designed for this task. It has some resemblance to both the Fisher-Yates shuffle and to reservoir sampling.
A couple of minor differences from the framing in your question:
Here's the Python code:
A good source for the origin of and mathematical justification for Floyd's algorithm is this article by Jon Bentley from 1987. It's worth noting that the algorithm only calls randint() K times.
A very brief description of how it works is that, on iteration t, it draws a number randomly from [1..n-k+t]. If the number has not been drawn before, it is added to the sample. Otherwise, it adds the number n-k+t to the sample. This works to give the proper probabilities of inclusion in the final set.
In Bentley's article there is a description of the recursive formulation of the algorithm:
We can appreciate the correctness of Floyd’s algorithm anecdotally. When M = 5 and N = 10, the algorithm first recursively computes in S a 4-element random sample from 1..9. Next it assigns to T a random integer in the range 1..10. Of the 10 values that T can assume, exactly 5 result in inserting 10 into S: the four values already in S, and the value 10 itself. Thus element 10 is inserted into the set with the correct probability of 5/10.
I can't do better than that explanation, right now!
The algorithm has been mentioned on StackOverflow and proofs are here on Math SE.
In this case, we want to do it in one pass, but we do know both N and K, so the problem is a little different and it permits a simpler solution.
Floyd's algorithm is designed for this task. It has some resemblance to both the Fisher-Yates shuffle and to reservoir sampling.
A couple of minor differences from the framing in your question:
- K
- we use randint(i, j) which produces a random integer between i and j inclusive
- the samples will be generated in the range 1 to N inclusive (but of course it is easy to adjust this at the end)
Here's the Python code:
import random
n = 200
k = 5
s = set()
for j in range(n-k+1, n+1):
t = random.randint(1, j)
if t not in s:
s.add(t)
else:
s.add(j)
print(s)A good source for the origin of and mathematical justification for Floyd's algorithm is this article by Jon Bentley from 1987. It's worth noting that the algorithm only calls randint() K times.
A very brief description of how it works is that, on iteration t, it draws a number randomly from [1..n-k+t]. If the number has not been drawn before, it is added to the sample. Otherwise, it adds the number n-k+t to the sample. This works to give the proper probabilities of inclusion in the final set.
In Bentley's article there is a description of the recursive formulation of the algorithm:
We can appreciate the correctness of Floyd’s algorithm anecdotally. When M = 5 and N = 10, the algorithm first recursively computes in S a 4-element random sample from 1..9. Next it assigns to T a random integer in the range 1..10. Of the 10 values that T can assume, exactly 5 result in inserting 10 into S: the four values already in S, and the value 10 itself. Thus element 10 is inserted into the set with the correct probability of 5/10.
I can't do better than that explanation, right now!
The algorithm has been mentioned on StackOverflow and proofs are here on Math SE.
Code Snippets
import random
n = 200
k = 5
s = set()
for j in range(n-k+1, n+1):
t = random.randint(1, j)
if t not in s:
s.add(t)
else:
s.add(j)
print(s)Context
StackExchange Computer Science Q#165180, answer score: 16
Revisions (0)
No revisions yet.