HiveBrain v1.2.0
Get Started
← Back to all entries
patternModerate

pick K random integers without repetition

Submitted by: @import:stackexchange-cs··
0
Viewed 0 times
randomwithoutintegersrepetitionpick

Problem

Suppose we've to pick $K (\le N)$ random integers in the range $[0, N - 1]$ for very large $N$ such that there is no repetition, while also deterministically minimizing the number of calls made to the $rand()$ function. How would we do it?

Assume $rand(X)$ produces a random integer in the range $[0,X - 1]$; calling it with any argument counts as one call.

Picking at random till we get a unique value may not be efficient for smaller N, and $O(N)$ algorithms, i.e. ones which require looping through the entire range of integers is also not feasible, as $N$ could be up to $10^{18}$.

Is there any way to accomplish this?

Solution

Ṃųỻịgǻňạcểơửṩ mentioned that this is the reservoir problem. The reservoir sampling problem, though, is explicitly a very strict requirement for a streaming algorithm, i.e. it works in just one pass over a stream of values whose eventual size we don't know.

In this case, we want to do it in one pass, but we do know both N and K, so the problem is a little different and it permits a simpler solution.

Floyd's algorithm is designed for this task. It has some resemblance to both the Fisher-Yates shuffle and to reservoir sampling.

A couple of minor differences from the framing in your question:

  • K



  • we use randint(i, j) which produces a random integer between i and j inclusive



  • the samples will be generated in the range 1 to N inclusive (but of course it is easy to adjust this at the end)



Here's the Python code:

import random

n = 200
k = 5
s = set()

for j in range(n-k+1, n+1):
    t = random.randint(1, j)
    if t not in s:
        s.add(t)
    else:
        s.add(j)
        
print(s)


A good source for the origin of and mathematical justification for Floyd's algorithm is this article by Jon Bentley from 1987. It's worth noting that the algorithm only calls randint() K times.

A very brief description of how it works is that, on iteration t, it draws a number randomly from [1..n-k+t]. If the number has not been drawn before, it is added to the sample. Otherwise, it adds the number n-k+t to the sample. This works to give the proper probabilities of inclusion in the final set.

In Bentley's article there is a description of the recursive formulation of the algorithm:

We can appreciate the correctness of Floyd’s algorithm anecdotally. When M = 5 and N = 10, the algorithm first recursively computes in S a 4-element random sample from 1..9. Next it assigns to T a random integer in the range 1..10. Of the 10 values that T can assume, exactly 5 result in inserting 10 into S: the four values already in S, and the value 10 itself. Thus element 10 is inserted into the set with the correct probability of 5/10.

I can't do better than that explanation, right now!

The algorithm has been mentioned on StackOverflow and proofs are here on Math SE.

Code Snippets

import random

n = 200
k = 5
s = set()

for j in range(n-k+1, n+1):
    t = random.randint(1, j)
    if t not in s:
        s.add(t)
    else:
        s.add(j)
        
print(s)

Context

StackExchange Computer Science Q#165180, answer score: 16

Revisions (0)

No revisions yet.