patterncModerate
Uniform random numbers in an integer interval in C
Viewed 0 times
randomnumbersintervaluniforminteger
Problem
I need to generate random integers inside a range [a, b] in C. I used the normal implementation that you will see everywhere on the internet and wrote the following:
Now, this works just fine, but the distribution over the range really isn't good, and it's starting to cause secondary algorithms I have to behave badly. How can I improve that function so that it is:
/**
* randRange Generates random integers in range
* @param lower Lower bound
* @param upper Upper bound
* @return random integer between lower and upper (inclusive)
*/
int randRange(int lower, int upper) {
return rand() % (++upper - lower) + lower;
}Now, this works just fine, but the distribution over the range really isn't good, and it's starting to cause secondary algorithms I have to behave badly. How can I improve that function so that it is:
- Not biased
- Doesn't rely on anything not on the standard library
- Is suitable for a Monte Carlo algorithm
- Fast
Solution
There are many good reasons to not use modulo for random numbers like this:
LCGs have poor entropy in lower bits
It's well known that the lower bits on Linear Congruential Generators (LCG) have poor entropy. And although not required to be,
This means that using:
has even worse entropy as it is only using the lower bits due to the modulo operator.
Using modulo skews the distribution
As if low entropy wasn't enough, using modulo in this way also skews the distribution of numbers.
To see this, imagine
Zero is twice as likely as any other outcome. This is clearly not uniform! Even for larger RAND_MAX and range values the problem persists. There are some special cases where it works but in general it doesn't.
So how should you do it?
Unfortunately though you see the approach with modulo shown everywhere on the internet because it's easier to understand. Or because people don't know better. Whatever the reason using modulo is not the way to limit the range of a random number generator.
There are a few different ways:
Use all the bits - Floating point rescale
As rand() returns a some what uniform random number in the range [0,RAND_MAX] and we want all numbers to be used; So we can re-scale this range to [0,1[ and multiply it by our desired range and quantize it. This requires a bit of care so that we don't introduce additional bias around the edges of the range:
This method while better than the modulo in the aspect that it has higher entropy as it uses the high bits of the rand() call still suffers some bias problems. Consider again if RAND_MAX=3, min=0 and max=2;
the good news is that as RAND_MAX grows larger, the bias approaches (but never reaches) zero as 1/x (which is also true when using the modulo approach but exaggerated by the bad entropy in the lower bits).
For example for RAND_MAX=3000 you will have 1001 trials that result in rnd=0 and 1000 trials each that result in rnd=1 or rnd=2.
Depending on your requirements, this may or may not be enough.
Use all the bits - Integer only w/ re-rolls
So we want to use the whole range and use integers only, if we split the range of random numbers from rand() into equally large "chunks" and depending on which chunk we "land in" we determine the outcome. In the general case there will be some slack at the end of the range as the chunk size doesn't evenly divide the range, if we roll a value in the slack we just roll again:
We have to be careful about the inequalities and off by ones here as any error adds a bias. Lets check:
Yay! No bias! Lets try again with RAND_MAX = 4;
As you can see no bias here.
The things to take note of: The reroll takes care of the bias; and using the largest possible chunkSize and thus using the largest part possible of the output range of rand() avoids the issue with the low bits being of poor quality (unless you have a very large range but then you're SOL anyway).
The probability of a re-roll is:
note that:
otherwise
chance of a re-roll.
In closing
That all being said and done, linear congruential generators are generally poor choices for any kind of serious pseudo random numbers overall, you should look into a better generator like for example the widely used Mersenne Twister.
LCGs have poor entropy in lower bits
It's well known that the lower bits on Linear Congruential Generators (LCG) have poor entropy. And although not required to be,
rand() is typically an LCG.This means that using:
rand()%range + minhas even worse entropy as it is only using the lower bits due to the modulo operator.
Using modulo skews the distribution
As if low entropy wasn't enough, using modulo in this way also skews the distribution of numbers.
To see this, imagine
RAND_MAX=3 and you do rand()%3 then the possible outcomes are:0%3 = 0
1%3 = 1
2%3 = 2
3%3 = 0Zero is twice as likely as any other outcome. This is clearly not uniform! Even for larger RAND_MAX and range values the problem persists. There are some special cases where it works but in general it doesn't.
So how should you do it?
Unfortunately though you see the approach with modulo shown everywhere on the internet because it's easier to understand. Or because people don't know better. Whatever the reason using modulo is not the way to limit the range of a random number generator.
There are a few different ways:
Use all the bits - Floating point rescale
As rand() returns a some what uniform random number in the range [0,RAND_MAX] and we want all numbers to be used; So we can re-scale this range to [0,1[ and multiply it by our desired range and quantize it. This requires a bit of care so that we don't introduce additional bias around the edges of the range:
// Use whatever c equivalent of next_after get one ULP larger than
// RAND_MAX. This transforms the range from to [0,1[ with minimal
// skew in the distribution.
double normalized = rand() / std::next_after(double(RAND_MAX), DOUBLE_MAX);
int rnd = min + int(normalized * (max - min + 1));This method while better than the modulo in the aspect that it has higher entropy as it uses the high bits of the rand() call still suffers some bias problems. Consider again if RAND_MAX=3, min=0 and max=2;
rand()=0 -> norm = 0 rnd = 0
rand()=1 -> norm = 0.333333333333333332 rnd = 0
rand()=2 -> norm = 0.666666666666666665 rnd = 1
rand()=3 -> norm = 0.9999999999999999998 rnd = 2the good news is that as RAND_MAX grows larger, the bias approaches (but never reaches) zero as 1/x (which is also true when using the modulo approach but exaggerated by the bad entropy in the lower bits).
For example for RAND_MAX=3000 you will have 1001 trials that result in rnd=0 and 1000 trials each that result in rnd=1 or rnd=2.
Depending on your requirements, this may or may not be enough.
Use all the bits - Integer only w/ re-rolls
So we want to use the whole range and use integers only, if we split the range of random numbers from rand() into equally large "chunks" and depending on which chunk we "land in" we determine the outcome. In the general case there will be some slack at the end of the range as the chunk size doesn't evenly divide the range, if we roll a value in the slack we just roll again:
int range = max - min + 1;
// Largest value that when multiplied by "range"
// is less than or equal to RAND_MAX
int chunkSize = (RAND_MAX + 1) / range;
int endOfLastChunk = chunkSize * range;
int r = rand();
while(r >= endOfLastChunk){
r = rand();
}
return min + r / chunkSize;We have to be careful about the inequalities and off by ones here as any error adds a bias. Lets check:
RAND_MAX = 5;
min = 0;
max = 2;
range = 3;
chunkSize = (5 + 1)/3 = 2;
endOfLastChunk = 2 * 3 = 6;
r = 0 -> return 0;
r = 1 -> return 1/2 = 0;
r = 2 -> return 2/2 = 1;
r = 3 -> return 3/2 = 1;
r = 4 -> return 4/4 = 2;
r = 5 -> return 5/4 = 2;Yay! No bias! Lets try again with RAND_MAX = 4;
chunkSize = (4 + 1)/3 = 1;
endOfLastChunk = 1 * 3 = 3;
r = 0 -> return 0;
r = 1 -> return 1/1 = 1;
r = 2 -> return 2/1 = 2;
r = 3 -> reroll
r = 4 -> rerollAs you can see no bias here.
The things to take note of: The reroll takes care of the bias; and using the largest possible chunkSize and thus using the largest part possible of the output range of rand() avoids the issue with the low bits being of poor quality (unless you have a very large range but then you're SOL anyway).
The probability of a re-roll is:
(RAND_MAX + 1 - endOfLastChunk) / (RAND_MAX + 1)note that:
(RAND_MAX + 1 - endOfLastChunk) < rangeotherwise
chunkSize would be bigger. So we have:range / (RAND_MAX + 1)chance of a re-roll.
In closing
That all being said and done, linear congruential generators are generally poor choices for any kind of serious pseudo random numbers overall, you should look into a better generator like for example the widely used Mersenne Twister.
Code Snippets
rand()%range + min0%3 = 0
1%3 = 1
2%3 = 2
3%3 = 0// Use whatever c equivalent of next_after get one ULP larger than
// RAND_MAX. This transforms the range from to [0,1[ with minimal
// skew in the distribution.
double normalized = rand() / std::next_after(double(RAND_MAX), DOUBLE_MAX);
int rnd = min + int(normalized * (max - min + 1));rand()=0 -> norm = 0 rnd = 0
rand()=1 -> norm = 0.333333333333333332 rnd = 0
rand()=2 -> norm = 0.666666666666666665 rnd = 1
rand()=3 -> norm = 0.9999999999999999998 rnd = 2int range = max - min + 1;
// Largest value that when multiplied by "range"
// is less than or equal to RAND_MAX
int chunkSize = (RAND_MAX + 1) / range;
int endOfLastChunk = chunkSize * range;
int r = rand();
while(r >= endOfLastChunk){
r = rand();
}
return min + r / chunkSize;Context
StackExchange Code Review Q#159604, answer score: 10
Revisions (0)
No revisions yet.