patternMinor
Bias of first values produced by a family of RNGs
Viewed 0 times
rngsbiasproducedfirstvaluesfamily
Problem
Suppose I'm doing a large number of pseudo-random but deterministic experiments, where each experiment requires generating several random numbers.
I'm approaching this by having each experiment use a random number generator seeded with a random-looking seed - e.g. hash of the index of the experiment. This is especially convenient for parallelizing my experiments.
I'm wondering if this is a sound way to get good randomness. In particular, this article points out at the end that the first value returned by e.g. Java's random generator (LCG) is highly biased even if you vary the seed.
How can I avoid this bias? Is it sufficient to throw away one or two numbers, as that article suggests? Is there any research on this topic? (I wasn't able to find the right keywords)
Would, e.g., Mersenne Twister behave better?
I'm approaching this by having each experiment use a random number generator seeded with a random-looking seed - e.g. hash of the index of the experiment. This is especially convenient for parallelizing my experiments.
I'm wondering if this is a sound way to get good randomness. In particular, this article points out at the end that the first value returned by e.g. Java's random generator (LCG) is highly biased even if you vary the seed.
How can I avoid this bias? Is it sufficient to throw away one or two numbers, as that article suggests? Is there any research on this topic? (I wasn't able to find the right keywords)
Would, e.g., Mersenne Twister behave better?
Solution
Yes. This general approach (use a hash of the index as seed to your prng) is a sound way to perform repeatable but pseudorandom experiments.
The most critical factor is to choose the pseudo-random number generator wisely, so that its statistical properties are sufficient for the kind of simulations you are doing. Some pseudo-random number generators are especially awful: for instance, on some platforms
It's important to include the hashing step. If you don't do that, then you're feeding closely related seeds into the pseudo-random generators. For some pseudo-random generators, bad things might happen: you might get closely related outputs. For instance, that's exactly what the article you mentioned is warning about, when you use Java's
In particular, if you're concerned about this risk, one extremely effective solution is to use a strong cryptographic hash function (e.g., SHA256) as your hash. This ensures that the resulting seeds will look entirely unrelated. After all, any detectable relationship in the resulting seeds would likely count as a "break" of the cryptographic hash function, and would be huge news. So, heuristically, if you use a cryptographic hash function to hash the index and use its output as the seed, you're likely on safe ground. This can be formalized using the random oracle model, but I'll spare you the technical details -- you can read more about that sort of analysis in the cryptographic literature if you really want, but in the interests of space, I'll spare you all the mathematics.
I expect that using a good hash will be more effective/reliable than throwing away the first few numbers, and more effective than trying to choose a different pseudo-random number generator. Those alternatives are a crapshoot: they might work, but they might not. In contrast, there is sound science behind the "hash the index and use the result as a seed" approach.
The most critical factor is to choose the pseudo-random number generator wisely, so that its statistical properties are sufficient for the kind of simulations you are doing. Some pseudo-random number generators are especially awful: for instance, on some platforms
rand() is terrible (its least significant bit alternates between 0 and 1), and Excel's RAND() has been criticized in the statistical literature. So, choose wisely. But this issue is more a matter of choosing a suitable pseudo-random generator, which is something you need to do anyway.It's important to include the hashing step. If you don't do that, then you're feeding closely related seeds into the pseudo-random generators. For some pseudo-random generators, bad things might happen: you might get closely related outputs. For instance, that's exactly what the article you mentioned is warning about, when you use Java's
Math.random(). Using a good hash function will fix this, because now the seeds to the pseudo-random generator will look unrelated (each bit will be different with probability about 1/2, etc.).In particular, if you're concerned about this risk, one extremely effective solution is to use a strong cryptographic hash function (e.g., SHA256) as your hash. This ensures that the resulting seeds will look entirely unrelated. After all, any detectable relationship in the resulting seeds would likely count as a "break" of the cryptographic hash function, and would be huge news. So, heuristically, if you use a cryptographic hash function to hash the index and use its output as the seed, you're likely on safe ground. This can be formalized using the random oracle model, but I'll spare you the technical details -- you can read more about that sort of analysis in the cryptographic literature if you really want, but in the interests of space, I'll spare you all the mathematics.
I expect that using a good hash will be more effective/reliable than throwing away the first few numbers, and more effective than trying to choose a different pseudo-random number generator. Those alternatives are a crapshoot: they might work, but they might not. In contrast, there is sound science behind the "hash the index and use the result as a seed" approach.
Context
StackExchange Computer Science Q#44718, answer score: 2
Revisions (0)
No revisions yet.