snippetMinor

How to extract randomness from a file?

Submitted by: @import:stackexchange-cs·Mar 10, 2026·

Viewed 0 times

filehowrandomnessextractfrom

Problem

I have a generator of files with approximately 7 bits /byte entropy. The files are about 30KB each in length. I'd like to use these as sources of entropy to generate random numbers. Theoretically I should be able to achieve about 26 KB (max) of randomness per file.

This is a hobby and I know about hashes, but they're primarily checksums. I'm looking for more sophisticated methods of extraction designed specifically for that purpose. I'm hoping to use this as a true random number generator.

Supplemental:

Following comments, and in order to solicit non hash based responses, I'm evaluating the following architecture...

It works like this:-

The entropy is the 30KB file

The compressor reduces the entropy to a target size of 26KB

A seed is derived from all of the bytes of the entropy

Some PRNG implementation produces 26KB of pseudo random output

The two outputs are xored together to produce true random numbers

So in summary, a 30KB file of 86% entropy is manipulated into a 26KB file of 100% entropy. Entropy is preserved throughout the extraction process, and all the output is totally dependant on the input. 26KB of full entropy goes in, and 26KB of full entropy comes out.

I suggest that this is a good method to extract entropy from complete files. Or not.

Solution

In practice, in about 95% of cases, the correct answer is probably going to be: forget about those files, use /dev/urandom or CryptGenRandom() (or equivalent). They provide very high quality random numbers -- That's what they're designed for. So this is the most pragmatic answer, and it will provide excellent quality randomness.

Let's say you can't do that, maybe because it's a headless embedded device with no access to entropy, or you want your random numbers to be a repeatable deterministic function of the files. Then in that case the best answer is likely to be -- guess what? -- hashing. In particular: concatenate all the files, take the SHA256 hash of the concatenation, and then use that as a key for AES-256-CTR; then use AES-256-CTR to generate as much output as you want. This is basically using a cryptographic hash to generate a seed for a cryptographic-quality pseudorandom generator, and it satisfies all your requirements. This is a solid answer, from the perspective of robust engineering and high quality random numbers.

OK, let's say you are a theorist. If you are a theorist, you won't like the last answer, because it relies upon cryptographic assumptions that have not been mathematically proven to hold. Pragmatically, you probably shouldn't worry about that -- you rely upon those assumptions every day (e.g., when doing e-commerce or entering your password into a website), and other issues are far more likely to affect you in practice -- but let's say you are a theorist.

From a theoretical perspective, the cryptographic solution is potentially unsatisfying, because it relies upon an unproven assumption, and that is... inelegant. So, it's interesting to ask what can be achieved that will provably work. If that's what you're interested in, you'll want to read about randomness extractors.

If you care about pragmatics, randomness extractors are probably not the best possible solution: extractors need stronger assumptions about the distribution of the data in your files, they are more limited in how much random output they can produce, and they are more fragile. But they do come with provably good properties, and they are mathematically elegant and beautiful.

Context

StackExchange Computer Science Q#45631, answer score: 6

Revisions (0)

No revisions yet.