HiveBrain v1.2.0
Get Started
← Back to all entries
patternMajor

Can PRNGs be used to magically compress stuff?

Submitted by: @import:stackexchange-cs··
0
Viewed 0 times
magicallycanusedcompressstuffprngs

Problem

This idea occurred to me as a kid learning to program and
on first encountering PRNG's. I still don't know how realistic
it is, but now there's stack exchange.

Here's a 14 year-old's scheme for an amazing compression algorithm:

Take a PRNG and seed it with seed s to get a long sequence
of pseudo-random bytes. To transmit that sequence to another party,
you need only communicate a description of the PRNG, the appropriate seed
and the length of the message. For a long enough sequence, that
description would be much shorter then the sequence itself.

Now suppose I could invert the process. Given enough time and
computational resources, I could do a brute-force search and find
a seed (and PRNG, or in other words: a program) that produces my
desired sequence (Let's say an amusing photo of cats being mischievous).

PRNGs repeat after a large enough number of bits have been generated,
but compared to "typical" cycles my message is quite short so this
dosn't seem like much of a problem.

Voila, an effective (if rube-Goldbergian) way to compress data.

So, assuming:

  • The sequence I wish to compress is finite and known in advance.



  • I'm not short on cash or time (Just as long as a finite amount


of both is required)

I'd like to know:

  • Is there a fundamental flaw in the reasoning behind the scheme?



  • What's the standard way to analyse these sorts of thought experiments?



Summary

It's often the case that good answers make clear not only the answer,
but what it is that I was really asking. Thanks for everyone's patience
and detailed answers.

Here's my nth attempt at a summary of the answers:

  • The PRNG/seed angle doesn't contribute anything, it's no more


than a program that produces the desired sequence as output.

  • The pigeonhole principle: There are many more messages of


length > k than there are (message generating) programs of
length

  • It's worth mentioning that the interpreter of the program


(message) is necessarily fixed in advance. And it

Solution

You've got a brilliant new compression scheme, eh? Alrighty, then...

♫ Let's all play, the entropy game ♫

Just to be simple, I will assume you want to compress messages of exactly $n$ bits, for some fixed $n$. However, you want to be able to use it for longer messages, so you need some way of differentiating your first message from the second (it cannot be ambiguous what you have compressed).

So, your scheme is to determine some family of PRNG/seeds such that if you want to compress, say, $01000111001$, then you just write some number $k$, which identifies some precomputed (and shared) seed/PRNG combo that generates those bits after $n$ queries. Alright. How many different bit-strings of length $n$ are there? $2^n$ (you have n choices between two items; $0$ and $1$). That means you will have to compute $2^n$ of these combos. No problem. However, you need to write out $k$ in binary for me to read it. How big can $k$ get? Well, it can be as big as $2^n$. How many bits do I need to write out $2^n$? $\log{2^n} = n$.

Oops! Your compression scheme needs messages as long as what you're compressing!

"Haha!", you say, "but that's in the worst case! One of my messages will be mapped to $0$, which needs only $1$ bit to represent! Victory!"

Yes, but your messages have to be unambiguous! How can I tell apart $1$ followed by $0$ from $10$? Since some of your keys are length $n$, all of them must be, or else I can't tell where you've started and stopped.

"Haha!", you say, "but I can just put the length of the string in binary first! That only needs to count to $n$, which can be represented by $\log{n}$ bits! So my $0$ now comes prefixed with only $\log{n}$ bits, I still win!"

Yes, but now those really big numbers are prefixed with $\log{n}$ bits. Your compression scheme has made some of your messages even longer! And half of all of your numbers start with $1$, so half of your messages are that much longer!

You then proceed to throw out more ideas like a terminating character, gzipping the number, and compressing the length itself, but all of those run into cases where the resultant message is just longer. In fact, for every bit you save on some message, another message will get longer in response. In general, you're just going to be shifting around the "cost" of your messages. Making some shorter will just make others longer. You really can't fit $2^n$ different messages in less space than writing out $2^n$ binary strings of length $n$.

"Haha!", you say, "but I can choose some messages as 'stupid' and make them illegal! Then I don't need to count all the way to $2^n$, because I don't support that many messages!"

You're right, but you haven't really won. You've just shrunk the set of messages you support. If you only supported $a=0000000011010$ and $b=111111110101000$ as the messages you send, then you can definitely just have the code $a\rightarrow 0$, $b\rightarrow 1$, which matches exactly what I've said. Here, $n=1$. The actual length of the messages isn't important, it's how many there are.

"Haha!", you say, "but I can simply determine that those stupid messages are rare! I'll make the rare ones big, and the common ones small! Then I win on average!"

Yep! Congratulations, you've just discovered entropy! If you have $n$ messages, where the $i$th message has probability $p_i$ of being sent, then you can get your expected message length down to the entropy $H = \sum_{i=1}^np_i\log(1/p_i)$ of this set of messages. That's a kind of weird expression, but all you really need to know is that's it's biggest when all messages are equally likely, and smaller when some are more common than others. In the extreme, if you know basically every message is going to be $a=000111010101$. Then you can use this super efficient code: $a\rightarrow0$, $x\rightarrow1x$ otherwise. Then your expected message length is basically $1$, which is awesome, and that's going to be really close to the entropy $H$. However, $H$ is a lower bound, and you really can't beat it, no matter how hard you try.

Anything that claims to beat entropy is probably not giving enough information to unambiguously retrieve the compressed message, or is just wrong. Entropy is such a powerful concept that we can lower-bound (and sometimes even upper-bound) the running time of some algorithms with it, because if they run really fast (or really slow), then they must be doing something that violates entropy.

Context

StackExchange Computer Science Q#23010, answer score: 45

Revisions (0)

No revisions yet.