patternMinor

What determines the entropy of a program's source code?

Submitted by: @import:stackexchange-cs·Mar 10, 2026·

Viewed 0 times

data-compression cs kolmogorov-complexity stackoverflow

thewhatsourceprogramdeterminesentropycode

Problem

a few days ago I asked a question about the limits of compression:
Can PRNGs be used to magically compress-stuff?

The idea common to all the answers was that if you consider all programs
of length

All refactorings, which might alter the AST but preserve all input/output

relationships, will do the same if they do not inflate the size of the source code.

a PL that provides multiple constructs ("There's more then one way to do it")

for achieving the same result (control flow constructs, for example) will , generally speaking, have more programs of length

This brings me to the following question: If I consider all programs
in $C_k$ that generate the same output (it's an equivalence relation)
and then proceed to compress them, what would be the results? Would
the equivalent programs compress down to roughly the same size or not?
Does the compressability depend on the meaning of the program (which
is identical) or the bit sequence (which is not)?

A statement given in the answers to the previous question is
that the entropy of a bit sequence determines how compressible it is.
Is that entropy determined by the input/output relationships encoded
in the program, or by it's bit-representation in language X?
What's the right way to think of this?

update
Since a program describes a computational process
rather then merely output, the notion of equivalence
suggested really doesn't indicate me much about
the relative complexity of programs which are equivalent
by that definition.

Solution

This is a classical question in the foundations of Kolmogorov complexity: does the programming language matter for the sake of defining minimal description length? It depends on your model, but in general the answer is "no" or "almost no".

Suppose that $P_1,P_2$ are two Turing complete programming languages. In particular, in $P_1$ you can write an interpreter $I$ for $P_2$. Therefore any program $\pi$ written in $P_2$ can be converted in $P_1$ by pasting together $I$ and $\pi$. What is the length of the combined program? In any reasonable programming language, it is $O(|\pi|)$, and so different programming languages only have at most a constant multiplicative advantage over each other. If we are more stringent in our requirements on the programming languages, the answer should be $|\pi| + O(1)$, which is what happens in Kolmogorov complexity. In this case different programming languages only have at most a constant additive advantage over each other.

The idea behind $\pi + O(1)$ is that you should be able to concatenate the code for $I$ with $\pi$. But in practice this is not quite possible. Suppose for example that $P_1$ is $C$. We need to write $\pi$ as a string. If the character set of $\pi$ includes backslash or double quotes, then we will need to escape these, which could result in a multiplicative blow-up. Still, we get a $C$ program whose length is $O(|\pi|)$. If we are careful in our definitions then we can prevent this — check any introductory text on Kolmogorov complexity for the details.

Context

StackExchange Computer Science Q#23238, answer score: 5

Revisions (0)

No revisions yet.