patternMinor

Why is self-information defined the way it is?

Submitted by: @import:stackexchange-cs·Mar 10, 2026·

Viewed 0 times

whythewayselfdefinedinformation

Problem

The self-information of an event of probability $p_x$ is defined as $I(p_x)=-\log_2(p_x)$.¹

I fully understand this for equiprobable events of the form $p_x = \frac{1}{2^k}$.
In that case, we want to encode $2^k$ events so we need $\log_2(2^k)=k$ bits.
So $I(p_x)$ should be $\log_2(\frac{1}{p_x}) = -\log_2(p_x)$

However I am confused about the generalization to events of different probabilities and probabilities where $p_x \neq \frac{1}{2^k}$.
Then, I am wondering why $\log$ is a good choice.

I've never come farther as to such tutorials:
http://www.i-programmer.info/babbages-bag/213-information-theory.html
But as I read it they just think about what they want, namely a function $I$ with $I(xy)=I(x)+I(y)$ and then find that $-\log_2(p_x)$ satisfies all their needs. But is $\log$ the only choice they have? $I(x)=0$ works too! What are the exact requirements that we need? And why do we finally choose $I(p_x)=-\log_2(p_x)$?

¹I know $2$ is just the special case for bits, chosen for simplicity.

Solution

Though it is more common to discuss axiomatic derivation of the expectaion of $I(\cdot)$, that it, the entropy function (to which, several sets of axioms exist, see here), for the self information $I(\cdot)$, we usually care about the following properties (as defined by Shannon):

If the event happens with certainty, there is no information in it. Thus, if $p=1$, $I(p)=0$

Furthermore, $I(p)$ must be decreasing with $p$: the more likely the event is, the less information it gives us (or alternatively, the more surprising the event is, the more information its occurrence contains)

As you mentioned, we want that the information of two independent events will add up: $I(p_xp_y)=I(p_x)+I(p_y)$. The event of both $X$ and $Y$ occurring has probability $p_x\cdot p_y$ if independent.

These axioms lead to $I(p) =\log \frac1p$, up to a normalization (e.g., choosing the basis of the logarithm).

Context

StackExchange Computer Science Q#43534, answer score: 4

Revisions (0)

No revisions yet.