HiveBrain v1.2.0
Get Started
← Back to all entries
patternMinor

Why is self-information defined the way it is?

Submitted by: @import:stackexchange-cs··
0
Viewed 0 times
whythewayselfdefinedinformation

Problem

The self-information of an event of probability $p_x$ is defined as $I(p_x)=-\log_2(p_x)$.¹

I fully understand this for equiprobable events of the form $p_x = \frac{1}{2^k}$.
In that case, we want to encode $2^k$ events so we need $\log_2(2^k)=k$ bits.
So $I(p_x)$ should be $\log_2(\frac{1}{p_x}) = -\log_2(p_x)$

However I am confused about the generalization to events of different probabilities and probabilities where $p_x \neq \frac{1}{2^k}$.
Then, I am wondering why $\log$ is a good choice.

I've never come farther as to such tutorials:
http://www.i-programmer.info/babbages-bag/213-information-theory.html
But as I read it they just think about what they want, namely a function $I$ with $I(xy)=I(x)+I(y)$ and then find that $-\log_2(p_x)$ satisfies all their needs. But is $\log$ the only choice they have? $I(x)=0$ works too! What are the exact requirements that we need? And why do we finally choose $I(p_x)=-\log_2(p_x)$?

¹I know $2$ is just the special case for bits, chosen for simplicity.

Solution

Though it is more common to discuss axiomatic derivation of the expectaion of $I(\cdot)$, that it, the entropy function (to which, several sets of axioms exist, see here), for the self information $I(\cdot)$, we usually care about the following properties (as defined by Shannon):

  • If the event happens with certainty, there is no information in it. Thus, if $p=1$, $I(p)=0$



  • Furthermore, $I(p)$ must be decreasing with $p$: the more likely the event is, the less information it gives us (or alternatively, the more surprising the event is, the more information its occurrence contains)



  • As you mentioned, we want that the information of two independent events will add up: $I(p_xp_y)=I(p_x)+I(p_y)$. The event of both $X$ and $Y$ occurring has probability $p_x\cdot p_y$ if independent.



These axioms lead to $I(p) =\log \frac1p$, up to a normalization (e.g., choosing the basis of the logarithm).

Context

StackExchange Computer Science Q#43534, answer score: 4

Revisions (0)

No revisions yet.