patternMinor
Why is self-information defined the way it is?
Viewed 0 times
whythewayselfdefinedinformation
Problem
The self-information of an event of probability $p_x$ is defined as $I(p_x)=-\log_2(p_x)$.¹
I fully understand this for equiprobable events of the form $p_x = \frac{1}{2^k}$.
In that case, we want to encode $2^k$ events so we need $\log_2(2^k)=k$ bits.
So $I(p_x)$ should be $\log_2(\frac{1}{p_x}) = -\log_2(p_x)$
However I am confused about the generalization to events of different probabilities and probabilities where $p_x \neq \frac{1}{2^k}$.
Then, I am wondering why $\log$ is a good choice.
I've never come farther as to such tutorials:
http://www.i-programmer.info/babbages-bag/213-information-theory.html
But as I read it they just think about what they want, namely a function $I$ with $I(xy)=I(x)+I(y)$ and then find that $-\log_2(p_x)$ satisfies all their needs. But is $\log$ the only choice they have? $I(x)=0$ works too! What are the exact requirements that we need? And why do we finally choose $I(p_x)=-\log_2(p_x)$?
¹I know $2$ is just the special case for bits, chosen for simplicity.
I fully understand this for equiprobable events of the form $p_x = \frac{1}{2^k}$.
In that case, we want to encode $2^k$ events so we need $\log_2(2^k)=k$ bits.
So $I(p_x)$ should be $\log_2(\frac{1}{p_x}) = -\log_2(p_x)$
However I am confused about the generalization to events of different probabilities and probabilities where $p_x \neq \frac{1}{2^k}$.
Then, I am wondering why $\log$ is a good choice.
I've never come farther as to such tutorials:
http://www.i-programmer.info/babbages-bag/213-information-theory.html
But as I read it they just think about what they want, namely a function $I$ with $I(xy)=I(x)+I(y)$ and then find that $-\log_2(p_x)$ satisfies all their needs. But is $\log$ the only choice they have? $I(x)=0$ works too! What are the exact requirements that we need? And why do we finally choose $I(p_x)=-\log_2(p_x)$?
¹I know $2$ is just the special case for bits, chosen for simplicity.
Solution
Though it is more common to discuss axiomatic derivation of the expectaion of $I(\cdot)$, that it, the entropy function (to which, several sets of axioms exist, see here), for the self information $I(\cdot)$, we usually care about the following properties (as defined by Shannon):
These axioms lead to $I(p) =\log \frac1p$, up to a normalization (e.g., choosing the basis of the logarithm).
- If the event happens with certainty, there is no information in it. Thus, if $p=1$, $I(p)=0$
- Furthermore, $I(p)$ must be decreasing with $p$: the more likely the event is, the less information it gives us (or alternatively, the more surprising the event is, the more information its occurrence contains)
- As you mentioned, we want that the information of two independent events will add up: $I(p_xp_y)=I(p_x)+I(p_y)$. The event of both $X$ and $Y$ occurring has probability $p_x\cdot p_y$ if independent.
These axioms lead to $I(p) =\log \frac1p$, up to a normalization (e.g., choosing the basis of the logarithm).
Context
StackExchange Computer Science Q#43534, answer score: 4
Revisions (0)
No revisions yet.