patternModerate
Shannon Entropy of 0.922, 3 Distinct Values
Viewed 0 times
distinctshannonentropy922values
Problem
Given a string of values $AAAAAAAABC$, the Shannon Entropy in log base $2$ comes to $0.922$. From what I understand, in base $2$ the Shannon Entropy rounded up is the minimum number of bits in binary to represent a single one of the values.
Taken from the introduction on this wikipedia page:
https://en.wikipedia.org/wiki/Entropy_%28information_theory%29
So, how can three values be represented by one bit? $A$ could be $1$, $B$ could be $0$; but how could you represent $C$?
Thank you in advance.
Taken from the introduction on this wikipedia page:
https://en.wikipedia.org/wiki/Entropy_%28information_theory%29
So, how can three values be represented by one bit? $A$ could be $1$, $B$ could be $0$; but how could you represent $C$?
Thank you in advance.
Solution
Here is a concrete encoding that can represent each symbol in less than 1 bit on average:
First, split the input string into pairs of successive characters (e.g. AAAAAAAABC becomes AA|AA|AA|AA|BC). Then encode AA as 0, AB as 100, AC as 101, BA as 110, CA as 1110, BB as 111100, BC as 111101, CB as 111110, CC as 111111.
I've not said what happens if there is an odd number of symbols, but you can just encode the last symbol using some arbitrary encoding, it doesn't really matter when the input is long.
This is a Huffman code for the distribution of independent pairs of symbols, and corresponds to choosing $n = 2$ in Yuval's answer. Larger $n$ would lead to even better codes (approaching the Shannon entropy in the limit, as he mentioned).
The average number of bits per symbol pair for the above encoding is
$$\frac{8}{10} \cdot \frac{8}{10} \cdot 1 + 3 \cdot \frac{8}{10} \cdot \frac{1}{10} \cdot 3 + \frac{1}{10} \cdot \frac{8}{10} \cdot 4 + 4 \cdot \frac{1}{10} \cdot \frac{1}{10} \cdot 6 = 1.92$$
i.e. $1.92/2 = 0.96$ bits per symbol, not that far from the Shannon entropy actually for such a simple encoding.
First, split the input string into pairs of successive characters (e.g. AAAAAAAABC becomes AA|AA|AA|AA|BC). Then encode AA as 0, AB as 100, AC as 101, BA as 110, CA as 1110, BB as 111100, BC as 111101, CB as 111110, CC as 111111.
I've not said what happens if there is an odd number of symbols, but you can just encode the last symbol using some arbitrary encoding, it doesn't really matter when the input is long.
This is a Huffman code for the distribution of independent pairs of symbols, and corresponds to choosing $n = 2$ in Yuval's answer. Larger $n$ would lead to even better codes (approaching the Shannon entropy in the limit, as he mentioned).
The average number of bits per symbol pair for the above encoding is
$$\frac{8}{10} \cdot \frac{8}{10} \cdot 1 + 3 \cdot \frac{8}{10} \cdot \frac{1}{10} \cdot 3 + \frac{1}{10} \cdot \frac{8}{10} \cdot 4 + 4 \cdot \frac{1}{10} \cdot \frac{1}{10} \cdot 6 = 1.92$$
i.e. $1.92/2 = 0.96$ bits per symbol, not that far from the Shannon entropy actually for such a simple encoding.
Context
StackExchange Computer Science Q#100278, answer score: 18
Revisions (0)
No revisions yet.