HiveBrain v1.2.0
Get Started
← Back to all entries
patternMinor

Length of shortest codeword in Huffman encoding

Submitted by: @import:stackexchange-cs··
0
Viewed 0 times
huffmanshortestlengthencodingcodeword

Problem

Under Huffman Encoding, if one character occurs more than 1/3rd of the time, is it guaranteed that there will be at least one character whose codeword is of length 1?

I thought of 2 cases where this might be true, but they're not exactly proofs -

Case 1: There is only one character which occurs more than 1/3rd of the time. In this case, a good huffman encoding scheme will ensure that this character corresponds to a single-character encoding, because there are no competing characters. Even though there may be characters that occur with high probability (but still less than 0.33), they must have length greater than 1 to ensure the prefix-free nature of the encoding.

Case 2: There are multiple characters that occur more than 33% of the time. We can also make the stronger assertion that in this case, there can only be at most 2 such characters. This can be further subdivided into cases where 1) there are a total of 2 characters in the encoding scheme, and 2) where there are more than 2 characters in the scheme. In case 1), both the characters can correspond to a single-character encoding scheme (0 and 1 for each respectively). In case 2), at most 1 of the 2 largely prevalent/frequent characters must have an encoding of length of 1 (since we're considering a good huffman encoding scheme) while the other n-1 characters have a length of at least 2.

I'm afraid that this is not the right approach. Could someone please help me out?

Solution

Consider the distribution $\Pr[a]=0.34,\Pr[b]=\Pr[c]=\Pr[d]=0.22$.

Huffman's algorithm will first merge (say) $c$ and $d$ to create the symbol $cd$ with probability $0.44$. It will then merge $a$ and $b$ to create the symbol $ab$ with probability $0.56$. Finally, it will merge the remaining symbols.

All codewords thus have length 2.

You can check that the same example works up to $\Pr[a] = 0.4$ (we need $\Pr[a] \leq 2\Pr[b] = (2/3)(1-\Pr[a])$). What if $\Pr[a] > 0.4$? Consider the time at which Huffman's algorithm merges $a$ with some symbol $X$ (which may be composed of several original symbols). If $a$ and $X$ are the only symbols, then $a$ will be encoded using a symbol of length 1. Otherwise, there is some other symbol $Y$ of probability at least $\max(\Pr[a],\Pr[X]) > 0.4$. We see that $a,X,Y$ must be the only remaining symbols, and furthermore $\Pr[a] > \Pr[X]$ (otherwise there would 3 symbols whose probability exceeds 0.4).

Suppose that $Y$ was obtained by merging two other symbols $Z,W$. At that point, since $X$ itself or one of its constituents wasn't merged, we must have $\Pr[X] \geq \max(\Pr[Z],\Pr[W]) \geq \Pr[Y]/2$. But $\Pr[Y] > 0.4$ and so $\Pr[X] > 0.2$, and so the probabilities sum up to more than 1. This contradiction shows that $Y$ is an original symbol, and so in this case there is a codeword of length 1.

Concluding, the threshold for guaranteeing a codeword of length 1 is 0.4.

Context

StackExchange Computer Science Q#104512, answer score: 2

Revisions (0)

No revisions yet.