patternMinor
How does one choose an optimal alphabet for finding a Huffman encoding?
Viewed 0 times
alphabethuffmanoptimalchoosefindingonefordoeshowencoding
Problem
Huffman encoding will perform best when the distribution of symbols of an alphabet that the string to be encoded uses is dyadic.
Given an arbitrary bit string
Is there an algorithm for finding the optimal word width (assume we use constant-length words).
I would guess that to evaluate an alphabet, it would only be fair if we considered the costs of storing the actual encoding as well. This addresses the case where the alphabet is just one symbol - the entire original string. Technically the message would just be one bit, but the the encoding tree that's stored would have to indicate that the one bit used is a code for the original string, so we've just increased our message by two bits trivially!
(Constant-length encoding information such as width size, encoding table size, etc., need not be considered for the comparisons, of course).
Given an arbitrary bit string
S, how can we find the best alphabet for encoding? Suppose S is an ASCII file. Then given the regularity of 1-byte characters that such files exhibit, we would expect that an optimal, or at least pretty good, alphabet should contain, say, 8-bit or 16-bit words (which we then build codes for after constructing the Huffman tree).Is there an algorithm for finding the optimal word width (assume we use constant-length words).
I would guess that to evaluate an alphabet, it would only be fair if we considered the costs of storing the actual encoding as well. This addresses the case where the alphabet is just one symbol - the entire original string. Technically the message would just be one bit, but the the encoding tree that's stored would have to indicate that the one bit used is a code for the original string, so we've just increased our message by two bits trivially!
(Constant-length encoding information such as width size, encoding table size, etc., need not be considered for the comparisons, of course).
Solution
The size required to store the Huffman code table scales like the number of codewords. We expect the number of unique $k$-letter words to be exponential in $k$, in fact roughly $2^{kH}$, where $H$ in the source entropy, though since the file is not infinite, for large $k$ we will actually see less. Still, this suggests that for logarithmically large $k$, most of the $k$-letter strings will be almost unique, and so the overall compression for such a $k$ would typically be quite small. In view of that, you can just try several values of $k$, and choose the best one. After you do some such experiments, you can formulate and perhaps prove a hypothesis as to the optimal value of $k$ in different situations.
Context
StackExchange Computer Science Q#26278, answer score: 4
Revisions (0)
No revisions yet.