patternMinor
How would the use of an external lookup table affect file compression?
Viewed 0 times
thefilecompressionaffectwouldlookuphowexternalusetable
Problem
It occurred to me that, the way compression essentially works, is that you have a mapping of long patterns to short patterns, which you replace one way to compress and the other to decompress. My understanding of the way compression algorithms work, is that they create this mapping dynamically, with things like huffman coding being used to create the lookup table.
And so I wonder: if a general lookup table was pre-built, and not derived in any way from the input file, how would this affect the compression? My guess would be that if would make the compression slightly less effective (because the ordering of the mappings would not be optimized for the specific case), but would make it much faster to run, since it doesn't need to build the table as it goes.
And so I wonder: if a general lookup table was pre-built, and not derived in any way from the input file, how would this affect the compression? My guess would be that if would make the compression slightly less effective (because the ordering of the mappings would not be optimized for the specific case), but would make it much faster to run, since it doesn't need to build the table as it goes.
Solution
There are compression programs such as zstd which explicitly allow a separate dictionary text, and stream-based compressors such as LZW can be easily converted to use a dictionary by prepending it on the input. The dictionary appears as a fixed prefix on the output and can be removed, ie $C(xy)=C(x)C(y|x)$.
The benefit of using one text as a dictionary for another is a function of their mutual information, which is based on Kolmogorov Complexity (an incomputable minimum). However a compression algorithm turns out to be a good way to approximate it, using a technique known as Normalized Compression Distance. The texts are compressed separately and together as above, and the difference between $C(y|x)$ and $C(y)$ measures similarity. So, while it's difficult to predict the benefit, measuring it empirically has some uses.
The benefit of using one text as a dictionary for another is a function of their mutual information, which is based on Kolmogorov Complexity (an incomputable minimum). However a compression algorithm turns out to be a good way to approximate it, using a technique known as Normalized Compression Distance. The texts are compressed separately and together as above, and the difference between $C(y|x)$ and $C(y)$ measures similarity. So, while it's difficult to predict the benefit, measuring it empirically has some uses.
Context
StackExchange Computer Science Q#91384, answer score: 3
Revisions (0)
No revisions yet.