patternMinor
Compression algorithms for small strings - building upon / extending Huffman
Viewed 0 times
huffmanuponcompressionbuildingalgorithmssmallforstringsextending
Problem
I'm exploring algorithms to compress small strings like the following - every line is to be compressed individually, i.e. even those small strings should get compressed:
I have a couple of megabytes worth of data like the above, which allows me to create a static statistical model, i.e. for Huffman. If I have such a static model, I can embed that with the decompressor, removing the need to transmit/store it, i.e. trading compression rate for adaptability (important with these small strings).
I have now tried various algorithms, and the results of compressing those megabytes worth of strings (using a simple program tha
ZYXH1|104932,104932,927.6000,200,1,927.4000,1600,1
ZYXH1|104932,104932,3390,1600,1,3389,2700,1
HXDFF|0001739.42;HXDFF:Y|0U010:49:32
ZYXH1|104932,104932,5360,5400,1,5350,16400,1
ZYXH1|104931,104931,1966,2800,1,1965,3800,1;HXYFBR|0003.12;HXCFBR|0001.14;HXCFER|0006.7;HXMKCF|0003268409320112.0
ZYXH1|104932,104932,526,192400,1,525,88400,1
ZYXH1|104932,104932,803,6300,1,802,63900,1
ZYXDFF|803;ZYXDFF:Y|104932;ZYXVW|100800.6381;ZYXVL|1121700,104932;TAXTIC|L719;TAXNO|L1121700
ZYXH1|104932,104932,1367,2100,1,1366.5000,2800,1
HXYFBR|0002.04;HXCFBR|0001.52;HXCFER|00015.0;HXMKCF|0001676300748887.4
ZYXH1|104932,104932,380,63500,1,379,21500,1
ZYXH1|104932,104932,5360,5400,1,5350,16400,1
ZYXH1|104932,104932,803,3900,1,802,63800,1;ZYXVA|898075800
ZYXH1|104932,104932,3045,4700,1,3040,5500,1;ZYXDFF|3040;ZYXDFF:Y|104932;ZYXVW|1003043.4356;ZYXVL|164600,104932;ZYXVA|498213500;TAXTIC|L199;TAXTICD|L42;TAXNO|L164600;TAXDFG|Y0058
ZYXDFF|3040;ZYXDFF:Y|104932;ZYXVW|1003043.3304;ZYXVL|169800,104932;TAXTIC|L202;TAXNO|L169800
ZYXDFF|3040;ZYXDFF:Y|104932;ZYXVW|1003043.2935;ZYXVL|171700,104932;TAXTIC|L206;TAXNO|L171700
ZYXDFF|3040;ZYXDFF:Y|104932;ZYXVW|1003043.2764;ZYXVL|172600,104932;HXCFF|0003040;HXCFF:T|0T010:49;TAXTIC|L212;TAXNO|L172600
ZYXDFF|3040;ZYXDFF:Y|104932;ZYXVW|1003043.2745;ZYXVL|172700,104932;TAXTIC|L213;TAXNO|L172700
ZYXH1|104932,104932,4168,1200,1,4167,100,1I have a couple of megabytes worth of data like the above, which allows me to create a static statistical model, i.e. for Huffman. If I have such a static model, I can embed that with the decompressor, removing the need to transmit/store it, i.e. trading compression rate for adaptability (important with these small strings).
I have now tried various algorithms, and the results of compressing those megabytes worth of strings (using a simple program tha
Solution
A simple variant of Huffman is due, I believe, to David Wheeler.
Suppose the alphabet is $\Sigma = \{s_1, \dots, s_n\}$ and let $\star$ be some new character that's not in $\Sigma$. For each character $s\in\Sigma$, let $p(s)$ be the most common character to occur immediately after $s$ in your dataset.
To compress the string $X=x_1\dots x_N$, you first scan through it and, each time $x_{i+1}=p(x_i)$, replace $x_{i+1}$ with $\star$, giving a new string over alphabet $\Sigma\cup\{{\star}\}$. Then run ordinary Huffman on the resulting string. The idea is that the character $\star$ should be really common in the modified string, so it should get a really short codeword.
For example, if your alphabet is $\{a,b,c,d\}$ and $p(a)=a$, $p(b)=a$, $p(c)=d$ and $p(d)=c$, then the string $abacd$ would be transformed to $ab{\star}c{\star}$ and $cdcdcd$ would be transformed to $c{\star}{\star}{\star}{\star}{\star}$.
Since all your example strings begin with
Suppose the alphabet is $\Sigma = \{s_1, \dots, s_n\}$ and let $\star$ be some new character that's not in $\Sigma$. For each character $s\in\Sigma$, let $p(s)$ be the most common character to occur immediately after $s$ in your dataset.
To compress the string $X=x_1\dots x_N$, you first scan through it and, each time $x_{i+1}=p(x_i)$, replace $x_{i+1}$ with $\star$, giving a new string over alphabet $\Sigma\cup\{{\star}\}$. Then run ordinary Huffman on the resulting string. The idea is that the character $\star$ should be really common in the modified string, so it should get a really short codeword.
For example, if your alphabet is $\{a,b,c,d\}$ and $p(a)=a$, $p(b)=a$, $p(c)=d$ and $p(d)=c$, then the string $abacd$ would be transformed to $ab{\star}c{\star}$ and $cdcdcd$ would be transformed to $c{\star}{\star}{\star}{\star}{\star}$.
Since all your example strings begin with
Z, it might also be worth storing the most common initial character to allow you to replace the first character of the string with $\star$, too.Context
StackExchange Computer Science Q#100106, answer score: 4
Revisions (0)
No revisions yet.