HiveBrain v1.2.0
Get Started
← Back to all entries
patternMinor

Compression algorithms for small strings - building upon / extending Huffman

Submitted by: @import:stackexchange-cs··
0
Viewed 0 times
huffmanuponcompressionbuildingalgorithmssmallforstringsextending

Problem

I'm exploring algorithms to compress small strings like the following - every line is to be compressed individually, i.e. even those small strings should get compressed:

ZYXH1|104932,104932,927.6000,200,1,927.4000,1600,1
ZYXH1|104932,104932,3390,1600,1,3389,2700,1
HXDFF|0001739.42;HXDFF:Y|0U010:49:32
ZYXH1|104932,104932,5360,5400,1,5350,16400,1
ZYXH1|104931,104931,1966,2800,1,1965,3800,1;HXYFBR|0003.12;HXCFBR|0001.14;HXCFER|0006.7;HXMKCF|0003268409320112.0
ZYXH1|104932,104932,526,192400,1,525,88400,1
ZYXH1|104932,104932,803,6300,1,802,63900,1
ZYXDFF|803;ZYXDFF:Y|104932;ZYXVW|100800.6381;ZYXVL|1121700,104932;TAXTIC|L719;TAXNO|L1121700
ZYXH1|104932,104932,1367,2100,1,1366.5000,2800,1
HXYFBR|0002.04;HXCFBR|0001.52;HXCFER|00015.0;HXMKCF|0001676300748887.4
ZYXH1|104932,104932,380,63500,1,379,21500,1
ZYXH1|104932,104932,5360,5400,1,5350,16400,1
ZYXH1|104932,104932,803,3900,1,802,63800,1;ZYXVA|898075800
ZYXH1|104932,104932,3045,4700,1,3040,5500,1;ZYXDFF|3040;ZYXDFF:Y|104932;ZYXVW|1003043.4356;ZYXVL|164600,104932;ZYXVA|498213500;TAXTIC|L199;TAXTICD|L42;TAXNO|L164600;TAXDFG|Y0058
ZYXDFF|3040;ZYXDFF:Y|104932;ZYXVW|1003043.3304;ZYXVL|169800,104932;TAXTIC|L202;TAXNO|L169800
ZYXDFF|3040;ZYXDFF:Y|104932;ZYXVW|1003043.2935;ZYXVL|171700,104932;TAXTIC|L206;TAXNO|L171700
ZYXDFF|3040;ZYXDFF:Y|104932;ZYXVW|1003043.2764;ZYXVL|172600,104932;HXCFF|0003040;HXCFF:T|0T010:49;TAXTIC|L212;TAXNO|L172600
ZYXDFF|3040;ZYXDFF:Y|104932;ZYXVW|1003043.2745;ZYXVL|172700,104932;TAXTIC|L213;TAXNO|L172700
ZYXH1|104932,104932,4168,1200,1,4167,100,1


I have a couple of megabytes worth of data like the above, which allows me to create a static statistical model, i.e. for Huffman. If I have such a static model, I can embed that with the decompressor, removing the need to transmit/store it, i.e. trading compression rate for adaptability (important with these small strings).

I have now tried various algorithms, and the results of compressing those megabytes worth of strings (using a simple program tha

Solution

A simple variant of Huffman is due, I believe, to David Wheeler.

Suppose the alphabet is $\Sigma = \{s_1, \dots, s_n\}$ and let $\star$ be some new character that's not in $\Sigma$. For each character $s\in\Sigma$, let $p(s)$ be the most common character to occur immediately after $s$ in your dataset.

To compress the string $X=x_1\dots x_N$, you first scan through it and, each time $x_{i+1}=p(x_i)$, replace $x_{i+1}$ with $\star$, giving a new string over alphabet $\Sigma\cup\{{\star}\}$. Then run ordinary Huffman on the resulting string. The idea is that the character $\star$ should be really common in the modified string, so it should get a really short codeword.

For example, if your alphabet is $\{a,b,c,d\}$ and $p(a)=a$, $p(b)=a$, $p(c)=d$ and $p(d)=c$, then the string $abacd$ would be transformed to $ab{\star}c{\star}$ and $cdcdcd$ would be transformed to $c{\star}{\star}{\star}{\star}{\star}$.

Since all your example strings begin with Z, it might also be worth storing the most common initial character to allow you to replace the first character of the string with $\star$, too.

Context

StackExchange Computer Science Q#100106, answer score: 4

Revisions (0)

No revisions yet.