HiveBrain v1.2.0
Get Started
← Back to all entries
patternMinor

Algorithm for compressing binary data that can efficiently be partially decompressed

Submitted by: @import:stackexchange-cs··
0
Viewed 0 times
canefficientlydecompressedpartiallyalgorithmbinarycompressingforthatdata

Problem

So I have been given the following problem and since I have no experience at all in this field, I would really appreciate some advice/guidance.

These are the requirements:

So three times per second a certain amount (fixed per session) of 32-bit values come in. These values have to be stored into a file with a timestamp per data-entry and it must be possible to at any time, with a given timestamp and interval, read out the values that are stored in that file.

This is all relatively simple to do, but this produces a 5GB file every 24 hours. If this file is zipped, it just leaves a 500MB file, so it is compressable but in that case you can't instantly retreive datapoints because you would have to unzip the entire file.

So I was thinking that this could be solved like this:

  • Buffer the first 10MB of sensor data.



  • Then sort all the values by the amount of times that they contain the same value. (Sensor data, so quite a few of the channels will


mostly contain stable values)

  • Run an algorithm that detects recurring patterns, replaces those patterns with shorter values and defines a table with the recurring patterns and their replacements.



  • The compress the buffer and save it to the disk and compress every new series of sensor data with the pre-defined table.



If this would be implemented properly, it should be possible to uncompress an individual datapoint and the file is sorted by timestamp so finding the right one should be trivial with a binary search.

So I guess the questions I'm having are the follwing

  • Is this at all possible? I do not expect compression results near Zip and 2x - 5x would be enough.



  • What is a good (and if possible semi-easy to implement) algorithm to find patterns in binary data? I have been doing some research and I mostly ran into algorithms that only apply to text data.



Any form of help is greatly appreciated!

Solution

Eliminate timestamps

If the data is sampled at regular intervals (e.g. three times per second), it is sufficient to store the timestamp of the first point and the interval. This enables you to calulate the offset from a timestamp and vice versa and grants you access to any datapoint in constant time.

Compression schemes

As others have already laid out, you will always lose the ability to seek when using a compression scheme. Compress blockwise and search within the blocks.

Lempel-Ziv-like algorithms

The Lempel-Ziv family and similar algorithms provide good compression for data with many/long reoccurring strings, but fail to recognize the relationship between integers in noisy or slowly changing sequences. Below are three sequences of integers: the first consists of constant values, the second counts upward and somehow resembles timestamps, and the third is noisy and represents a sensor reading.

$$[900, 900, 900, 900, 900, 900, 900, 900, 900, 900]$$
$$[890, 892, 894, 896, 898, 900, 902, 904, 906, 908]$$
$$[900, 890, 905, 901, 918, 892, 919, 916, 904, 920]$$

Zip will greatly reduce the size of the first sequence, have a hard time on the second and won't provide any compression on the third.

Delta encoding

To reduce the number of bits per number, the range of possible numbers must get smaller. If we incorporate the knowledge that values change slowly, we conclude that storing only the difference to the previous value yields values close to zero. The three example sequences above become:

$$[900, 0, 0, 0, 0, 0, 0, 0, 0, 0]$$
$$[890, 2, 2, 2, 2, 2, 2, 2, 2, 2]$$
$$[900, -10, 15, -4, 17, -26, 27, -3, -12, 16]$$

If we know the rate of change is approximately constant, as in case of timestamps, we can apply delta encoding twice. Below is a sequence of timestamps in original form and after applying delta encoding once/twice.

$$[96, 296, 497, 701, 904, 1102]$$
$$[96, 200, 201, 204, 203, 198]$$
$$[96, 104, 1, 3, -1, -5]$$

All we need now is a way to make the (on average) smaller numbers into less disk space.

Variable width integers

We can reduce the number of bits by using variable width integers that use less bits for smaller (by magnitude) numbers. The usual encoding is:

  • For numbers that fit into 7-bit signed integers, use one byte.



  • For larger numbers



  • store the 7 least significant bits in one byte.



  • set the MSB (continue-flag).



  • encode the remainder in the following bytes, using the same encoding.



This way 32-bit integers are encoded in 1-5 bytes, dependent on the actual value.

Conclusion

  • If possible, don't store timestamps at all (independent of compression).



  • Compress in blocks. Trade-off between access time and compression.



  • If the sensor readings are exactly constant for some periods of time, zip is the way to go.



  • If the sensor readings are approximately constant, use delta encoding.



  • If the timestamps are changing at approximately constant rate, use delta encoding twice.

Context

StackExchange Computer Science Q#54722, answer score: 3

Revisions (0)

No revisions yet.