HiveBrain v1.2.0
Get Started
← Back to all entries
patternpythonMinor

Efficiently processing large (~100 MB) structured binary data in Python 3

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
efficiently100largestructuredbinarypythonprocessingdata

Problem

I need to process the data contained in a (relatively) large binary file.

The file has the following structure:

40 bytes of initialization,
4 bytes, 1 byte,
4 bytes, 1 byte,
...
lots of such 5-byte blocks
...


The number of 5-byte blocks (to which I'll refer to as 'timetags' in the following) may vary, but the total size of the file can be in the order of ~100 MBs.

In each 5-byte block the first 4 bytes encode a uint32_t (unsigned integer) 'timestamp', and the fifth byte is a number encoding a 'channel'.

My task is to find out whether there are contiguous sequences of 4 timetags, such that the corresponding timestamps are within a certain time window from each other, and if that is the case store the corresponding channels.

For example, if there is a sequence of timetags whose decoded data is

100, 2
300, 4
310, 5
340, 8,
369, 6,
413, 8


and my time window is 100, then I store the list [4,5,8,6].

In general the number of such fourfold coincidences will be extremely small with respect to the total number of timetags (e.g. for a ~100MB file I have ~10 such coincidences).
Also, the timestamps are generally in increasing order, but sometimes there is a sudden jump (when the timestamps becomes too big for the 4 bytes to encode) and the count starts over, and this has to be taken into account (see below for an example file).

I tried a variety of things to do this, but the best I achieved is a ~20s time to process a ~50MB file.
However, implementing the same code in C, I get a performance of ~0.2s for a ~100MB file.
While I understand that a C program is naturally faster than a python one, such a big performance gap really seems strange to me, so there must be something wrong with my python code.

Here is an example of the binary file to process (this is a reduced version of ~1MB).
It contains a single fourfold coincidence at the last four timetags.

Here is my best implementation:

```
#!/usr/bin/python3
import struct
import os

# ============

Solution

Reading the whole file at once is a major source of inefficiency. Think of all page faults and cache misses. Sometimes (for random access algorithms) it is unavoidable, but if the file is processed sequentially, streaming is more natural. You never need more than 2 records at a time, and you never seek backwards.

try:
        while True:
            timetag = input_file.read(timetag.size)
            timestamp, channel = struct.unpack("LB", timetag)
            ....
    except struct.error:
        pass


Besides, your code recomputes each timestamp at least twice. The recommended approach does it just once, and simplifies comparison to

def timestamps_in_window(triggering_timestamp_bytes, new_timestamp_bytes):
        return 0 < new_timestamp_bytes - triggering_timestamp_bytes < WINDOW_SIZE

Code Snippets

try:
        while True:
            timetag = input_file.read(timetag.size)
            timestamp, channel = struct.unpack("LB", timetag)
            ....
    except struct.error:
        pass
def timestamps_in_window(triggering_timestamp_bytes, new_timestamp_bytes):
        return 0 < new_timestamp_bytes - triggering_timestamp_bytes < WINDOW_SIZE

Context

StackExchange Code Review Q#117080, answer score: 6

Revisions (0)

No revisions yet.