patternpythonMinor
Efficiently processing large (~100 MB) structured binary data in Python 3
Viewed 0 times
efficiently100largestructuredbinarypythonprocessingdata
Problem
I need to process the data contained in a (relatively) large binary file.
The file has the following structure:
The number of 5-byte blocks (to which I'll refer to as 'timetags' in the following) may vary, but the total size of the file can be in the order of ~100 MBs.
In each 5-byte block the first 4 bytes encode a
My task is to find out whether there are contiguous sequences of 4 timetags, such that the corresponding timestamps are within a certain time window from each other, and if that is the case store the corresponding channels.
For example, if there is a sequence of timetags whose decoded data is
and my time window is
In general the number of such fourfold coincidences will be extremely small with respect to the total number of timetags (e.g. for a ~100MB file I have ~10 such coincidences).
Also, the timestamps are generally in increasing order, but sometimes there is a sudden jump (when the timestamps becomes too big for the 4 bytes to encode) and the count starts over, and this has to be taken into account (see below for an example file).
I tried a variety of things to do this, but the best I achieved is a ~20s time to process a ~50MB file.
However, implementing the same code in C, I get a performance of ~0.2s for a ~100MB file.
While I understand that a C program is naturally faster than a python one, such a big performance gap really seems strange to me, so there must be something wrong with my python code.
Here is an example of the binary file to process (this is a reduced version of ~1MB).
It contains a single fourfold coincidence at the last four timetags.
Here is my best implementation:
```
#!/usr/bin/python3
import struct
import os
# ============
The file has the following structure:
40 bytes of initialization,
4 bytes, 1 byte,
4 bytes, 1 byte,
...
lots of such 5-byte blocks
...The number of 5-byte blocks (to which I'll refer to as 'timetags' in the following) may vary, but the total size of the file can be in the order of ~100 MBs.
In each 5-byte block the first 4 bytes encode a
uint32_t (unsigned integer) 'timestamp', and the fifth byte is a number encoding a 'channel'.My task is to find out whether there are contiguous sequences of 4 timetags, such that the corresponding timestamps are within a certain time window from each other, and if that is the case store the corresponding channels.
For example, if there is a sequence of timetags whose decoded data is
100, 2
300, 4
310, 5
340, 8,
369, 6,
413, 8and my time window is
100, then I store the list [4,5,8,6].In general the number of such fourfold coincidences will be extremely small with respect to the total number of timetags (e.g. for a ~100MB file I have ~10 such coincidences).
Also, the timestamps are generally in increasing order, but sometimes there is a sudden jump (when the timestamps becomes too big for the 4 bytes to encode) and the count starts over, and this has to be taken into account (see below for an example file).
I tried a variety of things to do this, but the best I achieved is a ~20s time to process a ~50MB file.
However, implementing the same code in C, I get a performance of ~0.2s for a ~100MB file.
While I understand that a C program is naturally faster than a python one, such a big performance gap really seems strange to me, so there must be something wrong with my python code.
Here is an example of the binary file to process (this is a reduced version of ~1MB).
It contains a single fourfold coincidence at the last four timetags.
Here is my best implementation:
```
#!/usr/bin/python3
import struct
import os
# ============
Solution
Reading the whole file at once is a major source of inefficiency. Think of all page faults and cache misses. Sometimes (for random access algorithms) it is unavoidable, but if the file is processed sequentially, streaming is more natural. You never need more than 2 records at a time, and you never seek backwards.
Besides, your code recomputes each timestamp at least twice. The recommended approach does it just once, and simplifies comparison to
try:
while True:
timetag = input_file.read(timetag.size)
timestamp, channel = struct.unpack("LB", timetag)
....
except struct.error:
passBesides, your code recomputes each timestamp at least twice. The recommended approach does it just once, and simplifies comparison to
def timestamps_in_window(triggering_timestamp_bytes, new_timestamp_bytes):
return 0 < new_timestamp_bytes - triggering_timestamp_bytes < WINDOW_SIZECode Snippets
try:
while True:
timetag = input_file.read(timetag.size)
timestamp, channel = struct.unpack("LB", timetag)
....
except struct.error:
passdef timestamps_in_window(triggering_timestamp_bytes, new_timestamp_bytes):
return 0 < new_timestamp_bytes - triggering_timestamp_bytes < WINDOW_SIZEContext
StackExchange Code Review Q#117080, answer score: 6
Revisions (0)
No revisions yet.