patterncsharpModerate

Python 3 decompression routine 10x slower than C# equivalent

Submitted by: @import:stackexchange-codereview·Mar 10, 2026·

Viewed 0 times

10xequivalentroutineslowerthandecompressionpython

Problem

I'm relatively new to Python 3, especially IO programming, writing a Blender add-on to import model data.

The model data is available in a custom compression, and I originally wrote code in C# to decompress it in memory, porting it to Python 3.

However, being a little unsure about the "optimal" usage of IO classes and functions in Python, I got a little speed problem. The code runs 10 times slower in Python compared to the C# equivalent, and I don't find any more optimization potential, because of my limited Python knowledge.

A test yielded the following speed results at decompressing the same file (around 50 megabytes of data):

C#: ~4-5 seconds

Python: ~43 seconds

I wonder if anyone can point out where I'm failing hard in Python and need to learn a lot...

Here is the C# code. It uses an extended System.IO.BinaryReader and System.IO.BinaryWriter, operating the same; just with handling endianness more flexible.

```
public static int Decompress(Stream input, MemoryStream output)
{
using (BinaryDataReader reader = new BinaryDataReader(input, true))
using (BinaryDataWriter writer = new BinaryDataWriter(output, true))
{
reader.ByteOrder = ByteOrder.BigEndian;
uint decompressedSize = reader.ReadUInt32();
// Decompress the data.
int decompressedBytes = 0;
while (decompressedBytes = 0; i--)
{
// Check if bit of the current chunk is set.
if ((groupConfig & (1 > 12/1 byte (8 bits) + 1 nibble (4 bits)/);
if (nibble == 0)
{
// Nibble is 0, the number of bytes to read is in third byte, which is (size + 0x12).
dataSize = reader.ReadByte() + 0x12;
}
else
{
// Nibble is not 0, and determines (size + 0x02) of bytes to read.
dataSize = nibble + 0x02;
// Re

Solution

This kind of bit-slinging code is one of Python's weak spots, I'm afraid. But it is possible to make substantial improvements.

-
First of all, let's establish a baseline. This is in Python 3:

>>> benchmark = lambda:decompress(open('image.compressed', 'rb'))
>>> from timeit import timeit
>>> timeit(benchmark, number=1)
80.23794965818524

-
One of the reasons why C# runs so much faster is that it has a JIT compiler. So we can try Python's JIT compiler, PyPy. Unfortunately, this does not yet support Python 3, so we have to backport the code by adding a couple of calls to ord.

This reduces the runtime by 80%:

$ pypy cr129054.py
14.9790380001

But I suspect that PyPy does not work in the context of a Blender extension, so how much speedup can we get in plain Python 3?

-
Replace io.BytesIO with bytearray. The seek and read calls become indexing operations, and the write method calls become extend or append.

def decompress(compressed):
    decompressed_size = struct.unpack(">I", compressed.read(4))[0]
    decompressed = bytearray()
    # Decompress the data.
    while len(decompressed) H", compressed.read(2))[0]
                # If the nibble of the first back seek offset byte is
                # 0, the config is 3 bytes long.
                nibble = offset >> 12 # 1 byte (8 bits) + 1 nibble (4 bits)
                if nibble:
                    # Nibble is not 0, and determines (size + 0x02) of
                    # bytes to read.
                    data_size = nibble + 0x02
                    # Remaining bits are the real back seek offset
                    offset &= 0x0FFF
                else:
                    # Nibble is 0, the number of bytes to read is in
                    # third byte, which is (size + 0x12).
                    data_size = compressed.read(1)[0] + 0x12
                for j in range(0, data_size):
                    decompressed.append(decompressed[-offset])
    return decompressed

I haven't investigated exactly why this is faster, but it brings the runtime down by about 30%:

>>> timeit(benchmark, number=1)
55.275238760281354

-
Instead of reading and writing one byte at a time from the decompressed output:

for j in range(0, data_size):
    decompressed.append(decompressed[-offset])

read as much as possible to minimize the number of operations:

offset += 1
if data_size == offset:
    chunk = decompressed[-offset:]
elif data_size < offset:
    chunk = decompressed[-offset:data_size-offset]
else:
    copies, remainder = divmod(data_size, offset)
    chunk = decompressed[-offset:] * copies
    if remainder:
        chunk += decompressed[-offset:-offset + remainder]
decompressed.extend(chunk)

(I didn't check this very carefully, so it's possible that there's an off-by-one error. But this idea is right even if the code is wrong.)

This cuts another 15% off the runtime:

>>> timeit(benchmark, number=1)
43.47242012480274

-
Avoid having to look up modules, function, and methods by caching them in local variables:

_unpack = struct.unpack
_divmod = divmod
_read = compressed.read
_extend = decompressed.extend

This saves a little bit:

>>> timeit(benchmark, number=1)
39.8517839522101

-
Precompute the bit values. Instead of:

for i in range(7, -1, -1):
    # Check if the bit of the current chunk is set.
    if group_config & (1 << i) == 1 << i:
        # Bit is set, copy 1 raw byte to the output.
        decompressed.extend(compressed.read(1))

write:

for i in (128, 64, 32, 16, 8, 4, 2, 1):
    # Check if the bit of the current chunk is set.
    if group_config & i:
        # Bit is set, copy 1 raw byte to the output.
        _extend(_read(1))

This reduces the runtime to about a third of the original:

>>> timeit(benchmark, number=1)
27.295489253941923

That's as far as I got in plain Python. I think that if this was a bottleneck in my application, I would consider switching to the C API.

Code Snippets

>>> benchmark = lambda:decompress(open('image.compressed', 'rb'))
>>> from timeit import timeit
>>> timeit(benchmark, number=1)
80.23794965818524

$ pypy cr129054.py
14.9790380001

def decompress(compressed):
    decompressed_size = struct.unpack(">I", compressed.read(4))[0]
    decompressed = bytearray()
    # Decompress the data.
    while len(decompressed) < decompressed_size:
        # Read the configuration byte of a decompression setting
        # group, and go through each bit of it.
        group_config = compressed.read(1)[0]
        for i in range(7, -1, -1):
            # Check if the bit of the current chunk is set.
            if group_config & (1 << i) == 1 << i:
                # Bit is set, copy 1 raw byte to the output.
                decompressed.extend(compressed.read(1))
            elif len(decompressed) < decompressed_size:
                # Bit is not set and data copying configuration
                # follows, either 2 or 3 bytes long.
                offset = struct.unpack(">H", compressed.read(2))[0]
                # If the nibble of the first back seek offset byte is
                # 0, the config is 3 bytes long.
                nibble = offset >> 12 # 1 byte (8 bits) + 1 nibble (4 bits)
                if nibble:
                    # Nibble is not 0, and determines (size + 0x02) of
                    # bytes to read.
                    data_size = nibble + 0x02
                    # Remaining bits are the real back seek offset
                    offset &= 0x0FFF
                else:
                    # Nibble is 0, the number of bytes to read is in
                    # third byte, which is (size + 0x12).
                    data_size = compressed.read(1)[0] + 0x12
                for j in range(0, data_size):
                    decompressed.append(decompressed[-offset])
    return decompressed

>>> timeit(benchmark, number=1)
55.275238760281354

for j in range(0, data_size):
    decompressed.append(decompressed[-offset])

Context

StackExchange Code Review Q#129054, answer score: 13

Revisions (0)

No revisions yet.