patterncsharpModerate
Python 3 decompression routine 10x slower than C# equivalent
Viewed 0 times
10xequivalentroutineslowerthandecompressionpython
Problem
I'm relatively new to Python 3, especially IO programming, writing a Blender add-on to import model data.
The model data is available in a custom compression, and I originally wrote code in C# to decompress it in memory, porting it to Python 3.
However, being a little unsure about the "optimal" usage of IO classes and functions in Python, I got a little speed problem. The code runs 10 times slower in Python compared to the C# equivalent, and I don't find any more optimization potential, because of my limited Python knowledge.
A test yielded the following speed results at decompressing the same file (around 50 megabytes of data):
I wonder if anyone can point out where I'm failing hard in Python and need to learn a lot...
Here is the C# code. It uses an extended
```
public static int Decompress(Stream input, MemoryStream output)
{
using (BinaryDataReader reader = new BinaryDataReader(input, true))
using (BinaryDataWriter writer = new BinaryDataWriter(output, true))
{
reader.ByteOrder = ByteOrder.BigEndian;
uint decompressedSize = reader.ReadUInt32();
// Decompress the data.
int decompressedBytes = 0;
while (decompressedBytes = 0; i--)
{
// Check if bit of the current chunk is set.
if ((groupConfig & (1 > 12/1 byte (8 bits) + 1 nibble (4 bits)/);
if (nibble == 0)
{
// Nibble is 0, the number of bytes to read is in third byte, which is (size + 0x12).
dataSize = reader.ReadByte() + 0x12;
}
else
{
// Nibble is not 0, and determines (size + 0x02) of bytes to read.
dataSize = nibble + 0x02;
// Re
The model data is available in a custom compression, and I originally wrote code in C# to decompress it in memory, porting it to Python 3.
However, being a little unsure about the "optimal" usage of IO classes and functions in Python, I got a little speed problem. The code runs 10 times slower in Python compared to the C# equivalent, and I don't find any more optimization potential, because of my limited Python knowledge.
A test yielded the following speed results at decompressing the same file (around 50 megabytes of data):
- C#: ~4-5 seconds
- Python: ~43 seconds
I wonder if anyone can point out where I'm failing hard in Python and need to learn a lot...
Here is the C# code. It uses an extended
System.IO.BinaryReader and System.IO.BinaryWriter, operating the same; just with handling endianness more flexible.```
public static int Decompress(Stream input, MemoryStream output)
{
using (BinaryDataReader reader = new BinaryDataReader(input, true))
using (BinaryDataWriter writer = new BinaryDataWriter(output, true))
{
reader.ByteOrder = ByteOrder.BigEndian;
uint decompressedSize = reader.ReadUInt32();
// Decompress the data.
int decompressedBytes = 0;
while (decompressedBytes = 0; i--)
{
// Check if bit of the current chunk is set.
if ((groupConfig & (1 > 12/1 byte (8 bits) + 1 nibble (4 bits)/);
if (nibble == 0)
{
// Nibble is 0, the number of bytes to read is in third byte, which is (size + 0x12).
dataSize = reader.ReadByte() + 0x12;
}
else
{
// Nibble is not 0, and determines (size + 0x02) of bytes to read.
dataSize = nibble + 0x02;
// Re
Solution
This kind of bit-slinging code is one of Python's weak spots, I'm afraid. But it is possible to make substantial improvements.
-
First of all, let's establish a baseline. This is in Python 3:
-
One of the reasons why C# runs so much faster is that it has a JIT compiler. So we can try Python's JIT compiler, PyPy. Unfortunately, this does not yet support Python 3, so we have to backport the code by adding a couple of calls to
This reduces the runtime by 80%:
But I suspect that PyPy does not work in the context of a Blender extension, so how much speedup can we get in plain Python 3?
-
Replace
I haven't investigated exactly why this is faster, but it brings the runtime down by about 30%:
-
Instead of reading and writing one byte at a time from the decompressed output:
read as much as possible to minimize the number of operations:
(I didn't check this very carefully, so it's possible that there's an off-by-one error. But this idea is right even if the code is wrong.)
This cuts another 15% off the runtime:
-
Avoid having to look up modules, function, and methods by caching them in local variables:
This saves a little bit:
-
Precompute the bit values. Instead of:
write:
This reduces the runtime to about a third of the original:
That's as far as I got in plain Python. I think that if this was a bottleneck in my application, I would consider switching to the C API.
-
First of all, let's establish a baseline. This is in Python 3:
>>> benchmark = lambda:decompress(open('image.compressed', 'rb'))
>>> from timeit import timeit
>>> timeit(benchmark, number=1)
80.23794965818524-
One of the reasons why C# runs so much faster is that it has a JIT compiler. So we can try Python's JIT compiler, PyPy. Unfortunately, this does not yet support Python 3, so we have to backport the code by adding a couple of calls to
ord.This reduces the runtime by 80%:
$ pypy cr129054.py
14.9790380001But I suspect that PyPy does not work in the context of a Blender extension, so how much speedup can we get in plain Python 3?
-
Replace
io.BytesIO with bytearray. The seek and read calls become indexing operations, and the write method calls become extend or append.def decompress(compressed):
decompressed_size = struct.unpack(">I", compressed.read(4))[0]
decompressed = bytearray()
# Decompress the data.
while len(decompressed) H", compressed.read(2))[0]
# If the nibble of the first back seek offset byte is
# 0, the config is 3 bytes long.
nibble = offset >> 12 # 1 byte (8 bits) + 1 nibble (4 bits)
if nibble:
# Nibble is not 0, and determines (size + 0x02) of
# bytes to read.
data_size = nibble + 0x02
# Remaining bits are the real back seek offset
offset &= 0x0FFF
else:
# Nibble is 0, the number of bytes to read is in
# third byte, which is (size + 0x12).
data_size = compressed.read(1)[0] + 0x12
for j in range(0, data_size):
decompressed.append(decompressed[-offset])
return decompressedI haven't investigated exactly why this is faster, but it brings the runtime down by about 30%:
>>> timeit(benchmark, number=1)
55.275238760281354-
Instead of reading and writing one byte at a time from the decompressed output:
for j in range(0, data_size):
decompressed.append(decompressed[-offset])read as much as possible to minimize the number of operations:
offset += 1
if data_size == offset:
chunk = decompressed[-offset:]
elif data_size < offset:
chunk = decompressed[-offset:data_size-offset]
else:
copies, remainder = divmod(data_size, offset)
chunk = decompressed[-offset:] * copies
if remainder:
chunk += decompressed[-offset:-offset + remainder]
decompressed.extend(chunk)(I didn't check this very carefully, so it's possible that there's an off-by-one error. But this idea is right even if the code is wrong.)
This cuts another 15% off the runtime:
>>> timeit(benchmark, number=1)
43.47242012480274-
Avoid having to look up modules, function, and methods by caching them in local variables:
_unpack = struct.unpack
_divmod = divmod
_read = compressed.read
_extend = decompressed.extendThis saves a little bit:
>>> timeit(benchmark, number=1)
39.8517839522101-
Precompute the bit values. Instead of:
for i in range(7, -1, -1):
# Check if the bit of the current chunk is set.
if group_config & (1 << i) == 1 << i:
# Bit is set, copy 1 raw byte to the output.
decompressed.extend(compressed.read(1))write:
for i in (128, 64, 32, 16, 8, 4, 2, 1):
# Check if the bit of the current chunk is set.
if group_config & i:
# Bit is set, copy 1 raw byte to the output.
_extend(_read(1))This reduces the runtime to about a third of the original:
>>> timeit(benchmark, number=1)
27.295489253941923That's as far as I got in plain Python. I think that if this was a bottleneck in my application, I would consider switching to the C API.
Code Snippets
>>> benchmark = lambda:decompress(open('image.compressed', 'rb'))
>>> from timeit import timeit
>>> timeit(benchmark, number=1)
80.23794965818524$ pypy cr129054.py
14.9790380001def decompress(compressed):
decompressed_size = struct.unpack(">I", compressed.read(4))[0]
decompressed = bytearray()
# Decompress the data.
while len(decompressed) < decompressed_size:
# Read the configuration byte of a decompression setting
# group, and go through each bit of it.
group_config = compressed.read(1)[0]
for i in range(7, -1, -1):
# Check if the bit of the current chunk is set.
if group_config & (1 << i) == 1 << i:
# Bit is set, copy 1 raw byte to the output.
decompressed.extend(compressed.read(1))
elif len(decompressed) < decompressed_size:
# Bit is not set and data copying configuration
# follows, either 2 or 3 bytes long.
offset = struct.unpack(">H", compressed.read(2))[0]
# If the nibble of the first back seek offset byte is
# 0, the config is 3 bytes long.
nibble = offset >> 12 # 1 byte (8 bits) + 1 nibble (4 bits)
if nibble:
# Nibble is not 0, and determines (size + 0x02) of
# bytes to read.
data_size = nibble + 0x02
# Remaining bits are the real back seek offset
offset &= 0x0FFF
else:
# Nibble is 0, the number of bytes to read is in
# third byte, which is (size + 0x12).
data_size = compressed.read(1)[0] + 0x12
for j in range(0, data_size):
decompressed.append(decompressed[-offset])
return decompressed>>> timeit(benchmark, number=1)
55.275238760281354for j in range(0, data_size):
decompressed.append(decompressed[-offset])Context
StackExchange Code Review Q#129054, answer score: 13
Revisions (0)
No revisions yet.