patterncppMinor

Optimizing Huffman Decoding

Submitted by: @import:stackexchange-codereview·Mar 10, 2026·

Viewed 0 times

codereview compression cpp c++video optimization stackoverflow lookup

optimizinghuffmandecoding

Problem

I've been working on implementing a fast Huffman decoder for video decoding. With some help from you, I have been able to produce a decent implementation. However I am still not satisfied with the results. I have done some research both on Stack Overflow and some research papers, however I haven't found anything to further improve my current implementation.

I'm using a lookup-table for decoding. I've removed most of the branches and byte moves (MOVZX). And also aligned my data for optimal cache usage.

I think I've got the decoding itself as fast as it can get, however building the lookup-table (build_tables2) takes as much time as the decoding itself.

My profiler gives me:

decoding: 52%

build lookup-table (build_tables2): 45%

In average my decoder is building ~100 tables per frame (of max 256, one for each node in the biggest tree), which is much but I don't see how I could reduce it. 16 bit tables seems out of the question.

Does anyone have any ideas as to how I can speed up building the lookup-table? Maybe I could somehow encode some kind of additional data in the frame/"input" which would help for building the tables?

```
typedef uint8_t block_type;
static const int bits_per_block = 8;
static const int values_per_block = 256;

struct node_t
{
node_t* children; // 0 right, 1 left
int32_t value; // if leaf then its the value of the current node, if not leaf then its the index of the next lookup-table
int32_t is_leaf;
};

// NOTE:
// entry is 16 bytes and will be aligned during allocation, thus "values" will always be on 16 byte alignment
struct entry
{
uint8_t values[bits_per_block];
std::array* next_table;
int32_t num_values;
};

typedef std::array table_t;
typedef std::array tables_t;

void build_tables2(node_t* nodes, tables_t& tables, int& table_count);

void unpack_tree(void data, node_t nodes);

void decode_huff(void input, uint8_t dest)
{
// Initial setup
std::shared_ptr nodes(reinterpret_cast(scalable

Solution

Do you have some control over the input encoder?
If so it should be providing you with a Huffman tree delta, rather than a whole new tree for every frame. In the same way that the HT algorithm applies similarity rules to encoded data, from frame to frame the HT will have also have a very high degree of similarity.

Also some efficiency and platform concerns:
Please rewrite expressions of the form

((n >> #) & 1)

where # is a digit as,

(((n & (1 << #)) != 0) ? 1 : 0)

Compile time versus run time calculation is always going to be faster.

*reinterpret_cast(entry.values + entry.num_values) = current->value;

Is dependent on the little endian architecture. Prefer correct over fast.

auto* next_value = entry.values; // get the starting value element address
...
*next_value = static_cast(current->value); // set its value
next_value += current->is_leaf;
...
// at the end of the loop unrolling do this calculate num_value 
// from the offset, thus avoiding multiple dereferences of the struct member
entry.num_values = next_value - entry.values;

Code Snippets

((n >> #) & 1)

(((n & (1 << #)) != 0) ? 1 : 0)

*reinterpret_cast<int32_t*>(entry.values + entry.num_values) = current->value;

auto* next_value = entry.values; // get the starting value element address
...
*next_value = static_cast<uint_8>(current->value); // set its value
next_value += current->is_leaf;
...
// at the end of the loop unrolling do this calculate num_value 
// from the offset, thus avoiding multiple dereferences of the struct member
entry.num_values = next_value - entry.values;

Context

StackExchange Code Review Q#4518, answer score: 4

Revisions (0)

No revisions yet.