patterncppMinor
Optimizing Huffman Decoding
Viewed 0 times
optimizinghuffmandecoding
Problem
I've been working on implementing a fast Huffman decoder for video decoding. With some help from you, I have been able to produce a decent implementation. However I am still not satisfied with the results. I have done some research both on Stack Overflow and some research papers, however I haven't found anything to further improve my current implementation.
I'm using a lookup-table for decoding. I've removed most of the branches and byte moves (MOVZX). And also aligned my data for optimal cache usage.
I think I've got the decoding itself as fast as it can get, however building the lookup-table (build_tables2) takes as much time as the decoding itself.
My profiler gives me:
In average my decoder is building ~100 tables per frame (of max 256, one for each node in the biggest tree), which is much but I don't see how I could reduce it. 16 bit tables seems out of the question.
Does anyone have any ideas as to how I can speed up building the lookup-table? Maybe I could somehow encode some kind of additional data in the frame/"input" which would help for building the tables?
```
typedef uint8_t block_type;
static const int bits_per_block = 8;
static const int values_per_block = 256;
struct node_t
{
node_t* children; // 0 right, 1 left
int32_t value; // if leaf then its the value of the current node, if not leaf then its the index of the next lookup-table
int32_t is_leaf;
};
// NOTE:
// entry is 16 bytes and will be aligned during allocation, thus "values" will always be on 16 byte alignment
struct entry
{
uint8_t values[bits_per_block];
std::array* next_table;
int32_t num_values;
};
typedef std::array table_t;
typedef std::array tables_t;
void build_tables2(node_t* nodes, tables_t& tables, int& table_count);
void unpack_tree(void data, node_t nodes);
void decode_huff(void input, uint8_t dest)
{
// Initial setup
std::shared_ptr nodes(reinterpret_cast(scalable
I'm using a lookup-table for decoding. I've removed most of the branches and byte moves (MOVZX). And also aligned my data for optimal cache usage.
I think I've got the decoding itself as fast as it can get, however building the lookup-table (build_tables2) takes as much time as the decoding itself.
My profiler gives me:
- decoding: 52%
- build lookup-table (build_tables2): 45%
In average my decoder is building ~100 tables per frame (of max 256, one for each node in the biggest tree), which is much but I don't see how I could reduce it. 16 bit tables seems out of the question.
Does anyone have any ideas as to how I can speed up building the lookup-table? Maybe I could somehow encode some kind of additional data in the frame/"input" which would help for building the tables?
```
typedef uint8_t block_type;
static const int bits_per_block = 8;
static const int values_per_block = 256;
struct node_t
{
node_t* children; // 0 right, 1 left
int32_t value; // if leaf then its the value of the current node, if not leaf then its the index of the next lookup-table
int32_t is_leaf;
};
// NOTE:
// entry is 16 bytes and will be aligned during allocation, thus "values" will always be on 16 byte alignment
struct entry
{
uint8_t values[bits_per_block];
std::array* next_table;
int32_t num_values;
};
typedef std::array table_t;
typedef std::array tables_t;
void build_tables2(node_t* nodes, tables_t& tables, int& table_count);
void unpack_tree(void data, node_t nodes);
void decode_huff(void input, uint8_t dest)
{
// Initial setup
std::shared_ptr nodes(reinterpret_cast(scalable
Solution
Do you have some control over the input encoder?
If so it should be providing you with a Huffman tree delta, rather than a whole new tree for every frame. In the same way that the HT algorithm applies similarity rules to encoded data, from frame to frame the HT will have also have a very high degree of similarity.
Also some efficiency and platform concerns:
Please rewrite expressions of the form
where # is a digit as,
Compile time versus run time calculation is always going to be faster.
Is dependent on the little endian architecture. Prefer correct over fast.
If so it should be providing you with a Huffman tree delta, rather than a whole new tree for every frame. In the same way that the HT algorithm applies similarity rules to encoded data, from frame to frame the HT will have also have a very high degree of similarity.
Also some efficiency and platform concerns:
Please rewrite expressions of the form
((n >> #) & 1)where # is a digit as,
(((n & (1 << #)) != 0) ? 1 : 0)Compile time versus run time calculation is always going to be faster.
*reinterpret_cast(entry.values + entry.num_values) = current->value;Is dependent on the little endian architecture. Prefer correct over fast.
auto* next_value = entry.values; // get the starting value element address
...
*next_value = static_cast(current->value); // set its value
next_value += current->is_leaf;
...
// at the end of the loop unrolling do this calculate num_value
// from the offset, thus avoiding multiple dereferences of the struct member
entry.num_values = next_value - entry.values;Code Snippets
((n >> #) & 1)(((n & (1 << #)) != 0) ? 1 : 0)*reinterpret_cast<int32_t*>(entry.values + entry.num_values) = current->value;auto* next_value = entry.values; // get the starting value element address
...
*next_value = static_cast<uint_8>(current->value); // set its value
next_value += current->is_leaf;
...
// at the end of the loop unrolling do this calculate num_value
// from the offset, thus avoiding multiple dereferences of the struct member
entry.num_values = next_value - entry.values;Context
StackExchange Code Review Q#4518, answer score: 4
Revisions (0)
No revisions yet.