patterncppMinor
Huffman compressor in C++
Viewed 0 times
compressorhuffmanstackoverflow
Problem
I have this Huffman compressor that can compress both text and binary files by treating each as binary files. (By the way, it is fully compatible with this Java implementation.) See what I have:
Code
main.cpp
```
#include "bit_string.hpp"
#include "byte_counts.hpp"
#include "huffman_decoder.hpp"
#include "huffman_deserializer.hpp"
#include "huffman_encoder.hpp"
#include "huffman_serializer.hpp"
#include "huffman_tree.hpp"
#include
#include
#include
#include
#include
#include
#include
#include
#include
#include
#include
using std::cout;
using std::cerr;
using std::endl;
#define ASSERT(C) if (!(C)) report(#C, __FILE__, __LINE__)
void report(const char condition, const char file, size_t line)
{
cerr & data);
std::vector file_read(std::string& file_name);
int main(int argc, const char * argv[])
{
try
{
exec(argc, argv);
}
catch (std::runtime_error& error)
{
cerr & data)
{
std::ofstream file(file_name, std::ios::out | std::ofstream::binary);
std::size_t size = data.size();
char* byte_data = new char[size];
std::copy(data.begin(), data.end(), byte_data);
file.write(byte_data, size);
file.close();
}
std::vector file_read(std::string& file_name)
{
std::ifstream file(file_name, std::ios::in | std::ifstream::binary);
std::vector ret;
std::filebuf* pbuf = file.rdbuf();
std::size_t size = pbuf->pubseekoff(0, file.end, file.in);
pbuf->pubseekpos(0, file.in);
char* buffer = new char[size];
pbuf->sgetn(buffer, size);
for (std::size_t i = 0; i != size; ++i)
{
ret.push_back((int8_t) buffer[i]);
}
delete[] buffer;
file.close();
return std::move(ret);
}
void do_decode(int argc, const char * argv[])
{
if (argc != 4)
{
throw std::runtime_error{BAD_CMD_FORMAT};
}
std::string flag = argv[1];
if (flag != DECODE_FLAG_SHORT and flag != DECODE_FLAG_LONG)
{
throw std::runtime_error{BAD_CMD_FORMAT};
}
Code
main.cpp
```
#include "bit_string.hpp"
#include "byte_counts.hpp"
#include "huffman_decoder.hpp"
#include "huffman_deserializer.hpp"
#include "huffman_encoder.hpp"
#include "huffman_serializer.hpp"
#include "huffman_tree.hpp"
#include
#include
#include
#include
#include
#include
#include
#include
#include
#include
#include
using std::cout;
using std::cerr;
using std::endl;
#define ASSERT(C) if (!(C)) report(#C, __FILE__, __LINE__)
void report(const char condition, const char file, size_t line)
{
cerr & data);
std::vector file_read(std::string& file_name);
int main(int argc, const char * argv[])
{
try
{
exec(argc, argv);
}
catch (std::runtime_error& error)
{
cerr & data)
{
std::ofstream file(file_name, std::ios::out | std::ofstream::binary);
std::size_t size = data.size();
char* byte_data = new char[size];
std::copy(data.begin(), data.end(), byte_data);
file.write(byte_data, size);
file.close();
}
std::vector file_read(std::string& file_name)
{
std::ifstream file(file_name, std::ios::in | std::ifstream::binary);
std::vector ret;
std::filebuf* pbuf = file.rdbuf();
std::size_t size = pbuf->pubseekoff(0, file.end, file.in);
pbuf->pubseekpos(0, file.in);
char* buffer = new char[size];
pbuf->sgetn(buffer, size);
for (std::size_t i = 0; i != size; ++i)
{
ret.push_back((int8_t) buffer[i]);
}
delete[] buffer;
file.close();
return std::move(ret);
}
void do_decode(int argc, const char * argv[])
{
if (argc != 4)
{
throw std::runtime_error{BAD_CMD_FORMAT};
}
std::string flag = argv[1];
if (flag != DECODE_FLAG_SHORT and flag != DECODE_FLAG_LONG)
{
throw std::runtime_error{BAD_CMD_FORMAT};
}
Solution
For efficiency you mention "correct use of move semantics", but there is a more fundamental problem with this setup: it uses tree walking decoding. That algorithm really only exists to show that Huffman codes are decodable, not to actually decode them.
Unlike most other Huffman decoding algorithms, it will scale to very unbalanced trees. However, very unbalanced trees are not necessary for good compression, only for theoretical optimality. Every real life use of Huffman codes (that I know of) sacrifices theoretical optimality to enable fast decoding.
If the code length is limited* to some reasonable size (say 15, like Deflate, but you can go a bit higher, 20 is doable even with the simplest table-based decoding algorithm), you can decode a symbol using a single table lookup. The table would be indexed by a bitbuffer holding sufficient bits to decode any symbol, and give you a pair
*: Code lengths can be limited by dividing the symbol counts by 2 (rounding up), and then re-creating a Huffman tree. Repeat until the length is not exceeded. This is a bit of a dumb way to do it (but simple to implement and it does work), you can also solve it in one go with the package-merge algorithm.
Secondly, it does not use canonical Huffman codes. This means it has to store the full code words in the header, making the compressed file bigger. When using canonical Huffman codes, the code word lengths are sufficient to recreate the code words themselves. You can apply run-length encoding to the table of lengths to (typically) make it even smaller.
Of course both of these suggestions would cause you to lose compatibility with the linked Java code, so perhaps you cannot use them.
Unlike most other Huffman decoding algorithms, it will scale to very unbalanced trees. However, very unbalanced trees are not necessary for good compression, only for theoretical optimality. Every real life use of Huffman codes (that I know of) sacrifices theoretical optimality to enable fast decoding.
If the code length is limited* to some reasonable size (say 15, like Deflate, but you can go a bit higher, 20 is doable even with the simplest table-based decoding algorithm), you can decode a symbol using a single table lookup. The table would be indexed by a bitbuffer holding sufficient bits to decode any symbol, and give you a pair
(symbol, length) decoding the first symbol in the buffer. The table is easy to generate from the list of code words. There are some more complex schemes that reduce the table size, and multi-level tables would enable accelerated decoding even without limited code word length but it obviously comes at an efficiency price - it becomes tree-decoding again, but with a high branching factor so multiple bits are decoded per level.*: Code lengths can be limited by dividing the symbol counts by 2 (rounding up), and then re-creating a Huffman tree. Repeat until the length is not exceeded. This is a bit of a dumb way to do it (but simple to implement and it does work), you can also solve it in one go with the package-merge algorithm.
Secondly, it does not use canonical Huffman codes. This means it has to store the full code words in the header, making the compressed file bigger. When using canonical Huffman codes, the code word lengths are sufficient to recreate the code words themselves. You can apply run-length encoding to the table of lengths to (typically) make it even smaller.
Of course both of these suggestions would cause you to lose compatibility with the linked Java code, so perhaps you cannot use them.
Context
StackExchange Code Review Q#148713, answer score: 3
Revisions (0)
No revisions yet.