patterncppModerate
Simple LZW compression algorithm
Viewed 0 times
lzwsimplecompressionalgorithm
Problem
It's a simple version of LZW compression algorithm with 12 bit codes. I want to know what's good and what's bad about this code. Be as picky as you like.
```
// Compile with gcc 4.7.2 or later, using the following command line:
//
// g++ -std=c++0x lzw.c -o lzw
//
//LZW algorithm implemented using fixed 12 bit codes.
#include
#include
#include
#include
#include
#include
#define MAX_DEF 4096
using namespace std;
string convert_int_to_bin(int number)
{
string result = bitset(number).to_string();
return result;
}
void compress(string input, int size, string filename) {
unordered_map compress_dictionary(MAX_DEF);
//Dictionary initializing with ASCII
for ( int unsigned i = 0 ; i dictionary(MAX_DEF);
//Dictionary initializing with ASCII
for ( int unsigned i = 0 ; i binary(subinput);
code = binary.to_ullong();
i+=12;
if ( dictionary.find(code) ==dictionary.end() )
dictionary.insert(make_pair(code,(previous_string + previous_string.substr(0,1))));
outputFile<<dictionary[code];
if ( previous_string.size())
dictionary.insert(make_pair(next_code++,previous_string + dictionary[code][0]));
previous_string = dictionary[code];
}
outputFile.close();
}
string convert_char_to_string(const char *pCh, int arraySize){
string str;
if (pCh[arraySize-1] == '\0') str.append(pCh);
else for(int i=0; i<arraySize; i++) str.append(1,pCh[i]);
return str;
}
static void show_usage()
{
cerr << "Usage: \n"
<< "Specify the file that needs to be compressed or decompressed\n"
<<"lzw -c input #compress file input\n"
<<"lzw -d input #decompress file input\n"
<<"Compressed data will be found in a file with the same name but with a .lzw extension\n"
<<"Decompressed data can be found in a file with the same name and a _uncompressed.txt extension\n"
<< endl;
}
in
```
// Compile with gcc 4.7.2 or later, using the following command line:
//
// g++ -std=c++0x lzw.c -o lzw
//
//LZW algorithm implemented using fixed 12 bit codes.
#include
#include
#include
#include
#include
#include
#define MAX_DEF 4096
using namespace std;
string convert_int_to_bin(int number)
{
string result = bitset(number).to_string();
return result;
}
void compress(string input, int size, string filename) {
unordered_map compress_dictionary(MAX_DEF);
//Dictionary initializing with ASCII
for ( int unsigned i = 0 ; i dictionary(MAX_DEF);
//Dictionary initializing with ASCII
for ( int unsigned i = 0 ; i binary(subinput);
code = binary.to_ullong();
i+=12;
if ( dictionary.find(code) ==dictionary.end() )
dictionary.insert(make_pair(code,(previous_string + previous_string.substr(0,1))));
outputFile<<dictionary[code];
if ( previous_string.size())
dictionary.insert(make_pair(next_code++,previous_string + dictionary[code][0]));
previous_string = dictionary[code];
}
outputFile.close();
}
string convert_char_to_string(const char *pCh, int arraySize){
string str;
if (pCh[arraySize-1] == '\0') str.append(pCh);
else for(int i=0; i<arraySize; i++) str.append(1,pCh[i]);
return str;
}
static void show_usage()
{
cerr << "Usage: \n"
<< "Specify the file that needs to be compressed or decompressed\n"
<<"lzw -c input #compress file input\n"
<<"lzw -d input #decompress file input\n"
<<"Compressed data will be found in a file with the same name but with a .lzw extension\n"
<<"Decompressed data can be found in a file with the same name and a _uncompressed.txt extension\n"
<< endl;
}
in
Solution
Here are some things I see that may help you improve your code.
Shouldn't a compressed file be smaller?
Imagine my surprise when I discovered that a 4057-byte file (the source code itself) became 18720 bytes when compressed! Then I looked at the "compressed" file and saw that it was composed of ASCII
Pass by const reference where practical
The first argument to
Note that this is a
The
Remember to free allocated memory
The program should
Eliminate buggy
The code for converting from a char pointer to a string as currently in the code is this:
However this has a few problems. First is the bug that occurs if the passed buffer has two
Note that there are faster ways to do this, but none as succinct.
Use C++11 compile switches
The program uses C++ 11 constructs, so I'd recommend changing the command-line compilation suggestion in the commment to this instead:
Prefer
Rather than using a
It doesn't make a lot of difference here, but generally the advantage is that the value has an associated type. Where it does make a difference is in the next suggestion:
Make relationships between constants explicit
The numbers 12 and 4096 in the program are associated, but that association is implicit rather than explicit. It would be better to make the relationship between the two constants explicit:
Now all places that currently use the "magic number" 12 should use the constant
Avoid using the same unnamed constant multiple times
The code currently initializes
This is advantageous because there is no way to make a mistake in typing the constant two different ways. A similar change should be made to
Use a
In the
Use this:
Bail out early on error
The code currently checks to see if the file is open and then has an
Also note that I've used
Use all appropriate
The program us
Shouldn't a compressed file be smaller?
Imagine my surprise when I discovered that a 4057-byte file (the source code itself) became 18720 bytes when compressed! Then I looked at the "compressed" file and saw that it was composed of ASCII
'1' and '0' characters rather than binary. If that had been converted to binary output, that would be 2340 bytes which actually is smaller than the input. I'd suggest emitting binary rather than ASCII representation of binary which is 8x the size.Pass by const reference where practical
The first argument to
compress is a std::string but that causes the entire input buffer to be duplicated. Better would be to make it const std::string & because it is not modified and it doesn't need to be duplicated. The for loop then becomes for(const char c: input) {
// ...
}Note that this is a
const char not const char & because on most machines it's quicker and smaller to simply use a char rather than a reference to one.The
filename argument should also be const string &.Remember to free allocated memory
The program should
delete[] memblock at the end of main. Better yet, see the following suggestionEliminate buggy
convert_char_to_string()The code for converting from a char pointer to a string as currently in the code is this:
string convert_char_to_string(const char *pCh, int arraySize){
string str;
if (pCh[arraySize-1] == '\0')
str.append(pCh);
else
for(int i=0; i<arraySize; i++)
str.append(1,pCh[i]);
return str;
}However this has a few problems. First is the bug that occurs if the passed buffer has two
'\0' characters in it, one of which is at the end. The bug is that this code will only copy up to the first '\0' character, silently truncating the input buffer. The second problem is the loop. As written, it will probably reallocate and copy the contents of str many times as it grows. That pointlessly slows the program because you already know how big the string should be. Better still would be to eliminate it entirely, and eliminate memblock by copying the file directly into the string:string input = string(std::istreambuf_iterator(file),
std::istreambuf_iterator());Note that there are faster ways to do this, but none as succinct.
Use C++11 compile switches
The program uses C++ 11 constructs, so I'd recommend changing the command-line compilation suggestion in the commment to this instead:
g++ -std=c++11 lzw.c -o lzwPrefer
constexpr to old-style #defineRather than using a
#define for MAX_DEF the code could use a constexpr:constexpr std::size_t MAX_DEF = 4096;It doesn't make a lot of difference here, but generally the advantage is that the value has an associated type. Where it does make a difference is in the next suggestion:
Make relationships between constants explicit
The numbers 12 and 4096 in the program are associated, but that association is implicit rather than explicit. It would be better to make the relationship between the two constants explicit:
constexpr std::size_t BITS=12;
constexpr std::size_t MAX_DEF=1<<BITS;Now all places that currently use the "magic number" 12 should use the constant
BITS instead. Better still, would be to have a Compressor class for which these would be member variables, and compress and decompress would be member functions.Avoid using the same unnamed constant multiple times
The code currently initializes
compress_dictionary and then sets next_code to 256. This would be a little more straightforward and avoids having the same constant multiple places in the same routine:unsigned next_code;
for ( next_code = 0 ; next_code < 256 ; next_code++ ){
compress_dictionary[string(1,next_code)] = next_code;
}This is advantageous because there is no way to make a mistake in typing the constant two different ways. A similar change should be made to
decompress which also eliminates another ugly peculiarity, which is to have declared i as two different types within the same function.Use a
for loop instead of while where it makes senseIn the
decompress routine, the while loop would be better expressed as a for loop. Instead of this:int i = 0;
while (i<size){
//
i+=12;
//
}Use this:
for (int i=0; i < size; i += BITS) {
//
}Bail out early on error
The code currently checks to see if the file is open and then has an
else clause way at the bottom of main to handle an error. Better would be to bail out early on error, and not enclose the entire program in an if clause:if (!file)
{
cout << "Unable to open file." << endl;
show_usage();
}Also note that I've used
!file for the condition. This is the usual C++ idiom for file error checking.Use all appropriate
#includesThe program us
Code Snippets
for(const char c: input) {
// ...
}string convert_char_to_string(const char *pCh, int arraySize){
string str;
if (pCh[arraySize-1] == '\0')
str.append(pCh);
else
for(int i=0; i<arraySize; i++)
str.append(1,pCh[i]);
return str;
}string input = string(std::istreambuf_iterator<char>(file),
std::istreambuf_iterator<char>());g++ -std=c++11 lzw.c -o lzwconstexpr std::size_t MAX_DEF = 4096;Context
StackExchange Code Review Q#86543, answer score: 12
Revisions (0)
No revisions yet.