HiveBrain v1.2.0
Get Started
← Back to all entries
patterncppMinor

Histogram generator

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
histogramgeneratorstackoverflow

Problem

This is some code to generate a histogram (or maybe: histochart) in the form of a simple list of words and how many of that word were found in the input file. It accepts an input file name on the command line. For our purpose, a "word" consists of English letters and/or the apostrophe (e.g., so we're is treated as one word, not two). Case insensitive comparison must be used, so (for example) Build and build are treated as equal (and the result should show the words entirely in lower case).

All other characters (digits, punctuation other than apostrophe, etc.) are to be ignored.

The result is to be sorted in inverse order by frequency (i.e., the words that occurred the most often listed first).

It also uses (requires) a set of stop words (stored in a file named "stopwords.txt") that are to be ignored. This is intended to hold words like "a", "an", "the", "am", "are", "is", "was", "were", etc., that are similar to punctuation--necessary for grammatical sentences, but irrelevant to meaning, so their presence in the histogram would just add noise.

```
#include
#include
#include
#include
#include
#include
#include
#include
#include
#include

typedef const std::pair ytype;

namespace std {
std::ostream &operator {
public:
mask const *get_table() {
static std::vector::mask> table(table_size, (mask)space);
std::fill(table.begin() + 'a', table.begin()+'z', (mask)alpha);
std::fill(table.begin() + 'A', table.begin()+'Z', (mask)alpha);
table['\''] = (mask)alpha; // an apostrophe will also be considered part of a word

return &table[0];
}
alpha_only(size_t refs = 0) : std::ctype(get_table(), false, refs) { }
};

int main(int argc, char **argv) {
if (argc \n";
return 1;
}

std::ifstream input(argv[1]);

input.imbue(std::locale(std::locale(), new alpha_only));

std::map counts;

std::ifstream stop("stopwords.txt");

std::set stopwords{
std::istream_iterator(sto

Solution

Optimizing the algorithm

All tests were done on a 4.7mb file containing a copy of the history of the US and a copy of a medical manual; around 1400 pages worth of content).

Timings are an average of executing one single run through the file 10 times. The code was compiled under VC++2015 x64.

Here's a link to the test program; simply use your own files where appropriate.

http://coliru.stacked-crooked.com/a/28907bc68796a0bf
Better containers for lookup

You're using a std::map<> in order to count word occurrence. You should use a std::unordered_map<> as it provies amortized O(1) lookup. This is highly beneficial for large documents. We replace:

std::map counts;

With:

std::unordered_map counts;

This reduced runtime from 524ms to 352ms.

Similarly, you're using a std::set<> in order to find stopwords in O(logn) time. You should use a std::unordered_set<> as it provides amortized O(1) lookup. We replace:

std::set stopwords{ ... };

With:

std::unordered_set stopwords{ ... };

I've tested this with the following stopwords.txt file:

a an the am are is was were and then for I

This reduced runtime from 352ms to 325ms.

Optimize tolower()

We can cut out execution time by directly modifying the string reference instead of copying it in. Thus, this function:

std::string tolower(std::string in) {
    for (auto &c : in) {
        c = std::tolower(unsigned char(c));
    }
    return in;
}


Can be implemented like so:

void tolower(std::string& in) {
    for (auto &c : in) {
        c = std::tolower(unsigned char(c));
    }
}


Demo

Furthermore, since your character table only uses ASCII characters, this provides a bigger speed boost than calling std::tolower() for every individual character:

void tolower_ref( std::string& in )
{
    static constexpr int diff{ 'a' - 'A' };
    for ( auto &c : in )
        if ( c < 'a' && c != '\'' )
            c += diff;
}


Demo

This reduced runtime from 325ms to 274ms.

Conclusion

By applying a small couple of changes, the speed of the algorithm was increased by ~50%.

Code Snippets

std::string tolower(std::string in) {
    for (auto &c : in) {
        c = std::tolower(unsigned char(c));
    }
    return in;
}
void tolower(std::string& in) {
    for (auto &c : in) {
        c = std::tolower(unsigned char(c));
    }
}
void tolower_ref( std::string& in )
{
    static constexpr int diff{ 'a' - 'A' };
    for ( auto &c : in )
        if ( c < 'a' && c != '\'' )
            c += diff;
}

Context

StackExchange Code Review Q#129329, answer score: 4

Revisions (0)

No revisions yet.