patterncppMinor
Histogram generator
Viewed 0 times
histogramgeneratorstackoverflow
Problem
This is some code to generate a histogram (or maybe: histochart) in the form of a simple list of words and how many of that word were found in the input file. It accepts an input file name on the command line. For our purpose, a "word" consists of English letters and/or the apostrophe (e.g., so
All other characters (digits, punctuation other than apostrophe, etc.) are to be ignored.
The result is to be sorted in inverse order by frequency (i.e., the words that occurred the most often listed first).
It also uses (requires) a set of stop words (stored in a file named "stopwords.txt") that are to be ignored. This is intended to hold words like "a", "an", "the", "am", "are", "is", "was", "were", etc., that are similar to punctuation--necessary for grammatical sentences, but irrelevant to meaning, so their presence in the histogram would just add noise.
```
#include
#include
#include
#include
#include
#include
#include
#include
#include
#include
typedef const std::pair ytype;
namespace std {
std::ostream &operator {
public:
mask const *get_table() {
static std::vector::mask> table(table_size, (mask)space);
std::fill(table.begin() + 'a', table.begin()+'z', (mask)alpha);
std::fill(table.begin() + 'A', table.begin()+'Z', (mask)alpha);
table['\''] = (mask)alpha; // an apostrophe will also be considered part of a word
return &table[0];
}
alpha_only(size_t refs = 0) : std::ctype(get_table(), false, refs) { }
};
int main(int argc, char **argv) {
if (argc \n";
return 1;
}
std::ifstream input(argv[1]);
input.imbue(std::locale(std::locale(), new alpha_only));
std::map counts;
std::ifstream stop("stopwords.txt");
std::set stopwords{
std::istream_iterator(sto
we're is treated as one word, not two). Case insensitive comparison must be used, so (for example) Build and build are treated as equal (and the result should show the words entirely in lower case).All other characters (digits, punctuation other than apostrophe, etc.) are to be ignored.
The result is to be sorted in inverse order by frequency (i.e., the words that occurred the most often listed first).
It also uses (requires) a set of stop words (stored in a file named "stopwords.txt") that are to be ignored. This is intended to hold words like "a", "an", "the", "am", "are", "is", "was", "were", etc., that are similar to punctuation--necessary for grammatical sentences, but irrelevant to meaning, so their presence in the histogram would just add noise.
```
#include
#include
#include
#include
#include
#include
#include
#include
#include
#include
typedef const std::pair ytype;
namespace std {
std::ostream &operator {
public:
mask const *get_table() {
static std::vector::mask> table(table_size, (mask)space);
std::fill(table.begin() + 'a', table.begin()+'z', (mask)alpha);
std::fill(table.begin() + 'A', table.begin()+'Z', (mask)alpha);
table['\''] = (mask)alpha; // an apostrophe will also be considered part of a word
return &table[0];
}
alpha_only(size_t refs = 0) : std::ctype(get_table(), false, refs) { }
};
int main(int argc, char **argv) {
if (argc \n";
return 1;
}
std::ifstream input(argv[1]);
input.imbue(std::locale(std::locale(), new alpha_only));
std::map counts;
std::ifstream stop("stopwords.txt");
std::set stopwords{
std::istream_iterator(sto
Solution
Optimizing the algorithm
All tests were done on a 4.7mb file containing a copy of the history of the US and a copy of a medical manual; around 1400 pages worth of content).
Timings are an average of executing one single run through the file 10 times. The code was compiled under VC++2015 x64.
Here's a link to the test program; simply use your own files where appropriate.
http://coliru.stacked-crooked.com/a/28907bc68796a0bf
Better containers for lookup
You're using a
With:
This reduced runtime from 524ms to 352ms.
Similarly, you're using a
With:
I've tested this with the following stopwords.txt file:
a an the am are is was were and then for I
This reduced runtime from 352ms to 325ms.
Optimize
We can cut out execution time by directly modifying the string reference instead of copying it in. Thus, this function:
Can be implemented like so:
Demo
Furthermore, since your character table only uses ASCII characters, this provides a bigger speed boost than calling
Demo
This reduced runtime from 325ms to 274ms.
Conclusion
By applying a small couple of changes, the speed of the algorithm was increased by ~50%.
All tests were done on a 4.7mb file containing a copy of the history of the US and a copy of a medical manual; around 1400 pages worth of content).
Timings are an average of executing one single run through the file 10 times. The code was compiled under VC++2015 x64.
Here's a link to the test program; simply use your own files where appropriate.
http://coliru.stacked-crooked.com/a/28907bc68796a0bf
Better containers for lookup
You're using a
std::map<> in order to count word occurrence. You should use a std::unordered_map<> as it provies amortized O(1) lookup. This is highly beneficial for large documents. We replace:std::map counts;With:
std::unordered_map counts;This reduced runtime from 524ms to 352ms.
Similarly, you're using a
std::set<> in order to find stopwords in O(logn) time. You should use a std::unordered_set<> as it provides amortized O(1) lookup. We replace:std::set stopwords{ ... };With:
std::unordered_set stopwords{ ... };I've tested this with the following stopwords.txt file:
a an the am are is was were and then for I
This reduced runtime from 352ms to 325ms.
Optimize
tolower()We can cut out execution time by directly modifying the string reference instead of copying it in. Thus, this function:
std::string tolower(std::string in) {
for (auto &c : in) {
c = std::tolower(unsigned char(c));
}
return in;
}Can be implemented like so:
void tolower(std::string& in) {
for (auto &c : in) {
c = std::tolower(unsigned char(c));
}
}Demo
Furthermore, since your character table only uses ASCII characters, this provides a bigger speed boost than calling
std::tolower() for every individual character:void tolower_ref( std::string& in )
{
static constexpr int diff{ 'a' - 'A' };
for ( auto &c : in )
if ( c < 'a' && c != '\'' )
c += diff;
}Demo
This reduced runtime from 325ms to 274ms.
Conclusion
By applying a small couple of changes, the speed of the algorithm was increased by ~50%.
Code Snippets
std::string tolower(std::string in) {
for (auto &c : in) {
c = std::tolower(unsigned char(c));
}
return in;
}void tolower(std::string& in) {
for (auto &c : in) {
c = std::tolower(unsigned char(c));
}
}void tolower_ref( std::string& in )
{
static constexpr int diff{ 'a' - 'A' };
for ( auto &c : in )
if ( c < 'a' && c != '\'' )
c += diff;
}Context
StackExchange Code Review Q#129329, answer score: 4
Revisions (0)
No revisions yet.