HiveBrain v1.2.0
Get Started
← Back to all entries
patterncppMinor

Fast Document Distance, C++

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
fastdistancedocument

Problem

This is an attempt at a fast "document distance" implementation in C++. For reference, I found this problem in MIT's OpenCourseware 6.006 algorithms course.

A quick look at the callgrind output shows that my performance killers are, in order:

  • the transfrom call in formatToken



  • Constructing inputString in wordOccurences



  • Indexing into the unordered_map (calling operator[])



  • reading from the stringstream in order to split the string



Here's the code:

```
#include
#include
#include
#include
#include
#include
#include
#include

using namespace std;

string const g_source = "abcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyz1234567890";
string const g_destination = "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ1234567890";

bool containsKey(unordered_map const &wordMap, string const &key)
{
return wordMap.find(key) != wordMap.end();
}

template
unordered_map translationTable(
ForwardIterator first,
ForwardIterator last,
ForwardIterator dFirst)
{
unordered_map table;

for(; first != last; ++first, ++dFirst) {
table[dFirst] = first;
}

return table;
}

void formatToken(string &rawString)
{
static unordered_map charTranslationTable = translationTable(
g_source.begin(),
g_source.end(),
g_destination.begin());

transform(
rawString.begin(),
rawString.end(),
rawString.begin(),
[](char c) { char ret = charTranslationTable[c]; return ret ? ret : ' '; });
}

unordered_map wordOccurences(ifstream &is)
{
unordered_map result;
string inputString{istreambuf_iterator(is), istreambuf_iterator()};
formatToken(inputString);

istringstream stream(inputString);
string token;

while (stream) {
stream >> token;
result[token] += 1;
}

return result;
}

double documentMagnitude(unordered_map const &document)
{
double magnitude = 0.0;

for (auto const &word : document) {
magnit

Solution

Speed/optimization

Unfortunately, istream_buf_iterators aren't a particularly fast way to read the entirety of a file into a string. One simple way to improve speed is to use the istream's read instead. To do that, you can replace this:

string inputString{istreambuf_iterator(is), istreambuf_iterator()};


...with something like this:

string inputString;

is.seekg(0, std::ios::end);
size_t length = is.tellg();
inputString.resize(length);
is.seekg(0, std::ios::beg);
is.read(&inputString[0], length);


At least in my testing, this reduces the time from around 800 ms to around 600 ms for a pair of test files I threw together (about 6.5 megabytes, mostly from project Gutenberg, with one chapter of one book edited out of one of the files).

If you're running this under Linux, you'll probably also benefit from adding a line like this toward the beginning of main:

std::ios::sync_with_stdio(false);


On Windows this has essentially no effect, but on Linux it can make quite a bit of difference.

Correctness

Command line arguments

You declare the command line arguments as argc/argv like usual, but attempt to refer to two arguments as arg[1] and arg[2] (should be argv instead of arg).

You should also check that argc > 2 before attempting to use argv[2].

EOF detection

This code:

while (stream) {
    stream >> token;
    result[token] += 1;
}


...is broken. It doesn't detect the end of the file correctly. For accurate EOF detection, consider using something like this instead:

while (stream >> token)
    ++result[token];


Translation

I'm pretty sure you transposed these two strings:

string const g_source      = "abcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyz1234567890";
string const g_destination = "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ1234567890";


I'm pretty sure the intent was the either upper-case or lower-case input would be mapped to lower-case output. As it stands, you seem to be trying to map lower-case input to both lower-case and upper-case output.

Style

Simplicity

Although it probably won't make a huge difference in speed, I'd rather use a simple std::array for the translation table, and just fill in the result character for every possible input:

static const size = std::numeric_limits::max();

std::array translation_table;

for (int i=0; i<size; i++)
   if (std::isalpha(i))
       translation_table[i] = std::tolower(i);
   else if (std::isdigit(i))
       translation_table[i] = i;
   else
       translation_table[i] = ' ';


Then translating a character becomes utterly trivial:

transform(
    rawString.begin(),
    rawString.end(),
    rawString.begin(),
    [](unsigned char c) { return translation_table[c]; });


Also note the change to unsigned char, so if we have an input that's negative when viewed as a char, we don't try to index outside the table.

Encapsulation

Taking the next step from the last point, I'd rather hide the implementation of the translation a bit more by wrapping it up into a class. For example, something like this might be usable:

class formatter {
    class xlat_table {
        static const int table_size = std::numeric_limits::max();
        std::array table;
    public:
        xlat_table() {
            for (int i = 0; i < table_size; i++)
                if (std::isalpha(i))
                    table[i] = std::tolower(i);
                else if (std::isdigit(i))
                    table[i] = i;
                else table[i] = ' ';
        }
        char operator[](size_t n) { return table[n]; }
    };

    static xlat_table table;

public:

    void operator()(std::string &rawString) {  
        transform(
            rawString.begin(),
            rawString.end(),
            rawString.begin(),
            [this](unsigned char c) { return table[c]; });
    }
};

formatter::xlat_table formatter::table;


This creates the xlat_table as a static member of formatter, so we can create formatter objects freely, without initializing a new translation table every time.

Conclusion

This should improve speed at least a little bit, but this will probably be I/O bound in most cases, so just translating the code to C++ isn't necessarily going to give any huge speed improvement. On the other hand, if you have something that can do really fast I/O (e.g., a reasonably high-end SSD) you might be more likely to see substantial gains.

My final code putting the pieces together looks like this:

```
#include
#include
#include
#include
#include
#include
#include
#include
#include
#include
#include

using namespace std;

bool containsKey(unordered_map const &wordMap, string const &key)
{
return wordMap.find(key) != wordMap.end();
}

class formatter {
class xlat_table {
static const int table_size = std::numeric_limits::max();
std::array table;
public:
xlat_table() {
for (int i = 0; i wordOccurences(ist

Code Snippets

string inputString{istreambuf_iterator<char>(is), istreambuf_iterator<char>()};
string inputString;

is.seekg(0, std::ios::end);
size_t length = is.tellg();
inputString.resize(length);
is.seekg(0, std::ios::beg);
is.read(&inputString[0], length);
std::ios::sync_with_stdio(false);
while (stream) {
    stream >> token;
    result[token] += 1;
}
while (stream >> token)
    ++result[token];

Context

StackExchange Code Review Q#127864, answer score: 2

Revisions (0)

No revisions yet.