patterncppMinor
Fast Document Distance, C++
Viewed 0 times
fastdistancedocument
Problem
This is an attempt at a fast "document distance" implementation in C++. For reference, I found this problem in MIT's OpenCourseware 6.006 algorithms course.
A quick look at the
Here's the code:
```
#include
#include
#include
#include
#include
#include
#include
#include
using namespace std;
string const g_source = "abcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyz1234567890";
string const g_destination = "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ1234567890";
bool containsKey(unordered_map const &wordMap, string const &key)
{
return wordMap.find(key) != wordMap.end();
}
template
unordered_map translationTable(
ForwardIterator first,
ForwardIterator last,
ForwardIterator dFirst)
{
unordered_map table;
for(; first != last; ++first, ++dFirst) {
table[dFirst] = first;
}
return table;
}
void formatToken(string &rawString)
{
static unordered_map charTranslationTable = translationTable(
g_source.begin(),
g_source.end(),
g_destination.begin());
transform(
rawString.begin(),
rawString.end(),
rawString.begin(),
[](char c) { char ret = charTranslationTable[c]; return ret ? ret : ' '; });
}
unordered_map wordOccurences(ifstream &is)
{
unordered_map result;
string inputString{istreambuf_iterator(is), istreambuf_iterator()};
formatToken(inputString);
istringstream stream(inputString);
string token;
while (stream) {
stream >> token;
result[token] += 1;
}
return result;
}
double documentMagnitude(unordered_map const &document)
{
double magnitude = 0.0;
for (auto const &word : document) {
magnit
A quick look at the
callgrind output shows that my performance killers are, in order:- the
transfromcall informatToken
- Constructing
inputStringinwordOccurences
- Indexing into the
unordered_map(callingoperator[])
- reading from the
stringstreamin order to split the string
Here's the code:
```
#include
#include
#include
#include
#include
#include
#include
#include
using namespace std;
string const g_source = "abcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyz1234567890";
string const g_destination = "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ1234567890";
bool containsKey(unordered_map const &wordMap, string const &key)
{
return wordMap.find(key) != wordMap.end();
}
template
unordered_map translationTable(
ForwardIterator first,
ForwardIterator last,
ForwardIterator dFirst)
{
unordered_map table;
for(; first != last; ++first, ++dFirst) {
table[dFirst] = first;
}
return table;
}
void formatToken(string &rawString)
{
static unordered_map charTranslationTable = translationTable(
g_source.begin(),
g_source.end(),
g_destination.begin());
transform(
rawString.begin(),
rawString.end(),
rawString.begin(),
[](char c) { char ret = charTranslationTable[c]; return ret ? ret : ' '; });
}
unordered_map wordOccurences(ifstream &is)
{
unordered_map result;
string inputString{istreambuf_iterator(is), istreambuf_iterator()};
formatToken(inputString);
istringstream stream(inputString);
string token;
while (stream) {
stream >> token;
result[token] += 1;
}
return result;
}
double documentMagnitude(unordered_map const &document)
{
double magnitude = 0.0;
for (auto const &word : document) {
magnit
Solution
Speed/optimization
Unfortunately,
...with something like this:
At least in my testing, this reduces the time from around 800 ms to around 600 ms for a pair of test files I threw together (about 6.5 megabytes, mostly from project Gutenberg, with one chapter of one book edited out of one of the files).
If you're running this under Linux, you'll probably also benefit from adding a line like this toward the beginning of
On Windows this has essentially no effect, but on Linux it can make quite a bit of difference.
Correctness
Command line arguments
You declare the command line arguments as
You should also check that
EOF detection
This code:
...is broken. It doesn't detect the end of the file correctly. For accurate EOF detection, consider using something like this instead:
Translation
I'm pretty sure you transposed these two strings:
I'm pretty sure the intent was the either upper-case or lower-case input would be mapped to lower-case output. As it stands, you seem to be trying to map lower-case input to both lower-case and upper-case output.
Style
Simplicity
Although it probably won't make a huge difference in speed, I'd rather use a simple
Then translating a character becomes utterly trivial:
Also note the change to
Encapsulation
Taking the next step from the last point, I'd rather hide the implementation of the translation a bit more by wrapping it up into a class. For example, something like this might be usable:
This creates the
Conclusion
This should improve speed at least a little bit, but this will probably be I/O bound in most cases, so just translating the code to C++ isn't necessarily going to give any huge speed improvement. On the other hand, if you have something that can do really fast I/O (e.g., a reasonably high-end SSD) you might be more likely to see substantial gains.
My final code putting the pieces together looks like this:
```
#include
#include
#include
#include
#include
#include
#include
#include
#include
#include
#include
using namespace std;
bool containsKey(unordered_map const &wordMap, string const &key)
{
return wordMap.find(key) != wordMap.end();
}
class formatter {
class xlat_table {
static const int table_size = std::numeric_limits::max();
std::array table;
public:
xlat_table() {
for (int i = 0; i wordOccurences(ist
Unfortunately,
istream_buf_iterators aren't a particularly fast way to read the entirety of a file into a string. One simple way to improve speed is to use the istream's read instead. To do that, you can replace this:string inputString{istreambuf_iterator(is), istreambuf_iterator()};...with something like this:
string inputString;
is.seekg(0, std::ios::end);
size_t length = is.tellg();
inputString.resize(length);
is.seekg(0, std::ios::beg);
is.read(&inputString[0], length);At least in my testing, this reduces the time from around 800 ms to around 600 ms for a pair of test files I threw together (about 6.5 megabytes, mostly from project Gutenberg, with one chapter of one book edited out of one of the files).
If you're running this under Linux, you'll probably also benefit from adding a line like this toward the beginning of
main:std::ios::sync_with_stdio(false);On Windows this has essentially no effect, but on Linux it can make quite a bit of difference.
Correctness
Command line arguments
You declare the command line arguments as
argc/argv like usual, but attempt to refer to two arguments as arg[1] and arg[2] (should be argv instead of arg).You should also check that
argc > 2 before attempting to use argv[2].EOF detection
This code:
while (stream) {
stream >> token;
result[token] += 1;
}...is broken. It doesn't detect the end of the file correctly. For accurate EOF detection, consider using something like this instead:
while (stream >> token)
++result[token];Translation
I'm pretty sure you transposed these two strings:
string const g_source = "abcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyz1234567890";
string const g_destination = "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ1234567890";I'm pretty sure the intent was the either upper-case or lower-case input would be mapped to lower-case output. As it stands, you seem to be trying to map lower-case input to both lower-case and upper-case output.
Style
Simplicity
Although it probably won't make a huge difference in speed, I'd rather use a simple
std::array for the translation table, and just fill in the result character for every possible input:static const size = std::numeric_limits::max();
std::array translation_table;
for (int i=0; i<size; i++)
if (std::isalpha(i))
translation_table[i] = std::tolower(i);
else if (std::isdigit(i))
translation_table[i] = i;
else
translation_table[i] = ' ';Then translating a character becomes utterly trivial:
transform(
rawString.begin(),
rawString.end(),
rawString.begin(),
[](unsigned char c) { return translation_table[c]; });Also note the change to
unsigned char, so if we have an input that's negative when viewed as a char, we don't try to index outside the table.Encapsulation
Taking the next step from the last point, I'd rather hide the implementation of the translation a bit more by wrapping it up into a class. For example, something like this might be usable:
class formatter {
class xlat_table {
static const int table_size = std::numeric_limits::max();
std::array table;
public:
xlat_table() {
for (int i = 0; i < table_size; i++)
if (std::isalpha(i))
table[i] = std::tolower(i);
else if (std::isdigit(i))
table[i] = i;
else table[i] = ' ';
}
char operator[](size_t n) { return table[n]; }
};
static xlat_table table;
public:
void operator()(std::string &rawString) {
transform(
rawString.begin(),
rawString.end(),
rawString.begin(),
[this](unsigned char c) { return table[c]; });
}
};
formatter::xlat_table formatter::table;This creates the
xlat_table as a static member of formatter, so we can create formatter objects freely, without initializing a new translation table every time.Conclusion
This should improve speed at least a little bit, but this will probably be I/O bound in most cases, so just translating the code to C++ isn't necessarily going to give any huge speed improvement. On the other hand, if you have something that can do really fast I/O (e.g., a reasonably high-end SSD) you might be more likely to see substantial gains.
My final code putting the pieces together looks like this:
```
#include
#include
#include
#include
#include
#include
#include
#include
#include
#include
#include
using namespace std;
bool containsKey(unordered_map const &wordMap, string const &key)
{
return wordMap.find(key) != wordMap.end();
}
class formatter {
class xlat_table {
static const int table_size = std::numeric_limits::max();
std::array table;
public:
xlat_table() {
for (int i = 0; i wordOccurences(ist
Code Snippets
string inputString{istreambuf_iterator<char>(is), istreambuf_iterator<char>()};string inputString;
is.seekg(0, std::ios::end);
size_t length = is.tellg();
inputString.resize(length);
is.seekg(0, std::ios::beg);
is.read(&inputString[0], length);std::ios::sync_with_stdio(false);while (stream) {
stream >> token;
result[token] += 1;
}while (stream >> token)
++result[token];Context
StackExchange Code Review Q#127864, answer score: 2
Revisions (0)
No revisions yet.