HiveBrain v1.2.0
Get Started
← Back to all entries
patterncppModerate

Count every word occurrence from file

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
fileeverywordoccurrencecountfrom

Problem

How good is this algorithm? Personally I think it's over-complicated and needs some improvement. Should I be using iterators here (I think it's "harder" this way) or indexing? Should I stick with a vector or try other containers? Is a vector (array) of pointers generally better than one of objects?

```
#include
#include
#include
#include
struct StringOccurrence //stores word and number of occurrences
{
std::string m_str;
unsigned int m_count;
StringOccurrence(const char* str, unsigned int count) : m_str(str), m_count(count) {};
};

int main()
{
std::string path;
std::cout > path;
std::ifstream in(path);
if (!in) //check if file path is valid
{
std::cerr vec;
std::string lineBuff;
while (std::getline(in, lineBuff)) // write multiline text to vector of strings
{
vec.push_back(lineBuff);
}

std::vector strOc;
std::string stringBuff;

for (auto it = vec.begin(); it begin(); it2 end();it2++) //itterate through each letter
{
if (*it2 != ' ') //keep adding letters to buffer until space (word)
{
stringBuff += *it2;
}
if (*it2 == ' ' || it2 + 1 == it->end()) //(need fix?)
{
if (*(it2 - 1) == ' ') //check for reccurring spaces so they are not counted as words
{
continue;
}
for (auto it3 = strOc.begin(); it3 m_str) //if word was found increase count
{
(*it3)->m_count++;
goto end; //skip next step (need fix?)
}
}
strOc.push_back(new StringOccurrence(stringBuff.c_str(), 1)); //if word was not found add it
end:
stringBuff.clear(); //empty buffer for next word
}
}
}

std::ofstream out("test2.out"); //write to file
for (auto it = strOc.begin();it m_str m_count

Solution

Excessive Memory Use

The first thing you do is read every line into memory:

while (std::getline(in, lineBuff)) // write multiline text to vector of strings
{
    vec.push_back(lineBuff);
}


Why? Do you need to have every line in memory? No. You're just trying to count the words. What if the file was 100TB worth of "all work and no play makes Jack a dull boy"?

Poor Memory Use

std::vector strOc;


That should yell antipattern to you. You're not holding pointers to existing objects, this is strOc actually owning that memory. You use new to add to it. I don't see any deletes. So you're leaking memory. Doesn't matter so much in this case, but this is exactly what RAII is for. Prefer:

std::vector> strOc;


Though, regardless, don't use a vector to begin with because...

Data Container Choice

You are holding a vector of StringOccurrences. That means that on each new word, you have to walk through the whole vector doing string comparisons. That is O(N). Everything about this is bad. To start with:

goto end; //skip next step (need fix?)


Yes, this needs a fix. Never use goto. The way to search for things is to use the ` library, specifically std::find_if:

auto it = std::find_if(strOc.begin(), strOc.end(), [&](StringOccurence* so){
    return so->m_str == stringBuff;
});
if (it != strOc.end()) {
    // success case
}
else {
    // new word case
}


That'll make your code easier to understand by far and drop the
goto. But it won't handle the O(N) problem. For that, we just need to use an entirely new data structure. We need to map a word to a number, and we don't care about word ordering. Thus, std::unordered_map. This will have O(1) lookup. And we don't even need to do all that work ourselves!

std::unordered_map strOc;
// for each word
++strOc[stringBuff]; // this will insert new elements as necessary


Cool.

Read the words directly

You are walking through a line character by character. That is error-prone at best, hard to follow at worst. But there's already support for this in C++. Use a
std::stringstream. You could have just done:

while (std::getline(in, lineBuff)) {
    std::string word;
    std::istringstream iss(lineBuff);
    while (iss >> word) {
        ...
    }
}


Then again,
operator>> is defined on any istream, not just istringstream. So we can use that on the ifstream directly:

std::string word;
while (in >> word) {
    ...
}


Use the right loops

Don't use
auto` with iterators - just use a range-based for-expression. It'll save on the typing and add to clarity.

Improved solution

std::unordered_map wordCounts;
std::string word;
while (in >> word) {
    ++wordCounts[word];
}

std::ofstream out("test2.out");
for (auto const& wc : wordCounts) {
    out << wc.first << ' ' << wc.second << '\n';
}

Code Snippets

while (std::getline(in, lineBuff)) // write multiline text to vector of strings
{
    vec.push_back(lineBuff);
}
std::vector<StringOccurrence*> strOc;
std::vector<std::unique_ptr<StringOccurrence>> strOc;
goto end; //skip next step (need fix?)
auto it = std::find_if(strOc.begin(), strOc.end(), [&](StringOccurence* so){
    return so->m_str == stringBuff;
});
if (it != strOc.end()) {
    // success case
}
else {
    // new word case
}

Context

StackExchange Code Review Q#104639, answer score: 11

Revisions (0)

No revisions yet.