patterncppModerate
Count every word occurrence from file
Viewed 0 times
fileeverywordoccurrencecountfrom
Problem
How good is this algorithm? Personally I think it's over-complicated and needs some improvement. Should I be using iterators here (I think it's "harder" this way) or indexing? Should I stick with a vector or try other containers? Is a vector (array) of pointers generally better than one of objects?
```
#include
#include
#include
#include
struct StringOccurrence //stores word and number of occurrences
{
std::string m_str;
unsigned int m_count;
StringOccurrence(const char* str, unsigned int count) : m_str(str), m_count(count) {};
};
int main()
{
std::string path;
std::cout > path;
std::ifstream in(path);
if (!in) //check if file path is valid
{
std::cerr vec;
std::string lineBuff;
while (std::getline(in, lineBuff)) // write multiline text to vector of strings
{
vec.push_back(lineBuff);
}
std::vector strOc;
std::string stringBuff;
for (auto it = vec.begin(); it begin(); it2 end();it2++) //itterate through each letter
{
if (*it2 != ' ') //keep adding letters to buffer until space (word)
{
stringBuff += *it2;
}
if (*it2 == ' ' || it2 + 1 == it->end()) //(need fix?)
{
if (*(it2 - 1) == ' ') //check for reccurring spaces so they are not counted as words
{
continue;
}
for (auto it3 = strOc.begin(); it3 m_str) //if word was found increase count
{
(*it3)->m_count++;
goto end; //skip next step (need fix?)
}
}
strOc.push_back(new StringOccurrence(stringBuff.c_str(), 1)); //if word was not found add it
end:
stringBuff.clear(); //empty buffer for next word
}
}
}
std::ofstream out("test2.out"); //write to file
for (auto it = strOc.begin();it m_str m_count
```
#include
#include
#include
#include
struct StringOccurrence //stores word and number of occurrences
{
std::string m_str;
unsigned int m_count;
StringOccurrence(const char* str, unsigned int count) : m_str(str), m_count(count) {};
};
int main()
{
std::string path;
std::cout > path;
std::ifstream in(path);
if (!in) //check if file path is valid
{
std::cerr vec;
std::string lineBuff;
while (std::getline(in, lineBuff)) // write multiline text to vector of strings
{
vec.push_back(lineBuff);
}
std::vector strOc;
std::string stringBuff;
for (auto it = vec.begin(); it begin(); it2 end();it2++) //itterate through each letter
{
if (*it2 != ' ') //keep adding letters to buffer until space (word)
{
stringBuff += *it2;
}
if (*it2 == ' ' || it2 + 1 == it->end()) //(need fix?)
{
if (*(it2 - 1) == ' ') //check for reccurring spaces so they are not counted as words
{
continue;
}
for (auto it3 = strOc.begin(); it3 m_str) //if word was found increase count
{
(*it3)->m_count++;
goto end; //skip next step (need fix?)
}
}
strOc.push_back(new StringOccurrence(stringBuff.c_str(), 1)); //if word was not found add it
end:
stringBuff.clear(); //empty buffer for next word
}
}
}
std::ofstream out("test2.out"); //write to file
for (auto it = strOc.begin();it m_str m_count
Solution
Excessive Memory Use
The first thing you do is read every line into memory:
Why? Do you need to have every line in memory? No. You're just trying to count the words. What if the file was 100TB worth of "all work and no play makes Jack a dull boy"?
Poor Memory Use
That should yell antipattern to you. You're not holding pointers to existing objects, this is
Though, regardless, don't use a
Data Container Choice
You are holding a vector of
Yes, this needs a fix. Never use
Improved solution
The first thing you do is read every line into memory:
while (std::getline(in, lineBuff)) // write multiline text to vector of strings
{
vec.push_back(lineBuff);
}Why? Do you need to have every line in memory? No. You're just trying to count the words. What if the file was 100TB worth of "all work and no play makes Jack a dull boy"?
Poor Memory Use
std::vector strOc;That should yell antipattern to you. You're not holding pointers to existing objects, this is
strOc actually owning that memory. You use new to add to it. I don't see any deletes. So you're leaking memory. Doesn't matter so much in this case, but this is exactly what RAII is for. Prefer:std::vector> strOc;Though, regardless, don't use a
vector to begin with because...Data Container Choice
You are holding a vector of
StringOccurrences. That means that on each new word, you have to walk through the whole vector doing string comparisons. That is O(N). Everything about this is bad. To start with:goto end; //skip next step (need fix?)Yes, this needs a fix. Never use
goto. The way to search for things is to use the ` library, specifically std::find_if:
auto it = std::find_if(strOc.begin(), strOc.end(), [&](StringOccurence* so){
return so->m_str == stringBuff;
});
if (it != strOc.end()) {
// success case
}
else {
// new word case
}
That'll make your code easier to understand by far and drop the goto. But it won't handle the O(N) problem. For that, we just need to use an entirely new data structure. We need to map a word to a number, and we don't care about word ordering. Thus, std::unordered_map. This will have O(1) lookup. And we don't even need to do all that work ourselves!
std::unordered_map strOc;
// for each word
++strOc[stringBuff]; // this will insert new elements as necessary
Cool.
Read the words directly
You are walking through a line character by character. That is error-prone at best, hard to follow at worst. But there's already support for this in C++. Use a std::stringstream. You could have just done:
while (std::getline(in, lineBuff)) {
std::string word;
std::istringstream iss(lineBuff);
while (iss >> word) {
...
}
}
Then again, operator>> is defined on any istream, not just istringstream. So we can use that on the ifstream directly:
std::string word;
while (in >> word) {
...
}
Use the right loops
Don't use auto` with iterators - just use a range-based for-expression. It'll save on the typing and add to clarity.Improved solution
std::unordered_map wordCounts;
std::string word;
while (in >> word) {
++wordCounts[word];
}
std::ofstream out("test2.out");
for (auto const& wc : wordCounts) {
out << wc.first << ' ' << wc.second << '\n';
}Code Snippets
while (std::getline(in, lineBuff)) // write multiline text to vector of strings
{
vec.push_back(lineBuff);
}std::vector<StringOccurrence*> strOc;std::vector<std::unique_ptr<StringOccurrence>> strOc;goto end; //skip next step (need fix?)auto it = std::find_if(strOc.begin(), strOc.end(), [&](StringOccurence* so){
return so->m_str == stringBuff;
});
if (it != strOc.end()) {
// success case
}
else {
// new word case
}Context
StackExchange Code Review Q#104639, answer score: 11
Revisions (0)
No revisions yet.