HiveBrain v1.2.0
Get Started
← Back to all entries
patterncppMinor

Counting words in files

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
wordsfilescounting

Problem

I made a simple program for my study that calculates words in a text files and prints every word and its repeated times in the files.

How can I improve this code?

#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 

typedef std::unordered_map Words;

Words wordsInFile(const std::string& fileName)
{
    std::ifstream file(fileName);
    if (!file.is_open())
    {
        throw "Can't open the file" + fileName;
    }

    Words loadFromFile;

    for (std::string word; file >> word;)
    {
        ++loadFromFile[word];
    }

    return loadFromFile;
}

int main(int argc, char *argv[])
{
    std::vector> futures;

    for (int i = 1; i < argc; ++i) 
    {
        futures.push_back(std::async([=] { return wordsInFile(argv[i]); }));
    }

    Words words;

    for (auto& i : futures) 
    {
        const auto results = i.get();

        for (const auto& j : results) 
        {
            words[j.first] = j.second;
        }
    }

    std::cout << "Word\tRepeated Times\n-------------------------\n";
    for (const auto& i : words)
        std::cout << i.first << "\t\t" << i.second << '\n';
}


test.txt

File system is huge subject need more work out. file is plain text file
This is some junk words. simple program for words counting in text file. 1 2 3
2 2 2 2
@ % & ^ *
all so good
this test text file


Output:

`Word Repeated Times
-------------------------
File 1
out. 1
1 1
system 1
2 5
is 3
need 1
huge 1
subject 1
work 1
more 1
% 1
file 3
plain 1
text 3
This 1
some 1
junk 1
program 1
words. 1
simple 1
for 1
words 1
counting 1
in 1
file. 1
so 1
3 1
@ 1
& 1
^ 1
this

Solution

I see some things that may help you improve your code.

Simplify your code

You don't need a lambda at all. Use this instead:

futures.push_back(std::async(wordsInFile,argv[i]));


Also, instead of if (!file.is_open()) use the more idiomatic if (!file).

Fix the counting

If we have two files which both contain the word fox exactly once then I would expect the result of this program to print fox 2 but in fact, it would falsely claim a count of 1. The problem is this line:

words[j.first] = j.second;


This overwrites the count instead of accumulating it. This is what you need instead:

words[j.first] += j.second;


Use reserve to prevent reallocations

This code already knows how many futures are going to be created. The code can be made a little more efficient by using this:

futures.reserve(argc-1);


With large numbers of files, this eliminates the re-allocation that would be required by dynamically resizing the vector.

Check for exceptions

If any of the file names can't be read. That is, if it can be opened but not read, as with a directory under Linux, this code will throw an error that isn't caught and so therefore results in a crash. Since future::get() will throw any error that was stored, it's that part of the code that should check for exceptions. One alternative would be this:

Words results;
    try {
        results = i.get();
    }
    catch(std::exception &err)
    {
        std::cerr << "ERROR: " << err.what() << '\n';
    }
    catch(std::string &err)
    {
        std::cerr << "ERROR: " << err << '\n';
    }
    for (const auto& j : results) 
    {
        words[j.first] += j.second;
    }


Note that this will simply skip the offending file name but continue to process the rest.

Measure your results

Create a non-parallel version of this code and compare the timing to see if it's faster or slower than this version. I have found that it's often true that the overhead of the implicit (as in async) or explicit (as in thread) thread creation swamps the savings in time by making things parallel. The only way to know for sure is to time it.

Code Snippets

futures.push_back(std::async(wordsInFile,argv[i]));
words[j.first] = j.second;
words[j.first] += j.second;
futures.reserve(argc-1);
Words results;
    try {
        results = i.get();
    }
    catch(std::exception &err)
    {
        std::cerr << "ERROR: " << err.what() << '\n';
    }
    catch(std::string &err)
    {
        std::cerr << "ERROR: " << err << '\n';
    }
    for (const auto& j : results) 
    {
        words[j.first] += j.second;
    }

Context

StackExchange Code Review Q#73310, answer score: 3

Revisions (0)

No revisions yet.