patterncppMinor
Counting words in files
Viewed 0 times
wordsfilescounting
Problem
I made a simple program for my study that calculates words in a text files and prints every word and its repeated times in the files.
How can I improve this code?
test.txt
Output:
`Word Repeated Times
-------------------------
File 1
out. 1
1 1
system 1
2 5
is 3
need 1
huge 1
subject 1
work 1
more 1
% 1
file 3
plain 1
text 3
This 1
some 1
junk 1
program 1
words. 1
simple 1
for 1
words 1
counting 1
in 1
file. 1
so 1
3 1
@ 1
& 1
^ 1
this
How can I improve this code?
#include
#include
#include
#include
#include
#include
#include
#include
typedef std::unordered_map Words;
Words wordsInFile(const std::string& fileName)
{
std::ifstream file(fileName);
if (!file.is_open())
{
throw "Can't open the file" + fileName;
}
Words loadFromFile;
for (std::string word; file >> word;)
{
++loadFromFile[word];
}
return loadFromFile;
}
int main(int argc, char *argv[])
{
std::vector> futures;
for (int i = 1; i < argc; ++i)
{
futures.push_back(std::async([=] { return wordsInFile(argv[i]); }));
}
Words words;
for (auto& i : futures)
{
const auto results = i.get();
for (const auto& j : results)
{
words[j.first] = j.second;
}
}
std::cout << "Word\tRepeated Times\n-------------------------\n";
for (const auto& i : words)
std::cout << i.first << "\t\t" << i.second << '\n';
}test.txt
File system is huge subject need more work out. file is plain text file
This is some junk words. simple program for words counting in text file. 1 2 3
2 2 2 2
@ % & ^ *
all so good
this test text file
Output:
`Word Repeated Times
-------------------------
File 1
out. 1
1 1
system 1
2 5
is 3
need 1
huge 1
subject 1
work 1
more 1
% 1
file 3
plain 1
text 3
This 1
some 1
junk 1
program 1
words. 1
simple 1
for 1
words 1
counting 1
in 1
file. 1
so 1
3 1
@ 1
& 1
^ 1
this
Solution
I see some things that may help you improve your code.
Simplify your code
You don't need a lambda at all. Use this instead:
Also, instead of
Fix the counting
If we have two files which both contain the word
This overwrites the count instead of accumulating it. This is what you need instead:
Use
This code already knows how many
With large numbers of files, this eliminates the re-allocation that would be required by dynamically resizing the
Check for exceptions
If any of the file names can't be read. That is, if it can be opened but not read, as with a directory under Linux, this code will throw an error that isn't caught and so therefore results in a crash. Since
Note that this will simply skip the offending file name but continue to process the rest.
Measure your results
Create a non-parallel version of this code and compare the timing to see if it's faster or slower than this version. I have found that it's often true that the overhead of the implicit (as in
Simplify your code
You don't need a lambda at all. Use this instead:
futures.push_back(std::async(wordsInFile,argv[i]));Also, instead of
if (!file.is_open()) use the more idiomatic if (!file).Fix the counting
If we have two files which both contain the word
fox exactly once then I would expect the result of this program to print fox 2 but in fact, it would falsely claim a count of 1. The problem is this line:words[j.first] = j.second;This overwrites the count instead of accumulating it. This is what you need instead:
words[j.first] += j.second;Use
reserve to prevent reallocationsThis code already knows how many
futures are going to be created. The code can be made a little more efficient by using this:futures.reserve(argc-1);With large numbers of files, this eliminates the re-allocation that would be required by dynamically resizing the
vector.Check for exceptions
If any of the file names can't be read. That is, if it can be opened but not read, as with a directory under Linux, this code will throw an error that isn't caught and so therefore results in a crash. Since
future::get() will throw any error that was stored, it's that part of the code that should check for exceptions. One alternative would be this:Words results;
try {
results = i.get();
}
catch(std::exception &err)
{
std::cerr << "ERROR: " << err.what() << '\n';
}
catch(std::string &err)
{
std::cerr << "ERROR: " << err << '\n';
}
for (const auto& j : results)
{
words[j.first] += j.second;
}Note that this will simply skip the offending file name but continue to process the rest.
Measure your results
Create a non-parallel version of this code and compare the timing to see if it's faster or slower than this version. I have found that it's often true that the overhead of the implicit (as in
async) or explicit (as in thread) thread creation swamps the savings in time by making things parallel. The only way to know for sure is to time it.Code Snippets
futures.push_back(std::async(wordsInFile,argv[i]));words[j.first] = j.second;words[j.first] += j.second;futures.reserve(argc-1);Words results;
try {
results = i.get();
}
catch(std::exception &err)
{
std::cerr << "ERROR: " << err.what() << '\n';
}
catch(std::string &err)
{
std::cerr << "ERROR: " << err << '\n';
}
for (const auto& j : results)
{
words[j.first] += j.second;
}Context
StackExchange Code Review Q#73310, answer score: 3
Revisions (0)
No revisions yet.