HiveBrain v1.2.0
Get Started
← Back to all entries
patterncppMinor

Finding the top k elements from a file

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
thetopfileelementsfindingfrom

Problem

I was writing a program in C++11 to find top k words from a file. I think priority queues will give the optimal solution so I implemented using that. The code works and compiles. Could anyone please review this and suggest some more C++11/14-ish things to do?

I also have some specific questions:

-
Is there way in C++11 or C++14 to avoid using an iterator to find whether an element exists or not in a map?

I am talking about the condition:

if(it != word_map.end()) {
  word_map[word] = ++word_map[word];
}


How can I avoid using an iterator? Is there a "defaultdict" in C++ like we have in Python? I suppose here I can just remove it since default initialization will be 0, but what is the general way?

-
I saw a weird behavior for this:

while(input_file>>word) {
  //....
}


This loop gives the correct count. But if I do:

while(input_file.good()) {
   //..
}


this gives the count with an additional 1 added if the word is present on two different lines.

For example:

this is a test.
this.


while giving a count of this as 3.

-
I think I need to write a custom comparator for the priority queue to compare only the second part of the "pair". In that case, on what basis is the priority queue operating as of now?

/** top k words in a file
  */
#include 
#include 
#include 
#include 
using namespace std;

void getTopKWords(const map& word_map, const unsigned int k) {
  priority_queue > pq;
  for(auto item : word_map) {
    pq.push(make_pair(item.first, item.second));
  }

  for(unsigned int i=0;i word_map;
  std::string word = "";
  while(input_file >> word) {
    auto it = word_map.find(word);
    if(it != word_map.end()) {
      word_map[word] = ++word_map[word];
    }
    else {
      word_map[word] = 1;
    }
  }
  input_file.close();
  getTopKWords(word_map, k);
}

int main() {
  getWordFrequency("test_file.txt", 5);
}

Solution

Answer to questions:


I think priority queues will give the optimal solution so i implemented using that.

I would bet a heap is better for anything bigger than test data. Note that in a priority queue you are keeping all the data. So every time you add an item you are adding to its size and the length of time it takes to insert.

In real terms a heap would be better because you only keep the top 10 (or hover many you actually require). So each new value is only being compared against the top 10 items.

See:

http://en.cppreference.com/w/cpp/algorithm/make_heap

http://en.cppreference.com/w/cpp/algorithm/push_heap

http://en.cppreference.com/w/cpp/algorithm/pop_heap


The code works and compiles. Could anyone please review this and suggest some more C++11/14-ish things to do?.

Covered in the next section:


Is there way in c++11 or c++14 to avoid using an iterator to find whether an element exists or not in a map ?

if(it != word_map.end()) {
  word_map[word] = ++word_map[word];
}
else {
  word_map[word] = 1;
}


Yes. This can be replaced by

++word_map[word];


If the item does not exist in the map then using operator[] will insert the key and set the default value to zero. We then apply the ++ operator and increment it. So when a value is first inserted it gets the value 1, subsequent increments increase the stored value appropriately.


I saw a weird behavior for this : while(input_file>>word) { //.... }


this loop gives the correct count. But if i do while(input_file.good()) { //.. } this gives the count with an additional 1 added if the word is present on two different lines. For example :

This is correct. In all languages you test the result of the read operation (the behavior when you use while(input_file>>word)).

In all languages this is wrong while(input_file.good()) {.

The reason is that the EOF flag is not set until you read past the actual eof. BUT the last successful read reads up-to but not past the EOF. So when you read the last word, there is no more data on the input but the EOF flag is not set because you have not read past the end of file.

So if you use while(input_file.good()) you will re-enter the loop after you have read all the data. It is not until you try and read data from the stream that the EOF will be set.

You can do it this way but you must test the state of the read as well:

while(input_file.good())
{
    if (input_file >> word)
    {
        // read worked so you can processes the data
    }
}


It seems much easier to write:

while(input_file >> word)
    {
        // Only enter the loop if the read worked.
    }


This is because operator>> reads a word from the stream and returns as a result a reference to the stream. So you are now using the stream as the condition of the while(). When used in this context the stream is converted to a boolean by calling good(). So in this case you have done the read and then tested the state of the stream to see if the read worked.


while give count of this as 3.

Yep you entered the loop again when there was no data to read and counted the extra entry into the loop.

I think i need to write a custom comparator for the priority queue to compare only the second part of the "pair".


Yep

Code Review

Don't do this:

using namespace std;


See every other C++ reviewed question for why.

And Why is “using namespace std;” considered bad practice?.

Don't use unsigned int.

void getTopKWords(const map& word_map, const unsigned int k) {


I know it seems logical that the your input should be a positive value. But unfortunately if somebody calls the function and accidentally passes a negative number it will still work; It just passes a very large positive value instead (stupid auto-conversions).

The general consensus is that you should always used signed integers. The only time that using unsigned is useful is when your integer is holding a set of bit flags.

To make this work as required you would need to define a custom ordering.

priority_queue > pq;


Why not create and open the file in a single line:

ifstream input_file;
  input_file.open(filename);

  // Easier to read as:
  ifstream input_file(filename);


String defaults to the empty string.

std::string word = "";
  // Easier to write as:
  std::string word;


This we mentioned above:

auto it = word_map.find(word);
    if(it != word_map.end()) {
      word_map[word] = ++word_map[word];
    }
    else {
      word_map[word] = 1;
    }
    // Can all be replaced with:
    ++word_map[word];


Don't manually close a file.

input_file.close();


The exception is if you want to check that the close worked and potentially report a problem to the user. Since you are not doing this check then let the destructor of the file call close() for you. This way it caches all errors cleanly and discards them.

Code to use Heap to get top - N Items

```
struct CountTester
{

Code Snippets

if(it != word_map.end()) {
  word_map[word] = ++word_map[word];
}
else {
  word_map[word] = 1;
}
++word_map[word];
while(input_file.good())
{
    if (input_file >> word)
    {
        // read worked so you can processes the data
    }
}
while(input_file >> word)
    {
        // Only enter the loop if the read worked.
    }
I think i need to write a custom comparator for the priority queue to compare only the second part of the "pair".

Context

StackExchange Code Review Q#86812, answer score: 4

Revisions (0)

No revisions yet.