snippetcppMinor
How to search for a word in a sorted text file efficiently?
Viewed 0 times
efficientlyfilesearchtextwordforhowsorted
Problem
I want to write my own program to search for a word in a sorted text file. In the text file each word is in one line. I want to write this myself because I did not find any simple program (without libraries such as boost) which can provide suggestion word which means a word with most similarity. Here is my code but I don't know if I am in the right or wrong track. Appreciate any suggestion.
#include
#include
#include
using namespace std;
int main(){
ifstream fin;
fin.open("/usr/dict/words");
if(!fin.good()){cout>search;
coutword)
{
last=middle-1;
cout<<"b"; //**
}
else
{
found=true;
break;
}
cout<<first<<" "<<last<<" "<<" "<<middle<<endl; //**
}
return 0;
}Solution
Your biggest problem is the code does not take into account not finding the word. If it fails your code will loop forever. You need a not found exit case.
This is always a bad sign
That line is nearly always wrong (in every programming language I know). Especially in your case that should never be true (since you know where the end is).
Remember that the last successful read reads up-to but not past the EOF. The eof-flag is only set when you read past EOF.
So on the last good read you will read up-to the end of file processes the data. There is no more data left in the file but the eof-flag will not be set because you have not read past the EOF yet so you will re-enter the loop. The next attempt to read data will fail (setting the eof-flag), but the read has failed and thus left the variables you just read into in a questionable state.
Luckily you don't get into that state because you move after each search. But it is a sign of the wrong type of test. This should be:
You are assuming that words start on the exact boundary of middle!
Not sure I believe that will hold. Middle is a good estimate. But you need to find the start of the word. So seek forward for the next word.
Correct typing in C++ is everything:
If you don't care about the actual type (you just want the correct one). Then let the compiler work it out for you.
Why take two lines to open the file:
Sure you can test for good(). But its better to let the stream decide if it is a usable state.
If the size of the dictionary is not huge (couple of Meg). I personally would just load it into memory and let the standard data structures handle it. Alternatively You could map the file to memory and do a more standard search.
This is always a bad sign
while(!fin.eof())That line is nearly always wrong (in every programming language I know). Especially in your case that should never be true (since you know where the end is).
Remember that the last successful read reads up-to but not past the EOF. The eof-flag is only set when you read past EOF.
So on the last good read you will read up-to the end of file processes the data. There is no more data left in the file but the eof-flag will not be set because you have not read past the EOF yet so you will re-enter the loop. The next attempt to read data will fail (setting the eof-flag), but the read has failed and thus left the variables you just read into in a questionable state.
Luckily you don't get into that state because you move after each search. But it is a sign of the wrong type of test. This should be:
while(searching) // Set to false when you finish searching.
// Which may be when you find the word
// or there are no more word to search.You are assuming that words start on the exact boundary of middle!
middle=(first+last)/2;
fin.seekg(middle);
fin>>search;Not sure I believe that will hold. Middle is a good estimate. But you need to find the start of the word. So seek forward for the next word.
middle=(first+last)/2;
fin.seekg(middle);
fin.ignore(std::numeric_limits::max(), '\n');
if (!fin.eof()) // That may have read past EOF
{
}Correct typing in C++ is everything:
fin.seekg(0,ios::end);
int last = fin.tellg(); // This does not return an int.
// With todays big files this can be rather a large
// number.
// That line should be.
std::streamsize last = fin.tellg();If you don't care about the actual type (you just want the correct one). Then let the compiler work it out for you.
// C++11 let the compiler deduct the correct type of last.
auto last = fin.tellg()Why take two lines to open the file:
ifstream fin;
fin.open("/usr/dict/words");
// Much simpler to read and write:
ifstream fin("/usr/dict/words");Sure you can test for good(). But its better to let the stream decide if it is a usable state.
if(!fin.good()){cout<<"No dictionary file!\n"; exit(1);}
// Prefer to let the stream decide:
if (fin) // Stream used in a boolean context will convert itself into
// something useful that is convertible to true if you can
// use the stream or false if the stream is not usable.
{
cout<<"No dictionary file!\n";
exit(1);
}If the size of the dictionary is not huge (couple of Meg). I personally would just load it into memory and let the standard data structures handle it. Alternatively You could map the file to memory and do a more standard search.
bool isWordInDict(std::string const& word);
{
struct MyDict: std::set
{
typedef std::set::const_iterator const_iterator;
MyDict()
{
std::ifstream fin("/usr/dict/words");
std::copy(std::istream_iterator(fin), std::istream_iterator(),
std::inserter(*this, end()));
}
};
static const MyDict dict;
MyDict::const_iterator find = dict.find(word);
return find != dict.end();
}Code Snippets
while(!fin.eof())while(searching) // Set to false when you finish searching.
// Which may be when you find the word
// or there are no more word to search.middle=(first+last)/2;
fin.seekg(middle);
fin>>search;middle=(first+last)/2;
fin.seekg(middle);
fin.ignore(std::numeric_limits<std::streamsize>::max(), '\n');
if (!fin.eof()) // That may have read past EOF
{
}fin.seekg(0,ios::end);
int last = fin.tellg(); // This does not return an int.
// With todays big files this can be rather a large
// number.
// That line should be.
std::streamsize last = fin.tellg();Context
StackExchange Code Review Q#52702, answer score: 6
Revisions (0)
No revisions yet.