patterncppMinor
File parsing code in C++
Viewed 0 times
codefileparsing
Problem
I'm processing about 3.5gb/15minutes, and I was wondering if I could get this down to a better time...
Note that this is for a very specific CSV file, so I am not concerned about generality - only speed of implementation and optimization.
Note that this is for a very specific CSV file, so I am not concerned about generality - only speed of implementation and optimization.
int main(int argc, char *argv[]) {
std::string filename;
std::cout > filename;
std::string ticker;
std::cout > ticker;
std::ifstream instream(filename.c_str());
std::string ask_filename = ticker + "_ASK.NIT";
std::ofstream askstream(ask_filename.c_str());
std::string bid_filename = ticker + "_BID.NIT";
std::ofstream bidstream(bid_filename.c_str());
std::string line;
std::getline(instream,line);
while(std::getline(instream,line))
{
std::stringstream lineStream(line);
std::string cell;
std::string new_line;
std::vector my_str_vec;
while(std::getline(lineStream,cell,','))
{
my_str_vec.push_back(cell);
//new_line.append(cell.append(";"));
}
// works on date
std::string my_date = my_str_vec[0];
std::string::iterator my_iter;
std::string processed_date = "";
for(my_iter = my_date.begin(); my_iter != my_date.end(); ++my_iter) {
if(std::isalnum(*my_iter) || *my_iter == ' ')
processed_date.append(1,(*my_iter));
}
my_str_vec[0] = processed_date;
std::vector::iterator my_vec_iter;
for(my_vec_iter = my_str_vec.begin() + 1; my_vec_iter != my_str_vec.end(); ++my_vec_iter) {
std::string my_semicol = ";";
*my_vec_iter = my_semicol.append(*my_vec_iter);
}
askstream << my_str_vec[0] << my_str_vec[1] << my_str_vec[3] << std::endl;
bidstream << my_str_vec[0] << my_str_vec[2] << my_str_vec[4] << std::endl;
}
askstream.close();
bidstream.close();
return 0;
}Solution
First off streams and specifically the stream operators >> and << are designed for robustness not speed. They make the code harder to write incorrectly (and provide some local specific conversions) that make expensive in terms of using (consider changing to C fscanf() if speed is paramount).
So if speed is you objective streams is not the best choice.
Assuming we are staying with streams (Ask another question about how to write this in C for speed):
Dead Code
Second delete dead code (it confuses a review) and costs in creation.
Small Inefficiencies
There is no need to create the cell object each time through the loop create it once outside the loop. (Personally I would leave it in the loop, but if you are going for speed then this is an obvious simple save).
When scanning the date you are making a copy of the value:
No need just make a reference (you can even use a const reference for safety):
Use Reserve
Building the string character by character may not be the most efficient way of getting the date (as the string
Use reserve so you know that it is not re-sized internally, then use a standard algorithm for optimal looping:
Standard algorithms.
Your code will be optimal when you use the standard algorithms. Its worth learning what is available.
http://www.sgi.com/tech/stl/table_of_contents.html
See section 5
I showed you how to copy all but the characters you want. But there is one to remove all elements in-place. I'll let you experiment.
So if speed is you objective streams is not the best choice.
Assuming we are staying with streams (Ask another question about how to write this in C for speed):
Dead Code
Second delete dead code (it confuses a review) and costs in creation.
std::string new_line;Small Inefficiencies
There is no need to create the cell object each time through the loop create it once outside the loop. (Personally I would leave it in the loop, but if you are going for speed then this is an obvious simple save).
When scanning the date you are making a copy of the value:
std::string my_date = my_str_vec[0];No need just make a reference (you can even use a const reference for safety):
std::string const& my_date = my_str_vec[0];
// ^^^^^^Use Reserve
Building the string character by character may not be the most efficient way of getting the date (as the string
processed_date may be resized on each insert). for(my_iter = my_date.begin(); my_iter != my_date.end(); ++my_iter) {
if(std::isalnum(*my_iter) || *my_iter == ' ')
processed_date.append(1,(*my_iter));
}Use reserve so you know that it is not re-sized internally, then use a standard algorithm for optimal looping:
processed_date.reserve(my_date.size()); // Note: not resize()
std::remove_copy(my_date.begin(), my_date.end(),
std::back_inserter(processed_date),
[](char const& my_char) {return !(std::isalnum(*my_char) || *my_char == ' ');}
);
// Note if you don't have C++0x then lamda is easily replaced by a functor.Standard algorithms.
Your code will be optimal when you use the standard algorithms. Its worth learning what is available.
http://www.sgi.com/tech/stl/table_of_contents.html
See section 5
I showed you how to copy all but the characters you want. But there is one to remove all elements in-place. I'll let you experiment.
Code Snippets
std::string new_line;std::string my_date = my_str_vec[0];std::string const& my_date = my_str_vec[0];
// ^^^^^^for(my_iter = my_date.begin(); my_iter != my_date.end(); ++my_iter) {
if(std::isalnum(*my_iter) || *my_iter == ' ')
processed_date.append(1,(*my_iter));
}processed_date.reserve(my_date.size()); // Note: not resize()
std::remove_copy(my_date.begin(), my_date.end(),
std::back_inserter(processed_date),
[](char const& my_char) {return !(std::isalnum(*my_char) || *my_char == ' ');}
);
// Note if you don't have C++0x then lamda is easily replaced by a functor.Context
StackExchange Code Review Q#4585, answer score: 8
Revisions (0)
No revisions yet.