HiveBrain v1.2.0
Get Started
← Back to all entries
patterncppMinor

File parsing code in C++

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
codefileparsing

Problem

I'm processing about 3.5gb/15minutes, and I was wondering if I could get this down to a better time...

Note that this is for a very specific CSV file, so I am not concerned about generality - only speed of implementation and optimization.

int main(int argc, char *argv[]) {
    std::string filename;
    std::cout > filename;

    std::string ticker;
    std::cout > ticker;

    std::ifstream instream(filename.c_str());
    std::string ask_filename = ticker + "_ASK.NIT";
    std::ofstream askstream(ask_filename.c_str());

    std::string bid_filename = ticker + "_BID.NIT";

    std::ofstream bidstream(bid_filename.c_str());

    std::string line;
    std::getline(instream,line);

    while(std::getline(instream,line))
    {
        std::stringstream lineStream(line);
        std::string cell;
        std::string new_line;
        std::vector my_str_vec;
        while(std::getline(lineStream,cell,','))
        {
            my_str_vec.push_back(cell);
            //new_line.append(cell.append(";"));
        }
        // works on date
        std::string my_date = my_str_vec[0];
        std::string::iterator my_iter;
        std::string processed_date = "";
        for(my_iter = my_date.begin(); my_iter != my_date.end(); ++my_iter) {
            if(std::isalnum(*my_iter) || *my_iter == ' ')
                processed_date.append(1,(*my_iter));
        }

        my_str_vec[0] = processed_date;

        std::vector::iterator my_vec_iter;
        for(my_vec_iter = my_str_vec.begin() + 1; my_vec_iter != my_str_vec.end(); ++my_vec_iter) {
            std::string my_semicol = ";";
            *my_vec_iter = my_semicol.append(*my_vec_iter);
        }

        askstream << my_str_vec[0] << my_str_vec[1] << my_str_vec[3] << std::endl;
        bidstream << my_str_vec[0] << my_str_vec[2] << my_str_vec[4] << std::endl;
    }

    askstream.close();
    bidstream.close();

    return 0;
}

Solution

First off streams and specifically the stream operators >> and << are designed for robustness not speed. They make the code harder to write incorrectly (and provide some local specific conversions) that make expensive in terms of using (consider changing to C fscanf() if speed is paramount).

So if speed is you objective streams is not the best choice.

Assuming we are staying with streams (Ask another question about how to write this in C for speed):

Dead Code

Second delete dead code (it confuses a review) and costs in creation.

std::string new_line;


Small Inefficiencies

There is no need to create the cell object each time through the loop create it once outside the loop. (Personally I would leave it in the loop, but if you are going for speed then this is an obvious simple save).

When scanning the date you are making a copy of the value:

std::string my_date = my_str_vec[0];


No need just make a reference (you can even use a const reference for safety):

std::string const&   my_date = my_str_vec[0];
          //    ^^^^^^


Use Reserve

Building the string character by character may not be the most efficient way of getting the date (as the string processed_date may be resized on each insert).

for(my_iter = my_date.begin(); my_iter != my_date.end(); ++my_iter) {
        if(std::isalnum(*my_iter) || *my_iter == ' ')
            processed_date.append(1,(*my_iter));
    }


Use reserve so you know that it is not re-sized internally, then use a standard algorithm for optimal looping:

processed_date.reserve(my_date.size());  // Note: not resize()
std::remove_copy(my_date.begin(), my_date.end(), 
                 std::back_inserter(processed_date),
                 [](char const& my_char) {return !(std::isalnum(*my_char) || *my_char == ' ');}
                );
// Note if you don't have C++0x then lamda is easily replaced by a functor.


Standard algorithms.

Your code will be optimal when you use the standard algorithms. Its worth learning what is available.

http://www.sgi.com/tech/stl/table_of_contents.html

See section 5

I showed you how to copy all but the characters you want. But there is one to remove all elements in-place. I'll let you experiment.

Code Snippets

std::string new_line;
std::string my_date = my_str_vec[0];
std::string const&   my_date = my_str_vec[0];
          //    ^^^^^^
for(my_iter = my_date.begin(); my_iter != my_date.end(); ++my_iter) {
        if(std::isalnum(*my_iter) || *my_iter == ' ')
            processed_date.append(1,(*my_iter));
    }
processed_date.reserve(my_date.size());  // Note: not resize()
std::remove_copy(my_date.begin(), my_date.end(), 
                 std::back_inserter(processed_date),
                 [](char const& my_char) {return !(std::isalnum(*my_char) || *my_char == ' ');}
                );
// Note if you don't have C++0x then lamda is easily replaced by a functor.

Context

StackExchange Code Review Q#4585, answer score: 8

Revisions (0)

No revisions yet.