HiveBrain v1.2.0
Get Started
← Back to all entries
patternpythonMinor

Large ASCII file data read

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
filereadlargeasciidata

Problem

I am moving a project from Python to C++, partly in order to achieve a speed up.

This part of the code reads large .txt data files (well only about 2MB per file, but quite a lot of files), and needs to convert the data to floating point.

This C++ code does is not faster than Python bytecode (.pyc). Regardless, my application requires faster processing. What can you see are the main things I am doing wrong?

Below is a complete standalone representative example that will compile with:

cl.exe turtlereader.cpp (or other compiler I believe)

```
#include
#include
#include
#include
#include
#include
#include

class TurtleFileReader {
public:

// pointless constructor for this example
TurtleFileReader(){};

// Turtle read looks for zones in the input file and directs the fileread process.
int TurtleRead() {

std::ifstream readfile;
readfile.open("sampleinput.txt");

// records line we are reading on:
int linenumber = 0;
int data_starts_on_line = 0; // init to 0

// find first zone = line
std::string line;
while (std::getline(readfile, line)) {
linenumber += 1;
if ( line.find("ZONE") > total_z_values;

// at this point, we have come across a zone line in the file. And,
// we know how many data lines to read next.
for (size_t i = 0; i vec = LineSplit(line);
if (vec.size() == 9) {
a1_.push_back(vec[0]);
a2_.push_back(vec[1]);
a3_.push_back(vec[2]);
a4_.push_back(vec[3]);
a5_.push_back(vec[3]);
a6_.push_back(vec[4]);
a7_.push_back(vec[5]);
a8_.push_back(vec[6]);
a9_.push_back(vec[5]);

// Do some check on the data here:
if (vec[0] > 10 || vec[0]
std::vector LineSplit(const std::string& line) {
std::istringstream is(line);
return std::vector(std::istre

Solution

The first principle of optimization is: "measure don't guess". So the first step is to use a profiler on your platform to measure the most consuming steps in your algorithm. It may depend on compilation options (optimization turned on/off). On my platform (x86-64 Linux/g++ 4.8.1 with -O3), the most consuming operation is:

template
std::vector LineSplit(const std::string& line) {
    std::istringstream is(line);
    return std::vector(std::istream_iterator(is), std::istream_iterator());
}


I would first try to write a specialization of this method for double and parse line manually (using pointer arithmetic and the strtod() function from the STL), then measure and optimize the next bottleneck.

Code Snippets

template<typename T>
std::vector<T> LineSplit(const std::string& line) {
    std::istringstream is(line);
    return std::vector<T>(std::istream_iterator<T>(is), std::istream_iterator<T>());
}

Context

StackExchange Code Review Q#51381, answer score: 6

Revisions (0)

No revisions yet.