HiveBrain v1.2.0
Get Started
← Back to all entries
patterncMinor

Fast CSV parser

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
fastcsvparser

Problem

I have written a CSV parser to read in 12609000 lines of CSV data. Each line has 16 fields of both string and double type data. The following code takes about 40 seconds to parse the data. I would like to know if there is a better way of doing this which can make it even faster. I have seen recommendations to load all of the CSV data from the file at once into memory and then perform the parsing as in here.

I have tried it but didn't make too much of a difference. I am posting the core of my code which takes the longest time, about 37 out of the 40 seconds.

```
// outBuffer is a (char*) to a char array containing the CSV
// string I load from the file. outBufferSize has the length
// of outBuffer.

// Get number of lines in the file.
int num_rows = 0;
for(int i=0;i<outBufferSize;i++) {
if (outBuffer[i]=='\n')
num_rows++;
}

// Assign required memory
lastPtr = calloc(num_rows,sizeof(double));
lastPtr = calloc(num_rows,sizeof(double));
lastPtr = calloc(num_rows,sizeof(double));
bidPtr = calloc(num_rows,sizeof(double));
askPtr = calloc(num_rows,sizeof(double));
volPtr = calloc(num_rows,sizeof(double));
highPtr = calloc(num_rows,sizeof(double));
lowPtr = calloc(num_rows,sizeof(double));
bidSizePtr = calloc(num_rows,sizeof(double));
askSizePtr = calloc(num_rows,sizeof(double));
lastSizePtr = calloc(num_rows,sizeof(double));

// Get the indices of end-of-line characters in outBuffer.
int line_ends = (int) calloc(num_rows,sizeof(int));
for(int i=0,j=0;i<outBufferSize;i++){
if(outBuffer[i]=='\n')
line_ends[j++] = i;
}

int line = 0;
int line_begin=0;
char token[4096];
while(line<num_rows){

memcpy(token,&outBuffer[line_begin],(size_t)line_ends[line]-line_begin);
line_begin = line_ends[line]+1;

sscanf(token,"%[^,],%[^,],%f,%f,%f,%f,%f,%f,%f,%f,%[^,],%f,%f,%f,%f,%f",
time,symbol,&last,&bid,&ask,&volume,&y_close,&open,&hig

Solution

First, I would take the memcpy out of the loop. Just make token a char* and keep moving it to the next line_begin, and replace all the \n with null to terminate token. That will save you a lot of byte moving you don't need.

Second, you can put the &lastPtr[line] directly inside the sscanf, and avoid a reassignment from a temp variable.

Third, if you put all your values inside a struct and make a big array of that struct, instead of a bunch of separate arrays, it will put values that you access close together in time also close together in memory, which should help you avoid cache misses.

Context

StackExchange Code Review Q#129521, answer score: 6

Revisions (0)

No revisions yet.