patterncMinor
Reading file into structure
Viewed 0 times
intoreadingfilestructure
Problem
At the time I'm trying to read a quite big file into a C program for later user. The file size is in the range of 800 megabytes containing around 20 million lines of data of the following format:
YYYYMMDD HHMMSSMMM,X.XXXXX0,X,XXXXX0,0
Here are a few examples:
I want to read the data of a single line into the following structure:
As you can already imagine, this data represents the bid/ask quotes at a specific point in time. The bid/ask quotes are stored in
The
Here are the routines to read in the file:
```
#include
#include
#include
#include
#include
#include
#include
#define INITIAL_SET_SIZE 10000000
#define SET_INCREASE_STEP 1000000
// fileDescriptor returned by open(fileName, O_RDONLY, 0);
struct forexDataSet *createForexDataSetFromFile(int fileDescriptor) {
char buf[39000];
ssize_t nBytesRead;
struct forexDataSet *set = calloc(1, sizeof(struct forexDataSet));
set->data = malloc(INITIAL_SET_SIZE sizeof(struct forexData ));
if (!set->data) {
return NULL;
}
set->capacity = INITIAL_SET_SIZE;
do {
nBytesRead = read(fileDescriptor, buf
YYYYMMDD HHMMSSMMM,X.XXXXX0,X,XXXXX0,0
- Y: Year
- M: Month
- D: Day
- H: Hour
- M: Minute
- S: Second
- M: Milliseconds
- X: floating point number
Here are a few examples:
20150101 130021493,1.209650,1.210070,0
20150101 130044493,1.209720,1.210140,0
20150101 130044743,1.209650,1.210070,0
20150101 130045493,1.209720,1.210140,0
20150101 130045743,1.209670,1.210090,0I want to read the data of a single line into the following structure:
struct forexData {
struct tm timestamp;
uint32_t bidQuote;
uint32_t askQuote;
};As you can already imagine, this data represents the bid/ask quotes at a specific point in time. The bid/ask quotes are stored in
uint32_t, since I prefer an integer value over floats for later use. The structures of each line are placed in another structure, which will contain all data.struct forexDataSet {
struct forexData **data;
uint32_t capacity;
uint32_t cardinality;
};The
data-property points to an array of pointers to forexData structures. The capacity-property describes how many pointers the array can hold and cardinality holds, how many pointers are currently used.Here are the routines to read in the file:
```
#include
#include
#include
#include
#include
#include
#include
#define INITIAL_SET_SIZE 10000000
#define SET_INCREASE_STEP 1000000
// fileDescriptor returned by open(fileName, O_RDONLY, 0);
struct forexDataSet *createForexDataSetFromFile(int fileDescriptor) {
char buf[39000];
ssize_t nBytesRead;
struct forexDataSet *set = calloc(1, sizeof(struct forexDataSet));
set->data = malloc(INITIAL_SET_SIZE sizeof(struct forexData ));
if (!set->data) {
return NULL;
}
set->capacity = INITIAL_SET_SIZE;
do {
nBytesRead = read(fileDescriptor, buf
Solution
No need for realloc
The first thing I noticed is that your program has an inefficient reallocation strategy. On a file with 20 million entries, your program will need to reallocate 10 times. But you don't even need to reallocate at all. Since you know that the file contains entries of length 39, all you need to do is measure the file size and divide by 39 to determine the number of entries. Here is the code I used to do it:
This change didn't end up speeding up the program, but it did simplify things by removing the whole reallocation portion of the code. It also reduced the total memory usage from around 1.7 GB to about 1.3 GB for a 20 million line input file.
One array of structs
Right now, you allocate an array of pointers, and then allocate one data structure per line in the file. You end up calling
This change sped up the program from 21 sec to 17.2 sec on my test case. It also reduced the memory usage from 1.3 GB to 1.0 GB.
Custom read time
One of the slower parts of the program is the call to
This change sped up the program from 17.2 sec to 7.8 sec.
More compact time format
I'm not sure if you need the time in a
This change only sped up the program from 7.8 to 7.6 sec, but it reduced the memory usage from over 1 GB to about 320 MB.
Custom read float
The other slow part of the program is reading the float and converting to an integer. You can make your own custom float reading function as well:
```
// Assumes float is in the format X.XXXXXX and returns the float
// multiplied by 1000000.
static inline uint32_t getFloat_1_6(const char *str)
{
return (str[0] - '0') 1000000 + (str[2] - '0') 100000 +
(str[3] - '0') 10000 + (str[4] - '0') 1000 +
(str[4] - '0') 100 + (str[5] - '0') 10 +
(str[6] - '0');
}
void createForexDataFromString(const char str, struct forexData tmp) {
tmp->timestamp.y
The first thing I noticed is that your program has an inefficient reallocation strategy. On a file with 20 million entries, your program will need to reallocate 10 times. But you don't even need to reallocate at all. Since you know that the file contains entries of length 39, all you need to do is measure the file size and divide by 39 to determine the number of entries. Here is the code I used to do it:
#define INPUT_LINE_LEN 39
// In createForexDataSetFromFile()
off_t fsize;
ssize_t numEntries;
fsize = lseek(fileDescriptor, 0, SEEK_END);
numEntries = (fsize + INPUT_LINE_LEN - 1) / INPUT_LINE_LEN;
lseek(fileDescriptor, 0, SEEK_SET);This change didn't end up speeding up the program, but it did simplify things by removing the whole reallocation portion of the code. It also reduced the total memory usage from around 1.7 GB to about 1.3 GB for a 20 million line input file.
One array of structs
Right now, you allocate an array of pointers, and then allocate one data structure per line in the file. You end up calling
malloc 20 million times more than you need to. Instead of an array of pointers, you should just allocate one array of structs. The code would look like this:struct forexDataSet {
struct forexData *data;
uint32_t numEntries;
};
// fileDescriptor returned by open(fileName, O_RDONLY, 0);
struct forexDataSet *createForexDataSetFromFile(int fileDescriptor) {
char buf[INPUT_LINE_LEN * 1000];
ssize_t nBytesRead;
off_t fsize;
ssize_t numEntries;
ssize_t entryNum = 0;
fsize = lseek(fileDescriptor, 0, SEEK_END);
numEntries = (fsize + INPUT_LINE_LEN - 1) / INPUT_LINE_LEN;
lseek(fileDescriptor, 0, SEEK_SET);
struct forexDataSet *set = calloc(1, sizeof(struct forexDataSet));
set->data = malloc(numEntries * sizeof(struct forexData));
if (!set->data) {
free(set);
return NULL;
}
set->numEntries = numEntries;
do {
nBytesRead = read(fileDescriptor, buf, sizeof(buf));
size_t lineStart;
for (lineStart = 0; lineStart data[entryNum++]);
}
} while(nBytesRead == sizeof(buf));
return set;
}
void createForexDataFromString(const char *str, struct forexData *tmp) {
strptime(str, "%Y%m%d %H%M%S", &tmp->timestamp);
tmp->bidQuote = (uint32_t) (atof(str + 19) * 1000000);
tmp->askQuote = (uint32_t) (atof(str + 28) * 1000000);
}This change sped up the program from 21 sec to 17.2 sec on my test case. It also reduced the memory usage from 1.3 GB to 1.0 GB.
Custom read time
One of the slower parts of the program is the call to
strptime(). Since the time is in a fixed format, you could do better by writing your own custom time parser, like this:static inline int getInt4(const char *str)
{
return (str[0] - '0') * 1000 + (str[1] - '0') * 100 +
(str[2] - '0') * 10 + (str[3] - '0');
}
static inline int getInt2(const char *str)
{
return (str[0] - '0') * 10 + (str[1] - '0');
}
void createForexDataFromString(const char *str, struct forexData *tmp) {
tmp->timestamp.tm_year = getInt4(str) - 1900;
tmp->timestamp.tm_mon = getInt2(str+4) - 1;
tmp->timestamp.tm_mday = getInt2(str+6);
tmp->timestamp.tm_hour = getInt2(str+9);
tmp->timestamp.tm_min = getInt2(str+11);
tmp->timestamp.tm_sec = getInt2(str+13);
tmp->bidQuote = (uint32_t) (atof(str + 19) * 1000000);
tmp->askQuote = (uint32_t) (atof(str + 28) * 1000000);
}This change sped up the program from 17.2 sec to 7.8 sec.
More compact time format
I'm not sure if you need the time in a
struct tm format, but that format wastes a lot of space. It is defined as a struct containing 9 int fields, which is typically going to be 36 bytes. Multiply that by 20 million entries and you are using 720 MB just to hold the timestamps. I changed the timestamp to a custom struct which uses only 8 bytes, like this:struct forexTimestamp {
uint8_t sec;
uint8_t min;
uint8_t hour;
uint8_t day;
uint8_t mon;
uint16_t year;
};
struct forexData {
struct forexTimestamp timestamp;
uint32_t bidQuote;
uint32_t askQuote;
};This change only sped up the program from 7.8 to 7.6 sec, but it reduced the memory usage from over 1 GB to about 320 MB.
Custom read float
The other slow part of the program is reading the float and converting to an integer. You can make your own custom float reading function as well:
```
// Assumes float is in the format X.XXXXXX and returns the float
// multiplied by 1000000.
static inline uint32_t getFloat_1_6(const char *str)
{
return (str[0] - '0') 1000000 + (str[2] - '0') 100000 +
(str[3] - '0') 10000 + (str[4] - '0') 1000 +
(str[4] - '0') 100 + (str[5] - '0') 10 +
(str[6] - '0');
}
void createForexDataFromString(const char str, struct forexData tmp) {
tmp->timestamp.y
Code Snippets
#define INPUT_LINE_LEN 39
// In createForexDataSetFromFile()
off_t fsize;
ssize_t numEntries;
fsize = lseek(fileDescriptor, 0, SEEK_END);
numEntries = (fsize + INPUT_LINE_LEN - 1) / INPUT_LINE_LEN;
lseek(fileDescriptor, 0, SEEK_SET);struct forexDataSet {
struct forexData *data;
uint32_t numEntries;
};
// fileDescriptor returned by open(fileName, O_RDONLY, 0);
struct forexDataSet *createForexDataSetFromFile(int fileDescriptor) {
char buf[INPUT_LINE_LEN * 1000];
ssize_t nBytesRead;
off_t fsize;
ssize_t numEntries;
ssize_t entryNum = 0;
fsize = lseek(fileDescriptor, 0, SEEK_END);
numEntries = (fsize + INPUT_LINE_LEN - 1) / INPUT_LINE_LEN;
lseek(fileDescriptor, 0, SEEK_SET);
struct forexDataSet *set = calloc(1, sizeof(struct forexDataSet));
set->data = malloc(numEntries * sizeof(struct forexData));
if (!set->data) {
free(set);
return NULL;
}
set->numEntries = numEntries;
do {
nBytesRead = read(fileDescriptor, buf, sizeof(buf));
size_t lineStart;
for (lineStart = 0; lineStart < nBytesRead; lineStart += 39) {
createForexDataFromString(buf + lineStart, &set->data[entryNum++]);
}
} while(nBytesRead == sizeof(buf));
return set;
}
void createForexDataFromString(const char *str, struct forexData *tmp) {
strptime(str, "%Y%m%d %H%M%S", &tmp->timestamp);
tmp->bidQuote = (uint32_t) (atof(str + 19) * 1000000);
tmp->askQuote = (uint32_t) (atof(str + 28) * 1000000);
}static inline int getInt4(const char *str)
{
return (str[0] - '0') * 1000 + (str[1] - '0') * 100 +
(str[2] - '0') * 10 + (str[3] - '0');
}
static inline int getInt2(const char *str)
{
return (str[0] - '0') * 10 + (str[1] - '0');
}
void createForexDataFromString(const char *str, struct forexData *tmp) {
tmp->timestamp.tm_year = getInt4(str) - 1900;
tmp->timestamp.tm_mon = getInt2(str+4) - 1;
tmp->timestamp.tm_mday = getInt2(str+6);
tmp->timestamp.tm_hour = getInt2(str+9);
tmp->timestamp.tm_min = getInt2(str+11);
tmp->timestamp.tm_sec = getInt2(str+13);
tmp->bidQuote = (uint32_t) (atof(str + 19) * 1000000);
tmp->askQuote = (uint32_t) (atof(str + 28) * 1000000);
}struct forexTimestamp {
uint8_t sec;
uint8_t min;
uint8_t hour;
uint8_t day;
uint8_t mon;
uint16_t year;
};
struct forexData {
struct forexTimestamp timestamp;
uint32_t bidQuote;
uint32_t askQuote;
};// Assumes float is in the format X.XXXXXX and returns the float
// multiplied by 1000000.
static inline uint32_t getFloat_1_6(const char *str)
{
return (str[0] - '0') * 1000000 + (str[2] - '0') * 100000 +
(str[3] - '0') * 10000 + (str[4] - '0') * 1000 +
(str[4] - '0') * 100 + (str[5] - '0') * 10 +
(str[6] - '0');
}
void createForexDataFromString(const char *str, struct forexData *tmp) {
tmp->timestamp.year = getInt4(str);
tmp->timestamp.mon = getInt2(str+4);
tmp->timestamp.day = getInt2(str+6);
tmp->timestamp.hour = getInt2(str+9);
tmp->timestamp.min = getInt2(str+11);
tmp->timestamp.sec = getInt2(str+13);
tmp->bidQuote = getFloat_1_6(str + 19);
tmp->askQuote = getFloat_1_6(str + 28);
}Context
StackExchange Code Review Q#122757, answer score: 6
Revisions (0)
No revisions yet.