patternjavaMinor
Speed up CSV reading code (vector of doubles)
Viewed 0 times
readingcsvdoublescodespeedvector
Problem
I am trying to read a single-columned CSV of doubles into
This CSV file may contain strings so I am parsing it with this in mind.
The CSV reading method needs to return
The issue is not due to the
The
Java with a string header. It is 11 megabytes and takes over 15 minutes to read, which is clearly unacceptable. In R this CSV would take about 3 seconds to load.This CSV file may contain strings so I am parsing it with this in mind.
The CSV reading method needs to return
Vector due to reliance on this output by other parts of the application.The issue is not due to the
isNumber static method, since each call to that is taking 200 nanoseconds, thus contributing approximately 0.2 seconds to the 15 minutes of parsing time.The
Double.valueOf() only takes about 500 nanoseconds, so it is not that either.csvData.add() is only taking 80 nanoseconds so it is not that.private static Vector readTXTFileSingle(String csvFileName) throws IOException {
String line = null;
BufferedReader stream = null;
Vector csvData = new Vector();
try {
stream = new BufferedReader(new FileReader(csvFileName));
while ((line = stream.readLine()) != null) {
String[] splitted = line.split(",");
if( ! NumberUtils.isNumber(splitted[0])) {
continue;
}
Double dataLine = Double.valueOf(splitted[0]);
csvData.add(dataLine);
}
} finally {
if (stream != null)
stream.close();
}
return csvData;
}Solution
As per Bobby's comment, Vector is your problem, but not for the reason he says...
Vector is a synchronized class. Each call to any method on Vector will lock the thread, flush all cache lines, and generally waste a lot of time (in a situation where it's usage is in a single thread only).
The fact that you use Vector indicates that you are running some really old code, or you have not properly read the JavaDoc for it.
A secondary performance problem is that each value is being converted to a
You should also be using the Java7 try-with-resources mechanism for your
My recommendation is to change the signature of your method to return a List... actually, no, my recommendation is to return an array of primitive
Edit:
One other thing, if you want another tweak in performance, use:
since you never access more than the first field in the record, you do not need to look for comma's beyond the first comma
Vector is a synchronized class. Each call to any method on Vector will lock the thread, flush all cache lines, and generally waste a lot of time (in a situation where it's usage is in a single thread only).
The fact that you use Vector indicates that you are running some really old code, or you have not properly read the JavaDoc for it.
A secondary performance problem is that each value is being converted to a
Double Object. In cases where you have large amounts of data, and where there is a primitive available for you to use, it is always faster to use the primitive (in this case, double instead of Double).You should also be using the Java7 try-with-resources mechanism for your
stream.My recommendation is to change the signature of your method to return a List... actually, no, my recommendation is to return an array of primitive
double[].... if you are interested in speed, this will be a significant improvement:private static double[] readTXTFileSingle(String csvFileName) throws IOException {
double[] csvData = new double[4096]; // arbitrary starting size.
int dcnt = 0;
try (BufferedReader stream = new BufferedReader(new FileReader(csvFileName))) {
String line = null;
while ((line = stream.readLine()) != null) {
String[] splitted = line.split(",");
if( ! NumberUtils.isNumber(splitted[0])) {
continue;
}
double dataLine = Double.parseDouble(splitted[0]);
if (dcnt >= csvData.length) {
// add 50% to array size.
csvData = Arrays.copyOf(csvData, dcnt + (dcnt / 2));
}
csvData[dcnt++] = dataLine;
}
}
return Arrays.copyOf(csvData, dcnt);
}Edit:
One other thing, if you want another tweak in performance, use:
String[] splitted = line.split(",", 2);since you never access more than the first field in the record, you do not need to look for comma's beyond the first comma
Code Snippets
private static double[] readTXTFileSingle(String csvFileName) throws IOException {
double[] csvData = new double[4096]; // arbitrary starting size.
int dcnt = 0;
try (BufferedReader stream = new BufferedReader(new FileReader(csvFileName))) {
String line = null;
while ((line = stream.readLine()) != null) {
String[] splitted = line.split(",");
if( ! NumberUtils.isNumber(splitted[0])) {
continue;
}
double dataLine = Double.parseDouble(splitted[0]);
if (dcnt >= csvData.length) {
// add 50% to array size.
csvData = Arrays.copyOf(csvData, dcnt + (dcnt / 2));
}
csvData[dcnt++] = dataLine;
}
}
return Arrays.copyOf(csvData, dcnt);
}String[] splitted = line.split(",", 2);Context
StackExchange Code Review Q#40842, answer score: 6
Revisions (0)
No revisions yet.