patternjavaModerate
Reading Large Files in Java getting really slow
Viewed 0 times
readingreallygettingjavaslowlargefiles
Problem
I am trying to read a large file ~200 MB (~300 million lines of text). I am using a relatively standard way of reading like:
I find that reading slows down at about line 20 million... in face it doesnt seem to progress at all ... how can I improve this? Whats the problem in this simple piece of code?
A latest run seem to run out of heap space ... I guess I cant store data in an arraylist like that?
This is supposed to be a simulator. It first parses trace files containing hexadecimal addresses of memory accesses. Then is supposed to run simulation of these data. This code snipplet only parses and stores the traces in an arraylist.
for (int i = 1; i traces = new ArrayList();
int lineNum = 1;
BufferedReader br = new BufferedReader(new FileReader(traceFilePrefix + numProcs + "/" + traceFilePrefix + i + ".PRG"));
while ((line = br.readLine()) != null) {
line = line.toUpperCase();
// validate line format
if (line.matches(regexLineFormat)) {
StringTokenizer tokenizer = new StringTokenizer(line);
instrType = Integer.parseInt(tokenizer.nextToken());
hexAddr = tokenizer.nextToken();
traces.add(new Trace(instrType, hexAddr));
} else {
// line is in invalid format
System.err.println("ERR: line " + lineNum + " is invalid \"" + line + "\"");
}
lineNum++;
if (lineNum % 1000 == 0) System.out.println("line " + lineNum);
}
br.close();
processorsTraces.add(i-1, traces);
} catch (Exception e) {
e.printStackTrace();
}
}I find that reading slows down at about line 20 million... in face it doesnt seem to progress at all ... how can I improve this? Whats the problem in this simple piece of code?
A latest run seem to run out of heap space ... I guess I cant store data in an arraylist like that?
This is supposed to be a simulator. It first parses trace files containing hexadecimal addresses of memory accesses. Then is supposed to run simulation of these data. This code snipplet only parses and stores the traces in an arraylist.
Solution
The issue is almost certainly in the structure of your
The more likely problem is the size of the hexAddress String. You may not realise it but Strings are notorious for 'leaking' memory. In this case, you have a
Now, set a break-point in (I use eclipse) your IDE, and then run it, and you will see that hexAddr contains a char[] array for the entire line, and it has an offset of 3 and a count of 7.
Because of the way that String substring and other constructs work, they can consume huge amounts of memory for short strings... (in theory that memory is shared with other strings though). As a consequence, you are essentially storing the entire file in memory!!!!
At a minimum, you should change your code to:
But even better would be:
EDIT: Request to get help on how to debug the code:
To debug the code, you need to set a break-point in the method. Select the line
Then, with that break-point enabled, you need to debug the program:
Right-click on the 'main' word in the
This should prompt you to go to the 'Debug' perspective, and you will have a window with the 'Variables' displayed. One of the variables will be
From the variables section there are a few things of note. At the bottom, the 'value' variable is highlighted, and it's internal value is shown in the area below (
You can see the
Trace class, and it's memory efficiency. You should ensure that the instrType and hexAddress are stored as memory efficient structures. The instrType appears to be an int, which is good, but just make sure that it is declared as an int in the Trace class.The more likely problem is the size of the hexAddress String. You may not realise it but Strings are notorious for 'leaking' memory. In this case, you have a
line and you think you are just getting the hexString from it... but in reality, the hexString contains the entire line.... yeah, really. For example, look at the following code:public class SToken {
public static void main(String[] args) {
StringTokenizer tokenizer = new StringTokenizer("99 bottles of beer");
int instrType = Integer.parseInt(tokenizer.nextToken());
String hexAddr = tokenizer.nextToken();
System.out.println(instrType + hexAddr);
}
}Now, set a break-point in (I use eclipse) your IDE, and then run it, and you will see that hexAddr contains a char[] array for the entire line, and it has an offset of 3 and a count of 7.
Because of the way that String substring and other constructs work, they can consume huge amounts of memory for short strings... (in theory that memory is shared with other strings though). As a consequence, you are essentially storing the entire file in memory!!!!
At a minimum, you should change your code to:
hexAddr = new String(tokenizer.nextToken().toCharArray());But even better would be:
long hexAddr = parseHexAddress(tokenizer.nextToken());EDIT: Request to get help on how to debug the code:
To debug the code, you need to set a break-point in the method. Select the line
System.out.println(...); and use the key-board short-cut Ctrl-Shift-B which will toggle a break-point, and you will get the 'dot' marker like the picture below:Then, with that break-point enabled, you need to debug the program:
Right-click on the 'main' word in the
public static void main (String[] ...) and select 'Debug As -> Java Application'This should prompt you to go to the 'Debug' perspective, and you will have a window with the 'Variables' displayed. One of the variables will be
hexAddr and you can expand it to show its contents. Here's a screen-shot:From the variables section there are a few things of note. At the bottom, the 'value' variable is highlighted, and it's internal value is shown in the area below (
[9, 9, , b, o, ......, r]You can see the
count and offset variables for hexAddr too. This means that the hexAddr variable stores the char[] array for the entire line, but also stores an offset and count in that array which it uses for managing the subset of the chars that this String instance is interested inCode Snippets
public class SToken {
public static void main(String[] args) {
StringTokenizer tokenizer = new StringTokenizer("99 bottles of beer");
int instrType = Integer.parseInt(tokenizer.nextToken());
String hexAddr = tokenizer.nextToken();
System.out.println(instrType + hexAddr);
}
}hexAddr = new String(tokenizer.nextToken().toCharArray());long hexAddr = parseHexAddress(tokenizer.nextToken());Context
StackExchange Code Review Q#33976, answer score: 18
Revisions (0)
No revisions yet.