patternjavaCritical
Load fifty million integers as quickly as possible in Java
Viewed 0 times
quicklymillionjavapossibleloadintegersfifty
Problem
I am working my way through Project Euler. Many of the problems deal with prime number calculations: this is the type of code that can be extracted into a separate class and reused.
While calculating primes can be done quickly using certain methods, I was able to speed up the process of "getting prime numbers" by using a list of primes on disk and loading them. It is an order of magnitude faster than any calculation, but still takes roughly three minutes loading from an SSD.
What I did:
Is there any way to speed up the process of loading 50,000,000 integers? If I need to preprocess the data differently I am willing to do that.
`import java.io.DataInputStream;
import java.io.FileInputStream;
import java.io.IOException;
public class Test {
public static int[] loadByQuantity(int argNumPrimes) {
int numPrimes = Math.min(argNumPrimes, 50_000_000);
int[] primes = new int[numPrimes];
try (DataInputStream in = new DataInputStream(new FileInputStream("primes.bin"))) {
for (int i = 0; i
While calculating primes can be done quickly using certain methods, I was able to speed up the process of "getting prime numbers" by using a list of primes on disk and loading them. It is an order of magnitude faster than any calculation, but still takes roughly three minutes loading from an SSD.
What I did:
- I took a publicly-available list of the first fifty million primes (some problems do use a significant portion of those numbers).
- I wrote a program to convert the numbers from their textual representation to binary using Java's
DataOutputStream. The file is simply fifty million four-byte integers in a row.
- The utility method below loads each integer into an array using
DataInputStream. It produces correct results, but it slow.
Is there any way to speed up the process of loading 50,000,000 integers? If I need to preprocess the data differently I am willing to do that.
`import java.io.DataInputStream;
import java.io.FileInputStream;
import java.io.IOException;
public class Test {
public static int[] loadByQuantity(int argNumPrimes) {
int numPrimes = Math.min(argNumPrimes, 50_000_000);
int[] primes = new int[numPrimes];
try (DataInputStream in = new DataInputStream(new FileInputStream("primes.bin"))) {
for (int i = 0; i
Solution
Your code is a bit messy, and having the file name hard-coded in to your function is not great. Also, the "Hungarian Notation" (using things like
On the other hand, I understand this is an exercise to test performance.... and it's not about the resusability (yet) of the code.
Still, I know from experience that reusability will happen, and your code will need some changes.
I also know, from experience, that memory-mapped IO is much faster in Java than other IO forms, so I figured I would have a kick at this problem.
I took my prime generator I had reveiewed here on Code Review previously ( Thread Safe Prime Generator ) and I generated the first 50,000,000 primes in to a file on my own system, then I tested your code against it (and after changing the file name to a parameter, it worked).
Out of interest, generating and writing to file of the primes took about 5 minutes on my computer - 311 seconds - much of which was probably IO time too.
Then, I converted your function to be:
The only difference is the file-name is now a parameter.
Then I wrote a little test system to time how long your code took for me:
I then implemented the same functionality using an NIO mechanism specifically using memory-mapped IO: http://docs.oracle.com/javase/8/docs/api/java/nio/MappedByteBuffer.html
This type of IO is designed to significantly reduce the amount of memory copies are made of the input file. It also means that Java reads the content out of the OS, without first copying it in to Java's memory space.
For the types of sequential IO this problem has, I expected the performance improvements to be significant.
Further, the MappedByteBuffer has methods
Here's the code I came up with. Note, I have used the same exception handling that you use, and also the same array initialization. I believe you should be throwing exceptions from these methods, and not just wrapping them in runtime exceptions:
I then ran the process a couple of times in the main method, and compared the results:
The NIO operation is, as expected, significantly faster... 1500 times faster
I admit, it is far faster than I was expecting too.... but, the results are correct.
Now, I encourage you to experiment with how to put this concept behind a Java stream, instead of populating the full 50,000,000 in to an array. By streaming the results you can start your "real" computation sooner, without the latency of having to read all the primes from the file. For example, consider what you could do with logic like:
where primes was reading the values on-demand from the file.
Here's the full code I have been running:
```
package prperf;
import java.io.DataInputStream;
impo
arg to prefix your function parameters - note, it's a parameter, not an argument, by the way)... is not conventional.On the other hand, I understand this is an exercise to test performance.... and it's not about the resusability (yet) of the code.
Still, I know from experience that reusability will happen, and your code will need some changes.
I also know, from experience, that memory-mapped IO is much faster in Java than other IO forms, so I figured I would have a kick at this problem.
I took my prime generator I had reveiewed here on Code Review previously ( Thread Safe Prime Generator ) and I generated the first 50,000,000 primes in to a file on my own system, then I tested your code against it (and after changing the file name to a parameter, it worked).
Out of interest, generating and writing to file of the primes took about 5 minutes on my computer - 311 seconds - much of which was probably IO time too.
Then, I converted your function to be:
public static int[] loadByQuantity(int argNumPrimes, String fname) {
int numPrimes = Math.min(argNumPrimes, 50_000_000);
int[] primes = new int[numPrimes];
try (DataInputStream in = new DataInputStream(new FileInputStream(fname))) {
for (int i = 0; i < primes.length; ++i) {
primes[i] = in.readInt();
}
} catch (IOException ex) {
throw new RuntimeException(ex);
}
return primes;
}The only difference is the file-name is now a parameter.
Then I wrote a little test system to time how long your code took for me:
public static void time(Supplier task, String name) {
long nanos = System.nanoTime();
int[] primes = task.get();
System.out.printf("Task %s took %.3fms\n", name, (System.nanoTime() - nanos) / 1000000.0);
System.out.printf("Count %d\nFirst %d\nLast %d\n", primes.length, primes[0], primes[primes.length - 1]);
System.out.println("----");
}
public static void main(String[] args) {
int count = 50000000;
time(() -> loadByQuantity(count, "primes.dat"), "OP");
}I then implemented the same functionality using an NIO mechanism specifically using memory-mapped IO: http://docs.oracle.com/javase/8/docs/api/java/nio/MappedByteBuffer.html
This type of IO is designed to significantly reduce the amount of memory copies are made of the input file. It also means that Java reads the content out of the OS, without first copying it in to Java's memory space.
For the types of sequential IO this problem has, I expected the performance improvements to be significant.
Further, the MappedByteBuffer has methods
getInt() which also decodes 4 bytes in to an int for you: http://docs.oracle.com/javase/8/docs/api/java/nio/ByteBuffer.html#getInt--Here's the code I came up with. Note, I have used the same exception handling that you use, and also the same array initialization. I believe you should be throwing exceptions from these methods, and not just wrapping them in runtime exceptions:
public static int[] loadByNIO(int argNumPrimes, String fname) {
int numPrimes = Math.min(argNumPrimes, 50_000_000);
int[] primes = new int[numPrimes];
try (FileChannel fc = FileChannel.open(Paths.get(fname))) {
MappedByteBuffer mbb = fc.map(MapMode.READ_ONLY, 0, numPrimes * 4l);
for (int i = 0; i < numPrimes; i++) {
primes[i] = mbb.getInt();
}
} catch (IOException ex) {
throw new RuntimeException(ex);
}
return primes;
}I then ran the process a couple of times in the main method, and compared the results:
public static void main(String[] args) {
int count = 50_000_000;
time(() -> loadByQuantity(count, "primes.dat"), "OP");
time(() -> loadByNIO(count, "primes.dat"), "NIO");
time(() -> loadByQuantity(count, "primes.dat"), "OP");
time(() -> loadByNIO(count, "primes.dat"), "NIO");
}The NIO operation is, as expected, significantly faster... 1500 times faster
Task OP took 214163.250ms
Count 50000000
First 2
Last 982451653
----
Task NIO took 141.511ms
Count 50000000
First 2
Last 982451653
----
Task OP took 214633.128ms
Count 50000000
First 2
Last 982451653
----
Task NIO took 159.571ms
Count 50000000
First 2
Last 982451653
----I admit, it is far faster than I was expecting too.... but, the results are correct.
Now, I encourage you to experiment with how to put this concept behind a Java stream, instead of populating the full 50,000,000 in to an array. By streaming the results you can start your "real" computation sooner, without the latency of having to read all the primes from the file. For example, consider what you could do with logic like:
primes.stream().....where primes was reading the values on-demand from the file.
Here's the full code I have been running:
```
package prperf;
import java.io.DataInputStream;
impo
Code Snippets
public static int[] loadByQuantity(int argNumPrimes, String fname) {
int numPrimes = Math.min(argNumPrimes, 50_000_000);
int[] primes = new int[numPrimes];
try (DataInputStream in = new DataInputStream(new FileInputStream(fname))) {
for (int i = 0; i < primes.length; ++i) {
primes[i] = in.readInt();
}
} catch (IOException ex) {
throw new RuntimeException(ex);
}
return primes;
}public static void time(Supplier<int[]> task, String name) {
long nanos = System.nanoTime();
int[] primes = task.get();
System.out.printf("Task %s took %.3fms\n", name, (System.nanoTime() - nanos) / 1000000.0);
System.out.printf("Count %d\nFirst %d\nLast %d\n", primes.length, primes[0], primes[primes.length - 1]);
System.out.println("----");
}
public static void main(String[] args) {
int count = 50000000;
time(() -> loadByQuantity(count, "primes.dat"), "OP");
}public static int[] loadByNIO(int argNumPrimes, String fname) {
int numPrimes = Math.min(argNumPrimes, 50_000_000);
int[] primes = new int[numPrimes];
try (FileChannel fc = FileChannel.open(Paths.get(fname))) {
MappedByteBuffer mbb = fc.map(MapMode.READ_ONLY, 0, numPrimes * 4l);
for (int i = 0; i < numPrimes; i++) {
primes[i] = mbb.getInt();
}
} catch (IOException ex) {
throw new RuntimeException(ex);
}
return primes;
}public static void main(String[] args) {
int count = 50_000_000;
time(() -> loadByQuantity(count, "primes.dat"), "OP");
time(() -> loadByNIO(count, "primes.dat"), "NIO");
time(() -> loadByQuantity(count, "primes.dat"), "OP");
time(() -> loadByNIO(count, "primes.dat"), "NIO");
}Task OP took 214163.250ms
Count 50000000
First 2
Last 982451653
----
Task NIO took 141.511ms
Count 50000000
First 2
Last 982451653
----
Task OP took 214633.128ms
Count 50000000
First 2
Last 982451653
----
Task NIO took 159.571ms
Count 50000000
First 2
Last 982451653
----Context
StackExchange Code Review Q#153857, answer score: 106
Revisions (0)
No revisions yet.