patternjavaMinor
HashMap with MySQL
Viewed 0 times
withmysqlhashmap
Problem
I have faced a scenario: the task was to read a file which contains 3 Millions IP Address.
There is a MySQL table which contains
In totally, there are 8000 records, each record with multiple IP and CIDR IP.
Now, my task was to read that file, check it against with database and write the matching
Initially, when I run my program, my program failed because:
java.lang.OutOfMemoryError: Java heap space
I have increased it by 3GB. It was still failing, then I later split the file into 6 subfiles, as 0.5 million each.
To find CIDR IP List, I have used Apache SubnetUtils.
```
public static void main(String[] args) {
String sqlQuery = "SELECT id,PrimaryIP from IPTable where PrimaryIP != '' limit 100000;";
Connection connection = null;
Statement statement = null;
File oFile = new File("output.txt");
System.out.println(new Date());
try{
List fileData = FileUtils.readLines(new File("input.txt"));
System.out.println("File Data Size : "+fileData.size());
Class.forName("com.mysql.jdbc.Driver");
connection = DriverManager.getConnection("jdbc:mysql://localhost/db?user=root&password=pwd");
statement = connection.createStatement();
ResultSet resultSet = statement.executeQuery(sqlQuery);
System.out.println("Started with MySQL Querying");
Map primaryIPIDMap = new HashMap();
while (resultSet.next()) {
primaryIPIDMap.clear();
int recordID = resultSet.getInt(1);
if (resultSet.getString(2).contains("#")) {
String primaryIP[] = resultSet.getString(2).split("#");
for (int i = 0; i < primaryIP.length; i++) {
if (primaryIP[i].contains("/")) {
There is a MySQL table which contains
Id,PrimaryIP, PrimaryIP can by multiple IP separated by #, moreover that PrimaryIP can also contain CIDR IP. In totally, there are 8000 records, each record with multiple IP and CIDR IP.
Now, my task was to read that file, check it against with database and write the matching
IP,ID to a file. Initially, when I run my program, my program failed because:
java.lang.OutOfMemoryError: Java heap space
I have increased it by 3GB. It was still failing, then I later split the file into 6 subfiles, as 0.5 million each.
To find CIDR IP List, I have used Apache SubnetUtils.
```
public static void main(String[] args) {
String sqlQuery = "SELECT id,PrimaryIP from IPTable where PrimaryIP != '' limit 100000;";
Connection connection = null;
Statement statement = null;
File oFile = new File("output.txt");
System.out.println(new Date());
try{
List fileData = FileUtils.readLines(new File("input.txt"));
System.out.println("File Data Size : "+fileData.size());
Class.forName("com.mysql.jdbc.Driver");
connection = DriverManager.getConnection("jdbc:mysql://localhost/db?user=root&password=pwd");
statement = connection.createStatement();
ResultSet resultSet = statement.executeQuery(sqlQuery);
System.out.println("Started with MySQL Querying");
Map primaryIPIDMap = new HashMap();
while (resultSet.next()) {
primaryIPIDMap.clear();
int recordID = resultSet.getInt(1);
if (resultSet.getString(2).contains("#")) {
String primaryIP[] = resultSet.getString(2).split("#");
for (int i = 0; i < primaryIP.length; i++) {
if (primaryIP[i].contains("/")) {
Solution
You have 'too much' data to keep in resident memory. There are a number of things you can do to smoothe this out, but they will take a bit more effort than rearranging code.
Read input line per line
Right now, the algorithm looks a little like this:
For each record in the database, search the input file for a match and write that match to output.
Since the database is the indexed one, and the input file (probably) isn't, we'll save time going at it the other way:
For each line in the input file, find the matching record(s) and write them to output.
This will let us read the input file line per line, just once, using java.io.BufferedReader, which will save you from having to slice your files into bits. We will have to hit the database (or a in-memory version that you keep) more often, but they're built for this, and may cache some things:
As a bonus, this would mean you can use your program as a 'text utility' reading from standard input and writing to standard output, where compression could make you save drastically save on disk churning:
Database Schema
There is a MySQL table which contains Id,PrimaryIP, PrimaryIP can by multiple IP separated by #, moreover that PrimaryIP can also contain CIDR IP.
Remove the packing by '#'; let the database worry about compressing results if it really has to. Split out the IP addresses so that you have a clean Id x PrimaryIP M:N relation. This will make querying on PrimaryIP easier.
The CIDR will be a bit harder to fit into this. See if you can make an extra table
Now that you have a more direct schema, you can make the database do some of the heavy lifting:
For range queries to work, however, you will need to change the way you store the IP addresses so that they can be quickly compared. You could either make them string-comparable through making each segment always 3 chars long, or you could parse the address as a number and use that:
(I think this should also work for IPv6.)
Other remarks
Consider using try-with-resources (see example above) for your database connections and file streams; it really cleans up I/O code.
You don't seem to specify the character encodings of your input and output. Considering the domain, I doubt you'll run into serious trouble, but it may be useful to decide on an encoding like, say, UTF-8.
I doubt String.intern() will help you much. Interning helps to reduce the number of copies in memory, true, but since you use them as hash keys, you'll discard duplicates anyway.
Read input line per line
Right now, the algorithm looks a little like this:
For each record in the database, search the input file for a match and write that match to output.
Since the database is the indexed one, and the input file (probably) isn't, we'll save time going at it the other way:
For each line in the input file, find the matching record(s) and write them to output.
This will let us read the input file line per line, just once, using java.io.BufferedReader, which will save you from having to slice your files into bits. We will have to hit the database (or a in-memory version that you keep) more often, but they're built for this, and may cache some things:
try ( Connection connection = /*...*/ ;
Statement statement = /*...*/ ) {
try ( BufferedReader in = Files.newBufferedReader(input, charset);
BufferedWriter out = Files.newBufferedWriter(output, charset) ) {
for ( String line = in.readLine(); line != null; line = in.readLine() ) {
// 1. parse line
// 2. query database
// 3. write out result
}
}
}As a bonus, this would mean you can use your program as a 'text utility' reading from standard input and writing to standard output, where compression could make you save drastically save on disk churning:
zcat input.gz | java pkg.MyAwesomeFilter | gzip > output.gzDatabase Schema
There is a MySQL table which contains Id,PrimaryIP, PrimaryIP can by multiple IP separated by #, moreover that PrimaryIP can also contain CIDR IP.
Remove the packing by '#'; let the database worry about compressing results if it really has to. Split out the IP addresses so that you have a clean Id x PrimaryIP M:N relation. This will make querying on PrimaryIP easier.
The CIDR will be a bit harder to fit into this. See if you can make an extra table
IPRange that stores IP ranges per minimum and maximum possible IP address. For instance, 10.0.0.0/8 would be 10.0.0.0 -> 10.255.255.255 . This way, you won't have to store every possible address.Now that you have a more direct schema, you can make the database do some of the heavy lifting:
(select id from iptable where primaryIP = ?)
union
(select id from iprange where ? between minIP and maxIP)
For range queries to work, however, you will need to change the way you store the IP addresses so that they can be quickly compared. You could either make them string-comparable through making each segment always 3 chars long, or you could parse the address as a number and use that:
10.0.0.0 -> 010 000 000 000
192.168.1.11 -> 192 168 001 011(I think this should also work for IPv6.)
Other remarks
Consider using try-with-resources (see example above) for your database connections and file streams; it really cleans up I/O code.
You don't seem to specify the character encodings of your input and output. Considering the domain, I doubt you'll run into serious trouble, but it may be useful to decide on an encoding like, say, UTF-8.
I doubt String.intern() will help you much. Interning helps to reduce the number of copies in memory, true, but since you use them as hash keys, you'll discard duplicates anyway.
Code Snippets
try ( Connection connection = /*...*/ ;
Statement statement = /*...*/ ) {
try ( BufferedReader in = Files.newBufferedReader(input, charset);
BufferedWriter out = Files.newBufferedWriter(output, charset) ) {
for ( String line = in.readLine(); line != null; line = in.readLine() ) {
// 1. parse line
// 2. query database
// 3. write out result
}
}
}zcat input.gz | java pkg.MyAwesomeFilter | gzip > output.gz10.0.0.0 -> 010 000 000 000
192.168.1.11 -> 192 168 001 011Context
StackExchange Code Review Q#73618, answer score: 3
Revisions (0)
No revisions yet.