patternjavaMinor
Frequency count of single words and word pairs over 20 million tweets
Viewed 0 times
millionfrequencytweetswordsoversinglewordandcountpairs
Problem
My application needs to read 20-million-line text files and count the word frequencies for one and two words.
For example:
A B A B SSS G D A
One word Frequency
Two Word Frequency
After the reading method, it puts all
Note: it took 40 minutes to read a file.
Please also help me to find a way to decrease its processing time, such as by speeding up the file reading.
Second File
```
package tweetfile20million;
import java.io.BufferedReader;
import java.io.BufferedWriter;
import java.io.FileInputStream;
import java.io.FileNotFoundException;
import java.io.FileOutputStream;
import java.io.IOException;
import java.io.InputStreamReader;
import java.io.OutputStreamWriter;
import java.nio.charset.Charset;
import java.util.HashMap;
import java.util.Iterator;
import java.util.Map;
import java.util.Set;
import java.util.TreeMap;
/**
*
* @author KayD
*/
public class FileOperation{
String fileDir = "B:\\20milyontweet\\";
String dataFile = fileDir + "all_tweets.txt";
HashMap hash = new HashMap<>();
long startTime;
public void readText(FileType fileType){
try{
//Reading Starting
startTime = (System.currentTimeMillis());
this.readFile(fileType);
//Writer Starting
long middleTime = (System.currentTimeMillis());
this.writeFile(
For example:
A B A B SSS G D A
One word Frequency
- A: 3
- B: 2
- SSS: 1
- G: 1
Two Word Frequency
- A B: 2
- B A: 1
- B SSS: 1
- SSS G: 1
After the reading method, it puts all
HashMap values in the TreeMap for sorting and gets an OutOfMemoryException.Note: it took 40 minutes to read a file.
Please also help me to find a way to decrease its processing time, such as by speeding up the file reading.
package tweetfile20million;
import java.util.Comparator;
import java.util.Map;
enum FileType {
OneWord, TwoWord
}
/**
*
* @author KayD
*/
public class ValueComparator implements Comparator {
Map map;
public ValueComparator(Map base) {
this.map = base;
}
public int compare(String a, String b) {
if (map.get(a) >= map.get(b)) {
return -1;
} else {
return 1;
} // returning 0 would merge keys
}
}Second File
```
package tweetfile20million;
import java.io.BufferedReader;
import java.io.BufferedWriter;
import java.io.FileInputStream;
import java.io.FileNotFoundException;
import java.io.FileOutputStream;
import java.io.IOException;
import java.io.InputStreamReader;
import java.io.OutputStreamWriter;
import java.nio.charset.Charset;
import java.util.HashMap;
import java.util.Iterator;
import java.util.Map;
import java.util.Set;
import java.util.TreeMap;
/**
*
* @author KayD
*/
public class FileOperation{
String fileDir = "B:\\20milyontweet\\";
String dataFile = fileDir + "all_tweets.txt";
HashMap hash = new HashMap<>();
long startTime;
public void readText(FileType fileType){
try{
//Reading Starting
startTime = (System.currentTimeMillis());
this.readFile(fileType);
//Writer Starting
long middleTime = (System.currentTimeMillis());
this.writeFile(
Solution
Your code looks fine, but the situation you are running into is because you are loading everything into memory and doing everything linearly.
There are 3 aspects you can optimize your code which should give you pretty large gains:
Consider Storing Your Data In a DB or datastore. For 20 million words, you can have a maximum of 20 million records in your hash, which is doable, but by the time you get to 2 word combinations - you are talking about a ton of records. Let us conservatively say there are only 2000 unique words and we are picking 2 words each, we get nearly 2,000,000 records ( C(n,r) C(2000,2) - http://stattrek.com/online-calculator/combinations-permutations.aspx). 2 million records 32 bytes per record + 4 capacity = 100-200MB just for the hash. (http://java-performance.info/memory-consumption-of-java-data-types-2/). Chances are that there are many more than 2000 unique words and even if you don't have every single combination, it is still alot.
The first thing you can do is optimize your hashmap for the large size that you should be expecting. One of the reasons your code is slow is because hashmap is constantly trying to grow its size as it approaches its limit. If you set
Better yet though, I would recommend pushing this off to dedicated system like Redis. This would save you from running out of memory on the Java VM. Also, this will allow you to run multiple instances of this which segways into my next point of optimization.
Right now, you're only running 1 CPU core. The modern computer has 2 if not 4 or 8 cpu cores which you are not using at all. So theoretically, you are only using 1/2, 1/4, or even 1/8 of your computing power. You can optimize this after you switch over to using a separate datastore like redis by spawning multiple processes doing the same task. The easiest way to do that is to divide the file up 4 or 8 files and run the same program on each of the files. You'll notice that your system utilization will go from 25% to a lot higher.
Lastly, you can optimize your IO. Disks generally read large chunks of data much faster than 1 line at a time. You are requesting bytes of data to be read which is a lot of overhead. If you re-write your code to read in 10 or 20 or 50MB at a time, you should see a performance increase in IO. You should be able to do this after you kick out storing the hash in memory and to redis.
Other Methodologies
Hope these suggestions help! Finally, if you are really looking to approach the task differently, you can look into using some Map/Reduce algorithm/framework like hadoop (https://hadoop.apache.org/docs/r1.2.1/mapred_tutorial.html). With something like this, you can distribute the work to multiple computers and get the job completed extremely fast, though I think if you follow the 3 optimization techniques, you'll be in good shape enough.
There are 3 aspects you can optimize your code which should give you pretty large gains:
- Optimize Your Memory Utilization
Consider Storing Your Data In a DB or datastore. For 20 million words, you can have a maximum of 20 million records in your hash, which is doable, but by the time you get to 2 word combinations - you are talking about a ton of records. Let us conservatively say there are only 2000 unique words and we are picking 2 words each, we get nearly 2,000,000 records ( C(n,r) C(2000,2) - http://stattrek.com/online-calculator/combinations-permutations.aspx). 2 million records 32 bytes per record + 4 capacity = 100-200MB just for the hash. (http://java-performance.info/memory-consumption-of-java-data-types-2/). Chances are that there are many more than 2000 unique words and even if you don't have every single combination, it is still alot.
The first thing you can do is optimize your hashmap for the large size that you should be expecting. One of the reasons your code is slow is because hashmap is constantly trying to grow its size as it approaches its limit. If you set
new HashMap<>(2100000), you can prevent some of this slowdown as it tries to grow the capacity of your hashmap.Better yet though, I would recommend pushing this off to dedicated system like Redis. This would save you from running out of memory on the Java VM. Also, this will allow you to run multiple instances of this which segways into my next point of optimization.
- Optimize your CPU Utilization
Right now, you're only running 1 CPU core. The modern computer has 2 if not 4 or 8 cpu cores which you are not using at all. So theoretically, you are only using 1/2, 1/4, or even 1/8 of your computing power. You can optimize this after you switch over to using a separate datastore like redis by spawning multiple processes doing the same task. The easiest way to do that is to divide the file up 4 or 8 files and run the same program on each of the files. You'll notice that your system utilization will go from 25% to a lot higher.
- Optimize your IO
Lastly, you can optimize your IO. Disks generally read large chunks of data much faster than 1 line at a time. You are requesting bytes of data to be read which is a lot of overhead. If you re-write your code to read in 10 or 20 or 50MB at a time, you should see a performance increase in IO. You should be able to do this after you kick out storing the hash in memory and to redis.
Other Methodologies
Hope these suggestions help! Finally, if you are really looking to approach the task differently, you can look into using some Map/Reduce algorithm/framework like hadoop (https://hadoop.apache.org/docs/r1.2.1/mapred_tutorial.html). With something like this, you can distribute the work to multiple computers and get the job completed extremely fast, though I think if you follow the 3 optimization techniques, you'll be in good shape enough.
Context
StackExchange Code Review Q#114245, answer score: 8
Revisions (0)
No revisions yet.