patternjavaMinor
Creating a fast Android dictionary (word counts)
Viewed 0 times
fastandroidcreatingworddictionarycounts
Problem
This is a follow-up of my question here:
I am currently working on an application for various statistics. One task is to analyse a good amount of sentences for their word counts.
The specifications are:
What I'm looking for:
Current version with the suggested improvements made
```
//fields
private static final CharMatcher pat_rep = CharMatcher.inRange('A', 'Z').or(CharMatcher.inRange('a', 'z'))
.precomputed();
private static final Pattern pat_split = Pattern.compile("\\s");
private HashMultiset sent = HashMultiset.create();
private HashMultiset rcvd = HashMultiset.create();
private Cursor c1;
private Cursor c2;
//start
c1 = db.rawQuery("select lower(DATA) as SENTENCE, SENT from MESSAGELIST", null);
while (c1.moveToNext()) {
String[] words = pat_split.split(c1.getString(c1.getColumnIndex("SENTENCE")));
int from_me = c1.getInt(c1.getColumnIndex("key_from_me"));
for (String in : words) {
in = pat_rep.retainFrom(in);
if (!in.equals("")) {
if (from_me == 1) {
sent.add(in);
} else {
rcvd.add(in);
}
}
}
}
db.execSQL("create temp table if not exists WORDS (WORD varchar, SENT integer, CNT integer)");
SQLiteStatement ins = db.compileStatement("insert into WORDS values (?, ?, ?)");
db.beginTransaction();
Iterator> i = sent.entrySet().
I am currently working on an application for various statistics. One task is to analyse a good amount of sentences for their word counts.
The specifications are:
- sentences are read from SQLiteDatabase (up to 20k with an average of about 15 words)
- transformation: split by whitespaces (to get the words of the sentences)
- transformation: toLowerCase (to minimize variations of words)
- transformation: replace [^a-zA-Z] (for the same reason as above)
- get word + count for the first x (not sure yet, maybe 10-15) most common words
- preserve a flag if the messages was sent/received
What I'm looking for:
- improvements to make the code run faster
- alternative approaches for this task
- (general hints to improve the task)
Current version with the suggested improvements made
```
//fields
private static final CharMatcher pat_rep = CharMatcher.inRange('A', 'Z').or(CharMatcher.inRange('a', 'z'))
.precomputed();
private static final Pattern pat_split = Pattern.compile("\\s");
private HashMultiset sent = HashMultiset.create();
private HashMultiset rcvd = HashMultiset.create();
private Cursor c1;
private Cursor c2;
//start
c1 = db.rawQuery("select lower(DATA) as SENTENCE, SENT from MESSAGELIST", null);
while (c1.moveToNext()) {
String[] words = pat_split.split(c1.getString(c1.getColumnIndex("SENTENCE")));
int from_me = c1.getInt(c1.getColumnIndex("key_from_me"));
for (String in : words) {
in = pat_rep.retainFrom(in);
if (!in.equals("")) {
if (from_me == 1) {
sent.add(in);
} else {
rcvd.add(in);
}
}
}
}
db.execSQL("create temp table if not exists WORDS (WORD varchar, SENT integer, CNT integer)");
SQLiteStatement ins = db.compileStatement("insert into WORDS values (?, ?, ?)");
db.beginTransaction();
Iterator> i = sent.entrySet().
Solution
Just some ideas.
Maybe you could use batching?
You may be also able to save some time by iterating over
Split on
Can't you use JDK5 loops like
? I guess,
Make all fields private. Always (unless you have a very good reason not to). AT least I hope that
Split your method. Shorter methods are easier to read and to optimize.
Maybe you could use batching?
You may be also able to save some time by iterating over
sent.entrySet() instead of looking up the count separately.Split on
[^a-zA-Z] as you later throw non-letter away anyway.Can't you use JDK5 loops like
for (String in : send) {...}? I guess,
clearBindings is unnecessary as you always overwrite everything.Make all fields private. Always (unless you have a very good reason not to). AT least I hope that
pat_rep etc. are fields.Split your method. Shorter methods are easier to read and to optimize.
Code Snippets
for (String in : send) {...}Context
StackExchange Code Review Q#60579, answer score: 3
Revisions (0)
No revisions yet.