patternjavaMinor
Finding the most common DNA patterns of some given length
Viewed 0 times
thelengthgivenpatternsdnasomefindingcommonmost
Problem
I'm working on an algorithm to search through 100,000+ lines of dna sequencing, with the fastest time possible. Here is my current code, I was wondering if there's any ways to make this run faster:
To be clear, it runs fast (0.08s for 100,000 chars), but I want to see how fast I can get it to go.
public static void mostCommonKmer(String s, int k) {
// your code here
HashMap m = new HashMap<>();
int l = 0, g;
String n = "", ss;
for (int i = 0; i < s.length() - k; i++) {
ss = s.substring(i, i+k);
g = m.getOrDefault(ss, 1);
m.put(ss, g + 1);
if (l < g) {
l = g;
n = ss;
}
}
System.out.println("Most Frequent " + k + "-mer = " + n);
System.out.println("frequency = " + l);
}- s is the 100,000+ character string passed to the method and k is the length of pattern to look for
To be clear, it runs fast (0.08s for 100,000 chars), but I want to see how fast I can get it to go.
Solution
You can optimize a bit by pre-calculating s.length() - k
Furthermore, you can cut a bit of execution time by replacing the auto boxing and unboxing you do by using a mutable value class, something like this:
Raw execution time for 1000 repetitions (with proper ramp-up) using 100000 chars and k = 150 went from 3308 ms for the original code to 3190 using both changes on my machine.
And as you are on codereview, one additional remark: longer variable names (longest, subsequence, n?) do not hurt execution time, but have the potential to make the code more readable.... ;-)
for (int i = 0, max = s.length() - k; i < max; i++) {
...
}Furthermore, you can cut a bit of execution time by replacing the auto boxing and unboxing you do by using a mutable value class, something like this:
private static class MutableInt {
public int value;
public MutableInt(int value) {
this.value = value;
}
}
public static void mostCommonKmer3(String s, int k) {
HashMap m = new HashMap<>();
int l = 0;
MutableInt g;
String n = "", ss;
for (int i = 0, max = s.length() - k; i new MutableInt(1));
g.value++;
if (l < g.value) {
l = g.value;
n = ss;
}
}
//System.out.println("Most Frequent " + k + "-mer = " + n);
//System.out.println("frequency = " + l);
}Raw execution time for 1000 repetitions (with proper ramp-up) using 100000 chars and k = 150 went from 3308 ms for the original code to 3190 using both changes on my machine.
And as you are on codereview, one additional remark: longer variable names (longest, subsequence, n?) do not hurt execution time, but have the potential to make the code more readable.... ;-)
Code Snippets
for (int i = 0, max = s.length() - k; i < max; i++) {
...
}private static class MutableInt {
public int value;
public MutableInt(int value) {
this.value = value;
}
}
public static void mostCommonKmer3(String s, int k) {
HashMap<String, MutableInt> m = new HashMap<>();
int l = 0;
MutableInt g;
String n = "", ss;
for (int i = 0, max = s.length() - k; i < max; i++) {
ss = s.substring(i, i+k);
// note: computeIfAbsent over getOrDefault in this case, so that
// creation will only take place when necessary
g = m.computeIfAbsent(ss, key -> new MutableInt(1));
g.value++;
if (l < g.value) {
l = g.value;
n = ss;
}
}
//System.out.println("Most Frequent " + k + "-mer = " + n);
//System.out.println("frequency = " + l);
}Context
StackExchange Code Review Q#160352, answer score: 5
Revisions (0)
No revisions yet.