HiveBrain v1.2.0
Get Started
← Back to all entries
patternjavaMinor

Finding the most common DNA patterns of some given length

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
thelengthgivenpatternsdnasomefindingcommonmost

Problem

I'm working on an algorithm to search through 100,000+ lines of dna sequencing, with the fastest time possible. Here is my current code, I was wondering if there's any ways to make this run faster:

public static void mostCommonKmer(String s, int k) {
    // your code here
    HashMap m = new HashMap<>();
    int l = 0, g;
    String n = "", ss;
    for (int i = 0; i < s.length() - k; i++) {
        ss = s.substring(i, i+k);
        g = m.getOrDefault(ss, 1);
        m.put(ss, g + 1);
        if (l < g) {
            l = g;
            n = ss;
        }
    }
    System.out.println("Most Frequent " + k + "-mer = " + n);
    System.out.println("frequency = " + l);
}


  • s is the 100,000+ character string passed to the method and k is the length of pattern to look for



To be clear, it runs fast (0.08s for 100,000 chars), but I want to see how fast I can get it to go.

Solution

You can optimize a bit by pre-calculating s.length() - k

for (int i = 0, max = s.length() - k; i < max; i++) {
    ...
}


Furthermore, you can cut a bit of execution time by replacing the auto boxing and unboxing you do by using a mutable value class, something like this:

private static class MutableInt {
    public int value;

    public MutableInt(int value) {
        this.value = value;
    }
}

public static void mostCommonKmer3(String s, int k) {
    HashMap m = new HashMap<>();
    int l = 0;
    MutableInt g;
    String n = "", ss;
    for (int i = 0, max = s.length() - k; i  new MutableInt(1));
        g.value++;
        if (l < g.value) {
            l = g.value;
            n = ss;
        }
    }
    //System.out.println("Most Frequent " + k + "-mer = " + n);
    //System.out.println("frequency = " + l);
}


Raw execution time for 1000 repetitions (with proper ramp-up) using 100000 chars and k = 150 went from 3308 ms for the original code to 3190 using both changes on my machine.

And as you are on codereview, one additional remark: longer variable names (longest, subsequence, n?) do not hurt execution time, but have the potential to make the code more readable.... ;-)

Code Snippets

for (int i = 0, max = s.length() - k; i < max; i++) {
    ...
}
private static class MutableInt {
    public int value;

    public MutableInt(int value) {
        this.value = value;
    }
}

public static void mostCommonKmer3(String s, int k) {
    HashMap<String, MutableInt> m = new HashMap<>();
    int l = 0;
    MutableInt g;
    String n = "", ss;
    for (int i = 0, max = s.length() - k; i < max; i++) {
        ss = s.substring(i, i+k);

        // note: computeIfAbsent over getOrDefault in this case, so that
        // creation will only take place when necessary
        g = m.computeIfAbsent(ss, key -> new MutableInt(1));
        g.value++;
        if (l < g.value) {
            l = g.value;
            n = ss;
        }
    }
    //System.out.println("Most Frequent " + k + "-mer = " + n);
    //System.out.println("frequency = " + l);
}

Context

StackExchange Code Review Q#160352, answer score: 5

Revisions (0)

No revisions yet.