HiveBrain v1.2.0
Get Started
← Back to all entries
patternjavaMinor

Sequence Alignment of two Strings

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
twostringsalignmentsequence

Problem

I have some code which aligns the sequence of two strings, I am doing numbers just for my implementation.

I was wondering whether there are any performance enhancements I could make as the code itself is \$O(n^2)\$ which isn't ideal in terms of scalability, here's the code:

int[] goal = {1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14,15, 0};
        int[] test = {1, 5, 9, 11, 2, 10, 0, 6, 8, 14, 15, 12, 7, 4, 3, 13};
        int gap = 1;
        // The penalties to apply
        int substitution = 1, match = 0;

        int[][] opt = new int[goal.length + 1][goal.length + 1];

        // First of all, compute insertions and deletions at 1st row/column
        for (int i = 1; i <= goal.length; i++) {
            opt[i][0] = opt[i - 1][0] + gap;
        }
        for (int j = 1; j <= test.length; j++) {
            opt[0][j] = opt[0][j - 1] + gap;
        }

        for (int i = 1; i <= goal.length; i++) {
            for (int j = 1; j <= test.length; j++) {
                int scoreDiag = opt[i - 1][j - 1] + (goal[i - 1] == test[j - 1] ? match : substitution); // different symbol
                int scoreLeft = opt[i][j - 1] + gap; // insertion
                int scoreUp = opt[i - 1][j] + gap; // deletion
                // we take the minimum
                opt[i][j] = Math.min(Math.min(scoreDiag, scoreLeft), scoreUp);
            }
        }

        for (int i = 0; i <= goal.length; i++) {
            for (int j = 0; j <= test.length; j++) {
                System.out.print(opt[i][j] + "\t");
            }
            System.out.println();
        }


The value that I actually need from the code is opt[goal.length][test.length]

Solution

One general way to speed up dynamic algorithms is the "Method of Four Russians" - \$O(\frac{n^2}{\log n})\$. But to implement this method you should use limited alphabet (not like numbers in your implementation).

Although theoretically this method looks awesome, practically the size of a lookup table is growing fast (\$O(3^{2t} \cdot \left|A \right| \cdot t)\$, where \$\left|A \right|\$ is the size of an alphabet, and \$t\$ is a block size). For example, in the case of DNA (4 letters alphabet) it can fit to the modern cash memory just in the case of \$t\$ equals to 3 (\$\sim0.1MB\$) or 4 (\$\sim10MB\$).

Another way to optimize the pairwise alignment is to take advantage of an extended instruction set of modern CPUs (SIMD extension). Up to now there are several such implementations of the Smith-Waterman algorithm (1 or 2).

Context

StackExchange Code Review Q#160564, answer score: 3

Revisions (0)

No revisions yet.