patternjavaMinor
Sequence Alignment of two Strings
Viewed 0 times
twostringsalignmentsequence
Problem
I have some code which aligns the sequence of two strings, I am doing numbers just for my implementation.
I was wondering whether there are any performance enhancements I could make as the code itself is \$O(n^2)\$ which isn't ideal in terms of scalability, here's the code:
The value that I actually need from the code is
I was wondering whether there are any performance enhancements I could make as the code itself is \$O(n^2)\$ which isn't ideal in terms of scalability, here's the code:
int[] goal = {1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14,15, 0};
int[] test = {1, 5, 9, 11, 2, 10, 0, 6, 8, 14, 15, 12, 7, 4, 3, 13};
int gap = 1;
// The penalties to apply
int substitution = 1, match = 0;
int[][] opt = new int[goal.length + 1][goal.length + 1];
// First of all, compute insertions and deletions at 1st row/column
for (int i = 1; i <= goal.length; i++) {
opt[i][0] = opt[i - 1][0] + gap;
}
for (int j = 1; j <= test.length; j++) {
opt[0][j] = opt[0][j - 1] + gap;
}
for (int i = 1; i <= goal.length; i++) {
for (int j = 1; j <= test.length; j++) {
int scoreDiag = opt[i - 1][j - 1] + (goal[i - 1] == test[j - 1] ? match : substitution); // different symbol
int scoreLeft = opt[i][j - 1] + gap; // insertion
int scoreUp = opt[i - 1][j] + gap; // deletion
// we take the minimum
opt[i][j] = Math.min(Math.min(scoreDiag, scoreLeft), scoreUp);
}
}
for (int i = 0; i <= goal.length; i++) {
for (int j = 0; j <= test.length; j++) {
System.out.print(opt[i][j] + "\t");
}
System.out.println();
}The value that I actually need from the code is
opt[goal.length][test.length]Solution
One general way to speed up dynamic algorithms is the "Method of Four Russians" - \$O(\frac{n^2}{\log n})\$. But to implement this method you should use limited alphabet (not like numbers in your implementation).
Although theoretically this method looks awesome, practically the size of a lookup table is growing fast (\$O(3^{2t} \cdot \left|A \right| \cdot t)\$, where \$\left|A \right|\$ is the size of an alphabet, and \$t\$ is a block size). For example, in the case of DNA (4 letters alphabet) it can fit to the modern cash memory just in the case of \$t\$ equals to 3 (\$\sim0.1MB\$) or 4 (\$\sim10MB\$).
Another way to optimize the pairwise alignment is to take advantage of an extended instruction set of modern CPUs (SIMD extension). Up to now there are several such implementations of the Smith-Waterman algorithm (1 or 2).
Although theoretically this method looks awesome, practically the size of a lookup table is growing fast (\$O(3^{2t} \cdot \left|A \right| \cdot t)\$, where \$\left|A \right|\$ is the size of an alphabet, and \$t\$ is a block size). For example, in the case of DNA (4 letters alphabet) it can fit to the modern cash memory just in the case of \$t\$ equals to 3 (\$\sim0.1MB\$) or 4 (\$\sim10MB\$).
Another way to optimize the pairwise alignment is to take advantage of an extended instruction set of modern CPUs (SIMD extension). Up to now there are several such implementations of the Smith-Waterman algorithm (1 or 2).
Context
StackExchange Code Review Q#160564, answer score: 3
Revisions (0)
No revisions yet.