patternMinor
Matlab implementation of Needleman-Wunsch algorithm
Viewed 0 times
matlabneedlemanalgorithmimplementationwunsch
Problem
This code (an implementation of the path finding Needleman-Wunsch algorithm), given two sequences, correctly aligns and calculates the similarity of them. For example, given two sequences:
AAA,CCC
or
AA,CA
it correctly aligns them as
and
respectively, while returning the similarity.
```
% Matlab implementation of NW algorithm
% Written by Joseph Farah
% Completed 7/14/16
% Last updated 7/14/16
function similarity = needleman(s1, s2)
% defining variables
% sequence related variables
s1 = strcat('-',s1);
s2 = strcat('-',s2);
seq1 = s1;
seq2 = s2;
%seq1 = '-ACTTAGATTACTTG';
%seq2 = '-CCCACTAGATTATTAG';
fseq1 = '';
fseq2 = '';
% weights
w_s = 2;
w_indel = 1;
% temporary costs for the matrix fill
h_cost = 0;
v_cost = 0;
d_cost = 0;
% the matrices
cost_matrix = zeros(length(seq1),length(seq2));
direction_cell_matrix = cell(length(seq1),length(seq2));
%% Initiliazation and Scoring
for x = 1:length(seq1)
for y = 1:length(seq2)
point = [seq1(x), seq2(y)];
if point(1) == '-' && point(2) == '-'
cost_matrix(x,y) = 0;
direction_cell_matrix{x,y} = 'n';
continue
end
% apparently, to get it to go in the right order, moving
% horizontally is defined by y-1 and vertically by x-1
if point(1) == '-' % this means it is on the lip --
cost_matrix(x,y) = cost_matrix(x,y-1) + w_indel;
% the only way it can go is left!
direction_cell_matrix{x,y} = 'h';
continue
end
if point(2) == '-' % is it vertical?
cost_matrix(x,y) = cost_matrix(x-1,y) + w_indel;
% the only way it can go is up!
direction_cell_matrix{x,y} = 'v';
continue
end
% if none of the above conditions were met, the point must be in
% the body of the array. First thing we have to do is calculate the
% cost of the current point.
h_cost = cost
AAA,CCC
or
AA,CA
it correctly aligns them as
---AAA
CCC---and
-AA
CA-respectively, while returning the similarity.
```
% Matlab implementation of NW algorithm
% Written by Joseph Farah
% Completed 7/14/16
% Last updated 7/14/16
function similarity = needleman(s1, s2)
% defining variables
% sequence related variables
s1 = strcat('-',s1);
s2 = strcat('-',s2);
seq1 = s1;
seq2 = s2;
%seq1 = '-ACTTAGATTACTTG';
%seq2 = '-CCCACTAGATTATTAG';
fseq1 = '';
fseq2 = '';
% weights
w_s = 2;
w_indel = 1;
% temporary costs for the matrix fill
h_cost = 0;
v_cost = 0;
d_cost = 0;
% the matrices
cost_matrix = zeros(length(seq1),length(seq2));
direction_cell_matrix = cell(length(seq1),length(seq2));
%% Initiliazation and Scoring
for x = 1:length(seq1)
for y = 1:length(seq2)
point = [seq1(x), seq2(y)];
if point(1) == '-' && point(2) == '-'
cost_matrix(x,y) = 0;
direction_cell_matrix{x,y} = 'n';
continue
end
% apparently, to get it to go in the right order, moving
% horizontally is defined by y-1 and vertically by x-1
if point(1) == '-' % this means it is on the lip --
cost_matrix(x,y) = cost_matrix(x,y-1) + w_indel;
% the only way it can go is left!
direction_cell_matrix{x,y} = 'h';
continue
end
if point(2) == '-' % is it vertical?
cost_matrix(x,y) = cost_matrix(x-1,y) + w_indel;
% the only way it can go is up!
direction_cell_matrix{x,y} = 'v';
continue
end
% if none of the above conditions were met, the point must be in
% the body of the array. First thing we have to do is calculate the
% cost of the current point.
h_cost = cost
Solution
I'll take this from the bottom up:
The last part of the code that calculates the similarity can be vectorized by comparing all elements of the two vectors and summing the ones that are equal:
Can instead be written like this:
or if you want the
Note that
Timing of loop approach versus vectorized approach:
You are using
Instead of
Timing of
You're using only the first elements of the
Specifying that you only want the first index improve the runtime of up to 30%.
If you can safely assume that the equal elements appear in the start of the vectors, you're better of defining your own
A very simple function that has runtime
The last part of the code that calculates the similarity can be vectorized by comparing all elements of the two vectors and summing the ones that are equal:
sim_score = 0;
for ii=1:length(fseq1)
if fseq1(ii) == fseq2(ii)
sim_score = sim_score + 1;
else
continue
end
end
similarity = sim_score/length(fseq1);Can instead be written like this:
similarity = sum(fseq1 == fseq2) / numel(fseq1);or if you want the
sim_score too:sim_score = sum(fseq1 == fseq);
similarity = sim_score / numel(fseq1);Note that
numel is faster and more robust than length.Timing of loop approach versus vectorized approach:
You are using
strcat a lot. strcat is good as it's easy to read and understand that you're concatenating strings (vector of characters). However, regular concatenating with brackets [str1, str2] is faster:Instead of
fseq2 = strcat('-',fseq2); you can do fseq2 = ['-', fseq2].Timing of
strcat vs brackets:You're using only the first elements of the
index variable after calling find. find has a second optional parameter that specifies the number of indices you want to store, so you can do: index = find(min_list_cost == minimum, 1);
if index == 1
...
if index == 2
...Specifying that you only want the first index improve the runtime of up to 30%.
If you can safely assume that the equal elements appear in the start of the vectors, you're better of defining your own
find_first function. This assumption is safe if you have a lot of repetitive elements, such as AACCTTATTCTTA... A very simple function that has runtime
O(1) if the first equal value appear in the beginning of the vector.function idx = find_first(x, val)
idx = 0;
for ii = 1:numel(x)
if x(ii) == val
idx = ii;
break
end
end
endCode Snippets
sim_score = 0;
for ii=1:length(fseq1)
if fseq1(ii) == fseq2(ii)
sim_score = sim_score + 1;
else
continue
end
end
similarity = sim_score/length(fseq1);similarity = sum(fseq1 == fseq2) / numel(fseq1);sim_score = sum(fseq1 == fseq);
similarity = sim_score / numel(fseq1);index = find(min_list_cost == minimum, 1);
if index == 1
...
if index == 2
...function idx = find_first(x, val)
idx = 0;
for ii = 1:numel(x)
if x(ii) == val
idx = ii;
break
end
end
endContext
StackExchange Code Review Q#135217, answer score: 2
Revisions (0)
No revisions yet.