HiveBrain v1.2.0
Get Started
← Back to all entries
patternMinor

Matlab implementation of Needleman-Wunsch algorithm

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
matlabneedlemanalgorithmimplementationwunsch

Problem

This code (an implementation of the path finding Needleman-Wunsch algorithm), given two sequences, correctly aligns and calculates the similarity of them. For example, given two sequences:


AAA,CCC

or


AA,CA

it correctly aligns them as

---AAA
CCC---


and

-AA
CA-


respectively, while returning the similarity.

```
% Matlab implementation of NW algorithm
% Written by Joseph Farah
% Completed 7/14/16
% Last updated 7/14/16
function similarity = needleman(s1, s2)
% defining variables
% sequence related variables
s1 = strcat('-',s1);
s2 = strcat('-',s2);
seq1 = s1;
seq2 = s2;
%seq1 = '-ACTTAGATTACTTG';
%seq2 = '-CCCACTAGATTATTAG';
fseq1 = '';
fseq2 = '';
% weights
w_s = 2;
w_indel = 1;
% temporary costs for the matrix fill
h_cost = 0;
v_cost = 0;
d_cost = 0;
% the matrices
cost_matrix = zeros(length(seq1),length(seq2));
direction_cell_matrix = cell(length(seq1),length(seq2));
%% Initiliazation and Scoring
for x = 1:length(seq1)
for y = 1:length(seq2)
point = [seq1(x), seq2(y)];
if point(1) == '-' && point(2) == '-'
cost_matrix(x,y) = 0;
direction_cell_matrix{x,y} = 'n';
continue
end
% apparently, to get it to go in the right order, moving
% horizontally is defined by y-1 and vertically by x-1
if point(1) == '-' % this means it is on the lip --
cost_matrix(x,y) = cost_matrix(x,y-1) + w_indel;
% the only way it can go is left!
direction_cell_matrix{x,y} = 'h';
continue
end
if point(2) == '-' % is it vertical?
cost_matrix(x,y) = cost_matrix(x-1,y) + w_indel;
% the only way it can go is up!
direction_cell_matrix{x,y} = 'v';
continue
end
% if none of the above conditions were met, the point must be in
% the body of the array. First thing we have to do is calculate the
% cost of the current point.
h_cost = cost

Solution

I'll take this from the bottom up:

The last part of the code that calculates the similarity can be vectorized by comparing all elements of the two vectors and summing the ones that are equal:

sim_score = 0;
for ii=1:length(fseq1)
    if fseq1(ii) == fseq2(ii)
        sim_score = sim_score + 1;
    else
        continue
    end
end
similarity = sim_score/length(fseq1);


Can instead be written like this:

similarity = sum(fseq1 == fseq2) / numel(fseq1);


or if you want the sim_score too:

sim_score = sum(fseq1 == fseq);
similarity = sim_score / numel(fseq1);


Note that numel is faster and more robust than length.

Timing of loop approach versus vectorized approach:

You are using strcat a lot. strcat is good as it's easy to read and understand that you're concatenating strings (vector of characters). However, regular concatenating with brackets [str1, str2] is faster:

Instead of fseq2 = strcat('-',fseq2); you can do fseq2 = ['-', fseq2].

Timing of strcat vs brackets:

You're using only the first elements of the index variable after calling find. find has a second optional parameter that specifies the number of indices you want to store, so you can do:

index = find(min_list_cost == minimum, 1);
if index == 1
   ...
if index == 2
   ...


Specifying that you only want the first index improve the runtime of up to 30%.

If you can safely assume that the equal elements appear in the start of the vectors, you're better of defining your own find_first function. This assumption is safe if you have a lot of repetitive elements, such as AACCTTATTCTTA...

A very simple function that has runtime O(1) if the first equal value appear in the beginning of the vector.

function idx = find_first(x, val)

idx = 0;
for ii = 1:numel(x)
    if x(ii) == val
        idx = ii;
        break
    end
end
end

Code Snippets

sim_score = 0;
for ii=1:length(fseq1)
    if fseq1(ii) == fseq2(ii)
        sim_score = sim_score + 1;
    else
        continue
    end
end
similarity = sim_score/length(fseq1);
similarity = sum(fseq1 == fseq2) / numel(fseq1);
sim_score = sum(fseq1 == fseq);
similarity = sim_score / numel(fseq1);
index = find(min_list_cost == minimum, 1);
if index == 1
   ...
if index == 2
   ...
function idx = find_first(x, val)

idx = 0;
for ii = 1:numel(x)
    if x(ii) == val
        idx = ii;
        break
    end
end
end

Context

StackExchange Code Review Q#135217, answer score: 2

Revisions (0)

No revisions yet.