HiveBrain v1.2.0
Get Started
← Back to all entries
gotchaMinor

Why does Banded Needleman-Wunsch give alignments with no more than d base pairs of indels?

Submitted by: @import:stackexchange-cs··
0
Viewed 0 times
bandedwhyneedlemanwiththanmoregivedoesalignmentsindels

Problem

A common modification to the Needleman-Wunsch to reduce running time is to only fill in the cells along a diagonal band of the matrix (slides 27/28). Let 2d + 1 represent the width of the band. Then based on the provided text, this ensures that the total number of inserts and deletes does not exceed d.

However, I can't understand the intuition behind the algorithm. If we let L represent a list of the size of each mutation in the alignment in-order (where negative sizes represent deletions), doesn't the band restriction merely say that the sum of values in any prefix slice of L must stay between -d and d? But the total number of mutations could easily exceed d.

E.g. If d = 4, throughout a long alignment, there could be 5 distinct mutations, where L refers to a series of [insertion, deletion, insertion, insertion, deletion] with sizes: [3,-2,2,1,-4]. Visually, such an alignment would represent a wiggly line through the DP matrix that results in plenty of small insertions and deletions, alternating to ensure that the alignment stays within the band.

In other words, the band does not actually restrict the total number of insertions and deletions as claimed? If I'm misunderstanding something, I'd appreciate any clarification. Thanks!

P.S. Was unsure if this post should go under the bioinformatics stackexchange beta.

Solution

You are correct, the number of insertions/deletions $d$ is not constrained (only) by the bandwidth.

However, the algorithm only uses the fact that if we know $d$, then the path must stay within the $2d+1$ band. This band does not have to be tight: a zigzag path would indeed fit within a much narrower band.

You may wonder why the smallest bandwidth (which is indeed equal to the maximum sum over prefixes of your list $L$) isn't used as a parameter instead of $d$. Since $d$ is related to the similarity of the strings, it isn't unreasonable that an upper bound of $d$ is known in some applications. The smallest bandwidth doesn't seem to correspond to a 'natural' parameter in the context of similarity as $d$ does, so it doesn't seem reasonable that it is known and can be used in the algorithm.

Context

StackExchange Computer Science Q#83320, answer score: 2

Revisions (0)

No revisions yet.