patternpythonMinor

Simple DNA sequence finder w/ mismatch tolerance

Submitted by: @import:stackexchange-codereview·Mar 10, 2026·

Viewed 0 times

simplefindersequencednatolerancemismatch

Problem

The goal with this function is to find one DNA sequence within another sequence, with a specified amount of mismatch tolerance. For example:

dnasearch('acc','ccc',0) shouldn't find a match, while

dnasearch('acc','ccc',1) should find one.

EDIT: Function should also be able to find pat sequence in a sequence larger than the pat sequence; furthermore it should be able to find multiple matches within that sequence if they exist. For example, dnasearch('acc','acaaatc',1) would find two matches: 'aca' and 'act'. Thanks to @weronika for pointing this out.

Here's my working draft:

def dnasearch(pat,seq,mismatch_allowed=0):
patset1=set([pat])
patset2=set([pat])
for number in range(mismatch_allowed):
    for version in patset1:
        for a in range(len(version)):
            for letter in 'actg':
                newpat=version[:a]+letter+version[a+1:]
                patset2.add(newpat)
    patset1|=patset2
for n in patset1:
    if seq.find(n) != -1: print 'At position', seq.find(n),\
       'found',n

I was wondering how to write this function:

With fewer for-loops.

Without the second patset. Originally I tried just adding new patterns to patset1, but I got the error: RuntimeError: Set changed size during iteration.

Simpler in any other way.

Solution

This is a solved problem in bioinformatics and there are specific algorithms depending on your specific use-case:

Do you allow gaps?

How many mismatches do gaps incur?

… etc.

In the general case, this is a global pairwise alignment which is computed by the Needleman–Wunsch algorithm. If you are just interested in the score of the alignment, rather than the complete alignment, this can be simplified by not performing the backtracking.

Now, if you have a threshold and are only interested in finding matches below said threshold, then a more efficient variant employs the Ukkonen trick – unfortunately, references for this are hard to find online, and the print literature is quite expensive. But what it does is essentially break the recurrence in the above-mentioned algorithm once the score exceeds the threshold chosen before (using the insight that the error score can only increase, never decrease in a given column).

But all this is unnecessary if you don’t allow gaps. In that case, the algorithm is as straightforward as walking over both strings at once and looking whether the current positions match.

And this can be expressed in a single line:

def distance(a, b):
    return sum(map(lambda (x, y): 0 if x == y else 1, zip(a, b)))

and:

def distance_less_than(k, a, b):
    return distance(a, b) < k

Code Snippets

def distance(a, b):
    return sum(map(lambda (x, y): 0 if x == y else 1, zip(a, b)))

def distance_less_than(k, a, b):
    return distance(a, b) < k

Context

StackExchange Code Review Q#12522, answer score: 4

Revisions (0)

No revisions yet.