patternpythonMinor
Simple DNA sequence finder w/ mismatch tolerance
Viewed 0 times
simplefindersequencednatolerancemismatch
Problem
The goal with this function is to find one DNA sequence within another sequence, with a specified amount of mismatch tolerance. For example:
EDIT: Function should also be able to find pat sequence in a sequence larger than the pat sequence; furthermore it should be able to find multiple matches within that sequence if they exist. For example,
Here's my working draft:
I was wondering how to write this function:
dnasearch('acc','ccc',0)shouldn't find a match, while
dnasearch('acc','ccc',1)should find one.
EDIT: Function should also be able to find pat sequence in a sequence larger than the pat sequence; furthermore it should be able to find multiple matches within that sequence if they exist. For example,
dnasearch('acc','acaaatc',1) would find two matches: 'aca' and 'act'. Thanks to @weronika for pointing this out.Here's my working draft:
def dnasearch(pat,seq,mismatch_allowed=0):
patset1=set([pat])
patset2=set([pat])
for number in range(mismatch_allowed):
for version in patset1:
for a in range(len(version)):
for letter in 'actg':
newpat=version[:a]+letter+version[a+1:]
patset2.add(newpat)
patset1|=patset2
for n in patset1:
if seq.find(n) != -1: print 'At position', seq.find(n),\
'found',nI was wondering how to write this function:
- With fewer for-loops.
- Without the second
patset. Originally I tried just adding new patterns topatset1, but I got the error:RuntimeError: Set changed size during iteration.
- Simpler in any other way.
Solution
This is a solved problem in bioinformatics and there are specific algorithms depending on your specific use-case:
In the general case, this is a global pairwise alignment which is computed by the Needleman–Wunsch algorithm. If you are just interested in the score of the alignment, rather than the complete alignment, this can be simplified by not performing the backtracking.
Now, if you have a threshold and are only interested in finding matches below said threshold, then a more efficient variant employs the Ukkonen trick – unfortunately, references for this are hard to find online, and the print literature is quite expensive. But what it does is essentially break the recurrence in the above-mentioned algorithm once the score exceeds the threshold chosen before (using the insight that the error score can only increase, never decrease in a given column).
But all this is unnecessary if you don’t allow gaps. In that case, the algorithm is as straightforward as walking over both strings at once and looking whether the current positions match.
And this can be expressed in a single line:
and:
- Do you allow gaps?
- How many mismatches do gaps incur?
- … etc.
In the general case, this is a global pairwise alignment which is computed by the Needleman–Wunsch algorithm. If you are just interested in the score of the alignment, rather than the complete alignment, this can be simplified by not performing the backtracking.
Now, if you have a threshold and are only interested in finding matches below said threshold, then a more efficient variant employs the Ukkonen trick – unfortunately, references for this are hard to find online, and the print literature is quite expensive. But what it does is essentially break the recurrence in the above-mentioned algorithm once the score exceeds the threshold chosen before (using the insight that the error score can only increase, never decrease in a given column).
But all this is unnecessary if you don’t allow gaps. In that case, the algorithm is as straightforward as walking over both strings at once and looking whether the current positions match.
And this can be expressed in a single line:
def distance(a, b):
return sum(map(lambda (x, y): 0 if x == y else 1, zip(a, b)))and:
def distance_less_than(k, a, b):
return distance(a, b) < kCode Snippets
def distance(a, b):
return sum(map(lambda (x, y): 0 if x == y else 1, zip(a, b)))def distance_less_than(k, a, b):
return distance(a, b) < kContext
StackExchange Code Review Q#12522, answer score: 4
Revisions (0)
No revisions yet.