snippetpythonMinor
Create a list of all strings within hamming distance of a reference string with a given alphabet
Viewed 0 times
alphabetreferencecreateallwithdistancegivenwithinliststrings
Problem
For a bioinformatic problem, I found myself wanting to create a list of strings within Hamming distance "k" of a reference sequence. I wanted to do so quickly and pythonically. I've implemented it in pure python and in cython, with and without type declarations. The time performance has been identical. (I also compared the compiled python version with an interpreted version defined in ipython, which also performed identically.)
cfi has been set as shorthand to
Results:
`motif = "ACCTACTGAACT"
%timeit -n 5 PermuteMotifN(motif, 6)
5 loops, best of 3: 6.93s per loop
%timeit -n 5 PyPermuteMotifN(motif, 6)
5 loops, best of 3: 6.81
cfi has been set as shorthand to
chain.from_iterable in order to reduce the number of dot operators being used as in the following module-level import and definition:from itertools import chain
cfi = chain.from_iterable
@cython.returns(list)
def PermuteMotifOnce(cython.str motif, set alphabet={"A", "C", "G", "T"}):
"""
Gets all strings within hamming distance 1 of motif and returns it as a
list.
"""
return list(set(cfi([[
motif[:pos] + alpha + motif[pos + 1:] for
alpha in alphabet] for
pos in range(len(motif))])))
def PyPermuteMotifOnce(motif, alphabet={"A", "C", "G", "T"}):
"""
Gets all strings within hamming distance 1 of motif and returns it as a
list.
"""
return list(set(cfi([[
motif[:pos] + alpha + motif[pos + 1:] for
alpha in alphabet] for
pos in range(len(motif))])))
@cython.returns(list)
def PermuteMotifN(cython.str motif, cython.long n=-1):
assert n > 0
cdef set workingSet
cdef cython.long i
workingSet = {motif}
for i in range(n):
workingSet = set(cfi(map(PermuteMotifOnce, workingSet)))
return list(workingSet)
def PyPermuteMotifN(motif, n=-1):
assert n > 0
workingSet = {motif}
for i in range(n):
workingSet = set(cfi(map(PermuteMotifOnce, workingSet)))
return list(workingSet)Results:
`motif = "ACCTACTGAACT"
%timeit -n 5 PermuteMotifN(motif, 6)
5 loops, best of 3: 6.93s per loop
%timeit -n 5 PyPermuteMotifN(motif, 6)
5 loops, best of 3: 6.81
Solution
Your strategy here is to keep a working set and repeatedly expand it by adding strings within Hamming distance 1 of elements in the set. This involves a lot of duplicated work, because the same candidate strings are going to be generated many times over (and then thrown away because they are already members of the working set).
I find that I get a substantial speedup just by generating each string exactly once.
In pure Python, this runs about 4 times as fast as the code in the original post.
I find that I get a substantial speedup just by generating each string exactly once.
from itertools import chain, combinations, product
def hamming_circle(s, n, alphabet):
"""Generate strings over alphabet whose Hamming distance from s is
exactly n.
>>> sorted(hamming_circle('abc', 0, 'abc'))
['abc']
>>> sorted(hamming_circle('abc', 1, 'abc'))
['aac', 'aba', 'abb', 'acc', 'bbc', 'cbc']
>>> sorted(hamming_circle('aaa', 2, 'ab'))
['abb', 'bab', 'bba']
"""
for positions in combinations(range(len(s)), n):
for replacements in product(range(len(alphabet) - 1), repeat=n):
cousin = list(s)
for p, r in zip(positions, replacements):
if cousin[p] == alphabet[r]:
cousin[p] = alphabet[-1]
else:
cousin[p] = alphabet[r]
yield ''.join(cousin)
def hamming_ball(s, n, alphabet):
"""Generate strings over alphabet whose Hamming distance from s is
less than or equal to n.
>>> sorted(hamming_ball('abc', 0, 'abc'))
['abc']
>>> sorted(hamming_ball('abc', 1, 'abc'))
['aac', 'aba', 'abb', 'abc', 'acc', 'bbc', 'cbc']
>>> sorted(hamming_ball('aaa', 2, 'ab'))
['aaa', 'aab', 'aba', 'abb', 'baa', 'bab', 'bba']
"""
return chain.from_iterable(hamming_circle(s, i, alphabet)
for i in range(n + 1))In pure Python, this runs about 4 times as fast as the code in the original post.
Code Snippets
from itertools import chain, combinations, product
def hamming_circle(s, n, alphabet):
"""Generate strings over alphabet whose Hamming distance from s is
exactly n.
>>> sorted(hamming_circle('abc', 0, 'abc'))
['abc']
>>> sorted(hamming_circle('abc', 1, 'abc'))
['aac', 'aba', 'abb', 'acc', 'bbc', 'cbc']
>>> sorted(hamming_circle('aaa', 2, 'ab'))
['abb', 'bab', 'bba']
"""
for positions in combinations(range(len(s)), n):
for replacements in product(range(len(alphabet) - 1), repeat=n):
cousin = list(s)
for p, r in zip(positions, replacements):
if cousin[p] == alphabet[r]:
cousin[p] = alphabet[-1]
else:
cousin[p] = alphabet[r]
yield ''.join(cousin)
def hamming_ball(s, n, alphabet):
"""Generate strings over alphabet whose Hamming distance from s is
less than or equal to n.
>>> sorted(hamming_ball('abc', 0, 'abc'))
['abc']
>>> sorted(hamming_ball('abc', 1, 'abc'))
['aac', 'aba', 'abb', 'abc', 'acc', 'bbc', 'cbc']
>>> sorted(hamming_ball('aaa', 2, 'ab'))
['aaa', 'aab', 'aba', 'abb', 'baa', 'bab', 'bba']
"""
return chain.from_iterable(hamming_circle(s, i, alphabet)
for i in range(n + 1))Context
StackExchange Code Review Q#88912, answer score: 7
Revisions (0)
No revisions yet.