HiveBrain v1.2.0
Get Started
← Back to all entries
snippetpythonMinor

Create a list of all strings within hamming distance of a reference string with a given alphabet

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
alphabetreferencecreateallwithdistancegivenwithinliststrings

Problem

For a bioinformatic problem, I found myself wanting to create a list of strings within Hamming distance "k" of a reference sequence. I wanted to do so quickly and pythonically. I've implemented it in pure python and in cython, with and without type declarations. The time performance has been identical. (I also compared the compiled python version with an interpreted version defined in ipython, which also performed identically.)

cfi has been set as shorthand to chain.from_iterable in order to reduce the number of dot operators being used as in the following module-level import and definition:

from itertools import chain
cfi = chain.from_iterable

@cython.returns(list)
def PermuteMotifOnce(cython.str motif, set alphabet={"A", "C", "G", "T"}):
    """
    Gets all strings within hamming distance 1 of motif and returns it as a
    list.
    """
    return list(set(cfi([[
        motif[:pos] + alpha + motif[pos + 1:] for
        alpha in alphabet] for
                         pos in range(len(motif))])))

def PyPermuteMotifOnce(motif, alphabet={"A", "C", "G", "T"}):
    """
    Gets all strings within hamming distance 1 of motif and returns it as a
    list.
    """
    return list(set(cfi([[
        motif[:pos] + alpha + motif[pos + 1:] for
        alpha in alphabet] for
                         pos in range(len(motif))])))

@cython.returns(list)
def PermuteMotifN(cython.str motif, cython.long n=-1):
    assert n > 0
    cdef set workingSet
    cdef cython.long i
    workingSet = {motif}
    for i in range(n):
        workingSet = set(cfi(map(PermuteMotifOnce, workingSet)))
    return list(workingSet)

def PyPermuteMotifN(motif, n=-1):
    assert n > 0
    workingSet = {motif}
    for i in range(n):
        workingSet = set(cfi(map(PermuteMotifOnce, workingSet)))
    return list(workingSet)


Results:

`motif = "ACCTACTGAACT"
%timeit -n 5 PermuteMotifN(motif, 6)
5 loops, best of 3: 6.93s per loop
%timeit -n 5 PyPermuteMotifN(motif, 6)
5 loops, best of 3: 6.81

Solution

Your strategy here is to keep a working set and repeatedly expand it by adding strings within Hamming distance 1 of elements in the set. This involves a lot of duplicated work, because the same candidate strings are going to be generated many times over (and then thrown away because they are already members of the working set).

I find that I get a substantial speedup just by generating each string exactly once.

from itertools import chain, combinations, product

def hamming_circle(s, n, alphabet):
    """Generate strings over alphabet whose Hamming distance from s is
    exactly n.

    >>> sorted(hamming_circle('abc', 0, 'abc'))
    ['abc']
    >>> sorted(hamming_circle('abc', 1, 'abc'))
    ['aac', 'aba', 'abb', 'acc', 'bbc', 'cbc']
    >>> sorted(hamming_circle('aaa', 2, 'ab'))
    ['abb', 'bab', 'bba']

    """
    for positions in combinations(range(len(s)), n):
        for replacements in product(range(len(alphabet) - 1), repeat=n):
            cousin = list(s)
            for p, r in zip(positions, replacements):
                if cousin[p] == alphabet[r]:
                    cousin[p] = alphabet[-1]
                else:
                    cousin[p] = alphabet[r]
            yield ''.join(cousin)

def hamming_ball(s, n, alphabet):
    """Generate strings over alphabet whose Hamming distance from s is
    less than or equal to n.

    >>> sorted(hamming_ball('abc', 0, 'abc'))
    ['abc']
    >>> sorted(hamming_ball('abc', 1, 'abc'))
    ['aac', 'aba', 'abb', 'abc', 'acc', 'bbc', 'cbc']
    >>> sorted(hamming_ball('aaa', 2, 'ab'))
    ['aaa', 'aab', 'aba', 'abb', 'baa', 'bab', 'bba']

    """
    return chain.from_iterable(hamming_circle(s, i, alphabet)
                               for i in range(n + 1))


In pure Python, this runs about 4 times as fast as the code in the original post.

Code Snippets

from itertools import chain, combinations, product

def hamming_circle(s, n, alphabet):
    """Generate strings over alphabet whose Hamming distance from s is
    exactly n.

    >>> sorted(hamming_circle('abc', 0, 'abc'))
    ['abc']
    >>> sorted(hamming_circle('abc', 1, 'abc'))
    ['aac', 'aba', 'abb', 'acc', 'bbc', 'cbc']
    >>> sorted(hamming_circle('aaa', 2, 'ab'))
    ['abb', 'bab', 'bba']

    """
    for positions in combinations(range(len(s)), n):
        for replacements in product(range(len(alphabet) - 1), repeat=n):
            cousin = list(s)
            for p, r in zip(positions, replacements):
                if cousin[p] == alphabet[r]:
                    cousin[p] = alphabet[-1]
                else:
                    cousin[p] = alphabet[r]
            yield ''.join(cousin)

def hamming_ball(s, n, alphabet):
    """Generate strings over alphabet whose Hamming distance from s is
    less than or equal to n.

    >>> sorted(hamming_ball('abc', 0, 'abc'))
    ['abc']
    >>> sorted(hamming_ball('abc', 1, 'abc'))
    ['aac', 'aba', 'abb', 'abc', 'acc', 'bbc', 'cbc']
    >>> sorted(hamming_ball('aaa', 2, 'ab'))
    ['aaa', 'aab', 'aba', 'abb', 'baa', 'bab', 'bba']

    """
    return chain.from_iterable(hamming_circle(s, i, alphabet)
                               for i in range(n + 1))

Context

StackExchange Code Review Q#88912, answer score: 7

Revisions (0)

No revisions yet.