HiveBrain v1.2.0
Get Started
← Back to all entries
patternpythonMinor

Array-search algorithm for identical sequences and character distances

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
searcharrayidenticalcharacteralgorithmforanddistancessequences

Problem

Here is Python code written to perform the following operations:

  • Find each occurrence of a three-letter sequence in a character array (e.g. a sequence such as ('a','b','c')), including overlapping sequences of up to 2 shared characters.



  • Count the characters between the start of each sequence and the start of all identical sequences following it.



  • Every time the same number of characters results from #2, increment a counter for that specific number of characters (regardless of which character sequence caused it).



  • Return a dictionary containing all accumulated counters for character distances.



counts = {}
# repeating groups of three identical chars in a row
for i in range(len(array)-2):
    for j in range(i+1,len(array)-2):
        if ((array[i] == array[j]) & (array[i+1] == array[j+1]) & (array[i+2] == array[j+2])):
            if counts.has_key(j-i) == False:
                counts[j-i] = 1
            else:
                counts[j-i] += 1


This code was originally written in another programming language, but I would like to apply any optimizations or improvements available in Python.

Solution


  1. Improving the code



Here are some ways in which you could improve your code, while leaving the algorithm unchanged:

-
You can avoid the need to check counts.has_key(j-i) if you use collections.Counter.

-
You are looping over all pairs of distinct indexes between 0 and len(array)-2. The function itertools.combinations provides a way to cleanly express your intention.

-
Instead of comparing three pairs of array elements, you can compare one pair of array slices.

-
Organize the code into a function with a docstring.

-
Generalize it from 3 to \$k\$.

-
Python doesn't have an "array" type, but it does have sequences. So your variables could be better named.

Here's the revised code after making the above improvements:

from collections import Counter
from itertools import combinations

def subseq_distance_counts(seq, k=3):
    """Return a dictionary whose keys are distances and whose values
    are the number of identical pairs of `k`-length subsequences
    of `seq` separated by that distance. `k` defaults to 3.

    """
    counts = Counter()
    for i, j in combinations(range(len(seq) - k + 1), 2):
        if seq[i:i + k] == seq[j:j + k]:
            counts[j - i] += 1
    return counts


It's possible to rewrite the body of this function as a single expression:

return Counter(j - i
                   for i, j in combinations(range(len(seq) - k + 1), 2)
                   if seq[i:i + k] == seq[j:j + k])


Some people prefer this style of coding, and others prefer the explicitness of the first version. It doesn't make much difference to the performance, so comes down to personal taste.

  1. Improving the algorithm



Here's an alternative approach (that you could use if the subsequences are hashable). Your original algorithm is \$ O(n^2) \$, where \$n\$ is the length of the sequence. The algorithm below is \$O(n + m)\$, where \$m\$ is the number of repeated triples. This won't make any different in the worst case, where \$m = Θ(n^2)\$, but in cases where \$m = ο(n^2)\$ it should be an improvement.

def subseq_distance_counts(seq, k=3):
    """Return a dictionary whose keys are distances and whose values
    are the number of identical pairs of `k`-length subsequences
    of `seq` separated by that distance. `k` defaults to 3.

    >>> subseq_distance_counts('abcabcabc')
    Counter({3: 4, 6: 1})
    >>> subseq_distance_counts('aaaaaa', 1)
    Counter({1: 5, 2: 4, 3: 3, 4: 2, 5: 1})

    """
    positions = defaultdict(list) # List of positions where each subsequence appears.
    for i in range(len(seq) - k + 1):
        positions[seq[i:i + k]].append(i)
    return Counter(j - i
                   for p in positions.itervalues()
                   for i, j in combinations(p, 2))

Code Snippets

from collections import Counter
from itertools import combinations

def subseq_distance_counts(seq, k=3):
    """Return a dictionary whose keys are distances and whose values
    are the number of identical pairs of `k`-length subsequences
    of `seq` separated by that distance. `k` defaults to 3.

    """
    counts = Counter()
    for i, j in combinations(range(len(seq) - k + 1), 2):
        if seq[i:i + k] == seq[j:j + k]:
            counts[j - i] += 1
    return counts
return Counter(j - i
                   for i, j in combinations(range(len(seq) - k + 1), 2)
                   if seq[i:i + k] == seq[j:j + k])
def subseq_distance_counts(seq, k=3):
    """Return a dictionary whose keys are distances and whose values
    are the number of identical pairs of `k`-length subsequences
    of `seq` separated by that distance. `k` defaults to 3.

    >>> subseq_distance_counts('abcabcabc')
    Counter({3: 4, 6: 1})
    >>> subseq_distance_counts('aaaaaa', 1)
    Counter({1: 5, 2: 4, 3: 3, 4: 2, 5: 1})

    """
    positions = defaultdict(list) # List of positions where each subsequence appears.
    for i in range(len(seq) - k + 1):
        positions[seq[i:i + k]].append(i)
    return Counter(j - i
                   for p in positions.itervalues()
                   for i, j in combinations(p, 2))

Context

StackExchange Code Review Q#19409, answer score: 4

Revisions (0)

No revisions yet.