patternpythonMinor
Array-search algorithm for identical sequences and character distances
Viewed 0 times
searcharrayidenticalcharacteralgorithmforanddistancessequences
Problem
Here is Python code written to perform the following operations:
This code was originally written in another programming language, but I would like to apply any optimizations or improvements available in Python.
- Find each occurrence of a three-letter sequence in a character array (e.g. a sequence such as
('a','b','c')), including overlapping sequences of up to 2 shared characters.
- Count the characters between the start of each sequence and the start of all identical sequences following it.
- Every time the same number of characters results from #2, increment a counter for that specific number of characters (regardless of which character sequence caused it).
- Return a dictionary containing all accumulated counters for character distances.
counts = {}
# repeating groups of three identical chars in a row
for i in range(len(array)-2):
for j in range(i+1,len(array)-2):
if ((array[i] == array[j]) & (array[i+1] == array[j+1]) & (array[i+2] == array[j+2])):
if counts.has_key(j-i) == False:
counts[j-i] = 1
else:
counts[j-i] += 1This code was originally written in another programming language, but I would like to apply any optimizations or improvements available in Python.
Solution
- Improving the code
Here are some ways in which you could improve your code, while leaving the algorithm unchanged:
-
You can avoid the need to check
counts.has_key(j-i) if you use collections.Counter.-
You are looping over all pairs of distinct indexes between
0 and len(array)-2. The function itertools.combinations provides a way to cleanly express your intention.-
Instead of comparing three pairs of array elements, you can compare one pair of array slices.
-
Organize the code into a function with a docstring.
-
Generalize it from 3 to \$k\$.
-
Python doesn't have an "array" type, but it does have sequences. So your variables could be better named.
Here's the revised code after making the above improvements:
from collections import Counter
from itertools import combinations
def subseq_distance_counts(seq, k=3):
"""Return a dictionary whose keys are distances and whose values
are the number of identical pairs of `k`-length subsequences
of `seq` separated by that distance. `k` defaults to 3.
"""
counts = Counter()
for i, j in combinations(range(len(seq) - k + 1), 2):
if seq[i:i + k] == seq[j:j + k]:
counts[j - i] += 1
return countsIt's possible to rewrite the body of this function as a single expression:
return Counter(j - i
for i, j in combinations(range(len(seq) - k + 1), 2)
if seq[i:i + k] == seq[j:j + k])Some people prefer this style of coding, and others prefer the explicitness of the first version. It doesn't make much difference to the performance, so comes down to personal taste.
- Improving the algorithm
Here's an alternative approach (that you could use if the subsequences are hashable). Your original algorithm is \$ O(n^2) \$, where \$n\$ is the length of the sequence. The algorithm below is \$O(n + m)\$, where \$m\$ is the number of repeated triples. This won't make any different in the worst case, where \$m = Θ(n^2)\$, but in cases where \$m = ο(n^2)\$ it should be an improvement.
def subseq_distance_counts(seq, k=3):
"""Return a dictionary whose keys are distances and whose values
are the number of identical pairs of `k`-length subsequences
of `seq` separated by that distance. `k` defaults to 3.
>>> subseq_distance_counts('abcabcabc')
Counter({3: 4, 6: 1})
>>> subseq_distance_counts('aaaaaa', 1)
Counter({1: 5, 2: 4, 3: 3, 4: 2, 5: 1})
"""
positions = defaultdict(list) # List of positions where each subsequence appears.
for i in range(len(seq) - k + 1):
positions[seq[i:i + k]].append(i)
return Counter(j - i
for p in positions.itervalues()
for i, j in combinations(p, 2))Code Snippets
from collections import Counter
from itertools import combinations
def subseq_distance_counts(seq, k=3):
"""Return a dictionary whose keys are distances and whose values
are the number of identical pairs of `k`-length subsequences
of `seq` separated by that distance. `k` defaults to 3.
"""
counts = Counter()
for i, j in combinations(range(len(seq) - k + 1), 2):
if seq[i:i + k] == seq[j:j + k]:
counts[j - i] += 1
return countsreturn Counter(j - i
for i, j in combinations(range(len(seq) - k + 1), 2)
if seq[i:i + k] == seq[j:j + k])def subseq_distance_counts(seq, k=3):
"""Return a dictionary whose keys are distances and whose values
are the number of identical pairs of `k`-length subsequences
of `seq` separated by that distance. `k` defaults to 3.
>>> subseq_distance_counts('abcabcabc')
Counter({3: 4, 6: 1})
>>> subseq_distance_counts('aaaaaa', 1)
Counter({1: 5, 2: 4, 3: 3, 4: 2, 5: 1})
"""
positions = defaultdict(list) # List of positions where each subsequence appears.
for i in range(len(seq) - k + 1):
positions[seq[i:i + k]].append(i)
return Counter(j - i
for p in positions.itervalues()
for i, j in combinations(p, 2))Context
StackExchange Code Review Q#19409, answer score: 4
Revisions (0)
No revisions yet.