patternpythonMinor
Finding most common contiguous sub-lists in an array of lists
Viewed 0 times
arraysubfindinglistscontiguouscommonmost
Problem
Objective: Given a set of sequences ( eg: step1->step2->step3, step1->step3->step5) ) arranged in an array of lists, count the number of times every contiguous sub-lists occur
Where I need your help:
-
The code below works, but is very slow on my original dataset (100M sequences, with 100 unique steps). Can you help me make this more efficient? For larger datasets, is there a more efficient programming method than just brute force?
-
My code currently depends on each element in the list being a single character. How can I adapt this code to handle multiple-character elements?
Working code:
Where I need your help:
-
The code below works, but is very slow on my original dataset (100M sequences, with 100 unique steps). Can you help me make this more efficient? For larger datasets, is there a more efficient programming method than just brute force?
-
My code currently depends on each element in the list being a single character. How can I adapt this code to handle multiple-character elements?
Working code:
from collections import Counter
sequences = [['A','B'],['A','B','B'],['A','C','A','B']]
counts = Counter()
for sequence in sequences:
input = "".join(sequence)
for j in range(1,len(input)+1):
counts = counts + Counter(input[i:i+j] for i in range(len(input)-(j-1)))
print counts
for x in counts:
print x,":",counts[x]," times"Solution
- Write a test case
When working on performance of code, the first thing to do is to make a reproducible test case. We'll need to have your code in a function:
from collections import Counter
def subsequence_counts_1(sequences):
counts = Counter()
for sequence in sequences:
input = "".join(sequence)
for j in range(1,len(input)+1):
counts = counts + Counter(input[i:i+j] for i in range(len(input)-(j-1)))
return countsand then we need some test data:
import random
import string
def test_data(n, m, choices):
"""Return a list of n lists of m items chosen randomly from choices."""
return [[random.choice(choices) for _ in range(m)] for _ in range(n)]So let's try a small example:
>>> data = test_data(50, 50, string.ascii_uppercase)
>>> from timeit import timeit
>>> timeit(lambda:subsequence_counts_1(data), number=1)
102.42408156394958- Don't build data structures by repeated addition
The main problem with the code is this line:
counts = counts + Counter(...)which has essentially the same effect as:
new_counts = counts + Counter(...)
counts = new_countsThat is, it creates a new
Counter object and populates it with the counts from both of the addends. In particular, this involves copying across the whole contents of counts into the new object.We can avoid all that copying by using the
update method:def subsequence_counts_2(sequences):
counts = Counter()
for sequence in sequences:
input = "".join(sequence)
for j in range(1,len(input)+1):
counts.update(input[i:i+j] for i in range(len(input)-(j-1)))
return countsand this is a thousand times faster:
>>> timeit(lambda:subsequence_counts_2(data), number=1)
0.088530634064227343. Further improvements
There are few more minor improvements that we could make:
-
Instead of having separate iterations over
i and j, we can iterate over both at the same time using itertools.combinations.-
Instead of calling
Counter.update for each sequence, we could do all the work in one comprehension.This results in the following:
from itertools import combinations
def subsequence_counts_3(sequences):
return Counter(seq[i:j]
for seq in map(''.join, sequences)
for i, j in combinations(range(len(seq) + 1), 2))which is about 60% faster than
subsequence_counts_2:>>> timeit(lambda:subsequence_counts_3(data), number=1)
0.052610712591558695But it's still not going to be able to solve your problem in a reasonable amount of time:
>>> data2 = test_data(100, 100, string.ascii_uppercase)
>>> timeit(lambda:subsequence_counts_3(data2), number=1)
0.5441978382878006So to process 100 million sequences of 100 characters would take more than half a million seconds, which is more than six days.
- Using tuples
If you want to handle other kinds of data, convert the sequences to tuples:
def subsequence_counts_4(sequences):
return Counter(seq[i:j]
for seq in map(tuple, sequences)
for i, j in combinations(range(len(seq) + 1), 2))and then you can use any hashable items:
>>> data3 = test_data(50, 50, list(range(10)))
>>> subsequence_counts_4(data3)[(8, 8, 8)]
5Code Snippets
from collections import Counter
def subsequence_counts_1(sequences):
counts = Counter()
for sequence in sequences:
input = "".join(sequence)
for j in range(1,len(input)+1):
counts = counts + Counter(input[i:i+j] for i in range(len(input)-(j-1)))
return countsimport random
import string
def test_data(n, m, choices):
"""Return a list of n lists of m items chosen randomly from choices."""
return [[random.choice(choices) for _ in range(m)] for _ in range(n)]>>> data = test_data(50, 50, string.ascii_uppercase)
>>> from timeit import timeit
>>> timeit(lambda:subsequence_counts_1(data), number=1)
102.42408156394958counts = counts + Counter(...)new_counts = counts + Counter(...)
counts = new_countsContext
StackExchange Code Review Q#108052, answer score: 5
Revisions (0)
No revisions yet.