HiveBrain v1.2.0
Get Started
← Back to all entries
patternpythonMinor

Finding most common contiguous sub-lists in an array of lists

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
arraysubfindinglistscontiguouscommonmost

Problem

Objective: Given a set of sequences ( eg: step1->step2->step3, step1->step3->step5) ) arranged in an array of lists, count the number of times every contiguous sub-lists occur

Where I need your help:

-
The code below works, but is very slow on my original dataset (100M sequences, with 100 unique steps). Can you help me make this more efficient? For larger datasets, is there a more efficient programming method than just brute force?

-
My code currently depends on each element in the list being a single character. How can I adapt this code to handle multiple-character elements?

Working code:

from collections import Counter

sequences = [['A','B'],['A','B','B'],['A','C','A','B']]
counts = Counter()

for sequence in sequences:
    input = "".join(sequence)

    for j in range(1,len(input)+1):
        counts = counts + Counter(input[i:i+j] for i in range(len(input)-(j-1)))
        print counts

for x in counts:
    print x,":",counts[x]," times"

Solution


  1. Write a test case



When working on performance of code, the first thing to do is to make a reproducible test case. We'll need to have your code in a function:

from collections import Counter

def subsequence_counts_1(sequences):
    counts = Counter()
    for sequence in sequences:
        input = "".join(sequence)
        for j in range(1,len(input)+1):
            counts = counts + Counter(input[i:i+j] for i in range(len(input)-(j-1)))
    return counts


and then we need some test data:

import random
import string

def test_data(n, m, choices):
    """Return a list of n lists of m items chosen randomly from choices."""
    return [[random.choice(choices) for _ in range(m)] for _ in range(n)]


So let's try a small example:

>>> data = test_data(50, 50, string.ascii_uppercase)
>>> from timeit import timeit
>>> timeit(lambda:subsequence_counts_1(data), number=1)
102.42408156394958


  1. Don't build data structures by repeated addition



The main problem with the code is this line:

counts = counts + Counter(...)


which has essentially the same effect as:

new_counts = counts + Counter(...)
counts = new_counts


That is, it creates a new Counter object and populates it with the counts from both of the addends. In particular, this involves copying across the whole contents of counts into the new object.

We can avoid all that copying by using the update method:

def subsequence_counts_2(sequences):
    counts = Counter()
    for sequence in sequences:
        input = "".join(sequence)
        for j in range(1,len(input)+1):
            counts.update(input[i:i+j] for i in range(len(input)-(j-1)))
    return counts


and this is a thousand times faster:

>>> timeit(lambda:subsequence_counts_2(data), number=1)
0.08853063406422734


 3. Further improvements

There are few more minor improvements that we could make:

-
Instead of having separate iterations over i and j, we can iterate over both at the same time using itertools.combinations.

-
Instead of calling Counter.update for each sequence, we could do all the work in one comprehension.

This results in the following:

from itertools import combinations

def subsequence_counts_3(sequences):
    return Counter(seq[i:j]
                   for seq in map(''.join, sequences)
                   for i, j in combinations(range(len(seq) + 1), 2))


which is about 60% faster than subsequence_counts_2:

>>> timeit(lambda:subsequence_counts_3(data), number=1)
0.052610712591558695


But it's still not going to be able to solve your problem in a reasonable amount of time:

>>> data2 = test_data(100, 100, string.ascii_uppercase)
>>> timeit(lambda:subsequence_counts_3(data2), number=1)
0.5441978382878006


So to process 100 million sequences of 100 characters would take more than half a million seconds, which is more than six days.

  1. Using tuples



If you want to handle other kinds of data, convert the sequences to tuples:

def subsequence_counts_4(sequences):
    return Counter(seq[i:j]
                   for seq in map(tuple, sequences)
                   for i, j in combinations(range(len(seq) + 1), 2))


and then you can use any hashable items:

>>> data3 = test_data(50, 50, list(range(10)))
>>> subsequence_counts_4(data3)[(8, 8, 8)]
5

Code Snippets

from collections import Counter

def subsequence_counts_1(sequences):
    counts = Counter()
    for sequence in sequences:
        input = "".join(sequence)
        for j in range(1,len(input)+1):
            counts = counts + Counter(input[i:i+j] for i in range(len(input)-(j-1)))
    return counts
import random
import string

def test_data(n, m, choices):
    """Return a list of n lists of m items chosen randomly from choices."""
    return [[random.choice(choices) for _ in range(m)] for _ in range(n)]
>>> data = test_data(50, 50, string.ascii_uppercase)
>>> from timeit import timeit
>>> timeit(lambda:subsequence_counts_1(data), number=1)
102.42408156394958
counts = counts + Counter(...)
new_counts = counts + Counter(...)
counts = new_counts

Context

StackExchange Code Review Q#108052, answer score: 5

Revisions (0)

No revisions yet.