HiveBrain v1.2.0
Get Started
← Back to all entries
patternpythonMinor

Very slow Python Combination Generator

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
combinationslowgeneratorpythonvery

Problem

I have the following function:

from collections import Counter

import itertools

def combinationsWithMaximumCombinedIntervals(data, windowWidth, maximumCombinedIntervals):

    for indices in itertools.combinations(range(len(data)),windowWidth):
        previous = indices[0]
        combinedIntervals = 0
        for i in indices[1:]:
            combinedIntervals = combinedIntervals + i-previous-1 
            if combinedIntervals > maximumCombinedIntervals:                    
                break
            previous = i
        else:   
            yield(tuple(data[i] for i in indices))

test = ["a","b","c","d","e","f","g","h","i","j","k","l"]
combinations = combinationsWithMaximumCombinedIntervals(test, 5, 2)

for i in combinations:
    pass


The function is associated with a set of combinations for the data supplied to it. Each element in the list represents a combination with the length of each of these combinations equal to windowLength.

If we take the flowing combination ("a","b","c","d","f"), then the combined interval is 1 since "d" to "f" is a hop of 1. If we take the flowing combination ("a","b","c","e","g"), then the combined interval is 2 since "c" to "e" and "e" to "g" are hops of 1 each. Finally, combination ("a","b","c","d","g") has a combined interval of 2 since "d" to "g" is a hop of 2. The maximumCombinedIntervals argument limits the number of combinations based on the combined interval. In other words, this combination is not available when maximumCombinedIntervals is 2: ("a","c","e","f","h") since the combination of hops is now greater than 2.

I wish to run this function on data with a length of many thousands, a windowWidth in the order of 10 and a maximumCombinedIntervals under 10. However this takes forever to execute.

Can anyone suggest how I might refactor this function so that it will run much more quickly in Python. itertools.combinations is implemented in C.

Here is a similar function which I'm also interested in using:

Solution

As for your actual algorithm, you need to change it. Filtering down python's generating makes things relatively simple. However, in this case you are going throw away way more items then you keep. As a result, this isn't going to be an efficient method.

What I'd try is iterating over all possible starting positions. For each starting position I'd take the next windowWidth+maximumIntervalSize items. Then I'd use itertools.combinations to grab each possible set of maximumIntervalSize items to remove from the list.

EDIT

Review of new code:

def combinationsWithMaximumCombinedIntervals2(data, combinationWidth, maximumCombinedIntervals):


The python style guide recommends lowercase_with_underscores for function and parameter names. The name are also just a bit on the longish side. (Which is way better then being too short or cryptic)

windowWidth = combinationWidth + maximumCombinedIntervals

    combinations = []

    for startingIndex in range(len(data)-windowWidth+1):
        for indices in itertools.combinations(range(startingIndex, startingIndex+windowWidth), combinationWidth):


If possible, try to avoid looping over indexes. Code is cleaner if it loops over the data elements.

if not indices in combinations:


You use if indices not in combinations for a slightly cleaner expression

yield(tuple(data[j] for j in indices))


You don't need the parenthesis for yield, it is a statement not a function.

combinations.append(indices)


Combinations would be faster to check if it were a set rather then a list.

But the real question here is how to avoid the duplicates being generated in the first place. We simple restrict the combinations generated to include the startingIndex. This prevents duplicates because no later window can produce the same combinations. (They don't have the startingIndex in them). But it still produces all values because all combinations without the startingIndex will still be contained in the next window. (The next window is the current window except without startingIndex and with an additional value) The only tricky part is the end, when there are no more windows. This is most easily handled by continuing to generate just with a smaller window size for as long as possible.

My solution (tested to make sure it produces the same results as yours, so I don't make any more silly mistakes):

def combinationsWithMaximumCombinedIntervals3(data, combinationWidth, maximumCombinedIntervals):
    windowWidth = combinationWidth + maximumCombinedIntervals

    for startingIndex in range(len(data)-combinationWidth+1):
        window = data[startingIndex + 1: startingIndex + windowWidth]
        for indices in itertools.combinations(window, combinationWidth - 1):
            yield (data[startingIndex],) + indices

Code Snippets

def combinationsWithMaximumCombinedIntervals2(data, combinationWidth, maximumCombinedIntervals):
windowWidth = combinationWidth + maximumCombinedIntervals

    combinations = []

    for startingIndex in range(len(data)-windowWidth+1):
        for indices in itertools.combinations(range(startingIndex, startingIndex+windowWidth), combinationWidth):
if not indices in combinations:
yield(tuple(data[j] for j in indices))
combinations.append(indices)

Context

StackExchange Code Review Q#4682, answer score: 4

Revisions (0)

No revisions yet.