HiveBrain v1.2.0
Get Started
← Back to all entries
patternpythonMinor

Return "support count" for a given itemset after searching a large list of transactions

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
aftersearchingreturnlistlargefortransactionscountitemsetgiven

Problem

Not having found a satisfactory association rules implementation in Python, I set out to code my own. The code below runs reasonably quickly, but the groceries dataset used for testing is peanuts compared to the actual transactions dataset I'll be using in production (think hundreds of thousands of transactions). I'm looking for any performance improvements. Most of the objects iterated over for creating itemset frequencies are sets, but is there anything else in Python I should be thinking of, speed-wise, that's relevant to the code below? I think the final list comprehension maybe could use some improvement, but I haven't had any luck with multiprocessing.Pool yet.

```
import operator
from itertools import combinations
from bs4 import BeautifulSoup
import urllib2

# separate into support_count and support
def support_count(items):
'''
Let transactions be a list of transactions, n_transactions be the number of elements in that list, and X be an itemset.

In pseudocode, support_count(X) - support count for an itemset of length 1:

count = 0
For t in transactions:
if X is an element of t:
count += 1
return count

support_count((X,Y)) - support count for an itemset of length two:
count = 0
For t in transactions:
if X or Y is an element of t:
count += 1
return count
'''
# if str, then we know we're looking for the support of one item
if type(items) is str:
# if type(items) is str then we know we're searching
# for support_count for one item and can use rows_indiv (list of sets of words)
spec_rows = rows_indiv
count = sum([1 for row in spec_rows if items in row])
# if tuple, then we know we're looking for the support rule between two items
elif type(items) is tuple:
if items[0] == items[1]:
return None
else:
# use rows_all_type_combos to satisfy or condition...
# either the whole

Solution

One simple improvement will be to realize that sum can take a generator expression. It does not need an intermediate list which will take up a lot of memory if the number of transactions becomes large. Same goes for any, which will even use short-circuit evaluation in that case. So you can write:

count = sum(1 for row in spec_rows if items in row)


and

count = sum(1 for row in spec_rows if any(item in row for item in items))


Secondly, your specs rows_indiv seem to be used exactly once (and have the length of the number of entries). You could make it a generator to save a lot of space (and time for memory allocation).

rows_indiv = (set(a) for a in rows)


On the other hand, this is not true for rows_all_type_combos, because some of the list's members are being re-assigned in the for loop after its construction.

Code Snippets

count = sum(1 for row in spec_rows if items in row)
count = sum(1 for row in spec_rows if any(item in row for item in items))
rows_indiv = (set(a) for a in rows)

Context

StackExchange Code Review Q#142660, answer score: 2

Revisions (0)

No revisions yet.