patternpythonMinor
Return "support count" for a given itemset after searching a large list of transactions
Viewed 0 times
aftersearchingreturnlistlargefortransactionscountitemsetgiven
Problem
Not having found a satisfactory association rules implementation in Python, I set out to code my own. The code below runs reasonably quickly, but the groceries dataset used for testing is peanuts compared to the actual transactions dataset I'll be using in production (think hundreds of thousands of transactions). I'm looking for any performance improvements. Most of the objects iterated over for creating itemset frequencies are
```
import operator
from itertools import combinations
from bs4 import BeautifulSoup
import urllib2
# separate into support_count and support
def support_count(items):
'''
Let transactions be a list of transactions, n_transactions be the number of elements in that list, and X be an itemset.
In pseudocode, support_count(X) - support count for an itemset of length 1:
count = 0
For t in transactions:
if X is an element of t:
count += 1
return count
support_count((X,Y)) - support count for an itemset of length two:
count = 0
For t in transactions:
if X or Y is an element of t:
count += 1
return count
'''
# if str, then we know we're looking for the support of one item
if type(items) is str:
# if type(items) is str then we know we're searching
# for support_count for one item and can use rows_indiv (list of sets of words)
spec_rows = rows_indiv
count = sum([1 for row in spec_rows if items in row])
# if tuple, then we know we're looking for the support rule between two items
elif type(items) is tuple:
if items[0] == items[1]:
return None
else:
# use rows_all_type_combos to satisfy or condition...
# either the whole
sets, but is there anything else in Python I should be thinking of, speed-wise, that's relevant to the code below? I think the final list comprehension maybe could use some improvement, but I haven't had any luck with multiprocessing.Pool yet.```
import operator
from itertools import combinations
from bs4 import BeautifulSoup
import urllib2
# separate into support_count and support
def support_count(items):
'''
Let transactions be a list of transactions, n_transactions be the number of elements in that list, and X be an itemset.
In pseudocode, support_count(X) - support count for an itemset of length 1:
count = 0
For t in transactions:
if X is an element of t:
count += 1
return count
support_count((X,Y)) - support count for an itemset of length two:
count = 0
For t in transactions:
if X or Y is an element of t:
count += 1
return count
'''
# if str, then we know we're looking for the support of one item
if type(items) is str:
# if type(items) is str then we know we're searching
# for support_count for one item and can use rows_indiv (list of sets of words)
spec_rows = rows_indiv
count = sum([1 for row in spec_rows if items in row])
# if tuple, then we know we're looking for the support rule between two items
elif type(items) is tuple:
if items[0] == items[1]:
return None
else:
# use rows_all_type_combos to satisfy or condition...
# either the whole
Solution
One simple improvement will be to realize that
and
Secondly, your specs
On the other hand, this is not true for
sum can take a generator expression. It does not need an intermediate list which will take up a lot of memory if the number of transactions becomes large. Same goes for any, which will even use short-circuit evaluation in that case. So you can write:count = sum(1 for row in spec_rows if items in row)and
count = sum(1 for row in spec_rows if any(item in row for item in items))Secondly, your specs
rows_indiv seem to be used exactly once (and have the length of the number of entries). You could make it a generator to save a lot of space (and time for memory allocation).rows_indiv = (set(a) for a in rows)On the other hand, this is not true for
rows_all_type_combos, because some of the list's members are being re-assigned in the for loop after its construction.Code Snippets
count = sum(1 for row in spec_rows if items in row)count = sum(1 for row in spec_rows if any(item in row for item in items))rows_indiv = (set(a) for a in rows)Context
StackExchange Code Review Q#142660, answer score: 2
Revisions (0)
No revisions yet.