patternpythonMinor
SQL GROUPING SETS in Python using Pandas
Viewed 0 times
pandasgroupingsqlusingpythonsets
Problem
The code below is intended to provide SQL's GROUPING SETS functionality in Python with the aid of Pandas.
Background on SQL GROUPING SETS
There are at least two advantages to doing this in Python:
There is at least one disadvantage to doing this in Python:
I am posting this question in the hopes that the code can be cleaned up. Secondarily I am posting to determine if anyone would find this useful. The example below has been tested with Python 2.7.6 and Pandas 0.15.1.
``
'''
uniqcols = (i for i in grpby if len(df[i].unique()) == 1)
pwrset = (i for i in powerset(grpby))
s = set()
for uniqcol in uniqcols:
for i in pwrset:
if uniqcol in i:
s.add(i) # add a level of aggregation only when non-redundant
workingdf = df.copy()
#for idx,i in enumerate(powerset(grpby)):
for idx,i in enumerate(s):
print(' grouping by: {}'.format(i))
if i != ():
tmp = aggfunc( workingdf.groupby(i) )
else:
# hack to get output to be a DataFrameGroupBy object:
# insert dummy column on which to group by
# old, naive code:
# tmp = aggfunc( workingdf )
dummycolname = hash(tuple(workingdf.columns.tolist()))
Background on SQL GROUPING SETS
There are at least two advantages to doing this in Python:
- it is less verbose than SQL, and
- it provides a non-redundant set of aggregations without the need to manually specify them.
There is at least one disadvantage to doing this in Python:
- memory usage is quite likely higher than would be in SQL.
I am posting this question in the hopes that the code can be cleaned up. Secondarily I am posting to determine if anyone would find this useful. The example below has been tested with Python 2.7.6 and Pandas 0.15.1.
``
from __future__ import division, print_function
import itertools as it
import pandas as pd
from pandas.util.testing import assert_frame_equal
def powerset(iterable):
"powerset([1,2,3]) --> () (1,) (2,) (3,) (1,2) (1,3) (2,3) (1,2,3)"
s = list(iterable)
return it.chain.from_iterable(it.combinations(s,r) for r in range(len(s)+1))
def grouper(df,grpby=None,aggfunc=None):
''' produces aggregate DataFrame from DataFrames for non-redundant groupings
workingdf` is used to avoid modifying original DataFrame'''
uniqcols = (i for i in grpby if len(df[i].unique()) == 1)
pwrset = (i for i in powerset(grpby))
s = set()
for uniqcol in uniqcols:
for i in pwrset:
if uniqcol in i:
s.add(i) # add a level of aggregation only when non-redundant
workingdf = df.copy()
#for idx,i in enumerate(powerset(grpby)):
for idx,i in enumerate(s):
print(' grouping by: {}'.format(i))
if i != ():
tmp = aggfunc( workingdf.groupby(i) )
else:
# hack to get output to be a DataFrameGroupBy object:
# insert dummy column on which to group by
# old, naive code:
# tmp = aggfunc( workingdf )
dummycolname = hash(tuple(workingdf.columns.tolist()))
Solution
About this part:
Revised code:
def grouper(df,grpby=None,aggfunc=None):
uniqcols = (i for i in grpby if len(df[i].unique()) == 1)
pwrset = (i for i in powerset(grpby))
s = set()
for uniqcol in uniqcols:
for i in pwrset:
if uniqcol in i:
s.add(i) # add a level of aggregation only when non-redundant- The two optional arguments are in fact required by the function.
- The convention is that
iis a loop variable of typeint.
- Bug:
pwrsetis a generator expression. As such it can only be iterated through once, just like the iterator returned frompowerset(grpby). Using square brackets[]instead of()would turn it into a list comprehension and fix the bug.
- Changing the other generator expression into a list comprehension would allow you to swap the order of the nested for loops. This would be more natural since the nested loop is filtering
pwrset, and more efficient because you could break out of the inner loop after firsts.add(i). Thensdoes not even need to be aset(because the rest of the code merely iterates overs). If you makeuniqcolsasetinstead, the inner loop could be avoided altogether.
Revised code:
def grouper(df, grpby, aggfunc):
uniqcols = set(col for col in grpby if len(df[col].unique()) == 1)
s = [cols for cols in powerset(grpby) if not uniqcols.isdisjoint(cols)]Code Snippets
def grouper(df,grpby=None,aggfunc=None):
uniqcols = (i for i in grpby if len(df[i].unique()) == 1)
pwrset = (i for i in powerset(grpby))
s = set()
for uniqcol in uniqcols:
for i in pwrset:
if uniqcol in i:
s.add(i) # add a level of aggregation only when non-redundantdef grouper(df, grpby, aggfunc):
uniqcols = set(col for col in grpby if len(df[col].unique()) == 1)
s = [cols for cols in powerset(grpby) if not uniqcols.isdisjoint(cols)]Context
StackExchange Code Review Q#72269, answer score: 2
Revisions (0)
No revisions yet.