HiveBrain v1.2.0
Get Started
← Back to all entries
patternpythonMinor

SQL GROUPING SETS in Python using Pandas

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
pandasgroupingsqlusingpythonsets

Problem

The code below is intended to provide SQL's GROUPING SETS functionality in Python with the aid of Pandas.

Background on SQL GROUPING SETS

There are at least two advantages to doing this in Python:

  • it is less verbose than SQL, and



  • it provides a non-redundant set of aggregations without the need to manually specify them.



There is at least one disadvantage to doing this in Python:

  • memory usage is quite likely higher than would be in SQL.



I am posting this question in the hopes that the code can be cleaned up. Secondarily I am posting to determine if anyone would find this useful. The example below has been tested with Python 2.7.6 and Pandas 0.15.1.

``
from __future__ import division, print_function

import itertools as it
import pandas as pd

from pandas.util.testing import assert_frame_equal

def powerset(iterable):
"powerset([1,2,3]) --> () (1,) (2,) (3,) (1,2) (1,3) (2,3) (1,2,3)"
s = list(iterable)
return it.chain.from_iterable(it.combinations(s,r) for r in range(len(s)+1))

def grouper(df,grpby=None,aggfunc=None):
''' produces aggregate DataFrame from DataFrames for non-redundant groupings
workingdf` is used to avoid modifying original DataFrame
'''
uniqcols = (i for i in grpby if len(df[i].unique()) == 1)
pwrset = (i for i in powerset(grpby))
s = set()
for uniqcol in uniqcols:
for i in pwrset:
if uniqcol in i:
s.add(i) # add a level of aggregation only when non-redundant

workingdf = df.copy()
#for idx,i in enumerate(powerset(grpby)):
for idx,i in enumerate(s):
print(' grouping by: {}'.format(i))
if i != ():
tmp = aggfunc( workingdf.groupby(i) )
else:
# hack to get output to be a DataFrameGroupBy object:
# insert dummy column on which to group by
# old, naive code:
# tmp = aggfunc( workingdf )
dummycolname = hash(tuple(workingdf.columns.tolist()))

Solution

About this part:

def grouper(df,grpby=None,aggfunc=None):
    uniqcols = (i for i in grpby if len(df[i].unique()) == 1)
    pwrset = (i for i in powerset(grpby))
    s = set()
    for uniqcol in uniqcols:
        for i in pwrset:
            if uniqcol in i:
                s.add(i) # add a level of aggregation only when non-redundant


  • The two optional arguments are in fact required by the function.



  • The convention is that i is a loop variable of type int.



  • Bug: pwrset is a generator expression. As such it can only be iterated through once, just like the iterator returned from powerset(grpby). Using square brackets [] instead of () would turn it into a list comprehension and fix the bug.



  • Changing the other generator expression into a list comprehension would allow you to swap the order of the nested for loops. This would be more natural since the nested loop is filtering pwrset, and more efficient because you could break out of the inner loop after first s.add(i). Then s does not even need to be a set (because the rest of the code merely iterates over s). If you make uniqcols a set instead, the inner loop could be avoided altogether.



Revised code:

def grouper(df, grpby, aggfunc):
    uniqcols = set(col for col in grpby if len(df[col].unique()) == 1)
    s = [cols for cols in powerset(grpby) if not uniqcols.isdisjoint(cols)]

Code Snippets

def grouper(df,grpby=None,aggfunc=None):
    uniqcols = (i for i in grpby if len(df[i].unique()) == 1)
    pwrset = (i for i in powerset(grpby))
    s = set()
    for uniqcol in uniqcols:
        for i in pwrset:
            if uniqcol in i:
                s.add(i) # add a level of aggregation only when non-redundant
def grouper(df, grpby, aggfunc):
    uniqcols = set(col for col in grpby if len(df[col].unique()) == 1)
    s = [cols for cols in powerset(grpby) if not uniqcols.isdisjoint(cols)]

Context

StackExchange Code Review Q#72269, answer score: 2

Revisions (0)

No revisions yet.