HiveBrain v1.2.0
Get Started
← Back to all entries
patternpythonModerate

Chi Square Independence Test for Two Pandas DF columns

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
independencepandascolumnschitwofortestsquare

Problem

I want to calculate the scipy.stats.chi2_contingency() for two columns of a pandas DataFrame. The data is categorical, like this:

var1 var2
0 1
1 0
0 2
0 1
0 2


Here is the example data: TU Berlin Server

The task is to build the crosstable sums (contingency table) of each category-relationship. Example:

var1
0 1
---------------------
0 | 0 1
var2 1 | 2 0
2 | 2 0


I'm not really a coder, but this is what I got (working):

def create_list_sum_of_categories(df, var, cat, var2):
list1 = []
for cat2 in range(int(df[var2].min()), int(df[var2].max())+1):
list1.append( len(df[ (df[var] == cat) & (df[var2] == cat2) ]))
return list1

def chi_square_of_df_cols(df,col1,col2):
''' for each category of col1 create list with sums of each category of col2'''
result_list = []
for cat in range(int(df[col1].min()), int(df[col1].max())+1):
result_list.append(create_list_sum_of_categories(df,col1,cat,col2))

return scs.chi2_contingency(result_list)

test_df = pd.read_csv('test_data_for_chi_square.csv')
print(chi_square_of_df_cols(test_df,'var1','var2'))


My question gears towards two things:

  • Can you confirm that this actually does what I want?



  • If you have suggestions to make this code more beautiful (e.g. include everything in one function), please go ahead!

Solution

I would try to use existing pandas features where possible to keep this code minimal - this aids readability and reduces the possibility of bugs being introduced in complicated loop structures.

import pandas
from scipy.stats import chi2_contingency

def chisq_of_df_cols(df, c1, c2):
    groupsizes = df.groupby([c1, c2]).size()
    ctsum = groupsizes.unstack(c1)
    # fillna(0) is necessary to remove any NAs which will cause exceptions
    return(chi2_contingency(ctsum.fillna(0)))

test_df = pandas.DataFrame([[0, 1], [1, 0], [0, 2], [0, 1], [0, 2]], columns=['var1', 'var2'])
chisq_of_df_cols(test_df, 'var1', 'var2')

Code Snippets

import pandas
from scipy.stats import chi2_contingency

def chisq_of_df_cols(df, c1, c2):
    groupsizes = df.groupby([c1, c2]).size()
    ctsum = groupsizes.unstack(c1)
    # fillna(0) is necessary to remove any NAs which will cause exceptions
    return(chi2_contingency(ctsum.fillna(0)))

test_df = pandas.DataFrame([[0, 1], [1, 0], [0, 2], [0, 1], [0, 2]], columns=['var1', 'var2'])
chisq_of_df_cols(test_df, 'var1', 'var2')

Context

StackExchange Code Review Q#96761, answer score: 13

Revisions (0)

No revisions yet.