patternpythonModerate

Chi Square Independence Test for Two Pandas DF columns

Submitted by: @import:stackexchange-codereview·Mar 10, 2026·

Viewed 0 times

independencepandascolumnschitwofortestsquare

Problem

I want to calculate the scipy.stats.chi2_contingency() for two columns of a pandas DataFrame. The data is categorical, like this:

var1    var2
0       1
1       0
0       2
0       1
0       2

Here is the example data: TU Berlin Server

The task is to build the crosstable sums (contingency table) of each category-relationship. Example:

         var1
         0    1
---------------------
     0 | 0    1
var2 1 | 2    0
     2 | 2    0

I'm not really a coder, but this is what I got (working):

def create_list_sum_of_categories(df, var, cat, var2):
    list1 = []
    for cat2 in range(int(df[var2].min()), int(df[var2].max())+1):
            list1.append( len(df[ (df[var] == cat) & (df[var2] == cat2) ]))   
    return list1

def chi_square_of_df_cols(df,col1,col2):
    ''' for each category of col1 create list with sums of each category of col2'''
    result_list = []
    for cat in range(int(df[col1].min()), int(df[col1].max())+1):
        result_list.append(create_list_sum_of_categories(df,col1,cat,col2)) 

    return scs.chi2_contingency(result_list)

test_df = pd.read_csv('test_data_for_chi_square.csv')
print(chi_square_of_df_cols(test_df,'var1','var2'))

My question gears towards two things:

Can you confirm that this actually does what I want?

If you have suggestions to make this code more beautiful (e.g. include everything in one function), please go ahead!

Solution

I would try to use existing pandas features where possible to keep this code minimal - this aids readability and reduces the possibility of bugs being introduced in complicated loop structures.

import pandas
from scipy.stats import chi2_contingency

def chisq_of_df_cols(df, c1, c2):
    groupsizes = df.groupby([c1, c2]).size()
    ctsum = groupsizes.unstack(c1)
    # fillna(0) is necessary to remove any NAs which will cause exceptions
    return(chi2_contingency(ctsum.fillna(0)))

test_df = pandas.DataFrame([[0, 1], [1, 0], [0, 2], [0, 1], [0, 2]], columns=['var1', 'var2'])
chisq_of_df_cols(test_df, 'var1', 'var2')

Code Snippets

import pandas
from scipy.stats import chi2_contingency

def chisq_of_df_cols(df, c1, c2):
    groupsizes = df.groupby([c1, c2]).size()
    ctsum = groupsizes.unstack(c1)
    # fillna(0) is necessary to remove any NAs which will cause exceptions
    return(chi2_contingency(ctsum.fillna(0)))

test_df = pandas.DataFrame([[0, 1], [1, 0], [0, 2], [0, 1], [0, 2]], columns=['var1', 'var2'])
chisq_of_df_cols(test_df, 'var1', 'var2')

Context

StackExchange Code Review Q#96761, answer score: 13

Revisions (0)

No revisions yet.