patternpythonModerate
Chi Square Independence Test for Two Pandas DF columns
Viewed 0 times
independencepandascolumnschitwofortestsquare
Problem
I want to calculate the
Here is the example data: TU Berlin Server
The task is to build the crosstable sums (contingency table) of each category-relationship. Example:
I'm not really a coder, but this is what I got (working):
My question gears towards two things:
scipy.stats.chi2_contingency() for two columns of a pandas DataFrame. The data is categorical, like this:var1 var2
0 1
1 0
0 2
0 1
0 2
Here is the example data: TU Berlin Server
The task is to build the crosstable sums (contingency table) of each category-relationship. Example:
var1
0 1
---------------------
0 | 0 1
var2 1 | 2 0
2 | 2 0
I'm not really a coder, but this is what I got (working):
def create_list_sum_of_categories(df, var, cat, var2):
list1 = []
for cat2 in range(int(df[var2].min()), int(df[var2].max())+1):
list1.append( len(df[ (df[var] == cat) & (df[var2] == cat2) ]))
return list1
def chi_square_of_df_cols(df,col1,col2):
''' for each category of col1 create list with sums of each category of col2'''
result_list = []
for cat in range(int(df[col1].min()), int(df[col1].max())+1):
result_list.append(create_list_sum_of_categories(df,col1,cat,col2))
return scs.chi2_contingency(result_list)
test_df = pd.read_csv('test_data_for_chi_square.csv')
print(chi_square_of_df_cols(test_df,'var1','var2'))
My question gears towards two things:
- Can you confirm that this actually does what I want?
- If you have suggestions to make this code more beautiful (e.g. include everything in one function), please go ahead!
Solution
I would try to use existing pandas features where possible to keep this code minimal - this aids readability and reduces the possibility of bugs being introduced in complicated loop structures.
import pandas
from scipy.stats import chi2_contingency
def chisq_of_df_cols(df, c1, c2):
groupsizes = df.groupby([c1, c2]).size()
ctsum = groupsizes.unstack(c1)
# fillna(0) is necessary to remove any NAs which will cause exceptions
return(chi2_contingency(ctsum.fillna(0)))
test_df = pandas.DataFrame([[0, 1], [1, 0], [0, 2], [0, 1], [0, 2]], columns=['var1', 'var2'])
chisq_of_df_cols(test_df, 'var1', 'var2')Code Snippets
import pandas
from scipy.stats import chi2_contingency
def chisq_of_df_cols(df, c1, c2):
groupsizes = df.groupby([c1, c2]).size()
ctsum = groupsizes.unstack(c1)
# fillna(0) is necessary to remove any NAs which will cause exceptions
return(chi2_contingency(ctsum.fillna(0)))
test_df = pandas.DataFrame([[0, 1], [1, 0], [0, 2], [0, 1], [0, 2]], columns=['var1', 'var2'])
chisq_of_df_cols(test_df, 'var1', 'var2')Context
StackExchange Code Review Q#96761, answer score: 13
Revisions (0)
No revisions yet.