HiveBrain v1.2.0
Get Started
← Back to all entries
patternpythonMinor

Subtract multiple columns in PANDAS DataFrame by a series (single column)

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
pandascolumnssubtractcolumnseriessinglemultipledataframe

Problem

Background

I have tons of very large pandas DataFrames that need to be normalized with the following operation; log2(data) - mean(log2(data))

Example Data

The example DataFrame my_df looks like this;

iovrrx    nfinsu    mvdfjc    idjges    fubmrg    lvuhfv
0  0.987654  0.206104  0.802920  0.011157  0.860618  0.575871
1  0.706397  0.860083  0.939230  0.436194  0.557081  0.706964
2  0.043139  0.729435  0.597488  0.700998  0.974193  0.917758
3  0.316080  0.461547  0.844540  0.510143  0.908475  0.877330
4  0.828839  0.177670  0.610833  0.328238  0.327697  0.689756


Question

I have tried to perform the normalization operation noted above many different ways however the following code snippet is the only one that I have gotten to work;

log_div_ave = my_df.apply(np.log2).values.T - my_df.apply(np.log2).mean(axis=1).values

log_div_ave = pd.DataFrame(log_div_ave.T,columns=my_df.columns)

print(log_div_ave)

   iovrrx    nfinsu    mvdfjc    idjges    fubmrg    lvuhfv
0  1.667378 -0.593258  1.368628 -4.800610  1.468744  0.889117
1  0.056992  0.340988  0.467991 -0.638518 -0.285601  0.058149
2 -3.467018  0.612699  0.324830  0.555330  1.030127  0.944032
3 -0.941776 -0.395590  0.476099 -0.251165  0.581380  0.531053
4  0.933714 -1.288174  0.493400 -0.402633 -0.405015  0.668708


As you can see I'm converting the DataFrame to a numpy array and transposing it just so I can subtract by the mean of the data. I then have to transpose the resulting array then reconstitute it as a DataFrame. Is there a simpler way to do all of this?

Solution

There's need to transpose. You can subtract along any axis you want on a DataFrame using its subtract method.

First, take the log base 2 of your dataframe, apply is fine but you can pass a DataFrame to numpy functions.

Store the log base 2 dataframe so you can use its subtract method. You can also reuse this dataframe when you take the mean of each row.

Finally subtract along the index axis for each column of the log2 dataframe, subtract the matching mean.

log2df = np.log2(my_df)
log2mean = log2df.mean(axis='columns')
log_div_ave = log2df.subtract(log2mean, axis='index')

Code Snippets

log2df = np.log2(my_df)
log2mean = log2df.mean(axis='columns')
log_div_ave = log2df.subtract(log2mean, axis='index')

Context

StackExchange Code Review Q#156447, answer score: 3

Revisions (0)

No revisions yet.