HiveBrain v1.2.0
Get Started
← Back to all entries
patternpythonMinor

Conditional Concatenation of a Pandas DataFrame

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
dataframeconcatenationconditionalpandas

Problem

I am concatenating columns of a Python Pandas Dataframe and want to improve the speed of my code.

My data has the following structure:

Apple    Pear    Cherry
    1        2       3
    4        5       NaN
    7        8       9


I only want to concatenate the contents of the Cherry column if there is actually value in the respective row. If my code works correctly, the result of the example above should be:

Result
    1 :: 2 :: 3
    4 :: 5
    7 :: 8 :: 9


My code so far is this:

a_dataframe[result] = a_dataframe.apply(lambda r:
          str(r.loc['apple']) + ' :: ' + str(r.loc['pear'])+' :: '+str(r.loc['cherry'])
          if pd.notnull(r.loc['cherry']) & (r.loc['cherry'] != "")
          # if cherry value is empty, do not add cherry into result
          else str(r.loc['apple']) + ' :: ' + str(r.loc['pear']),
          axis=1)


Any thoughts on how I can improve the speed of my code? Can I run this without an apply statement using only Pandas column operations?

Thanks up front for the help.

Solution

There's no need to create a lambda for this.

Let's suppose we have the following dataframe:

my_df = pd.DataFrame({
    'Apple':  ['1', '4', '7'],
    'Pear':   ['2', '5', '8'],
    'Cherry': ['3', np.nan, '9']})


Which is:

Apple Cherry Pear
   1      3    2
   4    NaN    5
   7      9    8


An easier way to achieve what you want without the apply() function is:

  • use iterrows() to parse each row one by one.



  • use Series() and str.cat() to do the merge.



You'll get this:

l = []
for _, row in my_df.iterrows():
    l.append(pd.Series(row).str.cat(sep='::'))

empty_df = pd.DataFrame(l, columns=['Result'])


Doing this, NaN will automatically be taken out, and will lead us to the desired result:

Result
1::3::2
   4::5
7::9::8


The entire program may look like:

import pandas as pd
import numpy as np

def merge_columns(my_df):
    l = []
    for _, row in my_df.iterrows():
        l.append(pd.Series(row).str.cat(sep='::'))
    empty_df = pd.DataFrame(l, columns=['Result'])

    return empty_df.to_string(index=False)

if __name__ == '__main__':
    my_df = pd.DataFrame({
        'Apple': ['1', '4', '7'],
        'Pear': ['2', '5', '8'],
        'Cherry': ['3', np.nan, '9']})
    print(merge_columns(my_df))


There are other things that I added to my answer as:

  • if __name__ == '__main__'



  • added the logic into its own function so that you can reuse it later



As @MathiasEttinger suggested, you can also modify the above function to use list comprehension to get a slightly better performance:

def merge_columns_1(my_df):
    l = [pd.Series(row).str.cat(sep='::') for _, row in my_df.iterrows()]

    return pd.DataFrame(l, columns=['Result']).to_string(index=False)


I'll let the order of the columns as an exercise for OP.

Code Snippets

my_df = pd.DataFrame({
    'Apple':  ['1', '4', '7'],
    'Pear':   ['2', '5', '8'],
    'Cherry': ['3', np.nan, '9']})
Apple Cherry Pear
   1      3    2
   4    NaN    5
   7      9    8
l = []
for _, row in my_df.iterrows():
    l.append(pd.Series(row).str.cat(sep='::'))

empty_df = pd.DataFrame(l, columns=['Result'])
Result
1::3::2
   4::5
7::9::8
import pandas as pd
import numpy as np


def merge_columns(my_df):
    l = []
    for _, row in my_df.iterrows():
        l.append(pd.Series(row).str.cat(sep='::'))
    empty_df = pd.DataFrame(l, columns=['Result'])

    return empty_df.to_string(index=False)


if __name__ == '__main__':
    my_df = pd.DataFrame({
        'Apple': ['1', '4', '7'],
        'Pear': ['2', '5', '8'],
        'Cherry': ['3', np.nan, '9']})
    print(merge_columns(my_df))

Context

StackExchange Code Review Q#154550, answer score: 5

Revisions (0)

No revisions yet.