patternpythonMinor
Conditional Concatenation of a Pandas DataFrame
Viewed 0 times
dataframeconcatenationconditionalpandas
Problem
I am concatenating columns of a Python Pandas Dataframe and want to improve the speed of my code.
My data has the following structure:
I only want to concatenate the contents of the Cherry column if there is actually value in the respective row. If my code works correctly, the result of the example above should be:
My code so far is this:
Any thoughts on how I can improve the speed of my code? Can I run this without an apply statement using only Pandas column operations?
Thanks up front for the help.
My data has the following structure:
Apple Pear Cherry
1 2 3
4 5 NaN
7 8 9I only want to concatenate the contents of the Cherry column if there is actually value in the respective row. If my code works correctly, the result of the example above should be:
Result
1 :: 2 :: 3
4 :: 5
7 :: 8 :: 9My code so far is this:
a_dataframe[result] = a_dataframe.apply(lambda r:
str(r.loc['apple']) + ' :: ' + str(r.loc['pear'])+' :: '+str(r.loc['cherry'])
if pd.notnull(r.loc['cherry']) & (r.loc['cherry'] != "")
# if cherry value is empty, do not add cherry into result
else str(r.loc['apple']) + ' :: ' + str(r.loc['pear']),
axis=1)Any thoughts on how I can improve the speed of my code? Can I run this without an apply statement using only Pandas column operations?
Thanks up front for the help.
Solution
There's no need to create a lambda for this.
Let's suppose we have the following dataframe:
Which is:
An easier way to achieve what you want without the
You'll get this:
Doing this,
The entire program may look like:
There are other things that I added to my answer as:
As @MathiasEttinger suggested, you can also modify the above function to use list comprehension to get a slightly better performance:
I'll let the order of the columns as an exercise for OP.
Let's suppose we have the following dataframe:
my_df = pd.DataFrame({
'Apple': ['1', '4', '7'],
'Pear': ['2', '5', '8'],
'Cherry': ['3', np.nan, '9']})Which is:
Apple Cherry Pear
1 3 2
4 NaN 5
7 9 8An easier way to achieve what you want without the
apply() function is:- use
iterrows()to parse each row one by one.
- use
Series()andstr.cat()to do the merge.
You'll get this:
l = []
for _, row in my_df.iterrows():
l.append(pd.Series(row).str.cat(sep='::'))
empty_df = pd.DataFrame(l, columns=['Result'])Doing this,
NaN will automatically be taken out, and will lead us to the desired result:Result
1::3::2
4::5
7::9::8The entire program may look like:
import pandas as pd
import numpy as np
def merge_columns(my_df):
l = []
for _, row in my_df.iterrows():
l.append(pd.Series(row).str.cat(sep='::'))
empty_df = pd.DataFrame(l, columns=['Result'])
return empty_df.to_string(index=False)
if __name__ == '__main__':
my_df = pd.DataFrame({
'Apple': ['1', '4', '7'],
'Pear': ['2', '5', '8'],
'Cherry': ['3', np.nan, '9']})
print(merge_columns(my_df))There are other things that I added to my answer as:
if __name__ == '__main__'
- added the logic into its own function so that you can reuse it later
As @MathiasEttinger suggested, you can also modify the above function to use list comprehension to get a slightly better performance:
def merge_columns_1(my_df):
l = [pd.Series(row).str.cat(sep='::') for _, row in my_df.iterrows()]
return pd.DataFrame(l, columns=['Result']).to_string(index=False)I'll let the order of the columns as an exercise for OP.
Code Snippets
my_df = pd.DataFrame({
'Apple': ['1', '4', '7'],
'Pear': ['2', '5', '8'],
'Cherry': ['3', np.nan, '9']})Apple Cherry Pear
1 3 2
4 NaN 5
7 9 8l = []
for _, row in my_df.iterrows():
l.append(pd.Series(row).str.cat(sep='::'))
empty_df = pd.DataFrame(l, columns=['Result'])Result
1::3::2
4::5
7::9::8import pandas as pd
import numpy as np
def merge_columns(my_df):
l = []
for _, row in my_df.iterrows():
l.append(pd.Series(row).str.cat(sep='::'))
empty_df = pd.DataFrame(l, columns=['Result'])
return empty_df.to_string(index=False)
if __name__ == '__main__':
my_df = pd.DataFrame({
'Apple': ['1', '4', '7'],
'Pear': ['2', '5', '8'],
'Cherry': ['3', np.nan, '9']})
print(merge_columns(my_df))Context
StackExchange Code Review Q#154550, answer score: 5
Revisions (0)
No revisions yet.