HiveBrain v1.2.0
Get Started
← Back to all entries
patternpythonMinor

Undoing corrections to a big dataframe

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
dataframebigundoingcorrections

Problem

I have 2 dataframes. The first one (900 lines) contains corrections that have been applied to a deal. The second dataframe (140,000 lines) contains the list of deals with corrected values. What I am trying to do is to put the old value back.

To link the corrected deals to the corrections I have to compare a number of attributes. In the correction dataframe (900 lines) I have the old and the new value for each corrected attribute. But each correction can be corrected on a different attribute, therefore I check every possible corrected attribute (in the correction dataframe) to compare the new value with the old one and check if this attribute was corrected. If it was I put the old value back. I'm precise that a correction can apply on several deals that share the same data in the fields used to identify.

To finish, I create a new column on the Deals dataframe (140,000 lines) where I put a boolean that true when a deals has been uncorrected, false otherwise.

My code right now is quite gross, I wanted to factorize a bit but the iteration process blocked me. It is running but it has to go through 900140,000 lines. I launched it all night long (14h) on a Quad Core VM with 12GB RAM and it only went through 150140,000 in this time.

How can I improve performance?

```
def Uncorrection(Correction,dataframe):
dataframe['Modified']=np.nan
#getting the link between the corrections and deals
b=0

for index in Correction.index:
b+=1 #just values to see progression of the program
c=0
for index1 in dataframe.index:
c+=1
a=0
print('Handling correction '+str(b)+' and deal '+str(c)) # printing progress
if (Correction.loc[index,'BO Branch Code']==dataframe.loc[index1,'Wings Branch'] and Correction.loc[index,'Profit Center']==dataframe.loc[index1,'Profit Center'] and Correction.loc[index,'Back Office']==dataframe.loc[index1,'Back Office']
and Correction.loc[index,'BO System Code']==dataframe.loc[index1,'BO System

Solution

Would you please close out this question, and repost as a new, cleaned-up question? I get the sense that recently your code and your needs have been evolving. The code you posted is difficult to evaluate, it definitely conforms to your assessment of "gross", and your recent edits have likely cleaned it up. And posting time to execute a single iteration (out of 140k iterations) would be helpful.

DRY - don't repeat yourself

I'm looking at clauses like this:

if Correction.loc[index,'Back Office Seniority'] != Correction.loc[index,'Back Office Seniority _M']:
    dataframe.loc[index1,'BO Seniority'] = Correction.loc[index,'Back Office Seniority']
    a = 1


It's pretty clear you have a need for modeling synonyms. That is, you need a dictionary that maps e.g. 'Back Office Seniority' -> 'BO Seniority'.

With that in hand, you could turn lots of ifs into just one if in the middle of a loop. It might not affect performance, but it would have a very very strong effect on how reviewers interact with your code.

Also, there seems to be a

if Correction.loc[index, foo] != Correction.loc[index, foo + ' _M']:


interaction going on that your code should explicitly model, rather than using copy-n-paste string constants.

On a separate topic, I'm looking at this:

if (((Correction.get_value(i,'Emetteur Trade Id').strip()==dataframe.get_value(j,'Emetteur Trade Id').strip()) and Correction.get_value(i,'Emetteur Trade Id').strip()!='#') or
                    (Correction.get_value(i,'Emetteur Trade Id').strip()=='#' and Correction.get_value(i,'BO Trade Id').strip()==dataframe.get_value(j,'Trade Id').strip())):
                print ('level 2 success')
                # dataframe.set_value(j, 'Modified', 2)
                if (int(Correction.get_value(i,'UE'))==int(dataframe.get_value(j,'Entity')) and Correction.get_value(i,'Id Ricos').strip()==dataframe.get_value(j,'Siris Id').strip()):
                    print ('level 4 success')


Is level 3 like Fight Club? We just don't talk about it?

The code you posted may "work" in the sense that it produces useful output, but it does not appear to be ready for a code review. You clearly have some ideas about how to usefully refactor it. I invite you to apply some of those ideas and to repost. We will still be here, ready to review!

Code Snippets

if Correction.loc[index,'Back Office Seniority'] != Correction.loc[index,'Back Office Seniority _M']:
    dataframe.loc[index1,'BO Seniority'] = Correction.loc[index,'Back Office Seniority']
    a = 1
if Correction.loc[index, foo] != Correction.loc[index, foo + ' _M']:
if (((Correction.get_value(i,'Emetteur Trade Id').strip()==dataframe.get_value(j,'Emetteur Trade Id').strip()) and Correction.get_value(i,'Emetteur Trade Id').strip()!='#') or
                    (Correction.get_value(i,'Emetteur Trade Id').strip()=='#' and Correction.get_value(i,'BO Trade Id').strip()==dataframe.get_value(j,'Trade Id').strip())):
                print ('level 2 success')
                # dataframe.set_value(j, 'Modified', 2)
                if (int(Correction.get_value(i,'UE'))==int(dataframe.get_value(j,'Entity')) and Correction.get_value(i,'Id Ricos').strip()==dataframe.get_value(j,'Siris Id').strip()):
                    print ('level 4 success')

Context

StackExchange Code Review Q#156584, answer score: 2

Revisions (0)

No revisions yet.