HiveBrain v1.2.0
Get Started
← Back to all entries
patternpythonMinor

Rating tennis players in a database, taking days to run

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
ratingplayersdatabasetennisdaysruntaking

Problem

I have this project in data analysis for creating a ranking of tennis players. Currently, it takes more than 6 days to run on my computer.

Can you review the code and see where's the problem?

Project steps:

-
I have a database of 600,000 tennis matches called matchdatabase.The database fileds are a) winner name, b) loser name, c) tournament, d) other fields for the winner and loser.

-
From that database, I create a playerdatabase with every player in the matchdatabase.

-
For each match in the matchdatabase it goes into the playerdatabase, retrieves the ranking/elo and computes the expected result.

-
It updates the ranking after the match into the playerdatabase

This for loop ends up running 1 match/second, so the whole database takes several days to run!

```
import pandas as pd
import glob
import numpy as np
import math

all_data = pd.read_csv('tennisdatabase.csv')
all_data = all_data.sort(['date'], ascending=[0])
all_data = all_data.reindex(index = np.arange(1, len(all_data) + 1))

#it checks every player in the matchdatabase and creates a database of players

playerdatabase = pd.DataFrame()
list_winners = pd.pivot_table(all_data,index=["winner_name"],values=["tourney_id"],aggfunc=np.count_nonzero)
list_losers = pd.pivot_table(all_data,index=["loser_name"],values=["tourney_id"],aggfunc=np.count_nonzero)
firstloss = pd.pivot_table(all_data,index=["loser_name"],values=["date"],aggfunc=np.min)
firstwin = pd.pivot_table(all_data,index=["winner_name"],values=["date"],aggfunc=np.min)
playerdatabase = pd.concat([list_winners, list_losers, firstloss, firstwin], axis=1)
playerdatabase['NumberOfGames'] = 0

#defines a elo calculator for expectations and modified ratings

def getExpectation(rating_1, rating_2):
"calculator for the expected result to player 1 based on the rating of both players"
calc = (1.0 / (1.0 + pow(10, ((rating_2 - rating_1) / 400.0))))
return calc

def modifyRating(rating, expected, actual, kfactor):
"gives

Solution

Without knowing what your csv file is structured, it is hard to give too much concrete. I do have some suggestions, however.

  • You can most likely drastically increase performance by converting strings like the player names to categorical data. Strings are slow in pandas, especially string lookup in a large column (as you have here many times). Using categeorical data converts it to integers seamlessly behind-the scenes, so you can benefit from using strings while still have fast lookups.



  • You should loop over the rows rather than re-indexing so much. In fact all you really need is the winner name and loser name from each match, which you can get at the beginning of each loop.



  • You may not be able to calculate the Rating all at once, but you can calculate Number of Games all at once by just counting how many times a player is a loser and adding that to how many times the same player is a winner.



  • Your other functions are one-liners. This is probably a small part, but it would be better to not have them as functions at all.

Context

StackExchange Code Review Q#94080, answer score: 5

Revisions (0)

No revisions yet.