patternpythonMinor
Rating tennis players in a database, taking days to run
Viewed 0 times
ratingplayersdatabasetennisdaysruntaking
Problem
I have this project in data analysis for creating a ranking of tennis players. Currently, it takes more than 6 days to run on my computer.
Can you review the code and see where's the problem?
Project steps:
-
I have a database of 600,000 tennis matches called
-
From that database, I create a
-
For each match in the
-
It updates the ranking after the match into the
This
```
import pandas as pd
import glob
import numpy as np
import math
all_data = pd.read_csv('tennisdatabase.csv')
all_data = all_data.sort(['date'], ascending=[0])
all_data = all_data.reindex(index = np.arange(1, len(all_data) + 1))
#it checks every player in the matchdatabase and creates a database of players
playerdatabase = pd.DataFrame()
list_winners = pd.pivot_table(all_data,index=["winner_name"],values=["tourney_id"],aggfunc=np.count_nonzero)
list_losers = pd.pivot_table(all_data,index=["loser_name"],values=["tourney_id"],aggfunc=np.count_nonzero)
firstloss = pd.pivot_table(all_data,index=["loser_name"],values=["date"],aggfunc=np.min)
firstwin = pd.pivot_table(all_data,index=["winner_name"],values=["date"],aggfunc=np.min)
playerdatabase = pd.concat([list_winners, list_losers, firstloss, firstwin], axis=1)
playerdatabase['NumberOfGames'] = 0
#defines a elo calculator for expectations and modified ratings
def getExpectation(rating_1, rating_2):
"calculator for the expected result to player 1 based on the rating of both players"
calc = (1.0 / (1.0 + pow(10, ((rating_2 - rating_1) / 400.0))))
return calc
def modifyRating(rating, expected, actual, kfactor):
"gives
Can you review the code and see where's the problem?
Project steps:
-
I have a database of 600,000 tennis matches called
matchdatabase.The database fileds are a) winner name, b) loser name, c) tournament, d) other fields for the winner and loser.-
From that database, I create a
playerdatabase with every player in the matchdatabase.-
For each match in the
matchdatabase it goes into the playerdatabase, retrieves the ranking/elo and computes the expected result.-
It updates the ranking after the match into the
playerdatabaseThis
for loop ends up running 1 match/second, so the whole database takes several days to run!```
import pandas as pd
import glob
import numpy as np
import math
all_data = pd.read_csv('tennisdatabase.csv')
all_data = all_data.sort(['date'], ascending=[0])
all_data = all_data.reindex(index = np.arange(1, len(all_data) + 1))
#it checks every player in the matchdatabase and creates a database of players
playerdatabase = pd.DataFrame()
list_winners = pd.pivot_table(all_data,index=["winner_name"],values=["tourney_id"],aggfunc=np.count_nonzero)
list_losers = pd.pivot_table(all_data,index=["loser_name"],values=["tourney_id"],aggfunc=np.count_nonzero)
firstloss = pd.pivot_table(all_data,index=["loser_name"],values=["date"],aggfunc=np.min)
firstwin = pd.pivot_table(all_data,index=["winner_name"],values=["date"],aggfunc=np.min)
playerdatabase = pd.concat([list_winners, list_losers, firstloss, firstwin], axis=1)
playerdatabase['NumberOfGames'] = 0
#defines a elo calculator for expectations and modified ratings
def getExpectation(rating_1, rating_2):
"calculator for the expected result to player 1 based on the rating of both players"
calc = (1.0 / (1.0 + pow(10, ((rating_2 - rating_1) / 400.0))))
return calc
def modifyRating(rating, expected, actual, kfactor):
"gives
Solution
Without knowing what your
csv file is structured, it is hard to give too much concrete. I do have some suggestions, however.- You can most likely drastically increase performance by converting strings like the player names to
categoricaldata. Strings are slow inpandas, especially string lookup in a large column (as you have here many times). Usingcategeoricaldata converts it to integers seamlessly behind-the scenes, so you can benefit from using strings while still have fast lookups.
- You should loop over the rows rather than re-indexing so much. In fact all you really need is the winner name and loser name from each match, which you can get at the beginning of each loop.
- You may not be able to calculate the
Ratingall at once, but you can calculateNumber of Gamesall at once by just counting how many times a player is a loser and adding that to how many times the same player is a winner.
- Your other functions are one-liners. This is probably a small part, but it would be better to not have them as functions at all.
Context
StackExchange Code Review Q#94080, answer score: 5
Revisions (0)
No revisions yet.