patternpythonMinor

Rating tennis players in a database, taking days to run

Submitted by: @import:stackexchange-codereview·Mar 10, 2026·

Viewed 0 times

ratingplayersdatabasetennisdaysruntaking

Problem

I have this project in data analysis for creating a ranking of tennis players. Currently, it takes more than 6 days to run on my computer.

Can you review the code and see where's the problem?

Project steps:

-
I have a database of 600,000 tennis matches called matchdatabase.The database fileds are a) winner name, b) loser name, c) tournament, d) other fields for the winner and loser.

-
From that database, I create a playerdatabase with every player in the matchdatabase.

-
For each match in the matchdatabase it goes into the playerdatabase, retrieves the ranking/elo and computes the expected result.

-
It updates the ranking after the match into the playerdatabase

This for loop ends up running 1 match/second, so the whole database takes several days to run!

```
import pandas as pd
import glob
import numpy as np
import math

all_data = pd.read_csv('tennisdatabase.csv')
all_data = all_data.sort(['date'], ascending=[0])
all_data = all_data.reindex(index = np.arange(1, len(all_data) + 1))

#it checks every player in the matchdatabase and creates a database of players

playerdatabase = pd.DataFrame()
list_winners = pd.pivot_table(all_data,index=["winner_name"],values=["tourney_id"],aggfunc=np.count_nonzero)
list_losers = pd.pivot_table(all_data,index=["loser_name"],values=["tourney_id"],aggfunc=np.count_nonzero)
firstloss = pd.pivot_table(all_data,index=["loser_name"],values=["date"],aggfunc=np.min)
firstwin = pd.pivot_table(all_data,index=["winner_name"],values=["date"],aggfunc=np.min)
playerdatabase = pd.concat([list_winners, list_losers, firstloss, firstwin], axis=1)
playerdatabase['NumberOfGames'] = 0

#defines a elo calculator for expectations and modified ratings

def getExpectation(rating_1, rating_2):
"calculator for the expected result to player 1 based on the rating of both players"
calc = (1.0 / (1.0 + pow(10, ((rating_2 - rating_1) / 400.0))))
return calc

def modifyRating(rating, expected, actual, kfactor):
"gives

Solution

Without knowing what your csv file is structured, it is hard to give too much concrete. I do have some suggestions, however.

You can most likely drastically increase performance by converting strings like the player names to categorical data. Strings are slow in pandas, especially string lookup in a large column (as you have here many times). Using categeorical data converts it to integers seamlessly behind-the scenes, so you can benefit from using strings while still have fast lookups.

You should loop over the rows rather than re-indexing so much. In fact all you really need is the winner name and loser name from each match, which you can get at the beginning of each loop.

You may not be able to calculate the Rating all at once, but you can calculate Number of Games all at once by just counting how many times a player is a loser and adding that to how many times the same player is a winner.

Your other functions are one-liners. This is probably a small part, but it would be better to not have them as functions at all.

Context

StackExchange Code Review Q#94080, answer score: 5

Revisions (0)

No revisions yet.