patternpythonMinor
Slow code pandas 8 million rows
Viewed 0 times
millionrowspandasslowcode
Problem
I have the following code which takes 10-15 minutes to execute. That is way too slow considering that my database is growing daily. Is there any chance to make it faster?
# Replace all empty lists ([], '[]') in dataframe with NaN's
df = df.mask(df.applymap(str).eq('[]'))
# Replace all zeros in dataframe with NaN's
df[df == 0.0] = np.nan
# Replace empty strings in dataframe with NaN's
df.replace('', np.nan, inplace=True)
# Replace all strings with value 'null' in a dataframe with NaN
df.replace('null', np.NaN, inplace=True)Solution
Not a lot to review there;
The code is well documented and readable, the only thing I frowned at was
In can only think of the fact that
From a design perspective, all the routines writing to your database should never write empty lists, zeros, empty strings or nulls, but NaN instead. Then you would never have to run this script in the first place.
The code is well documented and readable, the only thing I frowned at was
df[df == 0.0], but my Python is probably just not good enough.In can only think of the fact that
replace can take a list of strings for to_replace, so you can merge the last 2 statements, which will give a speed up. From a design perspective, all the routines writing to your database should never write empty lists, zeros, empty strings or nulls, but NaN instead. Then you would never have to run this script in the first place.
Context
StackExchange Code Review Q#162833, answer score: 4
Revisions (0)
No revisions yet.