gotchapythonMajor
Pandas vectorized operations vs apply: 10-100x performance difference
Viewed 0 times
pandas apply slowvectorize pandaspandas performance optimizationnumpy where pandasrow-wise operation
Problem
Using df.apply() with a Python lambda or function on large DataFrames is orders of magnitude slower than vectorized operations. A common mistake is writing row-wise logic with apply() when NumPy broadcasting or built-in pandas methods can do the same work in a single C-level loop.
Solution
Replace apply() with vectorized pandas/numpy operations wherever possible:
# SLOW — row-wise apply
df['result'] = df.apply(lambda row: row['a'] * 2 + row['b'], axis=1)
# FAST — vectorized
df['result'] = df['a'] * 2 + df['b']
# SLOW — string processing with apply
df['upper'] = df['name'].apply(lambda x: x.upper())
# FAST — vectorized string method
df['upper'] = df['name'].str.upper()
# SLOW — conditional with apply
df['label'] = df['score'].apply(lambda x: 'high' if x > 0.5 else 'low')
# FAST — numpy where
import numpy as np
df['label'] = np.where(df['score'] > 0.5, 'high', 'low')
# SLOW — row-wise apply
df['result'] = df.apply(lambda row: row['a'] * 2 + row['b'], axis=1)
# FAST — vectorized
df['result'] = df['a'] * 2 + df['b']
# SLOW — string processing with apply
df['upper'] = df['name'].apply(lambda x: x.upper())
# FAST — vectorized string method
df['upper'] = df['name'].str.upper()
# SLOW — conditional with apply
df['label'] = df['score'].apply(lambda x: 'high' if x > 0.5 else 'low')
# FAST — numpy where
import numpy as np
df['label'] = np.where(df['score'] > 0.5, 'high', 'low')
Why
pandas built-in operations and NumPy ufuncs execute in compiled C/Cython code with no Python interpreter overhead per element. apply() calls a Python function once per row, paying full interpreter overhead each time — O(n) Python calls vs a single C loop.
Gotchas
- apply(func, axis=1) is almost always the wrong choice for numeric math
- apply() on a single column (Series.apply) is faster than row-wise DataFrame.apply but still slower than vectorized
- np.vectorize() is syntactic sugar for a Python loop — it does NOT provide C-level performance
- Some operations genuinely require apply() when they depend on multiple conditions that numpy.select handles better
Code Snippets
Benchmark comparing apply() vs vectorized arithmetic on 1M rows
import pandas as pd
import numpy as np
import time
df = pd.DataFrame({'a': np.random.rand(1_000_000), 'b': np.random.rand(1_000_000)})
start = time.time()
df['slow'] = df.apply(lambda r: r['a'] * 2 + r['b'], axis=1)
print(f'apply: {time.time() - start:.2f}s')
start = time.time()
df['fast'] = df['a'] * 2 + df['b']
print(f'vectorized: {time.time() - start:.4f}s')Context
Data transformation pipelines processing DataFrames with millions of rows
Revisions (0)
No revisions yet.