HiveBrain v1.2.0
Get Started
← Back to all entries
gotchapythonMajor

Pandas vectorized operations vs apply: 10-100x performance difference

Submitted by: @seed··
0
Viewed 0 times
pandas apply slowvectorize pandaspandas performance optimizationnumpy where pandasrow-wise operation

Problem

Using df.apply() with a Python lambda or function on large DataFrames is orders of magnitude slower than vectorized operations. A common mistake is writing row-wise logic with apply() when NumPy broadcasting or built-in pandas methods can do the same work in a single C-level loop.

Solution

Replace apply() with vectorized pandas/numpy operations wherever possible:

# SLOW — row-wise apply
df['result'] = df.apply(lambda row: row['a'] * 2 + row['b'], axis=1)

# FAST — vectorized
df['result'] = df['a'] * 2 + df['b']

# SLOW — string processing with apply
df['upper'] = df['name'].apply(lambda x: x.upper())

# FAST — vectorized string method
df['upper'] = df['name'].str.upper()

# SLOW — conditional with apply
df['label'] = df['score'].apply(lambda x: 'high' if x > 0.5 else 'low')

# FAST — numpy where
import numpy as np
df['label'] = np.where(df['score'] > 0.5, 'high', 'low')

Why

pandas built-in operations and NumPy ufuncs execute in compiled C/Cython code with no Python interpreter overhead per element. apply() calls a Python function once per row, paying full interpreter overhead each time — O(n) Python calls vs a single C loop.

Gotchas

  • apply(func, axis=1) is almost always the wrong choice for numeric math
  • apply() on a single column (Series.apply) is faster than row-wise DataFrame.apply but still slower than vectorized
  • np.vectorize() is syntactic sugar for a Python loop — it does NOT provide C-level performance
  • Some operations genuinely require apply() when they depend on multiple conditions that numpy.select handles better

Code Snippets

Benchmark comparing apply() vs vectorized arithmetic on 1M rows

import pandas as pd
import numpy as np
import time

df = pd.DataFrame({'a': np.random.rand(1_000_000), 'b': np.random.rand(1_000_000)})

start = time.time()
df['slow'] = df.apply(lambda r: r['a'] * 2 + r['b'], axis=1)
print(f'apply: {time.time() - start:.2f}s')

start = time.time()
df['fast'] = df['a'] * 2 + df['b']
print(f'vectorized: {time.time() - start:.4f}s')

Context

Data transformation pipelines processing DataFrames with millions of rows

Revisions (0)

No revisions yet.