patternpythonModerate
Polars vs pandas: when to switch and what breaks
Viewed 0 times
polars lazy evaluationpolars vs pandas speedpolars migrationpolars null nanpolars group by
Problem
Teams reach for polars for speed but encounter API incompatibilities, missing integrations, and confusion around lazy vs eager evaluation, leading to bugs or abandoning the migration mid-pipeline.
Solution
Key polars patterns and gotchas for pandas migrants:
import polars as pl
# Lazy evaluation — query optimizer rewrites the plan
df = (
pl.scan_parquet('events.parquet') # lazy
.filter(pl.col('event_type') == 'purchase')
.group_by('user_id')
.agg(pl.col('amount').sum().alias('total'))
.collect() # execute
)
# No inplace operations — always reassign
df = df.with_columns(pl.col('amount') 1.1) # not df['amount'] = 1.1
# Null vs NaN distinction (polars separates them)
df = df.fill_null(0) # replace null
df = df.fill_nan(0) # replace NaN (float only)
import polars as pl
# Lazy evaluation — query optimizer rewrites the plan
df = (
pl.scan_parquet('events.parquet') # lazy
.filter(pl.col('event_type') == 'purchase')
.group_by('user_id')
.agg(pl.col('amount').sum().alias('total'))
.collect() # execute
)
# No inplace operations — always reassign
df = df.with_columns(pl.col('amount') 1.1) # not df['amount'] = 1.1
# Null vs NaN distinction (polars separates them)
df = df.fill_null(0) # replace null
df = df.fill_nan(0) # replace NaN (float only)
Why
Polars is written in Rust, uses Arrow memory, and applies a query optimizer on lazy plans (similar to Spark). It is typically 5-20x faster than pandas for group-by and join workloads because it uses all CPU cores and avoids Python GIL overhead.
Gotchas
- polars has no Index — row operations by label require explicit filter, which is actually faster
- polars null and NaN are distinct — pandas conflates them; migrating code silently produces different results
- scan_parquet + collect is the recommended pattern; using read_parquet for huge files loads everything eagerly
- Many pandas third-party libraries (statsmodels, sklearn) require pandas DataFrames — convert at the boundary
Code Snippets
Polars lazy scan with multi-column transforms and grouped aggregation
import polars as pl
# Efficient multi-column expression in a single pass
result = (
pl.scan_parquet('transactions/*.parquet')
.with_columns([
pl.col('amount').cast(pl.Float64),
pl.col('ts').str.to_datetime('%Y-%m-%d'),
])
.filter(pl.col('ts').dt.year() == 2024)
.group_by(['user_id', pl.col('ts').dt.month().alias('month')])
.agg([
pl.col('amount').sum().alias('monthly_spend'),
pl.len().alias('tx_count'),
])
.collect()
)
print(result)Context
Migrating pandas-based ETL pipelines to polars for performance improvements
Revisions (0)
No revisions yet.