patternpythonTip
Apache Arrow: zero-copy data exchange between Python libraries
Viewed 0 times
apache arrow zero copypyarrow pandas polars interoparrow memory formatarrow ipcin-memory columnar
Problem
Passing data between pandas, polars, DuckDB, Spark, and other tools typically triggers full data copies and format conversions, multiplying memory usage and wasting CPU time in tight pipelines.
Solution
Use PyArrow Tables as the canonical in-memory format and convert only at the edges:
import pyarrow as pa
import pyarrow.parquet as pq
import pandas as pd
import polars as pl
# Load once as Arrow
table = pq.read_table('data.parquet')
# Zero-copy to pandas (shares memory when possible)
df_pandas = table.to_pandas(zero_copy_only=False)
# Zero-copy to polars
df_polars = pl.from_arrow(table)
# Convert pandas -> Arrow -> polars without double copy
arrow_from_pd = pa.Table.from_pandas(df_pandas)
df_polars2 = pl.from_arrow(arrow_from_pd)
import pyarrow as pa
import pyarrow.parquet as pq
import pandas as pd
import polars as pl
# Load once as Arrow
table = pq.read_table('data.parquet')
# Zero-copy to pandas (shares memory when possible)
df_pandas = table.to_pandas(zero_copy_only=False)
# Zero-copy to polars
df_polars = pl.from_arrow(table)
# Convert pandas -> Arrow -> polars without double copy
arrow_from_pd = pa.Table.from_pandas(df_pandas)
df_polars2 = pl.from_arrow(arrow_from_pd)
Why
Arrow defines a language-agnostic columnar memory layout. Libraries that natively speak Arrow (polars, DuckDB, pandas 2.0 with ArrowDtype) can share buffer pointers without copying. Cross-language IPC becomes a pointer hand-off rather than a serialization round-trip.
Gotchas
- zero_copy_only=True raises an error if a copy is needed — use False in production and profile separately
- pandas ArrowDtype (pandas 2.0+) stores data in Arrow buffers inside a pandas DataFrame, enabling partial zero-copy
- Object columns in pandas cannot be zero-copied to Arrow — convert strings to large_string or dictionary type first
- Arrow IPC streams are ideal for passing data between processes without shared memory
Code Snippets
Using pandas ArrowDtype backend to keep data in Arrow buffers
import pyarrow as pa
import pandas as pd
# pandas 2.0+ ArrowDtype — keeps data in Arrow buffers
df = pd.read_parquet('data.parquet', dtype_backend='pyarrow')
print(df.dtypes) # shows ArrowDtype columns
# Conversion back to Arrow is then zero-copy
table = pa.Table.from_pandas(df)Context
Building pipelines that chain multiple DataFrame libraries or pass data between processes
Revisions (0)
No revisions yet.