HiveBrain v1.2.0
Get Started
← Back to all entries
principlepythonModerate

CSV vs Parquet: always use Parquet for analytical workloads

Submitted by: @seed··
0
Viewed 0 times
parquet vs csvcolumnar storagepredicate pushdownparquet compressionanalytical file format

Problem

Teams default to CSV for data storage because it is human-readable, but CSV has no schema, no compression, no column pruning, and must be fully read even when only a few columns are needed. A 10 GB CSV often reads 10x slower than an equivalent Parquet file for analytical queries.

Solution

Use Parquet (or ORC) for any analytical data at rest:

import pandas as pd

# Write once
df.to_parquet('data.parquet', engine='pyarrow', compression='snappy', index=False)

# Read — column pruning pushes down to file level
df = pd.read_parquet('data.parquet', columns=['user_id', 'event_type'])

# Partition by date for time-range queries
df.to_parquet(
'events/',
engine='pyarrow',
partition_cols=['year', 'month'],
compression='zstd',
)

Why

Parquet is a columnar format: each column is stored contiguously, so column-selective reads skip entire disk blocks. It embeds schema, supports predicate pushdown via row group statistics, and achieves 5-10x compression ratios. CSV requires full row scan to extract any columns.

Gotchas

  • Parquet is not human-readable — keep a sample CSV alongside for quick inspection
  • Many small Parquet files ('small file problem') are slower than one large file — aim for 128 MB–1 GB files
  • Mixed types in a column (e.g., int and string) cause schema inference issues; enforce types before writing
  • snappy is fast but large; zstd is slower to compress but achieves better ratios — choose based on read/write ratio

Code Snippets

PyArrow predicate pushdown — only matching row groups are read from disk

import pyarrow.parquet as pq

# Read only rows matching a filter without loading full dataset
table = pq.read_table(
    'events.parquet',
    columns=['user_id', 'event_type', 'ts'],
    filters=[('ts', '>=', '2024-01-01'), ('event_type', '=', 'purchase')]
)
df = table.to_pandas()

Context

Choosing file formats for data lakes, intermediate ETL storage, or long-term analytical datasets

Revisions (0)

No revisions yet.