patternpythonMajor
Schema evolution: handling backward and forward compatibility in Parquet and Avro
Viewed 0 times
schema evolution parquetbackward compatible schemaadd column parquetavro schema evolutionschema registry
Error Messages
Problem
Adding, removing, or renaming columns in a data pipeline breaks downstream readers that depend on fixed column positions or names, causing silent data corruption or explicit read failures.
Solution
Design schemas with evolution in mind:
# Parquet with PyArrow: adding a nullable column is backward compatible
import pyarrow as pa
import pyarrow.parquet as pq
# v1 schema
schema_v1 = pa.schema([
('order_id', pa.int64()),
('amount', pa.float64()),
])
# v2 schema — add nullable column (safe)
schema_v2 = pa.schema([
('order_id', pa.int64()),
('amount', pa.float64()),
('discount', pa.float64()), # nullable by default
])
# Read old files with new schema — missing column fills with null
table = pq.read_table('old_data.parquet', schema=schema_v2)
# Avro schema with default for backward compatibility
# {"name": "discount", "type": ["null", "float"], "default": null}
# Parquet with PyArrow: adding a nullable column is backward compatible
import pyarrow as pa
import pyarrow.parquet as pq
# v1 schema
schema_v1 = pa.schema([
('order_id', pa.int64()),
('amount', pa.float64()),
])
# v2 schema — add nullable column (safe)
schema_v2 = pa.schema([
('order_id', pa.int64()),
('amount', pa.float64()),
('discount', pa.float64()), # nullable by default
])
# Read old files with new schema — missing column fills with null
table = pq.read_table('old_data.parquet', schema=schema_v2)
# Avro schema with default for backward compatibility
# {"name": "discount", "type": ["null", "float"], "default": null}
Why
Parquet stores column names, not positions, so adding a nullable column at the end is safe — old files simply return null for the new column. Removing or renaming columns breaks readers. Avro union types with null default enable the same backward compatibility.
Gotchas
- Renaming a column in Parquet is equivalent to dropping the old column and adding a new null one — data is lost unless you alias
- Changing a column type (int to string) is never backward compatible in Parquet — create a new column
- Schema registries (Confluent, AWS Glue) enforce compatibility rules at write time — integrate early
- Parquet does not have a formal schema registry; use a Glue Catalog or Iceberg schema tracking
Context
Evolving data schemas in a data lake or streaming pipeline without breaking consumers
Revisions (0)
No revisions yet.