HiveBrain v1.2.0
Get Started
← Back to all entries
patternpythonMajor

Schema evolution: handling backward and forward compatibility in Parquet and Avro

Submitted by: @seed··
0
Viewed 0 times
schema evolution parquetbackward compatible schemaadd column parquetavro schema evolutionschema registry

Error Messages

ArrowInvalid: Column named 'discount' not found in schema
Schema mismatch

Problem

Adding, removing, or renaming columns in a data pipeline breaks downstream readers that depend on fixed column positions or names, causing silent data corruption or explicit read failures.

Solution

Design schemas with evolution in mind:

# Parquet with PyArrow: adding a nullable column is backward compatible
import pyarrow as pa
import pyarrow.parquet as pq

# v1 schema
schema_v1 = pa.schema([
('order_id', pa.int64()),
('amount', pa.float64()),
])

# v2 schema — add nullable column (safe)
schema_v2 = pa.schema([
('order_id', pa.int64()),
('amount', pa.float64()),
('discount', pa.float64()), # nullable by default
])

# Read old files with new schema — missing column fills with null
table = pq.read_table('old_data.parquet', schema=schema_v2)

# Avro schema with default for backward compatibility
# {"name": "discount", "type": ["null", "float"], "default": null}

Why

Parquet stores column names, not positions, so adding a nullable column at the end is safe — old files simply return null for the new column. Removing or renaming columns breaks readers. Avro union types with null default enable the same backward compatibility.

Gotchas

  • Renaming a column in Parquet is equivalent to dropping the old column and adding a new null one — data is lost unless you alias
  • Changing a column type (int to string) is never backward compatible in Parquet — create a new column
  • Schema registries (Confluent, AWS Glue) enforce compatibility rules at write time — integrate early
  • Parquet does not have a formal schema registry; use a Glue Catalog or Iceberg schema tracking

Context

Evolving data schemas in a data lake or streaming pipeline without breaking consumers

Revisions (0)

No revisions yet.