HiveBrain v1.2.0
Get Started
← Back to all entries
principlepythonMajor

Idempotent pipelines: design ETL jobs that are safe to rerun

Submitted by: @seed··
0
Viewed 0 times
idempotent etlpipeline rerun safeoverwrite partition sparkupsert on conflictetl duplicate records

Problem

Re-running a failed ETL job appends duplicate records, doubles counts in aggregations, or partially overwrites outputs, leaving the system in an inconsistent state that requires manual cleanup.

Solution

Design every pipeline stage to produce the same result on repeated execution:

# Idempotent write: overwrite partition, not append
df.to_parquet(
f's3://lake/events/date={run_date}/',
mode='overwrite', # partition overwrite, not append
)

# Idempotent DB upsert: INSERT ... ON CONFLICT
conn.execute("""
INSERT INTO orders (order_id, amount, status)
VALUES (:order_id, :amount, :status)
ON CONFLICT (order_id) DO UPDATE SET
amount = EXCLUDED.amount,
status = EXCLUDED.status
""", rows)

# Idempotent Spark write: overwrite partition dynamically
spark.conf.set('spark.sql.sources.partitionOverwriteMode', 'dynamic')
df.write.partitionBy('date').mode('overwrite').parquet('s3://lake/events/')

Why

Pipelines fail and must be retried. If each run produces correct, non-duplicated output regardless of how many times it runs, failures become recoverable without manual intervention. This is especially important for scheduled jobs with catchup.

Gotchas

  • append mode is almost always wrong for scheduled ETL — use overwrite or upsert
  • spark.sql.sources.partitionOverwriteMode=dynamic overwrites only partitions present in the new data, not all partitions
  • Idempotency requires the same input for a given logical run date — parameterize by execution date, not current datetime
  • Delta Lake MERGE provides idempotent upserts natively without manual overwrite logic

Context

Designing ETL pipelines that will be scheduled, may fail, and must be safely retried

Revisions (0)

No revisions yet.