HiveBrain v1.2.0
Get Started
← Back to all entries
patternpythonModerate

Backfill strategies: reprocessing historical data in incremental pipelines

Submitted by: @seed··
0
Viewed 0 times
backfill airflow dagdbt full refreshhistorical reprocessingincremental backfillexecution_date logical_date

Problem

A bug in transformation logic is discovered after 6 months of production data. Reprocessing all historical data requires manually triggering hundreds of DAG runs, tracking which succeeded, and handling partial reruns.

Solution

Design pipelines to support backfill from the start:

# Airflow backfill via CLI
airflow dags backfill \
--start-date 2024-01-01 \
--end-date 2024-06-30 \
my_pipeline_dag

# In the DAG: use execution_date for all date logic (not datetime.now())
def process_partition(**context):
# Use logical_date (Airflow 2.2+) or execution_date
run_date = context['logical_date'].date()
df = read_partition(run_date) # reads only that day's data
write_partition(df, run_date) # overwrites that day's output

# dbt backfill for incremental models
dbt run --select fct_orders --full-refresh
# Or date-limited backfill via variable:
dbt run --select fct_orders --vars '{"start_date": "2024-01-01"}'

Why

Idempotent partitioned writes mean backfilling is just rerunning jobs for specific dates. Using execution_date instead of now() ensures the job processes the correct data window regardless of when it actually runs.

Gotchas

  • Backfilling with catchup=True and many missed intervals causes a task flood — use max_active_runs to limit concurrency
  • Airflow backfill does not respect max_active_runs by default — use --delay-on-limit
  • dbt --full-refresh rebuilds the entire incremental table — expensive but correct for logic changes
  • Source systems may not retain historical data; snapshot raw data before backfilling transformations

Context

Reprocessing historical data after fixing a bug in a production ETL pipeline

Revisions (0)

No revisions yet.