patternpythonModerate
Backfill strategies: reprocessing historical data in incremental pipelines
Viewed 0 times
backfill airflow dagdbt full refreshhistorical reprocessingincremental backfillexecution_date logical_date
Problem
A bug in transformation logic is discovered after 6 months of production data. Reprocessing all historical data requires manually triggering hundreds of DAG runs, tracking which succeeded, and handling partial reruns.
Solution
Design pipelines to support backfill from the start:
# Airflow backfill via CLI
airflow dags backfill \
--start-date 2024-01-01 \
--end-date 2024-06-30 \
my_pipeline_dag
# In the DAG: use execution_date for all date logic (not datetime.now())
def process_partition(**context):
# Use logical_date (Airflow 2.2+) or execution_date
run_date = context['logical_date'].date()
df = read_partition(run_date) # reads only that day's data
write_partition(df, run_date) # overwrites that day's output
# dbt backfill for incremental models
dbt run --select fct_orders --full-refresh
# Or date-limited backfill via variable:
dbt run --select fct_orders --vars '{"start_date": "2024-01-01"}'
# Airflow backfill via CLI
airflow dags backfill \
--start-date 2024-01-01 \
--end-date 2024-06-30 \
my_pipeline_dag
# In the DAG: use execution_date for all date logic (not datetime.now())
def process_partition(**context):
# Use logical_date (Airflow 2.2+) or execution_date
run_date = context['logical_date'].date()
df = read_partition(run_date) # reads only that day's data
write_partition(df, run_date) # overwrites that day's output
# dbt backfill for incremental models
dbt run --select fct_orders --full-refresh
# Or date-limited backfill via variable:
dbt run --select fct_orders --vars '{"start_date": "2024-01-01"}'
Why
Idempotent partitioned writes mean backfilling is just rerunning jobs for specific dates. Using execution_date instead of now() ensures the job processes the correct data window regardless of when it actually runs.
Gotchas
- Backfilling with catchup=True and many missed intervals causes a task flood — use max_active_runs to limit concurrency
- Airflow backfill does not respect max_active_runs by default — use --delay-on-limit
- dbt --full-refresh rebuilds the entire incremental table — expensive but correct for logic changes
- Source systems may not retain historical data; snapshot raw data before backfilling transformations
Context
Reprocessing historical data after fixing a bug in a production ETL pipeline
Revisions (0)
No revisions yet.