Idempotency & backfills

The previous lesson left us with a powerful tool: every run owns a data interval, and tasks should process exactly that interval. This lesson cashes that in. We'll see why idempotency is the property that makes retries and backfills safe, how to design tasks to have it, and how to reprocess history without the classic disaster: double-counting.

This is the lesson most tutorials skip — they teach you Airflow operators but never idempotency — and it's exactly why so many real pipelines quietly corrupt their own data on the first re-run.

The problem: re-running is dangerous by default

Recall that an orchestrator retries failed tasks and lets you backfill history. Both mean the same task runs more than once for the same interval. Now picture shopflow_daily's ingest_orders task (ShopFlow — see Meet ShopFlow) written naively to land yesterday's orders into raw.orders:

-- NAIVE: just append whatever we pulled
INSERT INTO raw.orders SELECT * FROM june_23_orders;

It works the first time. Then the task fails after the insert but before it reports success (the network blipped on the final acknowledgment). The orchestrator dutifully retries. The insert runs again. Now June 23rd's orders are in raw.orders twice — and every fact_sales revenue number dbt_run builds on top of them is now wrong, and nothing errored. This is double-counting, and it is the canonical data-pipeline bug.

The retry didn't cause the bug. The task design did. The task was not idempotent.

Idempotency: the property that makes re-runs safe

Idempotency means running an operation again with the same input produces the same end state — running it once or five times leaves the data identical. An idempotent task can be safely retried, re-run, or backfilled without corrupting anything.

The word comes from math (f(f(x)) = f(x)), but the practical test is simple: if I run this task twice for the same interval, is the result the same as running it once? If yes, idempotent. If running twice doubles the rows, not idempotent.

The naive INSERT … SELECT fails the test. The fix is to make ingest_orders own its partition and replace it rather than append to it. ShopFlow partitions raw.orders by order date — call the partition column dt (derived from order_ts) — so each shopflow_daily run owns exactly one day's dt.

A partition is a slice of a table keyed by something (almost always the data interval — e.g. dt = '2026-06-23'). A partition-scoped task writes only to its own partition.

Pattern 1 — delete-then-insert (idempotent within a partition)

-- IDEMPOTENT: clear this interval's partition, then write it fresh.
DELETE FROM raw.orders WHERE dt = '2026-06-23';
INSERT INTO raw.orders SELECT * FROM june_23_orders WHERE dt = '2026-06-23';

Now trace the retry: ingest_orders runs, deletes the dt = '2026-06-23' partition (removing the half-written rows), reinserts June 23. Run it again — same delete, same insert, same final state. One copy of June 23's orders in raw.orders, every time, no matter how many retries. The dt comes from the run's logical date (last lesson), so each run cleanly owns exactly one partition.

Pattern 2 — overwrite/`MERGE`/`INSERT OVERWRITE`

Many engines give you this directly. Spark and warehouses support INSERT OVERWRITE PARTITION (replace one partition atomically); warehouses support MERGE (upsert by key). dbt's incremental models — exactly what dbt_run builds for fact_sales downstream — use this idea under the hood. The shape is always the same: target one partition by the interval, and replace it, so a re-run overwrites rather than stacks.

:::tip The rule of thumb A task should produce the same output whether it runs once or ten times for a given interval. Achieve it by making each run own and overwrite a partition keyed by its logical date, never blind-append. If you remember one thing from this chapter, remember this. :::

Backfilling: reprocessing history correctly

Now backfills become almost trivial. A backfill is running a pipeline for past intervals it never ran — or must re-run because logic changed. ShopFlow's analysts redefine how fact_sales attributes revenue; you must recompute the last 90 days.

Because each shopflow_daily run is idempotent and partition-scoped, a backfill is just "run shopflow_daily for each interval from March 25 to today." The orchestrator (Airflow's backfill command, Dagster's backfill UI) walks the intervals and triggers a run per partition: each run re-lands that day's raw.orders dt partition via delete-insert, then dbt_run rebuilds fact_sales for that day. Each run overwrites its own day. No double-counting, no skipped days, no manual date juggling — the same property that made retries safe makes backfills safe.

Contrast the alternative if ingest_orders weren't idempotent: every backfilled day would append onto raw.orders, doubling (or 90×-ing) ShopFlow's order history. The difference between "backfill is one command" and "backfill corrupts everything" is entirely whether you built idempotent, partition-scoped tasks.

A late-arriving day

Partition-keying earns its keep when data shows up late. Say June 23rd's shopflow_daily run landed raw.orders for dt = '2026-06-23', but a batch of June 23 orders was delayed in the source system and only appeared two days later. You don't patch rows by hand — you simply re-run shopflow_daily for the June 23 logical date. ingest_orders deletes the dt = '2026-06-23' partition and re-lands it complete (now including the late orders), dbt_run rebuilds June 23's fact_sales, and quality_check re-tests. Because the task owns that one partition by its order-date dt, the late day is corrected in place with zero risk to any other day.

:::note Backfill ≠ catchup They're cousins, not twins. Catchup is the scheduler automatically filling missed intervals when you enable a DAG (often a surprise). A backfill is you deliberately re-running a chosen range. Both rely on idempotency to be safe; the difference is who initiated it and whether it was intentional. :::

Watch out for non-partitioned side effects

Idempotency is easy for a single partitioned table and tricky for side effects. If your "task" also sends an email or fires a webhook, re-running it sends the email again — emails are not idempotent. The discipline: keep re-runnable data work (partition overwrites) separate from one-shot side effects, or guard side effects with an idempotency key so the second attempt is a no-op. We'll touch alerting and external triggers in the reliability lesson.

Why it matters

Orchestrators retry and backfill, which means tasks run more than once for the same interval — so a naive blind-append task double-counts the moment it's retried, silently corrupting every number downstream. Idempotency — same input, same end state, no matter how many runs — is the property that makes re-runs safe, and you get it by making each task own and overwrite a partition keyed by its logical date (delete-then-insert, INSERT OVERWRITE PARTITION, or MERGE). Build that, and backfilling history becomes a single safe command instead of a corruption event. This is the difference between a pipeline that runs and one you can trust. Next, we map all of this onto a real orchestrator: Airflow.

Next: Airflow core (and Airflow 3) →

The problem: re-running is dangerous by default​

Idempotency: the property that makes re-runs safe​

Pattern 1 — delete-then-insert (idempotent within a partition)​

Pattern 2 — overwrite/MERGE/INSERT OVERWRITE​

Backfilling: reprocessing history correctly​

A late-arriving day​

Watch out for non-partitioned side effects​

Why it matters​