Retries, SLAs, triggering & decoupling compute
You can now model a pipeline as an idempotent DAG and run it in Airflow or a Dagster-style asset graph. This final lesson is about making it production-grade: surviving the flaky failures that will happen, promising and policing freshness, choosing how runs get triggered, parameterizing cleanly, and respecting the boundary that keeps the whole system healthy — orchestrate, don't compute.
Failure is normal — design for it
In a real pipeline, transient failures are not exceptional; they're Tuesday. For shopflow_daily (ShopFlow — see Meet ShopFlow), the source database rate-limits ingest_orders, a warehouse connection drops mid-dbt_run, a cloud region hiccups. Most of these succeed if you simply try again in a moment. That's what retries are for.
A retry automatically re-runs a failed task instance a configured number of times before declaring it failed. Because
ingest_ordersis idempotent, retrying is safe — re-running for the same interval overwrites that day'sraw.ordersdtpartition rather than double-counting. (Retries and idempotency are a package deal; one is dangerous without the other.)
But naive retries can make things worse. If a downstream service is overwhelmed, retrying instantly — three times in three seconds — just piles on more load. The fix is exponential backoff:
Exponential backoff increases the wait between retries — e.g. 1 min, then 2, then 4 — giving a struggling dependency time to recover instead of hammering it. Often paired with a small random jitter so many tasks don't all retry in lockstep.
When retries are exhausted, the task truly fails — and the orchestrator's failure semantics kick in: downstream tasks that depend on it are skipped (not run on missing data). If ingest_orders can't land raw.orders, dbt_run and quality_check are skipped rather than building fact_sales on a missing day, and a human is alerted. This is the whole point of declaring dependencies — a failure stops the bad data from propagating.
SLAs and alerting: promising freshness
Retries handle crashes. But a pipeline can succeed and still be a problem if it succeeds too late — ShopFlow's executive revenue dashboard needs fact_sales ready by 8 a.m., and today shopflow_daily is still grinding at 9:30. That's where an SLA comes in.
An SLA (Service Level Agreement) is a promise about when a pipeline's output will be ready (e.g. "fresh by 8 a.m."). Orchestrators let you attach an SLA to a task/DAG and alert when it's missed, even if the task eventually succeeds.
The distinction matters: a failure alert says "it broke"; an SLA / freshness alert says "it's late." Mature pipelines watch both, because stale-but-successful data quietly erodes trust just as much as an outright failure. (In the asset world from lesson 8.5, this is expressed as a freshness policy on the fact_sales asset — "must be no more than 24h stale" — same idea, asset vocabulary.) Alerting routes to where humans actually look — Slack, PagerDuty, email — and good alerts are actionable: which DAG (shopflow_daily), which interval, what failed, and a link to the logs.
How runs get triggered: three options
Not every pipeline should run on a fixed clock. There are three ways to trigger work, in roughly increasing sophistication:
- Schedule (time-based). "Run
shopflow_dailyat 2 a.m." Simple and predictable. The weakness: if the source's nightly export is late,ingest_ordersruns on incomplete data or pads with a guessed gap (the fragile cron pattern from lesson 8.1). - Sensor / event (wait for a condition). A sensor waits for something to be true — a file lands, a partition appears, an external job finishes — then lets downstream run. For ShopFlow, a sensor at the head of
shopflow_dailycan wait for the day'sraw.orderspartition to land (or for the source export file to appear) beforedbt_runstarts, instead of guessing that 2 a.m. is "late enough." This reacts to reality instead of guessing timing. - Data-availability trigger (data-aware). "Rebuild
fact_saleswhenraw.ordersupdates." The asset-based model from lesson 8.5 — the most robust, because the downstream fires exactly when its inputs are genuinely ready.
Sensor anti-pattern: poke mode vs deferrable
Sensors are powerful but have a famous trap, worth stating plainly:
:::warning Don't let sensors hog worker slots A poke-mode sensor sits on a worker slot and re-checks its condition every few seconds for hours. Run a hundred sensors waiting on a hundred files and they can consume every worker slot doing nothing but waiting — your real tasks queue behind idle pollers. The fix is the deferrable (async) operator/sensor: it releases its worker slot while waiting and resumes via a lightweight trigger process when the condition is met. Prefer deferrable sensors; avoid armies of poking ones. This is one of the highest-impact reliability fixes in a busy Airflow deployment. :::
Parameterization and dynamic task mapping
A pipeline should be a template driven by inputs, not a pile of hardcoded constants.
Parameterization means designing a pipeline so values like the run's date, the environment (dev/prod), or the source bucket are inputs, not baked-in literals. The most important parameter is the one from lesson 8.2: the logical date / data interval, which is what makes a task re-runnable for any interval.
Sometimes you don't know how many tasks you need until runtime — say, one task per file that landed today. Hardcoding ten tasks for ten files breaks on the day eleven files arrive. Dynamic task mapping solves it:
Dynamic task mapping generates a variable number of parallel task instances at runtime from a list — e.g. "for each file in today's drop, run a process task." The DAG adapts to the data instead of assuming a fixed shape.
But beware where you compute that list. Generating it with a heavy call at the top level of the DAG file runs it on every parse and stalls the scheduler — the top-level-code pitfall from lesson 8.4. Compute the list inside the pipeline (in a task), then map over it.
The boundary that keeps it all healthy: decouple compute
We end where lesson 8.1 began, because it's that important.
:::tip Orchestrate, don't compute. The orchestrator's job is to trigger work and react to it — not to crunch the data itself. To process a billion rows, the orchestrator should fire off a Spark job (Chapter 5), trigger a dbt run in the warehouse (Chapter 7), or call an ingestion service — and then wait and react to success or failure. It should never pull that billion rows into a worker's memory. :::
Why this is non-negotiable:
- Scalability. Real engines (Spark, the warehouse) are built to scale across many machines; an Airflow worker is sized for coordination. Move the compute into the worker and you've capped your scale at one small box and discarded your engine.
- Stability. A heavy task can exhaust a worker's memory and take down other tasks sharing it. Coordination work stays light and predictable.
- Separation of concerns. The orchestrator owns when and in what order; the engine owns how to process. Each does the thing it's good at, and you can swap either independently.
So the healthy pattern is always: orchestrator → triggers external engine → waits → reacts. The orchestrator is the conductor, not the orchestra. dbt as an orchestrated step is the perfect example: Airflow/Dagster says "run these dbt models now," dbt does the SQL transformation in the warehouse, and the orchestrator just watches the result.
Why it matters
Production pipelines are defined by how they handle the bad days. Retries (safe because tasks are idempotent) with exponential backoff ride out transient failures without hammering a struggling dependency; when retries exhaust, failure semantics skip downstream tasks and alert a human. SLAs / freshness policies catch the subtler failure — succeeding too late — that erodes trust silently. Choose triggering deliberately: schedule, sensor, or data-availability, and always prefer deferrable sensors over slot-hogging poke mode. Parameterize (especially by the data interval) and use dynamic task mapping for data-shaped fan-out — but never put heavy work in top-level code. Above all, orchestrate, don't compute: the orchestrator triggers Spark, dbt, and the warehouse and reacts to them — it is the conductor, never the orchestra. That completes the chapter; prove it on the checkpoint.
Next: Chapter 8 checkpoint →