Chapter 8 · Orchestration & Pipelines
By now you can land raw data (Chapter 6), reshape it with SQL and dbt (Chapter 7), and crunch it at scale with Spark (Chapter 5). But each of those is a step. In a real platform, dozens of steps must run in the right order, on the right schedule, and recover cleanly when one of them fails at 3 a.m. The piece that makes that happen is orchestration — and it is one of the highest-leverage skills a data engineer can own, because a pipeline that produces correct numbers but can't be re-run safely or recovered after a failure is not actually production-grade.
This chapter teaches orchestration concept-first. We start with why it exists (and why a pile of cron jobs is not enough), build up the DAG mental model, then confront the two ideas that trip up almost everyone: the logical date / data interval (when a run covers versus when it executes) and idempotency (designing tasks so a re-run or backfill is safe and never double-counts). Only then do we map those durable ideas onto today's tools — Airflow 3, Dagster, Prefect — and the big 2025–2026 shift toward data-aware, asset-based scheduling.
The durable idea
A pipeline is a DAG of idempotent, partition-scoped tasks — and the orchestrator's job is to run them in dependency order, on schedule, and recover cleanly from failure.
The shape (a DAG of idempotent tasks), the data-interval model, and the recovery semantics are durable. The specific orchestrator, its API, and its operators are dated — you will likely switch tools at least once in your career, and these concepts carry across all of them.
What's in this chapter
- Why orchestration exists → — dependencies, scheduling, retries, observability, and backfills, and why a folder full of cron jobs collapses under them.
- DAGs, scheduling & the logical date → — tasks, dependencies, schedule intervals, and the notorious execution/logical date and
catchup. - Idempotency & backfills → — designing partition-scoped tasks so re-runs and historical reprocessing never double-count.
- Airflow core (and Airflow 3) → — operators, sensors, hooks, XComs, executors, the TaskFlow API, and Airflow 3's data-aware assets.
- Asset-based vs task-based orchestration → — Dagster's software-defined assets versus Airflow's task-centric model, and when each fits.
- Retries, SLAs, triggering & decoupling compute → — exponential backoff, SLAs and alerting, deferrable sensors, parameterization, and keeping heavy compute out of the orchestrator.
- Chapter 8 checkpoint → — a quiz to lock it in.
Let's start with the problem orchestration was invented to solve.