Skip to main content

Chapter 8 · Orchestration & Pipelines

By now you can land raw data (Chapter 6), reshape it with SQL and dbt (Chapter 7), and crunch it at scale with Spark (Chapter 5). But each of those is a step. In a real platform, dozens of steps must run in the right order, on the right schedule, and recover cleanly when one of them fails at 3 a.m. The piece that makes that happen is orchestration — and it is one of the highest-leverage skills a data engineer can own, because a pipeline that produces correct numbers but can't be re-run safely or recovered after a failure is not actually production-grade.

This chapter teaches orchestration concept-first. We start with why it exists (and why a pile of cron jobs is not enough), build up the DAG mental model, then confront the two ideas that trip up almost everyone: the logical date / data interval (when a run covers versus when it executes) and idempotency (designing tasks so a re-run or backfill is safe and never double-counts). Only then do we map those durable ideas onto today's tools — Airflow 3, Dagster, Prefect — and the big 2025–2026 shift toward data-aware, asset-based scheduling.

The durable idea

A pipeline is a DAG of idempotent, partition-scoped tasks — and the orchestrator's job is to run them in dependency order, on schedule, and recover cleanly from failure.

The shape (a DAG of idempotent tasks), the data-interval model, and the recovery semantics are durable. The specific orchestrator, its API, and its operators are dated — you will likely switch tools at least once in your career, and these concepts carry across all of them.

What's in this chapter

Let's start with the problem orchestration was invented to solve.

Next: Why orchestration exists →