Skip to main content

Why orchestration exists

Imagine your first real pipeline. Every morning you need to: pull yesterday's orders from an API, load them into the warehouse, run a dbt transformation to build a clean daily_sales table, and email a dashboard refresh. Four steps. You write four scripts and schedule each one with cron — the classic Unix job scheduler that runs a command at a fixed time. Load runs at 1:00, transform at 1:30, email at 2:00. It works on Monday.

By Friday it has broken in four different ways, and understanding why tells you exactly what an orchestrator is for.

The four ways cron breaks

1. Dependencies are implicit and fragile. Your transform at 1:30 assumes the load at 1:00 finished. But one morning the API is slow and the load takes 45 minutes. At 1:30 the transform runs anyway — against incomplete data — and silently produces wrong numbers. Cron knows about time, not about "this step must finish before that step starts." You faked the dependency with a 30-minute gap, and the gap was a guess.

A dependency is a "must run before" relationship between two tasks. The core thing an orchestrator gives you is the ability to state dependencies explicitly — "transform runs only after load succeeds" — instead of hoping a time gap is big enough.

2. Failures are silent and unrecovered. The load fails (the API returned a 500). Cron runs the command, the command exits with an error, and... nothing happens. Cron does not retry, does not alert, does not stop the downstream transform from running on stale data. You find out two days later when someone asks why sales looks flat. You need retries (automatically re-run a failed step) and alerting (tell a human when something is wrong) — neither of which cron provides.

3. You're blind. Which steps ran today? Which succeeded? How long did each take? Where exactly did Tuesday's run fail, and what was the error? With cron, the answers are scattered across log files on a server, if they were captured at all. You have no observability — no single place that shows the state and history of every run.

4. Re-running history is a nightmare. Marketing changes the definition of "active customer" and asks you to recompute the last 90 days. With cron, you'd hand-edit dates and run 90 commands by hand, praying you don't skip or double-run a day. You need backfills — a first-class way to re-run a pipeline over a range of historical periods, correctly and once each.

What an orchestrator is

An orchestrator is a system that runs your pipeline's tasks for you while solving exactly those four problems. You describe the pipeline once — the tasks and how they depend on each other — and the orchestrator takes responsibility for:

  • Dependencies — run tasks in the correct order; a task starts only when the tasks it depends on have succeeded.
  • Scheduling — trigger the whole pipeline on a schedule (or when new data arrives — more on that later).
  • Retries & failure semantics — automatically retry a flaky task, and when something truly fails, stop downstream work rather than feeding it bad data.
  • Observability — one UI/API showing every run, every task's status, logs, and timing — so you can see and debug what happened.
  • Backfills — re-run any range of historical periods on demand, cleanly.
1:00 load"]-.guessed gap.->c2["1:30 transform"]Cronno real deps · noretries · no alerts· no history ·

The orchestrator does not replace cron's idea of "run on a schedule" — it absorbs it and adds dependencies, recovery, visibility, and backfills on top. This is the difference between a set of jobs that happen to run near each other and a pipeline.

The crucial boundary: orchestrate, don't compute

Here is a principle worth internalizing now, because it shapes every good pipeline you'll ever build:

:::tip The orchestrator coordinates; the engines compute. An orchestrator's job is to trigger work in the right order and recover from failure — not to do the heavy lifting itself. The actual data crunching should happen in a purpose-built engine: a Spark cluster (Chapter 5), the warehouse running your dbt models (Chapter 7), or an ingestion service. The orchestrator says "run this Spark job now," waits, and reacts to success or failure. It does not load a billion rows into its own memory. :::

Why this matters: orchestrators run a scheduler and lightweight workers that are sized for coordination, not for terabytes of data. If you pull a huge dataset into a worker and process it there, you starve the worker, slow the whole platform, and lose the scalability of your real compute engine. We return to this in the lesson on triggering and decoupling compute — but keep it in mind throughout: the orchestrator points at the engines; it is not an engine.

The landscape, briefly

You'll meet these properly across the chapter, but to orient you, the orchestrators you'll hear about in 2026:

  • Apache Airflow (now at version 3.x) — the long-time default; Python-defined DAGs, a huge ecosystem of operators, and as of Airflow 3 a strong push into data-aware scheduling.
  • Dagster — built around assets (the data objects you produce) rather than tasks; strong on testing, types, and data lineage.
  • Prefect — Python-native, lightweight, dynamic; pipelines are decorated Python functions.
  • Managed and adjacent optionsAstronomer and Google Cloud Composer (managed Airflow), AWS Step Functions (cloud-native state machines), and newer entrants like Mage and Kestra.

We teach the concepts using mostly Airflow vocabulary (because it's the most common and most imitated) and contrast Dagster's asset model, but everything generalizes.

Why it matters

A pile of cron jobs fakes dependencies with time gaps, never retries, never alerts, leaves no usable history, and makes backfills a manual nightmare — it collapses the moment a pipeline grows past a few steps. Orchestration replaces it with a system that runs tasks in explicit dependency order, on a schedule, with retries and alerting, full observability, and first-class backfills. And the dividing line that keeps it healthy is orchestrate, don't compute: the orchestrator triggers external engines and reacts to them; it never becomes the data-processing engine itself. With the why in hand, the next lesson makes the model concrete — the DAG, and the schedule that drives it.

Next: DAGs, scheduling & the logical date →