Chapter 8 checkpoint
You can now reason about pipelines as orchestrated DAGs. Recall the chapter, then prove it.
The throughline
- A pipeline is a DAG — directed (dependencies have a direction), acyclic (no loops). You declare dependencies; the orchestrator derives order and parallelism, then runs, retries, observes, and backfills. A pile of cron jobs can't: it fakes dependencies with time gaps and has no retries, alerts, history, or safe backfills.
- The logical date labels the data interval a run covers, not the wall-clock time it executes. Tasks must derive their window from the interval, never
now(). Catchup governs whether enabling a DAG floods you with historical runs — set it deliberately (oftenFalse). - Idempotency (same interval → same end state) is what makes retries and backfills safe. Get it by making each task own and overwrite a partition keyed by its logical date, never blind-append — otherwise retries double-count.
- Airflow names the model: operators, sensors, hooks, XComs (small values only), connections/variables, the scheduler, executors (Local/Celery/Kubernetes), the TaskFlow API. The cardinal pitfall is top-level code that runs on every parse. Airflow 3 adds data-aware assets, DAG versioning, and multi-team deployments.
- Asset-centric (Dagster, Airflow 3 assets) declares data objects and derives the DAG, enabling lineage, freshness, and data-aware scheduling — the 2025–2026 shift — versus task-centric "run these steps." Reliability comes from retries + exponential backoff, SLAs/alerting, deferrable (not poke) sensors, and the rule orchestrate, don't compute (trigger Spark/dbt/the warehouse; don't crunch data in the orchestrator).
Quiz
Chapter 8 — Orchestration & Pipelines
Pass to unlock the Next button belowThat completes Chapter 8: you can model a pipeline as an idempotent DAG, reason about the logical date and catchup, backfill history without double-counting, name Airflow's building blocks (and Airflow 3's assets), choose between task- and asset-centric scheduling, and keep pipelines reliable while decoupling orchestration from compute. Next, the data starts moving continuously — streaming and real-time.