DAGs, scheduling & the logical date
The last lesson named the shape of a pipeline: a DAG. This lesson makes that precise and then tackles the single most confusing thing in all of orchestration — the logical date and catchup. Almost every newcomer gets burned by this once. Reading this lesson carefully is how you avoid being one of them.
What a DAG is
DAG stands for Directed Acyclic Graph. Break it down:
- Graph — a set of nodes (here, tasks — single units of work like "land raw orders") connected by edges.
- Directed — each edge has a direction: an arrow from
ingest_orderstodbt_runmeans ingest must run before transform. That arrow is a dependency. - Acyclic — no cycles. You can never follow the arrows and end up back where you started. This guarantees a valid running order exists: if A depends on B which depends on A, nothing could ever start.
So a DAG is just "these tasks, run in this dependency order, and the dependencies never loop." That's the whole idea, and it's why every orchestrator — Airflow, Dagster, Prefect — is built on it.
Take ShopFlow's daily pipeline (ShopFlow — see Meet ShopFlow). The DAG shopflow_daily lands yesterday's orders, rebuilds the warehouse models, then tests them:
The orchestrator computes a valid order by topological sort — a fancy name for "repeatedly pick any task whose dependencies are all done." Tasks with no dependency between them can run in parallel (in shopflow_daily the chain is strictly linear — dbt_run can't start until ingest_orders lands raw.orders, and quality_check can't test models dbt_run hasn't built yet). You declare what depends on what; the orchestrator figures out the order and the parallelism.
:::note You define dependencies, not order
A beginner instinct is to think "I'll list tasks top to bottom in the order they run." Don't. You declare dependencies (dbt_run needs ingest_orders), and the orchestrator derives the order. This is what lets it parallelize independent branches and recover correctly.
:::
The schedule interval
A DAG by itself is a template. To actually run, it needs a schedule interval: how often the orchestrator should kick off a run. You express it as a cron expression (e.g. 0 2 * * * = "2 a.m. daily") or a preset like @daily / @hourly. shopflow_daily runs @daily — once per day, after the previous day's orders are complete.
Each time the schedule fires, the orchestrator creates a DAG run — one execution of the whole DAG — which in turn creates a task instance for each task (the specific run of that task for that period). "Did dbt_run succeed?" is always really "did this task instance succeed?"
The hard part: logical date ≠ when it runs
Here is the concept that confuses everyone, so we'll go slowly with a concrete trace.
shopflow_daily processes one day of orders per run. The key insight: the data a run processes, and the moment the run executes, are two different things.
The data interval is the window of data a run is responsible for — for a daily DAG, one calendar day. The logical date (historically called the execution date in Airflow) is the label identifying that interval. It is not the wall-clock time the run starts.
Why are they different? Because you can only process a day's data after that day is over. To process all of June 23rd's ShopFlow orders, you must wait until June 23rd has fully elapsed — so the shopflow_daily run actually executes early on June 24th. The run's logical date points at the interval it covers (the 23rd), even though it physically runs on the 24th.
Let's trace it:
So inside ingest_orders, you must use the logical date, never today() or the wall clock, to decide which orders to pull:
# CORRECT — derive the window from the run's logical/data interval
# (pseudocode; exact variable names are tool-specific and DATED)
start, end = data_interval_start, data_interval_end # the run's own window
sql = f"SELECT * FROM orders WHERE order_ts >= '{start}' AND order_ts < '{end}'"
# WRONG — uses wall-clock "now", so a re-run next week processes the WRONG day
sql = "SELECT * FROM orders WHERE order_ts >= CURRENT_DATE - 1"
The wrong version is a trap: it works the first time, but if you ever re-run shopflow_daily for June 23rd a week later (to fix a bug), CURRENT_DATE - 1 now points at a totally different day. The correct version always reprocesses exactly the interval the run owns — June 23rd's orders, landing into raw.orders — no matter when you run it. This is the foundation of safe re-runs and backfills — the entire next lesson.
:::tip Tasks are parameterized by their interval, not by "now." A well-built task asks "what interval am I responsible for?" and processes exactly that. It never asks "what time is it right now?" This one habit makes re-runs and backfills correct by construction. We formalize it as parameterization later. :::
Catchup: the other half of the trap
Now the second famous gotcha. You write a DAG with start_date = January 1 and @daily, and you finally turn it on in June. What happens?
By default in Airflow, catchup is on, which means the scheduler reasons: "This daily DAG was supposed to start January 1. It's now June. I owe a run for every missed interval." It immediately schedules ~150 runs — one per day from January through today — and starts grinding through them.
Catchup is whether the scheduler creates a run for every past interval between the start date and now. With catchup on, enabling a DAG triggers a flood of historical runs. With it off, the DAG simply runs from the next interval forward.
Sometimes catchup-on is exactly what you want (you genuinely need to process all of history). Often it's a nasty surprise — you flood your warehouse and API with 150 simultaneous runs you didn't ask for. The fix depends on intent:
- If you don't want history processed automatically: set
catchup = False. The DAG starts running from the current interval forward. - If you do want history, but in a controlled way: leave catchup off and run an explicit backfill over the range you choose — covered next.
:::note A real-world default
Many teams set catchup=False as the standard and reach for explicit backfills when they truly need history. A surprise stampede of historical runs at deploy time is one of the most common "what just happened to our warehouse bill?" incidents for new Airflow users.
:::
Why it matters
A DAG — directed (dependencies have a direction), acyclic (no loops, so an order always exists) — is the universal shape of a pipeline; you declare dependencies and the orchestrator derives order and parallelism. A schedule interval turns that template into recurring DAG runs, each made of task instances. The concept to never forget: the logical date labels the data interval a run covers, not the wall-clock time it executes — so tasks must derive their work window from the interval, not from "now," or re-runs silently process the wrong day. And catchup governs whether enabling a DAG floods you with historical runs; knowing to set it deliberately (often False) saves you a very bad afternoon. With the interval model clear, we can finally do history correctly: idempotency and backfills.