Airflow core (and Airflow 3)
We've built the durable model: a DAG of idempotent, partition-scoped tasks driven by a logical date. Now we map it onto the most widely used orchestrator, Apache Airflow. Learn Airflow's vocabulary and the concepts transfer to almost every other tool — and you'll recognize the durable ideas from earlier lessons dressed in concrete names.
Everything in this lesson is the dated layer — specific APIs, class names, and executors. The concepts underneath (a scheduler queuing ready tasks, reusable task templates, passing small values between tasks) are durable. We flag which is which.
The building blocks
Operators — reusable task templates
An operator is a template for one kind of task. Instead of writing the plumbing to run SQL, you instantiate a SQL operator with your query; instead of writing HTTP code, you use an HTTP operator. Common families:
- A Python operator runs a Python function.
- A SQL/warehouse operator runs a query against Snowflake/BigQuery/Postgres.
- A transfer operator moves data between systems (e.g. cloud storage → warehouse).
- A trigger operator kicks off external work — a Spark job, a dbt run — which is exactly the orchestrate-don't-compute boundary in action.
An operator instance in a DAG is a task; when it runs for a particular interval it's a task instance.
Sensors — tasks that wait
A sensor is a special task that waits for a condition before downstream tasks run: a file lands in a bucket, a partition appears in a table, an external job finishes. Sensors are how a pipeline reacts to data availability rather than guessing with a time gap (remember cron's fragile 30-minute gap from lesson 8.1).
:::warning Poke-mode sensors are an anti-pattern at scale A classic sensor runs in poke mode: it occupies a worker slot and re-checks the condition every N seconds for hours. A hundred such sensors waiting for files can consume all your worker slots doing nothing but waiting — starving real work. The modern fix is deferrable (async) sensors, which release the worker slot while waiting and resume via a lightweight trigger. We dig into this in the reliability lesson; for now: prefer deferrable, avoid armies of poking sensors. :::
Hooks — clients for external systems
A hook is Airflow's reusable client for talking to an external system (a warehouse, an API, cloud storage). Operators use hooks under the hood; you use hooks directly when you write custom logic. A hook handles connection details and auth so your task code stays focused.
Connections & variables — config out of code
- A connection is a stored, named bundle of credentials + endpoint for an external system. Your task references it by name (
my_warehouse) and Airflow injects the real secret at runtime — so credentials never live in your DAG code. - A variable is a stored, named configuration value (a bucket name, a feature flag) read at runtime.
Both keep secrets and environment-specific values out of code, which is what makes the same DAG run unchanged across dev and prod — the parameterization principle again.
XComs — passing small values between tasks
XCom ("cross-communication") lets one task hand a small value to another — an extracted row count, a file path, an ID. The emphasis is small: XCom is metadata plumbing, not a data pipe. Passing a whole dataset through XCom is an anti-pattern (it'd serialize gigabytes through Airflow's metadata store). Move data through storage/the warehouse; pass pointers through XCom. This is the orchestrate-don't-compute boundary yet again.
The scheduler
The scheduler is Airflow's brain: it parses your DAG files, figures out which task instances are ready (their dependencies succeeded, their interval has arrived), and queues them. It's continuously asking "what can run now?" The scheduler is also why the next pitfall matters so much.
Executors — where tasks actually run
The executor decides where and how queued tasks physically run. The common choices:
- LocalExecutor — tasks run as processes on one machine. Fine for small setups.
- CeleryExecutor — tasks run on a pool of worker machines via a queue. Scales horizontally.
- KubernetesExecutor — each task runs in its own Kubernetes pod (Chapter 4 / cloud guide), isolated and elastically scaled.
The executor is purely a dated operational choice; the DAG you write is identical regardless.
The TaskFlow API — DAGs as plain functions
Older Airflow made you instantiate operator objects and wire dependencies manually. The TaskFlow API lets you write tasks as decorated Python functions, and it automatically passes return values between them (via XCom) and infers dependencies from the call graph. Here is ShopFlow's pipeline (ShopFlow — see Meet ShopFlow) as a TaskFlow DAG — ingest_orders → dbt_run → quality_check:
# TaskFlow style — concepts are durable, exact API is DATED
from airflow.decorators import dag, task
@dag(schedule="@daily", start_date=..., catchup=False) # catchup off by default — lesson 8.2
def shopflow_daily():
@task
def ingest_orders(data_interval_start=None, data_interval_end=None):
# land THIS run's order-date partition into raw.orders.
# use the run's OWN interval, never "now" — lesson 8.2 / idempotency
rows = query_orders(data_interval_start, data_interval_end)
overwrite_partition("raw.orders", dt=data_interval_start, rows=rows) # idempotent — lesson 8.3
return rows.count
@task
def dbt_run(order_count):
# orchestrate, don't compute: trigger dbt to build stg_* + fact_sales — lesson 8.1
run_dbt(select="stg_orders+ fact_sales")
@task
def quality_check():
run_dbt(command="test", select="fact_sales") # tests on the freshly built models
quality_check_t = quality_check()
dbt_run(ingest_orders()) >> quality_check_t # ingest_orders → dbt_run → quality_check
shopflow_daily()
Notice how the durable lessons show up: catchup=False, deriving the window from the data interval, an idempotent partition overwrite on raw.orders, and dbt_run/quality_check triggering dbt rather than crunching data themselves (the orchestrate-don't-compute boundary). The TaskFlow decorators are dated sugar; the discipline underneath is what makes it correct.
The number-one Airflow pitfall: top-level code
The scheduler re-parses every DAG file constantly to discover changes. Any code at the top level of the file (not inside a task) runs on every parse — potentially every few seconds, for every DAG.
Top-level code is code that executes when Airflow parses the DAG file, before any task runs. Heavy work here — a database query, a big import, an API call to build the DAG — runs on every parse and can grind the scheduler to a halt.
# BAD — runs on EVERY DAG parse, hammering the API and slowing the scheduler
data = requests.get("https://api.example.com/config").json() # top-level!
# GOOD — the network call lives INSIDE a task, so it runs only when the task runs
@task
def fetch_config():
return requests.get("https://api.example.com/config").json()
The rule: DAG files should be cheap to parse. Keep imports light and put all real work — queries, API calls, heavy computation — inside tasks, never at module level. Slow DAG parsing is one of the most common reasons an Airflow deployment feels sluggish.
What Airflow 3 changed (2025–2026)
Airflow 3 is a major release; the headline shifts you should know:
- Data-aware (asset) scheduling. Airflow 3 elevates assets (called datasets in Airflow 2.4+). A DAG can declare it produces an asset and another DAG can be triggered when that asset updates — scheduling driven by data availability, not just the clock. This is the big industry move we cover head-on in the asset-vs-task lesson. You can even partition assets, aligning the data-aware model with the interval thinking from lesson 8.2.
- DAG versioning. Airflow 3 tracks which version of a DAG's code produced a given run, so history stays interpretable when you change a DAG — no more "which code actually ran last Tuesday?"
- A modernized architecture for multi-team deployments — clearer isolation so several teams can run on shared Airflow without stepping on each other, plus a separated API/task-execution layer.
The takeaway: Airflow is moving toward the data-aware, asset-centric world that Dagster popularized — which is exactly why the next lesson contrasts the two models directly.
Why it matters
Airflow gives the durable model concrete names: operators (task templates), sensors (wait for a condition — prefer deferrable over slot-hogging poke mode), hooks (clients for external systems), connections/variables (secrets and config out of code), XCom (pass small values, not data), the scheduler (queues ready tasks), and executors (Local/Celery/Kubernetes — where tasks run). The TaskFlow API writes all of this as plain decorated functions. The cardinal pitfall is top-level code, which runs on every parse and can stall the scheduler — keep real work inside tasks. And Airflow 3 pushes hard into data-aware asset scheduling, DAG versioning, and multi-team deployments. That asset shift is big enough to deserve its own lesson — next, we compare it to Dagster's asset-first model.