Skip to main content

Asset-based vs task-based orchestration

There are two fundamentally different ways to think about a pipeline, and the gap between them is the biggest conceptual shift in orchestration over the last few years. Most older guides teach only the first and never mention the second — so let's close that gap head-on. The two models are task-centric and asset-centric, and understanding both makes you fluent across Airflow, Dagster, and Prefect.

Two ways to describe the same pipeline

Picture ShopFlow's goal (ShopFlow — see Meet ShopFlow): the fact_sales table, built from raw.orders, which ingest_orders lands from the source database.

Task-centric (the classic model, e.g. Airflow). You describe the verbs — the steps to perform and their order. This is exactly the shopflow_daily DAG from the last lessons:

"Run ingest_orders, then dbt_run, then quality_check."

The orchestrator's job is to run those steps in order. The tables those steps produce — fact_sales, dim_customer — are an implicit side effect; nowhere does the system have a first-class concept of "the fact_sales table." If someone asks "what produces fact_sales and is it fresh?", the system can't answer directly; it only knows about tasks.

Asset-centric (the newer model, pioneered by Dagster). You describe the nouns — the data objects you want to exist and what each depends on:

"fact_sales is an asset built from the stg_orders asset, built from the raw.orders asset, which is loaded from the source database."

A software-defined asset (SDA) is a declarative definition of a data object — a table, a file, an ML model — together with the code that produces it. You declare the asset and its upstream assets; the orchestrator derives the DAG from those dependencies.

Notice the inversion. In the task model you write the steps (ingest_orders → dbt_run → quality_check) and ShopFlow's tables are a byproduct. In the asset model you declare the tablesfact_sales and dim_customer as assets — and the steps (and their order) are derived. The DAG still exists — it's computed from asset dependencies instead of hand-wired.

ingest_ordersdbt_run("raw.orders\nasset")("stg_orders\nasset")

Why the asset model caught on

Declaring data objects instead of steps unlocks things that are awkward in the pure task world:

  • Lineage for free. Because every asset declares its upstream assets, the system knows the data lineage — "fact_sales comes from stg_orders comes from raw.orders comes from the source database." You get a live dependency graph of ShopFlow's data, not just its jobs, which is gold for debugging and impact analysis.
  • "Is this table fresh?" is a first-class question. The system tracks each asset's last materialization, so freshness, staleness, and "what needs rebuilding?" are native concepts — "when was fact_sales last rebuilt?" is answerable directly.
  • Data-aware scheduling falls out naturally. If you've declared that fact_sales depends on raw.orders, then "rebuild fact_sales when raw.orders updates" is the obvious behavior — scheduling driven by data availability, not just the clock.

Materialize is the asset-world verb for "run the code that produces this asset's data," the asset analog of "run this task."

Asset-based (data-aware) scheduling triggers work when an upstream data object updates, rather than purely on a time schedule. It's the model behind Dagster assets and, increasingly, Airflow 3's assets/datasets.

Time-based vs data-aware scheduling

This is the practical heart of the shift, so make it concrete:

  • Time-based: "Run dbt_run at 2 a.m." But what if ingest_orders is late landing raw.orders? You either build fact_sales on stale data or pad with a guessed gap (the fragile cron pattern from lesson 8.1).
  • Data-aware: "Rebuild fact_sales whenever raw.orders has new data." No guessing about timing — the downstream reacts to the upstream actually being ready.

Data-aware scheduling directly fixes the late-upstream problem that time gaps only paper over. This is the 2025–2026 industry direction: Dagster was asset-first from the start, and Airflow 3 added first-class assets/datasets (lesson 8.4) precisely to offer it. Even dbt fits naturally here — a dbt project is essentially a graph of data assets (models), which is why dbt and asset-based orchestrators pair so well.

When each model fits

Neither is universally "better" — they suit different shapes of work:

Lean task-centric (Airflow-style) when…Lean asset-centric (Dagster-style) when…
The pipeline is a sequence of operational actions (trigger a job, call an API, move a file) where the output isn't a neat tableThe pipeline's whole purpose is producing and keeping data assets fresh (tables, files, features, models)
You have a large existing Airflow estate and ecosystem of operatorsYou want built-in lineage, freshness, and data-aware scheduling from day one
Steps don't map cleanly to "one task = one data object"Work maps cleanly to "this code produces this dataset," and you value strong typing/testing of those datasets

:::tip The convergence The lines are blurring fast. Airflow 3 brought assets to the task world; Dagster can still run plain ops/tasks when an action isn't a data object; Prefect stays flexible Python. So the real skill isn't picking a camp — it's thinking in both: "what data objects must exist (assets)?" and "what operational steps produce them (tasks)?" Fluency in both vocabularies is what makes you portable across tools. :::

Why it matters

Task-centric orchestration describes the verbs — the steps and their order — and treats the data tables as a byproduct; it's the classic Airflow model and fits operational, action-shaped pipelines. Asset-centric orchestration (Dagster's software-defined assets, now echoed by Airflow 3's assets) describes the nouns — the data objects you want to exist and their dependencies — and derives the DAG, giving you lineage, freshness, and data-aware scheduling (rebuild when upstream data updates) for free. The 2025–2026 shift is decisively toward data-aware/asset thinking, and the two models are converging, so the durable skill is to reason in both. Next, we close the loop on running pipelines reliably: retries, SLAs, smart triggering, and keeping heavy compute out of the orchestrator.

Next: Retries, SLAs, triggering & decoupling compute →