Chapter 7 checkpoint
You can now transform raw warehouse data into trustworthy, modeled, documented tables — treating transformation like software. Recall the spine, take the quiz, then trace one model.
The throughline
- ELT beat ETL because cheap, elastic, decoupled cloud-warehouse compute removed the reason to transform before loading: load raw first, transform in place with SQL, keep the raw data forever. The modern data stack = managed EL + warehouse + dbt + BI.
- dbt is a SQL compiler and orchestrator with no compute of its own: a model is a
SELECT,ref()andsource()wire models into a DAG, and dbt compiles it to ordered plain SQL the warehouse runs. When in doubt, read the compiled SQL. - Materializations decide what a model becomes: view (no data, pays at query), table (rebuilt each run, cheap to query), ephemeral (inlined CTE), incremental (build once, then only new rows).
- Incremental =
is_incremental()filters to new rows on later runs (via a watermark); the strategy (append/merge/delete+insert/insert_overwrite) decides how they combine in.mergeon aunique_keyis the forgiving default. - Jinja lets SQL write SQL (
{{ }}substitutes,{% %}controls); macros + packages (dbt_utils) keep it DRY — but don't over-template. Tests (generic:unique/not_null/accepted_values/relationships; singular) shift quality left so bad data fails the build. - Sources + freshness guard stale inputs; snapshots = SCD Type 2 (history, going forward only); seeds = small reference CSVs; exposures = downstream consumers;
dbt docs= living lineage graph. - Layering staging → intermediate → marts ≡ medallion bronze/silver/gold. The semantic layer (MetricFlow) defines each metric once. CI/CD: slim CI (
state:modified+) +--deferto prod + environment promotion. 2026: dbt Fusion (Rust, column-level lineage) and SQLMesh (the alternative).
Quiz
Chapter 7 — Transformation & the Modern Data Stack
Pass to unlock the Next button belowTrace this dbt model
No cluster needed — the daily skill is reading a model and predicting its compiled SQL and behavior. Work through this ShopFlow daily-revenue rollup in your head (ShopFlow — see Meet ShopFlow):
-- models/marts/daily_revenue.sql
{{ config(materialized='incremental', incremental_strategy='merge', unique_key='order_date') }}
select
cast(order_ts as date) as order_date,
sum(line_revenue) as revenue
from {{ ref('fact_sales') }}
{% if is_incremental() %}
where cast(order_ts as date) >= (select max(order_date) from {{ this }})
{% endif %}
group by 1
Question 1: On the very first dbt run, which rows of fact_sales does this read, and why?
Question 2: On the next day's run, what does the {% if is_incremental() %} block do, and why is >= (not >) used here?
Question 3: Why is merge on unique_key='order_date' the right strategy for this model?
Answers
- All rows. On the first run
is_incremental()isfalse, so thewhereblock disappears and theSELECTreads the entirefact_saleshistory to build the full table. (Same on any--full-refresh.) - On later runs
is_incremental()istrue, so dbt addswhere cast(order_ts as date) >= (select max(order_date) from {{ this }}), scanning only recent orders instead of all history. It uses>=(not>) because today's revenue is still accumulating — re-reading and re-summing today's orders keeps it correct as more sales land, rather than freezing a partial total.{{ this }}is the model's own already-built table. - Because the same
order_dateis recomputed across runs (today's row is reprocessed every run as new ShopFlow orders land), you must update the existing row for that date, not insert a duplicate.mergeonunique_key='order_date'upserts: it updates the row whoseorder_datealready exists and inserts genuinely new dates.appendwould create duplicate revenue rows per day.
Where this leaves you
You can now read any dbt project and explain why each model is built the way it is: the ELT shift, the compiled-SQL-on-a-DAG model, the four materializations and the incremental strategies, Jinja/macros/tests, sources/snapshots/docs, the staging→marts (bronze→gold) layering, the semantic layer, and a slim, deferred CI/CD workflow — plus where Fusion and SQLMesh are taking the engine. But transformations don't run themselves on a schedule, in the right order, recovering from failure. Coordinating all the pipeline's work — ingestion, then dbt, then everything downstream — is the job of an orchestrator, which is Chapter 8.