Skip to main content

Chapter 7 checkpoint

You can now transform raw warehouse data into trustworthy, modeled, documented tables — treating transformation like software. Recall the spine, take the quiz, then trace one model.

The throughline

  • ELT beat ETL because cheap, elastic, decoupled cloud-warehouse compute removed the reason to transform before loading: load raw first, transform in place with SQL, keep the raw data forever. The modern data stack = managed EL + warehouse + dbt + BI.
  • dbt is a SQL compiler and orchestrator with no compute of its own: a model is a SELECT, ref() and source() wire models into a DAG, and dbt compiles it to ordered plain SQL the warehouse runs. When in doubt, read the compiled SQL.
  • Materializations decide what a model becomes: view (no data, pays at query), table (rebuilt each run, cheap to query), ephemeral (inlined CTE), incremental (build once, then only new rows).
  • Incremental = is_incremental() filters to new rows on later runs (via a watermark); the strategy (append / merge / delete+insert / insert_overwrite) decides how they combine in. merge on a unique_key is the forgiving default.
  • Jinja lets SQL write SQL ({{ }} substitutes, {% %} controls); macros + packages (dbt_utils) keep it DRY — but don't over-template. Tests (generic: unique/not_null/accepted_values/relationships; singular) shift quality left so bad data fails the build.
  • Sources + freshness guard stale inputs; snapshots = SCD Type 2 (history, going forward only); seeds = small reference CSVs; exposures = downstream consumers; dbt docs = living lineage graph.
  • Layering staging → intermediate → marts ≡ medallion bronze/silver/gold. The semantic layer (MetricFlow) defines each metric once. CI/CD: slim CI (state:modified+) + --defer to prod + environment promotion. 2026: dbt Fusion (Rust, column-level lineage) and SQLMesh (the alternative).

Quiz

Required checkpoint

Chapter 7 — Transformation & the Modern Data Stack

Pass to unlock the Next button below

Trace this dbt model

No cluster needed — the daily skill is reading a model and predicting its compiled SQL and behavior. Work through this ShopFlow daily-revenue rollup in your head (ShopFlow — see Meet ShopFlow):

-- models/marts/daily_revenue.sql
{{ config(materialized='incremental', incremental_strategy='merge', unique_key='order_date') }}

select
cast(order_ts as date) as order_date,
sum(line_revenue) as revenue
from {{ ref('fact_sales') }}

{% if is_incremental() %}
where cast(order_ts as date) >= (select max(order_date) from {{ this }})
{% endif %}

group by 1

Question 1: On the very first dbt run, which rows of fact_sales does this read, and why? Question 2: On the next day's run, what does the {% if is_incremental() %} block do, and why is >= (not >) used here? Question 3: Why is merge on unique_key='order_date' the right strategy for this model?

Answers
  1. All rows. On the first run is_incremental() is false, so the where block disappears and the SELECT reads the entire fact_sales history to build the full table. (Same on any --full-refresh.)
  2. On later runs is_incremental() is true, so dbt adds where cast(order_ts as date) >= (select max(order_date) from {{ this }}), scanning only recent orders instead of all history. It uses >= (not >) because today's revenue is still accumulating — re-reading and re-summing today's orders keeps it correct as more sales land, rather than freezing a partial total. {{ this }} is the model's own already-built table.
  3. Because the same order_date is recomputed across runs (today's row is reprocessed every run as new ShopFlow orders land), you must update the existing row for that date, not insert a duplicate. merge on unique_key='order_date' upserts: it updates the row whose order_date already exists and inserts genuinely new dates. append would create duplicate revenue rows per day.

Where this leaves you

You can now read any dbt project and explain why each model is built the way it is: the ELT shift, the compiled-SQL-on-a-DAG model, the four materializations and the incremental strategies, Jinja/macros/tests, sources/snapshots/docs, the staging→marts (bronze→gold) layering, the semantic layer, and a slim, deferred CI/CD workflow — plus where Fusion and SQLMesh are taking the engine. But transformations don't run themselves on a schedule, in the right order, recovering from failure. Coordinating all the pipeline's work — ingestion, then dbt, then everything downstream — is the job of an orchestrator, which is Chapter 8.

Next: Chapter 8: Orchestration & Pipelines →