Chapter 7 · Transformation & the Modern Data Stack
You've learned to land raw data in a warehouse (Chapter 6) and to model it dimensionally (Chapter 4). This chapter is about the step in between: transformation — turning raw, ugly, source-shaped tables into the clean, tested, modeled tables that analysts, dashboards, and machine-learning features depend on. This is where most data engineers and analytics engineers spend most of their day, and it's the chapter where a single tool — dbt — dominates the landscape.
This is the heart of the running example. Every worked model in this chapter builds ShopFlow (ShopFlow — see Meet ShopFlow) — the raw tables Chapter 6 landed (raw.orders, raw.customers, raw.products, raw.order_items) become dbt staging models (stg_orders, stg_customers, …) and finally the star schema Chapter 4 designed (fact_sales + dim_customer/dim_product/dim_date). You watch the same rows travel raw → staging → marts.
The durable idea
Transform data in the warehouse with version-controlled, tested SQL — treat transformations like software, not one-off scripts.
For two decades, "transformation" meant a thicket of fragile, undocumented scripts that nobody dared touch. The durable shift this chapter teaches is that transformations are software: they live in version control, they're built from small reusable pieces, they're tested automatically, they document themselves, and they ship through CI/CD. The warehouse got cheap and fast enough to do the heavy lifting, so transformation moved into it — and software-engineering discipline came with it.
ELT, modular tested SQL, a transformation DAG, and a governed metrics layer are durable. dbt itself, its Cloud product, the specific adapters, and this year's engine (Fusion) are dated — important to know, but they will change.
What you'll learn
- The ELT paradigm — why "extract, load, then transform" beat the old ETL order once the cloud warehouse got cheap, and what the "modern data stack" actually is.
- dbt's mental model — the single most important idea in the chapter: a dbt model is just a
SELECT, andref()wires thoseSELECTs into a DAG of compiled SQL. Project structure, sources, and the compiled-SQL view. - Materializations & incremental models — view vs table vs incremental vs ephemeral, and the hardest real-world dbt topic: incremental strategies (
append,merge,delete+insert,insert_overwrite) andis_incremental(). - Jinja, macros, packages & testing — DRY transformations with Jinja and macros (and the danger of over-templating),
dbt_utils, and shift-left testing (generic + singular tests). - Sources, snapshots, seeds & docs — source freshness, snapshots (dbt's SCD Type 2), seeds, exposures, and the living lineage graph from
dbt docs. - Layering, the semantic layer & CI/CD — the staging → intermediate → marts convention mapped to medallion (bronze/silver/gold), the dbt Semantic Layer / MetricFlow, slim CI with
state:modifiedand deferral, and the 2026 landscape (dbt Fusion vs SQLMesh). - Chapter 7 checkpoint — lock it in with a quiz.
By the end you'll be able to read any dbt project, explain why a model is materialized the way it is, reason about incremental loads without fear, and describe the modern data stack end to end.
Start here: The ELT paradigm →