Skip to main content

Chapter 7 · Transformation & the Modern Data Stack

You've learned to land raw data in a warehouse (Chapter 6) and to model it dimensionally (Chapter 4). This chapter is about the step in between: transformation — turning raw, ugly, source-shaped tables into the clean, tested, modeled tables that analysts, dashboards, and machine-learning features depend on. This is where most data engineers and analytics engineers spend most of their day, and it's the chapter where a single tool — dbt — dominates the landscape.

This is the heart of the running example. Every worked model in this chapter builds ShopFlow (ShopFlow — see Meet ShopFlow) — the raw tables Chapter 6 landed (raw.orders, raw.customers, raw.products, raw.order_items) become dbt staging models (stg_orders, stg_customers, …) and finally the star schema Chapter 4 designed (fact_sales + dim_customer/dim_product/dim_date). You watch the same rows travel raw → staging → marts.

The durable idea

Transform data in the warehouse with version-controlled, tested SQL — treat transformations like software, not one-off scripts.

For two decades, "transformation" meant a thicket of fragile, undocumented scripts that nobody dared touch. The durable shift this chapter teaches is that transformations are software: they live in version control, they're built from small reusable pieces, they're tested automatically, they document themselves, and they ship through CI/CD. The warehouse got cheap and fast enough to do the heavy lifting, so transformation moved into it — and software-engineering discipline came with it.

ELT, modular tested SQL, a transformation DAG, and a governed metrics layer are durable. dbt itself, its Cloud product, the specific adapters, and this year's engine (Fusion) are dated — important to know, but they will change.

What you'll learn

  • The ELT paradigm — why "extract, load, then transform" beat the old ETL order once the cloud warehouse got cheap, and what the "modern data stack" actually is.
  • dbt's mental model — the single most important idea in the chapter: a dbt model is just a SELECT, and ref() wires those SELECTs into a DAG of compiled SQL. Project structure, sources, and the compiled-SQL view.
  • Materializations & incremental models — view vs table vs incremental vs ephemeral, and the hardest real-world dbt topic: incremental strategies (append, merge, delete+insert, insert_overwrite) and is_incremental().
  • Jinja, macros, packages & testing — DRY transformations with Jinja and macros (and the danger of over-templating), dbt_utils, and shift-left testing (generic + singular tests).
  • Sources, snapshots, seeds & docs — source freshness, snapshots (dbt's SCD Type 2), seeds, exposures, and the living lineage graph from dbt docs.
  • Layering, the semantic layer & CI/CD — the staging → intermediate → marts convention mapped to medallion (bronze/silver/gold), the dbt Semantic Layer / MetricFlow, slim CI with state:modified and deferral, and the 2026 landscape (dbt Fusion vs SQLMesh).
  • Chapter 7 checkpoint — lock it in with a quiz.

By the end you'll be able to read any dbt project, explain why a model is materialized the way it is, reason about incremental loads without fear, and describe the modern data stack end to end.

Start here: The ELT paradigm →