Chapter 4 · Data Modeling & Warehousing
Storing data is not the same as modeling it. Chapter 2 gave you durable, columnar files; Chapter 3 gave you SQL and the engines that run it. But a pile of correct tables is still hard to query if nobody decided what each row means. This chapter is about deliberate design: shaping data inside a data warehouse so that analysts can answer business questions quickly, consistently, and without re-deriving the same number five different ways.
A data warehouse is a database built for analytics — large queries over modeled, integrated, historical data — rather than for running an app. Modeling is the discipline of organizing that data so its meaning is obvious and its queries are fast.
The durable idea
Model data for the questions it will answer — declare the grain, separate measurements (facts) from context (dimensions), keep dimensions conformed and history intact — so the warehouse stays understandable and trustworthy as it grows.
Dimensional modeling, the concept of grain, and slowly changing dimensions are durable — they have outlived three generations of database hardware. The specific warehouse product (Snowflake, BigQuery, Redshift, Databricks SQL) and its tuning knobs are dated. We teach the durable ideas first, then map them onto today's tools.
What you'll learn
- The cloud data warehouse — what a warehouse is, why separating storage from compute changed everything, and the columnar-MPP internals (micro-partitions, clustering/sort keys, caching, materialized views) that drive cost and speed.
- Dimensional modeling — the Kimball method: facts vs dimensions, the all-important grain declaration, star vs snowflake schemas, conformed dimensions, surrogate keys, and degenerate dimensions.
- Fact tables & measures — the three fact-table types (transaction, periodic snapshot, accumulating snapshot) and additive / semi-additive / non-additive measures.
- Slowly changing dimensions — how to track history when a dimension changes: SCD Types 0–6, the effective-date / current-flag mechanics, late-arriving dimensions, and why Type 2 dominates analytics.
- Three traditions: Kimball, Inmon & Data Vault — dimensional (Kimball) vs corporate-information-factory 3NF (Inmon) vs Data Vault (hubs/links/satellites), and when each fits.
- Modern modeling in the cloud — medallion (bronze/silver/gold) layering, One Big Table vs star schema as a tradeoff not a dogma, the semantic / metrics layer, and idempotent loading (full refresh vs incremental, merge keys, surrogate-key generation).
- Checkpoint — a quiz to lock it in.
Let's start with the box all of this lives in: the cloud data warehouse.