Chapter 4 checkpoint

You can now design data for the questions it will answer and load it safely into a cloud warehouse. Recall the spine, then take the quiz.

The throughline

A data warehouse is an OLAP system built on columnar storage + MPP; the cloud's defining move is separating storage from compute (Snowflake virtual warehouses, BigQuery slots, Redshift clusters, Databricks SQL warehouses). Speed and cost reduce to one question: how much data did the query actually read? — which pruning over micro-partitions (helped by clustering/sort keys/partitioning, caching, and materialized views) controls.
Dimensional modeling separates facts (measurements you SUM) from dimensions (context you filter/group by). Declare the grain first — what one fact row means — or you double-count. The star schema (flat dimensions) beats the snowflake schema on columnar warehouses. Surrogate keys replace fragile source IDs and enable history; conformed dimensions give one shared definition; degenerate dimensions (order number) live in the fact.
Fact types: transaction (per event), periodic snapshot (per thing per period), accumulating snapshot (per process, updated in place). Additivity: additive (sum anywhere), semi-additive (balances — don't sum across time), non-additive (ratios — store components, compute last).
SCD Type 2 adds a new row per version with a surrogate key + effective/expiration dates + is_current flag, so facts stay bonded to the version true when they happened. Type 2 dominates because it preserves point-in-time history. Late-arriving dimensions need placeholder/Unknown handling and date-correct bonding.
Kimball (dimensional, serve-first) vs Inmon (3NF core, integrate-first) vs Data Vault (hubs/links/satellites — auditability & change-resilience). They layer: an integration core feeds Kimball marts, which is what users query.
Modern stack: medallion (bronze/silver/gold = staging/intermediate/marts); OBT vs star is a tradeoff, not dogma; a semantic layer defines metrics once; loads must be idempotent (full refresh, or incremental with a merge key + MERGE).

Quiz

Required checkpoint

Chapter 4 — Data Modeling & Warehousing

Pass to unlock the Next button below

You can now model data for the questions it will answer and land it safely in a warehouse. But modeled tables don't fill themselves, and the heavy transformations behind silver and gold often need more than SQL on one machine. The next chapter is how data gets processed at scale: batch processing and Spark.

Next: Chapter 5: Batch Processing & Spark →

The throughline​

Quiz​

The throughline

Quiz