Modern modeling in the cloud

The patterns in this chapter are decades old, but the tools around them changed everything. Cheap cloud storage made redundancy nearly free; dbt (Chapter 7) made SQL transformations modular and tested; columnar warehouses made wide-table scans fast. So the modern question isn't "Kimball or Inmon?" so much as "how do I layer my transformations, and how wide should my final tables be?" This lesson maps the durable ideas onto today's practice — and closes the two gaps most guides leave open: the One Big Table debate and the semantic layer.

The medallion architecture: bronze → silver → gold

Modern warehouses rarely transform raw data into final models in one leap. Instead they layer it, refining quality at each stage. The popular name (from Databricks, but now near-universal) is the medallion architecture, with three tiers:

Bronze (raw). Source data landed as-is — same shape it arrived in, minimally touched, append-only. Bronze is your replayable source of truth: if a downstream transform has a bug, you fix the code and rebuild from bronze without re-pulling from the source. (This is exactly the ELT pattern from Chapter 1 — load raw first, transform in the warehouse.)
Silver (cleaned / conformed). Bronze, cleaned and integrated: deduplicated, typed, standardized, joined across sources, business keys reconciled. Silver is trustworthy but not yet shaped for a specific question — the integration layer.
Gold (modeled / serving). The business-ready layer users and BI tools query: star schemas, aggregates, and metrics built for consumption. This is where your facts, dimensions, and SCDs from earlier in the chapter actually live.

In the dbt world this same layering has another vocabulary you'll see in every project, and it maps cleanly:

Medallion	dbt layer	What it does
Bronze	staging	One model per source table; light renaming/typing, no business logic
Silver	intermediate	Join, dedupe, apply business rules; not yet final marts
Gold	marts	The fact/dimension/wide tables users query

The medallion architecture isn't a new idea — it's the Inmon insight ("integrate before you serve") expressed as warehouse layers, with bronze added as a replayable raw tier. The durable principle: separate raw from cleaned from served, so each can be rebuilt and reasoned about independently.

One Big Table (OBT) vs the star schema

Here's a debate people turn into dogma, and shouldn't. The star schema keeps facts and dimensions separate and joins them at query time. One Big Table (OBT) — also called a wide table or denormalized table — does the opposite: it pre-joins the fact and all its dimensions into a single very wide table, so the customer's city, the product's category, and the date's quarter are already columns on every sales row. No joins at query time.

Why OBT became viable — and popular — in the cloud:

Columnar storage makes width cheap. A 200-column wide table costs almost nothing for a query that reads 4 columns, because columnar engines only scan referenced columns (Chapter 2). The old penalty for wide tables largely evaporated.
Joins are the expensive part. On big data, the join (and its shuffle, Chapter 3) is often costlier than scanning a pre-joined wide table. Pre-computing the join once, on write, can beat re-doing it on every read.
Simplicity for consumers. A BI user or notebook hits one table — no join logic, no fan-out traps, no forgotten dimension.

But OBT is not free, which is why "always use OBT" is as wrong as "always use a star":

	Star schema	One Big Table (OBT)
Storage	Less (dimensions stored once)	More (dimension attributes repeated on every row)
Query joins	Required	None
Reusability	Dimensions shared across many facts	Logic re-done per wide table
SCD / history	Clean (Type 2 in the dimension)	Awkward — history must be baked in, easy to mis-grain
Maintenance	Change a dimension once	Rebuild every wide table that embedded it
Best for	Many facts sharing conformed dimensions; evolving attributes	A specific, stable, heavily-read consumption surface

:::tip OBT vs star is a context-dependent tradeoff, not a religion Use a star schema when many facts share conformed dimensions, when attributes change and you need clean SCD history, and when reusability matters — i.e. as your gold/serving core. Build a One Big Table on top of that core as a final consumption surface for a specific, stable, heavily-queried use case (a dashboard, a feature table for ML). They are layers, not rivals: a well-run warehouse often has a star-schema gold layer and purpose-built wide tables derived from it. Anyone who tells you one is universally correct is selling dogma. :::

The semantic / metrics layer: one definition of "revenue"

Now the gap that quietly causes more business pain than any schema choice. Ask three analysts for "active users" or "revenue" and you'll often get three different SQL queries — one excludes refunds, one includes tax, one counts trials as active. Every dashboard is subtly different, every number is argued about, and trust dies. The schema can be perfect and this still happens, because the definition of a metric lives in whatever SQL each person happened to write.

A semantic layer (a.k.a. metrics layer) is a single, governed place where metric and dimension definitions live once, decoupled from any individual query. You define ShopFlow's revenue once — "sum of fact_sales.line_revenue minus refunds, excluding tax" — and every tool (BI dashboards, notebooks, ad-hoc SQL) asks the semantic layer for revenue by region by month and gets the same, correct computation. It's the single source of truth for metrics, sitting between your modeled tables and your consumers.

What it buys you:

Consistency. One definition, used everywhere — the number is the same in the executive dashboard and the analyst's notebook.
Governed, additive-aware math. Recall the non-additive trap from lesson 4.3: a semantic layer is the right home for "store additive components, compute the ratio last." It computes SUM(revenue)/SUM(cost) correctly at any grouping, so margin % is never wrong-summed.
Decoupling. Change the definition once and every consumer updates; rename a column without breaking every dashboard.

You'll meet concrete implementations in Chapter 7 (dbt's metrics/semantic layer and standalone semantic layers like Cube). The durable idea: define metrics once, in one governed place, so the whole organization computes them identically.

Idempotent loading: full refresh, incremental, and merge

Finally, the mechanics of getting data into these tables repeatedly without corrupting them. From Chapter 1, idempotency means re-running an operation produces the same result — so a retry, a backfill, or a double-run is safe. This is the single most important property of a production load. There are two base strategies:

Full refresh. Drop and rebuild the whole table from source every run. Dead simple and inherently idempotent (the result depends only on the source, not on how many times you ran it). Perfect for small tables; ruinous for a billion-row fact table you'd rebuild nightly.
Incremental. Process only new or changed rows since the last run and add them to the existing table. Far cheaper at scale, but you must make it idempotent yourself — or a re-run will double-insert the same rows.

The tool that makes incremental loads idempotent is the MERGE statement (also called upsert — update-or-insert), keyed on a merge key (a unique business/surrogate key that identifies a row):

MERGE INTO fact_sales AS target
USING stg_order_items AS source
ON  target.order_id   = source.order_id            -- the merge key:
AND target.product_key = source.product_key        -- ShopFlow's line-item grain
WHEN MATCHED THEN UPDATE SET ...           -- row already exists → update
WHEN NOT MATCHED THEN INSERT ...;          -- new row → insert

Because MERGE keys on ShopFlow's line-item grain (order_id + product_key), running it twice with the same data produces the same table — the second run matches existing rows and updates them in place instead of inserting duplicates. That's what makes the load idempotent, and it's exactly how dbt's incremental models and accumulating-snapshot facts (lesson 4.3) update rows safely.

Two threads from earlier tie in here:

Surrogate-key generation. As rows load, dimensions need their warehouse-generated surrogate keys (lesson 4.2) assigned deterministically. A common modern technique is a hash of the natural key (and, for SCD Type 2, the effective date), so the same input always yields the same key — which keeps key generation itself idempotent across re-runs and parallel loads.
Late-arriving dimensions (lesson 4.4) are fundamentally a loading-pattern problem: your MERGE and join logic must handle facts that reference not-yet-loaded members, via placeholder rows or an "Unknown" member, rather than dropping or mis-bonding them.

:::warning Incremental without a merge key is a duplication bug waiting to happen An incremental load that just INSERTs "new" rows will silently double-count the moment it re-runs after a partial failure, a backfill, or an overlapping window. Always make incremental loads idempotent — key them on a stable merge key and MERGE/upsert — so a re-run converges to the same table instead of inflating it. "It worked once" is not the bar; "it produces the same result every time you run it" is. :::

Why it matters

Modern modeling layers the durable ideas onto the cloud stack. The medallion architecture — bronze (raw, replayable), silver (cleaned, conformed), gold (modeled, served) — maps directly onto dbt's staging → intermediate → marts, expressing "separate raw from cleaned from served so each is rebuildable." One Big Table vs star is a tradeoff, not a dogma: stars excel as the conformed, history-aware serving core; wide pre-joined tables excel as stable, heavily-read consumption surfaces built on top — and columnar storage made wide tables cheap. A semantic / metrics layer defines each metric once, in one governed place, so the whole organization computes "revenue" identically (and computes ratios from additive components, never wrong-summing them). And robust loading is idempotent: full refresh is trivially safe; incremental loads must be made safe with a merge key and MERGE/upsert, deterministic (often hash-based) surrogate keys, and a deliberate late-arriving-dimension policy — so any re-run converges to the same correct table.

That completes the modeling-and-warehousing toolkit. Lock it in with the checkpoint.

Next: Chapter 4 checkpoint →

The medallion architecture: bronze → silver → gold​

One Big Table (OBT) vs the star schema​

The semantic / metrics layer: one definition of "revenue"​

Idempotent loading: full refresh, incremental, and merge​

Why it matters​

The medallion architecture: bronze → silver → gold

One Big Table (OBT) vs the star schema

The semantic / metrics layer: one definition of "revenue"

Idempotent loading: full refresh, incremental, and merge

Why it matters