Modern modeling in the cloud
The patterns in this chapter are decades old, but the tools around them changed everything. Cheap cloud storage made redundancy nearly free; dbt (Chapter 7) made SQL transformations modular and tested; columnar warehouses made wide-table scans fast. So the modern question isn't "Kimball or Inmon?" so much as "how do I layer my transformations, and how wide should my final tables be?" This lesson maps the durable ideas onto today's practice — and closes the two gaps most guides leave open: the One Big Table debate and the semantic layer.
The medallion architecture: bronze → silver → gold
Modern warehouses rarely transform raw data into final models in one leap. Instead they layer it, refining quality at each stage. The popular name (from Databricks, but now near-universal) is the medallion architecture, with three tiers:
- Bronze (raw). Source data landed as-is — same shape it arrived in, minimally touched, append-only. Bronze is your replayable source of truth: if a downstream transform has a bug, you fix the code and rebuild from bronze without re-pulling from the source. (This is exactly the ELT pattern from Chapter 1 — load raw first, transform in the warehouse.)
- Silver (cleaned / conformed). Bronze, cleaned and integrated: deduplicated, typed, standardized, joined across sources, business keys reconciled. Silver is trustworthy but not yet shaped for a specific question — the integration layer.
- Gold (modeled / serving). The business-ready layer users and BI tools query: star schemas, aggregates, and metrics built for consumption. This is where your facts, dimensions, and SCDs from earlier in the chapter actually live.
In the dbt world this same layering has another vocabulary you'll see in every project, and it maps cleanly:
| Medallion | dbt layer | What it does |
|---|---|---|
| Bronze | staging | One model per source table; light renaming/typing, no business logic |
| Silver | intermediate | Join, dedupe, apply business rules; not yet final marts |
| Gold | marts | The fact/dimension/wide tables users query |
The medallion architecture isn't a new idea — it's the Inmon insight ("integrate before you serve") expressed as warehouse layers, with bronze added as a replayable raw tier. The durable principle: separate raw from cleaned from served, so each can be rebuilt and reasoned about independently.
One Big Table (OBT) vs the star schema
Here's a debate people turn into dogma, and shouldn't. The star schema keeps facts and dimensions separate and joins them at query time. One Big Table (OBT) — also called a wide table or denormalized table — does the opposite: it pre-joins the fact and all its dimensions into a single very wide table, so the customer's city, the product's category, and the date's quarter are already columns on every sales row. No joins at query time.
Why OBT became viable — and popular — in the cloud:
- Columnar storage makes width cheap. A 200-column wide table costs almost nothing for a query that reads 4 columns, because columnar engines only scan referenced columns (Chapter 2). The old penalty for wide tables largely evaporated.
- Joins are the expensive part. On big data, the join (and its shuffle, Chapter 3) is often costlier than scanning a pre-joined wide table. Pre-computing the join once, on write, can beat re-doing it on every read.
- Simplicity for consumers. A BI user or notebook hits one table — no join logic, no fan-out traps, no forgotten dimension.
But OBT is not free, which is why "always use OBT" is as wrong as "always use a star":
| Star schema | One Big Table (OBT) | |
|---|---|---|
| Storage | Less (dimensions stored once) | More (dimension attributes repeated on every row) |
| Query joins | Required | None |
| Reusability | Dimensions shared across many facts | Logic re-done per wide table |
| SCD / history | Clean (Type 2 in the dimension) | Awkward — history must be baked in, easy to mis-grain |
| Maintenance | Change a dimension once | Rebuild every wide table that embedded it |
| Best for | Many facts sharing conformed dimensions; evolving attributes | A specific, stable, heavily-read consumption surface |
:::tip OBT vs star is a context-dependent tradeoff, not a religion Use a star schema when many facts share conformed dimensions, when attributes change and you need clean SCD history, and when reusability matters — i.e. as your gold/serving core. Build a One Big Table on top of that core as a final consumption surface for a specific, stable, heavily-queried use case (a dashboard, a feature table for ML). They are layers, not rivals: a well-run warehouse often has a star-schema gold layer and purpose-built wide tables derived from it. Anyone who tells you one is universally correct is selling dogma. :::
The semantic / metrics layer: one definition of "revenue"
Now the gap that quietly causes more business pain than any schema choice. Ask three analysts for "active users" or "revenue" and you'll often get three different SQL queries — one excludes refunds, one includes tax, one counts trials as active. Every dashboard is subtly different, every number is argued about, and trust dies. The schema can be perfect and this still happens, because the definition of a metric lives in whatever SQL each person happened to write.
A semantic layer (a.k.a. metrics layer) is a single, governed place where metric and dimension definitions live once, decoupled from any individual query. You define ShopFlow's revenue once — "sum of fact_sales.line_revenue minus refunds, excluding tax" — and every tool (BI dashboards, notebooks, ad-hoc SQL) asks the semantic layer for revenue by region by month and gets the same, correct computation. It's the single source of truth for metrics, sitting between your modeled tables and your consumers.
What it buys you:
- Consistency. One definition, used everywhere — the number is the same in the executive dashboard and the analyst's notebook.
- Governed, additive-aware math. Recall the non-additive trap from lesson 4.3: a semantic layer is the right home for "store additive components, compute the ratio last." It computes
SUM(revenue)/SUM(cost)correctly at any grouping, so margin % is never wrong-summed. - Decoupling. Change the definition once and every consumer updates; rename a column without breaking every dashboard.
You'll meet concrete implementations in Chapter 7 (dbt's metrics/semantic layer and standalone semantic layers like Cube). The durable idea: define metrics once, in one governed place, so the whole organization computes them identically.
Idempotent loading: full refresh, incremental, and merge
Finally, the mechanics of getting data into these tables repeatedly without corrupting them. From Chapter 1, idempotency means re-running an operation produces the same result — so a retry, a backfill, or a double-run is safe. This is the single most important property of a production load. There are two base strategies:
- Full refresh. Drop and rebuild the whole table from source every run. Dead simple and inherently idempotent (the result depends only on the source, not on how many times you ran it). Perfect for small tables; ruinous for a billion-row fact table you'd rebuild nightly.
- Incremental. Process only new or changed rows since the last run and add them to the existing table. Far cheaper at scale, but you must make it idempotent yourself — or a re-run will double-insert the same rows.
The tool that makes incremental loads idempotent is the MERGE statement (also called upsert — update-or-insert), keyed on a merge key (a unique business/surrogate key that identifies a row):
MERGE INTO fact_sales AS target
USING stg_order_items AS source
ON target.order_id = source.order_id -- the merge key:
AND target.product_key = source.product_key -- ShopFlow's line-item grain
WHEN MATCHED THEN UPDATE SET ... -- row already exists → update
WHEN NOT MATCHED THEN INSERT ...; -- new row → insert
Because MERGE keys on ShopFlow's line-item grain (order_id + product_key), running it twice with the same data produces the same table — the second run matches existing rows and updates them in place instead of inserting duplicates. That's what makes the load idempotent, and it's exactly how dbt's incremental models and accumulating-snapshot facts (lesson 4.3) update rows safely.
Two threads from earlier tie in here:
- Surrogate-key generation. As rows load, dimensions need their warehouse-generated surrogate keys (lesson 4.2) assigned deterministically. A common modern technique is a hash of the natural key (and, for SCD Type 2, the effective date), so the same input always yields the same key — which keeps key generation itself idempotent across re-runs and parallel loads.
- Late-arriving dimensions (lesson 4.4) are fundamentally a loading-pattern problem: your
MERGEand join logic must handle facts that reference not-yet-loaded members, via placeholder rows or an "Unknown" member, rather than dropping or mis-bonding them.
:::warning Incremental without a merge key is a duplication bug waiting to happen
An incremental load that just INSERTs "new" rows will silently double-count the moment it re-runs after a partial failure, a backfill, or an overlapping window. Always make incremental loads idempotent — key them on a stable merge key and MERGE/upsert — so a re-run converges to the same table instead of inflating it. "It worked once" is not the bar; "it produces the same result every time you run it" is.
:::
Why it matters
Modern modeling layers the durable ideas onto the cloud stack. The medallion architecture — bronze (raw, replayable), silver (cleaned, conformed), gold (modeled, served) — maps directly onto dbt's staging → intermediate → marts, expressing "separate raw from cleaned from served so each is rebuildable." One Big Table vs star is a tradeoff, not a dogma: stars excel as the conformed, history-aware serving core; wide pre-joined tables excel as stable, heavily-read consumption surfaces built on top — and columnar storage made wide tables cheap. A semantic / metrics layer defines each metric once, in one governed place, so the whole organization computes "revenue" identically (and computes ratios from additive components, never wrong-summing them). And robust loading is idempotent: full refresh is trivially safe; incremental loads must be made safe with a merge key and MERGE/upsert, deterministic (often hash-based) surrogate keys, and a deliberate late-arriving-dimension policy — so any re-run converges to the same correct table.
That completes the modeling-and-warehousing toolkit. Lock it in with the checkpoint.
Next: Chapter 4 checkpoint →