Slowly changing dimensions (SCD)
Here's a question that sounds trivial and isn't: a ShopFlow customer moves from Seattle to Austin. What should the warehouse do? (This is the canonical SCD example from Meet ShopFlow: customers.city can change, and dim_customer tracks it as SCD Type 2.)
If you just overwrite their city in dim_customer, then every past sale they made in Seattle now reports as an Austin sale. Your "revenue by region" history silently rewrites itself — last year's Seattle numbers shrink, last year's Austin numbers grow, and nobody changed a single transaction. If instead you never update the city, the warehouse is wrong about where they live now.
This is the slowly changing dimension problem: dimension attributes (a customer's city, a product's category, a rep's territory) change occasionally over time — slowly, compared to the firehose of facts — and you must decide how the warehouse remembers history. Ralph Kimball numbered the strategies, and the numbers (especially Type 2) are spoken fluently in every data team. This lesson teaches each type, and then the actual mechanics of Type 2 — the part most tutorials wave at and never show.
Type 0 — never change
Type 0: retain the original, ignore the change. The attribute is fixed forever — it's part of the record's identity. A customer's original signup date, a product's original launch date, a person's date of birth. There's no history to track because the value, by definition, never changes. Type 0 is a deliberate "this is immutable," not laziness.
Type 1 — overwrite (no history)
Type 1: overwrite the old value with the new one. The dimension always reflects the current truth, and the past is forgotten. After the move, the customer's row simply says city = 'Austin', and it's as if they were always in Austin.
Before: customer_key=501 email=maria@… city=Seattle
After: customer_key=501 email=maria@… city=Austin ← overwritten, Seattle gone
Type 1 is correct when history doesn't matter or the old value was simply wrong — fixing a misspelled name, correcting a typo'd email. It's simple and cheap. Its cost is exactly the problem above: every historical fact is now attributed to the new value. Use Type 1 only when you genuinely don't care what the value was.
Type 2 — add a new row (full history)
Type 2: keep history by adding a new row for the changed version, and marking the old row as expired. This is the one that matters most, so we'll do it carefully.
The key realization from lesson 4.2: you can't use the natural key (email) as the primary key anymore, because now there are two rows for the same customer. This is the reason surrogate keys exist. Each version gets its own surrogate key.
The dimension carries three bookkeeping columns:
effective_date(orvalid_from) — when this version of the row became true.expiration_date(orvalid_to) — when it stopped being true (often a far-future date like9999-12-31for the current version).is_current(a boolean current flag) —truefor exactly one row per natural key,falsefor the expired versions.
Trace Maria moving on 2026-03-15:
customer_key email city effective_date expiration_date is_current
501 maria@… Seattle 2024-01-10 2026-03-14 false
502 maria@… Austin 2026-03-15 9999-12-31 true
Walk through what just happened and why each piece is needed:
- The old Seattle row (
501) is not deleted. Itsexpiration_dateis set to the day before the change, andis_currentflips tofalse. It still exists, frozen, describing the period Maria lived in Seattle. - A new row (
502) is inserted for Austin, with a new surrogate key,effective_date = 2026-03-15,expiration_date = 9999-12-31,is_current = true. - The natural key (email) is identical on both rows — that's how you know it's the same person. The surrogate keys differ — that's how the warehouse keeps the versions distinct.
Now the payoff — facts join to the right version automatically. A sale Maria made in 2025 was loaded with customer_key = 501 (the surrogate that was current then). A sale in May 2026 was loaded with customer_key = 502. So:
-- Revenue by city, historically accurate
SELECT c.city, SUM(f.line_revenue) AS revenue
FROM fact_sales f
JOIN dim_customer c ON f.customer_key = c.customer_key
GROUP BY c.city;
The 2025 sale rolls up under Seattle (via key 501) and the May-2026 sale under Austin (via key 502) — exactly as it should be. History doesn't rewrite itself, because each fact is permanently bonded to the version of the customer that was true when the fact occurred. That is the entire point of Type 2, and it's why Type 2 dominates analytics: it's the only common type that preserves a faithful, point-in-time view of the past.
:::tip Two ways to query a Type 2 dimension
To ask "what is true now?" filter WHERE is_current = true (the current flag is a fast, convenient shortcut). To ask "what was true at the time of each fact?" simply join on the surrogate key the fact already carries — the fact points at the correct historical version with no date logic needed. The effective_date/expiration_date range is there for the rarer case of "what was true as of an arbitrary date?" — WHERE '2025-08-01' BETWEEN effective_date AND expiration_date.
:::
Type 3 — add a column (limited history)
Type 3: keep one prior value in a second column. The customer row gets current_city and previous_city. After the move: current_city = 'Austin', previous_city = 'Seattle'. You can see the immediately prior value but not a full timeline, and a second move overwrites Seattle. Type 3 is niche — useful for "compare current to previous" when you only ever care about one step back (e.g. a sales-territory realignment where you want to report under both the old and new alignment for a transition period).
Types 4 and 6 — briefly
- Type 4: a separate history table. Keep the main dimension small and current (like Type 1), and push all historical versions into a companion history table. Useful when a dimension is huge and most queries only want the current version, but you still need the full audit trail somewhere.
- Type 6: combine 1 + 2 + 3 (1×2×3 = 6). A Type 2 row structure (full history) plus extra columns that always show the current value (a Type 1/3 flavor on top), so you can report a fact under both the historically-true attribute and the customer's current attribute from the same join. Powerful, but more complex to build and reason about.
| Type | Strategy | History kept | Use when |
|---|---|---|---|
| 0 | Never change | N/A (immutable) | Original/birth values |
| 1 | Overwrite | None | Old value was wrong, or history doesn't matter |
| 2 | New row + flags | Full | You need accurate point-in-time history (the default) |
| 3 | Extra column | One prior value | You only ever care about "current vs previous" |
| 4 | History table | Full (separate) | Big dim, mostly-current queries + audit trail |
| 6 | 1+2+3 hybrid | Full + current view | Report under both historical and current attribute |
Late-arriving dimensions (the gotcha nobody warns you about)
In a perfect pipeline, the dimension row exists before the fact that references it. Reality disagrees. A late-arriving dimension (a.k.a. an early-arriving fact) is when a fact shows up referencing a dimension member you haven't loaded yet — an order arrives for a customer whose record hasn't synced from the source system. The fact has nowhere valid to point its foreign key.
You have a few standard moves, and picking one on purpose beats letting the load crash or silently drop rows:
- Insert a placeholder (inferred) dimension row. Create a stub dimension row for the unknown member — surrogate key assigned, natural key recorded, descriptive attributes set to "Unknown" or null, flagged as inferred. The fact points at it immediately. When the real dimension data arrives later, you update the stub in place (Type 1) with the true attributes. The fact's key never has to change.
- Route to an "Unknown" member. Point the fact at a single reserved "Unknown" dimension row (a standard practice is reserving surrogate key
-1for "Unknown"/"Not applicable"). Simpler, but you lose the ability to backfill that specific member's real attributes.
There's a second flavor: a late-arriving fact for a Type 2 dimension — a fact whose event date is in the past, when the dimension had an older version. The correct behavior is to bond the fact to the version that was current on the fact's event date (using the effective_date/expiration_date range), not to today's current version. Getting this wrong re-attributes old facts to new attribute values — the exact history corruption Type 2 was built to prevent.
:::warning Don't let late arrivals silently drop or mis-bond If your load joins facts to dimensions with an inner join and a member is missing, those fact rows vanish — revenue quietly disappears and no error fires. And if a late-arriving fact bonds to the current Type 2 version instead of the version that was valid on its event date, history is corrupted. Decide your late-arrival policy (placeholder vs Unknown member, and date-correct Type 2 bonding) as part of the load design, not after someone notices the totals are off. :::
Why it matters
A slowly changing dimension changes occasionally, and you must choose how the warehouse remembers: Type 0 never changes (immutable values); Type 1 overwrites and keeps no history (use only when the old value was wrong or doesn't matter); Type 2 adds a new row per version — with a fresh surrogate key, effective/expiration dates, and an is_current flag — so each fact stays bonded to the version that was true when it happened and history never rewrites itself. That faithfulness is why Type 2 dominates analytics. Types 3, 4, and 6 cover narrower needs (one prior value, a separate history table, a hybrid that shows both historical and current). And late-arriving dimensions — facts that reference not-yet-loaded members, or old facts hitting a Type 2 dimension — must be handled deliberately with placeholder rows or an "Unknown" member and date-correct bonding, or revenue silently drops and history quietly corrupts. Next, we zoom out from how to build a dimension to whole-warehouse philosophies — Kimball vs Inmon vs Data Vault.