Skip to main content

Dimensional modeling: facts, dimensions, and grain

You have a warehouse. Now: how should the tables be shaped? You could dump everything into one giant table, or copy your app's normalized database verbatim — and both will hurt. Dimensional modeling, the method Ralph Kimball popularized in the 1990s, is the most durable answer the industry has. It is built on one beautifully simple idea: separate the things you measure from the things you measure them by.

  • The things you measure — amounts, counts, durations — go in fact tables.
  • The things you measure them by — the customer, the product, the date, the store — go in dimension tables.

A sale is a measurement (it has a quantity and a revenue). The customer who bought, the product sold, the date it happened, and the store it happened in are all context. Dimensional modeling puts the measurement in the center and the context around it. That shape is the star schema, and it's the workhorse of analytics.

Throughout this chapter we model the warehouse for ShopFlow — see Meet ShopFlow — whose source tables (customers, products, orders, order_items) you've already met. We'll turn those into ShopFlow's canonical star: fact_sales plus dim_customer, dim_product, dim_date, and dim_store.

Facts: the measurements (the verbs)

A fact table holds the numeric measurements of a business process — one row per event or measurement. A row in ShopFlow's fact_sales table might be: "On 2026-06-23, customer 88 bought product 412 at store 7, quantity 2, unit_price $29.99, line_revenue $59.98." The numeric columns (quantity, unit_price, line_revenue) are the measures — the things you'll SUM, AVG, and COUNT. The other columns are foreign keys (customer_key, product_key, date_key) pointing out to dimension tables (which customer, which product, which date, which store), plus the order_id that ties line items back to their order.

Fact tables are usually long and skinny: billions of rows, but only a handful of measures and keys. They grow forever as the business operates.

Dimensions: the context (the nouns)

A dimension table holds the descriptive context you filter and group by — one row per thing. ShopFlow's dim_customer has one row per customer version with name, city, region, signup date; dim_product has one row per product with name, category, current price. Dimensions are usually short and wide: relatively few rows, but many descriptive columns.

Dimensions are what make analytics human. The fact table says "customer 88, product 412." The dimensions translate that into "Maria in Seattle bought the Large Blue Mug from the Kitchenware category." Every "slice and dice" a business user wants — by region, by category, by month — is a GROUP BY on a dimension column.

Grain: declare it FIRST, or everything breaks

Here is the step almost every tutorial skips and almost every real project regrets skipping. Before you add a single column, you must declare the grain of a fact table: exactly what one row represents.

The grain of ShopFlow's fact_sales is: one row per order_items line — one product line within one order.

That sentence is a contract. It decides everything downstream:

  • It tells you which dimensions are valid. If the grain is "one row per order line item," then customer, product, date, and store all apply to a single line — good. "Cashier on shift" might not, if a transaction can change hands.
  • It tells you which measures are valid. At line-item grain, quantity, unit_price, and line_revenue (= quantity × unit_price) are correct. But the order's amount (the whole-order total from ShopFlow's orders table) is not — it belongs to the whole order, and if you store it on every line item, you'll double-count it the instant someone sums it.
  • It prevents the #1 modeling bug. The most common way fact tables produce wrong numbers is mixing grains: putting a transaction-level measure and an order-level measure in the same row, then summing. The total is silently inflated, the dashboard is wrong, and trust evaporates.

:::warning Skipping grain declaration is the #1 modeling mistake If you cannot state, in one plain sentence, what a single row of your fact table means, stop — you are not ready to add columns. A vague grain ("sales stuff") leads directly to double-counted revenue, impossible joins, and measures that can't be safely aggregated. Declaring the grain first, in words, before designing the table, is the discipline that separates fact tables that work from fact tables that lie. Every measure and dimension you add must be true at that grain. :::

The rule of thumb: prefer the finest (most atomic) grain you can. One row per line item is more flexible than one row per order, because you can always sum line items up to an order, but you can never split an order down into line items you didn't store. Atomic facts answer questions you haven't thought of yet.

The star schema

Put it together and you get a star schema: one fact table in the center, dimension tables radiating out, joined by keys. Drawn out, it looks like a star.

dim_customer\n(name,city, region — SCD2)fact_sales\nGRAIN:one order lineitem\n(quantity,dim_product\n(name,category, price)dim_date\n(day,month, quarter,year)dim_store\n(channel:web/mobile/marketplace)

A typical query reads like the business question itself — "revenue by category by month":

SELECT p.category,
d.month,
SUM(f.line_revenue) AS revenue
FROM fact_sales f
JOIN dim_product p ON f.product_key = p.product_key
JOIN dim_date d ON f.date_key = d.date_key
GROUP BY p.category, d.month;

Notice the shape: aggregate the fact's measure, group by dimension attributes, join on keys. Nearly every analytic query over a star schema looks like this. That predictability is why the star schema endures — it's easy for humans to write and easy for the engine to optimize.

Snowflake schema: when dimensions get normalized

A snowflake schema is a star schema where dimensions are normalized — split into multiple related tables instead of one flat table. Instead of dim_product holding category and category_manager directly, you'd have dim_product → dim_category → dim_department, each in its own table.

fact_salesdim_productdim_categorydim_department

The tradeoff:

  • Star (denormalized dimensions) — fewer joins, simpler queries, faster on columnar warehouses; some redundancy (the category name repeats on every product row). This is the default for analytics.
  • Snowflake (normalized dimensions) — less redundancy, but more joins and more complexity for every query.

On a modern columnar warehouse, storage redundancy is cheap and joins cost more than duplicated strings, so flat (star) dimensions are usually preferred. Snowflaking earns its keep only when a sub-dimension is large, shared across many dimensions, or genuinely changes independently. (We'll meet the Kimball-vs-Inmon normalization debate in full in lesson 4.5.)

Naming note: don't confuse the snowflake schema (a modeling pattern) with Snowflake the warehouse product from the last lesson. Same word, unrelated meaning.

Surrogate keys: don't trust the source's IDs

A surrogate key is a meaningless, warehouse-generated integer (1, 2, 3…) used as the primary key of a dimension, instead of the natural key (the business identifier the source system uses, like a customer email or a product SKU). The fact table's foreign keys point to surrogate keys.

Why bother, when the source already has IDs?

  • Sources change and collide. A customer's email changes; two source systems reuse the same SKU; an acquired company's IDs overlap yours. A warehouse-owned surrogate key is stable and unique regardless.
  • It enables history (SCD Type 2). This is the big one. When you need to keep multiple versions of the same customer over time (their old city and their new city), you cannot use the natural key as the primary key — it's no longer unique across versions. The surrogate key gives each version its own identity. (Full mechanics in lesson 4.4.)
  • It decouples and speeds joins. A compact integer key joins faster than a long natural string and shields facts from messy source identifiers.

The natural key still lives in the dimension as an ordinary column (so you can look a customer up by email), but the key the fact joins on is the surrogate.

Two more dimension concepts you'll meet constantly

Conformed dimensions. A conformed dimension is a single dimension table shared, identically, across multiple fact tables. If dim_date and dim_customer are used by both fact_sales and fact_support_tickets, then "customer" means the same thing in both — so you can compare sales and support for the same customer and trust the numbers line up. Conformance is what lets an organization have one definition of "customer" or "product" instead of five conflicting ones. It is the backbone of an enterprise-consistent warehouse, and the reason teams invest in shared dimensions rather than letting each project build its own.

Degenerate dimensions. Sometimes a fact has an identifier that's useful for grouping but has no descriptive attributes of its own — classically an order number or invoice number. In ShopFlow this is order_id: it would be silly to build a dim_order table that contains nothing but the order id. So you leave order_id in fact_sales itself as a column with no dimension table behind it. That's a degenerate dimension: a dimension key that lives in the fact because it has nowhere else meaningful to go. It still does real work — GROUP BY order_id lets you count line items per order, or sum a single order's line revenue back up to its order total.

When NOT to fully dimensional-model

The star schema is the durable default for analytics — but "always build a full star" is its own anti-pattern. Dimensional modeling is an investment (declared grains, surrogate keys, conformed dimensions, SCD machinery), and it only pays off when the data is shared, evolving, and queried many ways. Reach for less when it isn't:

  • A raw / staging model is enough. Early in a pipeline — or for one-off exploration — cleaned, typed staging tables answer the question without facts and dimensions at all. Don't model what nobody queries yet.
  • A wide one-big-table (OBT) is enough. For a single, stable, heavily-read consumption surface (one dashboard, one notebook), pre-joining everything into one wide table is often simpler and faster than maintaining a star — no join logic for consumers, no fan-out traps. The full tradeoff is in Modern modeling in the cloud.
  • Columnar warehouses blunt some star benefits. A star schema was partly a performance design from the row-store era. On a modern columnar/MPP warehouse, wide tables scan cheaply and joins are the expensive part, so the old "normalize into a star to save space and speed scans" argument is weaker. The star's clarity and reusability still earn their keep — its raw-speed edge no longer always does.
  • Small or early projects. A handful of tables that a few people query don't justify conformed dimensions and SCD Type 2. Over-modeling a young project front-loads cost you may never recoup; you can always evolve into a star once questions and reuse emerge.

:::tip The honest decision rule Reach for a full star schema when many fact processes share dimensions, attributes change over time (history matters), and the data is sliced many ways by many people — that's exactly what facts, conformed dimensions, and SCDs are for. Skip or defer it when a staging/raw layer answers the question, when one wide OBT serves a single stable consumer, or when the project is too small to repay the modeling. Model for shared, evolving, multi-angle analytics — not for its own sake. :::

Why it matters

Dimensional modeling organizes analytics around one idea: separate facts (the numeric measurements of a process — what you SUM) from dimensions (the descriptive context you filter and group by). The non-negotiable first step is declaring the grain — exactly what one fact row represents — because every valid measure and dimension follows from it, and mixing grains is the #1 cause of double-counted, wrong numbers. The result is the star schema (fact in the center, flat dimensions around it), preferred over the more-normalized snowflake schema on columnar warehouses because joins cost more than redundancy. Surrogate keys (warehouse-generated integers) replace fragile source IDs and are what make history-tracking possible; conformed dimensions give the whole organization one consistent definition of a thing; and degenerate dimensions (like an order number) live in the fact when they have no attributes of their own. Next, we go deeper on the fact side: the different types of fact tables and which measures you're even allowed to add up.

Next: Fact tables & measures →