Meet ShopFlow — our running example
Most guides teach each topic with a fresh, throwaway example — taxis here, sensors there, weather data on the next page. This one doesn't. From here to the final chapter we follow one company and one dataset as it travels the entire data lifecycle: born in an app, landed in storage, queried, modeled, reshaped in batch and in real time, governed, and paid for.
That company is ShopFlow, a fictional online store. This page defines its data once. Every later chapter reaches for these exact table and column names, so when Chapter 7 builds a stg_orders model on top of the raw.orders table that Chapter 6 ingested from the orders source table you meet below, you're watching the same row move — not learning a new example.
Think of this page as the cast list for a story told across twelve chapters. ShopFlow sells things online; its customers place orders; each order contains line items for products. That's the whole world. Keep this page open in a tab — whenever a later chapter shows orders or fact_sales, it means these tables.
The business in one paragraph
ShopFlow runs a typical online store. Customers sign up, browse products, and place orders; each order is paid, shipped, and occasionally cancelled. The app that runs the store records every one of these facts in its own operational database — and the business wants to learn from all of it: revenue by day, top products, repeat-customer rate, carts abandoned. Turning that scattered operational data into trustworthy answers is the data engineering you'll build, chapter by chapter, on ShopFlow's data.
1. The source systems — where ShopFlow's data is born
ShopFlow's web app writes to an OLTP database (an operational, row-oriented database tuned for the app's own reads and writes — see OLTP vs OLAP). These are the source tables. Data engineers read from them; they never own them.
customers — one row per registered customer.
| column | type | notes |
|---|---|---|
customer_id | bigint | primary key |
name | text | |
email | text | PII |
city | text | can change over time (drives the SCD example in Ch. 4) |
region | text | e.g. US-West |
signup_date | date | |
updated_at | timestamp | last change — drives incremental extraction (Ch. 6) |
products — one row per sellable product.
| column | type | notes |
|---|---|---|
product_id | bigint | primary key |
name | text | |
category | text | e.g. Books, Electronics |
unit_price | decimal(10,2) | current list price |
orders — one row per placed order (the order header).
| column | type | notes |
|---|---|---|
order_id | bigint | primary key |
customer_id | bigint | → customers.customer_id |
order_ts | timestamp | when the order was placed |
status | text | placed · paid · shipped · cancelled |
amount | decimal(12,2) | order total |
order_items — one row per product line within an order (the order detail). This is the finest grain in the system.
| column | type | notes |
|---|---|---|
order_id | bigint | → orders.order_id |
product_id | bigint | → products.product_id |
quantity | int | |
unit_price | decimal(10,2) | price captured at sale time (may differ from current products.unit_price) |
Primary key is the pair (order_id, product_id).
The event stream
Separately, ShopFlow's app emits a real-time event stream (Chapter 9) — every meaningful action as it happens, onto a topic called order_events:
| field | type | notes |
|---|---|---|
event_id | uuid | unique per event |
order_id | bigint | → orders.order_id |
event_type | text | placed · paid · shipped · cancelled |
ts | timestamp | event time (the basis for watermarks in Ch. 9) |
2. The destination — the ShopFlow warehouse (star schema)
After ingestion and modeling, ShopFlow's data lands in an analytics warehouse as a star schema (taught in Chapter 4): one central fact table of measurable events, surrounded by dimension tables of descriptive context.
fact_sales— grain: one order line item. Measures:quantity,unit_price,line_revenue(= quantity × unit_price). Foreign keys:customer_key,product_key,date_key,order_id(degenerate dimension).dim_customer— one row per customer version. Tracked as SCD Type 2 (Ch. 4) so a customer moving fromSeattletoAustinkeeps history. Carrieslifetime_valuedownstream.dim_product— product name, category, current price.dim_date— one row per calendar day (the canonical date dimension).dim_store— channel/store the order came through (web, mobile, marketplace).
3. The journey — where each chapter acts on ShopFlow's data
Here is the whole guide, as one pipeline over ShopFlow's data. Each chapter does one thing to the same dataset:
4. The canonical names — the contract every chapter uses
So the dataset stays recognizable end to end, these names are fixed. When you see them later, they always mean what's defined here.
| Stage | Objects |
|---|---|
| Source (OLTP) | customers, products, orders, order_items |
| Stream | order_events (topic) |
| Raw landing (Ch. 6) | raw.customers, raw.products, raw.orders, raw.order_items |
| Staging (Ch. 7, dbt) | stg_customers, stg_products, stg_orders, stg_order_items |
| Marts / warehouse (Ch. 4 + 7) | fact_sales, dim_customer, dim_product, dim_date, dim_store |
| Streaming output (Ch. 9) | fact_revenue_1m (per-minute revenue) |
| Pipeline (Ch. 8) | DAG shopflow_daily: ingest_orders → dbt_run → quality_check |
:::note Durable vs dated ShopFlow's shape — sources → raw → staging → star schema, with a parallel event stream — is durable; it's how analytics data is built almost everywhere. The specific engines we run it on (a particular warehouse, Spark version, or dbt syntax) are dated examples, flagged as such where they appear. :::
→ Next: OLTP vs OLAP — why ShopFlow's app database and its analytics warehouse are built on opposite trade-offs, and why that gap is the reason your job exists.