Meet ShopFlow — our running example

Most guides teach each topic with a fresh, throwaway example — taxis here, sensors there, weather data on the next page. This one doesn't. From here to the final chapter we follow one company and one dataset as it travels the entire data lifecycle: born in an app, landed in storage, queried, modeled, reshaped in batch and in real time, governed, and paid for.

That company is ShopFlow, a fictional online store. This page defines its data once. Every later chapter reaches for these exact table and column names, so when Chapter 7 builds a stg_orders model on top of the raw.orders table that Chapter 6 ingested from the orders source table you meet below, you're watching the same row move — not learning a new example.

In plain English

Think of this page as the cast list for a story told across twelve chapters. ShopFlow sells things online; its customers place orders; each order contains line items for products. That's the whole world. Keep this page open in a tab — whenever a later chapter shows orders or fact_sales, it means these tables.

The business in one paragraph

ShopFlow runs a typical online store. Customers sign up, browse products, and place orders; each order is paid, shipped, and occasionally cancelled. The app that runs the store records every one of these facts in its own operational database — and the business wants to learn from all of it: revenue by day, top products, repeat-customer rate, carts abandoned. Turning that scattered operational data into trustworthy answers is the data engineering you'll build, chapter by chapter, on ShopFlow's data.

1. The source systems — where ShopFlow's data is born

ShopFlow's web app writes to an OLTP database (an operational, row-oriented database tuned for the app's own reads and writes — see OLTP vs OLAP). These are the source tables. Data engineers read from them; they never own them.

customers — one row per registered customer.

column	type	notes
`customer_id`	bigint	primary key
`name`	text
`email`	text	PII
`city`	text	can change over time (drives the SCD example in Ch. 4)
`region`	text	e.g. `US-West`
`signup_date`	date
`updated_at`	timestamp	last change — drives incremental extraction (Ch. 6)

products — one row per sellable product.

column	type	notes
`product_id`	bigint	primary key
`name`	text
`category`	text	e.g. `Books`, `Electronics`
`unit_price`	decimal(10,2)	current list price

orders — one row per placed order (the order header).

column	type	notes
`order_id`	bigint	primary key
`customer_id`	bigint	→ `customers.customer_id`
`order_ts`	timestamp	when the order was placed
`status`	text	`placed` · `paid` · `shipped` · `cancelled`
`amount`	decimal(12,2)	order total

order_items — one row per product line within an order (the order detail). This is the finest grain in the system.

column	type	notes
`order_id`	bigint	→ `orders.order_id`
`product_id`	bigint	→ `products.product_id`
`quantity`	int
`unit_price`	decimal(10,2)	price captured at sale time (may differ from current `products.unit_price`)

Primary key is the pair (order_id, product_id).

The event stream

Separately, ShopFlow's app emits a real-time event stream (Chapter 9) — every meaningful action as it happens, onto a topic called order_events:

field	type	notes
`event_id`	uuid	unique per event
`order_id`	bigint	→ `orders.order_id`
`event_type`	text	`placed` · `paid` · `shipped` · `cancelled`
`ts`	timestamp	event time (the basis for watermarks in Ch. 9)

2. The destination — the ShopFlow warehouse (star schema)

After ingestion and modeling, ShopFlow's data lands in an analytics warehouse as a star schema (taught in Chapter 4): one central fact table of measurable events, surrounded by dimension tables of descriptive context.

fact_sales — grain: one order line item. Measures: quantity, unit_price, line_revenue (= quantity × unit_price). Foreign keys: customer_key, product_key, date_key, order_id (degenerate dimension).
dim_customer — one row per customer version. Tracked as SCD Type 2 (Ch. 4) so a customer moving from Seattle to Austin keeps history. Carries lifetime_value downstream.
dim_product — product name, category, current price.
dim_date — one row per calendar day (the canonical date dimension).
dim_store — channel/store the order came through (web, mobile, marketplace).

3. The journey — where each chapter acts on ShopFlow's data

Here is the whole guide, as one pipeline over ShopFlow's data. Each chapter does one thing to the same dataset:

4. The canonical names — the contract every chapter uses

So the dataset stays recognizable end to end, these names are fixed. When you see them later, they always mean what's defined here.

Stage	Objects
Source (OLTP)	`customers`, `products`, `orders`, `order_items`
Stream	`order_events` (topic)
Raw landing (Ch. 6)	`raw.customers`, `raw.products`, `raw.orders`, `raw.order_items`
Staging (Ch. 7, dbt)	`stg_customers`, `stg_products`, `stg_orders`, `stg_order_items`
Marts / warehouse (Ch. 4 + 7)	`fact_sales`, `dim_customer`, `dim_product`, `dim_date`, `dim_store`
Streaming output (Ch. 9)	`fact_revenue_1m` (per-minute revenue)
Pipeline (Ch. 8)	DAG `shopflow_daily`: `ingest_orders → dbt_run → quality_check`

:::note Durable vs dated ShopFlow's shape — sources → raw → staging → star schema, with a parallel event stream — is durable; it's how analytics data is built almost everywhere. The specific engines we run it on (a particular warehouse, Spark version, or dbt syntax) are dated examples, flagged as such where they appear. :::

→ Next: OLTP vs OLAP — why ShopFlow's app database and its analytics warehouse are built on opposite trade-offs, and why that gap is the reason your job exists.

The business in one paragraph​

1. The source systems — where ShopFlow's data is born​

The event stream​

2. The destination — the ShopFlow warehouse (star schema)​

3. The journey — where each chapter acts on ShopFlow's data​

4. The canonical names — the contract every chapter uses​