Relational fundamentals: keys, normalization, ACID, indexes
The source systems you ingest from are overwhelmingly relational databases — PostgreSQL, MySQL, and their kin. ShopFlow's customers, products, orders, and order_items tables (the running example) are exactly such a schema, and we use them throughout this lesson. To reason about why their data is shaped the way it is (and how to reshape it for analytics later), you need a working grasp of the relational model: keys, normalization, transactions, ACID, and indexes. Skip this and the modeling chapter (4) will feel arbitrary — you won't be able to reason about why you denormalize. So we earn it from scratch here.
The plain-English on-ramp
A relational database stores data in tables — rows and columns, like spreadsheets — where tables can reference each other. That "reference each other" part is the whole idea behind the word relational. Instead of repeating a customer's full details on every one of their orders, you keep customers in one table and orders in another, and each order just points at a customer. The pointing is done with keys. Splitting data up to avoid repeating yourself is normalization. Doing several changes as one all-or-nothing unit is a transaction, and the guarantees around transactions are ACID. Making lookups fast is what an index does. That's the whole lesson in five words — now let's make each precise.
Keys: how rows point at each other
- A primary key is a column (or set of columns) whose value uniquely identifies each row in a table. No two rows share it, and it's never empty.
order_idin anorderstable;customer_idin acustomerstable. It's the row's permanent name. - A foreign key is a column in one table that holds the primary key value of a row in another table — that's the "pointer." An
orderstable has acustomer_idforeign key whose value matches thecustomer_idprimary key over incustomers. It links the two.
customers orders
┌────────────┬───────┐ ┌──────────┬─────────────┬────────┐
│ customer_id│ name │ │ order_id │ customer_id │ amount │
│ (PK) │ │ │ (PK) │ (FK) ─────┼──┐ │
├────────────┼───────┤ ├──────────┼─────────────┤ │ │
│ 7 │ Ana │◄───────────┤ 8841 │ 7 │ │ 79 │
│ 9 │ Ben │ │ 8842 │ 9 │ │ 12 │
└────────────┴───────┘ └──────────┴─────────────┴────────┘
▲ │
└──── the FK value (7) matches a PK ───────┘
The database can enforce that a foreign key always points at a real row (you can't have an order for a customer who doesn't exist). That enforcement is referential integrity — a guarantee the source gives you, and one you often lose when you copy data into an analytics system, which is part of why analytics data needs its own quality checks (Chapter 11).
Normalization: don't repeat yourself, 1NF → 3NF
Normalization is the practice of structuring tables so that each fact is stored exactly once, by splitting data across related tables. The motivation is concrete: if you store a customer's address on every order, then (a) you waste space, and (b) when they move, you must update every order or risk contradictory data — an update anomaly. Normalization eliminates that by keeping each fact in one place and pointing at it with keys.
There are formal levels called normal forms. You need a working grasp of the first three:
- First Normal Form (1NF) — each cell holds a single, atomic value; no lists crammed into one column, and each row is unique. Bad: a
phone_numberscell containing"555-1234, 555-9876". Fixed: one phone number per row (or a separatephonestable). - Second Normal Form (2NF) — 1NF, plus every non-key column depends on the whole primary key, not just part of it. (Only bites when the primary key is made of multiple columns.) Bad: an
order_itemstable keyed by(order_id, product_id)that also storesproduct_name— butproduct_namedepends only onproduct_id, half the key. Fixed: moveproduct_nameto aproductstable. - Third Normal Form (3NF) — 2NF, plus no non-key column depends on another non-key column (no "transitive" dependencies). Bad: a
customerstable storingzip_codeandcity, whencityis determined byzip_code. Fixed: keepzip_codeand look upcityfrom azip_codestable.
The plain-English summary that carries you 95% of the way: 3NF means every column depends on the key, the whole key, and nothing but the key. OLTP databases are usually designed to roughly 3NF, because it makes writes safe and consistent — exactly what operational systems need.
Worked example: normalizing a flat table
Imagine ShopFlow had stored its orders as one flat table (a beginner's first instinct):
orders_flat
┌──────────┬───────────┬──────────────┬─────────────┬────────────┐
│ order_id │ cust_name │ cust_city │ product │ unit_price │
├──────────┼───────────┼──────────────┼─────────────┼────────────┤
│ 8841 │ Ana │ Seattle │ Coffee Maker│ 79 │
│ 8842 │ Ana │ Seattle │ Tea Kettle │ 35 │
└──────────┴───────────┴──────────────┴─────────────┴────────────┘
Ana's name and city repeat. If she moves, you must fix multiple rows (update anomaly). Normalize to 3NF — the four tables ShopFlow actually uses, each fact stored once:
customers products order_items
┌─────────────┬──────┬───────┐ ┌────────────┬───────┬────────┐ ┌──────────┬────────────┬─────┐
│ customer_id │ name │ city │ │ product_id │ name │ price │ │ order_id │ product_id │ qty │
├─────────────┼──────┼───────┤ ├────────────┼───────┼────────┤ ├──────────┼────────────┼─────┤
│ 7 │ Ana │Seattle│ │ 100 │Coffee │ 79 │ │ 8841 │ 100 │ 1 │
└─────────────┴──────┴───────┘ │ 101 │Tea │ 35 │ │ 8842 │ 101 │ 1 │
└────────────┴───────┴────────┘ └──────────┴────────────┴─────┘
orders
┌──────────┬─────────────┐
│ order_id │ customer_id │
├──────────┼─────────────┤
│ 8841 │ 7 │
│ 8842 │ 7 │
└──────────┴─────────────┘
Now Ana's city lives in exactly one row of customers. A move is a one-row update. That's normalization working — and it's exactly the shape of ShopFlow's source schema. (That changeable city is what drives the slowly-changing-dimension example back in Chapter 4.)
Denormalization: the deliberate opposite, for analytics
Here's the twist that confuses people: for analytics, you often do the reverse — on purpose. Denormalization is deliberately duplicating data (folding those three tables back into one wide one) to make reads faster and simpler.
Why would you undo good work? Because OLTP normalization optimizes writes and consistency, but OLAP analytics optimizes reads. An analyst asking "sales by customer city" against a normalized OLTP schema must join several tables — and joins are expensive at scale. If you've already folded city, name, and product into one wide analytics table, the query is a simple scan of one table (and recall from OLTP vs OLAP that columnar storage makes wide tables cheap to scan).
The tradeoff is exactly the one normalization was avoiding: duplication means a value lives in many places, so if it changes you must rewrite many rows. But analytics data is mostly rebuilt by a pipeline, not edited in place — so the update anomaly that's fatal in OLTP is a non-issue in OLAP. The pipeline just regenerates the denormalized table from the normalized source.
:::tip The durable rule Normalize for writes (OLTP); denormalize for reads (OLAP). Source systems are normalized so updates are safe; your analytics models (Chapter 4) are denormalized so queries are fast and simple. A data engineer constantly moves data from the first shape to the second — that reshaping is the transformation stage of the lifecycle. This is durable; it long predates any current tool. :::
Transactions, ACID, and isolation levels
A transaction is a group of operations treated as one indivisible unit: either all of them happen, or none do. The classic example is a bank transfer — debit one account, credit another. If the debit succeeds but the credit fails, money vanishes. A transaction makes the pair all-or-nothing.
The guarantees a relational database makes about transactions are the acronym ACID:
- Atomicity — all-or-nothing; a transaction fully completes or fully rolls back, never half.
- Consistency — a transaction moves the database from one valid state to another, respecting all rules (like referential integrity).
- Isolation — concurrent transactions don't corrupt each other; the result is as if they ran one at a time.
- Durability — once a transaction is committed, it survives crashes and power loss (it's safely written to disk).
Isolation levels are a knob on the "I" in ACID: how strictly the database keeps concurrent transactions from seeing each other's in-progress work. Stronger isolation is more correct but slower; weaker is faster but risks subtle anomalies. The standard levels, weakest to strongest: Read Uncommitted → Read Committed → Repeatable Read → Serializable. You don't need to memorize the anomaly each one allows; you need to know that isolation level is a correctness-vs-speed tradeoff the source database makes, and that the data you ingest is a snapshot of a system that's constantly changing underneath you — which is exactly why ingestion must handle "the data moved while I was reading it."
:::caution Why this matters for you, not just app developers You may think ACID is the app team's problem. But you ingest from ACID systems and often lose those guarantees on the way to analytics: object storage isn't transactional, so a half-finished write can leave a partial file. Reclaiming transaction-like guarantees on the analytics side is precisely what open table formats (Delta, Iceberg — Chapter 10) were invented to do. The vocabulary you're learning now is what makes Chapter 10 make sense. :::
Indexes: how lookups get fast
When you ask ShopFlow's OLTP database "give me order_id 8841," how does it find that one row among millions without scanning them all? An index is a separate, sorted data structure (usually a B-tree) that maps a column's values to the rows that hold them — like a book's index maps words to page numbers. With an index on order_id, the database jumps almost straight to the row instead of reading the whole table (a full table scan).
The tradeoffs, which you'll reason about constantly:
- Indexes make reads on the indexed column fast, but make writes slower (every insert must also update the index) and cost storage.
- They help selective lookups ("one specific order") enormously, but help little when a query reads most of the table anyway — which is why OLAP systems lean on columnar layout and partitioning (next idea, and Chapter 2) rather than the per-row indexes OLTP relies on.
That last point ties the lesson together: OLTP gets speed from indexes on a row store; OLAP gets speed from column pruning and partitioning on a column store. Same goal (don't read what you don't need), different machinery — because the workloads are opposite.
Why it matters
The relational model is the shape of nearly every source system you'll ingest from. Primary keys name rows, foreign keys point between them, and referential integrity keeps the pointers valid. Normalization (1NF–3NF) stores each fact once — the key, the whole key, and nothing but the key — which is right for OLTP writes; denormalization deliberately duplicates for fast OLAP reads, and moving data from the first shape to the second is the heart of the transformation stage. ACID and isolation levels are the correctness guarantees source databases make (and that you often must reclaim downstream with table formats). And indexes make selective lookups fast on row stores, just as columnar layout and partitioning will for analytics. Master this vocabulary and Chapters 2, 4, and 10 will feel inevitable instead of arbitrary.
So far every system we've discussed runs on one machine's terms. But data engineering happens at a scale where one machine isn't enough and failure is constant — which forces a new set of ideas. After one more foundational split (how data moves — batch vs streaming), we confront the distributed-systems reality head-on.
Where this leads: denormalization becomes dimensional modeling in Chapter 4; reclaiming ACID on the lake is Chapter 10; partitioning as the OLAP answer to indexes is Chapter 2.