ACID, time travel, and schema evolution
The last three lessons built the machinery: a metadata layer that defines a table as an immutable list of files and commits by atomically swapping a pointer. Now we collect the payoff — the features that machinery unlocks — and confront the one genuine trade-off that every format makes you decide: copy-on-write vs merge-on-read.
ACID and snapshot isolation, traced
You met ACID in lesson 10.1 as the warehouse guarantee a bare lake lacked. The table format restores it. Let's trace an update on ShopFlow's fact_sales as an Iceberg-style table (ShopFlow — see Meet ShopFlow) to see each letter earned. fact_sales currently is Snapshot 5, made of files A, B, C. A MERGE job runs that affects order-line rows living in file B (say, today's corrected order lines).
- The job writes a new file B′ containing B's rows with the update applied. Files A, B, C are untouched; B′ sits off to the side. No reader can see B′ yet — it's in no committed snapshot.
- The job builds a new metadata state:
fact_salesis now files A, B′, C (B replaced by B′). This becomes Snapshot 6. - The job attempts the atomic commit: make Snapshot 6 the current one. If another job committed first, this fails and the job retries against the new state (optimistic concurrency). If it succeeds, the pointer now names Snapshot 6.
Concretely, that MERGE upserts corrected order lines into fact_sales — the single statement the table format wraps in one atomic snapshot:
-- ACID upsert into ShopFlow's fact_sales (Iceberg-style MERGE)
MERGE INTO fact_sales t
USING staged_order_lines s
ON t.order_id = s.order_id
AND t.product_key = s.product_key
WHEN MATCHED THEN UPDATE SET
quantity = s.quantity,
unit_price = s.unit_price,
line_revenue = s.quantity * s.unit_price
WHEN NOT MATCHED THEN INSERT *;
Now read off the guarantees:
- Atomicity — until step 3 succeeds, nothing changed; after it, everything changed. There is no half-state. ✔
- Isolation — the daily revenue query that started while the job ran was following Snapshot 5 and saw A, B, C the whole time — a consistent
line_revenuetotal, never a mix. This is snapshot isolation: every reader reads one immutable snapshot start to finish, oblivious to concurrent writers. ✔ - Consistency — readers only ever see committed snapshots, never partial ones. ✔
- Durability — committed snapshots are files in durable object storage. ✔
Time travel: old snapshots are still there
Notice what step 1 didn't do: it never deleted file B. After the commit, B is no longer in the current snapshot, but Snapshot 5 still references it, and Snapshot 5 still exists in the metadata. So the table's previous version is fully intact and queryable. That is time travel: reading the table as of an earlier snapshot or timestamp.
-- Iceberg: read fact_sales as it was at a past snapshot or time
SELECT * FROM fact_sales VERSION AS OF 5;
SELECT * FROM fact_sales TIMESTAMP AS OF '2026-06-20 00:00:00';
-- "What did yesterday's revenue look like before today's MERGE?"
SELECT date_key, SUM(line_revenue) AS revenue
FROM fact_sales TIMESTAMP AS OF '2026-06-24 00:00:00'
GROUP BY date_key;
Time travel is the same superpower a warehouse charges dearly for, and it falls out of the architecture for free: because commits never mutate old files and old snapshots are retained, every past version is just another pointer you can read. For ShopFlow it's used for auditing ("what did fact_sales say last Tuesday before the correction?"), reproducing a model training run on the exact fact_sales it saw, debugging a bad nightly load, and rolling back a bad write by re-pointing fact_sales at the previous good snapshot.
:::note Time travel is not free forever
Those old files cost storage, and old snapshots accumulate. You don't keep every version of a busy table like fact_sales indefinitely — you expire snapshots past some retention window, which lets the old files they referenced be cleaned up. That's table maintenance, and it's lesson 10.5. The point here: time travel exists because you retain history, and you trade storage for how far back you can travel.
:::
Schema evolution and schema enforcement
Because the schema lives in the table's metadata (not baked irreversibly into every file), the table format can manage schema change safely — two related powers:
- Schema enforcement — when a write doesn't match
fact_sales's declared schema (line_revenuearrives as text, or the requiredorder_idis missing), the table format rejects it rather than silently corrupting the table. This is the guarantee the bare lake couldn't give (lesson 10.1): bad data is stopped at the door. - Schema evolution — you can intentionally change the schema — add a column, rename one, drop one, widen a type (
int→long) — and the format records the change in metadata without rewriting existing data files. Old files simply lack the new column; reads returnnullfor it on those rows. Crucially, good formats track columns by a stable internal field ID, not by name or position, so a rename is a metadata edit and a dropped-then-readded column doesn't accidentally resurrect old data. (This is why renaming a column in Iceberg is safe, whereas in a naive folder-of-Parquet lake it can silently mis-map columns.)
Make it concrete with the exact drift ShopFlow hits in Chapters 2 and 6: the source starts sending a discount on each order line, and fact_sales needs to carry it. In a bare folder-of-Parquet lake that's a crisis — old files have no discount, new files do, and a careless reader mis-aligns columns. With a table format it's one metadata edit, no rewrite of history:
-- Add discount to fact_sales without rewriting a single old file
ALTER TABLE fact_sales ADD COLUMN discount decimal(10,2);
After this commit, every pre-existing order-line file is untouched; queries over old snapshots return discount = null for rows written before the change, and new loads populate it. The historical revenue numbers don't move; the column simply appears going forward.
This pairing — enforce by default, evolve on request — is exactly the discipline ShopFlow's schema-drift problem (Chapter 6) needs: accidental drift is blocked, the deliberate discount addition is cheap and safe.
The real trade-off: copy-on-write vs merge-on-read
Everything above glossed over one decision with real consequences. When you update or delete some rows in a file, you have two strategies, and they trade read speed against write speed.
Copy-on-write (COW). Rewrite the whole file with the changes applied — exactly the trace above, where B became B′. After commit, the data files are clean and complete: a reader just reads them, fast. But the write was expensive — to change a few rows you rewrote an entire (possibly large) file. This is write amplification: a small logical change causes a large physical write.
Merge-on-read (MOR). Don't rewrite the file. Instead write a small delta: a side file recording the changes — either delete markers saying "rows X, Y in file B are gone" (Delta's deletion vectors, Iceberg's delete files) and/or new rows in a separate log file (Hudi's approach). The write is cheap and fast. But now a reader must merge the base file with its deltas on the fly — read B, then apply its delete markers and updates — which makes reads slower. This is read amplification: every read pays to reconcile base + deltas.
How to choose:
- Copy-on-write when reads vastly outnumber writes and writes are infrequent/bulk — you pay the rewrite cost rarely and get fast reads forever after. Good for a nightly-rebuilt
fact_salesthat dashboards hammer all day. - Merge-on-read when writes are frequent and small — streaming upserts from
order_events, high-rate CDC offorders(Chapter 6) — where rewriting wholefact_salesfiles every time an order flipsplaced → paid → shippedwould be ruinous. You accept slower reads in exchange for cheap, fast writes. This is exactly Hudi's home turf, and why Delta added deletion vectors.
MOR doesn't make read amplification permanent: a periodic compaction job merges the deltas back into clean base files (turning MOR data back into COW-clean data), trading some background compute to restore read speed. That compaction is, again, table maintenance — our next and final lesson.
:::tip The one-line summary Copy-on-write = pay on write, fast reads. Merge-on-read = pay on read, fast writes. Pick by your read/write ratio; use compaction to claw back read speed on MOR tables. This single trade-off is the most important "it depends" in the entire lakehouse, and the gap most other guides skip. :::
Why it matters
The lakehouse's headline features all fall out of the metadata-over-files architecture. ACID with snapshot isolation comes from committing via an atomic pointer swap, so readers always see one consistent, immutable snapshot. Time travel comes free because old snapshots and their files are retained — you can query, audit, or roll back to any past version (until you expire it). Schema enforcement blocks bad data at the door, while schema evolution lets you deliberately add/rename/drop/widen columns via metadata without rewriting files (tracked by stable field IDs). The one genuine trade-off is copy-on-write vs merge-on-read: COW rewrites whole files for fast reads and slow writes (write amplification); MOR writes small deltas for fast writes and slower reads (read amplification), reclaimed later by compaction. Choose by your read/write ratio — and that word compaction points straight to the last piece: keeping a lakehouse healthy in production.