Skip to main content

Chapter 10 checkpoint

You can now explain how cheap files in object storage become a reliable, versioned, engine-agnostic table. Recall the spine, take the quiz, then trace one commit by hand.

The throughline

  • Object storage stores files, not tables, so a bare data lake has no atomic commits, no consistent reads, and no schema enforcement — no ACID. That gap is why table formats exist; a lake + table format = a lakehouse.
  • A table format is not a file format. Parquet describes one file's bytes; Iceberg/Delta/Hudi are metadata over many files saying which files are the table and what version you're reading. They use Parquet, they don't replace it.
  • The core trick: a table is an immutable list of files plus a pointer; committing = write new files aside, then atomically swap the pointer. The metadata is a tree (Iceberg: metadata → snapshot → manifest list → manifests → data files) or an ordered transaction log (Delta's _delta_log + checkpoints).
  • Superpowers fall out of this: ACID with snapshot isolation, time travel (old snapshots retained), schema enforcement + evolution (schema in metadata, stable field IDs).
  • The real trade-off: copy-on-write (rewrite files → fast reads, slow writes) vs merge-on-read (write small deltas → fast writes, slower reads), reclaimed by compaction. Hudi/MOR shine for streaming upserts.
  • Iceberg is 2026's default open standard; Delta is strong in Spark/Databricks (UniForm bridges to Iceberg); Hudi owns mutation-heavy streaming ingestion.
  • The catalog holds the current-version pointer and is the 2026 interoperability battleground (Iceberg REST, Polaris, Unity, Glue, UniForm). Tables need maintenance (compaction, snapshot expiration, manifest rewrite, orphan cleanup) or they rot.

Quiz

Required checkpoint

Chapter 10 — Table Formats & the Lakehouse

Pass to unlock the Next button below

Trace this: one Iceberg-style commit

Work it through before reading the answer. A table is currently Snapshot 9, made of data files [P, Q, R]. A MERGE job needs to update rows that live in file Q and append a batch of brand-new rows, using copy-on-write. A second job is reading the table the whole time, having started at Snapshot 9.

Walk through: (1) What files does the write job create? (2) What is the new snapshot's file list? (3) When does the change become visible? (4) What does the concurrent reader see, and is file Q deleted? (5) After the commit, can someone still query the table as of Snapshot 9 — and what maintenance later changes that?

Answer
  1. New files: a rewritten Q′ (file Q's rows with the updates applied — copy-on-write rewrites the whole file) plus one or more new files S holding the appended rows. P, Q, R are untouched; Q′ and S are written off to the side, in no committed snapshot, so no reader can see them yet.
  2. New snapshot (Snapshot 10): file list [P, Q′, R, S]Q replaced by Q′, S added, P and R carried over.
  3. Visibility: nothing is visible until the job atomically swaps the catalog pointer from Snapshot 9 to Snapshot 10. Before the swap, the table is unchanged; after it, the whole change is live at once. That single swap is the atomic commit (atomicity).
  4. The concurrent reader started on Snapshot 9 and keeps reading [P, Q, R] consistently to the end — snapshot isolation; it never sees a half-state. Because of that, Q is NOT deleted on commit — Snapshot 9 still references it, so it must remain readable.
  5. YesSELECT ... VERSION AS OF 9 still works, because Snapshot 9 and its files (P, Q, R) are retained: that's time travel. What ends it is snapshot expiration: once Snapshot 9 ages past the retention window, expiring it makes Q unreferenced, after which Q can be physically deleted to reclaim storage. (And had the job crashed after writing Q′/S but before the swap, those would be orphan files for the orphan-cleanup job to remove.)

You can now turn cheap files on object storage into a reliable, versioned, engine-agnostic table — the lakehouse. But "the commit succeeded" is not the same as "the data is correct and trustworthy." The next chapter is about exactly that: testing, contracts, observability, lineage, and governance — making the data a product people can rely on.

Next: Chapter 11: Data Quality, Governance & Observability →