Chapter 10 · Table Formats & the Lakehouse

For two decades, you had to choose. You could put your data in a warehouse — a system that gave you transactions, schemas, and fast SQL, but charged a premium and locked your data inside one vendor's engine. Or you could dump files into a data lake — cheap, open object storage you could point any tool at — and give up nearly every guarantee the warehouse offered. There was no middle ground.

This chapter is about the technology that erased the choice: the open table format. Apache Iceberg, Delta Lake, and Apache Hudi take a pile of files sitting in object storage and wrap them in a layer of metadata that brings back ACID transactions, time travel, and schema evolution — the warehouse's guarantees — while keeping the lake's cheapness and openness. The architecture that results is the lakehouse: one copy of your data, in open formats, that batch and streaming engines, SQL warehouses, and ML tools can all read and write safely.

The durable idea

An open table format is a transaction log of metadata layered over plain files — it gives object storage the reliability of a warehouse without locking you into one engine.

The concept — a metadata layer that turns files into a transactional table — is durable and will outlive every product name in this chapter. Which format "wins," its exact feature list, and which catalog you use are dated specifics. We teach the concept first, then map it onto Iceberg, Delta, and Hudi as they stand in 2026.

:::tip One distinction to hold onto from the start A table format is not a file format. Parquet (Chapter 2) is a file format — it describes the bytes of one file. Iceberg and Delta are table formats — they describe how a collection of Parquet files adds up to one consistent, versioned table. Confusing the two is the single most common mistake newcomers make, and we'll nail it down in lesson 10.2. :::

What you'll learn

This chapter builds from the problem to the architecture:

The lake, the warehouse, and the divide that collapsed — why object storage can't give you ACID on its own, what a warehouse bought you, and why the lakehouse became possible.
What a table format actually is — the metadata tree and transaction log that turn files into a table, and why it is not a file format.
Iceberg, Delta Lake, and Hudi — the three open formats, their architectures, and where each one is strong.
ACID, time travel, and schema evolution — the lakehouse superpowers, plus the copy-on-write vs merge-on-read trade-off behind every upsert.
Catalogs, interoperability, and maintenance — the 2026 catalog battleground (Iceberg REST, Unity, Glue, Polaris, UniForm) and the table upkeep that keeps a lakehouse from rotting.
Chapter 10 checkpoint — a quiz and a trace-the-commit exercise.

This chapter sits on everything before it: object storage and Parquet (Chapter 2), distributed query engines (Chapter 3), Spark (Chapter 5), and Kafka/Flink streaming (Chapter 9) are all engines that read and write these tables. It sets up governance and lineage (Chapter 11), which the catalog layer increasingly owns.

Next: The lake, the warehouse, and the divide that collapsed →

The durable idea​

What you'll learn​

The durable idea

What you'll learn