Skip to main content

What a table format actually is

This is the conceptual heart of the chapter. If you understand one thing about the lakehouse, make it this: a table format is a layer of metadata over your data files, not a new way to store the bytes of a file. Once that clicks, ACID, time travel, and schema evolution all stop being magic and become obvious consequences.

The distinction everyone gets wrong: table format ≠ file format

A file format describes the bytes of a single file. Parquet (Chapter 2) is a file format: it says how one file lays out columns, compresses them, and stores its statistics. One ShopFlow fact_sales Parquet file (ShopFlow — see Meet ShopFlow) knows nothing about the others; it is one self-contained unit.

A table format describes how a collection of files adds up to one logical table. For ShopFlow's fact_sales, it is metadata that says: "These specific order-line Parquet files, as of right now, are fact_sales. These older files are previous versions. This is the schema (quantity, unit_price, line_revenue, the foreign keys). This is how the data is partitioned (by date_key)." Iceberg, Delta Lake, and Hudi are table formats. They do not replace Parquet — they use it. The actual order-line rows still live in Parquet files; the table format just tracks which files count, and when.

:::danger The most common newcomer mistake Calling Iceberg or Delta a "file format" and comparing them to Parquet. They are not alternatives to Parquet — they sit on top of Parquet (Iceberg can also use ORC or Avro for data; Delta uses Parquet). The right mental model:

Parquet = how one file stores rows. Iceberg/Delta = which files are the table, and what version you're reading.

Get this wrong and the rest of the lakehouse never makes sense. :::

Table formatlayer\n(Iceberg /Delta / Hudi)\n=fact_sales-001.parquetfact_sales-002.parquetfact_sales-003.parquetEach .parquet is aFILE format.\nThelayer above is the

The core trick: a table is a pointer to a list of files

Here's how that metadata layer manufactures atomicity on storage that has none. Instead of letting "the table" mean "whatever files happen to be in this directory," the table format defines the table as an explicit, named list of exactly which files belong to it right now — and a tiny pointer that says "this list is the current table."

To change the table, a writer does not edit files in place. It:

  1. Writes any new data files (new Parquet, off to the side — nobody is reading them yet).
  2. Writes a new list of files describing the table's new state (old files it keeps, minus any it's removing, plus the new ones).
  3. Atomically swaps the pointer to point at the new list.

Steps 1 and 2 touch only new files, so readers running during them see nothing change — they're still following the old pointer to the old list. The entire change becomes visible in one operation: the pointer swap in step 3. That single swap either happens or it doesn't. That is your atomic commit. Object storage couldn't make 400 file operations atomic — but it can atomically update one small pointer, and the table format reduces every change to exactly that.

This is the whole game. Everything else — time travel, isolation, schema evolution — falls out of "a table is an immutable list of files, and committing means swapping which list is current."

The metadata tree

In practice the "list of files" isn't one flat list — it's a small tree of metadata, for performance. The names differ by format, but the shape is the same. We'll use Iceberg's vocabulary because it's the most explicit; lesson 10.3 maps Delta and Hudi onto it.

A snapshot is the table as of one committed moment — one complete version. A snapshot points to manifest files: lists of data files, each entry carrying per-file statistics (row count, and min/max values for each column). The manifests are themselves listed by a manifest list. And the whole thing hangs off a top-level metadata file that records the current schema, partition layout, and which snapshot is current.

Catalog\npoints tocurrent metadatafilemetadata.json\nschema · partitioning ·list of snapshots\n+Snapshot v2(current)Snapshot v1 (old,kept for timetravel)Manifest list\n(themanifests in thissnapshot)Manifest A\n→ datafiles + statsManifest B\n→ datafiles + statsfact_sales-001.parquetfact_sales-002.parquetfact_sales-003.parquet

Reading the tree top to bottom: a catalog (lesson 10.5) holds the pointer to the current metadata file; the metadata file names the current snapshot; the snapshot's manifest list names its manifests; each manifest lists the data files and their statistics. To read the table, an engine walks down to gather the data-file list for the current snapshot.

:::tip Why the stats in the manifests matter so much Each manifest entry stores per-file min/max values for each column. So before opening a single data file, the engine can ask the metadata: "which fact_sales files could possibly contain rows for date_key = 2026-06-01 with product_key = 42?" Files whose min/max ranges can't match are skipped entirely. This is partition pruning and file skipping done from metadata — the engine reads only the relevant files, often without a single slow list call on the bucket. It's a large part of why lakehouse tables query fast. (Compare Parquet's internal row-group statistics from Chapter 2 — same idea, now lifted to the whole-table level.) :::

Delta's transaction log: the same idea, kept as a log

Delta Lake arranges the same information slightly differently, and it's worth seeing because the framing is illuminating. Delta keeps a transaction log — a directory called _delta_log next to fact_sales's data files — containing an ordered sequence of small JSON files: 00000.json, 00001.json, and so on. Each one records one commit as a set of actions: "added these order-line files, removed those files, the schema is now this."

The current state of the table is the result of replaying the log from the start: start empty, apply commit 0, then 1, then 2, and you arrive at exactly which files are live and what the schema is. A new commit is just writing the next-numbered JSON file. Because two writers can't both successfully create 00007.json (only one wins the create), the log is the atomic-commit mechanism — same pointer-swap principle, expressed as "append the next entry to the log."

Replaying thousands of tiny JSON files would get slow, so Delta periodically writes a checkpoint: a Parquet file that snapshots the table's full state at, say, commit 100, so a reader can start from the checkpoint and replay only the few commits after it, instead of all 100.

:::note Two shapes, one principle Iceberg builds a fresh snapshot tree per commit and atomically swaps a pointer to the new root. Delta appends an entry to an ordered log and reconstructs state by replay (with checkpoints for speed). Different data structures, identical guarantee: every change becomes one indivisible operation, so readers always see a consistent version. Hold the principle; the file layouts are dated implementation detail. :::

Why it matters

A table format is metadata and a transaction log layered over ordinary data files — it is not a file format, and it does not replace Parquet; it tracks which Parquet files make up the table and when. The core trick is to define a table as an immutable list of files plus a tiny pointer to the current list, so every change reduces to writing new files off to the side and then atomically swapping the pointer — the one operation object storage can do indivisibly. The metadata is organized as a tree (Iceberg: metadata file → snapshot → manifest list → manifests → data files) or an ordered transaction log (Delta's _delta_log with checkpoints), and the per-file statistics it carries let engines prune to just the relevant files without scanning the bucket. Every lakehouse superpower in the next lessons — atomic commits, consistent reads, time travel, schema evolution — is a direct consequence of this one design. Next, we meet the three formats that implement it.

Next: Iceberg, Delta Lake, and Hudi →