Catalogs, interoperability, and maintenance
We have a metadata layer that makes files behave like a table. Two questions remain before you can run a real lakehouse. First: who holds the pointer to "the current version of this table," and how do the many engines find and agree on it? That's the catalog — and it's the hottest battleground in the lakehouse in 2026. Second: a table that's constantly written rots if left alone — small files pile up, old snapshots accumulate, dead files linger. So we cover maintenance, the upkeep most guides skip and most rotted production tables prove necessary.
The catalog: who holds the pointer
Recall the very top of the metadata tree (lesson 10.2): something has to store the pointer to "the current metadata file for this table," and it has to be updated atomically on each commit. A catalog is that something — plus the directory that maps human table names (shopflow.fact_sales — ShopFlow, see Meet ShopFlow) to their metadata location. It answers two things for every engine: where does fact_sales live, and what is its current version right now.
The catalog is also where the atomic commit ultimately lands: when the nightly load swaps fact_sales's pointer, it's the catalog that enforces "only one writer wins this swap." Get the catalog wrong and two engines can disagree about what fact_sales even is.
Why this is suddenly the center of gravity: a lakehouse's whole promise is many engines, one copy of data. That only holds if the nightly Spark load, the dbt models, Trino serving the dashboard, and a BI tool all consult the same catalog and agree on fact_sales's current version. So the catalog became the integration point — and whoever owns the catalog owns the lakehouse. Hence the 2026 fight.
The 2026 interoperability battleground
The names you'll meet:
- Iceberg REST catalog — an open standard API (not a product) for catalogs. Any engine that speaks the REST spec can talk to any catalog that implements it. This is the key move toward vendor-neutral interoperability: engines target one open interface instead of N proprietary ones.
- Apache Polaris — an open-source implementation of the Iceberg REST catalog (originated by Snowflake, donated to Apache). A vendor-neutral catalog you can self-host.
- AWS Glue Data Catalog — AWS's managed catalog, long the default on AWS for lake tables; speaks Iceberg.
- Databricks Unity Catalog — Databricks' governance-first catalog (table discovery plus permissions, lineage, auditing), now partly open-sourced and Iceberg-REST-compatible.
- Delta UniForm — from lesson 10.3: write Iceberg metadata alongside Delta so the same files are readable by Iceberg clients. A bridge so a Delta table isn't walled off from the Iceberg ecosystem.
- Snowflake Iceberg tables — Snowflake reading/writing Iceberg directly, so warehouse and lakehouse share one open copy instead of two.
:::tip The durable read on the 2026 catalog war The durable trend: the industry is converging on Iceberg-as-the-open-table-standard and a standard REST catalog API, so your data isn't trapped behind one vendor's catalog. The dated part is which specific catalog product wins commercially (Polaris vs Unity vs Glue vs …). Optimize for the open interface — pick a catalog that speaks the Iceberg REST standard — and the commercial winner matters far less. The catalog, increasingly, is also where governance lives (permissions, lineage), which hands you straight to Chapter 11. :::
Maintenance: keep the table from rotting
A lakehouse table is not "set and forget." Every commit and every streaming microbatch into fact_sales adds files and metadata; without upkeep, performance decays and storage bloats. These maintenance jobs are not optional in production — skipping them is the single most common way a healthy lakehouse turns into a slow, expensive mess.
- Compaction (bin-packing) — the small-files problem. Frequent writes, especially streaming
order_eventsupserts, produce many tinyfact_salesfiles. Engines pay fixed per-file overhead (open, read metadata, plan), so afact_salesof a million 1 MB files is far slower and pricier to query than the same order lines in a thousand 1 GB files — even though the bytes are identical. Compaction rewrites many small files into fewer right-sized ones (Delta'sOPTIMIZE, Iceberg'srewrite_data_files). This is the table-format-level version of the file-sizing concern from Chapter 2 — same problem, now caused by the commit cadence. - Snapshot / version expiration. Time travel (lesson 10.4) retains old
fact_salessnapshots and their files forever unless you intervene. Expiring snapshots past a retention window (say, 7 days) lets the now-unreferenced old order-line files be deleted, reclaiming storage — at the cost of how far back you can queryfact_sales. (Delta's equivalent isVACUUM.) - Manifest rewriting. Over many commits, Iceberg's manifest files themselves fragment and grow numerous, slowing the metadata planning that makes the daily revenue query fast. Periodically rewriting/compacting manifests keeps the metadata tree lean.
- Orphan-file cleanup. A
fact_salesload that crashed after writing order-line files but before committing leaves files in the bucket that no snapshot references — orphans. They waste storage and never get cleaned by snapshot expiration (which only removes files that were once referenced). A dedicated orphan-file cleanup job scans for unreferenced files and removes them. (Run it carefully — it must not delete files an in-flight write is about to commit.)
:::danger Unmaintained tables rot The most common production lakehouse failure isn't a dramatic outage — it's a table that silently got slow and expensive because nobody scheduled compaction and snapshot expiration. Treat maintenance as a first-class, scheduled part of the pipeline (Chapter 8 orchestration), not an afterthought. Many managed platforms now run it automatically; if yours doesn't, you own it. :::
One table for streaming and batch
A quietly profound payoff of all this: because fact_sales is just a metadata-managed list of files with atomic commits, streaming and batch engines can write to the same fact_sales without stepping on each other. A Flink or Spark Structured Streaming job (Chapter 9) consuming ShopFlow's order_events can continuously append/upsert order lines into the Iceberg or Hudi fact_sales, while the nightly Spark batch job compacts or backfills the same table, while Trino serves the daily revenue dashboard over it — each using snapshot isolation to get a consistent view.
This collapses the old, painful Lambda architecture (a separate "speed layer" for fresh order_events and "batch layer" for the correct nightly fact_sales, with the headache of keeping two pipelines and two copies in sync). With a lakehouse fact_sales, the stream and the batch jobs converge on one table — fresh and correct, one copy, one schema. Hudi's record-level upserts and MOR (lesson 10.4) make it especially strong here, which is why streaming order_events lands so naturally in a lakehouse.
How to choose
Pulling the chapter together into a decision:
- Lakehouse vs warehouse vs lake. Use a warehouse (Chapter 4) when you want maximum simplicity and SQL performance, data volumes are moderate, and lock-in/cost are acceptable. Use a bare lake essentially never for tables you care about — its lack of ACID is why table formats exist. Use a lakehouse when you want one open copy of data, at scale, readable by many engines (SQL, Spark, ML, streaming) without duplication or lock-in. In 2026 the lakehouse is the default for serious data platforms; warehouses increasingly read lakehouse tables (Snowflake Iceberg) rather than replace them.
- Which format. Default to Iceberg for openness and engine-agnosticism — the safe pick for a nightly-loaded, read-heavy
fact_sales. Choose Delta if you're Spark/Databricks-centric (and use UniForm to stay Iceberg-readable). Choose Hudi for mutation-heavy streaming ingestion / high-rate CDC — e.g. iffact_salesis kept live by upsertingorder_events. - Which catalog. Prefer one that speaks the Iceberg REST standard (Polaris, Unity, Glue) so you keep engine freedom; let governance needs (permissions, lineage) tilt the choice — that's Chapter 11's concern.
- Plan maintenance from day one. Whatever you pick, schedule compaction, snapshot expiration, and orphan cleanup — or use a platform that does. An unmaintained table is a slow table.
Why it matters
The catalog holds the pointer to each table's current version and is where the atomic commit lands, so it's the point all engines must agree on — which is why the 2026 fight (Iceberg REST, Polaris, Unity Catalog, AWS Glue, plus UniForm and Snowflake Iceberg tables) is really a fight over interoperability; the durable bet is the open Iceberg-REST standard, not any single product. A lakehouse table also needs ongoing maintenance — compaction to fix the small-files problem, snapshot expiration to reclaim time-travel storage, manifest rewriting to keep metadata lean, and orphan-file cleanup to remove files no snapshot references — or it silently rots into a slow, costly table. Finally, because commits are atomic and isolated, streaming and batch can write one shared table, collapsing the old Lambda two-pipeline split into a single fresh-and-correct copy. Choose lakehouse-by-default, Iceberg-by-default, an open-standard catalog, and scheduled maintenance — and you have a platform that is cheap, open, reliable, and engine-agnostic at once. That wraps the chapter; lock it in at the checkpoint.
Next: Chapter 10 checkpoint →