Schema management and streaming architecture: registry, CDC, lambda vs kappa
Walk me through it
So far you can move events through a durable log, process them statefully, handle time correctly, and get exactly-once results. This lesson is the glue that makes it survive contact with a real organization — the parts that don't show up in tutorials but break pipelines in production.
Three questions drive it. First: a stream lives for years, and the shape of its events will change (a team adds a field, renames another). If a producer changes the format, do all the consumers break? That's schema management. Second: where do high-value streams even come from? Often from a database that was never designed to emit events — and the answer is change data capture (CDC). Third: how do you wire batch and streaming together into one system — do you run both side by side (lambda) or make everything a stream (kappa)? Get these right and your streaming platform is durable; get them wrong and it's a fragile pile that shatters the first time someone adds a column.
Schema management: contracts for a stream that lives for years
A Kafka message is just bytes. Producers and consumers must agree on how to interpret those bytes — the schema (the fields, their types). With one producer and one consumer you could hard-code it. With dozens of consumers reading a stream for years while producers evolve, you need an enforced contract. Two pieces:
Serialization formats. Instead of bulky, schema-less JSON, streams typically use compact binary formats that carry or reference a schema:
- Avro — a compact binary format with an explicit schema (defined in JSON). Dominant in the Kafka world; great for evolution.
- Protobuf (Protocol Buffers) — Google's schema-based binary format; also widely used, strong cross-language support.
Both are smaller and faster than JSON and — crucially — typed, so a consumer knows exactly what it's getting.
The Schema Registry. A Schema Registry is a service that stores and versions every topic's schema and hands out a tiny schema ID. Producers register a schema and embed just the ID in each message; consumers fetch the schema by ID to decode. (Confluent's Schema Registry is the common implementation.) The registry's real job isn't storage — it's enforcing that schema changes are safe before a bad one ever reaches the topic.
Compatibility modes — the part that prevents 3 a.m. pages
When a producer wants to evolve a schema, the registry checks the new version against a compatibility mode and rejects it if it would break readers. The three you must know:
- Backward compatible — a new schema can read data written with the old schema. This lets you upgrade consumers first. Safe changes: adding an optional field (with a default), removing a field. This is the most common default — it means you can deploy a new consumer and it still understands old messages still in the topic.
- Forward compatible — old consumers can read data written with the new schema. This lets you upgrade producers first — old consumers keep working against new data (ignoring fields they don't know). Safe changes: adding a field, removing an optional field.
- Full compatible — both backward and forward: any order of upgrade is safe. The strictest and safest.
The point of compatibility modes: in a system with many independently-deployed producers and consumers, you can never upgrade them all at once. Compatibility rules guarantee that whichever upgrades first, nothing breaks — the registry rejects the unsafe change at deploy time instead of letting it corrupt a live consumer at runtime.
:::warning The breaking change that takes down consumers
Changing a field's type (e.g. int → string), or adding a required field with no default, is not backward-compatible — old data can't be read under the new schema. Without a registry enforcing compatibility, such a change deploys silently and every consumer that hits an old-format message crashes or mis-parses. The registry's job is to reject that change at registration time. This is the single most common "why did the whole pipeline fall over?" cause, and it's entirely preventable.
:::
Change data capture (CDC): turning a database into a stream
Many of the most valuable streams aren't emitted by an app on purpose — they're the changes happening in an operational database (new orders, updated profiles). Change data capture (CDC) streams a database's row-level changes (inserts, updates, deletes) as events, by reading the database's transaction log (the write-ahead log every database already keeps for its own durability).
Why read the log instead of polling with SELECT? Because the log captures every change in order — including deletes and intermediate updates that a periodic SELECT * WHERE updated_at > … would miss — with near-zero load on the source (it's tailing a file the DB already writes). Debezium is the dominant open-source CDC platform: it tails MySQL/Postgres/etc. transaction logs and produces a Kafka stream of change events, each describing the before/after of a row. (We met CDC as an ingestion pattern in Ingestion & integration; here it's the source of a stream.)
CDC pairs naturally with log compaction (lesson 9.2): a compacted CDC topic keyed by primary key keeps the latest state of each row — replay it and you reconstruct the current table.
Streaming SQL: continuous queries instead of code
Writing stateful, windowed, event-time processing as low-level code is powerful but slow to author and hard to staff. Streaming SQL lets you express it as continuous SQL queries that run forever over streams, emitting results as new events arrive:
- ksqlDB — SQL over Kafka Streams (lesson 9.4), for Kafka-native stream processing.
- Flink SQL — SQL over Flink, with full event-time, windowing, and exactly-once.
A streaming SELECT … GROUP BY TUMBLE(…) never finishes — it maintains a materialized view (a query result kept continuously up to date) that updates with every event. This is the same SQL you learned in Chapter 3, now running over an unbounded input — a huge accessibility win, because analysts can build streaming pipelines without learning a stream-processing API.
Lambda vs kappa: two ways to wire it all together
You have batch (Chapters 5–8) and streaming (this chapter). How do you combine them into one system that serves both fast-but-approximate and slow-but-complete results? Two reference architectures:
Lambda architecture
Run two layers in parallel:
- A batch layer reprocesses all historical data periodically — slow, but complete and authoritative (your warehouse/lakehouse jobs).
- A speed (streaming) layer processes events in real time — fast, but approximate and only covers recent data.
- A serving layer merges both so queries get real-time freshness and historical accuracy.
The fatal flaw: you maintain the same business logic twice — once in batch code, once in streaming code — and keep them in sync forever. Two codebases that must produce identical results is a perpetual source of drift and bugs.
Kappa architecture
Drop the batch layer entirely. There is one streaming layer, and everything is a stream. "Batch" reprocessing isn't a separate system — it's just replaying the log through the same streaming code (the log is replayable, lesson 9.1). Need to fix a bug or recompute history? Rewind to offset 0 and reprocess through the one pipeline.
Kappa's promise: one codebase, one system, no batch/stream logic drift. Its requirement: a durable, replayable log with enough retention to reprocess history, and a stream engine that handles event time and exactly-once well — which lessons 9.1–9.5 give you. Kappa has become the default mental model precisely because the streaming tools matured enough to make the separate batch layer unnecessary for many workloads. Lambda survives where the batch layer does heavy, fundamentally-batchy work (huge ML training, complex historical recomputes) that's awkward as a stream.
Streaming into the lakehouse
The modern endpoint ties this chapter to the next. Instead of streaming results into a separate real-time database, the pattern is increasingly Kafka → stream processor → a lakehouse table (Apache Iceberg or Delta Lake, Chapter 10). The stream processor writes exactly-once (lesson 9.5) into ACID table formats on object storage, so the same table serves both real-time and historical queries — collapsing lambda's two stores into one. This is why streaming and the lakehouse are converging: a transactional table format lets a stream land continuously while batch SQL reads it consistently.
A traced example: a safe schema change on the order_events stream
ShopFlow (see Meet ShopFlow) publishes order_events to Kafka in Avro — the contract's four fields (event_id, order_id, event_type, ts), each explicitly typed, with the registry set to backward compatibility. A product team wants to add a channel field (web / mobile / marketplace) to every event.
- They register the new schema: the existing
order_eventsfields plus an optionalchannel(defaultnull). The registry checks it against backward compatibility → a new optional field with a default is safe (a new-schema reader can still read old records, defaulting the missing field) → accepted. - Producers (ShopFlow's app) start emitting
order_eventsrecords carryingchannel. - Consumers are deployed on their own schedule. A consumer still on the old schema reading a new record? Backward compatibility was chosen to upgrade consumers first, but because the field is optional, even old consumers simply ignore the unknown
channel— nothing breaks either way. - Contrast the unsafe path: had they instead changed
order_idfrombiginttostring, the registry would have rejected it at step 1 — preventing the deploy that would otherwise have crashed every consumer hitting an old-formatorder_eventsrecord. The registry turned a 3 a.m. incident into a deploy-time error message.
The whole point: schema evolution on a long-lived stream like order_events is routine and safe when a registry enforces compatibility, and catastrophic when it doesn't.
Why it matters
These are the parts that separate a demo from a platform. A Schema Registry with the right compatibility mode is what lets dozens of teams evolve a shared stream for years without coordinated, breaking deploys — the most common preventable cause of pipeline outages. CDC (Debezium) is how you turn the databases you already have into high-fidelity streams without touching the apps. Streaming SQL makes all of it accessible to people who don't write stream-processing code. And lambda vs kappa is the architectural fork: kappa's "one replayable log, one codebase" has become the default because the streaming tooling matured, with the lakehouse increasingly the unified landing zone for both real-time and historical reads — which is exactly where Chapter 10 picks up.
Common pitfalls
- No schema enforcement (raw JSON, no registry). Works until a producer changes a type and silently breaks every consumer at runtime. A registry rejects the unsafe change at deploy time instead.
- Choosing the wrong compatibility mode. Backward lets you upgrade consumers first; forward lets you upgrade producers first; full allows either. Picking one that doesn't match your deploy order forces lock-step upgrades you wanted to avoid.
- Treating a type change or new required field as harmless. These are breaking changes — old data can't be read. Add optional fields with defaults instead.
- Polling a database for changes instead of CDC.
SELECT … WHERE updated_at > xmisses deletes and intermediate updates and loads the source. Read the transaction log (Debezium) for complete, low-impact change streams. - Reaching for lambda by default. Maintaining batch and streaming logic in parallel is a perpetual drift hazard. Prefer kappa (one replayable pipeline) unless a genuinely batch-only workload justifies the second layer.
Checkpoint
9.6 Schema & architecture
Pass to unlock the Next button below→ Where this leads: the lakehouse table formats this lesson streams into are Chapter 10: Table Formats & the Lakehouse. Schema contracts here connect to data quality & contracts.
← Foundations: compatibility builds on schemas/serialization from Storage & file formats; CDC was introduced as a pattern in Ingestion & integration; streaming SQL reuses Chapter 3 SQL.