Chapter 9 checkpoint
You can now reason about a streaming system end to end: model data as an append-only log, scale and order it with Kafka, get time right, hold state through failures, deliver each event exactly once, and wire it into a durable architecture. Recall the throughline, then prove it with the quiz.
The throughline
- A stream is an unbounded, ordered, replayable log of events. Batch is the special case that stopped. Producers append; many independent consumers each track their own offset and can replay. Default to batch — reach for streaming only when staleness has a per-minute cost and a fast action consumes the result.
- Kafka is a partitioned log, not a queue. A topic is many partitions, each an independent ordered log; order exists only within a partition. The key decides the partition (
hash(key) mod N), so key by the entity whose order you care about. Replication + ISR +acks=allmakes an acked write durable. A consumer group fans partitions across members (capped by partition count), healed by rebalancing; watch consumer lag. - Event time, not processing time. Compute on when events happened for correct, reproducible results. A watermark ("event time has reached T") fires windows (tumbling / sliding / session). Handle stragglers with allowed lateness (restate the window) + a side output (catch the too-late) — never silently drop.
- Stateful processing needs memory that survives crashes. Keyed state (in memory or RocksDB) is snapshotted by checkpoints (state + offsets, consistently) for crash recovery, and savepoints for planned upgrades. Engines: Flink (true streaming, lowest latency, big state), Spark Structured Streaming (micro-batch, unifies with Spark/batch), Kafka Streams (library embedded in your app).
- Delivery semantics is a chain, not a checkbox. At-most-once (may lose) → at-least-once (may duplicate — the cheap default) → exactly-once = idempotent producer + transactions + consistent checkpoints + transactional/idempotent sink. The most portable trick: dedup on
(key, event_id)or upsert so duplicates become harmless. - The production glue. A Schema Registry (Avro/Protobuf) enforces compatibility (backward = upgrade consumers first; forward = producers first; full = either) — rejecting breaking changes at deploy time. CDC/Debezium turns a database's transaction log into a stream. Streaming SQL (ksqlDB / Flink SQL) makes it a continuous materialized view.
- Architecture: kappa over lambda. Lambda runs batch + speed layers in parallel (logic duplicated, drifts). Kappa has one streaming pipeline; reprocessing = replaying the log. Increasingly, streams land exactly-once in the lakehouse (Iceberg/Delta), so one ACID table serves real-time and historical reads.
Quiz
Chapter 9 — Streaming & Real-Time
Pass to unlock the Next button belowYou can now design a streaming system that's correct under the three hard truths — the data never ends, events arrive late and out of order, and machines fail mid-flight. Notice where every chapter's streams kept landing: an ACID table on object storage that serves both real-time and historical reads. That unification — batch and streaming over one transactional table — is the lakehouse, and it's Chapter 10.