Chapter 9 · Streaming & Real-Time
Every chapter so far has, at heart, processed bounded data: a file, a table, a partition — a dataset with a known beginning and end that a batch job reads to completion and then stops. That model is the backbone of analytics, and it's correct for most of what a data team does. But a growing share of the business can't wait for the nightly run. Fraud has to be caught as the card is swiped. A dashboard of orders should move while the sale is happening. A recommendation should reflect what you clicked ten seconds ago. For these, the data never "ends" — it's an endless arrival of events — and a job that waits to read all of it would wait forever.
This chapter is about that world: streaming, the processing of unbounded data — data that has no end — as it arrives, continuously, with low latency. We thread it through one concrete stream: ShopFlow's order_events (see Meet ShopFlow) — every placed / paid / shipped / cancelled as it happens — which this chapter turns into fact_revenue_1m, per-minute revenue, windowed and delivered exactly-once.
The shift in mindset
Streaming is not "batch, but faster." It's a different model of computation, and most streaming bugs come from carrying batch intuitions into it:
- In batch, you have all the data before you compute. In streaming, you compute on partial data and refine as more arrives — so you must decide when an answer is "done enough" to emit.
- In batch, "now" is implicit and singular. In streaming, there are multiple clocks — when an event happened versus when you processed it — and confusing them produces wrong answers that look right.
- In batch, a failed job just re-runs the file. In streaming, the job is always running, so fault tolerance and delivery guarantees (did each event count once? twice? not at all?) become first-class design problems, not afterthoughts.
Get those three things right — partial answers, multiple clocks, and delivery guarantees — and streaming is tractable. Get them wrong and you ship a pipeline that's subtly, silently incorrect.
The durable idea
A stream is an unbounded, ordered, replayable log of events. Streaming systems exist to answer questions over that log correctly despite three hard truths: the data never ends, events arrive late and out of order, and machines fail mid-flight.
The log — an append-only, ordered, replayable sequence of records — is the foundational abstraction this whole chapter builds on, and it's remarkably durable: it predates Kafka and will outlast it. The concepts in this chapter (the log, event time, watermarks, windowing, state, exactly-once) are stable for decades. The tools that implement them — Apache Kafka, Apache Flink, Spark Structured Streaming, Kafka Streams, Redpanda, Amazon Kinesis, Google Pub/Sub — are dated, and we'll name them as today's instances of the durable ideas, not as the ideas themselves.
What this chapter covers
- Streaming fundamentals — unbounded vs bounded data, events, the log, producers and consumers, and when real-time is actually worth its cost.
- Kafka architecture — topics, partitions, offsets, and consumer groups; brokers, replication and ISR; retention and log compaction; and how keys decide partitioning and ordering. The lesson that fixes "Kafka is just a queue."
- Time and watermarks — event time vs processing time vs ingestion time, watermarks, windowing (tumbling, sliding, session), and handling late and out-of-order data.
- Stateful processing — keyed state, checkpoints and savepoints, state backends, fault-tolerant recovery, and the tradeoffs between Flink, Spark Structured Streaming, and Kafka Streams.
- Delivery semantics & exactly-once — at-most-once, at-least-once, exactly-once; idempotent producers, transactions, transactional sinks (two-phase commit), and dedup on a key.
- Schema management & streaming architecture — Schema Registry, Avro/Protobuf and compatibility modes; lambda vs kappa; CDC with Debezium; streaming SQL; and feeding the lakehouse (Kafka → Iceberg/Delta).
Then a checkpoint quiz to lock it in.
Where this chapter connects to the rest of the guide: streaming is the real-time counterpart to batch processing, it's the live arm of ingestion (CDC is a stream), and its output increasingly lands in the lakehouse you'll meet next.
Next: Streaming fundamentals →