Skip to main content

Streaming fundamentals: unbounded data and the log

Walk me through it

Imagine a conveyor belt at the end of a factory line. Boxes come down it one after another, and they never stop coming — there's no last box you can wait for. Your job is to inspect each box as it passes and keep a running tally on a whiteboard. You can't wait until "all the boxes have arrived" to start counting, because they never all arrive. You count as they go by, and at any moment your whiteboard holds the best answer so far.

That conveyor belt is a stream. Each box is an event. The whiteboard tally is your continuously-updated result. This lesson builds the vocabulary and the one core data structure — the log — that the rest of the chapter relies on. Everything fancy later (watermarks, exactly-once, Flink) is machinery for doing the whiteboard tally correctly when boxes arrive out of order and the factory occasionally loses power.

Bounded vs unbounded data

The cleanest way to tell batch and streaming apart is the shape of the data, not the speed.

  • Bounded data is finite: it has a known beginning and end. A CSV file. Yesterday's orders table. A Parquet partition. You can read all of it, and then you're done. Every chapter before this one processed bounded data.
  • Unbounded data has no end. Clicks on a website. Card swipes. GPS pings from a fleet of trucks. Sensor readings. New events keep arriving for as long as the system runs. You can never "read all of it" — so any computation must work on a moving, partial view.

That single difference cascades into everything. With bounded data you ask, "what is the total?" and get one final number. With unbounded data, "the total" is never final — it's "the total as of now," and it changes with the next event. Streaming is the discipline of answering questions over data that won't hold still.

:::note Batch is a special case of streaming A useful unifying idea: a bounded dataset is just an unbounded stream that happens to have stopped. Modern engines like Apache Flink and Spark Structured Streaming lean on this — the same operators run over a finite file or an endless stream. That's why the kappa architecture (lesson 9.6) can claim "everything is a stream; batch is replay." :::

The atom: an event

An event is an immutable record that something happened at a point in time. Three properties matter:

  • It happened in the past. OrderPlaced, PaymentCaptured, TemperatureRead. Events are facts, named in the past tense. You never "update" an event — if something changes, that's a new event.
  • It carries a timestamp. When it happened is part of the data, not metadata you can ignore. This becomes the crux of lesson 9.3 (event time).
  • It's usually small and self-describing. A JSON or Avro record. ShopFlow (see Meet ShopFlow) emits its order_events this way: {"event_id":"e-1","order_id":1001,"event_type":"placed","ts":"2026-06-24T10:00:01Z"}.

A stream, then, is an unbounded, ordered, append-only sequence of events. "Append-only" is load-bearing: you only ever add events to the end; you never edit or delete in the middle. That constraint is what makes the next idea possible.

The core data structure: the log

Under almost every streaming system sits one beautifully simple structure: the log (sometimes "commit log"). A log is an append-only, ordered sequence of records, each assigned a monotonically increasing number called its offset.

offset: 0 1 2 3 4 5 → (grows forever)
record: [evt] [evt] [evt] [evt] [evt] [evt]
▲ ▲
oldest newest (the "head")

Three properties make the log the perfect substrate for streaming:

  1. Order is preserved. Records sit in the exact sequence they were written, and each has a stable position (its offset — the integer index of a record in the log). Offset 4 always means the same record.
  2. It's append-only and immutable. Writers only add to the head; existing records never change. That makes writes cheap (no random updates) and reads predictable.
  3. It's replayable. Because nothing is destroyed (until retention expires) and everything is ordered, a reader can start at any offset and re-read forward. Replay is the secret weapon behind fault recovery, reprocessing with fixed logic, and bootstrapping new consumers.

This is not a new idea — a database's write-ahead log and a bank's ledger are the same shape. What Kafka and friends did was make the log itself the public, shared, durable abstraction that many systems read and write, rather than a private internal detail.

:::tip Why "the log" beats "a queue" A traditional message queue deletes a message once it's been delivered and acknowledged — it's a hand-off. A log retains messages after they're read, so many independent consumers can each read the whole thing at their own pace, and any of them can rewind. This single difference is why Kafka is a log, not a queue, and why "Kafka is just a queue" leads you astray (the full story is lesson 9.2). :::

Producers and consumers

Two roles act on the log:

  • A producer is any client that writes (appends) events to the log. ShopFlow's web app appends a placed event to order_events; its payment service appends paid.
  • A consumer is any client that reads events from the log. A fraud detector, a revenue dashboard, a job that copies order_events to the warehouse.

The pivotal design choice: producers and consumers are decoupled. The producer appends and moves on — it doesn't know or care who reads, how many readers there are, or how fast they go. Each consumer independently tracks its own offset — its bookmark of "the next record I haven't read yet." Consumer A can be at offset 1,000,000 (live) while consumer B replays from offset 0 (backfilling), reading the same log without affecting each other.

Producer\n(ShopFlowweb app)LOGProducer\n(ShopFlowpayments)01Consumer A\n(fraud,at offset 4 — live)ConsumerB\n(warehouseloader, at offset 1

This decoupling is the architectural payoff of streaming: you can add a brand-new consumer years later, point it at offset 0, and let it rebuild its world from history — without touching the producers or the existing consumers.

A traced example: a running count

Let's make the conveyor-belt tally concrete. Suppose ShopFlow's paid events arrive on order_events, and a consumer keeps a running total of paid revenue (looking up each order's amount). The log fills up:

offsetevent_typeorder amountrunning total after processing
0paid$10$10
1paid$25$35
2paid$5$40
3paid$60$100

At offset 2 the answer is "$40 so far" — correct as of that point, not wrong, just partial. The consumer also records "I have processed up to offset 2." Now the machine crashes. On restart, it reads its saved offset (2), resumes at offset 3, and continues — it does not re-add offsets 0–2. That saved bookmark (committing the offset) is the seed of fault tolerance; lesson 9.5 makes it bulletproof for exactly-once. Notice what made recovery trivial: the log was replayable and the consumer remembered its offset. Strip away either and you'd either lose data or double-count.

Why it matters — and when real-time is worth it

Streaming is powerful, but it is more expensive to build and operate than batch: always-on infrastructure, harder testing, subtler correctness. The senior move is to ask whether you actually need it. A useful test:

  • What's the cost of staleness? If a number being an hour old costs nothing (most reporting), batch is cheaper and simpler. If staleness has a price per minute — fraud losses, missed alerts, a worse user experience — real-time earns its keep.
  • Is there an action at the other end? Real-time pays off when a fast decision consumes it (block this transaction, page this engineer, re-rank this feed). A dashboard nobody watches at 3 a.m. does not need sub-second freshness.
  • Is the volume continuous? Truly continuous, high-velocity sources (telemetry, logs, clickstream, CDC) are natural streams. A file that lands once a day is a batch in a trench coat.

Default to batch; reach for streaming when staleness has a real, per-unit-time cost and a fast action consumes the result. Choosing streaming "because real-time sounds modern" is the most expensive mistake in this chapter.

Common pitfalls

  • Treating a stream like a finite dataset. Writing SELECT SUM(amount) FROM purchases and expecting a final number. Over an unbounded stream there is no final number — only "so far." You must add a window (lesson 9.3) to get a complete, bounded answer.
  • Assuming events arrive in order. They mostly do within one partition, but across the system, networks delay and reorder. Designing as if "newest event = latest timestamp" is the root of a whole class of bugs (lesson 9.3).
  • Forgetting to commit offsets — or committing them too early. Commit before you've durably processed an event and a crash loses it (at-most-once); the timing of that commit is your delivery guarantee (lesson 9.5).
  • Confusing a log with a queue. Expecting messages to disappear after reading, or expecting only one consumer. The log retains and fans out — that's the point.
  • Reaching for streaming when batch would do. The most common and most expensive pitfall. Real-time you don't need is just operational cost.

Checkpoint

Required checkpoint

9.1 Streaming fundamentals

Pass to unlock the Next button below

Going deeper: the log abstraction you just learned is implemented concretely — with partitions, replication, and consumer groups — in Kafka architecture. The "compute on partial views" problem gets its real solution in Time and watermarks.

Context: this is the real-time counterpart to Batch processing & Spark; see also Ingestion & integration for where streams enter the platform.

Next: Kafka architecture →