Skip to main content

Stateful processing: keyed state, checkpoints, and the engines

Walk me through it

Some stream questions you can answer by looking at one event alone: "is this temperature above 100?" That's stateless — each event judged on its own, nothing remembered. But the interesting questions need memory: "how many times has this user failed to log in?", "what's the running total per store?", "is this the third fraud signal for this card in five minutes?" To answer those, the processor must remember things between events — and remember them per key (per user, per store, per card).

That memory is called state, and it raises a scary question: the processor is just a program running on a machine, and machines crash. If a fraud counter lives in memory and the process dies, does the count vanish? Do we double-count on restart? This lesson is about how stream processors hold state and survive failures without losing or duplicating it — and how today's engines (Apache Flink, Spark Structured Streaming, Kafka Streams) make those tradeoffs differently.

Stateless vs stateful operations

  • Stateless operations look at one event and emit a result: filtering (amount > 100), mapping/reshaping, routing. No memory needed; trivially parallel and trivially recoverable (just reprocess).
  • Stateful operations accumulate information across many events: aggregations (counts, sums, averages), windows (which hold their running aggregate until they fire — lesson 9.3), joins (matching events from two streams), and deduplication (remembering keys already seen). These need somewhere to store and update that accumulated information.

Every windowed aggregate you wrote in lesson 9.3 was secretly stateful: the window's running sum is state, held until the watermark fires the window. ShopFlow's fact_revenue_1m (see Meet ShopFlow) is exactly this — a tumbling 1-minute window whose running revenue total is keyed state that lives until the watermark closes the minute.

Keyed state

The dominant form of state in stream processing is keyed state — state partitioned by key, exactly the keys you've been using since lesson 9.2. The engine maintains an independent slice of state per key. For ShopFlow, a "spend so far this browsing session, per customer" aggregate keyed on customer_id:

keyed state for ShopFlow "session spend per customer_id":

cust-7 → $40
cust-19 → $0
cust-42 → $215 ← each key has its own independent value

When a paid event for cust-42 arrives, the operator looks up cust-42's state ($215), adds the order's amount (→ $260), and writes it back — touching only that key's slice. (Session windows, lesson 9.3, are themselves keyed state on customer_id: a customer's clicks group into one session until a gap of inactivity closes it.) This is why keying matters beyond ordering: it's also how state is sharded. Because each key's state is independent, the engine can spread keys across many parallel workers, each owning a disjoint set of keys and their state — the same fan-out you saw with partitions and consumer groups.

State backends: where state physically lives

State has to live somewhere, and the choice is a real tradeoff:

  • In-memory (on-heap) — fastest access, but bounded by RAM and lost if the process dies (recovered only from checkpoints). Good for small state.
  • On-disk embedded store (e.g. RocksDB) — state spills to a local disk-backed key-value store. Slower per access than RAM but holds far more state than fits in memory (hundreds of GB to terabytes per worker). This is how Flink and Kafka Streams support huge keyed state — billions of keys — without needing enormous RAM.

A state backend is simply the choice of where keyed state lives and how it's snapshotted. The durable principle: small state → memory; large state → an embedded on-disk store like RocksDB. Either way, the recovery story is the same, and it's the next idea.

Checkpoints: surviving failure without losing or duplicating state

Here's the crux. A streaming job runs forever, accumulating state. A machine fails. How do we restart exactly where we were — same state, same input position — so we neither lose events nor double-count them?

The answer is the checkpoint: a periodic, consistent snapshot of (a) all operator state and (b) the input positions (Kafka offsets) that produced that state, written durably to reliable storage (object storage like S3). "Consistent" is the magic word — a checkpoint captures the state exactly as of a specific set of input offsets, so state and offsets always agree.

Recovery then becomes mechanical:

  1. The job dies (a worker crashes).
  2. It restarts, loads the last successful checkpoint — restoring every key's state and the exact input offsets that state corresponds to.
  3. It rewinds the source to those offsets and replays forward.

Because state and offsets were snapshotted together and the log is replayable (lesson 9.1), the job resumes as if nothing happened — no lost state, and (with the right sink) no duplicated effects. This checkpoint mechanism is the foundation that makes exactly-once possible (lesson 9.5): the snapshot is the "memory," the replayable source is the "redo," and they're kept consistent.

Source\n(Kafkaoffsets)Statefuloperator\n(keyedstate)Sink

:::note Checkpoint vs savepoint A checkpoint is automatic and periodic, owned by the engine for crash recovery (often pruned, keeping just the latest). A savepoint is manually triggered and retained — you take one deliberately before upgrading your job's code, migrating clusters, or rescaling, then restart from it. Same snapshot mechanism, different intent: checkpoints heal crashes; savepoints enable planned changes without losing state. :::

The engines and their tradeoffs

Three engines dominate, and they sit at different points on a latency vs. simplicity spectrum. The durable concepts (keyed state, event time, watermarks, checkpoints) are the same across all three; the execution model differs.

Flink processes events one at a time, as they arrive — a true continuous dataflow. This "true streaming" model gives it the lowest latency (millisecond range), the richest event-time and watermark support, flexible windowing (including sessions), and large RocksDB-backed keyed state with consistent checkpoints. The cost is operational complexity — it's a full distributed system to run (often via a managed service). Reach for Flink when you need low-latency, sophisticated event-time logic, and large state.

Spark Structured Streaming — micro-batch

Spark Structured Streaming treats a stream as a sequence of tiny batches: it collects events for a short interval (say, a few hundred ms to seconds), runs a small batch job, repeats. This micro-batch model means latency is bounded by the batch interval (typically higher than Flink's per-event latency, though a low-latency continuous mode exists). The payoff is unification: it's the same Spark engine, API, and skills as your batch jobs — one tool, one codebase for batch and streaming. Reach for Spark Structured Streaming when you already run Spark and second-ish latency is fine — the operational and skill savings are large.

Kafka Streams — a library, not a cluster

Kafka Streams is a client library you embed in your own application (a JVM service) rather than a separate processing cluster. Your app reads from Kafka, transforms with the Streams API (including keyed state via local RocksDB, backed up to compacted Kafka topics for recovery), and writes back to Kafka. There's no extra cluster to operate — it scales by running more instances of your app (consumer-group style). The tradeoff: it's tied to Kafka in/out and is best for Kafka-to-Kafka stream processing inside an application, not heavyweight analytics. Reach for Kafka Streams for lightweight, app-embedded stream processing without standing up Flink or Spark.

FlinkSpark Structured StreamingKafka Streams
Modeltrue per-event streamingmicro-batchlibrary, per-event
Latencylowest (ms)bounded by batch intervallow (ms)
Statelarge, RocksDB, checkpointsmanaged, checkpointslocal RocksDB + Kafka backup
Operateown cluster (often managed)reuse Spark clusterno cluster — embed in your app
Best forlow-latency, rich event-time, big stateunifying with existing Spark/batchKafka→Kafka inside an app

There's also a strong trend toward streaming SQL layers on top of these — ksqlDB (over Kafka Streams) and Flink SQL (over Flink) — so you can express stateful, windowed processing as SQL instead of low-level code. We cover that in lesson 9.6.

A traced example: a per-minute revenue window surviving a crash

Goal: ShopFlow's fact_revenue_1m — "sum the order amounts of paid events into a tumbling 1-minute window." Keyed state = the running revenue total for the current minute-window (keyed on the window).

  1. paid for order 1001 ($30) at 10:01:10 → window [10:01–10:02) revenue = $30.
  2. paid for order 1002 ($20) at 10:01:40 → window [10:01–10:02) revenue = $50.
  3. A checkpoint fires at 10:01:45 — snapshotting {[10:01–10:02): $50} together with the Kafka offset just consumed, to S3.
  4. paid for order 1003 ($15) at 10:01:55 → window [10:01–10:02) revenue = $65; the watermark passes 10:02 → emit fact_revenue_1m row: minute 10:01 = $65.
  5. The worker crashes right after emitting (before the next checkpoint).
  6. On restart, it loads the last checkpoint: window [10:01–10:02) revenue = $50, and rewinds Kafka to that snapshot's offset.
  7. It replays from there: the order-1003 paid arrives again, revenue goes $50 → $65 → emit minute 10:01 = $65 — correctly, once, because we restored to the consistent (state, offset) pair and replayed.

Notice the subtlety: the fact_revenue_1m row "emitted twice" in wall-clock terms (once before the crash, once on replay), but the input was replayed from a consistent point, so logically it's one window producing one revenue total. Making the external effect (the warehouse write to fact_revenue_1m) also happen exactly once is the job of the sinklesson 9.5. Checkpoints get the internal state exactly right; the sink closes the loop to the outside world.

Why it matters

Almost everything valuable in streaming — counts, windows, joins, dedup, sessionization — is stateful, and state is what makes streaming hard to operate: it has to be held efficiently (memory vs RocksDB), sharded by key, and above all survive failure. Checkpoints (automatic, for crashes) and savepoints (manual, for upgrades/rescaling) are the mechanism that lets a forever-running job recover to a consistent point without losing or corrupting state. And the engine choice — Flink for low-latency/rich-state, Spark Structured Streaming to unify with batch, Kafka Streams to embed in an app — is one of the highest-leverage decisions in a streaming architecture. Get the concepts right and the engine is an implementation detail you can swap.

Common pitfalls

  • Unbounded state growth. Keyed state with no expiry (a counter per user that never clears) grows forever and eventually OOMs or fills disk. Use windows, TTLs, or compaction to bound it.
  • Assuming in-memory state survives a restart. It doesn't — only checkpoints do. If you never checkpoint (or checkpoint to non-durable storage), a crash loses all state.
  • Confusing checkpoints and savepoints. Trying to upgrade a job's code by relying on the auto-pruned latest checkpoint, instead of taking a deliberate, retained savepoint first.
  • Picking Flink for a Spark shop reflexively. If you already operate Spark and second-level latency is acceptable, Structured Streaming's unification often beats adding a whole new system to run.
  • Forgetting state is keyed. A "global" counter that ignores keys either becomes a hotspot (all events on one worker) or is simply the wrong aggregate.

Checkpoint

Required checkpoint

9.4 Stateful processing & engines

Pass to unlock the Next button below

Going deeper: checkpoints get state internally right; making external effects happen once is Delivery semantics & exactly-once. A backed-up stateful operator is a slow consumer — how that shows up as lag and backpressure on-call is Streaming operations & failure modes. The SQL layer over these engines (ksqlDB, Flink SQL) is in Schema management & streaming architecture.

Foundations: keyed state shards by the same keys from Kafka architecture; windows are the stateful operators from Time and watermarks. Spark Structured Streaming reuses the engine from Batch processing & Spark.

Next: Delivery semantics & exactly-once →