Skip to main content

Streaming operations & failure modes: dead-letter queues, consumer lag, and backpressure

Walk me through it

Everything so far has been about getting a stream correct — ordered (lesson 9.2), on-time (lesson 9.3), stateful through crashes (lesson 9.4), exactly-once (lesson 9.5). But correctness is what you design; operability is what you live with. A streaming pipeline runs forever, which means it's the one part of your data platform that can page you at 3 a.m. — and when it does, the cause is almost never "the watermark math was wrong." It's one of three blunt, physical failure modes: a message the consumer can't process, consumers falling behind producers, or consumers that simply can't keep up with the firehose.

This lesson is the on-call runbook the other lessons assume you already have. We'll name the three failure modes precisely — poison messages / dead-letter queues, consumer lag, and backpressure — add the rebalancing storm that amplifies all three, and trace a real incident end to end. A guide that teaches exactly-once but not "what do I do when lag is climbing and the pager is going off" has trained an architect who can't keep the lights on. This is the keep-the-lights-on lesson.

Failure mode 1 — Poison messages and the dead-letter queue

A poison message (also called a poison pill) is a single record your consumer cannot process and will never be able to process — not because the system is down, but because that record is broken. Maybe it's malformed JSON, a schema the consumer can't decode, a null where code assumes a value, or a payload that triggers a bug. The defining trait: retrying it will fail the same way every time.

Here's why one bad record is so dangerous. A naive consumer reads a record, processes it, then commits its offset (its position in the partition). If processing throws, a well-meaning consumer retries — but it can't commit past a record it never processed. So it retries the same record forever. The whole partition is now stuck behind one poison message: every healthy record after it waits in line, lag climbs without limit, and the pipeline is effectively dead — all because of one bad byte. This is the streaming equivalent of a single failed row aborting an entire batch load, except the batch eventually ends and the stream never does.

The fix is the dead-letter queue (DLQ): a separate topic (or table) where you route records you can't process, so the main stream keeps flowing. The pattern has three moves:

  1. Detect — wrap processing in a try/catch. A processing exception (not a transient infra error) means "this record is poison."
  2. Route — instead of retrying forever, publish the failed record to a <topic>.DLQ topic, enriched with why it failed (exception, stack trace, original offset, timestamp) — then commit the offset on the main topic so the consumer moves on. The poison record is now parked, not blocking.
  3. Alert + replay — the DLQ is monitored. A human (or an automated job) inspects the parked records, fixes the root cause (deploy a code fix, correct the schema), and replays the DLQ records back through the now-fixed consumer. The DLQ is a quarantine, not a graveyard — its records are meant to be reprocessed.
("order_events\ntopic")Consumer\ntry:process\ncatch: it'spoisonSink("order_events.DLQ\n(why it failed)")successexception →\nroute +commit offset

:::warning Distinguish "poison" from "transient" The whole pattern hinges on telling a permanent failure (bad record → DLQ immediately) from a transient one (the database was briefly down → retry, don't DLQ). DLQ-ing a record that would have succeeded on retry silently drops good data; retrying a genuine poison message forever blocks the partition. The usual rule: a few bounded retries (often with backoff) to absorb transient blips, and only after they're exhausted does the record go to the DLQ. :::

Failure mode 2 — Consumer lag (the stream's fever thermometer)

You met consumer lag briefly in lesson 9.2; here it becomes your primary operational signal. Definition:

Consumer lag = (the offset of the newest record in a partition, the log-end offset) − (the offset your consumer group has last committed). It's the count of records produced but not yet consumedhow far behind real-time you are, measured in messages.

Lag of zero means you're caught up. Lag of 2 million means there are two million records sitting in the log that your consumers haven't gotten to yet. Lag is reported per partition; you usually watch the max across partitions in a group (one hot partition can be badly behind while others are fine — averaging hides it).

Why lag grows. Lag is a simple flow imbalance: it rises whenever the producer rate exceeds the consumer rate. Concretely:

  • A traffic spike — producers suddenly emit faster than consumers drain.
  • A slow consumer — a downstream dependency (a database write, an API call) got slower, so each record takes longer.
  • A stuck consumer — a poison message (failure mode 1) blocking a partition makes its lag climb without bound.
  • Too few consumers/partitions — not enough parallelism to match the producer rate (recall: a group's parallelism is capped by partition count).
  • A rebalancing storm (failure mode 4) — consumers spend their time reassigning instead of consuming.

Lag vs throughput — don't confuse them. Throughput is records/second flowing through the consumer; lag is the backlog of records waiting. High throughput can coexist with high lag (you're processing fast but producers are faster). The danger sign isn't the absolute number — it's the trend: steadily rising lag means consumption can't match production and the gap will only widen. Flat lag (even if large) means you're keeping pace.

Why rising lag is an emergency, not a curiosity. A topic has finite retention (lesson 9.2) — say 7 days. If your lag grows so large that records age out of retention before you read them, those records are deleted, unread — permanent data loss. Climbing lag is a countdown to falling off the retention cliff.

How to monitor and alert. Lag is exposed by Kafka itself (consumer-group offset metadata) and surfaced by tools like Burrow, Kafka Lag Exporter, or your managed platform's console, scraped into Prometheus/Grafana or Datadog. Two complementary alerts:

  • Threshold alert — "max lag in group > N for 5 minutes." Catches a backlog building up.
  • Trend / time-to-drain alert — alert on lag that's monotonically rising, or on consumer-lag-in-time ("we're 8 minutes behind real-time"), which is more intuitive than a raw count and directly comparable to your retention window and your SLA.

Failure mode 3 — Backpressure (when consumers can't keep up)

Backpressure is the situation — and the family of responses to it — when a consumer cannot process records as fast as they arrive. Lag is the symptom you measure; backpressure is the condition, and how you react to it defines whether the system degrades gracefully or falls over.

When the inflow exceeds what a stage can handle, you have exactly three options — buffer, scale, or shed — and a healthy system blends them:

  1. Buffer (absorb) — let records queue up and ride out a temporary spike. In Kafka the broker is the buffer: the log durably holds the backlog (bounded by retention), so a short burst just shows up as lag that drains once the spike passes. This is free and correct for transient spikes — but buffering a sustained overload only delays the retention cliff. A bounded buffer that fills must eventually scale or shed.
  2. Scale (add capacity) — increase consumption rate to match production. Add consumers to the group (up to the partition count), and if you're already at that ceiling, add partitions so the group can fan out wider. Many engines also support autoscaling on lag — spin up more consumers when lag crosses a threshold. Scaling is the right answer to a sustained increase in legitimate load.
  3. Shed (drop / sample) — when you cannot scale fast enough and the data is loss-tolerant, deliberately drop or sample records to protect the system: keep 1 in 10 metrics events, drop low-priority traffic, or apply rate limits. Load shedding is a conscious tradeoff — sacrifice completeness to preserve liveness — and it's only acceptable where occasional loss is fine (telemetry, some metrics), never for payments or orders.
Inflow > processingrate\n(backpressure)BUFFER\nlog absorbsthespike\n(transientSCALE\nadd consumers→ addpartitions\n(autoscaSHED\ndrop /sample\n(loss-tolerant data only)

:::note Pull-based backpressure is a feature of the log A subtle gift of the pull-based consumer model (consumers ask the broker for records at their own pace, rather than the broker pushing) is that backpressure is built in: a slow consumer simply pulls slower, and the unread records sit safely in the durable log as lag. Contrast a push system with no durable buffer, where an overwhelmed consumer must drop records or crash. The log turns "I'm overwhelmed" into "I have visible, recoverable lag" — which is exactly why it's the substrate under streaming. True-streaming engines like Flink propagate this backpressure upstream through their operator chain, all the way back to the source, so a slow sink naturally throttles the read rate instead of blowing up memory. :::

Failure mode 4 — Rebalancing storms

A rebalance (lesson 9.2) is how a consumer group heals: when a member joins, crashes, or leaves, Kafka reassigns partitions so each still has exactly one owner. Healthy and necessary. The pathology is the rebalancing storm — rebalances firing over and over, so the group spends its time reassigning partitions instead of consuming them. During a (classic "stop-the-world") rebalance, consumption pauses, so a storm makes lag spike and throughput collapse.

The usual trigger is a liveness misconfiguration. A consumer must periodically heartbeat to the group coordinator and call poll() within max.poll.interval.ms to prove it's alive. If processing a batch takes longer than that interval (e.g. a slow downstream call), the coordinator assumes the consumer died, kicks it out, and rebalances — then the consumer finishes, rejoins, and triggers another rebalance. The group thrashes. Common fixes: make per-record processing faster or fetch smaller batches (lower max.poll.records) so a poll cycle finishes in time; raise max.poll.interval.ms if work is legitimately slow; and prefer cooperative/incremental rebalancing, which reassigns only the partitions that must move instead of revoking everything on every rebalance. The lesson: a rebalancing storm masquerades as "consumers are slow," but the root cause is usually that processing outran the liveness timeout.

The on-call runbook mindset

The reason to name these four failure modes is that on-call is not the time to think — it's the time to execute a checklist. The runbook mindset turns a panicked 3 a.m. page into a triage sequence:

  1. Alert fires — usually "consumer lag > threshold" or "lag rising / N minutes behind real-time." Lag is the symptom that surfaces almost every failure mode, which is why it's the alert you build first.
  2. Triage the cause — is one partition's lag spiking (→ a poison message stuck, or a hot key) or all partitions (→ a global slowdown or backpressure)? Is throughput zero (→ stuck/crashed consumer or rebalancing storm) or just slower than production (→ scale)?
  3. Act per the cause — poison message → confirm it's in the DLQ and the partition moved on; sustained load → scale consumers/partitions; transient spike → confirm lag is draining on its own; rebalancing storm → fix the liveness timeout; loss-tolerant flood you can't scale → shed.
  4. Verify recovery — watch lag trend back toward zero and confirm you're inside the retention window (you didn't lose data off the cliff).
  5. Post-incident — replay any DLQ'd records after the fix, and feed the root cause back into alert thresholds and capacity. Every page should make the next one less likely.

The deeper point: design for these failure modes before you ship, not after the first incident. A DLQ, a lag dashboard with threshold and trend alerts, autoscaling on lag, and sane liveness timeouts are the four things that separate a stream you can sleep through from one that pages you nightly.

A traced example: a poison record and a lag spike, end to end

ShopFlow's (see Meet ShopFlow) consumer group enrich-orders reads topic order_events (6 partitions) and writes enriched rows to the warehouse. At 02:14 the max-lag alert fires: lag on partition 3 is climbing past 200k while the other five partitions sit near zero.

  1. 02:14 — alert. "Max consumer lag in enrich-orders > 100k for 5 min." On-call is paged.
  2. Triage. One partition (3) is spiking, the rest are flat → not a global slowdown, not backpressure across the board. A single stuck partition points at a poison message blocking offset progress on partition 3.
  3. Confirm. Consumer logs show the same order_events record at offset 48,217,902 throwing a JsonParseException on every retry — a producer shipped a truncated order_events payload (a half-serialized paid event). Classic poison pill: retried forever, blocking every healthy record behind it.
  4. Why lag climbed. Partition 3 hasn't committed past 48,217,902 since 02:09, so its backlog grew at the full produce rate while the consumer spun uselessly. Other partitions, unaffected, stayed near zero.
  5. Route + unblock. The deploy already had a try/catch around parsing, but it was set to retry indefinitely. The fix (hotfix config): after 3 retries, publish the bad record to order_events.DLQ with the exception and original offset, commit past it, and continue. Partition 3 unsticks; its 200k backlog now drains as the consumer races forward.
  6. Verify. Within ~8 minutes partition-3 lag falls back toward zero — and crucially it drained before the 7-day retention cliff, so no data was lost (the one poison record is safely parked in the DLQ, not gone).
  7. Replay. Next morning: inspect order_events.DLQ, see the truncated payloads, fix the upstream producer bug, then replay the DLQ records through the fixed consumer. The quarantined order_events are recovered. The DLQ did its job — it parked one bad record instead of letting it kill the whole partition.

Trace what each mechanism bought: the DLQ turned "one bad byte stops the pipeline" into "one bad byte is quarantined." Lag monitoring caught it in minutes, not when customers noticed missing orders. Per-partition lag (not the average) localized it to partition 3 instantly. And retention was the clock the whole incident was racing — recovery means draining the backlog before records expire. That's the on-call loop: alert → triage with lag → act per cause → verify against retention → replay.

Why it matters

Watermarks and exactly-once are what you put on the design doc; DLQs, lag dashboards, and backpressure handling are what you put on the pager. In production, the streaming incidents that actually wake you are mundane and physical — a malformed record jamming a partition, consumers drifting behind producers toward the retention cliff, or a load spike consumers can't drain — and every one of them is survivable by design if you built the four defenses in advance: a dead-letter queue so one poison record can't stop the stream, consumer-lag monitoring with both threshold and trend alerts so you see "falling behind" before it becomes "lost data," a scale-or-shed plan for backpressure, and sane liveness timeouts so the group doesn't thrash. An engineer who can recite exactly-once but freezes when lag is climbing isn't yet trusted with the stream. This lesson is what makes you the person who is.

Common pitfalls

  • No DLQ, so one poison record halts a partition. A single un-processable message retried forever blocks every record behind it and runs lag off the retention cliff. Route permanent failures to a DLQ and commit past them.
  • Treating the DLQ as a graveyard. A DLQ nobody monitors or replays is just silent data loss with extra steps. Alert on DLQ volume, fix the root cause, and replay the parked records.
  • DLQ-ing transient errors. Routing a record to the DLQ because the database blinked drops data that would have succeeded on retry. Bound-retry transient failures; DLQ only permanent ones.
  • Monitoring throughput but not lag. A pipeline can process fast (high throughput) while still falling further behind (rising lag). Lag — especially its trend and max across partitions — is the health signal; throughput alone hides the backlog.
  • Alerting only on a lag threshold, not the trend. A flat-but-large lag may be fine; a small-but-rising lag is an emergency. Alert on monotonic growth / time-behind-real-time, not just an absolute count.
  • Confusing buffering with fixing. Letting the log absorb a sustained overload just postpones the retention cliff. Transient spikes buffer; sustained load must scale (more consumers/partitions) or shed.
  • Shedding load where loss is unacceptable. Dropping/sampling is fine for telemetry, fatal for payments or orders. Shed only loss-tolerant data.
  • Slow processing triggering rebalancing storms. If a poll cycle outruns max.poll.interval.ms, the coordinator evicts the consumer and the group thrashes. Process faster, fetch smaller batches, raise the interval, or use cooperative rebalancing.

Checkpoint

Required checkpoint

9.7 Operations & failure modes

Pass to unlock the Next button below

Going deeper: these failure modes are how the exactly-once guarantees of lesson 9.5 get operated — a DLQ and a replay are exactly-once thinking applied to the records that break. The lag and backpressure you manage here are pressure on the keyed state and checkpoints of lesson 9.4 (a backed-up stateful operator is a slow consumer).

Foundations: consumer lag, offsets, consumer groups, rebalancing, and retention all come from Kafka architecture; the DLQ is the streaming cousin of the dead-letter / quarantine pattern in Ingestion & integration.

Next: Chapter 9 checkpoint →