Skip to main content

Kafka architecture: partitions, offsets, consumer groups, and ordering

Walk me through it

In lesson 9.1 you met the log — one append-only, ordered sequence. That's a beautiful idea, but a single log on a single machine has two problems: it can only go as fast as one disk and one network card, and if that machine dies, your stream dies with it. Apache Kafka is the most widely used system that takes the log idea and makes it fast (by splitting it into pieces) and durable (by copying those pieces across machines). This lesson is how it does both — and the price you pay for the speed, which is the single most misunderstood thing about Kafka: order.

Throughout, "Kafka" stands for the whole family of log systems with the same model — Confluent Cloud / Confluent Platform (managed/enterprise Kafka), Redpanda (a Kafka-compatible engine), Amazon Kinesis, and Google Pub/Sub all share these durable concepts even where they rename the parts. Learn the model once; map the vocabulary per tool.

Topics, partitions, and offsets

A topic is a named feed of events — for ShopFlow (see Meet ShopFlow), the canonical one is order_events. Producers write to a topic; consumers read from it. So far, that's just "the log with a name."

The crucial move is that a topic is split into partitions. A partition is an independent, ordered log — a shard of the topic. A topic with 6 partitions is really 6 separate append-only logs that, together, hold the topic's data.

Topic "order_events" (3 partitions)

partition 0: [o0][o1][o2][o3] ← each partition is its own ordered log
partition 1: [o0][o1][o2][o3][o4]
partition 2: [o0][o1]

offsets are PER-PARTITION (each partition counts from 0)

This is why partitions exist and what they cost:

  • They are the unit of parallelism. Different partitions live on different brokers (servers) and are written and read in parallel. Want more throughput? Add partitions. A topic's max consumer parallelism equals its partition count.
  • An offset is per-partition. The offset (a record's integer position) counts from 0 within each partition independently. "Offset 4" is meaningless without naming the partition. There is no global offset across a topic.
  • Order exists only inside a partition. Records in partition 1 are strictly ordered relative to each other. But Kafka makes no ordering promise across partitions — partition 0's offset 3 and partition 2's offset 3 could have happened in any real-world order. This is the heart of the whole lesson, and we return to it below.

:::warning "Kafka is just a queue" — the costly half-truth If you picture Kafka as one ordered queue, you will write code that silently breaks under load. Kafka is a partitioned log: it's ordered within a partition and unordered across partitions, it retains data after reads, and it fans out to groups of consumers. Miss partitions, offsets, consumer groups, or the ordering rule and you'll ship bugs that only appear once traffic spreads events across partitions. This lesson exists to close exactly that gap. :::

Keys → partition → ordering (the rule that matters most)

How does a producer decide which partition an event goes to? By its key. Every Kafka message can carry a key (e.g. customer_id, order_id). The rule:

partition = hash(key) mod (number of partitions)

The same key always hashes to the same partition. Messages with no key are spread across partitions (round-robin-ish) for load balance.

This gives you the one ordering guarantee Kafka offers, stated precisely:

All events with the same key go to the same partition, and a partition is strictly ordered. Therefore events sharing a key are processed in order; events with different keys have no ordering guarantee relative to each other.

The design consequence is enormous. For ShopFlow, a single order's lifecycle must process in order (placed before paid before shipped), so key order_events by order_id. Then all of an order's events land in one partition, in order — while different orders fan out across partitions for parallelism. (If you instead need each customer's events ordered — e.g. for sessionization — key by customer_id.) You get per-key ordering and scale at the same time. Key your events by the entity whose order you care about; if you key by something irrelevant (or nothing), you forfeit ordering where you needed it.

:::note The repartitioning trap Because partition = hash(key) mod N, changing the partition count N re-shuffles which partition a key maps to. Existing data stays where it was, but new events for an old key may now route to a different partition than the historical ones — breaking per-key order across the boundary. This is why teams pick a generous partition count up front and avoid resizing live topics casually. :::

Brokers, replication, and ISR

A Kafka cluster is a set of brokers (servers). Partitions are distributed across them. Durability comes from replication: each partition is copied replication-factor times onto different brokers.

For each partition, one replica is the leader and the rest are followers:

  • The leader handles all reads and writes for that partition. Producers and consumers talk only to the leader.
  • Followers continuously copy the leader's records. They're hot standbys.
  • If the leader's broker dies, one of the in-sync followers is promoted to leader, and clients reconnect — the partition stays available with no data loss.
Broker 1partition 0\n(LEADER)partition 1\n(follower)Broker 2partition 0\n(follower)partition 1\n(LEADER)Broker 3partition 0\n(follower)partition 1\n(follower)

The key durability concept is the ISR — the in-sync replica set: the replicas currently caught up to the leader. A write is considered committed (safe, won't be lost) once all replicas in the ISR have it. A producer can ask for acks=all, meaning "don't acknowledge my write until the full ISR has it" — the strongest durability, and the setting exactly-once depends on (lesson 9.5). With acks=1 (leader only), a leader crash after ack but before a follower copies the record loses that record. ISR is the dial between speed and "I will never lose an acknowledged write."

Retention and log compaction

Unlike a queue, a Kafka topic keeps records after they're consumed. How long is governed by retention:

  • Time/size retention (the default): keep records for N days or until the partition reaches S gigabytes, then delete the oldest. A 7-day retention means any consumer can replay the last 7 days, but older data is gone.
  • Log compaction: a different mode that, instead of deleting by age, keeps only the latest record per key (older records for the same key are eventually garbage-collected). This turns the log from a history of changes into a snapshot of current state — perfect for "the current value for each customer_id." A compacted topic can be replayed to rebuild a complete key→latest-value table, which is exactly how stream processors restore state and how CDC topics (lesson 9.6) represent "the current row."

Use time/size retention for event histories; use compaction when you only care about the latest value per key.

Consumer groups and rebalancing

Now the read side, where Kafka's fan-out really shines. A consumer group is a set of consumer instances that cooperate to read a topic. The rule:

Within a consumer group, each partition is assigned to exactly one consumer. Kafka divides the topic's partitions among the group's members.

This gives you horizontal scaling for free: a topic with 6 partitions can be read by up to 6 consumers in a group, each owning a partition, processing in parallel. Add a 7th consumer and it sits idle (no partition left to own) — partition count is the ceiling on a group's parallelism.

Different groups are independent: the fraud-detector group and the warehouse-loader group each read the whole topic, at their own offsets, oblivious to each other. That's the log's fan-out — many independent readers of the same data.

p0"]; p1["p1"];p2["p2"]; p3["p3consumer A"];c2["consumer Bconsumer Xp2c2

When membership changes — a consumer joins, crashes, or leaves — Kafka performs a rebalance: it reassigns partitions across the surviving members so every partition still has exactly one owner. Rebalancing is what makes the group self-healing (a dead consumer's partitions are picked up by others), but it's also a brief pause and a common source of latency spikes, so production systems tune it.

Each consumer commits its offset per partition, so after a crash or rebalance the new owner resumes from the last committed position — the recovery mechanic from lesson 9.1, now operating per partition. The gap between the head and a group's committed offset is consumer lag — the single most important health metric for a stream. Rising lag means consumers can't keep up with producers; it's your early-warning siren.

A traced example: keying for order

ShopFlow streams order_events to a topic with 3 partitions. Two events for order 1001: placed then paid. Order is critical — process the paid before the placed and a downstream consumer that builds order state might record a payment against an order it hasn't seen exist yet.

Wrong: produce with no key. The two events round-robin to, say, partition 0 and partition 2. Two consumers in the group own those partitions and process in parallelpaid might be handled before placed. Silent corruption that only appears under load.

Right: produce with key = order_id (1001). Both events hash to the same partition (say partition 1), land in order (placed at a lower offset than paid), and the one consumer that owns partition 1 processes them in that order. Meanwhile order 2002's events route to a different partition and process in parallel — full throughput, per-order correctness. The fix was one line: set the key.

When NOT to use Kafka

Kafka is the backbone of streaming at scale — and like Spark in Chapter 5, the senior move is sometimes not to reach for it. A partitioned, replicated log is a lot of standing machinery (brokers, ISR, consumer-group rebalancing, retention tuning), and you only want to pay for it when a real throughput or decoupling need justifies it:

  • Below a real throughput / decoupling need. If a single service can handle the volume directly, or one producer talks to one consumer, you don't need a distributed log. A plain message queue (SQS, RabbitMQ) — or even a database table polled as a work queue — is simpler and enough.
  • A scheduled batch would do. If the data lands periodically and staleness has no per-minute cost, this is the "streaming when batch would do" trap from lesson 9.1. A file dropped hourly is a batch in a trench coat, not a stream.
  • You can't carry the operational burden. Self-managed Kafka means partition sizing, broker capacity, replication, lag monitoring, and rebalance tuning. Below serious scale, that overhead can dwarf the problem — managed options (Confluent Cloud, Kinesis, Pub/Sub, Redpanda) lower it but don't make it free.
  • You want a database. Kafka retains and replays records, which tempts people to treat it as a store. It has no indexes, no ad-hoc queries, no point lookups by value. It's a log to move and re-derive state, not a system of record to query.

:::tip The honest decision rule Reach for Kafka when you have genuinely high-volume, continuous events and many independent consumers that must read the same stream at their own pace, replay history, and scale out — that fan-out-plus-replay is what a queue can't give you. Skip it when one consumer suffices (a queue is simpler), when batch would do (cheaper), or when you actually need to query the data (use a database). Kafka is a partitioned, replayable log — not a queue, not a scheduler, and not a database. :::

Why it matters

Kafka is the de-facto backbone of streaming, and almost every streaming bug at scale traces back to a partition/ordering misunderstanding. Internalize four things and you can reason about any Kafka system: (1) a topic is partitions, each an independent ordered log; (2) the key decides the partition and is therefore your only ordering lever; (3) replication + ISR + acks=all is what makes an acknowledged write durable; (4) a consumer group fans work across members, capped by partition count, healed by rebalancing. These map cleanly onto Confluent, Redpanda, Kinesis, and Pub/Sub.

Common pitfalls

  • Expecting global ordering across a topic. There is none — only per-partition order. Code that assumes "later timestamp ⇒ read later" breaks once events span partitions.
  • Not keying (or mis-keying) order-sensitive events. No key ⇒ events for one entity scatter across partitions and get processed out of order. Key by the entity whose order you care about.
  • Resizing partitions casually. Changing partition count re-maps keys and breaks per-key ordering across the change. Provision generously; resize rarely and deliberately.
  • Running acks=1 and assuming zero loss. A leader crash after ack but before replication loses the write. Use acks=all with a healthy ISR when durability matters.
  • Ignoring consumer lag. Lag is your stream's fever thermometer. A pipeline with quietly climbing lag is falling behind real-time and will eventually fall off the retention cliff and lose data.
  • More consumers than partitions and expecting more speed. Extra consumers in a group sit idle. Throughput is bounded by partition count.

Checkpoint

Required checkpoint

9.2 Kafka architecture

Pass to unlock the Next button below

Going deeper: durability here (ISR, acks) becomes end-to-end correctness in Delivery semantics & exactly-once; the state a consumer keeps gets formalized in Stateful processing.

Foundations: this builds directly on the log abstraction from Streaming fundamentals. Partitioning by key echoes the partitioning ideas from Storage & file formats.

Next: Time and watermarks →