Distributed systems & idempotency: the spine of every pipeline

This is the lesson that separates a data engineer from someone who writes scripts. At any real scale, your data lives on many machines, and those machines fail constantly — not as an exception, but as the normal operating condition. Every later chapter (Spark, streaming, the lakehouse, orchestration) is, underneath, a set of techniques for staying correct despite that. If you internalize one idea here, make it this: failure is normal, so every pipeline step must be safe to re-run. The property that makes that possible is idempotency, and it is the spine of this entire guide.

The plain-English on-ramp

Imagine you text a friend "transfer me $50" and your phone loses signal mid-send. Did it go through? You don't know. If you re-send to be safe, do you risk asking for $100? It depends entirely on how the system handles a repeated message. If sending the same request twice has the same effect as sending it once, you can safely re-send after any failure. If it doesn't, re-sending is dangerous.

That tiny scenario is distributed systems. Whenever two computers talk over a network, a message can be lost, delayed, duplicated, or arrive after the sender gave up and tried again. You can never be sure a remote step "definitely happened exactly once." So instead of trying to make networks perfect (impossible), engineers design steps that are safe to repeat. That design property is idempotency, and building pipelines out of idempotent steps is how you stay correct in a world where everything intermittently breaks.

Why "failure is normal" — and the primitives that follow

A single hard drive fails rarely. But a data platform might run on thousands of disks and machines — so at that scale, something is always broken right now. A truth the field states plainly: the network is unreliable, machines die, and partial failure is the default. Data engineers don't prevent failure; they design around it. Five primitives capture how, and you need a conceptual grasp of each (not a PhD).

Partitioning (sharding): split the data across machines

A dataset too big for one machine is partitioned (also called sharded): split into pieces, each living on a different machine, usually by a partition key (e.g. by customer ID, or by date). This is how systems hold more data than one machine can — and, crucially for analytics, it lets a query skip irrelevant partitions. A query for "yesterday's orders" against data partitioned by date reads one partition, not the whole history. (This is the cost undercurrent and the OLAP speed trick from Chapter 2: don't read what you don't need.)

Replication: keep copies so failure doesn't lose data

Replication is keeping multiple copies of the same data on different machines. If one dies, a copy survives — that's durability and availability. But replication creates a new problem: when you update the data, the copies are briefly out of sync. Which copy does a reader see? That question is consistency, and it's the crux of the next idea.

CAP and PACELC: the tradeoff you can't escape

The CAP theorem states that when a distributed system hits a network partition (P — machines can't talk to each other), it must choose between Consistency (C — every reader sees the latest write, or an error) and Availability (A — every request gets an answer, but possibly a stale one). You cannot have both during a partition; you pick which to sacrifice. A bank balance leans CP (refuse rather than show wrong money); a social feed leans AP (show slightly stale likes rather than an error).

PACELC extends it with the part CAP omits: Else (when there's no partition), you still trade Latency vs Consistency — keeping copies perfectly in sync costs time. So even on a healthy day, stronger consistency is slower. You don't need the formalism; you need the instinct: consistency, availability, and latency are in tension, and every data store you pick has made a choice among them. "Eventual consistency" (a write appears everywhere eventually, not instantly) is the common AP/low-latency choice you'll meet constantly.

Consistency models: a spectrum, not a switch

Following from the above, consistency isn't on/off — it's a spectrum from strong (every read sees the latest write, as if one machine) to eventual (reads may be briefly stale but all copies converge). Stronger is easier to reason about but slower and less available; weaker is faster and more available but your code must tolerate staleness. Knowing where a system sits tells you what surprises to expect.

Consensus: how machines agree (Raft/Paxos, conceptually)

If data is replicated across machines, how do they agree on what the truth is — say, which one is the leader, or whether a write really committed — when any of them might fail mid-decision? Consensus algorithms (the famous ones are Paxos and the more teachable Raft) are protocols that let a group of machines reach reliable agreement even though some may fail or messages may be lost. You will almost never implement one — but they run underneath the orchestrators, message buses, and databases you use, and "this system uses Raft for leader election" should now parse as "it has a principled way to agree despite failures." That conceptual grasp is all this lesson asks.

Idempotency: the property that makes pipelines safe

Now the payoff. Idempotency means: performing an operation multiple times produces the same result as performing it once. Press an elevator button five times — the elevator still comes once. That's idempotent. A pipeline step is idempotent if re-running it (after a crash, a retry, a duplicate message) leaves the data exactly as if it ran once.

Why this is the spine of everything: because networks and machines fail, retries are inevitable — the orchestrator will re-run a failed step, a message bus will occasionally deliver twice. If your steps are idempotent, retries are free and safe. If they aren't, every retry risks doubling data, double-charging, or corrupting the table. Idempotency is what makes "just re-run it" a safe answer instead of a disaster.

Worked example: non-idempotent vs idempotent

Task: load yesterday's ShopFlow orders into the daily sales table. The job crashes after writing 800 of 1,000 rows and is retried from the start.

Non-idempotent design — INSERT (append):

-- First run: appends 800 rows, then crashes.
-- Retry: appends ALL 1,000 rows again.
INSERT INTO daily_orders
SELECT * FROM stg_orders WHERE order_date = '2026-06-23';
-- Result: 1,800 rows. 800 duplicated. The dashboard is now WRONG.

Idempotent design — overwrite the partition (delete-then-insert, or MERGE):

-- Make the whole step replace yesterday's slice atomically.
DELETE FROM daily_orders WHERE order_date = '2026-06-23';
INSERT INTO daily_orders
SELECT * FROM stg_orders WHERE order_date = '2026-06-23';
-- First run: crashes partway — but it's wrapped so it commits all-or-nothing.
-- Retry: deletes the partial 800, re-inserts the full 1,000.
-- Result: EXACTLY 1,000 rows, every time, no matter how often it runs.

The fix is a design choice: make the step define the final desired state of a slice ("yesterday's table should be exactly these 1,000 rows") rather than a blind incremental action ("append these rows"). Overwrite-by-partition, MERGE/upsert keyed on a primary key, and deterministic output paths are the everyday tools. Designing steps to be idempotent is the single most important habit in this guide — and it's why we kept promising it since Lesson 1.

Delivery guarantees: at-least-once, at-most-once, exactly-once

When one system hands data to another over an unreliable network, there are exactly three possible promises about how many times each record gets delivered. You'll choose between them constantly (especially in streaming, Chapter 9):

At-most-once — each record is delivered zero or one times. Never duplicated, but may be lost (if a failure happens before redelivery). Fine for low-value, high-volume data where a dropped reading doesn't matter (some metrics). Rarely what you want for business data.
At-least-once — each record is delivered one or more times. Never lost, but may be duplicated (a retry re-sends a record that actually arrived). This is the common, robust default — but it requires the consumer to be idempotent to handle the duplicates safely.
Exactly-once — each record takes effect once and only once — never lost, never duplicated. The ideal, but genuinely hard and expensive across a network.

Here's the punchline that ties the whole lesson together: true exactly-once delivery is nearly impossible, but exactly-once effect is achievable — by combining at-least-once delivery with idempotent processing. Let the network deliver duplicates (cheap, reliable), and make your processing idempotent so duplicates don't matter (the MERGE above). What looks like "exactly-once" in modern systems (Spark, Kafka, Flink) is almost always at-least-once delivery + idempotent/transactional writes. Now you know the trick behind the marketing term.

:::tip Durable vs dated "Failure is normal; design idempotent steps; combine at-least-once with idempotency for exactly-once effect" is as durable as it gets — it's true of every distributed data system, past and future. The tools that implement it (Kafka's transactions, Spark's checkpointing, Iceberg's atomic commits) are dated instances of this one idea. Learn the principle; the tools become recognizable applications of it. :::

A note on data formats: how the bytes are shaped

One more foundational vocabulary set, because it threads through everything you'll ingest and store. Data arrives in different physical shapes, and the shape determines how you read it:

Structured data has a fixed, known schema — neat rows and columns (a relational table, a CSV with consistent columns). Easy to query.
Semi-structured data has some structure but it's flexible and self-describing — JSON and XML, where each record carries its own field names and records can vary. Common from APIs.
Unstructured data has no inherent tabular structure — free text, images, audio.

Cross-cutting that: text vs binary formats. Text formats (CSV, JSON, XML) are human-readable but bulky and slow to parse. Binary formats (like Parquet, Chapter 2) are compact and fast for machines but not eyeball-readable. And one decision shapes how you handle the schema:

Schema-on-write — you enforce the structure when data is written (a relational database rejects a row that doesn't fit). Clean and validated, but rigid.
Schema-on-read — you store data raw and apply a structure when you read it (dump JSON into a lake now, interpret it at query time). Flexible and fast to ingest, but you discover bad data later, at read time.

This text-vs-binary, structure, and schema-on-read-vs-write vocabulary is exactly what Chapter 2 builds on — and it's why "load raw, transform later" (ELT, schema-on-read) and "validate up front" (schema-on-write) are the two ends of a real tradeoff you'll keep meeting.

Common pitfalls

Writing append-only steps. A blind INSERT is the classic non-idempotent trap. On retry it duplicates. Default to overwrite-a-partition or MERGE keyed on a primary key.
Assuming the network is reliable. "It usually works" is not a design. Plan for the timeout, the duplicate, the out-of-order arrival — they will happen at scale.
Chasing literal exactly-once delivery. It's a trap. Build at-least-once + idempotent processing instead; that's what the real systems do.
Confusing replication with backup. Replicas keep you available when a machine dies, but if a bad pipeline deletes the right data on all replicas, replication faithfully deletes it everywhere. You still need backups.

Why it matters

At scale, failure is the normal condition, so data engineering is the art of staying correct despite it. Partitioning spreads data (and lets queries skip what they don't need); replication keeps copies for durability but raises the consistency question; CAP/PACELC say you can't have consistency, availability, and low latency all at once during (and even without) a network partition; consensus (Raft/Paxos) is how machines agree despite failures. The keystone is idempotency — designing every step so re-running it is safe — which, combined with cheap at-least-once delivery, is how modern systems achieve exactly-once effect. Carry this mindset into every later chapter; it is what "production-grade" actually means.

You've now built the entire foundation: the role, the lifecycle, OLTP vs OLAP, the relational model, batch vs streaming and ETL vs ELT, and the distributed-systems spine. One practical thread is still loose: when your data is small enough to fit on a single machine, you don't need any of this distributed machinery — a single-node dataframe is often the simplest, fastest tool. The next lesson makes that case before we lock the chapter in.

Where this leads: idempotency reappears in ingestion (Ch. 6), orchestration retries (Ch. 8), and exactly-once streaming (Ch. 9); partitioning is the heart of Ch. 2; reclaiming atomic commits on the lake is Ch. 10.

Next: Single-machine dataframes: pandas vs Polars →

The plain-English on-ramp​

Why "failure is normal" — and the primitives that follow​

Partitioning (sharding): split the data across machines​

Replication: keep copies so failure doesn't lose data​

CAP and PACELC: the tradeoff you can't escape​

Consistency models: a spectrum, not a switch​

Consensus: how machines agree (Raft/Paxos, conceptually)​

Idempotency: the property that makes pipelines safe​

Worked example: non-idempotent vs idempotent​

Delivery guarantees: at-least-once, at-most-once, exactly-once​

A note on data formats: how the bytes are shaped​

Common pitfalls​

Why it matters​