Skip to main content

Extraction patterns: full, incremental & idempotent

You've identified a source. Now the core question: how much do you pull, and what happens when the job runs again? The answer is the difference between a pipeline that scales for years and one that quietly grinds to a halt — or worse, double-counts your revenue on every retry.

We'll ground this in ShopFlow's orders table (ShopFlow — see Meet ShopFlow) — the busiest, fastest-growing table in the store — and watch it land into raw.orders. There are two extraction shapes, full and incremental, and one property you want on both, idempotency. Let's build them from the ground up.

Full load: reload everything

A full load (or full refresh) re-reads the entire source every run and replaces the destination with it. Night one, you pull all 10 million ShopFlow orders into raw.orders. Night two, you pull all 10 million again — including the ones that didn't change.

It is the simplest possible pattern, and for small, slowly changing tables it's the right call: ShopFlow's products catalog (a few thousand rows that rarely change) is cheaper to reload wholesale than to track changes on. Full loads are also trivially correct — the destination is always an exact mirror of the source, with no drift to debug.

But it doesn't scale. As orders grows, a full load gets slower, more expensive, and heavier on the store's database, until the nightly job that took 4 minutes takes 4 hours and then doesn't finish inside its window at all. The moment a table is large and growing — which orders always is — you need to stop moving data that didn't change.

Incremental load: move only what changed

An incremental load pulls only the records that are new or modified since the last successful run. To do that you need a way to ask "what changed since last time?" The most common mechanism is a high-watermark.

The high-watermark pattern

A watermark is a saved marker recording how far you got last time — typically the maximum value of an ever-increasing column. ShopFlow's orders table carries an updated_at column for exactly this purpose. The classic columns are:

  • An updated_at timestamp the source updates whenever a row changes (ShopFlow's orders.updated_at bumps every time an order's status or amount changes), or
  • A monotonically increasing ID (an auto-incrementing primary key like order_id) for append-only tables that never update existing rows.

The loop, for ShopFlow's orders:

  1. Read your saved watermark (say, 2026-06-23 02:00:00).
  2. Query only newer rows: SELECT * FROM orders WHERE updated_at > '2026-06-23 02:00:00'.
  3. Load those rows into raw.orders.
  4. Save the new high-water mark: the max updated_at you just saw.

Each run moves a thin slice instead of the whole table. The 4-hour full load of every order becomes a 30-second incremental pull of just the orders that were placed or changed status since last night.

Tracing it

Suppose the watermark is 10:00:00 and ShopFlow's orders source now holds:

order_idstatusupdated_at
101shipped09:55:00
102placed10:05:00
103shipped10:07:00

Your query WHERE updated_at > '10:00:00' returns orders 102 and 103. Order 101 is skipped — it didn't change. You load 102 and 103 into raw.orders, then save the new watermark: 10:07:00. Next run starts there.

:::warning The boundary bugs that bite everyone High-watermarks have two notorious failure modes:

  • Lost updates from > vs >=. If you use strict > and two rows share the exact same updated_at as your watermark, and you only processed one of them, the other is skipped forever. Many teams use >= and rely on idempotency (below) to absorb the resulting overlap.
  • Late-arriving / clock-skew rows. A row written with an updated_at earlier than your watermark (because of a transaction that committed late, or a source clock that drifted) will never be picked up. Log-based CDC (next lesson) sidesteps this entirely by reading the database's change log in commit order instead of trusting a timestamp column. :::

Idempotency: making re-runs safe

Here is the reality that forces the next idea. Jobs fail. A network blips, a worker is killed, an orchestrator retries a task, an engineer re-runs a backfill. So your load will run more than once over overlapping data. If running it twice corrupts your data, you don't have a pipeline — you have a time bomb.

An operation is idempotent when applying it multiple times has the same effect as applying it once. Loading orders 102 and 103 into raw.orders twice should leave exactly one copy of each, not two.

Why naive appends break

The naive load is an INSERT (append). If your incremental pull overlaps the previous run by even one row — which the >= watermark guarantees — a plain append writes that order again. Now raw.orders has duplicate order_ids, and any SUM(amount) downstream double-counts ShopFlow's revenue. At-least-once delivery (the norm for queues and APIs that retry) makes overlap not a possibility but a certainty.

The fix: a dedup key + upsert

Two tools make a load idempotent:

  • A dedup key (deduplication key) — a column or combination of columns that uniquely identifies a record. For orders it's order_id; for order_items it's the pair (order_id, product_id). It's how you recognize "this is the same record I already have."
  • An UPSERT — a single operation meaning "update if the key exists, insert if it doesn't." In SQL this is MERGE, or INSERT ... ON CONFLICT (order_id) DO UPDATE. Instead of blindly appending, you merge by key: an order you've seen before overwrites its prior version; a new order is added.

With an upsert keyed on order_id, loading orders 102 and 103 into raw.orders a second time simply re-writes the same two rows in place. The destination is identical whether the load ran once or five times. That is idempotency, and it's what lets you use a forgiving >= watermark, retry freely, and re-run the ShopFlow backfill without fear.

Read saved watermarkPull rowswhere\nupdated_at >=watermarkUPSERT by dedupkey\n(update ifexists, insert ifSave newmax(updated_at)\nasnext watermark

"Exactly-once-ish" — the honest framing

You'll hear vendors promise exactly-once ingestion. True end-to-end exactly-once delivery across a network is famously hard and often impossible to guarantee strictly. What practically works — and what most systems actually mean — is at-least-once delivery made idempotent at the destination: the source may hand you a record more than once, but because you upsert by a dedup key (or track processed offsets), the effect is exactly-once. Deliver at-least-once, deduplicate on write. Remember that phrase; it's the honest version of "exactly-once."

Why it matters

Full loads are simple and correct but stop scaling the moment a table is large and growing. Incremental loads move only what changed, usually via a high-watermark on updated_at or a monotonic ID — at the cost of subtle boundary and late-arrival bugs. And because jobs will re-run, every load should be idempotent: a dedup key plus an upsert makes running once or five times produce the same correct state, which is the only honest route to "exactly-once." Get incremental-and-idempotent right and the rest of ingestion is detail.

The watermark approach trusts a timestamp column. The next lesson removes that trust entirely by reading the database's own change log.

Next: Change data capture (CDC) →