Skip to main content

Idempotent SQL & semi-structured data

Two things separate SQL you write for a report from SQL you write for a pipeline. First, a pipeline runs over and over — on a schedule, after failures, during backfills — so the same SQL must be safe to re-run without creating duplicates or double-counting. Second, real source data is rarely clean tables; it arrives as JSON blobs, arrays, and nested structures that you must crack open inside SQL. This lesson teaches both, because both show up on day one of a real data-engineering job and in almost every interview.

Idempotency: the property that makes re-running safe

Idempotent means running an operation once or many times produces the same result. Pressing a crosswalk button ten times is idempotent — the light changes once. Appending a row ten times is not — you get ten rows.

This matters because pipelines fail and retry. A task dies halfway, the orchestrator (Chapter 8) retries it, and a non-idempotent step now double-loads yesterday's data. Idempotent SQL is how you make "run it again" always safe. Three patterns deliver it.

Pattern 1 — Deduplication with QUALIFY ROW_NUMBER()

The most common impurity in raw data is duplicate rows for the same logical key — the same ShopFlow order (ShopFlow — see Meet ShopFlow) lands twice in raw.orders because an upstream system retried, or change-data-capture (Chapter 6) emitted the same record more than once. You want one row per key — one row per order_id — and usually the latest one.

The previous lesson gave you the tool: ROW_NUMBER() partitioned by the key, ordered so the row you want is number 1. The clean way to keep only that row is QUALIFY — a clause that filters on a window function the way WHERE filters on columns and HAVING filters on groups:

SELECT *
FROM raw.orders
QUALIFY ROW_NUMBER() OVER (
PARTITION BY order_id -- one group per logical key
ORDER BY order_ts DESC -- newest first
) = 1;

This returns exactly one row per order_id — the most recently placed/updated. Trace it: within each order_id partition the rows are numbered by order_ts descending, so the newest gets 1; QUALIFY … = 1 keeps only those.

:::note QUALIFY is the missing fourth filter SQL's filter clauses run at different stages: WHERE (rows, before grouping) → GROUP BYHAVING (groups) → window functions → QUALIFY (window results). QUALIFY is supported in Snowflake, BigQuery, DuckDB, Databricks/Spark, and Trino. On an engine that lacks it (e.g. PostgreSQL), wrap the window in a CTE and filter in an outer WHERE rn = 1 — identical result, two more lines. :::

A subtle but critical detail: make the ordering deterministic. If two rows tie on order_ts, ROW_NUMBER breaks the tie arbitrarily, so two runs can keep different rows — your output is no longer reproducible. Add a tiebreaker that is guaranteed unique (here the per-record ingestion id the loader stamps on each landed row):

QUALIFY ROW_NUMBER() OVER (
PARTITION BY order_id
ORDER BY order_ts DESC, ingestion_id DESC -- unique tiebreaker → deterministic
) = 1

Deterministic ordering — a sort that always produces the same result given the same input — is what makes a dedup reproducible and therefore idempotent.

Pattern 2 — MERGE / UPSERT

Appending data is not idempotent; an upsert is. UPSERT = update if the key exists, insert if it doesn't. The standard SQL statement for it is MERGE, which reconciles a target table against a source in one atomic operation. Here we upsert ShopFlow line items into the fact_sales table — grain: one order line item, so its natural key is the pair (order_id, product_id) — from a deduped stg_order_items source:

MERGE INTO fact_sales AS t
USING stg_order_items AS s
ON t.order_id = s.order_id
AND t.product_id = s.product_id -- the (order_id, product_id) grain
WHEN MATCHED THEN
UPDATE SET quantity = s.quantity,
unit_price = s.unit_price,
line_revenue = s.quantity * s.unit_price
WHEN NOT MATCHED THEN
INSERT (order_id, product_id, quantity, unit_price, line_revenue)
VALUES (s.order_id, s.product_id, s.quantity, s.unit_price,
s.quantity * s.unit_price);

Run this once or a hundred times with the same staging data and fact_sales ends up identical: existing line items updated in place, new ones inserted, no duplicates. That is idempotency by construction. MERGE is the backbone of incremental fact loads and slowly-changing dimensions (Chapter 4), and it's a first-class statement in Snowflake, BigQuery, Spark SQL, and the lakehouse table formats of Chapter 10.

:::warning Dedup your source before you MERGE MERGE errors (or silently mis-updates, depending on engine) if the source has more than one row per join key — it doesn't know which one to apply. The standard fix is to run the QUALIFY ROW_NUMBER() dedup from Pattern 1 on the staging data first, then MERGE the clean result. Dedup, then merge. :::

Pattern 3 — Idempotent loads by partition (delete-insert)

The third pattern doesn't need MERGE at all and is the dominant pattern on the data lake: load data one partition at a time, and make each load replace its partition rather than append to it. ShopFlow's orders land partitioned by order date, so a daily pipeline keys on that date:

-- Idempotent daily load of ShopFlow orders: blow away the day, then re-insert it.
DELETE FROM orders WHERE order_date = DATE '2026-01-15';
INSERT INTO orders
SELECT * FROM stg_orders WHERE order_date = DATE '2026-01-15';

Re-running for 2026-01-15 deletes and rewrites only that day's orders, so the table is identical no matter how many times the task retries. (Here order_date is the partition column derived from order_ts at landing time.) Table formats like Iceberg and Delta make this an atomic INSERT OVERWRITE/REPLACE on the partition. The durable idea: scope each write to a partition and make it idempotent within that scope.

Querying semi-structured data

Raw data rarely arrives as tidy columns. Event streams, API payloads, and logs arrive as JSON; warehouses store these in a flexible column type — Snowflake calls it VARIANT, BigQuery has JSON/STRUCT/ARRAY, DuckDB and Spark have STRUCT/MAP/LIST. A data engineer's job is to query into and flatten these without first writing a parsing program. SQL can do it.

Reaching into nested fields

You navigate JSON/struct fields with path syntax (the exact operator is dialect-specific, but the idea is universal):

-- conceptually identical across engines; syntax varies
SELECT
payload:user.id AS user_id, -- Snowflake VARIANT path
payload['user']['country'] AS country, -- DuckDB / bracket style
json_value(payload, '$.user.id') AS uid -- standard JSON_VALUE
FROM raw_events;

The durable skill is reading a pathuser.id means "the id field inside the user object." Cast the result to a real type (::INT, ::STRING) so downstream queries get a clean column.

Arrays and UNNEST / lateral joins

The hard part is arrays: one row contains a list of items and you need one output row per item. ShopFlow's app emits each order as a JSON document with its line items nested inside — exactly the shape you must flatten into the flat order_items table from the contract. The operation that turns an array into rows is UNNEST (also called EXPLODE in Spark, FLATTEN in Snowflake):

-- raw: one row per order, items is an array of structs
-- raw_orders
-- order_id │ items
-- 1 │ [ {product_id:42, quantity:2}, {product_id:77, quantity:1} ]

SELECT o.order_id, i.product_id, i.quantity
FROM raw_orders AS o,
UNNEST(o.items) AS i; -- one row per line item

-- order_id │ product_id │ quantity
-- 1 │ 42 │ 2
-- 1 │ 77 │ 1

That output is precisely the order_items grain (one row per order_id × product_id) — flattening the nested order payload into line-item rows is how ShopFlow's ingestion (Chapter 6) populates order_items.

This is a lateral join: UNNEST(o.items) is evaluated per row of orders, using that row's items — a join where the right side depends on the left. (Some engines spell it CROSS JOIN UNNEST(...) or LATERAL FLATTEN(input => items); same operation.) Flattening arrays into rows is one of the most common real ingestion tasks, and "how do I explode this JSON array into rows?" is a frequent interview question. The answer is always UNNEST + a lateral join.

1 rowitems = [A, B, C]3 rowsA | B | CUNNEST / EXPLODE

To go the other way — collapse many rows back into one array/struct — use aggregation functions like ARRAY_AGG, STRUCT, or MAP. Flatten on the way in, re-nest on the way out.

Why it matters

Pipelines re-run, and data arrives dirty and nested — so the SQL that survives production is idempotent and comfortable with semi-structured data. QUALIFY ROW_NUMBER() with a deterministic order gives you one clean row per key; MERGE and partition delete-insert make loads safe to retry without duplicates; UNNEST/lateral joins crack open the JSON and arrays that real sources emit. These patterns are the difference between a pipeline that quietly double-counts after a retry and one you can re-run with total confidence.

Common pitfalls

  • Non-deterministic dedup. A ROW_NUMBER ordering with ties keeps a different row each run. Always add a unique tiebreaker.
  • MERGE on a duplicated source. Dedup the staging side first, or the merge is ambiguous.
  • Appending in a retry-able task. Plain INSERT in a step that can re-run is the classic double-load bug. Use upsert or partition-replace.
  • Forgetting to cast extracted JSON. payload:amount is often a VARIANT/string; compare or sum it without ::NUMERIC and you get string sorting or errors.
  • Exploding then accidentally fanning out a join. After UNNEST, your row count multiplies; if you later SUM an order-level field (like orders.amount) you'll over-count it once per line item. Aggregate at the right grain.

Checkpoint

  1. You have raw.orders with occasional duplicate order_ids. Write the clause that keeps exactly one row per order_id — the latest by order_ts — and say what makes it deterministic.
  2. Why must you dedup the source before a MERGE into fact_sales?
  3. A raw_orders row has items as an array of {product_id, quantity}. Write the query producing one row per line item, and name the operation.
Answers
  1. QUALIFY ROW_NUMBER() OVER (PARTITION BY order_id ORDER BY order_ts DESC, <unique_tiebreaker> DESC) = 1. The unique tiebreaker (e.g. ingestion_id) makes the ordering deterministic, so every run keeps the same row.
  2. MERGE matches each target row to one source row per key (here the (order_id, product_id) grain); if the source has duplicates it can't decide which to apply (error or wrong update). Dedup first, then merge.
  3. SELECT o.order_id, i.product_id, i.quantity FROM raw_orders o, UNNEST(o.items) AS i; — the operation is UNNEST (a.k.a. EXPLODE/FLATTEN), evaluated as a lateral join.