Idempotent SQL & semi-structured data
Two things separate SQL you write for a report from SQL you write for a pipeline. First, a pipeline runs over and over — on a schedule, after failures, during backfills — so the same SQL must be safe to re-run without creating duplicates or double-counting. Second, real source data is rarely clean tables; it arrives as JSON blobs, arrays, and nested structures that you must crack open inside SQL. This lesson teaches both, because both show up on day one of a real data-engineering job and in almost every interview.
Idempotency: the property that makes re-running safe
Idempotent means running an operation once or many times produces the same result. Pressing a crosswalk button ten times is idempotent — the light changes once. Appending a row ten times is not — you get ten rows.
This matters because pipelines fail and retry. A task dies halfway, the orchestrator (Chapter 8) retries it, and a non-idempotent step now double-loads yesterday's data. Idempotent SQL is how you make "run it again" always safe. Three patterns deliver it.
Pattern 1 — Deduplication with QUALIFY ROW_NUMBER()
The most common impurity in raw data is duplicate rows for the same logical key — the same ShopFlow order (ShopFlow — see Meet ShopFlow) lands twice in raw.orders because an upstream system retried, or change-data-capture (Chapter 6) emitted the same record more than once. You want one row per key — one row per order_id — and usually the latest one.
The previous lesson gave you the tool: ROW_NUMBER() partitioned by the key, ordered so the row you want is number 1. The clean way to keep only that row is QUALIFY — a clause that filters on a window function the way WHERE filters on columns and HAVING filters on groups:
SELECT *
FROM raw.orders
QUALIFY ROW_NUMBER() OVER (
PARTITION BY order_id -- one group per logical key
ORDER BY order_ts DESC -- newest first
) = 1;
This returns exactly one row per order_id — the most recently placed/updated. Trace it: within each order_id partition the rows are numbered by order_ts descending, so the newest gets 1; QUALIFY … = 1 keeps only those.
:::note QUALIFY is the missing fourth filter
SQL's filter clauses run at different stages: WHERE (rows, before grouping) → GROUP BY → HAVING (groups) → window functions → QUALIFY (window results). QUALIFY is supported in Snowflake, BigQuery, DuckDB, Databricks/Spark, and Trino. On an engine that lacks it (e.g. PostgreSQL), wrap the window in a CTE and filter in an outer WHERE rn = 1 — identical result, two more lines.
:::
A subtle but critical detail: make the ordering deterministic. If two rows tie on order_ts, ROW_NUMBER breaks the tie arbitrarily, so two runs can keep different rows — your output is no longer reproducible. Add a tiebreaker that is guaranteed unique (here the per-record ingestion id the loader stamps on each landed row):
QUALIFY ROW_NUMBER() OVER (
PARTITION BY order_id
ORDER BY order_ts DESC, ingestion_id DESC -- unique tiebreaker → deterministic
) = 1
Deterministic ordering — a sort that always produces the same result given the same input — is what makes a dedup reproducible and therefore idempotent.
Pattern 2 — MERGE / UPSERT
Appending data is not idempotent; an upsert is. UPSERT = update if the key exists, insert if it doesn't. The standard SQL statement for it is MERGE, which reconciles a target table against a source in one atomic operation. Here we upsert ShopFlow line items into the fact_sales table — grain: one order line item, so its natural key is the pair (order_id, product_id) — from a deduped stg_order_items source:
MERGE INTO fact_sales AS t
USING stg_order_items AS s
ON t.order_id = s.order_id
AND t.product_id = s.product_id -- the (order_id, product_id) grain
WHEN MATCHED THEN
UPDATE SET quantity = s.quantity,
unit_price = s.unit_price,
line_revenue = s.quantity * s.unit_price
WHEN NOT MATCHED THEN
INSERT (order_id, product_id, quantity, unit_price, line_revenue)
VALUES (s.order_id, s.product_id, s.quantity, s.unit_price,
s.quantity * s.unit_price);
Run this once or a hundred times with the same staging data and fact_sales ends up identical: existing line items updated in place, new ones inserted, no duplicates. That is idempotency by construction. MERGE is the backbone of incremental fact loads and slowly-changing dimensions (Chapter 4), and it's a first-class statement in Snowflake, BigQuery, Spark SQL, and the lakehouse table formats of Chapter 10.
:::warning Dedup your source before you MERGE
MERGE errors (or silently mis-updates, depending on engine) if the source has more than one row per join key — it doesn't know which one to apply. The standard fix is to run the QUALIFY ROW_NUMBER() dedup from Pattern 1 on the staging data first, then MERGE the clean result. Dedup, then merge.
:::
Pattern 3 — Idempotent loads by partition (delete-insert)
The third pattern doesn't need MERGE at all and is the dominant pattern on the data lake: load data one partition at a time, and make each load replace its partition rather than append to it. ShopFlow's orders land partitioned by order date, so a daily pipeline keys on that date:
-- Idempotent daily load of ShopFlow orders: blow away the day, then re-insert it.
DELETE FROM orders WHERE order_date = DATE '2026-01-15';
INSERT INTO orders
SELECT * FROM stg_orders WHERE order_date = DATE '2026-01-15';
Re-running for 2026-01-15 deletes and rewrites only that day's orders, so the table is identical no matter how many times the task retries. (Here order_date is the partition column derived from order_ts at landing time.) Table formats like Iceberg and Delta make this an atomic INSERT OVERWRITE/REPLACE on the partition. The durable idea: scope each write to a partition and make it idempotent within that scope.
Querying semi-structured data
Raw data rarely arrives as tidy columns. Event streams, API payloads, and logs arrive as JSON; warehouses store these in a flexible column type — Snowflake calls it VARIANT, BigQuery has JSON/STRUCT/ARRAY, DuckDB and Spark have STRUCT/MAP/LIST. A data engineer's job is to query into and flatten these without first writing a parsing program. SQL can do it.
Reaching into nested fields
You navigate JSON/struct fields with path syntax (the exact operator is dialect-specific, but the idea is universal):
-- conceptually identical across engines; syntax varies
SELECT
payload:user.id AS user_id, -- Snowflake VARIANT path
payload['user']['country'] AS country, -- DuckDB / bracket style
json_value(payload, '$.user.id') AS uid -- standard JSON_VALUE
FROM raw_events;
The durable skill is reading a path — user.id means "the id field inside the user object." Cast the result to a real type (::INT, ::STRING) so downstream queries get a clean column.
Arrays and UNNEST / lateral joins
The hard part is arrays: one row contains a list of items and you need one output row per item. ShopFlow's app emits each order as a JSON document with its line items nested inside — exactly the shape you must flatten into the flat order_items table from the contract. The operation that turns an array into rows is UNNEST (also called EXPLODE in Spark, FLATTEN in Snowflake):
-- raw: one row per order, items is an array of structs
-- raw_orders
-- order_id │ items
-- 1 │ [ {product_id:42, quantity:2}, {product_id:77, quantity:1} ]
SELECT o.order_id, i.product_id, i.quantity
FROM raw_orders AS o,
UNNEST(o.items) AS i; -- one row per line item
-- order_id │ product_id │ quantity
-- 1 │ 42 │ 2
-- 1 │ 77 │ 1
That output is precisely the order_items grain (one row per order_id × product_id) — flattening the nested order payload into line-item rows is how ShopFlow's ingestion (Chapter 6) populates order_items.
This is a lateral join: UNNEST(o.items) is evaluated per row of orders, using that row's items — a join where the right side depends on the left. (Some engines spell it CROSS JOIN UNNEST(...) or LATERAL FLATTEN(input => items); same operation.) Flattening arrays into rows is one of the most common real ingestion tasks, and "how do I explode this JSON array into rows?" is a frequent interview question. The answer is always UNNEST + a lateral join.
To go the other way — collapse many rows back into one array/struct — use aggregation functions like ARRAY_AGG, STRUCT, or MAP. Flatten on the way in, re-nest on the way out.
Why it matters
Pipelines re-run, and data arrives dirty and nested — so the SQL that survives production is idempotent and comfortable with semi-structured data. QUALIFY ROW_NUMBER() with a deterministic order gives you one clean row per key; MERGE and partition delete-insert make loads safe to retry without duplicates; UNNEST/lateral joins crack open the JSON and arrays that real sources emit. These patterns are the difference between a pipeline that quietly double-counts after a retry and one you can re-run with total confidence.
Common pitfalls
- Non-deterministic dedup. A
ROW_NUMBERordering with ties keeps a different row each run. Always add a unique tiebreaker. MERGEon a duplicated source. Dedup the staging side first, or the merge is ambiguous.- Appending in a retry-able task. Plain
INSERTin a step that can re-run is the classic double-load bug. Use upsert or partition-replace. - Forgetting to cast extracted JSON.
payload:amountis often aVARIANT/string; compare or sum it without::NUMERICand you get string sorting or errors. - Exploding then accidentally fanning out a join. After
UNNEST, your row count multiplies; if you laterSUMan order-level field (likeorders.amount) you'll over-count it once per line item. Aggregate at the right grain.
Cross-links
- Builds on: Analytic SQL —
QUALIFYand dedup are window functions applied. - Used by: Ingestion & integration (idempotent loads, schema drift), Data modeling (
MERGEpowers slowly-changing dimensions), and the lakehouse (atomic partition overwrites). - Next: Inside a query engine → — how the planner turns all this SQL into an execution plan.
Checkpoint
- You have
raw.orderswith occasional duplicateorder_ids. Write the clause that keeps exactly one row perorder_id— the latest byorder_ts— and say what makes it deterministic. - Why must you dedup the source before a
MERGEintofact_sales? - A
raw_ordersrow hasitemsas an array of{product_id, quantity}. Write the query producing one row per line item, and name the operation.
Answers
QUALIFY ROW_NUMBER() OVER (PARTITION BY order_id ORDER BY order_ts DESC, <unique_tiebreaker> DESC) = 1. The unique tiebreaker (e.g.ingestion_id) makes the ordering deterministic, so every run keeps the same row.MERGEmatches each target row to one source row per key (here the(order_id, product_id)grain); if the source has duplicates it can't decide which to apply (error or wrong update). Dedup first, then merge.SELECT o.order_id, i.product_id, i.quantity FROM raw_orders o, UNNEST(o.items) AS i;— the operation isUNNEST(a.k.a. EXPLODE/FLATTEN), evaluated as a lateral join.