Testing pipelines: the data testing pyramid & data-diff in CI
The previous lesson gave you the six quality dimensions and turned each into an assertion — a pass/fail check on the data. That answers "is the data good?" But there's a second, equally important question a serious data team has to answer on every change: "is this pipeline code correct, and does my change do only what I intended?" Those are different questions. A not_null test on a production table tells you nothing about whether the SQL you just edited still computes revenue correctly — or whether it silently changed 40,000 rows you never meant to touch.
This lesson is about testing the pipeline itself, the way software engineers test code: a layered strategy called the testing pyramid, adapted for data. We'll run it on ShopFlow (ShopFlow — see Meet ShopFlow) — testing a customer_tiers model built on top of its canonical stg_customers and stg_orders staging tables. We'll build it from the bottom up — unit tests on transformation logic, integration tests against a staging warehouse, dbt's generic/singular tests in their proper place — and then add the technique that's specific to data and that interviewers increasingly ask about: data-diffing in CI, where you compare the output your changed code produces against the output production currently produces, and review the difference before you merge. The dbt CI/CD lesson covered slim CI and deferral — the mechanics of running dbt on a pull request. This lesson is the test strategy that runs inside that machinery.
:::note Two meanings of "test" — keep them straight Data tests (last lesson) assert things about rows: "no nulls", "freshness < 6h". Pipeline tests (this lesson) assert things about code: "given these inputs, the transform produces these outputs". Both matter; they catch different failures. A column might be perfectly non-null while your transform quietly joins on the wrong key. This lesson is mostly about the second kind, plus the data-diff technique that bridges them. :::
The testing pyramid, translated to data
In software, the testing pyramid is a rule of thumb about how many tests of each kind to write: lots of fast, cheap unit tests at the bottom, fewer integration tests in the middle, and a small number of slow, expensive end-to-end tests at the top. The shape is a pyramid because the cheaper a test is to run, the more of them you can afford and the faster they give feedback.
The same shape works for data pipelines, with the layers redefined for our world:
A fixture, used at every layer, is a small, fixed set of input rows you control completely — you wrote them, so you know exactly what the correct output is. Fixtures are the foundation of all pipeline testing: a test is only meaningful if you know the right answer in advance, and on real production data you usually don't.
The durable idea: push as much verification as possible down to the cheap, fast layer, and reserve the slow layers for the failures only they can catch. Let's build each layer.
Unit tests: one transform, known input → expected output
A unit test isolates the smallest meaningful piece of logic — a single transformation — and checks it against fixtures with a known correct answer. No warehouse, no real data, no network: you feed in a handful of rows you constructed, run the transform, and assert the output equals what you computed by hand.
Say a transform classifies a customer's tier from their total spend:
def assign_tier(spend: float) -> str:
if spend >= 1000: return "gold"
if spend >= 100: return "silver"
return "bronze"
The unit test pins down the behavior — including the boundaries, where bugs hide:
def test_assign_tier_boundaries():
assert assign_tier(0) == "bronze"
assert assign_tier(99.99) == "bronze" # just below the silver cutoff
assert assign_tier(100) == "silver" # exactly the cutoff (is it >= or >?)
assert assign_tier(1000) == "gold" # exactly the gold cutoff
That 100 case is the whole point: a unit test is where you encode the decision "is the cutoff inclusive?" so a future edit can't silently flip it. The same applies to SQL transforms — a CASE expression, a window function, a dedup. You don't always have a Python function to call; in SQL the unit-test pattern is run the transform against a tiny seeded input table and compare to a tiny expected output table. dbt supports exactly this with dbt unit tests (since dbt 1.8): you provide mock input rows and the expected output, and dbt runs the model's compiled SQL against the mock data — no production tables involved.
# a dbt unit test: feed mock rows, assert the model's output
unit_tests:
- name: test_tier_assignment
model: customer_tiers
given:
- input: ref('stg_customers')
rows:
- {customer_id: 1, total_spend: 99.99}
- {customer_id: 2, total_spend: 100}
- {customer_id: 3, total_spend: 1500}
expect:
rows:
- {customer_id: 1, tier: 'bronze'}
- {customer_id: 2, tier: 'silver'}
- {customer_id: 3, tier: 'gold'}
:::tip Unit tests are the only place you test logic, not data Every other layer mixes your code with real(ish) data, so a failure is ambiguous — bad code or bad data? A unit test on fixtures removes that ambiguity entirely: the input is fixed, so any failure is a logic bug, full stop. Write them for the gnarly transforms — boundary conditions, deduplication keys, currency/timezone math, the CASE statement everyone gets wrong — not for trivial column renames. :::
Integration tests: run the pipeline on a staging warehouse
Unit tests prove each piece is correct in isolation. They cannot prove the pieces fit together: that stg_orders and stg_customers actually join on a compatible key, that the warehouse accepts the schema your models produce, that a multi-step DAG runs front-to-back without an error. That's the job of an integration test — running the assembled pipeline end-to-end against a real (but non-production) warehouse, on sample data, and asserting properties of the result.
A staging warehouse (or a throwaway CI schema — the environments idea from Chapter 7) is a real instance of your warehouse engine, separate from production, that exists so tests can run real queries without touching real data or real consumers. You seed it with a small, representative sample dataset, run the pipeline, and check invariants — properties that must hold no matter what the exact data is:
- Row-count invariants — "the orders mart has exactly as many rows as there are distinct
order_ids in the seed" (no fan-out from a bad join), or "≥ 1 row" (the pipeline produced something). - Schema invariants — "the mart has these columns, with these types" (a downstream consumer depends on them; a rename or type change should fail the build, not surprise them).
- Key invariants — "
order_idis unique in the output" (the join didn't duplicate), "everycustomer_idin the mart exists in the customers seed" (referential integrity held through the whole DAG).
These are the same kinds of assertions as runtime data tests — but here they run on controlled sample data in CI, so a failure unambiguously means your pipeline code is broken, not that today's data was weird. That's the distinction the last lesson drew as CI vs runtime, now made concrete: integration tests are CI tests that exercise the whole assembled pipeline, not just one model.
Where dbt's generic and singular tests fit
You met dbt tests in Chapter 7 and again last lesson. It's worth placing them precisely in the pyramid, because they're easy to over-claim. dbt offers two kinds:
- Generic tests — the reusable, parameterized ones (
not_null,unique,accepted_values,relationships, plus packaged ones from dbt-utils/dbt-expectations). You attach them to columns in YAML; dbt compiles each into a query that returns failing rows. - Singular tests — a one-off
.sqlfile containing a query that should return zero rows; any rows it returns are failures. This is how you express a bespoke assertion that no generic test covers ("no order has aship_datebefore itsorder_date").
The key insight: dbt generic and singular tests assert on the output of a model, not on its logic in isolation. They run on whatever rows are present — sample data in CI, full data at runtime. That makes them a superb fit for the integration layer (run on staging sample data, they check key/schema/referential invariants) and the runtime layer (run on production data, they catch bad data). They are not unit tests: they don't feed fixed inputs and check fixed outputs; they assert properties of whatever the model produced. So the pyramid for a dbt project is:
| Layer | What runs | Catches |
|---|---|---|
| Unit | dbt unit tests (mock input → expected output) | logic bugs in a single model |
| Integration | the dbt build on staging sample data + generic/singular tests | assembly failures, schema/key/ref invariants |
| Data-diff | build on a PR, diff output vs prod | unintended changes to rows your test list never anticipated |
That third row is the one most teams are missing — and it's next.
Data-diffing in CI: comparing your PR's output to production
Here's the failure mode that unit and integration tests structurally cannot catch. You change a transform — you think you're only fixing the gold tier cutoff. Your unit tests pass (you updated them). Your integration tests pass (the pipeline still assembles, keys are still unique). You merge. A week later finance notices that silver-tier counts dropped 12% — because your change had a side effect you never imagined, on rows no test was written to watch. Tests check the things you thought to check. The change you didn't intend, by definition, isn't on that list.
Data-diffing closes this gap. The idea: build your changed pipeline's output in a CI schema, build (or reuse) production's current output, and compute the difference between the two datasets — then put that diff in front of a human reviewer before the merge. Instead of asking "do my assertions pass?", data-diff asks the more honest question: "exactly what does my change do to the data, row by row and column by column?"
A data-diff reports two complementary things:
- Schema-level diff — columns added, dropped, renamed, or retyped between prod and the PR. ("You added
tier_v2and droppedtier— is that intended? Three downstream modelsref()tier.") - Value-level diff — for matching primary keys, which rows changed, which were added, which were removed, and for changed rows, which columns differ and by how much. ("40,120 rows changed
tier; of those, 38,900 went bronze→silver, which you expected; but 1,220 went silver→bronze, which you did not — investigate.")
The tooling in 2026: data-diff (the open-source library) and Datafold (its commercial platform) are the names most associated with this, computing efficient cross-database diffs and posting a summary as a pull-request comment. dbt has its own flavors of the same pattern — dbt-audit-helper's compare_relations / compare_all_columns macros, and "build in CI, diff against the deferred prod relation". The brand is the dated part; the durable technique is diff the PR's data output against prod's, surface it in review, and gate the merge on a human understanding the blast radius.
:::tip Why data-diff catches what tests miss A test encodes a hypothesis you had ("revenue should never be negative"). A data-diff requires no hypothesis — it shows you the actual effect of your change against the baseline, including effects you'd never have predicted. That's why a mature data CI runs both: assertions for the failures you can name, and a data-diff for the regressions you can't. The diff is also the single best code-review artifact a data team produces — a reviewer who sees "this PR changes 1,220 rows in an unexpected direction" reviews far better than one reading raw SQL. :::
Gating the merge: tying it together in CI
Each layer is only valuable if a failure actually stops the merge. Gating means configuring your CI/CD (GitHub Actions, GitLab CI, etc.) so the pull request cannot be merged unless the checks pass — a required status check on the branch. A typical data-pipeline CI job, in order, fastest-first (so cheap failures stop you before expensive steps run):
- Lint / compile — does the SQL parse and the dbt project compile? (seconds)
- Unit tests — do the transforms produce expected output on fixtures? (seconds)
- Integration build + tests — does the pipeline assemble on the staging schema and hold its invariants? (minutes — this is where slim CI +
--deferkeeps it cheap by building onlystate:modified+) - Data-diff — what does the change do versus prod? Posted as a PR comment for the reviewer. (minutes)
Steps 1–3 gate automatically (red = blocked). Step 4 is usually informational-but-required-to-read — the diff rarely auto-fails, because some rows are supposed to change; the judgment "is this the intended change?" is the reviewer's. That human gate on the diff is the point.
A traced worked example
Walk a single pull request through the whole pyramid. The change: raise the gold tier cutoff from $1000 to $1500 of total spend. The author believes this only moves some customers from gold down to silver.
-
Unit test (fixtures, seconds). The author updates
test_tier_assignment: a$1500customer should now begold, a$1200customer should now besilver(was gold). dbt runs the model's SQL on these mock rows. Green — the boundary logic is correct. What this proved: the transform's logic does what the author intends in isolation. What it can't prove: the effect on real customers, or on anything downstream. -
Integration run (staging warehouse, minutes). CI seeds the staging schema with a sample of ~5,000 customers, runs the full pipeline (
state:modified+, deferring unchanged upstream models to prod), then runs the generic/singular tests.customer_idis still unique in the mart; the schema is unchanged (still atiercolumn, type text); row count equals distinct customers. Green — the pipeline assembles correctly and its invariants hold. What this proved: nothing broke structurally. What it can't prove: whether the values changed the way the author expects. -
Data-diff vs prod (minutes). CI diffs the PR's
customer_tiersoutput against production's, joined oncustomer_id. The PR comment reports:- Schema diff: none. ✅
- Value diff:
4,302rows changedtier. Breakdown:4,180went gold → silver (expected — that's the whole change). But122rows wentsilver → bronze. That direction has nothing to do with the gold cutoff.
-
Reviewer decision. The
gold→silvermovement is exactly the intended effect — approve that part. Butsilver→bronzeis a surprise: raising the gold cutoff should never touch the silver/bronze boundary. The reviewer blocks the PR and asks why. Investigation finds the author had also "tidied" the silver branch and accidentally changed>= 100to> 100, knocking every customer with exactly$100spend down to bronze. No unit test covered the exact-100silver case after the edit, and no integration invariant was violated — only the data-diff's value-level comparison against prod surfaced the 122-row regression. The author fixes the off-by-one, the diff comes back clean (only the expectedgold→silverrows), and the PR merges.
That 122-row regression is the entire argument for data-diffing: it was invisible to assertion-based testing and visible immediately in a value-level diff against the baseline.
Common pitfalls
- Testing only the happy path. Unit tests that check
$1500 → goldbut never the boundaries ($100,$1000,$0, nulls, negatives) — exactly where the bugs in the traced example lived. The boundaries and the weird inputs are the point of unit testing. - No staging data, so no integration layer. Teams with unit tests and runtime tests but nothing in between never run the assembled pipeline until production. Joins that fan out, schema mismatches, and DAG-ordering bugs then surface on real consumers instead of in CI.
- Flaky tests on live data. Pointing "integration" tests at a live, changing source means they pass or fail based on today's data, not your code — so they go red randomly, the team learns to ignore red, and the gate becomes theater. Integration tests must run on fixed sample/seed data you control.
- Asserting exact values that legitimately drift. Hard-coding
assert revenue == 48211.50against real data, or auto-failing a data-diff on any changed row. Real data and intended changes move numbers constantly; such tests cry wolf until everyone ignores them. Assert invariants and bounds (uniqueness, ranges, referential integrity, "row count within ±5% of yesterday"), and treat the data-diff as a human-reviewed signal, not an auto-fail. - Confusing data tests with pipeline tests. A green wall of
not_null/uniquetests feels like coverage, but it only checks data, never your logic or your change's blast radius. The traced regression passed every column test. You need the pipeline-test layers on top of data tests.
Quiz
Pipeline testing & data-diff in CI
Pass to unlock the Next button belowWhy it matters
Testing data means testing two different things: the data (last lesson's assertions) and the pipeline code (this lesson). For the code, the testing pyramid gives the strategy: many fast unit tests that pin transformation logic to fixtures with known outputs (the only place you test logic free of data ambiguity), fewer integration tests that run the assembled pipeline on a staging warehouse with sample data and assert row-count, schema, and key invariants, and dbt's generic/singular tests sitting in the integration and runtime layers (asserting on a model's output, not its isolated logic). On top sits data-diffing — comparing your PR's output to production's, schema- and value-level, and reviewing it before merge — which catches the unintended regressions no assertion was written to find, exactly like the 122-row off-by-one in the traced example. All of it is only real if it gates the merge: required CI checks, fastest-first, with the data-diff as the human-judged final signal. Tools (dbt unit tests, data-diff/Datafold, dbt-audit-helper) are the dated part; the durable part is the layered strategy and the diff-before-merge discipline. Next, we push the strongest checks all the way upstream to the producer, as a contract.