Skip to main content

Testing pipelines: the data testing pyramid & data-diff in CI

The previous lesson gave you the six quality dimensions and turned each into an assertion — a pass/fail check on the data. That answers "is the data good?" But there's a second, equally important question a serious data team has to answer on every change: "is this pipeline code correct, and does my change do only what I intended?" Those are different questions. A not_null test on a production table tells you nothing about whether the SQL you just edited still computes revenue correctly — or whether it silently changed 40,000 rows you never meant to touch.

This lesson is about testing the pipeline itself, the way software engineers test code: a layered strategy called the testing pyramid, adapted for data. We'll run it on ShopFlow (ShopFlow — see Meet ShopFlow) — testing a customer_tiers model built on top of its canonical stg_customers and stg_orders staging tables. We'll build it from the bottom up — unit tests on transformation logic, integration tests against a staging warehouse, dbt's generic/singular tests in their proper place — and then add the technique that's specific to data and that interviewers increasingly ask about: data-diffing in CI, where you compare the output your changed code produces against the output production currently produces, and review the difference before you merge. The dbt CI/CD lesson covered slim CI and deferral — the mechanics of running dbt on a pull request. This lesson is the test strategy that runs inside that machinery.

:::note Two meanings of "test" — keep them straight Data tests (last lesson) assert things about rows: "no nulls", "freshness < 6h". Pipeline tests (this lesson) assert things about code: "given these inputs, the transform produces these outputs". Both matter; they catch different failures. A column might be perfectly non-null while your transform quietly joins on the wrong key. This lesson is mostly about the second kind, plus the data-diff technique that bridges them. :::

The testing pyramid, translated to data

In software, the testing pyramid is a rule of thumb about how many tests of each kind to write: lots of fast, cheap unit tests at the bottom, fewer integration tests in the middle, and a small number of slow, expensive end-to-end tests at the top. The shape is a pyramid because the cheaper a test is to run, the more of them you can afford and the faster they give feedback.

The same shape works for data pipelines, with the layers redefined for our world:

END-TO-END /DATA-DIFFrun the realpipeline on a PR;diff output vsproduction<i>few, slowest,most realistic</i>INTEGRATION TESTSrun the pipeline ona staging warehousewith sample data;assert row counts,schema, keyinvariants<i>some, mediumcost</i>UNIT TESTSone transform ontiny fixtureswith known input →expected output<i>many, fastest,cheapest</i>

A fixture, used at every layer, is a small, fixed set of input rows you control completely — you wrote them, so you know exactly what the correct output is. Fixtures are the foundation of all pipeline testing: a test is only meaningful if you know the right answer in advance, and on real production data you usually don't.

The durable idea: push as much verification as possible down to the cheap, fast layer, and reserve the slow layers for the failures only they can catch. Let's build each layer.

Unit tests: one transform, known input → expected output

A unit test isolates the smallest meaningful piece of logic — a single transformation — and checks it against fixtures with a known correct answer. No warehouse, no real data, no network: you feed in a handful of rows you constructed, run the transform, and assert the output equals what you computed by hand.

Say a transform classifies a customer's tier from their total spend:

def assign_tier(spend: float) -> str:
if spend >= 1000: return "gold"
if spend >= 100: return "silver"
return "bronze"

The unit test pins down the behavior — including the boundaries, where bugs hide:

def test_assign_tier_boundaries():
assert assign_tier(0) == "bronze"
assert assign_tier(99.99) == "bronze" # just below the silver cutoff
assert assign_tier(100) == "silver" # exactly the cutoff (is it >= or >?)
assert assign_tier(1000) == "gold" # exactly the gold cutoff

That 100 case is the whole point: a unit test is where you encode the decision "is the cutoff inclusive?" so a future edit can't silently flip it. The same applies to SQL transforms — a CASE expression, a window function, a dedup. You don't always have a Python function to call; in SQL the unit-test pattern is run the transform against a tiny seeded input table and compare to a tiny expected output table. dbt supports exactly this with dbt unit tests (since dbt 1.8): you provide mock input rows and the expected output, and dbt runs the model's compiled SQL against the mock data — no production tables involved.

# a dbt unit test: feed mock rows, assert the model's output
unit_tests:
- name: test_tier_assignment
model: customer_tiers
given:
- input: ref('stg_customers')
rows:
- {customer_id: 1, total_spend: 99.99}
- {customer_id: 2, total_spend: 100}
- {customer_id: 3, total_spend: 1500}
expect:
rows:
- {customer_id: 1, tier: 'bronze'}
- {customer_id: 2, tier: 'silver'}
- {customer_id: 3, tier: 'gold'}

:::tip Unit tests are the only place you test logic, not data Every other layer mixes your code with real(ish) data, so a failure is ambiguous — bad code or bad data? A unit test on fixtures removes that ambiguity entirely: the input is fixed, so any failure is a logic bug, full stop. Write them for the gnarly transforms — boundary conditions, deduplication keys, currency/timezone math, the CASE statement everyone gets wrong — not for trivial column renames. :::

Integration tests: run the pipeline on a staging warehouse

Unit tests prove each piece is correct in isolation. They cannot prove the pieces fit together: that stg_orders and stg_customers actually join on a compatible key, that the warehouse accepts the schema your models produce, that a multi-step DAG runs front-to-back without an error. That's the job of an integration test — running the assembled pipeline end-to-end against a real (but non-production) warehouse, on sample data, and asserting properties of the result.

A staging warehouse (or a throwaway CI schema — the environments idea from Chapter 7) is a real instance of your warehouse engine, separate from production, that exists so tests can run real queries without touching real data or real consumers. You seed it with a small, representative sample dataset, run the pipeline, and check invariants — properties that must hold no matter what the exact data is:

  • Row-count invariants — "the orders mart has exactly as many rows as there are distinct order_ids in the seed" (no fan-out from a bad join), or "≥ 1 row" (the pipeline produced something).
  • Schema invariants — "the mart has these columns, with these types" (a downstream consumer depends on them; a rename or type change should fail the build, not surprise them).
  • Key invariants — "order_id is unique in the output" (the join didn't duplicate), "every customer_id in the mart exists in the customers seed" (referential integrity held through the whole DAG).

These are the same kinds of assertions as runtime data tests — but here they run on controlled sample data in CI, so a failure unambiguously means your pipeline code is broken, not that today's data was weird. That's the distinction the last lesson drew as CI vs runtime, now made concrete: integration tests are CI tests that exercise the whole assembled pipeline, not just one model.

("sample seed data(fixtures, small)")run the FULLpipelineon the stagingwarehouseassert row-countinvariantsassert schema(columns + types)assert keyinvariants(uniqueness, refs)Vpipeline assembles ✅block the PR ❌yesno

Where dbt's generic and singular tests fit

You met dbt tests in Chapter 7 and again last lesson. It's worth placing them precisely in the pyramid, because they're easy to over-claim. dbt offers two kinds:

  • Generic tests — the reusable, parameterized ones (not_null, unique, accepted_values, relationships, plus packaged ones from dbt-utils/dbt-expectations). You attach them to columns in YAML; dbt compiles each into a query that returns failing rows.
  • Singular tests — a one-off .sql file containing a query that should return zero rows; any rows it returns are failures. This is how you express a bespoke assertion that no generic test covers ("no order has a ship_date before its order_date").

The key insight: dbt generic and singular tests assert on the output of a model, not on its logic in isolation. They run on whatever rows are present — sample data in CI, full data at runtime. That makes them a superb fit for the integration layer (run on staging sample data, they check key/schema/referential invariants) and the runtime layer (run on production data, they catch bad data). They are not unit tests: they don't feed fixed inputs and check fixed outputs; they assert properties of whatever the model produced. So the pyramid for a dbt project is:

LayerWhat runsCatches
Unitdbt unit tests (mock input → expected output)logic bugs in a single model
Integrationthe dbt build on staging sample data + generic/singular testsassembly failures, schema/key/ref invariants
Data-diffbuild on a PR, diff output vs produnintended changes to rows your test list never anticipated

That third row is the one most teams are missing — and it's next.

Data-diffing in CI: comparing your PR's output to production

Here's the failure mode that unit and integration tests structurally cannot catch. You change a transform — you think you're only fixing the gold tier cutoff. Your unit tests pass (you updated them). Your integration tests pass (the pipeline still assembles, keys are still unique). You merge. A week later finance notices that silver-tier counts dropped 12% — because your change had a side effect you never imagined, on rows no test was written to watch. Tests check the things you thought to check. The change you didn't intend, by definition, isn't on that list.

Data-diffing closes this gap. The idea: build your changed pipeline's output in a CI schema, build (or reuse) production's current output, and compute the difference between the two datasets — then put that diff in front of a human reviewer before the merge. Instead of asking "do my assertions pass?", data-diff asks the more honest question: "exactly what does my change do to the data, row by row and column by column?"

A data-diff reports two complementary things:

  • Schema-level diff — columns added, dropped, renamed, or retyped between prod and the PR. ("You added tier_v2 and dropped tier — is that intended? Three downstream models ref() tier.")
  • Value-level diff — for matching primary keys, which rows changed, which were added, which were removed, and for changed rows, which columns differ and by how much. ("40,120 rows changed tier; of those, 38,900 went bronze→silver, which you expected; but 1,220 went silver→bronze, which you did not — investigate.")
("PROD outputcurrent main")DATA-DIFFjoin on primary key,compare every column("PR outputyour changed code")schema diff:cols added/dropped/retypedvalue diff:N rows changed,+added / −removed,which columns differRapprove & merge ✅block & investigateyes, expectedno, surprise

The tooling in 2026: data-diff (the open-source library) and Datafold (its commercial platform) are the names most associated with this, computing efficient cross-database diffs and posting a summary as a pull-request comment. dbt has its own flavors of the same pattern — dbt-audit-helper's compare_relations / compare_all_columns macros, and "build in CI, diff against the deferred prod relation". The brand is the dated part; the durable technique is diff the PR's data output against prod's, surface it in review, and gate the merge on a human understanding the blast radius.

:::tip Why data-diff catches what tests miss A test encodes a hypothesis you had ("revenue should never be negative"). A data-diff requires no hypothesis — it shows you the actual effect of your change against the baseline, including effects you'd never have predicted. That's why a mature data CI runs both: assertions for the failures you can name, and a data-diff for the regressions you can't. The diff is also the single best code-review artifact a data team produces — a reviewer who sees "this PR changes 1,220 rows in an unexpected direction" reviews far better than one reading raw SQL. :::

Gating the merge: tying it together in CI

Each layer is only valuable if a failure actually stops the merge. Gating means configuring your CI/CD (GitHub Actions, GitLab CI, etc.) so the pull request cannot be merged unless the checks pass — a required status check on the branch. A typical data-pipeline CI job, in order, fastest-first (so cheap failures stop you before expensive steps run):

  1. Lint / compile — does the SQL parse and the dbt project compile? (seconds)
  2. Unit tests — do the transforms produce expected output on fixtures? (seconds)
  3. Integration build + tests — does the pipeline assemble on the staging schema and hold its invariants? (minutes — this is where slim CI + --defer keeps it cheap by building only state:modified+)
  4. Data-diff — what does the change do versus prod? Posted as a PR comment for the reviewer. (minutes)

Steps 1–3 gate automatically (red = blocked). Step 4 is usually informational-but-required-to-read — the diff rarely auto-fails, because some rows are supposed to change; the judgment "is this the intended change?" is the reviewer's. That human gate on the diff is the point.

A traced worked example

Walk a single pull request through the whole pyramid. The change: raise the gold tier cutoff from $1000 to $1500 of total spend. The author believes this only moves some customers from gold down to silver.

  1. Unit test (fixtures, seconds). The author updates test_tier_assignment: a $1500 customer should now be gold, a $1200 customer should now be silver (was gold). dbt runs the model's SQL on these mock rows. Green — the boundary logic is correct. What this proved: the transform's logic does what the author intends in isolation. What it can't prove: the effect on real customers, or on anything downstream.

  2. Integration run (staging warehouse, minutes). CI seeds the staging schema with a sample of ~5,000 customers, runs the full pipeline (state:modified+, deferring unchanged upstream models to prod), then runs the generic/singular tests. customer_id is still unique in the mart; the schema is unchanged (still a tier column, type text); row count equals distinct customers. Green — the pipeline assembles correctly and its invariants hold. What this proved: nothing broke structurally. What it can't prove: whether the values changed the way the author expects.

  3. Data-diff vs prod (minutes). CI diffs the PR's customer_tiers output against production's, joined on customer_id. The PR comment reports:

    • Schema diff: none. ✅
    • Value diff: 4,302 rows changed tier. Breakdown: 4,180 went gold → silver (expected — that's the whole change). But 122 rows went silver → bronze. That direction has nothing to do with the gold cutoff.
  4. Reviewer decision. The gold→silver movement is exactly the intended effect — approve that part. But silver→bronze is a surprise: raising the gold cutoff should never touch the silver/bronze boundary. The reviewer blocks the PR and asks why. Investigation finds the author had also "tidied" the silver branch and accidentally changed >= 100 to > 100, knocking every customer with exactly $100 spend down to bronze. No unit test covered the exact-100 silver case after the edit, and no integration invariant was violated — only the data-diff's value-level comparison against prod surfaced the 122-row regression. The author fixes the off-by-one, the diff comes back clean (only the expected gold→silver rows), and the PR merges.

That 122-row regression is the entire argument for data-diffing: it was invisible to assertion-based testing and visible immediately in a value-level diff against the baseline.

Common pitfalls

  • Testing only the happy path. Unit tests that check $1500 → gold but never the boundaries ($100, $1000, $0, nulls, negatives) — exactly where the bugs in the traced example lived. The boundaries and the weird inputs are the point of unit testing.
  • No staging data, so no integration layer. Teams with unit tests and runtime tests but nothing in between never run the assembled pipeline until production. Joins that fan out, schema mismatches, and DAG-ordering bugs then surface on real consumers instead of in CI.
  • Flaky tests on live data. Pointing "integration" tests at a live, changing source means they pass or fail based on today's data, not your code — so they go red randomly, the team learns to ignore red, and the gate becomes theater. Integration tests must run on fixed sample/seed data you control.
  • Asserting exact values that legitimately drift. Hard-coding assert revenue == 48211.50 against real data, or auto-failing a data-diff on any changed row. Real data and intended changes move numbers constantly; such tests cry wolf until everyone ignores them. Assert invariants and bounds (uniqueness, ranges, referential integrity, "row count within ±5% of yesterday"), and treat the data-diff as a human-reviewed signal, not an auto-fail.
  • Confusing data tests with pipeline tests. A green wall of not_null/unique tests feels like coverage, but it only checks data, never your logic or your change's blast radius. The traced regression passed every column test. You need the pipeline-test layers on top of data tests.

Quiz

Required checkpoint

Pipeline testing & data-diff in CI

Pass to unlock the Next button below

Why it matters

Testing data means testing two different things: the data (last lesson's assertions) and the pipeline code (this lesson). For the code, the testing pyramid gives the strategy: many fast unit tests that pin transformation logic to fixtures with known outputs (the only place you test logic free of data ambiguity), fewer integration tests that run the assembled pipeline on a staging warehouse with sample data and assert row-count, schema, and key invariants, and dbt's generic/singular tests sitting in the integration and runtime layers (asserting on a model's output, not its isolated logic). On top sits data-diffing — comparing your PR's output to production's, schema- and value-level, and reviewing it before merge — which catches the unintended regressions no assertion was written to find, exactly like the 122-row off-by-one in the traced example. All of it is only real if it gates the merge: required CI checks, fastest-first, with the data-diff as the human-judged final signal. Tools (dbt unit tests, data-diff/Datafold, dbt-audit-helper) are the dated part; the durable part is the layered strategy and the diff-before-merge discipline. Next, we push the strongest checks all the way upstream to the producer, as a contract.

Next: Data contracts & shift-left →