Data quality dimensions & testing
Ask ten engineers "is this data good?" and you'll get ten answers, because "good" is vague. The first real skill of data quality is refusing the vague question and replacing it with specific, measurable ones. The industry has settled on a small vocabulary for this — a handful of dimensions of quality — and once you have them, "is this data good?" turns into a checklist you can actually test.
This lesson defines those dimensions from scratch, turns each into a concrete assertion, shows where those assertions run (and the big distinction of when they run), and closes the gap that most courses miss: quality is not just per-column rules — it's also reconciliation (did all the rows make it across intact?) and idempotency (does re-running the pipeline produce the same answer, not duplicates?). We'll express every assertion on our running example — ShopFlow (ShopFlow — see Meet ShopFlow), whose orders, customers, and order_items source tables you already know.
What "data quality" actually means: the six dimensions
A data quality dimension is one specific aspect of correctness you can measure independently. Data can be perfect on one dimension and broken on another — present but wrong, valid but stale — which is exactly why you need them separated. The widely used set:
| Dimension | The question it answers | A failing example |
|---|---|---|
| Accuracy | Does the value match reality? | A ShopFlow customer's city is recorded as a real, well-formed city — but it's the wrong one. |
| Completeness | Is expected data present (no missing rows/values)? | 4% of orders have a NULL customer_id. |
| Consistency | Does the data agree with itself / other systems? | orders.amount ≠ sum of its matching order_items line revenue; the warehouse fact_sales count ≠ the source count. |
| Timeliness (freshness) | Is the data recent enough to be useful? | The "daily" shopflow_daily revenue output last updated 30 hours ago. |
| Validity | Does the value conform to its rules (type, format, range, allowed set)? | orders.status = "shppd"; an amount of -3; a customers.email with no @. |
| Uniqueness | Is each real-world thing represented once (no dups)? | The same order_id appears twice because a load ran twice. |
:::note Accuracy is the hard one — and the one tests can't fully catch Validity is "the value looks right" (a well-formed email). Accuracy is "the value is right" (it's the customer's actual email). You can test validity with a regex; you usually cannot test accuracy directly, because that requires knowing ground truth. This is why accuracy leans on reconciliation against a trusted source (later in this lesson) and on observability (next lesson) rather than on a single column rule. Confusing the two — "my regex passes, so the data is accurate" — is a classic and dangerous mistake. :::
The first thing this list kills is the most common gap in the whole field: reducing data quality to "add a not_null test." A not_null test covers part of completeness for one column. It says nothing about freshness, uniqueness, consistency, or accuracy. Real quality work means covering all six dimensions deliberately.
From dimension to test: an assertion
A data test (or assertion) is an executable statement of an expectation about the data that either passes or fails when run against real rows. Each dimension becomes one or more assertions:
- Completeness → "
orders.customer_idhas zero nulls" / "theorderstable has ≥ 10,000 rows." - Uniqueness → "
orders.order_idis unique" / "no duplicate (order_id,product_id) pairs inorder_items." - Validity → "
orders.statusis in (placed,paid,shipped,cancelled)" / "customers.emailmatches an email pattern" / "orders.amount≥ 0." - Consistency → "
orders.amount= sum of matchingorder_itemsline revenue" / "everyorders.customer_idexists incustomers(referential integrity)." - Timeliness → "
orders'max(order_ts)is within the last 6 hours." - Accuracy → typically reconciliation: "
fact_salesrow count = sourceorder_itemsrow count (±0)."
The point of writing them down is that a vague worry becomes a check that runs automatically and fails loudly. "I hope there are no dupes" becomes a uniqueness test that turns red the moment a dup appears.
Where tests live: the three workhorse tools
Three tools dominate data testing in 2026. They overlap, and teams often run more than one. What matters is the model each uses — the tool names are dated; the models are durable.
dbt tests — assertions that live next to your models
If you've read Chapter 7, you know dbt turns SQL SELECTs into managed tables. dbt also has built-in tests: you declare them in YAML right beside the model, and dbt compiles each into a query that returns the failing rows — zero rows means pass.
# schema.yml — tests declared alongside the ShopFlow orders model
models:
- name: stg_orders
columns:
- name: order_id
tests:
- not_null # completeness
- unique # uniqueness
- name: status
tests:
- accepted_values: # validity — ShopFlow's order lifecycle
values: ['placed', 'paid', 'shipped', 'cancelled']
- name: customer_id
tests:
- relationships: # consistency / referential integrity
to: ref('stg_customers')
field: customer_id
That single block covers four of the six ShopFlow dimensions: order_id is complete and unique, status is valid (only the four real lifecycle values), and every customer_id resolves to a real customer (consistency — the orders.customer_id → customers foreign key from the contract). dbt ships four generic tests (not_null, unique, accepted_values, relationships) — covering completeness, uniqueness, validity, and consistency out of the box — plus custom and packaged tests (e.g. dbt-utils, dbt-expectations) for ranges, freshness, row counts, and more. Their superpower: tests are versioned in Git with the transformation they test, so the check and the code evolve together.
Great Expectations — a rich, standalone validation library
Great Expectations (GE) is a Python library whose unit is an Expectation — a named, parameterized assertion like expect_column_values_to_not_be_null or expect_column_values_to_be_between. You group Expectations into a Suite, run it against a batch of data, and GE produces a human-readable validation result (and "Data Docs" — generated documentation of what's expected). GE is engine-agnostic (pandas, Spark, SQL warehouses), has a huge built-in vocabulary of expectations, and isn't tied to dbt — so it's common where validation must happen outside the transformation layer (e.g. on raw files at ingestion, before dbt ever runs).
Soda — checks as a readable DSL, built for monitoring
Soda (open-source Soda Core + its cloud) expresses checks in a compact YAML-ish language called SodaCL:
checks for orders:
- row_count > 0 # volume / completeness
- missing_count(customer_id) = 0 # completeness
- duplicate_count(order_id) = 0 # uniqueness
- freshness(order_ts) < 6h # timeliness — newest ShopFlow order
- invalid_count(status) = 0: # validity
valid values: [placed, paid, shipped, cancelled]
Soda's design leans toward scheduled monitoring of warehouses — readable checks, freshness as a first-class idea, easy alert routing — which puts it on the bridge between testing and observability (next lesson).
:::tip How to choose (the durable version) Don't memorize feature tables. The decision is: are your assertions about a model you transform (dbt tests, in-repo, Git-versioned), about arbitrary data at any stage including raw (Great Expectations), or about continuously monitoring tables on a schedule (Soda)? Many teams use dbt tests for transformation logic and a monitoring layer on top. The categories outlive the brands. :::
The dimension every course forgets: when tests run
Two assertions can be identical yet play completely different roles depending on when they execute. This is the CI vs runtime distinction, and missing it is a real gap.
- CI checks (shift-left, on code change): run in your CI/CD pipeline when someone changes the pipeline code — against sample, staging, or a dry run. (CI/CD = Continuous Integration / Continuous Delivery, the automation that tests and ships code on every change, covered for data pipelines in Chapter 12's DataOps lesson.) Their job: stop a bad transformation from ever being merged. Fast, on synthetic/sample data, blocking the PR.
- Runtime checks (on data change): run in production as data flows, against the real, full dataset each load. Their job: catch bad data — a corrupt upstream file, a today-only null spike — that no code change could have predicted. They gate or quarantine the load (a circuit breaker: stop the pipeline before bad data spreads downstream).
You need both. CI catches bugs in your logic before they ship; runtime catches bad data in reality that perfect logic can't prevent. A team with only CI tests ships clean code that still serves garbage the day an upstream system misbehaves.
Beyond columns: reconciliation and idempotency
Column rules check values within a table. Two whole-pipeline checks catch failures that no per-column rule will — and they're exactly what production data jobs require.
Reconciliation — did everything make it across, intact?
Reconciliation is verifying that a target dataset faithfully represents its source after data moved or transformed — the data engineer's version of double-entry bookkeeping. The everyday checks:
- Row counts:
count(target)vscount(source)(allowing for known, expected filtering). A silent drop from 1,000,000 → 980,000 rows is invisible to every column test but obvious to a count. - Control totals: sum a meaningful numeric column on both sides — e.g.
sum(amount)in source vs warehouse. If the totals match, you have strong evidence rows weren't dropped, duplicated, or mangled. (Borrowed straight from accounting and financial batch processing.) - Source-vs-target diffs: for a sample of keys, compare the actual values field-by-field to catch silent transformation bugs (a timezone shift, a truncated string, a misjoined column).
Reconciliation is the practical way you get at accuracy and consistency, the two dimensions a lone column rule can't reach.
Idempotency — does re-running change the answer?
Idempotency means an operation produces the same result whether it runs once or many times. It matters because pipelines retry — a task fails halfway, the orchestrator (Chapter 8) re-runs it, and a non-idempotent load appends the same rows again → duplicates. (This connects directly to delivery semantics from streaming, Chapter 9: at-least-once delivery will re-deliver, so your writes must be idempotent or you'll double-count.)
You both design for idempotency (use MERGE/upsert keyed on a business key, or delete-then-insert a partition, instead of blind INSERT) and test for it:
- A uniqueness/dedup test on the natural key (
duplicate_count(order_id) = 0) catches the symptom. - A re-run check: run the load twice in staging; the row count and control totals must be identical the second time. If they grow, the pipeline isn't idempotent.
:::warning The duplicate that passes every "normal" test
A retried non-idempotent load can produce a table where every value is valid, nothing is null, all formats are correct — and revenue is silently double-counted because half the orders appear twice. Only a uniqueness check on the business key plus reconciliation against the source total catches it. This is the canonical example of why "we added not_null tests" is nowhere near enough.
:::
Why it matters
"Is this data good?" is unanswerable until you split it into the six dimensions — accuracy, completeness, consistency, timeliness, validity, uniqueness — and turn each into an assertion that passes or fails. You express those assertions in dbt tests (versioned beside your models), Great Expectations (rich, engine-agnostic, any stage), or Soda (readable checks built for scheduled monitoring) — and you run them in two places that do different jobs: CI stops bad code from merging (shift-left), runtime stops bad data from reaching consumers. Finally, real quality goes beyond column rules: reconciliation (row counts, control totals, source-vs-target diffs) proves the data crossed intact, and idempotency ensures a retried pipeline doesn't silently double your numbers. Tests prove the things you thought to check. The next lesson pushes the strongest of those checks upstream to the producer as a contract — and the lesson after that monitors the things you didn't think to check.