Skip to main content

Chapter 11 checkpoint

You can now make data trustworthy on purpose — testing it, contracting it, monitoring it, tracing it, and governing it. Recall the spine, then take the quiz.

The throughline

  • Quality isn't one thing — it's six dimensions: accuracy, completeness, consistency, timeliness (freshness), validity, uniqueness. Each becomes an assertion (pass/fail test). "Add a not_null test" covers a sliver of one dimension — real quality covers all six.
  • Testing tools & timing: dbt tests (versioned beside models), Great Expectations (rich, any stage), Soda (readable, monitoring-leaning). Run them in CI (stop bad code merging — shift-left) and at runtime (stop bad data reaching consumers). Beyond columns: reconciliation (row counts, control totals, source-vs-target diffs) and idempotency (a retried load must not duplicate).
  • Data contracts & shift-left: an explicit, versioned producer↔consumer agreement on schema + semantics + SLAs + ownership, enforced in the producer's CI/CD. Breaking-change detection blocks a breaking change before it ships and names who it breaks. The dominant 2026 governance pattern.
  • Observability = catching the unspecified. Five pillars: freshness, volume, schema, distribution, lineage. Explicit rules for what you must guarantee (SLAs) + anomaly detection (learned baselines, no threshold) for scalable coverage. Define "fresh enough" via SLI→SLO→SLA; operationalize with Slack/PagerDuty alerts routed to an owner, a humane on-call, runbooks, and blameless postmortems.
  • Lineage = the dependency graph. Table-level (which datasets) vs column-level (which field, traced to its origin). Two jobs: root-cause debugging (walk upstream) and impact analysis (walk downstream to the blast radius — before a change and during an incident). Auto-captured (dbt/SQL parse, query logs, OpenLineage).
  • Governance = controls you implement, not paperwork: catalog + glossary + discovery + ownership (find it); RBAC/ABAC + row/column security + masking/tokenization (touch it); GDPR/CCPA right-to-be-forgotten (which collides with immutable lakes — solved by table-format deletes, snapshot expiry, or crypto-shredding, located via column-level lineage) + retention + auditing (keep it); FinOps — cost per query/run, attribution via tags, workload isolation (pay for it).

Quiz

Required checkpoint

Chapter 11 — Data Quality, Governance & Observability

Pass to unlock the Next button below

You can now make a data platform trustworthy on purpose — defining quality as six measurable dimensions and testing them in CI and at runtime, pushing the strongest checks upstream as data contracts, monitoring the five observability pillars with rules and anomaly detection, tracing data with table- and column-level lineage for debugging and impact analysis, and governing who can find, touch, keep, and pay for it. The final chapter zooms out: scaling these systems, making (and recording) hard architecture decisions, and growing your data engineering career.

Next: Chapter 12: Scale, Decisions & Career →