Chapter 11 · Data Quality, Governance & Observability
A pipeline that runs is not the same as a pipeline you can trust. By now you can move data, reshape it, store it in a lakehouse, and orchestrate the whole thing. This chapter is about the part that separates a hobby project from a platform a business will bet money on: making the output trustworthy on purpose, and keeping it that way as it grows, as people churn, and as regulators come knocking.
That word trust hides three distinct jobs, and people constantly collapse them into one (usually "add a few tests") and then wonder why their data still breaks:
- Quality — is the data correct? Are values present, valid, unique, consistent, and fresh? You answer this by testing data like you test code (this chapter, lessons 11.2 and 11.3).
- Observability — when quality does slip, do you find out before your CEO does? You answer this by monitoring data like you monitor a service — freshness, volume, schema, distribution, lineage (lesson 11.4).
- Governance — who is allowed to touch this data, who owns it, can we find it, are we allowed to keep it, and what is it costing us? You answer this with catalogs, access control, privacy controls, retention, and cost attribution (lesson 11.5).
The durable idea
Trust is a property you engineer in — test data like code, monitor it like a service, and govern who can touch it, find it, keep it, and pay for it.
Notice how durable that is. The principles — define what "good" means, test it, monitor it, trace it, restrict it, expire it — long outlast any tool. The specific tools (Great Expectations, Soda, dbt tests, Monte Carlo, OpenMetadata, Unity Catalog, Collibra…) are dated and will churn. So we teach the concepts first and map them onto today's tools second, exactly as the rest of this guide does. When a name in this chapter is gone in five years, the idea it implemented will still be how the job is done.
How this chapter builds
A natural reading: define and test quality → push those checks upstream to the producer as a contract → monitor the parts tests can't catch → use lineage to debug and assess blast radius when something breaks → wrap it all in governance (who, find, keep, pay).
Planned lessons
- 11.2 — Data quality dimensions & testing: the six dimensions (accuracy, completeness, consistency, timeliness, validity, uniqueness), where tests live (dbt tests, Great Expectations, Soda — CI vs runtime), and reconciliation/idempotency checks.
- 11.3 — Data contracts & shift-left: enforcing schema and semantics at the producer boundary, breaking-change detection, and why shift-left is the dominant 2026 governance pattern.
- 11.4 — Data observability: the five pillars (freshness, volume, schema, distribution, lineage), anomaly detection vs explicit rules, freshness SLAs/SLOs, and alerting/on-call for pipelines.
- 11.5 — Lineage & impact analysis: table- and column-level lineage, how it powers debugging and trust, and impact (blast-radius) analysis.
- 11.6 — Governance, privacy & cost: catalogs and discovery, ownership/stewardship, RBAC/ABAC, masking/tokenization, PII and GDPR/CCPA (including right-to-be-forgotten on immutable lakes), retention, auditing, and cost/FinOps.
- 11.7 — Checkpoint: lock it in with a quiz.