Skip to main content

Data observability

Tests and contracts (11.2, 11.3) share a limit: they only catch what you thought to specify. You wrote not_null on orders.customer_id, so nulls get caught. But you never wrote "alert me if today's ShopFlow order count is 40% below normal," or "tell me when shopflow_daily mysteriously stops updating," or "warn me when the average orders.amount suddenly triples" — because you can't enumerate every way data can go wrong in advance. (ShopFlow — see Meet ShopFlow; shopflow_daily is its nightly pipeline, ingest_orders → dbt_run → quality_check.) This is the data world's version of the known-unknowns vs unknown-unknowns problem from the Cloud guide's observability chapter, and the answer has the same name: observability.

Data observability is the ability to understand the health of your data and pipelines — and to get told when something is off — including problems you didn't write an explicit rule for. Where lesson 11.2's tests are pass/fail rules you authored, observability adds continuous monitoring and learned baselines that flag the anomalies rules would miss.

The five pillars of data observability

The field has converged on five things to watch. A monitoring tool covers data health to the extent it covers these five pillars — memorize them; they're the durable checklist:

PillarThe questionWhat "broken" looks like
FreshnessIs the data up to date?shopflow_daily hasn't refreshed fact_sales in 30 hours.
VolumeDid the expected amount of data arrive?ShopFlow normally lands ~40,000 orders/day; today only 1,200 arrived (a partial load) — or 120,000 (a duplicate load).
SchemaDid the structure change?orders.status was renamed to order_status, or amount was dropped upstream.
Distribution (quality)Are the values in their normal range/shape?orders.customer_id % null jumped from 0.1% to 18%; average orders.amount tripled; a new unseen status value appeared.
LineageWhat's upstream/downstream of this?(The connective pillar — which sources feed fact_sales and what it feeds. Lesson 11.5.)

Notice how these map to (and extend) the dimensions from 11.2: freshness = timeliness; volume catches the silent row-drop reconciliation worries about; distribution is completeness/validity watched over time instead of as a one-shot rule; schema is the contract being violated in reality; lineage is the thread that lets you trace a problem to its source.

:::note Why "pillars" again? Same idea, different layer The Cloud guide had pillars of system telemetry (metrics, logs, traces). These are pillars of data telemetry — the analog for the data flowing through the system rather than the services running it. The instinct is identical: don't watch one signal, watch a small complete set that together tells you whether things are healthy. :::

Explicit rules vs anomaly detection

There are two ways to decide "is this value a problem?", and a mature setup uses both.

Explicit rules (thresholds you set)

You state the boundary: "orders freshness must be < 6h," "orders row count must be > 30,000/day," "customer_id null rate must be < 1%." These are exactly the assertions from 11.2, just evaluated continuously. Strengths: precise, predictable, no surprises, perfect when you know the right threshold (an SLA, a hard business rule). Weakness: you have to know and maintain the number for every metric on every table — which doesn't scale to thousands of tables, and can't anticipate the weird stuff.

Anomaly detection (learned baselines)

Anomaly detection means the tool learns a metric's normal pattern from history — its typical value, its daily and weekly seasonality, its normal variance — and alerts when a new observation deviates significantly from that learned baseline, with no threshold you had to set. It notices that Tuesday's ShopFlow order volume is unusually low for a Tuesday, that shopflow_daily's freshness is slipping relative to its normal cadence, that the orders.amount distribution has shifted in a way you never thought to rule-check.

You declare:freshness < 6h9h?Tool learns history:updates ~every 1h,low on weekendsToday's gapfar outsidelearned pattern?

The trade-offs mirror the alerting lessons elsewhere in this guide: anomaly detection scales to coverage you could never hand-author, but it can be noisy (false positives on legitimate spikes like Black Friday) and is opaque ("why did it fire?"). Explicit rules are precise but don't scale and don't catch the unforeseen. Use rules for the things you know and must guarantee (SLAs, hard constraints) and anomaly detection for broad, scalable coverage of the unknown-unknowns across many tables.

:::tip The gap this closes The most common quality failure in the field isn't a missing not_null test — it's the table that silently stopped updating three days ago, or whose volume quietly halved, with every column rule still passing. Picture ShopFlow's worst case: an upstream extract config breaks and orders drops to zero today — shopflow_daily runs green, every not_null/unique/accepted_values test on the (now empty) load passes, and fact_sales simply gains no new rows. No assertion you'd think to write catches "the data just… stopped." But a volume monitor sees today's count crater from ~40,000 to 0, and a freshness monitor sees fact_sales go stale — especially with anomaly detection, which knows zero orders is wildly outside ShopFlow's learned baseline and pages on it. A guide that teaches only not_null and never freshness/anomaly monitoring has left its readers blind to the most common real incident. :::

SLAs and SLOs for data

You can't say data is "fresh enough" without a number. Borrowing straight from SRE (the Cloud guide's SLI/SLO/SLA lesson), applied to data:

  • SLI (Service-Level Indicator) — the measured number. "Hours since orders last updated." "% of shopflow_daily runs that completed by 06:00."
  • SLO (Service-Level Objective) — the internal target on that indicator. "orders freshness < 2h, 99% of the time." This is what your alerts and dashboards are built around.
  • SLA (Service-Level Agreement) — the committed promise to consumers, usually looser than the SLO with consequences attached. "ShopFlow's fact_sales revenue dataset is delivered by 07:00, or the finance team is formally notified."

The most common data SLO is a freshness SLA: a committed maximum staleness for a dataset. It belongs in the data contract (11.3), is monitored as the freshness pillar here, and is what an alert pages on when violated. Defining "fresh enough" as a number — rather than a vibe — is what turns "the dashboard looks stale" into an objective, alertable condition with an owner.

Operationalizing: alerts, on-call, and incident response

Observability is worthless if a problem fires into a void. The operational layer — identical in spirit to running any production service — is what makes it real.

  • Alerting & routing. Detected problems become alerts routed to where people work: Slack (or Teams) for awareness and lower-severity issues; PagerDuty (or Opsgenie) to page an on-call human for severity that can't wait. Route by ownership (11.6) so the alert reaches the team that owns the broken dataset, not a shared channel everyone mutes.
  • Severity & actionability. As with system alerting: every page must be actionable, or you breed alert fatigue and people start ignoring the pager. Anomaly detection's noise risk makes this especially acute for data — tune it, or the first false-positive flood trains everyone to ignore it.
  • On-call for pipelines. A data on-call rotation owns pipeline health: respond to freshness/volume/quality alerts, triage severity, mitigate (e.g. halt downstream publishing so bad data doesn't spread — the circuit breaker from 11.2), then diagnose. A runbook per common failure ("shopflow_daily's ingest_orders task late → check the upstream orders extract job → …") turns a 2 a.m. page into a checklist.
  • Incident response & blameless postmortems. A data incident gets the same treatment as a system one: detect → triage → mitigate-before-diagnose → resolve → blameless postmortem (fix the system that let bad data through, not the person) → owned action items. This is where lineage (next lesson) becomes the debugging superpower: when a metric is wrong, lineage tells you which upstream table to investigate and which downstream dashboards are affected.

The tool landscape (dated — the pillars are not)

  • Data observability platforms: Monte Carlo, Metaplane, and Anomalo are dedicated SaaS that auto-monitor the five pillars with anomaly detection across your warehouse, with minimal config.
  • Open-source / dbt-native: Elementary plugs into dbt to add anomaly tests, freshness, and volume monitoring from your existing dbt project — a popular, lower-cost on-ramp.
  • Soda (from 11.2) spans testing into scheduled monitoring, sitting on the bridge between the two.

Don't anchor on the names; anchor on "does it cover freshness, volume, schema, distribution, and lineage, with rules and anomaly detection, routed to an owner?"

Common pitfalls

  • Equating data quality with column tests only. Tests catch specified problems; they miss the table that silently stopped updating or whose volume halved. Observability (freshness/volume/anomaly) is the other half.
  • Anomaly detection with no tuning → alert fatigue. A noisy detector that pages on every Black Friday spike gets muted, and then it's worse than nothing. Tune baselines and route by severity.
  • No freshness SLA. Without a committed number, "is it fresh enough?" is unanswerable and unalertable. Put a freshness SLO/SLA on every important dataset.
  • Alerts with no owner. An alert in a channel nobody owns is noise. Route to the dataset's owning team (11.6).
  • Monitoring without lineage. An alert that says "table X is anomalous" but can't tell you what feeds X or what X feeds leaves you debugging blind — which is exactly why lineage is the fifth pillar.

Why it matters

Tests and contracts catch what you specified; data observability catches what you didn't. Watch the five pillarsfreshness, volume, schema, distribution, lineage — using explicit rules for the thresholds you must guarantee (SLAs) and anomaly detection for scalable coverage of the unknown-unknowns across thousands of tables. Make "fresh enough" concrete with SLIs → SLOs → SLAs, the most common being a freshness SLA that lives in the contract and pages when violated. Then operationalize: route alerts to the owning team via Slack/PagerDuty, run a humane data on-call with runbooks, and treat data incidents with blameless postmortems. The fifth pillar — lineage — is the thread that makes every alert debuggable, so it gets its own lesson next.

Next: Lineage & impact analysis →