Data contracts & shift-left
In the last lesson you tested data after it arrived in your warehouse. That's necessary — but notice it's fundamentally reactive: by the time your runtime test goes red, the bad data already exists, a downstream dashboard may already be wrong, and you're now doing forensics. This lesson is about the most important shift in data engineering practice of the last few years: stop catching breakage downstream where it hurts, and start preventing it upstream where it's born. That idea is called shift-left, and the mechanism that implements it is the data contract.
Where data actually breaks
Picture a typical org — ShopFlow (ShopFlow — see Meet ShopFlow). A producer — the application team that owns the orders service — emits data (events, or the orders table). A consumer — the data team — builds stg_orders, the fact_sales mart, a revenue dashboard, and an ML feature on top of it. Now the app team, doing their normal job, renames orders.status → order_status, or changes amount from dollars to cents, or starts emitting NULL customer_ids for guest checkout.
They had no idea anyone depended on the old shape. To them it was a private implementation detail. But three layers downstream, the data team's pipeline silently breaks — or worse, doesn't break and just produces wrong numbers. This is the single most common cause of data incidents: an upstream change the producer didn't know was breaking, discovered by the consumer hours or days later through a wrong report.
The root problem is an invisible, unagreed dependency. Nothing recorded that the data team relied on orders.status being a non-null string in one of ShopFlow's four lifecycle values (placed, paid, shipped, cancelled). So nothing could warn the app team that their change would break it.
What a data contract is
A data contract is an explicit, versioned, machine-readable agreement between a data producer and its consumers about the data the producer will emit. It's the database/event-stream analog of an API contract (like an OpenAPI spec for a REST API): a promise about shape and behavior that both sides can check against. A real contract specifies more than columns:
- Schema — field names, types, nullability, and the structure. (
order_id: bigint, not null;amount: decimal(12,2);status: enum[placed, paid, shipped, cancelled].) - Semantics — what the fields mean and their rules. (
amountis the order total in dollars (USD), not cents;order_tsis UTC;statusmay only transition forward through ShopFlow's lifecycle.) Semantics are where the subtle, expensive breakages live — if the app team quietly switchedamountfrom dollars to cents, that's schema-valid but a 100× error in every revenue number. - Quality guarantees & SLAs — the dimensions from 11.2, promised: "
order_idis unique and non-null," "theordersfeed is no more than 2 hours stale" (a freshness SLA — Service-Level Agreement, a committed target; see 11.4). - Ownership — who owns this data and is accountable when it breaks (ties into stewardship in 11.6).
- Version — contracts are versioned, so changes are explicit and negotiable rather than silent.
In practice a contract is often a YAML or JSON file checked into the producer's repository, written in a spec like the open Data Contract Specification or ODCS (Open Data Contract Standard):
# orders.contract.yaml — ShopFlow orders, owned by the app (checkout) team
model: orders
owner: shopflow-app-team
schema:
order_id: { type: bigint, required: true, unique: true }
customer_id: { type: bigint, required: true, description: "FK → customers.customer_id" }
amount: { type: decimal, required: true, description: "Order total in DOLLARS (USD), 2dp" }
status: { type: string, required: true, enum: [placed, paid, shipped, cancelled] }
order_ts: { type: timestamp, required: true, description: "UTC; when the order was placed" }
servicelevels:
freshness: { threshold: 2h }
availability: { description: "published by 02:00 UTC daily" }
This is the agreement between ShopFlow's app team (who own the orders service) and the data team (who build stg_orders and fact_sales on it). It pins the schema, the semantics that matter (amount in dollars, order_ts in UTC, the four-value status lifecycle), the freshness SLA, and the owner — so none of it can change silently.
Shift-left: enforce at the producer boundary
Shift-left is a term borrowed from software testing: move quality checks earlier (further "left") in the lifecycle — toward the moment the data (or code) is created, rather than the moment it's consumed. Applied here: the contract is checked at the producer, in the producer's own CI/CD, before the change ships.
Concretely, ShopFlow's app-team pipeline gains a gate: when the app team opens a pull request that changes the orders data, CI validates the new data/schema against orders.contract.yaml and fails the build if it would break the contract — before merge, before deploy, before a single bad row reaches the data team's stg_orders.
The payoff is the reversal of the failure mode: instead of the consumer discovering breakage after it ships (reactive forensics), the producer is told before they ship that their change is breaking, who it breaks, and is forced to handle it deliberately. The invisible dependency is now visible and enforced.
:::tip Shift-left vs runtime tests are partners, not rivals Lesson 11.2's CI-vs-runtime split lives inside the consumer. The data contract pushes a check even further left — into the producer's repo, before the data is even emitted. Layered: contract check at the producer (don't emit a breaking change) → runtime checks at the consumer (catch reality the contract can't promise) → observability (catch what nobody specified). Each layer catches what the previous can't. :::
Breaking-change detection: what counts as "breaking"?
The engine behind the gate is breaking-change detection — automatically comparing a proposed schema/contract against the current one and classifying the diff. The durable categories (you'll see them in Protobuf/Avro registries, GraphQL, and contract tools alike):
- Backward-compatible (safe): old consumers keep working. Adding an optional/nullable field; adding a new enum value consumers ignore. Allowed to ship freely.
- Breaking: old consumers will fail or misread. Removing a field, renaming a field, narrowing a type (string→int), making a nullable field required, changing units/meaning. Blocked, or allowed only via an explicit version bump.
A change that is technically schema-compatible but semantically breaking (dollars→cents) is the dangerous middle — which is why contracts carry descriptions and semantic rules, and why a human review of the diff still matters.
Where the registry comes in
For streaming data, this enforcement is often handled by a schema registry — a service that stores the versioned schema for each topic/event and rejects producers trying to publish a payload that violates the registered schema's compatibility rules. This is the Chapter 9 streaming tie-in: in event-driven systems, the Confluent Schema Registry (with Avro/Protobuf) is the original, battle-tested data-contract enforcement point — a producer literally cannot publish an incompatible message. Modern "data contract" tooling generalizes that idea from event streams to tables, dbt models, and files.
How contracts plug into the stack
You don't need an exotic tool to start — contracts compose with what you already have:
- dbt has first-class model contracts: declare a model's columns/types as
enforced, and dbt fails the build if the model's actual output drifts from the declared shape — a contract on the transformation boundary. - Schema registries (Confluent, AWS Glue Schema Registry) enforce contracts on the streaming boundary.
- Great Expectations / Soda suites can be the runtime expression of a contract's quality clauses.
- Dedicated data-contract tooling and catalogs (often integrated with DataHub or OpenMetadata, lesson 11.6) store contracts, render them as docs, and wire the CI gate.
Common pitfalls
- Treating a contract as just a schema. Schema-only contracts miss semantics (units, timezone, meaning) and SLAs (freshness) — exactly where the costly, silent breakages live.
- Putting the contract on the consumer side. A contract the consumer enforces after receiving data is just a runtime test renamed. The whole point of shift-left is enforcement in the producer's pipeline, before they ship.
- No version field / no breaking-change policy. Without versioning and a "what's breaking" rule, every change is a judgment call and the contract erodes into documentation nobody trusts.
- A contract with no owner. If no one is accountable for the
ordersdata, a failing contract check has no one to act on it. Contracts presume ownership (11.6). - Boiling the ocean. You don't write contracts for all 4,000 tables on day one. Put contracts on the few high-value, multi-consumer datasets where a silent break is most expensive, and grow from there.
Why it matters
Most data incidents are born at the producer boundary, from an upstream change nobody knew was breaking a downstream consumer — an invisible, unagreed dependency. A data contract makes that dependency explicit and machine-readable: schema plus semantics plus quality SLAs plus ownership, versioned. Shift-left enforces it in the producer's CI/CD, so breaking-change detection blocks the change before it ships and names who it would break — turning a reactive downstream fire-drill into a proactive upstream conversation. Contracts compose with tools you've already met: schema registries on streams, dbt model contracts on transformations, GE/Soda for the quality clauses. This is the dominant governance pattern of 2026, and most guides never teach it. Contracts and tests both check things you anticipated. Next: catching the breakage nobody specified — data observability.
Next: Data observability →