Lineage & impact analysis
A ShopFlow revenue number on a dashboard is wrong. Where do you even start? In a modern stack, that figure is the end of a long chain: a raw source (raw.orders) → an ingestion job → a staging table (stg_orders) → joined intermediate models → a final mart (fact_sales) → the dashboard. The wrongness could have entered anywhere along that chain. Without a map of how data flows, debugging is archaeology — opening table after table, guessing. Data lineage is that map, and it's the single most useful artifact for trust and debugging in a data platform. It's also the fifth observability pillar from the last lesson — the thread that makes every alert investigable. (ShopFlow — see Meet ShopFlow.)
What lineage is
Data lineage is the record of where data comes from, how it flows, and how it transforms as it moves through your systems — a directed graph of dependencies. Read it one way it answers "what feeds this?"; read it the other way it answers "what does this feed?"
That graph is lineage. Each node is a dataset (or column); each edge is a transformation that produced one from another. The arrows are the whole point — they let you walk upstream (toward sources) and downstream (toward consumers).
Table-level vs column-level lineage
Lineage comes at two resolutions, and the difference is enormous in practice:
- Table-level lineage records dependencies between datasets: "
fact_salesis built fromstg_orders,stg_order_items, anddim_customer." Useful, but coarse — it tells you the table is involved, not which part. - Column-level lineage records dependencies between individual columns: "
fact_sales.line_revenueis computed fromstg_order_items.quantity×stg_order_items.unit_price, andfact_salesinherits the order total fromstg_orders.amount, which traces toraw.orders.amount." It maps each field back through every transformation to its origin columns.
Column-level lineage is what turns "this table is somehow involved" into "this exact field comes from that exact source column via these specific steps." For both debugging and the privacy work in the next lesson, that precision is the difference between a five-minute answer and a five-hour hunt.
:::note Why column-level is worth the extra effort Table-level lineage on a 200-column mart tells you "something in this big table depends on that big table." Column-level tells you the one column that's affected — so you fix the right thing, alert only the genuinely impacted consumers, and (crucially for PII, 11.6) know exactly which downstream columns inherit sensitive data. Most courses stop at table-level and leave readers without the tool that actually solves real debugging — this is a real gap. :::
How lineage gets captured
You do not draw lineage by hand — it would be obsolete by lunch. It's captured automatically, three ways:
- By parsing SQL / transformation code. Tools read your queries and models (especially dbt, where every model declares its sources via
ref()andsource()) and derive the graph from the code itself. dbt produces table-and column-level lineage for its own project essentially for free. - By reading query logs. Warehouses (Snowflake, BigQuery, Databricks) log every query; a catalog can parse those logs to reconstruct who-built-what across all tools, not just dbt.
- Via a standard. OpenLineage is an open specification for emitting lineage events from any job (Spark, Airflow, dbt, etc.), so a heterogeneous stack produces one unified graph instead of per-tool islands.
The graph then lives in a data catalog (next lesson) — OpenMetadata, DataHub, Unity Catalog, Atlan, Collibra, Apache Atlas — which renders it and lets you click through it.
The two jobs lineage does
Lineage isn't a pretty diagram for its own sake. It earns its keep on two everyday tasks that run in opposite directions.
1. Root-cause debugging — walk upstream
A ShopFlow revenue number is wrong, or an observability alert fired on fact_sales. You walk the lineage backward: fact_sales ← stg_orders ← raw.orders (and ← stg_order_items ← raw.order_items). At each hop you check the data; the first place it goes bad is your culprit. Combine with observability and the search collapses further: if the freshness/volume/distribution alerts also fired on stg_orders but not on stg_order_items, lineage + observability point straight at the orders ingestion. Lineage turns "the number's wrong, somewhere upstream" into a directed search instead of a guess.
2. Impact analysis — walk downstream
Impact analysis is the reverse: given a dataset (or column), find everything downstream that depends on it — its blast radius. You use it in two moments:
- Proactively, before you change something. ShopFlow's app team is about to change
orders.amount— say, switch its units or widen its type. Impact analysis answers "what breaks if I do this?" — before they do it. Walk downstream fromorders.amount: it feedsstg_orders.amount, which feedsfact_sales, which feeds the exec revenue dashboard, the finance close report, and the churn model — all at risk. It's the data-side complement to the producer's breaking-change detection from the data contract lesson (11.3): the contract blocks the breaking change toorders.amount; impact analysis shows you who it would break so you can migrate them. - Reactively, when something is already broken. A corrupt batch landed in
stg_ordersovernight (everyorders.amountdoubled). Impact analysis instantly lists every downstream asset touched by it — so you can proactively tell the finance and exec teams "your ShopFlow revenue numbers are affected, hold off," before they make a decision on bad data. That early, accurate "here's who's affected" is a huge part of what trust in a data team actually is.
:::tip Lineage is how trust is operationalized Two of the three jobs from the chapter intro lean directly on lineage. Debugging (walk upstream to the root cause) and impact analysis (walk downstream to the blast radius) are the daily mechanics of being trusted — fast root cause, and "we'll tell you the moment your data is affected, accurately." A data team without lineage is slow to debug and can't reliably say who's affected, which is exactly when stakeholders stop trusting it. :::
Lineage and the lessons around it
- Observability (11.4): lineage is the fifth pillar and the connective one — it's what makes the other four alerts investigable (root cause) and scopeable (impact).
- Contracts (11.3): impact analysis tells the producer who a breaking change affects; the contract is what enforces the conversation.
- Privacy (11.6): column-level lineage answers "where did this PII flow?" — e.g. tracing
customers.emailintostg_customersanddim_customer— essential for finding every place a ShopFlow customer's data landed when they ask to be forgotten, and for verifying that masking on a source column actually propagates downstream.
Common pitfalls
- Stopping at table-level lineage. It tells you a table is involved, not which column — leaving the precise debugging and PII-tracing questions unanswered. Column-level is where the real value is.
- Hand-maintained lineage. A diagram in a wiki is wrong within a week. Lineage must be auto-derived from code/query logs or it's a liability.
- Lineage you can't act on. A graph with no ownership attached tells you what's affected but not who to tell. Lineage pays off when paired with ownership (11.6) so impact analysis routes to real people.
- Ignoring impact analysis before changes. Shipping a schema change without checking the blast radius is how the producer-boundary breakage of 11.3 happens. Look downstream first.
Why it matters
Lineage is the dependency graph of your data — where it comes from, how it flows, what it becomes. Table-level lineage names the datasets involved; column-level lineage traces each field back through every transformation to its origin, and that precision is what makes real debugging and PII-tracing tractable. It's captured automatically (parsing dbt/SQL, query logs, or OpenLineage) and lives in your catalog. It earns its keep on two opposite-direction jobs: root-cause debugging (walk upstream to find where a bad number entered) and impact analysis (walk downstream to find the blast radius — both before a change to see what breaks, and during an incident to tell affected consumers fast and accurately). Lineage is, quite literally, how a data team operationalizes trust. Next, we wrap quality, observability, and lineage inside the broader discipline that governs who may touch the data, whether you can find it, whether you're allowed to keep it, and what it costs: governance.