Skip to main content

Sources, freshness, snapshots, seeds & docs

The previous lessons covered the core of dbt — models, materializations, Jinja, tests. This lesson rounds out the toolkit with the pieces that make a project production-grade: knowing your raw data is fresh, capturing history that source systems throw away, loading reference data, declaring who consumes your data, and generating documentation that never goes stale. These are the features beginner tutorials skip and real jobs depend on.

Sources and freshness: trust your inputs

You met sources in the mental-model lesson — the YAML declarations of the raw tables EL loaded, referenced with {{ source(...) }}. Declaring sources buys you a second, crucial capability: source freshness checks.

A pipeline can run perfectly and still produce wrong results if its raw input is stale — if the sync that lands raw.orders broke last night, dbt happily transforms yesterday's ShopFlow orders into today's "fresh" revenue dashboard. A freshness check catches this. You annotate a source with how recently you expect new data, and dbt source freshness checks the most recent timestamp against that threshold:

# models/staging/_sources.yml (ShopFlow — see Meet ShopFlow)
sources:
- name: shopflow
schema: raw
tables:
- name: orders
loaded_at_field: _loaded_at # column holding the load timestamp
freshness:
warn_after: {count: 6, period: hour} # warn if newest row > 6h old
error_after: {count: 24, period: hour} # error if > 24h old

dbt runs select max(_loaded_at) from raw.orders and compares it to now. Older than 6 hours → warn; older than 24 → error, failing the run before it builds anything on stale ShopFlow data. This is the data-engineering equivalent of checking the milk's date before you cook with it. (Source freshness is a pillar of data observability — see Chapter 11.)

Snapshots: dbt's SCD Type 2

Here's a problem that bites every data team — and it's exactly ShopFlow's dim_customer (ShopFlow — see Meet ShopFlow). A source database is mutable — when a ShopFlow customer moves from Seattle to Austin, the source customers row's city is updated in place. The old value is gone. But analytics often needs the history: "how much revenue came from customers living in Seattle last March?" The source can't answer that; it only knows the current city.

Capturing that history is called a Slowly Changing Dimension Type 2 (SCD Type 2) — this is the same dim_customer SCD Type 2 you designed in Chapter 4. Instead of overwriting a changed row, you keep every version with validity timestamps. dbt implements this with snapshots: dbt looks at the raw source on each run and, when a tracked row has changed, it closes off the old version and inserts a new one.

-- snapshots/dim_customer_snapshot.sql
{% snapshot dim_customer_snapshot %}
{{ config(
target_schema='snapshots',
unique_key='customer_id',
strategy='timestamp', -- detect change via the updated_at column
updated_at='updated_at'
) }}
select customer_id, name, email, city, region, signup_date, updated_at
from {{ source('shopflow', 'customers') }}
{% endsnapshot %}

Run dbt snapshot daily and dbt maintains the history table that becomes dim_customer. Watch one customer move cities — the exact Seattle→Austin change from Chapter 4:

customer_idcitydbt_valid_fromdbt_valid_to
42Seattle2026-01-012026-03-15
42Austin2026-03-15null

dbt adds the dbt_valid_from / dbt_valid_to columns automatically. A null dbt_valid_to means "this is the current version." Now "which customers lived in Seattle on 2026-02-01?" is answerable: find rows where that date falls between valid_from and valid_to. The two change-detection strategies are timestamp (trust ShopFlow's customers.updated_at column — fast, preferred) and check (compare a list of columns for any change — use when there's no reliable timestamp).

:::tip Snapshot the source, early and forever You can only capture history going forward, from the moment you start snapshotting. If a column's history might ever matter — ShopFlow's city certainly does — snapshot it now; you can't recover versions overwritten before you started. Snapshots run against the raw customers source (before transformation) so you capture truth at the edge. :::

Seeds: small reference data as code

Sometimes you need a small, static lookup table that doesn't live in any source system — for ShopFlow, a mapping of each product category to a merchandising department that the app never stored. A seed is a CSV file you commit to the seeds/ folder; dbt seed loads it into the warehouse as a table you can ref() like any model.

-- seeds/category_departments.csv
category,department
Books,Media
Electronics,Tech
Apparel,Softlines
select p.*, d.department
from {{ ref('stg_products') }} p
left join {{ ref('category_departments') }} d on p.category = d.category

Seeds are version-controlled (the data lives in git, so changes are reviewable) — perfect for small, hand-maintained reference data. Not for loading real source data; that's EL's job. The line: seeds are for tiny, static, business-owned lookups, measured in rows, not gigabytes.

Exposures: declaring who depends on your data

A dbt DAG normally ends at your marts — but in reality, something consumes those marts: a dashboard, an ML model, a reverse-ETL sync. An exposure is a YAML declaration of one of those downstream consumers, which dbt adds to the DAG as a leaf node:

# models/marts/_exposures.yml
exposures:
- name: shopflow_revenue_dashboard
type: dashboard
url: https://bi.shopflow.com/dashboards/42
depends_on:
- ref('fact_sales')
owner: {name: Data Team, email: data@shopflow.com}

Now lineage extends past dbt into the real world. Before you change fact_sales, dbt's graph tells you the ShopFlow revenue dashboard depends on it — so you know what you might break. Exposures turn "what queries this model?" from a guessing game into a documented fact.

Docs and the lineage graph: living documentation

The single most loved dbt feature is dbt docs. Because dbt already knows every model, its ref()/source() dependencies, its tests, and its column descriptions (from YAML), it can generate a documentation website automatically — including an interactive lineage graph of the entire DAG.

("source:\nraw.orders")stg_orders("source:\nraw.order_items")stg_order_itemsfact_sales["exposure:\nShopFlow revenuedashboard"]

Add a description to any model or column in YAML and it shows up in the docs:

models:
- name: fact_sales
description: "ShopFlow sales fact. Grain: one order line item."
columns:
- name: line_revenue
description: "Line revenue = quantity × unit_price; see the line_revenue macro."

Why this matters more than ordinary docs: it's living documentation generated from the code itself. A hand-written wiki rots the moment someone renames a column. dbt docs regenerate on every build, so the lineage and descriptions are always in sync with the actual project. Click any node and you see its SQL, its tests, its description, and what's upstream and downstream — answering "where does this number come from?" and "what breaks if I change this?" instantly. This DAG-driven lineage is exactly what the 2026 engines push further with column-level lineage, which we'll meet in the next lesson.

Why it matters

These are the features that make a dbt project trustworthy in production. Sources + freshness verify your raw inputs aren't stale before you build on them. Snapshots implement SCD Type 2 — capturing the history mutable source systems overwrite, but only going forward, so start early. Seeds version-control small reference lookups as code. Exposures extend lineage to the dashboards and models that consume your data. And dbt docs turns the whole DAG into living, auto-generated documentation and a clickable lineage graph that's always in sync with the code. With models, tests, history, and docs in hand, the last lesson assembles them into a layered architecture, a governed metrics layer, and a real CI/CD workflow.

Next: Layering, the semantic layer & CI/CD →