dbt's mental model: it's just compiled SQL on a DAG
dbt has a reputation for being magical, and that reputation hurts learners — they memorize syntax without understanding what's happening. This lesson kills the magic. By the end you'll hold the single mental model that explains everything dbt does:
A dbt model is just a
SELECTstatement.ref()wires models into a DAG. dbt compiles the whole thing down to plain SQL and runs it in your warehouse. That's it.
Everything else — tests, docs, materializations, incremental loads — is built on that foundation.
What dbt is (and isn't)
dbt (data build tool) is the T in ELT (from the previous lesson): a framework for transforming data that's already in your warehouse. Crucially, dbt does not move or process data itself. It has no compute engine. It is a compiler and orchestrator: you write SQL SELECT statements, and dbt turns them into the right CREATE TABLE/CREATE VIEW commands and sends them to the warehouse, which does all the actual work.
That's the first thing to internalize: the warehouse does the computing; dbt just generates and sequences the SQL. dbt is to your warehouse what a build tool like Make or Maven is to a compiler — it figures out what to build and in what order, then hands the real work off.
There are two main flavors, identical in concept for learning:
- dbt Core — the free, open-source command-line tool. You run
dbt run,dbt test, etc. yourself. - dbt Cloud — a paid hosted product that adds a web IDE, a scheduler, a docs viewer, and CI — wrapping dbt Core.
A model is a SELECT
A model is the central object in dbt, and it is shockingly simple: a model is a single .sql file containing one SELECT statement. The filename becomes the model's name. That's the whole definition.
Here is a complete, valid dbt model on ShopFlow (ShopFlow — see Meet ShopFlow). Save it as stg_orders.sql — it cleans the raw.orders table Chapter 6 landed:
-- models/staging/stg_orders.sql
select
order_id,
customer_id,
order_ts,
status,
amount as order_amount
from raw.orders
When you run dbt run, dbt wraps that SELECT in a CREATE statement and executes it in the warehouse. By default it builds a view, so dbt sends roughly:
create view analytics.stg_orders as (
select order_id, customer_id, order_ts, ... from raw.orders
);
You wrote a SELECT; dbt added the create view ... as around it. You never write CREATE, INSERT, or DROP — you just describe the shape of the result you want, and dbt handles the boilerplate to persist it. (How dbt decides whether to build a view, a table, or something else is the subject of the next lesson on materializations.)
ref(): how models depend on each other
Real transformation is layered: a clean customers table is built from a staging table, which is built from raw. You never want to hard-code the physical table name of one model inside another — that's brittle. Instead, models refer to each other with the ref() function.
Here's a second model that builds on top of stg_orders — the per-customer rollup that feeds dim_customer's lifetime_value:
-- models/intermediate/int_customer_orders.sql
select
customer_id,
count(*) as order_count,
sum(order_amount) as lifetime_value
from {{ ref('stg_orders') }} -- <-- depend on the model, not a raw table
group by customer_id
{{ ref('stg_orders') }} is the heart of dbt. Those {{ }} are Jinja (a templating language we'll cover fully in the Jinja & macros lesson) — for now just read ref('stg_orders') as "the table that the stg_orders model builds." When dbt compiles int_customer_orders, it replaces {{ ref('stg_orders') }} with the actual fully-qualified table name, e.g. analytics.stg_orders. The compiled SQL becomes:
create view analytics.int_customer_orders as (
select customer_id, count(*) as order_count, sum(order_amount) as lifetime_value
from analytics.stg_orders -- ref() resolved to the real name
group by customer_id
);
ref() does two jobs at once, and both matter:
- It resolves the real physical name — so if you rename a model or point it at a different schema (dev vs prod), every downstream model follows automatically. No hard-coded names to update.
- It declares a dependency — by reading
ref('stg_orders')insideint_customer_orders, dbt learns thatint_customer_ordersdepends onstg_orders, and therefore must be built after it.
The DAG: dependencies dbt infers for you
Collect every ref() across all your models and you get a graph: who depends on whom. That graph is a DAG — a Directed Acyclic Graph (directed = dependencies point one way; acyclic = no loops, since a model can't depend on itself). dbt builds this DAG automatically just by reading the ref() calls — you never write it by hand.
The DAG is what makes dbt powerful. From it, dbt knows:
- Build order. When you run
dbt run, dbt does a topological sort of the DAG and builds models in dependency order —stg_ordersbeforeint_customer_orders, always — and builds independent models in parallel for speed. - What's downstream. Change
stg_ordersand dbt can tell you (and rebuild) everything affected. - Lineage. The DAG is your data lineage — covered when we reach docs and the lineage graph.
This is the punchline of the whole tool: you write small SELECTs and connect them with ref(); dbt assembles them into a DAG and compiles it to ordered, plain SQL.
source(): naming your raw inputs
Models ref() other models, but the DAG has to start somewhere — at the raw tables that EL loaded into the warehouse. You don't ref() those; you declare them as sources in a YAML file and reference them with source():
# models/staging/_sources.yml
sources:
- name: shopflow # a logical group
schema: raw # the real schema in the warehouse (raw landing, Ch. 6)
tables:
- name: orders
- name: customers
- name: products
- name: order_items
-- models/staging/stg_orders.sql (the grown-up version)
select order_id, customer_id, order_ts, status, amount as order_amount
from {{ source('shopflow', 'orders') }} -- not a hard-coded "raw.orders"
{{ source('shopflow', 'orders') }} compiles to the real raw table name (raw.orders). Declaring sources gives you the same two benefits as ref() — a single place to change physical names, and explicit DAG entry points — plus a hook for freshness checks (covered later). The rule of thumb: source() for raw data EL gave you; ref() for everything dbt builds.
The fact at the finest grain
ShopFlow's central mart, fact_sales, lives at the order-line-item grain — one row per product line within an order. It joins the staging models and computes the one measure everything downstream needs, line_revenue:
-- models/marts/fact_sales.sql
{{ config(materialized='table') }}
select
oi.order_id,
oi.product_id,
o.customer_id,
cast(o.order_ts as date) as order_date,
oi.quantity,
oi.unit_price,
oi.quantity * oi.unit_price as line_revenue -- the core measure
from {{ ref('stg_order_items') }} oi
join {{ ref('stg_orders') }} o on oi.order_id = o.order_id
where o.status <> 'cancelled'
Two ref()s — stg_order_items and stg_orders — so dbt knows fact_sales builds after both. line_revenue = quantity × unit_price is exactly the measure the Chapter 4 star schema defined; you'll see it pulled into a macro and a metric in later lessons.
The compiled-SQL mental model
Whenever dbt confuses you, run dbt compile (or look in the target/compiled/ folder) and read the compiled SQL. dbt takes your SELECT + Jinja and produces plain SQL with every ref() and source() replaced by real table names — exactly what the warehouse runs. There is no hidden magic underneath; it's SQL all the way down.
:::tip When in doubt, read the compiled SQL
The biggest beginner mistake is treating dbt as a black box. It isn't — it's a SQL generator. If a model behaves unexpectedly, open its compiled version in target/compiled/ and read the literal SQL that ran. Ninety percent of dbt debugging is just reading compiled SQL.
:::
Project structure: where things live
A dbt project is a folder of files with a predictable layout. You don't need to memorize it, just recognize the pieces:
my_project/
├── dbt_project.yml # project config: name, model paths, defaults
├── models/ # your .sql models + .yml docs/tests
│ ├── staging/
│ └── marts/
├── tests/ # custom (singular) tests
├── macros/ # reusable Jinja functions
├── seeds/ # small CSVs loaded as tables
├── snapshots/ # SCD-2 history captures
└── target/ # generated artifacts (compiled SQL, etc.)
dbt_project.yml— the project's config file: its name, where models live, and default materializations.profiles.yml— kept outside the project (it holds connection secrets). It tells dbt which warehouse to connect to and with what credentials. The same project can point at a dev warehouse or prod just by switching profiles — this is how dbt separates "what to build" (the versioned project) from "where to build it" (the local profile).- Adapters — dbt connects to each warehouse through an adapter (
dbt-snowflake,dbt-bigquery,dbt-databricks,dbt-postgres). The adapter knows that warehouse's SQL dialect. Your models stay mostly portable; the adapter handles the dialect differences.
When NOT to use dbt
dbt is the default for warehouse SQL transformation — but its mental model tells you exactly where it stops being the right tool. dbt has no compute, only knows SQL, and runs as scheduled batch builds, so it's a poor fit when any of those assumptions break:
- Non-SQL or row-by-row transforms. dbt compiles
SELECTs; it can't express logic that isn't relational set-based work — calling an external API per row, heavy Python/ML feature engineering, image or document parsing. That belongs in Python/Spark, not a dbt model. - Streaming / real-time. dbt builds tables and views on a schedule (
dbt run); it does not process events as they arrive. Sub-minute freshness is a streaming job's job (Chapter 9), not dbt's. - A general-purpose orchestrator. dbt sequences its own models via the DAG, but it doesn't ingest data, call APIs, or coordinate non-dbt steps.
dbt runis usually one task inside an orchestrator like Airflow/Dagster (Chapter 8), not a replacement for it. - A handful of tables / low volume. For a few small tables transformed once, dbt's project scaffolding, profiles, and DAG are overhead you don't recoup. A couple of plain
CREATE TABLE ASstatements (or a scheduled SQL script) is simpler.
:::tip The honest decision rule Reach for dbt when you have many interdependent SQL transformations in a warehouse that benefit from versioning, testing, lineage, and automatic build-ordering. Skip it when the work isn't SQL, isn't in a warehouse, needs real-time freshness, or is small enough that the DAG buys you nothing. dbt owns the T in batch ELT — it is not an orchestrator, a stream processor, or a compute engine. :::
Why it matters
Strip away the reputation and dbt is one clean idea: a model is a SELECT; ref() and source() wire models into a DAG; dbt compiles that DAG to ordered plain SQL and runs it in your warehouse. dbt itself owns no compute — it's a SQL compiler and build tool. The DAG it infers from your ref() calls gives you free build-ordering, parallelism, downstream impact, and lineage. Hold the "it's just compiled SQL on a DAG" model and nothing dbt does will surprise you. The next question is what kind of object each model becomes in the warehouse — a view, a table, or an incrementally-updated table — which is exactly what materializations control.