Skip to main content

dbt's mental model: it's just compiled SQL on a DAG

dbt has a reputation for being magical, and that reputation hurts learners — they memorize syntax without understanding what's happening. This lesson kills the magic. By the end you'll hold the single mental model that explains everything dbt does:

A dbt model is just a SELECT statement. ref() wires models into a DAG. dbt compiles the whole thing down to plain SQL and runs it in your warehouse. That's it.

Everything else — tests, docs, materializations, incremental loads — is built on that foundation.

What dbt is (and isn't)

dbt (data build tool) is the T in ELT (from the previous lesson): a framework for transforming data that's already in your warehouse. Crucially, dbt does not move or process data itself. It has no compute engine. It is a compiler and orchestrator: you write SQL SELECT statements, and dbt turns them into the right CREATE TABLE/CREATE VIEW commands and sends them to the warehouse, which does all the actual work.

That's the first thing to internalize: the warehouse does the computing; dbt just generates and sequences the SQL. dbt is to your warehouse what a build tool like Make or Maven is to a compiler — it figures out what to build and in what order, then hands the real work off.

There are two main flavors, identical in concept for learning:

  • dbt Core — the free, open-source command-line tool. You run dbt run, dbt test, etc. yourself.
  • dbt Cloud — a paid hosted product that adds a web IDE, a scheduler, a docs viewer, and CI — wrapping dbt Core.

A model is a SELECT

A model is the central object in dbt, and it is shockingly simple: a model is a single .sql file containing one SELECT statement. The filename becomes the model's name. That's the whole definition.

Here is a complete, valid dbt model on ShopFlow (ShopFlow — see Meet ShopFlow). Save it as stg_orders.sql — it cleans the raw.orders table Chapter 6 landed:

-- models/staging/stg_orders.sql
select
order_id,
customer_id,
order_ts,
status,
amount as order_amount
from raw.orders

When you run dbt run, dbt wraps that SELECT in a CREATE statement and executes it in the warehouse. By default it builds a view, so dbt sends roughly:

create view analytics.stg_orders as (
select order_id, customer_id, order_ts, ... from raw.orders
);

You wrote a SELECT; dbt added the create view ... as around it. You never write CREATE, INSERT, or DROP — you just describe the shape of the result you want, and dbt handles the boilerplate to persist it. (How dbt decides whether to build a view, a table, or something else is the subject of the next lesson on materializations.)

ref(): how models depend on each other

Real transformation is layered: a clean customers table is built from a staging table, which is built from raw. You never want to hard-code the physical table name of one model inside another — that's brittle. Instead, models refer to each other with the ref() function.

Here's a second model that builds on top of stg_orders — the per-customer rollup that feeds dim_customer's lifetime_value:

-- models/intermediate/int_customer_orders.sql
select
customer_id,
count(*) as order_count,
sum(order_amount) as lifetime_value
from {{ ref('stg_orders') }} -- <-- depend on the model, not a raw table
group by customer_id

{{ ref('stg_orders') }} is the heart of dbt. Those {{ }} are Jinja (a templating language we'll cover fully in the Jinja & macros lesson) — for now just read ref('stg_orders') as "the table that the stg_orders model builds." When dbt compiles int_customer_orders, it replaces {{ ref('stg_orders') }} with the actual fully-qualified table name, e.g. analytics.stg_orders. The compiled SQL becomes:

create view analytics.int_customer_orders as (
select customer_id, count(*) as order_count, sum(order_amount) as lifetime_value
from analytics.stg_orders -- ref() resolved to the real name
group by customer_id
);

ref() does two jobs at once, and both matter:

  1. It resolves the real physical name — so if you rename a model or point it at a different schema (dev vs prod), every downstream model follows automatically. No hard-coded names to update.
  2. It declares a dependency — by reading ref('stg_orders') inside int_customer_orders, dbt learns that int_customer_orders depends on stg_orders, and therefore must be built after it.

The DAG: dependencies dbt infers for you

Collect every ref() across all your models and you get a graph: who depends on whom. That graph is a DAG — a Directed Acyclic Graph (directed = dependencies point one way; acyclic = no loops, since a model can't depend on itself). dbt builds this DAG automatically just by reading the ref() calls — you never write it by hand.

("raw.orders\n(source)")stg_orders\n(model)("raw.customers\n(source)")stg_customers\n(model)int_customer_orders\n(model)dim_customer\n(mart)

The DAG is what makes dbt powerful. From it, dbt knows:

  • Build order. When you run dbt run, dbt does a topological sort of the DAG and builds models in dependency order — stg_orders before int_customer_orders, always — and builds independent models in parallel for speed.
  • What's downstream. Change stg_orders and dbt can tell you (and rebuild) everything affected.
  • Lineage. The DAG is your data lineage — covered when we reach docs and the lineage graph.

This is the punchline of the whole tool: you write small SELECTs and connect them with ref(); dbt assembles them into a DAG and compiles it to ordered, plain SQL.

source(): naming your raw inputs

Models ref() other models, but the DAG has to start somewhere — at the raw tables that EL loaded into the warehouse. You don't ref() those; you declare them as sources in a YAML file and reference them with source():

# models/staging/_sources.yml
sources:
- name: shopflow # a logical group
schema: raw # the real schema in the warehouse (raw landing, Ch. 6)
tables:
- name: orders
- name: customers
- name: products
- name: order_items
-- models/staging/stg_orders.sql (the grown-up version)
select order_id, customer_id, order_ts, status, amount as order_amount
from {{ source('shopflow', 'orders') }} -- not a hard-coded "raw.orders"

{{ source('shopflow', 'orders') }} compiles to the real raw table name (raw.orders). Declaring sources gives you the same two benefits as ref() — a single place to change physical names, and explicit DAG entry points — plus a hook for freshness checks (covered later). The rule of thumb: source() for raw data EL gave you; ref() for everything dbt builds.

The fact at the finest grain

ShopFlow's central mart, fact_sales, lives at the order-line-item grain — one row per product line within an order. It joins the staging models and computes the one measure everything downstream needs, line_revenue:

-- models/marts/fact_sales.sql
{{ config(materialized='table') }}

select
oi.order_id,
oi.product_id,
o.customer_id,
cast(o.order_ts as date) as order_date,
oi.quantity,
oi.unit_price,
oi.quantity * oi.unit_price as line_revenue -- the core measure
from {{ ref('stg_order_items') }} oi
join {{ ref('stg_orders') }} o on oi.order_id = o.order_id
where o.status <> 'cancelled'

Two ref()s — stg_order_items and stg_orders — so dbt knows fact_sales builds after both. line_revenue = quantity × unit_price is exactly the measure the Chapter 4 star schema defined; you'll see it pulled into a macro and a metric in later lessons.

The compiled-SQL mental model

Whenever dbt confuses you, run dbt compile (or look in the target/compiled/ folder) and read the compiled SQL. dbt takes your SELECT + Jinja and produces plain SQL with every ref() and source() replaced by real table names — exactly what the warehouse runs. There is no hidden magic underneath; it's SQL all the way down.

You write:\nSELECT +ref()/source()\n+Jinjadbt compilesPlain SQL\n(refsresolved to\nrealtable names)Warehouse runsit\nas CREATETABLE/VIEW

:::tip When in doubt, read the compiled SQL The biggest beginner mistake is treating dbt as a black box. It isn't — it's a SQL generator. If a model behaves unexpectedly, open its compiled version in target/compiled/ and read the literal SQL that ran. Ninety percent of dbt debugging is just reading compiled SQL. :::

Project structure: where things live

A dbt project is a folder of files with a predictable layout. You don't need to memorize it, just recognize the pieces:

my_project/
├── dbt_project.yml # project config: name, model paths, defaults
├── models/ # your .sql models + .yml docs/tests
│ ├── staging/
│ └── marts/
├── tests/ # custom (singular) tests
├── macros/ # reusable Jinja functions
├── seeds/ # small CSVs loaded as tables
├── snapshots/ # SCD-2 history captures
└── target/ # generated artifacts (compiled SQL, etc.)
  • dbt_project.yml — the project's config file: its name, where models live, and default materializations.
  • profiles.yml — kept outside the project (it holds connection secrets). It tells dbt which warehouse to connect to and with what credentials. The same project can point at a dev warehouse or prod just by switching profiles — this is how dbt separates "what to build" (the versioned project) from "where to build it" (the local profile).
  • Adapters — dbt connects to each warehouse through an adapter (dbt-snowflake, dbt-bigquery, dbt-databricks, dbt-postgres). The adapter knows that warehouse's SQL dialect. Your models stay mostly portable; the adapter handles the dialect differences.

When NOT to use dbt

dbt is the default for warehouse SQL transformation — but its mental model tells you exactly where it stops being the right tool. dbt has no compute, only knows SQL, and runs as scheduled batch builds, so it's a poor fit when any of those assumptions break:

  • Non-SQL or row-by-row transforms. dbt compiles SELECTs; it can't express logic that isn't relational set-based work — calling an external API per row, heavy Python/ML feature engineering, image or document parsing. That belongs in Python/Spark, not a dbt model.
  • Streaming / real-time. dbt builds tables and views on a schedule (dbt run); it does not process events as they arrive. Sub-minute freshness is a streaming job's job (Chapter 9), not dbt's.
  • A general-purpose orchestrator. dbt sequences its own models via the DAG, but it doesn't ingest data, call APIs, or coordinate non-dbt steps. dbt run is usually one task inside an orchestrator like Airflow/Dagster (Chapter 8), not a replacement for it.
  • A handful of tables / low volume. For a few small tables transformed once, dbt's project scaffolding, profiles, and DAG are overhead you don't recoup. A couple of plain CREATE TABLE AS statements (or a scheduled SQL script) is simpler.

:::tip The honest decision rule Reach for dbt when you have many interdependent SQL transformations in a warehouse that benefit from versioning, testing, lineage, and automatic build-ordering. Skip it when the work isn't SQL, isn't in a warehouse, needs real-time freshness, or is small enough that the DAG buys you nothing. dbt owns the T in batch ELT — it is not an orchestrator, a stream processor, or a compute engine. :::

Why it matters

Strip away the reputation and dbt is one clean idea: a model is a SELECT; ref() and source() wire models into a DAG; dbt compiles that DAG to ordered plain SQL and runs it in your warehouse. dbt itself owns no compute — it's a SQL compiler and build tool. The DAG it infers from your ref() calls gives you free build-ordering, parallelism, downstream impact, and lineage. Hold the "it's just compiled SQL on a DAG" model and nothing dbt does will surprise you. The next question is what kind of object each model becomes in the warehouse — a view, a table, or an incrementally-updated table — which is exactly what materializations control.

Next: Materializations & incremental models →