dbt's mental model: it's just compiled SQL on a DAG

dbt has a reputation for being magical, and that reputation hurts learners — they memorize syntax without understanding what's happening. This lesson kills the magic. By the end you'll hold the single mental model that explains everything dbt does:

A dbt model is just a SELECT statement. ref() wires models into a DAG. dbt compiles the whole thing down to plain SQL and runs it in your warehouse. That's it.

Everything else — tests, docs, materializations, incremental loads — is built on that foundation.

What dbt is (and isn't)

dbt (data build tool) is the T in ELT (from the previous lesson): a framework for transforming data that's already in your warehouse. Crucially, dbt does not move or process data itself. It has no compute engine. It is a compiler and orchestrator: you write SQL SELECT statements, and dbt turns them into the right CREATE TABLE/CREATE VIEW commands and sends them to the warehouse, which does all the actual work.

That's the first thing to internalize: the warehouse does the computing; dbt just generates and sequences the SQL. dbt is to your warehouse what a build tool like Make or Maven is to a compiler — it figures out what to build and in what order, then hands the real work off.

There are two main flavors, identical in concept for learning:

dbt Core — the free, open-source command-line tool. You run dbt run, dbt test, etc. yourself.
dbt Cloud — a paid hosted product that adds a web IDE, a scheduler, a docs viewer, and CI — wrapping dbt Core.

A model is a SELECT

A model is the central object in dbt, and it is shockingly simple: a model is a single .sql file containing one SELECT statement. The filename becomes the model's name. That's the whole definition.

Here is a complete, valid dbt model on ShopFlow (ShopFlow — see Meet ShopFlow). Save it as stg_orders.sql — it cleans the raw.orders table Chapter 6 landed:

-- models/staging/stg_orders.sql
select
    order_id,
    customer_id,
    order_ts,
    status,
    amount as order_amount
from raw.orders

When you run dbt run, dbt wraps that SELECT in a CREATE statement and executes it in the warehouse. By default it builds a view, so dbt sends roughly:

create view analytics.stg_orders as (
    select order_id, customer_id, order_ts, ... from raw.orders
);

You wrote a SELECT; dbt added the create view ... as around it. You never write CREATE, INSERT, or DROP — you just describe the shape of the result you want, and dbt handles the boilerplate to persist it. (How dbt decides whether to build a view, a table, or something else is the subject of the next lesson on materializations.)

`ref()`: how models depend on each other

Real transformation is layered: a clean customers table is built from a staging table, which is built from raw. You never want to hard-code the physical table name of one model inside another — that's brittle. Instead, models refer to each other with the ref() function.

Here's a second model that builds on top of stg_orders — the per-customer rollup that feeds dim_customer's lifetime_value:

-- models/intermediate/int_customer_orders.sql
select
    customer_id,
    count(*)            as order_count,
    sum(order_amount)   as lifetime_value
from {{ ref('stg_orders') }}      -- <-- depend on the model, not a raw table
group by customer_id

{{ ref('stg_orders') }} is the heart of dbt. Those {{ }} are Jinja (a templating language we'll cover fully in the Jinja & macros lesson) — for now just read ref('stg_orders') as "the table that the stg_orders model builds." When dbt compiles int_customer_orders, it replaces {{ ref('stg_orders') }} with the actual fully-qualified table name, e.g. analytics.stg_orders. The compiled SQL becomes:

create view analytics.int_customer_orders as (
    select customer_id, count(*) as order_count, sum(order_amount) as lifetime_value
    from analytics.stg_orders       -- ref() resolved to the real name
    group by customer_id
);

ref() does two jobs at once, and both matter:

It resolves the real physical name — so if you rename a model or point it at a different schema (dev vs prod), every downstream model follows automatically. No hard-coded names to update.
It declares a dependency — by reading ref('stg_orders') inside int_customer_orders, dbt learns that int_customer_orders depends on stg_orders, and therefore must be built after it.

The DAG: dependencies dbt infers for you

Collect every ref() across all your models and you get a graph: who depends on whom. That graph is a DAG — a Directed Acyclic Graph (directed = dependencies point one way; acyclic = no loops, since a model can't depend on itself). dbt builds this DAG automatically just by reading the ref() calls — you never write it by hand.

The DAG is what makes dbt powerful. From it, dbt knows:

Build order. When you run dbt run, dbt does a topological sort of the DAG and builds models in dependency order — stg_orders before int_customer_orders, always — and builds independent models in parallel for speed.
What's downstream. Change stg_orders and dbt can tell you (and rebuild) everything affected.
Lineage. The DAG is your data lineage — covered when we reach docs and the lineage graph.

This is the punchline of the whole tool: you write small SELECTs and connect them with ref(); dbt assembles them into a DAG and compiles it to ordered, plain SQL.

`source()`: naming your raw inputs

Models ref() other models, but the DAG has to start somewhere — at the raw tables that EL loaded into the warehouse. You don't ref() those; you declare them as sources in a YAML file and reference them with source():

# models/staging/_sources.yml
sources:
  - name: shopflow       # a logical group
    schema: raw          # the real schema in the warehouse (raw landing, Ch. 6)
    tables:
      - name: orders
      - name: customers
      - name: products
      - name: order_items

-- models/staging/stg_orders.sql  (the grown-up version)
select order_id, customer_id, order_ts, status, amount as order_amount
from {{ source('shopflow', 'orders') }}   -- not a hard-coded "raw.orders"

{{ source('shopflow', 'orders') }} compiles to the real raw table name (raw.orders). Declaring sources gives you the same two benefits as ref() — a single place to change physical names, and explicit DAG entry points — plus a hook for freshness checks (covered later). The rule of thumb: source() for raw data EL gave you; ref() for everything dbt builds.

The fact at the finest grain

ShopFlow's central mart, fact_sales, lives at the order-line-item grain — one row per product line within an order. It joins the staging models and computes the one measure everything downstream needs, line_revenue:

-- models/marts/fact_sales.sql
{{ config(materialized='table') }}

select
    oi.order_id,
    oi.product_id,
    o.customer_id,
    cast(o.order_ts as date)          as order_date,
    oi.quantity,
    oi.unit_price,
    oi.quantity * oi.unit_price       as line_revenue   -- the core measure
from {{ ref('stg_order_items') }} oi
join {{ ref('stg_orders') }} o on oi.order_id = o.order_id
where o.status <> 'cancelled'

Two ref()s — stg_order_items and stg_orders — so dbt knows fact_sales builds after both. line_revenue = quantity × unit_price is exactly the measure the Chapter 4 star schema defined; you'll see it pulled into a macro and a metric in later lessons.

The compiled-SQL mental model

Whenever dbt confuses you, run dbt compile (or look in the target/compiled/ folder) and read the compiled SQL. dbt takes your SELECT + Jinja and produces plain SQL with every ref() and source() replaced by real table names — exactly what the warehouse runs. There is no hidden magic underneath; it's SQL all the way down.

:::tip When in doubt, read the compiled SQL The biggest beginner mistake is treating dbt as a black box. It isn't — it's a SQL generator. If a model behaves unexpectedly, open its compiled version in target/compiled/ and read the literal SQL that ran. Ninety percent of dbt debugging is just reading compiled SQL. :::

Project structure: where things live

A dbt project is a folder of files with a predictable layout. You don't need to memorize it, just recognize the pieces:

my_project/
├── dbt_project.yml      # project config: name, model paths, defaults
├── models/              # your .sql models + .yml docs/tests
│   ├── staging/
│   └── marts/
├── tests/               # custom (singular) tests
├── macros/              # reusable Jinja functions
├── seeds/               # small CSVs loaded as tables
├── snapshots/           # SCD-2 history captures
└── target/              # generated artifacts (compiled SQL, etc.)

dbt_project.yml — the project's config file: its name, where models live, and default materializations.
profiles.yml — kept outside the project (it holds connection secrets). It tells dbt which warehouse to connect to and with what credentials. The same project can point at a dev warehouse or prod just by switching profiles — this is how dbt separates "what to build" (the versioned project) from "where to build it" (the local profile).
Adapters — dbt connects to each warehouse through an adapter (dbt-snowflake, dbt-bigquery, dbt-databricks, dbt-postgres). The adapter knows that warehouse's SQL dialect. Your models stay mostly portable; the adapter handles the dialect differences.

When NOT to use dbt

dbt is the default for warehouse SQL transformation — but its mental model tells you exactly where it stops being the right tool. dbt has no compute, only knows SQL, and runs as scheduled batch builds, so it's a poor fit when any of those assumptions break:

Non-SQL or row-by-row transforms. dbt compiles SELECTs; it can't express logic that isn't relational set-based work — calling an external API per row, heavy Python/ML feature engineering, image or document parsing. That belongs in Python/Spark, not a dbt model.
Streaming / real-time. dbt builds tables and views on a schedule (dbt run); it does not process events as they arrive. Sub-minute freshness is a streaming job's job (Chapter 9), not dbt's.
A general-purpose orchestrator. dbt sequences its own models via the DAG, but it doesn't ingest data, call APIs, or coordinate non-dbt steps. dbt run is usually one task inside an orchestrator like Airflow/Dagster (Chapter 8), not a replacement for it.
A handful of tables / low volume. For a few small tables transformed once, dbt's project scaffolding, profiles, and DAG are overhead you don't recoup. A couple of plain CREATE TABLE AS statements (or a scheduled SQL script) is simpler.

:::tip The honest decision rule Reach for dbt when you have many interdependent SQL transformations in a warehouse that benefit from versioning, testing, lineage, and automatic build-ordering. Skip it when the work isn't SQL, isn't in a warehouse, needs real-time freshness, or is small enough that the DAG buys you nothing. dbt owns the T in batch ELT — it is not an orchestrator, a stream processor, or a compute engine. :::

Why it matters

Strip away the reputation and dbt is one clean idea: a model is a SELECT; ref() and source() wire models into a DAG; dbt compiles that DAG to ordered plain SQL and runs it in your warehouse. dbt itself owns no compute — it's a SQL compiler and build tool. The DAG it infers from your ref() calls gives you free build-ordering, parallelism, downstream impact, and lineage. Hold the "it's just compiled SQL on a DAG" model and nothing dbt does will surprise you. The next question is what kind of object each model becomes in the warehouse — a view, a table, or an incrementally-updated table — which is exactly what materializations control.

Next: Materializations & incremental models →

What dbt is (and isn't)​

A model is a SELECT​

ref(): how models depend on each other​

The DAG: dependencies dbt infers for you​

source(): naming your raw inputs​

The fact at the finest grain​

The compiled-SQL mental model​

Project structure: where things live​

When NOT to use dbt​

Why it matters​