Skip to main content

Materializations & incremental models

In the last lesson every model became a view by default. That's only one of several options. A materialization is dbt's word for how a model gets persisted in the warehouse — and choosing the right one is one of the highest-leverage decisions in a dbt project, because it directly controls build time and cost. This lesson covers the four materializations and then goes deep on the one that trips up almost everyone: incremental.

What a materialization is

Recall from the mental-model lesson that dbt wraps your SELECT in a CREATE statement. A materialization decides which CREATE statement — and therefore what physical object exists in the warehouse and how it's refreshed on each dbt run. You set it with a config block at the top of a model (or as a default in dbt_project.yml):

{{ config(materialized='table') }}

select ...

There are four built-in materializations. Three are simple; the fourth (incremental) needs the rest of this lesson.

The three simple materializations

view (the default)

dbt creates a view — a saved query that stores no data; it re-runs against the underlying tables every time someone queries it.

  • Build cost: ~zero. dbt run just (re)defines the view; no data is computed at build time.
  • Query cost: paid every time the view is queried, because the logic re-executes.
  • Freshness: always live — a view reflects the current state of its inputs.
  • Use it for: lightweight transformations, staging models, anything queried infrequently.

table

dbt drops and fully recreates a physical table on every dbt run, executing your SELECT and storing the result.

  • Build cost: the full query runs at build time, every run.
  • Query cost: cheap and fast — it's a real table; no logic re-runs.
  • Freshness: as fresh as your last dbt run.
  • Use it for: models that are expensive to compute but queried often (e.g. dashboard-facing marts) — pay once at build, query cheaply many times.

The view-vs-table choice is a classic compute trade-off: view shifts cost to query time; table shifts cost to build time. Query something rarely → view. Query something constantly → table.

ephemeral

An ephemeral model is never built as its own object. Instead, dbt inlines its SQL as a CTE (Common Table Expression — a WITH ... AS (...) subquery) directly into any model that ref()s it.

  • Build cost: none — nothing is created in the warehouse.
  • Use it for: small, reusable bits of logic you want to keep DRY but don't need to expose as their own table or view. The downside: it can't be queried or tested on its own (it has no physical existence), and over-use makes compiled SQL hard to read.
view\nstores:nothing\nbuild:cheap · query: paystable\nstores: fulldata\nbuild: fulleach run · query:ephemeral\nstores:nothing\ninlined asa CTE into refsincremental\nstores:full data\nbuild:only NEW/changed

The problem incremental solves

Now the hard one. Take ShopFlow's fact_sales (ShopFlow — see Meet ShopFlow) once the store has been running for years: two billion order line items, growing by 50 million a day as new orders come in. Materialize it as a table and every single dbt run drops and rebuilds all two billion rows — scanning every order line ever placed, every time, to add one day's sales. That's absurdly slow and expensive. You don't want to rebuild history; you want to process only the new rows and append/merge them into the existing table.

That's exactly what the incremental materialization does: on the first run it builds the whole table, and on every run after that it processes only the new or changed rows and adds them to what's already there.

How incremental works: is_incremental()

An incremental model has to behave differently on its first run (build everything) versus later runs (only the new rows). dbt gives you a flag for this: the is_incremental() macro. It returns:

  • false on the first run, or when you do a --full-refresh (rebuild from scratch), or if the table doesn't exist yet.
  • true on subsequent runs, when the table already exists and you're adding to it.

You use it to filter the SELECT down to just new rows on incremental runs. Here's the canonical pattern on ShopFlow's fact_sales — read it slowly:

{{ config(
materialized='incremental',
unique_key=['order_id', 'product_id'] -- the order-line grain
) }}

select
oi.order_id,
oi.product_id,
o.customer_id,
o.order_ts,
oi.quantity * oi.unit_price as line_revenue
from {{ ref('stg_order_items') }} oi
join {{ ref('stg_orders') }} o on oi.order_id = o.order_id

{% if is_incremental() %}
-- This block ONLY appears on incremental runs.
-- Only pull order lines newer than the newest order we already have.
where o.order_ts > (select max(order_ts) from {{ this }})
{% endif %}

Two new pieces:

  • {{ this }} — Jinja for "this model's own table," i.e. the table being built. select max(order_ts) from {{ this }} asks the already-built fact_sales for its high-water mark.
  • The {% if is_incremental() %} block — on the first run, is_incremental() is false, the block vanishes, and the SELECT reads all of ShopFlow's order lines to build the full table. On every later run it's true, so dbt adds where order_ts > (the max we already loaded), scanning only the new slice.

Let's trace it:

Runis_incremental()SQL that runsRows processed
1st (or --full-refresh)falsefull SELECT, no whereall 2 billion lines
2nd (next day)true... where order_ts > max(order_ts)~50 million new
3rd (next day)truesame filter, new max~50 million new

The filter that defines "new rows" — here order_ts > max(...) — is yours to write, and getting it right is the crux of incremental modeling. A column you filter on like this is called a watermark (the same idea you met in Chapter 6 ingestion), and order_ts is exactly the watermark ShopFlow uses.

Incremental strategies: how new rows get combined in

Filtering to new rows is half the job. The other half is: once you have those new rows, how do you put them into the existing table? That choice is the incremental strategy, and it's where merge-vs-overwrite confusion lives. There are four common strategies (availability varies by adapter):

append

Just INSERT the new rows. Fastest, simplest. Use when rows are immutable and never duplicated — pure event logs where a row, once written, never changes. Danger: if a row you already loaded shows up again, you get a duplicateappend does no deduplication.

merge (the most common)

Run a SQL MERGE: for each new row, if a row with the same unique_key already exists, update it; otherwise insert it. This is "upsert" (update-or-insert). Use when rows can change after first being written — a ShopFlow order whose status goes placed → paid → shipped, which must update the existing stg_orders row, not duplicate it. The unique_key config tells dbt how to match a new row to an existing one.

-- models/staging/stg_orders.sql (as an incremental upsert on the order header)
{{ config(materialized='incremental', incremental_strategy='merge', unique_key='order_id') }}

On an incremental run, dbt loads the new rows into a temp table, then issues roughly:

merge into analytics.stg_orders as target
using new_rows as source
on target.order_id = source.order_id -- match on unique_key
when matched then update set ... -- order existed -> update its status
when not matched then insert ...; -- new order -> insert it

delete+insert

DELETE the rows that match your new batch's keys, then INSERT the new batch. Achieves an upsert-like result on warehouses without an efficient MERGE. Use when merge isn't well-supported or you want to fully replace a set of keys.

insert_overwrite

Replace data one partition at a time. The model is partitioned (e.g. by day); on each run dbt computes the affected partitions and overwrites those whole partitions, leaving the rest untouched. Use when you reprocess data in chunks — "rebuild today's and yesterday's partitions" — common on partitioned warehouses like BigQuery. It's idempotent at the partition grain: re-running for the same day yields the same result, no duplicates.

New/changedrows\n(afteris_incrementalIncremental\nstrategy?INSERTonly\n(immutableevents;\nrisk:upsert onunique_key\n(rowscan change)delete matchingkeys,\nthen insertreplacewhole\npartitions(by day, etc.)appendmergedelete+insertinsert_overwrite

:::tip Choosing a strategy Immutable event logs → append (fastest, but watch for duplicates). Mutable records with a stable key → merge (the safe default; needs unique_key). Partitioned reprocessing → insert_overwrite (idempotent per partition). delete+insert when merge isn't available. When unsure, merge with a correct unique_key is the forgiving choice — it won't duplicate. :::

The honest trade-off: incremental is faster but riskier

Incremental models are the biggest performance win in dbt and the biggest source of subtle bugs. The risks are real:

  • A wrong watermark filter silently loses data. If a late-arriving ShopFlow order has an order_ts older than your max, where order_ts > max(...) skips it forever. (Fixes: a lookback window like > max(...) - interval '3 days', or a merge on the key.)
  • append with re-delivered rows duplicates data. Use merge if rows can repeat.
  • Incremental tables drift from a full rebuild over time. Schedule a periodic --full-refresh (rebuild from scratch) to re-sync, and always run --full-refresh after changing the model's columns — an incremental run won't add a new column to an existing table.

The mental model: first run builds everything; later runs filter to new rows with is_incremental() + a watermark, then combine them in with your chosen strategy. Get the filter and the strategy right and you can transform billions of rows in seconds a day.

Why it matters

A materialization decides what each model becomes: a view (no stored data, pays at query time), a table (fully rebuilt each run, cheap to query), ephemeral (inlined as a CTE, never built), or incremental (build once, then process only new/changed rows). Incremental is the key to scale: is_incremental() filters to new rows on later runs, and the strategyappend, merge, delete+insert, or insert_overwrite — decides how those rows merge into the existing table. It's the most powerful and most error-prone dbt feature, so reason about the watermark and strategy deliberately. Next we'll see the Jinja machinery ({% if %}, {{ this }}) that made incremental possible — and how to test your models so a bad incremental load can't ship silently.

Next: Jinja, macros, packages & testing →