Skip to main content

The cloud data warehouse

You already know two kinds of system from Chapter 1: an OLTP database (the operational database behind an app — many small, fast reads and writes) and an OLAP system (built for big analytic queries over many rows). A data warehouse is the classic OLAP system: a database whose entire design is bent toward answering analytic questions — "what was revenue by region last quarter?" — over large, integrated, historical data.

This lesson explains what makes a warehouse a warehouse, and then opens the hood on the cloud warehouse specifically, because the internals are exactly the things that determine your bill and your query speed. If you only learn which buttons to click, you'll be lost when the product changes. If you learn why the buttons exist, every warehouse looks the same.

A warehouse is not a bigger app database

It is tempting to think a warehouse is just a large PostgreSQL. It isn't — it's optimized for the opposite workload.

Operational database (OLTP)Data warehouse (OLAP)
Typical query"Get order #4471 for this user""Sum revenue by region for 3 years"
Rows touchedA handfulMillions to billions
WritesConstant, tiny, transactionalBulk loads, append-mostly
Storage layoutRow-oriented (whole rows together)Columnar (each column stored separately)
GoalLow-latency single-record correctnessHigh-throughput scans and aggregation

The single most important physical difference is columnar storage (introduced in Chapter 2): the warehouse stores each column's values together rather than each row's values together. When a query asks for SUM(revenue) GROUP BY region, the engine reads only the revenue and region columns and skips everything else. For analytic queries that touch a few columns of very many rows, this is a massive saving — and it's why you cannot just point your app database at a dashboard and expect it to scale.

The second difference is MPPMassively Parallel Processing. A warehouse splits a query across many machines (or many compute units), each scanning a slice of the data in parallel, then combines the partial results. Columnar layout + MPP is the engine of every modern analytic warehouse.

The cloud shift: separating storage from compute

For decades, warehouses (Teradata, on-prem Oracle, early Redshift) tied storage and compute together: the same machines that held the disks also ran the queries. That coupling had two painful consequences — to store more data you had to buy more compute, and everyone's queries fought over the same fixed cluster.

The defining innovation of the cloud warehouse is separation of storage and compute: the data lives in cheap, effectively-infinite object storage (Chapter 2's S3/GCS/Azure Blob), and independent, resizable compute clusters read from it on demand.

Yourtables\n(columnarmicro-partitions)Compute clusterA\n(BI dashboards)Compute clusterB\n(data loading)Compute clusterC\n(data science)

Why this matters, concretely:

  • Scale each independently. Hold a petabyte but spin up a tiny cluster for a small query; hold a gigabyte but throw huge compute at one heavy job. Storage cost and compute cost are now separate line items.
  • Isolate workloads. The team running heavy data-science queries gets its own compute cluster and can't slow down the executives' dashboards, even though both read the same tables. No more "who is hammering the warehouse?"
  • Pay for what you run. Compute can pause when idle. You pay storage continuously (it's cheap) and compute only while a cluster is awake.

Every major cloud warehouse is built on this model; they just name the compute differently:

WarehouseName for a compute clusterNote
SnowflakeVirtual warehouseT-shirt sizes (XS, S, M…); auto-suspend/resume
Google BigQuerySlotsServerless by default; slots are units of compute
Amazon RedshiftCluster / RA3 nodes, Redshift ServerlessRA3 separated storage from compute; Serverless removes cluster management
Databricks SQLSQL warehouseCompute over lakehouse tables (Chapter 10)

:::note Durable vs dated The principle — separate storage from compute so you can scale and isolate each independently — is durable and now universal. The names ("virtual warehouse," "slots," "SQL warehouse") and the exact sizing knobs are dated product detail. Learn the principle; the names are a lookup. :::

Micro-partitions and pruning: how a scan skips data it doesn't need

A warehouse holding a trillion rows can't physically read all of them for every query. The trick is to read as little as possible. Modern warehouses store a table as a large number of micro-partitions — small, immutable, columnar chunks of the table (in Snowflake, automatically created blocks of a few megabytes; BigQuery and Redshift have analogous block structures). Crucially, the warehouse keeps metadata about each chunk: for every column, the minimum and maximum value it contains.

That metadata enables pruning (also called partition elimination): before reading data, the engine checks the per-chunk min/max and skips any chunk that cannot possibly contain matching rows.

Trace it. Your orders table has 100,000 micro-partitions, and you run:

SELECT SUM(amount) FROM orders WHERE order_date = '2026-06-23';

The engine looks at each micro-partition's recorded min/max order_date. A chunk whose range is 2024-01-01 … 2024-03-31 cannot contain June 2026 rows, so it is skipped without being read. If your data happens to be organized so June-2026 rows cluster into a handful of chunks, the engine reads ~10 chunks instead of 100,000 — a 10,000× reduction in I/O. Pruning is the number-one reason one query costs pennies and a logically identical one costs dollars.

Clustering and sort keys: helping the pruner help you

Pruning only works well if related rows are physically near each other. If your June-2026 orders are scattered evenly across all 100,000 chunks, every chunk's min/max range spans all dates, nothing can be skipped, and you scan the whole table.

The levers that control physical ordering:

  • Snowflake — clustering keys. You declare a clustering key (e.g. order_date) and Snowflake works to keep rows with similar keys in the same micro-partitions, so date-filtered queries prune hard.
  • Amazon Redshift — sort keys (and distribution keys). A sort key physically orders rows on disk (so range filters prune); a distribution key controls how rows are spread across compute nodes (so joins/aggregations avoid moving data between machines — a shuffle, from Chapter 3).
  • BigQuery — partitioning + clustering. You partition a table (commonly by date) and optionally cluster within partitions; queries that filter on the partition column scan only matching partitions.

:::tip The cost-and-speed mental model Most "why is this query slow / expensive?" answers reduce to one question: how much data did the engine actually read? Filters on a clustered/partitioned/sorted column → heavy pruning → little data read → fast and cheap. Filters on an unsorted column, or SELECT *, or no filter at all → no pruning → full scan → slow and expensive. Tuning a warehouse is mostly arranging data so pruning can happen. :::

Caching: the cheapest query is the one you don't run

Cloud warehouses cache aggressively, and understanding the layers prevents both confusion and waste:

  • Result cache. If you run the exact same query and the underlying data hasn't changed, the warehouse returns the stored result instantly — often for zero compute cost. (This is why re-running a dashboard query is instant the second time, and why benchmarking the same query twice is misleading.)
  • Local/warehouse cache. Compute clusters keep recently read micro-partitions in fast local storage, so a follow-up query over the same data avoids re-fetching from object storage.

Materialized views: precompute the expensive part

A regular view is just a saved query — it re-runs every time you select from it. A materialized view (MV) stores the computed result and keeps it incrementally up to date as base data changes. If a heavy aggregation (say, daily revenue by region) is read constantly, materializing it means the expensive computation happens once on write instead of on every read.

The tradeoff: an MV trades storage and refresh cost for read speed. Use it when an expensive result is read far more often than the base data changes; skip it when data churns constantly or the query is rare.

Why it matters

A data warehouse is an OLAP system built for large analytic queries, and its power comes from columnar storage + MPP: read only the columns you need, in parallel across many machines. The cloud's defining move was separating storage from compute so each scales and is billed independently and workloads can be isolated — the thing Snowflake calls a virtual warehouse, BigQuery calls slots, Redshift a cluster, Databricks a SQL warehouse. Underneath, tables are micro-partitions with min/max metadata, and pruning skips chunks that can't match — which is why clustering keys / sort keys / partitioning (arranging data so filters prune) is the single biggest cost-and-speed lever, helped by result caching and materialized views. Internalize one question — how much data did this query actually read? — and warehouse performance stops being mysterious.

Now that you understand the box, the next lesson is how to shape the data inside it so the questions are easy to ask: dimensional modeling.

Next: Dimensional modeling →