The engine landscape: lake engines vs warehouses
You now understand what every analytic engine does inside — parse, plan, optimize, join, and distribute the work. The good news is that those internals are nearly identical across products, which means the dozen engine names you'll hear — DuckDB, Trino, Spark SQL, Athena, Snowflake, BigQuery, Redshift, ClickHouse — are far less intimidating than they look. They mostly differ on one axis, and once you see it, the whole landscape collapses into a simple map you can reason about. This lesson draws that map.
The one durable distinction
A common beginner mistake is to treat warehouses and lake query engines as unrelated, mysterious technologies. They aren't. The durable distinction is:
Lake query engines vs cloud warehouses are the same SQL surface on a different compute-and-storage model.
Both speak almost-identical analytic SQL (everything in this chapter — windows, CTEs, joins, MERGE). Both run the parse→optimize→distribute pipeline you've learned. What differs is who owns the storage and how the compute is operated:
- A query-on-the-lake engine brings compute to your data. Your tables are open files (Parquet/Iceberg, Chapter 2) in your object storage, and the engine is a stateless compute layer you point at them. You own the data; the engine is interchangeable.
- A cloud data warehouse is an integrated, managed product that owns both storage and compute (in its own optimized internal format) and hands you a SQL endpoint. You load data into it; it manages everything (Chapter 4 goes deep on the warehouse).
This maps directly onto the storage/compute separation from the previous lesson: lake engines make that separation explicit and open (any engine, your files); modern warehouses have adopted the same separation internally but keep the storage proprietary and managed.
The lake query engines
All four bring compute to open files in object storage; they differ in scale and deployment.
- DuckDB — an in-process, single-machine OLAP engine: a library that runs inside your Python/CLI process, no server, no cluster. It reads Parquet/CSV/JSON and remote files directly and is astonishingly fast on a laptop for datasets up to tens of GB. By 2026 it has become the default engine for local development and embedded analytics — the "SQLite of analytics." If you skip DuckDB you miss the single most important shift in the recent landscape: a huge amount of data work no longer needs a cluster at all. Reach for it for local exploration, tests, notebooks, and small-to-medium pipelines.
- Trino (formerly PrestoSQL) / Presto — a distributed, coordinator-and-workers MPP engine (3.5) built to run interactive SQL across huge datasets in object storage, and to federate across many sources (join data in S3 with data in PostgreSQL in one query). It's the open-source workhorse for "query the lake at scale." (Presto and Trino are siblings from the same original project.)
- Spark SQL — the SQL interface to Apache Spark, the general distributed compute engine of Chapter 5. Same MPP model, but Spark also does non-SQL transformations, ML, and streaming, so Spark SQL is the choice when SQL is one part of a larger programmatic pipeline.
- Amazon Athena — a serverless managed front-end (built on Trino/Presto) that runs SQL over files in S3 with zero infrastructure and per-query, pay-per-byte-scanned billing. It's "Trino without operating Trino." The pricing model makes the column/partition pruning from 3.3 directly visible as dollars.
The cloud warehouses
Integrated, fully-managed products that own storage and compute and present a polished SQL endpoint. You'll see these as the destination in Chapter 4.
- Snowflake — pioneered the modern decoupled warehouse with independently-sized virtual warehouses (compute) over shared storage. Known for elastic scaling, near-zero administration, and the
VARIANTsemi-structured type from 3.2. - Google BigQuery — fully serverless: no clusters to size at all; you submit SQL and Google allocates compute behind the scenes, billing by bytes scanned (or slots). Deep partition/clustering support and native nested/
STRUCTdata. - Amazon Redshift — AWS's warehouse; the older clustered model plus newer serverless and lake-querying (Spectrum) options, increasingly converging on the same decoupled architecture as the others.
The lines blur on purpose: warehouses now read open lake formats (Iceberg), and lake engines now offer managed, warehouse-like experiences. The axis remains the right way to reason, even as products straddle it — it tells you who owns the data and who operates the compute.
ClickHouse: the third shape
ClickHouse doesn't fit either box neatly. It's a column-oriented database built for blistering real-time analytical queries — sub-second aggregations over billions of rows — typically on data loaded into its own highly-tuned format. Think dashboards, observability, product/user-facing analytics, and anything where queries must return in milliseconds while thousands of users hit it at once. It's the engine to name when the requirement is interactive, high-concurrency, real-time OLAP rather than batch lake queries or warehouse-style modeling. Where a lake engine optimizes for "query files I already have" and a warehouse for "managed central analytics," ClickHouse optimizes ruthlessly for latency and concurrency on freshly-arriving data. Four pieces of its architecture explain that speed, and they're worth knowing because they're why you'd pick it.
The MergeTree storage engine — its heart. ClickHouse stores each table as many small, immutable, sorted data parts on disk, columnar and compressed. A table's ORDER BY clause defines the sort order of every part. Inserts don't update rows in place; they write a new part, and a background process continuously merges small parts into bigger sorted ones (hence MergeTree). This LSM-tree-like design — write new immutable chunks, compact them later — is exactly the append-only, idempotency-friendly pattern from Chapter 1, and it's why ingest is so fast: a write is just appending a sorted file, never a random in-place mutation.
The sparse primary-key index. This is ClickHouse's signature trick and a real contrast with OLTP databases. A transactional database keeps a dense index — one entry per row — so it can jump to any single row instantly (great for OLTP point lookups). ClickHouse instead keeps a sparse index: one entry roughly every 8,192 rows (a granule), storing just the primary-key value at each granule boundary. The index is therefore tiny enough to live in memory even for a billion-row table. To answer a filtered query, ClickHouse binary-searches this sparse index to find which granules could contain matching rows, then reads only those granules' columns — skipping the vast majority of the table. It trades the ability to fetch one exact row cheaply (an OLTP need it doesn't care about) for the ability to scan-and-skip enormous columnar tables at OLAP speed. This is the same data-skipping idea as Parquet row-group pruning, pushed to the extreme and baked into the storage engine.
Async inserts. Because each insert creates a part and many tiny parts hurt (more merging, more files), ClickHouse handles high-rate ingestion — think a flood of events from many producers — with asynchronous inserts: incoming rows are buffered server-side and flushed as one larger part, instead of one part per tiny request. That lets it absorb thousands of small concurrent inserts per second (typical of streaming/observability sources) without drowning in micro-parts, bridging ClickHouse toward the streaming world.
Materialized views. In ClickHouse a materialized view is not a periodically-refreshed cached query (the warehouse meaning) — it's an insert-time trigger. When rows land in a source table, the view's query runs on that incoming block and writes the result into a destination table, continuously. Pointed at an aggregating MergeTree engine, this maintains pre-aggregated rollups (e.g. per-minute counts) as data arrives, so a dashboard reads an already-summarized table instead of scanning raw events each time. It's how ClickHouse serves real-time aggregate dashboards at sub-second latency under heavy concurrency — the heavy lifting happens once, on write.
When to choose it. Reach for ClickHouse when you need real-time, high-concurrency analytical queries on a continuous stream of fresh data — user-facing product analytics, observability/metrics/logs at scale, live operational dashboards — and sub-second latency under many simultaneous users is a hard requirement. It is not the tool for ad-hoc querying of arbitrary lake files (use Trino/DuckDB) or for being the central modeled warehouse the whole company joins against (use a cloud warehouse); its sweet spot is a purpose-built, fast-ingest, fast-read serving layer for analytics.
A decision rule
You'll choose engines for the rest of your career; a durable starting rule:
- Local dev, tests, notebooks, data up to ~tens of GB → DuckDB. No cluster, runs in-process, fast.
- Interactive SQL over a big data lake / federating sources → Trino (self-operated) or Athena (serverless on S3).
- SQL as part of a larger programmatic batch/ML pipeline → Spark SQL.
- A managed, low-ops central warehouse for analysts and BI → a cloud warehouse (Snowflake / BigQuery / Redshift).
- Real-time, high-concurrency, sub-second dashboards → ClickHouse.
And the meta-rule that outlives every product: the SQL you write is portable; the compute model is the decision. Because the language is durable, you can prototype a query in DuckDB on your laptop and run essentially the same SQL on Trino or Snowflake at scale.
:::note Keeping SQL itself clean and portable
Across all these engines, sqlfluff is the standard open-source SQL linter and auto-formatter — it enforces a consistent style and catches mistakes, and it's dialect-aware (DuckDB, Snowflake, BigQuery, Trino, …). Wiring it into CI keeps a team's SQL readable and consistent regardless of which engine runs it, mirroring the linting discipline you'd apply to any codebase.
:::
Why it matters
The engine landscape looks crowded but collapses onto one durable axis: lake query engines (DuckDB, Trino, Spark SQL, Athena) bring interchangeable compute to your open files, while cloud warehouses (Snowflake, BigQuery, Redshift) are integrated managed products that own storage and compute behind a SQL endpoint — same SQL surface, different compute model. DuckDB has made local, cluster-free analytics the default for development; ClickHouse owns real-time OLAP. Because the SQL and the internals you learned this chapter are shared across all of them, you can read any engine's plan and move queries between them. Choosing an engine is then not about syntax — it's about scale, who operates the compute, and who owns the data.
Common pitfalls
- Treating warehouses and lake engines as unrelated magic. They're the same SQL on different compute/storage ownership. The axis is the whole point.
- Defaulting to a cluster for small data. Reaching for Spark/Trino on a few GB when DuckDB on a laptop would be faster and simpler. Match the engine to the data size.
- Ignoring the billing model. On Athena/BigQuery you pay per byte scanned, so
SELECT *and missing partition filters are literally money. The pruning skills from 3.3 are cost control. - Assuming SQL is 100% portable. The core is portable, but dialects differ at the edges (
QUALIFY, JSON paths,MERGEsyntax, functions). Lint withsqlfluffand test on the target engine.
Cross-links
- Builds on: Distributed execution (storage/compute separation), Storage & file formats (the open files lake engines read).
- Leads to: Data modeling & warehousing (the warehouse in depth), Batch processing & Spark (Spark SQL's engine), and the lakehouse (where lake engines and warehouses converge).
- Next: Chapter 3 checkpoint →.
Checkpoint
- In one sentence, what is the durable difference between a query-on-the-lake engine and a cloud warehouse?
- For exploring a 4 GB Parquet file on your laptop with no cluster, which engine is the 2026 default, and what makes it different from Trino?
- You're billed per byte scanned on Athena. Name two SQL/engine techniques from this chapter that directly cut that bill.
Answers
- They expose essentially the same analytic SQL, but a lake engine brings interchangeable compute to your open files in your object storage (you own the data), while a warehouse is an integrated managed product that owns both storage and compute behind a SQL endpoint.
- DuckDB — it's in-process and single-machine (a library running inside your process, no server/cluster), whereas Trino is a distributed coordinator-and-workers engine for large-scale querying.
- Any two of: projection/column pruning (
SELECTonly needed columns, not*), partition pruning (filter on the partition column so whole files are skipped), predicate pushdown (filters reach the scan), and using columnar formats so less data is read.