Chapter 3 · SQL & Query Engines

SQL is the lingua franca of data. You already met it in Chapter 1 as the language of the relational database; in Chapter 2 you saw the columnar files (Parquet, ORC) that analytic data physically lives in. This chapter is where the two meet: the SQL you write and the engine that turns it into work over those files.

Two skills separate a data engineer from someone who "knows some SQL":

Analytic SQL beyond CRUD. Most courses stop at SELECT … JOIN … GROUP BY. Real data work — and almost every data-engineering interview — runs on window functions, CTEs, and idempotent patterns (deduplication, MERGE/UPSERT) that those courses never teach. We fix that first.
Reading the engine, not just the language. SQL is declarative: you say what you want, and the engine decides how. When a query is slow, you cannot fix it without seeing the plan the engine chose. So we open the hood — parser, optimizer, join algorithms, distributed execution — and learn to read an EXPLAIN plan the way a mechanic reads an engine.

The durable idea

SQL declares what you want; the engine decides how. The language is one durable skill; reading and steering the engine's plan is the other — and together they are how you write queries that scale.

The SQL language (joins, windows, CTEs) and query-planning concepts (logical vs physical plans, join algorithms, predicate pushdown, shuffles) are durable — they have barely changed in decades and are nearly identical across every engine. The specific engines and dialects (DuckDB vs Trino vs Snowflake vs BigQuery) are dated detail layered on top. Learn the durable core once and you can read any engine's plan with a glossary in hand.

What you'll be able to do

Write window functions (ROW_NUMBER, RANK, LAG/LEAD, running totals) and recursive CTEs, and use GROUPING SETS/ROLLUP/CUBE for multi-level aggregates.
Write idempotent SQL — MERGE/upsert, QUALIFY ROW_NUMBER() deduplication, deterministic ordering — so a pipeline can re-run without creating duplicates.
Query semi-structured data (JSON/VARIANT, arrays, structs) with UNNEST/lateral joins.
Explain how an engine goes parse → logical plan → optimize → physical plan → execute, and read an EXPLAIN ANALYZE to find the slow step.
Name and choose between join algorithms — hash, sort-merge, broadcast, nested-loop — and explain why join order and cardinality decide query speed.
Describe distributed (MPP) execution — coordinator + workers, stages, shuffles/exchanges, skew, and spill — and the separation of storage and compute that defines modern engines.
Place today's engines on one map: lake query engines (Trino, Presto, DuckDB, Spark SQL, Athena) vs cloud warehouses (Snowflake, BigQuery, Redshift) — same SQL surface, different compute model.

Lessons

Analytic SQL: windows, CTEs & grouping sets — the SQL that dominates real DE work, beyond SELECT/JOIN/GROUP BY.
Idempotent SQL & semi-structured data — MERGE, QUALIFY deduplication, deterministic ordering, and querying JSON/arrays/structs.
Inside a query engine: parse, plan, optimize, execute — the four stages, cost- vs rule-based optimization, predicate pushdown, and reading EXPLAIN.
Join algorithms & why join order matters — hash, sort-merge, broadcast, nested-loop, and the cardinality math behind plan choice.
Distributed execution: stages, shuffles, skew & spill — how MPP engines split a query across workers, and what goes wrong at scale.
The engine landscape: lake engines vs warehouses — DuckDB, Trino, Spark SQL, Athena, Snowflake, BigQuery, Redshift, ClickHouse — one SQL surface, many compute models.
Chapter 3 checkpoint — lock it in with a quiz.

Start with Analytic SQL →

The durable idea​

What you'll be able to do​

Lessons​

The durable idea

What you'll be able to do

Lessons