Chapter 3 · SQL & Query Engines
SQL is the lingua franca of data. You already met it in Chapter 1 as the language of the relational database; in Chapter 2 you saw the columnar files (Parquet, ORC) that analytic data physically lives in. This chapter is where the two meet: the SQL you write and the engine that turns it into work over those files.
Two skills separate a data engineer from someone who "knows some SQL":
- Analytic SQL beyond CRUD. Most courses stop at
SELECT … JOIN … GROUP BY. Real data work — and almost every data-engineering interview — runs on window functions, CTEs, and idempotent patterns (deduplication, MERGE/UPSERT) that those courses never teach. We fix that first. - Reading the engine, not just the language. SQL is declarative: you say what you want, and the engine decides how. When a query is slow, you cannot fix it without seeing the plan the engine chose. So we open the hood — parser, optimizer, join algorithms, distributed execution — and learn to read an
EXPLAINplan the way a mechanic reads an engine.
The durable idea
SQL declares what you want; the engine decides how. The language is one durable skill; reading and steering the engine's plan is the other — and together they are how you write queries that scale.
The SQL language (joins, windows, CTEs) and query-planning concepts (logical vs physical plans, join algorithms, predicate pushdown, shuffles) are durable — they have barely changed in decades and are nearly identical across every engine. The specific engines and dialects (DuckDB vs Trino vs Snowflake vs BigQuery) are dated detail layered on top. Learn the durable core once and you can read any engine's plan with a glossary in hand.
What you'll be able to do
- Write window functions (
ROW_NUMBER,RANK,LAG/LEAD, running totals) and recursive CTEs, and useGROUPING SETS/ROLLUP/CUBEfor multi-level aggregates. - Write idempotent SQL —
MERGE/upsert,QUALIFY ROW_NUMBER()deduplication, deterministic ordering — so a pipeline can re-run without creating duplicates. - Query semi-structured data (JSON/VARIANT, arrays, structs) with
UNNEST/lateral joins. - Explain how an engine goes parse → logical plan → optimize → physical plan → execute, and read an
EXPLAIN ANALYZEto find the slow step. - Name and choose between join algorithms — hash, sort-merge, broadcast, nested-loop — and explain why join order and cardinality decide query speed.
- Describe distributed (MPP) execution — coordinator + workers, stages, shuffles/exchanges, skew, and spill — and the separation of storage and compute that defines modern engines.
- Place today's engines on one map: lake query engines (Trino, Presto, DuckDB, Spark SQL, Athena) vs cloud warehouses (Snowflake, BigQuery, Redshift) — same SQL surface, different compute model.
Lessons
- Analytic SQL: windows, CTEs & grouping sets — the SQL that dominates real DE work, beyond
SELECT/JOIN/GROUP BY. - Idempotent SQL & semi-structured data — MERGE,
QUALIFYdeduplication, deterministic ordering, and querying JSON/arrays/structs. - Inside a query engine: parse, plan, optimize, execute — the four stages, cost- vs rule-based optimization, predicate pushdown, and reading
EXPLAIN. - Join algorithms & why join order matters — hash, sort-merge, broadcast, nested-loop, and the cardinality math behind plan choice.
- Distributed execution: stages, shuffles, skew & spill — how MPP engines split a query across workers, and what goes wrong at scale.
- The engine landscape: lake engines vs warehouses — DuckDB, Trino, Spark SQL, Athena, Snowflake, BigQuery, Redshift, ClickHouse — one SQL surface, many compute models.
- Chapter 3 checkpoint — lock it in with a quiz.
Start with Analytic SQL →