Skip to main content

Chapter 3 checkpoint

You can now write the analytic SQL that real data work runs on and read the engine plan beneath it. Recall the throughline, then prove it with the quiz.

The throughline

  • Analytic SQL beyond CRUD: window functions summarize without collapsing rows — ROW_NUMBER/RANK/DENSE_RANK (ties behave differently), LAG/LEAD for deltas, and framed SUM(... ) OVER (... ) for running totals. CTEs name pipeline steps (recursive CTEs walk hierarchies). GROUPING SETS/ROLLUP/CUBE produce multi-level subtotals in one pass.
  • Idempotent SQL: make pipelines safe to re-run — QUALIFY ROW_NUMBER() with deterministic ordering for one-row-per-key dedup, MERGE/upsert and partition delete-insert to avoid duplicates. UNNEST/lateral joins flatten JSON/arrays/structs.
  • The engine pipeline: parse → logical plan → optimize → physical plan → execute. Rule-based rewrites are always-safe (pushdown, pruning); cost-based optimization uses statistics to choose join order and algorithm. The wins all mean read less: predicate pushdown, projection pruning, partition pruning, dynamic filtering. Read EXPLAIN ANALYZE bottom-up; the estimated-vs-actual row gap reveals bad stats.
  • Join algorithms: hash (O(M+N), small side in memory) is the equi-join default; sort-merge handles huge/pre-sorted inputs; broadcast copies a small table to every worker to avoid a shuffle; nested-loop is the brute-force fallback. Cardinality estimates drive the choice and the join order.
  • Distributed execution: coordinator + workers, plan split into stages joined by shuffles/exchanges (the dominant cost). Separation of storage and compute is the modern trait. Skew (lopsided key) and spill (out-of-memory → disk) are the classic failure modes.
  • The engine map: lake engines (DuckDB, Trino, Spark SQL, Athena) bring compute to your open files; warehouses (Snowflake, BigQuery, Redshift) own storage+compute behind a SQL endpoint — same SQL, different compute model. DuckDB is the default local OLAP engine; ClickHouse owns real-time analytics.

Quiz

Required checkpoint

Chapter 3 — SQL & Query Engines

Pass to unlock the Next button below

You can now write the SQL that real pipelines run on and read the plan that explains why a query is slow. Chapter 4 takes the next step: not just querying data, but modeling it — designing the warehouse tables, facts, and dimensions that this SQL queries.

Next: Chapter 4: Data Modeling & Warehousing →