A complete path into data engineering. Start at "what even is a columnar file?" and end able to ingest, store, model, transform, orchestrate and serve data at scale — with the durable concepts foregrounded so the skills outlast any one tool. Every term is defined the first time it appears.
-- Transform raw events into a trusted, modeled table. SELECT date_trunc('day', ts) AS day, count(*) AS orders FROM bronze.events WHERE event = 'order_placed' GROUP BY 1; # → Materialized gold.daily_orders · 1.2M rows scanned
Every tool — a warehouse, Spark, an orchestrator, a streaming bus — exists to serve one part of that journey. Master what job each tool does and the crowded vendor landscape collapses into a handful of roles you can hold in your head.
Engines get renamed and re-launched every year. But columnar storage, partitioning, idempotency, and dimensional modeling are decades-stable. This guide teaches the concept first, then shows it in today's tools.
Each one builds on the last — where data lives, how you query and model it, how you move and reshape it in batch and in real time, and how you keep it trustworthy and affordable. Master one topic per page and always know what comes next.
The whole guide fans out from that one question. Twenty minutes from now you’ll know exactly what columnar storage, partitioning, and the OLTP/OLAP split are — and why every later decision rests on them.