2026 Edition · From your first pipeline to a lakehouse

How data is
actually moved.

A complete path into data engineering. Start at "what even is a columnar file?" and end able to ingest, store, model, transform, orchestrate and serve data at scale — with the durable concepts foregrounded so the skills outlast any one tool. Every term is defined the first time it appears.

Start from the Introduction →Browse all 12 chapters

Chapters

Batch · Stream · Lakehouse

The whole pipeline

Jun ’26

Last reviewed

Free · source-available

orders.sqlrun

-- Transform raw events into a trusted, modeled table.
SELECT date_trunc('day', ts) AS day,
       count(*) AS orders
FROM bronze.events
WHERE event = 'order_placed'
GROUP BY 1;

# → Materialized gold.daily_orders · 1.2M rows scanned

declarative · versionedtested like any code

Start here

Two ground-truth facts before chapter 1.

Data engineering is plumbing — move data from where it's made to where it's trusted.

Every tool — a warehouse, Spark, an orchestrator, a streaming bus — exists to serve one part of that journey. Master what job each tool does and the crowded vendor landscape collapses into a handful of roles you can hold in your head.

Learn the durable concept, not the dated vendor.

Engines get renamed and re-launched every year. But columnar storage, partitioning, idempotency, and dimensional modeling are decades-stable. This guide teaches the concept first, then shows it in today's tools.

The full path

Twelve chapters, read in order.

Each one builds on the last — where data lives, how you query and model it, how you move and reshape it in batch and in real time, and how you keep it trustworthy and affordable. Master one topic per page and always know what comes next.

01Data Engineering FoundationsThe role, the lifecycle, OLTP vs OLAP 02Storage & File FormatsObject stores, columnar, Parquet, partitioning 03SQL & Query EnginesAnalytic SQL, planners, Trino & DuckDB 04Data Modeling & WarehousingStar schemas, facts & dimensions, SCDs 05Batch Processing & SparkDistributed jobs, DataFrames, the shuffle 06Ingestion & IntegrationCDC, incremental loads, idempotency 07Transformation & the Modern Data StackELT, dbt, bronze/silver/gold 08Orchestration & PipelinesDAGs, Airflow/Dagster, retries & backfills 09Streaming & Real-TimeKafka, Flink, windowing, exactly-once 10Table Formats & the LakehouseIceberg, Delta, ACID on object storage 11Quality, Governance & ObservabilityTests, contracts, lineage, freshness 12Scale, Decisions & CareerSolo → enterprise, and where you fit

Who it’s for

Meets you wherever you are.

“I can write code and SQL…”

…but I’ve never built a data pipeline, and words like "lakehouse" and "CDC" are a blur. Start at file formats — no prior data-platform experience assumed.

“I run production pipelines.”

…and want a sharp 2026 refresh on the lakehouse, streaming, orchestration, data quality and the decision rules that actually hold up.

Ready?

What happens between a raw event and a trusted table?

The whole guide fans out from that one question. Twenty minutes from now you’ll know exactly what columnar storage, partitioning, and the OLTP/OLAP split are — and why every later decision rests on them.

Start from the Introduction →