Become a data engineer
This guide takes you from "I can write code and some SQL, but I've never built a data pipeline" to being able to ingest, store, model, transform, orchestrate, and serve data at scale — reliably and cost-effectively. It is written linearly: read top to bottom and every term is defined the first time it appears. You do not need any prior data-platform background — chapter 1 starts at "what does a data engineer actually do?".
It is also useful to working engineers who want a sharp 2026 refresh: the decision rules, the durable-vs-dated framing, and the patterns that actually hold up in production.
The one idea this whole guide rests on
Data engineering is moving data from where it is produced to where it can be trusted and used — reliably, repeatably, and at scale.
Every tool you'll meet — a warehouse, a file format, Spark, an orchestrator, a streaming bus, a table format — exists to serve one part of that journey: getting raw data in, reshaping it into something trustworthy, and making it available to analysts, models, and applications. Learn the underlying job each tool does and the crowded vendor landscape collapses into a handful of roles you can hold in your head.
Durable vs dated — the rule that keeps this guide useful
The data ecosystem moves fast, but not all of it moves fast. We constantly separate two things:
- Durable — concepts that have been true for a decade and will stay true: columnar storage, partitioning, the difference between OLTP and OLAP, idempotent pipelines, dimensional modeling, exactly-once semantics. Most of your study time should go here.
- Dated — the specific vendor, this quarter's "new" engine, the exact pricing model, the current syntax. Useful to know, but it will change. We flag dated facts as dated so you never anchor your understanding on them.
:::tip How to read this guide When something is marked durable, internalize it. When something is marked dated (a product name, a price, a config flag), treat it as an example, not a law — verify it against current docs when you actually build. :::
How the guide is structured
The twelve chapters follow the real arc of a piece of data: where it lives, how you query and model it, how you move and reshape it in batch and in real time, and how you keep it trustworthy and affordable.
- Part A — Foundations & storage (Ch. 1–4): what the role is, how data is stored on disk, how you query it, and how you model it for analytics.
- Part B — Processing & pipelines (Ch. 5–8): process data in bulk with Spark, ingest it from sources, transform it with the modern data stack, and orchestrate the whole thing.
- Part C — Real-time, quality & scale (Ch. 9–12): handle streaming and real-time data, adopt open table formats and the lakehouse, keep data trustworthy and governed, and adapt the playbook from solo to enterprise.
Where this sits on the ladder
This guide assumes you can already write a small program and basic SQL. If you can't yet, start with Programming Basics. It cross-links the sibling Modern Cloud Engineer Guide (where data infrastructure runs) and the Modern AI Guide (where data feeds models) rather than duplicating them.
Ready? Start with Chapter 1: Data Engineering Foundations →