Chapter 1 checkpoint
You've built the mental model the rest of the guide stands on. Before moving on to storage and file formats, lock it in. Recall the throughline first, then take the quiz.
The throughline
- Data engineering is reliable plumbing — it turns raw, scattered source data into trustworthy data products. The data engineer owns the pipelines and platform; the platform engineer runs the infra below, and the analytics engineer, analyst, and data scientist all use the data above. The two non-negotiable languages are SQL and Python, and a pipeline is a DAG of idempotent, observable, testable steps — not a pile of scripts.
- The lifecycle — generation → ingestion → storage → transformation → serving — is the backbone, with four undercurrents (security, data management, orchestration, cost) cutting through every stage.
- OLTP vs OLAP is the deepest split: operational systems do many small fast transactions, stored row-by-row; analytical systems do huge scans of few columns, stored column-by-column. They're physically different machines, which is why you never run analytics on the prod database and why columnar formats and warehouses exist.
- Relational fundamentals: keys link tables, normalization (1NF–3NF) stores each fact once (right for OLTP writes), denormalization duplicates for fast OLAP reads, ACID guarantees transactions, and indexes speed selective lookups on row stores.
- Batch vs streaming is a latency/throughput/cost spectrum (default to batch); ETL vs ELT is a cost-driven architectural choice (cheap elastic cloud compute pushed the T into the warehouse → ELT won).
- Failure is normal: partitioning, replication, CAP/PACELC, and consensus describe distributed reality. Idempotency — every step safe to re-run — is the spine, and at-least-once delivery + idempotent processing is how you get exactly-once effect.
Quiz
Chapter 1 — Data Engineering Foundations
Pass to unlock the Next button belowPassed? You have the foundation the entire field is built on — the role, the lifecycle, the OLTP/OLAP split, the relational model, the batch/streaming and ETL/ELT decisions, and the idempotent, failure-aware mindset. Next we go down to the bytes: how data is physically stored on disk, and why columnar formats like Parquet and cheap object storage are the substrate everything else sits on.