The data engineering lifecycle
You now know what a data engineer does. This lesson gives you the map of the work: the fixed sequence of stages a piece of data passes through on its way from "just happened" to "trusted and useful." This sequence is called the data engineering lifecycle, and it is the single most useful mental model in the field. Once you can place any tool or task onto it, the whole landscape stops feeling like a random pile of vendors.
The plain-English on-ramp
Follow one fact through its life. A ShopFlow customer buys a coffee maker at 2:14 p.m. (these are ShopFlow's canonical tables, which the whole guide reuses).
- The checkout app writes a row to ShopFlow's
orderstable:order_id8841,customer_id7,statuspaid,amount79, plus anorder_itemsline for the coffee maker. The fact has been generated. - That night (or seconds later), a pipeline copies that row out of the app's database into the data platform. The fact has been ingested.
- It lands in cheap, durable storage alongside billions of other rows. The fact is now stored.
- A job cleans it, joins it to its
customersrecord, and rolls it into a "daily sales" table. The fact has been transformed. - The next morning, that table powers a dashboard the finance team opens, and a model that forecasts demand. The fact is now served.
Generation → ingestion → storage → transformation → serving. Five stages, always in that order. Every pipeline you ever build is some slice of this lifecycle. That's the whole map.
The five stages, defined
1. Generation
Generation is where data is born, in the source systems — application databases, third-party APIs, event streams, IoT sensors, uploaded files. Crucially, the data engineer usually does not own or control these. You read from them; another team (or another company) decides their schema, their reliability, and when they change. A huge part of the job is dealing gracefully with source systems that change without warning. The source's shape — is it neat rows, or messy JSON? — drives every decision downstream, which is why we devote Relational fundamentals and Chapter 2 to it.
2. Ingestion
Ingestion is getting data out of the source systems and into your platform. This is often the hardest, most failure-prone stage, because it's where your reliable world meets the messy outside one. The central choices here are batch vs streaming (do you pull data in scheduled chunks, or continuously as it arrives?) — the entire subject of Batch vs streaming — and push vs pull (does the source send to you, or do you fetch from it?). Ingestion is also where idempotency earns its keep: networks fail mid-transfer, so an ingestion step must be safe to re-run without creating duplicates. Chapter 6 is dedicated to this stage.
3. Storage
Storage is keeping the data, durably and affordably, in a form the later stages can use. It is not a single decision — storage runs underneath the whole lifecycle (you store raw ingested data, store intermediate transformed data, store the final served tables). The key tensions are cost (object storage is cheap; databases are pricier) and query speed (how data is laid out on disk determines how fast and cheap it is to read — the heart of OLTP vs OLAP and Chapter 2's file formats). Choosing the storage layout is one of the highest-leverage decisions a data engineer makes.
4. Transformation
Transformation is turning raw, stored data into something trustworthy and useful: cleaning bad values, standardizing formats, joining datasets together, aggregating, and modeling it into business-friendly shapes (Chapter 4). This is where raw data becomes a data product. The big architectural question here — ETL vs ELT (transform before loading, or load first and transform inside the warehouse?) — is covered in Batch vs streaming, and the tooling (dbt, the modern data stack) in Chapter 7.
5. Serving
Serving is delivering the transformed data to the people and systems that consume it: analytics (dashboards, reports, ad-hoc SQL), machine learning (features and training data for models), and reverse ETL / data-driven applications (pushing modeled data back into operational tools). Serving is the point of the whole lifecycle — every earlier stage exists to make this one correct, fast, and affordable. If serving is wrong, nothing upstream mattered.
The undercurrents: four concerns that run through every stage
The five stages are the horizontal flow. But four concerns cut vertically through all of them — they aren't a stage you do once; they apply at every stage. These are called the undercurrents, and missing them is what separates a toy pipeline from a production one.
- Security — who can read or change the data, at every stage. Data is a liability as much as an asset; a leak at any stage is a breach. (The customer's side of the shared-responsibility line from the Cloud guide applies directly.)
- Data management — governance, quality, cataloging, and lineage: what data you have, where it came from, whether it's correct, and who's allowed to use it. This is Chapter 11's territory and increasingly a legal requirement (privacy regulations).
- Orchestration — coordinating the stages so they run in the right order, on schedule, with retries when something fails. This is what turns five separate steps into one dependable pipeline (the DAG mental model from the previous lesson). Chapter 8 is dedicated to it.
- Cost — every stage spends money (storage to keep, compute to transform, egress to move). As established in the last lesson, cost is a first-class design constraint, evaluated throughout, not bolted on at the end.
A fully traced example: ShopFlow's daily orders pipeline
Let's trace a realistic pipeline against the lifecycle, naming each stage and undercurrent as it appears. It runs over ShopFlow's orders, order_items, and products tables.
Goal: every morning, the finance dashboard shows yesterday's total sales by product category.
- Generation — ShopFlow's app writes each order to the
orderstable (with its line items inorder_items) in a PostgreSQL database (an operational, row-oriented store — see OLTP vs OLAP). The data engineer doesn't control this schema. - Ingestion — at 1 a.m., a batch job (it runs once a day; see Batch vs streaming) reads yesterday's new
orders(those with a freshorder_ts) from PostgreSQL and writes them as files into cheap object storage. Undercurrent — idempotency: the job is written so that if it fails halfway and is retried, it overwrites yesterday's slice rather than appending duplicates. - Storage — those files land in a columnar format (Chapter 2) partitioned by order date, so a query for "yesterday" reads only yesterday's files, not all of history. Undercurrent — cost: this layout means the next step scans megabytes, not terabytes.
- Transformation — a job (SQL via the warehouse, Chapter 4) cleans the rows, joins each
order_itemsline to itsproducts.category, and aggregates to total sales per category per day. Undercurrent — data management: a quality check fails the run if total sales are negative or a category is missing. - Serving — the resulting small, clean table powers the finance dashboard. Undercurrent — security: only finance and the data team can read it.
Undercurrent — orchestration: an orchestrator runs steps 2 → 5 in order, only starting each when the previous succeeded, retrying transient failures, and alerting a human if the run is still broken after retries.
Notice that the interesting engineering lives in the undercurrents — idempotency, cost-aware layout, quality checks, ordered retries — exactly the things a "just write a script" approach skips. That is the lesson.
Why it matters
The data engineering lifecycle — generation → ingestion → storage → transformation → serving — is the backbone map of the field: every pipeline is a slice of it, and every tool you'll meet serves one stage. Cutting through all five are the four undercurrents — security, data management, orchestration, and cost — which are where production reliability actually lives. Learn to place any task or tool onto this map, and the overwhelming vendor landscape collapses into "which stage, which undercurrent." Each later chapter zooms into one part of this picture.
The very first distinction the lifecycle forces on you is at the generation and storage stages: the systems that produce data are built completely differently from the systems that analyze it. That split — OLTP vs OLAP — is the next, foundational lesson.
Where this leads: ingestion's batch-vs-streaming choice is the next-but-one lesson; orchestration's "DAG of steps" is the mental model from What a data engineer does, expanded in Chapter 8.
Next: OLTP vs OLAP →