Skip to main content

What a data engineer actually does

Before any tool names, any Spark-vs-warehouse debate, any streaming bus — one plain definition. Data engineering is building the reliable systems that move data from where it is produced to where it can be trusted and used. That sentence contains the whole job. Everything in this chapter, and most of this guide, is an elaboration of it.

The plain-English on-ramp

Imagine a company that sells things online — say ShopFlow, the running example we'll follow for the whole guide. Every second, facts get produced: someone clicks a product, an orders row is written, a payment succeeds, a delivery van pings its location. Each of those facts is born inside some app's database, designed to handle that one app and nothing else.

Now the business wants to learn from all those facts together: How many orders this week versus last? Which products get abandoned in the cart? Should we restock? Answering those questions means pulling data out of a dozen separate systems, lining it up, cleaning the messy bits, and putting it somewhere a person (or a dashboard, or a machine-learning model) can ask questions of all of it at once — quickly and correctly.

That pulling, lining-up, cleaning, and serving is data engineering. A data engineer is the person who builds and runs the plumbing that does it, day after day, without breaking. When you hear "data pipeline," picture exactly this: an automated assembly line that takes raw, scattered, messy source data in one end and produces trustworthy, query-ready data out the other.

:::note Why "plumbing" is the right metaphor Plumbing is invisible when it works and catastrophic when it doesn't. Nobody thanks the data engineer when the morning dashboard is correct — but everyone notices when the revenue number is wrong, or yesterday's data is missing, or the report takes six hours. The job is judged on reliability, not cleverness. Hold on to that; it shapes every decision in this guide. :::

Defining the term, precisely

Let's tighten the definition with the words you'll use for the rest of your career:

  • Data — recorded facts. A click, a row in a table, a sensor reading, a line in a log file.
  • A source system — the place data is born: an application's database, a third-party API, a stream of events, a file someone uploads. Data engineers rarely own these; they read from them.
  • A data pipeline — an automated, repeatable process that moves and reshapes data from one place to another. Pipelines are the unit of work a data engineer ships.
  • A data platform — the collection of systems (storage, query engines, orchestrators, catalogs) the pipelines run on. The platform is the factory; the pipelines are the assembly lines inside it.
  • A data product — the trustworthy output others depend on: a cleaned table, a metric, a feature for a model, a dashboard's backing dataset.

So a fuller definition: a data engineer designs, builds, and operates the pipelines and platform that turn raw source data into reliable data products.

Where the data engineer sits among the data roles

"Works with data" describes at least five different jobs. People constantly confuse them, and job titles are inconsistent between companies — so learn the roles (durable) and treat the titles (dated) as approximations. Here is the honest map, in the order data flows.

  • Platform / infrastructure engineer — provisions and operates the underlying compute, storage, and networking the data platform runs on (often using the cloud and Kubernetes from the sibling Cloud Engineer guide). They give the data engineer reliable infrastructure to build on. Below the data engineer in the stack.
  • Data engineer (you) — builds the pipelines and the data platform: ingestion, storage layout, the heavy transformations, orchestration, reliability, and cost. You make trustworthy data exist and stay fresh. The owner of the plumbing.
  • Analytics engineer — a newer role that lives on top of the data engineer's output, mostly writing SQL (the language for querying tabular data, defined in Relational fundamentals) to model already-loaded data into clean, business-friendly tables — typically with a tool called dbt (Chapter 7). They turn raw-but-loaded data into well-modeled data. Think "software-engineering discipline applied to SQL transformations."
  • Data analyst — consumes the modeled data to answer business questions: builds dashboards, runs ad-hoc queries, reports the numbers. They produce insight, not infrastructure.
  • Data scientist / ML engineer — uses the data to build statistical and machine-learning models — predictions, recommendations, forecasts (the territory of the sibling AI guide). They depend on the data engineer for clean, reliable inputs (and on the ML platform to serve models).
Platformengineer\n(compute,storage, network)Dataengineer\n(pipelines+ platform:\ningest,Analyticsengineer\n(SQLmodeling)Data scientist /\nMLengineerDataanalyst\n(dashboards, insight)

The clean one-line test: a data engineer makes trustworthy data exist and stay fresh; everyone downstream uses it. If your job is to keep the pipelines running and the tables correct and on time, you're doing data engineering — whatever the title on the offer letter says.

:::tip Durable vs dated The roles (someone provisions infra, someone moves and reshapes data, someone models it in SQL, someone analyzes it, someone models with ML) are durable — they map onto any data org. The titles and the boundaries between them are dated and vary wildly: at a small startup, one person is all five; at a big company, "analytics engineer" might not exist as a title at all. Learn the work, not the words. :::

The two non-negotiable languages, and the rest of the toolkit

You do not need to learn fifty tools to start. You need two languages and a handful of habits. Everything else is built on these.

  • SQL — the language for asking questions of tabular data. It is the lingua franca of data work; you will write it every day, forever. We teach it properly in Chapter 3, and lean on relational basics in Relational fundamentals.
  • Python — the general-purpose programming language data engineering standardized on. You'll use it to write pipeline logic, call APIs, and wrangle data with libraries like pandas and the newer, faster Polars (both let you manipulate tables of data in memory; we use them throughout). Modern Python practice — type hints, packaging your code, and reproducible environments with tools like uv or poetry — is part of the job, not optional polish.

Around those two languages sit the durable supporting tools every data engineer uses:

  • Git — version control, so your pipeline code is reviewed, history is tracked, and changes ship through CI (continuous integration — automated tests that run on every change).
  • The command line (Linux/bash) — where you actually run, inspect, and debug things on servers.
  • Docker — packages your code and its dependencies into a portable container so it runs identically on your laptop and in production.
  • Make / uv / poetry — automate the repetitive commands and pin exact dependency versions so a pipeline is reproducible.
  • Jupyter — interactive notebooks for exploring data and prototyping before you harden logic into a real pipeline.

You'll meet each in context. The point now is that data engineering is software engineering applied to data — which means the same disciplines (version control, testing, automation, reproducibility) apply. A pipeline is code, and it deserves to be treated like code.

:::caution The trap: "data engineering is just writing Python scripts" The most common beginner misconception is that the job is a pile of one-off scripts that move files around. It is not. A production pipeline runs unattended, on a schedule, against systems that fail — networks drop, a source is briefly down, a job gets retried. The hard part isn't the happy-path script; it's making every step safe to re-run and observable when it breaks. That distributed-systems mindset — failure is normal, design for retries — is what separates a data engineer from someone who writes scripts. We make it the spine of Distributed-systems primitives, and it recurs in every later chapter. :::

Cost is a design constraint, not an afterthought

Here's something junior engineers learn the hard way and you'll learn now: in data engineering, every design decision is also a cost decision. Data is big, and big means expensive. The three levers you constantly trade off:

  • Storage — keeping the bytes. Cheap per gigabyte, but it adds up across years of history and copies.
  • Compute — processing the bytes (scanning, joining, aggregating). Usually the expensive part: a badly written query can scan a terabyte and cost real money for a single run.
  • Egress — moving bytes out of a system or region. Often surprisingly costly and easy to forget.

A pipeline that works but scans ten times more data than necessary, or copies a dataset across regions every hour, can quietly cost more than the team that built it. So from the very first lesson, we treat cost as a first-class design input — "will this scale affordably?" sits right next to "will this be correct?". (FinOps — the discipline of managing cloud cost — is covered in depth in the Cloud Engineer guide; here we just keep cost visible in every decision.)

A pipeline is a DAG of idempotent, observable, testable steps

One mental model to carry through the whole guide. A data pipeline is best pictured as a DAG — a Directed Acyclic Graph: a set of steps (ingest ShopFlow's orders, then clean them, then join to customers, then publish the daily sales table) connected by arrows showing what depends on what, with no loops (you never go backwards). The orchestrator (Chapter 8) runs the steps in dependency order.

Three properties make that DAG production-grade, and we'll define each properly when it matters:

  1. Idempotent — re-running a step produces the same result, so a retry after a failure is safe (defined fully in Distributed-systems primitives).
  2. Observable — you can see whether each step ran, how long it took, and whether the data looks right (Chapter 11).
  3. Testable — the logic can be checked automatically before it ships, like any other code.

Keep this picture in your head: every pipeline you build is a DAG of steps that are each safe to re-run, watchable, and tested. That single sentence is most of what separates a robust data platform from a fragile pile of scripts.

Why it matters

Data engineering is the reliable plumbing that turns raw, scattered, messy source data into trustworthy data products others depend on. The data engineer owns the pipelines and the platform — sitting above the platform engineer (who runs the infrastructure) and below the analytics engineer, analyst, and data scientist (who all use the data the engineer makes exist). The two non-negotiable languages are SQL and Python, surrounded by the same software-engineering disciplines — Git, CI, the command line, containers, reproducible environments. And from day one, cost is a design constraint and a pipeline is a DAG of idempotent, observable, testable steps, not a pile of scripts.

The natural next question is: what exactly is the journey this data takes? That journey has a name and a fixed set of stages — the data engineering lifecycle — and it's the backbone of everything that follows.

Where this leads: the role you just placed on the map runs the lifecycle in the next lesson, and the "failure is normal" mindset becomes concrete in Distributed-systems primitives.

Next: The data engineering lifecycle →