Skip to main content

DataOps, reliability & security

Two data engineers can both "build a working pipeline." The difference between junior and senior is everything around the pipeline: is it in version control, tested in CI, deployed through dev→staging→prod, defined as code, monitored against a commitment, recoverable when it breaks, and secured? This bundle of practices — borrowed from software and site-reliability engineering and applied to data — is called DataOps, and it's the maturity layer most data curricula skip while the job market screams for it. The specific tools are dated; the rigor is durable, and it's the clearest line between someone who can build a pipeline and someone you'd trust to run a platform.

DataOps is, in plain terms, applying proven software-engineering and operations discipline — version control, automated testing, continuous delivery, environments, infrastructure as code, and monitoring — to data pipelines and data products, so they're reliable, repeatable, and safe to change.

Version control and CI/CD for pipelines

The foundation is treating everything as code in Git: your dbt models, your orchestration DAGs, your SQL, your transformations, your infrastructure definitions, your tests. Git gives you history, review, and the ability to reproduce and roll back — the same reasons it's non-negotiable in software. "The transformation logic lives only in the warehouse UI" or "the pipeline is a notebook on someone's laptop" is the junior anti-pattern that makes a platform unreviewable and unrecoverable.

On top of Git sits CI/CD (Continuous Integration / Continuous Delivery): automation — typically GitHub Actions, GitLab CI, or similar — that runs on every change. For data pipelines, CI means: on a pull request, automatically lint the SQL, compile the dbt project, run the data tests on a sample, and check the models build — catching a broken transformation before it reaches production, not after it corrupts a dashboard. CD means: once merged, the change is automatically deployed to the next environment. This is exactly the DORA-style delivery rigor — ship small changes frequently, with automated checks, so failures are caught early and recovery is fast.

A pipeline change that wasn't tested in CI is a production incident waiting for a schedule. The whole point of CI/CD for data is to move the moment-of-failure-discovery from "the executive's dashboard is wrong" to "the pull-request check went red."

Environments: dev → staging → prod

A core DataOps practice is separate environments so you never develop or experiment against production data and production consumers:

  • Dev — where an engineer builds and iterates, against sample or sandboxed data.
  • Staging / pre-prod — a production-like environment where changes are validated end-to-end before release.
  • Prod — the real pipelines feeding real dashboards, models, and decisions.

The same code is promoted through the stages; only configuration (which database, which credentials, which schema) differs. dbt's targets and warehouse schemas, separate cloud accounts, or separate projects all implement this. The durable rule: changes flow dev → staging → prod through review and CI, never straight into prod by hand. Editing production transformations live is how silent data corruption happens.

Infrastructure as code and containers

Infrastructure as code (IaC) means defining your platform's infrastructure — the warehouse, the object-storage buckets, the IAM roles, the orchestrator, the networking — in declarative configuration files (most commonly Terraform) checked into Git, rather than clicking it together in a console. The payoff is the same as for pipeline code: it's reviewable, reproducible, version-controlled, and you can stand up an identical environment (that staging copy!) from one command. A platform built by hand-clicking is undocumented, un-reproducible, and impossible to recover after an account is lost.

Containers package code with its exact dependencies so it runs identically everywhere. Docker is the standard: your Spark job, your dbt run, your custom ingestion script ships as an image that behaves the same on a laptop and in production — killing the "works on my machine" class of bug. Kubernetes orchestrates many containers across a cluster, scaling and restarting them; data tools increasingly run on it (the Spark-on-Kubernetes operator, Airflow's KubernetesExecutor, Dagster/Flink deployments). You don't need to operate Kubernetes to be a strong data engineer — but you must understand that your pipelines run as containers on a cluster, because that's the substrate of the modern platform.

:::tip You don't have to master IaC/K8s — but you can't ignore them The most common senior-vs-junior gap in 2026 data engineering isn't a missing SQL skill — it's missing DataOps maturity: no Git, no CI, no environments, no IaC, no containers. You don't need to be a Kubernetes administrator (that's the cloud/platform engineer's deep specialty, covered in depth in the sibling guide). But a data engineer who has never put a pipeline in CI, promoted through environments, or seen their job run as a container is operating at a junior level regardless of how good their SQL is. Learn the workflow; lean on the cloud guide for the depth. :::

Reliability: SLAs, SLOs, and treating data as a product

Once a dashboard or table is something people depend on, it's a data product, and a product needs a reliability commitment. The durable vocabulary, borrowed from SRE:

Make it concrete on ShopFlow's shopflow_daily DAG (see Meet ShopFlow), which feeds fact_sales to the morning revenue dashboards:

  • SLA (Service-Level Agreement) — the promise to consumers: "ShopFlow's fact_sales is fresh by 8 a.m. and passes its quality tests." It's the contract the finance and merchandising teams plan their morning around.
  • SLO (Service-Level Objective) — your internal target that keeps you safely inside the SLA: "shopflow_daily completes by 7 a.m." — an hour of headroom for a retry. A common data SLO is freshness (how recently fact_sales updated) plus completeness (yesterday's orders all landed) and quality (the tests on fact_sales pass).
  • SLI (Service-Level Indicator) — the actual measured number (today fact_sales was fresh at 7:12 a.m.) you compare against the objective.

The practical upshot is monitoring and alerting on the things consumers care about: is the data fresh, complete, and passing its quality tests? This is the data observability from Chapter 11 wired into an alert — if shopflow_daily is still running at 7:30 a.m. or its quality_check step fails, the on-call data engineer is paged before the merchandising team opens a stale ShopFlow dashboard. A single daily SLA like ShopFlow's rarely needs 24/7 on-call; a business-hours rotation that owns the morning run is usually the right-sized commitment — match the on-call burden to the SLA, exactly as you match architecture to scale.

Incident response, on-call, and postmortems

When a data product breaks the SLA — a source disappeared, a bad deploy corrupted a table, a pipeline silently produced wrong numbers — you need incident response: detect (the alert fired), assess impact (who relies on this?), mitigate (pause downstream, roll back, backfill from good data), communicate (tell affected consumers honestly), and resolve. Mature teams put a data engineer on-call for the critical pipelines. Afterward comes the most durable practice of all: the blameless postmortem — a written analysis of what happened, why, and what systemic change prevents a recurrence, focused on the system and process, not on blaming a person. The goal isn't to find a culprit; it's to make the same failure impossible next time. A team that writes honest postmortems gets monotonically more reliable; one that hunts for someone to blame learns nothing and hides the next failure.

Security at scale

A platform concentrates an organization's most sensitive data in one place, which makes data-engineering security non-optional. The durable, least-exotic essentials:

  • Secrets management — database passwords, API keys, and cloud credentials live in a secrets manager (AWS Secrets Manager, GCP Secret Manager, Vault) or the orchestrator's secret store, never hard-coded in code, notebooks, or — the classic disaster — committed to Git. Pipelines fetch them at runtime.
  • Least-privilege IAM — each pipeline, service, and person gets only the permissions it actually needs (read this bucket, write that schema), nothing more. A compromised ingestion job with read-only access to one dataset is a contained incident; one with admin over the whole warehouse is a catastrophe. This is the same least-privilege principle from cloud IAM, applied to data access.
  • Encryption in transit and at rest — data is encrypted at rest (on the storage, usually on by default in managed services) and in transit (TLS on every connection) so intercepted or stolen storage is useless without the keys.
  • Network controls — keep databases and warehouses in private networks (a VPC), reachable only through controlled paths, not exposed to the public internet. Combined with IAM, this is defense in depth.

These overlap heavily with the cloud security and governance/privacy material; the data-engineering responsibility is to apply them to the pipelines and the data: secrets out of code, least-privilege everywhere, encryption on, networks closed.

Common pitfalls

  • Logic living outside Git. Transformations in a warehouse UI or a laptop notebook — unreviewable, unreproducible, unrecoverable.
  • No CI / editing prod by hand. Changes that skip automated tests and environments, so failures are discovered by consumers instead of by a red PR check.
  • Hand-clicked infrastructure. No Terraform, so the platform is undocumented and can't be rebuilt or staged.
  • Data with no SLO or monitoring. A "data product" nobody monitors for freshness/quality, so staleness and corruption are found late, by the people who depend on it.
  • Secrets in code and over-broad IAM. Hard-coded credentials (especially committed to Git) and pipelines granted far more access than they need — the two most common, most damaging security mistakes.

Why it matters

DataOps is software- and SRE-grade rigor applied to data: everything in Git, CI/CD that tests and builds pipeline changes before production, separate dev/staging/prod environments with promotion through review, infrastructure as code (Terraform) and containers (Docker, running on Kubernetes) so the platform is reproducible. Reliability means treating dashboards and tables as data products with SLAs/SLOs for freshness, completeness, and quality, monitoring and alerting against them, incident response with on-call, and blameless postmortems that make failures impossible to repeat. Security at scale is the unglamorous essentials done consistently: secrets managed (never in code/Git), least-privilege IAM, encryption in transit and at rest, and closed networks. This maturity — not raw SQL skill — is the clearest line between junior and senior in 2026. With the platform built, decided, costed, and operated, the next lesson turns to where the field is heading: AI in the pipeline and the role split.

Next: 12.5 AI in the data stack & the role split →