Skip to main content

Designing a data platform & DE system design

So far this guide has taught you the parts: a store, a query engine, a model, a pipeline, an orchestrator, a streaming bus, a table format, a quality test. This lesson teaches you to assemble the parts into a whole — to design a data platform end to end — and then to do it under pressure, because "design a pipeline for this scale" is the single most common senior data-engineering interview, and it's the actual senior job. The mechanics of any one tool are dated; the ability to compose them sensibly for a given scale, estimate whether your design holds, and reason about how it fails is durable and is what gets you hired and promoted.

A data platform, in plain terms, is the whole connected system that gets data from where it is produced to where it can be trusted and used: the ingestion, the storage, the processing, the serving layer, plus the orchestration, quality, and governance wrapped around them. Designing one is choosing which component plays each role and how big each needs to be.

The same shape at every scale

Almost every data platform, from a hobby project to a bank, has the same five-stage spine you met in Chapter 1. What changes with scale is which tool fills each box and how much machinery surrounds it.

Sources\n(apps,APIs, DBs, events)Ingestion\n(batch /CDC / stream)Storage\n(lake /warehouse /lakehouse)Processing\n(SQL /dbt / Spark)Serving\n(BI, APIs,ML, reverse-ETL)

The senior insight is that the spine is constant; the implementation should be the simplest thing that meets the requirement. Let's walk the same platform at three scales — and make it concrete with the pipeline you've followed all guide: ShopFlow (see Meet ShopFlow), whose customers/products/orders/order_items sources become fact_sales + its dimensions. The right ShopFlow platform looks wildly different depending on whether ShopFlow is one founder's side project or a 5,000-person retailer — same spine, different machinery.

Solo / small (gigabytes, one person)

You have a few data sources and one consumer (you, or a small team's dashboard). The right platform is almost embarrassingly small: a managed warehouse (or even DuckDB on a laptop / a single Postgres), a lightweight ingestion tool or a few scheduled scripts, dbt for transformation, and a BI tool or notebook to serve. No Spark, no Kafka, no Kubernetes. Cron or a single scheduled GitHub Action is enough orchestration. For ShopFlow at this scale: the four source tables sync nightly into one warehouse, dbt builds stg_* then fact_sales + dims, and a single scheduled job runs the lot — no streaming for order_events yet, because a daily revenue dashboard is the only consumer. Reaching for distributed systems here is the number-one beginner mistake — you pay all the operational cost of scale with none of the data to justify it.

Startup / mid (terabytes, a small team)

Volume and number of sources grow; multiple teams now depend on the data. The platform fills out: a cloud warehouse or lakehouse as the core, a managed ingestion service (Fivetran/Airbyte-style) plus change data capture for the high-value source databases, dbt for the modeling layer, a real orchestrator (Airflow/Dagster/Prefect), and the beginnings of data-quality tests and a catalog. Streaming appears only where a real low-latency need exists. This is where the "modern data stack" lives.

Enterprise (petabytes, many teams)

Now the hard problems are organizational, not just technical: hundreds of sources, dozens of teams, strict governance, multiple regions, and reliability commitments. You'll see a lakehouse with open table formats, real-time streaming backbones (Kafka/Kinesis), Spark/Flink for heavy processing, a mature catalog and lineage, formal data contracts, SLAs on data products, and often a data mesh style split where domain teams own their data products on a shared self-serve platform. The platform team's job becomes building the paved road others build on.

:::tip Design for the scale you have, with a path to the next The mistake in both directions is symmetrical. Under-building (an enterprise on spreadsheets) collapses under load and governance needs; over-building (a startup on a petabyte-scale stack) drowns a small team in operational toil. The senior move: build for your current scale plus one, choosing components that could grow — e.g. a warehouse that scales compute independently, storage that's already columnar and partitioned — so you can grow into the next tier without a rewrite, but you don't pay for it today. :::

The DE system-design interview

"Design a data pipeline that ingests X and serves Y" is the canonical senior data-engineering interview, and it's deliberately open-ended. Interviewers aren't looking for one right answer — they're watching how you think. A reliable structure:

  1. Clarify requirements first. Never start drawing boxes. Ask: How much data (volume per day, total)? How fast must it be fresh — daily, hourly, seconds (latency/freshness requirement)? Batch or real-time? Who consumes it and how (dashboards, an API, an ML model)? What's the budget and team size? These answers determine the architecture; guessing them is the most common failure.
  2. Estimate the scale (the capacity math below). This separates senior candidates instantly.
  3. Sketch the spine — sources → ingestion → storage → processing → serving — and name a concrete component for each, justified by the requirements.
  4. Address the cross-cutting concerns — orchestration, idempotency and failure handling, data quality, schema evolution, cost, and monitoring. This is where seniority shows; juniors stop at the happy path.
  5. State the trade-offs — "I chose batch because the freshness requirement is daily and it's far cheaper and simpler; I'd switch to streaming if they needed sub-minute." Naming the alternative and the trigger to switch is the senior signal.

Capacity and throughput estimation

Capacity estimation is back-of-the-envelope math to check that your design is in the right ballpark before you commit. It is the single most under-practiced data-engineering interview skill (most candidates only drill SQL puzzles), and it's a daily senior task. The pattern is always the same: turn the requirement into a rate and a size, then sanity-check it.

Worked example — clickstream pipeline. "Ingest clickstream events from an app with 10 million daily active users, each generating ~50 events/day, serve hourly aggregates."

  • Event volume: 10M users × 50 = 500M events/day. Spread evenly that's 500M / 86,400 s ≈ 5,800 events/sec average — but traffic is peaky, so assume a peak of ~3× average ≈ 17,000 events/sec. Your ingestion layer must sustain peak, not average.
  • Data size: at ~1 KB/event, that's 500M × 1 KB = 500 GB/day raw, ≈ 180 TB/year before compression. With columnar Parquet compression (say 5–10×), call it ~20–35 TB/year stored — cheap on object storage.
  • Read the result: 17K events/sec comfortably exceeds what a single database or naive script can absorb, so you need a buffer that decouples producers from consumers — a streaming log (Kafka/Kinesis) or at minimum micro-batched object-storage writes. But 500 GB/day is not "you need a 200-node cluster" territory; a modest warehouse or Spark job handles the hourly rollups easily. The numbers told you the shape of the answer.

The point isn't precision — it's orders of magnitude. Estimation catches "this won't fit on one machine" and equally catches "you absolutely do not need Kafka for this," which is the over-engineering trap. Always compute: events/sec at peak, bytes/day, and the resulting storage/year.

The same math on ShopFlow. Run it on ShopFlow's order_events stream. Say ShopFlow does 100K orders/day, each emitting ~4 events (placed → paid → shipped, occasionally cancelled) → ~400K events/day ≈ 5/sec average, peaking maybe ~25/sec on a flash sale. That is nothing — a single Postgres absorbs it, and the honest estimate tells you ShopFlow does not need Kafka for ingestion at this size; it needs it only if/when streaming latency (per-minute revenue) becomes a real requirement (Ch. 9). When ShopFlow 10×s to a million orders/day, redo the math — ~50/sec average is still small, but fact_sales storage and query cost (next lesson, 12.3) start to bite first. Estimation tells you which wall you hit first.

Idempotency: the property that makes pipelines safe

You met idempotency in ingestion; it is the load-bearing concept of platform design, so it returns here as a design principle. An operation is idempotent if running it multiple times has the same effect as running it once. This matters because, in any real platform, steps will run more than once: an orchestrator retries a failed task, a backfill re-runs yesterday's job, a streaming consumer redelivers a message after a crash, an engineer re-triggers a DAG by hand. If re-running a step double-counts revenue or duplicates rows, your data is silently wrong — the worst kind of failure.

The durable techniques:

  • Idempotent writes via overwrite-by-partition. Instead of appending today's data, delete-and-replace the target partition (e.g. re-running ShopFlow's daily load for WHERE order_date = '2026-06-24' replaces that day's fact_sales rows) so re-running the day's job yields the same table, not duplicated revenue. This is why partitioned tables and the MERGE/insert-overwrite patterns from the lakehouse chapter matter — and why ShopFlow's shopflow_daily DAG can be safely retried.
  • Natural keys + upsert/MERGE. Key each record by a stable business identifier and MERGE (upsert) so a re-delivered row updates in place instead of inserting a duplicate.
  • Deduplication keys. Tag each event with a unique id and drop duplicates on read or load.

A design is only as reliable as its retries are safe. Retries are what make distributed pipelines robust; idempotency is what makes those retries safe. Design every pipeline step so it can be re-run with no harm — then failures become a non-event.

Failure handling: design for when, not if

A platform is a chain of systems, and at scale something is always partially broken. Senior design assumes failure and plans for it explicitly:

  • A source is late or missing — does the pipeline wait, skip, or fail loudly? Use orchestration sensors and freshness checks; never silently process stale or absent data.
  • A bad batch lands — schema drift, a corrupt file, garbage values. Quarantine it (a "dead-letter" path or a reject table), alert, and keep good data flowing rather than crashing the whole run. (This is the data-quality gate from Chapter 11 acting as a circuit breaker.)
  • A step crashes mid-run — because steps are idempotent, the orchestrator just retries the task with backoff; no partial-write corruption because you write atomically (write to a temp location, then swap) or replace-by-partition.
  • Backfills — re-processing historical data after a bug fix or a new column. A platform designed with partition-overwrite idempotency backfills trivially; one built on blind appends turns a backfill into a duplication disaster.

The throughline: clarify → estimate → sketch the spine → handle the cross-cutting concerns (idempotency, failure, quality, cost) → state trade-offs. That is both the interview answer and the job.

Common pitfalls

  • Drawing boxes before clarifying requirements. Volume, freshness, consumers, and budget determine the design; assuming them is the top interview failure.
  • Skipping capacity estimation. Without the events/sec and bytes/day math you can't tell over- from under-engineering — and you can't defend either.
  • Designing only the happy path. Real platforms are judged on how they handle late sources, bad batches, crashes, and backfills. Stopping at "data flows in and out" reads as junior.
  • Append-only thinking. Non-idempotent steps double-count on every retry and backfill — the most common cause of silently wrong data.
  • Cargo-culting big-tech architectures. Kafka + Flink + Spark + a mesh because a FAANG blog post used them, for 10 GB of data. Match the spine's implementation to your actual scale.

Why it matters

A data platform is the five-stage spine — ingest → store → process → serve, wrapped in orchestration, quality, and governance — and good design is choosing the simplest implementation of that spine that meets the requirement, at solo, startup, or enterprise scale. The canonical senior interview ("design a pipeline for this scale") and the senior job are the same skill: clarify requirements, estimate capacity (events/sec at peak, bytes/day, storage/year), sketch the spine with justified components, and then handle the cross-cutting concerns — above all idempotency (re-runnable steps via partition-overwrite/MERGE so retries and backfills can't double-count) and failure handling (late sources, bad batches, crashes, backfills). The tools are dated; this compose-estimate-and-fail-gracefully judgment is durable — and it's the foundation for the decisions the next lesson formalizes.

Next: 12.2 Architecture decisions & ADRs →