Skip to main content

Architecture decisions & ADRs

A senior data engineer's real job isn't writing transformations — it's making decisions under uncertainty and constraints, and being able to defend them later to an engineer, a finance partner, or a skeptical stakeholder. This lesson gives you a concrete rule for the handful of architecture questions that recur in every data role, shows you why "boring" is usually the right default, and hands you the durable artifact — the ADR — that makes your reasoning survive the people who made it. Tools churn yearly; the ability to decide well and record why is a career-long skill, and it's the clearest senior signal there is.

The recurring decisions, with a rule for each

Across every data platform, the same "should we…?" questions come back. None has a universal answer — but each has a default and a trigger to deviate. Internalize these and you'll out-reason most architecture arguments. Notice the shape that runs through all of them: every default is the simpler, cheaper, lower-operational-cost option, and you deviate only on a specific, articulated trigger — almost always a real latency, scale, or cost-at-scale requirement you can name.

Batch vs streaming

Process data in scheduled chunks (batch), or continuously as events arrive (streaming)?

Default: batch. It is dramatically simpler to build, test, debug, and reason about, far cheaper, and meets the freshness requirement of the overwhelming majority of analytics (daily or hourly is plenty for most dashboards and reports). Reach for streaming only when there is a real sub-minute latency requirement — fraud detection, live operational dashboards, real-time personalization, alerting. Streaming buys low latency at the cost of always-on infrastructure, harder debugging, exactly-once complexity, and on-call burden. The honest question is "what business decision changes if this data is one hour old instead of one second old?" — if the answer is "nothing," you don't need streaming.

Warehouse vs lakehouse

A managed cloud data warehouse (Snowflake/BigQuery/Redshift), or a lakehouse on object storage with open table formats (Iceberg/Delta/Hudi)?

Default: a cloud warehouse for a small-to-mid team that's mostly SQL analytics. It's the lowest-friction path — managed, fast, great SQL ergonomics, minimal ops. Move toward (or add) a lakehouse when you have large/diverse data including semi-structured and ML workloads, want to avoid storage lock-in with open formats, need a single copy of data serving both BI and Spark/ML, or your scale makes warehouse storage+compute pricing expensive. Many mature platforms run both: a lakehouse for raw/large/ML data and a warehouse (or warehouse-style engine) for curated BI. The choice hinges on workload mix, scale, and team SQL-vs-engineering balance — not on which is "newer."

Applied to ShopFlow (see Meet ShopFlow): should ShopFlow put fact_sales in a plain warehouse or a lakehouse? At today's scale — a handful of source tables, BI-style "revenue by day / top products" questions, a small team — a warehouse is the right default: fact_sales and its dims are clean star-schema SQL with no semi-structured or ML mix to justify the lakehouse's extra machinery. ShopFlow earns the lakehouse later — when it adds product clickstream/recommendation features (semi-structured + ML on the same data), wants fact_sales as open Iceberg/Delta to avoid lock-in (Ch. 10), or its order volume makes warehouse storage pricing pinch. Same data, different answer at different scale — exactly the kind of call you record in an ADR below.

Build vs buy

Build a capability yourself, or use an existing managed product/service?

Default: buy (or use a managed service) for anything that isn't your core differentiator. Ingestion connectors, the warehouse engine, orchestration hosting, a catalog — companies whose entire job is that problem do it better than you will. The classic data-engineering money pit is building and maintaining your own source connectors when a managed ingestion tool exists, or self-operating a Spark/Kafka cluster a managed service would run for a fraction of the engineer-hours. For ShopFlow: don't hand-write a connector to extract its orders/customers source DB — a managed ingestion tool (or CDC) lands raw.orders for far fewer engineer-hours, and ShopFlow's differentiator is selling things online, not maintaining database connectors. Build only when it's genuinely core to your value, no product fits a durable need, or the cost-at-scale of buying genuinely exceeds the fully-loaded cost of building and operating it. Always price the engineer-hours, not just the license.

Managed vs open-source / self-hosted

Run open-source software yourself (self-host Airflow, Spark, Kafka, Trino), or use the managed/cloud version (MWAA/Astronomer, Databricks/EMR, MSK/Confluent)?

Default: managed. You offload patching, scaling, backups, and upgrades to the provider — an enormous operational saving, especially for a small team. Open-source/self-hosted wins when you have a hard requirement managed can't meet (specific version, on-prem, data-residency), extreme cost at very large scale where the managed premium dwarfs a dedicated platform team, and the in-house expertise to operate it well. "We'll self-host to save money" usually loses once you price the on-call, the upgrades, and the 3 a.m. pages. Open-source still matters enormously for avoiding lock-in at the format and interface layer (open table formats, SQL, Arrow) even when you use a managed engine.

ArchitecturedecisionFreshnessneed?\ndaily/hourly→ batch · sub-minuteWorkload mix?\nSQLBI → warehouse ·large+ML+open →Coredifferentiator?\nno→ buy/managed · yesManaged premiumworth it?\nsmallteam → managed ·

"Boring data tech": why the unexciting choice usually wins

There's a durable principle in this field, sometimes called "choose boring technology": prefer mature, well-understood, widely-deployed tools over the exciting new thing. Boring here is a compliment — it means the tool's failure modes are known, the documentation and community are deep, hiring is easy, and you won't be the one debugging a bleading-edge bug at midnight. A team has a limited budget of novelty it can absorb; spend it on the part that's actually your differentiator, and pick boring, proven defaults (a standard warehouse, SQL, dbt, a mainstream orchestrator, Parquet) for everything else.

This is the antidote to résumé-driven development — choosing a trendy tool because it looks good to learn, not because it fits the problem. The shiny streaming-mesh-vector-everything stack that impresses on a slide is often a small team's undoing. Boring, appropriate, and justified beats novel and impressive almost every time.

:::tip The honest test for any "let's adopt X" Ask three questions: (1) What specific requirement does X meet that our current boring default can't — named in cost, latency, or scale terms? (2) What does X cost us in operational burden, hiring, and complexity, fully loaded? (3) Is this the part of our system where novelty actually buys differentiation? If you can't answer (1) concretely, you're probably résumé-driven, not requirement-driven. Write the answer down — which is the next section. :::

Writing it down: Architecture Decision Records (ADRs)

Here's the gap that quietly sinks data teams: a big call — "warehouse or lakehouse," "batch or streaming," "build the connector or buy" — gets made in a meeting or a Slack thread, and a year later nobody remembers why. A new engineer "fixes" the weird-looking choice, breaks something the original decision deliberately prevented, and the lesson is re-learned the expensive way. The durable fix is the Architecture Decision Record (ADR) — a short, plain document capturing one significant decision: what you decided, the context, why, and what you traded away.

An ADR is small on purpose:

# ADR 0007: Use daily batch ingestion, not streaming, for ShopFlow sales analytics

## Status
Accepted — 2026-06-24

## Context
ShopFlow needs `fact_sales` (and its dims) in the warehouse for the revenue,
top-products, and repeat-customer dashboards. Stakeholders confirmed the
freshness requirement is "by 8 a.m. daily," not real-time. Team is 3 data
engineers with no streaming experience. Volume is ~100K orders/day.

## Decision
Run the `shopflow_daily` DAG: ingest the `orders`/`customers`/`order_items`/
`products` sources nightly to raw, build `stg_*``fact_sales` + dims with dbt,
partitioned by order date. The `order_events` stream stays parked; no Kafka/Flink.

## Consequences
+ Far simpler to build, test, and operate; no always-on streaming infra.
+ Much cheaper; no exactly-once or on-call streaming burden for a 3-person team.
+ Idempotent: the daily load overwrites the order-date partition of `fact_sales`,
so retries/backfills never double-count revenue.
- Data is up to ~1 day stale. Accepted: no current ShopFlow decision needs it fresher.
- If a real sub-minute use case appears (e.g. live per-minute revenue on a flash
sale, `fact_revenue_1m`), add a streaming path for that slice only off
`order_events` (supersede this ADR).

ADRs are durable knowledge: cheap to write, numbered and append-only (you supersede an old one with a new one rather than editing history), and they make the reasoning — not just the outcome — survive turnover. A team with a folder of ADRs answers "why is it built this way?" in minutes instead of archaeology. Crucially, the ADR records the trigger to revisit ("if a real sub-minute need appears…"), so the decision isn't dogma — it's a defensible default with a written exit condition. This is also the artifact that does double duty in the portfolio: an ADR in your project repo signals judgment, and almost nobody includes one.

A close cousin is the RFC ("Request for Comments") — a longer proposal circulated before a big decision to gather feedback and align the team, after which the outcome is recorded tersely as an ADR. Both express the same durable instinct: make trade-offs explicit and written, not implicit and forgotten. The act of writing forces clearer thinking, and the artifact pays dividends for years.

Common pitfalls

  • Streaming by default. Reaching for Kafka/Flink when the freshness requirement is hourly — paying always-on cost and on-call burden for latency nobody needs.
  • Building undifferentiated infrastructure. Hand-maintaining source connectors or self-operating a cluster a managed service would run for far fewer engineer-hours.
  • Self-hosting "to save money" without pricing ops. The license is cheap; the on-call, upgrades, and expertise are not.
  • Résumé-driven / novel-by-default. Choosing the trendy tool with no named requirement it uniquely meets. Boring, justified defaults win.
  • Decisions only in chat. No ADR, so the why evaporates and the next engineer breaks it. Write down the decision, the context, the trade-off, and the trigger to revisit.

Why it matters

The recurring data-architecture calls — batch vs streaming, warehouse vs lakehouse, build vs buy, managed vs open-source — each default to the simpler, cheaper option, and you deviate only on a specific, articulated trigger (usually a real latency, scale, or cost-at-scale need you can name). "Boring data tech" — mature, proven, well-understood — is usually the right default; spend your scarce novelty budget only where it buys differentiation, and beware résumé-driven choices. The discipline that makes good decisions durable is writing them down: an ADR captures one decision's context, choice, trade-off, and trigger-to-revisit so the reasoning survives turnover, and an RFC aligns the team before a big change. This decision-and-documentation skill is the clearest senior signal in data engineering — and the next lesson makes one axis of it, cost, a first-class concern.

Next: 12.3 Cost & performance at scale (FinOps) →