Skip to main content

Batch vs streaming, and ETL vs ELT

Two architectural decisions shape almost every pipeline you'll build, and beginners routinely treat both as style choices when they're actually engineering tradeoffs with cost and latency consequences. The first: do you process data in scheduled chunks or continuously as it arrives? The second: do you transform before loading or load first, transform later? Get these two framings right and you can reason about most pipeline designs you'll ever see.

Part 1 — Batch vs streaming

The plain-English on-ramp

Two ways to do the dishes. Batch: let them pile up and wash the whole load once after dinner — efficient per dish, but the clean plates aren't available until the batch runs. Streaming: wash each dish the instant it's dirty — plates are always ready, but you're at the sink constantly and it's more effort per dish.

Data is identical. Batch processing waits, collects a bounded chunk of data, and processes it all at once on a schedule (every night, every hour). Stream processing handles each event the moment it arrives, continuously, forever. The tradeoff is the same as the dishes: batch is cheaper and simpler per record but adds latency (data is stale until the next run); streaming is fresh but more complex and usually costlier to run.

Defining the terms

  • Batch processing — processing data in bounded chunks on a schedule. The data has a clear start and end ("yesterday's ShopFlow orders"). Simple to reason about, easy to re-run, cheap. Latency: minutes to hours.
  • Stream processing — processing an unbounded sequence of events continuously, one (or a few) at a time, as they occur (ShopFlow's order_events stream — see Meet ShopFlow). No "end" — it runs forever. Latency: milliseconds to seconds.
  • Micro-batch — the middle ground: process very small batches very frequently (e.g. every few seconds). It looks streaming-ish to users but is implemented as tiny, frequent batch runs — keeping batch's simplicity and re-runnability while shrinking latency. Spark Structured Streaming (Chapter 5/9) popularized this.

It's a spectrum, not a binary:

slower / cheaper / simpler ◄─────────────────────────────► faster / costlier / harder
nightly batch hourly batch micro-batch (secs) true streaming (ms)

The tradeoffs: latency, throughput, cost

Three dials move together, and you can't max all three:

  • Latency — how fresh the data is (time from event happening to it being usable). Streaming minimizes it; batch trades it away.
  • Throughput — how much data you can process per unit time. Batch excels — processing a million rows together is far more efficient per row than a million one-at-a-time events.
  • Cost & complexity — a streaming system runs 24/7 (always-on compute) and must handle hard problems batch dodges: events arriving out of order, late events, exactly-once delivery (next lesson). Batch jobs run, finish, and release their resources.

When each is actually appropriate

The honest decision rule — and it leans harder toward batch than beginners expect:

  • Default to batch. Most analytics genuinely don't need second-fresh data. A daily sales dashboard, a weekly report, a monthly model retrain — nightly or hourly batch is simpler, cheaper, easier to re-run after a failure, and easier to reason about. Choosing streaming when batch suffices is one of the most common over-engineering mistakes in data.
  • Reach for streaming when freshness has real business value: fraud detection (catch it in the act, not tomorrow), real-time personalization, live operational dashboards, alerting, anything where a minutes-old answer is worthless.
  • Micro-batch is the pragmatic middle when you want "pretty fresh" (seconds) without the full complexity and cost of true streaming.

:::tip Durable vs dated The tradeoff — latency vs throughput vs cost/complexity — is durable and will outlive every tool. The tools (cron for batch; Spark, Kafka, Flink for streaming) are dated. Ask "what latency does the business actually need, and what will the fresher option cost?" — not "what's the cool real-time tool everyone mentions?" Chapter 9 covers streaming in depth; this is the framing. :::

Part 2 — ETL vs ELT

The plain-English on-ramp

You're moving data from a source into your warehouse — say ShopFlow's orders and order_items into the analytics warehouse — and somewhere along the way you have to transform it (clean, reshape, join to customers and products). The only question is where the transform happens relative to the load — before or after the data lands in the warehouse. That ordering is the entire ETL-vs-ELT debate, and it's a genuine architectural decision driven by cost, not a naming quirk.

Defining the terms

Both are the same three steps — Extract (pull from source), Load (put in warehouse), Transform (reshape) — in a different order:

  • ETL — Extract, Transform, Load. Pull data out, transform it on a separate processing system first, then load the finished, clean result into the warehouse. The warehouse only ever holds transformed data.
  • ELT — Extract, Load, Transform. Pull data out, load it raw into the warehouse immediately, then transform it inside the warehouse using the warehouse's own compute (SQL). The warehouse holds both raw and transformed data.
ExtractTransform\n(separateengine)ExtractLoad\n(raw)

Why the industry shifted from ETL to ELT

For decades, ETL dominated. The reason was economic: warehouse storage and compute were expensive and scarce, so you couldn't afford to dump raw data in and process it there — you transformed first, on cheaper dedicated hardware, and loaded only the lean finished result to spare the precious warehouse.

Then cloud data warehouses (Chapter 4) changed the economics: storage became cheap and compute became elastic and cheap-at-scale. That single shift flipped the calculus. Now it's cheaper and simpler to load everything raw and transform it inside the warehouse with SQL — which is why the modern default is ELT. The benefits compound:

  • Keep the raw data. Because storage is cheap, you load and retain the untransformed source. If you discover a transform bug, or need a column you discarded, you re-transform from raw — no re-extracting from the (possibly changed) source.
  • Transform in SQL, in one place. Analytics engineers reshape data with SQL right in the warehouse (the dbt + modern-data-stack workflow, Chapter 7) instead of a separate ETL system. Fewer moving parts.
  • Elastic compute. The warehouse scales up for the transform and back down after, so you pay for the burst, not a always-on ETL cluster.

:::caution Don't present ETL and ELT as interchangeable They are not synonyms or stylistic preferences — they are an architectural decision with a cost driver. The whole reason ELT won is that cheap, elastic cloud compute made it economical to push the T into the warehouse, where it wasn't before. State the cost reason; don't just shuffle the letters. ETL still wins in specific cases — e.g. you must scrub sensitive data (PII) before it ever touches the warehouse for compliance, or the source-to-warehouse network is the bottleneck. Know why you're choosing each. :::

The connection back to OLTP/OLAP and the lifecycle

ELT is only possible because the warehouse is a powerful OLAP engine that can transform huge volumes cheaply (OLTP vs OLAP). And both ETL and ELT are just the ingestion + transformation stages of the lifecycle with the boundary between "load" and "transform" drawn in a different place. Every framing in this chapter reinforces the others — that's the sign you're learning durable concepts, not disconnected facts.

Why it matters

Batch vs streaming is a spectrum, not a switch: batch trades freshness for simplicity, cheapness, and easy re-runs (and is the right default); streaming buys low latency at the cost of always-on complexity; micro-batch splits the difference. Pick based on the latency the business actually needs and what the fresher option costs. ETL vs ELT is an architectural decision driven by cost: ETL transforms before loading (born when warehouse compute was scarce); ELT loads raw then transforms in-warehouse, and became the modern default precisely because cheap, elastic cloud compute made it economical to push the T into the warehouse. Both are slices of the lifecycle with the load/transform boundary moved.

We've now hit the wall these decisions keep running into: at real scale, data lives on many machines, and machines fail constantly. The final foundational lesson confronts that directly — the distributed-systems primitives, and the idea of idempotency, that hold every pipeline together.

Where this leads: streaming is Chapter 9; the in-warehouse "T" of ELT is dbt and the modern data stack in Chapter 7; cheap elastic warehouse compute is Chapter 4.

Next: Distributed-systems primitives & idempotency →