Chapter 5 · Batch Processing & Spark
So far this guide has been about where data sits: file formats (Chapter 2), query engines (Chapter 3), and the models we shape data into (Chapter 4). This chapter is about moving and reshaping data in bulk when there is too much of it to fit, or process, on a single machine.
That last clause is the whole reason this chapter exists. A laptop with 16 GB of memory can sort a few gigabytes of data without breaking a sweat. Ask it to join two 500 GB tables and it will thrash, swap to disk, and eventually crash. The answer the industry settled on is distributed batch processing: split the work across a cluster of many machines that each handle a slice, then combine the results. Batch processing means working over a large, bounded dataset all at once (as opposed to streaming, Chapter 9, which processes records continuously as they arrive). A cluster is just a group of computers working together as one logical machine.
The dominant engine for this is Apache Spark — open-source software that runs a single program across hundreds of machines, hiding most (not all) of the distributed-systems pain behind a friendly DataFrame API. Spark is one of the most-asked-about technologies in data engineering interviews, and the questions are remarkably consistent: how does Spark execute a job? what is a shuffle and why is it slow? what is data skew and how do you fix it? By the end of this chapter you will be able to answer all three from first principles.
The durable idea
Distributed batch processing splits a big job across a cluster — and the shuffle (moving data between machines) is where the job lives or dies.
The model — partition the data, run the same code on each partition in parallel, and pay a steep cost whenever data has to cross the network between machines — is durable. It predates Spark (it's the lesson of MapReduce) and will outlive it. Spark's specific APIs, configuration flags, and version numbers are dated; we isolate those and teach the model first.
A critical caveat up front: you may not need Spark
Spark earns its complexity at scale. For datasets that fit on one beefy machine — and "one beefy machine" in 2026 means hundreds of gigabytes of RAM — single-node engines like DuckDB and Polars are often faster than Spark and vastly simpler to operate (no cluster, no driver, no executors). A recurring mistake in this field is reaching for Spark for a 2 GB CSV. This chapter teaches Spark and teaches you when not to use it.
What this chapter covers
- From MapReduce to Spark — why distributed processing looks the way it does, and why in-memory execution won.
- Spark architecture — the driver, executors, and cluster manager; how a job becomes stages and tasks; where the shuffle fits.
- The execution model — RDDs vs DataFrames, transformations vs actions, lazy evaluation, lineage and the DAG, and the Catalyst optimizer that makes DataFrames fast.
- The shuffle, skew, and joins — the expensive trio: why shuffles cost so much, how data skew quietly destroys performance, the join strategies Spark picks between, and how Adaptive Query Execution fixes problems at runtime.
- Tuning & PySpark in practice — partition sizing,
repartitionvscoalesce, caching, out-of-memory errors, why Python UDFs are a performance trap, writing partitioned Parquet, and when to skip Spark entirely. - Checkpoint — a quiz over the whole chapter.
Throughout, the worked examples run on one job: ShopFlow's daily revenue — read order_items (and orders), compute line_revenue = quantity × unit_price, and sum it per day (ShopFlow — see Meet ShopFlow). The same tables drive the shuffle, skew, and join examples, so you watch one pipeline get tuned rather than a new toy dataset per lesson.
When you finish, you'll be able to read a Spark query plan, explain why a job is slow, and reach for the right fix — or correctly decide Spark was the wrong tool.