Skip to main content

Chapter 5 checkpoint

You can now read a Spark job from read to write, picture the machines running it, find the shuffles, and explain why a job is slow. Recall the spine, take the quiz, then trace one more job.

The throughline

  • Distributed batch processing splits a big, bounded dataset across a cluster and processes the pieces in parallel. The model comes from MapReduce (map in parallel → shuffle to regroup by key → reduce); Spark kept that model but runs steps in memory, making it far faster, with friendly DataFrame/SQL APIs on top.
  • A Spark app = a driver (runs your code, plans, schedules; don't .collect() big data into it), executors (do the work; one task per partition per core), and a cluster manager (YARN/K8s/standalone hands out resources). One action → one job; a shuffle is the stage boundary; a stage = one task per partition.
  • Use DataFrames/SQL, not RDDs, because the Catalyst optimizer can see and rewrite them (pushdown, pruning) and Tungsten runs them with codegen/vectorization. Transformations are lazy; an action triggers the whole DAG (its lineage also gives fault tolerance). Narrow transformations don't move data; wide ones (groupBy, join, orderBy) shuffle.
  • The shuffle is expensive (disk write + all-to-all network + disk read, plus spill). Data skew (one hot key) bottlenecks a job on one straggler task — fix with salting, skew hints, or AQE splitting it. Joins: broadcast hash join (copy the small side, no big-table shuffle; ~10 MB autoBroadcast default) > sort-merge (both sides shuffle+sort, the big-to-big default) / shuffle hash. AQE re-optimizes at runtime: coalesce partitions, switch to broadcast, split skew.
  • Tune partitions (~100–200 MB each, ≥ cores); spark.sql.shuffle.partitions (200 default); coalesce to reduce (no shuffle) vs repartition to increase/rebalance (shuffle). Cache only what you reuse. Avoid Python UDFs (row-by-row serialization + Catalyst blind) — prefer built-ins, then Arrow pandas UDFs. Write partitioned Parquet. For data that fits on one machine, DuckDB/Polars beat Spark.

Quiz

Required checkpoint

Chapter 5 — Batch Processing & Spark

Pass to unlock the Next button below

Trace one more job (practice)

Read this PySpark job and answer in your head — it's the ShopFlow revenue-per-region job (ShopFlow — see Meet ShopFlow), and it's exactly the reasoning the chapter built toward.

import pyspark.sql.functions as F

orders = spark.read.parquet("s3://shopflow/orders/") # 40 partitions
customers = spark.read.parquet("s3://shopflow/customers/") # tiny dim: ~3 MB

result = (
orders.filter(orders.amount > 0) # (A)
.join(customers, "customer_id") # (B)
.groupBy("region").sum("amount") # (C)
)
result.write.parquet("s3://shopflow/region_totals/") # (D)

Question 1: Which operations are narrow and which are wide? Question 2: How many jobs does this trigger, and where are the stage boundaries? Question 3: What join strategy should Spark pick for (B), and why does it avoid a shuffle of orders?

Answers
  1. (A) filter is narrow (no data movement). (B) join is wide in general, but here see Q3. (C) groupBy is wide — it shuffles to regroup rows by region. (D) write is the action.
  2. One action (write) → one job. The groupBy in (C) forces a shuffle, so there's a stage boundary there: roughly a stage that reads+filters+joins, then the shuffle, then a stage that sums and writes. (If the join broadcasts — Q3 — it adds no shuffle of its own, so the only shuffle is the groupBy.)
  3. customers is ~3 MB, under the ~10 MB autoBroadcastJoinThreshold, so Spark picks a broadcast hash join: it copies the tiny customers dimension to every executor, letting each join its slice of orders locally with no shuffle of the big orders table. That's the cheap join, chosen automatically.

You can now reason about distributed batch jobs end to end: the machines, the plan, the shuffles, and the fixes. Next we look at how data gets into the lake in the first place — the ingestion and integration patterns that feed every batch job you just learned to tune.

Next: Chapter 6: Ingestion & Integration →