Spark architecture: driver, executors, and the cluster
You can't tune what you can't picture. The single biggest reason people write slow Spark and can't fix it is that they learned the API without the execution model — they know how to call .groupBy() but have no mental image of the machines that run it. This lesson fixes that. By the end you'll be able to point at every moving part of a running Spark application and say what it does.
The whole picture in plain English
A running Spark application is a small organization with one boss and many workers:
- The driver is the boss. It runs your program, decides what work needs doing, breaks it into pieces, and hands those pieces out. It does not crunch the bulk data itself.
- The executors are the workers. Each is a process on a machine in the cluster, given some CPU cores and some memory, that actually reads, transforms, and writes the data — one slice at a time.
- The cluster manager is HR. The driver asks it "I need 10 workers with 4 cores and 16 GB each," and the cluster manager finds machines and starts executors on them.
Your data is split into partitions (chunks), and each executor processes some partitions in parallel. That's the entire architecture in one breath. Now let's give each part its proper definition.
The driver
The driver is the process that runs your main program — the one that holds your SparkSession (the object you call Spark through). Its jobs:
- Run your application code line by line.
- Turn the operations you requested into a logical plan, then optimize it into a physical plan of actual work (the next lesson covers this optimization).
- Break that plan into tasks and schedule them onto executors.
- Track progress, handle failures (re-scheduling lost work), and collect final results.
The driver is a single point of failure: if it dies, the whole application dies. It's also where results land when you call something like .collect(). That last fact is a classic footgun — .collect() pulls all the data from every executor back into the driver's memory. Call it on a billion-row DataFrame and you out-of-memory the driver instantly. The driver is a coordinator with modest memory; never funnel big data through it.
The executors
An executor is a JVM process launched on a worker machine, allocated a fixed slice of cores (parallelism) and memory. Executors do the real work:
- Run the tasks the driver sends them.
- Hold cached/persisted data in memory (a later lesson).
- Communicate with each other during a shuffle (not through the driver — executors talk directly).
The unit of parallelism is this: one task processes one partition on one core. So if you have 200 partitions and a cluster offering 50 cores total (say 10 executors × 5 cores each), Spark runs 50 tasks at a time in 4 waves of 50. More cores → more partitions processed simultaneously → more parallelism — up to the number of partitions. If you only have 4 partitions, 50 cores can't help you; 46 of them sit idle. (That partition-count-vs-cores relationship is central to tuning, in lesson 5.5.)
:::note Why executors talk to each other directly During a shuffle, an executor needs data that other executors produced. They exchange it peer-to-peer over the network, not via the driver. This is exactly the MapReduce shuffle from the last lesson, now between executor processes — and it's why the shuffle is expensive: it's a burst of all-to-all network traffic and disk I/O among the workers. :::
The cluster manager
The cluster manager is the external system that owns the cluster's machines and parcels them out to applications. The driver requests resources; the cluster manager grants them and launches the executors. Spark supports several, and which one you use is a dated, environment-specific detail — the architecture above is identical regardless:
- Standalone — Spark's own built-in manager. Simple; good for small or dedicated clusters.
- YARN — the resource manager that came with Hadoop. Common in older on-prem and EMR setups.
- Kubernetes (K8s) — increasingly the default in cloud-native and Databricks-style environments; executors run as pods.
- Managed services hide this entirely — on Databricks, Amazon EMR, or Google Cloud Dataproc, the platform provisions and manages the cluster for you.
You rarely care which manager beyond initial setup; you care that something grants the driver its executors. Learn the roles, not the vendor.
How one job becomes jobs, stages, and tasks
This is the vocabulary interviewers probe and dashboards display, so let's nail it precisely. When you trigger work, Spark builds a hierarchy:
- A job is launched by a single action (an operation that forces computation — like
writeorcount; defined in the next lesson). One action → one job. - A job is split into stages. A new stage begins at every shuffle. A stage is the run of work that can happen without moving data between machines.
- A stage is split into tasks — one task per partition. All tasks in a stage run the same code on different partitions, in parallel.
The key sentence to memorize: the shuffle is the stage boundary. Operations that don't need to move data across the network (filter, map a column) chain together inside one stage. The moment you do something that regroups data by key across machines (a join, a groupBy), Spark must shuffle — and that shuffle splits the work into a before stage and an after stage.
Tracing a job into stages and tasks
Trace this PySpark snippet — the canonical ShopFlow batch job, computing revenue per customer from line items (ShopFlow — see Meet ShopFlow). (PySpark is Spark's Python API; we use it throughout. A DataFrame is a distributed table — formally introduced next lesson, but read it here as "a big table spread across the cluster.")
items = spark.read.parquet("s3://shopflow/order_items/") # say this loads into 8 partitions
result = (
items.filter(items.quantity > 0) # narrow: no data movement
.groupBy("order_id").sum("quantity") # WIDE: needs a shuffle
)
result.write.parquet("s3://shopflow/items_per_order/") # action → triggers the job
Here's what Spark builds when write (the action) runs:
JOB (one action: write)
│
├─ STAGE 0 ── read parquet + filter quantity > 0
│ └─ 8 tasks (one per input partition), all running in parallel
│
│ ╳╳╳ SHUFFLE ╳╳╳ ← groupBy must move rows so equal order_ids co-locate
│
└─ STAGE 1 ── sum quantity per order_id + write out
└─ N tasks (one per post-shuffle partition)
Read it top to bottom:
- One action (
write) → one job. - The
filterdoesn't move data, so it fuses with the read into Stage 0. Stage 0 has 8 tasks because the input had 8 partitions — each task handles one, all 8 in parallel (resources permitting). - The
groupByneeds every row with the sameorder_idon the same machine, so it shuffles. That shuffle is the boundary: everything after it is Stage 1. - Stage 1's task count equals the number of post-shuffle partitions (a tunable number — default historically 200; covered in lesson 5.5).
If you can produce that breakdown for an arbitrary snippet, you can read the Spark UI and reason about any job's performance. Most "why is my job slow?" answers are visible right here: a stage with one giant task (skew, lesson 5.4) or a stage with thousands of tiny tasks (over-partitioning, lesson 5.5).
Why it matters
Every performance decision in Spark is really a question about this picture. Why did my driver run out of memory? — you .collect()-ed too much into the boss. Why is one task taking 100× longer than the others? — that partition is skewed; the executor running it is overloaded. Why is my job not using all my cores? — fewer partitions than cores, so executors sit idle. Why is this stage so slow? — there's a shuffle at its boundary moving data across the network. You cannot tune Spark by memorizing config flags; you tune it by mapping the symptom onto driver/executors/stages/shuffle and fixing the right part.
Common pitfalls
- Calling
.collect()or.toPandas()on large data. This drags everything into the driver, which has modest memory and is a single point of failure. Use it only on already-small results. - Confusing executors with the cluster manager. The cluster manager hands out machines; executors do the work. Swapping the manager (YARN ↔ K8s) changes nothing about how your job executes.
- Thinking more executors always means faster. Parallelism is capped by partition count. With 4 partitions, a 200-core cluster runs 4 tasks and wastes 196 cores. Match partitions to cores (lesson 5.5).
- Forgetting the shuffle is the stage boundary. People count stages without asking why there's a boundary. Every extra stage means a shuffle you paid for — the count of stages is a rough cost meter.
Cross-links
- ← Came from From MapReduce to Spark — the shuffle between executors is the MapReduce shuffle.
- → Going deeper: Transformations, actions & lazy evaluation builds the logical/physical plan the driver schedules.
- → The shuffle that creates stage boundaries gets its own lesson: The shuffle, skew & joins.
Checkpoint
- Which component runs your application code and schedules tasks — and why is calling
.collect()on big data dangerous for it? - Complete the sentence: "A new stage begins at every ______."
- You have 12 partitions and a cluster with 4 cores. How many tasks run at once, and how many waves does the stage take?
Answers
- The driver runs your code and schedules tasks.
.collect()pulls all data from every executor into the driver's memory, which is modest and is a single point of failure — so it can out-of-memory and kill the whole application. - "...begins at every shuffle." The shuffle is the stage boundary.
- 4 tasks at once (one per core), in 3 waves of 4 (12 partitions ÷ 4 cores). The other 8 partitions wait their turn.