Tuning & PySpark in practice: partitions, caching, UDFs & OOMs
This is the lesson that turns understanding into working software. You know the architecture, the execution model, and the shuffle. Now we cover the concrete levers you pull in PySpark (Spark's Python API) to make a real job fast — and the traps that quietly destroy performance. This is also where we're honest about the most underrated tuning decision of all: whether to use Spark at all.
Partitions and parallelism
A partition is one chunk of your distributed data, and (lesson 5.2) one task processes one partition on one core. So partition count is your parallelism dial, and getting it right is most of practical tuning.
- Too few partitions → not enough tasks to fill your cores (idle cluster), and each partition may be too big to fit in memory (spill/OOM).
- Too many partitions → thousands of tiny tasks, each with scheduling overhead that swamps the actual work. Death by a thousand tasks.
The rule of thumb: aim for partitions of roughly 100–200 MB each, and for at least as many partitions as you have cluster cores (often a small multiple, so waves keep all cores busy). These are guidelines, not laws — measure in the Spark UI. On ShopFlow (see Meet ShopFlow), the large order_items table is what you size partitions around; the small customers dimension is something you'd rather broadcast (lesson 5.4, broadcast(customers)) than partition and shuffle at all.
spark.sql.shuffle.partitions
After a shuffle (lesson 5.4), the number of output partitions is set by spark.sql.shuffle.partitions — historically defaulted to 200. That default is a frequent culprit:
- Processing 5 GB? 200 partitions might be fine.
- Processing 5 TB? 200 partitions means each is ~25 GB — far too big; they'll spill and OOM. You need far more.
- Processing 50 MB? 200 partitions means 200 near-empty tasks of pure overhead.
In modern Spark, AQE's coalescing (lesson 5.4) auto-shrinks this when data is small, which removes much of the old need to hand-tune it — but for very large jobs you still raise it deliberately. The 200 default and AQE's coalescing of it are common interview points.
repartition vs coalesce
Two operations change partition count, and confusing them is a classic mistake:
repartition(n)— change to exactlynpartitions by doing a full shuffle that redistributes rows evenly. Can increase or decrease the count, and produces balanced partitions. It's a wide transformation — you pay a shuffle.coalesce(n)— reduce the partition count by merging existing partitions without a full shuffle. It only goes down, and can produce uneven partitions (it just glues neighbors together), but it's cheap because it avoids the shuffle.
The decision:
| Want to... | Use | Cost |
|---|---|---|
| Reduce partitions (e.g. before writing fewer output files) | coalesce(n) | Cheap — no shuffle |
| Increase partitions, or rebalance skew evenly | repartition(n) | Expensive — full shuffle |
Classic use: after a heavy filter leaves you with 200 sparse partitions, coalesce(20) to write 20 tidy files cheaply. But if those 200 are skewed, coalesce keeps the imbalance — you need repartition to actually rebalance. Reduce → coalesce; increase or rebalance → repartition is the line to remember.
Caching and persistence: when it helps, when it hurts
Recall (lesson 5.3) that each action re-runs the whole lineage. So if you compute an expensive DataFrame once and then run several actions on it, you recompute it every time. Caching (.cache(), or .persist() for control over memory/disk) tells Spark to keep that DataFrame's computed result in memory after the first action, so later actions reuse it instead of recomputing.
# ShopFlow: priced line items reused by several reports (see Meet ShopFlow)
priced = (spark.read.parquet("s3://shopflow/order_items/")
.join(orders, "order_id") # costly: shuffle join
.withColumn("line_revenue", F.col("quantity") * F.col("unit_price")))
priced.cache() # keep the result after first computation
priced.count() # action 1 — computes AND caches
daily = priced.groupBy("dt").agg(F.sum("line_revenue")) # reuses cache — no recompute
percust = priced.groupBy("customer_id").agg(F.sum("line_revenue")) # reuses cache — no recompute
When caching helps: you reuse the same computed DataFrame multiple times (multiple actions, or iterative algorithms). That's the whole win — pay once, reuse many.
When caching hurts (the part people miss):
- If you use the DataFrame only once, caching is pure overhead — you spend memory/time storing something you never reread.
- Cache competes for the same memory executors need to do work. Over-caching shrinks execution memory and causes spill and OOM — the opposite of faster.
- Forgetting to
.unpersist()leaves stale data hogging memory for the rest of the job.
Cache deliberately, only what you reuse, and free it when done. Caching is not a "make it faster" button; it's a "stop recomputing the thing I reuse" tool.
Memory management and the OOMs you'll meet
Spark jobs fail with out-of-memory (OOM) errors more than any other way, and now you can diagnose all the common ones from earlier lessons:
- Driver OOM — you
.collect()or.toPandas()-ed too much data into the driver (lesson 5.2). Fix: don't pull big data to the driver; write it out instead. - Executor OOM from skew — one giant skewed partition won't fit (lesson 5.4). Fix: salt / AQE skew handling / more partitions.
- Executor OOM from too-few partitions — each partition is too big. Fix: more partitions (raise shuffle partitions / repartition).
- Executor OOM from over-caching — cache ate the execution memory. Fix: cache less, unpersist.
- Broadcast OOM — you broadcast a table too large for executor memory (lesson 5.4). Fix: don't force-broadcast big tables.
Notice that every OOM maps back to a concept you already have. That's the payoff of learning the model before the knobs.
PySpark UDFs: the performance trap
This is the single most important practical warning in PySpark, and it follows directly from Catalyst (lesson 5.3).
A UDF (user-defined function) is your own function that Spark applies to data. In Python, a plain UDF is a catastrophe for performance, for two compounding reasons:
- Catalyst can't see inside it — it's an opaque black box, so all the optimizations (pushdown, pruning, codegen) are switched off across it.
- It serializes row-by-row across the JVM↔Python boundary. Spark runs on the JVM; your Python UDF runs in a separate Python process. For a plain Python UDF, Spark must serialize each row, ship it to Python, run your function, serialize the result, and ship it back — one row at a time. That per-row round trip is staggeringly slow compared to Spark's vectorized, codegen'd built-ins.
The hierarchy of what to use, best to worst:
- Built-in functions (
pyspark.sql.functions—F.upper,F.when,F.regexp_replace, …). Always prefer these. They run in the JVM, fully optimized, vectorized. Almost everything you'd write a UDF for already exists as a built-in. - Pandas UDFs (a.k.a. vectorized UDFs), powered by Apache Arrow. If you must write custom logic, use these. Arrow is a columnar in-memory format that lets Spark transfer a batch of rows to Python at once with near-zero serialization cost, and your function operates on a whole pandas Series vectorized. Far faster than a plain UDF.
- Plain Python UDFs. The row-by-row serializer. Use only as a last resort, and expect it to be the slowest part of your job.
# SLOW — plain Python UDF computing line_revenue: row-by-row JVM↔Python serialization, Catalyst blind
@udf("double")
def line_rev(qty, price):
return float(qty) * float(price)
items.withColumn("line_revenue", line_rev(items.quantity, items.unit_price))
# FAST — built-in column expression: runs in the JVM, vectorized, fully optimized
from pyspark.sql import functions as F
items.withColumn("line_revenue", F.col("quantity") * F.col("unit_price"))
The takeaway interviewers want: "Avoid Python UDFs — they serialize row-by-row across the JVM/Python boundary and blind Catalyst. Use built-ins; if you must go custom, use Arrow-backed pandas UDFs."
Writing to the lake: partitioned Parquet
A core daily PySpark task is reading from and writing to the data lake (Chapter 2). You write Parquet (the columnar format from Chapter 2) and usually partition the output by a column so downstream queries can skip irrelevant files (partition pruning, Chapter 3):
# ShopFlow daily-revenue output, partitioned by order date (dt)
(daily.write
.mode("overwrite")
.partitionBy("dt") # one folder per day → queries can prune
.parquet("s3://shopflow/daily_revenue/"))
This lays out files as s3://shopflow/daily_revenue/dt=2026-06-24/part-*.parquet — the same dt=… layout ShopFlow's orders already use (Chapter 2). A later query filtered to one day reads only that folder. Two cautions: don't partition by a high-cardinality column (e.g. customer_id or order_id) — you'll create millions of tiny files (the "small files problem"); and use coalesce before writing to control file count. Partitioning columns are the bridge from your Spark job to the Iceberg/Delta table formats in Chapter 10, which write Parquet underneath and add transactions on top.
When NOT to use Spark: DuckDB and Polars
The most important tuning decision is sometimes not to use Spark. Spark's distributed machinery — driver, executors, shuffles, cluster manager — is overhead you only want to pay when the data genuinely exceeds one machine. For data that fits on a single (even large) node, single-node engines are simpler and often faster:
- DuckDB — an in-process analytical SQL engine ("SQLite for analytics"). Point it at Parquet files, write SQL, get answers fast, no cluster.
- Polars — a fast DataFrame library (Rust-backed) with a lazy, optimized API, a great Spark-free replacement for pandas-style work.
A 2 GB CSV does not need a Spark cluster; DuckDB or Polars will finish before Spark's executors even start up, with none of the operational weight. Reach for Spark when the data is bigger than one machine or when you're already standing in a Spark-based platform; reach for DuckDB/Polars when it fits on a node. Knowing this — and saying it in an interview — signals real judgment, because over-reaching for Spark is one of the field's most common mistakes.
:::tip The honest decision rule Does the data fit (with headroom) on one big machine? If yes, strongly consider DuckDB or Polars — simpler, cheaper, often faster. If no, or you need to process it across a cluster you already run, that's Spark's job. "Big data tools for small data" is a real and costly anti-pattern. :::
Spark on the lakehouse, and the bridge to streaming (dated)
Two final placements so you know where Spark sits in the modern stack:
- The lakehouse / Databricks. Spark most often runs today inside managed lakehouse platforms — Databricks (whose runtime is an optimized Spark) on top of Delta Lake, or open Apache Iceberg tables — writing the partitioned Parquet you just saw with ACID transactions layered on. That whole world is Chapter 10.
- Structured Streaming — the bridge to Chapter 9. Spark's Structured Streaming lets you run the same DataFrame code against an unbounded, continuously-arriving stream instead of a fixed file — Spark treats the stream as a table that grows and re-runs your query incrementally. It's the conceptual bridge from this chapter's batch processing to streaming (Chapter 9): same API, same execution model, unbounded input.
These specific products and runtimes are dated; the batch execution model you learned is durable and underlies all of them.
Why it matters
This lesson is the difference between understanding Spark and operating it. Sizing partitions, choosing repartition vs coalesce, caching only what you reuse, diagnosing each OOM by its cause, banishing Python UDFs in favor of built-ins, writing pruning-friendly partitioned Parquet, and — crucially — knowing when DuckDB/Polars beat Spark entirely: these are the moves that show up in code review, in the on-call pager, and in the interview. Every one of them is just a concept from earlier in this chapter applied to a knob.
Common pitfalls
- Leaving
spark.sql.shuffle.partitionsat 200 for a multi-terabyte job. Each partition becomes huge and spills/OOMs. Raise it (or rely on AQE for the small-data direction). - Using
repartitionwhencoalescewould do (or vice versa). Reducing files?coalesce(no shuffle). Rebalancing skew?repartition(must shuffle). - Caching everything "to be safe." Cache steals execution memory and causes the spills it was meant to prevent. Cache only reused DataFrames;
unpersistwhen done. - Writing Python UDFs out of habit. They serialize row-by-row and blind Catalyst. Search
pyspark.sql.functionsfirst; use Arrow pandas UDFs only if nothing built-in fits. - Partitioning output by a high-cardinality column. Creates millions of tiny files. Partition by low-cardinality columns (date, region); control file count with
coalesce. - Reaching for Spark on small data. A few gigabytes belong in DuckDB or Polars, not a cluster.
Cross-links
- ← Came from The shuffle, skew & joins — shuffle partitions, broadcast OOMs, and AQE coalescing all started there.
- ← Caching answers the "actions re-run the lineage" problem from Transformations, actions & lazy evaluation; driver OOM traces to Spark architecture.
- → Partitioned Parquet writes feed the table formats in Chapter 10 · Table Formats & the Lakehouse; Structured Streaming bridges to Chapter 9 · Streaming & Real-Time.
- → Then lock it in with the chapter checkpoint.
Checkpoint
- You filtered a DataFrame down and now have 200 sparse partitions; you want to write 10 output files.
repartition(10)orcoalesce(10)— and why? - Why is a plain Python UDF so slow, and what are the two better options in order?
- A colleague spins up a Spark cluster to process a 1.5 GB CSV. What's the issue, and what would you suggest?
Answers
coalesce(10)— you're reducing partitions, andcoalescemerges existing ones without a full shuffle, so it's cheap. (Userepartitiononly if you needed to increase the count or rebalance skewed partitions evenly, which costs a shuffle.)- A plain Python UDF serializes data row-by-row across the JVM↔Python boundary and is an opaque black box that blinds Catalyst (no pushdown/codegen). Better, in order: (1) built-in
pyspark.sql.functions(JVM, vectorized, optimized); (2) Arrow-backed pandas/vectorized UDFs (batch transfer, vectorized) if you must write custom logic. - 1.5 GB easily fits on one machine, so Spark's cluster overhead (driver, executors, shuffles, startup) is unjustified — it's "big-data tools for small data." Suggest a single-node engine like DuckDB or Polars, which will be simpler, cheaper, and likely faster.
Next: Chapter 5 checkpoint →