Skip to main content

Join algorithms & why join order matters

"How does a database join two tables?" is the single most common question in data-engineering and Spark interviews — and the single most useful thing to understand when a query mysteriously takes an hour. A JOIN is just a relational operator in the plan; but how the engine physically performs it is a choice among a few algorithms with very different costs. The optimizer makes that choice, and when it chooses wrong, your query crawls. This lesson gives you the four algorithms, the cost intuition for each, and the cardinality math that drives the decision.

We'll ground each one in ShopFlow's two everyday joins (ShopFlow — see Meet ShopFlow): orderscustomers (a huge fact table joined to a tiny dimension — the broadcast case) and ordersorder_items (two large tables joined on order_id — the shuffle / sort-merge case).

Why there's a choice at all

Logically, a join matches rows from two inputs on a condition. Naively, you'd compare every row on the left against every row on the right — for tables of M and N rows that's M × N comparisons. With a million rows on each side that's a trillion comparisons. So engines use smarter algorithms that exploit hashing or sorting to avoid the full cross-product. There are four you must know.

The four join algorithms

1. Nested-loop join

The brute-force baseline: for each row on the left (the outer), scan the right (the inner) for matches.

  • Cost: O(M × N) — quadratic. Brutal for two large tables.
  • When the engine picks it: when one side is tiny, when there's no equality condition to hash on (e.g. a range or inequality join a.x BETWEEN b.lo AND b.hi), or as a last resort. An index on the inner side turns each lookup fast and makes it viable.
  • Mental model: two for loops. Avoid it for big-on-big equi-joins.

2. Hash join

The default for equality joins (a.id = b.id) on a single machine and within each worker of a distributed engine. Two phases:

  1. Build: pick the smaller input, read it once, and load it into an in-memory hash table keyed by the join key. (A hash table is a dictionary giving O(1) lookup by key.)
  2. Probe: stream the larger input; for each row, hash its key and look it up in the table to find matches.
  • Cost: O(M + N) — read each side once. Vastly better than nested-loop.
  • The catch: the build side must fit in memory. If the smaller table is still too big, the engine spills the hash table to disk (slow) or splits the join into partitions. Choosing the smaller side to build is critical — and the optimizer needs good statistics to know which side is smaller.
Small input("Hash tablekeyed by join key")Large inputMatched rowsbuildprobe each row

3. Sort-merge join

Sort both inputs by the join key, then walk them in lockstep like merging two sorted lists, emitting matches where keys line up.

  • Cost: dominated by the two sorts, O(M log M + N log N); the merge itself is linear.
  • When it wins: when the inputs are already sorted on the join key (then it's nearly free — no build, no hash, no memory blow-up), or when a hash table wouldn't fit in memory. It's memory-friendly and degrades gracefully because sorting can spill to disk in an orderly way.
  • Trade-off vs hash: hash join avoids sorting but needs the build side in memory; sort-merge pays for sorting but handles huge, unsorted inputs without a memory cliff. Engines pick based on input size, sort order, and available memory.

4. Broadcast join (map-side / replicated join)

The most important join to understand in distributed systems, and the one that fixes more slow Spark/Trino queries than any other. When one side is small and the other is huge and spread across many workers, instead of shuffling the giant table across the network to co-locate keys, the engine copies the entire small table to every worker (broadcasts it). Each worker then joins its local slice of the big table against its in-memory copy of the small one — no shuffle of the big table at all.

For ShopFlow this is the orderscustomers join: orders is a billion-row fact table spread across the cluster, while customers is a small dimension. Broadcasting customers to every worker lets each one enrich its local slice of orders with customer attributes — no shuffle of orders at all.

  • Why it's a huge win: the alternative is a shuffle join (a hash join that first repartitions both tables across the network by key so matching keys land on the same worker). Shuffling the billion-row orders table over the network is enormously expensive; broadcasting a few-MB customers dimension is nearly free. Broadcasting eliminates the big shuffle.
  • The limit: the broadcast table must be small enough to fit in each worker's memory (engines have a threshold, often tens to hundreds of MB). Broadcast a table that's too big and every worker runs out of memory.
  • The classic mistake: the orders × customers join runs as a full shuffle join because the optimizer's stats said customers was bigger than it is, so it didn't broadcast. Fixing the stats (or hinting the broadcast) can turn an hour into seconds.

By contrast, ordersorder_items joins two large tables on order_id: neither side is small enough to broadcast, so the engine must shuffle both by order_id (or sort-merge them) to co-locate matching keys. That's the case the broadcast trick can't rescue — it's an inherently distributed join, covered in the next lesson.

Small dim(copied to all)Worker 1fact slice ⋈ dimWorker 2fact slice ⋈ dimWorker 3fact slice ⋈ dim

Why join order matters

When a query joins three or more tables, the engine must pick an order to join them in — and the order dramatically changes how many intermediate rows flow through the plan. Consider A ⋈ B ⋈ C where the A ⋈ B result is 100 million rows but B ⋈ C is only 1,000 rows. Joining B ⋈ C first means the next join processes 1,000 rows; doing A ⋈ B first means it churns through 100 million. Same answer, wildly different cost. Picking a good order is one of the optimizer's hardest jobs (the number of possible orders explodes combinatorially), and it relies entirely on one input: cardinality estimates.

Cardinality is the whole game

Cardinality is the estimated number of rows an operator will produce. Every physical choice above depends on it:

  • Which side to build/broadcast in a hash or broadcast join → needs to know which side is smaller.
  • Whether to broadcast at all → needs to know if the small side is under the threshold.
  • What order to join → needs to know each intermediate result's size.

The optimizer derives cardinality from statistics (row counts, distinct values, null fractions, histograms). When those are stale or the query has tricky correlated filters, estimates go wrong — and a wrong estimate cascades: it picks a shuffle when it should broadcast, builds the wrong side, or joins in a terrible order. Most catastrophically slow joins trace back to a bad cardinality estimate. This is why "the optimizer chose the wrong join" and "the statistics are stale" are usually the same bug.

You can see all of this in the EXPLAIN plan: which algorithm each join uses (HashJoin, SortMergeJoin, BroadcastHashJoin), which side is built or broadcast, and — with EXPLAIN ANALYZE — the gap between estimated and actual rows that reveals a bad estimate.

Why it matters

A join's algorithm and order decide whether a query finishes in seconds or hours, and both are chosen by the optimizer from cardinality estimates. Hash join (O(M+N), needs the small side in memory) is the equi-join default; sort-merge handles huge or pre-sorted inputs without a memory cliff; broadcast eliminates a giant network shuffle when one side is small — the highest-leverage trick in distributed SQL; nested-loop is the brute-force fallback for tiny inputs or non-equality conditions. Knowing these lets you read a plan, spot the join that's shuffling a billion rows it shouldn't, and fix it — usually by repairing statistics so the optimizer broadcasts. This is the most interview-tested topic in the chapter for a reason: it's where SQL meets physics.

Common pitfalls

  • An exploding join. A join key that isn't unique on the side you assumed multiplies rows (a fan-out). Joining on a non-key, or forgetting a duplicate, can turn a million rows into a billion. Check grain; dedup first (3.2).
  • A missed broadcast. A small dimension that should be broadcast gets a full shuffle join because stats are wrong. Refresh stats; hint the broadcast if needed.
  • Broadcasting something too big. Forcing a broadcast on a table that doesn't fit in worker memory makes every worker OOM. Respect the threshold.
  • Skewed join keys. If one key value dominates (e.g. customer_id = 0 for "unknown"), one worker gets a giant share and stalls the whole stage — skew, covered next.
  • Range/inequality joins at scale. ON a.x BETWEEN b.lo AND b.hi can't hash, so it falls back toward nested-loop. Be wary of these on large inputs.

Checkpoint

  1. You equi-join the 2-billion-row orders table to the 5 MB customers dimension across a distributed cluster. Which join algorithm should the engine choose, and what expensive operation does it avoid?
  2. In a hash join, which side does the engine build the hash table from, and what goes wrong if it picks the other side?
  3. Joining orders ⋈ order_items yields 50 M rows; orders ⋈ customers yields 200 rows; the final result joins all three. Why does join order matter here, and what single input does the optimizer rely on to get it right?
Answers
  1. A broadcast (map-side) join — copy the 5 MB customers dimension to every worker so each joins its local slice of orders. It avoids shuffling the 2-billion-row orders table across the network.
  2. It builds from the smaller input (so the hash table fits in memory) and probes with the larger. Pick the bigger side to build and the hash table may not fit → it spills to disk (slow) or OOMs.
  3. Joining the tiny orders ⋈ customers (200 rows) first keeps the intermediate result small for the next join; doing orders ⋈ order_items (50 M rows) first churns 50 M rows downstream — same answer, far more work. The optimizer relies on cardinality estimates (from statistics) to choose.