Skip to main content

Partitioning, file sizing & the small-files problem

So far we've talked about one file. Real datasets are thousands or millions of files in a directory tree. How you arrange those files is its own load-bearing decision — and it's where some of the most common, most painful production mistakes live. This lesson covers the three that matter: partitioning (skipping at the directory level), the small-files problem (the top operational pain point most guides skip), and file sizing (the sweet spot in between).

Partitioning: skipping before you even open a file

Row-group skipping (Inside Parquet) lets an engine skip slices within a file using footer stats. Partitioning is the same idea one level up: skip whole files using nothing but their directory path — before opening a single byte.

Partitioning means physically splitting a dataset into subfolders by the value of one or more columns. The universal convention is Hive-style partitioning: directories named column=value.

Take ShopFlow's orders (our running example — see Meet ShopFlow). Orders have an order_ts timestamp; we derive an order date dt from it and partition on that:

orders/
dt=2026-06-22/
part-0001.parquet
part-0002.parquet
dt=2026-06-23/
part-0001.parquet
dt=2026-06-24/
part-0001.parquet

The partition column (dt here) is encoded in the path, not stored in the files. Now consider yesterday's revenue:

SELECT SUM(amount) FROM orders WHERE dt = '2026-06-23';

The engine looks at the directory names, sees only dt=2026-06-23/ can match, and reads only that folder — every other day's orders are skipped without being opened. This is partition pruning: eliminating whole partitions (directories) by matching the query's filter against the partition values in the paths. On orders partitioned by date, a query for one day touches one day's orders instead of years of them.

WHERE dt ='2026-06-23List directory namesdt=2026-06-22/ —skipdt=2026-06-23/ —READdt=2026-06-24/ —skip

:::tip Partition on what you filter by — and keep it low-cardinality A good partition column is one your queries filter on constantly and that has few distinct values: ShopFlow's dt (or year/month), customer region, order status. These are the columns that make pruning pay off without exploding the number of directories. This "filter on it + low cardinality" test is the durable rule. Hold onto it — the next section is what happens when you break it. :::

The small-files problem

Here is the failure most introductions never mention, and it's one of the most common real-world data-platform pains: thousands of tiny files cripple performance, even though the total data size is modest.

Why tiny files are so costly:

  • Per-file overhead dominates. Every file means a separate open, a metadata/footer read, and (on object storage) a separate network request with real latency. Reading 10,000 × 1 MB files is dramatically slower than reading 100 × 100 MB files holding the same data — the bytes are identical, but you've paid the per-file tax 100× more often.
  • Listing gets expensive. An engine must list the files before reading them; on object storage, listing millions of objects is slow and sometimes billed (more in Object storage).
  • Metadata bloat. Query planners track every file; millions of entries slow planning and strain catalogs.
  • Worse compression and skipping. A 1 MB file can't hold a full ~128 MB row group, so you lose the columnar and skipping advantages the file format was supposed to give you.

Where do tiny files come from? Streaming and frequent micro-batch writes are the classic source — a job that writes a new small file every few seconds, or every Kafka micro-batch, accumulates millions of slivers fast. Over-partitioning (next section) is the other big source.

Compaction: the fix

The remedy is compaction (also called "bin-packing"): a background job that periodically reads many small files and rewrites them as fewer large ones. Ten thousand 1 MB files in a partition become twenty 500 MB files; the data is unchanged, but reads get fast again. Compaction is a routine, scheduled part of operating any platform fed by streaming. (Modern table formats in Chapter 10 automate it — but the problem and the fix are durable regardless of tooling.)

File sizing: aim for the sweet spot

If tiny files are bad, are huge files good? No — there's a sweet spot.

  • Too small (KB–few MB): per-file overhead dominates (the whole problem above).
  • Too large (multi-GB single files): hurt parallelism (fewer files than workers means idle workers) and make any rewrite expensive.
  • The sweet spot: roughly 128 MB to 512 MB per file. Big enough to amortize per-file overhead and hold full row groups; small enough that a parallel engine can spread files across workers. This range lines up with the ~128 MB row-group size and with how distributed engines split work.

:::tip A durable target Aim for files in the ~128–512 MB range, and compact small files on a schedule. The exact number is dated (it tracks engine defaults and block sizes), but "hundreds of MB, not KB and not many GB" is the durable target. If a query is mysteriously slow, count the files — a partition with thousands of tiny files is a red flag. :::

The over-partitioning trap

Now the mistake that ties the whole lesson together — and it's worth burning in, because it's advice that sounds reasonable and is catastrophically wrong.

Tempting bad idea: "partitioning speeds up filtered queries, so let's partition ShopFlow's orders by customer_id!" You filter by customer a lot, after all. But customer_id is high-cardinality — millions of distinct customers. Partition by it and you create:

  • Millions of directories, one per customer — many holding a single tiny file of a few orders.
  • The small-files problem at maximum severity: listing becomes glacial, metadata explodes, every read pays the per-file tax millions of times.
  • Queries that don't filter by customer_id (like daily revenue) now have to scan millions of partitions — the layout helped one query pattern and wrecked all the others.

This is over-partitioning: choosing a partition key with too many distinct values. The directory explosion costs you far more than the pruning saves. The discipline:

  • Partition on low-cardinality columns you filter on broadly: dt, region. Keep total partitions to a sane number (hundreds to low thousands, not millions).
  • For a high-cardinality filter column like customer_id, don't partition — instead sort/cluster the orders by it within reasonably sized files. Then row-group min/max skipping (Inside Parquet) prunes by customer_id inside files, giving you the skipping benefit with none of the directory explosion.

:::note Two different "partitioning" — don't confuse them This lesson is about partitioning on disk — physical key=value directories that an engine prunes by path. That is not the same as a warehouse or table format partitioning a managed table internally (Chapter 10), nor the same as a streaming system's partitions (Chapter 9). Same word, three different mechanisms. Here it strictly means directory layout on storage. :::

A traced example

ShopFlow's streaming pipeline lands orders to the lake from the order_events stream, micro-batch every 30 seconds, partitioned dt=…/customer_id=…/:

  1. Symptom: a "last 7 days of revenue" query that should be quick takes minutes and keeps getting slower.
  2. Diagnosis: listing reveals ~4 million tiny files — customer_id partitioning created a directory per customer per day, each holding a sliver from each micro-batch. Classic over-partitioning and small files.
  3. Fix: drop customer_id from the partition scheme (partition by dt only); sort by customer_id within files so per-customer skipping still works via row-group stats; run a compaction job to bin-pack each day's slivers into a handful of ~256 MB files.
  4. Result: the same query now prunes to 7 date partitions of a few large, well-sorted files each — back to seconds. Nothing about the orders changed; only their layout.

Why it matters

How you arrange many files is as load-bearing as the format inside one. Hive-style key=value partitioning lets an engine prune whole directories by path before opening a byte — so partition on low-cardinality columns you filter on (dt, region). The small-files problem — thousands of tiny files crushing performance through per-file overhead, listing cost, and metadata bloat — is the top operational pain in practice; the fix is compaction (bin-pack into fewer large files) and a ~128–512 MB target size. The trap that ties it together is over-partitioning on a high-cardinality key like customer_id, which detonates a directory-and-tiny-file explosion; the right move is to not partition on it and instead sort within files so row-group skipping handles it. With layout on storage settled, the final lesson covers the substrate everything sits on — object storage — and the in-memory columnar standard, Arrow.

Next: Object storage, the data lake & Apache Arrow →