Skip to main content

Inside Parquet: row groups, pages & footer stats

Most guides tell you "use Parquet" and stop. That leaves you unable to answer the question that actually matters on the job: why can a query that filters on a date read only a sliver of a huge file? The answer is entirely in Parquet's internal structure. This lesson opens the file up so you can reason about performance instead of cargo-culting it.

Apache Parquet is the dominant open columnar file format for analytics — the default on data lakes, read and written by virtually every engine (Spark, DuckDB, Trino, warehouses). It is columnar (from the last lesson), self-describing (the schema lives inside the file), and immutable (you write a whole file at once; you never edit it in place). Understanding its anatomy is the single highest-leverage piece of storage knowledge a data engineer can have.

Why not just one big column-stream per file?

The last lesson described columnar as "write all of column A, then all of column B." If Parquet did only that for a billion-order file of ShopFlow's orders (our running example — see Meet ShopFlow), you'd hit two problems: you couldn't process the file in parallel chunks, and a filter like WHERE order_ts >= '2026-06-01' would still have to scan the entire column to find matching rows. Parquet solves both by adding horizontal slicing on top of the columnar layout. That combination is the whole trick.

The anatomy, top to bottom

A Parquet file is a nested structure. From the outside in:

Parquet fileRow group 1\n(~128MB of rows)Row group 2Footer(metadata)\nschema +per-row-groupColumn chunk:order_idColumn chunk: statusColumn chunk: amountPage · Page ·Page\n(encoded +compressed)

Row group — a horizontal slice of the table: some number of complete rows (commonly tuned to ~128 MB of data). The file is a sequence of row groups stacked one after another. This is the unit of parallelism — different workers can process different row groups at once — and, crucially, the unit of skipping.

Column chunkwithin a row group, the data for one column of that slice's rows, stored contiguously. So a row group contains one column chunk per column. This is where the columnar layout lives: inside a row group, each column is still its own contiguous run.

Page — a column chunk is divided into pages, the smallest unit Parquet encodes and compresses. Each page holds a run of values for that column, and encoding (dictionary, RLE, delta — see Compression & encoding) and compression (Snappy, ZSTD) are applied per page. Pages also carry their own small stats.

Footer — at the end of the file sits the footer: the file's metadata. It contains the schema (so the file is self-describing), and — the part that makes Parquet fast — the offsets and per-column statistics for every row group, including the min and max value of each column in each row group. The footer is read first (a reader seeks to the end), giving it a map of the whole file before touching any data.

:::note Why the footer lives at the end Parquet writes data first and metadata last because it can't know a row group's stats (min/max, sizes, offsets) until it has finished writing that row group. Writing the footer last lets it stream data out without buffering everything. Readers know to seek to the end to find it. :::

How the structure powers pushdown

Now the payoff. Two optimizations fall directly out of this anatomy.

Projection pushdown (skip columns)

We met this last lesson as column pruning. Because each column chunk has a known offset recorded in the footer, a reader handed SELECT status, amount reads only the status and amount column chunks and seeks past every other column's bytes. The footer is the map that makes those seeks possible.

Predicate pushdown (skip rows, via row-group skipping)

This is the one the file structure uniquely enables. A predicate is a filter condition — the WHERE clause. Predicate pushdown means pushing that filter down to the storage layer so it can skip data before the engine ever processes it.

Here's the mechanism. Suppose the query is WHERE amount > 1000, and the footer says:

Row groupamount minamount max
15800
22004500
31095

Before reading a single value, the engine consults the footer stats:

  • Row group 1: max is 800, which is not > 1000. No row here can match. Skip the entire row group — never read it off disk.
  • Row group 2: max is 4500, so some rows might match. Read it and check.
  • Row group 3: max is 95. Skip it.

This is row-group skipping (also called data skipping). With min/max stats per row group, the engine eliminates whole horizontal slices using only the tiny footer, reading just the row groups that could contain matches. On a well-sorted file, a selective filter can skip 90%+ of row groups.

WHERE amount > 1000Read footer statsPer rowgroup:\ncould maxsatisfy filter?skipread & filterskipRG1 max 800 — noRG2 max 4500 — maybeRG3 max 95 — no

:::tip Sorting makes skipping work Min/max skipping only helps if values are clustered by row group. If amount is sorted (or naturally ordered, like a timestamp), each row group covers a narrow min–max range and filters skip lots of groups. If values are random, every row group's range is [min, max] of the whole column and nothing can be skipped. So sorting data on your filter columns before writing is a real, durable performance lever — and it's invisible unless you understand row groups. :::

Inspecting a real file with DuckDB

You don't have to take the anatomy on faith. DuckDB — a small, embeddable analytic (OLAP) database — reads Parquet natively and can show you the metadata. (You can't run this in the browser here, so trace it.) Given a file orders.parquet:

-- Schema and types, straight from the footer (self-describing)
DESCRIBE SELECT * FROM 'orders.parquet';

-- The file's structure: one row per row group per column,
-- including the min/max stats that drive skipping
SELECT row_group_id, column_id, num_values, stats_min, stats_max
FROM parquet_metadata('orders.parquet');

-- File-level summary: how many row groups, rows, columns
SELECT * FROM parquet_file_metadata('orders.parquet');

Reading parquet_metadata output is a genuinely useful debugging skill: if a query is slow, checking whether your filter column has tight per-row-group min/max ranges tells you immediately whether skipping is even possible.

Why it matters

A Parquet file is columnar plus horizontal slicing: it's a stack of row groups (horizontal slices, the unit of parallelism and skipping); each row group holds one column chunk per column (the columnar layout); each chunk is split into pages (the unit of encoding and compression); and a footer at the end carries the schema and the per-row-group min/max statistics. That footer is what turns "read fewer columns" (projection pushdown) into also "read fewer rows" (predicate pushdown via row-group skipping) — the engine eliminates whole slices using only the metadata. The practical lever this hands you: sort your data on the columns you filter by, so each row group covers a narrow range and skipping actually fires. Inspect any file with DuckDB's parquet_metadata. Next we go one level deeper — into the pages — to see the compression and encoding tricks that make columnar data so small.

Next: Compression & encoding →