Compression & encoding

The last lesson said columnar pages are "encoded and compressed." Those are two distinct steps, and conflating them is a common source of confusion. Encoding is a column-aware transform that exploits the structure of a single column. Compression is a general-purpose, byte-level squeeze applied afterward. Parquet does encoding first, then compression — and understanding both lets you make a real decision instead of accepting defaults blindly.

Two layers of shrinking

Picture one page of ShopFlow's order status column (our running example — see Meet ShopFlow): paid, paid, paid, paid, shipped, shipped, paid, paid.

Encoding understands "these are values of one column." Compression just sees bytes. Both shrink the data; they stack.

Encodings: exploit the column's structure

Dictionary encoding

When a column has few distinct values repeated many times (low cardinality — cardinality just means the count of distinct values), storing the full value every time is wasteful. Dictionary encoding builds a small lookup table of the distinct values and replaces each value with a tiny integer index.

Our status column → dictionary {0: paid, 1: shipped}, data becomes 0,0,0,0,1,1,0,0. The strings "paid"/"shipped" are stored once; the column body is now small integers. This is Parquet's default for most columns and is hugely effective on the low-cardinality string columns analytics is full of — ShopFlow's status (placed/paid/shipped/cancelled), product category, customer region, currency.

The dictionary also supercharges skipping: the engine can check a filter value against the dictionary and discard the whole page if the value isn't present.

Run-length encoding (RLE)

When the same value repeats consecutively, run-length encoding stores the value once plus a count, instead of repeating it. The dictionary-encoded run 0,0,0,0,1,1,0,0 becomes (0×4)(1×2)(0×2) — three little pairs instead of eight values. RLE shines on sorted or naturally clustered columns, which is another reason sorting your data pays off (last lesson). Parquet combines dictionary + RLE so tightly the pair is often called "RLE/dictionary."

Delta encoding

For columns whose values change by small amounts — timestamps, monotonically increasing IDs, sequential counters — delta encoding stores the first value, then only the differences between consecutive values. ShopFlow's order_ts column, with orders arriving a few seconds apart, stores one full timestamp plus a stream of small deltas (+4000, +1200, +9000… ms), which take far fewer bytes than full 8-byte timestamps and also compress well afterward. The same trick collapses the steadily-increasing order_id.

:::note Encoding is automatic — but not magic Parquet writers pick encodings per column automatically based on the data. You rarely set them by hand. The value of knowing them is diagnostic: when you understand why a low-cardinality, sorted column compresses 50× and a high-cardinality random UUID column barely compresses at all, you can design schemas and sort orders that play to encoding's strengths. :::

Compression codecs: the general-purpose squeeze

After encoding, Parquet applies a general-purpose compression codec to each page. Codecs don't know anything about columns — they find and eliminate byte-level redundancy. The choice is a three-way tradeoff:

Compression ratio — how much smaller the data gets (less storage, less I/O).
CPU cost — how much processor time compressing/decompressing burns.
Splittability — whether a compressed file can be divided and processed in parallel without decompressing the whole thing first.

The common codecs, roughly fastest-to-smallest:

Codec	Ratio	Speed (de/compress)	Notes
Snappy	modest	very fast	Parquet's default. Optimized for speed over ratio; the safe everyday choice.
LZO	modest	very fast	Snappy-like; historically used in Hadoop. Less common now.
ZSTD	high	fast (tunable)	Modern sweet spot — near-gzip ratios at near-Snappy speed, with a level dial. Increasingly the recommended default.
gzip	high	slow	Great ratio, heavy CPU. Fine for cold/archival data read rarely.

:::tip A durable default For most lake data, Snappy (the default) is fine, and ZSTD is the better modern choice when you want noticeably smaller files without much CPU penalty — pick ZSTD when storage or scan cost matters and CPU headroom exists. Reserve gzip for cold data you write once and rarely read, where maximum shrink beats decompression speed. (Which codec is "best" shifts over time as libraries improve — the tradeoff axes ratio/CPU/splittability are the durable part; the rankings are dated.) :::

The splittability gotcha

Splittability is why this matters beyond storage cost. A 1 GB file that a parallel engine can split into eight chunks gets processed by eight workers at once. But some compression — notably a gzipped CSV or JSON file (one giant gzip stream over the whole file) — is not splittable: to read byte 500 MB you must decompress from the start, so the whole file is stuck on one worker. That single non-splittable file becomes a serial bottleneck in an otherwise parallel job.

Parquet sidesteps this entirely: because it compresses per page within per row group, the file stays splittable (workers grab different row groups) even though each page is compressed. This is a quiet but major advantage of Parquet over compressed text formats, and it's invisible unless you understand the page/row-group structure from the last lesson.

A traced example

ShopFlow's orders history, 100 GB raw as CSV. We rewrite it as Parquet:

Columnar layout groups like values → already more compressible than row-major CSV.
Encoding: status (4 distinct values) → dictionary + RLE, shrinks dramatically. customer_id (high cardinality, millions of customers) → barely helped by dictionary, stays larger. order_ts (ascending) → delta, shrinks a lot.
Compression: ZSTD over the encoded pages squeezes the remaining byte redundancy.

Result: ~100 GB CSV → roughly 8–15 GB Parquet — and it's still splittable for parallel reads, where the gzipped CSV alternative would be one-tenth the speed on read because a single worker must decompress it serially. Smaller and faster.

Why it matters

Columnar data shrinks in two stacked steps. Encodings exploit a column's structure: dictionary (few distinct values → integer indexes), RLE (consecutive repeats → value + count, loves sorted data), and delta (small changes → store differences, ideal for timestamps/IDs). General-purpose codecs then squeeze byte redundancy, trading ratio vs CPU vs splittability — Snappy (fast default), ZSTD (modern sweet spot), gzip (best ratio, slow, for cold data). The decisive practical points: choose ZSTD when scan/storage cost matters, reserve gzip for cold data, and remember that gzipped CSV/JSON is not splittable (a serial bottleneck) while Parquet stays splittable because it compresses per page. Encodings reward low cardinality and sorted columns — design for that. Next: with the bytes understood, we step up to choosing which format to use — Parquet, Avro, or ORC — and how each survives schema change.

Next: Choosing a format: Parquet vs Avro vs ORC →

Two layers of shrinking​

Encodings: exploit the column's structure​

Dictionary encoding​

Run-length encoding (RLE)​

Delta encoding​

Compression codecs: the general-purpose squeeze​

The splittability gotcha​

A traced example​

Why it matters​