Skip to main content

Chapter 2 · Storage & File Formats

Before you can process data, you have to store it well. The single decision that quietly governs the speed and cost of every query you will ever run against a dataset is made here, at rest: how the bytes are laid out on disk and which file format wraps them. Pick well and a query touches a few megabytes; pick badly and the same query drags terabytes off disk for no reason.

This chapter teaches that decision from first principles. We start with the physical reality — data is a one-dimensional stream of bytes, and a table is two-dimensional, so something has to choose an order. From there we derive why columnar layouts win for analytics, crack open Parquet to see exactly why a query engine can skip the data it doesn't need, and work through the practical knobs every data engineer turns: compression, encoding, partitioning, file sizing, and schema evolution. We finish on the substrate it all sits on — object storage — and the in-memory columnar standard, Apache Arrow.

The durable idea

The less data a query has to read, the faster and cheaper it is. Columnar storage, per-row-group statistics, and partitioning all exist to let an engine read only the columns and rows a query actually needs — and skip everything else.

That idea is permanent. The formats that implement it (Parquet, ORC, Avro) are durable and have dominated for over a decade. The specific object store, compression library version, and tooling are dated — we isolate those as we go.

What you'll be able to do

By the end of this chapter you'll be able to:

  • Explain why columnar beats row-based for analytics — column pruning, better compression, and vectorized reads — and when a row format is still the right call.
  • Read a Parquet file's anatomy (row groups, column chunks, pages, footer, min/max stats) and reason about why predicate and projection pushdown work.
  • Choose a compression codec and encoding by trading compression ratio against CPU and splittability.
  • Pick the right format for the job: Parquet for the lake, Avro for transport/streaming, ORC for Hive.
  • Lay a dataset out with Hive-style partitioning without falling into the small-files or over-partitioning traps, and target sane file sizes.
  • Reason about schema evolution — adding, dropping, renaming columns — and backward/forward compatibility.
  • Place it all on object storage and understand listing cost, prefixes, and lifecycle policies, plus where Arrow fits in memory.

Lessons in this chapter

  1. Data layout on disk: rows vs columns — why the physical order of bytes is the whole game, and why analytics is columnar.
  2. Inside Parquet: row groups, pages & footer stats — the anatomy that makes predicate and projection pushdown possible.
  3. Compression & encoding — Snappy, ZSTD, gzip; dictionary, RLE, delta — and the ratio/CPU/splittability tradeoffs.
  4. Choosing a format: Parquet vs Avro vs ORC (and schema evolution) — row vs columnar formats, when each fits, and how each handles schema change.
  5. Partitioning, file sizing & the small-files problem — Hive-style layout, partition pruning, compaction, and the over-partitioning trap.
  6. Object storage, the data lake & Apache Arrow — S3/GCS/ADLS as the substrate, and Arrow as the in-memory columnar standard.
  7. Chapter 2 checkpoint — lock it in with a quiz.

Start with the physical layout — everything else in the chapter follows from it.

Next: Data layout on disk: rows vs columns →