Skip to main content

Chapter 2 checkpoint

You now understand how analytic data physically lives — from the byte order on disk up to the substrate it sits on. Recall the throughline, then prove it with the quiz.

The throughline

  • Layout is the whole game: a disk is one-dimensional, a table is two, so a format must choose an order. Columnar wins for analytics via column pruning, better compression (like values together), and vectorized reads. Row-based wins for OLTP and data in motion.
  • Parquet anatomy: row groups (horizontal slices, the unit of skipping) → column chunkspages (encoded + compressed) → a footer with per-row-group min/max stats. That footer powers projection pushdown (skip columns) and predicate pushdown / row-group skipping (skip rows). Sorting on filter columns makes skipping fire.
  • Two ways to shrink: encodings (dictionary, RLE, delta) exploit column structure; codecs (Snappy default, ZSTD modern sweet spot, gzip cold) trade ratio vs CPU vs splittability. Gzipped CSV/JSON isn't splittable; Parquet stays splittable.
  • Format by purpose: Parquet (columnar) for the lake; Avro (row, schema-carrying) for streaming/Kafka; ORC (columnar) for Hive. Convert CSV/JSON to binary on landing.
  • Schema evolution: add optional fields with defaults, never rename or repurpose; distinguish backward (new reader, old data) vs forward (old reader, new data) compatibility; enforce with a schema registry.
  • Layout across files: Hive-style key=value partitioning prunes directories by path — partition on low-cardinality filter columns. Beware the small-files problem (fix: compaction, target ~128–512 MB) and over-partitioning on high-cardinality keys (sort within files instead).
  • Substrate & memory: object storage (S3/GCS/ADLS) is the data lake — flat prefixes, billed listing, consistency-as-correctness, lifecycle policies. Arrow is the in-memory columnar standard enabling zero-copy interchange.

Quiz

Required checkpoint

Chapter 2 — Storage & File Formats

Pass to unlock the Next button below

You can now reason from first principles about how data is stored and why a query is fast or slow — the foundation every processing engine builds on. Next, Chapter 3 puts these files to work: the SQL and query engines that read them.

Next: Chapter 3: SQL & Query Engines →