Chapter 2 checkpoint
You now understand how analytic data physically lives — from the byte order on disk up to the substrate it sits on. Recall the throughline, then prove it with the quiz.
The throughline
- Layout is the whole game: a disk is one-dimensional, a table is two, so a format must choose an order. Columnar wins for analytics via column pruning, better compression (like values together), and vectorized reads. Row-based wins for OLTP and data in motion.
- Parquet anatomy: row groups (horizontal slices, the unit of skipping) → column chunks → pages (encoded + compressed) → a footer with per-row-group min/max stats. That footer powers projection pushdown (skip columns) and predicate pushdown / row-group skipping (skip rows). Sorting on filter columns makes skipping fire.
- Two ways to shrink: encodings (dictionary, RLE, delta) exploit column structure; codecs (Snappy default, ZSTD modern sweet spot, gzip cold) trade ratio vs CPU vs splittability. Gzipped CSV/JSON isn't splittable; Parquet stays splittable.
- Format by purpose: Parquet (columnar) for the lake; Avro (row, schema-carrying) for streaming/Kafka; ORC (columnar) for Hive. Convert CSV/JSON to binary on landing.
- Schema evolution: add optional fields with defaults, never rename or repurpose; distinguish backward (new reader, old data) vs forward (old reader, new data) compatibility; enforce with a schema registry.
- Layout across files: Hive-style
key=valuepartitioning prunes directories by path — partition on low-cardinality filter columns. Beware the small-files problem (fix: compaction, target ~128–512 MB) and over-partitioning on high-cardinality keys (sort within files instead). - Substrate & memory: object storage (S3/GCS/ADLS) is the data lake — flat prefixes, billed listing, consistency-as-correctness, lifecycle policies. Arrow is the in-memory columnar standard enabling zero-copy interchange.
Quiz
Chapter 2 — Storage & File Formats
Pass to unlock the Next button belowYou can now reason from first principles about how data is stored and why a query is fast or slow — the foundation every processing engine builds on. Next, Chapter 3 puts these files to work: the SQL and query engines that read them.