Skip to main content

Object storage, the data lake & Apache Arrow

Everything in this chapter — Parquet files, partitions, compaction — has to physically sit somewhere. For modern analytics that somewhere is almost always object storage, and the collection of files living there is the data lake. This lesson covers the substrate's important quirks (they shape how you lay data out), and then closes the chapter by zooming from disk into memory: Apache Arrow, the in-memory columnar standard that's the on-disk lesson's natural counterpart.

Object storage as the data lake

We met object storage as a primitive in the cloud chapter elsewhere: files (objects) addressed by a unique key (name) over HTTP, grouped into buckets, effectively infinite, very cheap, and extremely durable. That cheapness and scale is exactly why it became the default home for analytic data.

The three you'll meet:

  • Amazon S3 (AWS) — the original; "S3" is almost a generic term.
  • Google Cloud Storage (GCS) (GCP).
  • Azure Data Lake Storage Gen2 (ADLS Gen2) (Azure) — object storage with a filesystem-style namespace layered on, tuned for analytics.

A data lake is simply a large volume of raw and processed files (mostly Parquet) stored on object storage, queried in place by engines like Spark, Trino, or DuckDB. No proprietary database owns the files — they're just objects, open formats, readable by any engine. That openness (decoupling storage from compute) is the whole appeal, and it's why this chapter's formats matter so much: the file is the interface.

The quirks that shape your layout

Object storage behaves differently from a filesystem, and three differences directly explain advice from earlier lessons.

There are no real folders — only key prefixes

Object storage is a flat namespace: an object's key is just a string like orders/dt=2026-06-23/part-0001.parquet. The slashes imply a folder hierarchy to you, but underneath it's one giant flat keyspace. The leading part of the key (orders/dt=2026-06-23/) is a prefix, and "listing a folder" really means "list all keys starting with this prefix." This is why Hive-style partition directories work at all — they're just shared key prefixes the engine filters on.

Listing is an operation — and it costs

Because there are no real folders, finding the files under a prefix requires a LIST API call (often paginated over many requests for big datasets). Listing is slower and pricier than you'd expect — providers bill per request, and a dataset of millions of objects can cost real money and time just to enumerate before any data is read. This is the deep reason the small-files problem (previous lesson) is so damaging on object storage specifically: millions of tiny objects means millions of listing entries to page through. Fewer, larger files keep listing cheap.

Eventual consistency — a historical footgun, now mostly gone

Early object stores were eventually consistent: after you wrote or deleted an object, a subsequent read or list might briefly not reflect the change — the update "eventually" propagated. For data pipelines this caused real, maddening bugs: a job writes files, the next job lists the prefix and misses some that were just written, silently processing incomplete data. Whole tools existed just to work around it.

:::note Durable lesson, dated specifics Major object stores are now strongly consistent (S3 became so in 2020) — a read after a write sees the write. So the workarounds are mostly history. But the lesson is durable: the storage layer's consistency model is a correctness concern, not a detail. When you adopt any storage, know its consistency guarantees before you build pipelines that assume "I just wrote it, so I can read it." (The specific guarantees of each store are dated; that you must check them is permanent.) :::

Lifecycle policies — automate hot → cold → delete

Object stores let you attach lifecycle policies: rules that automatically transition objects to cheaper, colder storage tiers as they age, or delete them after a retention period. "Move raw/ objects to archive after 90 days; delete after 2 years." This is a major cost lever for a lake — old raw data you rarely re-read shouldn't sit in the hottest, priciest tier. Set it once and storage cost manages itself.

Apache Arrow: columnar, but in memory

The whole chapter has been about data on disk. The moment an engine reads Parquet and starts computing, the data lives in memory — and there's a columnar standard for that too: Apache Arrow.

Apache Arrow is a language-agnostic, in-memory columnar format — a standardized way to lay out a table in RAM, column by column, identically across tools and programming languages. Parquet is the on-disk columnar standard; Arrow is the in-memory one. They're designed as partners: Parquet is optimized for compact, encoded storage; Arrow is optimized for fast in-memory processing (laid out for vectorized CPU access, the vectorized reads idea, in RAM).

Why a shared in-memory format matters: zero-copy

Without a standard, every tool has its own in-memory representation. Passing data from a Python library to a query engine to a dataframe tool means serializing (converting to bytes), shipping, and deserializing (rebuilding) at each hop — slow, and it duplicates the data in memory each time.

Arrow eliminates that. When two tools both speak Arrow, they can share the exact same bytes in memory with no conversion — this is zero-copy interchange. A query engine can hand a result to a Python dataframe by passing a pointer to the Arrow buffer; nothing is copied or re-serialized.

Tool A formatbytesTool ATool Bserializeshared Arrow buffer

This is why Arrow quietly underpins so much of the modern stack — DuckDB, Spark, pandas/Polars, and many engines use it as the common in-memory currency, so data moves between them nearly for free.

:::tip The clean mental split Parquet = columnar on disk (compact, encoded, stats for skipping). Arrow = columnar in memory (fast to compute on, zero-copy to share). A typical flow: read Parquet from object storage → decode into Arrow in memory → compute → hand the Arrow buffer to the next tool with no copy. Two formats, one columnar idea, two stages of the data's life. :::

Why it matters

Analytic data lives on object storageS3 / GCS / ADLS Gen2 — and a data lake is just open-format files (mostly Parquet) sitting there, queried in place, with storage decoupled from compute. Object storage's quirks directly justify earlier advice: it's a flat keyspace with prefixes (so partition "folders" are really shared prefixes), listing is a billed, slow operation (so the small-files problem hurts here most), its consistency model is a correctness concern to check (eventual-consistency bugs are mostly history but the lesson is permanent), and lifecycle policies automate hot→cold→delete for cost. Finally, Apache Arrow is the in-memory counterpart to Parquet's on-disk columnar layout, enabling zero-copy interchange so tools share the same buffer without serializing. That completes the storage picture: bytes on disk laid out columnar (Parquet), arranged across files (partitioning), on the right substrate (object storage), flowing into memory columnar (Arrow). Lock it in with the checkpoint.

Next: Chapter 2 checkpoint →