Skip to main content

Choosing a format: Parquet vs Avro vs ORC (and schema evolution)

You now understand layout (rows vs columns) and internals (Parquet's anatomy, encoding, compression). This lesson zooms out to the practical decision a data engineer makes constantly: which file format? There are really two questions hiding in that — text or binary? and, if binary, which of the three big formats? — plus a third that bites every team eventually: what happens when the schema changes?

Text formats: CSV and JSON

CSV (comma-separated values) and JSON are text formats: human-readable, universally supported, the lingua franca of data exchange. They're a fine interchange format — easy to inspect, easy to hand to a non-engineer. But as a storage format for analytics they're poor, for concrete reasons:

  • Row-based and untyped. CSV/JSON store whole records as text, so you get none of columnar's column pruning, and every number is parsed from a string on every read.
  • No embedded schema (CSV especially — is 01 a number or a string? is that column a date?). Types are guessed or configured externally, a constant source of bugs.
  • Bulky and (when gzipped) not splittable — the serial-bottleneck problem from the last lesson.
  • No statistics, so no predicate pushdown — every query is a full scan.

:::tip The rule for text formats Use CSV/JSON for interchange and small/human-facing data; convert to a binary format (Parquet/Avro/ORC) the moment data enters your platform for storage or processing. "Land it raw, then convert to Parquet" is a standard first step in almost every pipeline. :::

The three binary formats

The binary formats are typed, self-describing, compressed, and built for machines. There are three that matter, and the durable way to keep them straight is by layout + purpose.

Apache Parquet — columnar, for the lake

Covered in depth already: columnar, with row groups, footer stats, and pushdown. Parquet is the default for analytic storage — data lakes, anything you'll run analytic queries over. If you're not sure, it's Parquet.

Apache Avro — row-based, schema-carrying, for transport

Apache Avro is the row-based binary format. Each record's fields are stored together (row-major), which makes it poor for column-pruned analytics but excellent for the job it's built for: moving data record by record. Two properties make Avro the standard for streaming and transport:

  • It carries its schema. An Avro file embeds its schema (as JSON) in a header; streaming systems pair messages with a schema registry. Either way, a reader always knows the exact shape and types of every record — no guessing.
  • It's compact and fast to write per-record. Producing one record at a time is cheap, exactly what a stream of events needs.

So Avro is the workhorse of Apache Kafka and streaming pipelines (Chapter 9): events flow as Avro records, each self-describing, written and read one at a time. Use Avro for data in motion — transport, message queues, streaming.

Apache ORC — columnar, for Hive

Apache ORC (Optimized Row Columnar) is, like Parquet, a columnar analytic format with a very similar story: stripes (its row groups), per-stripe statistics, pushdown, strong compression. It emerged from the Hive ecosystem and has the deepest support there, including features for ACID transactions in Hive (row-level updates/deletes). In practice Parquet and ORC are close technical cousins; the choice is usually decided by ecosystem. Use ORC when you're in a Hive-centric stack (or one that standardized on it); otherwise Parquet's broader engine support makes it the safer default.

What's the job?Parquet\n(columnar,default)Avro\n(row-based,schema-carrying)ORC\n(columnar)CSV / JSON\n(thenconvert)Analytics on a lake/ warehouseData in motion:streaming, Kafka,transportHive-centric stack /Hive ACIDInterchange /human-facing
LayoutSchemaBest forPushdown
CSV/JSONrow, textnone/externalinterchange, small datano
Parquetcolumnarembeddedlake/warehouse analyticsyes
Avrorowembedded/registrystreaming, transport (Kafka)no
ORCcolumnarembeddedHive, Hive ACIDyes

:::note The common real pipeline Events arrive as Avro through Kafka (row-based, schema-carrying, written per-record), get batched and rewritten as Parquet on the lake (columnar, for analytics). Each format is used at the stage it's best at — this is the pattern, not an exception. :::

Schema evolution: the part every team trips on

Data schemas are never static. A new field gets added, an old one is deprecated, a column is renamed. Schema evolution is how a format copes when the schema of new data differs from the schema of old data you've already stored. Get this wrong and either old files become unreadable or new writers break old readers — and it will happen to you, so it's worth understanding precisely.

Two directions of compatibility, and they are not the same thing:

  • Backward compatibility — a new reader can read old data. (You upgraded your code; can it still read files written before the change?)
  • Forward compatibility — an old reader can read new data. (Someone started writing the new schema; can consumers that haven't upgraded yet still read it?)

In streaming especially you need both, because producers and consumers upgrade at different times.

How the common changes behave

  • Adding a column is the safe, common case — if the new column has a default (or is nullable). New readers reading old data use the default for the missing field (backward-compatible); old readers reading new data ignore the unknown field (forward-compatible). This is why "add nullable columns, never repurpose existing ones" is the golden rule.
  • Dropping a column is the mirror image: usually fine if the column was optional/nullable, since readers tolerate its absence.
  • Renaming a column is the dangerous one. Naively, a rename looks like dropping the old name and adding a new one — readers lose the link to existing data. Formats that support field aliases (Avro does) or stable field IDs handle renames gracefully; without that, a rename silently orphans your old data. Avoid renames; if you must, use the format's alias mechanism.

:::tip Make schema evolution boring Treat schemas as append-mostly: add optional fields, give them defaults, don't rename, don't change a field's type, don't repurpose a column's meaning. In streaming, enforce this automatically with a schema registry that rejects an incompatible new schema before it can break consumers. Discipline here prevents a whole category of 2 a.m. incidents. :::

A traced example

ShopFlow launches promo codes, so the team adds a discount field to the orders event already flowing as Avro through Kafka (our running example — see Meet ShopFlow):

  • You define discount as optional with default 0.00 and register the new schema.
  • Old consumers (not yet upgraded) read new records and simply ignore the discount field — forward-compatible, nothing breaks.
  • New consumers read old records (written before promo codes existed) and substitute the default 0.00backward-compatible.
  • Both old and new producers and consumers coexist during the rollout. No outage, no backfill required, and every Parquet file already on the lake stays readable — new readers just see discount = 0.00 for those historical orders.

That smooth rollout is only possible because you added an optional field with a default instead of renaming or repurposing one. The format gave you the mechanism; the discipline made it safe.

Why it matters

CSV/JSON are for interchange — convert to binary the moment data lands. Among binary formats, keep them straight by layout + purpose: Parquet (columnar) is the default for the lake/warehouse; Avro (row-based, schema-carrying) is for data in motion — streaming and Kafka; ORC (columnar) is the Hive choice, including Hive ACID. The standard pipeline uses Avro on the wire and Parquet at rest. Schema evolution is unavoidable, so design for it: distinguish backward (new reader, old data) from forward (old reader, new data) compatibility, add optional fields with defaults, never rename or repurpose, and enforce it with a schema registry in streaming. With formats and evolution settled, the next lesson is about how you arrange many files — partitioning, file sizing, and the small-files problem that quietly wrecks performance.

Next: Partitioning, file sizing & the small-files problem →