Cost & performance at scale (FinOps)

There's a senior signal that most data curricula skip entirely: the ability to answer "what does this query cost?" and "what does this pipeline cost?" — and to bring those numbers down without breaking the data. In the cloud, performance and cost are the same lever viewed from two sides: the techniques that make a query scan less data make it both faster and cheaper. This lesson teaches that lever. The specific prices are dated and will change; the cost-per-unit mindset and the optimization techniques are durable, and they're increasingly what separates a senior data engineer from a junior one.

FinOps (a blend of "finance" and "operations") is the discipline of managing cloud cost as a continuous engineering practice — making cost visible, optimized, and governed by the people who actually create it, rather than a surprise the finance team discovers at month-end. For data engineering specifically, FinOps matters acutely because data systems have a dangerous property: cost scales with data volume and usage, often invisibly, so a small inefficiency multiplied across petabytes and thousands of daily queries becomes an enormous bill.

Cost as a first-class metric: cost-per-unit

The durable reframe is to stop thinking about the total bill and start thinking about cost per unit of value:

Cost per query — the average (and worst-case) dollar cost of an analytical query. In usage-priced engines like BigQuery (you pay per byte scanned) this is direct; in warehouses like Snowflake or Redshift (you pay for compute-time) it's compute-seconds per query. Tracking it surfaces the one dashboard that scans a full table every five minutes and quietly costs more than the rest of the platform combined. Concretely: ShopFlow's (see Meet ShopFlow) revenue dashboard runs SELECT date_key, SUM(line_revenue) FROM fact_sales GROUP BY date_key — if fact_sales is unpartitioned, every refresh scans the whole table, and that is ShopFlow's cost-per-query you want to drive down.
Cost per pipeline run — what one execution of a pipeline costs in compute and storage. A nightly job that costs $2 is fine; one that costs $2,000 because it reprocesses all history every night is a five-minute fix worth real money.
Cost per terabyte stored / per dataset / per team — via cost allocation (tagging resources by team/project) so each team sees what it spends and owns reducing it.

You can't optimize what you can't see. The first FinOps move is always visibility — attributing cost to specific queries, pipelines, datasets, and teams. Only then can you target the expensive 10% that drives 90% of the bill.

The techniques: where data cost (and latency) actually comes from

Almost all data cost comes down to one thing: how much data your engines read, write, and keep. Reduce the bytes touched and you cut cost and latency together. The durable techniques, most of which you've met earlier and which now reappear as cost levers:

Partitioning — read less by skipping whole chunks

Partitioning (from storage) splits a table into directories/files by a column, usually a date. A query filtered on that column (WHERE order_date = '2026-06-24') reads only the matching partition and skips the rest entirely — this is partition pruning. On a year of data, scanning one day instead of 365 is a ~365× reduction in bytes read, which is a ~365× cut in both cost (on scan-priced engines) and time. Partition on the column you filter by most (almost always the event/load date) — for ShopFlow's fact_sales, that's the order date behind date_key, so the dashboard's day/week filters prune hard. Cluster within partitions on customer_key or product_key for the "top products" / "this customer's history" queries. The classic mistakes: not partitioning at all (every query scans everything), or over-partitioning into millions of tiny files (the "small files problem" — metadata overhead swamps the savings).

ShopFlow's cost drivers, and what happens at 10×. Today ShopFlow's bill is small. The drivers that grow are (1) compute — every dashboard refresh and the nightly shopflow_daily dbt run scanning fact_sales; (2) storage — fact_sales and the raw orders partitions accumulating forever; (3) egress — if a BI tool or reverse-ETL job repeatedly pulls fact_sales out of the cloud, cross-region/internet egress is a sneaky line item (keep compute and storage in the same region). When ShopFlow's order volume 10×s, an unpartitioned full-scan dashboard gets 10× more expensive linearly, while a partitioned + incrementally-materialized one barely moves (it still reads only the recent partitions). That gap — flat vs linear cost growth — is the entire payoff of the techniques below. For the old data: tier it. ShopFlow rarely queries three-year-old orders partitions, so move them to a cheaper storage class (S3 Glacier / archive tiers) or drop them from the hot warehouse and keep them as cold Parquet in the lake — you keep history cheaply without paying hot-storage prices to retain orders nobody filters on.

Clustering / sorting — read less within partitions

Clustering (Snowflake/BigQuery) or sorting/Z-ordering (the lakehouse chapter's Delta/Iceberg) physically co-locates rows with similar values so the engine can skip data within a partition using min/max statistics per file (data skipping). Cluster on the high-cardinality columns you filter or join on (a user id, a product id) that partitioning is too coarse for. The combination — partition on date, cluster on the frequent filter key — is the bread-and-butter of a cheap, fast warehouse.

Caching and materialization — don't redo work

Result caching — warehouses cache query results; an identical query returns instantly and free until the underlying data changes. Stable, repeated dashboard queries benefit enormously.
Materialization — precompute and store an expensive aggregation (a materialized view or a dbt model table) once, so the thousand dashboard loads read the small pre-aggregated result instead of re-scanning the raw billions of rows each time. The trade-off is freshness and storage for compute — usually a huge net win for hot, repeated queries.
Incremental processing — the biggest single pipeline-cost lever. Reprocessing all history every run is the most common waste in data engineering. Process only the new/changed data (an incremental dbt model or a partition-overwrite of just today) and a job that scanned a year now scans a day.

Right-sizing, auto-suspend, and spot — don't pay for idle

Right-sizing compute — match warehouse/cluster size to the workload. A bigger warehouse finishes faster but costs proportionally more per second; the sweet spot is the size where the job is fast enough without paying for idle parallelism. Don't run an XL warehouse for a query an S handles.
Auto-suspend / scale-to-zero — the idle-is-the-enemy rule. A warehouse or cluster left running between queries bills for nothing. Set auto-suspend (Snowflake suspends after N idle seconds; serverless engines like BigQuery and Databricks SQL serverless effectively scale to zero) so you pay only while actually computing. A warehouse accidentally left on overnight is one of the most common surprise bills.
Spot / preemptible instances — for fault-tolerant batch work (Spark jobs that can lose and restart a task), spot instances offer the same compute at up to ~90% less, in exchange for the cloud reclaiming them on short notice. Because Spark tasks are already retryable and idempotent, batch processing is the ideal spot workload. Never put a stateful, can't-be-interrupted job on pure spot.

A worked example: one query, two costs

A dashboard runs SELECT country, SUM(revenue) FROM events GROUP BY country over a year of clickstream, refreshing every 5 minutes for 200 viewers.

Cost-blind version. events is one unpartitioned 50 TB table. Every refresh scans all 50 TB. At a scan-priced rate, and across hundreds of refreshes a day, this single dashboard can cost thousands of dollars a day — and it's slow.

Cost-aware version. The same data is partitioned by date and the dashboard only needs the last 90 days → it prunes to ~12 TB. A nightly incremental, materialized rollup (country, date, sum(revenue)) precomputes the aggregate into a tiny table; the dashboard now reads megabytes, served from result cache between refreshes, on an auto-suspending S warehouse. Same numbers on screen. Cost: a tiny fraction — pennies, not thousands.

The only differences are partitioning, materialization, caching, and right-sizing — design and tuning choices, no new data and no new feature.

:::tip The cost-aware design checklist Before shipping a query or pipeline, ask: (1) Is the table partitioned on the column I filter by, and is my query using the partition filter (pruning)? (2) Should hot/repeated results be cached or materialized instead of recomputed every time? (3) Am I processing incrementally (only new data) or blindly reprocessing all history? (4) Is the compute right-sized and set to auto-suspend when idle? (5) Can fault-tolerant batch run on spot? Pricing the design before building is the highest-leverage FinOps move there is — and bringing a cost estimate into the pull request makes it a team habit, not an afterthought. :::

The FinOps loop

FinOps isn't a one-time cleanup; it's a continuous loop matching the data-quality and observability rhythm: see → optimize → govern, repeating. See — attribute cost to queries/pipelines/teams via tagging and usage dashboards. Optimize — apply the techniques above to the expensive few that dominate the bill (the 80/20 almost always holds). Govern — set budgets, alerts on cost spikes (a runaway query, a warehouse left on), and guardrails (query byte-scan limits, default auto-suspend) so the same waste doesn't recur. Done well, cost-awareness becomes a property of the platform rather than a quarterly fire drill — and "what does this cost?" becomes a question every engineer can answer about their own work.

Common pitfalls

No cost visibility. A single total bill with no attribution to queries, pipelines, or teams — so the expensive 10% is invisible and nobody owns reducing it.
Unpartitioned full-table scans. The single biggest data-cost (and latency) sink: every query reads everything because there's no partition to prune to.
Reprocessing all history every run. Non-incremental pipelines that rescan years of data nightly when only a day changed.
Idle compute. Warehouses/clusters left running between or after jobs with no auto-suspend — paying for nothing.
Oversized compute. Running an XL warehouse for a query an S handles, paying for parallelism the workload can't use.

Why it matters

In the cloud, cost and performance are the same lever: the techniques that make an engine touch fewer bytes make queries both cheaper and faster. FinOps makes cost a first-class engineering concern through cost-per-unit metrics — cost per query, per pipeline run, per dataset — which require visibility (attribution via tagging) before you can optimize. The durable techniques all reduce bytes-touched or idle-time: partitioning (prune to one day, not a year), clustering/sorting (skip files within a partition), caching and materialization (don't redo expensive work), incremental processing (only new data, not all history), and right-sizing + auto-suspend + spot (don't pay for idle or on-demand). Run it as a continuous see → optimize → govern loop and "what does this cost?" becomes answerable for every piece of work — a defining senior signal. Cost is one axis of platform maturity; the next lesson covers the rest: the DataOps and reliability rigor that separates senior from junior.

Next: 12.4 DataOps, reliability & security →

Cost as a first-class metric: cost-per-unit​

The techniques: where data cost (and latency) actually comes from​

Partitioning — read less by skipping whole chunks​

Clustering / sorting — read less within partitions​

Caching and materialization — don't redo work​

Right-sizing, auto-suspend, and spot — don't pay for idle​

A worked example: one query, two costs​

The FinOps loop​

Common pitfalls​

Why it matters​