Skip to main content

Single-machine dataframes: pandas vs Polars

The last lesson was about staying correct across many machines. This one is the deliberate counterweight, and it's a correction beginners badly need: most data is small, and most data work doesn't need a cluster at all. A huge fraction of real pipelines run perfectly on a single computer using an in-memory dataframe — a table held entirely in RAM that you reshape with code. Knowing when to reach for one (instead of a warehouse or Spark) is one of the highest-leverage judgment calls in the whole field, because the simple tool is usually the right one.

The plain-English on-ramp

A dataframe is just a table living in your program's memory: named, typed columns and many rows, the same mental shape as a spreadsheet or a SQL table (ShopFlow's orders, say — the running example) — but you manipulate it with code (filter, group, join, add columns) rather than queries or clicks. You load some data in, transform it step by step, and write the result out.

The catch that defines everything in this lesson: a single-machine dataframe must fit in that one machine's RAM. A laptop with 16 GB of memory can comfortably hold a few GB of data (you need headroom for the work itself). That sounds limiting until you notice how small most real datasets actually are. "Big data" is a famous phrase, but in practice the median analytics table an engineer touches is megabytes to a few gigabytes — easily a single-node job. The cluster is the exception, not the default.

So the first durable instinct to build is: don't distribute until you have to. A distributed engine (Spark, Chapter 5) brings real costs — a cluster to run and pay for, network shuffles, harder debugging, slower startup. If your data fits on one machine, a dataframe skips all of that and is usually faster because there's no network and no coordination overhead.

The two libraries you'll meet: pandas and Polars

In the Python world (the field's lingua franca alongside SQL), two dataframe libraries dominate, and the contrast between them teaches the key concept.

  • pandas — the original, ubiquitous Python dataframe library (since 2008). Enormous ecosystem, taught everywhere, glued into every tutorial and ML library. Its model is eager: each operation runs immediately and materializes a full intermediate result in memory before the next line runs. It's single-threaded by default and was designed in an era of smaller data, so it can be memory-hungry and slow on larger frames.
  • Polars — a modern (2020s) dataframe library written in Rust, built around Apache Arrow's columnar memory (Chapter 2). It is multi-threaded, columnar, and offers a lazy execution mode. On the same laptop it is frequently 5–30× faster than pandas and uses far less memory, while offering a similar (cleaner, more consistent) API.

The durable point is not "Polars beats pandas" as a slogan — it's the execution model difference behind the speed, which is the same idea you'll meet again in Spark and in every query engine.

Eager vs lazy execution — the concept that matters

This is the lesson's keystone, so go slowly. It's the difference between doing each step as you say it and planning all the steps, then running them together.

Eager execution (pandas' default): every line runs the moment it's written. Consider this conceptual pipeline:

# pandas — eager. Each line runs NOW and builds a full intermediate in RAM.
# ShopFlow's orders, exported to Parquet.
df = pd.read_parquet("orders.parquet") # loads ALL columns, ALL rows
df = df[df["status"] == "paid"] # builds a filtered copy
df = df[["order_date", "amount"]] # builds another copy, 2 columns
out = df.groupby("order_date")["amount"].sum() # finally aggregates

Notice the waste: you read every column of every row off disk, even though you only ever needed order_date and amount for paid orders. Each intermediate df is a full materialized table sitting in memory. Eager is simple to reason about and debug (you can inspect every step), but it leaves performance on the table because nothing can see the whole plan and optimize it.

Lazy execution (Polars' lazy mode, and the model of every SQL engine and Spark): you describe the steps but nothing runs yet. The operations build a query plan — a description of what you want — and only when you call .collect() does the engine look at the entire plan, optimize it, and execute once.

# Polars — lazy. Builds a PLAN; nothing runs until .collect().
out = (
pl.scan_parquet("orders.parquet") # a plan node, not a load
.filter(pl.col("status") == "paid")
.group_by("order_date")
.agg(pl.col("amount").sum())
.collect() # NOW the optimizer runs it
)

Because the optimizer sees the whole thing before running, it can apply the exact tricks you met in storage and query enginesprojection pushdown (only read the order_date, amount, status columns from the Parquet file, never the rest) and predicate pushdown (only read paid rows, skipping row groups whose statistics rule them out). The eager version couldn't do that, because by the time you wrote the filter the data was already fully loaded.

read_parquetALL columns/rowsfilter→ full copyscan + filter +select + aggbuild a PLANOPTIMIZEpush filter &columnsdown to the scan

That "build a plan, optimize the whole thing, then execute" idea is durable — it is exactly how SQL query engines and Spark work. Learning it here on a laptop-sized dataframe means you already understand the engine internals of Chapter 3 and the lazy transformations of Chapter 5 before you meet them at scale.

:::tip Durable vs dated "Describe the transformation as a plan, let an optimizer rewrite it, then execute once" is the durable idea — it powers SQL engines, Spark, and Polars alike. The specific library (pandas, Polars) and even the language binding are dated. pandas won't disappear (its ecosystem is too deep), but the lazy/columnar model that Polars popularized on a single node is the direction the whole field has moved. Learn the model, not the API. :::

When a single node beats SQL or Spark

You now have three tools for tabular work — a dataframe (pandas/Polars), a SQL warehouse/engine (Chapters 3–4), and a distributed engine like Spark (Chapter 5). A durable decision rule:

  • Reach for a single-machine dataframe when the data fits comfortably in one machine's RAM (rule of thumb: up to a few GB on a laptop, tens of GB on a big cloud VM), and the work is imperative/procedural — row-by-row Python logic, calling an ML library, complex reshaping, or anything awkward to express in SQL. Dataframes shine where you need general-purpose code, not just set operations.
  • Reach for SQL (a warehouse or DuckDB) when the transformation is naturally a set operation (filter, join, aggregate), especially if the data already lives in the warehouse or lake and you want it declarative, versioned, and queryable by others. Often the cleanest single-node option of all is DuckDB — an in-process SQL engine that gives you warehouse-grade columnar SQL on a laptop with no cluster (3.6).
  • Reach for Spark only when the data genuinely does not fit on one machine (hundreds of GB to many TB) or you need the cluster's parallelism for raw throughput. The cluster's cost and complexity are the price of admission you pay only when single-node has actually run out of room.

The most common junior mistake is inverting this — spinning up Spark for 2 GB of data that DuckDB or Polars would crunch faster on a laptop, with none of the operational weight. Match the tool to the data size and the shape of the logic, and default to the smallest tool that fits.

Typing and packaging: where dataframe code bites in production

Dataframe code is famously easy to start and surprisingly easy to get wrong in production. Two durable concerns:

Typing / schema. A dataframe's column types (int64, float, string, datetime, categorical) are real and consequential, but in eager pandas they're often inferred loosely and silently — an all-integer column gains one null and becomes a float; a date arrives as a plain string; a join silently produces NaN where keys didn't match. Because Python itself is dynamically typed, none of this errors at "compile" time — it surfaces as wrong numbers downstream. The durable defenses: declare expected types/schemas explicitly rather than relying on inference, and validate the dataframe against a contract at pipeline boundaries (libraries like Pandera or Polars' typed schema let you assert "this column is a non-null int in this range" and fail fast). This is the same data-contract discipline from Chapter 11, applied at the single-node level. Polars, being Arrow-backed, has a stricter, more consistent type system than classic pandas, which heads off a whole class of these surprises.

Packaging / reproducibility. "It worked in my notebook" is not a pipeline. Dataframe code that runs once on your laptop has to run unattended, repeatedly, on a server, which means: the exact library versions must be pinned (a pandas or Polars upgrade can change behavior, and NumPy ABI mismatches are a classic breakage), the code must be packaged as an importable, testable module rather than living only in a notebook, and the runtime environment must be reproducible (a lockfile, a container image). These are ordinary software-engineering concerns, but they're the ones that most often separate a one-off analysis from a dependable pipeline — and they apply before you ever reach a cluster.

:::note Running this yourself In-browser Python execution isn't available in this guide, so the snippets above are for reading the model, not running here. To try them, install polars and pandas in a local Python environment and point them at any Parquet or CSV file — the eager-vs-lazy difference is most visible if you scan_parquet a file with many columns and select only a few. :::

Why it matters

Most data is small enough to fit on one machine, so the in-memory dataframe — not a cluster — is the right default for a large share of real work. pandas is the ubiquitous, eager, single-threaded original; Polars is the modern, columnar, multi-threaded library with a lazy mode. The keystone is eager vs lazy: eager runs each step immediately (simple, but no global optimization and wasteful intermediates), while lazy builds a plan the optimizer can rewrite — pushing filters and column selection down to the scan — then executes once. That lazy/plan/optimize model is exactly how SQL engines and Spark work, so learning it on a laptop pays off everywhere later. Reach for a dataframe when data fits in RAM and the logic is procedural; reach for SQL/DuckDB when it's a set operation; reach for Spark only when single-node truly runs out of room. And treat dataframe code like real software — explicit types/validation and pinned, packaged, reproducible environments — or it will betray you in production.

Common pitfalls

  • Defaulting to a cluster for small data. Spinning up Spark for a few GB that Polars or DuckDB would handle faster on a laptop. Distribute only when single-node runs out of memory.
  • Confusing eager with lazy. Expecting a Polars LazyFrame to have results before you call .collect(), or assuming pandas optimizes across lines (it doesn't — each line runs in isolation, the moment it's written).
  • Trusting inferred types. Letting pandas guess column types and being surprised when an int column becomes float, a date stays a string, or a bad join fills NaN. Declare and validate schemas.
  • Shipping a notebook as a pipeline. Unpinned dependencies and notebook-only code that "works on my machine." Package it, pin versions, make it reproducible.
  • Forcing procedural logic into SQL (or vice-versa). Use the dataframe for general-purpose Python/ML logic; use SQL for clean set operations. Picking the awkward tool makes both slower and harder.

Checkpoint

  1. In one sentence, what is the difference between eager and lazy execution, and which does pandas use by default?
  2. You have a 3 GB CSV and need to filter, group, and sum it — and the logic is a clean set operation. Name two single-machine tools that would handle this without a cluster, and say why you would not reach for Spark.
  3. Name one typing concern and one packaging concern that make dataframe code unreliable in production, and the durable fix for each.
Answers
  1. Eager runs each operation immediately and materializes a full intermediate result; lazy builds a query plan and only executes (optimized, in one pass) when you .collect(). pandas is eager by default; Polars offers a lazy mode.
  2. Polars (lazy, columnar, multi-threaded) or DuckDB (in-process columnar SQL) — both run in a single process and 3 GB fits comfortably in RAM. You would not reach for Spark because the data fits on one machine, so a cluster only adds cost, network shuffles, and operational complexity with no benefit.
  3. Typing: inferred/loose column types silently go wrong (int→float on a null, dates as strings, NaN from bad joins) — fix by declaring schemas and validating with a contract (Pandera / typed Polars schema). Packaging: unpinned deps and notebook-only code don't run reliably unattended — fix by pinning versions and packaging the code reproducibly (module + lockfile/container).