Glossary

Every key term from the guide, defined in plain English and listed alphabetically. Terms are introduced and explained in depth in the chapters; this page is the quick reference.

ABAC (Attribute-Based Access Control) — an access model where access is decided by attributes of the user, the data, and the context, evaluated as policy.
accepted_values (test) — asserts every value of a column is within an allowed set.
Accumulating snapshot fact table — A fact table with one row per multi-step process instance, updated in place as the process passes milestones (e.g. order placed→paid→shipped).
Accuracy — the data quality dimension measuring whether a value matches reality (the actual truth), as opposed to merely being well-formed.
ACID — the transactional guarantees Atomicity, Consistency, Isolation, Durability that make a table trustworthy; object storage alone provides only durability.
Action — an operation that asks for a result (count, collect, show, write); triggers Spark to plan and execute the whole chain of transformations.
Adapter — the plugin (dbt-snowflake, dbt-bigquery, dbt-databricks, etc.) through which dbt speaks a specific warehouse's SQL dialect.
Adaptive Query Execution (AQE) — Spark's runtime re-optimization that uses real observed statistics to coalesce shuffle partitions, switch to broadcast joins, and split skewed partitions automatically.
Additive measure — A measure that can be summed across every dimension (e.g. revenue, quantity).
ADR (Architecture Decision Record) — a short, numbered, append-only document capturing one significant decision: its context, the decision, the trade-offs, and the trigger to revisit, so the reasoning survives turnover.
Airbyte — an open-source EL connector platform with a managed cloud option and the ability to self-host or write custom connectors.
allowed lateness — a grace period after the watermark passes during which late events still update an already-emitted window.
Amazon EMR — AWS's managed service for running Spark/Hadoop clusters.
Amazon S3 — AWS object storage; the original and near-generic object store for data lakes.
Analytics engineer — a data role emphasizing the transformation and modeling layer (dbt, dimensional modeling, the medallion architecture) and partnering with analysts/stakeholders; the role closest to the business, SQL-and-dbt-heavy.
Anomaly detection (data) — monitoring that learns a metric's normal historical pattern and alerts on significant deviations, with no threshold set by hand.
Apache Arrow — the language-agnostic in-memory columnar standard, the counterpart to Parquet on disk, enabling zero-copy interchange between tools.
Apache Avro — the row-based, schema-carrying binary format; compact to write per-record and self-describing, the standard for streaming and transport (Kafka).
Apache Hadoop — the open-source platform that gave the world a free MapReduce implementation and HDFS; largely legacy now.
Apache Hudi — an open table format (Uber-origin) built for efficient record-level upserts and deletes via an index mapping record keys to files; strong for streaming/CDC ingestion.
Apache Iceberg — an open, engine-agnostic table format (metadata file → snapshot → manifest list → manifests → data files) known for hidden partitioning and partition evolution; the de-facto open standard in 2026.
Apache NiFi — a flow-based, drag-and-connect data-routing tool, strong for on-prem and IoT/edge ingestion.
Apache ORC — a columnar analytic file format (stripes, per-stripe stats, pushdown) from the Hive ecosystem, with support for Hive ACID transactions.
Apache Paimon — an emerging streaming-first lakehouse table format, strong with Apache Flink.
Apache Parquet — the dominant open columnar file format for analytics; self-describing and immutable, built from row groups, column chunks, pages, and a footer with per-row-group statistics.
Apache Polaris — an open-source, self-hostable implementation of the Iceberg REST catalog (Snowflake-origin, donated to Apache).
Apache Spark — an open-source distributed batch engine that keeps intermediate data in memory (vs MapReduce's disk-between-steps), with friendly DataFrame/SQL APIs.
API (Application Programming Interface) — a defined set of HTTP endpoints a system exposes for programmatic access to its data.
append (strategy) — INSERT new rows only; fastest, for immutable event logs; risks duplicates if a row is re-delivered.
Asset-based (data-aware) scheduling — Triggering work when an upstream data object updates, rather than purely by the clock.
async insert (ClickHouse) — high-throughput ingestion where ClickHouse buffers many small incoming inserts server-side and flushes them as one larger data part, avoiding a flood of tiny parts from high-rate streaming sources.
at-least-once — delivery where each event is processed one or more times (never lost, may be duplicated).
At-least-once delivery — Each record is delivered one or more times: never lost, but may be duplicated (requires an idempotent consumer).
at-most-once — delivery where each event is processed zero or one times (may be lost, never duplicated).
At-most-once delivery — Each record is delivered zero or one times: never duplicated, but may be lost.
Athena — a serverless, fully-managed AWS query service (built on Trino/Presto) that runs SQL over S3 files with no infrastructure, billed per byte scanned.
Atomic commit — making a multi-file change to a table happen completely or not at all, achieved by writing new files aside and atomically swapping a single pointer to the new file list.
Atomicity — A transaction fully completes or fully rolls back, never half.
Auditing (data) — the immutable log of who accessed or changed what data and when, used to prove compliance and investigate breaches.
Auto-suspend (scale-to-zero) — automatically pausing or scaling compute to nothing when idle, so you pay only while actually computing; the fix for the common surprise of a warehouse left running.
Avro — a compact binary serialization format with an explicit schema, common for Kafka messages.
AWS DMS (Database Migration Service) — AWS's managed service for database migration and CDC.
AWS Glue Data Catalog — AWS's managed data catalog, a common default for lake/lakehouse tables on AWS.
Azure Data Lake Storage Gen2 (ADLS Gen2) — Azure object storage with a filesystem-style namespace tuned for analytics.
Backfill — re-processing historical data after a bug fix or a new column; trivial on a platform built with partition-overwrite idempotency, disastrous on one built on blind appends.
backpressure — a system's signal to slow upstream production when data arrives faster than it can be processed, instead of overflowing.
Backward compatibility — a new reader can read old data (e.g. new code reading files written before a schema change).
Batch processing — processing a large, bounded dataset all at once, as opposed to streaming's continuous record-by-record processing.
Batch vs streaming — the recurring choice between processing data in scheduled chunks (batch, the simpler/cheaper default that meets most freshness needs) and continuously as events arrive (streaming, used only for real sub-minute latency requirements).
Binary format — A compact, machine-efficient data format (e.g. Parquet); not human-readable.
binlog (binary log) — MySQL's equivalent ordered change log, read by log-based CDC tools.
Blameless postmortem — a written analysis after an incident of what happened, why, and what systemic change prevents recurrence, focused on the system and process rather than blaming a person, so the same failure becomes impossible.
Blast radius — the set of downstream assets affected by a change to or a failure in an upstream dataset.
Boring technology — the durable principle of preferring mature, well-understood, widely-deployed tools (whose failure modes and community are known) over the exciting new thing, spending a team's limited novelty budget only where it buys real differentiation.
bounded data — a finite dataset with a known beginning and end (a file, a table snapshot); the kind batch jobs process.
Breaking-change detection — automatically comparing a proposed schema/contract against the current one and classifying the diff as backward-compatible or breaking.
Broadcast hash join — a join that copies a small table to every executor so the big table is joined locally with no shuffle of the big table; auto-selected under spark.sql.autoBroadcastJoinThreshold (~10 MB).
Broadcast join (map-side join) — a distributed join that copies a small table to every worker so each joins its local slice of the large table, eliminating the costly network shuffle of the big table.
broker — a server in a Kafka-style cluster that stores partitions and serves reads/writes.
Build vs buy — the recurring choice between building a capability yourself and using a managed product; default to buy/managed for anything that isn't your core differentiator, pricing the fully-loaded engineer-hours, not just the license.
Business glossary — the agreed, plain-language definitions of business terms (e.g. "active customer") so they mean one consistent thing organization-wide.
Caching / persistence — keeping a computed DataFrame's result in memory (.cache()/.persist()) so repeated actions reuse it instead of re-running the lineage; helps only when the DataFrame is reused.
CAP theorem — During a network partition a distributed system must choose between Consistency and Availability; it cannot have both.
Capacity estimation — back-of-the-envelope math (events/sec at peak, bytes/day, storage/year) done before committing to a design, to check it's in the right order of magnitude and to tell over-engineering from under-engineering.
Cardinality — the number of distinct values in a column; low cardinality (few distinct values) compresses well and makes a good partition key, high cardinality does not.
Catalog — the service that maps table names to their metadata location and holds the pointer to each table's current version, enforcing the atomic commit; the integration point all engines must agree on.
Catalyst — Spark's query optimizer; turns DataFrame/SQL operations into an optimized physical plan via rules like predicate pushdown and column pruning.
Catchup — Whether the scheduler creates a run for every past interval since the start date when a DAG is first enabled or unpaused.
CCPA/CPRA — California privacy laws granting residents rights to know, delete, and opt out of the sale of their personal data.
change data capture (CDC) — streaming a database's row-level changes (inserts/updates/deletes) as events, typically from its transaction log.
change envelope — the structure of a CDC event describing the operation (op) plus the before and after row images and a timestamp.
Chargeback/showback — attributing data-platform cost back to the team, project, or dataset that incurred it via tags.
checkpoint — a periodic, consistent snapshot of all operator state plus input positions, used to restart after failure without data loss.
Checkpoint (Delta) — a Parquet snapshot of a Delta table's full state at a given log version, so readers replay only commits after it instead of the entire log.
CI (continuous integration) — Automated tests that run on every code change before it ships.
CI check (shift-left test) — a data test that runs in CI/CD when pipeline code changes (on sample/staging data) to stop a bad transformation from being merged.
CI/CD for pipelines — automation (e.g. GitHub Actions) that on every change lints SQL, compiles the dbt project, and runs data tests before production (CI), then deploys merged changes to the next environment (CD), catching broken transformations before they corrupt dashboards.
Circuit breaker (data) — a runtime gate that halts or quarantines a pipeline when a check fails, preventing bad data from spreading downstream.
ClickHouse — a column-oriented database built for blistering real-time, high-concurrency analytical queries (sub-second aggregations over billions of rows) on fresh data; powered by the MergeTree engine, a sparse primary-key index, async inserts, and insert-time materialized views. Used for observability, product/user-facing analytics, and live dashboards.
ClickHouse materialized view — not a cached query but an insert-time trigger: when rows land in a source table, the view's query runs on that incoming block and writes the result into a destination table, typically maintaining pre-aggregated rollups so dashboards read an already-summarized table.
Cloud data warehouse — an integrated, fully-managed product that owns both storage (proprietary format) and compute and presents a SQL endpoint (e.g. Snowflake, BigQuery, Redshift).
Cluster — a group of computers working together as one logical machine to process data in parallel.
Cluster manager — the external system (YARN, Kubernetes, or Spark standalone) that owns the cluster's machines and grants resources/executors to the driver.
Clustering (Z-ordering) — physically co-locating rows with similar values so the engine can skip data within a partition using per-file min/max statistics (data skipping); complements partitioning on high-cardinality filter/join keys.
Clustering key / sort key — A column used to physically co-locate related rows (Snowflake clustering key, Redshift sort key, BigQuery partition+cluster) so filters prune hard.
coalesce(n) — reduce partition count by merging existing partitions without a full shuffle; cheap, only goes down, may leave partitions uneven.
Column chunk — the data for one column within a single row group, stored contiguously; where Parquet's columnar layout lives.
Column pruning — reading only the columns a query references and skipping the rest; trivial in a columnar layout because each column is a contiguous run of bytes.
Column store (columnar) — A storage layout that keeps each column together on disk; fast and cheap for scanning a few columns over many rows, and compresses well (OLAP).
Column-level lineage — lineage tracked at the granularity of individual columns (which source column feeds which downstream column), enabled by engines that statically parse SQL (e.g. dbt Fusion, SQLMesh).
Column-level security — access control that restricts which columns of a table a given user can read.
Column-major (columnar) — a layout that writes each column's values contiguously, one whole column after another; wins for analytics via column pruning, better compression, and vectorized reads.
Columnar storage — Storing each column's values together rather than each row's, so a query reads only the columns it needs — the core of analytic warehouse speed.
Common Table Expression (CTE) — a named, temporary result set defined with WITH, used to break a complex query into readable, named steps.
Compaction (bin-packing) — a background job that rewrites many small files into fewer large ones to restore read performance, routine in streaming-fed platforms.
Compiled SQL — the plain SQL dbt produces from a model's SELECT + Jinja, with every ref()/source() replaced by real table names; reading it (in target/compiled/) is the primary way to debug dbt.
Completeness — the data quality dimension measuring whether expected data is present, with no missing rows or values.
Compression codec — a general-purpose, byte-level algorithm (Snappy, ZSTD, gzip, LZO) applied after encoding, trading compression ratio against CPU cost and splittability.
Confluent Cloud — the dominant managed Kafka and Kafka Connect offering.
Conformed dimension — A single dimension table shared identically across multiple fact tables, giving the whole organization one consistent definition of a thing.
Connection — A stored, named set of credentials/endpoint for an external system, referenced by tasks instead of hardcoding secrets.
connector — pre-built, maintained integration code that knows a specific source's API quirks and extracts its data for you.
Consensus — A protocol (e.g. Raft, Paxos) letting a group of machines reach reliable agreement even when some fail or messages are lost.
Consistency — the data quality dimension measuring whether data agrees with itself and with other systems (e.g. a total equals the sum of its parts).
consumer — a client that reads events from a stream/topic.
consumer group — a set of consumers that cooperatively split a topic's partitions so each partition is read by exactly one member.
consumer lag — how many records a consumer is behind the latest offset; the primary health metric of a stream.
Container — A package of code plus its dependencies (built with Docker) that runs identically across machines.
contract enforcement at the boundary — checking incoming data against an agreed schema as it enters, rather than discovering breaks downstream.
Control total — a sum of a meaningful numeric column compared between source and target to confirm rows were not dropped, duplicated, or mangled.
Coordinator (driver) — the MPP node that parses, plans, optimizes, and splits a query and hands pieces to workers, doing no heavy data crunching itself.
Copy-on-write (COW) — an update strategy that rewrites whole data files with changes applied, giving fast reads at the cost of slow, amplified writes.
Cost levers — The three things every data design trades off: storage (keeping bytes), compute (processing bytes), and egress (moving bytes out).
Cost per query / cost per pipeline — first-class cost-per-unit metrics (dollars or compute-seconds per analytical query, or per pipeline run) that surface the expensive few workloads dominating the bill, enabling targeted optimization.
Cost-based optimization (CBO) — optimization that uses table statistics to estimate the cost of competing plans and pick the cheapest; chooses join order and join algorithm.
Crypto-shredding — making data permanently unreadable by encrypting it per-subject and destroying the key, used to satisfy deletion on immutable storage.
CUBE — a GROUP BY extension producing every combination of the grouping columns plus the grand total; for cross-tab "all dimensions" reports.
Current flag (is_current) — A boolean on an SCD Type 2 dimension marking the one current version per natural key, for fast "what is true now?" queries.
cursor-based pagination — paging via an opaque next_cursor token; the most robust style for large, live datasets.
DAG (Directed Acyclic Graph) — the dependency graph dbt infers from all ref()/source() calls (directed = one-way dependencies, acyclic = no loops); dbt topologically sorts it to determine build order and parallelism, and it is the project's lineage.
DAG versioning — Tracking which version of a DAG's code produced a given run, so history stays interpretable across changes.
Data — Recorded facts (a click, a row, a sensor reading, a log line).
Data analyst — Consumes modeled data to answer business questions via dashboards and ad-hoc queries; produces insight, not infrastructure.
data at rest — data you go and fetch on a schedule (databases, APIs, files).
Data catalog — a searchable inventory of an organization's data assets and their metadata (technical, business, and operational).
Data contract — an explicit, versioned, machine-readable agreement between a data producer and its consumers specifying schema, semantics, quality/SLA guarantees, and ownership.
Data engineering — Building the reliable systems that move data from where it is produced to where it can be trusted and used.
Data engineering lifecycle — The five stages every piece of data travels: generation, ingestion, storage, transformation, serving.
Data governance — the set of implemented controls answering who can find, touch, keep, and pay for each dataset.
data in motion — data continuously pushed to you as events occur (streams, queues).
Data interval — The window of data a single run is responsible for (e.g. one day's worth), distinct from when that run actually executes.
Data lake — raw files dumped into cheap object storage, readable by any engine; open and inexpensive but, without a table format, lacking atomic commits, consistent reads, and schema enforcement (a "data swamp").
Data lineage — the record of where data comes from, how it flows, and how it transforms across systems, represented as a directed dependency graph.
Data management — The undercurrent covering governance, quality, cataloging, and lineage of data.
Data observability — the ability to understand the health of data and pipelines and be alerted to problems, including ones no explicit rule was written for.
Data on-call — a rotation responsible for pipeline health: responding to freshness/volume/quality alerts, triaging, mitigating, and diagnosing, guided by runbooks.
Data owner — the party (usually a team or lead) accountable for a dataset.
Data pipeline — An automated, repeatable process that moves and reshapes data from one place to another; the unit of work a data engineer ships.
Data platform — the whole connected system that gets data from where it is produced to where it can be trusted and used: ingestion, storage, processing, and serving, plus the orchestration, quality, and governance wrapped around them.
Data platform engineer — a data role emphasizing building the platform other data people build on (warehouse/lakehouse infrastructure, orchestration, self-serve tooling, IaC, cost and reliability), treating the platform as a product; the fastest-rising of the data roles.
Data product — a dashboard, table, or dataset that people depend on, treated like a product with a reliability commitment (SLAs/SLOs), monitoring, an owner, and documentation.
Data quality dimension — one measurable aspect of data correctness (accuracy, completeness, consistency, timeliness, validity, uniqueness) that can be assessed independently of the others.
Data retention — the policy governing how long data is kept before deletion or archival, driven by law, purpose (data minimization), and cost.
Data scientist — Uses data to build statistical and machine-learning models (predictions, recommendations, forecasts); depends on the data engineer for clean inputs.
Data skew — an uneven distribution of data across workers (often a lopsided/sentinel key) that overloads one worker and stalls a whole stage; mitigated by salting, filtering, or adaptive skew handling.
Data steward — the party responsible for a dataset's day-to-day quality, documentation, and proper use.
Data test (assertion) — an executable statement of an expectation about data that passes or fails when run against real rows.
Data Vault — A warehouse philosophy decomposing data into hubs (business keys), links (relationships), and satellites (attributes + full history) for auditability and resilience to source change.
Data warehouse — A database built for analytics (OLAP): large queries over modeled, integrated, historical data, rather than running an application.
Databricks — a managed lakehouse platform whose runtime is an optimized Spark, typically over Delta Lake.
DataFrame — a distributed table of named, typed columns; the preferred Spark API because Catalyst can understand and optimize its declarative operations.
DataOps — applying proven software-engineering and operations discipline (version control, automated testing, CI/CD, environments, infrastructure as code, monitoring) to data pipelines and products so they're reliable, repeatable, and safe to change.
Dataset — a typed DataFrame with compile-time type safety (Scala/Java only; not meaningful in PySpark).
Dataset / Asset (Airflow) — A named data object whose update can trigger downstream DAGs (Airflow's data-aware scheduling primitive).
dbt (data build tool) — a framework for transforming data already in a warehouse by writing version-controlled, tested SQL SELECTs; it owns no compute and acts as a SQL compiler and orchestrator that the warehouse executes.
dbt Cloud — the paid hosted product that wraps dbt Core with a web IDE, scheduler, docs viewer, and CI.
dbt Core — the free, open-source command-line edition of dbt (dbt run, dbt test, etc.).
dbt docs — auto-generated documentation, including an interactive lineage graph of the DAG, produced from the project's models, dependencies, tests, and YAML descriptions; living documentation always in sync with the code.
dbt Fusion — a 2025/2026 rewrite of dbt's engine in Rust that adds static analysis of SQL, enabling column-level lineage, earlier error detection, and faster compilation; the paradigm (compiled SQL on a DAG) is unchanged.
dbt model — a single .sql file containing one SELECT statement; the central object in dbt, persisted in the warehouse according to its materialization.
dbt package — a bundle of models and macros from another project installed via packages.yml and dbt deps; e.g. dbt_utils.
dbt test — a data assertion declared in YAML beside a dbt model and compiled to a query that returns failing rows; built-ins include not_null, unique, accepted_values, and relationships.
dbt_project.yml — the dbt project's main config file (name, model paths, default materializations).
dbt_utils — the most widely used community dbt package: battle-tested, cross-warehouse macros for surrogate keys, date spines, deduplication, pivoting, and more.
dead-letter queue (DLQ) — a holding area where unprocessable records are set aside with their error so good records still flow and failures aren't lost.
Debezium — the dominant open-source log-based CDC engine, emitting standardized change events (typically onto Kafka).
Decoupling orchestration from compute — Letting the orchestrator trigger external engines (Spark, dbt, the warehouse) rather than doing heavy processing itself.
dedup key — the field(s) (often key + event_id) used to recognize and drop duplicate events.
dedup key (deduplication key) — a column or combination that uniquely identifies a record, used to recognize duplicates.
Deferrable (async) operator — A task that releases its worker slot while waiting, resuming via a lightweight trigger instead of busy-polling.
Deferral (--defer) — a dbt CI feature where any ref() to a model not built in the current run resolves to the production version, so CI tests a change against real upstream data without rebuilding everything.
Degenerate dimension — A dimension key with no descriptive attributes of its own (e.g. an order number) that lives in the fact table itself.
delete+insert (strategy) — delete rows matching the new batch's keys, then insert the new batch; an upsert alternative where MERGE is unsupported.
Delete-insert (partition replace) — an idempotent load pattern that deletes a partition's data then re-inserts it, so re-running rewrites only that partition without creating duplicates.
Deletion vector — a small side file marking which rows in a data file are deleted, so deletes/updates avoid rewriting the whole file (a merge-on-read technique).
Delta encoding — storing the first value then only the differences between consecutive values; ideal for timestamps and monotonically increasing IDs.
Delta Lake — an open table format (Databricks-origin) that records commits as an ordered transaction log (_delta_log) reconstructed by replay, with strong layout tooling.
Delta UniForm — Delta Lake writing Iceberg metadata alongside its own so the same data files are readable by Iceberg clients.
_delta_log — Delta Lake's transaction log: an ordered series of JSON commit files (with periodic Parquet checkpoints) whose replay yields the table's current state; appending the next entry is the atomic commit.
Denormalization — Deliberately duplicating data so it is pre-joined and ready to read; optimizes query speed at the cost of redundancy.
Dependency (edge) — A "must run before" relationship between two tasks, drawn as an arrow in the DAG.
Deterministic ordering — a sort that always yields the same result for the same input (achieved with a unique tiebreaker), required to make a ROW_NUMBER dedup reproducible.
Dictionary encoding — replacing repeated values with small integer indexes into a lookup table of distinct values; effective on low-cardinality columns and Parquet's default.
Dimension table — A table of descriptive context (one row per thing) that you filter and group facts by, e.g. customer, product, date, store.
Dimensional modeling — The Kimball method of organizing analytics data into facts (measurements) and dimensions (context) for understandable, fast queries.
Distributed system — A system whose data and computation span many machines that communicate over an unreliable network.
Distribution key — In Redshift, the column controlling how rows are spread across compute nodes, to minimize data movement (shuffle) during joins and aggregations.
Driver — the Spark process that runs your program, builds and optimizes the plan, and schedules tasks onto executors; a single point of failure and where .collect() lands data.
DuckDB — an in-process, single-machine OLAP engine ("the SQLite of analytics") that reads Parquet/CSV/JSON directly with no server or cluster; the 2026 default for local dev and embedded analytics.
DuckLake — an emerging approach that stores table metadata in a SQL database rather than in metadata files.
Durability — Once a transaction commits, it survives crashes and power loss.
Durable vs dated — the guide's core learning strategy: invest deeply in concepts that last a career (SQL, modeling, partitioning, idempotency, decision frameworks, cost thinking, communication) and hold loosely the volatile specifics (this year's vendor, pricing, certification code, or hot tool).
Dynamic filtering — a runtime optimization that builds a filter from the actual keys of a small, already-filtered join side and pushes it into the large side's scan to skip most of it.
Dynamic task mapping — Generating a variable number of parallel task instances at runtime from a list (e.g. one per file).
Eager execution — a dataframe model (pandas' default) where each operation runs immediately and materializes a full intermediate result in memory before the next line; simple to debug but unable to optimize across the whole pipeline. Contrast lazy evaluation.
Effective date / expiration date — The valid-from / valid-to columns on an SCD Type 2 row that define the time window during which that version was true.
EL (Extract, Load) — the ingestion tool's job in the modern stack: faithfully extract and load raw data, leaving transformation downstream.
Elementary — a dbt-native data observability package that layers anomaly detection and alerting on top of dbt tests and metadata.
ELT — Extract, Load, Transform: the modern order where raw data is loaded into the warehouse first, then transformed in place using the warehouse's own SQL compute; won once cloud warehouses made storage cheap and compute elastic.
ELT (Extract, Load, Transform) — Pull data, load it raw into the warehouse, then transform it in-warehouse with SQL; the modern default, enabled by cheap elastic cloud compute.
Embedding — a vector (list of numbers) representing the meaning of a piece of text or image, produced by an embedding model, where similar meanings get nearby vectors, enabling search by semantic similarity rather than exact keyword match.
Encoding — a column-aware transform (dictionary, RLE, delta) that shrinks data by exploiting a single column's structure, applied before general compression.
Environments (dev/staging/prod) — separate dev, staging, and production environments through which the same code is promoted via review and CI, so changes are never developed or experimented against production data and consumers.
Ephemeral (materialization) — a model never built as its own object but inlined as a CTE into any model that ref()s it; keeps logic DRY but cannot be queried or tested alone.
Estuary Flow — a newer ingestion platform centered on real-time CDC and streaming EL.
ETL — Extract, Transform, Load: the older order where raw data is transformed on a separate server before being loaded into the warehouse; justified when warehouse storage and compute were expensive and fixed.
ETL (Extract, Transform, Load) — Pull data, transform it on a separate engine first, then load the finished result into the warehouse.
event — an immutable record that something happened at a point in time (a click, a payment, a sensor reading); the atom of a stream.
event time — the timestamp at which an event actually happened (carried inside the event).
Eventual consistency — a storage model where a read shortly after a write may not yet reflect it; a historical source of pipeline bugs, now largely replaced by strong consistency.
exactly-once — end-to-end semantics where each event affects results exactly one time, even across failures.
Exactly-once delivery — Each record takes effect once and only once; true exactly-once delivery is nearly impossible across a network.
Exactly-once effect — Achieving exactly-once results by combining at-least-once delivery with idempotent processing; how real systems implement "exactly-once."
exactly-once-ish — at-least-once delivery made effectively exactly-once by deduplicating at the destination (upsert by key / track offsets).
Executor — a worker process on a cluster machine, allocated cores and memory, that runs tasks on partitions and holds cached data; executors exchange shuffle data peer-to-peer.
EXPLAIN — a command that prints the optimizer's chosen plan for a query without running it; read bottom-up to understand execution.
EXPLAIN ANALYZE — a command that runs a query and annotates each plan operator with actual rows, time, and bytes; the estimated-vs-actual row gap reveals bad statistics.
explain() — the method that prints a DataFrame's physical query plan, showing what Spark will actually do (pushdowns, join strategy).
exponential backoff — retrying a failed request after a delay that doubles each attempt (1s, 2s, 4s), usually with random jitter.
Exposure — a YAML declaration of a downstream consumer of dbt models (a dashboard, ML model, or reverse-ETL sync) that extends lineage past the marts into the real world.
Fact table — A table of numeric measurements of a business process, one row per event/measurement, holding measures plus foreign keys to dimensions.
Failure handling — designing a pipeline to cope with the inevitable: late or missing sources, bad/corrupt batches (quarantined to a dead-letter path), mid-run crashes (retried because steps are idempotent), and backfills.
Feature store — a system that computes, stores, and serves ML features consistently for both training (batch/historical) and serving (live/low-latency), existing to prevent train/serve skew.
File format — the on-disk layout of a single file's bytes (e.g. Parquet, ORC, Avro). A table format sits on top of a file format; it does not replace it.
File sizing — choosing per-file size in the sweet spot (~128–512 MB): large enough to amortize per-file overhead, small enough for parallel reads.
FinOps — the discipline of managing cloud cost as a continuous engineering practice (making cost visible, optimized, and governed by the people who create it) rather than a month-end surprise; acute for data systems because cost scales with data volume and usage, often invisibly.
FinOps (data) — the discipline of making data-platform cost visible, attributable, and controllable (cost per query/run, tagging/chargeback, workload isolation, limits).
First Normal Form (1NF) — Each cell holds a single atomic value and each row is unique.
Five pillars of data observability — freshness, volume, schema, distribution (quality), and lineage — the standard set of signals to monitor for data health.
Fivetran — a fully-managed, closed-source EL connector platform priced by Monthly Active Rows (MAR).
follower — a replica that copies the leader's data so it can take over if the leader fails.
Footer — Parquet's trailing metadata block holding the schema plus per-row-group offsets and min/max statistics; read first to plan skipping.
Foreign key — A column holding the primary-key value of a row in another table, linking the two.
Forward compatibility — an old reader can read new data (e.g. a not-yet-upgraded consumer reading records written under a newer schema).
Freshness SLA — a committed maximum staleness for a dataset, declared in its contract, monitored as the freshness pillar, and paged on when violated.
full load (full refresh) — re-reading the entire source every run and replacing the destination with it.
Full refresh — A load that drops and rebuilds the whole table from source each run; simple and inherently idempotent, but costly at scale.
Full table scan — Reading every row of a table; what a query without a usable index or partition pruning falls back to.
GDPR — the EU General Data Protection Regulation, granting individuals rights over their personal data including access and erasure.
Generation — The lifecycle stage where data is born, in source systems the data engineer usually doesn't control.
Generic test — a reusable, parameterized test attached to a column in YAML; the four built-ins are unique, not_null, accepted_values, and relationships.
Git — Version control for code, so pipeline code is reviewed and its history tracked.
Google Cloud Dataproc — Google Cloud's managed Spark/Hadoop service.
Google Cloud Storage (GCS) — Google Cloud's object storage service.
Grain — The precise definition of what one fact-table row represents; declaring it first determines every valid measure and dimension and prevents double-counting (the #1 modeling mistake).
GraphQL API — an API style with a single endpoint where the client specifies exactly which fields it wants per query.
Great Expectations — an engine-agnostic Python validation library whose unit is an Expectation (a named assertion), grouped into Suites and producing human-readable validation results and Data Docs.
Granule (ClickHouse) — the block of rows (≈8,192 by default) that ClickHouse's sparse primary-key index points at; queries skip whole granules whose key range can't match, reading only the ones that might.
GROUPING SETS — a GROUP BY extension that computes several explicitly-listed grouping combinations in one query.
Hash join — the default equi-join algorithm: build an in-memory hash table from the smaller input, then probe it with the larger (O(M+N)); spills to disk if the build side doesn't fit in memory.
HDFS (Hadoop Distributed File System) — Hadoop's distributed storage layer, mostly superseded by cloud object storage like S3.
Hidden partitioning — Iceberg recording partitioning as a metadata transform on a column so users filter on the real column and the engine maps it to partitions, removing the classic "filtered the wrong column → full scan" footgun.
high-watermark — a saved marker (often max updated_at or max ID) recording how far a previous incremental run got.
Hive-style partitioning — the directory naming convention column=value/ that encodes a partition key in the path rather than inside the files.
Hook — Airflow's reusable client for talking to an external system (a warehouse, an API, cloud storage).
HTTP 429 (Too Many Requests) — the status code an API returns when you exceed its rate limit, often with a Retry-After header.
Hub / link / satellite — Data Vault's three building blocks: hubs hold business keys, links hold relationships between hubs, satellites hold descriptive attributes with full version history.
Iceberg REST catalog — an open standard API for catalogs that any engine and any compliant catalog can speak, enabling vendor-neutral interoperability.
Idempotency — the property that running an operation multiple times has the same effect as running it once; the load-bearing concept of pipeline design because retries, redeliveries, and backfills will re-run steps, and idempotent steps can't double-count.
Idempotent — the property that running an operation once or many times produces the same result; the goal that makes a re-run-able pipeline step safe.
Idempotent loading — Loading data so that re-running produces the same result, making retries and backfills safe; achieved via full refresh or incremental loads keyed on a merge key.
idempotent producer — a producer that de-duplicates its own retries so resending a record doesn't create duplicates in the log.
idempotent sink — a destination where writing the same record twice has the same effect as writing it once (e.g. upsert on a key).
Impact analysis — using lineage to find everything downstream that depends on a dataset or column (its blast radius), before a change or during an incident.
Incremental (materialization) — a model built fully on the first run, then updated on later runs by processing only new or changed rows; the key to transforming very large tables cheaply.
Incremental load — A load that processes only new/changed rows; cheap at scale but must be made idempotent with a merge key and MERGE/upsert.
Incremental processing — processing only the new or changed data each run instead of reprocessing all history; the biggest single pipeline-cost lever and the most common waste when neglected.
Incremental strategy — how new rows are combined into an existing incremental table: append, merge, delete+insert, or insert_overwrite.
Index — A separate sorted structure (often a B-tree) mapping a column's values to rows, so selective lookups avoid a full table scan; speeds reads but slows writes.
Infrastructure as code (IaC) — defining a platform's infrastructure (warehouse, buckets, IAM roles, orchestrator, networking) in declarative, version-controlled files (commonly Terraform) instead of clicking it together in a console, making it reviewable, reproducible, and recoverable.
Ingestion — The lifecycle stage of getting data out of source systems and into the platform; often the most failure-prone stage.
ingestion time — the time an event entered the streaming system (assigned by the broker).
initial snapshot — a one-time full read of existing tables a CDC pipeline does before switching to tailing the change log.
Inmon (Corporate Information Factory) — A warehouse philosophy building a normalized 3NF enterprise core first, then spinning dimensional marts off it; integrate-first, consistency-focused.
insert_overwrite (strategy) — replace whole partitions (e.g. by day) affected by the new data, leaving others untouched; idempotent at partition grain, common on partitioned warehouses.
integration — the broader job of wiring many heterogeneous sources together into a coherent data platform.
Intermediate (layer) — the middle dbt layer that joins staging models and applies business logic to build reusable, non-user-facing pieces; often ephemeral or views.
is_incremental() — a dbt macro that returns false on the first run / full-refresh and true on subsequent incremental runs; used to filter a model's SELECT down to only new rows.
Isolation level — A correctness-vs-speed knob controlling how strictly concurrent transactions are kept from seeing each other's in-progress work (Read Uncommitted → Read Committed → Repeatable Read → Serializable).
ISR (in-sync replicas) — the set of replicas currently caught up to the leader; a write is "committed" once all ISR have it.
Jinja — a templating language dbt runs over .sql files before execution: {{ }} substitutes an expression into the SQL, {% %} controls logic (if/for/set); the warehouse only ever sees the resulting plain SQL.
jitter — small randomness added to backoff delays so many clients don't retry in lockstep.
Job (Spark) — the unit of work triggered by a single action; one action produces one job.
Join (Spark) — combining rows of two DataFrames on a matching key; usually a wide transformation that shuffles, executed via one of three strategies.
Join order — the sequence in which a multi-table join is performed; a good order keeps intermediate results small and can change query cost by orders of magnitude.
Jupyter — Interactive notebooks for exploring data and prototyping before hardening logic into a pipeline.
Kafka Connect — a standardized runtime for moving data into and out of Kafka via reusable source and sink connectors.
kappa architecture — a design with a single streaming layer; reprocessing is done by replaying the log, eliminating the batch layer.
keyed state — per-key memory a stream operator maintains across events (counts, running totals, last-seen value).
Kimball (dimensional, bottom-up) — A warehouse philosophy modeling data as conformed star schemas by business process; serve-first, query-friendly.
LAG / LEAD — window functions that read a column's value from a row before (LAG) or after (LEAD) the current row in the ordered window; used for period-over-period deltas without a self-join.
Lakehouse — an architecture that adds an open table format over cheap object storage so one open copy of data gets warehouse-grade reliability (ACID, schema, time travel) while remaining readable by many engines.
Lambda architecture — a legacy design with separate batch and speed layers (two pipelines, two copies) that a single lakehouse table — written by both streaming and batch — collapses into one.
landing zone (raw / bronze layer) — the first, append-only, immutable place ingested data lands, preserved exactly as received.
Late-arriving dimension — A fact that references a dimension member not yet loaded (or an old fact hitting a Type 2 dimension); handled with placeholder/Unknown rows and date-correct bonding to avoid dropping or mis-attributing data.
Latency (data freshness) — The time between an event happening and it being usable downstream.
Lateral join — a join whose right side is evaluated per row of the left side using that row's values (e.g. UNNEST(o.items) per order row).
Lazy evaluation — Spark's model where transformations accumulate unexecuted and an action forces the whole recipe to run at once, enabling whole-plan optimization.
leader (partition leader) — the single replica that handles all reads and writes for a partition at a given time.
Least-privilege IAM — granting each pipeline, service, and person only the permissions it actually needs (read this bucket, write that schema) and nothing more, so a compromised component is a contained incident rather than a catastrophe.
Lifecycle policy — a rule that automatically transitions objects to colder, cheaper tiers as they age or deletes them after a retention period; a major lake cost lever.
Lineage — the recorded chain of how a dataset was derived; lets Spark recompute only lost partitions for fault tolerance.
Liquid clustering — Delta's automatic, incremental data layout by declared clustering keys that replaces fixed partitioning and Z-ordering and adapts to changing data and queries.
log compaction — a retention mode that keeps only the latest event per key, turning a log into a snapshot of current state.
log-based CDC — reading the database's own change log (WAL/binlog) to capture every change in commit order with minimal source load.
Logical date (execution date) — The label identifying which data interval a run covers, not the wall-clock time the run starts.
Logical plan — a tree of relational operators (Scan, Filter, Project, Join, Aggregate) expressing what a query computes, independent of how.
Macro — a named, reusable chunk of Jinja-templated SQL (a function for SQL) defined in macros/ and called from any model to keep transformations DRY; ref, source, and is_incremental are themselves macros.
Managed vs self-hosted — the choice between running open-source software yourself and using a cloud-managed version; default to managed to offload patching/scaling/backups, self-host only on a hard requirement plus the expertise and extreme-scale economics to justify it.
manifest / success marker — a file (e.g. _SUCCESS) signaling that a batch of files is complete and safe to read.
Manifest file — an Iceberg metadata file listing data files with per-file statistics (row counts, min/max per column) used to prune to only the relevant files.
Manifest list — an Iceberg metadata file naming the manifests that make up one snapshot.
Map (phase) — the parallel, independent step that runs on each piece of input and emits intermediate key→value pairs.
MapReduce — Google's programming model for distributed processing: a parallel map phase emits key→value pairs, a shuffle regroups them by key, and a reduce phase combines values per key.
MAR (Monthly Active Rows) — Fivetran's consumption-based pricing metric, roughly the number of distinct rows changed per month.
Marts (layer) — the final, business-facing dbt layer of polished fact and dimension tables that analysts and dashboards query; usually materialized as tables.
Masking — obscuring a sensitive value on read; static masking permanently transforms a copy, dynamic masking transforms at query time based on who is asking.
Massively parallel processing (MPP) — an architecture that spreads one query across many machines working in parallel, the basis of modern distributed analytic engines.
Materialization — precomputing and storing an expensive aggregation once (a materialized view or a dbt table) so many repeated reads hit a small pre-aggregated result instead of re-scanning the raw data, trading freshness and storage for compute.
Materialized view — A view whose computed result is stored and kept incrementally up to date, trading storage and refresh cost for fast reads of an expensive query.
materialized view (streaming) — a query result kept continuously up to date as new events arrive.
Measure — A numeric column in a fact table that you aggregate (SUM/AVG/COUNT).
Medallion architecture — Layering warehouse data as bronze (raw, replayable) → silver (cleaned, conformed) → gold (modeled, served), mapping onto dbt's staging → intermediate → marts.
Meltano — an open-source orchestrator for running Singer taps and targets, giving control with no per-row vendor fee.
merge (strategy) — upsert on a unique_key: update the row if its key exists, else insert; the forgiving default when rows can change.
MERGE (UPSERT) — a SQL statement that reconciles a target table against a source in one atomic operation, updating rows whose key matches and inserting those that don't; an idempotent alternative to plain append.
MergeTree (ClickHouse) — ClickHouse's core storage engine: data is written as many immutable, sorted, columnar parts (ordered by the table's ORDER BY), and a background process continuously merges small parts into larger ones — an LSM-tree-like design that makes ingest fast and reads sorted.
Merge key — The unique business/surrogate key a MERGE/upsert matches on, so a re-run updates existing rows instead of inserting duplicates.
Merge-on-read (MOR) — an update strategy that writes small delta/delete files instead of rewriting data, giving fast writes at the cost of slower reads that merge base + deltas (reclaimed by compaction).
message queue / event log — a durable, ordered stream of events (e.g. Kafka) that producers append to and consumers subscribe to.
Metadata — data about data: schemas, lineage, descriptions, ownership, quality scores, and definitions.
MetricFlow — the engine behind the dbt Semantic Layer that turns YAML metric and dimension definitions into correct SQL on demand, giving every consumer one definition of each metric.
Micro-batch — Processing very small batches very frequently; a middle ground between batch and streaming.
Micro-partition — A small, immutable, columnar chunk of a table with per-column min/max metadata that lets the engine skip chunks that can't match a query.
Modern data stack — a loosely-coupled assembly of mostly-managed best-in-class tools around a cloud warehouse: managed EL (e.g. Fivetran/Airbyte) + cloud warehouse (Snowflake/BigQuery/Databricks) + dbt for in-warehouse ELT + BI/activation.
monotonically increasing ID — an ever-growing key (like an auto-increment primary key) usable as a watermark for append-only tables.
MPP (Massively Parallel Processing) — Splitting a query across many compute units that each scan a slice of data in parallel, then combining the partial results.
Narrow transformation — a transformation where each output partition depends on exactly one input partition (filter, select, withColumn); no shuffle, chains within a stage.
Natural key — The business identifier a source system uses (email, SKU); kept as a column in the dimension but not used as the joined key.
Nested-loop join — the brute-force join algorithm comparing each outer row against the inner side (O(M×N)); used for tiny inputs, non-equality conditions, or with an index on the inner side.
Non-additive measure — A measure that can't be meaningfully summed across any dimension (e.g. a ratio or percentage); store additive components and compute the ratio at query time.
Normalization — Organizing data to eliminate redundancy by splitting it into many related tables (target: third normal form / 3NF); optimizes write integrity.
not_null (test) — asserts a column is never null.
OAuth 2.0 — an auth flow granting a short-lived access token (plus a refresh token) that an extractor must renew on expiry.
Object storage — storing data as objects addressed by a unique key over HTTP in buckets; effectively infinite, cheap, and durable; the substrate of the data lake.
offset — the integer position of a record within a partition; how a consumer remembers where it is.
offset (connector) — the position recording how far a connector has read, letting it resume exactly where it left off after a restart.
OLAP (Online Analytical Processing) — Systems optimized for large analytic queries that scan and aggregate many rows; warehouses and lakehouses.
OLTP (Online Transaction Processing) — Systems optimized for many small, fast, correct reads and writes; the operational databases behind apps.
OLTP database — an Online Transaction Processing database that powers a live application and holds its current state, changing rows in place.
One Big Table (OBT) — A wide, denormalized table that pre-joins a fact and all its dimensions so no joins are needed at query time; a context-dependent tradeoff against the star schema, not a universal rule.
OpenLineage — an open specification for emitting lineage events from any data job, producing a unified lineage graph across heterogeneous tools.
Operator — A reusable template in Airflow for one kind of task (run SQL, call an API, trigger a job).
Optimistic concurrency control — a writer reads the current version, does its work, then commits as the next version, retrying if another writer committed first; efficient when conflicts are rare.
OPTIMIZE — Delta's command to compact many small files into fewer right-sized files (addressing the small-files problem).
Orchestration — Coordinating the tasks of a data pipeline so they run in the right order, on schedule, with retries, observability, and recovery.
Orchestrator — The tool (Airflow, Dagster, Prefect) that schedules, runs, monitors, and recovers a pipeline's tasks.
Orphan-file cleanup — deleting data files left in storage that no snapshot references (typically from a write that crashed before committing).
Out-of-memory (OOM) — a failure when a process (driver or executor) exceeds its allocated memory; common causes include .collect(), skew, too-few partitions, over-caching, and oversized broadcasts.
Over-partitioning — partitioning on a high-cardinality column (e.g. user_id), creating millions of tiny-file directories; avoid by partitioning on low-cardinality keys and sorting within files.
PACELC — Extends CAP: under a Partition choose Availability or Consistency; Else (no partition) trade Latency vs Consistency.
Page — the smallest unit Parquet encodes and compresses; a run of one column's values within a column chunk.
pagination — returning results in pages plus a way to request the next, so a single API response never carries everything.
pandas — the ubiquitous original Python in-memory dataframe library; single-threaded and eager by default (each operation runs immediately), with a vast ecosystem but heavier memory use and slower performance than modern columnar alternatives on larger frames.
Pandas UDF (vectorized UDF) — an Arrow-backed UDF that transfers a batch of rows to Python at once and operates vectorized, far faster than a plain Python UDF.
Pandera — a Python library for declaring and validating a dataframe's schema (column types, nullability, value ranges) so bad data fails fast at a pipeline boundary; the single-node counterpart to data contracts.
Parameterization — Designing a pipeline so values like the run's date or environment are inputs, not hardcoded constants.
Parser — the engine stage that validates SQL syntax and resolves names against the catalog, producing an abstract syntax tree.
Partial failure — The normal condition at scale where some component is always broken while the rest works.
partition — an ordered, independently-stored shard of a topic; the unit of parallelism and the only scope in which order is guaranteed.
PARTITION BY — the window clause that splits rows into independent groups for a window function to compute over (like GROUP BY, but the rows are kept, not collapsed).
Partition evolution — changing a table's partitioning scheme without rewriting existing data, possible because Iceberg stores partitioning in metadata rather than in physical folder names.
Partition key — The column used to decide which partition a row belongs to.
Partition overwrite — an idempotent write pattern that deletes and replaces a target partition (e.g. one day) instead of appending, so re-running a job yields the same table rather than duplicated rows.
Partition pruning — the engine reading only the partitions that match a query's filter (e.g. one day out of a year) and skipping the rest entirely; the single biggest data cost-and-latency lever.
Partitioned Parquet — Parquet output laid out into folders by a column value (e.g. event_date=...) so downstream queries can prune to relevant files.
Partitioning (on disk) — physically splitting a dataset into subfolders by column value so an engine can skip whole directories by path; distinct from table-format or streaming partitioning.
Partitioning (sharding) — Splitting a dataset across machines or files by a key, so queries can skip irrelevant partitions.
Periodic snapshot fact table — A fact table with one row per thing per period (e.g. one account per day), recording a level sampled on a regular cadence.
Physical layout — the order in which a two-dimensional table's cells are written into a one-dimensional byte stream on disk; row-major or column-major. Decides how much data a query must read.
Physical plan — the logical plan with every execution choice made concrete (join algorithms, scan files/columns, and in distributed engines the stages and shuffles).
PII (Personally Identifiable Information) — data that identifies a person (name, email, SSN, phone, address, precise location, device IDs), subject to privacy law.
Platform engineer — Provisions and operates the underlying compute, storage, and networking the data platform runs on (below the data engineer in the stack).
Poke-mode sensor — A sensor that holds a worker slot and repeatedly checks a condition; an anti-pattern at scale versus deferrable sensors.
Polars — a fast Rust-backed DataFrame library with a lazy optimized API; a single-node alternative to Spark/pandas.
Portfolio capstone — an end-to-end project that proves you can ship: ingest → lake/warehouse → dbt → orchestrate → quality → dashboard, with senior touches almost nobody includes (Git + CI, Terraform, an honest README, and an ADR).
Predicate pushdown — moving filters as close to the data source as possible so rows are discarded before (or during) reading, often skipping whole file row-groups via footer statistics.
Prefix — the leading portion of an object key in a flat keyspace; "listing a folder" on object storage means listing all keys sharing a prefix.
Primary key — A column (or set) whose value uniquely identifies each row in a table.
processing time — the wall-clock time at which an event is processed by the system.
producer — a client that writes (publishes) events into a stream/topic.
profiles.yml — the dbt connection config kept outside the project; specifies which warehouse to connect to and with what credentials, letting one project target dev or prod.
Projection pruning (column pruning) — reading only the columns a query references; on columnar storage this is a major saving.
Projection pushdown — pushing the list of needed columns down to the storage reader so unreferenced columns are never read off disk; the column-pruning mechanism.
Protobuf — Google's schema-based binary serialization format, an alternative to Avro for stream messages.
Pruning (partition elimination) — Skipping micro-partitions/partitions whose min/max range can't contain matching rows, so the engine reads as little data as possible.
PySpark — Spark's Python API.
Python — The general-purpose programming language data engineering standardized on for pipeline logic and data wrangling.
QUALIFY — a SQL clause that filters on the result of a window function (as WHERE filters columns and HAVING filters groups); the idiomatic way to keep ROW_NUMBER() … = 1 for deduplication.
Query engine — the software that compiles declarative SQL into a concrete execution plan and runs it, deciding how to produce the result you declared.
Query optimizer — the engine stage that rewrites the logical plan into an equivalent cheaper one and chooses physical operators.
query-based CDC — inferring changes by repeatedly querying a timestamp column; simple but blind to deletes and intermediate states.
Query-on-the-lake engine — an engine that brings interchangeable compute to open files (Parquet/Iceberg) in your own object storage (e.g. DuckDB, Trino, Spark SQL, Athena); you own the data, the engine is swappable.
RAG (Retrieval-Augmented Generation) — the dominant pattern for using an organization's own data with an LLM: retrieve the relevant facts from your data and hand them to the model as context so it answers grounded in your data; fundamentally a data pipeline.
RANK / DENSE_RANK — ranking window functions: RANK gives tied rows the same rank then skips (1,1,3); DENSE_RANK gives ties the same rank with no gap (1,1,2).
rate limit — a vendor's cap on how many API requests you may make per window; exceeding it returns HTTP 429.
RBAC (Role-Based Access Control) — an access model where permissions attach to roles and users are granted roles.
RDD (Resilient Distributed Dataset) — Spark's original low-level API: an immutable, partitioned collection of objects that Catalyst cannot optimize; foundation everything compiles to.
Read amplification — every read paying extra work to reconcile a base file with its deltas, the cost of merge-on-read.
read replica — a synchronized copy of a production database you can query for extraction without loading the primary.
rebalance — the reassignment of partitions across a consumer group's members when membership changes.
Reconciliation — verifying that a target dataset faithfully represents its source after data moved or transformed, via row counts, control totals, and source-vs-target diffs.
Recursive CTE — a CTE that references itself (a base case UNION ALL a recursive step) to walk a hierarchy or generate a sequence until no new rows are produced.
Reduce (phase) — the step that gathers all values sharing a key and combines them.
ref() — the dbt function used to reference another model; it resolves to the model's real physical table name AND declares a dependency, which builds the DAG.
Referential integrity — The guarantee that every foreign key points at a real existing row.
Relational database — A database that stores data in tables (rows and columns) that reference each other via keys.
relationships (test) — asserts referential integrity: every value in a column has a matching row in another model.
repartition(n) — change to exactly n partitions via a full shuffle that redistributes rows evenly; can increase or rebalance, but costs a shuffle.
replayability — the ability to re-run all downstream transformations from preserved raw data without re-extracting from the source.
Replication — Keeping multiple copies of data on different machines for durability and availability.
replication factor — the number of copies of each partition kept on different brokers for fault tolerance.
requests / httpx — Python HTTP libraries used to build custom EL extractors (httpx adds async for higher throughput).
REST API — an API style where you request named resource endpoints (e.g. GET /charges) and receive JSON pages back.
Result cache — A warehouse cache that returns the stored result of an identical query instantly, often at zero compute cost, when the underlying data hasn't changed.
retention — the policy controlling how long (or how much) data a topic keeps before deleting it.
Retry — Automatically re-running a failed task instance a configured number of times before declaring it failed.
RFC (Request for Comments) — a longer proposal circulated to a team before a big decision to gather feedback and align, after which the outcome is recorded tersely as an ADR.
Right to be forgotten — the legal right of an individual to have all their personal data deleted across every system that holds it.
Right-sizing — matching warehouse/cluster compute to the workload so the job is fast enough without paying for idle parallelism (e.g. not running an XL warehouse for a query an S handles).
ROLLUP — a GROUP BY extension producing hierarchical subtotals: (a,b), (a), then () the grand total; for drill-down hierarchies.
Row group — a horizontal slice of a Parquet file holding some complete rows (often ~128 MB); the unit of parallelism and of row-group skipping.
Row store — A storage layout that keeps each row's columns together on disk; fast for fetching whole single rows (OLTP).
Row-group skipping (data skipping) — eliminating whole row groups whose min/max range cannot satisfy a filter, using only footer statistics, without reading their data.
Row-level security (RLS) — access control that filters which rows of a table a given user can see.
Row-major (row-oriented) — a layout that writes each record's fields contiguously, one whole row after another; ideal for transactional (OLTP) whole-row access.
ROW_NUMBER — a window function assigning a distinct sequential number (1,2,3,…) to rows within each partition by the ORDER BY, breaking ties arbitrarily; the standard tool for "one row per key."
ROWS vs RANGE — frame modes: ROWS counts physical rows; RANGE counts logical peers tying on the ORDER BY value. The implicit default for an ordered window is RANGE, which surprises people on duplicate keys.
Rule-based optimization (RBO) — optimization that applies fixed, always-safe rewrite rules (pushdown, pruning, constant folding) regardless of the data.
Run-length encoding (RLE) — storing a consecutively repeated value once plus a count instead of repeating it; shines on sorted or clustered columns.
Running total — a cumulative sum produced by an ordered window aggregate with a frame from the partition start to the current row.
Runtime check — a data test that runs in production against the real, full dataset on each load to stop bad data from reaching consumers.
Résumé-driven development — choosing a trendy tool because it looks good to learn rather than because it fits the problem; the anti-pattern that "boring technology" guards against.
Salting — a skew fix that appends a random suffix to a hot key to spread it across many partitions/tasks, then aggregates the salted results back.
savepoint — a manually triggered, durable checkpoint used for upgrades, migrations, and intentional restarts.
Scaling out (horizontal scaling) — handling more load by spreading work across many machines; the foundation of big data.
Scaling up (vertical scaling) — handling more load by using a bigger single machine; has a hard ceiling at large data sizes.
SCD Type 1 — Overwrite the old value with the new one, keeping no history.
SCD Type 2 — Track full history by adding a new dimension row (new surrogate key) per change, with effective/expiration dates and a current flag, so facts stay bonded to the version true when they occurred.
SCD Type 3 — Keep one prior value in a second column (limited history).
Schedule interval — How often a DAG runs (e.g. daily, hourly), expressed as a cron expression or a preset.
Scheduler — The orchestrator component that decides which task instances are ready and queues them to run.
schema compatibility — rules (backward/forward/full) governing whether a new schema version can be read alongside old data/consumers.
schema drift — uncoordinated change to a source's structure over time (added, dropped, renamed, or retyped columns) that breaks loads.
Schema enforcement — rejecting writes that don't match the table's declared schema, blocking accidental corruption (schema drift) at the door.
Schema evolution — how a format copes when the schema of new data differs from already-stored data; safest by adding optional fields with defaults and never renaming or repurposing columns.
Schema registry — a service storing versioned schemas for streaming topics/events that rejects producers attempting to publish a payload violating the registered compatibility rules.
Schema-on-read — Storing data raw and applying structure when it is read; flexible but defers discovery of bad data.
Schema-on-write — Enforcing structure when data is written; validated but rigid.
Second Normal Form (2NF) — 1NF plus every non-key column depends on the whole primary key, not just part of it.
Secrets management — storing credentials, API keys, and passwords in a dedicated secrets manager (or the orchestrator's secret store) fetched at runtime, never hard-coded in code, notebooks, or committed to Git.
Seed — a small static CSV committed to seeds/ and loaded into the warehouse by dbt seed as a ref()-able table; for tiny, version-controlled reference lookups, not bulk source data.
Semantic layer — a governed place that defines each business metric once so every tool computes it identically; in dbt this is MetricFlow / the dbt Semantic Layer.
Semantic layer (metrics layer) — A single governed place where metric and dimension definitions live once, so every tool computes a metric like "revenue" identically.
Semi-additive measure — A measure that can be summed across some dimensions but not across time (e.g. balance, inventory on hand); use AVG or last value across time.
Semi-structured data — Self-describing data with flexible structure, such as JSON or XML.
Sensor — A special task that waits for a condition (a file lands, a partition appears) before downstream work runs.
Separation of storage and compute — The cloud-warehouse model where data lives in cheap shared object storage and independent, resizable compute clusters read from it, so each scales and is billed independently.
Serving — The lifecycle stage of delivering transformed data to analytics, machine learning, and applications.
session window — a dynamic window that groups events separated by less than a gap of inactivity; closes after the gap elapses.
SFTP (SSH File Transfer Protocol) — a secure protocol for transferring files onto a shared server, a common batch-drop source.
Shift left — moving data-quality checks earlier (closer to ingestion and transformation, in CI and the build) so bad data fails the pipeline rather than surfacing in a downstream dashboard.
Shift-left — moving quality checks earlier in the lifecycle (toward data creation), e.g. enforcing a data contract in the producer's CI/CD before a change ships.
Shuffle — the operation that physically moves data across the network so all rows sharing a key land on the same partition; required by every wide transformation and the most expensive thing Spark does.
Shuffle (exchange) — the step that redistributes data across the network so the next stage's workers get the rows they need (e.g. repartition by group/join key); usually the most expensive part of a distributed query.
Shuffle hash join — a join that shuffles both sides and builds an in-memory hash table per partition instead of sorting; used when a partition fits in memory.
Shuffle join — a distributed hash join that first repartitions both tables across the network by the join key so matching keys co-locate; the (expensive) alternative to a broadcast join.
side output (late data stream) — a separate output channel where events later than allowed lateness are routed instead of dropped.
Singer — an open specification for connectors where sources are "taps" and destinations are "targets" speaking a common JSON protocol.
Singular test — a one-off test written as a .sql file in tests/ whose SELECT returns the rows that violate a business rule; passes if it returns none.
sink connector — a Kafka Connect connector that pushes data from Kafka out to a destination.
SLA (Service Level Agreement) — A promise about when a pipeline's output will be ready; orchestrators can alert when it is missed.
SLA (Service-Level Agreement) — the reliability promise made to data consumers (e.g. "the daily sales table is fresh by 8 a.m. and 99.9% accurate"); the contract for a data product.
SLI/SLO/SLA (data) — Service-Level Indicator (measured number, e.g. hours since update), Service-Level Objective (internal target), and Service-Level Agreement (committed consumer promise) applied to data.
sliding (hopping) window — fixed-size windows that advance by a smaller step, so they overlap and an event can fall in several.
Slim CI — a dbt CI pattern that builds only the models changed in a pull request and their downstream children, selected with state:modified+ by comparing against a stored production manifest.
SLO (Service-Level Objective) — the internal target that keeps a team safely inside its SLA (e.g. "the pipeline completes by 7 a.m."); for data, commonly freshness plus completeness and quality.
Slowly changing dimension (SCD) — A dimension whose attributes change occasionally over time, requiring a deliberate strategy for how history is remembered.
Small files problem — performance degradation from creating very many tiny files, e.g. by partitioning output on a high-cardinality column.
Small-files problem — many tiny files (often from frequent/streaming writes) slowing and inflating queries due to fixed per-file overhead, fixed by compaction.
Snappy — a fast, modest-ratio compression codec; Parquet's default, optimized for speed over size.
Snapshot — dbt's implementation of Slowly Changing Dimension Type 2: it captures a source's history going forward by closing the old row version (dbt_valid_to) and inserting a new one when a tracked row changes.
Snapshot expiration — removing old snapshots past a retention window so their now-unreferenced data files can be deleted, trading time-travel range for storage cost (Delta's VACUUM).
Snapshot isolation — each reader pins one immutable snapshot for its whole run, so concurrent writers are never seen mid-change; the basis of consistent reads on the lakehouse.
Snowflake schema — A star schema whose dimensions are normalized into multiple related tables; less redundancy but more joins (distinct from Snowflake the product).
Soda — a data-quality tool expressing checks in a compact DSL (SodaCL), oriented toward scheduled monitoring of warehouse tables, with freshness as a first-class check.
Software-defined asset — A declarative definition of a data object (a table, a file) plus the code that produces it; Dagster's core unit.
Sort-merge join — a join algorithm that sorts both inputs by the join key then merges them in lockstep; ideal when inputs are already sorted or too large for a hash table to fit in memory.
source — any system that holds data you want to pull in (a database, API, file drop, SaaS app, or message queue).
source connector — a Kafka Connect connector that pulls data into Kafka (e.g. Debezium).
Source freshness — a dbt check (dbt source freshness) comparing a source table's newest load timestamp against warn/error thresholds, catching stale upstream data before building on it.
Source system — The place data is born: an app database, a third-party API, an event stream, an uploaded file. Data engineers read from these but rarely own them.
source() — the dbt function used to reference a raw input table declared in YAML; marks DAG entry points and enables freshness checks. Rule of thumb: source() for raw data, ref() for everything dbt builds.
Spark SQL — the SQL interface to Spark DataFrames; queries are optimized by Catalyst just like DataFrame operations.
Sparse primary-key index (ClickHouse) — an index storing one key entry per granule (≈8,192 rows) rather than per row, so it stays small enough to hold in memory for huge tables; the engine binary-searches it to read only the granules that could match, trading cheap single-row lookups (an OLTP need) for fast columnar scan-and-skip.
spark.sql.autoBroadcastJoinThreshold — the size limit (default ~10 MB) under which Spark automatically broadcasts a join side.
spark.sql.shuffle.partitions — the config setting the number of post-shuffle partitions (historically 200); often tuned for very large jobs, increasingly auto-handled by AQE coalescing.
Spill — when an in-memory operation (hash build, sort, aggregation) doesn't fit in a worker's memory and the engine writes the excess to local disk, keeping the query correct but much slower.
Splittability — whether a compressed file can be divided and processed by parallel workers without decompressing the whole file first; Parquet is splittable, a gzipped CSV is not.
Spot (preemptible) instances — heavily discounted compute the cloud can reclaim on short notice, ideal for fault-tolerant, retryable batch work (e.g. Spark) and never for stateful, can't-be-interrupted jobs.
SQL (Structured Query Language) — The near-universal language for asking precise questions of tabular data; the lingua franca of data work.
sqlfluff — the standard open-source, dialect-aware SQL linter and auto-formatter, used to keep a team's SQL consistent and catch mistakes (e.g. in CI).
SQLMesh — the leading alternative to dbt: same SQL-on-a-DAG philosophy, with native column-level lineage, breaking-vs-non-breaking change detection, virtual-data-mart environments, and stronger incremental backfills.
Stage — a chunk of a distributed query plan that runs fully in parallel across workers with no data movement between them; stages are connected by shuffles.
Staging (layer) — the first dbt layer: one model per source table doing only light cleanup (rename, cast, basic null handling); no joins or business logic; usually materialized as views.
Star schema — A fact table surrounded by flat (denormalized) dimension tables joined by keys — the default analytic schema shape.
state backend — where keyed state physically lives (in-memory, or on-disk like RocksDB) and how it's snapshotted.
state:modified — a dbt selector that picks models changed relative to a saved state/manifest; with a trailing + it also includes downstream models (the change's blast radius).
Statistics — summaries the optimizer keeps about data (row counts, distinct values, min/max, null fractions, size) that fuel cost-based decisions; stale stats cause bad plans.
Stitch — an older, simpler managed EL service built on the Singer standard.
Storage — The lifecycle stage (and cross-cutting concern) of keeping data durably and affordably in a usable form.
stream — an unbounded, append-only, time-ordered sequence of events.
Stream processing — Processing an unbounded sequence of events continuously as they arrive; low latency but more complex and costly.
streaming SQL — running continuous SQL queries over streams (e.g. ksqlDB, Flink SQL) instead of writing low-level code.
Strong consistency — A model where every read sees the latest write, as if there were one machine.
Structured data — Data with a fixed known schema (neat rows and columns).
Structured Streaming — Spark's streaming API that runs the same DataFrame code against an unbounded, continuously growing input; the bridge from batch to streaming.
Surrogate key — A meaningless, warehouse-generated key (often an integer or a hash) used as a dimension's primary key instead of the source's natural key; enables stable joins and SCD history.
Table — Rows and columns of structured data; the basic unit of a relational database.
Table (materialization) — a physical table fully dropped and recreated on each dbt run; expensive to build, cheap to query.
Table format — a metadata layer over a collection of data files that makes them behave as one consistent, versioned table (ACID, time travel, schema evolution); not a file format. Examples: Apache Iceberg, Delta Lake, Apache Hudi.
Table-level lineage — lineage recording dependencies between datasets ("table A is built from table B").
Task — A single unit of work in a pipeline (one node of the DAG), such as "load yesterday's orders."
TaskFlow API — Airflow's decorator-based style (@task) for writing tasks as plain Python functions with automatic data passing.
Text format — A human-readable data format (CSV, JSON, XML); bulky and slower to parse.
the log (commit log) — an append-only, ordered, replayable sequence of records; the foundational data structure under most streaming systems.
Third Normal Form (3NF) — 2NF plus no non-key column depends on another non-key column; every column depends on the key, the whole key, and nothing but the key.
this — a Jinja reference to the current model's own table, used in incremental models (e.g. select max(created_at) from {{ this }}) to find the high-water mark already loaded.
Throughput — the rate at which a system can process work, e.g. events or rows per second; an ingestion layer must sustain peak throughput, not average.
Time travel — querying a table as of a past snapshot or timestamp, possible because committing never mutates old files and old snapshots are retained until expired.
Timeliness (freshness) — the data quality dimension measuring whether data is recent enough to be useful.
Tokenization — replacing a sensitive value with a meaningless, irreversible token while keeping the real value in a separate secured vault; the token can still join.
tombstone — a special CDC record with a key and null value signaling that the row for that key has been deleted.
Tool-chasing — the career-limiting mistake of jumping to whatever technology is trending without the durable fundamentals under it, leaving an engineer stranded each time a tool is replaced.
Top-level code — Code that runs every time the orchestrator parses a DAG file; heavy work here slows the whole scheduler.
topic — a named, durable category/feed of events that producers write to and consumers read from.
Train/serve skew — the bug where a model feature is computed one way for training and a different way at prediction time, silently degrading the model; a feature store guarantees one definition feeds both.
Transaction — A group of operations treated as one indivisible all-or-nothing unit.
transaction (Kafka) — an atomic group of writes (and offset commits) that either all become visible or none do.
Transaction fact table — A fact table with one row per event at the moment it happens (the flexible default).
Transformation — an operation that describes a new dataset from an existing one (filter, join, groupBy); lazy — it records a step but computes nothing.
trigger-based CDC — using database triggers to record every change into an audit table; captures deletes but taxes the write hot path.
Trino (Presto) — a distributed MPP SQL engine for interactive queries over object storage that can also federate across many data sources; Presto and Trino are siblings from the same project.
tumbling window — fixed-size, non-overlapping, contiguous windows; every event lands in exactly one.
Tungsten — Spark's execution engine; speeds DataFrames with whole-stage code generation, off-heap memory, compact binary formats, and vectorization.
two-phase commit (2PC) — a protocol that coordinates a "prepare" then "commit" across systems so a sink and the stream commit atomically.
UDF (user-defined function) — a custom function applied to data; plain Python UDFs serialize row-by-row across the JVM↔Python boundary and blind Catalyst, so they are slow.
unbounded data — a dataset that has no defined end; events keep arriving forever, so you can never "read all of it" before processing.
Undercurrents — Four concerns that run through every lifecycle stage: security, data management, orchestration, and cost.
unique (test) — asserts a column has no duplicate values.
unique_key — the dbt config naming the column(s) that identify a row, used by merge/delete+insert to match new rows against existing ones.
Uniqueness — the data quality dimension measuring whether each real-world entity is represented exactly once, with no duplicates.
Unity Catalog — Databricks' governance-first catalog (discovery plus permissions, lineage, auditing), now Iceberg-REST-compatible.
UNNEST — the operation (also EXPLODE in Spark, FLATTEN in Snowflake) that turns an array column into one output row per element, evaluated as a lateral join.
Unstructured data — Data with no inherent tabular structure, such as free text, images, or audio.
Update anomaly — The risk of contradictory data when a duplicated fact is updated in some places but not others.
upsert — a single operation that updates a row if its key exists and inserts it if not (SQL MERGE / INSERT ... ON CONFLICT DO UPDATE).
uv / poetry — Tools that pin exact Python dependency versions so an environment is reproducible.
Validity — the data quality dimension measuring whether a value conforms to its defined rules (type, format, range, allowed set).
Variable — A stored, named configuration value an Airflow pipeline can read at runtime.
VARIANT — a flexible column type (notably in Snowflake) that stores semi-structured JSON-like data, navigated with path syntax.
Vector store (vector database) — a system that stores embedding vectors and quickly returns the k most similar to a query vector; examples include pgvector (a Postgres extension, the boring default) and dedicated services like Pinecone.
Vectorized reads — processing a contiguous array of same-typed values in one tight CPU loop (using SIMD); enabled by columnar in-memory layout.
View (materialization) — a saved query storing no data that re-executes on every query; cheap to build, pays cost at query time, always live.
Virtual warehouse — Snowflake's name for an independent compute cluster (sized XS, S, M…, with auto-suspend/resume); BigQuery calls compute "slots," Redshift a "cluster," Databricks a "SQL warehouse."
WAL (Write-Ahead Log) — PostgreSQL's ordered, append-only log of committed changes, used for recovery and replication and read by CDC.
Watermark — a column (often a timestamp) used to identify new rows in an incremental load, e.g. where created_at > max(created_at); a wrong watermark can silently lose late-arriving data.
WHERE vs HAVING — WHERE filters rows before grouping; HAVING filters groups after aggregation.
Wide transformation — a transformation where each output partition depends on many input partitions (groupBy, join, distinct, orderBy); requires a shuffle and forces a stage boundary.
window — a finite slice of an unbounded stream (by time or count) over which an aggregation is computed.
Window frame — the ROWS/RANGE BETWEEN … AND … clause that bounds which rows around the current one a window aggregate sees (e.g. UNBOUNDED PRECEDING AND CURRENT ROW for a running total).
Window function — a SQL function that computes a value across a set of rows related to the current row (defined by an OVER (PARTITION BY … ORDER BY … frame) clause) without collapsing them, so every original row survives alongside the computed column.
Worker (executor) — an MPP node that processes a slice of the data in parallel and streams results back; adding workers scales the query horizontally.
Workload isolation — giving teams or workloads separate compute so one does not slow down or bill another, and each team's spend is isolated.
Write amplification — a small logical change causing a large physical write (e.g. rewriting an entire file to change a few rows under copy-on-write).
XCom (cross-communication) — Airflow's mechanism for passing small values between tasks.
YARN — the Hadoop resource manager, a common Spark cluster manager in older/on-prem and EMR setups.
Z-ORDER — reordering rows across files along an interleaved multi-column curve so commonly-filtered values cluster together, improving file skipping.
Zero-copy — sharing the exact same in-memory bytes between tools without serializing/deserializing or copying; enabled when both tools speak Arrow.
ZSTD — a modern compression codec offering near-gzip ratios at near-Snappy speed with a tunable level; an increasingly recommended default.