Change data capture (CDC)
The high-watermark pattern from the last lesson has a blind spot: it only sees what a timestamp column tells it. It can miss late-arriving rows, it can't reliably catch deletes (a deleted ShopFlow order has no updated_at to query — it's just gone), and it forces you to repeatedly scan the orders table. Change data capture fixes all three.
Change data capture (CDC) is the practice of detecting and streaming every change — every insert, update, and delete — to a database's data, as it happens, so a downstream system can stay in sync. For ShopFlow (ShopFlow — see Meet ShopFlow), CDC on the source orders and customers tables means every new order, every status flip from placed to shipped, every cancelled order removed, and every customer who changes city, flows into raw.orders/raw.customers as it happens. Instead of asking "what looks new?", CDC tells you exactly what changed, in the exact order it happened — which is why it beats repeated full loads of orders: no re-scanning millions of unchanged orders every night, and deletes and intermediate states you'd otherwise miss are captured faithfully.
There are three ways to do CDC, and they are not interchangeable. A guide that doesn't distinguish them leaves you unable to choose. Let's build all three.
Mechanism 1: query-based CDC
Query-based CDC is the high-watermark pattern you already know, dressed up. You periodically run SELECT * FROM orders WHERE updated_at > :last_seen. It's "CDC" only in the loosest sense — you're inferring changes from a column.
- Pros: trivial to set up, works on any database, no special permissions.
- Cons: misses deletes entirely (a cancelled ShopFlow order that gets hard-deleted never shows up in a query), misses intermediate states (if an order's
statuswentplaced → paid → shippedbetween polls, you only seeshipped), adds query load to the store database, and inherits all the watermark boundary bugs.
It's the right tool for low-stakes, append-mostly tables. It's the wrong tool when deletes matter or you need a faithful replica.
Mechanism 2: trigger-based CDC
A database trigger is a small piece of code the database runs automatically whenever a row is inserted, updated, or deleted. Trigger-based CDC installs triggers on the source tables that write a record of every change into a separate audit/shadow table, which you then ingest.
- Pros: captures inserts, updates, and deletes; captures every change, not just the latest; works without database-admin-level log access.
- Cons: the triggers fire inside every write transaction, adding latency and load to the production database on its hot path — exactly where you least want overhead. They're also invasive to install and maintain, and a buggy trigger can break writes.
Trigger-based CDC is a legacy workhorse: capable, but it taxes the very system it watches.
Mechanism 3: log-based CDC (the modern default)
Here's the key insight. Every relational database already writes down every change it makes, for its own survival. PostgreSQL calls it the WAL (Write-Ahead Log); MySQL calls it the binlog (binary log). This log is an ordered, append-only record of every committed transaction — it's how the database recovers after a crash and how replicas stay in sync.
Log-based CDC reads that log and turns each entry into a change event. You're not querying tables at all — you're tailing the database's own change journal.
- Pros: captures inserts, updates, and deletes in exact commit order; near-real-time; almost zero load on the source because reading the log is cheap and off the query hot path; sees every intermediate change, not just snapshots.
- Cons: requires configuring the database to expose its log (and the right permissions); the log format is engine-specific, so you generally use a tool rather than parse it yourself.
This is why log-based CDC is the modern default for keeping a warehouse or lake in sync with a production database. It's faithful, ordered, real-time, and gentle on the source.
Debezium: the open-source log reader
Debezium is the dominant open-source CDC engine. It connects to PostgreSQL, MySQL, MongoDB, SQL Server, Oracle and others, reads their logs, and emits a standardized stream of change events — typically onto Kafka (via Kafka Connect, covered in the connectors lesson). Managed CDC services like AWS DMS (Database Migration Service), Fivetran, and Estuary Flow do the same job as a hosted product. The principle is identical; the operational burden differs.
The change envelope: insert, update, delete
A CDC event isn't just a row — it's a change envelope describing what kind of change happened and carrying enough context to apply it. Here is a simplified Debezium-style envelope for ShopFlow order 102 shipping:
{
"op": "u",
"before": { "order_id": 102, "status": "placed", "amount": 50 },
"after": { "order_id": 102, "status": "shipped", "amount": 50 },
"ts_ms": 1750641000000
}
op— the operation:c(create/insert),u(update),d(delete),r(read, used during the initial snapshot).before— the row's state before the change (null for inserts).after— the row's state after the change (null for deletes).ts_ms— when it happened.
Downstream, you apply each event to raw.orders by the order's key (order_id): a c or u becomes an upsert (there's the idempotency from the last lesson — applying the same event twice is safe), and a d becomes a delete.
Tombstones: how deletes propagate
A delete is the subtle one. When a ShopFlow order is hard-deleted at the source, log-based CDC emits a d event (with the before image and a null after), often followed by a tombstone — a special record with the order's key (order_id) and a null value. In a log-compacted Kafka topic, the tombstone signals "this key is dead; the latest known state of this order is deleted," which lets compaction eventually drop it and tells consumers to remove the row from raw.orders.
Tombstones are the reason log-based CDC can faithfully replicate deletes while query-based CDC cannot. If your raw.orders must reflect deletes (and for an accurate replica of ShopFlow's order book, it must), this is precisely why you reach for log-based CDC over a watermark.
:::tip Snapshot, then stream
A fresh CDC pipeline starts with an initial snapshot — a one-time full read of the existing tables (emitted as r events) — and then switches to tailing the log for ongoing changes. That's how the destination begins as a complete copy and stays continuously in sync. The handoff from snapshot to streaming, done without missing or duplicating events, is the genuinely hard engineering that tools like Debezium handle for you.
:::
Choosing a mechanism
| Query-based | Trigger-based | Log-based | |
|---|---|---|---|
| Captures deletes | ✗ | ✓ | ✓ |
| Captures every change | ✗ (snapshots) | ✓ | ✓ |
| Load on source | Medium (queries) | High (hot path) | Very low |
| Latency | Poll interval | Near-real-time | Near-real-time |
| Setup difficulty | Trivial | Invasive | Needs log access |
| Typical use | Append-mostly tables | Legacy / no log access | Modern default |
Why it matters
CDC moves you from guessing what changed to knowing it. Query-based CDC is the watermark in disguise — easy, but blind to deletes and intermediate states. Trigger-based CDC catches everything but taxes the production hot path. Log-based CDC reads the database's own WAL/binlog — capturing inserts, updates, and deletes in exact commit order, in near-real-time, with almost no load on the source — which is why it's the modern default, with Debezium the canonical engine. The change envelope (op/before/after) and tombstones are how those changes — especially deletes — faithfully reach your warehouse, applied idempotently by key. Distinguish the three mechanisms and you can actually choose; conflate them and you'll pick wrong.
Next we turn from databases to the other big source — APIs — and settle where transformation belongs: ETL, ELT, or just EL.