Skip to main content

Chapter 6 checkpoint

You've gone from "just call an API and write a file" to the real engineering of ingestion — and you've watched ShopFlow's raw.orders, raw.customers, raw.products, and raw.order_items get born (ShopFlow — see Meet ShopFlow) through incremental extraction, log-based CDC, idempotent loads, API survival tactics, and the operational layer that keeps it all standing up. Recall the throughline, then take the quiz.

The throughline

  • Sources have shapes. Databases change in place (need change detection); APIs bury the work in pagination/rate limits/backoff/auth; files need new-file and completeness detection; SaaS is reached via connectors; message queues invert the model to subscribe-not-poll.
  • Incremental beats full once a table is large and growing — pull only what changed via a high-watermark (updated_at or a monotonic ID).
  • Idempotency is non-negotiable because jobs re-run: a dedup key + upsert makes one run and five runs produce the same state. That's the honest route to "exactly-once" — deliver at-least-once, deduplicate on write.
  • CDC comes in three flavors: query-based (blind to deletes), trigger-based (taxes the hot path), and log-based (reads the WAL/binlog — captures inserts, updates, and deletes in commit order, near-real-time, gentle on the source). Debezium is the canonical engine; tombstones are how deletes propagate.
  • Transformation moved downstream: ETL → ELT → plain EL. Ingestion tools faithfully extract-and-load raw; dbt owns the T.
  • The operational layer is the hard part: build/buy is a cost/control/locality trade-off; schema drift needs a schema registry and boundary contracts; dead-letter queues isolate bad records; backpressure matches speed to the slowest link; the append-only landing zone makes everything replayable.

Quiz

Required checkpoint

Chapter 6 — Ingestion & Integration

Pass to unlock the Next button below

Passed? You can now get data in the right way — incrementally, idempotently, and replayably. Next we pick up that raw landing-zone data and refine it: the transformation layer and the modern data stack.

Next: Chapter 7: Transformation & the Modern Data Stack →