Chapter 6 · Ingestion & Integration

Every pipeline begins with the same unglamorous step: getting data in. Before you can model, transform, or analyze anything, the raw bytes have to travel from wherever they were born — a production database, a payments API, a vendor's nightly file drop, a stream of clickstream events — into a place your tools can reach. That movement is ingestion, and the broader job of wiring many sources together is integration.

This is the chapter where ShopFlow's raw.* tables are born (ShopFlow — see Meet ShopFlow). The store's web app writes orders, customers, products, and line items into its operational database; here we pull those source tables into our platform and land them — untouched — as raw.orders, raw.customers, raw.products, and raw.order_items. Every later chapter builds on these landed copies, so getting ingestion right is the foundation the whole pipeline stands on.

It is tempting to wave this away as "just call an API and write a file." That framing is the single most expensive mistake in data engineering. The hard parts are not the happy path; they are the realities: a source that hands you the same ShopFlow order twice, a re-run that double-counts revenue, a connection that drops at order 4 million of 5 million, a vendor who quietly adds a column (a discount on orders) on a Tuesday and breaks your load on Wednesday, an API that rate-limits you after 100 requests for the products catalog, and a "load every order every night" job that grows from minutes to hours to never finishes. This chapter is about engineering through those realities deliberately.

The durable idea

Make ingestion incremental and idempotent: move only the data that changed, and make re-running a load safe — so the same source events always produce the same result, no matter how many times you retry.

Two ideas do the heavy lifting and will outlive every tool in this chapter:

Incremental — after the first load, only pull what is new or changed since last time (by a timestamp, an ID, or a database change log). Reloading everything forever does not scale.
Idempotent — running the load once or running it five times leaves the destination in the same correct state. Networks fail, jobs retry, backfills re-run; idempotency is what keeps those retries from corrupting your data.

Everything else — the specific connectors, the SaaS platforms, the streaming brokers — is dated machinery layered on top of these two durable principles. Learn the principle; the tool of the month becomes easy.

What this chapter covers

Sources: where data actually comes from → — relational databases, REST and GraphQL APIs, files and SFTP, SaaS connectors, and message queues, and how each one shapes your ingestion strategy.
Extraction patterns: full, incremental & idempotent → — full loads, incremental high-watermark extraction, dedup keys, and upserts that make re-runs safe.
Change data capture (CDC) → — query-based vs trigger-based vs log-based CDC, Debezium, the insert/update/delete envelope, and tombstones.
APIs, connectors & ETL vs ELT vs EL → — pagination, rate limits, retries with backoff, auth, and where transformation belongs in the modern stack.
Connectors, schema drift & the landing zone → — Fivetran vs Airbyte vs build-your-own, Kafka Connect, schema registries, dead-letter queues, backpressure, and the append-only raw layer that makes everything replayable.
Chapter 6 checkpoint → — lock it in.

How this connects

Ingestion lands data into the storage and file formats from Chapter 2 — typically an append-only raw zone in object storage or a warehouse. From there, Chapter 7's transformation layer reshapes it, and Chapter 8's orchestration schedules and retries the loads. The streaming half of ingestion deepens in Chapter 9. Ingestion is the front door; the rest of the guide is the house.

Let's start at the source.

Next: Sources: where data actually comes from →

The durable idea​

What this chapter covers​

How this connects​

The durable idea

What this chapter covers

How this connects