Connectors, schema drift & the landing zone
You now know the sources, the extraction patterns, CDC, and the four realities of APIs. This lesson covers the operational layer: the platforms that run ingestion for you, the failures they don't save you from, and the storage design that makes the whole thing recoverable. These are the parts most introductions skip — and the parts that decide whether a pipeline survives contact with reality.
Build vs buy: the real trade-off
You can write and operate every extractor yourself, or you can use a managed connector platform that hosts hundreds of pre-built connectors. The usual framing — "Fivetran vs Airbyte" as a tribal allegiance — is wrong. It's a cost vs control vs locality trade-off. Lay out the options honestly:
- Fivetran — fully managed, closed-source, polished. You click, it syncs. You pay by MAR (Monthly Active Rows — roughly, the number of distinct rows changed per month). Maximum convenience, least control, consumption-based cost that can surprise you at scale.
- Airbyte — open-source with a managed cloud option. Hundreds of community connectors; you can self-host it (data never leaves your network — locality) and even write custom connectors. More control and potentially far cheaper, at the price of operating it yourself.
- Stitch — a simpler, older managed EL service (built on the Singer standard).
- Meltano / Singer — Singer is an open specification for connectors (sources are "taps," destinations are "targets," speaking a common JSON protocol); Meltano is an open-source orchestrator for running them. Maximum control and zero per-row vendor fee — you own the code and the compute.
- Estuary Flow — a newer platform centered on real-time CDC and streaming EL.
- Apache NiFi — a flow-based, drag-and-connect data-routing tool, strong in on-prem and IoT/edge ingestion.
- AWS DMS — managed database migration and CDC, AWS-native.
- Build-your-own — the
requests/httpxextractor from the last lesson. Total control, total maintenance burden.
The honest decision rule:
| You should... | When... |
|---|---|
| Buy (Fivetran, managed Airbyte) | The source is a popular SaaS with a fiddly API, your team is small, and engineer time is worth more than the per-row bill. |
| Self-host (Airbyte, Meltano) | Cost at your volume is painful, data can't leave your network for compliance, or you need custom connectors. |
| Build | No connector exists, or the source is core enough that you want to own every line. |
There is no universally right answer — only the right answer for your volume, your compliance constraints, and your team size. Re-evaluate as those change.
Kafka Connect: the streaming ingestion backbone
For streaming and CDC, the connector framework you'll meet most is Kafka Connect — a standardized runtime for moving data in and out of Kafka without writing custom code. It defines two connector roles:
- Source connectors pull data into Kafka (e.g., Debezium runs as a Kafka Connect source connector, streaming database changes onto topics).
- Sink connectors push data out of Kafka into a destination (a warehouse, S3, Elasticsearch).
Kafka Connect handles the operational scaffolding — scaling across workers, tracking offsets (how far each connector has read, so it can resume exactly where it left off after a restart), and retries. Confluent Cloud is the dominant managed Kafka + Connect offering. This is the backbone of real-time EL, and it's where CDC from the previous lesson actually lands. (Streaming itself is Chapter 9.)
Schema drift: the silent breaker
Here is the failure every long-running pipeline eventually hits. Schema drift is when a source's structure changes over time without coordination — a column is added, renamed, dropped, or has its type changed (an int becomes a string). The vendor ships it on a Tuesday; your load, which expected the old shape, breaks (or worse, silently mangles data) on Wednesday.
ShopFlow gives the textbook case (ShopFlow — see Meet ShopFlow): the store rolls out promo codes, and an engineer adds a new discount column to the source orders table. Your orders → raw.orders load was written for the old five-column shape. What happens next is entirely up to how your ingestion handles drift.
How mature ingestion handles drift:
- Additive changes (the new
orders.discountcolumn) are usually absorbed automatically — good connectors detect the new field and add it toraw.orders, leaving older rows null fordiscount. The load keeps running; the column simply appears downstream. - Breaking changes (dropping
orders.amount, or changingamountfromdecimaltostring) need a policy: fail loudly, quarantine the bad records, or coerce — but never silently corrupt ShopFlow's revenue figures.
Schema registry and contract enforcement
The disciplined defense is a schema registry — a central service that stores the agreed-upon schema for each data stream and validates every record against it at the boundary. Confluent Schema Registry (using formats like Avro, Protobuf, or JSON Schema) is the canonical example. With a registry, a producer can't publish a record that violates the registered schema, and consumers know exactly what shape to expect.
This is contract enforcement at the boundary: the schema is a contract between the source and your platform, checked as data enters rather than discovered three transformations later when a dashboard goes blank. Catching a bad shape at the front door — instead of debugging a corrupted report a week later — is the entire value. (Data contracts and quality run deeper in Chapter 11.)
Partial failure: dead letters and backpressure
Real ingestion runs into bad records and overwhelmed destinations. Two patterns handle them.
Dead-letter queues
In a batch of 100,000 records, three will be malformed — unparseable JSON, a violated schema, a missing required field. You face a choice: fail the whole batch (one bad record blocks 99,997 good ones — unacceptable) or drop the bad ones silently (you lose data and never know — also unacceptable). The answer is a dead-letter queue (DLQ): a separate holding area where records that can't be processed are set aside with their error, so the good records flow through and the failures are preserved for inspection, fixing, and replay. Process what you can; quarantine what you can't; never silently lose data.
Backpressure and rate limiting
Backpressure is what a system does when data arrives faster than it can be processed: instead of buffering until it runs out of memory and crashes, it signals "slow down" back up the chain — the consumer's pace throttles the producer. A streaming system reading a Kafka topic naturally applies backpressure by simply reading slower; the unread events wait durably in the log. The same instinct on the source side is rate limiting — deliberately capping your own request rate so you don't overwhelm a source API (the 429 story from the last lesson) or your own destination. Both are about matching speed to the slowest link instead of pretending it doesn't exist.
The landing zone: raw, immutable, replayable
Everything in this chapter converges on one storage pattern. A landing zone (often called the raw or bronze layer) is the first place ingested data lands — for ShopFlow, exactly the raw.orders, raw.customers, raw.products, and raw.order_items tables this whole chapter has been filling. It is append-only and immutable: you add raw records exactly as received and never edit or delete them in place.
Why design it this way?
- Replayability. Because ShopFlow's raw, unmodified source data is preserved forever in
raw.*, you can re-run all downstream transformations from scratch whenever you fix a bug or add a new model — without re-extracting from the store database. The raw layer is your re-do button. - Auditability & debugging. When ShopFlow's revenue number looks wrong, you can trace it back to the exact order bytes that arrived in
raw.orders. Nothing was overwritten, so the evidence is intact. - Decoupling. Ingestion's only job becomes "land the raw data faithfully" (the EL from the last lesson). All cleaning and shaping happens after, in the transform layer — so a transformation bug never costs you the source data.
This append-only bronze layer is the foundation of the medallion architecture (bronze → silver → gold) you'll see again in Chapter 7 and the lakehouse chapter. Ingestion fills bronze; transformation refines it forward.
Why it matters
Choosing a connector platform is a cost / control / locality trade-off, not a religion: Fivetran buys convenience for a per-row bill, Airbyte/Meltano trade operational effort for control and lower cost, and build-your-own trades maximum maintenance for total ownership. Kafka Connect (with Debezium as a source connector) is the streaming EL backbone, resuming exactly via offsets. The genuinely hard parts no platform fully removes are schema drift (defend with a schema registry and contract enforcement at the boundary), partial failure (a dead-letter queue so one bad record never blocks the batch or vanishes silently), and backpressure (match speed to the slowest link). And all of it lands in an append-only, immutable landing zone — the raw bronze layer that makes your entire pipeline replayable, auditable, and safe to rebuild. Master these and you've mastered the part of ingestion that actually breaks.
Next: Chapter 6 checkpoint →