Sources: where data actually comes from

Before you can choose how to ingest, you have to understand what you're ingesting from. A source is any system that holds data you want to pull into your platform. The mistake beginners make is treating all sources the same — "data is data." It isn't. A live production database, a paginated web API, and a nightly CSV drop have radically different shapes, and that shape decides almost everything: how you detect change, how fast you can go, what fails, and how you recover.

This lesson is a tour of the five families you'll meet over and over. Throughout, our source is ShopFlow (ShopFlow — see Meet ShopFlow): its orders, customers, products, and order_items live in the store's operational database, with its products catalog also reachable through a REST API. Ingesting them is exactly what fills the raw.* landing tables this chapter is about.

1. Relational databases (the operational store)

The most common source is an OLTP database — Online Transaction Processing, the live PostgreSQL or MySQL database powering an application. It holds the system of record: users, orders, payments, inventory. ShopFlow's web app writes to exactly such a database — its orders, customers, products, and order_items tables are the system of record for the store. (We met OLTP vs OLAP in Chapter 1.)

Two things make databases special as a source:

They change in place. A ShopFlow orders row's status flips from placed to shipped; the old value is gone. Unlike a log of events, a database table shows you only the current state. To capture history, you have to detect when rows change — the whole problem of incremental extraction and CDC, covered in the next two lessons.
You can read them directly with SQL. You can issue SELECT * FROM orders WHERE updated_at > '2026-06-23' against ShopFlow's orders table and stream the result. This is powerful but dangerous: a heavy analytical scan against the live store database can slow down the very app serving its customers. Best practice is to read from a read replica — a copy the database keeps in sync precisely so you can query it without touching the primary.

:::warning Don't hammer the primary Pointing a full-table extract at a busy production primary is a classic incident. Read from a replica, throttle your reads, or use log-based CDC (next lesson) which reads the database's change log instead of querying tables at all. :::

2. REST and GraphQL APIs (someone else's system)

When the data lives in a system you don't own — Stripe, Salesforce, HubSpot, GitHub — you usually reach it through an API (Application Programming Interface), a defined set of HTTP endpoints the vendor exposes for programmatic access. ShopFlow's products catalog is served behind such a REST API (the merchandising team owns it, not us), so we ingest products over HTTP rather than reading the table directly — the APIs lesson works that example end to end.

A REST API offers endpoints like GET /v1/products that return JSON. You ask for a resource; you get a page of records back.
A GraphQL API exposes a single endpoint where you describe exactly which fields you want in a query, and the server returns just those. It avoids over-fetching but shifts complexity to you.

APIs are the hardest easy-looking source. The endpoint is one line; the realities are not. You must handle pagination (results come in pages, not all at once), rate limits (the vendor caps how many requests per minute you may make), retries with backoff (transient failures need re-attempting without hammering), and authentication (API keys, OAuth tokens that expire). We dedicate a full lesson to these in APIs, connectors & ETL vs ELT vs EL.

3. Files and SFTP (the batch drop)

A huge amount of real-world data still arrives as files: a CSV of yesterday's transactions, a Parquet export, a fixed-width legacy mainframe extract, a JSON dump. They land in object storage (S3, GCS, Azure Blob) or on an SFTP server — SSH File Transfer Protocol, a secure way for a partner to drop files onto a server you both can reach.

Files feel simple, and the easy cases are. The realities:

Detecting new files. You need to know which files are new since last run — usually by listing a folder and tracking which filenames or modification times you've already processed (a watermark, see the next lesson).
Partial and late files. A file may still be uploading when you read it, or arrive a day late. Many teams require a manifest or a _SUCCESS marker file that signals "this batch is complete and safe to read."
Format drift. The vendor changes the column order, the delimiter, or the encoding without telling you. (More on this under schema drift in the connectors lesson.)

4. SaaS connectors (the managed shortcut)

For popular SaaS sources — Salesforce, Google Analytics, NetSuite, Zendesk — you rarely write the API code yourself. A connector is pre-built, maintained integration code that knows a specific source's API quirks (its pagination style, its rate limits, its schema) and pulls the data for you. Platforms like Fivetran and Airbyte are catalogs of hundreds of such connectors.

The trade-off is control vs maintenance: a managed connector saves you from re-learning Salesforce's notoriously fiddly API, but you pay for it (often by volume) and you inherit the vendor's choices about schema and refresh behavior. We weigh build-vs-buy in the connectors lesson.

5. Message queues and event streams (data in motion)

The previous four sources are data at rest — you go and fetch it. The fifth is data in motion: a continuous stream of events pushed onto a message queue or event log as they happen. The dominant example is Apache Kafka, a durable, replayable log of events (clicks, orders, sensor readings) that producers append to and consumers read from. ShopFlow has exactly such a stream — its order_events topic, where every placed/paid/shipped/cancelled action is emitted as it happens. We meet that stream as a source family here; it gets its full treatment in Chapter 9.

Streaming sources flip the model: instead of polling a database on a schedule, you subscribe and receive events as they're produced. This is the backbone of real-time ingestion and the natural landing spot for CDC events (a database's changes published as a stream). Streaming gets a full treatment in Chapter 9; here we care that it's a source family with its own ingestion pattern — consume-and-commit-offset rather than query-and-watermark.

The mental model: at rest vs in motion

Every source ultimately lands in the same place — an append-only landing zone (covered in the connectors lesson) — but how you get it there differs by family.

Why it matters

The source's shape dictates your entire strategy. Databases change in place, so you need change detection (watermarks or CDC) and you must protect the primary. APIs look trivial but bury the work in pagination, rate limits, retries, and auth. Files need new-file detection and completeness signals. SaaS sources are usually best reached through maintained connectors, trading cost for not owning the API's quirks. Message queues invert the model entirely — you subscribe to a stream instead of polling. Name your source family first, and the right pattern follows.

Next we look at the most fundamental decision once you've picked a source: do you reload everything, or only what changed?

Next: Extraction patterns: full, incremental & idempotent →

1. Relational databases (the operational store)​

2. REST and GraphQL APIs (someone else's system)​

3. Files and SFTP (the batch drop)​

4. SaaS connectors (the managed shortcut)​

5. Message queues and event streams (data in motion)​

The mental model: at rest vs in motion​

Why it matters​