Skip to main content

APIs, connectors & ETL vs ELT vs EL

API ingestion is where "just call an API and write a file" goes to die. The call itself is one line. Everything that makes it production-grade — getting all the data, not getting banned, surviving a flaky network, and staying authenticated — is the actual job. We'll work it on ShopFlow's products catalog (ShopFlow — see Meet ShopFlow), which the merchandising team exposes behind a REST API rather than letting us read the table directly — so ingesting products into raw.products means going over HTTP. This lesson covers those four realities, then settles a question that quietly defines the whole modern data stack: where does transformation belong?

The four realities of API ingestion

1. Pagination — you never get it all at once

APIs don't hand you ShopFlow's entire catalog in one response; that would time out and exhaust memory. Instead they paginate — return a page of products plus a way to ask for the next one. The three common styles:

  • Offset/limit: GET /v1/products?limit=100&offset=200 — "skip 200, give me the next 100." Simple, but slow and fragile on large or changing datasets.
  • Page number: GET /v1/products?page=3&per_page=100.
  • Cursor-based: the response includes an opaque next_cursor token you pass to the next request: GET /v1/products?cursor=abc123. This is the most robust for large, live datasets and the modern default.

Your extractor must loop until there are no more pages — follow the cursor, or increment the page, until the API signals the end (an empty page, a null cursor, or a has_more: false flag). Forgetting to loop is the single most common cause of "we're only getting the first 100 rows."

2. Rate limits — go too fast and you're cut off

A rate limit is the vendor's cap on how many requests you may make in a window — say, 100 requests per minute, which the ShopFlow products API enforces. Exceed it and the API returns HTTP 429 (Too Many Requests), often with a Retry-After header telling you how long to wait. A robust extractor respects rate limits proactively (throttling its own request rate so paging through the whole catalog stays under the cap) and reacts to 429s by pausing as instructed.

3. Retries with exponential backoff

Networks fail transiently — a connection resets, a server returns a 503 for two seconds. The wrong response is to give up; the also-wrong response is to instantly retry in a tight loop and hammer a struggling server. The right response is exponential backoff: retry after a delay that doubles each attempt — 1s, 2s, 4s, 8s — usually with a little random jitter added so many clients don't retry in lockstep.

Trace a flaky products endpoint:

GET /v1/products?cursor=xyz → 503 (wait 1s + jitter)
retry → 503 (wait 2s + jitter)
retry → 503 (wait 4s + jitter)
retry → 200 ✓ (got the page; continue)

Retry only idempotent, transient failures (429, 5xx, timeouts). A 400 Bad Request or 401 Unauthorized won't fix itself by retrying — that's a bug or an expired credential, and you should fail loudly instead.

4. Authentication

You must prove who you are on every request. The common shapes:

  • API key — a secret token sent in a header (Authorization: Bearer sk_live_...). Simple; keep it in a secrets manager, never in code.
  • OAuth 2.0 — you exchange credentials for a short-lived access token that expires, plus a long-lived refresh token to mint new ones. Your extractor must detect expiry (a 401) and refresh transparently mid-run.

:::warning These four are the connector Pagination, rate limits, backoff, and auth aren't edge cases bolted onto API ingestion — they are API ingestion. When you "use a connector" instead of writing your own, what you're really buying is someone else having solved these four for a specific source's quirks. We'll weigh that build-vs-buy choice in the next lesson. :::

Building your own: the custom EL extractor

When no connector exists, or you need full control, you write the extractor yourself. In Python the standard tools are requests (the classic synchronous HTTP library) and httpx (a modern alternative supporting async for higher throughput). A custom extractor for ShopFlow's products catalog is just the four realities, coded:

import httpx, time

def extract_products(api_key):
cursor, rows = None, []
with httpx.Client(headers={"Authorization": f"Bearer {api_key}"}) as client:
while True: # 1. paginate: loop until done
params = {"limit": 100}
if cursor:
params["cursor"] = cursor
resp = backoff_get(client, "/v1/products", params) # 3. retries inside
data = resp.json()
rows.extend(data["data"])
cursor = data.get("next_cursor")
if not cursor: # end of pages
break
return rows

The backoff_get helper wraps reality #2 (honor Retry-After on 429) and #3 (exponential backoff on 5xx). This is the heart of what tools like Meltano/Singer standardize (more on those next lesson). The point: a real extractor is a loop with backoff and auth, not a single call.

Loading idempotently: upsert on product_id

Extracting the pages is only half the job — you still have to land them, and landing must be idempotent (the rule from the extraction-patterns lesson). The products API has no notion of "what changed since last time," so a re-run re-fetches products you already have; a flaky network can hand you the same page twice. So you don't INSERT the pages into raw.products — you upsert them, keyed on the catalog's natural key, product_id:

INSERT INTO raw.products (product_id, name, category, unit_price)
VALUES (...) -- the page of products just fetched
ON CONFLICT (product_id) DO UPDATE
SET name = EXCLUDED.name,
category = EXCLUDED.category,
unit_price = EXCLUDED.unit_price;

Now re-running the whole catalog load — after a 429 pause, a retry, or a backfill — leaves raw.products with exactly one row per product_id, updated to the latest values. A price change overwrites in place; a brand-new product is inserted; a product you've already seen is a no-op. Extract faithfully, upsert on the key, and the load is safe to run as often as it must.

ETL vs ELT vs EL: where transformation lives

Now the question that shapes the modern stack. Once data is extracted, you load it and at some point transform it (clean, reshape, join, aggregate). The order of those steps names three architectures.

ETL — Extract, Transform, Load (the classic)

ETL transforms data before loading it: extract from the source, reshape it in a separate processing engine, then load the finished result into the warehouse. This was necessary when warehouse storage and compute were expensive and scarce — you cleaned data outside the warehouse to avoid paying to store raw junk. The cost: transformation logic lives in a separate system, slower to change, and you've thrown away the raw data before loading.

ELT — Extract, Load, Transform (the modern default)

ELT flips the last two steps: extract, load the raw data straight into the warehouse, then transform it inside the warehouse using SQL. Cheap cloud warehouse storage and elastic compute (Snowflake, BigQuery, Databricks) made this the default. The wins: you keep the raw data — ShopFlow's untouched raw.orders/raw.products are replayable, re-transformable anytime without re-extracting — transformations are version-controlled SQL anyone on the team can read, and you lean on the warehouse's massive parallelism. This is what Chapter 7's transformation layer (dbt) is built around, when it turns raw.orders into stg_orders.

EL — just Extract and Load (the connector's job)

In the modern stack, the ingestion tool's only job is ELExtract from the source and Load raw into the warehouse, faithfully and idempotently. Transformation is a separate concern handled downstream by dbt or SQL. This separation is the whole philosophy behind Fivetran, Airbyte, and friends: they do EL well and leave T to the warehouse. When you see "EL," read it as "ELT where someone else owns the L and you own the T."

ExtractTransform\n(separateengine)ExtractLoad\n(raw)

Why it matters

API ingestion is defined by four realities — pagination (loop until the data runs out), rate limits (respect 429s), retries with exponential backoff (survive transient failures without hammering), and auth (keys and expiring OAuth tokens). A custom extractor in requests/httpx is exactly those four coded as a loop; a connector is someone else having solved them for a specific source. And the modern stack settles transformation's home: it moved from ETL (transform before load) to ELT (load raw, transform in-warehouse), so dedicated ingestion tools now do plain EL — faithful extract-and-load — while dbt owns the T. Keep extraction dumb and faithful; make data smart downstream.

Next we compare the connector platforms that do EL for you, and confront the parts they don't save you from — schema drift, dead letters, and backpressure.

Next: Connectors, schema drift & the landing zone →