Governance, privacy & cost
"Governance" is the word that makes engineers groan — it sounds like committees, spreadsheets, and policy PDFs nobody reads. Reframe it: data governance is the set of controls that answer four operational questions about every dataset — Can people find it? Who's allowed to touch it? Are we allowed to keep it? What is it costing us? Every one of those is something you implement with real mechanisms (catalogs, access policies, masking, retention jobs, cost attribution), not paperwork. This lesson is the implementer's tour. Treating governance as paperwork instead of as code-and-controls is the gap that gets data teams fined, breached, or bankrupted by a runaway query bill.
Find it: catalogs, metadata, and discovery
In a platform with thousands of tables, the first failure is mundane: nobody can find the right data, or knows if they can trust it. Three ShopFlow analysts each build a slightly different "revenue" table because none could find the others' — when fact_sales (ShopFlow — see Meet ShopFlow) was the certified answer all along. The fix is a data catalog.
A data catalog is a searchable inventory of an organization's data assets and their metadata — data about the data. Metadata it holds:
- Technical: schemas, types, row counts, freshness, lineage (11.5).
- Business: a plain-language description, a business glossary (the agreed definition of terms like "active customer" or "net revenue," so "revenue" means one thing org-wide), and which assets are certified/trusted.
- Operational & social: ownership, popularity, quality scores, sample queries.
Discovery is the search-and-browse experience on top — a ShopFlow analyst searching "revenue by day" gets the certified fact_sales mart, its owner, its freshness, its lineage back to raw.orders, and a description (grain: one order line item; measures: quantity, unit_price, line_revenue), so they reuse the trusted asset instead of reinventing a wrong one.
Ownership and stewardship are the human spine of all governance. A data owner is accountable for a dataset (usually a team/lead); a data steward is responsible for its quality, documentation, and proper use day-to-day. Every alert (11.4), every contract (11.3), every impact notification (11.5) needs a who to route to — that who is defined here. Governance without ownership is alerts firing into the void.
Catalog tools (dated; the concepts aren't): OpenMetadata and DataHub (open-source), Atlan and Collibra (enterprise SaaS), Unity Catalog (Databricks' governance + catalog layer), and Apache Atlas (the Hadoop-era original). They store the glossary, render lineage, and increasingly host data contracts.
Touch it: access control
Once data is findable, the next question is who may read or change what. Two models dominate.
- RBAC — Role-Based Access Control. Permissions attach to roles, and users get roles. "The
analystrole can readfact_salesanddim_product; thepii_approvedrole can also read theemailcolumn ofdim_customer." Simple, auditable, the default everywhere. It scales by role, not per-person. - ABAC — Attribute-Based Access Control. Access is decided by attributes — of the user (department, clearance, region), the data (tags like
pii,confidential), and context (time, location) — evaluated as policy. "Allow read ifuser.region == data.regionanduser.clearance >= data.sensitivity." More expressive and dynamic; one tag-driven policy can cover thousands of tables without enumerating each.
:::note RBAC vs ABAC — the durable distinction
RBAC = "what role are you in?" (static, simple, the workhorse). ABAC = "what attributes do you, the data, and the context have?" (dynamic, policy-driven, scales to fine-grained and per-region rules). Modern platforms blend them: roles for the broad strokes, attribute/tag-based policies for fine-grained and data-aware rules (e.g. "anything tagged pii is masked unless you're pii_approved"). This connects to the Security guide's "identity is the new perimeter" — access is the real boundary.
:::
Table-level grants are too coarse for real privacy. Two finer controls:
- Row-level security (RLS): different users see different rows of the same table. A ShopFlow regional manager queries
fact_salesjoined todim_customerand silently sees only theirregion's rows (US-West, say) — one table, policy-filtered per user. - Column-level security: restrict access to specific columns. Everyone can read
dim_customer(forregion,city,lifetime_value), but onlypii_approvedroles see theemailcolumn; others get it hidden or masked.
Protect it: masking, tokenization, and PII
PII — Personally Identifiable Information — is data that identifies a person: name, email, SSN, phone, address, precise location, device IDs. It's governed by law and is the data you most need to protect. In ShopFlow, the field flagged PII in the contract is customers.email (which flows into stg_customers and dim_customer). The key insight: most people who query a table with PII don't need the actual sensitive values — a ShopFlow analyst computing signups-per-day needs to count customers, not read their emails. So you let them query the table while hiding the sensitive values:
- Masking: replace or obscure a sensitive value on read. Static masking permanently transforms data in a copy (e.g. a non-prod dataset); dynamic masking transforms it at query time based on who's asking — ShopFlow's
customers.emailshows asj***@***.comto an analyst but in full to apii_approvedrole, from the same stored value. - Tokenization: replace a sensitive value with a meaningless token, with the real value kept in a separate secured vault.
4111-1111-1111-1111becomestok_9f3a…. The token can still join (the same card always tokenizes the same way) so analytics work, but the token reveals nothing and can't be reversed without the vault. The standard for card data and the most common way to keep PII useful while unreadable. - Hashing/pseudonymization and encryption (at rest and in transit) round out the kit — and column-level lineage (11.5) is what lets you verify a masking rule on
customers.emailactually propagates to every downstream column that inherited it (stg_customers.email,dim_customer.email).
Keep it (or not): privacy law, the right to be forgotten, and retention
Two laws set the baseline you'll be asked to comply with:
- GDPR (EU General Data Protection Regulation) — grants individuals rights over their personal data, including the right of access and the right to erasure ("right to be forgotten").
- CCPA/CPRA (California) — similar rights for California residents: know what's collected, delete it, opt out of its sale.
The right that breaks naïve data architectures is the right to be forgotten: a person can demand you delete all their personal data, and you must comply within a deadline across every system that holds it. Concretely for ShopFlow: customer customer_id = 80421 emails "delete my data." You must erase their email (and any other PII) from customers and from every place it flowed — stg_customers, dim_customer, and any backups or lake snapshots — while keeping their de-identified orders/fact_sales rows if you still need aggregate revenue (deletion of personal data, not necessarily every transaction).
:::warning Right-to-be-forgotten on an immutable lake — the trap nobody warns you about
Earlier chapters celebrated immutable, append-only storage: data lakes and the lakehouse (Chapter 10) write immutable files and never update in place — that immutability is what makes them reliable and time-travelable. But "never delete a row" collides head-on with "delete this person's rows by law." You cannot just DELETE from a pile of immutable Parquet files. The real mechanisms:
- Use a table format that supports deletes. Iceberg / Delta Lake / Hudi (Chapter 10) implement
DELETEover immutable files via copy-on-write (rewrite affected files without the row) or merge-on-read (record a delete marker, apply on read). This is a concrete reason the lakehouse table formats matter for governance, not just performance. - Then prune history. Time travel and old snapshots still contain the deleted rows — true erasure requires expiring/compacting old snapshots so the data is genuinely gone, not just hidden from the latest version.
- Crypto-shredding. Encrypt each person's data with a per-person key; to "forget" them, destroy their key — the ciphertext remains but is permanently unreadable. The standard trick when physically rewriting petabytes per request is impractical.
- Find it first with lineage. You can't delete what you can't locate. Column-level lineage is what tells you every table customer
80421's PII flowed into —customers.email → stg_customers → dim_customer— so the erasure reaches all of them, not just the source.
A team that learned "immutable lakes are great" but never "how do I delete one person from one" is unprepared for a job's real compliance work — this is a classic, costly gap. :::
- Data retention is the policy of how long you keep data before deleting it — driven by law (some records must be kept for years; others must not be kept once their purpose ends — data minimization) and cost. Implemented as automated jobs (lifecycle rules, partition expiry) that delete or archive data past its retention window.
- Auditing is the immutable log of who accessed or changed what, when — required to prove compliance, investigate breaches, and satisfy regulators. "Show that only authorized people saw this PII last quarter" is an audit-log query. (Note: an audit log of access is intentionally append-only and is itself usually exempt from erasure.)
Pay for it: cost governance (FinOps)
Governance includes the bill. In cloud data platforms, compute is usually metered — you pay per query scanned (BigQuery), per warehouse-second (Snowflake), per cluster-hour (Spark/Databricks) — so one careless query or runaway pipeline can cost thousands. FinOps (the cloud cost-management discipline; see the Cloud guide's FinOps chapter) applied to data means making cost visible, attributable, and controllable:
- Cost per query / per pipeline-run. Measure what each query and each pipeline costs — e.g. what one nightly
shopflow_dailyrun costs to rebuildfact_sales— so the expensive ones are visible instead of buried in a lump-sum bill. - Attribution. Tag queries, jobs, and warehouses by team/project/dataset so cost is charged back to whoever incurred it (showback/chargeback). Attribution is to cost what ownership is to quality — without it, no one is accountable and spend balloons.
- Workload isolation. Give teams/workloads separate compute (separate Snowflake warehouses, BigQuery reservations, Spark pools) so one team's giant backfill doesn't slow or bill another's, and each team's spend is isolated and capped.
- Controls. Query cost limits, auto-suspend on idle warehouses, partition/cluster your tables (Chapter 2) so queries scan less, and alert on cost anomalies (the same anomaly-detection instinct from 11.4, pointed at dollars).
:::tip Governance is implemented, not filed Every item here is a mechanism you build and run: a catalog you populate, RBAC/ABAC policies you write, masking rules you apply, retention jobs you schedule, audit logs you query, cost tags you enforce. The deliverable of governance is controls in production, not a binder. That's the mindset that separates a data engineer who can be trusted with regulated, expensive data from one who can't. :::
Common pitfalls
- Governance as paperwork. A policy doc with no enforcing mechanism changes nothing. Governance is catalogs, access policies, masking, retention jobs, and cost controls — implemented.
- Table-level grants only. Without row/column-level security and masking, "access to the table" leaks PII to everyone who can read it. Restrict at the row/column and mask the sensitive fields.
- Assuming an immutable lake can't (or trivially can) delete. Both are wrong: you need it for right-to-be-forgotten, and it requires real machinery (table-format deletes, snapshot expiry, or crypto-shredding) plus lineage to find the data.
- No ownership. Unowned data means unactioned alerts, unanswered contracts, and unattributed cost. Ownership/stewardship is the spine.
- Ignoring cost until the bill arrives. Un-attributed, un-isolated compute is how one query costs $40k. Tag, isolate, limit, and alert on spend — FinOps is governance too.
Why it matters
Governance is controls you implement, answering four operational questions. Find it: a catalog + business glossary + discovery, anchored by ownership/stewardship so every alert and contract has a who. Touch it: RBAC (by role) and ABAC (by attribute/tag), refined with row- and column-level security, and masking/tokenization so most people query PII tables without ever seeing the sensitive values. Keep it: GDPR/CCPA give people a right to be forgotten that collides with immutable lakes — solved with table-format deletes (Iceberg/Delta/Hudi), snapshot expiry, or crypto-shredding, located via column-level lineage — plus retention policies and audit logs to prove compliance. Pay for it: FinOps makes data cost visible, attributed (tags/chargeback), and controlled (workload isolation, query limits). Done right, governance is what lets an organization trust a data platform with its most sensitive, regulated, and expensive data. Next, lock the whole chapter in.
Next: Chapter 11 checkpoint →