The Data Governance Framework That Finally Stopped Shipping Broken Metrics (and Got Us Through the Audit)
Governance isn’t a binder of policies. It’s the reliability layer for data products: contracts, controls, and feedback loops that keep analytics compliant, secure, and useful.
If your governance program can’t tell you whether the revenue fact table is fresh, accurate, and access-controlled, it’s not governance. It’s theater.Back to all posts
The moment you realize governance is a production problem
I’ve watched more than one org sail through app uptime targets while the data stack quietly lights itself on fire. The dashboard is “up,” but it’s wrong. Finance is reconciling in spreadsheets. The CRO is calling the data team at 10pm because pipeline numbers changed again. And then—inevitably—Security shows up with an audit request that reads like a ransom note: “Show me who accessed customer PII, when, and why.”
Here’s the hard truth: data governance is not paperwork. It’s the set of technical controls and operating habits that make data reliable, secure, and worth paying for.
When GitPlumbers gets pulled into a “why are our metrics untrustworthy?” situation, it’s rarely about one bad query. It’s usually missing fundamentals:
- No clear ownership of key datasets
- No enforced contracts between producers and consumers
- Weak access controls (“just give them
SELECTon the schema”) - Quality checks that exist in Confluence, not in CI
- Zero visibility into lineage, so every change is a roulette spin
A governance framework that doesn’t kill delivery
The governance frameworks that actually work in the real world look more like SRE than a compliance committee. They’re built on three ideas:
- Data products: Treat important datasets like products with owners, roadmaps, and reliability targets.
- Controls as code: If a rule matters, it must be enforceable by tooling (
dbt,OPA,Terraform,Unity Catalog,Lake Formation). - Feedback loops: Measure, alert, and run incidents when data breaks—because it will.
You don’t need to boil the ocean. Start where the business feels pain:
- Revenue reporting (bookings, ARR/MRR)
- Customer analytics (churn, retention)
- Risk and compliance reporting (PII access, deletion requests)
If you can make those boringly reliable, the rest becomes easier.
The 6 building blocks: ownership, contracts, classification, access, quality, and lineage
1) Ownership: name humans, not teams
Every “decision dataset” needs:
- Data Owner (accountable): usually a business leader (e.g., VP Finance for revenue numbers)
- Data Steward (operational): someone who understands definitions and exceptions
- Tech Owner (engineering): the person who ships changes and carries the pager
A simple RACI beats a fancy org chart. If nobody is accountable, the dataset will drift.
2) Data contracts: stop surprise schema changes
A contract doesn’t have to be academic. It’s just a published agreement:
- Schema and field meanings
- Constraints (nullability, uniqueness)
- Freshness expectations
- Backfill and deprecation rules
Here’s a pragmatic contract style we’ve used successfully (store it next to the pipeline code):
# contracts/customer_events.yaml
version: 1
dataset: analytics.customer_events
owner:
business: "VP Growth"
technical: "data-platform@company.com"
slos:
freshness_minutes: 60
availability: 0.995
schema:
- name: event_id
type: string
required: true
constraints:
- unique
- name: customer_id
type: string
required: true
- name: email
type: string
required: false
classification: PII
- name: event_ts
type: timestamp
required: true
constraints:
- not_null
notes:
deprecations:
- field: "utm_campaign"
sunset_date: "2026-03-01"3) Classification: you can’t protect what you haven’t labeled
If you do nothing else for compliance, do this: tag sensitive data.
- PII (email, phone, address)
- PHI (health info)
- PCI (card data)
- Secrets/tokens
Most modern stacks support tags:
Snowflaketags + masking policiesBigQuerypolicy tagsDatabricks Unity Catalogtags + row/column-level access
Once classification exists, policy becomes enforceable.
4) Access control: default-deny and make exceptions auditable
I’ve seen “everyone gets SELECT on prod” more times than I care to admit. It’s fast—until it’s catastrophic.
Use:
- RBAC for broad roles (Analyst, Finance, Support)
- ABAC for sensitive attributes (PII access only with training + ticket + approval)
- Just-in-time access for elevated queries (hours, not months)
Here’s a simplified policy-as-code example using Open Policy Agent (the concept matters even if your enforcement point is Ranger/Lake Formation/Unity Catalog):
# opa/data_access.rego
package data.access
default allow = false
# Deny PII access unless user is in approved group and ticket exists
allow {
input.dataset.classification != "PII"
}
allow {
input.dataset.classification == "PII"
"pii_approved" in input.user.groups
input.request.ticket_id != ""
input.request.ticket_id_valid == true
}The measurable win: auditability. You can answer “who accessed PII and why” without archaeology.
Quality & reliability: build gates, not dashboards that apologize
If your “data quality process” is a weekly meeting where someone says “numbers look off,” you’re already behind.
You need automated checks in two places:
- In CI/CD (block merges that break contracts)
- In production (alert when SLOs are violated)
CI gates with dbt
dbt is a workhorse here because it’s close to transformations and plays well with pull requests.
# models/marts/revenue/schema.yml
version: 2
models:
- name: fct_bookings
description: "Bookings fact table used by Finance and RevOps"
columns:
- name: booking_id
tests:
- unique
- not_null
- name: customer_id
tests:
- not_null
- name: booking_amount_usd
tests:
- not_null
- name: booked_at
tests:
- not_null
tests:
- dbt_utils.expression_is_true:
expression: "booking_amount_usd >= 0"Then in your pipeline:
dbt build --select fct_bookings+Outcome you can measure: fewer broken dashboards shipped. In one cleanup, we cut “metric rollback” incidents from ~6/week to <1/week in a month by making dbt build mandatory before deploy.
Runtime checks with Great Expectations
For ingestion and raw zones, Great Expectations is great at catching upstream weirdness (duplicates, null spikes, out-of-range values):
# expectations/orders_raw.py
import great_expectations as gx
context = gx.get_context()
validator = context.sources.pandas_default.read_csv("s3://bucket/orders.csv")
validator.expect_column_values_to_not_be_null("order_id")
validator.expect_column_values_to_be_unique("order_id")
validator.expect_column_values_to_be_between("total", min_value=0, max_value=50000)
results = validator.validate()
if not results["success"]:
raise SystemExit("Data quality failed")Pair runtime checks with freshness monitoring (e.g., Monte Carlo, BigQuery scheduled queries, Snowflake tasks + alerts). Freshness is the silent killer of business trust.
Compliance and security: the controls auditors actually care about
Auditors don’t care that you have a “data policy doc.” They care that the system enforces it, and you can prove it.
The controls that consistently matter:
- Least privilege access (default-deny)
- Separation of duties (prod access is controlled; changes are reviewed)
- Encryption at rest and in transit (usually table stakes)
- Audit logs retained and queryable
- Data retention and deletion workflows (GDPR/CCPA)
- Masking/tokenization for sensitive fields
Example: masking in Snowflake
-- Snowflake example
CREATE TAG data_classification;
CREATE MASKING POLICY mask_pii AS (val STRING) RETURNS STRING ->
CASE
WHEN CURRENT_ROLE() IN ('PII_APPROVED_ROLE') THEN val
ELSE '***MASKED***'
END;
ALTER TABLE analytics.customer_events
MODIFY COLUMN email
SET TAG data_classification = 'PII'
SET MASKING POLICY mask_pii;Infrastructure-as-code: make drift painful
If you’re still hand-editing warehouse permissions, you’re signing up for access drift.
- Use
Terraformmodules for roles, grants, and tags - Require PR review for permission changes
- Log approvals (ticket references in PR templates)
The measurable win here is time-to-audit-evidence. Good governance reduces “two-week scramble” into “here’s the dashboard of controls.”
Operating model: run governance like SRE (with KPIs people respect)
Governance collapses when it’s everyone’s side quest. The operating model needs a heartbeat.
Cadence that works in practice
- Weekly: triage new data access requests (aim for <24h turnaround)
- Bi-weekly: review top data incidents and SLO breaches
- Monthly: review governance KPIs with Finance/RevOps/Product
- Quarterly: access recertification for sensitive datasets
KPIs that tie to business value
Track leading indicators (quality and reliability) and lagging indicators (business impact):
- Freshness SLO compliance (% of time critical tables meet freshness)
- Data incident rate (per week) and MTTR (time to restore trusted numbers)
- Change failure rate for data pipelines (bad deploys / total deploys)
- Access request lead time (hours/days)
- Forecast variance improvements after stabilizing core metrics
One pattern I’ve seen work repeatedly: define “Tier 1 datasets” (the ones executives use weekly) and give them SLOs first. Don’t pretend everything is Tier 1.
If your governance program can’t tell you whether the revenue fact table is fresh, accurate, and access-controlled, it’s not governance. It’s theater.
A realistic 30/60/90 plan (what we do at GitPlumbers)
When GitPlumbers comes into a messy data org—legacy ETL, AI-generated pipelines, tribal SQL—we don’t start with a tool migration. We start by restoring trust.
First 30 days: stop the bleeding
- Identify 5–10 Tier 1 datasets and owners
- Implement CI quality gates (
dbt test/dbt build) on those models - Tag PII columns and implement masking where supported
- Centralize audit logs and verify you can answer access questions
60 days: make it enforceable
- Add contracts to pipelines (schema + freshness + deprecation)
- Default-deny access for sensitive schemas; implement JIT approvals
- Publish lineage for Tier 1 flows (e.g.,
OpenLineage+DataHub)
90 days: make it sustainable
- Define SLOs and on-call for Tier 1 datasets
- Run postmortems on data incidents
- Add automated access reviews and retention/deletion workflows
Typical measurable outcomes we see when teams do this seriously:
- 50–80% reduction in recurring data incidents
- 2–5× faster audit evidence collection
- 30–60% reduction in analyst time spent “reconciling” numbers
- Noticeable executive trust rebound (which is the real KPI)
If you want compliance and speed, build guardrails you can’t ignore
The teams that win aren’t the ones with the longest governance doc. They’re the ones with automatic controls and a culture that treats data like production.
If you’re stuck in the loop of broken metrics, permission sprawl, and audit panic, GitPlumbers can help you put governance on rails—contracts, tests, access policies, and the operating model to keep it running.
- See how we approach messy data estates: Data Engineering Services
- When AI-generated pipelines are part of the problem: AI Code Rescue
- Want a quick gut-check? Data Platform Reliability Assessment
Related Resources
Key takeaways
- Treat governance as a **reliability system**: contracts + automated checks + incident response, not a committee.
- Start with **data products** and **clear ownership**; everything else (quality, access, lineage) hangs off that.
- Make compliance enforceable with **classification + policy-as-code** (RBAC/ABAC) and audited workflows.
- Use CI/CD gates (`dbt test`, Great Expectations) to stop bad data from reaching dashboards.
- Measure governance outcomes with business-facing KPIs: **freshness**, **accuracy**, **incident rate**, and **time-to-approve access**.
Implementation checklist
- Inventory critical metrics and pipelines; identify top 10 “decision datasets” and who uses them
- Define data products with owners, SLAs/SLOs, and a published contract
- Implement classification tags (PII/PHI/PCI) and default-deny access controls
- Automate quality checks in CI/CD and at runtime (freshness, uniqueness, referential integrity)
- Centralize audit logs and access review workflows (quarterly or automated)
- Publish lineage and documentation to a catalog; make it part of the definition of done
- Create an incident playbook for data outages with on-call, triage, and postmortems
- Track governance KPIs monthly and tie them to business outcomes (forecast accuracy, churn, revenue ops efficiency)
Questions we hear from teams
- Do we need a data catalog before we start governance?
- No. Start with Tier 1 datasets, ownership, contracts, and automated checks. A catalog (e.g., `DataHub`, `Collibra`) becomes valuable once you’re publishing reliable metadata and lineage—otherwise you’re cataloging chaos.
- What’s the fastest path to compliance improvements?
- Classify sensitive fields (PII/PHI/PCI), move to default-deny access, implement masking, and ensure audit logs are retained and queryable. Those four steps typically deliver the biggest audit-risk reduction quickly.
- How do we prevent governance from slowing teams down?
- Make guardrails self-service: templates for contracts, standardized role modules in `Terraform`, automated CI checks with `dbt`, and JIT access approvals. Measure access request lead time and treat delays as defects.
- How do you handle AI-generated data pipelines and “vibe-coded” SQL?
- Wrap them in contracts and tests immediately. Then refactor incrementally: add lineage, enforce linting and PR review, and replace brittle transformations with `dbt` models plus explicit tests. The goal is to make correctness repeatable, not heroic.
Ready to modernize your codebase?
Let GitPlumbers help you transform AI-generated chaos into clean, scalable applications.
