Lineage, Governance & Contracts
When a dashboard shows the wrong number, how do you find out which upstream table caused it? Column-level lineage gives you the dependency graph from raw source to analytical output. Data governance adds access control, PII classification, and compliance. Data contracts formalize the agreement between the team that produces a dataset and every team that consumes it — making breaking changes visible before they break things.
How It Works
Column-level lineage — click a node to trace ancestry
Tracing fct_orders.revenue back to its source columns
Column-level lineage lets you answer "which source column feeds this metric?" — essential when debugging incorrect numbers.
Click any node above to trace its ancestry. Table-level lineage tells you which tables feed which tables. Column-level lineage goes further: it tells you which specific column in which source table feeds a specific column in a downstream table, and what transformations were applied at each step.
The reason a dashboard shows the wrong number is almost always a change somewhere upstream that propagated silently through the dependency graph. Without lineage, finding the root cause means tracing every pipeline manually — a process that takes hours or days on a mature data stack. Lineage makes the dependency graph machine-readable so that impact analysis (what breaks if I change this?) and root cause analysis (where did this wrong number come from?) can be answered in seconds instead of hours.
Column-level lineage is built from the SQL or dbt model definitions. When fct_orders.revenue is defined as:
-- int_orders_with_refunds
SELECT
o.order_id,
o.amount - COALESCE(r.refund_amount, 0) AS net_revenue
FROM stg_orders o
LEFT JOIN stg_refunds r ON o.order_id = r.order_idA lineage tool can parse this SQL and record: fct_orders.revenue ← int_orders.net_revenue ← stg_orders.amount AND stg_refunds.refund_amount ← orders.amount AND refunds.refund_amount.
This makes two operations tractable:
Impact analysis (downstream): "If I rename stg_orders.order_amount, which downstream columns break?" Lineage shows every downstream model that references that column — preventing silent breakage.
Root cause analysis (upstream): "Revenue in the dashboard is wrong. Where did it come from?" Lineage traces the path back to the source column and the exact transformation at each step.
Column-level lineage required SQL parsing — not just dependency tracking — because table-level lineage cannot answer "which source column does this calculated metric depend on?" Parsing the SELECT statement is the only way to extract that column-to-column mapping, and this is why tools like dbt that define all transformations as SQL are better lineage sources than Spark jobs with Python logic: the SQL is structured enough to parse reliably, while arbitrary Python code is not.
Metadata catalogs
A data catalog is a searchable registry of datasets with their schemas, owners, descriptions, and usage stats. Teams use it to discover what data exists before building new pipelines (and avoid duplicating existing work).
Catalogs are built from two sources:
- Active metadata: schema from the warehouse, query history (who uses this table), dbt model documentation
- Passive metadata: column descriptions, owner annotations, PII tags added manually by data teams
Key catalog capabilities:
- Full-text search across column names and descriptions
- Popularity signals (which tables are queried most frequently)
- Lineage visualization
- PII/sensitivity classification surfaced alongside column names
PII classification and access control
Personally identifiable information (PII) in your warehouse requires classification and access control. Typical classification tiers:
| Tier | Examples | Policy |
|---|---|---|
| Public | anonymized aggregates, published metrics | No restrictions |
| Internal | non-PII operational data | All data team members |
| Confidential | email addresses, names, phone numbers | Role-restricted, logged |
| Restricted | health data, government IDs, financial details | Strict access control, audit log |
Row-level security in Snowflake, BigQuery, and Databricks lets you define policies that filter rows or mask columns based on the querying user's role. An analyst can query dim_customers but sees ***@***.com in the email column unless they have the pii_access role.
Design Tradeoffs
Where Your Intuition Breaks
Data contracts feel like bureaucracy — they add process overhead to what could be a simple schema change. The alternative is the "breaking change discovered in production" pattern, where a column rename or type change goes unannounced and a dozen downstream consumers fail simultaneously, each requiring its own debugging session. The cost of a contract is writing a document and notifying subscribers before a change; the cost of no contract is each consumer independently discovering the breakage after deployment, often without knowing which upstream change caused it. Contracts do not prevent change — they make change visible before it breaks things. The organizational resistance to contracts is usually strongest in teams that have not yet experienced a major lineage incident; after one, the overhead looks small.
Table-level vs column-level lineage
| Approach | Effort | Detail | Use case |
|---|---|---|---|
| Table-level | Low (automatic from query history) | Which tables depend on which | Impact analysis for table drops, rename |
| Column-level | Medium (SQL parsing or annotation) | Which columns flow from which | Debugging incorrect metrics, PII propagation |
| Business-level | High (manual annotation) | Which business metric = which columns | Self-service analytics, trust |
Column-level lineage requires SQL parsing or explicit annotation. Modern tools (dbt, OpenMetadata, DataHub) parse SQL automatically for common patterns but can miss dynamic SQL or stored procedures.
Passive discovery vs data contracts
Passive discovery (catalog-based): tools scrape metadata automatically — schema from the warehouse, query history from logs, lineage from SQL parsing. Low friction to adopt, but coverage is incomplete and metadata can be stale.
Data contracts (producer-defined): the team that owns a dataset publishes an explicit contract specifying schema, semantics, SLAs, and breaking-change policy. Consumers subscribe to the contract and receive notification before it changes.
Data contracts shift responsibility: instead of consumers discovering that a column was renamed after their dashboard broke, the producer is obligated to notify all subscribers before making breaking changes. This requires organizational buy-in but produces much more reliable data ecosystems.
What breaks a contract
A breaking change is anything that invalidates existing consumers:
- Removing or renaming a column
- Changing a column's data type
- Changing the primary key structure
- Changing the grain of the table (from order-level to order-line-level)
Non-breaking changes (safe to deploy without notification):
- Adding new columns
- Adding new rows
- Improving documentation
- Performance optimizations with no schema change
In Practice
Writing a data contract
A minimal data contract as a YAML file committed alongside the dbt model:
# contracts/fct_orders.yaml
dataset: analytics.fct_orders
owner: data-engineering
slack: "#data-eng"
sla:
freshness_hours: 4
availability: "99.5%"
schema_version: "2.1.0"
columns:
- name: order_id
type: VARCHAR
description: "Unique order identifier. Primary key."
nullable: false
pii: false
- name: customer_id
type: VARCHAR
description: "FK to dim_customers. All orders have a customer."
nullable: false
pii: false
- name: revenue
type: DECIMAL(18, 2)
description: "Net revenue after refunds applied."
nullable: false
pii: false
- name: customer_email
type: VARCHAR
description: "Customer email at time of order. PII."
nullable: true
pii: true
pii_tier: confidential
breaking_change_policy: "14 days notice via #data-contracts-announce before removing or renaming columns"Check this into the same repo as the dbt model. When the schema changes, the contract file changes in the same PR — making breaking changes visible in code review.
Lineage for PII propagation
When a PII column exists in a source table, lineage tells you every downstream table that inherits it. If users.email is classified as PII:
- Run lineage traversal downstream from
users.email - Every column derived from it (joins, concatenation, COALESCE) inherits the PII tag
- Access control policies can be applied uniformly across all descendant columns
Without lineage, PII classification is a manual audit that inevitably misses tables. With lineage, classification propagates automatically as models are built.
Catalog adoption
Catalogs provide value only if teams actually use them. Two patterns that drive adoption:
Embed links in dashboards: every Looker/Tableau/Metabase dashboard links to the catalog entry for each underlying dataset. Users who notice a metric behave unexpectedly can follow the link to see the data's owner, freshness, and lineage without filing a ticket.
Search-first culture: when a new dataset is needed, teams search the catalog before building. Catalog usage metrics (searches, page views per dataset) tell you which datasets are trusted and which are abandoned.
Production Patterns
dbt model-level contracts in schema.yml
dbt 1.5+ supports native data contracts — declare them alongside your model in schema.yml and dbt will validate the schema on every run:
# models/marts/schema.yml
models:
- name: fct_orders
config:
contract:
enforced: true # fail the run if schema drifts
columns:
- name: order_id
data_type: varchar
constraints:
- type: not_null
- type: primary_key
- name: customer_id
data_type: varchar
constraints:
- type: not_null
- name: revenue
data_type: numeric
constraints:
- type: not_null
- name: customer_email
data_type: varchar
meta:
pii: true
pii_tier: confidentialWhen contract.enforced: true, dbt compares the compiled model's column list and types against the declaration and raises a compilation error before any SQL runs. This catches drift from refactors before it hits production.
Emitting OpenLineage events from a custom job
For pipelines outside dbt (Spark jobs, custom Python scripts), emit OpenLineage events so your lineage graph stays complete:
from openlineage.client import OpenLineageClient
from openlineage.client.run import RunEvent, RunState, Run, Job
from openlineage.client.facet import (
SchemaDatasetFacet, SchemaField,
DataSourceDatasetFacet,
)
import uuid, datetime
client = OpenLineageClient.from_environment() # reads OPENLINEAGE_URL
run_id = str(uuid.uuid4())
job_name = "enrich_orders"
namespace = "data-engineering"
# Emit START event
client.emit(RunEvent(
eventType=RunState.START,
eventTime=datetime.datetime.utcnow().isoformat() + "Z",
run=Run(runId=run_id),
job=Job(namespace=namespace, name=job_name),
inputs=[{
"namespace": "snowflake://account.us-east-1",
"name": "analytics.fct_orders",
"facets": {
"schema": SchemaDatasetFacet(fields=[
SchemaField("order_id", "VARCHAR"),
SchemaField("revenue", "NUMERIC"),
])
}
}],
outputs=[{
"namespace": "snowflake://account.us-east-1",
"name": "analytics.fct_orders_enriched",
}],
))
# ... run your job logic ...
# Emit COMPLETE event
client.emit(RunEvent(
eventType=RunState.COMPLETE,
eventTime=datetime.datetime.utcnow().isoformat() + "Z",
run=Run(runId=run_id),
job=Job(namespace=namespace, name=job_name),
))OpenLineage events are ingested by Marquez, DataHub, and OpenMetadata. The lineage graph updates within seconds of job completion.
Enforcing data contracts in CI
Add a contract validation step to your PR pipeline so breaking changes are caught before merge:
# scripts/validate_contract.py
import yaml, sys
from pathlib import Path
import snowflake.connector
def load_contract(path: str) -> dict:
return yaml.safe_load(Path(path).read_text())
def fetch_live_schema(conn, dataset: str) -> dict[str, str]:
db, schema, table = dataset.split(".")
cursor = conn.cursor()
cursor.execute(f"""
SELECT column_name, data_type
FROM {db}.information_schema.columns
WHERE table_schema = '{schema.upper()}'
AND table_name = '{table.upper()}'
""")
return {row[0].lower(): row[1].lower() for row in cursor.fetchall()}
def validate(contract_path: str) -> list[str]:
contract = load_contract(contract_path)
conn = snowflake.connector.connect(
account=..., user=..., password=..., warehouse=...,
)
live = fetch_live_schema(conn, contract["dataset"])
errors = []
for col in contract["columns"]:
name = col["name"]
if name not in live:
errors.append(f"MISSING column: {name}")
elif col["type"].lower().split("(")[0] not in live[name]:
errors.append(
f"TYPE MISMATCH on {name}: "
f"contract={col['type']}, live={live[name]}"
)
return errors
if __name__ == "__main__":
errors = validate(sys.argv[1])
if errors:
print("Contract violations:")
for e in errors:
print(f" {e}")
sys.exit(1)
print("Contract valid.")Wire this into your CI pipeline:
# .github/workflows/dbt-ci.yml (relevant step only)
- name: Validate data contracts
run: |
for f in contracts/*.yaml; do
python scripts/validate_contract.py "$f"
done
env:
SNOWFLAKE_ACCOUNT: ${{ secrets.SNOWFLAKE_ACCOUNT }}
SNOWFLAKE_USER: ${{ secrets.SNOWFLAKE_USER }}
SNOWFLAKE_PASSWORD: ${{ secrets.SNOWFLAKE_PASSWORD }}PRs that rename or drop a column now fail CI before any reviewer approves them.
Enjoying these notes?
Get new lessons delivered to your inbox. No spam.