Lineage, Governance & Contracts

When a dashboard shows the wrong number, how do you find out which upstream table caused it? Column-level lineage gives you the dependency graph from raw source to analytical output. Data governance adds access control, PII classification, and compliance. Data contracts formalize the agreement between the team that produces a dataset and every team that consumes it — making breaking changes visible before they break things.

How It Works

Column-level lineage — click a node to trace ancestry

Tracing fct_orders.revenue back to its source columns

Column-level lineage lets you answer "which source column feeds this metric?" — essential when debugging incorrect numbers.

Click any node above to trace its ancestry. Table-level lineage tells you which tables feed which tables. Column-level lineage goes further: it tells you which specific column in which source table feeds a specific column in a downstream table, and what transformations were applied at each step.

The reason a dashboard shows the wrong number is almost always a change somewhere upstream that propagated silently through the dependency graph. Without lineage, finding the root cause means tracing every pipeline manually — a process that takes hours or days on a mature data stack. Lineage makes the dependency graph machine-readable so that impact analysis (what breaks if I change this?) and root cause analysis (where did this wrong number come from?) can be answered in seconds instead of hours.

Column-level lineage is built from the SQL or dbt model definitions. When fct_orders.revenue is defined as:

sql

-- int_orders_with_refunds
SELECT
    o.order_id,
    o.amount - COALESCE(r.refund_amount, 0) AS net_revenue
FROM stg_orders o
LEFT JOIN stg_refunds r ON o.order_id = r.order_id

A lineage tool can parse this SQL and record: fct_orders.revenue ← int_orders.net_revenue ← stg_orders.amount AND stg_refunds.refund_amount ← orders.amount AND refunds.refund_amount.

This makes two operations tractable:

Impact analysis (downstream): "If I rename stg_orders.order_amount, which downstream columns break?" Lineage shows every downstream model that references that column — preventing silent breakage.

Root cause analysis (upstream): "Revenue in the dashboard is wrong. Where did it come from?" Lineage traces the path back to the source column and the exact transformation at each step.

Column-level lineage required SQL parsing — not just dependency tracking — because table-level lineage cannot answer "which source column does this calculated metric depend on?" Parsing the SELECT statement is the only way to extract that column-to-column mapping, and this is why tools like dbt that define all transformations as SQL are better lineage sources than Spark jobs with Python logic: the SQL is structured enough to parse reliably, while arbitrary Python code is not.

Metadata catalogs

A data catalog is a searchable registry of datasets with their schemas, owners, descriptions, and usage stats. Teams use it to discover what data exists before building new pipelines (and avoid duplicating existing work).

Catalogs are built from two sources:

Active metadata: schema from the warehouse, query history (who uses this table), dbt model documentation
Passive metadata: column descriptions, owner annotations, PII tags added manually by data teams

Key catalog capabilities:

Full-text search across column names and descriptions
Popularity signals (which tables are queried most frequently)
Lineage visualization
PII/sensitivity classification surfaced alongside column names

PII classification and access control

Personally identifiable information (PII) in your warehouse requires classification and access control. Typical classification tiers:

Tier	Examples	Policy
Public	anonymized aggregates, published metrics	No restrictions
Internal	non-PII operational data	All data team members
Confidential	email addresses, names, phone numbers	Role-restricted, logged
Restricted	health data, government IDs, financial details	Strict access control, audit log

Row-level security in Snowflake, BigQuery, and Databricks lets you define policies that filter rows or mask columns based on the querying user's role. An analyst can query dim_customers but sees ***@***.com in the email column unless they have the pii_access role.

Design Tradeoffs

Where Your Intuition Breaks

Data contracts feel like bureaucracy — they add process overhead to what could be a simple schema change. The alternative is the "breaking change discovered in production" pattern, where a column rename or type change goes unannounced and a dozen downstream consumers fail simultaneously, each requiring its own debugging session. The cost of a contract is writing a document and notifying subscribers before a change; the cost of no contract is each consumer independently discovering the breakage after deployment, often without knowing which upstream change caused it. Contracts do not prevent change — they make change visible before it breaks things. The organizational resistance to contracts is usually strongest in teams that have not yet experienced a major lineage incident; after one, the overhead looks small.

Table-level vs column-level lineage

Approach	Effort	Detail	Use case
Table-level	Low (automatic from query history)	Which tables depend on which	Impact analysis for table drops, rename
Column-level	Medium (SQL parsing or annotation)	Which columns flow from which	Debugging incorrect metrics, PII propagation
Business-level	High (manual annotation)	Which business metric = which columns	Self-service analytics, trust

Column-level lineage requires SQL parsing or explicit annotation. Modern tools (dbt, OpenMetadata, DataHub) parse SQL automatically for common patterns but can miss dynamic SQL or stored procedures.

Passive discovery vs data contracts

Passive discovery (catalog-based): tools scrape metadata automatically — schema from the warehouse, query history from logs, lineage from SQL parsing. Low friction to adopt, but coverage is incomplete and metadata can be stale.

Data contracts (producer-defined): the team that owns a dataset publishes an explicit contract specifying schema, semantics, SLAs, and breaking-change policy. Consumers subscribe to the contract and receive notification before it changes.

Data contracts shift responsibility: instead of consumers discovering that a column was renamed after their dashboard broke, the producer is obligated to notify all subscribers before making breaking changes. This requires organizational buy-in but produces much more reliable data ecosystems.

What breaks a contract

A breaking change is anything that invalidates existing consumers:

Removing or renaming a column
Changing a column's data type
Changing the primary key structure
Changing the grain of the table (from order-level to order-line-level)

Non-breaking changes (safe to deploy without notification):

Adding new columns
Adding new rows
Improving documentation
Performance optimizations with no schema change

In Practice

Writing a data contract

A minimal data contract as a YAML file committed alongside the dbt model:

yaml

# contracts/fct_orders.yaml
dataset: analytics.fct_orders
owner: data-engineering
slack: "#data-eng"
sla:
  freshness_hours: 4
  availability: "99.5%"
schema_version: "2.1.0"
columns:
  - name: order_id
    type: VARCHAR
    description: "Unique order identifier. Primary key."
    nullable: false
    pii: false
  - name: customer_id
    type: VARCHAR
    description: "FK to dim_customers. All orders have a customer."
    nullable: false
    pii: false
  - name: revenue
    type: DECIMAL(18, 2)
    description: "Net revenue after refunds applied."
    nullable: false
    pii: false
  - name: customer_email
    type: VARCHAR
    description: "Customer email at time of order. PII."
    nullable: true
    pii: true
    pii_tier: confidential
breaking_change_policy: "14 days notice via #data-contracts-announce before removing or renaming columns"

Check this into the same repo as the dbt model. When the schema changes, the contract file changes in the same PR — making breaking changes visible in code review.

Lineage for PII propagation

When a PII column exists in a source table, lineage tells you every downstream table that inherits it. If users.email is classified as PII:

Run lineage traversal downstream from users.email
Every column derived from it (joins, concatenation, COALESCE) inherits the PII tag
Access control policies can be applied uniformly across all descendant columns

Without lineage, PII classification is a manual audit that inevitably misses tables. With lineage, classification propagates automatically as models are built.

Catalog adoption

Catalogs provide value only if teams actually use them. Two patterns that drive adoption:

Embed links in dashboards: every Looker/Tableau/Metabase dashboard links to the catalog entry for each underlying dataset. Users who notice a metric behave unexpectedly can follow the link to see the data's owner, freshness, and lineage without filing a ticket.

Search-first culture: when a new dataset is needed, teams search the catalog before building. Catalog usage metrics (searches, page views per dataset) tell you which datasets are trusted and which are abandoned.

Production Patterns

dbt model-level contracts in schema.yml

dbt 1.5+ supports native data contracts — declare them alongside your model in schema.yml and dbt will validate the schema on every run:

yaml

# models/marts/schema.yml
models:
  - name: fct_orders
    config:
      contract:
        enforced: true          # fail the run if schema drifts
    columns:
      - name: order_id
        data_type: varchar
        constraints:
          - type: not_null
          - type: primary_key
      - name: customer_id
        data_type: varchar
        constraints:
          - type: not_null
      - name: revenue
        data_type: numeric
        constraints:
          - type: not_null
      - name: customer_email
        data_type: varchar
        meta:
          pii: true
          pii_tier: confidential

When contract.enforced: true, dbt compares the compiled model's column list and types against the declaration and raises a compilation error before any SQL runs. This catches drift from refactors before it hits production.

Emitting OpenLineage events from a custom job

For pipelines outside dbt (Spark jobs, custom Python scripts), emit OpenLineage events so your lineage graph stays complete:

python

from openlineage.client import OpenLineageClient
from openlineage.client.run import RunEvent, RunState, Run, Job
from openlineage.client.facet import (
    SchemaDatasetFacet, SchemaField,
    DataSourceDatasetFacet,
)
import uuid, datetime
 
client = OpenLineageClient.from_environment()  # reads OPENLINEAGE_URL
 
run_id = str(uuid.uuid4())
job_name = "enrich_orders"
namespace = "data-engineering"
 
# Emit START event
client.emit(RunEvent(
    eventType=RunState.START,
    eventTime=datetime.datetime.utcnow().isoformat() + "Z",
    run=Run(runId=run_id),
    job=Job(namespace=namespace, name=job_name),
    inputs=[{
        "namespace": "snowflake://account.us-east-1",
        "name": "analytics.fct_orders",
        "facets": {
            "schema": SchemaDatasetFacet(fields=[
                SchemaField("order_id", "VARCHAR"),
                SchemaField("revenue", "NUMERIC"),
            ])
        }
    }],
    outputs=[{
        "namespace": "snowflake://account.us-east-1",
        "name": "analytics.fct_orders_enriched",
    }],
))
 
# ... run your job logic ...
 
# Emit COMPLETE event
client.emit(RunEvent(
    eventType=RunState.COMPLETE,
    eventTime=datetime.datetime.utcnow().isoformat() + "Z",
    run=Run(runId=run_id),
    job=Job(namespace=namespace, name=job_name),
))

OpenLineage events are ingested by Marquez, DataHub, and OpenMetadata. The lineage graph updates within seconds of job completion.

Enforcing data contracts in CI

Add a contract validation step to your PR pipeline so breaking changes are caught before merge:

python

# scripts/validate_contract.py
import yaml, sys
from pathlib import Path
import snowflake.connector
 
def load_contract(path: str) -> dict:
    return yaml.safe_load(Path(path).read_text())
 
def fetch_live_schema(conn, dataset: str) -> dict[str, str]:
    db, schema, table = dataset.split(".")
    cursor = conn.cursor()
    cursor.execute(f"""
        SELECT column_name, data_type
        FROM {db}.information_schema.columns
        WHERE table_schema = '{schema.upper()}'
          AND table_name   = '{table.upper()}'
    """)
    return {row[0].lower(): row[1].lower() for row in cursor.fetchall()}
 
def validate(contract_path: str) -> list[str]:
    contract = load_contract(contract_path)
    conn = snowflake.connector.connect(
        account=..., user=..., password=..., warehouse=...,
    )
    live = fetch_live_schema(conn, contract["dataset"])
    errors = []
    for col in contract["columns"]:
        name = col["name"]
        if name not in live:
            errors.append(f"MISSING column: {name}")
        elif col["type"].lower().split("(")[0] not in live[name]:
            errors.append(
                f"TYPE MISMATCH on {name}: "
                f"contract={col['type']}, live={live[name]}"
            )
    return errors
 
if __name__ == "__main__":
    errors = validate(sys.argv[1])
    if errors:
        print("Contract violations:")
        for e in errors:
            print(f"  {e}")
        sys.exit(1)
    print("Contract valid.")

Wire this into your CI pipeline:

yaml

# .github/workflows/dbt-ci.yml  (relevant step only)
- name: Validate data contracts
  run: |
    for f in contracts/*.yaml; do
      python scripts/validate_contract.py "$f"
    done
  env:
    SNOWFLAKE_ACCOUNT: ${{ secrets.SNOWFLAKE_ACCOUNT }}
    SNOWFLAKE_USER: ${{ secrets.SNOWFLAKE_USER }}
    SNOWFLAKE_PASSWORD: ${{ secrets.SNOWFLAKE_PASSWORD }}

PRs that rename or drop a column now fail CI before any reviewer approves them.

Enjoying these notes?

Get new lessons delivered to your inbox. No spam.

Data Quality & Testing

Data Observability