Data Mesh & Platform Thinking

Centralized data teams become bottlenecks at scale — every domain waits for one team to build their pipelines. Data mesh addresses this by treating data as a product, assigning domain ownership to the teams that understand the data best, and providing shared infrastructure that makes self-serve feasible. It is an organizational and architectural pattern, not a technology choice.

How It Works

One team owns all pipelines. Every domain waits in queue.

Data Platform Team

↑ bottleneck ↑

Growth

Wait: 3 weeks

Payments

Wait: 5 weeks

Ops

Wait: 2 weeks

Marketing

Wait: 4 weeks

Bottleneck: 4 domains, 1 team. Pipelines built without domain knowledge. Ownership unclear when things break.

Toggle between the two modes above. In the centralized model, all four domains (Growth, Payments, Ops, Marketing) queue requests to a single Data Platform team. Wait times of 2–5 weeks are common — not because the data team is slow, but because the queue is long and the team lacks domain context to prioritize correctly.

In the mesh model, each domain team owns and publishes its own data product. The Data Platform team shifts from a delivery team to an infrastructure team — providing the shared tooling (catalog, lineage, quality checks) that makes domain ownership viable.

A centralized data team at scale is structurally equivalent to a centralized operations team in a software organization — a team that every engineering team must wait on for every deployment. The solution in software was DevOps: give each team the tools and responsibility to deploy their own services. Data mesh is the same answer applied to data: give each domain team the tools and responsibility to publish their own data products, with a platform team providing the shared infrastructure that makes it feasible without requiring every domain team to become data infrastructure experts.

The four principles of data mesh

Zhamak Dehghani's data mesh (2019) defines four principles:

1. Domain ownership: data is owned and produced by the domain team that understands it best. The Payments team owns payments.transactions — they understand the schema, the edge cases, the SLA that's appropriate. They don't hand it off to a centralized team that must learn the domain to serve it.

2. Data as a product: each domain publishes data products — datasets that meet explicit quality standards. A data product has:

A discoverable name (payments.transactions, not stg_transactions_v3_final)
An owner and SLA (freshness, availability guarantees)
Documentation (schema, semantics, known limitations)
Quality checks that run automatically

3. Self-serve data infrastructure: domains can only own their data if the infrastructure is easy enough to operate. The platform team provides the scaffolding — pipeline templates, catalog registration, quality check frameworks, access control tooling — so domain teams aren't expected to build it all themselves.

4. Federated governance: governance policies (PII classification, retention rules, access tiers) are set centrally but enforced through infrastructure, not approval queues. A domain team that creates a new dataset gets automated PII scanning; they don't wait for a governance team to review it.

Federated governance had to be infrastructure-enforced rather than process-enforced precisely because process-based governance does not scale with the number of domain teams. If a central governance team must review every new dataset before publication, the governance team becomes the bottleneck that data mesh was designed to eliminate. Encoding governance rules as automated checks (PII scanner runs on every column, retention policies applied by the catalog at registration time) makes governance instantaneous and consistent without requiring a human in the loop for every dataset — which is the only way it can function in a decentralized organization.

What a data product looks like

A data product is a dataset with a contract:

Property	Example
Name	`payments.transactions`
Owner	Payments engineering team
SLA	Updated within 1 hour of transaction; 99.9% availability
Schema	Versioned; breaking changes require 14-day notice
Quality	Row count monitor; referential integrity; NULL checks
Discovery	Registered in catalog with descriptions and example queries
Access	Self-serve for analysts; PII columns require approval

The key difference from a shared database table: the domain team is accountable for the product quality, not just the pipeline.

Design Tradeoffs

Where Your Intuition Breaks

Data mesh is frequently misread as "every team builds their own data infrastructure" — which would mean dozens of teams independently choosing data tools, building pipelines from scratch, and maintaining their own catalog integrations. That interpretation produces chaos, not autonomy. The enabler of domain ownership is the platform: shared infrastructure that is opinionated enough to handle the hard parts (lineage, quality, discovery, access control) so that domain teams only need to handle the domain-specific parts (schema, semantics, SLA). Without a strong platform team providing excellent self-serve tools, data mesh devolves into "data swamp, but distributed." The organizational shift is from "central team delivers data to consumers" to "central team enables domain teams to deliver their own data" — a fundamentally different mission for the platform team, not the absence of one.

When centralized works fine

Data mesh is an organizational response to a scaling problem. For smaller organizations, it's likely unnecessary overhead:

Organization	Right model
Small team (1–5 data engineers)	Centralized — everyone knows the domain
Medium (5–20 data engineers)	Centralized with domain specialization
Large, multiple independent domains	Consider mesh for high-demand domains
Monorepo, single product	Centralized — domain boundaries are unclear

Data mesh requires that domains have meaningful data ownership. If all data lives in one operational database, there's no natural boundary for domain teams to own.

The coordination overhead of mesh

Mesh trades centralized bottlenecks for distributed coordination. What gets harder:

Cross-domain joins: a single query that joins payments.transactions with growth.user_events with ops.fulfillment must navigate three data products with potentially different schemas, SLA guarantees, and update schedules. In a centralized team, this join is one model; in mesh, it requires coordination between three teams.

Consistent semantic definitions: "active user" means different things to Growth (logged in last 30 days) and Payments (placed an order last 90 days). Without a central team to adjudicate, you get metric proliferation. Federated governance must define canonical metrics centrally, even if data products are distributed.

Onboarding: new analysts joining need to learn which domain owns which data, how to find the catalog, what the SLA guarantees mean in practice. Centralized teams are easier to onboard to.

Platform thinking without full mesh

Even if you don't adopt data mesh organization-wide, platform thinking improves centralized data teams:

Self-serve catalog: analysts can discover and understand data without filing tickets
Pipeline templates: new ingestion jobs follow a standard pattern; anyone can follow it
Automated quality checks: new models get a standard test suite by default
Access control automation: PII access requests are automated, not manual reviews

These capabilities reduce the ticket queue even without full domain ownership.

In Practice

Building a self-serve data platform

The platform team's job is to make domain teams' jobs easy. Concrete capabilities:

Catalog registration: when a domain team creates a new dbt model, it's automatically discoverable in the catalog. Schema, owner, and freshness are populated without manual steps.

Pipeline templates: the platform provides a standard template for common ingestion patterns (Postgres CDC, Kafka topic, S3 file). Domains fill in source-specific configuration; the platform handles retry logic, monitoring, and error routing.

Quality framework: every data product registered with the platform gets a default test suite (row count monitor, freshness check). Domain teams add custom tests for their semantic rules; the platform ensures baseline coverage exists.

Access control: the platform enforces the org's access policy automatically. Creating a column tagged as PII triggers a mandatory access review workflow. Columns without PII tags are accessible to all analysts. No governance team approval queue.

Measuring platform success

A platform team's success is measured by how fast domain teams can ship data products independently:

Time to first data product: how long does it take a new domain team to publish their first data product from scratch?
Platform ticket rate: what fraction of domain data work requires a ticket to the platform team? Target: less than 20%
Data product coverage: what percentage of important business datasets are published as data products with owners, SLAs, and quality checks?
Incident attribution: when a data incident occurs, is the owning domain team alerted first, or does the centralized team find out first?

Adopting incrementally

Data mesh doesn't have to be all-or-nothing. A pragmatic adoption path:

Catalog first: deploy a catalog and get every important table registered with owners and documentation. This surfaces ownership clarity without changing pipelines.
Data contracts for high-demand datasets: start with the 10 most-queried datasets. Add explicit contracts and notify consumers before breaking changes.
Domain self-service for new datasets: new datasets built by domain teams use platform templates and are registered automatically. Don't migrate existing centralized pipelines.
Federated governance for PII: automate PII classification and access control. Remove manual approval queues for data access.

Incremental adoption lets you validate that domain teams can handle the responsibility before shifting accountability broadly.

Production Patterns

Data product spec as code

Each domain publishes a manifest file alongside their pipeline code. This makes data product metadata version-controlled and diff-able in PRs:

yaml

# data-products/payments/transactions.yaml
apiVersion: data-platform/v1
kind: DataProduct
metadata:
  name: payments.transactions
  domain: payments
  owner: payments-engineering
  slack: "#payments-eng"
  created: "2024-01-15"
 
spec:
  description: >
    All payment transactions processed through the payments service.
    One row per transaction attempt. Includes successful, failed, and
    refunded states. Updated within 5 minutes of transaction commit.
 
  schema_version: "3.2.0"
  breaking_change_notice_days: 14
 
  sla:
    freshness_minutes: 5
    availability_percent: 99.9
    row_count_daily_min: 100000
 
  columns:
    - name: transaction_id
      type: VARCHAR(36)
      nullable: false
      pii: false
      description: "UUID. Primary key."
    - name: user_id
      type: VARCHAR(36)
      nullable: false
      pii: false
    - name: amount_usd
      type: NUMERIC(18, 4)
      nullable: false
      pii: false
    - name: card_last_four
      type: VARCHAR(4)
      nullable: true
      pii: true
      pii_tier: confidential
 
  access:
    default: analysts           # all analysts can read non-PII columns
    pii_columns: requires_approval
    approval_owner: privacy-team
 
  quality_checks:
    - type: freshness
    - type: row_count
    - type: not_null
      columns: [transaction_id, user_id, amount_usd]
    - type: referential_integrity
      column: user_id
      references: growth.users.user_id

The platform team provides a linter (data-product lint transactions.yaml) that validates required fields, checks that referenced tables exist in the catalog, and warns if PII columns lack an access tier.

Registering a data product via the internal platform API

When a domain team merges a new data product manifest, a CI step calls the platform API to register it in the catalog:

python

# scripts/register_data_product.py
import sys
import yaml
import requests
import os
from pathlib import Path
 
PLATFORM_API = os.environ["DATA_PLATFORM_API_URL"]
API_TOKEN    = os.environ["DATA_PLATFORM_API_TOKEN"]
 
def register(manifest_path: str) -> None:
    spec = yaml.safe_load(Path(manifest_path).read_text())
    name = spec["metadata"]["name"]
 
    resp = requests.put(
        f"{PLATFORM_API}/v1/data-products/{name.replace('.', '/')}",
        headers={"Authorization": f"Bearer {API_TOKEN}"},
        json={
            "name":           name,
            "domain":         spec["metadata"]["domain"],
            "owner":          spec["metadata"]["owner"],
            "slack_channel":  spec["metadata"]["slack"],
            "description":    spec["spec"]["description"],
            "schema_version": spec["spec"]["schema_version"],
            "sla":            spec["spec"]["sla"],
            "columns":        spec["spec"]["columns"],
            "access_policy":  spec["spec"]["access"],
        },
        timeout=15,
    )
 
    if resp.status_code == 200:
        print(f"Updated existing data product: {name}")
    elif resp.status_code == 201:
        print(f"Registered new data product: {name}")
        # Platform auto-provisions: catalog entry, default monitors, access policy
    else:
        resp.raise_for_status()
 
    # Print catalog URL for the PR comment bot to link
    catalog_url = resp.json().get("catalog_url")
    print(f"Catalog: {catalog_url}")
 
if __name__ == "__main__":
    register(sys.argv[1])

Add this to CI so every merged manifest immediately appears in the catalog:

yaml

# .github/workflows/register-data-products.yml
on:
  push:
    branches: [main]
    paths: ["data-products/**/*.yaml"]
 
jobs:
  register:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - name: Register changed data products
        run: |
          git diff --name-only HEAD~1 HEAD -- 'data-products/**/*.yaml' \
            | xargs -I{} python scripts/register_data_product.py {}
        env:
          DATA_PLATFORM_API_URL: ${{ secrets.DATA_PLATFORM_API_URL }}
          DATA_PLATFORM_API_TOKEN: ${{ secrets.DATA_PLATFORM_API_TOKEN }}

Measuring platform adoption

Track platform health with a weekly metrics query against platform metadata tables. A platform team's goal is to shrink the ticket rate and grow self-serve coverage over time:

sql

-- Weekly platform adoption report
-- Run in the data platform's own analytics schema
 
WITH data_products AS (
    SELECT
        domain,
        COUNT(*)                                         AS total_products,
        COUNT(*) FILTER (WHERE sla_freshness_minutes IS NOT NULL)
                                                         AS products_with_sla,
        COUNT(*) FILTER (WHERE quality_checks_enabled)   AS products_with_checks,
        AVG(DATEDIFF('day', created_at, CURRENT_DATE))   AS avg_product_age_days
    FROM platform.data_product_registry
    WHERE is_active
    GROUP BY domain
),
tickets AS (
    SELECT
        DATE_TRUNC('week', created_at)  AS week,
        domain,
        COUNT(*)                        AS tickets_filed,
        AVG(resolve_hours)              AS avg_resolve_hours
    FROM platform.support_tickets
    WHERE created_at >= DATEADD('week', -12, CURRENT_DATE)
    GROUP BY 1, 2
),
incidents AS (
    SELECT
        owner_domain,
        COUNT(*)                                       AS total_incidents,
        AVG(DATEDIFF('minute', detected_at, alerted_domain_at))
                                                       AS avg_detection_to_domain_alert_min
    FROM platform.data_incidents
    WHERE created_at >= DATEADD('week', -4, CURRENT_DATE)
    GROUP BY owner_domain
)
SELECT
    dp.domain,
    dp.total_products,
    dp.products_with_sla,
    dp.products_with_checks,
    ROUND(100.0 * dp.products_with_checks / NULLIF(dp.total_products, 0), 1)
                                                                   AS quality_coverage_pct,
    t.tickets_filed,
    i.avg_detection_to_domain_alert_min
FROM data_products dp
LEFT JOIN tickets t      ON dp.domain = t.domain
LEFT JOIN incidents i    ON dp.domain = i.owner_domain
ORDER BY dp.total_products DESC

Publish this report to a Slack channel each Monday. The two numbers that matter most: quality_coverage_pct (are domain teams using the quality framework?) and avg_detection_to_domain_alert_min (when something breaks, how quickly does the right team know?).

Enjoying these notes?

Get new lessons delivered to your inbox. No spam.

Data Observability

ML Data

Feature Stores