Data Mesh & Platform Thinking
Centralized data teams become bottlenecks at scale — every domain waits for one team to build their pipelines. Data mesh addresses this by treating data as a product, assigning domain ownership to the teams that understand the data best, and providing shared infrastructure that makes self-serve feasible. It is an organizational and architectural pattern, not a technology choice.
How It Works
One team owns all pipelines. Every domain waits in queue.
Bottleneck: 4 domains, 1 team. Pipelines built without domain knowledge. Ownership unclear when things break.
Toggle between the two modes above. In the centralized model, all four domains (Growth, Payments, Ops, Marketing) queue requests to a single Data Platform team. Wait times of 2–5 weeks are common — not because the data team is slow, but because the queue is long and the team lacks domain context to prioritize correctly.
In the mesh model, each domain team owns and publishes its own data product. The Data Platform team shifts from a delivery team to an infrastructure team — providing the shared tooling (catalog, lineage, quality checks) that makes domain ownership viable.
A centralized data team at scale is structurally equivalent to a centralized operations team in a software organization — a team that every engineering team must wait on for every deployment. The solution in software was DevOps: give each team the tools and responsibility to deploy their own services. Data mesh is the same answer applied to data: give each domain team the tools and responsibility to publish their own data products, with a platform team providing the shared infrastructure that makes it feasible without requiring every domain team to become data infrastructure experts.
The four principles of data mesh
Zhamak Dehghani's data mesh (2019) defines four principles:
1. Domain ownership: data is owned and produced by the domain team that understands it best. The Payments team owns payments.transactions — they understand the schema, the edge cases, the SLA that's appropriate. They don't hand it off to a centralized team that must learn the domain to serve it.
2. Data as a product: each domain publishes data products — datasets that meet explicit quality standards. A data product has:
- A discoverable name (
payments.transactions, notstg_transactions_v3_final) - An owner and SLA (freshness, availability guarantees)
- Documentation (schema, semantics, known limitations)
- Quality checks that run automatically
3. Self-serve data infrastructure: domains can only own their data if the infrastructure is easy enough to operate. The platform team provides the scaffolding — pipeline templates, catalog registration, quality check frameworks, access control tooling — so domain teams aren't expected to build it all themselves.
4. Federated governance: governance policies (PII classification, retention rules, access tiers) are set centrally but enforced through infrastructure, not approval queues. A domain team that creates a new dataset gets automated PII scanning; they don't wait for a governance team to review it.
Federated governance had to be infrastructure-enforced rather than process-enforced precisely because process-based governance does not scale with the number of domain teams. If a central governance team must review every new dataset before publication, the governance team becomes the bottleneck that data mesh was designed to eliminate. Encoding governance rules as automated checks (PII scanner runs on every column, retention policies applied by the catalog at registration time) makes governance instantaneous and consistent without requiring a human in the loop for every dataset — which is the only way it can function in a decentralized organization.
What a data product looks like
A data product is a dataset with a contract:
| Property | Example |
|---|---|
| Name | payments.transactions |
| Owner | Payments engineering team |
| SLA | Updated within 1 hour of transaction; 99.9% availability |
| Schema | Versioned; breaking changes require 14-day notice |
| Quality | Row count monitor; referential integrity; NULL checks |
| Discovery | Registered in catalog with descriptions and example queries |
| Access | Self-serve for analysts; PII columns require approval |
The key difference from a shared database table: the domain team is accountable for the product quality, not just the pipeline.
Design Tradeoffs
Where Your Intuition Breaks
Data mesh is frequently misread as "every team builds their own data infrastructure" — which would mean dozens of teams independently choosing data tools, building pipelines from scratch, and maintaining their own catalog integrations. That interpretation produces chaos, not autonomy. The enabler of domain ownership is the platform: shared infrastructure that is opinionated enough to handle the hard parts (lineage, quality, discovery, access control) so that domain teams only need to handle the domain-specific parts (schema, semantics, SLA). Without a strong platform team providing excellent self-serve tools, data mesh devolves into "data swamp, but distributed." The organizational shift is from "central team delivers data to consumers" to "central team enables domain teams to deliver their own data" — a fundamentally different mission for the platform team, not the absence of one.
When centralized works fine
Data mesh is an organizational response to a scaling problem. For smaller organizations, it's likely unnecessary overhead:
| Organization | Right model |
|---|---|
| Small team (1–5 data engineers) | Centralized — everyone knows the domain |
| Medium (5–20 data engineers) | Centralized with domain specialization |
| Large, multiple independent domains | Consider mesh for high-demand domains |
| Monorepo, single product | Centralized — domain boundaries are unclear |
Data mesh requires that domains have meaningful data ownership. If all data lives in one operational database, there's no natural boundary for domain teams to own.
The coordination overhead of mesh
Mesh trades centralized bottlenecks for distributed coordination. What gets harder:
Cross-domain joins: a single query that joins payments.transactions with growth.user_events with ops.fulfillment must navigate three data products with potentially different schemas, SLA guarantees, and update schedules. In a centralized team, this join is one model; in mesh, it requires coordination between three teams.
Consistent semantic definitions: "active user" means different things to Growth (logged in last 30 days) and Payments (placed an order last 90 days). Without a central team to adjudicate, you get metric proliferation. Federated governance must define canonical metrics centrally, even if data products are distributed.
Onboarding: new analysts joining need to learn which domain owns which data, how to find the catalog, what the SLA guarantees mean in practice. Centralized teams are easier to onboard to.
Platform thinking without full mesh
Even if you don't adopt data mesh organization-wide, platform thinking improves centralized data teams:
- Self-serve catalog: analysts can discover and understand data without filing tickets
- Pipeline templates: new ingestion jobs follow a standard pattern; anyone can follow it
- Automated quality checks: new models get a standard test suite by default
- Access control automation: PII access requests are automated, not manual reviews
These capabilities reduce the ticket queue even without full domain ownership.
In Practice
Building a self-serve data platform
The platform team's job is to make domain teams' jobs easy. Concrete capabilities:
Catalog registration: when a domain team creates a new dbt model, it's automatically discoverable in the catalog. Schema, owner, and freshness are populated without manual steps.
Pipeline templates: the platform provides a standard template for common ingestion patterns (Postgres CDC, Kafka topic, S3 file). Domains fill in source-specific configuration; the platform handles retry logic, monitoring, and error routing.
Quality framework: every data product registered with the platform gets a default test suite (row count monitor, freshness check). Domain teams add custom tests for their semantic rules; the platform ensures baseline coverage exists.
Access control: the platform enforces the org's access policy automatically. Creating a column tagged as PII triggers a mandatory access review workflow. Columns without PII tags are accessible to all analysts. No governance team approval queue.
Measuring platform success
A platform team's success is measured by how fast domain teams can ship data products independently:
- Time to first data product: how long does it take a new domain team to publish their first data product from scratch?
- Platform ticket rate: what fraction of domain data work requires a ticket to the platform team? Target: less than 20%
- Data product coverage: what percentage of important business datasets are published as data products with owners, SLAs, and quality checks?
- Incident attribution: when a data incident occurs, is the owning domain team alerted first, or does the centralized team find out first?
Adopting incrementally
Data mesh doesn't have to be all-or-nothing. A pragmatic adoption path:
- Catalog first: deploy a catalog and get every important table registered with owners and documentation. This surfaces ownership clarity without changing pipelines.
- Data contracts for high-demand datasets: start with the 10 most-queried datasets. Add explicit contracts and notify consumers before breaking changes.
- Domain self-service for new datasets: new datasets built by domain teams use platform templates and are registered automatically. Don't migrate existing centralized pipelines.
- Federated governance for PII: automate PII classification and access control. Remove manual approval queues for data access.
Incremental adoption lets you validate that domain teams can handle the responsibility before shifting accountability broadly.
Production Patterns
Data product spec as code
Each domain publishes a manifest file alongside their pipeline code. This makes data product metadata version-controlled and diff-able in PRs:
# data-products/payments/transactions.yaml
apiVersion: data-platform/v1
kind: DataProduct
metadata:
name: payments.transactions
domain: payments
owner: payments-engineering
slack: "#payments-eng"
created: "2024-01-15"
spec:
description: >
All payment transactions processed through the payments service.
One row per transaction attempt. Includes successful, failed, and
refunded states. Updated within 5 minutes of transaction commit.
schema_version: "3.2.0"
breaking_change_notice_days: 14
sla:
freshness_minutes: 5
availability_percent: 99.9
row_count_daily_min: 100000
columns:
- name: transaction_id
type: VARCHAR(36)
nullable: false
pii: false
description: "UUID. Primary key."
- name: user_id
type: VARCHAR(36)
nullable: false
pii: false
- name: amount_usd
type: NUMERIC(18, 4)
nullable: false
pii: false
- name: card_last_four
type: VARCHAR(4)
nullable: true
pii: true
pii_tier: confidential
access:
default: analysts # all analysts can read non-PII columns
pii_columns: requires_approval
approval_owner: privacy-team
quality_checks:
- type: freshness
- type: row_count
- type: not_null
columns: [transaction_id, user_id, amount_usd]
- type: referential_integrity
column: user_id
references: growth.users.user_idThe platform team provides a linter (data-product lint transactions.yaml) that validates required fields, checks that referenced tables exist in the catalog, and warns if PII columns lack an access tier.
Registering a data product via the internal platform API
When a domain team merges a new data product manifest, a CI step calls the platform API to register it in the catalog:
# scripts/register_data_product.py
import sys
import yaml
import requests
import os
from pathlib import Path
PLATFORM_API = os.environ["DATA_PLATFORM_API_URL"]
API_TOKEN = os.environ["DATA_PLATFORM_API_TOKEN"]
def register(manifest_path: str) -> None:
spec = yaml.safe_load(Path(manifest_path).read_text())
name = spec["metadata"]["name"]
resp = requests.put(
f"{PLATFORM_API}/v1/data-products/{name.replace('.', '/')}",
headers={"Authorization": f"Bearer {API_TOKEN}"},
json={
"name": name,
"domain": spec["metadata"]["domain"],
"owner": spec["metadata"]["owner"],
"slack_channel": spec["metadata"]["slack"],
"description": spec["spec"]["description"],
"schema_version": spec["spec"]["schema_version"],
"sla": spec["spec"]["sla"],
"columns": spec["spec"]["columns"],
"access_policy": spec["spec"]["access"],
},
timeout=15,
)
if resp.status_code == 200:
print(f"Updated existing data product: {name}")
elif resp.status_code == 201:
print(f"Registered new data product: {name}")
# Platform auto-provisions: catalog entry, default monitors, access policy
else:
resp.raise_for_status()
# Print catalog URL for the PR comment bot to link
catalog_url = resp.json().get("catalog_url")
print(f"Catalog: {catalog_url}")
if __name__ == "__main__":
register(sys.argv[1])Add this to CI so every merged manifest immediately appears in the catalog:
# .github/workflows/register-data-products.yml
on:
push:
branches: [main]
paths: ["data-products/**/*.yaml"]
jobs:
register:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Register changed data products
run: |
git diff --name-only HEAD~1 HEAD -- 'data-products/**/*.yaml' \
| xargs -I{} python scripts/register_data_product.py {}
env:
DATA_PLATFORM_API_URL: ${{ secrets.DATA_PLATFORM_API_URL }}
DATA_PLATFORM_API_TOKEN: ${{ secrets.DATA_PLATFORM_API_TOKEN }}Measuring platform adoption
Track platform health with a weekly metrics query against platform metadata tables. A platform team's goal is to shrink the ticket rate and grow self-serve coverage over time:
-- Weekly platform adoption report
-- Run in the data platform's own analytics schema
WITH data_products AS (
SELECT
domain,
COUNT(*) AS total_products,
COUNT(*) FILTER (WHERE sla_freshness_minutes IS NOT NULL)
AS products_with_sla,
COUNT(*) FILTER (WHERE quality_checks_enabled) AS products_with_checks,
AVG(DATEDIFF('day', created_at, CURRENT_DATE)) AS avg_product_age_days
FROM platform.data_product_registry
WHERE is_active
GROUP BY domain
),
tickets AS (
SELECT
DATE_TRUNC('week', created_at) AS week,
domain,
COUNT(*) AS tickets_filed,
AVG(resolve_hours) AS avg_resolve_hours
FROM platform.support_tickets
WHERE created_at >= DATEADD('week', -12, CURRENT_DATE)
GROUP BY 1, 2
),
incidents AS (
SELECT
owner_domain,
COUNT(*) AS total_incidents,
AVG(DATEDIFF('minute', detected_at, alerted_domain_at))
AS avg_detection_to_domain_alert_min
FROM platform.data_incidents
WHERE created_at >= DATEADD('week', -4, CURRENT_DATE)
GROUP BY owner_domain
)
SELECT
dp.domain,
dp.total_products,
dp.products_with_sla,
dp.products_with_checks,
ROUND(100.0 * dp.products_with_checks / NULLIF(dp.total_products, 0), 1)
AS quality_coverage_pct,
t.tickets_filed,
i.avg_detection_to_domain_alert_min
FROM data_products dp
LEFT JOIN tickets t ON dp.domain = t.domain
LEFT JOIN incidents i ON dp.domain = i.owner_domain
ORDER BY dp.total_products DESCPublish this report to a Slack channel each Monday. The two numbers that matter most: quality_coverage_pct (are domain teams using the quality framework?) and avg_detection_to_domain_alert_min (when something breaks, how quickly does the right team know?).
Enjoying these notes?
Get new lessons delivered to your inbox. No spam.