Data Modeling

How you structure data in a warehouse is as consequential as how you query it. Kimball's dimensional modeling (star schemas, slowly changing dimensions) optimizes for analyst comprehension and query speed. Data vault trades readability for auditability and schema flexibility. One Big Table (OBT) trades storage for join elimination. Each model is a set of tradeoffs, not a universal answer.

How It Works

Slowly Changing Dimensions — Alice Chen moves from USA to Canada

SCD Type 2 — Add Row — Expire old row, insert new row. Full history preserved.

customer_id	name	country	email	valid_from	valid_to	is_current
42	Alice Chen	USA	alice@co.com	2021-03-10	current	true

⚠ Join on customer_id requires a date filter to get the right version. Surrogate key needed.

Toggle between the three SCD types above to see how the same attribute change is handled differently in each approach. Note the consequence of each: Type 1 rewrites history, Type 2 preserves it at the cost of query complexity, Type 3 stores only one previous value.

When a customer changes their country, the fact tables recording their purchases don't change — but the answer to "which country did this purchase belong to?" changes depending on whether you ask about the customer's country today or at purchase time. Dimensional modeling forces a decision about this ambiguity at model design time, not at query time. Each SCD type is a different answer to "which version of the truth do you want?"

Dimensional modeling: Kimball's approach

Ralph Kimball's dimensional modeling organizes warehouse data around two table types: fact tables (events and measurements) and dimension tables (descriptive attributes). We covered the star schema structure in the warehouse lesson; here we go deeper on the modeling decisions within that structure.

Grain: the most important modeling decision is defining the grain of a fact table — what exactly does one row represent? "One row per order" is clearer and more useful than "one row per order-product combination" only if you never need to analyze line items. Wrong grain leads to double-counting or missing granularity that's impossible to recover without rebuilding. Define grain explicitly and document it.

Additive, semi-additive, and non-additive facts: revenue is fully additive — you can SUM it across any dimension (time, geography, product). inventory_balance is semi-additive — summing across products makes sense, but summing across time does not (you don't add Monday's inventory to Tuesday's). profit_margin is non-additive — you average it or compute it from additive components. Non-additive measures should generally not be stored; compute them from additive components at query time.

Slowly Changing Dimensions

Real-world attributes change. A customer moves countries. A product changes category. A sales rep transfers regions. How you handle these changes determines whether historical reports are accurate or misleading.

Type 1 — Overwrite: update the dimension row in place. Simple, but all historical reporting now uses the current attribute value. Revenue attributed to the customer's old country disappears. Use only when historical accuracy of that attribute doesn't matter.

Type 2 — Add row: expire the old row (set valid_to = change_date, is_current = false) and insert a new row. Full history is preserved. The fact table's foreign key points to the dimension surrogate key, which is version-specific. Historical reports are accurate. The cost: queries that want "current" attributes must filter WHERE is_current = true, and JOIN logic becomes more complex.

Type 3 — Add column: add a previous_value column alongside current_value. Simple to query, but limited to one historical value. A second change overwrites the previous value column. Rarely the right choice except for cases where you truly only need "current and immediately prior."

SCD Type 2 had to introduce surrogate keys — not because natural keys are insufficient for the application database, but because the same customer_id = 42 must now refer to two different rows in the dimension table (the version before and after the change). A fact table row pointing to customer_id = 42 is ambiguous: which version of the customer does it describe? Surrogate keys resolve this by making the foreign key version-specific: dim_customer_sk = 8017 unambiguously points to one row representing customer 42 as they were between 2023-01-15 and 2024-06-30. The added indirection is the necessary cost of having both point-in-time accuracy and the ability to join on current attributes.

Data Vault

Data Vault is an alternative modeling approach designed for auditability, flexibility, and parallel loading. Three table types replace fact/dimension: hubs (unique business keys — one row per customer_id, product_id), links (many-to-many relationships between hubs), and satellites (descriptive attributes with full history, every change creates a new row with a timestamp).

Data Vault is strictly additive — no rows are ever updated or deleted. Every load is an INSERT. This makes pipelines safe to parallelize and easy to audit: you can trace every piece of data to its source load. The tradeoff is query complexity: a simple "current customer attributes" query requires joining a hub, a satellite, and filtering for the latest record.

Data Vault fits organizations with: frequent source system schema changes (satellite tables can evolve independently), strict audit requirements (every historical state is preserved), or many parallel ingestion teams (each source system owns its own hub/satellite set).

One Big Table (OBT)

Denormalize everything into a single wide table. No JOINs at query time — every attribute is a column. Dashboards are fast, and analysts don't need to know the schema. The cost: storage (every combination of dimension attributes is repeated for every fact row), update complexity (changing a dimension attribute requires rewriting many rows), and a table that can have hundreds of columns that become hard to manage.

OBT works well as the final layer for a specific, stable dashboard — a pre-joined, pre-aggregated table purpose-built for one use case. It does not work as a general-purpose data model.

Design Tradeoffs

Where Your Intuition Breaks

SCD Type 2 is the "correct" approach for historical accuracy, but it introduces a query pattern that trips up almost everyone: joins that silently return wrong results. If you join fct_orders to dim_customer on the natural key (customer_id) instead of the surrogate key (dim_customer_sk), queries return multiple rows per order — one for each version of the customer. The result is inflated row counts and wrong aggregates, with no error message. The warehouse executes the query exactly as written. This is not a framework bug; it is the expected behavior of a fan-out join, and it happens because the surrogate key discipline was not enforced. In practice, most SCD Type 2 issues are discovered by analysts reporting "revenue is wrong" weeks after a dimension version change, when the fan-out join silently doubled transaction counts. Enforcing surrogate key joins in staging and testing for row count stability after dimension changes are the standard defenses.

When to use each approach

	Kimball Star	Data Vault	OBT
Query simplicity	Good	Complex	Best
Historical accuracy	Good (with SCD Type 2)	Excellent	Depends on build
Schema flexibility	Limited	High	Low
Load parallelism	Limited	High	Low
Storage efficiency	Good	Moderate	Poor
Best for	BI-focused teams	Complex audit requirements	Single dashboard optimization

SCD Type 2 surrogate keys

When using SCD Type 2, you cannot use the natural business key (e.g., customer_id) as the foreign key in fact tables — different versions of the same customer would share the same natural key. You need a surrogate key: a system-generated integer that uniquely identifies one specific version of a dimension record. dim_customer_sk points to exactly one row in dim_customer, capturing the customer's attributes as they were at the time of the transaction.

dbt handles this with the dbt_utils.generate_surrogate_key() macro, which hashes the natural key plus the effective date to produce a stable surrogate key that survives full refreshes.

In Practice

Choosing grain

For a subscription SaaS product, the right grain for fct_subscriptions is "one row per subscription per billing period" — not "one row per subscription" (loses history) and not "one row per subscription per day" (unnecessary granularity). The billing period grain lets you aggregate MRR, churn, and expansion across any dimension without double-counting.

SCD Type 2 in practice

The most common SCD Type 2 pattern in dbt uses the snapshot command:

sql

-- snapshots/dim_customers.sql
{% snapshot customers_snapshot %}
  {{ config(
    target_schema='snapshots',
    unique_key='customer_id',
    strategy='timestamp',
    updated_at='updated_at',
  ) }}
  SELECT * FROM {{ source('production', 'customers') }}
{% endsnapshot %}

dbt runs this on a schedule and automatically manages the valid_from, valid_to, and is_current columns. The resulting snapshot table is used as the source for dim_customer in the mart layer.

The "conformed dimension" principle

When multiple fact tables need to join to the same dimension, use the same dimension table — don't build separate copies of dim_customer for orders, support tickets, and product events. A conformed dimension lets you join across fact tables using a shared key, enabling cross-domain analysis. It also means a schema change to the dimension only needs to happen in one place.

Production Patterns

SCD Type 2 with dbt snapshots

The dbt snapshot command manages SCD Type 2 history automatically. Configure it once; dbt handles dbt_valid_from, dbt_valid_to, and dbt_is_current on every run.

sql

-- snapshots/customers_snapshot.sql
{% snapshot customers_snapshot %}
  {{
    config(
      target_schema='snapshots',
      unique_key='customer_id',
      strategy='timestamp',
      updated_at='updated_at',
      invalidate_hard_deletes=True
    )
  }}
  SELECT
    customer_id,
    email,
    country_code,
    plan_tier,
    updated_at
  FROM {{ source('production', 'customers') }}
{% endsnapshot %}

Then build dim_customer from the snapshot, generating a stable surrogate key that survives full refreshes:

sql

-- models/marts/dim_customer.sql
{{ config(materialized='table') }}
 
SELECT
  {{ dbt_utils.generate_surrogate_key(['customer_id', 'dbt_valid_from']) }} AS customer_sk,
  customer_id,
  email,
  country_code,
  plan_tier,
  dbt_valid_from  AS valid_from,
  dbt_valid_to    AS valid_to,
  dbt_is_current  AS is_current
FROM {{ ref('customers_snapshot') }}

The surrogate key customer_sk is what fact tables use as their foreign key — it uniquely identifies one version of a customer record. Analysts querying current state filter WHERE is_current = true; historical reports join on the surrogate key without any filter.

Testing referential integrity between fact and dimension tables

Broken foreign keys are a silent correctness failure — fct_orders rows that reference a non-existent customer_sk produce NULLs in reports without any error. Enforce this with dbt relationship tests and custom singular tests.

yaml

# models/marts/schema.yml
models:
  - name: fct_orders
    columns:
      - name: order_id
        tests:
          - not_null
          - unique
      - name: customer_sk
        tests:
          - not_null
          - relationships:
              to: ref('dim_customer')
              field: customer_sk
      - name: product_sk
        tests:
          - not_null
          - relationships:
              to: ref('dim_product')
              field: product_sk
 
  - name: dim_customer
    columns:
      - name: customer_sk
        tests:
          - not_null
          - unique
      - name: customer_id
        tests:
          - not_null

For SCD Type 2 dimensions, the relationships test against dim_customer.customer_sk only passes if every foreign key in fct_orders points to an existing versioned row — including historical rows. This means a fact row with a 2022 order date will join to the customer's 2022 customer_sk, not their current one.

Add a singular test to catch orphaned facts that lack any matching dimension row across all versions:

sql

-- tests/assert_no_orphaned_orders.sql
-- Returns rows if any order references a customer that doesn't exist
-- in any version of dim_customer (not just the current version)
SELECT o.order_id, o.customer_sk
FROM {{ ref('fct_orders') }} o
LEFT JOIN {{ ref('dim_customer') }} c ON o.customer_sk = c.customer_sk
WHERE c.customer_sk IS NULL

Managing SCD Type 2 in application queries

The most common SCD Type 2 query mistake is joining a fact table to a dimension without constraining on the valid date range — returning multiple rows per fact when a customer has multiple historical versions.

sql

-- WRONG: returns duplicate fact rows when a customer has multiple SCD versions
SELECT
  o.order_id,
  o.amount_cents,
  c.country_code
FROM fct_orders o
JOIN dim_customer c ON o.customer_sk = c.customer_sk  -- correct join key
-- Missing: no filter on is_current or valid date range
 
-- CORRECT pattern 1: use surrogate key (already version-specific — no extra filter needed)
SELECT
  o.order_id,
  o.amount_cents,
  c.country_code
FROM fct_orders o
JOIN dim_customer c ON o.customer_sk = c.customer_sk
-- No filter needed: surrogate key uniquely identifies one SCD version
 
-- CORRECT pattern 2: get current attributes only (for "as of today" reports)
SELECT
  o.order_id,
  o.amount_cents,
  c.country_code
FROM fct_orders o
JOIN dim_customer c
  ON o.customer_id = c.customer_id  -- join on natural key
  AND c.is_current = true           -- constrain to current version only
-- Caveat: this reports current country, not country at time of order

Document which join pattern is correct for each use case in your mart's schema.yml description. The surrogate key join is the default; the natural key plus is_current join is appropriate only when you explicitly want current-state reporting regardless of historical accuracy.

Enjoying these notes?

Get new lessons delivered to your inbox. No spam.

Batch Pipelines & ELT

Stream Processing & Real-Time Data