Neural-Path/Notes
30 min

The Modern Data Warehouse

Analytical queries scan billions of rows and aggregate across dimensions — a workload that row-oriented OLTP databases are fundamentally ill-suited for. The modern cloud data warehouse reorients storage around columns, separates compute from storage, and makes queries that would take hours finish in seconds.

How It Works

Star Schema — click a query to see which tables it touches

dim_date
  • date_id
  • date
  • month
  • quarter
  • year
dim_product
  • product_id
  • name
  • category
  • unit_price
dim_channel
  • channel_id
  • name
  • type
  • source
dim_customer
  • customer_id
  • name
  • country
  • segment
fact_orders
  • order_id
  • customer_id ▸
  • product_id ▸
  • date_id ▸
  • channel_id ▸
  • revenue
  • quantity

Example queries:

Click a query above to see which tables it touches. Notice that most analytical queries only need the fact table plus one or two specific dimensions — the star schema is designed to make those joins fast and obvious.

The normalized schema that protects your OLTP database from update anomalies becomes a performance liability in analytics: every query that spans entities requires chaining JOINs through a foreign-key graph. The warehouse inverts this by accepting controlled redundancy — denormalized dimensions, columnar storage, prepartitioned data — to make the query that runs thousands of times per day faster, at the cost of making schema updates slightly more complex.

Star schema: the warehouse data model

The OLTP data model you'd use for an e-commerce app is normalized: orders references customers by foreign key, order_items references products, products references categories. This minimizes redundancy and keeps writes fast. But analytical queries that need customer name, product category, and order date all at once require chaining multiple JOINs — expensive at scale.

The warehouse uses a star schema: one central fact table surrounded by dimension tables. The fact table records events or measurements (orders, pageviews, transactions) and contains foreign keys into each dimension plus numeric measures. Dimension tables are deliberately denormalized — they flatten hierarchies and repeat values so that a query only needs to join the fact table to one or two dimensions, not traverse a chain of normalized tables.

Fact tables are append-only and immutable. Each row represents a single business event. They tend to be very wide (many foreign keys and measures) and very tall (billions of rows over time). Dimension tables are narrow and relatively small — often just millions of rows describing entities that change slowly.

Why columnar storage makes aggregations fast

In a columnar store, all values for a single column are stored together in one compressed block. For SELECT SUM(revenue) FROM fact_orders, the engine reads one block — the revenue column — ignoring every other field entirely. This is typically 10–50× less data scanned vs a row-oriented store.

Three compression techniques compound the benefit:

  • Dictionary encoding: for a column like country with only ~200 distinct values across billions of rows, replace each string with a 1-byte integer key. The entire column shrinks dramatically.
  • Run-length encoding: for a column like is_deleted where long sequences of identical values are common, store (value, count) pairs instead of repeating values. false × 4,000,000 becomes a single entry.
  • Delta encoding: for sorted numeric columns like timestamp, store differences between consecutive values instead of absolute values. Small deltas compress much better than raw timestamps.

These compression schemes had to be column-oriented specifically because compression works by exploiting repetition and similarity among values — and adjacent values in the same column are far more similar to each other (all countries, all timestamps) than adjacent values in the same row (a country, a timestamp, an order amount, a boolean). The physical co-location of similar values is what makes compression ratios of 5–10× achievable; row storage cannot achieve this because each row interleaves columns with different data types and value ranges.

Partition pruning and clustering

Data warehouses physically organize data into partitions — groups of files that share the same partition key value (usually date). A query with WHERE date >= '2024-01-01' can completely skip partitions outside that range, reading zero bytes from them. This is called partition pruning and is often the single biggest lever for query performance.

Clustering (Snowflake) or clustering keys (BigQuery) further sort data within partitions by a secondary key (e.g., customer_id). Queries that filter on the cluster key read only the micro-partitions where matching values actually live, skipping the rest.

Compute/storage separation

In traditional databases, adding query capacity meant buying larger machines with more storage attached. Modern cloud warehouses decouple them: data lives in cheap object storage (S3, GCS), and virtual warehouses (Snowflake) or slots (BigQuery) are compute clusters that spin up on demand. You can run multiple warehouses against the same data simultaneously, scale each independently, and pay only per byte scanned or per second of compute used. A 10TB dataset no longer requires permanently running a 10TB machine.

Design Tradeoffs

Where Your Intuition Breaks

The star schema's deliberate denormalization feels wrong to anyone trained in relational database design — storing customer_country in every row of a billion-row fact table looks like a textbook normalization violation. But the tradeoff is calculated: updating a country name requires touching one row in dim_customer, not a billion rows in fct_orders, because the fact table stores foreign keys, not denormalized strings. The denormalization is in the dimension table itself (flattening customer → address → country into one row), and dimension tables are tiny by comparison. The design achieves the analytical query speed of full denormalization with the update simplicity of one-row-per-entity dimensions — the "star" shape is not sloppy design but a deliberate optimization for a workload where reads vastly outnumber writes.

Star vs snowflake schema

The snowflake schema further normalizes dimension tables — dim_product references dim_category, which references dim_department. This saves storage and avoids update anomalies when category names change. The tradeoff is additional JOINs on every analytical query. Most modern warehouses are fast enough at JOINs that the normalized form is rarely worth it; star schemas are preferred for simplicity and query speed.

Partition key selection

The partition key determines which queries benefit from pruning. Date is almost always the right choice for event data — most analytical queries have a time range. Partitioning by a high-cardinality field like user_id creates too many partitions (millions of tiny files), which the warehouse cannot prune efficiently. Partitioning by country when most queries span all countries provides little benefit. The partition key should match the most common filter in your queries.

Materialized views vs on-demand queries

Pre-computing common aggregations as materialized views makes dashboards faster but adds maintenance cost: materialized views must be refreshed on a schedule and drift from the underlying data between refreshes. Most cloud warehouses cache query results automatically for repeated identical queries. Materialized views are worth the complexity only when a query is run thousands of times per day by end users, not for internal analyst queries run a few times per week.

Cost models: per-query vs always-on

BigQuery charges per byte scanned per query (approximately $5/TB). Snowflake charges per second of virtual warehouse runtime. Neither model is strictly better — it depends on query patterns. A warehouse running hundreds of small queries benefits from BigQuery's per-query model. A team doing large batch transformations at predictable times benefits from Snowflake's always-on compute that can be paused when idle.

In Practice

dbt project structure

The transformation layer inside a warehouse follows a three-layer convention that dbt codifies:

  • Staging (stg_*): one-to-one models over raw source tables. Light cleaning: cast types, rename columns to snake_case, filter out test records. No business logic.
  • Intermediate (int_*): joins and transformations that don't need to be exposed to end users. Complex deduplication, session reconstruction, event sequencing.
  • Marts (fct_*, dim_*): the final star schema tables consumed by BI tools and analysts. Stable names, documented, tested.

This separation means that when a source system changes its schema, you update the staging model and the rest of the DAG inherits the fix automatically.

Controlling query costs

A few habits prevent warehouse bills from surprising you:

  • Always add partition filters: WHERE date >= ... before running anything on a large fact table. Without a date filter on a multi-year dataset, you scan everything.
  • Use SELECT col1, col2 not SELECT *: columnar warehouses scan only the columns you name. SELECT * reads everything.
  • Preview before running: most warehouses show estimated bytes scanned before execution. Develop the habit of checking before submitting large queries.
  • Set query cost controls: both BigQuery and Snowflake support per-query byte limits and warehouse auto-suspend, which prevents runaway queries from burning budget overnight.

When a warehouse is not enough

Columnar cloud warehouses optimize for throughput, not latency. A dashboard that needs to return in under 200ms for 10,000 simultaneous users cannot use BigQuery as the backend — even with caching, cold query latency is measured in seconds. For latency-sensitive serving, OLAP databases like ClickHouse or Apache Druid are designed for sub-second query response at high concurrency. They sacrifice some flexibility for the performance guarantees that user-facing products need.

Production Patterns

dbt project layout: staging, intermediate, marts

A well-structured dbt project enforces the three-layer convention through directory organization and naming conventions. The project structure below is the de facto standard:

dbt_project/
├── models/
│   ├── staging/
│   │   ├── stripe/
│   │   │   ├── stg_stripe__charges.sql
│   │   │   └── stg_stripe__subscriptions.sql
│   │   └── postgres/
│   │       ├── stg_postgres__users.sql
│   │       └── stg_postgres__orders.sql
│   ├── intermediate/
│   │   ├── int_orders__joined.sql
│   │   └── int_users__lifetime_metrics.sql
│   └── marts/
│       ├── core/
│       │   ├── fct_orders.sql
│       │   └── dim_users.sql
│       └── finance/
│           └── fct_revenue_daily.sql
├── dbt_project.yml
└── profiles.yml

Staging models are strictly one-to-one with source tables and contain no business logic — only cleaning:

sql
-- models/staging/postgres/stg_postgres__orders.sql
WITH source AS (
    SELECT * FROM {{ source('postgres', 'orders') }}
),
 
renamed AS (
    SELECT
        id                          AS order_id,
        user_id,
        CAST(amount_cents AS FLOAT64) / 100.0   AS amount_usd,
        status,
        TIMESTAMP(created_at)       AS created_at,
        TIMESTAMP(updated_at)       AS updated_at
    FROM source
    WHERE _fivetran_deleted IS FALSE  -- filter soft-deletes from connector
)
 
SELECT * FROM renamed

Marts reference intermediate models, never staging directly, so that business logic is centralized and reusable across multiple marts.

Controlling BigQuery and Snowflake query costs

Both platforms expose cost controls that should be configured from day one, not retrofitted after a surprise bill.

BigQuery — cost is per byte scanned. Use project-level and dataset-level defaults:

sql
-- Set a per-query byte budget (1 TB limit prevents runaway queries)
-- Run this in the BigQuery console or via the API at project setup time
ALTER PROJECT `my-project`
SET OPTIONS (
    default_query_job_config = JSON '{"maximumBytesBilled": "1099511627776"}'
);
 
-- Alternatively, enforce per-query in application code
from google.cloud import bigquery
 
client = bigquery.Client()
job_config = bigquery.QueryJobConfig(
    maximum_bytes_billed=100 * 1024**3  # 100 GB hard cap per query
)
query_job = client.query(sql, job_config=job_config)

Snowflake — cost is per second of virtual warehouse runtime. Prevent warehouses from running idle:

sql
-- Apply to every non-production warehouse
ALTER WAREHOUSE reporting_wh SET
    AUTO_SUSPEND = 60         -- suspend after 60s idle
    AUTO_RESUME = TRUE
    STATEMENT_TIMEOUT_IN_SECONDS = 300;  -- kill queries > 5 min
 
-- Use resource monitors to alert before budget is exhausted
CREATE RESOURCE MONITOR monthly_cap
    WITH CREDIT_QUOTA = 500
    TRIGGERS
        ON 75 PERCENT DO NOTIFY
        ON 90 PERCENT DO SUSPEND
        ON 100 PERCENT DO SUSPEND_IMMEDIATE;

Partitioning and clustering in SQL

Declare partitioning and clustering at table creation time in DDL; retrofitting it later requires a full table rewrite.

BigQuery — date-partitioned and clustered table:

sql
CREATE TABLE `project.dataset.fct_orders`
(
    order_id        STRING NOT NULL,
    user_id         STRING NOT NULL,
    amount_usd      FLOAT64,
    country         STRING,
    created_at      TIMESTAMP NOT NULL
)
PARTITION BY DATE(created_at)   -- partition pruning on date filters
CLUSTER BY country, user_id     -- secondary sort for clustered reads
OPTIONS (
    partition_expiration_days = 730,   -- auto-expire data > 2 years
    require_partition_filter = TRUE    -- force callers to include a date filter
);

Snowflake — clustered table with automatic micro-partition management:

sql
CREATE TABLE fct_orders (
    order_id    VARCHAR      NOT NULL,
    user_id     VARCHAR      NOT NULL,
    amount_usd  NUMBER(12,2),
    country     VARCHAR,
    created_at  TIMESTAMP_NTZ NOT NULL
)
CLUSTER BY (DATE(created_at), country);
 
-- Monitor clustering health over time
SELECT SYSTEM$CLUSTERING_INFORMATION('fct_orders', '(DATE(created_at), country)');

The require_partition_filter option in BigQuery is the most effective guardrail against accidental full-table scans — it makes unfiltered queries fail at parse time rather than execute and bill for terabytes of data.

Enjoying these notes?

Get new lessons delivered to your inbox. No spam.