Data Systems Landscape

Every production system generates events — clicks, purchases, signups, errors — but raw events are useless until they reach the right kind of storage and processing layer. OLTP databases handle the moment; OLAP systems handle the meaning. Understanding why you need both, and how data flows between them, is the foundation of every data decision you'll make.

How It Works

SELECT SUM(revenue) FROM orders

Row Storage (OLTP)

id	customer	product	revenue
1	Alice	Pro Plan	120
2	Bob	Basic	45
3	Carol	Pro Plan	120
4	Dave	Enterprise	800
5	Eve	Basic	45

Column Storage (OLAP)

customer

Alice

Bob

Carol

Dave

Eve

product

Pro Plan

Basic

Pro Plan

Enterprise

Basic

revenue

120

800

Click Run Query above to see how each storage model physically reads data to compute SUM(revenue). Row storage touches every cell in every row; column storage reads only the revenue column.

Every production system generates events, but the database that handles those events one row at a time is the wrong tool for answering "what was total revenue last quarter?" — a question that needs to touch every row but only two columns. OLTP and OLAP systems exist because read patterns and write patterns have fundamentally incompatible storage layouts: you cannot optimize a single database for both without sacrificing one.

The core mismatch

Production databases are optimized for individual transactions. When a user places an order, the database writes one row to orders, decrements inventory, records a payment — all within a single commit that takes under 10ms. The data is stored row by row: all fields for order #1234 sit together on disk, then all fields for order #1235, and so on. This layout makes row-level reads and writes fast because any given transaction needs all the fields for one specific record.

Analytical queries do the opposite. SELECT SUM(revenue) FROM orders WHERE created_at > '2024-01-01' doesn't care about customer names or shipping addresses — it only needs revenue and created_at from every row. In a row-oriented store, fetching those two columns means reading every full row and discarding the rest. On millions of rows, this wastes enormous I/O.

OLAP (Online Analytical Processing) systems flip the storage model. Values for each column are stored contiguously on disk — all revenue values in one block, all created_at values in another. A sum over revenue becomes a single sequential read of one column. This is 10–100× less I/O than scanning the equivalent data row by row, and columnar data also compresses far better (similar values next to each other compress heavily with run-length or dictionary encoding).

Columnar storage had to be this layout precisely because analytics is I/O-bound on column selection, not row access. If you store rows together, fetching one column from 10M rows means reading 10M full rows off disk and discarding 90% of the bytes. Storing columns together means the I/O for a column scan is proportional to the number of rows times the width of that one column — the only data you actually need.

The five-layer data stack

Modern data architectures connect source systems to analytical consumers through five distinct layers:

1. Source systems — where data is born. Operational databases (Postgres, MySQL, MongoDB) store current application state. SaaS tools (Stripe, Salesforce, Segment) store business activity. These systems are optimized for the application, not for analytics.

2. Ingestion — pipelines that move data from sources into the analytical layer. This includes full table copies, incremental syncs, Change Data Capture (CDC), and event streaming. The ingestion layer handles reliability, schema changes, and backfill.

3. Storage — the data warehouse or lakehouse. Columnar, append-optimized, designed for large scans. This is where all source data lands and accumulates over time.

4. Transformation — SQL or Spark jobs that shape raw ingested data into clean, modeled tables. dbt (data build tool) is the dominant tool here: it lets analysts write modular SQL transformations with tests and documentation, treating data models like software.

5. Serving — the consumption layer: BI dashboards (Looker, Tableau, Metabase), ML feature stores, reverse ETL that pushes aggregated data back into operational systems (e.g., syncing a customer lifetime value score back into your CRM).

Message queues and object storage

Two other system types appear in most mature stacks. Message queues (Kafka, Kinesis, Pub/Sub) decouple producers from consumers: your application publishes events to a queue and downstream systems consume at their own pace, without slowing down the producer or losing events during consumer downtime. They're the backbone of real-time pipelines.

Object storage (S3, GCS, Azure Blob) provides cheap, durable, infinitely scalable storage for raw files — CSV dumps, Parquet files, JSON logs, model artifacts. Most data lakes and lakehouses use object storage as the underlying storage layer, with a table format (Delta Lake, Apache Iceberg) layered on top to provide ACID semantics and schema enforcement.

Design Tradeoffs

Where Your Intuition Breaks

The natural assumption is that a "faster database" would make the OLTP/OLAP split unnecessary — if hardware is fast enough, row storage can handle analytics too. But the split is not about raw speed; it is about I/O amplification. On a 100-column table, a query that needs 2 columns must read 50× more data than necessary with row storage, and that ratio does not improve with faster hardware. It improves only with a different storage layout. The split is not a workaround for slow disks; it is the correct response to the mathematical structure of analytical vs. transactional access patterns. Similarly, cloud warehouses that separate compute and storage are sometimes framed as "infinitely scalable" — but the real advantage is that idle capacity costs nothing. A warehouse holding terabytes costs almost nothing until a query runs; a traditional coupled system charges for always-on compute even between queries. The economics, not the technology, are what changed.

OLTP vs OLAP at a glance

	OLTP	OLAP
Storage format	Row-oriented	Columnar
Schema style	Normalized (3NF)	Denormalized (star/snowflake)
Optimized for	Point reads and writes	Aggregations, full scans
Concurrency	Thousands of short transactions	Dozens of long-running queries
Typical latency	< 10ms	Seconds to minutes
Data age	Current application state	Historical records
Examples	Postgres, MySQL, MongoDB	BigQuery, Snowflake, Redshift

Freshness vs cost

Real-time pipelines (Kafka → Flink → warehouse) are significantly more expensive to build and operate than hourly batch loads. Most business analytics — revenue trends, cohort analysis, funnel metrics — can tolerate data that's 1–4 hours old. Match the freshness requirement to the actual business need, not the theoretical ideal. The "we need real-time" instinct is often driven by how cool it sounds, not what decisions it enables.

Schema-on-write vs schema-on-read

Warehouses enforce schema at write time: columns are defined before data loads, and malformed rows are rejected. Data lakes store raw files and apply schema at query time. Schema-on-write catches errors early and enables faster queries with better compression; schema-on-read is more flexible but means bad data sits silently in storage until someone queries it. Most modern lakehouses (Delta Lake, Iceberg) offer a middle ground: schema evolution with enforcement.

Compute/storage separation

Traditional databases coupled compute and storage on the same machines — scaling one meant scaling both. Modern cloud warehouses (BigQuery, Snowflake, Redshift Serverless) separate them: data lives in cheap object storage, and compute clusters spin up on demand per query. You pay only for what you run, and you can scale read capacity independently of storage. This changes the economics of analytics significantly — a large historical dataset no longer requires permanently running large machines.

In Practice

When to move analytics off your operational database

Signs you've outgrown running analytics on Postgres or MySQL:

Aggregations over more than ~500K rows are noticeably slow
Dashboard queries regularly time out during peak application traffic
You've added a read replica specifically to offload analytics queries
Your team starts asking not to run certain queries during business hours
You need more than 90 days of historical data but retention is hurting DB performance

A good rule of thumb: if you're doing GROUP BY on full table scans more than a few times a day, it's time for a warehouse.

Typical data flow

For a subscription SaaS product, the data flow might look like:

Source: subscriptions, events, users tables in Postgres (OLTP)
Ingestion: Fivetran or Airbyte syncs new/updated rows to Snowflake every hour
Transformation: dbt models transform raw tables into fct_subscriptions, dim_users, fct_events — cleaned, typed, and joined
Serving: a BI tool queries Snowflake directly; an ML pipeline reads from a feature store that's populated from the same modeled tables

The total lag from a user action to it appearing in a dashboard is typically 1–2 hours in this pattern. For most business decisions, that's fine.

Common pitfalls

Running analytics directly on production: works until it doesn't. A slow dashboard query during a traffic spike degrades your application for real users. Separate the systems.

Over-building too early: a company at 50K users probably doesn't need Kafka, Spark, and a lakehouse. A read replica for reporting and a weekly Postgres dump to BigQuery may be entirely sufficient. The cost of operating a full modern data stack is real — in tooling, in engineering time, and in cognitive overhead.

Ignoring query costs: columnar cloud warehouses charge per data scanned. Running SELECT * on a multi-terabyte table is expensive. Partitioning, clustering, and selecting only needed columns matter from day one.

Treating the warehouse as an application database: the warehouse is eventually consistent, optimized for batch reads, and not designed for point lookups at application latency. Don't route your app's user-facing queries there.

Production Patterns

Choosing the right system for the workload

The decision tree is simple in practice. Start from the query shape, not the data volume.

python

# Pseudocode decision logic — encode this as a team checklist, not runtime code
def choose_system(query):
    if query.is_point_lookup and query.latency_sla_ms < 50:
        return "OLTP (Postgres / MySQL)"
    if query.involves_full_table_scan or query.has_aggregation:
        return "OLAP warehouse (BigQuery / Snowflake)"
    if query.needs_sub_100ms and query.is_high_concurrency:
        return "OLAP DB (ClickHouse / Druid)"
    if query.is_streaming_join:
        return "Stream processor (Flink / Spark Structured Streaming)"

In a production data stack, the ingestion hop from OLTP to OLAP is typically handled by a managed connector. Fivetran and Airbyte both support incremental syncs that track a high-watermark cursor column (updated_at) and page through new rows without touching existing data:

python

# Conceptual incremental sync using cursor-based pagination
import psycopg2, datetime
 
def sync_incremental(conn, last_synced_at: datetime.datetime):
    cursor = conn.cursor()
    cursor.execute(
        """
        SELECT id, user_id, amount, created_at, updated_at
        FROM orders
        WHERE updated_at > %s
        ORDER BY updated_at ASC
        LIMIT 10000
        """,
        (last_synced_at,),
    )
    rows = cursor.fetchall()
    # write rows to warehouse staging table ...
    if rows:
        return rows[-1]["updated_at"]  # new watermark
    return last_synced_at

Use Change Data Capture (Debezium → Kafka) instead of cursor-based sync when you need sub-minute freshness or when rows are hard-deleted rather than soft-deleted.

Query anti-patterns on the wrong system type

Running analytical queries on an OLTP database under load causes cascading failures. The most dangerous pattern is a GROUP BY with no index support:

sql

-- Bad: runs on production Postgres, acquires shared locks, causes timeouts
SELECT
    DATE_TRUNC('day', created_at) AS day,
    COUNT(*) AS orders,
    SUM(amount) AS revenue
FROM orders
-- no WHERE clause — full table scan on 50M rows
GROUP BY 1
ORDER BY 1 DESC;

The same query on BigQuery with proper partitioning costs pennies and returns in seconds:

sql

-- Good: runs on BigQuery, partition pruning eliminates 95% of data
SELECT
    DATE(created_at) AS day,
    COUNT(*) AS orders,
    SUM(amount) AS revenue
FROM `project.dataset.fct_orders`
WHERE created_at >= TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 90 DAY)
GROUP BY 1
ORDER BY 1 DESC;

Signs you are running the wrong query on the wrong system: query time > 1s on Postgres for analytical work; EXPLAIN shows Seq Scan on a table over 1M rows; autovacuum falling behind because analytical queries hold long-running transactions.

Controlling the cost of mixing OLTP and OLAP reads

When an application backend reads from both systems in the same request — for example, fetching live account state from Postgres and then enriching it with a lifetime-value score from the warehouse — latency and cost risks compound:

python

import asyncio
import asyncpg
from google.cloud import bigquery
 
async def get_user_profile(user_id: str):
    # Parallel fetch: OLTP for live state, OLAP for aggregated metrics
    pg_task = fetch_account_state(user_id)       # < 10ms expected
    bq_task = fetch_ltv_score(user_id)           # 200ms–2s; cache this
 
    account, ltv = await asyncio.gather(pg_task, bq_task)
    return {**account, "lifetime_value": ltv}
 
async def fetch_ltv_score(user_id: str):
    # Never run this live on BigQuery per request — pre-compute and cache
    # Run a nightly job that writes results to a Redis or Postgres lookup table
    raise NotImplementedError("Read from Redis cache, not BigQuery directly")

The production rule: never query the warehouse synchronously in a user-facing request path. Pre-compute aggregated scores on a schedule (hourly or daily), write them back to a fast lookup store (Redis, Postgres, Firestore), and serve from there. This is reverse ETL — pushing derived analytics back into operational systems — and it prevents warehouse latency from bleeding into application latency.

Enjoying these notes?

Get new lessons delivered to your inbox. No spam.

Setup

Local Environment Setup

The Modern Data Warehouse