Data Systems Landscape
Every production system generates events — clicks, purchases, signups, errors — but raw events are useless until they reach the right kind of storage and processing layer. OLTP databases handle the moment; OLAP systems handle the meaning. Understanding why you need both, and how data flows between them, is the foundation of every data decision you'll make.
How It Works
SELECT SUM(revenue) FROM ordersRow Storage (OLTP)
| id | customer | product | revenue |
|---|---|---|---|
| 1 | Alice | Pro Plan | 120 |
| 2 | Bob | Basic | 45 |
| 3 | Carol | Pro Plan | 120 |
| 4 | Dave | Enterprise | 800 |
| 5 | Eve | Basic | 45 |
Column Storage (OLAP)
Click Run Query above to see how each storage model physically reads data to compute SUM(revenue). Row storage touches every cell in every row; column storage reads only the revenue column.
Every production system generates events, but the database that handles those events one row at a time is the wrong tool for answering "what was total revenue last quarter?" — a question that needs to touch every row but only two columns. OLTP and OLAP systems exist because read patterns and write patterns have fundamentally incompatible storage layouts: you cannot optimize a single database for both without sacrificing one.
The core mismatch
Production databases are optimized for individual transactions. When a user places an order, the database writes one row to orders, decrements inventory, records a payment — all within a single commit that takes under 10ms. The data is stored row by row: all fields for order #1234 sit together on disk, then all fields for order #1235, and so on. This layout makes row-level reads and writes fast because any given transaction needs all the fields for one specific record.
Analytical queries do the opposite. SELECT SUM(revenue) FROM orders WHERE created_at > '2024-01-01' doesn't care about customer names or shipping addresses — it only needs revenue and created_at from every row. In a row-oriented store, fetching those two columns means reading every full row and discarding the rest. On millions of rows, this wastes enormous I/O.
OLAP (Online Analytical Processing) systems flip the storage model. Values for each column are stored contiguously on disk — all revenue values in one block, all created_at values in another. A sum over revenue becomes a single sequential read of one column. This is 10–100× less I/O than scanning the equivalent data row by row, and columnar data also compresses far better (similar values next to each other compress heavily with run-length or dictionary encoding).
Columnar storage had to be this layout precisely because analytics is I/O-bound on column selection, not row access. If you store rows together, fetching one column from 10M rows means reading 10M full rows off disk and discarding 90% of the bytes. Storing columns together means the I/O for a column scan is proportional to the number of rows times the width of that one column — the only data you actually need.
The five-layer data stack
Modern data architectures connect source systems to analytical consumers through five distinct layers:
1. Source systems — where data is born. Operational databases (Postgres, MySQL, MongoDB) store current application state. SaaS tools (Stripe, Salesforce, Segment) store business activity. These systems are optimized for the application, not for analytics.
2. Ingestion — pipelines that move data from sources into the analytical layer. This includes full table copies, incremental syncs, Change Data Capture (CDC), and event streaming. The ingestion layer handles reliability, schema changes, and backfill.
3. Storage — the data warehouse or lakehouse. Columnar, append-optimized, designed for large scans. This is where all source data lands and accumulates over time.
4. Transformation — SQL or Spark jobs that shape raw ingested data into clean, modeled tables. dbt (data build tool) is the dominant tool here: it lets analysts write modular SQL transformations with tests and documentation, treating data models like software.
5. Serving — the consumption layer: BI dashboards (Looker, Tableau, Metabase), ML feature stores, reverse ETL that pushes aggregated data back into operational systems (e.g., syncing a customer lifetime value score back into your CRM).
Message queues and object storage
Two other system types appear in most mature stacks. Message queues (Kafka, Kinesis, Pub/Sub) decouple producers from consumers: your application publishes events to a queue and downstream systems consume at their own pace, without slowing down the producer or losing events during consumer downtime. They're the backbone of real-time pipelines.
Object storage (S3, GCS, Azure Blob) provides cheap, durable, infinitely scalable storage for raw files — CSV dumps, Parquet files, JSON logs, model artifacts. Most data lakes and lakehouses use object storage as the underlying storage layer, with a table format (Delta Lake, Apache Iceberg) layered on top to provide ACID semantics and schema enforcement.
Design Tradeoffs
Where Your Intuition Breaks
The natural assumption is that a "faster database" would make the OLTP/OLAP split unnecessary — if hardware is fast enough, row storage can handle analytics too. But the split is not about raw speed; it is about I/O amplification. On a 100-column table, a query that needs 2 columns must read 50× more data than necessary with row storage, and that ratio does not improve with faster hardware. It improves only with a different storage layout. The split is not a workaround for slow disks; it is the correct response to the mathematical structure of analytical vs. transactional access patterns. Similarly, cloud warehouses that separate compute and storage are sometimes framed as "infinitely scalable" — but the real advantage is that idle capacity costs nothing. A warehouse holding terabytes costs almost nothing until a query runs; a traditional coupled system charges for always-on compute even between queries. The economics, not the technology, are what changed.
OLTP vs OLAP at a glance
| OLTP | OLAP | |
|---|---|---|
| Storage format | Row-oriented | Columnar |
| Schema style | Normalized (3NF) | Denormalized (star/snowflake) |
| Optimized for | Point reads and writes | Aggregations, full scans |
| Concurrency | Thousands of short transactions | Dozens of long-running queries |
| Typical latency | < 10ms | Seconds to minutes |
| Data age | Current application state | Historical records |
| Examples | Postgres, MySQL, MongoDB | BigQuery, Snowflake, Redshift |
Freshness vs cost
Real-time pipelines (Kafka → Flink → warehouse) are significantly more expensive to build and operate than hourly batch loads. Most business analytics — revenue trends, cohort analysis, funnel metrics — can tolerate data that's 1–4 hours old. Match the freshness requirement to the actual business need, not the theoretical ideal. The "we need real-time" instinct is often driven by how cool it sounds, not what decisions it enables.
Schema-on-write vs schema-on-read
Warehouses enforce schema at write time: columns are defined before data loads, and malformed rows are rejected. Data lakes store raw files and apply schema at query time. Schema-on-write catches errors early and enables faster queries with better compression; schema-on-read is more flexible but means bad data sits silently in storage until someone queries it. Most modern lakehouses (Delta Lake, Iceberg) offer a middle ground: schema evolution with enforcement.
Compute/storage separation
Traditional databases coupled compute and storage on the same machines — scaling one meant scaling both. Modern cloud warehouses (BigQuery, Snowflake, Redshift Serverless) separate them: data lives in cheap object storage, and compute clusters spin up on demand per query. You pay only for what you run, and you can scale read capacity independently of storage. This changes the economics of analytics significantly — a large historical dataset no longer requires permanently running large machines.
In Practice
When to move analytics off your operational database
Signs you've outgrown running analytics on Postgres or MySQL:
- Aggregations over more than ~500K rows are noticeably slow
- Dashboard queries regularly time out during peak application traffic
- You've added a read replica specifically to offload analytics queries
- Your team starts asking not to run certain queries during business hours
- You need more than 90 days of historical data but retention is hurting DB performance
A good rule of thumb: if you're doing GROUP BY on full table scans more than a few times a day, it's time for a warehouse.
Typical data flow
For a subscription SaaS product, the data flow might look like:
- Source:
subscriptions,events,userstables in Postgres (OLTP) - Ingestion: Fivetran or Airbyte syncs new/updated rows to Snowflake every hour
- Transformation: dbt models transform raw tables into
fct_subscriptions,dim_users,fct_events— cleaned, typed, and joined - Serving: a BI tool queries Snowflake directly; an ML pipeline reads from a feature store that's populated from the same modeled tables
The total lag from a user action to it appearing in a dashboard is typically 1–2 hours in this pattern. For most business decisions, that's fine.
Common pitfalls
Running analytics directly on production: works until it doesn't. A slow dashboard query during a traffic spike degrades your application for real users. Separate the systems.
Over-building too early: a company at 50K users probably doesn't need Kafka, Spark, and a lakehouse. A read replica for reporting and a weekly Postgres dump to BigQuery may be entirely sufficient. The cost of operating a full modern data stack is real — in tooling, in engineering time, and in cognitive overhead.
Ignoring query costs: columnar cloud warehouses charge per data scanned. Running SELECT * on a multi-terabyte table is expensive. Partitioning, clustering, and selecting only needed columns matter from day one.
Treating the warehouse as an application database: the warehouse is eventually consistent, optimized for batch reads, and not designed for point lookups at application latency. Don't route your app's user-facing queries there.
Production Patterns
Choosing the right system for the workload
The decision tree is simple in practice. Start from the query shape, not the data volume.
# Pseudocode decision logic — encode this as a team checklist, not runtime code
def choose_system(query):
if query.is_point_lookup and query.latency_sla_ms < 50:
return "OLTP (Postgres / MySQL)"
if query.involves_full_table_scan or query.has_aggregation:
return "OLAP warehouse (BigQuery / Snowflake)"
if query.needs_sub_100ms and query.is_high_concurrency:
return "OLAP DB (ClickHouse / Druid)"
if query.is_streaming_join:
return "Stream processor (Flink / Spark Structured Streaming)"In a production data stack, the ingestion hop from OLTP to OLAP is typically handled by a managed connector. Fivetran and Airbyte both support incremental syncs that track a high-watermark cursor column (updated_at) and page through new rows without touching existing data:
# Conceptual incremental sync using cursor-based pagination
import psycopg2, datetime
def sync_incremental(conn, last_synced_at: datetime.datetime):
cursor = conn.cursor()
cursor.execute(
"""
SELECT id, user_id, amount, created_at, updated_at
FROM orders
WHERE updated_at > %s
ORDER BY updated_at ASC
LIMIT 10000
""",
(last_synced_at,),
)
rows = cursor.fetchall()
# write rows to warehouse staging table ...
if rows:
return rows[-1]["updated_at"] # new watermark
return last_synced_atUse Change Data Capture (Debezium → Kafka) instead of cursor-based sync when you need sub-minute freshness or when rows are hard-deleted rather than soft-deleted.
Query anti-patterns on the wrong system type
Running analytical queries on an OLTP database under load causes cascading failures. The most dangerous pattern is a GROUP BY with no index support:
-- Bad: runs on production Postgres, acquires shared locks, causes timeouts
SELECT
DATE_TRUNC('day', created_at) AS day,
COUNT(*) AS orders,
SUM(amount) AS revenue
FROM orders
-- no WHERE clause — full table scan on 50M rows
GROUP BY 1
ORDER BY 1 DESC;The same query on BigQuery with proper partitioning costs pennies and returns in seconds:
-- Good: runs on BigQuery, partition pruning eliminates 95% of data
SELECT
DATE(created_at) AS day,
COUNT(*) AS orders,
SUM(amount) AS revenue
FROM `project.dataset.fct_orders`
WHERE created_at >= TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 90 DAY)
GROUP BY 1
ORDER BY 1 DESC;Signs you are running the wrong query on the wrong system: query time > 1s on Postgres for analytical work; EXPLAIN shows Seq Scan on a table over 1M rows; autovacuum falling behind because analytical queries hold long-running transactions.
Controlling the cost of mixing OLTP and OLAP reads
When an application backend reads from both systems in the same request — for example, fetching live account state from Postgres and then enriching it with a lifetime-value score from the warehouse — latency and cost risks compound:
import asyncio
import asyncpg
from google.cloud import bigquery
async def get_user_profile(user_id: str):
# Parallel fetch: OLTP for live state, OLAP for aggregated metrics
pg_task = fetch_account_state(user_id) # < 10ms expected
bq_task = fetch_ltv_score(user_id) # 200ms–2s; cache this
account, ltv = await asyncio.gather(pg_task, bq_task)
return {**account, "lifetime_value": ltv}
async def fetch_ltv_score(user_id: str):
# Never run this live on BigQuery per request — pre-compute and cache
# Run a nightly job that writes results to a Redis or Postgres lookup table
raise NotImplementedError("Read from Redis cache, not BigQuery directly")The production rule: never query the warehouse synchronously in a user-facing request path. Pre-compute aggregated scores on a schedule (hourly or daily), write them back to a fast lookup store (Redis, Postgres, Firestore), and serve from there. This is reverse ETL — pushing derived analytics back into operational systems — and it prevents warehouse latency from bleeding into application latency.
Enjoying these notes?
Get new lessons delivered to your inbox. No spam.