Neural-Path/Notes
30 min

Data Lakes & Lakehouses

Object storage is cheap, scalable, and format-agnostic — which is exactly why the original data lake failed. Without schema enforcement and ACID transactions, lakes became swamps. The lakehouse pattern adds warehouse-quality table semantics (Delta Lake, Apache Iceberg) over object storage without giving up the flexibility and cost profile that made lakes attractive.

How It Works

Delta Lake — transaction log + time travel

_delta_log/

Object storage (S3) — snapshot at v2 — Schema evolution + new data

📄 part-00001.parquet● active
📄 part-00002.parquet○ inactive
📄 part-00003.parquet● active
📄 part-00004.parquet● active
📄 part-00005.parquet● active
Active rows: 160,600

Click a commit to time-travel to that version. Only files listed as active in the log are visible — deleted files remain in storage but are excluded from the snapshot.

Click the commits above to see how the transaction log builds up the table state over time. Earlier commits remain fully queryable — that's time travel. Files marked as removed are not deleted from storage immediately; they're simply excluded from the current snapshot.

The original data lake's promise — cheap, flexible, format-agnostic storage for everything — collided with a fundamental problem: without ACID transactions, multiple writers corrupt each other's data, there is no way to recover from a bad write, and readers see partial state during a write. The lakehouse pattern fixes this by adding a transaction log over the same cheap object storage, giving you warehouse-quality consistency without warehouse-level cost or vendor lock-in.

Why the original data lake failed

The premise was compelling: dump everything into S3 in raw form, apply schema later when you need it. No rigid schema means no schema migration problems. No ETL pipeline means no latency. Any tool can read any file directly.

In practice, raw data accumulates without structure. Data scientists discover that the user_id field uses three different formats depending on which team wrote the data. Two pipelines write to the same prefix concurrently and corrupt each other's files. Someone runs DELETE FROM events WHERE test=true and half the event data disappears with no recovery path. There is no history of what the data looked like yesterday. The lake becomes a swamp: technically accessible but practically untrusted.

What a table format adds

A table format (Delta Lake, Apache Iceberg, Apache Hudi) sits between raw files in object storage and query engines. It provides ACID semantics without a separate server — everything is metadata files stored in the same object storage bucket as the data files.

The core mechanism is a transaction log: a directory of JSON or Parquet files recording every operation that modified the table. Each commit appends a new log entry listing which data files were added and which were removed. The "current state" of the table is derived by replaying the log: start with an empty set, apply each add/remove operation in order, and what remains is the set of active files.

Atomicity: a commit either adds all its files to the log or adds none — there is no partial commit. Concurrent writers use optimistic concurrency: each writer reads the current log version, performs its write, then attempts to commit. If another writer committed in between, the commit fails and the writer retries.

The transaction log had to be stored as immutable append-only files in the same object storage bucket — not in a separate coordination server — because the whole point is to avoid any infrastructure beyond S3. Object stores provide atomic PUT operations, so appending a new log entry is atomic; the "latest version" of the table is the highest-numbered log file present. This design means any query engine (Spark, Trino, DuckDB) that can read object storage can implement the Delta/Iceberg protocol without a running Delta server, which is what makes the formats genuinely engine-neutral.

Isolation: readers always see a consistent snapshot. A reader that starts before a write completes never sees partial results — it continues reading from the snapshot version it started with.

Time travel: because old log entries are never deleted, you can query the table as it existed at any past commit or timestamp. This enables debugging (what was in this table yesterday before the pipeline ran?), auditing, and reproducible ML training datasets.

File formats: Parquet, ORC, Avro

The actual data files in a lakehouse are typically Parquet — the same columnar file format used by cloud warehouses internally. Parquet files embed their own schema in the file footer, support predicate pushdown (a query engine can read the row group statistics and skip entire file sections without decompressing data), and compress well with Snappy or Zstandard.

ORC (Optimized Row Columnar) is similar to Parquet and dominant in Hive/Hadoop ecosystems. Avro is row-oriented and better suited for write-heavy streaming pipelines where you need fast appends, not columnar scans.

Schema evolution — adding, renaming, or dropping columns — is tracked in the transaction log and handled at read time by the query engine. New columns appear as null in older files. Dropped columns are ignored when present in older files.

Design Tradeoffs

Where Your Intuition Breaks

The "data lake vs data warehouse" framing suggests a binary choice. The lakehouse reveals the actual variable: whether ACID semantics are provided by the storage layer (object storage + table format) or by a proprietary engine (managed warehouse). The cost difference is not about the data itself — Parquet files cost the same bytes whether in S3 directly or backing Snowflake's internal storage. The difference is whether you pay for a managed layer on top and whether that layer locks data into a proprietary format. A common mistake is choosing a managed warehouse for simplicity at small scale, then discovering that exporting petabytes of data for ML training is slow and expensive because the data is in a format only the warehouse can read efficiently. The lakehouse trades some operational simplicity for format openness — the right tradeoff depends on how much of your workload is ML versus SQL analytics.

Lakehouse vs managed warehouse

Data LakehouseManaged Warehouse
Storage costVery cheap (S3/GCS rates)Higher (vendor markup on storage)
ComputeSeparate (Spark, Trino, DuckDB)Bundled (virtual warehouses)
Raw file accessYes — any engine reads ParquetNo — proprietary internal format
ML training dataExcellent — Spark can read directlyAwkward — must export first
Operational overheadHigher — you manage computeLower — fully managed
SQL ecosystemGood (Trino, Spark SQL)Excellent (native SQL, GUI)
Time travelYes (Delta/Iceberg)Limited or per-vendor

The lakehouse shines when you need both analytics and ML training data from the same storage, or when storage costs matter at scale (petabyte-scale data in a managed warehouse can be 5–10× more expensive than equivalent object storage). The managed warehouse wins when you want a BI-focused team to be self-sufficient without operating Spark clusters.

Delta Lake vs Apache Iceberg vs Apache Hudi

All three solve the same problem with slightly different designs. Delta Lake (originated at Databricks) has the deepest Spark integration and the largest community. Apache Iceberg has stronger multi-engine support (Spark, Trino, Flink, Presto all treat it as a first-class format) and a more flexible partition evolution model — you can change the partition scheme without rewriting data. Apache Hudi (originated at Uber) emphasizes low-latency upserts, making it well-suited for CDC pipelines that need to merge change streams into the lake efficiently.

For new projects without a specific constraint, Iceberg's engine neutrality gives the most flexibility.

The small file problem

Each Spark or Flink job that writes to a lakehouse creates multiple Parquet files — typically one per partition per task. A streaming pipeline writing every minute creates 1,440 files per day per partition. Thousands of small files degrade query performance (the engine spends more time opening files than reading data) and inflate object storage LIST request costs.

Compaction solves this: a periodic job reads and rewrites multiple small files into fewer larger files, updating the transaction log to reflect the new layout. Delta Lake's OPTIMIZE command and Iceberg's rewrite_data_files procedure automate this. Running compaction daily on streaming tables is standard practice.

In Practice

When to choose a lakehouse over a warehouse

Choose the lakehouse path when:

  • You need a single store for both analytics and ML training — ML engineers can read Parquet directly with Spark or PyArrow, no export step needed
  • Storage costs dominate — petabyte-scale data is meaningfully cheaper in object storage than in managed warehouse storage
  • You already operate Spark or Trino for other workloads
  • You need event-level raw data retained indefinitely for replay or reprocessing

Stick with the managed warehouse when:

  • Your team is primarily SQL analysts — the operational overhead of Spark isn't worth it
  • Sub-second BI dashboard latency matters — managed warehouses have better caching and optimization for dashboard workloads
  • Data volume is under ~10TB — the cost advantage of object storage is minimal at smaller scale

Partition strategy for lakehouses

Partition by the most common query filter — almost always date — using a directory hierarchy like year=2024/month=03/day=15/. Avoid over-partitioning: if your queries never filter by user_id, partitioning by it creates millions of tiny directories with no pruning benefit. A partition should contain at least 128MB–1GB of data; if partitions are smaller, consolidate.

For Iceberg, hidden partitioning lets you partition by days(event_time) without exposing the partition column to queries — the engine handles pruning transparently, and you can change the granularity without breaking existing queries.

Compaction scheduling

A simple compaction strategy: run nightly after the last streaming job completes. Target file sizes of 256MB–512MB Parquet files per partition. Keep the last 7 days of uncompacted data accessible (for streaming watermark compatibility) and compact everything older. Retain transaction log history for at least 30 days to support time travel debugging.

Production Patterns

Setting up a Delta Lake table with OPTIMIZE and VACUUM

Creating a Delta Lake table with correct partitioning and then running maintenance jobs on a schedule is the minimum viable lakehouse operation:

python
from pyspark.sql import SparkSession
from delta.tables import DeltaTable
 
spark = SparkSession.builder \
    .appName("lakehouse-setup") \
    .config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension") \
    .config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog") \
    .getOrCreate()
 
# Create a partitioned Delta table on S3
spark.sql("""
    CREATE TABLE IF NOT EXISTS events (
        event_id    STRING NOT NULL,
        user_id     STRING,
        event_type  STRING,
        properties  MAP<STRING, STRING>,
        event_time  TIMESTAMP NOT NULL
    )
    USING DELTA
    PARTITIONED BY (DATE(event_time))
    LOCATION 's3://my-lakehouse/events/'
    TBLPROPERTIES (
        'delta.autoOptimize.optimizeWrite' = 'true',
        'delta.autoOptimize.autoCompact'   = 'true',
        'delta.logRetentionDuration'       = 'interval 30 days',
        'delta.deletedFileRetentionDuration' = 'interval 7 days'
    )
""")

Run compaction (OPTIMIZE) and cleanup (VACUUM) as a nightly Spark job after all streaming writes have completed for the day:

python
from delta.tables import DeltaTable
import datetime
 
dt = DeltaTable.forPath(spark, "s3://my-lakehouse/events/")
 
# Compact small files into ~256MB Parquet files within each partition.
# Z-ordering co-locates rows with the same user_id on disk — queries that
# filter on user_id then skip most file blocks via predicate pushdown.
dt.optimize() \
  .where(f"DATE(event_time) < '{datetime.date.today()}'") \
  .executeZOrderBy("user_id", "event_type")
 
# Remove files no longer referenced by any snapshot older than 7 days.
# The retention period must be >= delta.deletedFileRetentionDuration.
dt.vacuum(retentionHours=168)  # 7 days

autoOptimize.optimizeWrite coalesces task output before writing, reducing the number of small files created by streaming jobs without requiring a separate compaction step. Enable it on all streaming Delta tables.

Handling schema evolution

Delta Lake supports additive schema evolution automatically and non-additive changes (column drops, type changes) via explicit migration:

python
from pyspark.sql.types import StructType, StructField, StringType, TimestampType
 
# Adding a new column — safe, enabled with mergeSchema
new_events_df = spark.read.json("s3://raw/events/2024-03-15/")
 
new_events_df.write \
    .format("delta") \
    .mode("append") \
    .option("mergeSchema", "true") \
    .partitionBy("event_date") \
    .save("s3://my-lakehouse/events/")
 
# Renaming or dropping a column requires an explicit ALTER TABLE
spark.sql("""
    ALTER TABLE events
    RENAME COLUMN properies TO properties
""")
 
# Changing a column type requires overriding schema enforcement
spark.sql("""
    ALTER TABLE events
    ALTER COLUMN user_id TYPE BIGINT
""")

For pipelines reading Delta tables, always use spark.read.format("delta").load(path) rather than reading Parquet files directly — the Delta reader applies the current schema and handles column additions/removals in older files transparently.

Z-ordering for query acceleration

Z-ordering rewrites files within a partition so that rows with similar values for the Z-order columns are stored in the same file blocks. Predicate pushdown then skips entire files whose block statistics show no matching values. The effect is most pronounced on high-cardinality columns that are frequently used in WHERE filters but are not the partition key:

python
# Z-order on user_id within each date partition.
# Before: a query for user_id = 'u123' scans all files in the partition.
# After: Parquet row-group statistics show min/max user_id per file block;
#        the engine skips ~95% of blocks for point-user queries.
 
dt.optimize() \
  .where("event_date >= '2024-01-01'") \
  .executeZOrderBy("user_id")
 
# Verify the improvement with Delta history
spark.sql("DESCRIBE HISTORY delta.`s3://my-lakehouse/events/`") \
     .select("version", "timestamp", "operation", "operationMetrics") \
     .show(truncate=False)

Z-ordering is CPU-intensive and rewrites data in place — run it on a schedule proportional to how often the data changes, not on every write. For append-only event tables, nightly Z-ordering on the previous day's partition is sufficient. For tables with frequent updates or deletes, Z-order weekly and rely on autoCompact to handle daily maintenance.

Enjoying these notes?

Get new lessons delivered to your inbox. No spam.