Dataset Versioning & Reproducibility

"The model was better last month" is an unfalsifiable statement without dataset versioning. Reproducing an ML result requires knowing the exact dataset the model was trained on, the preprocessing applied, and the code version — not approximations. Treating datasets as immutable, content-addressed artifacts (like code commits) makes every training run reproducible and every regression debuggable.

How It Works

Dataset version graph — click a version to inspect

v2.0sha: c1a5f7

Rows

138,000

Date

Mar 15

Hash

c1a5f7

Change

−7k: filtered low-quality labels (confidence < 0.7)

Trained models

classifier-v3 (acc 0.87)classifier-v3-large (acc 0.89)

Every version is immutable and content-addressed. Rolling back a model means re-training on the exact dataset hash it was originally trained on.

Click each version above. Every dataset version is immutable — once created, the rows in v2.0 never change. A new version is created when rows are added, removed, or filtered.

"The model was better last month" is meaningless without knowing whether the dataset changed between last month and now. Dataset versioning is the ML equivalent of source control: without it, you cannot reproduce a training run, cannot diagnose a regression, and cannot compare two models trained on different data. Content-addressed, immutable versions give every training run a permanent, verifiable link to the exact data it used.

Dataset versions as immutable snapshots

Each version stores:

Content hash: a deterministic fingerprint of the data (SHA-256 or MD5 of the file contents), so two versions with the same hash are byte-identical
Provenance metadata: what changed from the previous version and why
Linked model runs: which experiments were trained on this exact version

This mirrors how Git treats commits: immutable, content-addressed, with parent pointers.

DVC: dataset version control

DVC (Data Version Control) extends Git to track large files. Datasets are stored in object storage (S3, GCS) and version-tracked via small .dvc pointer files committed to Git:

bash

# Add a dataset to version control
dvc add data/training_v2.parquet
git add data/training_v2.parquet.dvc .gitignore
git commit -m "dataset: add v2.0 with quality filtering"
 
# Push dataset to remote storage
dvc push
 
# Later: reproduce exactly this dataset version
git checkout main~1
dvc pull

The .dvc file contains just the content hash and storage path. dvc pull fetches the exact bytes corresponding to that hash from remote storage.

DVC had to use small pointer files committed to Git rather than storing large files in Git itself because Git stores full file contents in its object store — committing a 100GB dataset to Git would require every clone to download 100GB. Pointer files separate the versioning (cheap, git-native) from the storage (in object storage, fetched on demand). Content hashing rather than timestamps ensures that if two datasets are byte-identical, they map to the same storage object — avoiding redundant copies when datasets are regenerated from the same source.

What makes a training run reproducible

Full reproducibility requires pinning:

Component	What to pin	How
Dataset	Content hash	DVC `.dvc` file
Code	Git commit SHA	`git log --oneline -1`
Dependencies	Exact package versions	`requirements.txt` with `==` pins, or lock file
Hyperparameters	All config values	Config file committed to git
Random seeds	NumPy, PyTorch seeds	`torch.manual_seed(42)` logged
Hardware	GPU type (affects numerical precision)	Logged in run metadata

Missing any one of these can prevent exact reproduction. For practical purposes, most teams aim for statistical reproducibility (same model quality) rather than bit-exact reproducibility (identical weights).

Design Tradeoffs

Where Your Intuition Breaks

Reproducibility feels like a research concern — something for academic papers, not production ML. In practice, most production ML incidents are regressions: a model that was working degrades without explanation. Diagnosing regressions requires being able to re-train the previous version exactly (same code, same data, same hyperparameters) and compare it to the current version in a controlled way. Without dataset versioning, this is impossible — you cannot know whether the regression came from a code change, a data change, or a distribution shift. The teams that treat reproducibility as a first-class concern are the ones that can diagnose and fix regressions in hours rather than days of archaeological investigation.

Versioning granularity

File-level versioning (DVC default): the entire dataset file is a single version. Simple, but a one-row change creates a new copy of the whole file if stored naively.

Row-level versioning (Delta Lake, Iceberg): each row has metadata (added in version X, deleted in version Y). Time-travel queries let you reconstruct the dataset as of any point in time. More complex but efficient for large datasets with frequent small changes.

Snapshot-based versioning: at regular intervals, snapshot the full dataset. Snapshots are immutable. Between snapshots, changes accumulate. Practical for daily or weekly cadences.

Storage costs for immutable datasets

Immutability means you never overwrite — you only add. Storage grows with each version. Mitigation strategies:

Content deduplication: identical files across versions share storage. Cloud providers deduplicate at the block level.
Delta storage: only store rows that changed between versions. Apache Iceberg's table format does this natively.
Retention policies: archive versions older than 90 days that have no active model runs linked to them.

Reproducing experiments without full dataset storage

If full dataset storage is prohibitive, store the generation script instead of the dataset. A deterministic script that produces the dataset from a fixed raw source (with a hash check on the raw source) is functionally equivalent. This works as long as the raw source is also versioned and immutable.

In Practice

Linking datasets to model runs in MLflow

When training a model, log the dataset version as a run parameter:

python

import mlflow
import subprocess
 
# Get current dataset hash
dataset_hash = subprocess.run(
    ["sha256sum", "data/training.parquet"],
    capture_output=True, text=True
).stdout.split()[0]
 
with mlflow.start_run():
    mlflow.log_param("dataset_version", "v2.0")
    mlflow.log_param("dataset_hash", dataset_hash[:12])
    mlflow.log_param("dataset_rows", 138000)
 
    # ... training ...
 
    mlflow.log_metric("val_accuracy", val_acc)

Later, if clf-v2.0 is outperforming clf-v3.0, you can look up exactly which dataset it was trained on and retrain v3.0 on the same data to isolate whether the regression is a dataset issue or a code issue.

Validating dataset integrity before training

Before training, verify the dataset hasn't been modified:

python

import hashlib
 
def verify_dataset(path: str, expected_hash: str) -> None:
    h = hashlib.sha256()
    with open(path, "rb") as f:
        for chunk in iter(lambda: f.read(8192), b""):
            h.update(chunk)
    actual = h.hexdigest()
    if actual != expected_hash:
        raise ValueError(f"Dataset integrity check failed: {actual} != {expected_hash}")
 
verify_dataset("data/training.parquet", "a1b2c3d4...")

This prevents silent corruption from partial uploads or storage errors from affecting training.

Production Patterns

DVC pipeline stages with params.yaml

DVC pipelines make preprocessing reproducible and cacheable. Define stages in dvc.yaml and pull hyperparameters from a versioned params.yaml — both files are committed to git:

yaml

# params.yaml
preprocess:
  min_label_count: 50
  test_split: 0.15
  random_seed: 42
  max_sequence_length: 512
 
train:
  learning_rate: 3e-4
  batch_size: 256
  max_epochs: 10

yaml

# dvc.yaml
stages:
  preprocess:
    cmd: python src/preprocess.py
    deps:
      - src/preprocess.py
      - data/raw/events.parquet
    params:
      - preprocess.min_label_count
      - preprocess.test_split
      - preprocess.random_seed
    outs:
      - data/processed/train.parquet
      - data/processed/val.parquet
 
  train:
    cmd: python src/train.py
    deps:
      - src/train.py
      - data/processed/train.parquet
      - data/processed/val.parquet
    params:
      - train.learning_rate
      - train.batch_size
      - train.max_epochs
    outs:
      - models/clf.pkl
    metrics:
      - metrics/eval.json:
          cache: false

python

# src/preprocess.py
import dvc.api
import pandas as pd
from sklearn.model_selection import train_test_split
 
params = dvc.api.params_show()["preprocess"]
 
df = pd.read_parquet("data/raw/events.parquet")
df = df[df["label"].map(df["label"].value_counts()) >= params["min_label_count"]]
 
train, val = train_test_split(
    df,
    test_size=params["test_split"],
    random_state=params["random_seed"],
    stratify=df["label"],
)
train.to_parquet("data/processed/train.parquet", index=False)
val.to_parquet("data/processed/val.parquet",   index=False)

Run the full pipeline: dvc repro. DVC skips stages whose inputs haven't changed — only re-runs what's stale.

Tagging a dataset version and linking to a model artifact

When a training run completes, tag the dataset version in git and record the link in MLflow so every model has a traceable data lineage:

python

import subprocess
import hashlib
import mlflow
from pathlib import Path
 
def compute_sha256(path: str) -> str:
    h = hashlib.sha256()
    with open(path, "rb") as f:
        for chunk in iter(lambda: f.read(65536), b""):
            h.update(chunk)
    return h.hexdigest()
 
def tag_and_log_dataset(
    dataset_path: str,
    version_tag: str,
    run_id: str,
) -> None:
    sha = compute_sha256(dataset_path)
    row_count = len(__import__("pandas").read_parquet(dataset_path))
 
    # Tag the current git commit with this dataset version
    subprocess.run(["git", "tag", "-a", version_tag, "-m",
                    f"dataset={version_tag} sha256={sha[:12]}"], check=True)
 
    # Link dataset metadata to the MLflow run
    with mlflow.start_run(run_id=run_id):
        mlflow.log_param("dataset_path",    dataset_path)
        mlflow.log_param("dataset_version", version_tag)
        mlflow.log_param("dataset_sha256",  sha[:16])
        mlflow.log_param("dataset_rows",    row_count)
        mlflow.log_param("git_commit",
            subprocess.check_output(["git", "rev-parse", "HEAD"]).decode().strip())
 
# After training completes:
# tag_and_log_dataset("data/processed/train.parquet", "dataset-v2.1", mlflow_run_id)
# subprocess.run(["git", "push", "origin", "dataset-v2.1"])

This creates a two-way link: from the model run you can find the exact dataset, and from the git tag you can find which runs used that data.

Reproducing an experiment from a git commit and DVC lock file

dvc.lock is auto-generated by dvc repro and records the exact input hashes and output hashes for every stage. Commit it alongside dvc.yaml:

bash

# dvc.lock (auto-generated — commit this file)
# Shows exact content hashes for every dep and out in each stage
 
# To reproduce experiment from 3 weeks ago:
git checkout a3f8c21                    # the commit tied to that experiment
dvc pull                                # fetch the exact data artifacts for that commit
dvc repro --no-commit                   # re-run pipeline with pinned params and data

python

# Programmatic reproduction: verify outputs match expected hashes
import yaml, hashlib
 
def verify_dvc_lock(lockfile: str = "dvc.lock") -> None:
    with open(lockfile) as f:
        lock = yaml.safe_load(f)
 
    for stage_name, stage in lock["stages"].items():
        for out in stage.get("outs", []):
            path = out["path"]
            expected_md5 = out["md5"]
 
            h = hashlib.md5()
            with open(path, "rb") as f:
                for chunk in iter(lambda: f.read(65536), b""):
                    h.update(chunk)
            actual_md5 = h.hexdigest()
 
            if actual_md5 != expected_md5:
                raise RuntimeError(
                    f"Stage '{stage_name}' output '{path}' "
                    f"hash mismatch: expected {expected_md5}, got {actual_md5}"
                )
            print(f"  {stage_name}/{path}: OK")
 
verify_dvc_lock()

Run this check as a pre-training gate in CI. If any stage output doesn't match the lock file, the pipeline is not reproducible from that commit and the job should fail before wasting GPU hours.

Enjoying these notes?

Get new lessons delivered to your inbox. No spam.

Training Data at Scale

ML Metadata & Lineage