ML Metadata & Lineage

When a deployed model's accuracy drops, the investigation starts with questions: which dataset was it trained on? What hyperparameters? Which preprocessing code? Without ML metadata — the structured record of every experiment run, artifact, and deployment — these questions take hours to answer instead of minutes. ML lineage extends this to the artifact graph: tracing a production endpoint back to the exact datasets, code, and experiments that produced it.

How It Works

ML artifact lineage — click any artifact to trace its inputs

Every model and serving endpoint links back to the exact datasets and hyperparameters used to produce it.

Lineage answers "what data and code produced this model?" — essential for audits, debugging regressions, and reproducing results.

Click any artifact above to trace its inputs. The serving endpoint /predict v2.1 was produced by clf-v2.1, which was trained in run/exp-47b, which consumed orders_v2.1 and users_v3.0. Click the serving endpoint to see the full ancestry in one click — no ticket to the team required.

ML development produces a graph of artifacts — datasets, code versions, model weights, evaluation results, deployed endpoints — connected by experiment runs that consumed some artifacts and produced others. Without explicit logging, this graph exists only in someone's memory. When that person leaves, or when a model misbehaves six months after training, the graph is gone. Metadata logging makes it permanent and queryable.

The ML artifact lineage graph

This graph structure answers four questions that arise constantly in production ML:

"Which dataset was this model trained on?" (audit, reproducibility)
"Which models used dataset X?" (impact analysis before changing X)
"Why did the model regress?" (compare lineage against the previous passing model)
"Is this model approved for production?" (compliance, review chain)

What gets logged per experiment run

A complete experiment record captures:

Inputs: dataset version + hash, code commit SHA, hyperparameters (learning rate, batch size, architecture config), random seeds

Outputs: model artifact path + hash, evaluation metrics (val_accuracy, AUC, F1), training curves (loss per epoch), system metrics (GPU hours, peak memory)

Context: who ran it, when, on which cluster, job ID for log retrieval

python

import subprocess
import mlflow
 
with mlflow.start_run(run_name="exp-47b") as run:
    # Log inputs
    mlflow.log_param("dataset_version", "orders_v2.1+users_v3.0")
    mlflow.log_param("learning_rate", 3e-4)
    mlflow.log_param("epochs", 15)
    mlflow.set_tag("git_commit", subprocess.run(
        ["git", "rev-parse", "--short", "HEAD"],
        capture_output=True, text=True
    ).stdout.strip())
 
    # ... training loop ...
 
    # Log outputs
    mlflow.log_metric("val_accuracy", 0.87)
    mlflow.log_metric("val_auc", 0.93)
    mlflow.log_artifact("model.pkl")    # model artifact stored and linked
 
    print(f"Run ID: {run.info.run_id}")

MLflow Model Registry

The MLflow Model Registry adds lifecycle management on top of experiment tracking: models move through Staging → Production → Archived states with explicit transitions that require approval.

python

from mlflow.tracking import MlflowClient
 
client = MlflowClient()
 
# Register a model from a run
model_uri = f"runs:/{run_id}/model"
model_version = mlflow.register_model(model_uri, "clf_orders")
 
# Promote to production after validation
client.transition_model_version_stage(
    name="clf_orders",
    version=model_version.version,
    stage="Production",
)

The registry maintains who promoted each version and when — an audit trail for regulated industries.

The staged promotion workflow (Staging → Production → Archived) had to be separate from experiment tracking because training and deploying are different operations with different risk profiles. Experiment tracking records what happened; the model registry controls what is allowed to happen next. Staging requires passing validation; Production requires explicit approval; Archived means the model can no longer be deployed. This separation ensures that a model cannot reach production without passing through a documented review process, which is what regulatory compliance requires — not just a log of what was trained, but a controlled chain of custody for what was deployed.

Design Tradeoffs

Where Your Intuition Breaks

ML metadata logging is often treated as a nice-to-have that gets added "when things are stable." The first time a production model regresses and the team cannot tell whether it was a data change, a code change, or a hyperparameter change — because runs were not logged consistently — the priority shifts. The problem is that retroactively adding metadata logging to a team that has been running undisciplined experiments for months requires first establishing what the baseline was, which is impossible without historical records. The correct time to add logging is at the beginning of a project, when the overhead is low and the discipline can be established as a norm. Treating every training run as potentially the one that will be deployed — and logging it accordingly — is the practice that makes production ML debuggable.

Experiment tracking tools

Tool	Strengths	Weaknesses
MLflow	Open-source, self-hostable, broad ecosystem	UI is functional but dated
Weights & Biases	Best-in-class UI, rich visualizations	Proprietary, per-seat cost
Neptune	Good for team collaboration	Less adoption
Custom (Postgres + S3)	Full control	Maintenance burden

MLflow is the most common choice for teams that want open-source and self-hosting. W&B dominates teams doing research-style iterative experimentation where visualization quality matters.

Logging granularity vs overhead

More metadata is better for debugging, but excessive logging adds overhead:

Logging every gradient norm at every step: high value, adds ~5% training overhead
Logging full prediction distributions on validation: high value, large storage cost
Logging config files: near-zero cost, very high value (always do this)
Logging random number generator state at every step: possible but rarely worth it

Reasonable defaults: log metrics every N steps, log artifacts (model checkpoints) every M epochs, log full config once at run start.

Implicit vs explicit lineage

Implicit lineage (inferred): track which datasets exist and which runs happened, infer relationships from timestamps and naming conventions. Fragile — breaks when naming conventions change.

Explicit lineage (declared): code explicitly logs "this run used dataset X version Y." Requires discipline but is reliable and queryable.

Explicit lineage with a lightweight logging library (MLflow, Neptune) is the right default. Reserve implicit lineage inference for cases where the code can't be modified.

In Practice

Reproducing a past experiment

Given a model version in production and an accuracy regression, reproduce the previous passing experiment:

bash

# Find the experiment run that produced the current model
mlflow runs list --experiment-id 3 --filter "metrics.val_accuracy > 0.86"
 
# Get all parameters for that run
mlflow runs get --run-id <run_id>
 
# Outputs:
# dataset_version: orders_v2.0+users_v3.0
# learning_rate: 3e-4
# epochs: 15
# git_commit: b7d9e4
 
# Check out that commit
git checkout b7d9e4
 
# Pull that dataset version
dvc checkout  # restores the dataset at that git state
 
# Re-run training
python train.py --lr 3e-4 --epochs 15

The combination of MLflow + DVC + Git makes this a 5-minute operation instead of a 2-day investigation.

Model cards and documentation

Each registered model version should have a model card documenting:

Intended use and out-of-scope uses
Training data description and known biases
Evaluation results across demographic groups (if applicable)
Limitations and failure modes

Model cards are increasingly required by regulation (EU AI Act) and by platform policies (publishing to Hugging Face Hub). Write them at registration time, not retroactively.

Production Patterns

MLflow autolog with custom tags

mlflow.autolog() captures framework-level metrics automatically (loss curves, optimizer config, validation scores for scikit-learn, PyTorch, XGBoost, etc.), but production runs need additional tags that autolog cannot infer: dataset version, triggering CI job, business context.

python

import subprocess
import mlflow
import mlflow.sklearn
from sklearn.ensemble import GradientBoostingClassifier
 
mlflow.set_tracking_uri("http://mlflow.internal:5000")
mlflow.set_experiment("orders_fraud_detection")
 
# Enable framework-level autolog — captures estimator params,
# training metrics, and the model artifact automatically.
mlflow.sklearn.autolog(log_input_examples=True, log_model_signatures=True)
 
DATASET_VERSION = "orders_v2.1+users_v3.0"
GIT_SHA = subprocess.run(
    ["git", "rev-parse", "--short", "HEAD"],
    capture_output=True, text=True,
).stdout.strip()
 
with mlflow.start_run(run_name="gbm_fraud_v47") as run:
    # Custom tags that autolog cannot infer
    mlflow.set_tags({
        "dataset_version":  DATASET_VERSION,
        "git_commit":       GIT_SHA,
        "triggered_by":     "ci/github-actions",
        "business_unit":    "payments",
        "data_cutoff_date": "2024-03-01",
    })
 
    model = GradientBoostingClassifier(n_estimators=400, learning_rate=3e-2)
    model.fit(X_train, y_train)  # autolog records val metrics automatically
 
    print(f"Run ID: {run.info.run_id}")
    print(f"Artifact URI: {run.info.artifact_uri}")

Tag every run with dataset_version — it is the single most important tag for audit and impact analysis. Without it, you cannot answer "which models were trained on the dataset we just found a bug in?"

Registering a model version and transitioning stages

Model promotion should be explicit and logged. The registry transition creates an audit trail (who promoted, when) and gates serving-layer rollout.

python

import mlflow
from mlflow.tracking import MlflowClient
 
client = MlflowClient(tracking_uri="http://mlflow.internal:5000")
MODEL_NAME = "fraud_detector"
 
# Register the model artifact from an existing run
model_uri = f"runs:/{run_id}/model"
mv = mlflow.register_model(model_uri, MODEL_NAME)
print(f"Registered version {mv.version} — state: {mv.current_stage}")
 
# Add a human-readable description to this version
client.update_model_version(
    name=MODEL_NAME,
    version=mv.version,
    description=(
        f"GBM trained on {DATASET_VERSION}, git={GIT_SHA}. "
        "Validation AUC 0.934. Approved by: rzeng 2024-03-15."
    ),
)
 
# Transition to Staging for integration testing
client.transition_model_version_stage(
    name=MODEL_NAME, version=mv.version, stage="Staging",
    archive_existing_versions=False,  # keep previous staging version during canary
)
 
# After integration tests pass, promote to Production
client.transition_model_version_stage(
    name=MODEL_NAME, version=mv.version, stage="Production",
    archive_existing_versions=True,   # archive the old production version
)
 
# Load directly from registry in serving layer
prod_model = mlflow.sklearn.load_model(f"models:/{MODEL_NAME}/Production")

Set archive_existing_versions=True on the Production transition to prevent multiple versions from simultaneously being in Production — a common source of silent serving inconsistencies.

Querying the tracking server: find all runs for a dataset version

When a dataset bug is discovered, the immediate question is: which models were trained on it? A parameterized search against the MLflow tracking server answers this in seconds.

python

from mlflow.tracking import MlflowClient
 
client = MlflowClient(tracking_uri="http://mlflow.internal:5000")
 
DATASET_VERSION = "orders_v2.1+users_v3.0"
 
# Search across all experiments for runs that used this dataset version
runs = client.search_runs(
    experiment_ids=client.search_experiments(view_type="ACTIVE_ONLY"),
    filter_string=f"tags.dataset_version = '{DATASET_VERSION}'",
    order_by=["start_time DESC"],
)
 
print(f"Found {len(runs)} runs trained on {DATASET_VERSION}:\n")
for r in runs:
    in_registry = client.search_model_versions(f"run_id='{r.info.run_id}'")
    registry_info = (
        f"  → registered as {in_registry[0].name} v{in_registry[0].version}"
        f" ({in_registry[0].current_stage})"
        if in_registry else "  → not registered"
    )
    print(
        f"Run {r.info.run_id[:8]}  "
        f"AUC={r.data.metrics.get('val_auc', 'n/a'):.3f}  "
        f"started={r.info.start_time}"
    )
    print(registry_info)

This query drives impact analysis: for each registered production model that used the flagged dataset, initiate a rollback or retrain workflow. Without the dataset_version tag on every run, this query is impossible and the investigation is manual.

Automating lineage verification in CI

Prevent untagged runs from reaching the registry by validating tags before registration. Add this as a step in your training CI job:

python

def assert_run_lineage(run_id: str, required_tags: list[str]) -> None:
    """Raise if any required lineage tag is missing from a completed run."""
    client = MlflowClient()
    run = client.get_run(run_id)
    missing = [t for t in required_tags if t not in run.data.tags]
    if missing:
        raise ValueError(
            f"Run {run_id} is missing required lineage tags: {missing}. "
            "Set these before registering the model."
        )
 
REQUIRED_TAGS = ["dataset_version", "git_commit", "triggered_by"]
assert_run_lineage(run.info.run_id, REQUIRED_TAGS)
 
# Only reached if all tags present
mlflow.register_model(f"runs:/{run.info.run_id}/model", MODEL_NAME)

Enjoying these notes?

Get new lessons delivered to your inbox. No spam.

Dataset Versioning & Reproducibility

LLM Dataset Construction