ML Metadata & Lineage
When a deployed model's accuracy drops, the investigation starts with questions: which dataset was it trained on? What hyperparameters? Which preprocessing code? Without ML metadata — the structured record of every experiment run, artifact, and deployment — these questions take hours to answer instead of minutes. ML lineage extends this to the artifact graph: tracing a production endpoint back to the exact datasets, code, and experiments that produced it.
How It Works
ML artifact lineage — click any artifact to trace its inputs
Every model and serving endpoint links back to the exact datasets and hyperparameters used to produce it.
Lineage answers "what data and code produced this model?" — essential for audits, debugging regressions, and reproducing results.
Click any artifact above to trace its inputs. The serving endpoint /predict v2.1 was produced by clf-v2.1, which was trained in run/exp-47b, which consumed orders_v2.1 and users_v3.0. Click the serving endpoint to see the full ancestry in one click — no ticket to the team required.
ML development produces a graph of artifacts — datasets, code versions, model weights, evaluation results, deployed endpoints — connected by experiment runs that consumed some artifacts and produced others. Without explicit logging, this graph exists only in someone's memory. When that person leaves, or when a model misbehaves six months after training, the graph is gone. Metadata logging makes it permanent and queryable.
The ML artifact lineage graph
This graph structure answers four questions that arise constantly in production ML:
- "Which dataset was this model trained on?" (audit, reproducibility)
- "Which models used dataset X?" (impact analysis before changing X)
- "Why did the model regress?" (compare lineage against the previous passing model)
- "Is this model approved for production?" (compliance, review chain)
What gets logged per experiment run
A complete experiment record captures:
Inputs: dataset version + hash, code commit SHA, hyperparameters (learning rate, batch size, architecture config), random seeds
Outputs: model artifact path + hash, evaluation metrics (val_accuracy, AUC, F1), training curves (loss per epoch), system metrics (GPU hours, peak memory)
Context: who ran it, when, on which cluster, job ID for log retrieval
import subprocess
import mlflow
with mlflow.start_run(run_name="exp-47b") as run:
# Log inputs
mlflow.log_param("dataset_version", "orders_v2.1+users_v3.0")
mlflow.log_param("learning_rate", 3e-4)
mlflow.log_param("epochs", 15)
mlflow.set_tag("git_commit", subprocess.run(
["git", "rev-parse", "--short", "HEAD"],
capture_output=True, text=True
).stdout.strip())
# ... training loop ...
# Log outputs
mlflow.log_metric("val_accuracy", 0.87)
mlflow.log_metric("val_auc", 0.93)
mlflow.log_artifact("model.pkl") # model artifact stored and linked
print(f"Run ID: {run.info.run_id}")MLflow Model Registry
The MLflow Model Registry adds lifecycle management on top of experiment tracking: models move through Staging → Production → Archived states with explicit transitions that require approval.
from mlflow.tracking import MlflowClient
client = MlflowClient()
# Register a model from a run
model_uri = f"runs:/{run_id}/model"
model_version = mlflow.register_model(model_uri, "clf_orders")
# Promote to production after validation
client.transition_model_version_stage(
name="clf_orders",
version=model_version.version,
stage="Production",
)The registry maintains who promoted each version and when — an audit trail for regulated industries.
The staged promotion workflow (Staging → Production → Archived) had to be separate from experiment tracking because training and deploying are different operations with different risk profiles. Experiment tracking records what happened; the model registry controls what is allowed to happen next. Staging requires passing validation; Production requires explicit approval; Archived means the model can no longer be deployed. This separation ensures that a model cannot reach production without passing through a documented review process, which is what regulatory compliance requires — not just a log of what was trained, but a controlled chain of custody for what was deployed.
Design Tradeoffs
Where Your Intuition Breaks
ML metadata logging is often treated as a nice-to-have that gets added "when things are stable." The first time a production model regresses and the team cannot tell whether it was a data change, a code change, or a hyperparameter change — because runs were not logged consistently — the priority shifts. The problem is that retroactively adding metadata logging to a team that has been running undisciplined experiments for months requires first establishing what the baseline was, which is impossible without historical records. The correct time to add logging is at the beginning of a project, when the overhead is low and the discipline can be established as a norm. Treating every training run as potentially the one that will be deployed — and logging it accordingly — is the practice that makes production ML debuggable.
Experiment tracking tools
| Tool | Strengths | Weaknesses |
|---|---|---|
| MLflow | Open-source, self-hostable, broad ecosystem | UI is functional but dated |
| Weights & Biases | Best-in-class UI, rich visualizations | Proprietary, per-seat cost |
| Neptune | Good for team collaboration | Less adoption |
| Custom (Postgres + S3) | Full control | Maintenance burden |
MLflow is the most common choice for teams that want open-source and self-hosting. W&B dominates teams doing research-style iterative experimentation where visualization quality matters.
Logging granularity vs overhead
More metadata is better for debugging, but excessive logging adds overhead:
- Logging every gradient norm at every step: high value, adds ~5% training overhead
- Logging full prediction distributions on validation: high value, large storage cost
- Logging config files: near-zero cost, very high value (always do this)
- Logging random number generator state at every step: possible but rarely worth it
Reasonable defaults: log metrics every N steps, log artifacts (model checkpoints) every M epochs, log full config once at run start.
Implicit vs explicit lineage
Implicit lineage (inferred): track which datasets exist and which runs happened, infer relationships from timestamps and naming conventions. Fragile — breaks when naming conventions change.
Explicit lineage (declared): code explicitly logs "this run used dataset X version Y." Requires discipline but is reliable and queryable.
Explicit lineage with a lightweight logging library (MLflow, Neptune) is the right default. Reserve implicit lineage inference for cases where the code can't be modified.
In Practice
Reproducing a past experiment
Given a model version in production and an accuracy regression, reproduce the previous passing experiment:
# Find the experiment run that produced the current model
mlflow runs list --experiment-id 3 --filter "metrics.val_accuracy > 0.86"
# Get all parameters for that run
mlflow runs get --run-id <run_id>
# Outputs:
# dataset_version: orders_v2.0+users_v3.0
# learning_rate: 3e-4
# epochs: 15
# git_commit: b7d9e4
# Check out that commit
git checkout b7d9e4
# Pull that dataset version
dvc checkout # restores the dataset at that git state
# Re-run training
python train.py --lr 3e-4 --epochs 15The combination of MLflow + DVC + Git makes this a 5-minute operation instead of a 2-day investigation.
Model cards and documentation
Each registered model version should have a model card documenting:
- Intended use and out-of-scope uses
- Training data description and known biases
- Evaluation results across demographic groups (if applicable)
- Limitations and failure modes
Model cards are increasingly required by regulation (EU AI Act) and by platform policies (publishing to Hugging Face Hub). Write them at registration time, not retroactively.
Production Patterns
MLflow autolog with custom tags
mlflow.autolog() captures framework-level metrics automatically (loss curves, optimizer config, validation scores for scikit-learn, PyTorch, XGBoost, etc.), but production runs need additional tags that autolog cannot infer: dataset version, triggering CI job, business context.
import subprocess
import mlflow
import mlflow.sklearn
from sklearn.ensemble import GradientBoostingClassifier
mlflow.set_tracking_uri("http://mlflow.internal:5000")
mlflow.set_experiment("orders_fraud_detection")
# Enable framework-level autolog — captures estimator params,
# training metrics, and the model artifact automatically.
mlflow.sklearn.autolog(log_input_examples=True, log_model_signatures=True)
DATASET_VERSION = "orders_v2.1+users_v3.0"
GIT_SHA = subprocess.run(
["git", "rev-parse", "--short", "HEAD"],
capture_output=True, text=True,
).stdout.strip()
with mlflow.start_run(run_name="gbm_fraud_v47") as run:
# Custom tags that autolog cannot infer
mlflow.set_tags({
"dataset_version": DATASET_VERSION,
"git_commit": GIT_SHA,
"triggered_by": "ci/github-actions",
"business_unit": "payments",
"data_cutoff_date": "2024-03-01",
})
model = GradientBoostingClassifier(n_estimators=400, learning_rate=3e-2)
model.fit(X_train, y_train) # autolog records val metrics automatically
print(f"Run ID: {run.info.run_id}")
print(f"Artifact URI: {run.info.artifact_uri}")Tag every run with dataset_version — it is the single most important tag for audit and impact analysis. Without it, you cannot answer "which models were trained on the dataset we just found a bug in?"
Registering a model version and transitioning stages
Model promotion should be explicit and logged. The registry transition creates an audit trail (who promoted, when) and gates serving-layer rollout.
import mlflow
from mlflow.tracking import MlflowClient
client = MlflowClient(tracking_uri="http://mlflow.internal:5000")
MODEL_NAME = "fraud_detector"
# Register the model artifact from an existing run
model_uri = f"runs:/{run_id}/model"
mv = mlflow.register_model(model_uri, MODEL_NAME)
print(f"Registered version {mv.version} — state: {mv.current_stage}")
# Add a human-readable description to this version
client.update_model_version(
name=MODEL_NAME,
version=mv.version,
description=(
f"GBM trained on {DATASET_VERSION}, git={GIT_SHA}. "
"Validation AUC 0.934. Approved by: rzeng 2024-03-15."
),
)
# Transition to Staging for integration testing
client.transition_model_version_stage(
name=MODEL_NAME, version=mv.version, stage="Staging",
archive_existing_versions=False, # keep previous staging version during canary
)
# After integration tests pass, promote to Production
client.transition_model_version_stage(
name=MODEL_NAME, version=mv.version, stage="Production",
archive_existing_versions=True, # archive the old production version
)
# Load directly from registry in serving layer
prod_model = mlflow.sklearn.load_model(f"models:/{MODEL_NAME}/Production")Set archive_existing_versions=True on the Production transition to prevent multiple versions from simultaneously being in Production — a common source of silent serving inconsistencies.
Querying the tracking server: find all runs for a dataset version
When a dataset bug is discovered, the immediate question is: which models were trained on it? A parameterized search against the MLflow tracking server answers this in seconds.
from mlflow.tracking import MlflowClient
client = MlflowClient(tracking_uri="http://mlflow.internal:5000")
DATASET_VERSION = "orders_v2.1+users_v3.0"
# Search across all experiments for runs that used this dataset version
runs = client.search_runs(
experiment_ids=client.search_experiments(view_type="ACTIVE_ONLY"),
filter_string=f"tags.dataset_version = '{DATASET_VERSION}'",
order_by=["start_time DESC"],
)
print(f"Found {len(runs)} runs trained on {DATASET_VERSION}:\n")
for r in runs:
in_registry = client.search_model_versions(f"run_id='{r.info.run_id}'")
registry_info = (
f" → registered as {in_registry[0].name} v{in_registry[0].version}"
f" ({in_registry[0].current_stage})"
if in_registry else " → not registered"
)
print(
f"Run {r.info.run_id[:8]} "
f"AUC={r.data.metrics.get('val_auc', 'n/a'):.3f} "
f"started={r.info.start_time}"
)
print(registry_info)This query drives impact analysis: for each registered production model that used the flagged dataset, initiate a rollback or retrain workflow. Without the dataset_version tag on every run, this query is impossible and the investigation is manual.
Automating lineage verification in CI
Prevent untagged runs from reaching the registry by validating tags before registration. Add this as a step in your training CI job:
def assert_run_lineage(run_id: str, required_tags: list[str]) -> None:
"""Raise if any required lineage tag is missing from a completed run."""
client = MlflowClient()
run = client.get_run(run_id)
missing = [t for t in required_tags if t not in run.data.tags]
if missing:
raise ValueError(
f"Run {run_id} is missing required lineage tags: {missing}. "
"Set these before registering the model."
)
REQUIRED_TAGS = ["dataset_version", "git_commit", "triggered_by"]
assert_run_lineage(run.info.run_id, REQUIRED_TAGS)
# Only reached if all tags present
mlflow.register_model(f"runs:/{run.info.run_id}/model", MODEL_NAME)Enjoying these notes?
Get new lessons delivered to your inbox. No spam.