Requires:Feature Engineering

Feature Stores

The most common silent killer of ML systems is training-serving skew — the model was trained on one version of a feature and is served a different one. Feature stores solve this by providing a single system that computes features once, stores them in an offline store for training (with point-in-time correct joins) and an online store for low-latency serving. Feast and Tecton are the two dominant implementations.

How It Works

Batch feature computation → offline store → point-in-time join prevents label leakage.

Raw event logs

Feature pipeline

Offline store

(S3 / BigQuery)

Point-in-time join

← join uses label timestamp, not query time

user_30d_purchase_count

batch1h batch

user_lifetime_value

batch24h batch

item_avg_rating

batch1h batch

user_last_seen_seconds

real-timereal-time

Feature registry holds definitions — offline and online stores share the same spec.

Toggle between the training and serving paths above. The key insight: both paths use the same feature definitions, stored once. The offline store materializes historical values for training; the online store holds the latest values for real-time lookup. The feature registry ensures that user_30d_purchase_count means exactly the same thing in both contexts.

Training-serving skew is the ML equivalent of a test environment that doesn't match production: the model learns from one version of reality and is deployed into a different one. Feature stores exist because the only reliable way to prevent this is to eliminate the two-implementation problem entirely — define features once, compute them once, and serve that same computation for both training and inference.

The training-serving skew problem

Without a feature store, two things happen independently: the training pipeline computes features one way (usually with access to full historical data), and the serving pipeline computes the same features another way (usually with different code, different libraries, different data sources). The gap between these two implementations is training-serving skew, and it silently degrades model performance in production without raising any errors.

Point-in-time correct joins

For supervised learning, features must be computed using only data available at the label timestamp. Consider predicting whether a user will churn: the label is "churned within 30 days after time T." The features must be the state of the world at time T, not at training time (which might be weeks later).

Without point-in-time correct joins, a model trained on user features computed at training time will have access to future data it won't have at serving time — a form of data leakage that produces optimistic offline metrics and disappointing online performance.

python

# Feast point-in-time join example
from feast import FeatureStore
 
store = FeatureStore(repo_path=".")
training_df = store.get_historical_features(
    entity_df=entity_df,      # contains entity_id and event_timestamp
    features=[
        "user_features:user_30d_purchase_count",
        "user_features:user_lifetime_value",
        "item_features:item_avg_rating",
    ],
).to_df()
# Each row gets features as of that row's event_timestamp

Offline vs online store

	Offline store	Online store
Storage	S3, BigQuery, Snowflake	Redis, DynamoDB, Cassandra
Access pattern	Batch (training, backfill)	Single-key lookup
Freshness	Minutes to hours (batch job)	Real-time or near-real-time
Latency	Seconds to minutes	Milliseconds
Use case	Training dataset generation	Serving predictions

Materialization is the process of computing features and writing them to both stores. The offline store gets the full history; the online store gets only the latest values.

The offline/online split had to use two different storage systems — not one — because the access patterns are incompatible. Offline training needs full history for any entity at any timestamp (a time-range scan); online serving needs only the latest value for a specific entity in under 10ms (a point lookup). No single storage system optimizes for both — batch columnar stores are slow for point lookups; Redis is fast for point lookups but expensive per GB and not designed for historical scans. The two-store architecture is not complexity for its own sake; it is the forced consequence of serving two incompatible query patterns.

Design Tradeoffs

Where Your Intuition Breaks

Point-in-time correct joins sound like a technical detail, but they represent the single most common source of optimistic offline evaluation metrics in ML. A model trained without point-in-time correctness has implicitly seen the future: features computed at training time include data that did not exist at the label timestamp. The model learns patterns that cannot be reproduced at serving time, producing offline AUC/RMSE that overstates production performance. The mismatch is especially insidious because it is invisible in standard evaluation: the metrics look good, the model passes review, and the performance gap only appears in production. Feature stores enforce point-in-time correctness as a primitive, which is why organizations with mature ML pipelines treat them as infrastructure rather than optional tooling.

Batch vs real-time features

Most features are batch features — they can be computed once per day or per hour and cached. user_30d_purchase_count doesn't change by the millisecond; a 1-hour lag is acceptable.

Some features require real-time computation: seconds_since_last_login or cart_items_added_last_5_minutes must be computed at serving time, not materialized ahead of time. These stream-computed features require a real-time feature pipeline (Flink, Spark Streaming) feeding directly into the online store.

Hybrid architectures combine both: slow-changing features (user demographics, historical aggregates) are batch-materialized; fast-changing features (session activity, recent clicks) are computed at request time and merged before model inference.

Feature freshness vs serving cost

Fresh features require frequent materialization runs and higher online store throughput. The cost scales with:

Materialization frequency: hourly vs daily vs real-time
Online store capacity: number of entity-feature combinations × bytes per value
Lookup throughput: QPS × features per request

For a recommendation system with 10M users × 50 features × 4 bytes/value, the online store needs ~2GB of memory at minimum. At 1000 QPS and 10 features per request, the online store handles 10,000 key lookups per second.

When NOT to use a feature store

Feature stores add operational complexity. Justified when:

Multiple models use the same features (amortizes the cost of computation)
Training-serving skew is a confirmed problem
Feature computation is expensive (shared computation is valuable)
Regulatory compliance requires feature audit trails

Skip a feature store when you have one model, simple features, and a small team. A well-tested feature computation function called from both training and serving code is simpler and often good enough.

In Practice

Defining features with Feast

python

# feature_repo/features.py
from feast import Entity, Feature, FeatureView, FileSource, ValueType
from datetime import timedelta
 
user = Entity(name="user_id", value_type=ValueType.INT64)
 
user_stats_source = FileSource(
    path="s3://my-bucket/user_stats.parquet",
    event_timestamp_column="event_timestamp",
)
 
user_features = FeatureView(
    name="user_features",
    entities=["user_id"],
    ttl=timedelta(days=7),
    features=[
        Feature(name="user_30d_purchase_count", dtype=ValueType.INT64),
        Feature(name="user_lifetime_value",      dtype=ValueType.FLOAT),
    ],
    online=True,
    source=user_stats_source,
)

Apply the registry: feast apply. Materialize to online store: feast materialize-incremental $(date -u +"%Y-%m-%dT%H:%M:%S").

Serving features at inference time

python

# At serving time
store = FeatureStore(repo_path=".")
 
feature_vector = store.get_online_features(
    features=["user_features:user_30d_purchase_count", "user_features:user_lifetime_value"],
    entity_rows=[{"user_id": 12345}],
).to_dict()
 
prediction = model.predict([[
    feature_vector["user_30d_purchase_count"][0],
    feature_vector["user_lifetime_value"][0],
]])

The online lookup adds ~2–5ms latency for Redis-backed stores — acceptable for most serving scenarios.

Monitoring feature freshness

Feature freshness degradation is silent — if the materialization job fails, the online store serves stale values without error. Monitor:

last_updated_at per feature view
Alert if materialization is more than 2× its scheduled interval behind
Shadow-compare online store values against a reference computation periodically

Production Patterns

Feast feature view with a stream source

Most production feature stores combine a batch source (for training) with a stream source (for the online store). Define both in a single feature view so Feast can materialize from either path:

python

# feature_repo/features.py
from feast import Entity, FeatureView, Field, PushSource, BigQuerySource
from feast.types import Float32, Int64
from datetime import timedelta
 
user = Entity(name="user_id", join_keys=["user_id"])
 
# Batch source: historical data for point-in-time joins
batch_source = BigQuerySource(
    table="myproject.features.user_stats",
    timestamp_field="event_timestamp",
)
 
# Push source: real-time updates from Kafka → Flink → Feast push API
push_source = PushSource(
    name="user_stats_push",
    batch_source=batch_source,
)
 
user_features = FeatureView(
    name="user_features",
    entities=[user],
    ttl=timedelta(days=7),
    schema=[
        Field(name="user_30d_purchase_count", dtype=Int64),
        Field(name="user_lifetime_value",      dtype=Float32),
        Field(name="user_session_count_1h",    dtype=Int64),
    ],
    online=True,
    source=push_source,
)

Apply and materialize:

bash

feast apply
# Incremental materialize: only processes new data since last run
feast materialize-incremental $(date -u +"%Y-%m-%dT%H:%M:%S")

Online feature retrieval at serving time

Retrieve features in the hot path with minimal overhead. Batch the entity lookup to amortize Redis round-trip cost:

python

from feast import FeatureStore
from typing import List, Dict, Any
 
store = FeatureStore(repo_path="/app/feature_repo")
 
FEATURE_REFS = [
    "user_features:user_30d_purchase_count",
    "user_features:user_lifetime_value",
    "item_features:item_avg_rating",
    "item_features:item_purchase_rate_7d",
]
 
def fetch_features(user_ids: List[int], item_ids: List[int]) -> Dict[str, Any]:
    entity_rows = [
        {"user_id": uid, "item_id": iid}
        for uid, iid in zip(user_ids, item_ids)
    ]
    return store.get_online_features(
        features=FEATURE_REFS,
        entity_rows=entity_rows,
    ).to_dict()
 
# In a FastAPI handler:
# features = fetch_features([user_id], [item_id])
# score = model.predict([[
#     features["user_30d_purchase_count"][0],
#     features["user_lifetime_value"][0],
#     features["item_avg_rating"][0],
#     features["item_purchase_rate_7d"][0],
# ]])

For latency-sensitive paths, initialize FeatureStore once at startup (not per request) — the Redis connection pool is reused across calls.

Point-in-time correct training dataset generation

Generate a training dataset where every row's features are as of that row's label timestamp. The entity_df must include event_timestamp per row:

python

import pandas as pd
from feast import FeatureStore
 
store = FeatureStore(repo_path="/app/feature_repo")
 
# entity_df: one row per training example, with the timestamp of the label event
entity_df = pd.DataFrame({
    "user_id":          [1001, 1002, 1003],
    "item_id":          [501,  502,  503],
    "event_timestamp":  pd.to_datetime([
        "2024-11-01 12:00:00",
        "2024-11-03 09:30:00",
        "2024-11-05 17:45:00",
    ], utc=True),
    "label":            [1, 0, 1],   # churn, purchase, etc.
})
 
training_df = store.get_historical_features(
    entity_df=entity_df,
    features=FEATURE_REFS,
).to_df()
 
# Each row now has features as of its event_timestamp — no future data leakage
print(training_df.head())

Schedule this generation job daily via Airflow or Prefect, tagging each output file with the run date and feature view versions. Log the resulting parquet path and row count to MLflow so every model run is linked to its exact training data.

Detecting training-serving skew in production

Skew is silent — flag it before it becomes a regression:

python

import numpy as np
from scipy.stats import ks_2samp
 
def detect_skew(
    training_values: np.ndarray,
    serving_values: np.ndarray,
    feature_name: str,
    p_threshold: float = 0.01,
) -> None:
    stat, p_value = ks_2samp(training_values, serving_values)
    if p_value < p_threshold:
        # Emit to your alerting system (PagerDuty, Datadog, etc.)
        print(f"SKEW ALERT: {feature_name} KS={stat:.3f} p={p_value:.4f}")
 
# Run nightly: sample last 24h of serving logs, compare to training distribution

Enjoying these notes?

Get new lessons delivered to your inbox. No spam.

Data Infra

Data Mesh & Platform Thinking

Training Data at Scale