Data Flywheels & Feedback Loops

The best training data comes from production. Every prediction your model makes is an opportunity to collect a label — if you log the right signals, sample intelligently, and close the loop back into training. Data flywheels (more usage → better labels → better model → more usage) are the compounding advantage that separates mature ML systems from one-shot deployments.

How It Works

Data flywheel — self-reinforcing feedback loop

More users → more signal → better training data → better model → more users.

predictions

feedback signals

label pipeline

curated data

re-train → v1.5

Model

v1.4 deployed

Users & Signals

clicks, ratings, dwell time

Interaction Logs

2.1M events/day

Labels & Scores

implicit + human review

Training Data

+180k labeled examples

↻
flywheel

The flywheel compounds: each model improvement attracts more users, generating more signal, enabling better training data for the next version.

Hover each edge above. The flywheel is a self-reinforcing feedback loop: the model serves predictions to users, users interact with those predictions (clicking, purchasing, dwelling, rating), those interactions are logged as signals, the signals are transformed into labels, new training data is assembled, the model is retrained — and because the retrained model is better, it attracts more users, generating more signal. Each cycle compounds the advantage.

The key insight is that production traffic is a data collection engine. A model that has served 10 million queries has collected 10 million labeled examples — if the logging pipeline is in place. A team that treats production as deployment-only leaves that signal on the floor.

The flywheel had to be built around an asynchronous join between prediction events and feedback events — rather than synchronous label collection at serving time — because most valuable signals (purchase, churn, long-term retention) arrive after the prediction is already served and forgotten. Synchronous labeling is only possible for immediate signals like clicks; the broader class of high-quality labels requires the asynchronous log-join architecture, which is why the request_id linking predictions to feedback is the single most critical field in the entire schema.

Implicit vs explicit feedback

Explicit feedback: the user directly labels the prediction. Star ratings, thumbs up/down, "was this answer helpful?", A/B test conversions. High quality, low volume — most users don't rate.

Implicit feedback: inferred from user behavior. Click-through rate (user clicked → relevant), dwell time (user read for 30s → engaged), purchase (user bought → relevant recommendation), skip (user skipped track → not relevant). High volume, lower quality — behavioral signals are noisy and confounded.

Most production systems rely primarily on implicit feedback because it requires no user effort and scales with usage. Explicit feedback is reserved for bootstrapping and high-stakes corrections.

What gets logged

A complete feedback logging schema captures:

python

{
  "request_id":    "req_7f3a9c",          # links prediction to feedback
  "user_id":       "u_4821",
  "timestamp":     "2024-03-15T14:23:07Z",
  "model_version": "clf-v2.3",
  "features": {
    "query":         "machine learning tutorial",
    "user_country":  "US",
    "session_depth": 3,
  },
  "prediction": {
    "ranked_items": ["doc_44", "doc_891", "doc_12"],
    "scores":       [0.92, 0.78, 0.61],
  },
  # Collected asynchronously when signal arrives:
  "feedback": {
    "clicked_item":  "doc_44",
    "click_delay_ms": 1840,
    "dwell_seconds":  47,
    "session_converted": True,
  }
}

The request_id is the join key that links the prediction record to the feedback record. Without it, you cannot close the loop.

From signal to label

Raw signals are not labels. Transforming signals into training labels requires defining a labeling function — the rule that maps behavioral signals to a target value:

Signal	Labeling function	Label
Clicked item in top position	`click and position <= 3`	relevant = 1
Purchased after recommendation	`purchased within 1 hour of rec`	relevant = 1
Scrolled past without click	`viewed and not clicked`	relevant = 0
Rated 4–5 stars	`rating >= 4`	positive = 1
Refund within 24 hours	`refunded and within_24h`	quality = 0

Labeling functions encode domain assumptions. They should be versioned — a change to the labeling function changes the semantics of all historical labels.

Design Tradeoffs

Where Your Intuition Breaks

More engagement data does not automatically produce a better model. Without position bias correction, the flywheel amplifies whatever the current model already does well rather than improving toward the actual objective. Teams often run the flywheel for months and observe steadily rising click-through rates while recommendation quality — measured by downstream retention, diversity, or satisfaction surveys — stagnates or degrades. The flywheel maximizes the proxy metric (clicks) not the true objective (user value). Position bias correction and explicit exploration budgets are not optional refinements; they determine whether the flywheel improves what users actually want or simply reinforces what they have already been shown. A flywheel without these safeguards is a filter bubble engine, not a quality improvement system.

Implicit signals: position bias and selection bias

Implicit feedback suffers from two systematic biases that corrupt the training signal:

Position bias: items shown at the top receive more clicks regardless of quality. If the model recommends item A first and item B second, item A gets more clicks — but this is partly because of its position, not its relevance. A model trained on raw click data learns to surface whatever the previous model surfaced, reinforcing the existing ranking.

Selection bias: only items the model showed can receive feedback. Items that were never shown can never be clicked. The model has no signal about items it systematically suppresses.

Mitigations:

Inverse propensity scoring (IPS): down-weight clicks on high-position items. If item at position 1 has 3× the click rate from position alone, weight its click by 1/3.
Randomized exploration: occasionally insert randomly-selected items (epsilon-greedy or Thompson sampling). Random insertions have no position bias, providing unbiased signal. Costs a small fraction of engagement.
Counterfactual evaluation: log the scores and rankings at serving time to reconstruct what the model believed when it acted. Essential for offline evaluation against logged data.

Loop latency: how fast does signal close?

Tight loop (minutes to hours): session signals (click, add-to-cart, skip). Low latency, high volume. Used in real-time recommendation retraining (streaming pipelines, online learning).

Medium loop (days): purchase/conversion signals that require user action after the session. Standard batch retraining.

Loose loop (weeks to months): long-term outcome signals (retention, churn, lifetime value). Captures true quality but makes attribution difficult — many intervening events between prediction and outcome.

Most systems operate at medium-loop cadence for the core training signal, supplemented by tight-loop signals for real-time feature updates (session state, recency).

Cold-start problem

New models, new items, and new users all lack feedback signal. The cold-start problem is structural: the flywheel requires data to start turning, but data only accumulates after the system is deployed.

New model: can be evaluated offline against logged data from the previous model using counterfactual metrics. Or A/B tested with a small traffic slice.

New items: assigned a prior score from content features (metadata, description embedding) until sufficient feedback accumulates. Explicit exploration budget — always show new items to some fraction of users.

New users: cold-start users receive population-average recommendations until enough behavioral signal accumulates for personalization. Explicit first-session signals (onboarding preferences, initial ratings) speed up the ramp.

Feedback loops and model collapse

Without intervention, the flywheel amplifies: popular items get shown more, receive more clicks, get more signal, become even more dominant. Over time the system converges to a narrow set of popular items, reducing diversity and missing long-tail quality.

Diversity constraints: in ranking systems, impose diversity in the candidate set (across categories, publishers, recency). Not a data pipeline concern, but it shapes what signal gets collected.

Exploration vs exploitation: allocate a fraction of serving traffic to exploration — showing items the model is uncertain about. Exploration generates signal on underexplored items, preventing premature convergence.

In Practice

Building the logging pipeline

The prediction logging pipeline must be in place before the model reaches production — you cannot retroactively collect signal for predictions that weren't logged.

python

import json
import uuid
from datetime import datetime, timezone
from kafka import KafkaProducer
 
producer = KafkaProducer(
    bootstrap_servers="kafka:9092",
    value_serializer=lambda v: json.dumps(v).encode("utf-8"),
)
 
def log_prediction(user_id: str, features: dict, predictions: list[dict], model_version: str) -> str:
    """Log a prediction event. Returns request_id for downstream feedback join."""
    request_id = str(uuid.uuid4())
    event = {
        "request_id":    request_id,
        "user_id":       user_id,
        "model_version": model_version,
        "timestamp":     datetime.now(timezone.utc).isoformat(),
        "features":      features,
        "predictions":   predictions,
    }
    producer.send("prediction_events", event)
    return request_id  # return to caller so it can be passed to frontend
 
def log_feedback(request_id: str, signal_type: str, signal_value: dict) -> None:
    """Log a feedback signal linked to a prediction."""
    event = {
        "request_id":  request_id,
        "signal_type": signal_type,     # "click", "purchase", "rating", "skip"
        "signal_value": signal_value,
        "timestamp":    datetime.now(timezone.utc).isoformat(),
    }
    producer.send("feedback_events", event)

The request_id is returned by the serving layer and stored in the frontend/app. When the user action fires, the client sends request_id + action back to the feedback endpoint.

Joining predictions and feedback

A daily Spark job joins prediction events to feedback events and materializes training examples:

python

from pyspark.sql import SparkSession
from pyspark.sql.functions import col, lit, when
 
spark = SparkSession.builder.appName("flywheel_join").getOrCreate()
 
predictions = spark.read.parquet("s3://bucket/prediction_events/date=2024-03-15/")
feedback    = spark.read.parquet("s3://bucket/feedback_events/date=2024-03-15/")
 
# Join on request_id, left join — unclicked items get label 0
labeled = (
    predictions
    .join(feedback, "request_id", "left")
    .withColumn("label",
        when(col("signal_type") == "click", lit(1))
        .when(col("signal_type") == "purchase", lit(1))
        .otherwise(lit(0))
    )
    # Apply position bias correction (IPS weight)
    .withColumn("ips_weight",
        when(col("rank") == 1, lit(0.33))   # 3× click rate at rank 1
        .when(col("rank") == 2, lit(0.50))
        .otherwise(lit(1.0))
    )
)
 
labeled.write.parquet(
    "s3://bucket/training_data/flywheel_v1/date=2024-03-15/",
    mode="overwrite",
)

Measuring flywheel health

Three metrics tell you whether the flywheel is turning effectively:

Label rate: fraction of predictions that receive any feedback signal within the observation window. Label rate < 5% suggests the signal collection is broken or the product has low engagement.

Signal freshness: median delay between prediction timestamp and feedback timestamp. For click signals this should be seconds. For purchase signals, minutes to hours. Stale signal delays retraining and reduces relevance.

Signal quality: the correlation between your implicit label (click) and an explicit ground-truth label (rating, purchase, long-term retention). High signal quality means click-trained models actually improve the outcome you care about. Low quality means the flywheel is spinning but not in the right direction.

python

# Compute flywheel health metrics over a trailing 7-day window
metrics = labeled.agg(
    (col("label").cast("double").mean()).alias("label_rate"),
    (col("feedback_delay_seconds").mean()).alias("avg_signal_latency_s"),
    (col("label").cast("double").stddev()).alias("label_variance"),
).collect()[0]
 
print(f"Label rate:          {metrics.label_rate:.1%}")
print(f"Avg signal latency:  {metrics.avg_signal_latency_s:.0f}s")

Track these metrics on a dashboard. A drop in label rate often signals an upstream bug — missing request_id in the client, a broken feedback endpoint — not a product problem. Catching it early prevents a training data gap from silently degrading the next model version.

Closing the loop: retraining on flywheel data

The flywheel data pipeline feeds directly into the training pipeline from the ML Data Pipelines lesson:

prediction_events + feedback_events
       ↓
  daily Spark join (flywheel_join job)
       ↓
  labeled training examples (S3)
       ↓
  DVC version (content-hash snapshot)
       ↓
  training job (MLflow run, logs dataset version)
       ↓
  model registry (Staging → Production)
       ↓
  serving layer (generates new predictions)
       ↓
  (repeat)

The loop cadence is set by business requirements. Daily retraining is standard for recommendation systems with fast-changing content. Weekly retraining is common for slower-moving domains. The key is that the loop is automatic — human intervention should be required only to approve model promotion or investigate quality regressions, not to manually run the pipeline.

Production Patterns

Collecting implicit feedback signals from user interactions

The event logging contract must be established before launch. Retrofitting logging into a live service is possible but error-prone — missed events during the migration create gaps that silently corrupt label distributions.

python

import json
import uuid
from datetime import datetime, timezone
from dataclasses import dataclass, asdict
from kafka import KafkaProducer
 
producer = KafkaProducer(
    bootstrap_servers=["kafka-broker-1:9092", "kafka-broker-2:9092"],
    value_serializer=lambda v: json.dumps(v).encode("utf-8"),
    acks="all",          # wait for all replicas before confirming
    retries=5,
    linger_ms=10,        # batch small events for throughput
)
 
@dataclass
class ImplicitFeedbackEvent:
    request_id: str
    user_id: str
    session_id: str
    signal_type: str        # "click", "dwell", "skip", "add_to_cart", "purchase"
    signal_value: dict      # type-specific payload
    item_id: str
    item_rank: int          # position in the ranked list (for IPS correction)
    model_version: str
    timestamp: str = ""
 
    def __post_init__(self):
        if not self.timestamp:
            self.timestamp = datetime.now(timezone.utc).isoformat()
 
def emit_feedback(event: ImplicitFeedbackEvent) -> None:
    """Emit to Kafka; client-side timeout protects serving latency."""
    future = producer.send("feedback_events", asdict(event))
    try:
        future.get(timeout=0.05)   # 50ms max — don't block serving
    except Exception:
        pass  # log to internal metrics, never raise to caller
 
# Example: click signal from the frontend callback
def handle_item_click(request_id: str, user_id: str, session_id: str,
                      item_id: str, rank: int, dwell_ms: int,
                      model_version: str) -> None:
    emit_feedback(ImplicitFeedbackEvent(
        request_id=request_id,
        user_id=user_id,
        session_id=session_id,
        signal_type="click",
        signal_value={"dwell_ms": dwell_ms},
        item_id=item_id,
        item_rank=rank,
        model_version=model_version,
    ))

Never let feedback emission block the response path. Fire-and-forget with a tight timeout keeps p99 serving latency unaffected by Kafka backpressure.

Routing low-confidence predictions to a human review queue

High-uncertainty predictions are the most valuable for the flywheel — they are exactly the cases where the model needs more signal. Routing them to humans closes the loop faster for the hardest examples.

python

from dataclasses import dataclass
import redis
import json
import time
 
review_queue = redis.Redis(host="redis.internal", port=6379, db=0)
 
@dataclass
class ReviewTask:
    request_id: str
    user_id: str
    features: dict
    model_prediction: dict   # {"label": "fraud", "score": 0.61}
    model_version: str
    created_at: float
 
CONFIDENCE_THRESHOLD = 0.75  # route predictions below this to human review
REVIEW_QUEUE_KEY = "human_review:pending"
REVIEW_QUEUE_MAX = 5_000     # cap queue depth — sample if overwhelmed
 
def route_prediction(request_id: str, user_id: str, features: dict,
                     prediction: dict, model_version: str) -> str:
    """
    Returns "model" or "human". Side-effect: enqueues low-confidence
    predictions for human labeling.
    """
    score = prediction.get("score", 1.0)
    confidence = max(score, 1 - score)   # distance from decision boundary
 
    if confidence >= CONFIDENCE_THRESHOLD:
        return "model"
 
    # Low confidence — enqueue for human review if queue has capacity
    queue_depth = review_queue.llen(REVIEW_QUEUE_KEY)
    if queue_depth < REVIEW_QUEUE_MAX:
        task = ReviewTask(
            request_id=request_id,
            user_id=user_id,
            features=features,
            model_prediction=prediction,
            model_version=model_version,
            created_at=time.time(),
        )
        review_queue.rpush(REVIEW_QUEUE_KEY, json.dumps(task.__dict__))
 
    # Model still serves the prediction — human review runs asynchronously
    return "model"
 
def consume_review_task() -> ReviewTask | None:
    """Worker: pop one task from the queue for human labeling UI."""
    raw = review_queue.lpop(REVIEW_QUEUE_KEY)
    if raw is None:
        return None
    data = json.loads(raw)
    return ReviewTask(**data)

Surfacing low-confidence predictions to annotators in a purpose-built labeling UI (Label Studio, Argilla) keeps labeler cognitive load low. Present the model's predicted label alongside the uncertainty score so annotators can quickly confirm or correct.

A/B testing model updates with shadow deployment

Shadow deployment runs the new model on live traffic without serving its predictions to users. This validates real-world performance before any user impact.

python

import random
import asyncio
import httpx
from dataclasses import dataclass
 
@dataclass
class ShadowConfig:
    control_endpoint: str    # current production model
    shadow_endpoint: str     # new candidate model
    shadow_fraction: float   # fraction of requests also sent to shadow (0.0–1.0)
    experiment_id: str
 
async def serve_with_shadow(
    features: dict, config: ShadowConfig
) -> dict:
    """
    Always returns the control model's prediction.
    Fires a shadow request asynchronously for a fraction of traffic.
    """
    async with httpx.AsyncClient(timeout=0.5) as client:
        control_task = client.post(config.control_endpoint, json=features)
 
        shadow_task = None
        if random.random() < config.shadow_fraction:
            shadow_task = asyncio.create_task(
                _fire_shadow(client, config.shadow_endpoint,
                             features, config.experiment_id)
            )
 
        control_response = await control_task
        prediction = control_response.json()
 
        if shadow_task:
            # Shadow result is logged, not served
            asyncio.ensure_future(shadow_task)
 
    return prediction
 
async def _fire_shadow(client: httpx.AsyncClient, endpoint: str,
                       features: dict, experiment_id: str) -> None:
    try:
        resp = await client.post(endpoint, json=features)
        shadow_pred = resp.json()
        # Emit comparison event for offline analysis
        _emit_shadow_comparison(experiment_id, features, shadow_pred)
    except Exception:
        pass  # shadow failures are silent — never affect serving
 
def _emit_shadow_comparison(experiment_id: str, features: dict,
                             shadow_pred: dict) -> None:
    producer.send("shadow_comparisons", {
        "experiment_id": experiment_id,
        "features":      features,
        "shadow_pred":   shadow_pred,
        "timestamp":     datetime.now(timezone.utc).isoformat(),
    })

Analyze shadow comparisons offline: do the control and shadow models disagree significantly? On which feature subspaces? Disagreements surface before any user impact. Once offline metrics confirm the shadow model is better, promote it to production via the model registry and retire the shadow deployment.

Enjoying these notes?

Get new lessons delivered to your inbox. No spam.

ML Data Pipelines

Linear Algebra

Vector Spaces & Linear Maps