Recommender Systems
Recommender systems power the majority of content and product discovery on the internet — from video feeds to e-commerce listings. The core challenge is learning user preferences from sparse, implicit feedback and scaling retrieval to millions of candidates in milliseconds.
Theory
User Tower f_θ(u)
Embedding Space
ANN retrieval finds nearest items
Rankings for User 1
Recommender systems learn that users like items similar to ones they've already interacted with, without anyone defining what "similar" means. Matrix factorization finds latent factors (genres, moods, styles) that explain user-item interactions, and the dot product between a user's factor vector and an item's factor vector predicts their compatibility. The diagram above shows the two-tower architecture: separate encoders for users and items, trained so that matching pairs land close together in embedding space.
Collaborative Filtering and Matrix Factorization
The interaction matrix R ∈ ℝ^(|U| × |I|) stores observed user–item signals (clicks, ratings, watch time). Matrix factorization decomposes it into low-rank embeddings:
The dot product is the natural similarity measure for matrix factorization because it decomposes into a sum of factor-wise agreement: each dimension of and represents the strength of a latent factor, and the dot product sums how well the user's factor profile aligns with the item's across all factors. Cosine similarity (dot product on normalized vectors) removes magnitude bias but loses the ability to represent user enthusiasm and item popularity separately — a high-magnitude user vector can naturally predict higher scores for all items, encoding that the user rates generously.
where p_u ∈ ℝ^k is the user latent vector and q_i ∈ ℝ^k is the item latent vector. The squared-error objective with L2 regularization:
This is optimized via Alternating Least Squares (ALS) or SGD.
Bayesian Personalized Ranking (BPR)
For implicit feedback (clicks, not ratings), the pairwise BPR objective is more appropriate. Given that user u interacted with item i but not j:
where σ is the sigmoid function. BPR optimizes for correct ranking of observed over unobserved items, which matches the actual recommendation objective better than pointwise regression.
Two-Tower Neural Retrieval
Modern large-scale systems use separate encoder networks for users and items, trained to place similar user–item pairs close in embedding space:
Training uses in-batch negatives (other items in the same batch serve as negatives), with optional hard negative mining. At serving time, item embeddings are pre-computed and indexed in an Approximate Nearest Neighbor (ANN) structure (HNSW, FAISS) for sub-millisecond retrieval.
The softmax loss with in-batch negatives:
where s_ij = score(u_i, item_j) and τ is a temperature parameter.
Ranking Metrics
NDCG@K (Normalized Discounted Cumulative Gain):
where r_k ∈ 1 is the relevance of the item ranked at position k. Recall@K measures what fraction of relevant items appear in the top K.
Walkthrough
Scenario: A streaming service with 10M users and 500K movies wants to recommend the top 20 for each user's home screen.
Step 1 — Candidate Retrieval (Two-Tower)
Train separate encoder towers on user features (watch history, demographics) and item features (genre, cast, metadata). Index item embeddings with FAISS. At request time, query with the user embedding → get top-500 ANN results in ~10ms.
Step 2 — Ranking
A full-featured cross-attention or gradient-boosted ranking model scores the 500 candidates using features that can't fit in the retrieval tower (freshness, diversity penalties, context). Outputs top-20.
Step 3 — Reranking / Business Rules
Apply hard filters (content ratings, geographic licensing), diversity constraints (no more than 3 from same franchise), and freshness boosts for new releases.
Matrix Factorization Baseline:
For a 4-user × 5-movie matrix with k=2 latent factors, ALS alternates:
- Fix Q, solve for each p_u in closed form: p_u = (QᵀQ + λI)⁻¹ Qᵀ r_u
- Fix P, solve for each q_i similarly
This converges in ~20 iterations for dense matrices and is the backbone of many production systems.
Implicit Feedback Matrix Factorization with implicit:
import numpy as np
import scipy.sparse as sp
from implicit.als import AlternatingLeastSquares
# Build sparse user-item interaction matrix (1 = interacted, 0 = not)
users = np.array([u["user_id"] for u in interactions])
items = np.array([u["item_id"] for u in interactions])
data = np.ones(len(interactions), dtype=np.float32)
R = sp.csr_matrix((data, (users, items)), shape=(n_users, n_items))
# ALS with confidence weighting: c_ui = 1 + alpha * r_ui
# Higher interaction count → stronger confidence, not certainty of preference
model = AlternatingLeastSquares(factors=64, regularization=0.01,
iterations=20, alpha=40)
model.fit(R)
# Retrieve top-10 candidates for user 42
ids, scores = model.recommend(42, R[42], N=10,
filter_already_liked_items=True)Two-Tower Retrieval in PyTorch:
import torch
import torch.nn as nn
import torch.nn.functional as F
class Tower(nn.Module):
"""Shared MLP architecture for both user and item towers."""
def __init__(self, input_dim: int, embed_dim: int = 64):
super().__init__()
self.net = nn.Sequential(
nn.Linear(input_dim, 128), nn.ReLU(),
nn.Linear(128, embed_dim),
)
def forward(self, x: torch.Tensor) -> torch.Tensor:
return F.normalize(self.net(x), dim=-1) # unit sphere → dot product = cosine
user_tower = Tower(input_dim=user_feature_dim)
item_tower = Tower(input_dim=item_feature_dim)
def in_batch_softmax_loss(u_emb: torch.Tensor, i_emb: torch.Tensor,
temperature: float = 0.07) -> torch.Tensor:
"""
Positive pairs: diagonal (user i matched with item i in the batch).
Negatives: all other items in the same batch.
Scales well — a batch of 256 gives 255 negatives per example for free.
"""
logits = (u_emb @ i_emb.T) / temperature # (B, B)
labels = torch.arange(len(u_emb)) # diagonal = positives
return F.cross_entropy(logits, labels)
optimizer = torch.optim.Adam(
list(user_tower.parameters()) + list(item_tower.parameters()), lr=1e-3
)
for batch in dataloader:
u_emb = user_tower(batch["user_features"])
i_emb = item_tower(batch["item_features"])
loss = in_batch_softmax_loss(u_emb, i_emb)
loss.backward()
optimizer.step()
optimizer.zero_grad()
# At serving time: pre-compute all item embeddings once, index with FAISS/HNSW
# item_embeddings = item_tower(all_item_features) → index → ANN lookup per requestAnalysis & Evaluation
Where Your Intuition Breaks
Matrix factorization and collaborative filtering require explicit ratings data. Collaborative filtering works on any signal that reflects user preferences — clicks, dwell time, purchases, skips — not just explicit ratings. Explicit ratings (1–5 stars) are sparse and biased toward items users chose to rate; implicit feedback (interactions) is dense and reflects actual consumption behavior. BPR and two-tower models are specifically designed for implicit feedback and typically outperform rating-based methods in production because implicit signals are more plentiful and less noisy than explicit ratings at scale.
Architecture Comparison
| Approach | Retrieval Speed | Personalization | Cold Start | Scale |
|---|---|---|---|---|
| Collaborative filtering (ALS) | Pre-computed | Strong (history) | Poor | Millions |
| Content-based filtering | Fast | Moderate | Good | Any |
| Two-tower neural | ANN (~10ms) | Strong | Moderate (features) | Billions |
| Hybrid / multi-stage | Multi-stage | Best | Best | Billions |
Common Pitfalls
Popularity bias: Models learn to recommend already-popular items because they dominate training data. Mitigate with inverse propensity scoring or popularity-debiased training.
Exposure bias: Users can only interact with items they were shown. Unshown relevant items appear as negatives. Counterfactual evaluation helps here.
Position bias: Clicks on items ranked higher happen more often regardless of quality. Use position-aware training (position as a feature with a "no-position" serving trick).
Feedback loops: Recommending popular items makes them more popular, reducing diversity over time. Monitor catalog coverage and novelty metrics alongside NDCG.
Evaluation Strategy
Offline: Hold out the last interaction per user (temporal split — never random split, which leaks future data). Compute NDCG@10, Recall@50.
Online: A/B test on click-through rate, session length, and return rate. Offline NDCG improvements of 0.5% can be meaningful; always validate online.
Enjoying these notes?
Get new lessons delivered to your inbox. No spam.