Reinforcement Learning Fundamentals
Reinforcement learning (RL) is the mathematical framework for sequential decision-making under uncertainty. Unlike supervised learning — which learns from labeled data — an RL agent learns by interacting with an environment, receiving scalar reward signals, and optimizing long-run cumulative return. RL underpins modern LLM alignment (RLHF, GRPO) and game-playing agents.
Theory
Agent at s0
Reward R(s): 0
V(s) = r + γ·max_a V(s') ≈ 0.31
Move agent:
★ s15 = goal (+1 reward)
✕ s5 = trap (-1 reward)
γ = 0.9 discounts future rewards
An RL agent learns to make decisions the way you learned to ride a bike — not from labeled examples but from the consequences of actions. Try something, see what happens, adjust. The MDP formalizes this: states capture what the agent knows, actions are what it can do, and rewards are the feedback signal. The diagram above shows the agent-environment loop that every RL algorithm is solving.
Markov Decision Process (MDP)
An MDP is a tuple (S, A, P, R, γ) where:
- S — state space
- A — action space
- P(s'|s, a) — transition probability: probability of reaching state s' from state s by taking action a
- R(s, a, s') — reward function: scalar signal received after each transition
- γ ∈ [0, 1) — discount factor, weighting future rewards
A policy π(a|s) maps states to probability distributions over actions. The goal is to find π* that maximizes expected cumulative discounted reward:
The geometric discount factor is not merely a technical convenience — it is the formal requirement for the sum to be finite and well-defined. Without discounting (), the sum over an infinite horizon diverges for any non-zero constant reward, making the value function undefined and the optimization problem intractable. Discounting also encodes an economic preference: a reward now is worth more than the same reward later. Setting close to 1 makes the agent long-sighted (good for tasks where delayed rewards matter); close to 0 makes it myopic (good for tasks where immediate rewards dominate).
Value Functions and the Bellman Equation
The state-value function V^π(s) measures expected return from state s following policy π:
The action-value function (Q-function) Q^π(s, a) conditions on taking action a then following π:
The Bellman expectation equation expresses V^π recursively:
The Bellman optimality equation for the optimal value function V*:
The optimal policy is then π*(s) = argmax_a Q*(s, a).
Q-Learning (Temporal Difference)
Q-learning is an off-policy TD algorithm. After taking action a in state s, observing reward r and next state s', the Q-value update:
where α is the learning rate. Q-learning converges to Q* under tabular settings. Deep Q-Network (DQN) replaces the table with a neural network Q_θ(s, a), adding experience replay and a target network to stabilize training.
Policy Gradient Theorem
Policy gradient methods directly optimize the policy parameters θ by gradient ascent on expected return J(θ) = E[G_0]:
This is the REINFORCE estimator. In practice, a baseline b(s) (usually V^π(s)) is subtracted to reduce variance without biasing the gradient:
The term G_t − b(S_t) is the advantage A^π(s, a): how much better action a is than the average.
Proximal Policy Optimization (PPO)
PPO (Schulman et al., 2017) is the workhorse of modern RL and RLHF fine-tuning. It uses a clipped surrogate objective to prevent destructively large policy updates:
where r_t(θ) = π_θ(a_t|s_t) / π_θ_old(a_t|s_t) is the importance ratio and ε (typically 0.1–0.2) controls how far the new policy can deviate from the old one.
Walkthrough
Environment: 4×4 grid world. States s0–s15 (row-major). s15 = goal (+1 reward). s5 = trap (−1 reward). All other transitions give 0 reward. γ = 0.9.
Value Iteration (Dynamic Programming):
Initialize V(s) = 0 for all states. Repeat until convergence:
For a deterministic grid world, P(s'|s,a) = 1 for the intended next state. After ~20 iterations, V* converges. States adjacent to the goal have high V*, states adjacent to the trap have negative V*.
Optimal Policy Extraction:
π*(s) = argmax_a [R(s,a) + γ V*(s'(s,a))]
The resulting policy routes around the trap: from states in column 0 and 1, the agent moves right and down; from states near the trap, it detours via the top row.
Q-Learning Episode (ε-greedy exploration, ε=0.1):
s=0, a=→, r=0, s'=1 → Q(0,→) ← 0 + 0.9·max_a Q(1,a) − 0
s=1, a=→, r=0, s'=2 → Q(1,→) ← ...
... eventually reaches s15
After ~1000 episodes, learned Q-values closely approximate Q*.
Q-learning implementation for the 4×4 grid world:
import numpy as np
# 4x4 grid world (states 0-15, row-major)
# Goal: state 15 (+1 reward, terminal). Trap: state 5 (-1 reward, terminal).
N_STATES, N_ACTIONS = 16, 4 # actions: 0=up, 1=right, 2=down, 3=left
Q = np.zeros((N_STATES, N_ACTIONS))
DELTAS = [(-1, 0), (0, 1), (1, 0), (0, -1)] # row/col delta per action
def step(s: int, a: int) -> tuple[int, float, bool]:
row, col = s // 4, s % 4
dr, dc = DELTAS[a]
row = np.clip(row + dr, 0, 3)
col = np.clip(col + dc, 0, 3)
s_next = row * 4 + col
if s_next == 15: return s_next, +1.0, True
if s_next == 5: return s_next, -1.0, True
return s_next, 0.0, False
alpha, gamma, epsilon = 0.1, 0.9, 0.1
for episode in range(5_000):
s = 0
for _ in range(200): # episode length cap
if np.random.rand() < epsilon: # explore
a = np.random.randint(N_ACTIONS)
else: # exploit
a = Q[s].argmax()
s_next, r, done = step(s, a)
# TD update: Q(s,a) ← Q(s,a) + α[r + γ max_a' Q(s',a') − Q(s,a)]
td_target = r + gamma * Q[s_next].max() * (not done)
Q[s, a] += alpha * (td_target - Q[s, a])
s = s_next
if done:
break
# Optimal policy: greedy w.r.t. learned Q
ARROWS = ['↑', '→', '↓', '←']
policy = Q.argmax(axis=1)
print("Optimal policy:")
for row in range(4):
print(' '.join(ARROWS[policy[row * 4 + col]] for col in range(4)))
# G=goal, T=trap cells will show the detour clearlyDQN for continuous observation spaces (architecture sketch):
import torch
import torch.nn as nn
import torch.optim as optim
from collections import deque
import random
class DQN(nn.Module):
def __init__(self, obs_dim: int, n_actions: int):
super().__init__()
self.net = nn.Sequential(
nn.Linear(obs_dim, 128), nn.ReLU(),
nn.Linear(128, 128), nn.ReLU(),
nn.Linear(128, n_actions),
)
def forward(self, x: torch.Tensor) -> torch.Tensor:
return self.net(x)
# Two networks: online (trained every step) and target (updated every C steps).
# The target network provides stable TD targets — without it, the TD target
# moves with the online network, creating a moving-target instability
# (the "deadly triad" of function approximation + bootstrapping + off-policy).
online_net = DQN(obs_dim=4, n_actions=2) # CartPole-v1
target_net = DQN(obs_dim=4, n_actions=2)
target_net.load_state_dict(online_net.state_dict())
target_net.requires_grad_(False)
replay_buffer = deque(maxlen=10_000)
optimizer = optim.Adam(online_net.parameters(), lr=1e-3)
def update(batch_size: int = 64, gamma: float = 0.99) -> float:
batch = random.sample(replay_buffer, batch_size)
s, a, r, s_next, done = zip(*batch)
s = torch.tensor(s, dtype=torch.float32)
a = torch.tensor(a, dtype=torch.long)
r = torch.tensor(r, dtype=torch.float32)
s_next = torch.tensor(s_next, dtype=torch.float32)
done = torch.tensor(done, dtype=torch.float32)
q_vals = online_net(s).gather(1, a.unsqueeze(1)).squeeze(1)
with torch.no_grad():
next_q = target_net(s_next).max(dim=1).values
target = r + gamma * next_q * (1 - done)
loss = nn.functional.mse_loss(q_vals, target)
optimizer.zero_grad()
loss.backward()
optimizer.step()
return loss.item()Analysis & Evaluation
Where Your Intuition Breaks
RL always requires millions of environment interactions, making it impractical for real-world applications. Sample efficiency varies enormously by algorithm family. Model-based RL (Dyna, MuZero) can achieve high performance with 10–100× fewer environment interactions by learning an internal model of the environment and planning within it. Offline RL (Conservative Q-Learning, IQL) trains entirely from a pre-collected dataset of transitions, requiring zero new environment interactions. The "millions of samples" intuition comes from model-free on-policy methods (PPO, A3C) — not from RL as a whole. The right algorithm depends on whether a simulator is available, how expensive environment interaction is, and whether offline data exists.
RL Algorithm Taxonomy
| Category | Algorithm | Sample Efficiency | Stability | Use Case |
|---|---|---|---|---|
| Dynamic Programming | Value Iteration, Policy Iteration | N/A (model-based) | High | Tabular MDPs |
| Model-free value | Q-learning, DQN, Double DQN | Low | Medium | Discrete actions |
| Policy gradient | REINFORCE, A2C, A3C | Low | Low | Continuous actions |
| Actor-Critic | PPO, SAC, TD3 | Medium | High | Most practical tasks |
| Model-based | Dyna, MuZero | High | Variable | Complex environments |
Common Pitfalls
Reward hacking: Agents find unexpected ways to maximize the reward signal that violate the designer's intent. The reward function must capture all of what you care about. This is why RLHF adds a KL penalty — to prevent the model from drifting too far from the supervised baseline while optimizing reward.
Deadly triad: Combining function approximation (neural networks) + bootstrapping (TD updates) + off-policy learning causes instability. DQN addresses this with target networks and experience replay; PPO's on-policy updates avoid it.
Sparse rewards: Long-horizon tasks where reward only arrives at the end (e.g., winning a game after 1000 moves) are extremely hard to learn. Solutions: reward shaping, hindsight experience replay (HER), curriculum learning.
Over-exploration vs exploitation: ε-greedy is simple but wastes samples on random actions indefinitely. Entropy regularization (soft actor-critic) encourages exploration more efficiently.
RL in LLM Context
PPO and its variants (GRPO, REINFORCE leave-one-out) are used in RLHF to optimize language model outputs against a reward model trained on human preferences. The key adaptations:
- State = conversation context + partial generation
- Action = next token from vocabulary
- Reward = scalar from reward model on complete response
- KL constraint keeps the policy close to the SFT base model
The PPO clip objective directly prevents catastrophic forgetting of language modeling capabilities during RL fine-tuning.
Enjoying these notes?
Get new lessons delivered to your inbox. No spam.