Requires:Gradient Descent

Reinforcement Learning Fundamentals

Reinforcement learning (RL) is the mathematical framework for sequential decision-making under uncertainty. Unlike supervised learning — which learns from labeled data — an RL agent learns by interacting with an environment, receiving scalar reward signals, and optimizing long-run cumulative return. RL underpins modern LLM alignment (RLHF, GRPO) and game-playing agents.

Theory

Markov Decision Process — Grid Worldγ = 0.9, 4×4 grid

s0●

s5✕

s10

s11

s12

s13

s14

s15★

Agent at s0

Reward R(s): 0

V(s) = r + γ·max_a V(s') ≈ 0.31

Move agent:

●

★ s15 = goal (+1 reward)
✕ s5 = trap (-1 reward)
γ = 0.9 discounts future rewards

An RL agent learns to make decisions the way you learned to ride a bike — not from labeled examples but from the consequences of actions. Try something, see what happens, adjust. The MDP formalizes this: states capture what the agent knows, actions are what it can do, and rewards are the feedback signal. The diagram above shows the agent-environment loop that every RL algorithm is solving.

Markov Decision Process (MDP)

An MDP is a tuple (S, A, P, R, γ) where:

S — state space
A — action space
P(s'|s, a) — transition probability: probability of reaching state s' from state s by taking action a
R(s, a, s') — reward function: scalar signal received after each transition
γ ∈ [0, 1) — discount factor, weighting future rewards

A policy π(a|s) maps states to probability distributions over actions. The goal is to find π* that maximizes expected cumulative discounted reward:

$G_t = \sum_{k=0}^{\infty} \gamma^k R_{t+k+1}$

The geometric discount factor $\gamma^k$ is not merely a technical convenience — it is the formal requirement for the sum $G_t$ to be finite and well-defined. Without discounting ( $\gamma = 1$ ), the sum over an infinite horizon diverges for any non-zero constant reward, making the value function undefined and the optimization problem intractable. Discounting also encodes an economic preference: a reward now is worth more than the same reward later. Setting $\gamma$ close to 1 makes the agent long-sighted (good for tasks where delayed rewards matter); close to 0 makes it myopic (good for tasks where immediate rewards dominate).

Value Functions and the Bellman Equation

The state-value function V^π(s) measures expected return from state s following policy π:

$V^\pi(s) = \mathbb{E}_\pi\!\left[\sum_{k=0}^{\infty} \gamma^k R_{t+k+1} \,\Big|\, S_t = s\right]$

The action-value function (Q-function) Q^π(s, a) conditions on taking action a then following π:

$Q^\pi(s, a) = \mathbb{E}_\pi\!\left[\sum_{k=0}^{\infty} \gamma^k R_{t+k+1} \,\Big|\, S_t = s, A_t = a\right]$

The Bellman expectation equation expresses V^π recursively:

$V^\pi(s) = \sum_a \pi(a|s) \sum_{s'} P(s'|s,a)\!\left[R(s,a,s') + \gamma V^\pi(s')\right]$

The Bellman optimality equation for the optimal value function V*:

$V^*(s) = \max_a \sum_{s'} P(s'|s,a)\!\left[R(s,a,s') + \gamma V^*(s')\right]$

The optimal policy is then π*(s) = argmax_a Q*(s, a).

Q-Learning (Temporal Difference)

Q-learning is an off-policy TD algorithm. After taking action a in state s, observing reward r and next state s', the Q-value update:

$Q(s, a) \leftarrow Q(s, a) + \alpha \underbrace{\left[r + \gamma \max_{a'} Q(s', a') - Q(s, a)\right]}_{\delta_t \text{ (TD error)}}$

where α is the learning rate. Q-learning converges to Q* under tabular settings. Deep Q-Network (DQN) replaces the table with a neural network Q_θ(s, a), adding experience replay and a target network to stabilize training.

Policy Gradient Theorem

Policy gradient methods directly optimize the policy parameters θ by gradient ascent on expected return J(θ) = E[G_0]:

$\nabla_\theta J(\theta) = \mathbb{E}_\pi\!\left[\sum_t \nabla_\theta \ln \pi_\theta(A_t|S_t) \cdot G_t\right]$

This is the REINFORCE estimator. In practice, a baseline b(s) (usually V^π(s)) is subtracted to reduce variance without biasing the gradient:

$\nabla_\theta J(\theta) = \mathbb{E}_\pi\!\left[\nabla_\theta \ln \pi_\theta(A_t|S_t) \cdot (G_t - b(S_t))\right]$

The term G_t − b(S_t) is the advantage A^π(s, a): how much better action a is than the average.

Proximal Policy Optimization (PPO)

PPO (Schulman et al., 2017) is the workhorse of modern RL and RLHF fine-tuning. It uses a clipped surrogate objective to prevent destructively large policy updates:

$\mathcal{L}^{\text{CLIP}}(\theta) = \mathbb{E}_t\!\left[\min\!\left(r_t(\theta) \hat{A}_t,\; \text{clip}(r_t(\theta), 1-\varepsilon, 1+\varepsilon)\hat{A}_t\right)\right]$

where r_t(θ) = π_θ(a_t|s_t) / π_θ_old(a_t|s_t) is the importance ratio and ε (typically 0.1–0.2) controls how far the new policy can deviate from the old one.

Walkthrough

Environment: 4×4 grid world. States s0–s15 (row-major). s15 = goal (+1 reward). s5 = trap (−1 reward). All other transitions give 0 reward. γ = 0.9.

Value Iteration (Dynamic Programming):

Initialize V(s) = 0 for all states. Repeat until convergence:

$V(s) \leftarrow \max_a \sum_{s'} P(s'|s,a)\!\left[R + \gamma V(s')\right]$

For a deterministic grid world, P(s'|s,a) = 1 for the intended next state. After ~20 iterations, V* converges. States adjacent to the goal have high V*, states adjacent to the trap have negative V*.

Optimal Policy Extraction:

π*(s) = argmax_a [R(s,a) + γ V*(s'(s,a))]

The resulting policy routes around the trap: from states in column 0 and 1, the agent moves right and down; from states near the trap, it detours via the top row.

Q-Learning Episode (ε-greedy exploration, ε=0.1):

s=0, a=→, r=0, s'=1 → Q(0,→) ← 0 + 0.9·max_a Q(1,a) − 0
s=1, a=→, r=0, s'=2 → Q(1,→) ← ...
... eventually reaches s15

After ~1000 episodes, learned Q-values closely approximate Q*.

Q-learning implementation for the 4×4 grid world:

python

import numpy as np
 
# 4x4 grid world (states 0-15, row-major)
# Goal: state 15 (+1 reward, terminal). Trap: state 5 (-1 reward, terminal).
N_STATES, N_ACTIONS = 16, 4   # actions: 0=up, 1=right, 2=down, 3=left
Q = np.zeros((N_STATES, N_ACTIONS))
 
DELTAS = [(-1, 0), (0, 1), (1, 0), (0, -1)]  # row/col delta per action
 
def step(s: int, a: int) -> tuple[int, float, bool]:
    row, col = s // 4, s % 4
    dr, dc   = DELTAS[a]
    row      = np.clip(row + dr, 0, 3)
    col      = np.clip(col + dc, 0, 3)
    s_next   = row * 4 + col
    if s_next == 15: return s_next, +1.0, True
    if s_next ==  5: return s_next, -1.0, True
    return s_next,  0.0, False
 
alpha, gamma, epsilon = 0.1, 0.9, 0.1
 
for episode in range(5_000):
    s = 0
    for _ in range(200):                          # episode length cap
        if np.random.rand() < epsilon:            # explore
            a = np.random.randint(N_ACTIONS)
        else:                                     # exploit
            a = Q[s].argmax()
 
        s_next, r, done = step(s, a)
 
        # TD update: Q(s,a) ← Q(s,a) + α[r + γ max_a' Q(s',a') − Q(s,a)]
        td_target = r + gamma * Q[s_next].max() * (not done)
        Q[s, a]  += alpha * (td_target - Q[s, a])
 
        s = s_next
        if done:
            break
 
# Optimal policy: greedy w.r.t. learned Q
ARROWS = ['↑', '→', '↓', '←']
policy = Q.argmax(axis=1)
print("Optimal policy:")
for row in range(4):
    print(' '.join(ARROWS[policy[row * 4 + col]] for col in range(4)))
# G=goal, T=trap cells will show the detour clearly

DQN for continuous observation spaces (architecture sketch):

python

import torch
import torch.nn as nn
import torch.optim as optim
from collections import deque
import random
 
class DQN(nn.Module):
    def __init__(self, obs_dim: int, n_actions: int):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(obs_dim, 128), nn.ReLU(),
            nn.Linear(128, 128),    nn.ReLU(),
            nn.Linear(128, n_actions),
        )
 
    def forward(self, x: torch.Tensor) -> torch.Tensor:
        return self.net(x)
 
# Two networks: online (trained every step) and target (updated every C steps).
# The target network provides stable TD targets — without it, the TD target
# moves with the online network, creating a moving-target instability
# (the "deadly triad" of function approximation + bootstrapping + off-policy).
online_net = DQN(obs_dim=4, n_actions=2)    # CartPole-v1
target_net = DQN(obs_dim=4, n_actions=2)
target_net.load_state_dict(online_net.state_dict())
target_net.requires_grad_(False)
 
replay_buffer = deque(maxlen=10_000)
optimizer = optim.Adam(online_net.parameters(), lr=1e-3)
 
def update(batch_size: int = 64, gamma: float = 0.99) -> float:
    batch = random.sample(replay_buffer, batch_size)
    s, a, r, s_next, done = zip(*batch)
 
    s      = torch.tensor(s,      dtype=torch.float32)
    a      = torch.tensor(a,      dtype=torch.long)
    r      = torch.tensor(r,      dtype=torch.float32)
    s_next = torch.tensor(s_next, dtype=torch.float32)
    done   = torch.tensor(done,   dtype=torch.float32)
 
    q_vals = online_net(s).gather(1, a.unsqueeze(1)).squeeze(1)
    with torch.no_grad():
        next_q = target_net(s_next).max(dim=1).values
        target = r + gamma * next_q * (1 - done)
 
    loss = nn.functional.mse_loss(q_vals, target)
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()
    return loss.item()

Analysis & Evaluation

Where Your Intuition Breaks

RL always requires millions of environment interactions, making it impractical for real-world applications. Sample efficiency varies enormously by algorithm family. Model-based RL (Dyna, MuZero) can achieve high performance with 10–100× fewer environment interactions by learning an internal model of the environment and planning within it. Offline RL (Conservative Q-Learning, IQL) trains entirely from a pre-collected dataset of transitions, requiring zero new environment interactions. The "millions of samples" intuition comes from model-free on-policy methods (PPO, A3C) — not from RL as a whole. The right algorithm depends on whether a simulator is available, how expensive environment interaction is, and whether offline data exists.

RL Algorithm Taxonomy

Category	Algorithm	Sample Efficiency	Stability	Use Case
Dynamic Programming	Value Iteration, Policy Iteration	N/A (model-based)	High	Tabular MDPs
Model-free value	Q-learning, DQN, Double DQN	Low	Medium	Discrete actions
Policy gradient	REINFORCE, A2C, A3C	Low	Low	Continuous actions
Actor-Critic	PPO, SAC, TD3	Medium	High	Most practical tasks
Model-based	Dyna, MuZero	High	Variable	Complex environments

Common Pitfalls

Reward hacking: Agents find unexpected ways to maximize the reward signal that violate the designer's intent. The reward function must capture all of what you care about. This is why RLHF adds a KL penalty — to prevent the model from drifting too far from the supervised baseline while optimizing reward.

Deadly triad: Combining function approximation (neural networks) + bootstrapping (TD updates) + off-policy learning causes instability. DQN addresses this with target networks and experience replay; PPO's on-policy updates avoid it.

Sparse rewards: Long-horizon tasks where reward only arrives at the end (e.g., winning a game after 1000 moves) are extremely hard to learn. Solutions: reward shaping, hindsight experience replay (HER), curriculum learning.

Over-exploration vs exploitation: ε-greedy is simple but wastes samples on random actions indefinitely. Entropy regularization (soft actor-critic) encourages exploration more efficiently.

RL in LLM Context

PPO and its variants (GRPO, REINFORCE leave-one-out) are used in RLHF to optimize language model outputs against a reward model trained on human preferences. The key adaptations:

State = conversation context + partial generation
Action = next token from vocabulary
Reward = scalar from reward model on complete response
KL constraint keeps the policy close to the SFT base model

The PPO clip objective directly prevents catastrophic forgetting of language modeling capabilities during RL fine-tuning.

Enjoying these notes?

Get new lessons delivered to your inbox. No spam.

Time Series & Forecasting

Experiments

Hypothesis Testing