Neural-Path/Notes
35 min

Reinforcement Learning Fundamentals

Reinforcement learning (RL) is the mathematical framework for sequential decision-making under uncertainty. Unlike supervised learning — which learns from labeled data — an RL agent learns by interacting with an environment, receiving scalar reward signals, and optimizing long-run cumulative return. RL underpins modern LLM alignment (RLHF, GRPO) and game-playing agents.

Theory

Markov Decision Process — Grid Worldγ = 0.9, 4×4 grid
s0
s1
s2
s3
s4
s5
s6
s7
s8
s9
s10
s11
s12
s13
s14
s15

Agent at s0

Reward R(s): 0

V(s) = r + γ·max_a V(s') ≈ 0.31

Move agent:

★ s15 = goal (+1 reward)
✕ s5 = trap (-1 reward)
γ = 0.9 discounts future rewards

An RL agent learns to make decisions the way you learned to ride a bike — not from labeled examples but from the consequences of actions. Try something, see what happens, adjust. The MDP formalizes this: states capture what the agent knows, actions are what it can do, and rewards are the feedback signal. The diagram above shows the agent-environment loop that every RL algorithm is solving.

Markov Decision Process (MDP)

An MDP is a tuple (S, A, P, R, γ) where:

  • S — state space
  • A — action space
  • P(s'|s, a) — transition probability: probability of reaching state s' from state s by taking action a
  • R(s, a, s') — reward function: scalar signal received after each transition
  • γ ∈ [0, 1) — discount factor, weighting future rewards

A policy π(a|s) maps states to probability distributions over actions. The goal is to find π* that maximizes expected cumulative discounted reward:

Gt=k=0γkRt+k+1G_t = \sum_{k=0}^{\infty} \gamma^k R_{t+k+1}

The geometric discount factor γk\gamma^k is not merely a technical convenience — it is the formal requirement for the sum GtG_t to be finite and well-defined. Without discounting (γ=1\gamma = 1), the sum over an infinite horizon diverges for any non-zero constant reward, making the value function undefined and the optimization problem intractable. Discounting also encodes an economic preference: a reward now is worth more than the same reward later. Setting γ\gamma close to 1 makes the agent long-sighted (good for tasks where delayed rewards matter); close to 0 makes it myopic (good for tasks where immediate rewards dominate).

Value Functions and the Bellman Equation

The state-value function V^π(s) measures expected return from state s following policy π:

Vπ(s)=Eπ ⁣[k=0γkRt+k+1St=s]V^\pi(s) = \mathbb{E}_\pi\!\left[\sum_{k=0}^{\infty} \gamma^k R_{t+k+1} \,\Big|\, S_t = s\right]

The action-value function (Q-function) Q^π(s, a) conditions on taking action a then following π:

Qπ(s,a)=Eπ ⁣[k=0γkRt+k+1St=s,At=a]Q^\pi(s, a) = \mathbb{E}_\pi\!\left[\sum_{k=0}^{\infty} \gamma^k R_{t+k+1} \,\Big|\, S_t = s, A_t = a\right]

The Bellman expectation equation expresses V^π recursively:

Vπ(s)=aπ(as)sP(ss,a) ⁣[R(s,a,s)+γVπ(s)]V^\pi(s) = \sum_a \pi(a|s) \sum_{s'} P(s'|s,a)\!\left[R(s,a,s') + \gamma V^\pi(s')\right]

The Bellman optimality equation for the optimal value function V*:

V(s)=maxasP(ss,a) ⁣[R(s,a,s)+γV(s)]V^*(s) = \max_a \sum_{s'} P(s'|s,a)\!\left[R(s,a,s') + \gamma V^*(s')\right]

The optimal policy is then π*(s) = argmax_a Q*(s, a).

Q-Learning (Temporal Difference)

Q-learning is an off-policy TD algorithm. After taking action a in state s, observing reward r and next state s', the Q-value update:

Q(s,a)Q(s,a)+α[r+γmaxaQ(s,a)Q(s,a)]δt (TD error)Q(s, a) \leftarrow Q(s, a) + \alpha \underbrace{\left[r + \gamma \max_{a'} Q(s', a') - Q(s, a)\right]}_{\delta_t \text{ (TD error)}}

where α is the learning rate. Q-learning converges to Q* under tabular settings. Deep Q-Network (DQN) replaces the table with a neural network Q_θ(s, a), adding experience replay and a target network to stabilize training.

Policy Gradient Theorem

Policy gradient methods directly optimize the policy parameters θ by gradient ascent on expected return J(θ) = E[G_0]:

θJ(θ)=Eπ ⁣[tθlnπθ(AtSt)Gt]\nabla_\theta J(\theta) = \mathbb{E}_\pi\!\left[\sum_t \nabla_\theta \ln \pi_\theta(A_t|S_t) \cdot G_t\right]

This is the REINFORCE estimator. In practice, a baseline b(s) (usually V^π(s)) is subtracted to reduce variance without biasing the gradient:

θJ(θ)=Eπ ⁣[θlnπθ(AtSt)(Gtb(St))]\nabla_\theta J(\theta) = \mathbb{E}_\pi\!\left[\nabla_\theta \ln \pi_\theta(A_t|S_t) \cdot (G_t - b(S_t))\right]

The term G_t − b(S_t) is the advantage A^π(s, a): how much better action a is than the average.

Proximal Policy Optimization (PPO)

PPO (Schulman et al., 2017) is the workhorse of modern RL and RLHF fine-tuning. It uses a clipped surrogate objective to prevent destructively large policy updates:

LCLIP(θ)=Et ⁣[min ⁣(rt(θ)A^t,  clip(rt(θ),1ε,1+ε)A^t)]\mathcal{L}^{\text{CLIP}}(\theta) = \mathbb{E}_t\!\left[\min\!\left(r_t(\theta) \hat{A}_t,\; \text{clip}(r_t(\theta), 1-\varepsilon, 1+\varepsilon)\hat{A}_t\right)\right]

where r_t(θ) = π_θ(a_t|s_t) / π_θ_old(a_t|s_t) is the importance ratio and ε (typically 0.1–0.2) controls how far the new policy can deviate from the old one.

Walkthrough

Environment: 4×4 grid world. States s0–s15 (row-major). s15 = goal (+1 reward). s5 = trap (−1 reward). All other transitions give 0 reward. γ = 0.9.

Value Iteration (Dynamic Programming):

Initialize V(s) = 0 for all states. Repeat until convergence:

V(s)maxasP(ss,a) ⁣[R+γV(s)]V(s) \leftarrow \max_a \sum_{s'} P(s'|s,a)\!\left[R + \gamma V(s')\right]

For a deterministic grid world, P(s'|s,a) = 1 for the intended next state. After ~20 iterations, V* converges. States adjacent to the goal have high V*, states adjacent to the trap have negative V*.

Optimal Policy Extraction:

π*(s) = argmax_a [R(s,a) + γ V*(s'(s,a))]

The resulting policy routes around the trap: from states in column 0 and 1, the agent moves right and down; from states near the trap, it detours via the top row.

Q-Learning Episode (ε-greedy exploration, ε=0.1):

s=0, a=→, r=0, s'=1 → Q(0,→) ← 0 + 0.9·max_a Q(1,a) − 0
s=1, a=→, r=0, s'=2 → Q(1,→) ← ...
... eventually reaches s15

After ~1000 episodes, learned Q-values closely approximate Q*.

Q-learning implementation for the 4×4 grid world:

python
import numpy as np
 
# 4x4 grid world (states 0-15, row-major)
# Goal: state 15 (+1 reward, terminal). Trap: state 5 (-1 reward, terminal).
N_STATES, N_ACTIONS = 16, 4   # actions: 0=up, 1=right, 2=down, 3=left
Q = np.zeros((N_STATES, N_ACTIONS))
 
DELTAS = [(-1, 0), (0, 1), (1, 0), (0, -1)]  # row/col delta per action
 
def step(s: int, a: int) -> tuple[int, float, bool]:
    row, col = s // 4, s % 4
    dr, dc   = DELTAS[a]
    row      = np.clip(row + dr, 0, 3)
    col      = np.clip(col + dc, 0, 3)
    s_next   = row * 4 + col
    if s_next == 15: return s_next, +1.0, True
    if s_next ==  5: return s_next, -1.0, True
    return s_next,  0.0, False
 
alpha, gamma, epsilon = 0.1, 0.9, 0.1
 
for episode in range(5_000):
    s = 0
    for _ in range(200):                          # episode length cap
        if np.random.rand() < epsilon:            # explore
            a = np.random.randint(N_ACTIONS)
        else:                                     # exploit
            a = Q[s].argmax()
 
        s_next, r, done = step(s, a)
 
        # TD update: Q(s,a) ← Q(s,a) + α[r + γ max_a' Q(s',a') − Q(s,a)]
        td_target = r + gamma * Q[s_next].max() * (not done)
        Q[s, a]  += alpha * (td_target - Q[s, a])
 
        s = s_next
        if done:
            break
 
# Optimal policy: greedy w.r.t. learned Q
ARROWS = ['↑', '→', '↓', '←']
policy = Q.argmax(axis=1)
print("Optimal policy:")
for row in range(4):
    print(' '.join(ARROWS[policy[row * 4 + col]] for col in range(4)))
# G=goal, T=trap cells will show the detour clearly

DQN for continuous observation spaces (architecture sketch):

python
import torch
import torch.nn as nn
import torch.optim as optim
from collections import deque
import random
 
class DQN(nn.Module):
    def __init__(self, obs_dim: int, n_actions: int):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(obs_dim, 128), nn.ReLU(),
            nn.Linear(128, 128),    nn.ReLU(),
            nn.Linear(128, n_actions),
        )
 
    def forward(self, x: torch.Tensor) -> torch.Tensor:
        return self.net(x)
 
# Two networks: online (trained every step) and target (updated every C steps).
# The target network provides stable TD targets — without it, the TD target
# moves with the online network, creating a moving-target instability
# (the "deadly triad" of function approximation + bootstrapping + off-policy).
online_net = DQN(obs_dim=4, n_actions=2)    # CartPole-v1
target_net = DQN(obs_dim=4, n_actions=2)
target_net.load_state_dict(online_net.state_dict())
target_net.requires_grad_(False)
 
replay_buffer = deque(maxlen=10_000)
optimizer = optim.Adam(online_net.parameters(), lr=1e-3)
 
def update(batch_size: int = 64, gamma: float = 0.99) -> float:
    batch = random.sample(replay_buffer, batch_size)
    s, a, r, s_next, done = zip(*batch)
 
    s      = torch.tensor(s,      dtype=torch.float32)
    a      = torch.tensor(a,      dtype=torch.long)
    r      = torch.tensor(r,      dtype=torch.float32)
    s_next = torch.tensor(s_next, dtype=torch.float32)
    done   = torch.tensor(done,   dtype=torch.float32)
 
    q_vals = online_net(s).gather(1, a.unsqueeze(1)).squeeze(1)
    with torch.no_grad():
        next_q = target_net(s_next).max(dim=1).values
        target = r + gamma * next_q * (1 - done)
 
    loss = nn.functional.mse_loss(q_vals, target)
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()
    return loss.item()

Analysis & Evaluation

Where Your Intuition Breaks

RL always requires millions of environment interactions, making it impractical for real-world applications. Sample efficiency varies enormously by algorithm family. Model-based RL (Dyna, MuZero) can achieve high performance with 10–100× fewer environment interactions by learning an internal model of the environment and planning within it. Offline RL (Conservative Q-Learning, IQL) trains entirely from a pre-collected dataset of transitions, requiring zero new environment interactions. The "millions of samples" intuition comes from model-free on-policy methods (PPO, A3C) — not from RL as a whole. The right algorithm depends on whether a simulator is available, how expensive environment interaction is, and whether offline data exists.

RL Algorithm Taxonomy

CategoryAlgorithmSample EfficiencyStabilityUse Case
Dynamic ProgrammingValue Iteration, Policy IterationN/A (model-based)HighTabular MDPs
Model-free valueQ-learning, DQN, Double DQNLowMediumDiscrete actions
Policy gradientREINFORCE, A2C, A3CLowLowContinuous actions
Actor-CriticPPO, SAC, TD3MediumHighMost practical tasks
Model-basedDyna, MuZeroHighVariableComplex environments

Common Pitfalls

Reward hacking: Agents find unexpected ways to maximize the reward signal that violate the designer's intent. The reward function must capture all of what you care about. This is why RLHF adds a KL penalty — to prevent the model from drifting too far from the supervised baseline while optimizing reward.

Deadly triad: Combining function approximation (neural networks) + bootstrapping (TD updates) + off-policy learning causes instability. DQN addresses this with target networks and experience replay; PPO's on-policy updates avoid it.

Sparse rewards: Long-horizon tasks where reward only arrives at the end (e.g., winning a game after 1000 moves) are extremely hard to learn. Solutions: reward shaping, hindsight experience replay (HER), curriculum learning.

Over-exploration vs exploitation: ε-greedy is simple but wastes samples on random actions indefinitely. Entropy regularization (soft actor-critic) encourages exploration more efficiently.

RL in LLM Context

PPO and its variants (GRPO, REINFORCE leave-one-out) are used in RLHF to optimize language model outputs against a reward model trained on human preferences. The key adaptations:

  • State = conversation context + partial generation
  • Action = next token from vocabulary
  • Reward = scalar from reward model on complete response
  • KL constraint keeps the policy close to the SFT base model

The PPO clip objective directly prevents catastrophic forgetting of language modeling capabilities during RL fine-tuning.

Enjoying these notes?

Get new lessons delivered to your inbox. No spam.