Requires:DPO: Direct Preference Optimization

GRPO: Group Relative Policy Optimization

GRPO (Shao et al., 2024) is the training method behind DeepSeek-R1's reasoning capabilities. PPO-based RL requires a critic (value network) to estimate baselines — this doubles memory and adds complexity at LLM scale. GRPO's insight: sample a group of completions for the same prompt, compute rewards for all of them, and use the group's statistics as the baseline. No separate critic. This makes RL training practical at scale and is the core method driving modern reasoning models. As of early 2025, GRPO and its variants are the dominant approach for training reasoning models.

Theory

GRPO group — prompt: "3 + 5 × 2 = ?"click cards to toggle reward

G = 6μ = 0.667σ = 0.471correct: 4/6

Â_i = (r_i − μ) / σ · positive advantage → policy reinforced · negative → suppressed

PPO requires a value network to estimate "how good is this state?" — which at 7B+ parameters means doubling memory. GRPO eliminates the value network by sampling multiple completions for the same prompt and using the group's own statistics as the baseline. The diagram above shows the key idea: a response scoring 7/10 in a group averaging 5/10 earns a strong positive advantage; the same score in a group averaging 8/10 earns a negative advantage. The group provides its own context.

Why PPO Needs a Critic

PPO estimates the advantage of an action as:

$\hat{A}_t = r_t - V_\phi(s_t)$

where $V_\phi(s_t)$ is a learned value function approximating expected future reward from state $s_t$ . Training this critic requires a separate model (often as large as the policy), actor-critic synchronization, and additional forward/backward passes. At 7B+ parameters, the critic roughly doubles memory requirements.

GRPO: Group Statistics as Baseline

For each prompt $x$ , sample $G$ completions $\{y_1, \ldots, y_G\}$ from the current policy $\pi_\theta^{\text{old}}$ and compute their rewards $\{r_1, \ldots, r_G\}$ . The advantage for completion $i$ is normalized relative to the group:

$\hat{A}_i = \frac{r_i - \mu_r}{\sigma_r + \varepsilon}$

Group normalization is forced by the meaninglessness of absolute reward values. A response with reward 5 is excellent if the group averages 3 and poor if the group averages 8 — without normalization, the optimizer receives signal in units that depend on the reward scale, initialization, and prompt difficulty. Normalizing by the group's standard deviation collapses all of these to a scale-free signal: positive means "better than average at this prompt," negative means "worse than average." This lets a single learning rate work across diverse prompts and reward scales without per-prompt tuning.

where $\mu_r = \frac{1}{G}\sum_{j} r_j$ and $\sigma_r = \sqrt{\frac{1}{G}\sum_{j}(r_j - \mu_r)^2}$ .

💡Intuition

Instead of asking "is this response good in absolute terms?", GRPO asks "is this response better or worse than the other attempts at the same problem?" A response scoring 7/10 in a group where others average 5/10 gets a strong positive advantage. The same score in a group averaging 8/10 gets a weak negative advantage. The group provides its own baseline — no critic needed.

The GRPO Objective

The loss combines a clipped policy gradient (same as PPO) with a KL penalty:

$\mathcal{L}_{\text{GRPO}}(\theta) = -\frac{1}{G} \sum_{i=1}^{G} \left[\min\!\left(\rho_i \hat{A}_i,\; \text{clip}(\rho_i, 1-\varepsilon, 1+\varepsilon)\hat{A}_i\right)\right] + \beta\, \text{KL}[\pi_\theta \| \pi_{\text{ref}}]$

where $\rho_i = \pi_\theta(y_i|x) / \pi_\theta^{\text{old}}(y_i|x)$ is the probability ratio between the current and sampling policy.

The clip: if $\rho_i$ deviates too far from 1 (the policy has changed substantially since sampling), the gradient contribution is clipped. This prevents destructively large updates — the same mechanism as standard PPO.

The KL penalty: $\beta\, \text{KL}[\pi_\theta \| \pi_{\text{ref}}]$ where $\pi_{\text{ref}}$ is the SFT checkpoint. Prevents the policy from drifting too far from the initial instruction-following behavior.

Token-Level vs. Response-Level KL

Two options for applying the KL constraint:

Response-level KL: compute KL once per response as the sum of per-token log-probability ratios. Straightforward but coarse.

Token-level KL (used in DeepSeek-R1): add a per-token KL penalty to the reward signal at each generation step, making the constraint more fine-grained. Avoids a second forward pass to compute full-sequence KL, making it more memory-efficient.

Walkthrough

Math Reasoning with Binary Rewards

GRPO works best on tasks with verifiable rewards — correctness can be checked programmatically without a learned reward model.

Task: solve arithmetic problems. Reward: 1.0 if the final answer is numerically correct, 0.0 otherwise.

Step 1 — Sample G completions per prompt:

python

prompt = (
    "Solve step by step: A store sells apples for $0.50 and oranges for $0.75. "
    "You buy 6 apples and 4 oranges. What is the total cost?"
)
# Correct answer: 6*0.50 + 4*0.75 = 3.00 + 3.00 = $6.00
 
completions = policy.generate(
    prompt,
    num_return_sequences=8,
    temperature=0.8,
    max_new_tokens=256,
)

Step 2 — Assign rewards:

python

def parse_answer(text):
    # extract final numeric answer from completion
    ...
 
rewards = [1.0 if parse_answer(c) == 6.00 else 0.0 for c in completions]
# e.g. [1, 0, 1, 0, 1, 1, 0, 1]  →  5 correct out of 8

Step 3 — Compute group-normalized advantages:

python

import numpy as np
 
mu = np.mean(rewards)    # 0.625
sigma = np.std(rewards)  # 0.484
 
advantages = [(r - mu) / (sigma + 1e-8) for r in rewards]
# correct:   +0.77  (above average → positive advantage)
# incorrect: -1.29  (below average → negative advantage)

Step 4 — Policy update: completions with positive advantage are reinforced; completions with negative advantage are suppressed. The policy learns to produce the step-by-step reasoning patterns that lead to correct answers.

Analysis & Evaluation

Where Your Intuition Breaks

Larger group size $G$ always improves GRPO — more samples give better baseline estimates. A larger $G$ reduces baseline variance but increases variance in the policy gradient itself: with more diverse completions, the importance weights $\rho_i = \pi_\theta(y_i|x)/\pi_\theta^{\text{old}}(y_i|x)$ spread further, making individual gradient steps noisier. Empirically, $G \in [8, 16]$ is optimal across most tasks — beyond that, the cost of additional forward passes outweighs the variance reduction. The right $G$ depends on prompt difficulty: easy prompts (high reward correlation) benefit from smaller $G$ ; hard prompts (high reward variance) benefit from larger $G$ .

GRPO vs. PPO vs. DPO

Property	GRPO	PPO	DPO
Reward model needed	No (verifiable) or optional	Yes	No
Value model (critic)	No	Yes	No
Online sampling	Yes	Yes	No
Memory overhead	~2× policy	~4× policy	~2× policy
Best reward type	Verifiable (math, code tests)	Learned reward model	Human preference pairs
Exploration	Yes	Yes	No (offline only)
Training stability	Good (group normalization)	Harder to tune	Very stable

When GRPO outperforms DPO: reasoning tasks where the model needs to discover better solutions through exploration. DPO is bounded by its offline preference dataset; GRPO can find correct reasoning chains not in any existing dataset.

When DPO outperforms GRPO: stylistic alignment where curated preference data exists and exploration is not needed. DPO is simpler and more stable.

Key Hyperparameters

Parameter	Typical value	Effect
G (group size)	4–16	Larger = more stable advantages, more compute
$\beta$ (KL weight)	0.001–0.01	Higher = stays closer to SFT checkpoint
$\varepsilon$ (clip)	0.1–0.2	Standard PPO clip range
Temperature	0.7–1.0	Higher = more diverse group responses

Zero-variance groups: if all $G$ completions receive the same reward (all correct or all incorrect), $\sigma_r \approx 0$ and advantages are undefined. Add $\varepsilon = 10^{-8}$ to the denominator. Optionally skip the update for zero-variance groups — this happens naturally when a problem is too easy or too hard, and neither case provides useful learning signal.

Reward Design

GRPO's effectiveness depends entirely on reward quality. Three patterns:

Outcome rewards (ORM): reward based on final answer correctness alone. Simple; sparse signal but unambiguous.

Process rewards (PRM): reward individual reasoning steps. Dense signal; requires labeled step-level data or a trained process reward model.

Format rewards: reward adherence to a required structure (e.g., <think>...</think> before the final answer). Encourages chain-of-thought before commitment. Used in DeepSeek-R1 to elicit explicit reasoning. Often combined with outcome rewards.

🚀Production

GRPO in practice:

Verifiable rewards are the key ingredient. GRPO works because the reward signal is reliable. For subjective tasks (tone, helpfulness), a learned reward model introduces noise — consider DPO instead.
Group size G: start at G=4 for memory constraints. G=8 provides meaningfully more stable advantage estimates. Returns diminish beyond G=16.
Sampling temperature: use 0.8–1.0. If all G completions are identical (low temperature), group variance is zero and GRPO provides no signal.
Monitor KL divergence: unlike DPO, GRPO can overfit to reward if $\beta$ is too small. Stop training or increase $\beta$ if KL divergence exceeds 5–10 nats.
Fast-moving area: as of early 2025, GRPO variants (Dr. GRPO, DAPO, REINFORCE++) are actively being developed. The core group-normalization idea is stable; check recent literature for specific implementation choices.

Enjoying these notes?

Get new lessons delivered to your inbox. No spam.

DPO: Direct Preference Optimization

Constitutional AI