GRPO: Group Relative Policy Optimization
GRPO (Shao et al., 2024) is the training method behind DeepSeek-R1's reasoning capabilities. PPO-based RL requires a critic (value network) to estimate baselines — this doubles memory and adds complexity at LLM scale. GRPO's insight: sample a group of completions for the same prompt, compute rewards for all of them, and use the group's statistics as the baseline. No separate critic. This makes RL training practical at scale and is the core method driving modern reasoning models. As of early 2025, GRPO and its variants are the dominant approach for training reasoning models.
Theory
Â_i = (r_i − μ) / σ · positive advantage → policy reinforced · negative → suppressed
PPO requires a value network to estimate "how good is this state?" — which at 7B+ parameters means doubling memory. GRPO eliminates the value network by sampling multiple completions for the same prompt and using the group's own statistics as the baseline. The diagram above shows the key idea: a response scoring 7/10 in a group averaging 5/10 earns a strong positive advantage; the same score in a group averaging 8/10 earns a negative advantage. The group provides its own context.
Why PPO Needs a Critic
PPO estimates the advantage of an action as:
where is a learned value function approximating expected future reward from state . Training this critic requires a separate model (often as large as the policy), actor-critic synchronization, and additional forward/backward passes. At 7B+ parameters, the critic roughly doubles memory requirements.
GRPO: Group Statistics as Baseline
For each prompt , sample completions from the current policy and compute their rewards . The advantage for completion is normalized relative to the group:
Group normalization is forced by the meaninglessness of absolute reward values. A response with reward 5 is excellent if the group averages 3 and poor if the group averages 8 — without normalization, the optimizer receives signal in units that depend on the reward scale, initialization, and prompt difficulty. Normalizing by the group's standard deviation collapses all of these to a scale-free signal: positive means "better than average at this prompt," negative means "worse than average." This lets a single learning rate work across diverse prompts and reward scales without per-prompt tuning.
where and .
Instead of asking "is this response good in absolute terms?", GRPO asks "is this response better or worse than the other attempts at the same problem?" A response scoring 7/10 in a group where others average 5/10 gets a strong positive advantage. The same score in a group averaging 8/10 gets a weak negative advantage. The group provides its own baseline — no critic needed.
The GRPO Objective
The loss combines a clipped policy gradient (same as PPO) with a KL penalty:
where is the probability ratio between the current and sampling policy.
The clip: if deviates too far from 1 (the policy has changed substantially since sampling), the gradient contribution is clipped. This prevents destructively large updates — the same mechanism as standard PPO.
The KL penalty: where is the SFT checkpoint. Prevents the policy from drifting too far from the initial instruction-following behavior.
Token-Level vs. Response-Level KL
Two options for applying the KL constraint:
Response-level KL: compute KL once per response as the sum of per-token log-probability ratios. Straightforward but coarse.
Token-level KL (used in DeepSeek-R1): add a per-token KL penalty to the reward signal at each generation step, making the constraint more fine-grained. Avoids a second forward pass to compute full-sequence KL, making it more memory-efficient.
Walkthrough
Math Reasoning with Binary Rewards
GRPO works best on tasks with verifiable rewards — correctness can be checked programmatically without a learned reward model.
Task: solve arithmetic problems. Reward: 1.0 if the final answer is numerically correct, 0.0 otherwise.
Step 1 — Sample G completions per prompt:
prompt = (
"Solve step by step: A store sells apples for $0.50 and oranges for $0.75. "
"You buy 6 apples and 4 oranges. What is the total cost?"
)
# Correct answer: 6*0.50 + 4*0.75 = 3.00 + 3.00 = $6.00
completions = policy.generate(
prompt,
num_return_sequences=8,
temperature=0.8,
max_new_tokens=256,
)Step 2 — Assign rewards:
def parse_answer(text):
# extract final numeric answer from completion
...
rewards = [1.0 if parse_answer(c) == 6.00 else 0.0 for c in completions]
# e.g. [1, 0, 1, 0, 1, 1, 0, 1] → 5 correct out of 8Step 3 — Compute group-normalized advantages:
import numpy as np
mu = np.mean(rewards) # 0.625
sigma = np.std(rewards) # 0.484
advantages = [(r - mu) / (sigma + 1e-8) for r in rewards]
# correct: +0.77 (above average → positive advantage)
# incorrect: -1.29 (below average → negative advantage)Step 4 — Policy update: completions with positive advantage are reinforced; completions with negative advantage are suppressed. The policy learns to produce the step-by-step reasoning patterns that lead to correct answers.
Analysis & Evaluation
Where Your Intuition Breaks
Larger group size always improves GRPO — more samples give better baseline estimates. A larger reduces baseline variance but increases variance in the policy gradient itself: with more diverse completions, the importance weights spread further, making individual gradient steps noisier. Empirically, is optimal across most tasks — beyond that, the cost of additional forward passes outweighs the variance reduction. The right depends on prompt difficulty: easy prompts (high reward correlation) benefit from smaller ; hard prompts (high reward variance) benefit from larger .
GRPO vs. PPO vs. DPO
| Property | GRPO | PPO | DPO |
|---|---|---|---|
| Reward model needed | No (verifiable) or optional | Yes | No |
| Value model (critic) | No | Yes | No |
| Online sampling | Yes | Yes | No |
| Memory overhead | ~2× policy | ~4× policy | ~2× policy |
| Best reward type | Verifiable (math, code tests) | Learned reward model | Human preference pairs |
| Exploration | Yes | Yes | No (offline only) |
| Training stability | Good (group normalization) | Harder to tune | Very stable |
When GRPO outperforms DPO: reasoning tasks where the model needs to discover better solutions through exploration. DPO is bounded by its offline preference dataset; GRPO can find correct reasoning chains not in any existing dataset.
When DPO outperforms GRPO: stylistic alignment where curated preference data exists and exploration is not needed. DPO is simpler and more stable.
Key Hyperparameters
| Parameter | Typical value | Effect |
|---|---|---|
| G (group size) | 4–16 | Larger = more stable advantages, more compute |
| (KL weight) | 0.001–0.01 | Higher = stays closer to SFT checkpoint |
| (clip) | 0.1–0.2 | Standard PPO clip range |
| Temperature | 0.7–1.0 | Higher = more diverse group responses |
Zero-variance groups: if all completions receive the same reward (all correct or all incorrect), and advantages are undefined. Add to the denominator. Optionally skip the update for zero-variance groups — this happens naturally when a problem is too easy or too hard, and neither case provides useful learning signal.
Reward Design
GRPO's effectiveness depends entirely on reward quality. Three patterns:
Outcome rewards (ORM): reward based on final answer correctness alone. Simple; sparse signal but unambiguous.
Process rewards (PRM): reward individual reasoning steps. Dense signal; requires labeled step-level data or a trained process reward model.
Format rewards: reward adherence to a required structure (e.g., <think>...</think> before the final answer). Encourages chain-of-thought before commitment. Used in DeepSeek-R1 to elicit explicit reasoning. Often combined with outcome rewards.
GRPO in practice:
- Verifiable rewards are the key ingredient. GRPO works because the reward signal is reliable. For subjective tasks (tone, helpfulness), a learned reward model introduces noise — consider DPO instead.
- Group size G: start at G=4 for memory constraints. G=8 provides meaningfully more stable advantage estimates. Returns diminish beyond G=16.
- Sampling temperature: use 0.8–1.0. If all G completions are identical (low temperature), group variance is zero and GRPO provides no signal.
- Monitor KL divergence: unlike DPO, GRPO can overfit to reward if is too small. Stop training or increase if KL divergence exceeds 5–10 nats.
- Fast-moving area: as of early 2025, GRPO variants (Dr. GRPO, DAPO, REINFORCE++) are actively being developed. The core group-normalization idea is stable; check recent literature for specific implementation choices.
Enjoying these notes?
Get new lessons delivered to your inbox. No spam.