Neural-Path/Notes
30 min

GRPO: Group Relative Policy Optimization

GRPO (Shao et al., 2024) is the training method behind DeepSeek-R1's reasoning capabilities. PPO-based RL requires a critic (value network) to estimate baselines — this doubles memory and adds complexity at LLM scale. GRPO's insight: sample a group of completions for the same prompt, compute rewards for all of them, and use the group's statistics as the baseline. No separate critic. This makes RL training practical at scale and is the core method driving modern reasoning models. As of early 2025, GRPO and its variants are the dominant approach for training reasoning models.

Theory

GRPO group — prompt: "3 + 5 × 2 = ?"click cards to toggle reward
G = 6μ = 0.667σ = 0.471correct: 4/6

Â_i = (r_i − μ) / σ · positive advantage → policy reinforced · negative → suppressed

PPO requires a value network to estimate "how good is this state?" — which at 7B+ parameters means doubling memory. GRPO eliminates the value network by sampling multiple completions for the same prompt and using the group's own statistics as the baseline. The diagram above shows the key idea: a response scoring 7/10 in a group averaging 5/10 earns a strong positive advantage; the same score in a group averaging 8/10 earns a negative advantage. The group provides its own context.

Why PPO Needs a Critic

PPO estimates the advantage of an action as:

A^t=rtVϕ(st)\hat{A}_t = r_t - V_\phi(s_t)

where Vϕ(st)V_\phi(s_t) is a learned value function approximating expected future reward from state sts_t. Training this critic requires a separate model (often as large as the policy), actor-critic synchronization, and additional forward/backward passes. At 7B+ parameters, the critic roughly doubles memory requirements.

GRPO: Group Statistics as Baseline

For each prompt xx, sample GG completions {y1,,yG}\{y_1, \ldots, y_G\} from the current policy πθold\pi_\theta^{\text{old}} and compute their rewards {r1,,rG}\{r_1, \ldots, r_G\}. The advantage for completion ii is normalized relative to the group:

A^i=riμrσr+ε\hat{A}_i = \frac{r_i - \mu_r}{\sigma_r + \varepsilon}

Group normalization is forced by the meaninglessness of absolute reward values. A response with reward 5 is excellent if the group averages 3 and poor if the group averages 8 — without normalization, the optimizer receives signal in units that depend on the reward scale, initialization, and prompt difficulty. Normalizing by the group's standard deviation collapses all of these to a scale-free signal: positive means "better than average at this prompt," negative means "worse than average." This lets a single learning rate work across diverse prompts and reward scales without per-prompt tuning.

where μr=1Gjrj\mu_r = \frac{1}{G}\sum_{j} r_j and σr=1Gj(rjμr)2\sigma_r = \sqrt{\frac{1}{G}\sum_{j}(r_j - \mu_r)^2}.

💡Intuition

Instead of asking "is this response good in absolute terms?", GRPO asks "is this response better or worse than the other attempts at the same problem?" A response scoring 7/10 in a group where others average 5/10 gets a strong positive advantage. The same score in a group averaging 8/10 gets a weak negative advantage. The group provides its own baseline — no critic needed.

The GRPO Objective

The loss combines a clipped policy gradient (same as PPO) with a KL penalty:

LGRPO(θ)=1Gi=1G[min ⁣(ρiA^i,  clip(ρi,1ε,1+ε)A^i)]+βKL[πθπref]\mathcal{L}_{\text{GRPO}}(\theta) = -\frac{1}{G} \sum_{i=1}^{G} \left[\min\!\left(\rho_i \hat{A}_i,\; \text{clip}(\rho_i, 1-\varepsilon, 1+\varepsilon)\hat{A}_i\right)\right] + \beta\, \text{KL}[\pi_\theta \| \pi_{\text{ref}}]

where ρi=πθ(yix)/πθold(yix)\rho_i = \pi_\theta(y_i|x) / \pi_\theta^{\text{old}}(y_i|x) is the probability ratio between the current and sampling policy.

The clip: if ρi\rho_i deviates too far from 1 (the policy has changed substantially since sampling), the gradient contribution is clipped. This prevents destructively large updates — the same mechanism as standard PPO.

The KL penalty: βKL[πθπref]\beta\, \text{KL}[\pi_\theta \| \pi_{\text{ref}}] where πref\pi_{\text{ref}} is the SFT checkpoint. Prevents the policy from drifting too far from the initial instruction-following behavior.

Token-Level vs. Response-Level KL

Two options for applying the KL constraint:

Response-level KL: compute KL once per response as the sum of per-token log-probability ratios. Straightforward but coarse.

Token-level KL (used in DeepSeek-R1): add a per-token KL penalty to the reward signal at each generation step, making the constraint more fine-grained. Avoids a second forward pass to compute full-sequence KL, making it more memory-efficient.

Walkthrough

Math Reasoning with Binary Rewards

GRPO works best on tasks with verifiable rewards — correctness can be checked programmatically without a learned reward model.

Task: solve arithmetic problems. Reward: 1.0 if the final answer is numerically correct, 0.0 otherwise.

Step 1 — Sample G completions per prompt:

python
prompt = (
    "Solve step by step: A store sells apples for $0.50 and oranges for $0.75. "
    "You buy 6 apples and 4 oranges. What is the total cost?"
)
# Correct answer: 6*0.50 + 4*0.75 = 3.00 + 3.00 = $6.00
 
completions = policy.generate(
    prompt,
    num_return_sequences=8,
    temperature=0.8,
    max_new_tokens=256,
)

Step 2 — Assign rewards:

python
def parse_answer(text):
    # extract final numeric answer from completion
    ...
 
rewards = [1.0 if parse_answer(c) == 6.00 else 0.0 for c in completions]
# e.g. [1, 0, 1, 0, 1, 1, 0, 1]  →  5 correct out of 8

Step 3 — Compute group-normalized advantages:

python
import numpy as np
 
mu = np.mean(rewards)    # 0.625
sigma = np.std(rewards)  # 0.484
 
advantages = [(r - mu) / (sigma + 1e-8) for r in rewards]
# correct:   +0.77  (above average → positive advantage)
# incorrect: -1.29  (below average → negative advantage)

Step 4 — Policy update: completions with positive advantage are reinforced; completions with negative advantage are suppressed. The policy learns to produce the step-by-step reasoning patterns that lead to correct answers.

Analysis & Evaluation

Where Your Intuition Breaks

Larger group size GG always improves GRPO — more samples give better baseline estimates. A larger GG reduces baseline variance but increases variance in the policy gradient itself: with more diverse completions, the importance weights ρi=πθ(yix)/πθold(yix)\rho_i = \pi_\theta(y_i|x)/\pi_\theta^{\text{old}}(y_i|x) spread further, making individual gradient steps noisier. Empirically, G[8,16]G \in [8, 16] is optimal across most tasks — beyond that, the cost of additional forward passes outweighs the variance reduction. The right GG depends on prompt difficulty: easy prompts (high reward correlation) benefit from smaller GG; hard prompts (high reward variance) benefit from larger GG.

GRPO vs. PPO vs. DPO

PropertyGRPOPPODPO
Reward model neededNo (verifiable) or optionalYesNo
Value model (critic)NoYesNo
Online samplingYesYesNo
Memory overhead~2× policy~4× policy~2× policy
Best reward typeVerifiable (math, code tests)Learned reward modelHuman preference pairs
ExplorationYesYesNo (offline only)
Training stabilityGood (group normalization)Harder to tuneVery stable

When GRPO outperforms DPO: reasoning tasks where the model needs to discover better solutions through exploration. DPO is bounded by its offline preference dataset; GRPO can find correct reasoning chains not in any existing dataset.

When DPO outperforms GRPO: stylistic alignment where curated preference data exists and exploration is not needed. DPO is simpler and more stable.

Key Hyperparameters

ParameterTypical valueEffect
G (group size)4–16Larger = more stable advantages, more compute
β\beta (KL weight)0.001–0.01Higher = stays closer to SFT checkpoint
ε\varepsilon (clip)0.1–0.2Standard PPO clip range
Temperature0.7–1.0Higher = more diverse group responses

Zero-variance groups: if all GG completions receive the same reward (all correct or all incorrect), σr0\sigma_r \approx 0 and advantages are undefined. Add ε=108\varepsilon = 10^{-8} to the denominator. Optionally skip the update for zero-variance groups — this happens naturally when a problem is too easy or too hard, and neither case provides useful learning signal.

Reward Design

GRPO's effectiveness depends entirely on reward quality. Three patterns:

Outcome rewards (ORM): reward based on final answer correctness alone. Simple; sparse signal but unambiguous.

Process rewards (PRM): reward individual reasoning steps. Dense signal; requires labeled step-level data or a trained process reward model.

Format rewards: reward adherence to a required structure (e.g., <think>...</think> before the final answer). Encourages chain-of-thought before commitment. Used in DeepSeek-R1 to elicit explicit reasoning. Often combined with outcome rewards.

🚀Production

GRPO in practice:

  • Verifiable rewards are the key ingredient. GRPO works because the reward signal is reliable. For subjective tasks (tone, helpfulness), a learned reward model introduces noise — consider DPO instead.
  • Group size G: start at G=4 for memory constraints. G=8 provides meaningfully more stable advantage estimates. Returns diminish beyond G=16.
  • Sampling temperature: use 0.8–1.0. If all G completions are identical (low temperature), group variance is zero and GRPO provides no signal.
  • Monitor KL divergence: unlike DPO, GRPO can overfit to reward if β\beta is too small. Stop training or increase β\beta if KL divergence exceeds 5–10 nats.
  • Fast-moving area: as of early 2025, GRPO variants (Dr. GRPO, DAPO, REINFORCE++) are actively being developed. The core group-normalization idea is stable; check recent literature for specific implementation choices.

Enjoying these notes?

Get new lessons delivered to your inbox. No spam.