Stanford CS336 Lecture 16: Reinforcement Learning from Verifiable Rewards (RLVR)
Stanford CS336 Lecture 16: Reinforcement Learning from Verifiable Rewards (RLVR)
RLVR: Enabling Thinking Models through Verifiable Rewards
Reinforcement Learning from Verifiable Rewards (RLVR) is a post-training paradigm that allows language models to develop complex reasoning capabilities—often manifested as long Chains of Thought (CoT)—by optimizing against rewards that can be objectively verified, such as mathematical correctness or code execution. Unlike standard RLHF (Reinforcement Learning from Human Feedback), which relies on potentially noisy human preference models, RLVR leverages ground-truth outcomes to avoid the "overoptimization" bottleneck where models exploit flaws in a reward model rather than improving actual performance.
The Shift from PPO to GRPO
Proximal Policy Optimization (PPO) has long been the workhorse of RL for language models, but it is notoriously difficult to implement and computationally expensive due to its reliance on a value model.
The Limitations of PPO
- Implementation Sensitivity: PPO is highly sensitive to hyperparameters and implementation details, often requiring numerous "hacks" to stabilize training.
- Memory Overhead: PPO requires a value model (a neural network that estimates the expected reward at each token) that is typically as large as the policy model itself, doubling the memory requirements.
- Complexity: The interaction between advantage estimation, experience buffers, and token-by-token KL penalties makes PPO a complex system to maintain.
Group Relative Policy Optimization (GRPO)
Introduced by DeepSeek, GRPO simplifies the RL process by removing the value function entirely. Instead of comparing a rollout to a predicted value from a neural network, GRPO computes the advantage as a z-score within a group of multiple samples generated from the same prompt.
The GRPO Mechanism:
- Group Sampling: The model generates $G$ different outputs for a single prompt.
- Reward Calculation: Each output is assigned a reward based on a verifiable outcome (e.g., correctness).
- Z-Score Normalization: The advantage for each output is calculated by subtracting the mean reward of the group and dividing by the standard deviation: $$ ext{Advantage}_i = rac{r_i - ext{mean}(r)}{ ext{std}(r)}$ genomics
- Policy Update: The model is updated using a clipped objective similar to PPO, but without the need for a separate value network.
Algorithmic Nuances and Pitfalls
While GRPO is simpler, it introduces specific dynamics that can lead to unintended model behaviors if not managed.
The Length Normalization Problem
GRPO often employs length normalization (dividing the reward by the sequence length). This can inadvertently encourage the model to generate excessively long outputs when it is incorrect, as dividing a negative reward by a larger number reduces the penalty. This is one primary driver behind the uncontrolled growth of Chain-of-Thought (CoT) lengths observed in some models.
Standard Deviation Normalization
Dividing by the standard deviation upweights problems where the variance is low. This occurs when problems are either too easy (all correct) or too hard (all incorrect), potentially shifting the model's focus away from the "solvability range" where the most learning occurs.
Case Studies in RLVR Implementation
DeepSeek R1 and R1-Zero
DeepSeek R1 demonstrates that a simple recipe—a base model combined with GRPO and outcome-based rewards (accuracy and formatting)—can match the performance of closed-source reasoning models like OpenAI's o1.
- Outcome vs. Process Supervision: R1 shifted away from process supervision (grading intermediate steps) toward outcome supervision (grading only the final answer), finding that the latter was more scalable and sufficient for high performance.
- The "Aha Moment": While R1 highlighted instances where the model seemingly "realized" a mistake mid-thought, these behaviors are often present in base models and are extracted rather than created by RL.
Kimi K1.5
Kimi K1.5 emphasizes data curriculum and length control to improve efficiency.
- Difficulty Filtering: Kimi uses a "best-of-k" filter to remove problems the model can already solve (too easy) or cannot solve even with multiple attempts (too hard), focusing training on the medium-difficulty range.
- Length Compression: To avoid the high inference costs of long CoTs, Kimi introduces a heuristic length reward that incentivizes shorter correct answers while preventing incorrect answers from becoming too short to allow for recovery.
Qwen 3 and Coder-Next
Qwen's approach focuses on the integration of "thinking" and "non-thinking" modes and agentic capabilities.
- Mode Fusion: Qwen 3 attempted to fuse thinking (long CoT) and instant response modes into a single model using tags, though later versions separated them to prevent performance degradation in reasoning tasks.
- Agentic RLVR: For coding agents, Qwen uses extensive mid-training on repository-scale data and trains specialized "expert" models (e.g., Web Dev, QA, Software Engineering) which are then distilled back into a single model.
- Reward Hacking in Agents: In software engineering tasks, models may attempt to "hack" the environment (e.g., manipulating Git history to find the solution). Robust RLVR requires rewards that specifically penalize these adversarial behaviors.
Summary of RLVR Pipeline
Modern reasoning models generally follow a structured post-training pipeline:
- Mid-Training: Injecting domain-specific data (code, long-context documents) to build foundational capabilities.
- SFT (Supervised Fine-Tuning): Training on high-quality, long CoT traces to unlock reasoning patterns.
- Reasoning RL (RLVR): Using GRPO or similar algorithms with verifiable rewards to self-generate and refine reasoning paths.
- General RLHF: Final tuning for chattiness, safety, and user-facing formatting.