Stanford CS336 Lecture 15: Mid-Training and Post-Training (SFT and RLHF)
Stanford CS336 Lecture 15: Mid-Training and Post-Training (SFT and RLHF)
Post-Training: From Base Models to Assistants
Post-training is the process of transforming a strong base model (like GPT-3) into a useful, instruction-following assistant (like ChatGPT). While pre-training provides the broad knowledge base—the "primordial soup"—post-training extracts specific behaviors, such as reliability and fine-grained control, through explicit data collection and steering.
Supervised Fine-Tuning (SFT)
SFT is the first phase of post-training, where a model is trained on high-quality input-output pairs. The primary challenge in SFT is not the algorithm—which is standard gradient descent—but the data curation.
Evolution of SFT Data
Data strategies have evolved from large-scale, programmatic datasets to high-quality, human-like interactions:
- FLAN: Early attempts used existing NLP benchmarks to create multitask datasets. However, these were often unnatural in structure and contained hallucinations, proving that sheer scale is less important than quality.
- Self-Instruct & Distillation (e.g., Alpaca, Vicuna): These methods use stronger models to generate synthetic instruction-following data, which reliably induces chat-like behavior in base models.
- Human-Driven Efforts (e.g., Open Assistant): Crowdsourced efforts to create expert-level prompts and responses to match closed-source performance.
- Agentic SFT (e.g., Nemotron): The current trend shifts from simple chat to agentic behavior, incorporating tool calls and structured to-do lists into the SFT data.
Key Pitfalls in SFT Data Collection
- Style vs. Capability: Users often prefer responses with bullet points or longer lengths, even if the underlying capability hasn't improved. This creates a risk of "length hacking" where engagement signals increase without actual intelligence gains.
- The Hallucination Trap: Training a model on "tail knowledge" (facts the model doesn't already know) during SFT can induce hallucinations. When a model is forced to emit a fact it doesn't know using a specific format (e.g., "Reference: [Citation]"), it may learn to mimic the format by fabricating information.
- Safety Tuning: Safety SFT involves balancing the "violation rate" (allowing bad queries) against the "false refusal rate" (refusing harmless queries, like "how to kill a Python process"). This is often achieved with a few thousand targeted examples of refusals.
Mid-Training: Blurring the Lines
Modern training pipelines often merge SFT into the pre-training phase. During the "decay phase" (the end of pre-training), high-quality chat and SFT data are mixed with general web data. This allows models to scale instruction-tuning and emphasizes higher-quality data at the point closest to deployment.
Reinforcement Learning from Human Feedback (RLHF)
RLHF shifts the objective from generative modeling (fitting a distribution) to reward maximization. It is used because humans are often better at rating outputs than generating them, and in some domains (like math), verification is easier than generation.
The RLHF Pipeline
- Sampling: The SFT model generates multiple candidate responses for a prompt.
- Ranking: Human raters rank these responses based on criteria like helpfulness, truthfulness, and harmlessness.
- Reward Modeling: A reward model is trained to predict these human preferences.
- Optimization: The policy is updated to maximize the reward, typically constrained by a KL divergence term to prevent the model from drifting too far from the base model and becoming degenerate.
The Role of the Annotator
Annotation has shifted from low-cost crowd-working to high-cost expert labor. Specialized professionals (doctors, lawyers) are now paid significant hourly rates to provide high-fidelity feedback. The demographic and ideological makeup of annotators directly influences the model's final alignment and political leanings.
RLHF Algorithms: PPO vs. DPO
- PPO (Proximal Policy Optimization): The traditional approach. It is complex, requiring a separate reward model and on-policy sampling, which is computationally expensive.
- DPO (Direct Preference Optimization): A simpler alternative that eliminates the reward model and on-policy sampling. DPO treats RLHF as a classification problem, taking gradient steps to increase the likelihood of the preferred response and decrease the likelihood of the rejected one.
Challenges in RLHF
- Over-optimization: Pushing RLHF too far can lead to "reward hacking," where the model overfits the reward model rather than improving actual utility.
- Model Collapse: RLHF can reduce output diversity, as the model collapses its distribution onto a single high-reward point for every input.
- Calibration: RLHF often leaves models uncalibrated, meaning the model's confidence in its answer does not accurately reflect the probability of correctness.