Stanford CS336 Lecture 15: Mid-Training and Post-Training (SFT and RLHF)

Stanford CS336 Lecture 15: Mid-Training and Post-Training (SFT and RLHF)

Post-Training: From Base Models to Assistants

Post-training is the process of transforming a strong base model (like GPT-3) into a useful, instruction-following assistant (like ChatGPT). While pre-training provides the broad knowledge base—the "primordial soup"—post-training extracts specific behaviors, such as reliability and fine-grained control, through explicit data collection and steering.

Supervised Fine-Tuning (SFT)

SFT is the first phase of post-training, where a model is trained on high-quality input-output pairs. The primary challenge in SFT is not the algorithm—which is standard gradient descent—but the data curation.

Evolution of SFT Data

Data strategies have evolved from large-scale, programmatic datasets to high-quality, human-like interactions:

  • FLAN: Early attempts used existing NLP benchmarks to create multitask datasets. However, these were often unnatural in structure and contained hallucinations, proving that sheer scale is less important than quality.
  • Self-Instruct & Distillation (e.g., Alpaca, Vicuna): These methods use stronger models to generate synthetic instruction-following data, which reliably induces chat-like behavior in base models.
  • Human-Driven Efforts (e.g., Open Assistant): Crowdsourced efforts to create expert-level prompts and responses to match closed-source performance.
  • Agentic SFT (e.g., Nemotron): The current trend shifts from simple chat to agentic behavior, incorporating tool calls and structured to-do lists into the SFT data.

Key Pitfalls in SFT Data Collection

  • Style vs. Capability: Users often prefer responses with bullet points or longer lengths, even if the underlying capability hasn't improved. This creates a risk of "length hacking" where engagement signals increase without actual intelligence gains.
  • The Hallucination Trap: Training a model on "tail knowledge" (facts the model doesn't already know) during SFT can induce hallucinations. When a model is forced to emit a fact it doesn't know using a specific format (e.g., "Reference: [Citation]"), it may learn to mimic the format by fabricating information.
  • Safety Tuning: Safety SFT involves balancing the "violation rate" (allowing bad queries) against the "false refusal rate" (refusing harmless queries, like "how to kill a Python process"). This is often achieved with a few thousand targeted examples of refusals.

Mid-Training: Blurring the Lines

Modern training pipelines often merge SFT into the pre-training phase. During the "decay phase" (the end of pre-training), high-quality chat and SFT data are mixed with general web data. This allows models to scale instruction-tuning and emphasizes higher-quality data at the point closest to deployment.

Reinforcement Learning from Human Feedback (RLHF)

RLHF shifts the objective from generative modeling (fitting a distribution) to reward maximization. It is used because humans are often better at rating outputs than generating them, and in some domains (like math), verification is easier than generation.

The RLHF Pipeline

  1. Sampling: The SFT model generates multiple candidate responses for a prompt.
  2. Ranking: Human raters rank these responses based on criteria like helpfulness, truthfulness, and harmlessness.
  3. Reward Modeling: A reward model is trained to predict these human preferences.
  4. Optimization: The policy is updated to maximize the reward, typically constrained by a KL divergence term to prevent the model from drifting too far from the base model and becoming degenerate.

The Role of the Annotator

Annotation has shifted from low-cost crowd-working to high-cost expert labor. Specialized professionals (doctors, lawyers) are now paid significant hourly rates to provide high-fidelity feedback. The demographic and ideological makeup of annotators directly influences the model's final alignment and political leanings.

RLHF Algorithms: PPO vs. DPO

  • PPO (Proximal Policy Optimization): The traditional approach. It is complex, requiring a separate reward model and on-policy sampling, which is computationally expensive.
  • DPO (Direct Preference Optimization): A simpler alternative that eliminates the reward model and on-policy sampling. DPO treats RLHF as a classification problem, taking gradient steps to increase the likelihood of the preferred response and decrease the likelihood of the rejected one.

Challenges in RLHF

  • Over-optimization: Pushing RLHF too far can lead to "reward hacking," where the model overfits the reward model rather than improving actual utility.
  • Model Collapse: RLHF can reduce output diversity, as the model collapses its distribution onto a single high-reward point for every input.
  • Calibration: RLHF often leaves models uncalibrated, meaning the model's confidence in its answer does not accurately reflect the probability of correctness.

Sources