VibeThinker-3B: Achieving Frontier-Level Reasoning in Small Language Models

VibeThinker-3B: Achieving Frontier-Level Reasoning in Small Language Models

VibeThinker-3B demonstrates that verifiable reasoning capabilities can be compressed into small-scale models, achieving performance that matches or exceeds flagship models orders of magnitude larger. By utilizing a Spectrum-to-Signal post-training paradigm, this 3B parameter dense model reaches frontier-level benchmarks in mathematics and coding without sacrificing instruction controllability.

Performance Benchmarks in Verifiable Reasoning

VibeThinker-3B achieves state-of-the-art results on highly demanding verifiable tasks, placing it in the same performance band as first-tier reasoning systems like DeepSeek V3.2, GLM-5, and Gemini 3 Pro.

Key performance metrics include:

  • AIME26: Scored 94.3, which improves to 97.1 when utilizing claim-level test-time scaling.
  • LiveCodeBench v6: Achieved an 80.2 Pass@1 rate.
  • LeetCode Contests: Demonstrated strong out-of-distribution generalization with a 96.1% acceptance rate on recent unseen contests.
  • IFEval: Scored 93.4, confirming that the focus on extreme reasoning does not degrade the model's ability to follow strict instructions.

The Spectrum-to-Signal Post-Training Pipeline

The model's capabilities are derived from a systematic optimization pipeline designed to push the boundaries of verifiable reasoning within a small-model regime. This pipeline consists of three primary phases:

  1. Curriculum-based Supervised Fine-Tuning (SFT): Initial training focused on structured learning paths.
  2. Multi-domain Reinforcement Learning (RL): Utilizing Group Relative Policy Optimization (GRPO) to refine reasoning paths across various domains.
  3. Offline Self-Distillation: Further enhancing the model's internal logic and consistency.

The Parametric Compression-Coverage Hypothesis

The development of VibeThinker-3B supports the Parametric Compression-Coverage Hypothesis. This theory posits a fundamental distinction between the types of knowledge required for different LLM capabilities:

  • Verifiable Reasoning: This capability is viewed as compressible into "compact reasoning cores," meaning high-level logic and problem-solving can be achieved with relatively few parameters.
  • Open-Domain Knowledge: General-purpose competence, factual recall, and handling long-tail scenarios require "broad parameter coverage," necessitating larger models to store the vast array of facts and concepts.

This hypothesis suggests that small models are not merely efficient alternatives for deployment, but a viable complementary path toward achieving frontier performance in specific, parameter-dense capability regimes like mathematical and logical reasoning.

Sources