VibeThinker 3B: Scaling Reasoning in Small Language Models

VibeThinker 3B demonstrates that small language models can achieve parity with massive proprietary models on verifiable reasoning tasks by focusing on search and constraint satisfaction rather than broad knowledge storage. Developed by the Weibo AI Lab, this 3B parameter model shows that specialized post-training recipes can unlock high-level reasoning in small footprints, though it lacks the general knowledge and nuance of larger models.

The Core Thesis: Reasoning vs. Knowledge

VibeThinker 3B is built on the premise that intelligence can be bifurcated into two distinct types: verifiable reasoning and broad knowledge.

Verifiable Reasoning: Tasks such as mathematics and coding are viewed as problems of search, constraint satisfaction, and error correction. The researchers argue that these tasks do not require massive parameter counts to store facts, but rather a robust "engine" for figuring things out.
Broad Knowledge: Tasks involving long-tail facts or general science require significant raw parameter capacity to store information.

By focusing exclusively on the former, VibeThinker 3B aims to be a reasoning engine that can be paired with external tools (like search) to compensate for its lack of internal knowledge.

Architecture and Training Pipeline

VibeThinker 3B is not trained from scratch; it is a post-trained version of the Qwen 2.5 Coder 3B base model. The team employed a "spectrum to signal" principle to refine the model's reasoning capabilities.

Two-Stage Supervised Fine-Tuning (SFT)

Broad Coverage: The first stage focuses on a wide array of math, code, STEM topics, and general chat.
Hard Problem Focus: The second stage retrains the model specifically on difficult, long-horizon problems. To prevent shallow pattern matching, the team discarded reasoning traces under 5,000 tokens and removed easy problems.

Reinforcement Learning (RL)

The model utilizes MGPO (Max Ent Guided Policy Optimization), a variation of GRPO. This approach weights examples to avoid both overly simple tasks and tasks that are too difficult for the model's current level.

Optimization Techniques

Diversity Distillation: Instead of converging on a single solution path, the model samples from multiple checkpoints and merges them to maintain diverse answering strategies.
Long-to-Short Math RL: The model is first optimized for accuracy. Once accuracy is achieved, it is rewarded for shorter correct answers and penalized for unnecessary length, mimicking the optimization seen in proprietary reasoning models.
Claim Level Reliability (CLR): This is a test-time compute technique where the model generates multiple answers and then selects the most reliable one, significantly boosting benchmark performance.

Benchmark Performance

On specific reasoning benchmarks, VibeThinker 3B performs competitively against models significantly larger than itself, including Claude Opus 4.5, Gemini 3 Pro, and DeepSeek V 3.2.

Math and Coding: The model is on par with or beats several proprietary giants on AIME and AMIE 26 benchmarks.
General Knowledge: The model performs poorly on general knowledge benchmarks (such as GPA diamond), trailing behind both large open-weight models and proprietary models. This confirms the trade-off: it excels at logic but lacks a broad factual database.

Practical Observations and Limitations

While VibeThinker 3B is a powerful research tool, it exhibits specific behaviors that make it unsuitable for general production use:

Inefficient Token Usage: The model often employs extremely long chains of thought even for simple logic tests that do not require deep reasoning. It lacks the flexibility to scale its thinking process based on the complexity of the task.
Knowledge Gaps: The model struggles with tasks requiring spatial or visual representation. For example, when asked to generate an SVG of a pelican on a bicycle, it consumes thousands of thinking tokens but produces a poor visual result because it lacks the internal representation of what such an image looks like.
Language Drift: The model occasionally drifts between English and Chinese during generation.
Comparison to Large Models: In long-context retrieval tasks, VibeThinker 3B requires thousands of thinking tokens to answer, whereas a larger model (like GLM 5.2) can answer almost instantly with minimal thinking, demonstrating a higher level of inherent confidence and understanding.

"This is certainly not a model that I'd use for production... It is a research project... the ideas that they've proposed could end up working out much better for a 9B model... or even going up to the 30B models."

Conclusion

VibeThinker 3B serves as a proof-of-concept for the "reasoning engine" approach. It proves that reinforcement learning from verifiable rewards can allow a 3B model to compete with models 300x its size in structured domains, provided the goal is specialized reasoning rather than general-purpose intelligence.

VibeThinker 3B: Scaling Reasoning in Small Language Models

VibeThinker 3B: Scaling Reasoning in Small Language Models

The Core Thesis: Reasoning vs. Knowledge

Architecture and Training Pipeline

Two-Stage Supervised Fine-Tuning (SFT)

Reinforcement Learning (RL)

Optimization Techniques

Benchmark Performance

Practical Observations and Limitations

Conclusion

Sources