The Next AI Training Paradigm: Beyond RLVR and Toward Continual Learning

The Next AI Training Paradigm: Beyond RLVR and Toward Continual Learning

The Core Bet: RLVR and its Limitations

AI labs are currently betting that training models on millions of verifiable tasks across thousands of diverse Reinforcement Learning (RL) environments will create a general problem-solving agent. The hypothesis is that scaling this approach—Reinforcement Learning from Verifiable Rewards (RLVR)—will overcome deficits in data inefficiency and the lack of continual learning, similar to how compute scaling solved many natural language processing problems.

However, this paradigm relies on the assumption that in-context learning (ICL) can eventually replace the need for weight updates. Proponents argue that if context windows become effectively infinite, a model can simply store all the experience gained during a deployment session without needing to distill that knowledge back into its weights.

The "Grindability" Bottleneck

Verifiability alone is not enough for rapid AI progress; a domain must also be "grindable." A grindable domain allows for thousands of parallel rollouts against a deterministic, replayable simulator starting from the same point.

  • Successes: Coding and math are highly grindable because agents can be tested in identical containers with specific software repositories.
  • Failures: Computer use (e.g., navigating Amazon or Slack) has progressed more slowly because it is not trivially grindable. Running thousands of bots on live websites leads to account bans and requires labor-intensive cloning of applications to create simulators.

This distinction reveals a critical gap: many essential human skills—such as building a business, winning court cases, or political strategy—cannot be simulated in a data center. These environments are reset-free and non-stationary, meaning the model must learn from scarce, real-world interactions where the outer-loop verification may take months or years.

The Necessity of Continual Learning

To achieve human-level proficiency in complex, real-world domains, AI must move beyond RLVR and implement continual learning—the ability to update weights based on deployment experience.

The Failure of Pure In-Context Learning

While in-context learning is sample-efficient, it scales poorly in terms of memory (KV cache). Human learning does not function by recalling every observation with perfect fidelity; instead, it involves compressing information into intuitions and big-picture knowledge within the weights. Relying solely on context windows creates a "savant-type" ability to recall data that can actually cripple the ability to understand abstractions and metaphors.

The Sample Efficiency Problem

Updating weights via gradient descent is notoriously sample-inefficient. Most current online-learning models (like Cursor Tab) only work because they learn the same objective across millions of users. True continual learning requires the model to learn specific, unique information about a particular organization or problem from a single session—data that is too scarce for traditional supervised fine-tuning (SFT).

Proposed Solutions for Weight Updates

To bridge the gap between scarce real-world data and weight updates, two primary technical paths are proposed:

On-Policy Self-Distillation (OPSD)

OPSD encourages a base model to match the predictions of a "teacher" model—the same model but with the full context of a long session accumulated.

  • Advantage over RLVR: OPSD does not require an outer-loop verifiable reward; it only requires that the model can learn the correct behavior within its context window.
  • Advantage over SFT: Unlike SFT, which naively predicts all observed tokens, OPSD (like RL) is sparse. It only extracts the knowledge necessary to achieve the same results as the teacher, preventing the model from overwriting existing knowledge or memorizing irrelevant transcripts.

"Dreaming" (Test-Time Training)

A more speculative approach is "dreaming," where an AI builds its own internal simulation of reality to rehearse skills and try alternative strategies.

  • Precedent: EfficientZero demonstrated that a model could outperform humans in Atari games by playing dozens of simulated games in its head for every one real-world step.
  • Application: If LLMs can spend compute writing their own RL environments and training against them, this would create a fourth axis of scaling (alongside pretraining, RL, and inference-time compute).

The 2027-2028 Vision

The transition to the next paradigm is expected to follow a specific sequence:

  1. RLVR as a Foundation: RLVR creates an agent competent enough to iterate and handle roadblocks in unfamiliar problems.
  2. Broad Deployment: This competent agent is deployed into the real world to engage in actual work.
  3. Continual Learning Loop: Using techniques like OPSD or dreaming, the model distills the experience from these real-world sessions back into its weights.

In this future, the primary driver of AI improvement shifts from pre-release training to the experience accumulated through broad economic deployment, where models become smarter by learning from every interaction with every user in real-time.

Sources