verl-agent: a reinforcement learning framework for training long-horizon LLM agents with customizable memory and step-independent rollouts

verl-agent: a reinforcement learning framework for training long-horizon LLM agents with customizable memory and step-independent rollouts

What it solves

verl-agent addresses the scalability issues of training LLM agents for long-horizon tasks. Traditional methods often concatenate the entire interaction history, which causes context lengths to grow rapidly and leads to token limits or inefficiency. This project provides a framework for training agents via reinforcement learning (RL) that can handle multi-turn interactions without the linear growth of context length.

How it works

The framework implements a step-independent multi-turn rollout mechanism. Instead of appending all previous turns, it allows for fully customizable per-step input structures and history management. This means developers can define exactly what information (e.g., recent steps, summaries, or key events) is passed to the model at each step, keeping the context length nearly constant.

It integrates with the veRL library and supports a variety of RL algorithms (including the project's own GiGPO) and parallelized Gym-style environments to enable high-throughput training. It also supports both text-only and vision-language modalities.

Who it’s for

It is designed for AI researchers and developers building reasoning agents for complex, multi-step tasks in environments ranging from embodied AI (ALFWorld) and digital interfaces (WebShop, AppWorld) to visual games (Sokoban) and tool-calling search tasks.

Highlights

  • Customizable Memory: Flexible memory modules allow developers to choose exactly what history to include for each step.
  • Long-Horizon Scalability: Maintains constant context length over time, supporting tasks requiring 30–50 steps.
  • Broad Algorithm Support: Includes implementations of GiGPO, GRPO, PPO, DAPO, GSPO, RLOO, and REINFORCE++.
  • Multi-Modal Capability: Supports training vision-language agents (e.g., using Qwen3-VL) for tasks requiring visual perception.
  • Efficient Training: Supports LoRA fine-tuning to reduce computational costs, enabling 7B model training on two H100 GPUs.

Sources