AI21 Maestro: Optimizing Accuracy, Cost, and Latency in Real-World Agents

AI21 Maestro: Optimizing Accuracy, Cost, and Latency in Real-World Agents

The Agent Optimization Trade-off

Optimizing AI agents typically involves a "vicious cycle" where accuracy, cost, and latency are in constant tension. Improving one often leads to a decline in the others. Traditionally, developers rely on hardcoded heuristics to decide which models, tools, and compute scaling strategies to use, leading to efficiency leakage across these three dimensions.

Strategies for Agent Performance

To improve agent performance, optimization generally falls into two primary categories: configuration and scaling.

Configuration Optimization

Configuration involves selecting the right components for the agent's harness. This includes:

  • Model Selection: Testing various LLMs to determine which performs best for a specific task.
  • Prompt Engineering: Manually tweaking prompts or using automatic prompt optimization tools like DSPy or GEAP.
  • Tool Integration: Selecting and optimizing the combination of tools provided to the agent, as too many tools can degrade performance while too few may leave the agent incapable of solving the task.
  • Guardrails: Implementing systematic flows and safety boundaries around the execution process.

Inference-Time Compute Scaling

Scaling allows developers to "use more to get more" by increasing the compute allocated to a task at runtime.

Vertical Scaling

Vertical scaling focuses on increasing the depth of reasoning. This includes longer reasoning chains, increasing the number of loops in a ReAct loop, or implementing critique-repair loops where one LLM judges the output and another repairs it.

Horizontal Scaling

Horizontal scaling leverages the probabilistic nature of LLMs through techniques like best-of-n sampling. By running multiple parallel samples and using an LLM-as-a-judge or a deterministic function (e.g., running code tests) to rank the results, agents can achieve significantly higher accuracy.

For example, in the BrowseComp Plus benchmark, using a lower-performing model like Minimax with 8-16 samples can match the state-of-the-art accuracy of a high-end model like GPT-5 run only once, while potentially offering better latency due to parallel execution.

The Pareto Frontier and Ensemble Approaches

By plotting different configurations (models and tools) against cost/latency and accuracy, developers can identify the Pareto frontier—the set of configurations that provide the best value for money (the "best bang for the buck").

An ensemble approach—using a diverse portfolio of models—further pushes this frontier. Because different models often solve different subsets of tasks, combining them allows an agent to achieve higher overall accuracy while reducing costs and latency by utilizing smaller, cheaper models for simpler tasks.

AI21 Maestro: Automatic Agent Optimization

Manual optimization is costly, inefficient, and not future-proof; a change in model pricing or a new model release can render months of manual tuning obsolete. AI21 Maestro automates this process through a two-part system:

1. Offline Build-Time Optimization

Maestro samples the action space (models, agents, tools) efficiently to find an optimal portfolio. It then trains an action model tasked with predicting the accuracy, cost, and latency of a specific action given a task.

2. Budget-Aware Runtime Orchestration

At inference, the action model is plugged into a runtime that is budget-aware. It uses predictions to dynamically orchestrate execution paths. Instead of a fixed harness, Maestro can execute a "weird" non-intuitive sequence—such as running five different models in a first phase and then deciding whether to proceed to a second wave based on the results and remaining budget.

Application and Results

Maestro has been applied to several benchmarks and challenging tasks:

  • BrowseComp Plus: Achieved state-of-the-art results by optimizing horizontal scaling and ensemble strategies.
  • Deep Research Bench: Utilized vertical scaling (repair loops) and the action model to determine when the next cycle of repair would be beneficial, avoiding diminishing returns.

This enables anytime fashion generation, where the agent provides the best possible candidate based on the current latency or budget constraint. If a task is simple, it stops early; if complex, it invests more compute.

Key Advantages of the Maestro Approach

  • Automatic: Removes the need for manual tinkering and expensive rollouts.
  • Efficient: Samples only the relevant parts of the action space.
  • Observable: Provides a visualizer to show the trade-offs between cost, latency, and accuracy, allowing developers to choose their operating point.
  • Future-Proof: When a new model is released, the system only needs to learn that specific model's configurations rather than retraining the entire router or distilling a new model.

Sources