Ara Khan: Evals Are Broken Use Them Anyway

Ara Khan: Evals Are Broken Use Them Anyway

The Core Conflict: Objective Metrics vs. Taste

AI development often splits into two incorrect camps regarding evaluations (evals): the objective metrics camp, which takes benchmark scores (like ELO or SweetBench) at face value, and the taste/vibes camp, which dismisses numbers entirely in favor of subjective feel.

Neither approach is sufficient. Objective benchmarks can be gamed by labs to achieve high scores without improving actual utility, while relying solely on "vibes" prevents systematic improvement. The effective path lies in the middle: treating evals not as absolute truths, but as critical heuristics for iterative development.

Heuristics for Interpreting External Evals

When evaluating benchmarks provided by model labs or other companies, developers should apply three primary heuristics to avoid being misled by marketing numbers:

1. Treat Lab Evals as Approximations

Do not treat numbers from model labs (e.g., for GPT or Claude releases) as the "word of God." While they are generally decent approximations, they should be used with discernment rather than as definitive proof of superiority.

2. Prioritize Stability Over Early Adoption

In the fast-paced AI landscape, the "best" model changes every few months. Attempting to switch to the absolute frontier model the moment it is released consumes excessive mental bandwidth. The recommended approach is to let the dust settle for a few weeks before integrating a new model into a production workflow.

3. Seek Problem-Specific Benchmarks

General-purpose benchmarks often fail to reflect real-world utility. For example, the SWE-bench was once a standard for coding agents but eventually became "saturated," meaning models scored so high that the benchmark no longer distinguished between quality levels. Developers should look for evals that closely mirror their specific problem domain (e.g., shopping, infrastructure, or specific coding tasks).

Implementing Agentic Evals for Coding Agents

Evaluating agents is fundamentally different from evaluating single-turn LLM responses. Because agents can take multiple turns, use various tools, and follow different paths, the answer space is effectively infinite.

The Shift to Real-World Tasks

Many early evals focused on trivial academic problems, such as implementing a Fibonacci sequence. These do not translate to professional software engineering. To solve this, the Cline team adopted Terminal Bench (developed by the Stanford ALOT Institute), which consists of 89 real-world software engineering tasks involving database issues, race conditions, and front-end bugs.

The Agentic Evaluation Process

Unlike deterministic tests, agentic evals allow the agent to run for extended periods (sometimes 30-45 minutes), performing web searches, installing libraries, and editing files. Success is measured by deterministic unit tests that check if the final output runs and passes the required tests.

Key Metrics to Track

To balance quality and cost, developers should track:

  • Turn count: How many iterations the agent takes.
  • Tool calls: The number of tools invoked.
  • Token usage: The total cost of the run.
  • Execution time: The total wall-clock time for the run.

Building a Robust Eval Infrastructure

To run evals effectively and avoid interference between tasks, isolation is mandatory.

  • Containerization: Each eval task should run in an isolated container with its own dependencies and environment. This prevents one task from corrupting the environment of another.
  • Parallelization: Running evals sequentially can take hours. Using infrastructure like Modal allows for parallelized, containerized environments, significantly reducing the feedback loop time.

The Iterative Improvement Loop

Evals allow developers to move from philosophical guessing to engineering. By analyzing a "portfolio allocation of failures," developers can categorize errors into broad buckets (e.g., "failed to read file," "inference error," "installation loop").

The Three Zones of Improvement

  1. Zone 1: Obvious Flaws. Fixing fundamental breaks, such as a broken read_file tool or failing checkpoints. This makes the agent functional.
  2. Zone 2: Hill Climbing. This is the primary area of optimization. Developers refine prompt engineering, adjust tool definitions, and optimize the logic of retries to improve the agent's philosophical approach to problem-solving.
  3. Zone 3: The Danger Zone. The risk of overfitting. Developers must avoid optimizing solely for the benchmark score by adding specific hacks that pass the test but degrade general performance.

The Three-Way Alignment

Successful agent performance requires alignment between three components:

  • The Model: The underlying LLM's capability.
  • The Harness: The agent scaffolding and tool implementation.
  • The Problem: The actual task being solved.

Even a superior model will fail if the harness is poorly written. Iterative evals help identify whether a failure is due to the model's intelligence or a flaw in the agent's scaffolding.

Sources