CursorBench 3.1: Evaluating AI Coding Agents on Real-World Tasks

CursorBench 3.1: Evaluating AI Coding Agents on Real-World Tasks

CursorBench 3.1 provides a real-world performance baseline for AI coding agents

CursorBench 3.1 evaluates AI agents using ambiguous, multi-file tasks derived from actual Cursor user sessions. Unlike synthetic benchmarks, this evaluation focuses on the ability of models to handle codebase understanding, bug finding, planning, and code review in realistic environments. The primary goal is to measure how well agents perform on tasks that require navigating and editing multiple files simultaneously.

Performance and Cost Rankings

According to the CursorBench 3.1 results, Fable 5 Max is the top-performing model with a score of 72.9%, followed by Fable 5 Extra High (72.0%) and Fable 5 High (70.6%). However, these high scores come with significant cost implications, as Fable 5 Max has the highest average cost per task at $18.02.

Key performance tiers from the benchmark include:

  • Top Tier (70%+): Fable 5 (Max, Extra High, High, Medium).
  • Mid Tier (60-69%): Opus 4.7 Max, GPT-5.5 Extra High, Fable 5 Low, Opus 4.8 Max, and Composer 2.5.
  • Lower Tier (<60%): Sonnet 5, Opus 4.8 (High/Medium/Low), and Gemini 3.5 Flash.

Notably, Composer 2.5 is ranked 9th with a score of 63.2%, while maintaining one of the lowest costs per task at $0.55.

Evolution from CursorBench 3.0

CursorBench 3.1 introduces several critical updates over the initial 3.0 version to better reflect professional software engineering workflows:

  • Expanded Task Scope: While version 3.0 focused primarily on edit, refactor, and bugfix problems, 3.1 adds problems specifically focused on codebase understanding, planning, and code review.
  • Refined Grading: The benchmark has implemented improved grading criteria for edit tasks to ensure more accurate performance measurement.

Community Critique and Benchmark Validity

The release of CursorBench 3.1 has sparked significant debate among developers regarding the validity of internal benchmarks versus third-party evaluations.

Discrepancies with External Benchmarks

Several users pointed out a stark contrast between CursorBench results and independent tests. For example, while Composer 2.5 performs competitively in Cursor's internal benchmark, other evaluations show a wider gap:

"Artificial Analysis' testing shows Composer 2.5 to be pretty far behind... You look at the DeepSWE benchmark... and GPT-5.5 xhigh gets a 64, Opus 4.8 max gets 56, and Cursor 2.5 gets 16."

Concerns Over Bias and Utility

Critics argue that a benchmark created by a company to evaluate its own model (Composer 2.5) is inherently biased. Some developers suggest that the only reliable metric is a model's performance on a user's specific daily workload:

"The independent benchmarks are probably part of training data now and the models are pattern-matching against them all the time. The final test of a model... is how good it works FOR YOU."

Model-Specific Observations

Users shared qualitative experiences that contrast with the quantitative data:

  • GPT-5.5 Extra High: Praised for speed and adaptive thinking, though limited by a smaller context window compared to Opus.
  • Opus 4.8 Max: Described as powerful for planning and review but potentially slow, sometimes "needlessly chewing on everything."
  • Fable 5: Noted for strong adaptive thinking but criticized for potentially leaving "big, dangerous holes" in implementations if not closely monitored.
  • Composer 2.5: Some users found it lacked the critical reasoning and thinking capabilities of frontier models, describing it as a "workhorse" better suited for executing existing plans than creating them.

Sources