Anthropic Claude Model Regressions in Tool Calling Schemas

Anthropic Claude Model Regressions in Tool Calling Schemas

SOTA Models are Regressing in Tool Schema Adherence

Newer state-of-the-art (SOTA) models from Anthropic, specifically Opus 4.8 and Sonnet 5, are increasingly prone to emitting malformed tool calls by inventing non-existent fields in nested arrays. While older models adhered to provided schemas, these newer versions frequently add "slop"—made-up keys—to tool arguments, causing external harnesses to reject the calls even when the primary payload is correct.

This behavior indicates that as models become more capable in general reasoning, they may become less flexible or more biased toward specific, undocumented tool shapes encountered during their post-training phase.

The Mechanics of Tool Call Failure

LLM tool calls are not native functions but are generated as text using in-band signaling. The model produces a serialized string (often resembling XML or JSON) that the client API interprets as a function call. For a file edit tool, a model might be expected to produce a JSON array of edits:

{
 "path": "some/file.py",
 "edits": [
 {
 "oldText": "text to replace",
 "newText": "replacement text"
 }
 ]
}

In failing cases with Opus 4.8 and Sonnet 5, the models produce the correct oldText and newText but append invented keys such as requireUnique, type, id, kind, matchCase, or in_file. These hallucinations typically occur at the highest-entropy point of the generation: immediately after closing a long escaped string, where the model must decide whether to close the object (}) or add another field (, "...").

Root Cause: The "Slop Harness" Hypothesis

The regression is likely a training artifact resulting from reinforcement learning (RL) performed within a forgiving environment, such as the closed-source Claude Code harness.

The Role of Forgiving Clients

Claude Code's internal client is designed to be highly resilient. It employs several strategies to handle malformed output:

  • Parameter Aliasing: It accepts multiple names for the same parameter (e.g., old_str and old_string).
  • Silent Filtering: It automatically strips out unexpected keys that are not in the schema.
  • Unicode Repair: It fixes broken \uXXXX sequences and lone surrogates.
  • Retry Loops: It uses a state machine to catch leaked markup and prompt the model to try again.

RL-Induced Bias

If a model is trained via RL in a harness that silently repairs errors, the model receives a reward for completing the task regardless of whether the tool call was perfectly schema-compliant. Consequently, there is no gradient pushing the model to avoid inventing aliases or adding stray fields. Over time, the model develops a strong prior for the specific, flat schema used by Claude Code, making alternative, nested schemas (like those used in the Pi harness) appear "off-distribution."

Mitigation Strategies

Several technical approaches can mitigate these schema regressions:

1. Strict Mode and Constrained Decoding

Enabling strict mode in the Anthropic API appears to eliminate these failures. This is likely because the server-side inference stack employs grammar-constrained decoding, masking out tokens that would violate the JSON schema. However, this comes with complexity limits on tool definitions that can cause API requests to fail.

2. Error-Driven Correction

Some developers have found success by providing detailed error messages back to the model. By explaining exactly why a tool call failed and how to fix the syntax, the agent can typically correct the mistake in the subsequent turn and maintain that correctness for the remainder of the context window.

3. Client-Side "Self-Healing"

Similar to the Claude Code approach, some users have built extensions to patch the edit tool, allowing it to silently ignore invented fields or map aliases back to the correct schema, reducing the number of round-trip retries.

Implications for AI Agent Development

This phenomenon suggests that tool schemas are not neutral abstract contracts. Instead, they are influenced by the model's post-training distribution.

  • Harness Dependency: The "intelligence" of an agent is increasingly the sum of the model and the harness. A model trained in a proprietary, forgiving ecology may struggle in a strict, open-source environment.
  • The Moat of the Closed Harness: Closed-source harnesses combined with RL fine-tuning on customer prompts create a technical moat. If frontier models perform best only in their own proprietary harnesses, third-party developers must either mimic those quirks or rely on strict constrained decoding, which may have its own quality tradeoffs.

Sources