Ornith-1.0: Self-Improving Open-Source Models for Agentic Coding

Ornith-1.0 is a series of self-improving, open-source models specifically optimized for agentic coding. By utilizing a reinforcement learning (RL) framework that jointly optimizes both the solution rollouts and the scaffolds that drive them, Ornith-1.0 discovers more efficient search trajectories to generate higher-quality code solutions.

Model Variants and Architecture

Ornith-1.0 is available in three primary sizes, post-trained on top of Gemma 4 and Qwen 3.5. All models support a 256K (262,144-token) context window and provide an OpenAI-compatible interface.

9B-Dense: Designed for single-GPU serving and fine-tuning. It fits on a single 80GB GPU.
35B-MoE: A Mixture-of-Experts model suitable for multi-GPU serving.
397B-MoE: A large-scale Mixture-of-Experts model for high-performance multi-GPU nodes.

Weights are provided in multiple formats to accommodate different hardware, including bf16 for full precision, FP8 for memory efficiency on compatible GPUs, and GGUF for local inference via llama.cpp or Ollama.

Performance Benchmarks

Ornith-1.0 achieves state-of-the-art performance among open-source models of comparable size across several agentic coding benchmarks.

High-Scale Performance (397B Model)

The 397B MoE model competes with top-tier proprietary and open models. On SWE-bench Verified, it scores 82.4, outperforming Qwen3.5-397B (76.4) and DeepSeek-V4-Pro-1.6T (80.6). It also shows strong results on Terminal-Bench 2.1, scoring 77.5 (Terminus-2) and 78.2 (Claude Code).

Mid-Scale Performance (35B Model)

The 35B MoE model demonstrates significant gains over its baselines. On SWE-bench Verified, it scores 75.6, compared to 70 for Qwen3.5-35B. It also achieves 64.2 on Terminal-Bench 2.1 (Terminus-2), significantly higher than the 41.4 scored by Qwen3.5-35B.

Small-Scale Performance (9B Model)

The 9B Dense model outperforms several larger baselines on specific tasks. For example, on Terminal-Bench 2.1 (Terminus-2), it scores 43.1, beating the larger Gemma4-31B (42.1) and Qwen3.5-9B (21.3).

Technical Implementation and Serving

Ornith-1.0 is a reasoning model; it generates a <think> block containing a chain-of-thought trace before providing the final answer.

Deployment Runtimes

To serve Ornith-1.0, the following minimum runtime versions are required:

Transformers: $\ge$ 5.8.1
vLLM: $\ge$ 0.19.1
SGLang: $\ge$ 0.5.9

Integration with Agent Frameworks

Because the models expose an OpenAI-compatible endpoint and support tool-calling, they integrate directly with several agentic frameworks:

OpenHands: Routes via LiteLLM using the openai/Ornith-1.0 prefix.
Hermes Agent & OpenClaw: Point directly to the Ornith server via OPENAI_BASE_URL.
Coding CLIs: Optimized for terminal-based agents like OpenCode.

Community Reception and Critique

While benchmarks show strong results, community feedback from Hacker News indicates a divide between benchmark performance and real-world utility.

Critical Perspectives

Some users have reported that the model's performance in non-tool-augmented chat is poor, noting a tendency toward hallucination. One user highlighted a discrepancy between benchmark success and practical bug-finding:

"Poor performer here, only found the one bug that almost every model found, despite its performance on other benchmarks being excellent for its size."

Other critics suggest the models may be "benchmaxxed"—optimized specifically for the benchmarks they are tested on—and argue that the 9B model's VRAM requirements (fitting on an 80GB GPU) remain too high for many individual users.

Positive Perspectives

Conversely, some users have found the models to be creative in their approach to coding problems and noted that this is one of the few Qwen-based fine-tunes that has been well-received by the local LLM community for its actual utility in coding tasks.

Ornith-1.0: Self-Improving Open-Source Models for Agentic Coding

Ornith-1.0: Self-Improving Open-Source Models for Agentic Coding

Model Variants and Architecture

Performance Benchmarks

High-Scale Performance (397B Model)

Mid-Scale Performance (35B Model)

Small-Scale Performance (9B Model)

Technical Implementation and Serving

Deployment Runtimes

Integration with Agent Frameworks

Community Reception and Critique

Critical Perspectives

Positive Perspectives

Sources