Ornith-1.0: Self-Improving Open-Source Models for Agentic Coding
Ornith-1.0: Self-Improving Open-Source Models for Agentic Coding
Ornith-1.0 is a series of self-improving, open-source models specifically optimized for agentic coding. By utilizing a reinforcement learning (RL) framework that jointly optimizes both the solution rollouts and the scaffolds that drive them, Ornith-1.0 discovers more efficient search trajectories to generate higher-quality code solutions.
Model Variants and Architecture
Ornith-1.0 is available in three primary sizes, post-trained on top of Gemma 4 and Qwen 3.5. All models support a 256K (262,144-token) context window and provide an OpenAI-compatible interface.
- 9B-Dense: Designed for single-GPU serving and fine-tuning. It fits on a single 80GB GPU.
- 35B-MoE: A Mixture-of-Experts model suitable for multi-GPU serving.
- 397B-MoE: A large-scale Mixture-of-Experts model for high-performance multi-GPU nodes.
Weights are provided in multiple formats to accommodate different hardware, including bf16 for full precision, FP8 for memory efficiency on compatible GPUs, and GGUF for local inference via llama.cpp or Ollama.
Performance Benchmarks
Ornith-1.0 achieves state-of-the-art performance among open-source models of comparable size across several agentic coding benchmarks.
High-Scale Performance (397B Model)
The 397B MoE model competes with top-tier proprietary and open models. On SWE-bench Verified, it scores 82.4, outperforming Qwen3.5-397B (76.4) and DeepSeek-V4-Pro-1.6T (80.6). It also shows strong results on Terminal-Bench 2.1, scoring 77.5 (Terminus-2) and 78.2 (Claude Code).
Mid-Scale Performance (35B Model)
The 35B MoE model demonstrates significant gains over its baselines. On SWE-bench Verified, it scores 75.6, compared to 70 for Qwen3.5-35B. It also achieves 64.2 on Terminal-Bench 2.1 (Terminus-2), significantly higher than the 41.4 scored by Qwen3.5-35B.
Small-Scale Performance (9B Model)
The 9B Dense model outperforms several larger baselines on specific tasks. For example, on Terminal-Bench 2.1 (Terminus-2), it scores 43.1, beating the larger Gemma4-31B (42.1) and Qwen3.5-9B (21.3).
Technical Implementation and Serving
Ornith-1.0 is a reasoning model; it generates a <think> block containing a chain-of-thought trace before providing the final answer.
Deployment Runtimes
To serve Ornith-1.0, the following minimum runtime versions are required:
- Transformers: $\ge$ 5.8.1
- vLLM: $\ge$ 0.19.1
- SGLang: $\ge$ 0.5.9
Integration with Agent Frameworks
Because the models expose an OpenAI-compatible endpoint and support tool-calling, they integrate directly with several agentic frameworks:
- OpenHands: Routes via LiteLLM using the
openai/Ornith-1.0prefix. - Hermes Agent & OpenClaw: Point directly to the Ornith server via
OPENAI_BASE_URL. - Coding CLIs: Optimized for terminal-based agents like OpenCode.
Community Reception and Critique
While benchmarks show strong results, community feedback from Hacker News indicates a divide between benchmark performance and real-world utility.
Critical Perspectives
Some users have reported that the model's performance in non-tool-augmented chat is poor, noting a tendency toward hallucination. One user highlighted a discrepancy between benchmark success and practical bug-finding:
"Poor performer here, only found the one bug that almost every model found, despite its performance on other benchmarks being excellent for its size."
Other critics suggest the models may be "benchmaxxed"—optimized specifically for the benchmarks they are tested on—and argue that the 9B model's VRAM requirements (fitting on an 80GB GPU) remain too high for many individual users.
Positive Perspectives
Conversely, some users have found the models to be creative in their approach to coding problems and noted that this is one of the few Qwen-based fine-tunes that has been well-received by the local LLM community for its actual utility in coding tasks.