Ornith 1.0 Release Notes

Ornith 1.0 Release Notes

Overview

Ornith 1.0 is a family of agentic coding models developed by Deep Reinforce. The core innovation of Ornith 1.0 is the concept of "self-scaffolding," where the model possesses the ability to write its own task-specific harness (or scaffold) on the fly to guide its own rollouts and achieve more accurate results. This approach shifts the responsibility of context engineering from the human developer to the model itself.

Model Family and Architecture

Ornith 1.0 consists of four models based on the Qwen 3.5 and Gemma 4 families. All models in the family are available as open weights:

  • 9B: Based on Qwen 3.5.
  • 31B: Based on Gemma 4.
  • 35B MoE: Based on Qwen 3.5.
  • 397B MoE: Based on Qwen 3.5.

These models are not new pre-trains but are results of mid-training and post-training focused on generating both agentic trajectories (rollouts) and the scaffolds that guide them.

Training Methodology: Two-Stage RL

Deep Reinforce utilized a two-stage reinforcement learning (RL) process to enable self-scaffolding. The process follows these steps:

  1. Scaffold Proposal: The model is conditioned on a task and a previously used scaffold, then proposes a refined version of that harness.
  2. Rollout Generation: Conditioning on the new harness, the model proposes a rollout to reach the desired result.

These rollouts are used as reward signals to update the model's weights for both scaffold generation and rollout execution, utilizing Group Relative Policy Optimization (GRPO).

Defending Against Reward Hacking

To prevent the model from "cheating" by creating shortcuts in the harness to get high rewards without actually solving the task, Ornith 1.0 employs a three-layer defense system:

  • Immutable Environment: The sandbox, tools, and environment where the scaffolding runs are immutable and cannot be changed by the model.
  • Deterministic Monitor: A monitor tracks the scaffolding's actions and penalizes the model if it attempts to modify verification scripts or use unsanctioned tools.
  • LLM Judge: An LLM acts as a final judge with the power to veto any result that appears to have been achieved through disallowed means.

Performance and Benchmarks

According to the provided benchmarks, the largest Ornith model (397B MoE) outperforms several other models, including Qwen 3.7 Max and MiniMax, and is competitive with Claude Opus. The smaller models, such as the 9B and 35B MoE, also perform strongly against models of comparable or larger sizes, making the 9B model a viable option for local coding tasks on limited hardware.

Practical Applications and Demos

Ornith 1.0 demonstrates a high capacity for complex, multi-step reasoning and code generation via a long chain-of-thought process. Key examples include:

  • SVG Generation: The model can successfully generate code to draw complex images, such as a pelican.
  • RAG Tasks: The model handles Retrieval-Augmented Generation questions by reasoning through provided data to find an answer.
  • Dynamic Harness Creation: When asked to create a weather forecast harness, the model can autonomously identify the need for an API, and if told that no API keys are available, it can pivot to find a free, no-API-required source (e.g., Open-Meteo API) and rewrite the script accordingly.
  • Interface Building: The model can build functional UI components, such as Gradio interfaces, to wrap around the harnesses it has created.

Sources