NVIDIA Nemotron 3 Ultra Release

NVIDIA Nemotron 3 Ultra Release

Nemotron 3 Ultra: A High-Performance Agentic Model

NVIDIA Nemotron 3 Ultra is a 550 billion parameter Mixture-of-Experts (MoE) model designed specifically for agentic use cases rather than general chatbot interactions. With 55 billion active parameters, the model is engineered to excel at coding, tool use, and long-horizon, multi-step actions, positioning it as a competitive open-weights alternative to frontier models like Anthropic Opus, GPT, and Gemini Pro.

Model Architecture and Specifications

Nemotron 3 Ultra utilizes a hybrid MoE architecture and supports multi-token prediction. Key technical specifications include:

  • Total Parameters: 550 Billion
  • Active Parameters: 55 Billion
  • Context Window: 1 Million tokens
  • Hardware Requirements: Due to its size, the model typically requires high-end hardware such as multiple H100s or B200 GPUs for local deployment, though it is widely available via inference providers and NVIDIA's cloud API.

Training Methodology and Open Recipes

NVIDIA has distinguished the Nemotron 3 Ultra release by publishing the training recipes and datasets used to create the model, providing a blueprint for organizations to fine-tune custom versions for specific enterprise tasks.

Multi-Teacher On-Policy Distillation

The model was developed using a multi-teacher on-policy distillation process. Instead of training a single model on all tasks, NVIDIA trained separate "teacher" models specialized in specific domains:

  • Coding
  • Tool Use
  • Instruction Following

These specialized teachers were then used to distill their knowledge into a single final model. This approach results in a model that is significantly stronger than one trained on a combined dataset from the start.

Post-Training for Agent Harnesses

To improve agentic performance, NVIDIA focused on post-training using trajectories from agent harnesses (such as OpenClaw, Hermes, or LangChain deep agents). By training on these trajectories, the model learns critical agent behaviors, including:

  • Error Correction: Learning to backtrack and fix errors when a task fails.
  • Tool Integration: Effectively utilizing tool calls and memory to complete complex tasks.
  • RL Environments: The model's capabilities were boosted through Reinforcement Learning (RL) environments, a technique NVIDIA is making public to benefit the open-source community.

Performance Benchmarks

Nemotron 3 Ultra demonstrates high efficiency and competitive performance, particularly in agent-centric benchmarks.

Agentic and General Benchmarks

  • Pinchbench: In benchmarks geared toward agent harnesses like OpenClaw, Nemotron 3 Ultra is the top-performing open-weights model, trailing only slightly behind proprietary models like Claude Opus.
  • Comparison to Large Models: Despite having fewer parameters than some trillion-parameter models (such as GLM 5.1), Nemotron 3 Ultra outperforms them in several tasks.
  • Inference Speed: According to data from the Artificial Analysis team, Nemotron 3 Ultra achieves speeds over 300 tokens per second, making it significantly faster than Kimi and GLM models.

Practical Implementation and Features

Reasoning Modes

Nemotron 3 Ultra is a reasoning model that allows users to control the "thinking" process via the API. Users can enable thinking and select from three effort levels:

  1. Low Effort: Provides short reasoning for low-cost, low-latency requirements.
  2. Default: The model autonomously decides the length of the chain-of-thought.
  3. Reasoning Budget: Allows users to set a maximum number of thinking tokens (e.g., 16,000), though the model often remains succinct regardless of the budget.

Tool Calling and Agentic Workflow

The model follows the OpenAI API format for tool definitions, ensuring compatibility with standard endpoints. In practical agentic runs, the model can:

  • Identify the correct tool for a specific query.
  • Generate precise arguments for that tool.
  • Process tool outputs to determine the next necessary action.
  • Iterate through multiple rounds of tool use before providing a final answer.

Sources