NVIDIA Nemotron 3.5 ASR Release Notes

Overview

NVIDIA Nemotron 3.5 ASR is a 600-million parameter streaming automatic speech recognition (ASR) model designed to replace entire speech-to-text stacks with a self-hosted solution. Developed by the NVIDIA NeMo speech team, the model supports transcription for 40 languages from a single checkpoint and is optimized specifically for live streaming use cases where low latency is critical.

Cache-Aware Streaming for Low Latency

Nemotron 3.5 ASR utilizes "cache-aware streaming" to eliminate the computational redundancies found in traditional buffered streaming.

The Problem with Overlapping Chunks

Traditional non-streaming encoders handle live audio by feeding it in overlapping chunks. This requires the system to transcribe a window of audio, slide the window forward, and re-transcribe the overlapping sections multiple times. This repetitive processing increases compute costs and adds significant delay to the transcription.

The Cache-Aware Solution

Cache-aware streaming functions similarly to a KV cache in LLM decoding. Instead of reprocessing overlaps, the model caches the encoder's self-attention and activations, reusing these states as new audio arrives. The model attends to cached representations rather than recomputing them from raw audio, which NVIDIA reports can improve performance by up to 17 times on an H100 GPU.

Runtime Configuration and Language Support

Latency vs. Accuracy Trade-offs

Users can adjust the attention context size (chunk size) at runtime to balance latency and accuracy without needing to retrain the model. Available chunk sizes include:

80 milliseconds
160 milliseconds
320 milliseconds
560 milliseconds
Just over 1 second

Smaller chunks (e.g., 80ms) provide faster, word-by-word responses, while larger chunks (e.g., 1s) result in the transcription of full phrases with potentially higher accuracy.

Multilingual Capabilities

The model's language support is tiered based on production readiness:

Out-of-the-box: 19 languages work perfectly with optional auto-detection.
Production-level: 13 additional languages are supported.
Adaptation: 8 languages (such as Thai) are pre-trained but require fine-tuning for serious production use.

Word Boosting for Domain-Specific Accuracy

Word boosting is a decode-time technique used to improve the transcription of rare words, such as product names, drug names, surnames, or technical jargon, which may not have been prevalent in the training data.

How Word Boosting Works

Unlike fine-tuning, word boosting requires no weight changes or retraining. It uses a boosting tree to generate and score candidates. By providing the model with a list of specific words or phrases and a corresponding "strength" value, the system adds a positive bias to the score of those tokens if the audio is close to the target phrase. This increases the probability that the model will predict the correct specialized term over a more common word that sounds similar.

Speaker Diarization and Attribution

Nemotron 3.5 ASR can be integrated into diarization pipelines to provide speaker-level attribution. This can be achieved through the NeMo framework or external models.

Key capabilities include:

Speaker Segmentation: Sectioning off and returning different speakers in a recording (ideal for podcasts).
Embedding Capture: Capturing the embeddings of a known speaker (e.g., a user stating their name at the start of a recording) to assign that identity to the speaker throughout the transcript.

NVIDIA Nemotron 3.5 ASR Release Notes

NVIDIA Nemotron 3.5 ASR Release Notes

Overview

Cache-Aware Streaming for Low Latency

The Problem with Overlapping Chunks

The Cache-Aware Solution

Runtime Configuration and Language Support

Latency vs. Accuracy Trade-offs

Multilingual Capabilities

Word Boosting for Domain-Specific Accuracy

How Word Boosting Works

Speaker Diarization and Attribution

Sources