NVIDIA Nemotron 3.5 ASR Release Notes
NVIDIA Nemotron 3.5 ASR Release Notes
Overview
NVIDIA Nemotron 3.5 ASR is a 600-million parameter streaming automatic speech recognition (ASR) model designed to replace entire speech-to-text stacks with a self-hosted solution. Developed by the NVIDIA NeMo speech team, the model supports transcription for 40 languages from a single checkpoint and is optimized specifically for live streaming use cases where low latency is critical.
Cache-Aware Streaming for Low Latency
Nemotron 3.5 ASR utilizes "cache-aware streaming" to eliminate the computational redundancies found in traditional buffered streaming.
The Problem with Overlapping Chunks
Traditional non-streaming encoders handle live audio by feeding it in overlapping chunks. This requires the system to transcribe a window of audio, slide the window forward, and re-transcribe the overlapping sections multiple times. This repetitive processing increases compute costs and adds significant delay to the transcription.
The Cache-Aware Solution
Cache-aware streaming functions similarly to a KV cache in LLM decoding. Instead of reprocessing overlaps, the model caches the encoder's self-attention and activations, reusing these states as new audio arrives. The model attends to cached representations rather than recomputing them from raw audio, which NVIDIA reports can improve performance by up to 17 times on an H100 GPU.
Runtime Configuration and Language Support
Latency vs. Accuracy Trade-offs
Users can adjust the attention context size (chunk size) at runtime to balance latency and accuracy without needing to retrain the model. Available chunk sizes include:
- 80 milliseconds
- 160 milliseconds
- 320 milliseconds
- 560 milliseconds
- Just over 1 second
Smaller chunks (e.g., 80ms) provide faster, word-by-word responses, while larger chunks (e.g., 1s) result in the transcription of full phrases with potentially higher accuracy.
Multilingual Capabilities
The model's language support is tiered based on production readiness:
- Out-of-the-box: 19 languages work perfectly with optional auto-detection.
- Production-level: 13 additional languages are supported.
- Adaptation: 8 languages (such as Thai) are pre-trained but require fine-tuning for serious production use.
Word Boosting for Domain-Specific Accuracy
Word boosting is a decode-time technique used to improve the transcription of rare words, such as product names, drug names, surnames, or technical jargon, which may not have been prevalent in the training data.
How Word Boosting Works
Unlike fine-tuning, word boosting requires no weight changes or retraining. It uses a boosting tree to generate and score candidates. By providing the model with a list of specific words or phrases and a corresponding "strength" value, the system adds a positive bias to the score of those tokens if the audio is close to the target phrase. This increases the probability that the model will predict the correct specialized term over a more common word that sounds similar.
Speaker Diarization and Attribution
Nemotron 3.5 ASR can be integrated into diarization pipelines to provide speaker-level attribution. This can be achieved through the NeMo framework or external models.
Key capabilities include:
- Speaker Segmentation: Sectioning off and returning different speakers in a recording (ideal for podcasts).
- Embedding Capture: Capturing the embeddings of a known speaker (e.g., a user stating their name at the start of a recording) to assign that identity to the speaker throughout the transcript.