Speech: a comprehensive framework for building and deploying ASR, TTS, and speech LLMs

Speech: a comprehensive framework for building and deploying ASR, TTS, and speech LLMs

What it solves

NVIDIA NeMo Speech provides a comprehensive framework for researchers and developers to create, customize, and deploy AI models for speech and audio. It simplifies the process of building complex speech systems by providing pre-trained model checkpoints and existing code, reducing the need to start from scratch.

How it works

Built on PyTorch, the toolkit allows developers to implement various speech-related AI tasks. It supports a wide range of models, including Automatic Speech Recognition (ASR), Text-to-Speech (TTS), and Speech Large Language Models (Speech LLMs). The framework is designed to be flexible, allowing users to install it over their existing Python/PyTorch/CUDA stack or use optimized Docker containers for high-performance hardware like NVIDIA H100 or A100 GPUs.

Who it’s for

It is designed for AI researchers and PyTorch developers who are specializing in audio, speech, and multimodal LLMs.

Highlights

  • Diverse Speech Capabilities: Supports ASR, TTS, and Speech LLMs, including full-duplex, natural, and interruptible conversations via Nemotron VoiceChat.
  • High Performance: Includes specialized architectures like Fastconformer for streaming ASR with controllable latency.
  • Multilingual Support: Offers models like MagpieTTS and Parakeet/Canary that support multiple European and global languages.
  • Hardware Optimized: Specifically tuned for NVIDIA GPUs with support for accelerated backends like Transformer Engine and FlashAttention.

Sources