Speech: a comprehensive framework for building and deploying ASR, TTS, and speech LLMs

What it solves

NVIDIA NeMo Speech provides a comprehensive framework for researchers and developers to create, customize, and deploy AI models for speech and audio. It simplifies the process of building complex speech systems by providing pre-trained model checkpoints and existing code, reducing the need to start from scratch.

How it works

Built on PyTorch, the toolkit allows developers to implement various speech-related AI tasks. It supports a wide range of models, including Automatic Speech Recognition (ASR), Text-to-Speech (TTS), and Speech Large Language Models (Speech LLMs). The framework is designed to be flexible, allowing users to install it over their existing Python/PyTorch/CUDA stack or use optimized Docker containers for high-performance hardware like NVIDIA H100 or A100 GPUs.

Who it’s for

It is designed for AI researchers and PyTorch developers who are specializing in audio, speech, and multimodal LLMs.

Highlights

Diverse Speech Capabilities: Supports ASR, TTS, and Speech LLMs, including full-duplex, natural, and interruptible conversations via Nemotron VoiceChat.
High Performance: Includes specialized architectures like Fastconformer for streaming ASR with controllable latency.
Multilingual Support: Offers models like MagpieTTS and Parakeet/Canary that support multiple European and global languages.
Hardware Optimized: Specifically tuned for NVIDIA GPUs with support for accelerated backends like Transformer Engine and FlashAttention.

Speech: a comprehensive framework for building and deploying ASR, TTS, and speech LLMs

Speech: a comprehensive framework for building and deploying ASR, TTS, and speech LLMs

What it solves

How it works

Who it’s for

Highlights

Sources