CosyVoice: a scalable multilingual zero-shot text-to-speech synthesizer based on large language models

What it solves

CosyVoice is a large language model (LLM) based text-to-speech (TTS) system designed for high-quality, zero-shot multilingual speech synthesis. It addresses the challenge of creating natural-sounding speech that maintains speaker similarity and content consistency across different languages and dialects, especially in "in-the-wild" scenarios.

How it works

The system leverages LLMs to generate speech. The latest version (Fun-CosyVoice 3.0) focuses on scaling up and post-training to improve prosody naturalness and speaker similarity. It supports zero-shot voice cloning, meaning it can mimic a speaker's voice based on a short sample without needing extensive retraining. It also includes a text normalization process that handles numbers and symbols without requiring a traditional frontend module.

Who it’s for

This project is for developers and researchers who need a scalable, high-performance TTS system capable of multi-lingual and cross-lingual synthesis, as well as those building production-ready audio applications requiring low-latency streaming.

Highlights

Multilingual & Dialect Support: Covers 9 common languages and over 18 Chinese dialects/accents.
Zero-Shot Voice Cloning: Supports multi-lingual and cross-lingual voice cloning.
Low Latency: Achieves audio-out streaming latency as low as 150ms.
Controllability: Supports pronunciation inpainting for Pinyin and English phonemes, and instructions for emotion, speed, and volume.
Deployment Options: Compatible with vLLM and Nvidia TensorRT-LLM for accelerated inference.

CosyVoice: a scalable multilingual zero-shot text-to-speech synthesizer based on large language models

CosyVoice: a scalable multilingual zero-shot text-to-speech synthesizer based on large language models

What it solves

How it works

Who it’s for

Highlights

Sources