CosyVoice: a scalable multilingual zero-shot text-to-speech synthesizer based on large language models
CosyVoice: a scalable multilingual zero-shot text-to-speech synthesizer based on large language models
What it solves
CosyVoice is a large language model (LLM) based text-to-speech (TTS) system designed for high-quality, zero-shot multilingual speech synthesis. It addresses the challenge of creating natural-sounding speech that maintains speaker similarity and content consistency across different languages and dialects, especially in "in-the-wild" scenarios.
How it works
The system leverages LLMs to generate speech. The latest version (Fun-CosyVoice 3.0) focuses on scaling up and post-training to improve prosody naturalness and speaker similarity. It supports zero-shot voice cloning, meaning it can mimic a speaker's voice based on a short sample without needing extensive retraining. It also includes a text normalization process that handles numbers and symbols without requiring a traditional frontend module.
Who it’s for
This project is for developers and researchers who need a scalable, high-performance TTS system capable of multi-lingual and cross-lingual synthesis, as well as those building production-ready audio applications requiring low-latency streaming.
Highlights
- Multilingual & Dialect Support: Covers 9 common languages and over 18 Chinese dialects/accents.
- Zero-Shot Voice Cloning: Supports multi-lingual and cross-lingual voice cloning.
- Low Latency: Achieves audio-out streaming latency as low as 150ms.
- Controllability: Supports pronunciation inpainting for Pinyin and English phonemes, and instructions for emotion, speed, and volume.
- Deployment Options: Compatible with vLLM and Nvidia TensorRT-LLM for accelerated inference.
Sources
- undefinedFunAudioLLM/CosyVoice