WhisperLiveKit: an ultra-low-latency self-hosted speech-to-text pipeline with real-time diarization and translation
WhisperLiveKit: an ultra-low-latency self-hosted speech-to-text pipeline with real-time diarization and translation
What it solves
WhisperLiveKit (WLK) provides an ultra-low-latency, self-hosted speech-to-text (STT) pipeline. It addresses the problem where standard Whisper models struggle with real-time audio chunks, often losing context or cutting off words. WLK uses advanced simultaneous speech research to enable intelligent buffering and incremental processing for high-quality, real-time transcription.
How it works
WLK implements a backend that supports multiple concurrent users and utilizes Voice Activity Detection (VAD) to reduce overhead. It integrates several state-of-the-art streaming policies (such as AlignAtt SimulStreaming and LocalAgreement) and backends (including Faster-Whisper, MLX for Apple Silicon, Voxtral, and Qwen3-ASR) to process audio streams. It exposes these capabilities via an OpenAI-compatible REST API, a Deepgram-compatible WebSocket, and a native WebSocket for real-time streaming.
Who it’s for
This tool is designed for developers building real-time transcription services, accessibility tools for the hearing-impaired, meeting transcription software, and content creators needing automatic subtitles for podcasts or videos.
Highlights
- Multi-Backend Support: Compatible with various backends including MLX (Apple Silicon), CUDA (NVIDIA), and CPU, with specialized support for Voxtral and Qwen3-ASR.
- Real-time Diarization: Supports speaker identification using Sortformer or Diart.
- Simultaneous Translation: Capable of translating speech from and to 200 languages using NLLW.
- Flexible API: Offers drop-in replacements for OpenAI and Deepgram APIs, making it easy to integrate into existing workflows.
- Deployment Ready: Includes Docker support and Nginx configuration guides for production deployment.
Sources
- undefinedQuentinFuxa/WhisperLiveKit