WhisperLiveKit: an ultra-low-latency self-hosted speech-to-text pipeline with real-time diarization and translation

WhisperLiveKit: an ultra-low-latency self-hosted speech-to-text pipeline with real-time diarization and translation

What it solves

WhisperLiveKit (WLK) provides an ultra-low-latency, self-hosted speech-to-text (STT) pipeline. It addresses the problem where standard Whisper models struggle with real-time audio chunks, often losing context or cutting off words. WLK uses advanced simultaneous speech research to enable intelligent buffering and incremental processing for high-quality, real-time transcription.

How it works

WLK implements a backend that supports multiple concurrent users and utilizes Voice Activity Detection (VAD) to reduce overhead. It integrates several state-of-the-art streaming policies (such as AlignAtt SimulStreaming and LocalAgreement) and backends (including Faster-Whisper, MLX for Apple Silicon, Voxtral, and Qwen3-ASR) to process audio streams. It exposes these capabilities via an OpenAI-compatible REST API, a Deepgram-compatible WebSocket, and a native WebSocket for real-time streaming.

Who it’s for

This tool is designed for developers building real-time transcription services, accessibility tools for the hearing-impaired, meeting transcription software, and content creators needing automatic subtitles for podcasts or videos.

Highlights

  • Multi-Backend Support: Compatible with various backends including MLX (Apple Silicon), CUDA (NVIDIA), and CPU, with specialized support for Voxtral and Qwen3-ASR.
  • Real-time Diarization: Supports speaker identification using Sortformer or Diart.
  • Simultaneous Translation: Capable of translating speech from and to 200 languages using NLLW.
  • Flexible API: Offers drop-in replacements for OpenAI and Deepgram APIs, making it easy to integrate into existing workflows.
  • Deployment Ready: Includes Docker support and Nginx configuration guides for production deployment.

Sources