RealtimeSTT: a Python speech-to-text library with integrated voice activity detection and wake word support

RealtimeSTT: a Python speech-to-text library with integrated voice activity detection and wake word support

What it solves

RealtimeSTT provides a streamlined way to integrate speech-to-text (STT) capabilities into Python applications. It simplifies the complex process of handling voice activity detection (VAD), managing audio streams, and implementing wake words, allowing developers to turn speech into text with minimal code.

How it works

The library centers around the AudioToTextRecorder class, which can capture audio directly from a microphone or receive audio chunks from external sources (like files or websockets). It uses a modular engine system—defaulting to faster_whisper but supporting various others like kroko_onnx and whisper.cpp—to transcribe audio. It also incorporates VAD (via WebRTC or Silero) to detect when speech starts and ends, and optional wake word detection (via Porcupine or OpenWakeWord) to trigger recording.

Who it’s for

This tool is designed for developers building AI assistants, dictation software, browser-based streaming servers, and rapid prototypes that require fast, local speech recognition.

Highlights

  • Flexible Audio Input: Supports both direct microphone access and external PCM audio chunks.
  • Multiple Engine Support: Compatible with a wide array of transcription engines including faster-whisper, OpenAI Whisper, and Kroko-ONNX.
  • Integrated VAD and Wake Words: Built-in support for voice activity detection and customizable wake word activation.
  • Event-Driven Architecture: Provides callbacks for recording, VAD state, and transcription updates.
  • Web Server Example: Includes a FastAPI reference server for browser-based streaming with multi-user session isolation.

Sources