moonshine: a low-latency on-device voice toolkit for real-time agents and streaming speech-to-text
moonshine: a low-latency on-device voice toolkit for real-time agents and streaming speech-to-text
What it solves
Moonshine Voice provides a high-performance, on-device AI toolkit for building real-time voice agents and applications. It addresses the high latency and redundant computation issues found in models like OpenAI's Whisper by offering flexible input windows and caching for streaming, making it suitable for live speech interfaces on a wide range of devices from high-end Macs to microcontrollers.
How it works
Moonshine uses a portable C++ core library powered by OnnxRuntime for cross-platform performance. It employs a system of Transcribers and Intent Recognizers that process audio input via Streams. The framework abstracts complex speech processing stages—microphone capture, voice activity detection, speech-to-text (STT), speaker identification, and intent recognition—into a single library. It uses event-based APIs (via TranscriptEventListeners) to notify applications of speech updates in real-time.
Who it’s for
Developers building voice-driven applications, conversational agents, and IoT devices who require low-latency, private, on-device processing without relying on cloud APIs or expensive hardware.
Highlights
- On-device processing: Fast, private, and requires no API keys or accounts.
- Streaming optimization: Low latency achieved through flexible input windows and state caching.
- Broad platform support: Runs on Python, iOS, Android, macOS, Linux, Windows, Raspberry Pi, and microcontrollers.
- High accuracy: The Medium Streaming model outperforms Whisper Large V3 in word-error rate (WER) while using significantly fewer parameters (250M vs 1.5B).
- Comprehensive toolkit: Includes STT, text-to-speech (TTS), speaker identification (diarization), and command recognition in one library.
Sources
- undefinedmoonshine-ai/moonshine