argmax-oss-swift: on-device audio inference frameworks for Apple platforms providing speech-to-text, text-to-speech, and speaker diarization
argmax-oss-swift: on-device audio inference frameworks for Apple platforms providing speech-to-text, text-to-speech, and speaker diarization
What it solves
Argmax Open-Source SDK provides a set of turn-key frameworks for running AI audio models entirely on-device for Apple platforms (macOS and iOS). It eliminates the need for cloud-based APIs for common audio tasks like transcription, speech synthesis, and speaker identification, ensuring lower latency and better privacy.
How it works
The SDK is a collection of three specialized "Kits" built on Core ML to run optimized models on Apple silicon:
- WhisperKit: Implements OpenAI's Whisper for speech-to-text transcription and translation.
- TTSKit: Uses Qwen-TTS models for text-to-speech generation, supporting real-time streaming playback and natural language style instructions.
- SpeakerKit: Utilizes Pyannote for speaker diarization (identifying who spoke when).
It includes a Swift CLI for testing and a local server that mimics the OpenAI Audio API, allowing developers to integrate these on-device capabilities using existing OpenAI-compatible clients.
Who it’s for
Apple developers building apps for iOS and macOS who want to integrate high-quality speech-to-text, text-to-speech, or speaker diarization without relying on external servers.
Highlights
- On-Device Inference: Runs entirely on Apple silicon via Core ML.
- OpenAI API Compatibility: Includes a local server that implements the OpenAI Audio API for easy integration.
- Real-Time Streaming: TTSKit supports frame-by-frame audio playback as it is generated.
- Multilingual Support: Supports a wide range of languages for both transcription and speech synthesis.
- Flexible Model Selection: Offers various model sizes (e.g., Tiny to Large) to balance speed and accuracy.
Sources
- undefinedargmaxinc/argmax-oss-swift