argmax-oss-swift: on-device audio inference frameworks for Apple platforms providing speech-to-text, text-to-speech, and speaker diarization

What it solves

Argmax Open-Source SDK provides a set of turn-key frameworks for running AI audio models entirely on-device for Apple platforms (macOS and iOS). It eliminates the need for cloud-based APIs for common audio tasks like transcription, speech synthesis, and speaker identification, ensuring lower latency and better privacy.

How it works

The SDK is a collection of three specialized "Kits" built on Core ML to run optimized models on Apple silicon:

WhisperKit: Implements OpenAI's Whisper for speech-to-text transcription and translation.
TTSKit: Uses Qwen-TTS models for text-to-speech generation, supporting real-time streaming playback and natural language style instructions.
SpeakerKit: Utilizes Pyannote for speaker diarization (identifying who spoke when).

It includes a Swift CLI for testing and a local server that mimics the OpenAI Audio API, allowing developers to integrate these on-device capabilities using existing OpenAI-compatible clients.

Who it’s for

Apple developers building apps for iOS and macOS who want to integrate high-quality speech-to-text, text-to-speech, or speaker diarization without relying on external servers.

Highlights

On-Device Inference: Runs entirely on Apple silicon via Core ML.
OpenAI API Compatibility: Includes a local server that implements the OpenAI Audio API for easy integration.
Real-Time Streaming: TTSKit supports frame-by-frame audio playback as it is generated.
Multilingual Support: Supports a wide range of languages for both transcription and speech synthesis.
Flexible Model Selection: Offers various model sizes (e.g., Tiny to Large) to balance speed and accuracy.

argmax-oss-swift: on-device audio inference frameworks for Apple platforms providing speech-to-text, text-to-speech, and speaker diarization

argmax-oss-swift: on-device audio inference frameworks for Apple platforms providing speech-to-text, text-to-speech, and speaker diarization

What it solves

How it works

Who it’s for

Highlights

Sources