mlx-audio: an optimized audio processing library for Apple Silicon supporting TTS, STT, and STS

mlx-audio: an optimized audio processing library for Apple Silicon supporting TTS, STT, and STS

What it solves

MLX-Audio provides a high-performance audio processing library specifically optimized for Apple Silicon (M-series chips). It simplifies the deployment of complex audio AI tasks—such as converting text to speech, transcribing speech to text, and performing speech-to-speech transformations—by leveraging the MLX framework for fast and efficient inference.

How it works

The library acts as a unified interface for a wide variety of pre-trained audio models. It supports multiple architectures for Text-to-Speech (TTS), Speech-to-Text (STT), and Speech-to-Speech (STS) tasks. To optimize performance and memory usage on Mac hardware, it includes support for quantization (ranging from 3-bit to 8-bit) and provides both a Python API and a command-line interface for generation and transcription.

Who it’s for

It is designed for developers building audio-centric applications on macOS or iOS, as well as researchers needing a fast way to run state-of-the-art audio models on Apple hardware.

Highlights

  • Comprehensive Model Support: Integrates numerous models including Kokoro, Whisper, Qwen3-TTS/ASR, and OmniVoice.
  • Versatile Audio Tasks: Supports multilingual TTS, zero-shot voice cloning, speaker diarization, and noise suppression.
  • OpenAI-Compatible API: Includes a REST API server for easier integration into existing workflows.
  • Apple Ecosystem Integration: Optimized for M-series chips and includes a Swift package for native iOS/macOS app development.
  • Advanced Controls: Offers speech speed control, 3D audio visualization in its web interface, and streaming audio generation.

Sources