IBM Granite Speech 4.1 Release: High-Throughput ASR Models
IBM Granite Speech 4.1 Release: High-Throughput ASR Models
IBM has released Granite Speech 4.1, a suite of three 2B-parameter Automatic Speech Recognition (ASR) models designed for edge deployment. These models allow developers to choose a variant based on their specific performance bottleneck, whether it is raw accuracy, the need for speaker attribution, or extreme processing throughput.
Granite Speech 4.1 2B: The High-Accuracy Workhorse
Granite Speech 4.1 2B is the base model and currently leads the Open ASR leaderboard on Hugging Face with a word error rate (WER) of 5.33%. This average WER is presented as a more reliable indicator of real-world performance than benchmarks like LibriSpeech.
Key Performance and Features
- Processing Speed: The model achieves a real-time factor (RTFX) of approximately 231, meaning one second of compute can process nearly four minutes of audio. This allows an hour of audio to be transcribed in roughly 16 seconds.
- Multilingual Support: It supports transcription for seven languages: English, French, German, Spanish, Portuguese, and Japanese.
- Translation: It provides bidirectional speech translation between English and the other supported languages.
- Keyword Biasing: Users can pass a list of names, acronyms, or technical terms in the prompt to weight the model toward recognizing domain-specific content correctly.
- Architecture: The model uses a standard autoregressive architecture.
Granite Speech 4.1 2B Plus: Diarization and Timestamps
The Plus variant is optimized for structured transcripts where knowing who spoke when is critical, such as in podcasts or meeting recordings.
Specialized Capabilities
- Speaker Attributed ASR (Diarization): The model provides speaker labels (e.g., "Speaker 1", "Speaker 2"), allowing users to attribute text to specific individuals.
- Word-Level Timestamps: Every word is tagged with an end time. The reported timestamp accuracy is claimed to outperform many existing models, including specialized versions of Whisper.
- Incremental Decoding: The model supports passing previously transcribed text as a prefix. This is particularly useful for long-form audio that has been split into chunks, ensuring consistent speaker numbering and continuity across segments.
Trade-offs
To enable these features, the Plus model makes several concessions:
- Language Support: Reduced to five languages (Japanese is dropped).
- Functionality: Translation capabilities are removed.
- Accuracy: The word error rate is slightly higher than the base 2B model.
Granite Speech 4.1 2B NAR: Extreme Throughput
Granite Speech 4.1 2B NAR is a non-autoregressive (NAR) model designed for maximum throughput, enabling the processing of massive volumes of audio in minimal time.
Non-Autoregressive LLM-based Editing (NLE)
Unlike standard autoregressive models that generate tokens sequentially, the NAR model uses a technique called Non-autoregressive LLM-based Editing (NLE). The process works in two steps:
- Drafting: A frozen, low-cost CTC encoder runs over the audio to produce a draft transcript.
- Editing: The model uses bidirectional attention to edit the draft by copying, inserting, deleting, or replacing text, which improves accuracy compared to traditional one-shot parallel prediction.
Performance and Trade-offs
- Throughput: On an H100 GPU using batches, the model claims a real-time factor of 1,820, which allows one hour of audio to be transcribed in approximately two seconds.
- Limitations: The NAR model does not support translation, keyword biasing, speaker attribution, or word-level timestamps.
Deployment and Implementation
All Granite Speech 4.1 models are small enough to be run on a variety of GPUs, though the NAR model typically requires Flash Attention for optimal performance. Implementation is handled via the Hugging Face Transformers library using AutoProcessor.
Fine-Tuning and Customization
IBM provides notebooks for fine-tuning, allowing users to adapt the model to specific voices, accents, or highly specialized domains (such as court transcripts) by using existing transcripts as training data.
Sources
- undefinedGranite 4.1 - The Fastest ASR?