CTranslate2: a high-performance inference engine for Transformer models with advanced quantization and hardware optimization

CTranslate2: a high-performance inference engine for Transformer models with advanced quantization and hardware optimization

What it solves

CTranslate2 is designed to solve the problem of slow and memory-intensive inference for Transformer models. It provides a custom runtime that accelerates execution and reduces memory usage on both CPU and GPU, making it more efficient than general-purpose deep learning frameworks.

How it works

The library implements several performance optimization techniques, including weights quantization (supporting FP16, BF16, INT16, INT8, and INT4/AWQ), layer fusion, padding removal, batch reordering, and in-place operations. It supports a wide range of model types, including encoder-decoder (e.g., T5, Whisper), decoder-only (e.g., Llama, Mistral, Gemma), and encoder-only (e.g., BERT) models. Models must be converted into an optimized format using provided converters for frameworks like PyTorch (Transformers), Fairseq, and OpenNMT.

Who it’s for

It is intended for developers and production-oriented users who need to deploy Transformer models with high throughput and low memory footprints on various hardware architectures (x86-64, ARM64) and GPUs.

Highlights

  • Broad Model Support: Supports a vast array of Transformer architectures including Llama, Mistral, and BERT.
  • Hardware Optimization: Optimized for multiple CPU backends (Intel MKL, oneDNN, OpenBLAS, Ruy, Apple Accelerate) with automatic CPU detection.
  • Quantization: Reduces model size on disk and memory usage with minimal accuracy loss.
  • Parallel Execution: Supports parallel and asynchronous execution across multiple GPUs or CPU cores, including tensor parallelism for very large models.
  • Simple Integration: Provides simple Python and C++ APIs with few dependencies.

Sources