TransformerEngine: a library for accelerating Transformer models on NVIDIA GPUs using low-precision numerical formats
TransformerEngine: a library for accelerating Transformer models on NVIDIA GPUs using low-precision numerical formats
What it solves
Transformer Engine (TE) addresses the high memory and compute demands of scaling Transformer models to hundreds of billions of parameters. It enables faster training and inference by utilizing low-precision numerical formats, reducing memory utilization without sacrificing model accuracy.
How it works
TE provides a library of highly optimized building blocks and fused kernels for Transformer architectures. It implements an automatic mixed-precision API that allows developers to seamlessly integrate low-precision formats—such as 8-bit floating point (FP8), MXFP8, and NVFP4—into their existing PyTorch or JAX workflows. The library internally manages the scaling factors required for low-precision training, simplifying the process for the user.
Who it’s for
It is designed for AI researchers and engineers building large-scale Transformer models (including LLMs, MoE architectures, and multimodal models) who are using NVIDIA GPUs (Ampere, Ada, Hopper, and Blackwell architectures).
Highlights
- Support for FP8 precision on Hopper, Ada, and Blackwell GPUs
- Support for MXFP8 and NVFP4 formats on Blackwell GPUs
- Optimized building blocks and fused kernels for Transformer models
- Integration with major frameworks like PyTorch and JAX, and LLM libraries such as DeepSpeed, Hugging Face Accelerate, and Megatron-LM
- Support for optimizations across FP16 and BF16 on Ampere GPUs and newer
Sources
- undefinedNVIDIA/TransformerEngine