TransformerEngine: a library for accelerating Transformer models on NVIDIA GPUs using low-precision numerical formats

TransformerEngine: a library for accelerating Transformer models on NVIDIA GPUs using low-precision numerical formats

What it solves

Transformer Engine (TE) addresses the high memory and compute demands of scaling Transformer models to hundreds of billions of parameters. It enables faster training and inference by utilizing low-precision numerical formats, reducing memory utilization without sacrificing model accuracy.

How it works

TE provides a library of highly optimized building blocks and fused kernels for Transformer architectures. It implements an automatic mixed-precision API that allows developers to seamlessly integrate low-precision formats—such as 8-bit floating point (FP8), MXFP8, and NVFP4—into their existing PyTorch or JAX workflows. The library internally manages the scaling factors required for low-precision training, simplifying the process for the user.

Who it’s for

It is designed for AI researchers and engineers building large-scale Transformer models (including LLMs, MoE architectures, and multimodal models) who are using NVIDIA GPUs (Ampere, Ada, Hopper, and Blackwell architectures).

Highlights

  • Support for FP8 precision on Hopper, Ada, and Blackwell GPUs
  • Support for MXFP8 and NVFP4 formats on Blackwell GPUs
  • Optimized building blocks and fused kernels for Transformer models
  • Integration with major frameworks like PyTorch and JAX, and LLM libraries such as DeepSpeed, Hugging Face Accelerate, and Megatron-LM
  • Support for optimizations across FP16 and BF16 on Ampere GPUs and newer

Sources