ao: a PyTorch-native architecture optimization library for training-to-serving model quantization and sparsity
ao: a PyTorch-native architecture optimization library for training-to-serving model quantization and sparsity
What it solves
TorchAO provides a native PyTorch library for optimizing AI models to make them faster and more memory-efficient. It addresses the common trade-off between model size and accuracy, allowing users to reduce the memory footprint of large models (like LLMs and diffusion models) and speed up both training and inference without significant quality loss.
How it works
TorchAO implements several architecture optimization techniques:
- Quantization: It converts model weights and activations to lower-precision formats (such as int4, int8, and float8), reducing memory usage and increasing throughput.
- Quantization-Aware Training (QAT): To prevent accuracy loss during quantization, it allows models to be trained to adapt to the lower precision.
- Sparsity: It uses semi-structured 2:4 sparsity to remove redundant weights, further increasing speed.
- ** uma-native integration**: It works seamlessly with
torch.compile()andFSDP2for high-performance execution across various hardware (CUDA, XPU, CPU, and ARM).
Who it’s for
This library is designed for AI researchers and engineers who need to deploy large-scale models on limited hardware, accelerate pre-training of massive models, or optimize models for edge devices via ExecuTorch.
Highlights
- Training Speedups: Pre-training Llama-3.1-70B up to 1.5x faster using float8 training.
- Inference Gains: Quantizing Llama-3-8B to int4 can result in 1.89x faster inference and 58% less memory usage.
- Broad Integration: Built-in support for Hugging Face Transformers, Diffusers, vLLM, and SGLang.
- Memory Efficiency: Includes quantized optimizers (AdamW 4/8-bit) and CPU offloading to reduce VRAM requirements by up to 60%.
Sources
- undefinedpytorch/ao