composer: a deep learning training library for scaling PyTorch workflows across GPU clusters

composer: a deep learning training library for scaling PyTorch workflows across GPU clusters

What it solves

Composer is a deep learning training library designed to simplify and accelerate the process of training large-scale models on GPU clusters. It removes the low-level complexities of distributed training, such as parallelism techniques and memory optimization, allowing researchers and developers to focus on model architecture and experimentation.

How it works

Built on top of PyTorch, Composer uses a central Trainer abstraction that manages the training loop. This trainer integrates several key mechanisms:

  • Scalability Tools: It incorporates PyTorch FullyShardedDataParallelism (FSDP) and Distributed Data Parallelism (DDP) to handle models too large for a single GPU, and supports elastic sharded checkpointing to allow resuming training across different hardware configurations.
  • Customization via Callbacks: A callback system allows users to insert custom logic at specific events in the training loop (e.g., at the end of a batch) without modifying the core trainer.
  • Algorithmic Speedups: It provides a collection of pre-built "recipes" of algorithmic speedups to reduce training time and cost for specific model types like Stable Diffusion, BERT, and ResNet.
  • Workflow Automation: It includes features like auto-resumption from checkpoints and auto-microbatching to prevent CUDA out-of-memory (OOM) errors.

Who it’s for

Composer is intended for ML engineers and researchers who are comfortable with Python and PyTorch and are training neural networks of any size, including LLMs, diffusion models, embedding models, and CNNs, especially those operating at cluster-scale.

Highlights

  • Cluster-Scale Training: Seamlessly scales from 1 to 512 GPUs.
  • Elastic Checkpointing: Save and resume training regardless of the number of GPUs used.
  • OOM Prevention: Automatically selects the largest microbatch size that fits in GPU memory.
  • Cloud Integration: First-class support for remote storage (S3, GCP, OCI) and popular experiment tracking tools (Weights and Biases, MLFlow).
  • Data Streaming: Integrates with StreamingDataset for on-the-fly cloud blob storage downloads.

Sources