horovod: a distributed deep learning training framework for scaling models across multiple GPUs and hosts
horovod: a distributed deep learning training framework for scaling models across multiple GPUs and hosts
What it solves
Horovod simplifies the process of scaling deep learning training from a single GPU to multiple GPUs and multiple hosts. It removes the complexity typically associated with distributed training, such as managing parameter servers, by providing a more straightforward way to run models in parallel across a cluster.
How it works
Horovod uses an MPI-based model (Message Passing Interface) to handle communication between workers. Instead of a central server, it employs collective communication operations like allreduce to average gradients across all workers. It integrates with popular frameworks including TensorFlow, Keras, PyTorch, and Apache MXNet. To implement it, users initialize Horovod, pin GPUs to specific processes, scale the learning rate, and wrap their optimizer in a DistributedOptimizer.
Who it’s for
It is designed for machine learning engineers and infrastructure teams who need to train large models on large datasets across multiple GPUs or servers while minimizing the amount of code changes required to scale.
Highlights
- Multi-Framework Support: Works with TensorFlow, PyTorch, Keras, and MXNet.
- Tensor Fusion: Improves performance by batching small allreduce operations and interleaving communication with computation.
- High Scaling Efficiency: Demonstrates high efficiency (up to 90% for certain models) when scaling across hundreds of GPUs.
- Autotuning: Includes a system to automatically optimize performance settings to reduce trial-and-error tuning.
- Flexible Deployment: Can be run via
horovodrun, Docker, Kubernetes, Spark, Ray, and various HPC clusters.
Sources
- undefinedhorovod/horovod