horovod: a distributed deep learning training framework for scaling models across multiple GPUs and hosts

What it solves

Horovod simplifies the process of scaling deep learning training from a single GPU to multiple GPUs and multiple hosts. It removes the complexity typically associated with distributed training, such as managing parameter servers, by providing a more straightforward way to run models in parallel across a cluster.

How it works

Horovod uses an MPI-based model (Message Passing Interface) to handle communication between workers. Instead of a central server, it employs collective communication operations like allreduce to average gradients across all workers. It integrates with popular frameworks including TensorFlow, Keras, PyTorch, and Apache MXNet. To implement it, users initialize Horovod, pin GPUs to specific processes, scale the learning rate, and wrap their optimizer in a DistributedOptimizer.

Who it’s for

It is designed for machine learning engineers and infrastructure teams who need to train large models on large datasets across multiple GPUs or servers while minimizing the amount of code changes required to scale.

Highlights

Multi-Framework Support: Works with TensorFlow, PyTorch, Keras, and MXNet.
Tensor Fusion: Improves performance by batching small allreduce operations and interleaving communication with computation.
High Scaling Efficiency: Demonstrates high efficiency (up to 90% for certain models) when scaling across hundreds of GPUs.
Autotuning: Includes a system to automatically optimize performance settings to reduce trial-and-error tuning.
Flexible Deployment: Can be run via horovodrun, Docker, Kubernetes, Spark, Ray, and various HPC clusters.

horovod: a distributed deep learning training framework for scaling models across multiple GPUs and hosts

horovod: a distributed deep learning training framework for scaling models across multiple GPUs and hosts

What it solves

How it works

Who it’s for

Highlights

Sources