torchmetrics: a scalable PyTorch metrics library for distributed training and evaluation
torchmetrics: a scalable PyTorch metrics library for distributed training and evaluation
What it solves
TorchMetrics provides a standardized way to calculate machine learning metrics for PyTorch applications. It eliminates the boilerplate code typically required to accumulate and synchronize metrics across multiple batches and distributed devices (such as multiple GPUs or nodes), ensuring results are reproducible and scalable.
How it works
The library offers two primary ways to compute metrics:
- Module-based metrics: These act like PyTorch modules, maintaining an internal state to automatically track and accumulate data across batches. They handle synchronization across multiple devices automatically, making them compatible with CPU, single GPU, or multi-GPU setups.
- Functional metrics: These are simple Python functions that take tensors as input and return a metric value immediately, without maintaining state.
Users can also create custom metrics by subclassing torchmetrics.Metric and defining how the metric should update its state and compute the final result.
Who it’s for
It is designed for PyTorch developers and machine learning engineers who need to track model performance across diverse domains (audio, text, image, etc.) and those working with distributed training at scale.
Highlights
- Extensive Library: Includes over 100 built-in metrics covering classification, regression, segmentation, audio, text, and multimodal data.
- Distributed Support: Built-in automatic synchronization and accumulation for multi-device training.
- Customizable: Easy API for implementing custom metrics by subclassing a base class.
- Visualization: Integrated plotting support to visualize metric progress over time.
Sources
- undefinedLightning-AI/torchmetrics