VLMEvalKit: a unified evaluation toolkit for large vision-language models with support for 70+ benchmarks
VLMEvalKit: a unified evaluation toolkit for large vision-language models with support for 70+ benchmarks
What it solves
VLMEvalKit provides a unified, open-source toolkit for evaluating Large Vision-Language Models (LVLMs). It eliminates the need for researchers and developers to manually prepare data across multiple different benchmark repositories, allowing for one-command evaluation across various benchmarks.
How it works
The toolkit adopts a generation-based evaluation approach for all models. It supports both exact matching and LLM-based answer extraction to determine accuracy. For developers adding new models, the process is simplified by requiring only the implementation of a single generate_inner() function, while the toolkit handles data downloading, preprocessing, inference, and metric calculation.
Who it’s for
It is designed for AI researchers and VLM developers who need to evaluate their models on standard benchmarks in a reproducible way.
Highlights
- Extensive Support: Supports over 200 Large Multimodal Models (LMMs) and 70+ image and video benchmarks.
- Distributed Inference: Integrates with LMDeploy and VLLM to support multi-node distributed inference for faster evaluation of large-scale or thinking models.
- Thinking Mode Support: Includes specialized handling for models with thinking mode (parsing content within
<think>tags). - Flexible Output: Supports saving prediction files in TSV format to prevent data truncation for models generating very long responses.
- Broad Compatibility: Provides detailed version recommendations for
transformers,torchvision, andflash-attnto ensure model compatibility.
Sources
- undefinedopen-compass/VLMEvalKit