VLMEvalKit: a unified evaluation toolkit for large vision-language models with support for 70+ benchmarks

VLMEvalKit: a unified evaluation toolkit for large vision-language models with support for 70+ benchmarks

What it solves

VLMEvalKit provides a unified, open-source toolkit for evaluating Large Vision-Language Models (LVLMs). It eliminates the need for researchers and developers to manually prepare data across multiple different benchmark repositories, allowing for one-command evaluation across various benchmarks.

How it works

The toolkit adopts a generation-based evaluation approach for all models. It supports both exact matching and LLM-based answer extraction to determine accuracy. For developers adding new models, the process is simplified by requiring only the implementation of a single generate_inner() function, while the toolkit handles data downloading, preprocessing, inference, and metric calculation.

Who it’s for

It is designed for AI researchers and VLM developers who need to evaluate their models on standard benchmarks in a reproducible way.

Highlights

  • Extensive Support: Supports over 200 Large Multimodal Models (LMMs) and 70+ image and video benchmarks.
  • Distributed Inference: Integrates with LMDeploy and VLLM to support multi-node distributed inference for faster evaluation of large-scale or thinking models.
  • Thinking Mode Support: Includes specialized handling for models with thinking mode (parsing content within <think> tags).
  • Flexible Output: Supports saving prediction files in TSV format to prevent data truncation for models generating very long responses.
  • Broad Compatibility: Provides detailed version recommendations for transformers, torchvision, and flash-attn to ensure model compatibility.

Sources