hallucination-leaderboard: a public leaderboard tracking LLM hallucination rates in summarization tasks

hallucination-leaderboard: a public leaderboard tracking LLM hallucination rates in summarization tasks

What it solves

This project provides a public leaderboard that tracks and compares the hallucination rates of various Large Language Models (LLMs). It specifically addresses the problem of factual inconsistency in summarization, helping users identify which models are most likely to introduce false information when summarizing a document.

How it works

The leaderboard uses Vectara's Hallucination Evaluation Model (HHEM), a specialized model trained to detect hallucinations. The process involves:

  1. Summarization Task: A curated dataset of over 7,700 articles across various domains (news, science, medicine, etc.) is fed to LLMs with a strict prompt requiring them to summarize the text using only the provided information.
  2. Evaluation: HHEM evaluates the summaries produced by the LLMs to compute a "factual consistency rate" (the percentage of summaries without hallucinations) and a "hallucination rate" (100 minus the consistency rate).
  3. Metrics: The leaderboard tracks the hallucination rate, factual consistency rate, answer rate (how often the model responded), and average summary length.

Who it’s for

  • AI Researchers and Developers: Those looking to benchmark the same factual consistency of different LLMs.
  • RAG and Agentic System Builders: Since these systems often use LLMs as summarizers of search results, this leaderboard serves as a proxy for the مدى accuracy of the accuracy of the models when used in such pipelines.

Highlights

  • Specialized Evaluation Model: Uses HHEM-2.3 (commercial) and provides an open-source variant (HHEM-2.1-Open).
  • Curated Dataset: Employs a private dataset of 7,700+ articles of varying complexity and length (50 to 24K words) to prevent overfitting.
  • Detailed Metrics: Beyond hallucination rates, it tracks answer rates to ensure models aren't gaming the metric by refusing to answer.
  • Broad Model Coverage: Evaluates a wide range of models from providers like OpenAI, Google, Anthropic, and Meta.

Sources