Stanford CME296 Lecture 7: Evaluation of Text-to-Image Generation Models

Stanford CME296 Lecture 7: Evaluation of Text-to-Image Generation Models

Evaluating the output of text-to-image generation models is a critical step in the development lifecycle because improvement requires a reliable way to quantify quality. Evaluation typically splits into two primary dimensions: aesthetics (whether the image is physically plausible and visually pleasing) and prompt adherence (whether the image accurately reflects the objects, styles, and locations specified in the input text).

Human-Based Evaluation

Human ratings provide the most nuanced feedback but are subject to high noise and cost. The lecture identifies three primary human evaluation setups:

  • Absolute Scale (1-5): Users rate images on a scale. This is nuanced but noisy, as different humans interpret the scale differently.
  • Binary Pass Rate: Users decide if an image is "good" or "bad." This is easier for humans but lacks a reference point for absolute quality.
  • Pairwise Comparison: Users compare two images and choose the better one. This is the least noisy method because relative comparison is more intuitive than absolute grading.

The Elo Rating System

To avoid the computational and human cost of comparing every model against every other model in a leaderboard, the Elo rating system is used. Instead of a simple win rate, Elo adjusts a model's rating based on the strength of its opponent. If a model wins against a strong opponent, its rating increases significantly; winning against a weak opponent yields minimal gains. This allows for a dynamic leaderboard where new models can be integrated without re-evaluating the entire set.

Reference-Free Metrics

Reference-free metrics evaluate generated images without comparing them to a single "ground truth" image, as multiple valid images can exist for one prompt.

Fréchet Inception Distance (FID)

FID is the industry standard for quantifying aesthetics and diversity. It compares the distribution of generated images to the distribution of real images in a latent space (specifically using the Inception network encoder).

  • Mechanism: It calculates the Wasserstein distance between two Gaussian distributions, characterized by their mean ($\mu$) and covariance ($\Sigma$).
  • Interpretation: A lower FID score indicates that the generated distribution is closer to the real distribution. Differences in means suggest style/quality gaps, while differences in covariance suggest a lack of diversity (mode collapse).
  • Limitations: FID assumes distributions are Gaussian, which is rarely true in practice, and it can be a poor proxy for actual human-perceived quality.

Prompt Adherence Metrics

  • CLIPScore: Uses the CLIP model to measure the cosine similarity between the embeddings of the input text and the generated image. It is effective for general semantic matching but struggles with subtle spatial or relational details.
  • PickScore: A CLIP-based model trained specifically on human preference data to provide a holistic score combining aesthetics and adherence.

Reference-Based Metrics

Reference-based metrics are used when a specific target image exists, such as in VAE reconstruction or image editing tasks.

  • MSE (Mean Squared Error): A pixel-wise distance. It is highly sensitive to slight shifts in alignment.
  • PSNR (Peak Signal-to-Noise Ratio): Normalizes MSE relative to the maximum possible pixel value and applies a logarithm to better align with human perception of error.
  • SSIM (Structural Similarity Index): Moves beyond pixels to compare local patches based on luminance, contrast, and structure (using Pearson correlation). It is more robust than MSE but still sensitive to large shifts.
  • LPIPS (Learned Perceptual Image Patch Similarity): Passes images through a pre-trained encoder (like VGG or AlexNet) and computes a weighted distance between feature maps. This is designed to align closely with human perceptual judgment.

MLLM-as-a-Judge

Multi-modal Large Language Models (MLLMs) are increasingly used as judges because they can provide reasoning (rationales) rather than just a scalar score.

Evolution of MLLM Evaluation

  1. TIFA (Text-to-Image Faithfulness Evaluation): Decomposes a prompt into atomic yes/no questions (e.g., "Is there a teddy bear?"). An MLLM answers each, and the final score is the proportion of correct answers. This allows for precise debugging of where a model fails.
  2. VQA Score: Formulates the evaluation as a Visual Question Answering task (e.g., "Does this figure show [prompt]?"). The score is the probability the model assigns to the token "yes."
  3. VIEScore (Visual Instruction-guided Explainable Score): Uses a concept-centric approach where the judge is given a detailed rubric (e.g., guidelines for "perceptual quality") and asked to provide a rationale before a final score, often outputting in JSON for easy parsing.

Best Practices for MLLM Judges

  • Chain-of-Thought: Require the model to output its rationale before the score to improve accuracy.
  • Determinism: Set the temperature to zero to ensure consistent results across runs.
  • Bias Mitigation: In pairwise settings, swap the order of images to prevent position bias.
  • Alignment: Calibrate the MLLM judge by comparing its ratings against human-graded samples and tuning the rubrics accordingly.

Technical Benchmarks

Several benchmarks target specific failure modes of image generation:

  • GenEval: Tests object counting, color attribution, and relative positioning using object detection models as judges.
  • DPG Bench: Uses a logical graph to evaluate dense prompts, checking prerequisites (e.g., if an object exists) before checking its attributes.
  • Long Text Bench: Specifically evaluates OCR capabilities—the ability of a model to render readable, accurate text within an image.
  • Grounded Edits Bench: Evaluates image editing tasks based on perceptual quality and semantic consistency.

Summary

This lecture outlines the methodologies for evaluating text-to-image models, distinguishing between aesthetics and prompt adherence, and detailing the transition from traditional mathematical metrics to MLLM-as-a-Judge frameworks.


中文翻译

Stanford CME296 Lecture 7: Evaluation of Text-to-Image Generation Models

评估文本到图像生成模型的输出是开发生命周期中的关键步骤,因为改进需要一种可靠的方式来量化质量。评估通常分为两个主要维度:美学(图像是否在物理上合理且视觉上令人愉悦)和提示遵循度(图像是否准确反映输入文本中指定的对象、风格和位置)。

基于人工的评估

人工评分提供了最细致的反馈,但噪声大且成本高。讲座列出了三种主要的人类评估方式:

  • 绝对尺度(1-5): 用户在一个尺度上对图像打分。此方式细致但噪声大,因为不同的人对尺度的理解不同。
  • 二元通过率: 用户判断图像是“好”还是“坏”。这种方式对人更容易,但缺乏绝对质量的参考点。
  • 两两比较: 用户比较两张图像并选择更好的一张。这是噪声最小的方法,因为相对比较比绝对打分更直观。

Elo 评分系统

为了避免在排行榜中对每个模型与其他所有模型进行比较的计算和人工成本,使用了Elo 评分系统。Elo 根据对手的强度调整模型的评分,而不是仅仅使用胜率。如果模型战胜了强大的对手,其评分会显著提升;战胜弱对手则收益有限。这使得排行榜能够动态更新,新模型可以在不重新评估全部模型的情况下加入。

无参考指标

无参考指标在不将生成图像与单一“真实”图像比较的情况下评估图像,因为同一提示可能对应多种有效图像。

Fréchet Inception Distance (FID)

FID 是衡量美学和多样性的行业标准。它在潜在空间(具体使用 Inception 网络编码器)中比较生成图像分布与真实图像分布。

  • 机制: 计算两个高斯分布之间的 Wasserstein 距离,分布由均值 ($\mu$) 和协方差 ($\Sigma$) 描述。
  • 解释: 较低的 FID 分数表明生成分布更接近真实分布。均值差异暗示风格/质量差距,协方差差异暗示多样性不足(模式崩溃)。
  • 局限性: FID 假设分布为高斯,实际很少成立,并且它可能是对人类感知质量的较差代理。

提示遵循度指标

  • CLIPScore: 使用 CLIP 模型测量输入文本和生成图像嵌入之间的余弦相似度。对一般语义匹配有效,但在细微空间或关系细节上表现不足。
  • PickScore: 基于 CLIP 的模型,专门在人工偏好数据上训练,提供兼顾美学和遵循度的整体评分。

有参考指标

有参考指标在存在特定目标图像时使用,例如 VAE 重建或图像编辑任务。

  • MSE(均方误差): 像素级距离,对细微对齐偏移非常敏感。
  • PSNR(峰值信噪比): 将 MSE 相对于最大像素值归一化,并取对数,以更贴合人类对误差的感知。
  • SSIM(结构相似性指数): 超越像素比较,基于亮度、对比度和结构(使用 Pearson 相关)比较局部块。比 MSE 更稳健,但对大幅位移仍敏感。
  • LPIPS(Learned Perceptual Image Patch Similarity): 将图像通过预训练编码器(如 VGG 或 AlexNet),计算特征图的加权距离。旨在与人类感知判断高度一致。

MLLM‑as‑a‑Judge

多模态大语言模型(MLLM)越来越多地被用作评审,因为它们能够提供推理(理由),而不仅仅是一个标量分数。

MLLM 评估的演进

  1. TIFA(Text-to-Image Faithfulness Evaluation): 将提示拆解为原子是/否问题(例如,“有泰迪熊吗?”)。MLLM 对每个问题作答,最终得分为正确答案的比例。此方式可精确定位模型失效的具体环节。
  2. VQA Score: 将评估形式化为视觉问答任务(例如,“这幅图展示了[提示]吗?”)。得分为模型给出“yes”标记的概率。
  3. VIEScore(Visual Instruction‑guided Explainable Score): 采用概念中心的方法,评审者获得详细的评分细则(例如“感知质量”指南),在给出最终分数前先提供理由,通常以 JSON 输出便于解析。

MLLM 评审的最佳实践

  • Chain‑of‑Thought(思考链): 要求模型在给出分数之前输出推理过程,以提升准确性。
  • 确定性: 将 temperature 设为 0,确保不同运行之间结果一致。
  • 偏差缓解: 在两两比较时交换图像顺序,防止位置偏差。
  • 对齐: 通过将 MLLM 评分与人工评分样本对比,校准评审模型并相应调整评分细则。

技术基准

多个基准针对图像生成的特定失效模式:

  • GenEval: 使用目标检测模型作为评审,测试对象计数、颜色属性和相对位置。
  • DPG Bench: 采用逻辑图评估密集提示,先检查前置条件(例如对象是否存在),再检查属性。
  • Long Text Bench: 专门评估 OCR 能力——模型在图像中渲染可读、准确文本的能力。
  • Grounded Edits Bench: 基于感知质量和语义一致性评估图像编辑任务。

Summary

This lecture outlines the methodologies for evaluating text-to-image models, distinguishing between aesthetics and prompt adherence, and detailing the transition from traditional mathematical metrics to MLLM-as-a-Judge frameworks.


中文摘要

本讲座概述了文本到图像模型的评估方法,区分了美学和提示遵循度两大维度,并详细阐述了从传统数学指标向 MLLM‑as‑a‑Judge 框架的转变。

Sources