Stanford CME296 Lecture 7: Evaluation of Text-to-Image Generation Models

Stanford CME296 Lecture 7: Evaluation of Text-to-Image Generation Models

Evaluating the output of text-to-image generation models is a critical step in the development lifecycle because improvement requires a reliable way to quantify quality. Evaluation typically splits into two primary dimensions: aesthetics (whether the image is physically plausible and visually pleasing) and prompt adherence (whether the image accurately reflects the objects, styles, and locations specified in the input text).

Human-Based Evaluation

Human ratings provide the most nuanced feedback but are subject to high noise and cost. The lecture identifies three primary human evaluation setups:

  • Absolute Scale (1-5): Users rate images on a scale. This is nuanced but noisy, as different humans interpret the scale differently.
  • Binary Pass Rate: Users decide if an image is "good" or "bad." This is easier for humans but lacks a reference point for absolute quality.
  • Pairwise Comparison: Users compare two images and choose the better one. This is the least noisy method because relative comparison is more intuitive than absolute grading.

The Elo Rating System

To avoid the computational and human cost of comparing every model against every other model in a leaderboard, the Elo rating system is used. Instead of a simple win rate, Elo adjusts a model's rating based on the strength of its opponent. If a model wins against a strong opponent, its rating increases significantly; winning against a weak opponent yields minimal gains. This allows for a dynamic leaderboard where new models can be integrated without re-evaluating the entire set.

Reference-Free Metrics

Reference-free metrics evaluate generated images without comparing them to a single "ground truth" image, as multiple valid images can exist for one prompt.

Fréchet Inception Distance (FID)

FID is the industry standard for quantifying aesthetics and diversity. It compares the distribution of generated images to the distribution of real images in a latent space (specifically using the Inception network encoder).

  • Mechanism: It calculates the Wasserstein distance between two Gaussian distributions, characterized by their mean ($\mu$) and covariance ($\Sigma$).
  • Interpretation: A lower FID score indicates that the generated distribution is closer to the real distribution. Differences in means suggest style/quality gaps, while differences in covariance suggest a lack of diversity (mode collapse).
  • Limitations: FID assumes distributions are Gaussian, which is rarely true in practice, and it can be a poor proxy for actual human-perceived quality.

Prompt Adherence Metrics

  • CLIPScore: Uses the CLIP model to measure the cosine similarity between the embeddings of the input text and the generated image. It is effective for general semantic matching but struggles with subtle spatial or relational details.
  • PickScore: A CLIP-based model trained specifically on human preference data to provide a holistic score combining aesthetics and adherence.

Reference-Based Metrics

Reference-based metrics are used when a specific target image exists, such as in VAE reconstruction or image editing tasks.

  • MSE (Mean Squared Error): A pixel-wise distance. It is highly sensitive to slight shifts in alignment.
  • PSNR (Peak Signal-to-Noise Ratio): Normalizes MSE relative to the maximum possible pixel value and applies a logarithm to better align with human perception of error.
  • SSIM (Structural Similarity Index): Moves beyond pixels to compare local patches based on luminance, contrast, and structure (using Pearson correlation). It is more robust than MSE but still sensitive to large shifts.
  • LPIPS (Learned Perceptual Image Patch Similarity): Passes images through a pre-trained encoder (like VGG or AlexNet) and computes a weighted distance between feature maps. This is designed to align closely with human perceptual judgment.

MLLM-as-a-Judge

Multi-modal Large Language Models (MLLMs) are increasingly used as judges because they can provide reasoning (rationales) rather than just a scalar score.

Evolution of MLLM Evaluation

  1. TIFA (Text-to-Image Faithfulness Evaluation): Decomposes a prompt into atomic yes/no questions (e.g., "Is there a teddy bear?"). An MLLM answers each, and the final score is the proportion of correct answers. This allows for precise debugging of where a model fails.
  2. VQA Score: Formulates the evaluation as a Visual Question Answering task (e.g., "Does this figure show [prompt]?"). The score is the probability the model assigns to the token "yes."\n3. VIEScore (Visual Instruction-guided Explainable Score): Uses a concept-centric approach where the judge is given a detailed rubric (e.g., guidelines for "perceptual quality") and asked to provide a rationale before a final score, often outputting in JSON for easy parsing.

Best Practices for MLLM Judges

  • Chain-of-Thought: Require the model to output its rationale before the score to improve accuracy.
  • Determinism: Set the temperature to zero to ensure consistent results across runs.
  • Bias Mitigation: In pairwise settings, swap the order of images to prevent position bias.
  • Alignment: Calibrate the MLLM judge by comparing its ratings against human-graded samples and tuning the rubrics accordingly.

Technical Benchmarks

Several benchmarks target specific failure modes of image generation:

  • GenEval: Tests object counting, color attribution, and relative positioning using object detection models as judges.
  • DPG Bench: Uses a logical graph to evaluate dense prompts, checking prerequisites (e.g., if an object exists) before checking its attributes.
  • Long Text Bench: Specifically evaluates OCR capabilities—the ability of a model to render readable, accurate text within an image.
  • Grounded Edits Bench: Evaluates image editing tasks based on perceptual quality and semantic consistency.

Sources