Stanford CME296 Diffusion & Large Vision Models Lecture 8 Summary

Executive Summary

This lecture serves as a capstone for the CME296 course, synthesizing the mathematical foundations of image generation—specifically the evolution from diffusion and score matching to flow matching—and extending these concepts to state-of-the-art (SOTA) models, video generation, image editing, and the emerging field of diffusion-based Large Language Models (LLMs). The core takeaway is that while the field is rapidly shifting toward flow matching and transformer-based architectures (DiT), the fundamental principles of noise removal and distribution mapping remain the bedrock of modern generative AI.

The Evolution of Image Generation Paradigms

Image generation is framed as the challenge of sampling from a complex, unknown data distribution by starting from a known, simple distribution (typically Gaussian noise) and learning a process to transform one into the other.

Diffusion and Score Matching

Diffusion models operate by defining a forward process that corrupts clean images into noise and learning a reverse process to remove that noise. This is mathematically achieved by maximizing the likelihood of the data distribution, often using the Evidence Lower Bound (ELBO) to derive a tractable L2 regression loss.

Score matching offers an alternative perspective, focusing on the "score"—the gradient of the log-probability density. The score acts as a compass, indicating the direction toward the data distribution. Denoising score matching allows models to estimate this score based on noisy images and their noise levels, eventually converging to a similar formulation as diffusion.

Flow Matching

Flow matching is the current industry standard (as of 2026), treating generation as a mass transport problem. Instead of removing noise, it learns a vector field (velocity) that moves probability density from an initial distribution to a target distribution.

Microscopic View: An Ordinary Differential Equation (ODE) describes the movement of individual particles.
Macroscopic View: The continuity equation ensures no probability mass is lost during the transition.
Rectified Flow: A variant of flow matching that creates straighter paths between distributions, reducing the number of numerical solver steps required during inference and speeding up sampling.

Representations and Architectures

Latent Space and VAEs

Generating images in pixel space is computationally expensive and inefficient due to high dimensionality and spatial correlation. To solve this, models use autoencoders to compress images into a lower-dimensional latent space.

Variational Autoencoders (VAEs) are used to regularize this latent space, ensuring it is compact and well-structured (avoiding "spikes"). This makes it easier for the diffusion or flow model to learn the mapping from noise to data. However, recent trends (e.g., HiDream-01) suggest that scaling transformers to massive parameter counts (up to 200B) may allow for direct pixel-space generation, potentially eliminating the fidelity loss associated with VAEs.

Image Generation Architectures

U-Net: Traditionally used for its downsampling (global understanding) and upsampling (detail reconstruction) paths, connected by skip connections.
Diffusion Transformer (DiT): Replaces the U-Net with a transformer architecture to allow long-range interactions between distant image patches, which is critical for global coherence.
Multi-modal DiT: Integrates conditions (like text prompts) directly into the joint attention mechanism rather than just modulating embeddings via adaptive layer norm.

Training and Evaluation

Training Pipeline

Pre-training: The most expensive phase, requiring a massive, high-quality corpus to learn general image generation.
Continued Training: Fine-tuning the model on a specific domain (e.g., teddy bears) to improve specialized generation.
Tuning (DreamBooth/LoRA): Using a small set of images (5-10) to teach the model a specific subject. Low-Rank Adaptation (LoRA) is used to tune a subset of weights to maintain efficiency.
Distillation: Shortening the number of inference steps to reduce production costs and latency.

Evaluation Metrics

Elo Rating: A pairwise comparison system that accounts for the strength of the opponent model, providing a more robust ranking than simple win rates.
FID (Fréchet Inception Distance): Measures the distance between the distribution of generated images and real images. Lower scores indicate higher realism, though it is a proxy metric based on the assumption of Gaussian distributions.
MLLM-as-a-Judge: Using multi-modal large language models to provide automated ratings, enabling faster iteration loops before human evaluation.

Extensions to Adjacent Fields

Video Generation

Video is treated as a 3D extension of images (space + time). Key challenges include temporal consistency (ensuring objects do not spontaneously change) and computational tractability.

Causal VAEs: Use asymmetric convolutions to ensure a frame's representation only depends on the current and previous frames, allowing for streaming encoding/decoding.
Space-Time Patches: DiT architectures for video operate on 3D patches, using self-attention to ensure coherence across both spatial and temporal dimensions.
Anchor Frames: The first frame is often treated as a special anchor to provide a stable starting point for the video sequence.

Image Editing

Rather than treating editing as a "from-scratch" generation problem (which often fails to preserve the original image's structure), new research focuses on action-based editing. This involves using a VLM to translate a user's intent into a sequence of specific editing actions (e.g., "decrease brightness by 50%") that can be executed by software like Photoshop.

Diffusion for LLMs

To overcome the latency of autoregressive (token-by-token) generation, researchers are applying diffusion to text.

Mechanism: Instead of sequential generation, the model starts with a sequence of masked tokens (noise) and progressively unmasks them to reveal the final text.
Benefits: This can result in speedups of up to 10x and is particularly effective for "fill-in-the-middle" tasks, such as coding, where the model must generate text between two existing blocks of code.
Challenges: Text is discrete, unlike images. This requires specialized masking schemes and confidence-based remasking during inference to correct errors.

Future Challenges and Outlook

Model Collapse: The risk that future models trained on AI-generated data will enter an "echo chamber of mistakes," drifting away from the true data distribution.
Provenance and Trust: The use of standards like C2PA (metadata) and SynthID (pixel-level watermarking) to distinguish AI-generated content from real images.
Hardware Evolution: Moving beyond matrix multiplication toward hardware optimized specifically for the attention mechanism.
Reasoning in Vision: Moving from simple image projection to deep reasoning within the visual modality, similar to the capabilities of modern LLMs.

Stanford CME296 Diffusion & Large Vision Models Lecture 8 Summary

Stanford CME296 Diffusion & Large Vision Models Lecture 8 Summary

Executive Summary

The Evolution of Image Generation Paradigms

Diffusion and Score Matching

Flow Matching

Representations and Architectures

Latent Space and VAEs

Image Generation Architectures

Training and Evaluation

Training Pipeline

Evaluation Metrics

Extensions to Adjacent Fields

Video Generation

Image Editing

Diffusion for LLMs

Future Challenges and Outlook

Sources