Stanford CS336 Lecture 17: Multimodal Language Modeling

Stanford CS336 Lecture 17: Multimodal Language Modeling

Overview of Multimodal Modeling

Multimodal models aim to move beyond text-to-text capabilities toward "omni models" that can ingest and output any combination of modalities, including text, images, audio, and video. Because transformers are the most effective architecture at scale across these modalities, the primary technical challenge is converting non-text data into tokens—either discrete or continuous—that a transformer can process.

Foundations: CLIP and SigLIP

Modern Vision-Language Models (VLMs) rely heavily on encoders that map images into a semantic space compatible with text.

CLIP (Contrastive Language Image Pre-Training)

CLIP was designed to leverage massive amounts of noisy image-text pairs from the web. It uses a contrastive objective: given a batch of image-text pairs, the model maximizes the dot product between the correct image-text pair while minimizing it for all other pairs in the batch.

  • Architecture: CLIP typically uses a Vision Transformer (ViT) as the image encoder, breaking images into patches (e.g., 14x14) and processing them as tokens. The text encoder is a GPT-2 style transformer.
  • Key Result: CLIP demonstrated that zero-shot image classification could outperform models trained on curated datasets like ImageNet by leveraging organic web data.
  • Limitation: CLIP requires very large batch sizes (e.g., 32,000) because the softmax operation occurs across the entire batch, making it difficult to parallelize and decompose.

SigLIP (Sigmoid Loss for Language Image Pre-Training)

SigLIP improves upon CLIP by replacing the multi-class softmax loss with a simpler sigmoid loss. Instead of ranking one image against a batch of texts, SigLIP treats each image-text pair as a binary classification problem (aligned or not).

  • Efficiency: This change decouples the loss from the batch size, allowing for more efficient training and better performance with smaller batch sizes.
  • Parallelization: SigLIP enables better distribution across devices by rotating text embeddings across the pod to compute negatives without requiring the full batch on a single device.

Vision-Language Model (VLM) Architectures

Most open-source VLMs follow a template: a pre-trained vision encoder, a language model (LLM) as the decoder, and a "projector" or adapter that maps vision embeddings into the LLM's token space.

LLaVA (Large Language-and-Vision Assistant)

LLaVA demonstrates how to stitch together a CLIP encoder and a Vicuna (Llama-based) LLM.

  • Data Synthesis: LLaVA used GPT-4 to generate synthetic conversations, detailed descriptions, and complex reasoning based on human-annotated captions from MS COCO.
  • Training Stages:
    1. Alignment Phase: The vision encoder and LLM are frozen; only the projector (a linear matrix $W$) is trained to map image vectors into the text embedding space.
    2. Fine-tuning Phase: The vision encoder remains frozen, but the projector and LLM are trained together on the synthesized multimodal data.

LLaVA OneVision and AnyRes

To handle high-resolution images and OCR, LLaVA OneVision introduced AnyRes. Instead of downsampling an image to a fixed square (e.g., 336x336), AnyRes breaks the image into multiple crops of the encoder's native resolution and concatenates the resulting vectors. This allows the model to preserve fine-grained details while remaining adaptive to different image aspect ratios.

The Qwen-VL Series

Qwen-VL and its successors (Qwen-2, Qwen-3) further refined the VLM pipeline:

  • Dynamic Resolution: Qwen-2 implemented a system where images are mapped to a variable number of tokens based on their size, compressing patches to manage context length.
  • Multimodal RoPE (M-RoPE): To handle 3D spatial-temporal data (height, width, and time), Qwen uses a multidimensional Rotary Positional Embedding. Qwen-3 improved this by interleaving dimensions to ensure all axes are exposed to both low and high frequencies.
  • Explicit Timestamps: Qwen-3 introduced explicit time tokens (e.g., "0 seconds") to help the model refer to specific moments in a video.
  • Deep Fusion: Rather than using a simple projector, later versions used a more sophisticated adapter (DeepStack) that integrates vision embeddings directly into the LLM's residual stream.

Native Multimodality: The Chameleon Approach

While most VLMs are "stitched" models that only output text, Meta's Chameleon attempts native multimodality by discretizing everything into tokens.

  • VQ-VAE (Vector Quantized Variational Autoencoder): Images are mapped to a discrete codebook (e.g., 8,000 prototypical vectors). An image becomes a sequence of discrete tokens, just like text.
  • Unified Training: Because images are now tokens, a single transformer can be trained to predict the next token regardless of whether it is text or an image. This allows for interleaved image-text generation.
  • Challenges: This approach suffers from training instability due to the high entropy of image tokens compared to text tokens. It also loses fine-grained information (like small text in OCR) due to discretization.

Summary of Technical Trade-offs

Feature Stitched VLMs (LLaVA/Qwen) Native Multimodal (Chameleon)
Input Continuous embeddings $\rightarrow$ Projector $\rightarrow$ LLM Discrete tokens $\rightarrow$ LLM
Output Text only Interleaved Text & Images
Output Method LLM Token Prediction LLM Token Prediction
Information Loss Low (Continuous encoders) High (Discretization)
Training Stability High Low (Entropy mismatch)

SUMMARY:

This lecture explores the architecture of multimodal models, focusing on how vision-language models (VLMs) integrate image encoders like CLIP and SigLIP with large language models (LLMs) to achieve visual reasoning.

TITLE:

Stanford CS336 Lecture 17: Multimodal Language Modeling

Sources