Stanford CS25: Transformers United V6 - From Language Models to Native Multimodal Intelligence
Stanford CS25: Transformers United V6 - From Language Models to Native Multimodal Intelligence
Native Multimodal Intelligence: The Shift from LLMs
Native multimodal intelligence aims to build AI systems that process symbolic knowledge and multimodal information (images, audio, video) seamlessly within a single architecture. While Large Language Models (LLMs) achieved breakthroughs via next-token prediction over symbolic information, they are insufficient for interacting with the physical world, which is inherently multimodal.
Modern native multimodal models extend the LLM paradigm by performing tokenization across all modalities. By converting non-text signals into tokens—whether through patchification for images or waveform transforms for audio—these models can be trained using global autoregressive generation modeling similar to standard language models.
Architectural Paradigms for Multimodal Models
Multimodal models generally fall into two categories based on their output capabilities:
- Multimodal Input, Text Output: These models condition on multimodal sequences but calculate loss only on text tokens. This approach is used by models like Gemini, Qwen, and Kimi to enable high-level understanding and question-answering.
- Omni Models: These models take multimodal inputs and generate multimodal outputs (text, images, and audio), such as GPT-4o.
Tokenization and Discrete Representations
The Chameleon family of models tests the hypothesis that every modality can be converted into discrete tokens. For images, this involves "patchifying" the image, running a continuous encoder, and then matching embeddings to a learned vector codebook (using VQ-VAE techniques). This allows the model to generate interleaved text and images in arbitrary order.
However, discrete tokenization introduces two primary limitations:
- Information Loss: Discretization causes significant loss during image understanding tasks compared to continuous encodings like SigLIP.
- Token Inefficiency: These models require massive amounts of data to sample well-formed images.
Unifying Autoregression and Diffusion
Transfusion addresses the limitations of discrete tokens by combining autoregressive language modeling with diffusion-based image generation in a single transformer. While text follows standard autoregressive prediction, image segments are processed via diffusion operations.
Key architectural differences in Transfusion include:
- Causal Attention: Used for text.
- Bidirectional Attention: Used for images to improve performance.
Despite its superior image quality and token efficiency, Transfusion faces a "dilemma" where the VAE representations efficient for generation are not equally efficient for image understanding.
Scaling and Efficiency: Mixture of Transformers (MoT)
Because different modalities have different information densities, using unified transformer parameters for all data can be inefficient. The Mixture of Transformers (MoT) architecture introduces modality-specific parameters for projection matrices in attention layers and feed-forward layers.
How MoT Works
MoT employs deterministic routing: if a token is text, the text-specific parameters are activated; if it is an image token, image-specific parameters are used. After separate QKV projections, a joint attention mechanism allows the modalities to unify, followed by modality-specific feed-forward processing.
Key Findings from MoT Experiments
- Non-Text Generation: MoT significantly improves the generation of images and speech without sacrificing text performance.
- Capacity Competition: Separate parameters prevent "capacity competition" that occurs when a single transformer tries to handle fundamentally different data types.
- Asynchronous Training: MoT enables extending existing off-the-shelf text models by adding new modality parameters and freezing the text model, avoiding the need for full fine-tuning.
The Relationship Between Understanding and Generation
Research into Omni models reveals an asymmetrical relationship between understanding and generation:
- Understanding $\rightarrow$ Generation: Stronger understanding capabilities in a base model lead to better generation, resulting in finer details and fewer hallucinations in infographics.
- Generation $\rightarrow$ Understanding: Training a model specifically for non-text generation (e.g., image generation) does not necessarily improve its image understanding capabilities.
The "Next Frame Prediction" Puzzle
While next-token prediction works for language, next-frame prediction for video does not yet yield the same emergent reasoning capabilities. Hypotheses for this include:
- Abstraction: Language is a highly compressed abstraction of human cognition and reasoning, whereas images/videos are raw sensory data.
- Loss Landscapes: The loss landscape for visual data is more complex; a model's loss may improve even if the generated output still looks poor to humans.
- Redundancy: Video frames contain high amounts of redundant information compared to the information density of text.
Future Directions in Multimodal AI
While current Omni models excel at digital information processing, significant gaps remain in physical world intelligence. Future research is focusing on:
- Spatial-Temporal Understanding: Improving real-time understanding and robotics control.
- Vision-Language-Action (VLA) Models: Using multimodal LLMs as backbones for action prediction in robotics.
- Unified Representations: Finding a single representation that effectively serves both perception and generation, potentially moving beyond the current split between VAEs (for generation) and continuous encoders (for understanding).