Sana: an efficiency-oriented framework for high-resolution image and video generation supporting 4K images and real-time streaming
Sana: an efficiency-oriented framework for high-resolution image and video generation supporting 4K images and real-time streaming
What it solves
SANA is designed to make high-resolution image and video generation significantly more efficient. It addresses the high computational cost and memory requirements typically associated with generating 4K images or long-form videos, allowing these tasks to run on consumer-grade hardware (including laptop GPUs with less than 8GB VRAM).
How it works
SANA employs several key architectural optimizations to reduce the workload on the GPU:
- Linear Attention: Replaces standard attention in Diffusion Transformers (DiT) to handle high resolutions more efficiently.
- DC-AE: Uses a 32x image compression ratio (compared to the traditional 8x) to drastically reduce the number of latent tokens.
- Decoder-only Text Encoder: Utilizes a modern LLM for better alignment between text prompts and generated images.
- Specialized Video Modules: Uses Block Causal Linear Attention and Causal Mix-FFN for long video generation, and sCM distillation for one-step generation (SANA-Sprint).
- Quantization: Supports 4-bit and 8-bit quantization to lower memory usage.
Who it’s for
This project is for AI researchers, developers, and creators who need high-quality image and video generation but lack industrial-scale compute resources, as well as those building real-time streaming video editing tools or controllable world models for Embodied AI.
Highlights
- Extreme Efficiency: Generates 1024px images in as little as 0.1s on H100 GPUs.
- High Resolution: Supports text-to-image generation up to 4K resolution.
- Versatile Suite: Includes specialized models for one-step generation (Sprint), video generation (SANA-Video), world modeling (SANA-WM), and real-time streaming editing (SANA-Streaming).
- Broad Compatibility: Integrated with
diffusers, ComfyUI, and SGLang for high-performance serving.
Sources
- undefinedNVlabs/Sana