FastVideo: a unified post-training and real-time inference framework for accelerated video generation
FastVideo: a unified post-training and real-time inference framework for accelerated video generation
What it solves
FastVideo addresses the high computational cost and slow generation speeds associated with state-of-the-art video generation models. It provides a unified framework to accelerate both the post-training (fine-tuning and distillation) and the real-time inference of video Diffusion Transformers (DiTs).
How it works
FastVideo employs several optimization techniques to reduce latency and increase throughput:
- Post-Training Optimizations: It supports full and LoRA fine-tuning, as well as Distribution Matching Distillation (DMD2) and sparse distillation to achieve significant denoising speedups (over 50x).
- Attention Mechanisms: It implements specialized attention backends, including Video Sparse Attention (VSA) and Sliding Tile Attention, to reduce the complexity of processing video frames.
- Inference Scaling: The framework utilizes sequence parallelism for distributed inference across multiple GPUs and supports various hardware (H100, A100, 4090) and operating systems.
- Real-time Streaming: Through its Dreamverse platform, it enables "vibe directing," allowing users to stream and edit video in real-time.
Who it’s for
This framework is designed for AI researchers and developers building high-performance video generation applications who need to reduce inference latency or train/distill specialized video models.
Highlights
- Massive Speedups: Capable of generating 5 seconds of video in 1.8 seconds end-to-end using FastWan-QAD.
- Comprehensive Tooling: Includes a full data preprocessing pipeline for video, image, and text.
- Scalable Training: Supports FSDP2, sequence parallelism, and selective activation checkpointing.
- Real-time Interface: Includes Dreamverse, a web UI for real-time video generation and editing.
Sources
- undefinedhao-ai-lab/FastVideo