HunyuanVideo: a large-scale open-source video foundation model with a hybrid dual-to-single stream Transformer architecture
HunyuanVideo: a large-scale open-source video foundation model with a hybrid dual-to-single stream Transformer architecture
What it solves
HunyuanVideo is a large-scale open-source video foundation model designed to bridge the gap between open-source and closed-source video generation. It addresses the challenge of creating high-quality videos with strong motion diversity, precise text-video alignment, and generation stability, aiming to match or exceed the performance of leading proprietary models.
How it works
The model operates on a spatial-temporally compressed latent space using a Causal 3D VAE. It employs a "Dual-stream to Single-stream" hybrid Transformer architecture: first processing video and text tokens independently (dual-stream) and then concatenating them for multimodal fusion (single-stream). For text encoding, it uses a pre-trained Decoder-Only Multimodal Large Language Model (MLLM) combined with a bidirectional token refiner to improve instruction following and detail description. Additionally, a fine-tuned Hunyuan-Large model is used to rewrite user prompts into a format the model prefers for better visual quality and intent comprehension.
Who it’s for
AI researchers, developers, and creators who need a high-performance, open-source text-to-video generation tool capable of producing professional-grade visual and motion quality.
Highlights
- Massive Scale: One of the largest open-source video generative models with over 13 billion parameters.
- Unified Architecture: Uses a hybrid Transformer design to handle both image and video generation.
- Advanced Text Encoding: Leverages an MLLM instead of standard CLIP/T5 encoders for superior reasoning and alignment.
- Efficient Compression: Employs a 3D VAE to reduce token counts, enabling training at original resolutions and frame rates.
- Flexible Inference: Supports single-GPU, multi-GPU parallel inference (via xDiT), and FP8 quantization to reduce memory overhead.
Sources
- undefinedTencent-Hunyuan/HunyuanVideo