HunyuanVideo: a large-scale open-source video foundation model with a hybrid dual-to-single stream Transformer architecture

HunyuanVideo: a large-scale open-source video foundation model with a hybrid dual-to-single stream Transformer architecture

What it solves

HunyuanVideo is a large-scale open-source video foundation model designed to bridge the gap between open-source and closed-source video generation. It addresses the challenge of creating high-quality videos with strong motion diversity, precise text-video alignment, and generation stability, aiming to match or exceed the performance of leading proprietary models.

How it works

The model operates on a spatial-temporally compressed latent space using a Causal 3D VAE. It employs a "Dual-stream to Single-stream" hybrid Transformer architecture: first processing video and text tokens independently (dual-stream) and then concatenating them for multimodal fusion (single-stream). For text encoding, it uses a pre-trained Decoder-Only Multimodal Large Language Model (MLLM) combined with a bidirectional token refiner to improve instruction following and detail description. Additionally, a fine-tuned Hunyuan-Large model is used to rewrite user prompts into a format the model prefers for better visual quality and intent comprehension.

Who it’s for

AI researchers, developers, and creators who need a high-performance, open-source text-to-video generation tool capable of producing professional-grade visual and motion quality.

Highlights

  • Massive Scale: One of the largest open-source video generative models with over 13 billion parameters.
  • Unified Architecture: Uses a hybrid Transformer design to handle both image and video generation.
  • Advanced Text Encoding: Leverages an MLLM instead of standard CLIP/T5 encoders for superior reasoning and alignment.
  • Efficient Compression: Employs a 3D VAE to reduce token counts, enabling training at original resolutions and frame rates.
  • Flexible Inference: Supports single-GPU, multi-GPU parallel inference (via xDiT), and FP8 quantization to reduce memory overhead.

Sources