TurboDiffusion: a video generation acceleration framework that reduces diffusion latency by 100-200x

TurboDiffusion: a video generation acceleration framework that reduces diffusion latency by 100-200x

What it solves

TurboDiffusion addresses the high computational cost and slow generation speeds of video diffusion models. It significantly reduces end-to-end generation latency, claiming speedups of 100 to 200 times on a single RTX 5090 GPU while maintaining video quality.

How it works

The framework achieves acceleration through a combination of three primary techniques:

  • Attention Acceleration: It utilizes SageAttention and Sparse-Linear Attention (SLA) to optimize the attention mechanism.
  • Timestep Distillation: It employs rCM for timestep distillation to reduce the number of sampling steps required.
  • Quantization: It provides quantized checkpoints for linear layers to enable efficient running on consumer-grade GPUs like the RTX 4090 and 5090.

Who it’s for

This project is for developers and researchers working with video generation models (specifically the Wan series) who need to generate high-quality videos (480p or 720p) in seconds rather than minutes or hours.

Highlights

  • Massive Speedup: Reduces generation time from 184s to 1.9s for certain models on an RTX 5090.
  • SLA Support: Includes SageSLA, a fast SLA forward pass based on SageAttention.
  • Flexible Modalities: Supports both Text-to-Video (T2V) and Image-to-Video (I2V) generation.
  • Hardware Optimized: Offers specific configurations and checkpoints for both high-end data center GPUs (H100) and consumer GPUs (RTX 5090/4090).

Sources