LTX-2: a DiT-based audio-video foundation model with synchronized sound and production-ready controls

LTX-2: a DiT-based audio-video foundation model with synchronized sound and production-ready controls

What it solves

LTX-2 is a foundation model designed to unify various video generation capabilities into a single system. It addresses the need for high-fidelity, production-ready video outputs that include synchronized audio, precise camera control, and the ability to perform complex edits like lip dubbing or regional regeneration.

How it works

Built on a Diffusion Transformer (DiT) architecture, LTX-2 uses a multi-stage pipeline to generate video. It supports various modes of generation, including text-to-video and image-to-video, and can be enhanced with LoRAs for specific controls (such as camera movement, pose, or HDR output). The system includes spatial and temporal upscalers to increase resolution and frame rate, and it utilizes the Gemma 3 text encoder for prompt processing.

Who it’s for

This project is for AI researchers, video producers, and developers who need professional-grade video generation tools with fine-grained control over motion, audio, and visual quality.

Highlights

  • Unified Capabilities: Combines synchronized audio-video generation, text-to-video, and image-to-video in one model.
  • Diverse Pipelines: Offers specialized pipelines for keyframe interpolation, audio-to-video (A2Vid), lip dubbing, and HDR output.
  • Diverse Control: Provides a wide array of LoRAs for camera control (dolly, jib, static) and motion tracking.
  • Lighter Inference: Includes a distilled version of the model for faster generation with significantly fewer steps.
  • Optimization: Supports FP8 quantization and FlashAttention 4 for high-performance inference on modern GPUs.

Sources