vllm-omni: a high-throughput serving framework for any-to-any multimodal and diffusion models
vllm-omni: a high-throughput serving framework for any-to-any multimodal and diffusion models
What it solves
It extends the vLLM framework to support omni-modality model serving, moving beyond just text-based autoregressive generation. It enables the efficient serving of models that can process and generate multiple types of data (text, image, video, and audio) and supports non-autoregressive architectures like Diffusion Transformers (DiT).
How it works
vLLM-Omni uses a fully disaggregated architecture based on an "OmniConnector" and dynamic resource allocation across stages. It leverages vLLM's efficient KV cache management for autoregressive tasks and implements pipelined stage execution to overlap processing and increase throughput. It provides a heterogeneous pipeline abstraction to manage complex multimodal workflows and supports various parallelism strategies (tensor, pipeline, data, and expert).
Who it’s for
Developers and researchers who need to deploy and serve large-scale omni-modal models, TTS models, or diffusion-based image and video generation models with high performance and OpenAI-compatible APIs.
Highlights
- Omni-modality support: Handles text, image, video, and audio processing and generation.
- Broad architecture support: Supports both autoregressive and non-autoregressive (DiT) models.
- High performance: Features pipelined execution and efficient KV cache management.
- Hardware flexibility: Compatible with CUDA, ROCm, MUSA, NPU, and XPU backends.
- Wide model compatibility: Supports popular models such as Qwen3-Omni, Cosmos, FLUX, and various TTS models.
Sources
- undefinedvllm-project/vllm-omni