vllm-ascend: a hardware plugin for running vLLM seamlessly on Ascend NPUs

vllm-ascend: a hardware plugin for running vLLM seamlessly on Ascend NPUs

What it solves

It enables the vLLM inference engine to run seamlessly on Ascend NPUs (Neural Processing Units). By providing a hardware-pluggable interface, it removes the need to tightly couple Ascend-specific code within the core vLLM codebase, allowing users to deploy a wide variety of open-source models on Ascend hardware.

How it works

The project acts as a community-maintained hardware plugin that implements a decoupled interface based on a hardware-pluggable RFC. This allows vLLM to communicate with the Ascend NPU backend without modifying the core engine's logic for every hardware-specific detail.

Who it’s for

Developers and AI engineers who use Ascend hardware (such as Atlas 800I or Atlas A2/A3 series) and want to utilize vLLM's high-performance inference capabilities for their models.

Highlights

  • Broad Model Support: Supports Transformer-like models, Mixture-of-Experts (MoE), Embedding models, and Multi-modal LLMs.
  • Hardware Compatibility: Compatible with Atlas 800I A2/A3, Atlas A2/A3 Training series, and Atlas 300I Duo (experimental).
  • Decoupled Architecture: Uses a plugin-based approach to keep the Ascend integration separate from the main vLLM core.
  • Enterprise Ready: Integrated with CANN and PyTorch-NPU for production-grade performance on Ascend NPUs.

Sources