vllm-ascend: a hardware plugin for running vLLM seamlessly on Ascend NPUs
vllm-ascend: a hardware plugin for running vLLM seamlessly on Ascend NPUs
What it solves
It enables the vLLM inference engine to run seamlessly on Ascend NPUs (Neural Processing Units). By providing a hardware-pluggable interface, it removes the need to tightly couple Ascend-specific code within the core vLLM codebase, allowing users to deploy a wide variety of open-source models on Ascend hardware.
How it works
The project acts as a community-maintained hardware plugin that implements a decoupled interface based on a hardware-pluggable RFC. This allows vLLM to communicate with the Ascend NPU backend without modifying the core engine's logic for every hardware-specific detail.
Who it’s for
Developers and AI engineers who use Ascend hardware (such as Atlas 800I or Atlas A2/A3 series) and want to utilize vLLM's high-performance inference capabilities for their models.
Highlights
- Broad Model Support: Supports Transformer-like models, Mixture-of-Experts (MoE), Embedding models, and Multi-modal LLMs.
- Hardware Compatibility: Compatible with Atlas 800I A2/A3, Atlas A2/A3 Training series, and Atlas 300I Duo (experimental).
- Decoupled Architecture: Uses a plugin-based approach to keep the Ascend integration separate from the main vLLM core.
- Enterprise Ready: Integrated with CANN and PyTorch-NPU for production-grade performance on Ascend NPUs.
Sources
- undefinedvllm-project/vllm-ascend