FastDeploy: a production-ready LLM and VLM deployment toolkit with PD separation and broad hardware acceleration

FastDeploy: a production-ready LLM and VLM deployment toolkit with PD separation and broad hardware acceleration

What it solves

FastDeploy addresses the complexity of deploying Large Language Models (LLMs) and Vision Language Models (VLMs) in production environments. It provides a production-ready toolkit that optimizes resource utilization, increases throughput, and ensures service level objectives (SLO) are met across a wide variety of hardware platforms.

How it works

Built on PaddlePaddle, FastDeploy implements several high-performance inference techniques:

  • PD Separation: A load-balanced Prefill-Decode separation strategy that allows for dynamic role switching and context caching to optimize throughput.
  • KV Cache Management: Uses a lightweight high-performance transmission library with intelligent selection of NVLink or RDMA for efficient cache transfer.
  • Acceleration Techniques: Employs speculative decoding, Multi-Token Prediction (MTP), and chunked prefilling to speed up generation.
  • Quantization: Supports multiple formats including W8A16, W8A8, W4A16, W4A8, W2A16, and FP8 to reduce memory footprint and increase speed.
  • API Compatibility: Offers an OpenAI-compatible API and is compatible with vLLM interfaces for easier integration.

Who it’s for

It is designed for developers and engineers who need to deploy LLMs and VLMs (such as ERNIE, Qwen, and DeepSeek) into production on diverse hardware, including NVIDIA GPUs and various specialized accelerators like Kunlunxin XPU, Hygon DCU, and Intel Gaudi.

Highlights

  • Broad Hardware Support: Compatible with NVIDIA, Kunlunxin, Hygon, Iluvatar, Enflame, Metax, and Intel Gaudi.
  • Production-Grade Features: Includes load-balanced PD separation and global cache pooling.
  • vLLM Compatibility: Allows for single-command deployment with vLLM-compatible interfaces.
  • Extensive Model Support: Supports a wide range of models including Qwen3-VL, DeepSeek V3, and ERNIE series.

Sources