lmdeploy: a high-throughput toolkit for compressing and serving LLMs and VLMs with dual inference engines
lmdeploy: a high-throughput toolkit for compressing and serving LLMs and VLMs with dual inference engines
What it solves
LMDeploy is a toolkit designed to simplify and optimize the process of compressing, deploying, and serving Large Language Models (LLMs) and Vision Language Models (VLMs). It addresses the challenge of high computational costs and latency in LLM inference by providing high-throughput serving and efficient quantization techniques.
How it works
The project provides two distinct inference engines: TurboMind, which is optimized for maximum performance, and a PyTorch-based engine developed in Python to lower the barrier for developers and enable rapid experimentation. It utilizes techniques such as persistent batching (continuous batching), blocked KV cache, tensor parallelism, and high-performance CUDA kernels to increase request throughput.
Who it’s for
It is intended for developers and AI engineers who need to deploy LLMs and VLMs in production environments, as well as researchers who want to experiment with new model architectures and features.
Highlights
- High Throughput: Delivers up to 1.8x higher request throughput than vLLM.
- Extensive Model Support: Supports a vast array of LLMs (e.g., Llama, Qwen, DeepSeek, Mistral, Phi) and VLMs (e.g., InternVL, LLaVA, Qwen-VL).
- Effective Quantization: Supports weight-only and KV cache quantization (including AWQ), with 4-bit inference performance up to 2.4x faster than FP16.
- Distribution Server: Facilitates easy deployment of multi-model services across multiple machines and cards.
- Hardware Compatibility: Supports NVIDIA GPUs (including RTX 50 series) and Huawei Ascend platforms.
Sources
- undefinedInternLM/lmdeploy