lorax: a multi-LoRA inference server that scales to thousands of fine-tuned LLMs on a single GPU
lorax: a multi-LoRA inference server that scales to thousands of fine-tuned LLMs on a single GPU
What it solves
LoRAX (LoRA eXchange) reduces the cost and complexity of serving many fine-tuned Large Language Models (LLMs) by allowing thousands of them to be hosted on a single GPU. It eliminates the need to dedicate a separate GPU for every fine-tuned version of a base model, which would otherwise be prohibitively expensive.
How it works
LoRAX uses a shared base model and dynamically loads task-specific LoRA adapters per request. It employs several optimization techniques to maintain high performance:
- Dynamic Adapter Loading: Adapters from HuggingFace, Predibase, or local filesystems are loaded just-in-time without blocking other requests.
- Heterogeneous Continuous Batching: Requests for different adapters are packed into the same batch to keep throughput and latency stable.
- Adapter Exchange Scheduling: An asynchronous system prefetches and offloads adapters between GPU and CPU memory to optimize throughput.
- Optimized Inference: It utilizes tensor parallelism, quantization, and specialized CUDA kernels like Flash-Attention, Paged Attention, and SGMV for high efficiency.
Who it’s for
Developers and ML engineers who need to deploy and serve multiple fine-tuned LLMs at scale without sacrificing performance or reducing the majority of the GPU memory footprint.
Highlights
- Massive Scalability: Serve thousands of fine-tuned models on a single GPU.
- OpenAI Compatible: Supports multi-turn chat via an OpenAI-compatible API.
- Production Ready: Includes Docker images, Helm charts for Kubernetes, and Open Telemetry for distributed tracing.
- Flexible Model Support: Compatible with Llama, Mistral, and Qwen architectures, supporting adapters trained via PEFT and Ludwig.
Sources
- undefinedpredibase/lorax