lorax: a multi-LoRA inference server that scales to thousands of fine-tuned LLMs on a single GPU

lorax: a multi-LoRA inference server that scales to thousands of fine-tuned LLMs on a single GPU

What it solves

LoRAX (LoRA eXchange) reduces the cost and complexity of serving many fine-tuned Large Language Models (LLMs) by allowing thousands of them to be hosted on a single GPU. It eliminates the need to dedicate a separate GPU for every fine-tuned version of a base model, which would otherwise be prohibitively expensive.

How it works

LoRAX uses a shared base model and dynamically loads task-specific LoRA adapters per request. It employs several optimization techniques to maintain high performance:

  • Dynamic Adapter Loading: Adapters from HuggingFace, Predibase, or local filesystems are loaded just-in-time without blocking other requests.
  • Heterogeneous Continuous Batching: Requests for different adapters are packed into the same batch to keep throughput and latency stable.
  • Adapter Exchange Scheduling: An asynchronous system prefetches and offloads adapters between GPU and CPU memory to optimize throughput.
  • Optimized Inference: It utilizes tensor parallelism, quantization, and specialized CUDA kernels like Flash-Attention, Paged Attention, and SGMV for high efficiency.

Who it’s for

Developers and ML engineers who need to deploy and serve multiple fine-tuned LLMs at scale without sacrificing performance or reducing the majority of the GPU memory footprint.

Highlights

  • Massive Scalability: Serve thousands of fine-tuned models on a single GPU.
  • OpenAI Compatible: Supports multi-turn chat via an OpenAI-compatible API.
  • Production Ready: Includes Docker images, Helm charts for Kubernetes, and Open Telemetry for distributed tracing.
  • Flexible Model Support: Compatible with Llama, Mistral, and Qwen architectures, supporting adapters trained via PEFT and Ludwig.

Sources