llm-d: what it is, what problem it solves & why it's gaining traction

llm-d: what it is, what problem it solves & why it's gaining traction

What it solves

llm-d addresses the complexity and performance tuning required to deploy large language models (LLMs) at scale in production. It eliminates the "heavy lifting" of optimizing inference across different hardware accelerators and infrastructure providers, helping users achieve state-of-the-art (SOTA) performance and reliability on Kubernetes.

How it works

llm-d acts as an orchestration and optimization layer that sits above model servers like vLLM and SGLang. It integrates with Kubernetes to provide a distributed inference serving stack with several key technical optimizations:

  • Intelligent Routing: Uses prefix-cache and load-aware balancing, as well as predicted latency-based scheduling to reduce latency and increase throughput.
  • KV-Cache Management: Implements tiered offloading of the KV cache to CPU or disk and global indexing to increase the effective working set size for multi-turn conversations.
  • Disaggregated Serving: Optimizes massive models by separating the prefill and decode phases and utilizing wide expert-parallelism over fast interconnects.
  • Operational Tools: Provides SLO-aware autoscaling, intelligent flow control for multi-tenant environments, and OpenAI-compatible Batch APIs for offline processing.

Who it’s for

It is designed for engineers and organizations deploying high-scale, real-world LLM traffic on Kubernetes across various hardware accelerators (such as NVIDIA, AMD, Intel, and Google TPUs).

Highlights

  • CNCF Sandbox Project: Founded by a consortium including Red Hat, Google Cloud, IBM Research, CoreWeave, and NVIDIA.
  • Significant Performance Gains: Demonstrated up to 3x higher output throughput and 70% higher tokens/sec through prefix-cache routing and prefill/decode disaggregation.
  • Hardware Agnostic: Optimized for a wide range of accelerators and infrastructure providers.
  • Production-Ready Recipes: Includes benchmarked guides and Helm charts to simplify the deployment of optimized baselines.

Sources