Mooncake: what it is, what problem it solves & why it's gaining traction
Mooncake: what it is, what problem it solves & why it's gaining traction
What it solves
Mooncake addresses the inefficiencies of traditional LLM serving by separating the prefill and decoding clusters (PD disaggregation). It solves the problem of underutilized hardware resources (CPU, DRAM, and SSD) by creating a disaggregated KVCache pool, which allows for higher throughput and better memory efficiency, especially in long-context scenarios where traditional methods struggle to meet latency Service Level Objectives (SLOs).
How it works
Mooncake implements a KVCache-centric architecture centered around three main components:
- Transfer Engine (TE): A high-performance data transfer framework that provides a unified interface for moving tensors across different storage and network environments. It uses RDMA, topology-aware routing, and multi-NIC bandwidth aggregation to achieve significantly faster data movement than standard TCP.
- Mooncake Store: A distributed key-value cache storage engine built on the Transfer Engine. It manages reusable KV caches and model weights across clusters using a multi-tier hierarchy (DRAM and SSD/NVMe) and allows for elastic storage that is independent of engine restarts.
- Mooncake EP and Process Group (PG): These extend the system to support large-scale Mixture-of-Experts (MoE) inference by providing fault-tolerant expert-parallel dispatch and a PyTorch-compatible distributed process-group backend that can recover from rank failures without restarting the entire service.
Who it’s for
Mooncake is designed for developers and operators of large-scale LLM services (such as Kimi) and inference engine maintainers (like those for vLLM, SGLang, and TensorRT-LLM) who need to maximize throughput and resource utilization for high-traffic, long-context AI applications.
Highlights
- High-Performance Transfer: Achieves up to 190 GB/s bandwidth in 8×400 Gbps RoCE networks, significantly faster than TCP.
- PD Disaggregation: Decouples prefill and decoding to optimize resource usage and throughput.
- Multi-Tier Storage: Leverages DRAM and SSD/NVMe for a massive, elastic KVCache pool.
- Fault-Tolerant MoE: Supports expert-parallel inference that can route around failed ranks to maintain service availability.
- Broad Ecosystem Integration: Integrated into vLLM, SGLang, TensorRT-LLM, and the PyTorch ecosystem.
Sources
- undefinedkvcache-ai/Mooncake