ktransformers: a CPU-GPU hybrid inference and fine-tuning framework optimized for ultra-large MoE models
ktransformers: a CPU-GPU hybrid inference and fine-tuning framework optimized for ultra-large MoE models
What it solves
KTransformers addresses the hardware limitations of running ultra-large Mixture-of-Experts (MoE) models. It allows users to perform high-performance inference and fine-tuning on consumer-grade hardware by utilizing CPU-GPU heterogeneous computing, reducing the reliance on massive amounts of expensive GPU VRAM.
How it works
The framework employs a hybrid computing approach where workloads are split between the GPU and CPU. Key technical implementations include:
- Heterogeneous Expert Placement: "Hot" experts are kept on the GPU for speed, while "cold" experts are offloaded to the CPU.
- CPU-Optimized Kernels: It uses Intel AMX and AVX512/AVX2 optimized kernels to accelerate INT4/INT8 quantized inference on the CPU.
- Memory Management: It implements NUMA-aware memory management for MoE inference and a 3-layer (GPU-CPU-Disk) prefix cache reuse system.
- SFT Integration: It integrates with LLaMA-Factory to enable fine-tuning of large MoE models with significantly faster speeds than traditional ZeRO-Offload methods.
Who it’s for
- Researchers and developers working with ultra-large MoE models (like DeepSeek-V3/R1) who lack enterprise-grade GPU clusters.
- Users wanting to run cutting-edge LLMs on consumer hardware (e.g., RTX 4090s).
- ML engineers looking for efficient ways to fine-tune large models using hybrid CPU/GPU memory.
Highlights
- Hybrid Inference: Supports CPU-GPU heterogeneous computing to run massive models on limited VRAM.
- Broad Hardware Support: Compatible with NVIDIA GPUs, AMD GPUs (ROCm), Intel Arc GPUs, and Ascend NPUs.
- Quantization: Supports INT4/INT8 on CPU and GPTQ/FP8 on GPU.
- Fine-Tuning Speed: Offers 6-12x training speedup for MoE SFT workloads compared to ZeRO-Offload.
- Framework Integration: Clean Python API for integration with SGLang.
Sources
- undefinedkvcache-ai/ktransformers