ktransformers: a CPU-GPU hybrid inference and fine-tuning framework optimized for ultra-large MoE models

What it solves

KTransformers addresses the hardware limitations of running ultra-large Mixture-of-Experts (MoE) models. It allows users to perform high-performance inference and fine-tuning on consumer-grade hardware by utilizing CPU-GPU heterogeneous computing, reducing the reliance on massive amounts of expensive GPU VRAM.

How it works

The framework employs a hybrid computing approach where workloads are split between the GPU and CPU. Key technical implementations include:

Heterogeneous Expert Placement: "Hot" experts are kept on the GPU for speed, while "cold" experts are offloaded to the CPU.
CPU-Optimized Kernels: It uses Intel AMX and AVX512/AVX2 optimized kernels to accelerate INT4/INT8 quantized inference on the CPU.
Memory Management: It implements NUMA-aware memory management for MoE inference and a 3-layer (GPU-CPU-Disk) prefix cache reuse system.
SFT Integration: It integrates with LLaMA-Factory to enable fine-tuning of large MoE models with significantly faster speeds than traditional ZeRO-Offload methods.

Who it’s for

Researchers and developers working with ultra-large MoE models (like DeepSeek-V3/R1) who lack enterprise-grade GPU clusters.
Users wanting to run cutting-edge LLMs on consumer hardware (e.g., RTX 4090s).
ML engineers looking for efficient ways to fine-tune large models using hybrid CPU/GPU memory.

Highlights

Hybrid Inference: Supports CPU-GPU heterogeneous computing to run massive models on limited VRAM.
Broad Hardware Support: Compatible with NVIDIA GPUs, AMD GPUs (ROCm), Intel Arc GPUs, and Ascend NPUs.
Quantization: Supports INT4/INT8 on CPU and GPTQ/FP8 on GPU.
Fine-Tuning Speed: Offers 6-12x training speedup for MoE SFT workloads compared to ZeRO-Offload.
Framework Integration: Clean Python API for integration with SGLang.

ktransformers: a CPU-GPU hybrid inference and fine-tuning framework optimized for ultra-large MoE models

ktransformers: a CPU-GPU hybrid inference and fine-tuning framework optimized for ultra-large MoE models

What it solves

How it works

Who it’s for

Highlights

Sources