chitu: a production-grade inference engine supporting diverse hardware and hybrid CPU-GPU deployment for massive LLMs

chitu: a production-grade inference engine supporting diverse hardware and hybrid CPU-GPU deployment for massive LLMs

What it solves

Chitu is a production-grade large model inference engine designed to bridge the gap between small-scale AI experiments and large-scale enterprise deployments. It addresses the need for a high-performance inference framework that is flexible across different hardware configurations and stable enough to handle concurrent production traffic.

How it works

Chitu provides a scalable inference solution that supports a wide range of hardware, from pure CPU and single GPU setups to large-scale clusters. It implements efficient operators for quantization (such as FP4 to FP8/BF16 conversion) to enable the deployment of massive models like DeepSeek-R1 671B on limited hardware, including support for CPU+GPU heterogeneous hybrid inference.

Who it’s for

It is built for enterprises and developers who need to deploy LLMs (such as DeepSeek, Qwen, GLM, and Kimi) across diverse hardware environments, including NVIDIA GPUs and various domestic Chinese AI chips (Ascend, Moore Threads, Muxi, Haiguang).

Highlights

  • Broad Hardware Compatibility: Supports NVIDIA GPUs, domestic Chinese AI chips, and CPU-only deployments.
  • Scalable Deployment: Flexible scaling from single-card setups to large-scale clusters.
  • Advanced Quantization: Includes efficient operators for FP4 and FP8 online conversion to support very large models.
  • Heterogeneous Inference: Supports hybrid CPU+GPU inference to run massive models on a single card.

Sources