zml: a production inference stack that decouples AI workloads from proprietary hardware

zml: a production inference stack that decouples AI workloads from proprietary hardware

What it solves

ZML is a production inference stack designed to decouple AI workloads from proprietary hardware. It allows developers to run models across various hardware accelerators without needing to rewrite the codebase for each specific platform.

How it works

Built using the Zig language, MLIR, and Bazel, ZML compiles models directly to multiple hardware backends. It supports a wide range of accelerators including NVIDIA CUDA, AMD ROCm, Intel OneAPI, Google TPU, and AWS Trainium/Inferentia 2, ensuring peak performance on any chosen hardware.

Who it’s for

It is intended for developers and engineers building production AI inference systems who want to avoid hardware lock-in and maintain a single codebase for multi-platform deployment.

Highlights

  • Multi-Hardware Support: Native compilation for NVIDIA, AMD, Intel, and TPU/Trainium accelerators.
  • Unified Codebase: Run any model on many hardwares using one codebase.
  • LLM Support: Out-of-the-box support for Llama 3.1/3.2, Qwen 3.5, and LFM 2.5.
  • Flexible Loading: Ability to load models from Hugging Face, S3, or local directories via a VFS layer.
  • High Performance: Compiled directly to hardware for peak execution speed.

Sources