cactus: what it is, what problem it solves & why it's gaining traction

cactus: what it is, what problem it solves & why it's gaining traction

What it solves

Cactus is a hybrid edge-cloud AI engine designed to enable fast, low-memory AI inference on mobile devices and wearables. It addresses the hardware constraints of ARM-based devices by optimizing memory usage and computation speed, while providing a seamless fallback to cloud models for complex queries.

How it works

Cactus uses a specialized stack of components to optimize on-device performance:

  • Cactus Kernels: High-performance ARM SIMD kernels for operations like matrix multiplication and attention.
  • Cactus Graph: A zero-copy computation graph that reduces RAM usage by up to 10x compared to other engines.
  • Cactus Quants: A quantization method (from 4-bit to 1-bit) where 4-bit uniform quantization matches the accuracy of f16.
  • Cactus Transpiler: A tool to convert PyTorch models into the Cactus runtime graph.
  • Hybrid Routing: Automatically routes requests to the cloud if the local model's confidence falls below a specific threshold.

Who it’s for

  • Mobile and Wearable Developers: Those building apps that require on-device AI (text, vision, and speech) with minimal RAM and high speed.
  • AI Researchers: Those needing an efficient way to deploy PyTorch models to ARM devices.

Highlights

  • Multimodal Support: A single engine for language, vision, and speech models.
  • Cactus Quants: 4-bit quantization that maintains f16 accuracy.
  • Zero-Copy Memory: Significantly lower RAM overhead for mobile devices.
  • Cloud Fallback: Automatic routing to cloud models based on confidence thresholds.
  • OpenAI-Compatible API: Local HTTP server for easy integration.
  • Broad Device Support: Optimized for Apple, Samsung, and Pixel devices, with bindings for Swift, Kotlin, Flutter, React Native, Python, and Rust.

Sources