executorch: a unified on-device AI inference engine for deploying PyTorch models to mobile and embedded hardware

executorch: a unified on-device AI inference engine for deploying PyTorch models to mobile and embedded hardware

What it solves

ExecuTorch provides a unified way to deploy PyTorch AI models on-device, ranging from smartphones to microcontrollers. It eliminates the need for manual C++ rewrites, intermediate format conversions (like ONNX or TFLite), and vendor lock-in, allowing developers to move from research to production with the same PyTorch APIs.

How it works

ExecuTorch uses ahead-of-time (AOT) compilation to prepare models for the edge. The process involves three main steps:

  1. Export: The PyTorch model graph is captured using torch.export().
  2. Compile: The model is quantized, optimized, and partitioned to specific hardware backends, resulting in a .pte file.
  3. Execute: The lightweight C++ runtime (with a base footprint of 50KB) loads and runs the .pte file on the device.

It uses a standardized Core ATen operator set and partitioners to delegate subgraphs to specialized hardware like NPUs or GPUs, with CPU fallback.

Who it’s for

AI developers and engineers who need to deploy LLMs, vision, speech, and multimodal models to mobile devices (Android/iOS) and embedded systems (Linux/Windows/MCU) across various hardware backends (Apple, Qualcomm, ARM, MediaTek, etc.).

Highlights

  • Native PyTorch Export: Direct export from PyTorch without intermediate formats.
  • Tiny Runtime: Minimal 50KB base footprint for extreme portability.
  • Broad Hardware Support: 12+ open-source acceleration backends including CoreML, Vulkan, and XNNPACK.
  • Production-Proven: Powers on-device AI for Meta's Instagram, WhatsApp, and Quest 3.
  • Advanced Deployment Tools: Built-in support for quantization (via torchao), memory planning, and dynamic shapes.

Sources