GenieX: an on-device inference runtime for running LLMs and VLMs locally on Qualcomm Snapdragon hardware

What it solves

GenieX provides a simplified way to run Large Language Models (LLMs) and Vision-Language Models (VLMs) locally on Qualcomm Snapdragon devices. It removes the complexity of hardware acceleration, allowing developers to leverage the Hexagon NPU, Adreno GPU, or CPU without needing deep expertise in chip-specific optimization.

How it works

GenieX acts as an on-device inference runtime that supports two primary execution paths:

llama.cpp runtime: Allows users to run almost any GGUF model from Hugging Face across the NPU, GPU, or CPU.
Qualcomm AI Engine Direct runtime: Executes pre-compiled model bundles from the Qualcomm AI Hub specifically for the NPU to achieve maximum performance.

It provides a unified C SDK that is exposed through multiple interfaces, including a CLI, Python library (mirroring the Hugging Face transformers API), an OpenAI-compatible server, Docker containers, and a Kotlin/Java SDK for Android.

Who it’s for

Developers building AI applications for Windows ARM64, Android, and Linux ARM64 devices powered by Qualcomm Snapdragon processors.

Highlights

Broad Model Support: Compatible with GGUF models from Hugging Face and optimized bundles from Qualcomm AI Hub.
Multi-Compute Support: Ability to dispatch workloads to the NPU, GPU, or CPU.
OpenAI Compatibility: Includes a local server that allows existing OpenAI clients to work without code changes.
Cross-Platform: Supports Windows ARM64, Android, and Linux ARM64.

GenieX: an on-device inference runtime for running LLMs and VLMs locally on Qualcomm Snapdragon hardware

GenieX: an on-device inference runtime for running LLMs and VLMs locally on Qualcomm Snapdragon hardware

What it solves

How it works

Who it’s for

Highlights

Sources