llama.cpp: a high-performance C/C++ inference engine for running quantized LLMs locally across diverse hardware
llama.cpp: a high-performance C/C++ inference engine for running quantized LLMs locally across diverse hardware
What it solves
llama.cpp provides a way to run large language models (LLMs) locally with minimal setup and high performance across a wide variety of hardware, including consumer-grade CPUs and GPUs. It removes the need for heavy dependencies by providing a plain C/C++ implementation of LLM inference.
How it works
The project implements LLM inference in C/C++ and uses the ggml library to optimize performance. It supports a vast array of hardware backends (such as Metal for Apple Silicon, CUDA for NVIDIA, and Vulkan for general GPUs) and employs integer quantization (ranging from 1.5-bit to 8-bit) to reduce memory usage and increase speed. It can also perform hybrid CPU+GPU inference to run models that are larger than the available VRAM.
Who it’s for
Developers and AI enthusiasts who want to run LLMs locally on their own hardware—ranging from MacBooks to NVIDIA GPUs and RISC-V architectures—without relying on cloud providers or complex installation processes.
Highlights
- Broad Hardware Support: Optimized for Apple Silicon, x86 (AVX/AMX), RISC-V, NVIDIA, AMD, and Intel GPUs.
- Quantization: Supports multiple integer quantization levels (1.5-bit to 8-bit) to fit large models on smaller devices.
- Extensive Model Support: Compatible with a huge range of text-only and multimodal models, including LLaMA, Mistral, Gemma, and LLaVA.
- Dependency-Free: Written in plain C/C++, making it easy to build and deploy across different platforms.
- OpenAI-Compatible API: Includes a server mode (
llama-server) that provides a REST API compatible with OpenAI's format.
Sources
- undefinedggml-org/llama.cpp