distributed-llama: a distributed inference engine that clusters home devices to run large LLMs via tensor parallelism

distributed-llama: a distributed inference engine that clusters home devices to run large LLMs via tensor parallelism

What it solves

Distributed Llama allows users to accelerate LLM inference by connecting multiple home devices (such as PCs, Mac Minis, or Raspberry Pis) into a single powerful cluster. This solves the problem of limited hardware resources on a single machine, enabling the execution of large models (like Llama 3.1 405B) that would otherwise be unavailable due to RAM constraints.

How it works

The project uses tensor parallelism to split the neural network across multiple nodes over an Ethernet connection. It employs a root-worker architecture:

  • Root Node: Manages model loading, weight distribution, and synchronization of the neural network state. It also acts as a worker by processing its own slice of the network.
  • Worker Nodes: Process their assigned slice of the neural network and require no model-specific configuration.

The system supports a power-of-two number of nodes (1, 2, 4... 2^n) and splits RAM usage across all connected devices.

Who it’s for

It is designed for users who have multiple spare devices and want to run large LLMs locally without investing in high-end enterprise hardware. It supports Linux, macOS, and Windows, and is optimized for ARM and x86_64 AVX2 CPUs, as well as GPUs via Vulkan.

Highlights

  • Cross-platform support: Works on Linux, macOS, Windows, and Raspberry Pi.
  • Broad model support: Compatible with Llama 3.1, 3.2, 3.3, DeepSeek R1 Distill, and Qwen 3 models.
  • Hardware flexibility: Supports CPU (ARM/x86_64) and GPU (Vulkan) inference.
  • Flexible deployment: Includes a CLI chat, a benchmark tool, and an API server.

Sources