AMD Strix Halo RDMA 集群設置指南
AMD Strix Halo RDMA Cluster Setup Guide
Distributed inference for large language models (LLMs) typically requires high-bandwidth, low-latency interconnects to synchronize tensor data. For the AMD Strix Halo platform, using RDMA over Converged Ethernet (RoCE v2) via Intel E810 NICs reduces inter-node latency from ~70-100µs (TCP/IP) to ~5µs, effectively making two separate nodes behave as a single machine for Tensor Parallelism (TP).
Architecture and Core Concepts
To achieve distributed inference on Strix Halo, three primary software layers orchestrate the workload:
- vLLM: The inference engine that splits models across nodes using Tensor Parallelism.
- Ray: The distributed computing framework that manages the control plane, orchestrating worker processes across the cluster.
- RCCL (ROCm Collective Communication Library): The AMD equivalent of NVIDIA’s NCCL. It manages the data plane, handling the high-speed synchronization of tensor data between GPUs.
RoCE v2 is critical here because it allows RCCL to write data directly from one node’s memory to another, bypassing the CPU and OS kernel, which is essential for maintaining interactive token generation speeds.
Hardware Requirements
Building a two-node Strix Halo cluster requires the following hardware:
- Compute Nodes: 2x Framework Desktop Mainboards with AMD Ryzen AI MAX+ "Strix Halo" and 128GB of Unified Memory.
- Network Interface Cards (NICs): 2x Intel Ethernet Controller E810-CQDA1 (or similar 100GbE QSFP28 cards).
- Interconnect: Direct Attach Copper (DAC) cable (e.g., QSFP28 DAC) for a direct node-to-node connection without a switch.
- Physical Interface: Since the Framework motherboard PCIe slot is physically x4, a PCIe 4x to 16x riser/extender is required to accommodate the x16 NICs.
Host Configuration (Fedora 43)
Both nodes must be configured on Fedora 43 (verified kernels 6.18.5-200.fc43.x86_64 and 6.18.6-200.fc43.x86_64).
Driver and Package Installation
Install the core RDMA userspace tools using DNF:
sudo dnf install rdma-core libibverbs-utils perftest
The setup utilizes the in-kernel ice (Ethernet) and irdma (RDMA) drivers; proprietary Intel drivers are not required.
Network Setup
Assign static IPs and enable Jumbo Frames (MTU 9000) to reduce CPU overhead. For a subnet of 192.168.100.0/30:
- Node 1:
192.168.100.1/30 - Node 2:
192.168.100.2/30
Verify the link state using rdma link, which should show the state as ACTIVE and LINK_UP.
BIOS and Kernel Optimization
To maximize the available unified memory for the iGPU and optimize RDMA performance, apply the following settings:
- BIOS: Set iGPU Memory Allocation to the minimum (512MB).
- Kernel Parameters: Append the following to
GRUB_CMDLINE_LINUXin/etc/default/grub:iommu=pt pci=realloc pcie_asm=off amdgpu.gttsize=126976 ttm.pages_limit=32505856
Parameter Breakdown:
iommu=pt: Enables Pass-Through mode to reduce overhead for the NIC and iGPU.pci=realloc: Reallocates PCI BARs to map large address spaces.pcie_aspm=off: Disables Active State Power Management to prevent latency spikes.amdgpu.gttsize&ttm.pages_limit: Caps the GPU GTT size to ~124GiB, allowing the GPU to address system RAM as VRAM.
Software Installation and vLLM Deployment
The RCCL Patch
A critical requirement for this setup is a respect of a custom-built librccl.so library. Upstream ROCm packages currently lack support for gfx1151 (Strix Halo) RDMA.
Toolbox Setup
Run ./refresh_toolbox.sh on both nodes. This script pulls the patched image and configs the container to expose /dev/dri, /dev/kfd, and /dev/infiniband, while setting --ulimit memlock=-1 for DMA memory pinning.
Running the Cluster
- Orchestration: Use the
start-vllm-clusterTUI utility to configure IPs and start theability to start the Ray cluster. Node 1 is designated as the