Running Local AI on AMD Hardware
Running Local AI on AMD Hardware
Local AI as a Strategic Necessity
Local AI is becoming essential because open-weight models have closed the performance gap with frontier models to within three to six months. While frontier lab token costs may appear lower on paper, the rise of AI agents and reasoning-heavy workloads consumes tokens at a scale that significantly increases operational costs. Local execution provides a critical solution for users requiring data privacy, total control over their AI stack, and a way to avoid the escalating costs of agentic API calls.
Hardware Configuration for AI Workloads
High-performance local AI requires substantial VRAM and processing power. The hardware stack evaluated in this demonstration includes:
- CPU: AMD Ryzen Threadripper 9980X
- GPU: AMD Radeon AI Pro R9700 with 32GB of VRAM
This configuration allows for the execution of high-quality models with minimal quantization compromises. For smaller models, 8-bit quantization or full resolution can be used, while larger models typically run effectively at 4-bit quantization.
LLM Execution and Performance
Running Large Language Models (LLMs) on AMD hardware is streamlined through tools like LM Studio and Ollama.
LM Studio and Ollama
LM Studio now ships with a ROCm runtime, allowing it to recognize AMD GPUs natively. Using a Qwen 3.6 mixture-of-experts model on the Radeon AI Pro R9700, performance reached approximately 160 tokens per second. This speed is sufficient for both human reading and the rapid iterations required by AI agents.
Model Capabilities
Local setups support a wide range of capabilities, including:
- Reasoning: Toggling reasoning abilities on and off.
- Vision: Processing visual inputs.
- Document Analysis: Loading and chatting with local documents.
- Context Windows: Adjusting context window sizes (e.g., 64K) while maintaining high token throughput.
The ROCm Software Stack
ROCm (Radeon Open Compute) is the foundational layer that enables deep learning frameworks to run on AMD hardware. It serves as the primary alternative to NVIDIA's CUDA.
Compatibility and Integration
ROCm and its translation layer, HIP, have matured to the point where software compatibility is no longer a primary barrier. Key integrations include:
- PyTorch: Official ROCm wheels are available, allowing users to install PyTorch via pip and run existing code with minimal changes.
- Transformers Library: Fully compatible with ROCm for model inference and deployment.
- Unsloth: Provides specific guides for fine-tuning LLMs on AMD GPUs.
ROCm supports not only inference but also full model training and fine-tuning from scratch.
Generative Media with ComfyUI
AMD GPUs are capable of running complex generative media pipelines via ComfyUI. By selecting the ROCm version of ComfyUI, users can execute various generative tasks:
- Image Generation: Rapid text-to-image and image-to-image generation.
- Video Generation: Support for models such as LTX 2 and Wan 2.2.
- Other Modalities: Support for audio models and image-to-3D models.
Optimizing Performance with Linux
While Windows (via WSL) is supported, native Linux installations provide the most robust support for the ROCm stack.
Linux Advantages
Installing Linux allows for the use of the latest ROCm versions (e.g., ROCm 7.2), which may not be available on Windows. This environment enables deeper integration with PyTorch, allowing developers to:
- Direct GPU Access: Verify device names and allocate tensors directly to the Radeon GPU.
- Custom Training: Train models (e.g., a ResNet model on the CIFAR-10 dataset) using a Gradio interface for predictions.
- Advanced Inference: Run full-resolution models, such as Gemma 4, using the Transformers library or serve them via vLLM for agentic workflows.
By leveraging a native Linux environment, developers move beyond simple chat interfaces to full-scale AI development and deployment on local AMD hardware.
Sources
- undefinedRunning Local AI on AMD