whichllm: a hardware-aware recommendation engine that ranks the best local LLMs based on system specs and real-world benchmarks

whichllm: a hardware-aware recommendation engine that ranks the best local LLMs based on system specs and real-world benchmarks

What it solves

Finding the best local Large Language Model (LLM) for specific hardware is often difficult because simply fitting a model into VRAM doesn't guarantee it is the highest quality or fastest option. whichllm solves this by automatically detecting system hardware and ranking models from HuggingFace based on actual benchmark performance, estimated speed, and hardware compatibility rather than just size.

How it works

The tool analyzes the user's GPU, CPU, and RAM to estimate VRAM requirements (including weights, KV cache, and overhead) and generation speed based on memory bandwidth. It fetches live model data from the HuggingFace API and merges it with multiple benchmark sources (such as LiveBench, Chatbot Arena, and Open LLM Leaderboard). A scoring engine then ranks models by combining benchmark quality, model size, quantization penalties, and runtime fit (e.g., full GPU vs. partial offload).

Who it’s for

It is designed for users running local LLMs who want to optimize their hardware usage, as well as people planning hardware purchases who want to simulate specific GPUs to see which models they could run.

Highlights

  • Hardware Auto-detection: Supports NVIDIA, AMD, Intel, and Apple Silicon.
  • Evidence-Based Ranking: Uses real benchmark scores with confidence-based dampening instead of simple size heuristics.
  • ** uma-Command Execution**: Includes whichllm run to instantly download and chat with a recommended model and whichllm snippet to generate ready-to-use Python code.
  • Hardware Planning: Features plan and upgrade commands to determine the GPU needed for a specific model or compare current hardware against potential upgrades.
  • Live Data: Integrates directly with the HuggingFace API for up-to-date model recommendations.

Sources