GLM5.2 Performance on AMD MI355X: Achieving High Throughput at Lower Cost
GLM5.2 Performance on AMD MI355X: Achieving High Throughput at Lower Cost
AMD MI355X delivers superior performance-per-dollar for GLM5.2
Wafer has demonstrated that the AMD Instinct MI355X can serve the GLM5.2 model with an aggregate throughput of 2626 tokens per second per node (tok/s/node) at 2.4 requests per second (RPS). This configuration achieves approximately 80% of the performance of an NVIDIA B200, while the hardware cost per GPU is estimated to be 2.75x lower than the B300.
Performance Benchmarks
Under a workload consisting of 20k input tokens and 1k output tokens with a 60% cache hit rate, the MI355X reached the following saturation point:
| Sustained RPS | Aggregate tok/s/node | TTFT p50 / p95 | Success |
|---|---|---|---|
| 0.5 | 449 | 0.59s / 0.60s | 100% |
| 1.0 | 974 | 0.60s / 0.81s | 100% |
| 1.5 | 1913 | 0.62s / 1.03s | 100% |
| 2.0 | 1944 | 0.62s / 1.05s | 100% |
| 2.25 | 2089 | 0.63s / 1.23s | 100% |
| 2.4 (saturation) | 2626 | 0.81s / 2.22s | 100% |
Additionally, in a single-stream test (10k input / 1.5k output tokens), the MI355X achieved 213 tok/s.
Technical Optimization Path
Achieving these results required overcoming several software and framework hurdles, as AMD's ROCm stack often lacks the "day-0" support provided by NVIDIA's CUDA ecosystem.
Quantization and Framework Selection
Wafer utilized AMD Quark to quantize the base bf16 GLM-5.2 model to MXFP4. This quantization was found to be effectively lossless compared to the official FP8 quantization, with minimal impact on benchmarks such as GSM8K and GPQA-Diamond.
For the inference engine, sglang was selected over vLLM and ATOM because it provided the least friction for native support of the MXFP4 quantization while remaining coherent.
Enabling Speculative Decoding
Speculative decoding was not supported out-of-the-box in the sglang ROCm image, requiring two specific fixes:
- Weight Mapping Fix: A mismatch between the MTP (Multi-Token Prediction) head's module prefix and the main decoder stack caused quantization lookup failures. By duplicating the layer 78 entries in the Quark un-quantized list under the decoder name used by sglang, Wafer unblocked speculative decode, resulting in a nearly 3x gain in single-stream throughput.
- ROCm Guard Implementation: Deep speculative decode (e.g., the 5/1/6 config) was blocked by a fused multi-step metadata kernel that lacked a ROCm guard. Adding an
#ifdef USE_ROCMguard resolved this issue.
Throughput and Kernel Tuning
To maximize aggregate throughput, Wafer shifted from a Tensor Parallelism 8 (TP8) configuration to a TP4×DP2 (Data Parallelism) configuration.
Furthermore, the team discovered that GLM-5.2's fp4 MoE (Mixture of Experts) was defaulting to a slow FlyDSL heuristic fallback on the sglang image. By manually tuning the MoE kernel selection for GLM's specific fp4 shapes (model_dim 6144, moe_inter 2048, E=256, topk=8), throughput was increased to the final 2626 tok/s/node.
Industry Implications and Community Perspective
This implementation suggests that the "CUDA moat" is eroding as agentic coding and manual optimization can bridge the gap between hardware capabilities and software support.
Community Counterpoints
While the technical achievement is notable, community members on Hacker News raised several critical points regarding the real-world applicability of these benchmarks:
- Quantization Quality: Some users argued that FP4 quantization is rarely truly lossless in practice and can lead to "lobotomized" models that lose frontier-level quality.
- Benchmark Validity: Critics noted that the 60% cache hit rate and the use of speculative decoding significantly influence the results, questioning if these represent typical production workloads.
- Metric Gaps: Discussion highlighted the absence of performance-per-watt metrics, which are critical for data center operators outside the US where electricity costs are higher.
- Production Viability: Some questioned whether these optimizations are primarily "benchmark hacking" for single-stream traffic rather than a scalable production strategy.