Wafer Reveals GLM5.2 Inference Performance on AMD MI355X
인프라/플랫폼 | Sat Jul 04 2026 00:00:00 GMT+0000 (Coordinated Universal Time) | 1 sources
Wafer achieved 2626 tok/s/node running GLM5.2 on AMD MI355X at half the cost of Blackwell.
Analysis
[Wafer] published GLM5.2 inference benchmarks on AMD MI355X [1]
- 20k input / 1k output
- 60% cache hit workload
- aggregate throughput of 2626 tok/s/node @ 2.4 rps
- meets TTFT ≤5s requirement
- 80% of B200 performance but over 2x cheaper
[Wafer] applied MXFP4 quantization via AMD Quark [1]
- quantized bf16 GLM-5.2 to MXFP4
- effectively lossless compared to z-ai's official FP8
- validated on GSM8K
- GPQA-Diamond
- tau2 benchmarks
- actually improved +0.015 on tau2 macro
[sglang] selected as the inference framework for serving MXFP4 + GLM-5.2 [1]
- vLLM does not support the MXFP4 + GlmMoeDsa path
- ATOM shows output quality degradation on long context
- sglang provides native support with minimal friction
[AMD Instinct MI355X] emerged as a low-cost inference alternative to Blackwell [1]
- average 2.75x cheaper GPU unit price than B300
- competitive with Blackwell at the silicon level
- ROCm stack's lack of day-0 support remains a weakness
- gap narrowing through agent-based kernel optimization