Wafer Reveals GLM5.2 Inference Performance on AMD MI355X

인프라/플랫폼 | Sat Jul 04 2026 00:00:00 GMT+0000 (Coordinated Universal Time) | 1 sources

Wafer achieved 2626 tok/s/node running GLM5.2 on AMD MI355X at half the cost of Blackwell.

Analysis

[Wafer] published GLM5.2 inference benchmarks on AMD MI355X ^[1]

20k input / 1k output
60% cache hit workload
aggregate throughput of 2626 tok/s/node @ 2.4 rps
meets TTFT ≤5s requirement
80% of B200 performance but over 2x cheaper

[Wafer] applied MXFP4 quantization via AMD Quark ^[1]

quantized bf16 GLM-5.2 to MXFP4
effectively lossless compared to z-ai's official FP8
validated on GSM8K
GPQA-Diamond
tau2 benchmarks
actually improved +0.015 on tau2 macro

[sglang] selected as the inference framework for serving MXFP4 + GLM-5.2 ^[1]

vLLM does not support the MXFP4 + GlmMoeDsa path
ATOM shows output quality degradation on long context
sglang provides native support with minimal friction

[AMD Instinct MI355X] emerged as a low-cost inference alternative to Blackwell ^[1]

average 2.75x cheaper GPU unit price than B300
competitive with Blackwell at the silicon level
ROCm stack's lack of day-0 support remains a weakness
gap narrowing through agent-based kernel optimization

Sources

[1] GLM5.2 on AMD MI355X at 2626 tok/s/node at over 2x lower cost than Blackwell - Hacker News