AMD Instinct MI355X GPU Sets a New Bar for DeepSeek Inference
Jun 10, 2026
DeepSeek V4 was one of the most demanding open-model inference launches to date, stressing every part of the serving stack: sparse attention, MoE execution, quantization, multi-token prediction, scheduling, graph capture, and distributed inference. In less than a month, AMD Instinct™ MI355X GPU performance improved by more than 100x on the same silicon through systematic kernel and framework engineering (Figure 1). Today, MI355X GPU delivers leading per-GPU throughput and strong cost-per-token economics on DeepSeek models.
Production Serving Performance
For production inference, the important question is not peak tokens per second alone. It is the throughput a system can deliver while maintaining the interactivity users expect. On DeepSeek V4 Pro, MI355X GPU with ATOM matches or exceeds B200 and B300 with Dynamo vLLM on per-GPU throughput across most of the interactivity range (Figure 2). Results are based on a PR submitted to InferenceX, not yet merged upstream.
Cost per Token and Fleet Economics
Cost per token is where infrastructure decisions become business decisions. The right evaluation uses a transparent ownership model at the interactivity target each application requires.
Starting with DeepSeek R1 0528 (FP4, 8K/1K, MTP enabled), MI355X GPU (with MoRI + SGLang + MTP) delivers equivalent or lower cost per million output tokens compared to GB300, GB200, B200, and B300 (running Dynamo + TRT + MTP) at production interactivity levels under a hyperscaler ownership cost model (Figure 3).
On DeepSeek V4 Pro (FP4, 8K/1K), MI355X GPU (with ATOM) delivers equivalent or lower cost per million output tokens compared to GB300, GB200, B200, and B300 (running Dynamo + vLLM) at production interactivity levels under a hyperscaler ownership cost model (Figure 4). Results are based on a PR submitted to InferenceX, not yet merged upstream.
The Path Forward
The MI355X GPU improvement curve showed that the software stack can move fast when kernel engineering, framework integration, and benchmark feedback loops are aligned. vLLM and SGLang are top priorities for AMD Instinct GPUs. ATOM, which is open source, gives us a speed-of-light path for experimentation and optimization before upstreaming improvements to vLLM and SGLang.
Day 0 is one day. Production is every day after that. MI355X GPU wins where it matters.