AMD Instinct MI355X GPU Sets a New Bar for DeepSeek Inference

Jun 10, 2026

DeepSeek V4 was one of the most demanding open-model inference launches to date, stressing every part of the serving stack: sparse attention, MoE execution, quantization, multi-token prediction, scheduling, graph capture, and distributed inference. In less than a month, AMD Instinct™ MI355X GPU performance improved by more than 100x on the same silicon through systematic kernel and framework engineering (Figure 1). Today, MI355X GPU delivers leading per-GPU throughput and strong cost-per-token economics on DeepSeek models.

Image Zoom
DeepSeek V4 Pro throughput frontier on the MI355X GPU over the first several weeks. Source: InferenceX open-source benchmarks.
Figure 1: DeepSeek V4 Pro throughput frontier on the MI355X GPU over the first several weeks. Source: InferenceX open-source benchmarks.

Production Serving Performance

For production inference, the important question is not peak tokens per second alone. It is the throughput a system can deliver while maintaining the interactivity users expect. On DeepSeek V4 Pro, MI355X GPU with ATOM matches or exceeds B200 and B300 with Dynamo vLLM on per-GPU throughput across most of the interactivity range (Figure 2). Results are based on a PR submitted to InferenceX, not yet merged upstream.

Image Zoom
Throughput per GPU vs. Interactivity DeepSeek V4 Pro 1.6T • FP4 • 8K/1K • Unofficial Source: InferenceX, PR submitted, not yet merged upstream.
Figure 2: Throughput per GPU vs. Interactivity DeepSeek V4 Pro 1.6T • FP4 • 8K/1K • Unofficial Source: InferenceX, PR submitted, not yet merged upstream.

Cost per Token and Fleet Economics

Cost per token is where infrastructure decisions become business decisions. The right evaluation uses a transparent ownership model at the interactivity target each application requires.

Starting with DeepSeek R1 0528 (FP4, 8K/1K, MTP enabled), MI355X GPU (with MoRI + SGLang + MTP) delivers equivalent or lower cost per million output tokens compared to GB300, GB200, B200, and B300 (running Dynamo + TRT + MTP) at production interactivity levels under a hyperscaler ownership cost model (Figure 3).

On DeepSeek V4 Pro (FP4, 8K/1K), MI355X GPU (with ATOM) delivers equivalent or lower cost per million output tokens compared to GB300, GB200, B200, and B300 (running Dynamo + vLLM) at production interactivity levels under a hyperscaler ownership cost model (Figure 4). Results are based on a PR submitted to InferenceX, not yet merged upstream.

Image Zoom
Cost per Million Output Tokens vs. Interactivity DeepSeek R1 0528 • FP4 • 8K/1K • MTP • Owning - Hyperscaler Cost Model Source: InferenceX open-source benchmarks. Lower is better.
Figure 3: Cost per Million Output Tokens vs. Interactivity DeepSeek R1 0528 • FP4 • 8K/1K • MTP • Owning - Hyperscaler Cost Model Source: InferenceX open-source benchmarks. Lower is better.
Image Zoom
Cost per Million Output Tokens vs. Interactivity  DeepSeek V4 Pro • FP4 • 8K/1K • Owning - Hyperscaler Cost Model • Unofficial Source: InferenceX, PR submitted, not yet merged upstream. Lower is better.
Figure 4: Cost per Million Output Tokens vs. Interactivity DeepSeek V4 Pro • FP4 • 8K/1K • Owning - Hyperscaler Cost Model • Unofficial Source: InferenceX, PR submitted, not yet merged upstream. Lower is better.

The Path Forward

The MI355X GPU improvement curve showed that the software stack can move fast when kernel engineering, framework integration, and benchmark feedback loops are aligned. vLLM and SGLang are top priorities for AMD Instinct GPUs. ATOM, which is open source, gives us a speed-of-light path for experimentation and optimization before upstreaming improvements to vLLM and SGLang.

Day 0 is one day. Production is every day after that. MI355X GPU wins where it matters.

Share:

Article By


Related Blogs