AMD Instinct MI355X GPU Sets a New Bar for DeepSeek Inference

Jun 10, 2026

DeepSeek V4 was one of the most demanding open-model inference launches to date, stressing every part of the serving stack: sparse attention, MoE execution, quantization, multi-token prediction, scheduling, graph capture, and distributed inference. In less than a month, AMD Instinct™ MI355X GPU performance improved by more than 100x on the same silicon through systematic kernel and framework engineering (Figure 1). Today, MI355X GPU delivers leading per-GPU throughput and strong cost-per-token economics on DeepSeek models.

Image Zoom

Figure 1: DeepSeek V4 Pro throughput frontier on the MI355X GPU over the first several weeks. Source: InferenceX open-source benchmarks.

Production Serving Performance

For production inference, the important question is not peak tokens per second alone. It is the throughput a system can deliver while maintaining the interactivity users expect. On DeepSeek V4 Pro, MI355X GPU with ATOM matches or exceeds B200 and B300 with Dynamo vLLM on per-GPU throughput across most of the interactivity range (Figure 2). Results are based on a PR submitted to InferenceX, not yet merged upstream.

Image Zoom

Figure 2: Throughput per GPU vs. Interactivity DeepSeek V4 Pro 1.6T • FP4 • 8K/1K • Unofficial Source: InferenceX, PR submitted, not yet merged upstream.

Cost per Token and Fleet Economics

Cost per token is where infrastructure decisions become business decisions. The right evaluation uses a transparent ownership model at the interactivity target each application requires.

Starting with DeepSeek R1 0528 (FP4, 8K/1K, MTP enabled), MI355X GPU (with MoRI + SGLang + MTP) delivers equivalent or lower cost per million output tokens compared to GB300, GB200, B200, and B300 (running Dynamo + TRT + MTP) at production interactivity levels under a hyperscaler ownership cost model (Figure 3).

On DeepSeek V4 Pro (FP4, 8K/1K), MI355X GPU (with ATOM) delivers equivalent or lower cost per million output tokens compared to GB300, GB200, B200, and B300 (running Dynamo + vLLM) at production interactivity levels under a hyperscaler ownership cost model (Figure 4). Results are based on a PR submitted to InferenceX, not yet merged upstream.

Image Zoom

Cost per Million Output Tokens vs. Interactivity DeepSeek R1 0528 • FP4 • 8K/1K • MTP • Owning - Hyperscaler Cost Model Source: InferenceX open-source benchmarks. Lower is better.

Figure 3: Cost per Million Output Tokens vs. Interactivity DeepSeek R1 0528 • FP4 • 8K/1K • MTP • Owning - Hyperscaler Cost Model Source: InferenceX open-source benchmarks. Lower is better.

Image Zoom

Cost per Million Output Tokens vs. Interactivity DeepSeek V4 Pro • FP4 • 8K/1K • Owning - Hyperscaler Cost Model • Unofficial Source: InferenceX, PR submitted, not yet merged upstream. Lower is better.

Figure 4: Cost per Million Output Tokens vs. Interactivity DeepSeek V4 Pro • FP4 • 8K/1K • Owning - Hyperscaler Cost Model • Unofficial Source: InferenceX, PR submitted, not yet merged upstream. Lower is better.

The Path Forward

The MI355X GPU improvement curve showed that the software stack can move fast when kernel engineering, framework integration, and benchmark feedback loops are aligned. vLLM and SGLang are top priorities for AMD Instinct GPUs. ATOM, which is open source, gives us a speed-of-light path for experimentation and optimization before upstreaming improvements to vLLM and SGLang.

Day 0 is one day. Production is every day after that. MI355X GPU wins where it matters.

Article By

Ramine Roane

Andy Luo

white pearl gradient medium color divider

Related Blogs

View All Blogs

Server CPUs

Business Systems

Personal & Gaming

Embedded

Resources

GPU Accelerators

Adaptive Accelerators

DPU Accelerators

Ethernet Adapters

Workstations

Desktops

Laptops

Resources

Adaptive SoCs & FPGAs

System-on-Modules (SOMs)

Technologies

Resources

Evaluation Boards & Kits

Processor Tools

Graphics Tools & Apps

Adaptive SoC & FPGA Tools

Intellectual Property & Apps

GPU Accelerator Tools & Apps

Ethernet Adapter Tools

Overview

For Data Center & Cloud

For Edge & Endpoints

For Developers

Industries

Industries

Industries

Industries

Industries

Workloads

Gaming

Systems

Technologies

Resources

EPYC Processors

Radeon Graphics & AMD Chipsets

Adaptive SoCs & FPGAs

Alveo Accelerators & Kria SOMs

Ryzen Processors

Ethernet Adapters

Overview

Processors

Accelerators

Embedded Products

Graphics

Overview

Resources by Product

Resources by Type

About Our Partners

AMD Global Support

Processors & Graphics

Accelerators

Adaptive SoCs & FPGAs

Gaming & Personal Computing

Adaptive & Embedded Computing

Get AMD Fan Gear

Shop Our Retail Partners

AMD Instinct MI355X GPU Sets a New Bar for DeepSeek Inference

Production Serving Performance

Cost per Token and Fleet Economics

The Path Forward

Article By

Related Blogs

AMD.com Feedback