Single Node and Distributed Inference Performance on AMD Instinct MI355X GPU

Jan 06, 2026

Introduction

The rapid growth of GenAI and LLM workloads, from agentic workflows and multi-step tool use to retrieval-augmented reasoning, demands inference infrastructure that is fast, adaptable, and highly optimized.

AMD addressed these challenges on the AMD Instinct™ MI355X GPU by combining three tightly integrated elements:

Highly optimized and fused kernels for modern LLM primitives such as MLA attention, sparse MoE experts, and block-scale GEMMs, tuned specifically for Instinct GPUs
A lightweight AMD inference engine that orchestrates scheduling, batching, and KV-cache management and can be run directly by users. This engine has been open-sourced here: https://github.com/ROCm/ATOM/
High-performance distributed inference support, optimized for MoE dispatch, expert aggregation, and KV traffic across nodes

While AMD continues to invest in general-purpose inference frameworks such as vLLM and SGLang, ATOM provides the most direct path to peak MI355X GPU performance for modern reasoning and MoE-heavy workloads, which increasingly dominate frontier LLM architectures.

Over the past months, AMD have implemented numerous optimizations to improve both single-node performance and multi-node distributed inference for DeepSeek-R1 on the MI355X GPU.

Single Node Inference

DeepSeek-R1 is a reasoning-focused open sourced model that combines MLA attention with sparse MoE experts, placing heavy demands on both compute efficiency and memory bandwidth.

Across concurrency levels from 4 to 64, the MI355X GPU running ATOM consistently delivers strong inference performance compared to NVIDIA Blackwell B200 systems using existing inference frameworks. These results are driven by kernel-level optimizations such as fused MoE execution, MLA attention fusion, and reduced memory movement across the inference pipeline.

At higher concurrency levels (32 and 64), which are critical for minimizing cost per token in large-scale deployments, the MI355X GPU with ATOM demonstrates particularly strong throughput, matching or exceeding B200 systems running SGLang in these high-throughput regimes.

The MI355X GPU maintains this performance advantage across a range of sequence length configurations, including interactive (1K/1K), throughput-oriented (8K/1K), and long-generation (1K/8K) workloads.

Image Zoom

Figure 1: DeepSeek R1-FP8 ISL/OSL – 1K/1K, Concurrency 4–64, TP = 8

Figure 1: DeepSeek R1-FP8 ISL/OSL – 1K/1K, Concurrency 4–64- Official InferenceMAX™ Result (SA1215-B200 (SGLANG), GitHub Actions run:https://github.com/InferenceMAX/InferenceMAX/actions/runs/20216641521)

Image Zoom

Figure 2: DeepSeek R1-FP8 ISL/OSL – 8K/1K, Concurrency 4–64, TP = 8

Figure 2: DeepSeek R1-FP8 ISL/OSL – 8K/1K, Concurrency 4–64- Official InferenceMAX™ Result (SA1215-B200 (SGLANG), GitHub Actions run:https://github.com/InferenceMAX/InferenceMAX/actions/runs/20216709902)

Image Zoom

Figure 3: DeepSeek R1-FP8 ISL/OSL – 1K/8K, Concurrency 4–64, TP = 8

Figure 3: DeepSeek R1-FP8 ISL/OSL – 1K/8K, Concurrency 4–64- Official InferenceMAX™ Result (SA1215-B200 (SGLANG), GitHub Actions run:https://github.com/InferenceMAX/InferenceMAX/actions/runs/20216754736)

Multi Node, Distributed Inference

The MI355X GPU also demonstrates competitive performance in multi-node distributed inference for MoE-based workloads.

By combining optimized kernel execution with efficient distributed handling of MoE dispatch, expert aggregation, and KV-cache traffic, the MI355X GPU scales effectively across nodes for various workloads.

In latency-sensitive configurations such as the 1K/1K ISL/OSL case, the MI355X GPU using a 3-nodes 1P2D, EP8 configuration delivers higher throughput per GPU than NVL72 with Dynamo, while maintaining similar interactivity, which is beneficial for real-time, user-facing applications. More broadly, the MI355X GPU supports advanced distributed inference strategies (including expert parallelism and prefill/decode disaggregation) and scales effectively across nodes for MoE-heavy workloads by efficiently handling MoE dispatch, expert aggregation, and KV-cache traffic.

Image Zoom

Figure 10: DeepSeek-R1 FP8 Distributed Inference, 1K/1K, High Interactivity

Figure 4: DeepSeek-R1 FP8 Distributed Inference, 1K/1K, High Interactivity

What is Next?

AMD will continue optimizing existing and emerging AI workloads, including:

Design space exploration for larger expert-parallel configurations
Additional operator fusion for large MoE models
Long-context inference and serving optimizations

Summary

This study demonstrates that the AMD Instinct MI355X GPU delivers competitive and often leading LLM inference performance for modern reasoning-focused MoE models.

For single-node inference, MI355X running ATOM achieves strong throughput and interactivity across a wide range of concurrency levels and sequence lengths. These results are enabled by optimized kernel execution for key LLM primitives such as MLA attention, sparse MoE experts, and block-scale GEMMs.

In distributed inference, the MI355X GPU scales efficiently for MoE workloads. In latency-sensitive configurations such as the 1K/1K ISL/OSL case, the MI355X GPU in a 1P2D EP8 configuration delivers higher throughput per GPU with similar interactivity, making it well suited for real-time inference scenarios. The platform also supports advanced parallelism strategies, including expert parallelism and prefill and decode disaggregation, enabling strong performance across both latency-oriented and throughput-oriented deployments.

All optimizations discussed in this study have been upstreamed into the open-source community. Users can reproduce and extend these results and deploy ATOM as a standalone inference engine or as a backend for frameworks such as vLLM and SGLang.

Taken together, these results show that the MI355X GPU is well positioned to support the next wave of GenAI inference workloads, where scalability, efficiency, and software openness increasingly define platform competitiveness.

Footnotes

Disclaimer:

All performance analysis conducted by AMD as of December 2025. This analysis is provided for informational purposes only and should not be relied upon to make a purchasing decision.

Article By

Andy Luo

white pearl gradient medium color divider

Related Blogs

View All Blogs

Data Center

Business Systems

Personal & Gaming

Embedded

Resources

GPU Accelerators

Adaptive Accelerators

DPU Accelerators

Ethernet Adapters

Workstations

Desktops

Laptops

Resources

Adaptive SoCs & FPGAs

System-on-Modules (SOMs)

Technologies

Resources

Evaluation Boards & Kits

Processor Tools

Graphics Tools & Apps

Adaptive SoC & FPGA Tools

Intellectual Property & Apps

GPU Accelerator Tools & Apps

Ethernet Adapter Tools

Overview

For Data Center & Cloud

For Edge & Endpoints

For Developers

Industries

Industries

Industries

Industries

Industries

Workloads

Gaming

Systems

Technologies

Resources

EPYC Processors

Radeon Graphics & AMD Chipsets

Adaptive SoCs & FPGAs

Alveo Accelerators & Kria SOMs

Ryzen Processors

Ethernet Adapters

Overview

Processors

Accelerators

Embedded Products

Graphics

Overview

Resources by Product

Resources by Type

About Our Partners

AMD Global Support

Processors & Graphics

Accelerators

Adaptive SoCs & FPGAs

Gaming & Personal Computing

Adaptive & Embedded Computing

Get AMD Fan Gear

Shop Our Retail Partners

Single Node and Distributed Inference Performance on AMD Instinct MI355X GPU

Introduction

Single Node Inference

Multi Node, Distributed Inference

What is Next?

Summary

Article By

Related Blogs