MLPerf 6.0: AMD Instinct™ MI355X GPUs Surpass 1M Tokens/Sec, Power New Workloads and Demonstrate Distributed Inference

Apr 01, 2026

Liquid-cooled multi-GPU server with MI355X GPUs alongside text highlighting MLPerf 6.0 GenAI performance results

AMD MLPerf Inference 6.0: What’s New and Why It Matters

In its MLPerf Inference 6.0 submission, AMD did not simply revisit familiar benchmarks with a faster GPU.  It expanded into first-time workloads, crossed the 1-million-tokens-per-second threshold at multinode scale and showed that partners can reproduce the results across a broader ecosystem.

That combination matters because our customers no longer evaluate inference platforms on one metric alone. They want competitive single-node performance, efficient scale-out, faster bring-up on new models, reproducible results across partner systems and confidence that the software stack can keep pace. MLPerf Inference 6.0 let us show all of that in one submission.

Just as important, we showed that these results are not isolated. A broad partner ecosystem submitted across four AMD Instinct™ GPU types that closely reproduced numbers submitted by AMD and the first three-GPU heterogeneous MLPerf submission demonstrated that AMD hardware and AMD ROCm™ software can orchestrate meaningful inference throughput even across systems in different geographies.

AMD Instinct MI355X GPUs: Designed for Inference from the Ground Up

AMD Instinct™ MI355X GPUs are built on AMD CDNA™ 4 architecture with a 3nm process, bring 185 billion transistors, add FP4 and FP6 support, and pair all of that with up to 288GB of HBM3E memory.

Image Zoom
MI355X GPU diagram highlighting CDNA4, 3nm process, 185B transistors, FP4/FP6 support, and HBM3E memory
Image Zoom
Infographic lists MI355X benefits: 10PF FP4/FP6 performance, 288GB HBM3E, supports 520B models, UBB8 deployment

With up to 10 petaflops of FP4 and FP6 performance, support for models up to 520 billion parameters on a single GPU and an industry-standard UBB8 node available in both air-cooled and direct liquid-cooled configurations, AMD Instinct MI355X GPUs are built to deliver more than speed. They also are designed to deliver large-model capacity and deployment readiness in one platform.

Defining Moments from the AMD MLPerf Inference 6.0 Submission

MLPerf Inference 6.0 results from AMD go well beyond a single proof point, revealing meaningful progress across performance, model coverage, scale and reproducibility. Several breakthroughs stand out:

1. AMD Breaks the 1M Tokens/Sec Barrier in MLPerf Inference

One of the biggest milestones in this round is that for the first time, AMD surpassed 1 million tokens per second in the MLPerf Inference benchmark. AMD crossed that threshold on Llama 2 70B in both Server and Offline benchmarks, and on GPT-OSS-120B in Offline – all at multinode scale on AMD Instinct MI355X GPUs.

The industry increasingly evaluates inference at cluster scale, where aggregate throughput and time-to-serve determine whether infrastructure is ready for deployment. Surpassing 1 million tokens per second demonstrates production-class inference throughput.

Infographic shows MI355X surpassing 1 million tokens/sec in MLPerf 6.0 across Llama 2-70B and GPT-OSS-120B workloads

For customers, this milestone has clear benefits:

  • Higher aggregate throughput for serving large user populations and larger models.
  • A clear proof point that AMD Instinct MI355X GPUs can sustain performance when deployments move beyond a single box.
  • Validation that first-time workloads like GPT-OSS can be enabled quickly and still scale to meaningful production output.
  • A stronger foundation for the next generation of multinode and rack-scale inference deployments.

2. AMD Instinct MI355X GPUs Deliver a Clear Generational Leap vs. Previous Gen

AMD also demonstrated a major generational uplift on Llama 2 70B Server.  The AMD Instinct MI355X GPU delivered 100,282 tokens per second, displaying 3.1x more throughput than the previously submitted AMD Instinct™ MI325X GPU results.

Bar chart shows MI355X FP4 reaching 100,282 tokens/sec vs MI325X FP8 at 32,028, up to 3.1x higher throughput

That is a meaningful jump in six months, and it reflects the power of the full stack: AMD CDNA 4 architecture, high compute density, support for FP4 and FP6, large HBM3E capacity and AMD ROCm™ software optimizations tuned for modern large language model inference.

3. Llama 2 70B Shows Broad Single-Node Competitiveness

On Llama 2 70B, the most recognized large language model benchmark in MLPerf, the AMD Instinct MI355X Platform delivered highly competitive single-node results against both NVIDIA B200 and B300 GPUs. Against B200, the AMD Instinct MI355X platform tied in Offline, delivered 97% of Server performance and reached 119% of Interactive benchmark performance. Against B300 single-node, the AMD Instinct MI355X platform reached 93% in Server, 92% in Offline and 104% in Interactive.

Bar charts show MI355X vs NVIDIA B200/B300 with up to 119% interactive performance and near parity in other modes

Especially important is the breadth of the results. This is not a one-scenario story. AMD shows competitiveness across batch throughput in Offline, sustained throughput in Server and responsiveness in Interactive.

4. GPT-OSS-120B Demonstrates Fast First-Time Model Bring-Up

GPT-OSS-120B is among the most exciting parts of this Inference 6.0 submission because it was a workload run in MLPerf for the first time. First-time model enablement is difficult – the model must be brought up, optimized, validated for accuracy and pushed to competitive performance inside MLPerf timing.

Bar charts show MI355X vs NVIDIA B200/B300 with 102–115% offline/server gains and 82–91% vs B300

Even with that complexity, the AMD Instinct MI355X platform delivered 111% of B200 Offline performance and 115% of NVIDIA B200 Server single-node performance.  Against NVIDIA B300 single-node, the AMD Instinct MI355X platform reached a competitive 91% in Offline and 82% in Server.

5. Wan-2.2-t2v Extends AMD into All-New Text-to-Video Inference

MLPerf Inference 6.0 also let AMD expand beyond large language models (LLMs) into text-to-video generation with a first-time submission on Wan-2.2-t2v.  This benchmark has two tests: Offline and Single Stream. For this submission we focused our effort on the Single Stream scenario and as a result our submission is in the Open category rather than Closed (which requires both Offline and Single Stream). However, our Single Stream run did satisfy the Closed submission rules and thus can be directly compared to scores in Closed division.

Even so, the result is impressive for a first-time AMD effort on a brand-new workload category: The AMD Instinct MI355X platform achieved 93% of NVIDIA B200 single-node performance and 87% of NVIDIA B300 single-node performance in Single Stream. After the deadline, additional tuning moved Single Stream to 108% of B200 and parity with B300, while unofficial Offline results reached 111% of B200 and 88% of B300. Post-deadline numbers were not part of the official MLPerf submission and were not verified by MLCommons, but they clearly show how quickly performance improved once we had more time to tune.  

Image Zoom
Bar charts compare MI355X vs NVIDIA B200 and B300 showing up to 108% and tie or 87% performance in single stream tests
Image Zoom
Bar charts show unofficial MI355X results with 111% vs B200 and 88% vs B300 in offline inference

For customers, the importance of this result goes beyond the percentages themselves. It shows that AMD is expanding model coverage from LLMs into newer multimodal and generative video workloads and doing so with competitive day-one performance.

That matters because generative AI is not standing still. The models customers want to deploy are becoming broader, more multimodal, more specialized and AMD is showing it can keep pace.

6. Multinode Inference Shows Efficient Scale-Out 

Interest in multinode inference is rising as models get larger, deployments become more demanding and the industry lays the groundwork for rack-scale systems such as the AMD Helios solution. Our MLPerf Inference 6.0 submission shows that the AMD Instinct MI355X is ready for that transition.

On Llama 2 70B, we scaled from one node to 11 nodes and stayed remarkably close to ideal linear scaling.

Bar chart of Llama 2-70B multi-node scaling showing 1,042,110 tokens/sec offline, 1,016,380 server, 785,522 interactive

At 11 nodes and 87 AMD Instinct MI355X GPUs, we delivered 1,042,110 tokens per second in Offline, 1,016,380 tokens per second in Server and 785,522 tokens per second in Interactive. Scale-out efficiency reached 93% in Offline, 93% in Server and 98% in Interactive. Offline scale-out is the more standard path, but Server and Interactive are harder because they must maintain latency requirements as the cluster grows, which makes these results especially compelling.

This is exactly the kind of stepping stone we need as inference moves toward AMD Helios and future rack-scale deployments.

Multinode results continue on GPT-OSS-120B. These results become even more interesting since this was our first GPT-OSS multinode submission. The question was not just whether we could enable the model, but whether we could scale it efficiently across a real cluster. At 12 nodes and 94 AMD Instinct MI355X GPUs, we delivered 1,031,070 tokens per second in Offline and 900,054 tokens per second in Server. Just as important, we stayed close to ideal 12x scaling, with 92% efficiency in Offline and 93% in Server. That made GPT-OSS the second model on which we crossed the 1-million-tokens-per-second mark at multinode scale.

Bar chart showing MI355X multi-node scaling from 8 to 94 GPUs with 92–93% efficiency and up to 1,031,070 tokens/sec offline

For customers, these scale-out results provide important proof points:

  • Predictable multinode scaling grows inference clusters without losing efficiency as workloads and model sizes increase.
  • Strong server-scale efficiency builds confidence for real-time inference, not just high-throughput batch processing.
  • Higher scale-out efficiency improves GPU utilization, helping lower cost per token and maximizing infrastructure investment.
  • Proven multinode performance gives customers a stronger path from pilot deployments to production-scale AI infrastructure

Ecosystem Scale and Reproducibility Across Partner Submissions

Another major highlight for AMD in the MLPerf Inference 6.0 submission is ecosystem momentum.  This round resulted in a tie for most partners submitting on AMD Instinct hardware with nine: Cisco, Dell, Giga Computing, HPE, MangoBoost, MiTAC, Oracle, Supermicro and Red Hat.

Those submissions spanned four AMD Instinct GPU types: MI300X, MI325X, MI350X and MI355X. It shows that the ecosystem is not limited to one flagship configuration; it covers multiple generations and multiple deployment models across OEM, ODM and cloud-style platforms.

Infographic shows 9 partners including Dell, Cisco, HPE achieving consistent MI355X performance within 4% across submissions

This reproducibility is especially powerful. On AMD Instinct MI355X GPUs, partner results landed within 4% of submission made by AMD, and some landed within 1%, even on workloads run for the first time. That is a very strong signal that these numbers are not fragile lab artifacts; they are reproducible across real partner systems thanks to predictable AMD hardware and AMD ROCm™ software.

For customers, that means more than choice. It means confidence that the performance AMD demonstrates can be re-created across the broader ecosystem, reducing deployment risk and accelerating time to production.

First 3-GPU Heterogeneous Submission Demonstrates Flexible Inference Across Geographies

One of the most forward-looking results is the first MLPerf heterogeneous submission built across three AMD Instinct GPU types: MI300X, MI325X and MI355X. Submitted by Dell and MangoBoost, the configuration reached 141,521 tokens per second on Llama 2 70B Server and 151,843 tokens per second on Llama 2 70B Offline.

Diagram shows MI300X, MI325X, MI355X GPUs combined achieving 141,521 server and 151,843 offline tokens/sec for Llama 2-70B

An especially important detail is geography. The AMD Instinct MI355X platform was located in Dell’s lab in the United States, while the Instinct MI300X and MI325X platforms were in Korea. That makes this more than a mixed-generation inference story, it is also a proof point for orchestration across systems in different geographies.

For customers, the value is obvious:

  • Existing AMD Instinct GPU deployments can be extended rather than ripped out and replaced.
  • Mixed-generation hardware can be orchestrated intelligently to preserve utilization and throughput.
  • AMD ROCm software helps make non-uniform environments more practical, predictable and cost-effective.
  • AMD is demonstrating a path toward flexible, future-ready infrastructure instead of one-shot refresh cycles.

How AMD ROCm Software Drives Performance, Scale and Model Enablement

Every major result in the AMD MLPerf Inference 6.0 submission is connected by one common thread: AMD ROCm software. It turned AMD Instinct MI355X hardware into a deployable platform for competitive single-GPU inference, cluster-scale throughput, heterogeneous orchestration and first-time model bring-up.

In the submission, AMD ROCm software helped enable efficient FP4 execution, optimized GPU-to-GPU communication for multinode scaling, dynamic workload distribution for heterogeneous inference and day-zero model readiness for models like Llama, Wan and GPT-OSS.

Diagram lists ROCm features: optimized FP4 performance, multi-node scaling, GPU-CPU communication, orchestration, model readiness

For customers, AMD ROCm software delivers several practical advantages:

  • Optimized model performance for modern generative AI data types and inference kernels.
  • Seamless multinode scaling with efficient communication and orchestration.
  • Heterogeneous workload distribution across AMD Instinct GPU generations.
  • Faster readiness for emerging models and new workload categories.

That is why ROCm software is such an important part of the AMD story. It is not merely helping AMD post one strong benchmark result, it enables performance, scalability, flexibility and reproducibility across the entire Instinct portfolio.

Annual Cadence Is Building Toward AMD Instinct MI400 Series and Helios Rack-Scale Solutions

The broader context behind these results is momentum. AMD is executing on an annual cadence for Instinct GPUs – and that consistency matters. AMD Instinct MI300X GPUs established a strong generative AI foothold in 2023. AMD instinct MI325X GPUs extended that foundation in 2024 with higher compute and HBM3E. In 2025, the AMD instinct MI350 Series, including the MI355X GPU, pushed the platform forward again with new AI data types, larger-model capacity and the inference gains shown throughout this submission.

Roadmap diagram shows AMD Instinct GPUs from MI300X (2023) to MI400 (2026) with CDNA3 to CDNA5 architecture evolution

And in 2026, AMD plans to advance to the AMD Instinct™ MI400 Series GPUs based on next-generation AMD CDNA™ 5 architecture, extending this annual cadence into the next era of rack-scale AI and laying the foundation for the AMD Helios rack-scale solution. For customers, that provides confidence that AMD is not simply delivering a strong result today but building a long-term inference platform that can scale with model size, workload diversity and production deployment requirements.

Final Takeaway

The AMD MLPerf Inference 6.0 submission marks a major step forward for AMD and its generative AI story. Across all-new workloads, AMD Instinct MI355X GPUs delivered highly competitive single-node results, incredibly efficient multinode scale-out performance, first-time model bring-up on GPT-OSS-120B and Wan-2.2-t2v, and a milestone of delivering more than 1 million tokens per second at cluster scale.

At the center of all of this is AMD ROCm software, and a disciplined annual roadmap cadence. From the AMD Instinct MI300X GPU to the MI325X GPU to the MI350 Series featuring the AMD Instinct MI355X GPU, AMD is moving quickly, expanding model support, and building the software and systems foundation needed for future rack-scale AI deployments. With the AMD Helios rack-scale solution powered by the AMD Instinct™ MI400 Series and future AMD Instinct™ generations on the horizon, MLPerf Inference 6.0 reinforces a clear message: AMD is not just participating in the generative AI inference transition, it is helping define what production-ready GenAI infrastructure looks like.

For readers interested in the deeper technical story behind these results, two new AMD ROCm blogs provide additional detail on the MLPerf Inference v6.0 submission. “AMD Instinct™ GPUs MLPerf Inference v6.0 Submission” explores the hardware and software work behind these results, while “Reproducing the AMD MLPerf Inference v6.0 submission result” walks through how to replicate the benchmarks on AMD Instinct hardware, reinforcing AMD’s commitment to openness, transparency and reproducibility in AI inference.

Footnotes
  1. For additional  information about AMD MLPerf  6.0 results as of April 1, 2026, visit:  https://mlcommons.org/benchmarks/inference-datacenter/.
  2. MI350-006 - Calculations by AMD Performance Labs in May 2025, based on the published memory capacity specifications of AMD Instinct™ MI350X / MI355X OAM GPU vs. an NVIDIA Hopper H200 GPU. Server manufacturers may vary configurations, yielding different results. MI350-006
  3. MI350-004 - Based on calculations  by AMD Performance Labs in May 2025, to determine the peak theoretical precision performance of eight (8) AMD Instinct™ MI355X and MI350X GPUs (Platform) and eight (8) AMD Instinct MI325X, MI300X, MI250X and MI100 GPUs (Platform) using the FP16, FP8, FP6 and FP4 datatypes with Matrix. Server manufacturers may vary configurations, yielding different results. Results may vary based on use of the latest drivers and optimizations. MI350-004
  4. MI350-012 - Based on calculations by AMD as of April 17, 2025, using the published memory specifications of the AMD Instinct MI350X / MI355X GPUs (288GB) vs MI300X (192GB) vs MI325X (256GB).  Calculations performed with FP16 precision datatype at (2) bytes per parameter, to determine the minimum number of GPUs (based on memory size) required to run the following LLMs: OPT (130B parameters), GPT-3 (175B parameters), BLOOM (176B parameters), Gopher (280B parameters), PaLM 1 (340B parameters), Generic LM (420B, 500B, 520B, 1.047T parameters), Megatron-LM (530B parameters), LLaMA ( 405B parameters) and Samba (1T parameters). Results based on GPU memory size versus memory required by the model at defined parameters, plus 10% overhead. Server manufacturers may vary configurations, yielding different results. Results may vary based on GPU memory configuration, LLM size, and potential variance in GPU memory access or the server operating environment. *All data based on FP16 datatype. For FP8 = X2.  For FP4 = X4. MI350-012.
  5. All results are from MLPerf Inference v6.0 that is under embargo until April 1st2026, 8am PST. The results (with submission IDs as shown( will be published on https://mlcommons.org/benchmarks/inference-datacenter/ on 4/1/2026. Results verified by MLCommons Association. The MLPerf name and logo are registered and unregistered trademarks of MLCommons Association in the United States and other countries. All rights reserved. Unauthorized use is strictly prohibited. See www.mlcommons.org for more information.

GENERAL DISCLAIMER​

The information contained herein is for informational purposes only and is subject to change without notice. While every precaution has been taken in the preparation of this document, it may contain technical inaccuracies, omissions and typographical errors, and AMD is under no obligation to update or otherwise correct this information. Advanced Micro Devices, Inc. makes no representations or warranties with respect to the accuracy or completeness of the contents of this document, and assumes no liability of any kind, including the implied warranties of noninfringement, merchantability or fitness for particular purposes, with respect to the operation or use of AMD hardware, software or other products described herein. No license, including implied or arising by estoppel, to any intellectual property rights is granted by this document. Terms and limitations applicable to the purchase or use of AMD products are as set forth in a signed agreement between the parties or in AMD's Standard Terms and Conditions of Sale. GD-18u.

© 2026 Advanced Micro Devices, Inc. All rights reserved. AMD, the AMD Arrow logo, AMD Instinct, and combinations thereof are trademarks of Advanced Micro Devices, Inc. The MLPerf name and logo are registered and unregistered trademarks of MLCommons Association in the United States and other countries. All rights reserved. Unauthorized use is strictly prohibited. See www.mlcommons.org for more information. Other product names used in this publication are for identification purposes only and may be trademarks of their respective owners. Certain AMD technologies may require third-party enablement or activation. Supported features may vary by operating system. Please confirm with the system manufacturer for specific features. No technology or product can be completely secure.

 

Share:

Article By


Contributors


SMTS Systems Design Engineer 

Related Blogs