Performance Results with AMD ROCm™ Software

This page summarizes performance measurements on AMD Instinct™ GPUs running popular AI models.

The results found on this page highlight both Inference and Training benchmarks. The results are organized by the following:

AI Inference: vLLM
AI Training: pyTorch, Megatron-LM, and JAX MaxText

The hardware platforms include Instinct MI355X/MI325X/MI300X GPUs, with benchmark insights provided for each framework where data is available.

The data in the following tables are a reference point to help users evaluate observed performance. It should not be considered as the peak performance that AMD GPUs and ROCm™ software can deliver.

AI Inference

vLLM

Results on AMD Instinct™ MI300X Platform

The following results are based on:

Docker container: rocm/vllm:rocm7.0.0_vllm_0.11.1_20251103
Release date: November 3, 2025
Server: Dual AMD EPYC 9554 64-core processor-based production server with 8x AMD MI300X (192GB HBM3 750W) GPUs, 1 NUMA node per socket, System BIOS 1.8, Ubuntu® 22.04, amdgpu driver 6.8.5

Throughput Measurements

The table below shows performance data where a local inference client is fed requests at an infinite rate and shows the throughput client-server scenario under maximum load.

Model	Precision	TP¹ Size	Input	Output	No. Prompts	Max. Seqs	Throughput²
Llama 3.1 70B (amd/Llama-3.1-70B-Instruct-FP8-KV)	FP8	8	128	2048	3200	3200	13279.6
			128	4096	1500	1500	11449.7
			500	2000	2000	2000	11347.4
			2048	2048	1500	1500	7651.7
Llama 3.1 405B (amd/Llama-3.1-405B-Instruct-FP8-KV)	FP8	8	128	2048	1500	1500	3816.8
			128	4096	1500	1500	3099.6
			500	2000	2000	2000	3026.1
			2048	2048	500	500	2196.4

Latency results

The table below shows latency measurement, which typically involves assessing the time from when the system receives an input to when the model produces a result.

Model	Precision	TP¹ Size	Batch Size	Input	Output	Latency²
Llama 3.1 70B (amd/Llama-3.1-70B-Instruct-FP8-KV)	FP8	8	1	128	2048	16.154
			2	128	2048	18.041
			4	128	2048	18.322
			8	128	2048	20.800
			16	128	2048	21.850
			32	128	2048	25.513
			64	128	2048	32.539
			128	128	2048	45.193
			1	2048	2048	16.256
			2	2048	2048	18.084
			4	2048	2048	18.851
			8	2048	2048	20.930
			16	2048	2048	23.079
			32	2048	2048	26.873
			64	2048	2048	34.585
			128	2048	2048	51.856
Llama 3.1 405B (amd/Llama-3.1-70B-Instruct-FP8-KV)	FP8	8	1	128	2048	48.138
			2	128	2048	48.366
			4	128	2048	49.790
			8	128	2048	53.546
			16	128	2048	55.685
			32	128	2048	67.445
			64	128	2048	86.597
			128	128	2048	120.387
			1	2048	2048	48.555
			2	2048	2048	48.348
			4	2048	2048	49.828
			8	2048	2048	53.415
			16	2048	2048	57.398
			32	2048	2048	68.519
			64	2048	2048	90.234
			128	2048	2048	130.518

Reproduce these results on your system by following these instructions:

Inference Performance with vLLM

Previous Versions

This table lists previous versions of the ROCm vLLM inference Docker image for inference performance testing. For detailed information about available models for benchmarking, see the version-specific documentation.

Docker image tag	Components	Resources
rocm/vllm:rocm7.0.0_vllm_0.11.1_20251024 (latest)	ROCm 7.0.0 vLLM 0.11.1 PyTorch 2.9.0	Documentation Docker Hub
rocm/vllm:rocm7.0.0_ vllm_ 0.10.2_ 20251006	ROCm 7.0.0 vLLM 0.10.2 PyTorch 2.9.0	Documentation Docker Hub
rocm/vllm:rocm6.4.1_ vllm_ 0.10.0_ 20250812	ROCm 6.4.1 vLLM 0.9.1 PyTorch 2.7.0	Documentation Docker Hub
rocm/vllm:rocm6.4.1_vllm_0.9.1_20250715	ROCm 6.4.1 vLLM 0.9.1 PyTorch 2.7.0	Documentation Docker Hub
rocm/vllm:rocm6.4.1_vllm_0.9.1_20250702	ROCm 6.4.1 vLLM 0.9.1 PyTorch 2.7.0	Documentation Docker Hub
rocm/vllm:rocm6.4.1_vllm_0.9.0.1_20250605	ROCm 6.4.1 vLLM 0.9.0.1 PyTorch 2.7.0	Documentation Docker Hub
rocm/vllm:rocm6.3.1_vllm_0.8.5_20250521	ROCm 6.3.1 0.8.5 vLLM (0.8.6.dev) PyTorch 2.7.0	Documentation Docker Hub
rocm/vllm:rocm6.3.1_vllm_0.8.5_20250513	ROCm 6.3.1 vLLM 0.8.5 PyTorch 2.7.0	Documentation Docker Hub
rocm/vllm:rocm6.3.1_instinct_vllm0.8.3_20250415	ROCm 6.3.1 vLLM 0.8.3 PyTorch 2.7.0	Documentation Docker Hub
rocm/vllm:rocm6.3.1_instinct_vllm0.7.3_20250325	ROCm 6.3.1 vLLM 0.7.3 PyTorch 2.7.0	Documentation Docker Hub
rocm/vllm:rocm6.3.1_mi300_ubuntu22.04_py3.12_vllm_0.6.6	ROCm 6.3.1 vLLM 0.6.6 PyTorch 2.7.0	Documentation Docker Hub
rocm/vllm:rocm6.2_mi300_ubuntu20.04_py3.9_vllm_0.6.4	ROCm 6.2.1 vLLM 0.6.4 PyTorch 2.5.0	Documentation Docker Hub
rocm/vllm:rocm6.2_mi300_ubuntu22.04_py3.9_vllm_7c5fd50	ROCm 6.2.0 vLLM 0.4.3 PyTorch 2.4.0	Documentation Docker Hub

AI Training

The table below shows training performance data, where the AMD Instinct™ platform measures text generation training throughput with a unique sequence length and batch size. It focuses on Tokens per second per GPU.

PyTorch
Megatron-LM
JaxMaxText

PyTorch

Results on the AMD Instinct MI355X Platform

The following results are based on:

Docker container: rocm/primus:v25.9_gfx950
Release date: October 17, 2025
Server: Dual AMD EPYC 9575F 96-core processor-based production server with 8x AMD MI355X (288GB HBM3E 1400W)GPUs,1 NUMA node per socket, System BIOS 1.4a, Ubuntu® 22.04.5 LTS Host GPU driver ROCm 7.0.1.

Model	Precision	Batch Size	Sequence Length	FSDP	TP	CP	PP	Tokens/sec/GPU
Llama 3.1 8B	FP8	8	8192	0	1	1	1	28,035
Llama 3.1 8B	BF16	5	8192	0	1	1	1	20,158
Llama 3.1 70B	FP8	6	8192	1	1	1	1	3,570
Llama 3.1 70B	BF16	8	8192	1	1	1	1	2,281

Fine-tuning

Model	Precision	Batch Size	Sequence Length	FSDP	TP	CP	PP	Tokens/sec/GPU
Llama 3.1 70B SFT	FP8	8	8192	1	1	1	1	3,546
Llama 3.1 70B SFT	BF16	16	8192	1	1	1	1	2,161
Llama 3.1 70B LoRA	BF16	16	8192	1	1	1	1	2,594

Reproduce these results on your system by following these instructions:

Training Performance with PyTorch on AMD GPUs User Guide

Results on AMD Instinct™ MI325X Platform

The following results are based on:

Docker container: rocm/ primus:v25.9_gfx942
Release date: October 17, 2025
Server: Dual AMD EPYC 9655 96-core processor-based production server with 8x AMD MI325X (256GB HBM3E 1000W) GPUs, 1 NUMA node per socket, System BIOS 3B03, Ubuntu® 22.04.5 LTS, Host GPU driver ROCm 6.3.1-48

Model	Precision	Batch Size	Sequence Length	FSDP	TP	CP	PP	Tokens/sec/GPU
Llama 3.1 8B	FP8	7	8192	0	1	1	1	14,984
Llama 3.1 8B	BF16	6	8192	0	1	1	1	11,144
Llama 3.1 70B	FP8	5	8192	1	1	1	1	1,716
Llama 3.1 70B	BF16	6	8192	1	1	1	1	1,150

Fine-tuning

Model	Precision	Batch Size	Sequence Length	FSDP	TP	CP	PP	Tokens/sec/GPU
Llama 3.1 70B SFT	FP8	8	8192	1	1	1	1	1,597
Llama 3.1 70B SFT	BF16	8	8192	1	1	1	1	1,037
Llama 3.1 70B LoRA	BF16	16	8192	1	1	1	1	1,286

Reproduce these results on your system by following these instructions:

Training Performance with PyTorch on AMD GPUs User Guide

Results on AMD Instinct™ MI300X Platform

The following results are based on:

Docker container: rocm//primus:v25.9_gfx942
Release date: October 17, 2025
Server: Server: Dual AMD EPYC 9554 64-core processor-based production server with 8x AMD MI300X (192GB HBM3 750W) GPUs, 1 NUMA node per socket, System BIOS 1.8, Ubuntu® 22.04.5 LTS, Host GPU driver ROCm 6.4.2-12

Model	Precision	Batch Size	Sequence Length	FSDP	TP	CP	PP	Tokens/sec/GPU
Llama 3.1 8B	FP8	5	8192	0	1	1	1	12,216
Llama 3.1 8B	BF16	4	8192	0	1	1	1	9,186
Llama 3.1 70B	FP8	3	8192	1	1	1	1	1,307
Llama 3.1 70B	BF16	4	8192	1	1	1	1	887

Fine-tuning

Model	Precision	Batch Size	Sequence Length	FSDP	TP	CP	PP	Tokens/sec/GPU
Llama 3.1 70B SFT	FP8	4	8192	1	1	1	1	1,343
Llama 3.1 70B SFT	BF16	4	8192	1	1	1	1	855
Llama 3.1 70B LoRA	BF16	4	8192	1	1	1	1	1,053

Reproduce these results on your system by following these instructions:

Training Performance with PyTorch on AMD GPUs User Guide

Previous Versions

This table lists previous versions of the PyTorch training Docker image for training performance testing. For detailed information about available models for benchmarking, see the version-specific documentation.

Image version	ROCm version	PyTorch version	Resources
V25.9 (latest)	7.0.7	Primus 0.3.0 PyTorch 2.9.0.dev20250821+rocm7.0.0.lw.git125803b7	Primus PyTorch Training documentation PyTorch training (legacy) documentation Docker Hub (gfx950) Docker Hub (gfx942)
v25.8	6.4.3	2.8.0a0+gitd06a406	Primus PyTorch Training documentation PyTorch training (legacy) documentation
v25.7	6.4.2	2.8.0a0+gitd06a406	Documentation Docker Hub
v25.6	6.3.4	2.8.0a0+git7d205b2	Documentation Docker Hub
v25.5	6.3.4	2.7.0a0+git637433	Documentation Docker Hub
v25.4	6.3.0	2.7.0a0+git637433	Documentation Docker Hub

Megatron-LM

Results on AMD Instinct™ MI355X Platform

The following results are based on:

Docker container: rocm/ primus:v25.9_gfx950
Release date: October 17, 2025
Server: Dual AMD EPYC 9575F 96-core processor-based production server with 8x AMD MI355X (288GB HBM3E 1400W）GPUs,1 NUMA node per socket, System BIOS 1.4a, Ubuntu® 22.04.5 LTS Host GPU driver ROCm 7.0.1.

Model	# nodes	Precision	Batch Size	Sequence Length	FSDP	TP	CP	PP	EP	Tokens/sec/GPU
Llama 3.1 8B	1	FP8	4	8191	0	1	1	1	-	32,451
Llama 3.1 8B	1	BF16	4	8191	0	1	1	1	-	21,908
Llama 3.1 70B	1	BF16	4	8191	1	1	1	1	-	2,074
Llama 3.3 70B	1	BF16	6	8191	1	1	1	1	-	2,024
Mixtral 8x7B	1	BF16	4	4096	0	1	1	1	8	13,008

Reproduce these results on your system by following these instructions:

Training Performance with Megatron-LM on AMD GPUs User Guide

Results on AMD Instinct™ MI325X Platform

The following results are based on:

Docker container: rocm/primus:v25.9_gfx942
Release date: October 17, 2025
Server: Dual AMD EPYC 9655 96-core processor-based production server with 8x AMD MI325X (256GB HBM3E 1000W) GPUs, 1 NUMA node per socket, System BIOS 3B03, Ubuntu® 22.04.5 LTS, Host GPU driver ROCm 6.3.1-48

Model	# nodes	Precision	Batch Size	Sequence Length	FSDP	TP	CP	PP	EP	Tokens/sec/GPU
Llama 3.1 8B	1	FP8	2	8191	0	1	1	1	-	16,678
Llama 3.1 8B	1	BF16	4	8191	0	1	1	1	-	11,803
Llama 3.1 70B	1	BF16	4	8191	1	1	1	1	-	1,091
Llama 3.3 70B	1	BF16	5	8191	1	1	1	1	-	1,052
Mixtral 8x7B	1	BF16	4	4096	0	1	1	1	8	6,511

Reproduce these results on your system by following these instructions:

Training Performance with Megatron-LM on AMD GPUs User Guide

Results on AMD Instinct™ MI300X Platform

The following results are based on:

Docker container: rocm/ rocm/primus:v25.9_gfx942
Release date: October 17, 2025
Server: Server: Dual AMD EPYC 9554 64-core processor-based production server with 8x AMD MI300X (192GB HBM3 750W) GPUs, 1 NUMA node per socket, System BIOS 1.8, Ubuntu® 22.04.5 LTS, Host GPU driver ROCm 6.4.2-120.
For multi-mode run, Server: Dual Intel Xeon Platinum 8480+ Processors with 8x AMD MI300X (192GB HBM3 750W) GPUs, 1 NUMA node per socket, System BIOS 79007700 Ubuntu® 22.04, Host GPU driver ROCm 6.3.0-39.

Model	# nodes	Precision	Batch Size	Sequence Length	FSDP	TP	CP	PP	EP	Tokens/sec/GPU
Llama 3.1 8B	1	FP8	2	8191	0	1	1	1	-	14,208
Llama 3.1 8B	1	BF16	2	8191	0	1	1	1	-	9,782
Llama 3.1 8B	8	FP8	2	8192	0	1	1	1	-	13,328
Llama 3.1 70B	1	BF16	3	8191	1	1	1	1	-	827
Llama 3.3 70B	1	BF16	2	8191	1	1	1	1	-	822
Mixtral 8x7B	1	BF16	2	4096	0	1	1	1	8	5,430

Reproduce these results on your system by following these instructions:

Training Performance with Megatron-LM on AMD GPUs User Guide

Previous Versions

This table lists previous versions of the Megatron-LM training Docker image for training performance testing. For detailed information about available models for benchmarking, see the version-specific documentation.

Image version	ROCm version	PyTorch version	Resources
v25.9 (latest)	7.0.0	Primus 0.3.0 PyTorch 2.9.0.dev20250821+rocm7.0.0.lw.git125803b7	Primus Megatron documentation Megatron-LM (legacy) documentation Docker Hub (gfx950) Docker Hub (gfx942)
v25.8	6.4.3	2.8.0a0+gitd06a406	Primus Megatron documentation Megatron-LM (legacy) documentation Docker Hub (py310)
v25.7	6.4.2	2.8.0a0+gitd06a406	Primus Megatron documentation Megatron-LM (legacy) documentation Docker Hub (py310)
v25.6	6.4.1	2.8.0a0+git7d205b2	Documentation Docker Hub (py312) Docker Hub (py310)
v25.5	6.3.4	2.8.0a0+gite2f9759	Documentation Docker Hub (py312) Docker Hub (py310)
v25.4	6.3.0	2.7.0a0+git637433	Documentation Docker Hub

JaxMaxText

Results on AMD Instinct™ MI355X Platform

The following results are based on:

Docker container: rocm/ jax-training:maxtext-v25.9
Release date: October 17, 2025.
Server: Dual AMD EPYC 9575F 96-core processor-based production server with 8x AMD MI355X (288GB HBM3E 1400W）GPUs,1 NUMA node per socket, System BIOS 1.4a, Ubuntu® 22.04.5 LTS Host GPU driver ROCm 7.0.1.

Model	# nodes	Precision	Batch Size	Sequence Length	FSDP	TP	CP	PP	EP	Tokens/sec/GPU
Llama 3.1 8B	1	BF16	7	8192	1	1	1	1	1	21,306
Llama 3.1 8B	1	FP8	4	8192	1	1	1	1	1	26,756
Llama 3.1 70B	1	BF16	4	8192	1	1	1	1	1	2,440
Llama 3.1 70B	1	FP8	7	8192	1	1	1	1	1	3,793
Llama 3.3 70B	1	BF16	7	8192	1	1	1	1	1	2,441
Mixtral 8x7B	1	BF16	12	4096	0	1	1	1	8	10,597

Reproduce these results on your system by following these instructions:

Training Performance with JaxMaxText on AMD GPUs User Guide

Results on AMD Instinct™ MI325X Platform

The following results are based on:

Docker container: rocm/ jax-training:maxtext-v25.9
Release date: October 17, 2025
Server: Dual AMD EPYC 9655 96-core processor-based production server with 8x AMD MI325X (256GB HBM3E 1000W) GPUs, 1 NUMA node per socket, System BIOS 3B03, Ubuntu® 22.04.5 LTS, Host GPU driver ROCm 6.3.1-48

Model	# nodes	Precision	Batch Size	Sequence Length	FSDP	TP	CP	PP	EP	Tokens/sec/GPU
Llama 3.1 8B	1	BF16	4	8192	1	1	1	1	1	10,292
Llama 3.1 70B	1	BF16	7	8192	1	1	1	1	1	1,178
Llama 3.3 70B	1	BF16	7	8192	1	1	1	1	1	1,178
Mixtral 8x7B	1	BF16	12	4096	0	1	1	1	8	5,519

Reproduce these results on your system by following these instructions:

Training Performance with JaxMaxText on AMD GPUs User Guide

Results on AMD Instinct™ MI300X Platform

The following results are based on:

Docker container: rocm/ jax-training:maxtext-v25.9
Release date: October 17, 2025
Server: Dual AMD EPYC 9554 64-core processor-based production server with 8x AMD MI300X (192GB HBM3 750W) GPUs, 1 NUMA node per socket, System BIOS 1.8, Ubuntu® 22.04.5 LTS, Host GPU driver ROCm 6.4.2-120.
For multi-mode run, Server: Dual AMD EPYC 9654 Processors with 8x AMD MI300X (192GB HBM3 750W) GPUs, 1 NUMA node per socket, System BIOS 3.10 Ubuntu® 22.04, Host GPU driver ROCm 6.3.1-48.

Model	# nodes	Precision	Batch Size	Sequence Length	FSDP	TP	CP	PP	EP	Tokens/sec/GPU
Llama 3.1 8B	1	BF16	4	8192	1	1	1	1	1	8,587
Llama 3.1 8B	8	BF16	15	8192	1	1	1	1	1	7,813
Llama 3.1 70B	1	BF16	7	8192	1	1	1	1	1	949
Llama 3.3 70B	1	BF16	7	8192	1	1	1	1	1	949
Mixtral 8x7B	1	BF16	12	4096	0	1	1	1	8	4,622

Reproduce these results on your system by following these instructions:

Training Performance with JaxMaxText on AMD GPUs User Guide

Previous Versions

The following results are based on:

This table lists previous versions of the ROCm JAX MaxText Docker image for training performance testing. For detailed information about available models for benchmarking, see the version-specific documentation.

Image version	ROCm version	JAX version	Resources
v25.9 (latest)	7.0.01	0.6.2	Documentation Docker Hub
v25.7	6.4.1	0.6.0, 0.5.0	Documentation Docker Hub (JAX 0.6.0) Docker Hub (JAX 0.5.0)
v25.5	6.3.4	0.4.35	Documentation Docker Hub
v25.4	6.3.0	0.4.31	Documentation Docker Hub

脚注

TP stands for Tensor Parallelism.
Throughput is measured in tokens/second

データセンター

ビジネスシステム

パーソナル & ゲーミング

エンベデッド

リソース

アクセラレータ

アダプティブ アクセラレータ

DPU アクセラレータ

イーサネット アダプター

ワークステーション

デスクトップ

ノート PC

リソース

アダプティブ SoC & FPGA

システム オン モジュール (SOM)

テクノロジ

開発者リソース

評価ボード & キット

プロセッサ ツール

グラフィックス ツール＆アプリケーション

アダプティブ SoC & FPGA ツール

IP & アプリ

GPU アクセラレータ ツール & アプリケーション

イーサネット アダプター ツール

概要

データセンター & クラウド向け

エッジ & エンドポイント向け

開発者向け

業界

業界

業界

業界

Industrias

ワークロード

ゲーミング

システム

テクノロジ

リソース

EPYC プロセッサ

Radeon グラフィックス & AMD チップセット

FPGA & アダプティブ SoC

Alveo アクセラレータ & Kria SOM

Ryzen プロセッサ

イーサネット アダプター

概要

EPYC プロセッサ

アクセラレータ

エンベデッド製品

グラフィックス

概要

製品別リソース

タイプ別リソース

AMD のパートナーについて

AMD グローバル サポート

プロセッサ & グラフィックス

アクセラレータ

アダプティブ SoC & FPGA

AMD 正規販売店から購入

アダプティブ & エンベデッドコンピューティング

Get AMD Fan Gear

Buy Direct From AMD

Buy Direct From AMD

Buy Direct From AMD

Buy Direct From AMD

Buy Direct From AMD

This page summarizes performance measurements on AMD Instinct™ GPUs running popular AI models.

AI Inference

vLLM

Results on AMD Instinct™ MI300X Platform

Results on AMD Instinct™ MI300X Platform

Previous Versions

Previous Versions

AI Training

PyTorch

Results on AMD Instinct™ MI355X Platform

Results on the AMD Instinct MI355X Platform

Results on AMD Instinct™ MI325X Platform

Results on AMD Instinct™ MI325X Platform

Results on AMD Instinct™ MI300X Platform

Results on AMD Instinct™ MI300X Platform

アダプティブアクセラレータ

イーサネットアダプター

システムオンモジュール (SOM)

プロセッサツール

グラフィックスツール＆アプリケーション

GPU アクセラレータツール & アプリケーション

イーサネットアダプターツール

イーサネットアダプター

AMD グローバルサポート

ニュース＆イベント