Performance Results with AMD ROCm™ Software

This page summarizes performance measurements on AMD Instinct™ GPUs running popular AI models.

The results found on this page highlight both Inference and Training benchmarks. The results are organized by the following:

AI Inference: vLLM
AI Training: pyTorch, Megatron-LM, and JAX MaxText

The hardware platforms include Instinct MI355X/MI325X/MI300X GPUs, with benchmark insights provided for each framework where data is available.

The data in the following tables are a reference point to help users evaluate observed performance. It should not be considered as the peak performance that AMD GPUs and ROCm™ software can deliver.

AI Inference

vLLM

Results on AMD Instinct™ MI300X Platform

The following results are based on:

Docker container: rocm/vllm:rocm7.0.0_vllm_0.10.2_20251006
Release date: October 6, 2025
Server: Dual AMD EPYC 9554 64-core processor-based production server with 8x AMD MI300X (192GB HBM3 750W) GPUs, 1 NUMA node per socket, System BIOS 1.8, Ubuntu® 22.04, amdgpu driver 6.8.5

Throughput Measurements

The table below shows performance data where a local inference client is fed requests at an infinite rate and shows the throughput client-server scenario under maximum load.

Model	Precision	TP¹ Size	Input	Output	No. Prompts	Max. Seqs	Throughput²
Llama 3.1 70B (amd/Llama-3.1-70B-Instruct-FP8-KV)	FP8	8	128	2048	3200	3200	13212.5
			128	4096	1500	1500	11312.8
			500	2000	2000	2000	11376.7
			2048	2048	1500	1500	7252.1
Llama 3.1 405B (amd/Llama-3.1-405B-Instruct-FP8-KV)	FP8	8	128	2048	1500	1500	4201.7
			128	4096	1500	1500	3176.3
			500	2000	2000	2000	2992.0
			2048	2048	500	500	2153.7

Latency results

The table below shows latency measurement, which typically involves assessing the time from when the system receives an input to when the model produces a result.

Model	Precision	TP¹ Size	Batch Size	Input	Output	Latency²
Llama 3.1 70B (amd/Llama-3.1-70B-Instruct-FP8-KV)	FP8	8	1	128	2048	15.882
			2	128	2048	17.934
			4	128	2048	18.487
			8	128	2048	20.251
			16	128	2048	22.307
			32	128	2048	29.933
			64	128	2048	32.359
			128	128	2048	45.419
			1	2048	2048	15.959
			2	2048	2048	18.177
			4	2048	2048	18.684
			8	2048	2048	20.716
			16	2048	2048	23.136
			32	2048	2048	26.969
			64	2048	2048	34.359
			128	2048	2048	52.351
Llama 3.1 405B (amd/Llama-3.1-70B-Instruct-FP8-KV)	FP8	8	1	128	2048	49.098
			2	128	2048	51.009
			4	128	2048	52.979
			8	128	2048	55.675
			16	128	2048	58.982
			32	128	2048	67.889
			64	128	2048	86.844
			128	128	2048	117.440
			1	2048	2048	49.033
			2	2048	2048	51.316
			4	2048	2048	52.947
			8	2048	2048	55.863
			16	2048	2048	60.103
			32	2048	2048	69.632
			64	2048	2048	89.826
			128	2048	2048	126.433

Reproduce these results on your system by following these instructions:

Inference Performance with vLLM on AMD GPUs User Guide

Previous Versions

This table lists previous versions of the ROCm vLLM inference Docker image for inference performance testing. For detailed information about available models for benchmarking, see the version-specific documentation.

Docker image tag	Components	Resources
rocm/vllm:rocm7.0.0_ vllm_ 0.10.2_ 20251006 (latest)	ROCm 7.0.0 vLLM 0.10.2 PyTorch 2.9.0	Documentation Docker Hub
rocm/vllm:rocm6.4.1_ vllm_ 0.10.0_ 20250812	ROCm 6.4.1 vLLM 0.9.1 PyTorch 2.7.0	Documentation Docker Hub
rocm/vllm:rocm6.4.1_vllm_0.9.1_20250715	ROCm 6.4.1 vLLM 0.9.1 PyTorch 2.7.0	Documentation Docker Hub
rocm/vllm:rocm6.4.1_vllm_0.9.1_20250702	ROCm 6.4.1 vLLM 0.9.1 PyTorch 2.7.0	Documentation Docker Hub
rocm/vllm:rocm6.4.1_vllm_0.9.0.1_20250605	ROCm 6.4.1 vLLM 0.9.0.1 PyTorch 2.7.0	Documentation Docker Hub
rocm/vllm:rocm6.3.1_vllm_0.8.5_20250521	ROCm 6.3.1 0.8.5 vLLM (0.8.6.dev) PyTorch 2.7.0	Documentation Docker Hub
rocm/vllm:rocm6.3.1_vllm_0.8.5_20250513	ROCm 6.3.1 vLLM 0.8.5 PyTorch 2.7.0	Documentation Docker Hub
rocm/vllm:rocm6.3.1_instinct_vllm0.8.3_20250415	ROCm 6.3.1 vLLM 0.8.3 PyTorch 2.7.0	Documentation Docker Hub
rocm/vllm:rocm6.3.1_instinct_vllm0.7.3_20250325	ROCm 6.3.1 vLLM 0.7.3 PyTorch 2.7.0	Documentation Docker Hub
rocm/vllm:rocm6.3.1_mi300_ubuntu22.04_py3.12_vllm_0.6.6	ROCm 6.3.1 vLLM 0.6.6 PyTorch 2.7.0	Documentation Docker Hub
rocm/vllm:rocm6.2_mi300_ubuntu20.04_py3.9_vllm_0.6.4	ROCm 6.2.1 vLLM 0.6.4 PyTorch 2.5.0	Documentation Docker Hub
rocm/vllm:rocm6.2_mi300_ubuntu22.04_py3.9_vllm_7c5fd50	ROCm 6.2.0 vLLM 0.4.3 PyTorch 2.4.0	Documentation Docker Hub

AI Training

The table below shows training performance data, where the AMD Instinct™ platform measures text generation training throughput with a unique sequence length and batch size. It focuses on Tokens per second per GPU.

PyTorch
Megatron-LM
JaxMaxText

PyTorch

Results on the AMD Instinct MI355X Platform

The following results are based on:

Docker container: rocm/primus:v25.9_gfx950
Release date: October 17, 2025
Server: Dual AMD EPYC 9575F 96-core processor-based production server with 8x AMD MI355X (288GB HBM3E 1400W)GPUs,1 NUMA node per socket, System BIOS 1.4a, Ubuntu® 22.04.5 LTS Host GPU driver ROCm 7.0.1.

Model	Precision	Batch Size	Sequence Length	FSDP	TP	CP	PP	Tokens/sec/GPU
Llama 3.1 8B	FP8	8	8192	0	1	1	1	28,035
Llama 3.1 8B	BF16	5	8192	0	1	1	1	20,158
Llama 3.1 70B	FP8	6	8192	1	1	1	1	3,570
Llama 3.1 70B	BF16	8	8192	1	1	1	1	2,281

Fine-tuning

Model	Precision	Batch Size	Sequence Length	FSDP	TP	CP	PP	Tokens/sec/GPU
Llama 3.1 70B SFT	FP8	8	8192	1	1	1	1	3,546
Llama 3.1 70B SFT	BF16	16	8192	1	1	1	1	2,161
Llama 3.1 70B LoRA	BF16	16	8192	1	1	1	1	2,594

Reproduce these results on your system by following these instructions:

Training Performance with PyTorch on AMD GPUs User Guide

Results on AMD Instinct™ MI325X Platform

The following results are based on:

Docker container: rocm/ primus:v25.9_gfx942
Release date: October 17, 2025
Server: Dual AMD EPYC 9655 96-core processor-based production server with 8x AMD MI325X (256GB HBM3E 1000W) GPUs, 1 NUMA node per socket, System BIOS 3B03, Ubuntu® 22.04.5 LTS, Host GPU driver ROCm 6.3.1-48

Model	Precision	Batch Size	Sequence Length	FSDP	TP	CP	PP	Tokens/sec/GPU
Llama 3.1 8B	FP8	7	8192	0	1	1	1	14,984
Llama 3.1 8B	BF16	6	8192	0	1	1	1	11,144
Llama 3.1 70B	FP8	5	8192	1	1	1	1	1,716
Llama 3.1 70B	BF16	6	8192	1	1	1	1	1,150

Fine-tuning

Model	Precision	Batch Size	Sequence Length	FSDP	TP	CP	PP	Tokens/sec/GPU
Llama 3.1 70B SFT	FP8	8	8192	1	1	1	1	1,597
Llama 3.1 70B SFT	BF16	8	8192	1	1	1	1	1,037
Llama 3.1 70B LoRA	BF16	16	8192	1	1	1	1	1,286

Reproduce these results on your system by following these instructions:

Training Performance with PyTorch on AMD GPUs User Guide

Results on AMD Instinct™ MI300X Platform

The following results are based on:

Docker container: rocm//primus:v25.9_gfx942
Release date: October 17, 2025
Server: Server: Dual AMD EPYC 9554 64-core processor-based production server with 8x AMD MI300X (192GB HBM3 750W) GPUs, 1 NUMA node per socket, System BIOS 1.8, Ubuntu® 22.04.5 LTS, Host GPU driver ROCm 6.4.2-12

Model	Precision	Batch Size	Sequence Length	FSDP	TP	CP	PP	Tokens/sec/GPU
Llama 3.1 8B	FP8	5	8192	0	1	1	1	12,216
Llama 3.1 8B	BF16	4	8192	0	1	1	1	9,186
Llama 3.1 70B	FP8	3	8192	1	1	1	1	1,307
Llama 3.1 70B	BF16	4	8192	1	1	1	1	887

Fine-tuning

Model	Precision	Batch Size	Sequence Length	FSDP	TP	CP	PP	Tokens/sec/GPU
Llama 3.1 70B SFT	FP8	4	8192	1	1	1	1	1,343
Llama 3.1 70B SFT	BF16	4	8192	1	1	1	1	855
Llama 3.1 70B LoRA	BF16	4	8192	1	1	1	1	1,053

Reproduce these results on your system by following these instructions:

Training Performance with PyTorch on AMD GPUs User Guide

Previous Versions

This table lists previous versions of the PyTorch training Docker image for training performance testing. For detailed information about available models for benchmarking, see the version-specific documentation.

Image version	ROCm version	PyTorch version	Resources
V25.9 (latest)	7.0.7	Primus 0.3.0 PyTorch 2.9.0.dev20250821+rocm7.0.0.lw.git125803b7	Primus PyTorch Training documentation PyTorch training (legacy) documentation Docker Hub (gfx950) Docker Hub (gfx942)
v25.8	6.4.3	2.8.0a0+gitd06a406	Primus PyTorch Training documentation PyTorch training (legacy) documentation
v25.7	6.4.2	2.8.0a0+gitd06a406	Documentation Docker Hub
v25.6	6.3.4	2.8.0a0+git7d205b2	Documentation Docker Hub
v25.5	6.3.4	2.7.0a0+git637433	Documentation Docker Hub
v25.4	6.3.0	2.7.0a0+git637433	Documentation Docker Hub

Megatron-LM

Results on AMD Instinct™ MI355X Platform

The following results are based on:

Docker container: rocm/ primus:v25.9_gfx950
Release date: October 17, 2025
Server: Dual AMD EPYC 9575F 96-core processor-based production server with 8x AMD MI355X (288GB HBM3E 1400W）GPUs,1 NUMA node per socket, System BIOS 1.4a, Ubuntu® 22.04.5 LTS Host GPU driver ROCm 7.0.1.

Model	# nodes	Precision	Batch Size	Sequence Length	FSDP	TP	CP	PP	EP	Tokens/sec/GPU
Llama 3.1 8B	1	FP8	4	8191	0	1	1	1	-	32,451
Llama 3.1 8B	1	BF16	4	8191	0	1	1	1	-	21,908
Llama 3.1 70B	1	BF16	4	8191	1	1	1	1	-	2,074
Llama 3.3 70B	1	BF16	6	8191	1	1	1	1	-	2,024
Mixtral 8x7B	1	BF16	4	4096	0	1	1	1	8	13,008

Reproduce these results on your system by following these instructions:

Training Performance with Megatron-LM on AMD GPUs User Guide

Results on AMD Instinct™ MI325X Platform

The following results are based on:

Docker container: rocm/primus:v25.9_gfx942
Release date: October 17, 2025
Server: Dual AMD EPYC 9655 96-core processor-based production server with 8x AMD MI325X (256GB HBM3E 1000W) GPUs, 1 NUMA node per socket, System BIOS 3B03, Ubuntu® 22.04.5 LTS, Host GPU driver ROCm 6.3.1-48

Model	# nodes	Precision	Batch Size	Sequence Length	FSDP	TP	CP	PP	EP	Tokens/sec/GPU
Llama 3.1 8B	1	FP8	2	8191	0	1	1	1	-	16,678
Llama 3.1 8B	1	BF16	4	8191	0	1	1	1	-	11,803
Llama 3.1 70B	1	BF16	4	8191	1	1	1	1	-	1,091
Llama 3.3 70B	1	BF16	5	8191	1	1	1	1	-	1,052
Mixtral 8x7B	1	BF16	4	4096	0	1	1	1	8	6,511

Reproduce these results on your system by following these instructions:

Training Performance with Megatron-LM on AMD GPUs User Guide

Results on AMD Instinct™ MI300X Platform

The following results are based on:

Docker container: rocm/ rocm/primus:v25.9_gfx942
Release date: October 17, 2025
Server: Server: Dual AMD EPYC 9554 64-core processor-based production server with 8x AMD MI300X (192GB HBM3 750W) GPUs, 1 NUMA node per socket, System BIOS 1.8, Ubuntu® 22.04.5 LTS, Host GPU driver ROCm 6.4.2-120.
For multi-mode run, Server: Dual Intel Xeon Platinum 8480+ Processors with 8x AMD MI300X (192GB HBM3 750W) GPUs, 1 NUMA node per socket, System BIOS 79007700 Ubuntu® 22.04, Host GPU driver ROCm 6.3.0-39.

Model	# nodes	Precision	Batch Size	Sequence Length	FSDP	TP	CP	PP	EP	Tokens/sec/GPU
Llama 3.1 8B	1	FP8	2	8191	0	1	1	1	-	14,208
Llama 3.1 8B	1	BF16	2	8191	0	1	1	1	-	9,782
Llama 3.1 8B	8	FP8	2	8192	0	1	1	1	-	13,328
Llama 3.1 70B	1	BF16	3	8191	1	1	1	1	-	827
Llama 3.3 70B	1	BF16	2	8191	1	1	1	1	-	822
Mixtral 8x7B	1	BF16	2	4096	0	1	1	1	8	5,430

Reproduce these results on your system by following these instructions:

Training Performance with Megatron-LM on AMD GPUs User Guide

Previous Versions

This table lists previous versions of the Megatron-LM training Docker image for training performance testing. For detailed information about available models for benchmarking, see the version-specific documentation.

Image version	ROCm version	PyTorch version	Resources
v25.9 (latest)	7.0.0	Primus 0.3.0 PyTorch 2.9.0.dev20250821+rocm7.0.0.lw.git125803b7	Primus Megatron documentation Megatron-LM (legacy) documentation Docker Hub (gfx950) Docker Hub (gfx942)
v25.8	6.4.3	2.8.0a0+gitd06a406	Primus Megatron documentation Megatron-LM (legacy) documentation Docker Hub (py310)
v25.7	6.4.2	2.8.0a0+gitd06a406	Primus Megatron documentation Megatron-LM (legacy) documentation Docker Hub (py310)
v25.6	6.4.1	2.8.0a0+git7d205b2	Documentation Docker Hub (py312) Docker Hub (py310)
v25.5	6.3.4	2.8.0a0+gite2f9759	Documentation Docker Hub (py312) Docker Hub (py310)
v25.4	6.3.0	2.7.0a0+git637433	Documentation Docker Hub

JaxMaxText

Results on AMD Instinct™ MI355X Platform

The following results are based on:

Docker container: rocm/ jax-training:maxtext-v25.9
Release date: October 17, 2025.
Server: Dual AMD EPYC 9575F 96-core processor-based production server with 8x AMD MI355X (288GB HBM3E 1400W）GPUs,1 NUMA node per socket, System BIOS 1.4a, Ubuntu® 22.04.5 LTS Host GPU driver ROCm 7.0.1.

Model	# nodes	Precision	Batch Size	Sequence Length	FSDP	TP	CP	PP	EP	Tokens/sec/GPU
Llama 3.1 8B	1	BF16	7	8192	1	1	1	1	1	21,306
Llama 3.1 8B	1	FP8	4	8192	1	1	1	1	1	26,756
Llama 3.1 70B	1	BF16	4	8192	1	1	1	1	1	2,440
Llama 3.1 70B	1	FP8	7	8192	1	1	1	1	1	3,793
Llama 3.3 70B	1	BF16	7	8192	1	1	1	1	1	2,441
Mixtral 8x7B	1	BF16	12	4096	0	1	1	1	8	10,597

Reproduce these results on your system by following these instructions:

Training Performance with JaxMaxText on AMD GPUs User Guide

Results on AMD Instinct™ MI325X Platform

The following results are based on:

Docker container: rocm/ jax-training:maxtext-v25.9
Release date: October 17, 2025
Server: Dual AMD EPYC 9655 96-core processor-based production server with 8x AMD MI325X (256GB HBM3E 1000W) GPUs, 1 NUMA node per socket, System BIOS 3B03, Ubuntu® 22.04.5 LTS, Host GPU driver ROCm 6.3.1-48

Model	# nodes	Precision	Batch Size	Sequence Length	FSDP	TP	CP	PP	EP	Tokens/sec/GPU
Llama 3.1 8B	1	BF16	4	8192	1	1	1	1	1	10,292
Llama 3.1 70B	1	BF16	7	8192	1	1	1	1	1	1,178
Llama 3.3 70B	1	BF16	7	8192	1	1	1	1	1	1,178
Mixtral 8x7B	1	BF16	12	4096	0	1	1	1	8	5,519

Reproduce these results on your system by following these instructions:

Training Performance with JaxMaxText on AMD GPUs User Guide

Results on AMD Instinct™ MI300X Platform

The following results are based on:

Docker container: rocm/ jax-training:maxtext-v25.9
Release date: October 17, 2025
Server: Dual AMD EPYC 9554 64-core processor-based production server with 8x AMD MI300X (192GB HBM3 750W) GPUs, 1 NUMA node per socket, System BIOS 1.8, Ubuntu® 22.04.5 LTS, Host GPU driver ROCm 6.4.2-120.
For multi-mode run, Server: Dual AMD EPYC 9654 Processors with 8x AMD MI300X (192GB HBM3 750W) GPUs, 1 NUMA node per socket, System BIOS 3.10 Ubuntu® 22.04, Host GPU driver ROCm 6.3.1-48.

Model	# nodes	Precision	Batch Size	Sequence Length	FSDP	TP	CP	PP	EP	Tokens/sec/GPU
Llama 3.1 8B	1	BF16	4	8192	1	1	1	1	1	8,587
Llama 3.1 8B	8	BF16	15	8192	1	1	1	1	1	7,813
Llama 3.1 70B	1	BF16	7	8192	1	1	1	1	1	949
Llama 3.3 70B	1	BF16	7	8192	1	1	1	1	1	949
Mixtral 8x7B	1	BF16	12	4096	0	1	1	1	8	4,622

Reproduce these results on your system by following these instructions:

Training Performance with JaxMaxText on AMD GPUs User Guide

Previous Versions

The following results are based on:

This table lists previous versions of the ROCm JAX MaxText Docker image for training performance testing. For detailed information about available models for benchmarking, see the version-specific documentation.

Image version	ROCm version	JAX version	Resources
v25.9 (latest)	7.0.01	0.6.2	Documentation Docker Hub
v25.7	6.4.1	0.6.0, 0.5.0	Documentation Docker Hub (JAX 0.6.0) Docker Hub (JAX 0.5.0)
v25.5	6.3.4	0.4.35	Documentation Docker Hub
v25.4	6.3.0	0.4.31	Documentation Docker Hub

Notes de bas de page

TP stands for Tensor Parallelism.
Throughput is measured in tokens/second

Centre de données

Systèmes professionnels

Informatique personnelle et gaming

Embedded

Ressources

Accélérateurs GPU

Accélérateurs adaptatifs

Accélérateurs DPU

Adaptateurs Ethernet

Stations de travail

PC de bureau

PC portables

Ressources

FPGA et SoC adaptatifs

Système sur Modules (SOM/System On Modules)

Technologies

Ressources pour les développeurs

Cartes et kits d'évaluation

Outils de processeur

Outils et applications graphiques

Outils FPGA et SoC adaptatifs

Propriété Intellectuelle et applications

Outils d'accélération et applications

Outils pour cartes Ethernet

Présentation

Pour les centres de données et le cloud

Pour la périphérie et les terminaux

Pour les développeurs

Secteurs d'activité

Secteurs d'activité

Secteurs d'activité

Secteurs d'activité

Industrias

Charges de travail

Gaming

Systèmes

Technologies

Ressources

Processeurs EPYC

Solutions graphiques Radeon et chipsets AMD

FPGA et SoC adaptatifs

Accélérateurs Alveo et SOM Kria

Processeurs Ryzen

Adaptateurs Ethernet

Présentation

Processeurs EPYC

Accélérateurs

Produits intégrés

Solutions graphiques

Présentation

Ressources par produit

Ressources par type

À propos de nos partenaires

Assistance mondiale AMD

Processeurs et solutions graphiques

Accélérateurs

FPGA et SoC adaptatifs

Gaming et informatique personnelle

Informatique adaptative et embarquée

Get AMD Fan Gear

Buy Direct From AMD

Buy Direct From AMD

Buy Direct From AMD

Buy Direct From AMD

Buy Direct From AMD

This page summarizes performance measurements on AMD Instinct™ GPUs running popular AI models.

AI Inference

vLLM

Results on AMD Instinct™ MI300X Platform

Results on AMD Instinct™ MI300X Platform

Previous Versions

Previous Versions

AI Training

PyTorch

Results on AMD Instinct™ MI355X Platform

Results on the AMD Instinct MI355X Platform

Results on AMD Instinct™ MI325X Platform

Results on AMD Instinct™ MI325X Platform

Results on AMD Instinct™ MI300X Platform

Results on AMD Instinct™ MI300X Platform