Performance Results with AMD ROCm™ Software

This page summarizes performance measurements on AMD Instinct™ GPUs for popular AI models.

The data in the following tables is a reference point to help users evaluate observed performance. It should not be considered as the peak performance that AMD GPUs and ROCm™ software can deliver.

AI Inference
AI Training

AI Inference

Throughput Measurements

The table below shows performance data where a local inference client is fed requests at an infinite rate and shows the throughput client-server scenario under maximum load.

This result is based on the Docker container (rocm/vllm: rocm6.4.1_vllm_0.9.1_20250715), which was released on July 16, 2025.

Model	Precision	TP Size	Input	Output	Num Prompts	Max Num Seqs	Throughput (tokens/s)
Llama 3.1 70B (amd/Llama-3.1-70B-Instruct-FP8-KV)	FP8	8	128	2048	3200	3200	12638.9
			128	4096	1500	1500	10756.8
			500	2000	2000	2000	10691.7
			2048	2048	1500	1500	7354.9
Llama 3.1 405B (amd/Llama-3.1-405B-Instruct-FP8-KV)	FP8	8	128	2048	1500	1500	3912.8
			128	4096	1500	1500	3084.7
			500	2000	2000	2000	2935.9
			2048	2048	500	500	2191.5

TP stands for Tensor Parallelism.

Server: Dual AMD EPYC 9554 64-core processor-based production server with 8x AMD MI300X (192GB HBM3 750W) GPUs, 1 NUMA node per socket, System BIOS 1.8, Ubuntu® 22.04.5 LTS, Host GPU driver ROCm 6.4.1 + amdgpu driver 6.8.5

Reproduce these results on your system by following the instructions in measuring inference performance with vLLM on the AMD GPUs user guide.

Latency Measurements

The table below shows latency measurement, which typically involves assessing the time from when the system receives an input to when the model produces a result.

This result is based on the Docker container (rocm/vllm: rocm6.4.1_vllm_0.9.1_20250715), which was released on July 16, 2025.

Model	Precision	TP Size	Batch Size	Input	Output	Latency (sec)
Llama 3.1 70B (amd/Llama-3.1-70B-Instruct-FP8-KV)	FP8	8	1	128	2048	17.236
			2	128	2048	18.057
			4	128	2048	18.45
			8	128	2048	19.677
			16	128	2048	22.072
			32	128	2048	24.932
			64	128	2048	33.287
			128	128	2048	46.484
			1	2048	2048	17.5
			2	2048	2048	18.055
			4	2048	2048	18.858
			8	2048	2048	20.161
			16	2048	2048	22.347
			32	2048	2048	25.966
			64	2048	2048	35.324
			128	2048	2048	52.394
Llama 3.1 405B (amd/Llama-3.1-70B-Instruct-FP8-KV)	FP8	8	1	128	2048	48.453
			2	128	2048	49.268
			4	128	2048	51.136
			8	128	2048	54.226
			16	128	2048	57.274
			32	128	2048	68.901
			64	128	2048	88.631
			128	128	2048	117.027
			1	2048	2048	48.362
			2	2048	2048	49.121
			4	2048	2048	52.347
			8	2048	2048	54.471
			16	2048	2048	57.841
			32	2048	2048	70.538
			64	2048	2048	91.452
			128	2048	2048	125.471

Reproduce these results on your system by following the instructions in measuring inference performance with ROCm vLLM Dcoker on the AMD GPUs user guide.

Previous versions

This table lists previous versions of the ROCm vLLM inference Docker image for inference performance testing. For detailed information about available models for benchmarking, see the version-specific documentation.

Docker image tag	Components	Resources
rocm/vllm:rocm6.4.1_vllm_0.9.1_20250715	ROCm 6.4.1 vLLM 0.9.1 PyTorch 2.7.0	Documentation Docker Hub
rocm/vllm:rocm6.4.1_vllm_0.9.1_20250702	ROCm 6.4.1 vLLM 0.9.1 PyTorch 2.7.0	Documentation Docker Hub
rocm/vllm:rocm6.4.1_vllm_0.9.0.1_20250605	ROCm 6.4.1 vLLM 0.9.0.1 PyTorch 2.7.0	Documentation Docker Hub
rocm/vllm:rocm6.3.1_vllm_0.8.5_20250521	ROCm 6.3.1 0.8.5 vLLM (0.8.6.dev) PyTorch 2.7.0	Documentation Docker Hub
rocm/vllm:rocm6.3.1_vllm_0.8.5_20250513	ROCm 6.3.1 vLLM 0.8.5 PyTorch 2.7.0	Documentation Docker Hub
rocm/vllm:rocm6.3.1_instinct_vllm0.8.3_20250415	ROCm 6.3.1 vLLM 0.8.3 PyTorch 2.7.0	Documentation Docker Hub
rocm/vllm:rocm6.3.1_instinct_vllm0.7.3_20250325	ROCm 6.3.1 vLLM 0.7.3 PyTorch 2.7.0	Documentation Docker Hub
rocm/vllm:rocm6.3.1_mi300_ubuntu22.04_py3.12_vllm_0.6.6	ROCm 6.3.1 vLLM 0.6.6 PyTorch 2.7.0	Documentation Docker Hub
rocm/vllm:rocm6.2_mi300_ubuntu20.04_py3.9_vllm_0.6.4	ROCm 6.2.1 vLLM 0.6.4 PyTorch 2.5.0	Documentation Docker Hub
rocm/vllm:rocm6.2_mi300_ubuntu22.04_py3.9_vllm_7c5fd50	ROCm 6.2.0 vLLM 0.4.3 PyTorch 2.4.0	Documentation Docker Hub

AI Training

The table below shows training performance data, where the AMD Instinct™ platform measures text generation training throughput with a unique sequence length and batch size. It focuses on TFLOPS per second per GPU.

For FLUX, image generation training throughput from the FLUX.1-dev model with the best batch size before the runs go out of memory, and it focuses on frame per second per GPU.

PyTorch training results on the AMD Instinct™ MI300X platform

This result is based on the Docker container (rocm/pytorch-training:v25.5), which was released on April 15, 2025.

Models	Precision	Batch Size	Sequence Length	TFLOPS/s/GPU
Llama 3.1 70B with FSDP	BF16	4	8192	426.79
Llama 3.1 8B with FSDP	BF16	3	8192	542.94
Llama 3.1 8B with FSDP	FP8	3	8192	737.40
Llama 3.1 8B with FSDP	BF16	6	4096	523.79
Llama 3.1 8B with FSDP	FP8	6	4096	735.44
Mistral 7B with FSDP	BF16	3	8192	483.17
Mistral 7B with FSDP	FP8	4	8192	723.30
FLUX	BF16	10	-	4.51 (FPS/GPU)*

*Note: FLUX performance is measured in FPS/GPU rather than TFLOPS/s/GPU.

Server: Dual AMD EPYC 9654 96-core processor-based production server with 8x AMD MI300X (192GB HBM3 750W) GPUs, 2 NUMA node per socket, System BIOS 5.27, Ubuntu® 22.04.5 LTS, Host GPU driver ROCm 6.3.0 ROCm 6.3 (pre-release)

Reproduce these results on your system by following the instructions in measuring training performance with ROCm PyTorch Docker on the AMD GPUs user guide.

PyTorch training results on the AMD Instinct MI325X platform

This result is based on the Docker container (rocm/pytorch-training:v25.5), which was released on April 15, 2025.

Models	Precision	Batch Size	Sequence Length	TFLOPS/s/GPU
Llama 3.1 70B with FSDP	BF16	7	8192	526.13
Llama 3.1 8B with FSDP	BF16	3	8192	643.01
Llama 3.1 8B with FSDP	FP8	5	8192	893.68
Llama 3.1 8B with FSDP	BF16	8	4096	625.96
Llama 3.1 8B with FSDP	FP8	10	4096	894.98
Mistral 7B with FSDP	BF16	5	8192	590.23
Mistral 7B with FSDP	FP8	6	8192	860.39

Server: Dual AMD EPYC 9654 96-core processor-based production server with 8x AMD MI325X (256GB HBM3E 1000W) GPUs, 2 NUMA node per socket, System BIOS 5.27, Ubuntu® 22.04.5 LTS, Host GPU driver ROCm 6.3.0 ROCm 6.3 (pre-release)

Reproduce these results on your system by following the instructions in measuring training performance with ROCm PyTorch Docker on the AMD GPUs user guide.

Previous versions

This table lists previous versions of the PyTorch training Docker image for training performance testing. For detailed information about available models for benchmarking, see the version-specific documentation.

Date	Image version	ROCm version	PyTorch version	Resources
3/11/2025	25.4	6.3.0	2.7.0a0+git637433	Documentation	Docker Hub

Megatron-LM training results on the AMD Instinct™ MI300X platform

This result is based on the Docker container (rocm/megatron-lm:v25.5), which was released on April 25, 2025.

Sequence length 8192
Model	# of nodes	Sequence length	MBS	GBS	Data Type	TP	PP	CP	TFLOPs/s/GPU
llama3.1-8B	1	8192	2	128	FP8	1	1	1	697.91
llama3.1-8B	2	8192	2	256	FP8	1	1	1	690.33
llama3.1-8B	4	8192	2	512	FP8	1	1	1	686.74
llama3.1-8B	8	8192	2	1024	FP8	1	1	1	675.50
Sequence length 4096
Model	# of nodes	Sequence length	MBS	GBS	Data Type	TP	PP	CP	TFLOPs/s/GPU
llama2-7B	1	4096	4	256	FP8	1	1	1	689.90
llama2-7B	2	4096	4	512	FP8	1	1	1	682.04
llama2-7B	4	4096	4	1024	FP8	1	1	1	676.83
llama2-7B	8	4096	4	2048	FP8	1	1	1	686.25

For Deepsee-V2-Lite with 16B parameters, the table below shows training performance data, where the AMD Instinct™ MI300X platform measures text generation training throughput with GEMM tuning was on. It focuses on TFLOPS per second per GPU.

This result is based on the Docker container(rocm/megatron-lm:v25.5), which was released on April 25, 2025.

Model

# of GPUs

Sequence length

MBS

GBS

Data Type

Recompute

TFLOPs/s/GPU

Deespeek-V2-Lite

4096

256

BF16

None

10570

Reproduce these results on your system by following the instructions in measuring training performance with ROCm Megatron-LM Docker on the AMD GPUs user guide.

Previous versions

This table lists previous versions of the Megatron-LM training Docker image for training performance testing. For detailed information about available models for benchmarking, see the version-specific documentation.

Date	Image version	ROCm version	PyTorch version	Resources
3/18/2025	25.4	6.3.0	2.7.0a0+git637433	Documentation	Docker Hub

数据中心

商用系统

个人和游戏

嵌入式产品

资源

加速器

自适应加速器

DPU 加速器

以太网适配器

工作站

台式机

笔记本电脑

资源

自适应 SoC 和 FPGA

模块化系统 (SOM)

技术

开发者资源

评估板与套件

处理器工具

显卡工具和应用

自适应 SoC 和 FPGA

IP 与应用

GPU 加速器工具和应用

概要

面向数据中心和云计算

面向边缘计算和终端

面向开发人员

行业

行业

行业

行业

Industrias

工作负载

游戏

系统

技术

资源

EPYC（霄龙）处理器

Radeon 显卡与 AMD 芯片组

FPGA 和自适应 SoC

Alveo 加速器和 Kria SOM

锐龙处理器

以太网适配器

概要

处理器

加速器

自适应 SoC、FPGA 和 SOM

显卡

概要

资源按产品

资源按类型

关于我们的合作伙伴

AMD 全球支持

处理器与显卡

加速器

FPGA 与自适应 SoC

选择我们的零售合作伙伴

自适应和嵌入式计算

Get AMD Fan Gear

Buy Direct From AMD

Buy Direct From AMD

Buy Direct From AMD

Buy Direct From AMD

Buy Direct From AMD

AI Inference

Throughput Measurements

Latency Measurements

AI Training

PyTorch training results on the AMD Instinct™ MI300X platform

PyTorch training results on the AMD Instinct MI325X platform

Megatron-LM training results on the AMD Instinct™ MI300X platform

公司

新闻与活动

资源

合作伙伴

投资者