Performance Results with AMD ROCm™ Software

This page summarizes performance measurements on AMD Instinct™ GPUs for popular AI models.

The data in the following tables is a reference point to help users evaluate observed performance. It should not be considered as the peak performance that AMD GPUs and ROCm™ software can deliver.

AI Inference
AI Training

AI Inference

Throughput Measurements

The table below shows performance data where a local inference client is fed requests at an infinite rate and shows the throughput client-server scenario under maximum load.

This result is based on the Docker container (rocm/vllm: rocm6.4.1_vllm_0.9.1_20250715), which was released on July 16, 2025.

Model	Precision	TP Size	Input	Output	Num Prompts	Max Num Seqs	Throughput (tokens/s)
Llama 3.1 70B (amd/Llama-3.1-70B-Instruct-FP8-KV)	FP8	8	128	2048	3200	3200	12638.9
			128	4096	1500	1500	10756.8
			500	2000	2000	2000	10691.7
			2048	2048	1500	1500	7354.9
Llama 3.1 405B (amd/Llama-3.1-405B-Instruct-FP8-KV)	FP8	8	128	2048	1500	1500	3912.8
			128	4096	1500	1500	3084.7
			500	2000	2000	2000	2935.9
			2048	2048	500	500	2191.5

TP stands for Tensor Parallelism.

Server: Dual AMD EPYC 9554 64-core processor-based production server with 8x AMD MI300X (192GB HBM3 750W) GPUs, 1 NUMA node per socket, System BIOS 1.8, Ubuntu® 22.04.5 LTS, Host GPU driver ROCm 6.4.1 + amdgpu driver 6.8.5

Reproduce these results on your system by following the instructions in measuring inference performance with vLLM on the AMD GPUs user guide.

Latency Measurements

The table below shows latency measurement, which typically involves assessing the time from when the system receives an input to when the model produces a result.

This result is based on the Docker container (rocm/vllm: rocm6.4.1_vllm_0.9.1_20250715), which was released on July 16, 2025.

Model	Precision	TP Size	Batch Size	Input	Output	Latency (sec)
Llama 3.1 70B (amd/Llama-3.1-70B-Instruct-FP8-KV)	FP8	8	1	128	2048	17.236
			2	128	2048	18.057
			4	128	2048	18.45
			8	128	2048	19.677
			16	128	2048	22.072
			32	128	2048	24.932
			64	128	2048	33.287
			128	128	2048	46.484
			1	2048	2048	17.5
			2	2048	2048	18.055
			4	2048	2048	18.858
			8	2048	2048	20.161
			16	2048	2048	22.347
			32	2048	2048	25.966
			64	2048	2048	35.324
			128	2048	2048	52.394
Llama 3.1 405B (amd/Llama-3.1-70B-Instruct-FP8-KV)	FP8	8	1	128	2048	48.453
			2	128	2048	49.268
			4	128	2048	51.136
			8	128	2048	54.226
			16	128	2048	57.274
			32	128	2048	68.901
			64	128	2048	88.631
			128	128	2048	117.027
			1	2048	2048	48.362
			2	2048	2048	49.121
			4	2048	2048	52.347
			8	2048	2048	54.471
			16	2048	2048	57.841
			32	2048	2048	70.538
			64	2048	2048	91.452
			128	2048	2048	125.471

Reproduce these results on your system by following the instructions in measuring inference performance with ROCm vLLM Dcoker on the AMD GPUs user guide.

Previous versions

This table lists previous versions of the ROCm vLLM inference Docker image for inference performance testing. For detailed information about available models for benchmarking, see the version-specific documentation.

Docker image tag	Components	Resources
rocm/vllm:rocm6.4.1_vllm_0.9.1_20250715	ROCm 6.4.1 vLLM 0.9.1 PyTorch 2.7.0	Documentation Docker Hub
rocm/vllm:rocm6.4.1_vllm_0.9.1_20250702	ROCm 6.4.1 vLLM 0.9.1 PyTorch 2.7.0	Documentation Docker Hub
rocm/vllm:rocm6.4.1_vllm_0.9.0.1_20250605	ROCm 6.4.1 vLLM 0.9.0.1 PyTorch 2.7.0	Documentation Docker Hub
rocm/vllm:rocm6.3.1_vllm_0.8.5_20250521	ROCm 6.3.1 0.8.5 vLLM (0.8.6.dev) PyTorch 2.7.0	Documentation Docker Hub
rocm/vllm:rocm6.3.1_vllm_0.8.5_20250513	ROCm 6.3.1 vLLM 0.8.5 PyTorch 2.7.0	Documentation Docker Hub
rocm/vllm:rocm6.3.1_instinct_vllm0.8.3_20250415	ROCm 6.3.1 vLLM 0.8.3 PyTorch 2.7.0	Documentation Docker Hub
rocm/vllm:rocm6.3.1_instinct_vllm0.7.3_20250325	ROCm 6.3.1 vLLM 0.7.3 PyTorch 2.7.0	Documentation Docker Hub
rocm/vllm:rocm6.3.1_mi300_ubuntu22.04_py3.12_vllm_0.6.6	ROCm 6.3.1 vLLM 0.6.6 PyTorch 2.7.0	Documentation Docker Hub
rocm/vllm:rocm6.2_mi300_ubuntu20.04_py3.9_vllm_0.6.4	ROCm 6.2.1 vLLM 0.6.4 PyTorch 2.5.0	Documentation Docker Hub
rocm/vllm:rocm6.2_mi300_ubuntu22.04_py3.9_vllm_7c5fd50	ROCm 6.2.0 vLLM 0.4.3 PyTorch 2.4.0	Documentation Docker Hub

AI Training

The table below shows training performance data, where the AMD Instinct™ platform measures text generation training throughput with a unique sequence length and batch size. It focuses on TFLOPS per second per GPU.

For FLUX, image generation training throughput from the FLUX.1-dev model with the best batch size before the runs go out of memory, and it focuses on frame per second per GPU.

PyTorch training results on the AMD Instinct™ MI300X platform

This result is based on the Docker container (rocm/pytorch-training:v25.5), which was released on April 15, 2025.

Models	Precision	Batch Size	Sequence Length	TFLOPS/s/GPU
Llama 3.1 70B with FSDP	BF16	4	8192	426.79
Llama 3.1 8B with FSDP	BF16	3	8192	542.94
Llama 3.1 8B with FSDP	FP8	3	8192	737.40
Llama 3.1 8B with FSDP	BF16	6	4096	523.79
Llama 3.1 8B with FSDP	FP8	6	4096	735.44
Mistral 7B with FSDP	BF16	3	8192	483.17
Mistral 7B with FSDP	FP8	4	8192	723.30
FLUX	BF16	10	-	4.51 (FPS/GPU)*

*Note: FLUX performance is measured in FPS/GPU rather than TFLOPS/s/GPU.

Server: Dual AMD EPYC 9654 96-core processor-based production server with 8x AMD MI300X (192GB HBM3 750W) GPUs, 2 NUMA node per socket, System BIOS 5.27, Ubuntu® 22.04.5 LTS, Host GPU driver ROCm 6.3.0 ROCm 6.3 (pre-release)

Reproduce these results on your system by following the instructions in measuring training performance with ROCm PyTorch Docker on the AMD GPUs user guide.

PyTorch training results on the AMD Instinct MI325X platform

This result is based on the Docker container (rocm/pytorch-training:v25.5), which was released on April 15, 2025.

Models	Precision	Batch Size	Sequence Length	TFLOPS/s/GPU
Llama 3.1 70B with FSDP	BF16	7	8192	526.13
Llama 3.1 8B with FSDP	BF16	3	8192	643.01
Llama 3.1 8B with FSDP	FP8	5	8192	893.68
Llama 3.1 8B with FSDP	BF16	8	4096	625.96
Llama 3.1 8B with FSDP	FP8	10	4096	894.98
Mistral 7B with FSDP	BF16	5	8192	590.23
Mistral 7B with FSDP	FP8	6	8192	860.39

Server: Dual AMD EPYC 9654 96-core processor-based production server with 8x AMD MI325X (256GB HBM3E 1000W) GPUs, 2 NUMA node per socket, System BIOS 5.27, Ubuntu® 22.04.5 LTS, Host GPU driver ROCm 6.3.0 ROCm 6.3 (pre-release)

Reproduce these results on your system by following the instructions in measuring training performance with ROCm PyTorch Docker on the AMD GPUs user guide.

Previous versions

This table lists previous versions of the PyTorch training Docker image for training performance testing. For detailed information about available models for benchmarking, see the version-specific documentation.

Date	Image version	ROCm version	PyTorch version	Resources
3/11/2025	25.4	6.3.0	2.7.0a0+git637433	Documentation	Docker Hub

Megatron-LM training results on the AMD Instinct™ MI300X platform

This result is based on the Docker container (rocm/megatron-lm:v25.5), which was released on April 25, 2025.

Sequence length 8192
Model	# of nodes	Sequence length	MBS	GBS	Data Type	TP	PP	CP	TFLOPs/s/GPU
llama3.1-8B	1	8192	2	128	FP8	1	1	1	697.91
llama3.1-8B	2	8192	2	256	FP8	1	1	1	690.33
llama3.1-8B	4	8192	2	512	FP8	1	1	1	686.74
llama3.1-8B	8	8192	2	1024	FP8	1	1	1	675.50
Sequence length 4096
Model	# of nodes	Sequence length	MBS	GBS	Data Type	TP	PP	CP	TFLOPs/s/GPU
llama2-7B	1	4096	4	256	FP8	1	1	1	689.90
llama2-7B	2	4096	4	512	FP8	1	1	1	682.04
llama2-7B	4	4096	4	1024	FP8	1	1	1	676.83
llama2-7B	8	4096	4	2048	FP8	1	1	1	686.25

For Deepsee-V2-Lite with 16B parameters, the table below shows training performance data, where the AMD Instinct™ MI300X platform measures text generation training throughput with GEMM tuning was on. It focuses on TFLOPS per second per GPU.

This result is based on the Docker container(rocm/megatron-lm:v25.5), which was released on April 25, 2025.

Model

# of GPUs

Sequence length

MBS

GBS

Data Type

Recompute

TFLOPs/s/GPU

Deespeek-V2-Lite

4096

256

BF16

None

10570

Reproduce these results on your system by following the instructions in measuring training performance with ROCm Megatron-LM Docker on the AMD GPUs user guide.

Previous versions

This table lists previous versions of the Megatron-LM training Docker image for training performance testing. For detailed information about available models for benchmarking, see the version-specific documentation.

Date	Image version	ROCm version	PyTorch version	Resources
3/18/2025	25.4	6.3.0	2.7.0a0+git637433	Documentation	Docker Hub

Data Center

Business Systems

Personal & Gaming

Embedded

Resources

GPU Accelerators

Adaptive Accelerators

DPU Accelerators

Ethernet Adapters

Workstations

Desktops

Laptops

Resources

Adaptive SoCs & FPGAs

System-on-Modules (SOMs)

Technologies

Resources

Evaluation Boards & Kits

Processor Tools

Graphics Tools & Apps

Adaptive SoC & FPGA Tools

Intellectual Property & Apps

GPU Accelerator Tools & Apps

Overview

For Data Center & Cloud

For Edge & Endpoints

For Developers

Industries

Industries

Industries

Industries

Industries

Workloads

Gaming

Systems

Technologies

Resources

EPYC Processors

Radeon Graphics & AMD Chipsets

Adaptive SoCs & FPGAs

Alveo Accelerators & Kria SOMs

Ryzen Processors

Ethernet Adapters

Overview

Processors

Accelerators

Adaptive SoCs, FPGAs, & SOMs

Graphics

Overview

Resources by Product

Resources by Type

About Our Partners

AMD Global Support

Processors & Graphics

Accelerators

Adaptive SoCs & FPGAs

Gaming & Personal Computing

Adaptive & Embedded Computing

Get AMD Fan Gear

Shop Our Retail Partners

AI Inference

Throughput Measurements

Latency Measurements

AI Training

PyTorch training results on the AMD Instinct™ MI300X platform

PyTorch training results on the AMD Instinct MI325X platform

Megatron-LM training results on the AMD Instinct™ MI300X platform

Company

News & Events

Resources

Partners

Investors