Performance Results with AMD ROCm™ Software

This page summarizes performance measurements on AMD Instinct™ GPUs running popular AI models.

The results found on this page highlight both Inference and Training benchmarks. The results are organized by the following:

AI Inference: vLLM, xDiT
AI Training: pyTorch, Megatron-LM, and JAX MaxText

The hardware platforms include Instinct MI355X/MI325X/MI300X GPUs, with benchmark insights provided for each framework where data is available.

The data in the following tables are a reference point to help users evaluate observed performance. It should not be considered as the peak performance that AMD GPUs and ROCm™ software can deliver.

AI Inference

vLLM
xDiT

vLLM

Results on AMD Instinct™ MI300X Platform

The following results are based on:

Docker container: rocm/vllm:rocm7.0.0_vllm_0.11.2_20251210
Release date: December 11, 2025
Server: Dual AMD EPYC 9554 64-core processor-based production server with 8x AMD MI300X (192GB HBM3 750W) GPUs, 1 NUMA node per socket, System BIOS 1.8, Ubuntu® 22.04, amdgpu driver 6.14.14

Throughput Measurements

The table below shows performance data where a local inference client is fed requests at an infinite rate and shows the throughput client-server scenario under maximum load.

Model	Precision	TP¹ Size	Input	Output	No. Prompts	Max. Seqs	Throughput²
Llama 3.1 70B (amd/Llama-3.1-70B-Instruct-FP8-KV)	FP8	8	128	2048	3200	3200	13562.4
			128	4096	1500	1500	11800.9
			500	2000	2000	2000	11249.5
			2048	2048	1500	1500	7753.1
Llama 3.1 405B (amd/Llama-3.1-405B-Instruct-FP8-KV)	FP8	8	128	2048	1500	1500	3822.8
			128	4096	1500	1500	3085.8
			500	2000	2000	2000	3059.9
			2048	2048	500	500	2192.3

_{¹TP stands for Tensor Parallelism.

²Throughput is measured in tokens/second}

Latency results

The table below shows latency measurement, which typically involves assessing the time from when the system receives an input to when the model produces a result.

Model	Precision	TP¹ Size	Batch Size	Input	Output	Latency²
Llama 3.1 70B (amd/Llama-3.1-70B-Instruct-FP8-KV)	FP8	8	1	128	2048	16.015
			2	128	2048	18.683
			4	128	2048	19.245
			8	128	2048	20.468
			16	128	2048	22.137
			32	128	2048	25.571
			64	128	2048	32.987
			128	128	2048	46.426
			1	2048	2048	16.421
			2	2048	2048	19.035
			4	2048	2048	20.221
			8	2048	2048	21.483
			16	2048	2048	24.350
			32	2048	2048	29.776
			64	2048	2048	40.625
			128	2048	2048	63.671
Llama 3.1 405B (amd/Llama-3.1-70B-Instruct-FP8-KV)	FP8	8	1	128	2048	48.618
			2	128	2048	50.980
			4	128	2048	52.760
			8	128	2048	55.864
			16	128	2048	58.795
			32	128	2048	69.482
			64	128	2048	89.384
			128	128	2048	122.601
			1	2048	2048	49.106
			2	2048	2048	51.664
			4	2048	2048	54.220
			8	2048	2048	58.904
			16	2048	2048	65.389
			32	2048	2048	83.387
			64	2048	2048	115.575
			128	2048	2048	177.779

_{¹TP stands for Tensor Parallelism.

²Latency is measured in seconds}

Reproduce these results on your system by following these instructions:

Inference Performance with vLLM

Previous Versions

This table lists previous versions of the ROCm vLLM inference Docker image for inference performance testing. For detailed information about available models for benchmarking, see the version-specific documentation.

Docker image tag	Components	Resources
rocm/vllm:rocm7.0.0_vllm_0.11.2_20251210 (latest)	ROCm 7.0.0 vLLM 0.11.2 PyTorch 2.9.0	Documentation Docker Hub
rocm/vllm:rocm7.0.0_vllm_0.11.1_20251024	ROCm 7.0.0 vLLM 0.11.1 PyTorch 2.9.0	Documentation Docker Hub
rocm/vllm:rocm7.0.0_ vllm_ 0.10.2_ 20251006	ROCm 7.0.0 vLLM 0.10.2 PyTorch 2.9.0	Documentation Docker Hub
rocm/vllm:rocm6.4.1_ vllm_ 0.10.0_ 20250812	ROCm 6.4.1 vLLM 0.9.1 PyTorch 2.7.0	Documentation Docker Hub
rocm/vllm:rocm6.4.1_vllm_0.9.1_20250715	ROCm 6.4.1 vLLM 0.9.1 PyTorch 2.7.0	Documentation Docker Hub
rocm/vllm:rocm6.4.1_vllm_0.9.1_20250702	ROCm 6.4.1 vLLM 0.9.1 PyTorch 2.7.0	Documentation Docker Hub
rocm/vllm:rocm6.4.1_vllm_0.9.0.1_20250605	ROCm 6.4.1 vLLM 0.9.0.1 PyTorch 2.7.0	Documentation Docker Hub
rocm/vllm:rocm6.3.1_vllm_0.8.5_20250521	ROCm 6.3.1 0.8.5 vLLM (0.8.6.dev) PyTorch 2.7.0	Documentation Docker Hub
rocm/vllm:rocm6.3.1_vllm_0.8.5_20250513	ROCm 6.3.1 vLLM 0.8.5 PyTorch 2.7.0	Documentation Docker Hub
rocm/vllm:rocm6.3.1_instinct_vllm0.8.3_20250415	ROCm 6.3.1 vLLM 0.8.3 PyTorch 2.7.0	Documentation Docker Hub
rocm/vllm:rocm6.3.1_instinct_vllm0.7.3_20250325	ROCm 6.3.1 vLLM 0.7.3 PyTorch 2.7.0	Documentation Docker Hub
rocm/vllm:rocm6.3.1_mi300_ubuntu22.04_py3.12_vllm_0.6.6	ROCm 6.3.1 vLLM 0.6.6 PyTorch 2.7.0	Documentation Docker Hub
rocm/vllm:rocm6.2_mi300_ubuntu20.04_py3.9_vllm_0.6.4	ROCm 6.2.1 vLLM 0.6.4 PyTorch 2.5.0	Documentation Docker Hub
rocm/vllm:rocm6.2_mi300_ubuntu22.04_py3.9_vllm_7c5fd50	ROCm 6.2.0 vLLM 0.4.3 PyTorch 2.4.0	Documentation Docker Hub

xDiT

Results on AMD Instinct™ MI355X Platform

The following results are based on:

Docker container: rocm/pytorch-xdit:v25.12
Release date: Dec 8, 2025
Server: Dual AMD EPYC 9575F 96-core processor-based production server with 8x AMD MI355X (288GB HBM3E 1400W) GPUs,1 NUMA node per socket, System BIOS 1.4a, Ubuntu® 22.04.3 LTS Host GPU driver ROCm 7.10.0_preview.

Models	Precision	Batch Size	Configuration	Latency¹
Hunyuan Video	BF16	1	720p, 129 Frames, 50 steps	86.74
Wan2.1	BF16	1	720p, 80 Frames, 40 steps	71.60
Wan2.2	BF16	1	720p, 80 Frames, 40 steps	66.69
Flux.1	BF16	1	1024x1240, 25 steps	0.94

_{¹ Latency is measured in seconds}

Reproduce these results on your system by following these instructions:

xDiT Diffusion Inference on AMD GPUs User Guide

Results on the AMD Instinct™ MI300X platform

The following results are based on:

Docker container: rocm/pytorch-xdit:v25.12
Release date: Dec 8, 2025
Server: Dual AMD EPYC 9554 64-core processor-based production server with 8x AMD MI300X (192GB HBM3 750W) GPUs, 1 NUMA node per socket, System BIOS 1.8, Ubuntu® 22.04.5 LTS, Host GPU driver ROCm 7.10.0_preview.

Models	Precision	Batch Size	Configuration	Latency¹
Hunyuan Video	BF16	1	720p, 129 Frames, 50 steps	181.05
Wan2.1	BF16	1	720p, 80 Frames, 40 steps	151.25
Wan2.2	BF16	1	720p, 80 Frames, 40 steps	142.17
Flux.1	BF16	1	1024x1240, 25 steps	1.33

_{¹ Latency is measured in seconds}

Reproduce these results on your system by following these instructions:

xDiT Diffusion Inference on AMD GPUs User Guide

Previous versions

This table lists previous versions of the Megatron-LM training Docker image for training performance testing. For detailed information about available models for benchmarking, see the version-specific documentation.

Docker image tag	Components	Resources
rocm/pytorch-xdit:v25.12(latest)	ROCm 7.10.0 preview TheRock 3e3f834	Documentation Docker Hub
rocm/pytorch-xdit:v25.11(latest)	ROCm 7.10.0 preview TheRock 3e3f834	Documentation Docker Hub
rocm/pytorch-xdit:v25.10	ROCm 7.9.0 preview TheRock 7afbe45	Documentation Docker Hub

AI Training

The table below shows training performance data, where the AMD Instinct™ platform measures text generation training throughput with a unique sequence length and batch size. It focuses on Tokens per second per GPU.

PyTorch
Megatron-LM
JaxMaxText

PyTorch

Results on the AMD Instinct MI355X Platform

The following results are based on:

Docker container: rocm/primus:v26.3
Release date: Jun 9, 2026
Server: Dual AMD EPYC 9575F 96-core processor-based production server with 8x AMD MI355X (288GB HBM3E 1400W)GPUs,1 NUMA node per socket, System BIOS 1.4a, Ubuntu® 22.04.5 LTS Host GPU driver ROCm 7.0.1.
Multi-node: Dual AMD EPYC 9575F 96-core processor-based production server with 8x AMD MI355X (288GB HBM3E 1400W) GPUs,1 NUMA node per socket, System BIOS 1.4a, Ubuntu® 22.04.5 LTS Host GPU driver ROCm 7.0.1.

Model	# nodes	Precision	Batch Size	Sequence Length	FSDP	TP	CP	PP	Tokens/sec/GPU
Llama 3.1 8B	1	FP8	8	8192	FALSE	1	1	1	30,254
Llama 3.1 8B	1	BF16	6	8192	FALSE	1	1	1	21,763
Llama 3.1 70B	1	FP8	6	8192	TRUE	1	1	1	3,663
Llama 3.1 70B	4	FP8	6	8192	TRUE	1	1	1	3,805
Llama 3.1 70B	1	BF16	8	8192	TRUE	1	1	1	2,294
Llama 3.1 405B	8	FP8	3	8192	TRUE	1	1	1	636

Reproduce these results on your system by following these instructions:

Training Performance with PyTorch on AMD GPUs User Guide

Results on AMD Instinct™ MI325X Platform

The following results are based on:

Docker container: rocm/primus:v26.3
Release date: Jun 9, 2026
Server: Dual AMD EPYC 9655 96-core processor-based production server with 8x AMD MI325X (256GB HBM3E 1000W) GPUs, 1 NUMA node per socket, System BIOS 3B03, Ubuntu® 22.04.5 LTS, Host GPU driver ROCm 7.0.1.

Model	# nodes	Precision	Batch Size	Sequence Length	FSDP	TP	CP	PP	Tokens/sec/GPU
Llama 3.1 8B	1	FP8	7	8192	FALSE	1	1	1	15,662
Llama 3.1 8B	1	BF16	6	8192	FALSE	1	1	1	11,642
Llama 3.1 70B	1	FP8	3	8192	TRUE	1	1	1	1,775
Llama 3.1 70B	8	FP8	3	8192	TRUE	1	1	1	1,748
Llama 3.1 70B	1	BF16	3	8192	TRUE	1	1	1	1,167
Llama3.1 405B	8	FP8	4	8192	TRUE	1	1	1	301.74

Reproduce these results on your system by following these instructions:

Training Performance with PyTorch on AMD GPUs User Guide

Results on AMD Instinct™ MI300X Platform

The following results are based on:

Docker container: rocm/primus:v26.3
Release date: Jun 9, 2026
Server: Server: Dual AMD EPYC 9554 64-core processor-based production server with 8x AMD MI300X (192GB HBM3 750W) GPUs, 1 NUMA node per socket, System BIOS 1.8, Ubuntu® 22.04.5 LTS, Host GPU driver ROCm 6.4.2-120.

Model	Precision	Batch Size	Sequence Length	FSDP	TP	CP	PP	Tokens/sec/GPU
Llama 3.1 8B	FP8	5	8192	FALSE	1	1	1	11,980
Llama 3.1 8B	BF16	4	8192	FALSE	1	1	1	8,922
Llama 3.1 70B	FP8	3	8192	TRUE	1	1	1	1,284
Llama 3.1 70B	BF16	4	8192	TRUE	1	1	1	869

Reproduce these results on your system by following these instructions:

Training Performance with PyTorch on AMD GPUs User Guide

Previous Versions

This table lists previous versions of the PyTorch training Docker image for training performance testing. For detailed information about available models for benchmarking, see the version-specific documentation.

Image version	ROCm version	PyTorch version	Resources
v26.3 (latest)	ROCm 7.2.1	PyTorch 2.10.0+git94c6e04	Primus PyTorch training documentation Docker Hub
v26.2	ROCm 7.2.0	PyTorch 2.10.0a0+git449b176	Primus Megatron documentation Docker Hub
v26.1	ROCm 7.1.0	PyTorch 2.10.0.dev20251112+rocm7.1	Primus PyTorch training documentation PyTorch training (legacy) documentation Docker Hub
v25.11	ROCm 7.1.0	PyTorch 2.10.0.dev20251112+rocm7.1	Primus PyTorch Training documentation PyTorch training (legacy) documentation Docker Hub
v25.10	7.1.0	PyTorch 2.10.0.dev20251112+rocm7.1	Primus PyTorch Training documentation PyTorch training (legacy) documentation Docker Hub
V25.9	7.0.0	Primus 0.3.0 PyTorch 2.9.0.dev20250821+rocm7.0.0.lw.git125803b7	Primus PyTorch Training documentation PyTorch training (legacy) documentation Docker Hub (gfx950) Docker Hub (gfx942)
v25.8	6.4.3	2.8.0a0+gitd06a406	Primus PyTorch Training documentation PyTorch training (legacy) documentation
v25.7	6.4.2	2.8.0a0+gitd06a406	Documentation Docker Hub
v25.6	6.3.4	2.8.0a0+git7d205b2	Documentation Docker Hub
v25.5	6.3.4	2.7.0a0+git637433	Documentation Docker Hub
v25.4	6.3.0	2.7.0a0+git637433	Documentation Docker Hub

Megatron-LM

Results on AMD Instinct™ MI355X Platform

The following results are based on:

Docker container: rocm/primus:v26.3
Release date: Jun 9, 2026
Server: Dual AMD EPYC 9575F 64-core processor-based production server with 8x AMD MI355X (288GB HBM3E 1400W）GPUs,1 NUMA node per socket, System BIOS 1.4a, Ubuntu® 22.04.5 LTS Host GPU driver ROCm 7.0.1.
Multi-node : Dual AMD EPYC 9575F 64-core processor-based production server with 8x AMD MI355X (288GB HBM3E 1400W）GPUs,1 NUMA node per socket, System BIOS 1.4a, Ubuntu® 22.04.5 LTS Host GPU driver ROCm 7.0.1

Model	# nodes	Precision	Batch Size	Sequence Length	FSDP	TP	CP	PP	EP/VPP	Tokens/sec/GPU
Llama 3.1 8B	1	FP8	4	8192	FALSE	1	1	1	-	32,031
Llama 3.1 8B	1	BF16	4	8192	FALSE	1	1	1	-	22,894
Llama 3.1 8B	8	FP8	4	8192	FALSE	1	1	1	-	30,581
Llama 3.1 70B	1	BF16	4	8192	TRUE	1	1	1	-	2,130
Llama 3.1 70B	8	FP8	4	8192	FALSE	1	1	1	-	3,207
Llama 3.1 70B	8	BF16	4	8192	FALSE	1	1	1	-	2,002
Llama 3.3 70B	1	BF16	6	8192	TRUE	1	1	1	-	2,031
Mixtral 8x7B	1	BF16	4	4096	FALSE	1	1	1	8	13,381
Mixtral 8x22B	4	BF16	1	8192	FALSE	1	1	4	8/2	3,299
DeepSeekV2 Lite	1	BF16	12	4096	FALSE	1	1	1	8	41,284
Qwen3 –30B	1	BF16	8	4096	FALSE	-	1	1	8	24,623
Qwen3 –30B	1	FP8	8	4096	FALSE	-	1	1	8	25,489
Qwen3-235B	8	FP8	4	4096	FALSE	-	1	1	8/4	4,709
GPT-OSS-20B	1	BF16	8	4096	FALSE	-	1	1	8	21,333
GPT-OSS-20B	1	FP8	8	4096	FALSE	-	1	1	8	21,065
GPT-OSS-120B	8	BF16	8	4096	FALSE	-	1	1	8/2	12,183

Reproduce these results on your system by following these instructions:

Training Performance with Megatron-LM on AMD GPUs User Guide

Results on AMD Instinct™ MI325X Platform

The following results are based on:

Docker container: rocm/primus:v26.3
Release date: Jun 9, 2026
Server: Dual AMD EPYC 9655 96-core processor-based production server with 8x AMD MI325X (256GB HBM3E 1000W) GPUs, 1 NUMA node per socket, System BIOS 3B03, Ubuntu® 22.04.5 LTS, Host GPU driver ROCm 7.0.1.
For multi-mode run, Server: Dual AMD EPYC 9575F 64-Core processor-based production server with 8x AMD Instinct MI325 (256GB HBM3E 1000W) GPUs, 1 NUMA node per socket, System BIOS 1.5, Ubuntu® 22.04.5 LTS, Host GPU driver ROCm 6.4.2.60402-120~22.04

Model	# nodes	Precision	Batch Size	Sequence Length	FSDP	TP	CP	PP	EP	Tokens/sec/GPU
Llama 3.1 8B	1	FP8	2	8192	FALSE	1	1	1	-	16,759
Llama 3.1 8B	8	FP8	2	8192	FALSE	1	1	1	-	15,355
Llama 3.1 8B	1	BF16	4	8192	FALSE	1	1	1	-	12,158
Llama 3.1 70B	1	BF16	3	8192	TRUE	1	1	1	-	1,099
Llama 3.1 70B	8	FP8	4	8192	TRUE	1	1	1	-	1,438
Llama 3.1 70B	8	BF16	1	8192	TRUE	1	1	1	-	1,076
Llama 3.3 70B	1	BF16	3	8192	TRUE	1	1	1	-	1,098
Mixtral 8x7B	1	BF16	4	4096	FALSE	1	1	1	8	7,177
DeepSeekV2 Lite	1	BF16	10	4096	FALSE	1	1	1	8	22,216
Qwen3-30B	1	BF16	2	4096	FALSE	-	1	1	8	10,088
Qwen3-30B	1	FP8	2	4096	FALSE	-	1	1	8	10,969
GPT-OSS-20B	1	BF16	4	4096	FALSE	1	1	1	8	13,952
GPT-OSS-20B	1	FP8	4	4096	FALSE	1	1	1	8	14,652
GPT-OSS-120B	8	BF16	6	4096	FALSE	1	1	2	8	5,273

Reproduce these results on your system by following these instructions:

Training Performance with Megatron-LM on AMD GPUs User Guide

Results on AMD Instinct™ MI300X Platform

The following results are based on:

Docker container: rocm/primus:v26.3
Release date: Jun 9, 2026
Server: Server: Dual AMD EPYC 9554 64-core processor-based production server with 8x AMD MI300X (192GB HBM3 750W) GPUs, 1 NUMA node per socket, System BIOS 1.8, Ubuntu® 22.04.5 LTS, Host GPU driver ROCm 6.4.2-120.
For multi-mode run, Server: Dual Intel Xeon Platinum 8480+ Processors with 8x AMD MI300X (192GB HBM3 750W) GPUs, 1 NUMA node per socket, System BIOS 79007700 Ubuntu® 22.04, Host GPU driver ROCm 6.3.0-39.

Model	# nodes	Precision	Batch Size	Sequence Length	FSDP	TP	CP	PP	EP	Tokens/sec/GPU
Llama 3.1 8B	1	FP8	2	8192	FALSE	1	1	1	-	12,970
Llama 3.1 8B	1	BF16	2	8192	FALSE	1	1	1	-	9,386
Llama 3.1 70B	1	BF16	3	8192	TRUE	1	1	1	-	788
Llama 3.3 70B	1	BF16	2	8192	TRUE	1	1	1	-	809
Mixtral 8x7B	1	BF16	2	4096	FALSE	1	1	1	8	5,433
DeepSeekV2 Lite	1	BF16	4	4096	FALSE	1	1	1	8	16,421
Qwen3-30B	1	BF16	2	4096	FALSE	1	1	1	8	8,810
Qwen3-30B	1	FP8	2	4096	FALSE	1	1	1	8	8,994
GPT-OSS-20B	1	BF16	4	4096	FALSE	1	1	1	1	11,490
GPT-OSS-20B	1	FP8	4	4096	FALSE	1	1	1	8	12,175

Reproduce these results on your system by following these instructions:

Training Performance with Megatron-LM on AMD GPUs User Guide

Previous Versions

Image version	ROCm version	PyTorch version	Resources
v26.3 (latest)	7.2.0	PyTorch 2.10.0+git94c6e04	Primus Megatron documentation Docker Hub
v26.2	7.2.0	PyTorch 2.10.0+git94c6e04	Primus Megatron documentation Docker Hub
v26.1	7.1.0	PyTorch 2.10.0.dev20251112+rocm7.1	Primus Megatron documentation Megatron-LM (legacy) documentation Docker Hub
v25.11	7.1.0	PyTorch 2.10.0.dev20251112+rocm7.1	Primus Megatron documentation Megatron-LM (legacy) documentation Docker Hub
v25.10	7.1.0	PyTorch 2.10.0.dev20251112+rocm7.1	Primus Megatron documentation Megatron-LM (legacy) documentation Docker Hub
v25.9	7.0.0	Primus 0.3.0 PyTorch 2.9.0.dev20250821+rocm7.0.0.lw.git125803b7	Primus Megatron documentation Megatron-LM (legacy) documentation Docker Hub (gfx950) Docker Hub (gfx942)
v25.8	6.4.3	2.8.0a0+gitd06a406	Primus Megatron documentation Megatron-LM (legacy) documentation Docker Hub (py310)
v25.7	6.4.2	2.8.0a0+gitd06a406	Primus Megatron documentation Megatron-LM (legacy) documentation Docker Hub (py310)
v25.6	6.4.1	2.8.0a0+git7d205b2	Documentation Docker Hub (py312) Docker Hub (py310)
v25.5	6.3.4	2.8.0a0+gite2f9759	Documentation Docker Hub (py312) Docker Hub (py310)
v25.4	6.3.0	2.7.0a0+git637433	Documentation Docker Hub

JaxMaxText

Results on AMD Instinct™ MI355X Platform

The following results are based on:

Docker container: rocm/jax-training:maxtext-v26.3
Release date: Jun 9, 2026
Server: Dual AMD EPYC 9575F 96-core processor-based production server with 8x AMD MI355X (288GB HBM3E 1400W）GPUs,1 NUMA node per socket, System BIOS 1.4a, Ubuntu® 22.04.5 LTS Host GPU driver ROCm 7.1.1.
Multi mode: Dual AMD EPYC 9575F 96-core processor-based production server with 8x AMD MI355X (288GB HBM3E 1400W）GPUs,1 NUMA node per socket, System BIOS 1.4a, Ubuntu® 22.04.5 LTS Host GPU driver ROCm 7.0.1.

Models	# nodes	Precision	Batch Size	Sequence Length	FSDP	TP	CP	PP	EP	Tokens/Sec/GPU
Llama 3.1 8B	1	BF16	9	8192	TRUE	1	1	1	1	20,869
Llama 3.1 8B	1	FP8	9	8192	TRUE	1	1	1	1	26,998
Llama 3.1 8B	8	BF16	9	8192	TRUE	1	1	1	1	20,092
Llama 3.1 70B	1	BF16	10	8192	TRUE	1	1	1	1	2,322
Llama 3.1 70B	1	FP8	10	8192	TRUE	1	1	1	1	3,803
Llama 3.1 70B	8	BF16	10	8192	TRUE	1	1	1	1	2,219
Llama 3.3 70B	1	BF16	10	8192	TRUE	1	1	1	1	2,331
Mixtral 8x7B	1	BF16	11	4096	FALSE	1	1	1	8	11,679

Reproduce these results on your system by following these instructions:

Training Performance with JaxMaxText on AMD GPUs User Guide

Results on AMD Instinct™ MI325X Platform

The following results are based on:

Docker container: rocm/jax-training:maxtext-v26.3
Release date: Jun 9, 2026
Server: Dual AMD EPYC 9655 96-core processor-based production server with 8x AMD MI325X (256GB HBM3E 1000W) GPUs, 1 NUMA node per socket, System BIOS 3B03, Ubuntu® 22.04.5 LTS, Host GPU driver ROCm 7.0.1.

Models	# nodes	Precision	Batch Size	Sequence Length	FSDP	TP	CP	PP	EP	Tokens/Sec/GPU
Llama 3.1 8B	1	BF16	4	8192	TRUE	1	1	1	1	10,652
Llama 3.1 8B	1	FP8	4	8192	TRUE	1	1	1	1	13,666
Llama 3.1 70B	1	BF16	7	8192	TRUE	1	1	1	1	1,217
Llama 3.1 70B	1	FP8	7	8192	TRUE	1	1	1	1	1,835
Llama 3.3 70B	1	BF16	7	8192	TRUE	1	1	1	1	1,217
Llama 3.3 70B	1	FP8	7	8192	TRUE	1	1	1	1	1,836
Mixtral 8x7B	1	BF16	9	4096	FALSE	1	1	1	8	5,607

Reproduce these results on your system by following these instructions:

Training Performance with JaxMaxText on AMD GPUs User Guide

Results on AMD Instinct™ MI300X Platform

The following results are based on:

Docker container: rocm/jax-training:maxtext-v26.1
Release date: Jan 21, 2026
Server: Dual AMD EPYC 9554 64-core processor-based production server with 8x AMD MI300X (192GB HBM3 750W) GPUs, 1 NUMA node per socket, System BIOS 1.8, Ubuntu® 22.04.5 LTS, Host GPU driver ROCm 7.1.1.
For multi-mode run, Server: Dual AMD EPYC 9654 Processors with 8x AMD MI300X (192GB HBM3 750W) GPUs, 1 NUMA node per socket, System BIOS 3.10 Ubuntu® 22.04, Host GPU driver ROCm 6.3.1-48.

Models	# nodes	Precision	Batch Size	Sequence Length	FSDP	TP	CP	PP	EP	Tokens/Sec/GPU
Llama 3.1 8B	1	BF16	4	8192	TRUE	1	1	1	1	8,720
Llama 3.1 8B	1	FP8	4	8192	TRUE	1	1	1	1	11,138
Llama 3.1 70B	1	FP8	5	8192	TRUE	1	1	1	1	1,472
Llama 3.1 70B	1	BF16	5	8192	TRUE	1	1	1	1	963
Llama 3.3 70B	1	BF16	5	8192	TRUE	1	1	1	1	962
Mixtral 8x7B	1	BF16	12	4096	FALSE	1	1	1	8	5,382

Reproduce these results on your system by following these instructions:

Training Performance with JaxMaxText on AMD GPUs User Guide

Previous Versions

The following results are based on:

This table lists previous versions of the ROCm JAX MaxText Docker image for training performance testing. For detailed information about available models for benchmarking, see the version-specific documentation.

Image version	ROCm version	JAX version	Resources
v26.2 (latest)	7.1.1	0.8.2	Documentation Docker Hub
v26.1	7.1.1	0.8.2	Documentation Docker Hub
v26.2	7.1.1	0.8.2	Documentation Docker Hub
v26.1	7.1.1	0.8.2	Documentation Docker Hub
v25.11	7.1.0	0.7.1	Documentation Docker Hub
v25.9	7.0.01	0.6.2	Documentation Docker Hub
v25.7	6.4.1	0.6.0, 0.5.0	Documentation Docker Hub (JAX 0.6.0) Docker Hub (JAX 0.5.0)
v25.5	6.3.4	0.4.35	Documentation Docker Hub
v25.4	6.3.0	0.4.31	Documentation Docker Hub

Notas al pie

TP stands for Tensor Parallelism.
Throughput is measured in tokens/second

CPU para servidores

Sistemas Comerciales

Dispositivos personales y para gaming

Productos Integrados

Recursos

Aceleradores de GPU

Aceleradores Adaptables

Aceleradores de DPU

Adaptadores de ethernet

Workstations

Equipos de Escritorio

Computadoras Portátiles

Recursos

FPGA y SoC Adaptables

Sistemas en Módulos (SOM)

Tecnologías

Recursos para el Desarrollador

Placas y Kits de Prueba

Herramientas para Procesadores

Herramientas y Aplicaciones para Tarjetas Gráficas

Herramientas de FPGA y SoC Adaptables

Propiedad Intelectual y Aplicaciones

Herramientas y Apps para Aceleradores de GPU

Herramientas de Adaptador Ethernet

Resumen

Para centros de datos y la nube

Para el borde y los puntos de conexión

Para desarrolladores

Industrias

Industrias

Industrias

Industrias

Industrias

Cargas de Trabajo

Juegos

Sistemas

Tecnologías

Recursos

Procesadores EPYC

Tarjetas gráficas Radeon y chipsets AMD

FPGA y SoC Adaptables

Aceleradores Alveo y SOM Kria

Procesadores Ryzen

Adaptadores de Ethernet

Resumen

Procesadores

Aceleradores

Productos Embedded

Tarjetas Gráficas

Página de inicio del Centro para socios

Recursos por producto

Recursos por tipo

Acerca de nuestros socios

Soporte global de AMD

Procesadores y Tarjetas Gráficas

Aceleradores

FPGA y SoC Adaptables

Experiencia de juego y computación personal

Informática embebida y adaptable

Get AMD Fan Gear

Buy Direct From AMD

Buy Direct From AMD

Buy Direct From AMD

Buy Direct From AMD

Buy Direct From AMD

This page summarizes performance measurements on AMD Instinct™ GPUs running popular AI models.

AI Inference

vLLM

Results on AMD Instinct™ MI300X Platform

Results on AMD Instinct™ MI300X Platform

Previous Versions

Previous Versions

xDiT

Results on the AMD Instinct MI355X platform

Results on AMD Instinct™ MI355X Platform

Results on the AMD Instinct™ MI300X platform

Results on the AMD Instinct™ MI300X platform

Previous versions

Previous versions

AI Training