This page summarizes performance measurements on AMD Instinct™ GPUs running popular AI models.

The results found on this page highlight both Inference and Training benchmarks. The results are organized by the following:

  • AI Inference: vLLM
  • AI Training: pyTorch, Megatron-LM, and JAX MaxText

The hardware platforms include Instinct MI355X/MI325X/MI300X GPUs, with benchmark insights provided for each framework where data is available.

The data in the following tables are a reference point to help users evaluate observed performance. It should not be considered as the peak performance that AMD GPUs and ROCm™ software can deliver.

AI Inference

vLLM

Results on AMD Instinct™ MI300X Platform

The following results are based on:

  • Docker container: rocm/vllm:rocm7.0.0_vllm_0.10.2_20251006
  • Release date: October 6, 2025
  • Server: Dual AMD EPYC 9554 64-core processor-based production server with 8x AMD MI300X (192GB HBM3 750W) GPUs, 1 NUMA node per socket, System BIOS 1.8, Ubuntu® 22.04, amdgpu driver 6.8.5

Throughput Measurements

The table below shows performance data where a local inference client is fed requests at an infinite rate and shows the throughput client-server scenario under maximum load.

Model Precision TP1 Size Input Output No. Prompts Max. Seqs Throughput2
Llama 3.1 70B (amd/Llama-3.1-70B-Instruct-FP8-KV) FP8 8 128 2048 3200 3200 13212.5
128 4096 1500 1500 11312.8
500 2000 2000 2000 11376.7
2048 2048 1500 1500 7252.1
Llama 3.1 405B (amd/Llama-3.1-405B-Instruct-FP8-KV) FP8 8 128 2048 1500 1500 4201.7
128 4096 1500 1500 3176.3
500 2000 2000 2000 2992.0
2048 2048 500 500 2153.7

Latency results

The table below shows latency measurement, which typically involves assessing the time from when the system receives an input to when the model produces a result.

Model Precision TP1 Size Batch Size Input Output Latency2
Llama 3.1 70B (amd/Llama-3.1-70B-Instruct-FP8-KV) FP8 8 1 128 2048 15.882
2 128 2048 17.934
4 128 2048 18.487
8 128 2048 20.251
16 128 2048 22.307
32 128 2048 29.933
64 128 2048 32.359
128 128 2048 45.419
1 2048 2048 15.959
2 2048 2048 18.177
4 2048 2048 18.684
8 2048 2048 20.716
16 2048 2048 23.136
32 2048 2048 26.969
64 2048 2048 34.359
128 2048 2048 52.351
Llama 3.1 405B (amd/Llama-3.1-70B-Instruct-FP8-KV) FP8 8 1 128 2048 49.098
2 128 2048 51.009
4 128 2048 52.979
8 128 2048 55.675
16 128 2048 58.982
32 128 2048 67.889
64 128 2048 86.844
128 128 2048 117.440
1 2048 2048 49.033
2 2048 2048 51.316
4 2048 2048 52.947
8 2048 2048 55.863
16 2048 2048 60.103
32 2048 2048 69.632
64 2048 2048 89.826
128 2048 2048 126.433

Reproduce these results on your system by following these instructions:

Previous Versions

This table lists previous versions of the ROCm vLLM inference Docker image for inference performance testing. For detailed information about available models for benchmarking, see the version-specific documentation.

Docker image tag Components Resources
rocm/vllm:rocm7.0.0_ vllm_ 0.10.2_ 20251006 (latest)
  • ROCm 7.0.0
  • vLLM 0.10.2
  • PyTorch 2.9.0
rocm/vllm:rocm6.4.1_ vllm_ 0.10.0_ 20250812
  • ROCm 6.4.1
  • vLLM 0.9.1
  • PyTorch 2.7.0
rocm/vllm:rocm6.4.1_vllm_0.9.1_20250715
  • ROCm 6.4.1
  • vLLM 0.9.1
  • PyTorch 2.7.0
rocm/vllm:rocm6.4.1_vllm_0.9.1_20250702
  • ROCm 6.4.1
  • vLLM 0.9.1
  • PyTorch 2.7.0
rocm/vllm:rocm6.4.1_vllm_0.9.0.1_20250605
  • ROCm 6.4.1
  • vLLM 0.9.0.1
  • PyTorch 2.7.0
rocm/vllm:rocm6.3.1_vllm_0.8.5_20250521
  • ROCm 6.3.1
  • 0.8.5 vLLM (0.8.6.dev)
  • PyTorch 2.7.0
rocm/vllm:rocm6.3.1_vllm_0.8.5_20250513
  • ROCm 6.3.1
  • vLLM 0.8.5
  • PyTorch 2.7.0
rocm/vllm:rocm6.3.1_instinct_vllm0.8.3_20250415
  • ROCm 6.3.1
  • vLLM 0.8.3
  • PyTorch 2.7.0
rocm/vllm:rocm6.3.1_instinct_vllm0.7.3_20250325
  • ROCm 6.3.1
  • vLLM 0.7.3
  • PyTorch 2.7.0
rocm/vllm:rocm6.3.1_mi300_ubuntu22.04_py3.12_vllm_0.6.6
  • ROCm 6.3.1
  • vLLM 0.6.6
  • PyTorch 2.7.0
rocm/vllm:rocm6.2_mi300_ubuntu20.04_py3.9_vllm_0.6.4
  • ROCm 6.2.1
  • vLLM 0.6.4
  • PyTorch 2.5.0
rocm/vllm:rocm6.2_mi300_ubuntu22.04_py3.9_vllm_7c5fd50
  • ROCm 6.2.0
  • vLLM 0.4.3
  • PyTorch 2.4.0

AI Training

The table below shows training performance data, where the AMD Instinct™ platform measures text generation training throughput with a unique sequence length and batch size. It focuses on Tokens per second per GPU.

PyTorch

Results on the AMD Instinct MI355X Platform

The following results are based on:

  • Docker container: rocm/primus:v25.9_gfx950
  • Release date: October 17, 2025
  • Server: Dual AMD EPYC 9575F 96-core processor-based production server with 8x AMD MI355X (288GB HBM3E 1400W)GPUs,1 NUMA node per socket, System BIOS 1.4a, Ubuntu® 22.04.5 LTS Host GPU driver ROCm 7.0.1.
Model Precision Batch Size Sequence Length FSDP TP CP PP Tokens/sec/GPU
Llama 3.1 8B FP8 8 8192 0 1 1 1 28,035
Llama 3.1 8B BF16 5 8192 0 1 1 1 20,158
Llama 3.1 70B FP8 6 8192 1 1 1 1 3,570
Llama 3.1 70B BF16 8 8192 1 1 1 1 2,281

Fine-tuning

Model Precision Batch Size Sequence Length FSDP TP CP PP Tokens/sec/GPU
Llama 3.1 70B SFT FP8 8 8192 1 1 1 1 3,546
Llama 3.1 70B SFT BF16 16 8192 1 1 1 1 2,161
Llama 3.1 70B LoRA BF16 16 8192 1 1 1 1 2,594

Reproduce these results on your system by following these instructions:

Results on AMD Instinct™ MI325X Platform

The following results are based on:

  • Docker container: rocm/ primus:v25.9_gfx942
  • Release date: October 17, 2025
  • Server: Dual AMD EPYC 9655 96-core processor-based production server with 8x AMD MI325X (256GB HBM3E 1000W) GPUs, 1 NUMA node per socket, System BIOS 3B03, Ubuntu® 22.04.5 LTS, Host GPU driver ROCm 6.3.1-48 
Model Precision Batch Size Sequence Length FSDP TP CP PP Tokens/sec/GPU
Llama 3.1 8B FP8 7 8192 0 1 1 1 14,984
Llama 3.1 8B BF16 6 8192 0 1 1 1 11,144
Llama 3.1 70B FP8 5 8192 1 1 1 1 1,716
Llama 3.1 70B BF16 6 8192 1 1 1 1 1,150

Fine-tuning

Model Precision Batch Size Sequence Length FSDP TP CP PP Tokens/sec/GPU
Llama 3.1 70B SFT FP8 8 8192 1 1 1 1 1,597
Llama 3.1 70B SFT BF16 8 8192 1 1 1 1 1,037
Llama 3.1 70B LoRA BF16 16 8192 1 1 1 1 1,286

Reproduce these results on your system by following these instructions:

Results on AMD Instinct™ MI300X Platform

The following results are based on:

  • Docker container: rocm//primus:v25.9_gfx942
  • Release date: October 17, 2025
  • Server: Server: Dual AMD EPYC 9554 64-core processor-based production server with 8x AMD MI300X (192GB HBM3 750W) GPUs, 1 NUMA node per socket, System BIOS 1.8, Ubuntu® 22.04.5 LTS, Host GPU driver ROCm 6.4.2-12 
Model Precision Batch Size Sequence Length FSDP TP CP PP Tokens/sec/GPU
Llama 3.1 8B FP8 5 8192 0 1 1 1 12,216
Llama 3.1 8B BF16 4 8192 0 1 1 1 9,186
Llama 3.1 70B FP8 3 8192 1 1 1 1 1,307
Llama 3.1 70B BF16 4 8192 1 1 1 1 887

Fine-tuning

Model Precision Batch Size Sequence Length FSDP TP CP PP Tokens/sec/GPU
Llama 3.1 70B SFT FP8 4 8192 1 1 1 1 1,343
Llama 3.1 70B SFT BF16 4 8192 1 1 1 1 855
Llama 3.1 70B LoRA BF16 4 8192 1 1 1 1 1,053

Reproduce these results on your system by following these instructions:

Previous Versions

This table lists previous versions of the PyTorch training Docker image for training performance testing. For detailed information about available models for benchmarking, see the version-specific documentation.

Image version ROCm version PyTorch version Resources
V25.9 (latest) 7.0.7

Primus 0.3.0

PyTorch 2.9.0.dev20250821+rocm7.0.0.lw.git125803b7

v25.8  6.4.3 2.8.0a0+gitd06a406
v25.7 6.4.2 2.8.0a0+gitd06a406
v25.6 6.3.4 2.8.0a0+git7d205b2
v25.5 6.3.4 2.7.0a0+git637433
v25.4 6.3.0 2.7.0a0+git637433

Megatron-LM

Results on AMD Instinct™ MI355X Platform

The following results are based on:

  • Docker container: rocm/ primus:v25.9_gfx950
  • Release date: October 17, 2025
  • Server: Dual AMD EPYC 9575F 96-core processor-based production server with 8x AMD MI355X (288GB HBM3E 1400W)GPUs,1 NUMA node per socket, System BIOS 1.4a, Ubuntu® 22.04.5 LTS Host GPU driver ROCm 7.0.1.
Model # nodes Precision Batch Size Sequence Length FSDP TP CP PP EP Tokens/sec/GPU
Llama 3.1 8B 1 FP8 4 8191 0 1 1 1 - 32,451
Llama 3.1 8B 1 BF16 4 8191 0 1 1 1 - 21,908
Llama 3.1 70B 1 BF16 4 8191 1 1 1 1 - 2,074
Llama 3.3 70B 1 BF16 6 8191 1 1 1 1 - 2,024
Mixtral 8x7B 1 BF16 4 4096 0 1 1 1 8 13,008

Reproduce these results on your system by following these instructions:

Results on AMD Instinct™ MI325X Platform

The following results are based on:

  • Docker container:  rocm/primus:v25.9_gfx942
  • Release date: October 17, 2025
  • Server: Dual AMD EPYC 9655 96-core processor-based production server with 8x AMD MI325X (256GB HBM3E 1000W) GPUs, 1 NUMA node per socket, System BIOS 3B03, Ubuntu® 22.04.5 LTS, Host GPU driver ROCm 6.3.1-48
Model # nodes Precision Batch Size Sequence Length FSDP TP CP PP EP Tokens/sec/GPU
Llama 3.1 8B 1 FP8 2 8191 0 1 1 1 - 16,678
Llama 3.1 8B 1 BF16 4 8191 0 1 1 1 - 11,803
Llama 3.1 70B 1 BF16 4 8191 1 1 1 1 - 1,091
Llama 3.3 70B 1 BF16 5 8191 1 1 1 1 - 1,052
Mixtral 8x7B 1 BF16 4 4096 0 1 1 1 8 6,511

Reproduce these results on your system by following these instructions:

Results on AMD Instinct™ MI300X Platform

The following results are based on:

  • Docker container: rocm/ rocm/primus:v25.9_gfx942
  • Release date: October 17, 2025
  • Server: Server: Dual AMD EPYC 9554 64-core processor-based production server with 8x AMD MI300X (192GB HBM3 750W) GPUs, 1 NUMA node per socket, System BIOS 1.8, Ubuntu® 22.04.5 LTS, Host GPU driver ROCm 6.4.2-120.
    For multi-mode run, Server: Dual Intel Xeon Platinum 8480+ Processors with 8x AMD MI300X (192GB HBM3 750W) GPUs, 1 NUMA node per socket, System BIOS 79007700 Ubuntu® 22.04, Host GPU driver ROCm 6.3.0-39.
Model # nodes Precision Batch Size Sequence Length FSDP TP CP PP EP Tokens/sec/GPU
Llama 3.1 8B 1 FP8 2 8191 0 1 1 1 - 14,208
Llama 3.1 8B 1 BF16 2 8191 0 1 1 1 - 9,782
Llama 3.1 8B 8 FP8 2 8192 0 1 1 1 - 13,328
Llama 3.1 70B 1 BF16 3 8191 1 1 1 1 - 827
Llama 3.3 70B 1 BF16 2 8191 1 1 1 1 - 822
Mixtral 8x7B 1 BF16 2 4096 0 1 1 1 8 5,430

Reproduce these results on your system by following these instructions:

Previous Versions

This table lists previous versions of the Megatron-LM training Docker image for training performance testing. For detailed information about available models for benchmarking, see the version-specific documentation.

Image version ROCm version PyTorch version Resources
v25.9 (latest) 7.0.0 Primus 0.3.0

PyTorch 2.9.0.dev20250821+rocm7.0.0.lw.git125803b7
v25.8  6.4.3 2.8.0a0+gitd06a406
v25.7 6.4.2 2.8.0a0+gitd06a406
v25.6 6.4.1 2.8.0a0+git7d205b2
v25.5 6.3.4 2.8.0a0+gite2f9759
v25.4 6.3.0 2.7.0a0+git637433

JaxMaxText

Results on AMD Instinct™ MI355X Platform

The following results are based on:

  • Docker container: rocm/ jax-training:maxtext-v25.9
  • Release date: October 17, 2025.
  • Server: Dual AMD EPYC 9575F 96-core processor-based production server with 8x AMD MI355X (288GB HBM3E 1400W)GPUs,1 NUMA node per socket, System BIOS 1.4a, Ubuntu® 22.04.5 LTS Host GPU driver ROCm 7.0.1.
Model # nodes Precision Batch Size Sequence Length FSDP TP CP PP EP Tokens/sec/GPU
Llama 3.1 8B 1 BF16 7 8192 1 1 1 1 1 21,306
Llama 3.1 8B 1 FP8 4 8192 1 1 1 1 1 26,756
Llama 3.1 70B 1 BF16 4 8192 1 1 1 1 1 2,440
Llama 3.1 70B 1 FP8 7 8192 1 1 1 1 1 3,793
Llama 3.3 70B 1 BF16 7 8192 1 1 1 1 1 2,441
Mixtral 8x7B 1 BF16 12 4096 0 1 1 1 8 10,597

Reproduce these results on your system by following these instructions:

Results on AMD Instinct™ MI325X Platform

The following results are based on:

  • Docker container: rocm/ jax-training:maxtext-v25.9
  • Release date: October 17, 2025
  • Server: Dual AMD EPYC 9655 96-core processor-based production server with 8x AMD MI325X (256GB HBM3E 1000W) GPUs, 1 NUMA node per socket, System BIOS 3B03, Ubuntu® 22.04.5 LTS, Host GPU driver ROCm 6.3.1-48 
Model # nodes Precision Batch Size Sequence Length FSDP TP CP PP EP Tokens/sec/GPU
Llama 3.1 8B 1 BF16 4 8192 1 1 1 1 1 10,292
Llama 3.1 70B 1 BF16 7 8192 1 1 1 1 1 1,178
Llama 3.3 70B 1 BF16 7 8192 1 1 1 1 1 1,178
Mixtral 8x7B 1 BF16 12 4096 0 1 1 1 8 5,519

Reproduce these results on your system by following these instructions:

Results on AMD Instinct™ MI300X Platform

The following results are based on:

  • Docker container: rocm/ jax-training:maxtext-v25.9
  • Release date: October 17, 2025
  • Server: Dual AMD EPYC 9554 64-core processor-based production server with 8x AMD MI300X (192GB HBM3 750W) GPUs, 1 NUMA node per socket, System BIOS 1.8, Ubuntu® 22.04.5 LTS, Host GPU driver ROCm 6.4.2-120.
    For multi-mode run, Server: Dual AMD EPYC 9654 Processors with 8x AMD MI300X (192GB HBM3 750W) GPUs, 1 NUMA node per socket, System BIOS 3.10 Ubuntu® 22.04, Host GPU driver ROCm 6.3.1-48.
Model # nodes Precision Batch Size Sequence Length FSDP TP CP PP EP Tokens/sec/GPU
Llama 3.1 8B 1 BF16 4 8192 1 1 1 1 1 8,587
Llama 3.1 8B 8 BF16 15 8192 1 1 1 1 1 7,813
Llama 3.1 70B 1 BF16 7 8192 1 1 1 1 1 949
Llama 3.3 70B 1 BF16 7 8192 1 1 1 1 1 949
Mixtral 8x7B 1 BF16 12 4096 0 1 1 1 8 4,622

Reproduce these results on your system by following these instructions:

Previous Versions

The following results are based on:

This table lists previous versions of the ROCm JAX MaxText Docker image for training performance testing. For detailed information about available models for benchmarking, see the version-specific documentation.  

Image version ROCm version JAX version Resources
v25.9 (latest) 7.0.01 0.6.2
v25.7 6.4.1 0.6.0, 0.5.0
v25.5 6.3.4 0.4.35
v25.4 6.3.0 0.4.31
附注
  1. TP stands for Tensor Parallelism.
  2. Throughput is measured in tokens/second