This page summarizes performance measurements on AMD Instinct™ GPUs for popular AI models.

The data in the following tables is a reference point to help users evaluate observed performance. It should not be considered as the peak performance that AMD GPUs and ROCm™ software can deliver.

AI Inference

Throughput Measurements

The table below shows performance data where a local inference client is fed requests at an infinite rate and shows the throughput client-server scenario under maximum load. 

This result is based on the Docker container (rocm/vllm:rocm7.0.0_vllm_0.10.2_20251006), which was released on October 6, 2025.

Model Precision TP Size Input Output Num Prompts Max Num Seqs Throughput (tokens/s)
Llama 3.1 70B (amd/Llama-3.1-70B-Instruct-FP8-KV) FP8 8 128 2048 3200 3200 13212.5
      128 4096 1500 1500 11312.8
      500 2000 2000 2000 11376.7
      2048 2048 1500 1500 7252.1
Llama 3.1 405B (amd/Llama-3.1-405B-Instruct-FP8-KV) FP8 8 128 2048 1500 1500 4201.7
      128 4096 1500 1500 3176.3
      500 2000 2000 2000 2992.0
      2048 2048 500 500 2153.7

TP stands for Tensor Parallelism.

Server: Dual AMD EPYC 9554 64-core processor-based production server with 8x AMD MI300X (192GB HBM3 750W) GPUs, 1 NUMA node per socket, System BIOS 1.8, Ubuntu® 22.04, amdgpu driver 6.8.5

Reproduce these results on your system by following the instructions in measuring inference performance with vLLM on the AMD GPUs user guide.

Latency Measurements

The table below shows latency measurement, which typically involves assessing the time from when the system receives an input to when the model produces a result.

This result is based on the Docker container (rocm/vllm:rocm7.0.0_vllm_0.10.2_20251006), which was released on October 6, 2025.

Model Precision TP Size Batch Size Input Output Latency (sec)
Llama 3.1 70B (amd/Llama-3.1-70B-Instruct-FP8-KV) FP8 8 1 128 2048 15.882
      2 128 2048 17.934
      4 128 2048 18.487
      8 128 2048 20.251
      16 128 2048 22.307
      32 128 2048 29.933
      64 128 2048 32.359
      128 128 2048 45.419
      1 2048 2048 15.959
      2 2048 2048 18.177
      4 2048 2048 18.684
      8 2048 2048 20.716
      16 2048 2048 23.136
      32 2048 2048 26.969
      64 2048 2048 34.359
      128 2048 2048 52.351
Llama 3.1 405B (amd/Llama-3.1-70B-Instruct-FP8-KV) FP8 8 1 128 2048 49.098
      2 128 2048 51.009
      4 128 2048 52.979
      8 128 2048 55.675
      16 128 2048 58.982
      32 128 2048 67.889
      64 128 2048 86.844
      128 128 2048 117.440
      1 2048 2048 49.033
      2 2048 2048 51.316
      4 2048 2048 52.947
      8 2048 2048 55.863
      16 2048 2048 60.103
      32 2048 2048 69.632
      64 2048 2048 89.826
      128 2048 2048 126.433

Server: Dual AMD EPYC 9554 64-core processor-based production server with 8x AMD MI300X (192GB HBM3 750W) GPUs, 1 NUMA node per socket, System BIOS 1.8, Ubuntu® 22.04, amdgpu driver 6.8.5

Previous Versions

This table lists previous versions of the ROCm vLLM inference Docker image for inference performance testing. For detailed information about available models for benchmarking, see the version-specific documentation.

Docker image tag Components Resources

rocm/vllm:rocm6.4.1_ vllm_ 0.10.0_ 20250812

  • ROCm 6.4.1
  • vLLM 0.9.1
  • PyTorch 2.7.0
rocm/vllm:rocm6.4.1_vllm_0.9.1_20250715
  • ROCm 6.4.1
  • vLLM 0.9.1
  • PyTorch 2.7.0
rocm/vllm:rocm6.4.1_vllm_0.9.1_20250702
  • ROCm 6.4.1
  • vLLM 0.9.1
  • PyTorch 2.7.0
rocm/vllm:rocm6.4.1_vllm_0.9.0.1_20250605
  • ROCm 6.4.1
  • vLLM 0.9.0.1
  • PyTorch 2.7.0
rocm/vllm:rocm6.3.1_vllm_0.8.5_20250521
  • ROCm 6.3.1
  • 0.8.5 vLLM (0.8.6.dev)
  • PyTorch 2.7.0
rocm/vllm:rocm6.3.1_vllm_0.8.5_20250513
  • ROCm 6.3.1
  • vLLM 0.8.5
  • PyTorch 2.7.0
rocm/vllm:rocm6.3.1_instinct_vllm0.8.3_20250415
  • ROCm 6.3.1
  • vLLM 0.8.3
  • PyTorch 2.7.0
rocm/vllm:rocm6.3.1_instinct_vllm0.7.3_20250325
  • ROCm 6.3.1
  • vLLM 0.7.3
  • PyTorch 2.7.0
rocm/vllm:rocm6.3.1_mi300_ubuntu22.04_py3.12_vllm_0.6.6
  • ROCm 6.3.1
  • vLLM 0.6.6
  • PyTorch 2.7.0
rocm/vllm:rocm6.2_mi300_ubuntu20.04_py3.9_vllm_0.6.4
  • ROCm 6.2.1
  • vLLM 0.6.4
  • PyTorch 2.5.0
rocm/vllm:rocm6.2_mi300_ubuntu22.04_py3.9_vllm_7c5fd50
  • ROCm 6.2.0
  • vLLM 0.4.3
  • PyTorch 2.4.0

Reproduce these results on your system by following the instructions in measuring inference performance with ROCm vLLM Dcoker on the AMD GPUs user guide.

AI Training

The table below shows training performance data, where the AMD Instinct™ platform measures text generation training throughput with a unique sequence length and batch size. It focuses on Tokens per second per GPU.

PyTorch training results on the AMD Instinct™ MI300X platform

This result is based on the Docker container (rocm/pytorch-training:v25.8), which was released on September 19, 2025.

Models Precision Batch Size Sequence Length FSDP TP CP PP Tokens/Sec/GPU
Llama 3.1 8B FP8 19 8192 0 1 1 1 9,823
Llama 3.1 8B BF16 19 8192 0 1 1 1 7,818
Llama 3.1 70B FP8 3 8192 1 1 1 1 1,257
Llama 3.1 70B BF16 4 8192 1 1 1 1 889

Fine-tuning

This result is based on the Docker container (rocm/pytorch-training:v25.8), which was released on September 19, 2025

Models Precision Batch Size Sequence Length FSDP TP CP PP Tokens/Sec/GPU
Llama 3.1 70B SFT FP8 4 8192 1 1 1 1 1,229
Llama 3.1 70B SFT BF16 4 8192 1 1 1 1 825
Llama 3.1 70B LoRA BF16 4 8192 1 1 1 1 1,004

Server: Dual AMD EPYC 9554 64-core processor-based production server with 8x AMD MI300X (192GB HBM3 750W) GPUs, 1 NUMA node per socket, System BIOS 1.8, Ubuntu® 22.04.5 LTS, Host GPU driver ROCm 6.4.2-12 ROCm 6.4.3.

Reproduce these results on your system by following the instructions in measuring training performance with ROCm PyTorch Docker on the AMD GPUs user guide.

PyTorch training results on the AMD Instinct MI325X platform

This result is based on the Docker container (rocm/pytorch-training:v25.8), which was released on September 19, 2025.

Models Precision Batch Size Sequence Length FSDP TP CP PP Tokens/Sec/GPU
Llama 3.1 8B FP8 16 8192 0 1 1 1 12,560
Llama 3.1 8B BF16 25 8192 0 1 1 1 9,683
Llama 3.1 70B FP8 5 8192 1 1 1 1 1,667
Llama 3.1 70B BF16 6 8192 1 1 1 1 1,156

Fine-tuning

This result is based on the Docker container (rocm/pytorch-training:v25.8), which was released on September 19, 2025

Models Precision Batch Size Sequence Length FSDP TP CP PP Tokens/Sec/GPU
Llama 3.1 70B SFT FP8 16 8192 1 1 1 1 1,436
Llama 3.1 70B SFT BF16 16 8192 1 1 1 1 1,005
Llama 3.1 70B LoRA BF16 16 8192 1 1 1 1 1,213

Server: Dual AMD EPYC 9655 96-core processor-based production server with 8x AMD MI325X (256GB HBM3E 1000W) GPUs, 1 NUMA node per socket, System BIOS 3B03, Ubuntu® 22.04.5 LTS, Host GPU driver ROCm 6.3.1-48 ROCm 6.4.3.

Reproduce these results on your system by following the instructions in measuring training performance with ROCm PyTorch Docker on the AMD GPUs user guide.

Previous versions

This table lists previous versions of the PyTorch training Docker image for training performance testing. For detailed information about available models for benchmarking, see the version-specific documentation.

Image version ROCm version PyTorch version Resources
v25.8 (latest) 6.4.3 2.8.0a0+gitd06a406
v25.7 6.4.2 2.8.0a0+gitd06a406
v25.6 6.3.4 2.8.0a0+git7d205b2
v25.5 6.3.4 2.7.0a0+git637433
v25.4 6.3.0 2.7.0a0+git637433

Megatron-LM training results on the AMD Instinct™ MI300X platform

This result is based on the Docker container(rocm/megatron-lm:v25.8_py310), which was released on September 19, 2025.

Models # nodes Precision Batch Size Sequence Length FSDP TP CP PP EP Tokens/Sec/GPU
Llama 3.1 8B 1 FP8 2 8191 0 1 1 1 - 12,605
Llama 3.1 8B 1 BF16 2 8191 0 1 1 1 - 9,338
Llama 3.1 8B 8 FP8 2 8191 0 1 1 1 - 12,096
Llama 3.1 70B 1 BF16 3 8191 1 1 1 1 - 792
Llama 3.3 70B 1 BF16 3 8191 1 1 1 1 - 789
Mixtral 8x7B 1 BF16 2 4096 0 1 1 1 8 5,263

Server: Dual AMD EPYC 9554 64-core processor-based production server with 8x AMD MI300X (192GB HBM3 750W) GPUs, 1 NUMA node per socket, System BIOS 1.8, Ubuntu® 22.04.5 LTS, Host GPU driver ROCm 6.4.2-120 ROCm 6.4.3.

For Multi-mode run, Server: Dual Intel Xeon Platinum 8480+ Processors with 8x AMD MI300X (192GB HBM3 750W) GPUs, 1 NUMA node per socket, System BIOS 79007700 Ubuntu® 22.04, Host GPU driver ROCm 6.3.0-39 ROCm 6.4.3.

Reproduce these results on your system by following the instructions in measuring training performance with ROCm Megatron-LM Docker on the AMD GPUs user guide.

Megatron-LM training results on the AMD Instinct™ MI325X platform

This result is based on the Docker container(rocm/megatron-lm:v25.8_py310), which was released on September 19, 2025.

Models # nodes Precision Batch Size Sequence Length FSDP TP CP PP EP Tokens/Sec/GPU
Llama 3.1 8B 1 FP8 2 8191 0 1 1 1 - 14,895
Llama 3.1 8B 1 BF16 4 8191 0 1 1 1 - 11,389
Llama 3.1 70B 1 BF16 4 8191 1 1 1 1 - 1,029
Llama 3.3 70B 1 BF16 5 8191 1 1 1 1 - 1,020
Mixtral 8x7B 1 BF16 4 4096 0 1 1 1 8 6,339

Server: Dual AMD EPYC 9655 96-core processor-based production server with 8x AMD MI325X (256GB HBM3E 1000W) GPUs, 1 NUMA node per socket, System BIOS 3B03, Ubuntu® 22.04.5 LTS, Host GPU driver ROCm 6.3.1-48 ROCm 6.4.3.

Reproduce these results on your system by following the instructions in measuring training performance with ROCm Megatron-LM Docker on the AMD GPUs user guide.

Previous versions

This table lists previous versions of the Megatron-LM training Docker image for training performance testing. For detailed information about available models for benchmarking, see the version-specific documentation.

Image version ROCm version PyTorch version Resources
v25.8 (latest) 6.4.3 2.8.0a0+gitd06a406
v25.7 6.4.2 2.8.0a0+gitd06a406
v25.6 6.4.1 2.8.0a0+git7d205b2
v25.5 6.3.4 2.8.0a0+gite2f9759
v25.4 6.3.0 2.7.0a0+git637433

JaxMaxText v0.6.0 training results on the AMD Instinct™ MI300X platform

This result is based on the Docker container(rocm/jax-training:maxtext-v25.7-jax060), which was released on September 19, 2025.

Models # nodes Precision Batch Size Sequence Length FSDP TP CP PP EP Tokens/Sec/GPU
Llama 3.1 8B 1 BF16 4 8192 1 1 1 1 1 8,661
Llama 3.1 70B 1 BF16 7 8192 1 1 1 1 1 920
Llama 3.3 70B 1 BF16 7 8192 1 1 1 1 1 920
Mixtral 8x7B 1 BF16 12 4096 0 1 1 1 8 4,564

Server: Dual AMD EPYC 9554 64-core processor-based production server with 8x AMD MI300X (192GB HBM3 750W) GPUs, 1 NUMA node per socket, System BIOS 1.8, Ubuntu® 22.04.5 LTS, Host GPU driver ROCm 6.4.2-120 ROCm 6.4.

Reproduce these results on your system by following the instructions in measuring training performance with ROCm JAX MaxText Docker on the AMD GPUs user guide.

JaxMaxText v0.6.0 training results on the AMD Instinct™ MI325X platform

This result is based on the Docker container(rocm/jax-training:maxtext-v25.7-jax060), which was released on September 19, 2025.

Models # nodes Precision Batch Size Sequence Length FSDP TP CP PP EP Tokens/Sec/GPU
Llama 3.1 8B 1 BF16 4 8192 1 1 1 1 1 10,492
Llama 3.1 70B 1 BF16 7 8192 1 1 1 1 1 1,141
Llama 3.3 70B 1 BF16 7 8192 1 1 1 1 1 1,141
Mixtral 8x7B 1 BF16 12 4096 0 1 1 1 8 5,410

Server: Dual AMD EPYC 9655 96-core processor-based production server with 8x AMD MI325X (256GB HBM3E 1000W) GPUs, 1 NUMA node per socket, System BIOS 3B03, Ubuntu® 22.04.5 LTS, Host GPU driver ROCm 6.3.1-48 ROCm 6.4.

Reproduce these results on your system by following the instructions in measuring training performance with ROCm JAX MaxText Docker on the AMD GPUs user guide.

JaxMaxText v0.5.0 training results on the AMD Instinct™ MI300X platform

This result is based on the Docker container(rocm/jax-training:maxtext-v25.7), which was released on September 19, 2025.

Models # nodes Precision Batch Size Sequence Length FSDP TP CP PP EP Tokens/Sec/GPU
Llama 3.1 8B 1 BF16 4 8192 1 1 1 1 1 8,114
Llama 3.1 8B 8 BF16 4 8192 1 1 1 1 - 7,298
Llama 3.1 70B 1 BF16 7 8192 1 1 1 1 1 900
Llama 3.3 70B 1 BF16 7 8192 1 1 1 1 1 901
Mixtral 8x7B 1 BF16 12 4096 0 1 1 1 8 4,333

Server: Dual AMD EPYC 9554 64-core processor-based production server with 8x AMD MI300X (192GB HBM3 750W) GPUs, 1 NUMA node per socket, System BIOS 1.8, Ubuntu® 22.04.5 LTS, Host GPU driver ROCm 6.4.2-120 ROCm 6.4.1.

For Multi-mode run, Server: Dual AMD EPYC 9654 Processors with 8x AMD MI300X (192GB HBM3 750W) GPUs, 1 NUMA node per socket, System BIOS 3.10 Ubuntu® 22.04, Host GPU driver ROCm 6.3.1-48 ROCm 6.4.

Reproduce these results on your system by following the instructions in measuring training performance with ROCm JAX MaxText Docker on the AMD GPUs user guide.

JaxMaxText v0.5.0 training results on the AMD Instinct™ MI325X platform

This result is based on the Docker container(rocm/jax-training:maxtext-v25.7), which was released on September 19, 2025.

Models # nodes Precision Batch Size Sequence Length FSDP TP CP PP EP Tokens/Sec/GPU
Llama 3.1 8B 1 BF16 4 8192 1 1 1 1 1 9,943
Llama 3.1 70B 1 BF16 7 8192 1 1 1 1 1 1,114
Llama 3.3 70B 1 BF16 7 8192 1 1 1 1 1 1,115
Mixtral 8x7B 1 BF16 12 4096 0 1 1 1 8 5,191

Server: Dual AMD EPYC 9655 96-core processor-based production server with 8x AMD MI325X (256GB HBM3E 1000W) GPUs, 1 NUMA node per socket, System BIOS 3B03, Ubuntu® 22.04.5 LTS, Host GPU driver ROCm 6.3.1-48 ROCm 6.4.1.

Reproduce these results on your system by following the instructions in measuring training performance with ROCm JAX MaxText Docker on the AMD GPUs user guide.

Previous versions

This table lists previous versions of the ROCm JAX MaxText Docker image for training performance testing. For detailed information about available models for benchmarking, see the version-specific documentation.

Image version ROCm version JAX version Resources
v25.7 (latest) 6.4.1 0.6.0, 0.5.0
v25.5 6.3.4 0.4.35
v25.4 6.3.0 0.4.31