This page summarizes performance measurements on AMD Instinct™ GPUs for popular AI models.

The data in the following tables is a reference point to help users evaluate observed performance. It should not be considered as the peak performance that AMD GPUs and ROCm™ software can deliver.

AI Inference

Throughput Measurements

The table below shows performance data where a local inference client is fed requests at an infinite rate and shows the throughput client-server scenario under maximum load.  

This result is based on the Docker container (rocm/vllm: rocm6.4.1_vllm_0.9.1_20250715), which was released on July 16, 2025.

Model Precision TP Size Input Output Num Prompts Max Num Seqs Throughput (tokens/s)
Llama 3.1 70B (amd/Llama-3.1-70B-Instruct-FP8-KV) FP8 8 128 2048 3200 3200 12638.9
      128 4096 1500 1500 10756.8
      500 2000 2000 2000 10691.7
      2048 2048 1500 1500 7354.9
Llama 3.1 405B (amd/Llama-3.1-405B-Instruct-FP8-KV) FP8 8 128 2048 1500 1500 3912.8
      128 4096 1500 1500 3084.7
      500 2000 2000 2000 2935.9
      2048 2048 500 500 2191.5

TP stands for Tensor Parallelism.

Server: Dual AMD EPYC 9554 64-core processor-based production server with 8x AMD MI300X (192GB HBM3 750W) GPUs, 1 NUMA node per socket, System BIOS 1.8, Ubuntu® 22.04.5 LTS, Host GPU driver ROCm 6.4.1 + amdgpu driver 6.8.5 

Reproduce these results on your system by following the instructions in measuring inference performance with vLLM on the AMD GPUs user guide.

Latency Measurements

The table below shows latency measurement, which typically involves assessing the time from when the system receives an input to when the model produces a result.

This result is based on the Docker container (rocm/vllm: rocm6.4.1_vllm_0.9.1_20250715), which was released on July 16, 2025.

Model Precision TP Size Batch Size Input Output Latency (sec)
Llama 3.1 70B (amd/Llama-3.1-70B-Instruct-FP8-KV) FP8 8 1 128 2048 17.236
      2 128 2048 18.057
      4 128 2048 18.45
      8 128 2048 19.677
      16 128 2048 22.072
      32 128 2048 24.932
      64 128 2048 33.287
      128 128 2048 46.484
      1 2048 2048 17.5
      2 2048 2048 18.055
      4 2048 2048 18.858
      8 2048 2048 20.161
      16 2048 2048 22.347
      32 2048 2048 25.966
      64 2048 2048 35.324
      128 2048 2048 52.394
Llama 3.1 405B (amd/Llama-3.1-70B-Instruct-FP8-KV) FP8 8 1 128 2048 48.453
      2 128 2048 49.268
      4 128 2048 51.136
      8 128 2048 54.226
      16 128 2048 57.274
      32 128 2048 68.901
      64 128 2048 88.631
      128 128 2048 117.027
      1 2048 2048 48.362
      2 2048 2048 49.121
      4 2048 2048 52.347
      8 2048 2048 54.471
      16 2048 2048 57.841
      32 2048 2048 70.538
      64 2048 2048 91.452
      128 2048 2048 125.471

Server: Dual AMD EPYC 9554 64-core processor-based production server with 8x AMD MI300X (192GB HBM3 750W) GPUs, 1 NUMA node per socket, System BIOS 1.8, Ubuntu® 22.04.5 LTS, Host GPU driver ROCm 6.4.1 + amdgpu driver 6.8.5

Reproduce these results on your system by following the instructions in measuring inference performance with ROCm vLLM Dcoker on the AMD GPUs user guide.

Previous versions

This table lists previous versions of the ROCm vLLM inference Docker image for inference performance testing. For detailed information about available models for benchmarking, see the version-specific documentation.

Docker image tag Components Resources
rocm/vllm:rocm6.4.1_vllm_0.9.1_20250715
  • ROCm 6.4.1
  • vLLM 0.9.1
  • PyTorch 2.7.0
rocm/vllm:rocm6.4.1_vllm_0.9.1_20250702
  • ROCm 6.4.1
  • vLLM 0.9.1
  • PyTorch 2.7.0
rocm/vllm:rocm6.4.1_vllm_0.9.0.1_20250605
  • ROCm 6.4.1
  • vLLM 0.9.0.1
  • PyTorch 2.7.0
rocm/vllm:rocm6.3.1_vllm_0.8.5_20250521
  • ROCm 6.3.1
  • 0.8.5 vLLM (0.8.6.dev)
  • PyTorch 2.7.0
rocm/vllm:rocm6.3.1_vllm_0.8.5_20250513
  • ROCm 6.3.1
  • vLLM 0.8.5
  • PyTorch 2.7.0
rocm/vllm:rocm6.3.1_instinct_vllm0.8.3_20250415
  • ROCm 6.3.1
  • vLLM 0.8.3
  • PyTorch 2.7.0
rocm/vllm:rocm6.3.1_instinct_vllm0.7.3_20250325
  • ROCm 6.3.1
  • vLLM 0.7.3
  • PyTorch 2.7.0
rocm/vllm:rocm6.3.1_mi300_ubuntu22.04_py3.12_vllm_0.6.6
  • ROCm 6.3.1
  • vLLM 0.6.6
  • PyTorch 2.7.0
rocm/vllm:rocm6.2_mi300_ubuntu20.04_py3.9_vllm_0.6.4
  • ROCm 6.2.1
  • vLLM 0.6.4
  • PyTorch 2.5.0
rocm/vllm:rocm6.2_mi300_ubuntu22.04_py3.9_vllm_7c5fd50
  • ROCm 6.2.0
  • vLLM 0.4.3
  • PyTorch 2.4.0

 

AI Training

The table below shows training performance data, where the AMD Instinct™ platform measures text generation training throughput with a unique sequence length and batch size. It focuses on TFLOPS per second per GPU.

For FLUX, image generation training throughput from the FLUX.1-dev model with the best batch size before the runs go out of memory, and it focuses on frame per second per GPU.

PyTorch training results on the AMD Instinct™ MI300X platform

This result is based on the Docker container (rocm/pytorch-training:v25.5), which was released on April 15, 2025.

Models Precision Batch Size Sequence Length TFLOPS/s/GPU
Llama 3.1 70B with FSDP BF16 4 8192 426.79
Llama 3.1 8B with FSDP BF16 3 8192 542.94
Llama 3.1 8B with FSDP FP8 3 8192 737.40
Llama 3.1 8B with FSDP BF16 6 4096 523.79
Llama 3.1 8B with FSDP FP8 6 4096 735.44
Mistral 7B with FSDP BF16 3 8192 483.17
Mistral 7B with FSDP FP8 4 8192 723.30
FLUX BF16 10 - 4.51 (FPS/GPU)*

*Note: FLUX performance is measured in FPS/GPU rather than TFLOPS/s/GPU.

Server: Dual AMD EPYC 9654 96-core processor-based production server with 8x AMD MI300X (192GB HBM3 750W) GPUs, 2 NUMA node per socket, System BIOS 5.27, Ubuntu® 22.04.5 LTS, Host GPU driver ROCm 6.3.0 ROCm 6.3 (pre-release)

Reproduce these results on your system by following the instructions in measuring training performance with ROCm PyTorch Docker on the AMD GPUs user guide.

PyTorch training results on the AMD Instinct MI325X platform

This result is based on the Docker container (rocm/pytorch-training:v25.5), which was released on April 15, 2025.

Models Precision Batch Size Sequence Length TFLOPS/s/GPU
Llama 3.1 70B with FSDP BF16 7 8192 526.13
Llama 3.1 8B with FSDP BF16 3 8192 643.01
Llama 3.1 8B with FSDP FP8 5 8192 893.68
Llama 3.1 8B with FSDP BF16 8 4096 625.96
Llama 3.1 8B with FSDP FP8 10 4096 894.98
Mistral 7B with FSDP BF16 5 8192 590.23
Mistral 7B with FSDP FP8 6 8192 860.39

Server: Dual AMD EPYC 9654 96-core processor-based production server with 8x AMD MI325X (256GB HBM3E 1000W) GPUs, 2 NUMA node per socket, System BIOS 5.27, Ubuntu® 22.04.5 LTS, Host GPU driver ROCm 6.3.0 ROCm 6.3 (pre-release)

Reproduce these results on your system by following the instructions in measuring training performance with ROCm PyTorch Docker on the AMD GPUs user guide.

Previous versions

This table lists previous versions of the PyTorch training Docker image for training performance testing. For detailed information about available models for benchmarking, see the version-specific documentation.

Date Image version ROCm version PyTorch version Resources
3/11/2025 25.4 6.3.0 2.7.0a0+git637433 Documentation Docker Hub

Megatron-LM training results on the AMD Instinct™ MI300X platform

This result is based on the Docker container (rocm/megatron-lm:v25.5), which was released on April 25, 2025.

Sequence length 8192
Model # of nodes Sequence length MBS    GBS    Data Type TP     PP     CP     TFLOPs/s/GPU 
llama3.1-8B 1 8192 2 128 FP8 1 1 1 697.91
llama3.1-8B 2 8192 2 256 FP8 1 1 1 690.33
llama3.1-8B 4 8192 2 512 FP8 1 1 1 686.74
llama3.1-8B 8 8192 2 1024 FP8 1 1 1 675.50
 Sequence length 4096
Model # of nodes Sequence length MBS    GBS    Data Type TP     PP     CP     TFLOPs/s/GPU
llama2-7B 1 4096 4 256 FP8 1 1 1 689.90
llama2-7B 2 4096 4 512 FP8 1 1 1 682.04
llama2-7B 4 4096 4 1024 FP8 1 1 1 676.83
llama2-7B 8 4096 4 2048 FP8 1 1 1 686.25

Server: Dual AMD EPYC 9654 96-core processor-based production server with 8x AMD MI300X (192GB HBM3 750W) GPUs, 2 NUMA node per socket, System BIOS 5.27, Ubuntu® 22.04.5 LTS, Host GPU driver ROCm 6.3.0 ROCm 6.3 (pre-release)

For Deepsee-V2-Lite with 16B parameters, the table below shows training performance data, where the AMD Instinct™ MI300X platform measures text generation training throughput with GEMM tuning was on. It focuses on TFLOPS per second per GPU.  

This result is based on the Docker container(rocm/megatron-lm:v25.5), which was released on April 25, 2025.

Model # of GPUs Sequence length MBS GBS Data Type TP PP CP EP SP Recompute TFLOPs/s/GPU
Deespeek-V2-Lite 8 4096 4 256 BF16 1 1 1 8 On None 10570

Server: Dual AMD EPYC 9654 96-core processor-based production server with 8x AMD MI300X (192GB HBM3 750W) GPUs, 2 NUMA node per socket, System BIOS 5.27, Ubuntu® 22.04.5 LTS, Host GPU driver ROCm 6.3.0 ROCm 6.3 (pre-release)

Reproduce these results on your system by following the instructions in measuring training performance with ROCm Megatron-LM Docker on the AMD GPUs user guide.

Previous versions

This table lists previous versions of the Megatron-LM training Docker image for training performance testing. For detailed information about available models for benchmarking, see the version-specific documentation.

Date Image version ROCm version PyTorch version Resources
3/18/2025 25.4 6.3.0 2.7.0a0+git637433 Documentation Docker Hub