This page summarizes performance measurements on AMD Instinct™ GPUs for popular AI models.
The data in the following tables is a reference point to help users evaluate observed performance. It should not be considered as the peak performance that AMD GPUs and ROCm™ software can deliver.
- AI Inference
- AI Training
AI Inference
Throughput Measurements
The table below shows performance data where a local inference client is fed requests at an infinite rate and shows the throughput client-server scenario under maximum load.
This result is based on the Docker container (rocm/vllm:rocm6.3.1_instinct_vllm0.8.3_20250415), which was released on April 29, 2025.
Model |
Precision |
TP Size |
Input |
Output |
Num Prompts |
Max Num Seqs |
Throughput (tokens/s) |
Llama 3.1 70B (amd/Llama-3.1-70B-Instruct-FP8-KV) |
FP8 |
8 |
128 |
2048 |
3200 |
3200 |
16896.6 |
128 |
4096 |
1500 |
1500 |
13943.8 |
|||
500 |
2000 |
2000 |
2000 |
13512.8 |
|||
2048 |
2048 |
1500 |
1500 |
8444.5 |
|||
Llama 3.1 405B (amd/Llama-3.1-405B-Instruct-FP8-KV) |
FP8 |
8 |
128 |
2048 |
1500 |
1500 |
4359.9 |
128 |
4096 |
1500 |
1500 |
3430.9 |
|||
500 |
2000 |
2000 |
2000 |
3226.8 |
|||
2048 |
2048 |
500 |
500 |
2228.2 |
TP stands for Tensor Parallelism.
Server: Dual AMD EPYC 9554 64-core processor-based production server with 8x AMD MI300X (192GB HBM3 750W) GPUs, 1 NUMA node per socket, System BIOS 1.8, Ubuntu® 22.04.5 LTS, Host GPU driver ROCm 6.2.2 + amdgpu driver 6.8.5
Reproduce these results on your system by following the instructions in measuring inference performance with vLLM on the AMD GPUs user guide.
Latency Measurements
The table below shows latency measurement, which typically involves assessing the time from when the system receives an input to when the model produces a result.
This result is based on the Docker container (rocm/vllm:rocm6.3.1_instinct_vllm0.8.3_20250415), which was released on April 29, 2025.
Model |
Precision |
TP Size |
Batch Size |
Input |
Output |
Latency (sec) |
Llama 3.1 70B (amd/Llama-3.1-70B-Instruct-FP8-KV) |
FP8 |
8 |
1 |
128 |
2048 |
15.427 |
2 |
128 |
2048 |
16.661 |
|||
4 |
128 |
2048 |
17.326 |
|||
8 |
128 |
2048 |
18.679 |
|||
16 |
128 |
2048 |
20.642 |
|||
32 |
128 |
2048 |
23.260 |
|||
64 |
128 |
2048 |
30.498 |
|||
128 |
128 |
2048 |
42.952 |
|||
1 |
2048 |
2048 |
15.677 |
|||
2 |
2048 |
2048 |
16.715 |
|||
4 |
2048 |
2048 |
17.684 |
|||
8 |
2048 |
2048 |
19.444 |
|||
16 |
2048 |
2048 |
22.282 |
|||
32 |
2048 |
2048 |
26.545 |
|||
64 |
2048 |
2048 |
36.651 |
|||
128 |
2048 |
2048 |
55.949 |
|||
Llama 3.1 405B (amd/Llama-3.1-70B-Instruct-FP8-KV) |
FP8 |
8 |
1 |
128 |
2048 |
45.294 |
2 |
128 |
2048 |
46.166 |
|||
4 |
128 |
2048 |
47.867 |
|||
8 |
128 |
2048 |
51.065 |
|||
16 |
128 |
2048 |
54.304 |
|||
32 |
128 |
2048 |
63.078 |
|||
64 |
128 |
2048 |
81.906 |
|||
128 |
128 |
2048 |
108.097 |
|||
1 |
2048 |
2048 |
46.003 |
|||
2 |
2048 |
2048 |
46.596 |
|||
4 |
2048 |
2048 |
49.273 |
|||
8 |
2048 |
2048 |
53.762 |
|||
16 |
2048 |
2048 |
59.629 |
|||
32 |
2048 |
2048 |
73.753 |
|||
64 |
2048 |
2048 |
103.530 |
|||
128 |
2048 |
2048 |
151.785 |
TP stands for Tensor Parallelism.
Server: Dual AMD EPYC 9554 64-core processor-based production server with 8x AMD MI300X (192GB HBM3 750W) GPUs, 1 NUMA node per socket, System BIOS 1.8, Ubuntu® 22.04.5 LTS, Host GPU driver ROCm 6.2.2 + amdgpu driver 6.8.5
Reproduce these results on your system by following the instructions in measuring inference performance with ROCm vLLM Dcoker on the AMD GPUs user guide.
Previous versions
This table lists previous versions of the ROCm vLLM inference Docker image for inference performance testing. For detailed information about available models for benchmarking, see the version-specific documentation.
Date |
ROCm version |
vLLM version |
PyTorch version |
Resources |
|
4/10/2025 |
6.3.1 |
0.8.3 |
2.7.0 |
Documentation | |
3/25/2025 |
6.3.1 |
0.7.3 |
2.7.0 |
Documentation | |
3/11/2025 |
6.3.1 |
0.7.3 |
2.7.0 |
Documentation | |
2/5/2025 |
6.3.1 |
0.6.6 |
2.7.0 |
Documentation | |
11/7/2024 |
6.2.1 |
0.6.4 |
2.5.0 |
Documentation | |
9/4/2024 |
6.2.0 |
0.4.3 |
2.4.0 |
AI Training
The table below shows training performance data, where the AMD Instinct™ platform measures text generation training throughput with a unique sequence length and batch size. It focuses on TFLOPS per second per GPU.
For FLUX, image generation training throughput from the FLUX.1-dev model with the best batch size before the runs go out of memory, and it focuses on frame per second per GPU.
PyTorch training results on the AMD Instinct™ MI300X platform
This result is based on the Docker container (rocm/pytorch-training:v25.5), which was released on April 15, 2025.
Models |
Precision |
Batch Size |
Sequence Length |
TFLOPS/s/GPU |
Llama 3.1 70B with FSDP |
BF16 |
4 |
8192 |
426.79 |
Llama 3.1 8B with FSDP |
BF16 |
3 |
8192 |
542.94 |
Llama 3.1 8B with FSDP |
FP8 |
3 |
8192 |
737.40 |
Llama 3.1 8B with FSDP |
BF16 |
6 |
4096 |
523.79 |
Llama 3.1 8B with FSDP |
FP8 |
6 |
4096 |
735.44 |
Mistral 7B with FSDP |
BF16 |
3 |
8192 |
483.17 |
Mistral 7B with FSDP |
FP8 |
4 |
8192 |
723.30 |
FLUX |
BF16 |
10 |
- |
4.51 (FPS/GPU)* |
*Note: FLUX performance is measured in FPS/GPU rather than TFLOPS/s/GPU.
Server: Dual AMD EPYC 9654 96-core processor-based production server with 8x AMD MI300X (192GB HBM3 750W) GPUs, 2 NUMA node per socket, System BIOS 5.27, Ubuntu® 22.04.5 LTS, Host GPU driver ROCm 6.3.0 ROCm 6.3 (pre-release)
Reproduce these results on your system by following the instructions in measuring training performance with ROCm PyTorch Docker on the AMD GPUs user guide.
PyTorch training results on the AMD Instinct MI325X platform
This result is based on the Docker container (rocm/pytorch-training:v25.5), which was released on April 15, 2025.
Models |
Precision |
Batch Size |
Sequence Length |
TFLOPS/s/GPU |
Llama 3.1 70B with FSDP |
BF16 |
7 |
8192 |
526.13 |
Llama 3.1 8B with FSDP |
BF16 |
3 |
8192 |
643.01 |
Llama 3.1 8B with FSDP |
FP8 |
5 |
8192 |
893.68 |
Llama 3.1 8B with FSDP |
BF16 |
8 |
4096 |
625.96 |
Llama 3.1 8B with FSDP |
FP8 |
10 |
4096 |
894.98 |
Mistral 7B with FSDP |
BF16 |
5 |
8192 |
590.23 |
Mistral 7B with FSDP |
FP8 |
6 |
8192 |
860.39 |
Server: Dual AMD EPYC 9654 96-core processor-based production server with 8x AMD MI325X (256GB HBM3E 1000W) GPUs, 2 NUMA node per socket, System BIOS 5.27, Ubuntu® 22.04.5 LTS, Host GPU driver ROCm 6.3.0 ROCm 6.3 (pre-release)
Reproduce these results on your system by following the instructions in measuring training performance with ROCm PyTorch Docker on the AMD GPUs user guide.
Previous versions
This table lists previous versions of the PyTorch training Docker image for training performance testing. For detailed information about available models for benchmarking, see the version-specific documentation.
Date |
Image version |
ROCm version |
PyTorch version |
Resources |
|
3/11/2025 |
25.4 |
6.3.0 |
2.7.0a0+git637433 |
Docker Hub |
Megatron-LM training results on the AMD Instinct™ MI300X platform
This result is based on the Docker container(rocm/megatron-lm:v25.4), which was released on March 18, 2025.
Sequence length 8192 |
|||||||||
Model |
# of nodes |
Sequence length |
MBS |
GBS |
Data Type |
TP |
PP |
CP |
TFLOPs/s/GPU |
llama3.1-8B |
1 |
8192 |
2 |
128 |
FP8 |
1 |
1 |
1 |
697.91 |
llama3.1-8B |
2 |
8192 |
2 |
256 |
FP8 |
1 |
1 |
1 |
690.33 |
llama3.1-8B |
4 |
8192 |
2 |
512 |
FP8 |
1 |
1 |
1 |
686.74 |
llama3.1-8B |
8 |
8192 |
2 |
1024 |
FP8 |
1 |
1 |
1 |
675.50 |
Sequence length 4096 |
|||||||||
Model |
# of nodes |
Sequence length |
MBS |
GBS |
Data Type |
TP |
PP |
CP |
TFLOPs/s/GPU |
llama2-7B |
1 |
4096 |
4 |
256 |
FP8 |
1 |
1 |
1 |
689.90 |
llama2-7B |
2 |
4096 |
4 |
512 |
FP8 |
1 |
1 |
1 |
682.04 |
llama2-7B |
4 |
4096 |
4 |
1024 |
FP8 |
1 |
1 |
1 |
676.83 |
llama2-7B |
8 |
4096 |
4 |
2048 |
FP8 |
1 |
1 |
1 |
686.25 |
Server: Dual AMD EPYC 9654 96-core processor-based production server with 8x AMD MI300X (192GB HBM3 750W) GPUs, 2 NUMA node per socket, System BIOS 5.27, Ubuntu® 22.04.5 LTS, Host GPU driver ROCm 6.3.0 ROCm 6.3 (pre-release)
For Deepseek-V2-Lite with 16B parameters, the table below shows training performance data, where the AMD Instinct™ MI300X platform measures text generation training throughput with GEMM tuning was on. It focuses on TFLOPS per second per GPU.
This result is based on the Docker container (rocm/megatron-lm:v25.4), which was released on March 18, 2025.
Model |
# of GPUs |
Sequence length |
MBS |
GBS |
Data Type |
TP |
PP |
CP |
EP |
SP |
Recompute |
TFLOPs/s/GPU |
Deespeek-V2-Lite |
8 |
4096 |
4 |
256 |
BF16 |
1 |
1 |
1 |
8 |
On |
None |
10570 |
Server: Dual AMD EPYC 9654 96-core processor-based production server with 8x AMD MI300X (192GB HBM3 750W) GPUs, 2 NUMA node per socket, System BIOS 5.27, Ubuntu® 22.04.5 LTS, Host GPU driver ROCm 6.3.0 ROCm 6.3 (pre-release)
Reproduce these results on your system by following the instructions in measuring training performance with ROCm Megatron-LM Docker on the AMD GPUs user guide.
Previous versions
This table lists previous versions of the Megatron-LM training Docker image for training performance testing. For detailed information about available models for benchmarking, see the version-specific documentation.
Date |
Image version |
ROCm version |
PyTorch version |
Resources |
|
3/18/2025 |
25.4 |
6.3.0 |
2.7.0a0+git637433 |
Docker Hub |