This page summarizes performance measurements on AMD Instinct™ GPUs for popular AI models.
The data in the following tables is a reference point to help users evaluate observed performance. It should not be considered as the peak performance that AMD GPUs and ROCm™ software can deliver.
- AI Inference
- AI Training
AI Inference
Throughput Measurements
The table below shows performance data where a local inference client is fed requests at an infinite rate and shows the throughput client-server scenario under maximum load.
This result is based on the Docker container (rocm6.4.1_vllm_0.10.1_20250909), which was released on Sep 9th , 2025.
Model | Precision | TP Size | Input | Output | Num Prompts | Max Num Seqs | Throughput (tokens/s) |
Llama 3.1 70B (amd/Llama-3.1-70B-Instruct-FP8-KV) | FP8 | 8 | 128 | 2048 | 3200 | 3200 | 13818.7 |
128 | 4096 | 1500 | 1500 | 11612.0 | |||
500 | 2000 | 2000 | 2000 | 11408.7 | |||
2048 | 2048 | 1500 | 1500 | 7800.5 | |||
Llama 3.1 405B (amd/Llama-3.1-405B-Instruct-FP8-KV) | FP8 | 8 | 128 | 2048 | 1500 | 1500 | 4134.0 |
128 | 4096 | 1500 | 1500 | 3177.6 | |||
500 | 2000 | 2000 | 2000 | 3034.1 | |||
2048 | 2048 | 500 | 500 | 2214.2 |
TP stands for Tensor Parallelism.
Server: Dual AMD EPYC 9554 64-core processor-based production server with 8x AMD MI300X (192GB HBM3 750W) GPUs, 1 NUMA node per socket, System BIOS 1.8, Ubuntu® 22.04.5 LTS, Host GPU driver ROCm 6.4.1 + amdgpu driver 6.8.5
fpy
Reproduce these results on your system by following the instructions in measuring inference performance with vLLM on the AMD GPUs user guide.
Latency Measurements
The table below shows latency measurement, which typically involves assessing the time from when the system receives an input to when the model produces a result.
This result is based on the Docker container (rocm6.4.1_vllm_0.10.1_20250909), which was released on Sep 9th , 2025.
Model | Precision | TP Size | Batch Size | Input | Output | Latency (sec) |
Llama 3.1 70B (amd/Llama-3.1-70B-Instruct-FP8-KV) | FP8 | 8 | 1 | 128 | 2048 | 15.254 |
2 | 128 | 2048 | 18.157 | |||
4 | 128 | 2048 | 18.549 | |||
8 | 128 | 2048 | 20.547 | |||
16 | 128 | 2048 | 22.164 | |||
32 | 128 | 2048 | 25.426 | |||
64 | 128 | 2048 | 33.297 | |||
128 | 128 | 2048 | 45.792 | |||
1 | 2048 | 2048 | 15.299 | |||
2 | 2048 | 2048 | 18.194 | |||
4 | 2048 | 2048 | 18.942 | |||
8 | 2048 | 2048 | 20.526 | |||
16 | 2048 | 2048 | 23.211 | |||
32 | 2048 | 2048 | 26.516 | |||
64 | 2048 | 2048 | 34.824 | |||
128 | 2048 | 2048 | 52.211 | |||
Llama 3.1 405B (amd/Llama-3.1-70B-Instruct-FP8-KV) | FP8 | 8 | 1 | 128 | 2048 | 47.150 |
2 | 128 | 2048 | 50.933 | |||
4 | 128 | 2048 | 52.521 | |||
8 | 128 | 2048 | 55.233 | |||
16 | 128 | 2048 | 59.065 | |||
32 | 128 | 2048 | 68.786 | |||
64 | 128 | 2048 | 88.094 | |||
128 | 128 | 2048 | 118.512 | |||
1 | 2048 | 2048 | 47.675 | |||
2 | 2048 | 2048 | 50.788 | |||
4 | 2048 | 2048 | 52.405 | |||
8 | 2048 | 2048 | 55.459 | |||
16 | 2048 | 2048 | 59.923 | |||
32 | 2048 | 2048 | 70.388 | |||
64 | 2048 | 2048 | 91.218 | |||
128 | 2048 | 2048 | 127.004 |
Server: Dual AMD EPYC 9554 64-core processor-based production server with 8x AMD MI300X (192GB HBM3 750W) GPUs, 1 NUMA node per socket, System BIOS 1.8, Ubuntu® 22.04.5 LTS, Host GPU driver ROCm 6.4.1 + amdgpu driver 6.8.5
Previous Versions
This table lists previous versions of the ROCm vLLM inference Docker image for inference performance testing. For detailed information about available models for benchmarking, see the version-specific documentation.
Docker image tag | Components | Resources |
rocm/vllm:rocm6.4.1_vllm_0.10.0_20250812 (latest) |
|
|
rocm/vllm:rocm6.4.1_vllm_0.9.1_20250715 |
|
|
rocm/vllm:rocm6.4.1_vllm_0.9.1_20250702 |
|
|
rocm/vllm:rocm6.4.1_vllm_0.9.0.1_20250605 |
|
|
rocm/vllm:rocm6.3.1_vllm_0.8.5_20250521 |
|
|
rocm/vllm:rocm6.3.1_vllm_0.8.5_20250513 |
|
|
rocm/vllm:rocm6.3.1_instinct_vllm0.8.3_20250415 |
|
|
rocm/vllm:rocm6.3.1_instinct_vllm0.7.3_20250325 |
|
|
rocm/vllm:rocm6.3.1_mi300_ubuntu22.04_py3.12_vllm_0.6.6 |
|
|
rocm/vllm:rocm6.2_mi300_ubuntu20.04_py3.9_vllm_0.6.4 |
|
|
rocm/vllm:rocm6.2_mi300_ubuntu22.04_py3.9_vllm_7c5fd50 |
|
Reproduce these results on your system by following the instructions in measuring inference performance with ROCm vLLM Dcoker on the AMD GPUs user guide.
AI Training
The table below shows training performance data, where the AMD Instinct™ platform measures text generation training throughput with a unique sequence length and batch size. It focuses on Tokens per second per GPU.
PyTorch training results on the AMD Instinct™ MI300X platform
This result is based on the Docker container (rocm/pytorch-training:v25.8), which was released on September 19, 2025.
Models | Precision | Batch Size | Sequence Length | FSDP | TP | CP | PP | Tokens/Sec/GPU |
Llama 3.1 8B | FP8 | 19 | 8192 | 0 | 1 | 1 | 1 | 9,823 |
Llama 3.1 8B | BF16 | 19 | 8192 | 0 | 1 | 1 | 1 | 7,818 |
Llama 3.1 70B | FP8 | 3 | 8192 | 1 | 1 | 1 | 1 | 1,257 |
Llama 3.1 70B | BF16 | 4 | 8192 | 1 | 1 | 1 | 1 | 889 |
Fine-tuning
This result is based on the Docker container (rocm/pytorch-training:v25.8), which was released on September 19, 2025
Models | Precision | Batch Size | Sequence Length | FSDP | TP | CP | PP | Tokens/Sec/GPU |
Llama 3.1 70B SFT | FP8 | 4 | 8192 | 1 | 1 | 1 | 1 | 1,229 |
Llama 3.1 70B SFT | BF16 | 4 | 8192 | 1 | 1 | 1 | 1 | 825 |
Llama 3.1 70B LoRA | BF16 | 4 | 8192 | 1 | 1 | 1 | 1 | 1,004 |
Server: Dual AMD EPYC 9554 64-core processor-based production server with 8x AMD MI300X (192GB HBM3 750W) GPUs, 1 NUMA node per socket, System BIOS 1.8, Ubuntu® 22.04.5 LTS, Host GPU driver ROCm 6.4.2-12 ROCm 6.4.3.
Reproduce these results on your system by following the instructions in measuring training performance with ROCm PyTorch Docker on the AMD GPUs user guide.
PyTorch training results on the AMD Instinct MI325X platform
This result is based on the Docker container (rocm/pytorch-training:v25.8), which was released on September 19, 2025.
Models | Precision | Batch Size | Sequence Length | FSDP | TP | CP | PP | Tokens/Sec/GPU |
Llama 3.1 8B | FP8 | 16 | 8192 | 0 | 1 | 1 | 1 | 12,560 |
Llama 3.1 8B | BF16 | 25 | 8192 | 0 | 1 | 1 | 1 | 9,683 |
Llama 3.1 70B | FP8 | 5 | 8192 | 1 | 1 | 1 | 1 | 1,667 |
Llama 3.1 70B | BF16 | 6 | 8192 | 1 | 1 | 1 | 1 | 1,156 |
Fine-tuning
This result is based on the Docker container (rocm/pytorch-training:v25.8), which was released on September 19, 2025
Models | Precision | Batch Size | Sequence Length | FSDP | TP | CP | PP | Tokens/Sec/GPU |
Llama 3.1 70B SFT | FP8 | 16 | 8192 | 1 | 1 | 1 | 1 | 1,436 |
Llama 3.1 70B SFT | BF16 | 16 | 8192 | 1 | 1 | 1 | 1 | 1,005 |
Llama 3.1 70B LoRA | BF16 | 16 | 8192 | 1 | 1 | 1 | 1 | 1,213 |
Server: Dual AMD EPYC 9655 96-core processor-based production server with 8x AMD MI325X (256GB HBM3E 1000W) GPUs, 1 NUMA node per socket, System BIOS 3B03, Ubuntu® 22.04.5 LTS, Host GPU driver ROCm 6.3.1-48 ROCm 6.4.3.
Reproduce these results on your system by following the instructions in measuring training performance with ROCm PyTorch Docker on the AMD GPUs user guide.
Previous versions
This table lists previous versions of the PyTorch training Docker image for training performance testing. For detailed information about available models for benchmarking, see the version-specific documentation.
Image version | ROCm version | PyTorch version | Resources |
v25.8 (latest) | 6.4.3 | 2.8.0a0+gitd06a406 | |
v25.7 | 6.4.2 | 2.8.0a0+gitd06a406 | |
v25.6 | 6.3.4 | 2.8.0a0+git7d205b2 | |
v25.5 | 6.3.4 | 2.7.0a0+git637433 | |
v25.4 | 6.3.0 | 2.7.0a0+git637433 |
Megatron-LM training results on the AMD Instinct™ MI300X platform
This result is based on the Docker container(rocm/megatron-lm:v25.8_py310), which was released on September 19, 2025.
Models | # nodes | Precision | Batch Size | Sequence Length | FSDP | TP | CP | PP | EP | Tokens/Sec/GPU |
Llama 3.1 8B | 1 | FP8 | 2 | 8191 | 0 | 1 | 1 | 1 | - | 12,605 |
Llama 3.1 8B | 1 | BF16 | 2 | 8191 | 0 | 1 | 1 | 1 | - | 9,338 |
Llama 3.1 8B | 8 | FP8 | 2 | 8191 | 0 | 1 | 1 | 1 | - | 12,096 |
Llama 3.1 70B | 1 | BF16 | 3 | 8191 | 1 | 1 | 1 | 1 | - | 792 |
Llama 3.3 70B | 1 | BF16 | 3 | 8191 | 1 | 1 | 1 | 1 | - | 789 |
Mixtral 8x7B | 1 | BF16 | 2 | 4096 | 0 | 1 | 1 | 1 | 8 | 5,263 |
Server: Dual AMD EPYC 9554 64-core processor-based production server with 8x AMD MI300X (192GB HBM3 750W) GPUs, 1 NUMA node per socket, System BIOS 1.8, Ubuntu® 22.04.5 LTS, Host GPU driver ROCm 6.4.2-120 ROCm 6.4.3.
For Multi-mode run, Server: Dual Intel Xeon Platinum 8480+ Processors with 8x AMD MI300X (192GB HBM3 750W) GPUs, 1 NUMA node per socket, System BIOS 79007700 Ubuntu® 22.04, Host GPU driver ROCm 6.3.0-39 ROCm 6.4.3.
Reproduce these results on your system by following the instructions in measuring training performance with ROCm Megatron-LM Docker on the AMD GPUs user guide.
Megatron-LM training results on the AMD Instinct™ MI325X platform
This result is based on the Docker container(rocm/megatron-lm:v25.8_py310), which was released on September 19, 2025.
Models | # nodes | Precision | Batch Size | Sequence Length | FSDP | TP | CP | PP | EP | Tokens/Sec/GPU |
Llama 3.1 8B | 1 | FP8 | 2 | 8191 | 0 | 1 | 1 | 1 | - | 14,895 |
Llama 3.1 8B | 1 | BF16 | 4 | 8191 | 0 | 1 | 1 | 1 | - | 11,389 |
Llama 3.1 70B | 1 | BF16 | 4 | 8191 | 1 | 1 | 1 | 1 | - | 1,029 |
Llama 3.3 70B | 1 | BF16 | 5 | 8191 | 1 | 1 | 1 | 1 | - | 1,020 |
Mixtral 8x7B | 1 | BF16 | 4 | 4096 | 0 | 1 | 1 | 1 | 8 | 6,339 |
Server: Dual AMD EPYC 9655 96-core processor-based production server with 8x AMD MI325X (256GB HBM3E 1000W) GPUs, 1 NUMA node per socket, System BIOS 3B03, Ubuntu® 22.04.5 LTS, Host GPU driver ROCm 6.3.1-48 ROCm 6.4.3.
Reproduce these results on your system by following the instructions in measuring training performance with ROCm Megatron-LM Docker on the AMD GPUs user guide.
Previous versions
This table lists previous versions of the Megatron-LM training Docker image for training performance testing. For detailed information about available models for benchmarking, see the version-specific documentation.
Image version | ROCm version | PyTorch version | Resources |
v25.8 (latest) | 6.4.3 | 2.8.0a0+gitd06a406 | |
v25.7 | 6.4.2 | 2.8.0a0+gitd06a406 | |
v25.6 | 6.4.1 | 2.8.0a0+git7d205b2 | |
v25.5 | 6.3.4 | 2.8.0a0+gite2f9759 | |
v25.4 | 6.3.0 | 2.7.0a0+git637433 |
JaxMaxText v0.6.0 training results on the AMD Instinct™ MI300X platform
This result is based on the Docker container(rocm/jax-training:maxtext-v25.7-jax060), which was released on September 19, 2025.
Models | # nodes | Precision | Batch Size | Sequence Length | FSDP | TP | CP | PP | EP | Tokens/Sec/GPU |
Llama 3.1 8B | 1 | BF16 | 4 | 8192 | 1 | 1 | 1 | 1 | 1 | 8,661 |
Llama 3.1 70B | 1 | BF16 | 7 | 8192 | 1 | 1 | 1 | 1 | 1 | 920 |
Llama 3.3 70B | 1 | BF16 | 7 | 8192 | 1 | 1 | 1 | 1 | 1 | 920 |
Mixtral 8x7B | 1 | BF16 | 12 | 4096 | 0 | 1 | 1 | 1 | 8 | 4,564 |
Server: Dual AMD EPYC 9554 64-core processor-based production server with 8x AMD MI300X (192GB HBM3 750W) GPUs, 1 NUMA node per socket, System BIOS 1.8, Ubuntu® 22.04.5 LTS, Host GPU driver ROCm 6.4.2-120 ROCm 6.4.
Reproduce these results on your system by following the instructions in measuring training performance with ROCm JAX MaxText Docker on the AMD GPUs user guide.
JaxMaxText v0.6.0 training results on the AMD Instinct™ MI325X platform
This result is based on the Docker container(rocm/jax-training:maxtext-v25.7-jax060), which was released on September 19, 2025.
Models | # nodes | Precision | Batch Size | Sequence Length | FSDP | TP | CP | PP | EP | Tokens/Sec/GPU |
Llama 3.1 8B | 1 | BF16 | 4 | 8192 | 1 | 1 | 1 | 1 | 1 | 10,492 |
Llama 3.1 70B | 1 | BF16 | 7 | 8192 | 1 | 1 | 1 | 1 | 1 | 1,141 |
Llama 3.3 70B | 1 | BF16 | 7 | 8192 | 1 | 1 | 1 | 1 | 1 | 1,141 |
Mixtral 8x7B | 1 | BF16 | 12 | 4096 | 0 | 1 | 1 | 1 | 8 | 5,410 |
Server: Dual AMD EPYC 9655 96-core processor-based production server with 8x AMD MI325X (256GB HBM3E 1000W) GPUs, 1 NUMA node per socket, System BIOS 3B03, Ubuntu® 22.04.5 LTS, Host GPU driver ROCm 6.3.1-48 ROCm 6.4.
Reproduce these results on your system by following the instructions in measuring training performance with ROCm JAX MaxText Docker on the AMD GPUs user guide.
JaxMaxText v0.5.0 training results on the AMD Instinct™ MI300X platform
This result is based on the Docker container(rocm/jax-training:maxtext-v25.7), which was released on September 19, 2025.
Models | # nodes | Precision | Batch Size | Sequence Length | FSDP | TP | CP | PP | EP | Tokens/Sec/GPU |
Llama 3.1 8B | 1 | BF16 | 4 | 8192 | 1 | 1 | 1 | 1 | 1 | 8,114 |
Llama 3.1 8B | 8 | BF16 | 4 | 8192 | 1 | 1 | 1 | 1 | - | 7,298 |
Llama 3.1 70B | 1 | BF16 | 7 | 8192 | 1 | 1 | 1 | 1 | 1 | 900 |
Llama 3.3 70B | 1 | BF16 | 7 | 8192 | 1 | 1 | 1 | 1 | 1 | 901 |
Mixtral 8x7B | 1 | BF16 | 12 | 4096 | 0 | 1 | 1 | 1 | 8 | 4,333 |
Server: Dual AMD EPYC 9554 64-core processor-based production server with 8x AMD MI300X (192GB HBM3 750W) GPUs, 1 NUMA node per socket, System BIOS 1.8, Ubuntu® 22.04.5 LTS, Host GPU driver ROCm 6.4.2-120 ROCm 6.4.1.
For Multi-mode run, Server: Dual AMD EPYC 9654 Processors with 8x AMD MI300X (192GB HBM3 750W) GPUs, 1 NUMA node per socket, System BIOS 3.10 Ubuntu® 22.04, Host GPU driver ROCm 6.3.1-48 ROCm 6.4.
Reproduce these results on your system by following the instructions in measuring training performance with ROCm JAX MaxText Docker on the AMD GPUs user guide.
JaxMaxText v0.5.0 training results on the AMD Instinct™ MI325X platform
This result is based on the Docker container(rocm/jax-training:maxtext-v25.7), which was released on September 19, 2025.
Models | # nodes | Precision | Batch Size | Sequence Length | FSDP | TP | CP | PP | EP | Tokens/Sec/GPU |
Llama 3.1 8B | 1 | BF16 | 4 | 8192 | 1 | 1 | 1 | 1 | 1 | 9,943 |
Llama 3.1 70B | 1 | BF16 | 7 | 8192 | 1 | 1 | 1 | 1 | 1 | 1,114 |
Llama 3.3 70B | 1 | BF16 | 7 | 8192 | 1 | 1 | 1 | 1 | 1 | 1,115 |
Mixtral 8x7B | 1 | BF16 | 12 | 4096 | 0 | 1 | 1 | 1 | 8 | 5,191 |
Server: Dual AMD EPYC 9655 96-core processor-based production server with 8x AMD MI325X (256GB HBM3E 1000W) GPUs, 1 NUMA node per socket, System BIOS 3B03, Ubuntu® 22.04.5 LTS, Host GPU driver ROCm 6.3.1-48 ROCm 6.4.1.
Reproduce these results on your system by following the instructions in measuring training performance with ROCm JAX MaxText Docker on the AMD GPUs user guide.
Previous versions
This table lists previous versions of the ROCm JAX MaxText Docker image for training performance testing. For detailed information about available models for benchmarking, see the version-specific documentation.
Image version | ROCm version | JAX version | Resources |
v25.7 (latest) | 6.4.1 | 0.6.0, 0.5.0 | |
v25.5 | 6.3.4 | 0.4.35 | |
v25.4 | 6.3.0 | 0.4.31 |