This page summarizes performance measurements on AMD Instinct™ GPUs running popular AI models.
The results found on this page highlight both Inference and Training benchmarks. The results are organized by the following:
- AI Inference: vLLM
- AI Training: pyTorch, Megatron-LM, and JAX MaxText
The hardware platforms include Instinct MI355X/MI325X/MI300X GPUs, with benchmark insights provided for each framework where data is available.
The data in the following tables are a reference point to help users evaluate observed performance. It should not be considered as the peak performance that AMD GPUs and ROCm™ software can deliver.
AI Inference
vLLM
Results on AMD Instinct™ MI300X Platform
The following results are based on:
- Docker container: rocm/vllm:rocm7.0.0_vllm_0.10.2_20251006
- Release date: October 6, 2025
- Server: Dual AMD EPYC 9554 64-core processor-based production server with 8x AMD MI300X (192GB HBM3 750W) GPUs, 1 NUMA node per socket, System BIOS 1.8, Ubuntu® 22.04, amdgpu driver 6.8.5
Throughput Measurements
The table below shows performance data where a local inference client is fed requests at an infinite rate and shows the throughput client-server scenario under maximum load.
| Model | Precision | TP1 Size | Input | Output | No. Prompts | Max. Seqs | Throughput2 |
| Llama 3.1 70B (amd/Llama-3.1-70B-Instruct-FP8-KV) | FP8 | 8 | 128 | 2048 | 3200 | 3200 | 13212.5 |
| 128 | 4096 | 1500 | 1500 | 11312.8 | |||
| 500 | 2000 | 2000 | 2000 | 11376.7 | |||
| 2048 | 2048 | 1500 | 1500 | 7252.1 | |||
| Llama 3.1 405B (amd/Llama-3.1-405B-Instruct-FP8-KV) | FP8 | 8 | 128 | 2048 | 1500 | 1500 | 4201.7 |
| 128 | 4096 | 1500 | 1500 | 3176.3 | |||
| 500 | 2000 | 2000 | 2000 | 2992.0 | |||
| 2048 | 2048 | 500 | 500 | 2153.7 |
Latency results
The table below shows latency measurement, which typically involves assessing the time from when the system receives an input to when the model produces a result.
| Model | Precision | TP1 Size | Batch Size | Input | Output | Latency2 |
| Llama 3.1 70B (amd/Llama-3.1-70B-Instruct-FP8-KV) | FP8 | 8 | 1 | 128 | 2048 | 15.882 |
| 2 | 128 | 2048 | 17.934 | |||
| 4 | 128 | 2048 | 18.487 | |||
| 8 | 128 | 2048 | 20.251 | |||
| 16 | 128 | 2048 | 22.307 | |||
| 32 | 128 | 2048 | 29.933 | |||
| 64 | 128 | 2048 | 32.359 | |||
| 128 | 128 | 2048 | 45.419 | |||
| 1 | 2048 | 2048 | 15.959 | |||
| 2 | 2048 | 2048 | 18.177 | |||
| 4 | 2048 | 2048 | 18.684 | |||
| 8 | 2048 | 2048 | 20.716 | |||
| 16 | 2048 | 2048 | 23.136 | |||
| 32 | 2048 | 2048 | 26.969 | |||
| 64 | 2048 | 2048 | 34.359 | |||
| 128 | 2048 | 2048 | 52.351 | |||
| Llama 3.1 405B (amd/Llama-3.1-70B-Instruct-FP8-KV) | FP8 | 8 | 1 | 128 | 2048 | 49.098 |
| 2 | 128 | 2048 | 51.009 | |||
| 4 | 128 | 2048 | 52.979 | |||
| 8 | 128 | 2048 | 55.675 | |||
| 16 | 128 | 2048 | 58.982 | |||
| 32 | 128 | 2048 | 67.889 | |||
| 64 | 128 | 2048 | 86.844 | |||
| 128 | 128 | 2048 | 117.440 | |||
| 1 | 2048 | 2048 | 49.033 | |||
| 2 | 2048 | 2048 | 51.316 | |||
| 4 | 2048 | 2048 | 52.947 | |||
| 8 | 2048 | 2048 | 55.863 | |||
| 16 | 2048 | 2048 | 60.103 | |||
| 32 | 2048 | 2048 | 69.632 | |||
| 64 | 2048 | 2048 | 89.826 | |||
| 128 | 2048 | 2048 | 126.433 |
Reproduce these results on your system by following these instructions:
Previous Versions
This table lists previous versions of the ROCm vLLM inference Docker image for inference performance testing. For detailed information about available models for benchmarking, see the version-specific documentation.
| Docker image tag | Components | Resources |
| rocm/vllm:rocm7.0.0_ vllm_ 0.10.2_ 20251006 (latest) |
|
|
| rocm/vllm:rocm6.4.1_ vllm_ 0.10.0_ 20250812 |
|
|
| rocm/vllm:rocm6.4.1_vllm_0.9.1_20250715 |
|
|
| rocm/vllm:rocm6.4.1_vllm_0.9.1_20250702 |
|
|
| rocm/vllm:rocm6.4.1_vllm_0.9.0.1_20250605 |
|
|
| rocm/vllm:rocm6.3.1_vllm_0.8.5_20250521 |
|
|
| rocm/vllm:rocm6.3.1_vllm_0.8.5_20250513 |
|
|
| rocm/vllm:rocm6.3.1_instinct_vllm0.8.3_20250415 |
|
|
| rocm/vllm:rocm6.3.1_instinct_vllm0.7.3_20250325 |
|
|
| rocm/vllm:rocm6.3.1_mi300_ubuntu22.04_py3.12_vllm_0.6.6 |
|
|
| rocm/vllm:rocm6.2_mi300_ubuntu20.04_py3.9_vllm_0.6.4 |
|
|
| rocm/vllm:rocm6.2_mi300_ubuntu22.04_py3.9_vllm_7c5fd50 |
|
AI Training
The table below shows training performance data, where the AMD Instinct™ platform measures text generation training throughput with a unique sequence length and batch size. It focuses on Tokens per second per GPU.
- PyTorch
- Megatron-LM
- JaxMaxText
PyTorch
Results on the AMD Instinct MI355X Platform
The following results are based on:
- Docker container: rocm/primus:v25.9_gfx950
- Release date: October 17, 2025
- Server: Dual AMD EPYC 9575F 96-core processor-based production server with 8x AMD MI355X (288GB HBM3E 1400W)GPUs,1 NUMA node per socket, System BIOS 1.4a, Ubuntu® 22.04.5 LTS Host GPU driver ROCm 7.0.1.
| Model | Precision | Batch Size | Sequence Length | FSDP | TP | CP | PP | Tokens/sec/GPU |
| Llama 3.1 8B | FP8 | 8 | 8192 | 0 | 1 | 1 | 1 | 28,035 |
| Llama 3.1 8B | BF16 | 5 | 8192 | 0 | 1 | 1 | 1 | 20,158 |
| Llama 3.1 70B | FP8 | 6 | 8192 | 1 | 1 | 1 | 1 | 3,570 |
| Llama 3.1 70B | BF16 | 8 | 8192 | 1 | 1 | 1 | 1 | 2,281 |
Fine-tuning
| Model | Precision | Batch Size | Sequence Length | FSDP | TP | CP | PP | Tokens/sec/GPU |
| Llama 3.1 70B SFT | FP8 | 8 | 8192 | 1 | 1 | 1 | 1 | 3,546 |
| Llama 3.1 70B SFT | BF16 | 16 | 8192 | 1 | 1 | 1 | 1 | 2,161 |
| Llama 3.1 70B LoRA | BF16 | 16 | 8192 | 1 | 1 | 1 | 1 | 2,594 |
Reproduce these results on your system by following these instructions:
Results on AMD Instinct™ MI325X Platform
The following results are based on:
- Docker container: rocm/ primus:v25.9_gfx942
- Release date: October 17, 2025
- Server: Dual AMD EPYC 9655 96-core processor-based production server with 8x AMD MI325X (256GB HBM3E 1000W) GPUs, 1 NUMA node per socket, System BIOS 3B03, Ubuntu® 22.04.5 LTS, Host GPU driver ROCm 6.3.1-48
| Model | Precision | Batch Size | Sequence Length | FSDP | TP | CP | PP | Tokens/sec/GPU |
| Llama 3.1 8B | FP8 | 7 | 8192 | 0 | 1 | 1 | 1 | 14,984 |
| Llama 3.1 8B | BF16 | 6 | 8192 | 0 | 1 | 1 | 1 | 11,144 |
| Llama 3.1 70B | FP8 | 5 | 8192 | 1 | 1 | 1 | 1 | 1,716 |
| Llama 3.1 70B | BF16 | 6 | 8192 | 1 | 1 | 1 | 1 | 1,150 |
Fine-tuning
| Model | Precision | Batch Size | Sequence Length | FSDP | TP | CP | PP | Tokens/sec/GPU |
| Llama 3.1 70B SFT | FP8 | 8 | 8192 | 1 | 1 | 1 | 1 | 1,597 |
| Llama 3.1 70B SFT | BF16 | 8 | 8192 | 1 | 1 | 1 | 1 | 1,037 |
| Llama 3.1 70B LoRA | BF16 | 16 | 8192 | 1 | 1 | 1 | 1 | 1,286 |
Reproduce these results on your system by following these instructions:
Results on AMD Instinct™ MI300X Platform
The following results are based on:
- Docker container: rocm//primus:v25.9_gfx942
- Release date: October 17, 2025
- Server: Server: Dual AMD EPYC 9554 64-core processor-based production server with 8x AMD MI300X (192GB HBM3 750W) GPUs, 1 NUMA node per socket, System BIOS 1.8, Ubuntu® 22.04.5 LTS, Host GPU driver ROCm 6.4.2-12
| Model | Precision | Batch Size | Sequence Length | FSDP | TP | CP | PP | Tokens/sec/GPU |
| Llama 3.1 8B | FP8 | 5 | 8192 | 0 | 1 | 1 | 1 | 12,216 |
| Llama 3.1 8B | BF16 | 4 | 8192 | 0 | 1 | 1 | 1 | 9,186 |
| Llama 3.1 70B | FP8 | 3 | 8192 | 1 | 1 | 1 | 1 | 1,307 |
| Llama 3.1 70B | BF16 | 4 | 8192 | 1 | 1 | 1 | 1 | 887 |
Fine-tuning
| Model | Precision | Batch Size | Sequence Length | FSDP | TP | CP | PP | Tokens/sec/GPU |
| Llama 3.1 70B SFT | FP8 | 4 | 8192 | 1 | 1 | 1 | 1 | 1,343 |
| Llama 3.1 70B SFT | BF16 | 4 | 8192 | 1 | 1 | 1 | 1 | 855 |
| Llama 3.1 70B LoRA | BF16 | 4 | 8192 | 1 | 1 | 1 | 1 | 1,053 |
Reproduce these results on your system by following these instructions:
Previous Versions
This table lists previous versions of the PyTorch training Docker image for training performance testing. For detailed information about available models for benchmarking, see the version-specific documentation.
| Image version | ROCm version | PyTorch version | Resources |
| V25.9 (latest) | 7.0.7 | Primus 0.3.0 |
|
| v25.8 | 6.4.3 | 2.8.0a0+gitd06a406 | |
| v25.7 | 6.4.2 | 2.8.0a0+gitd06a406 | |
| v25.6 | 6.3.4 | 2.8.0a0+git7d205b2 | |
| v25.5 | 6.3.4 | 2.7.0a0+git637433 | |
| v25.4 | 6.3.0 | 2.7.0a0+git637433 |
Megatron-LM
Results on AMD Instinct™ MI355X Platform
The following results are based on:
- Docker container: rocm/ primus:v25.9_gfx950
- Release date: October 17, 2025
- Server: Dual AMD EPYC 9575F 96-core processor-based production server with 8x AMD MI355X (288GB HBM3E 1400W)GPUs,1 NUMA node per socket, System BIOS 1.4a, Ubuntu® 22.04.5 LTS Host GPU driver ROCm 7.0.1.
| Model | # nodes | Precision | Batch Size | Sequence Length | FSDP | TP | CP | PP | EP | Tokens/sec/GPU |
| Llama 3.1 8B | 1 | FP8 | 4 | 8191 | 0 | 1 | 1 | 1 | - | 32,451 |
| Llama 3.1 8B | 1 | BF16 | 4 | 8191 | 0 | 1 | 1 | 1 | - | 21,908 |
| Llama 3.1 70B | 1 | BF16 | 4 | 8191 | 1 | 1 | 1 | 1 | - | 2,074 |
| Llama 3.3 70B | 1 | BF16 | 6 | 8191 | 1 | 1 | 1 | 1 | - | 2,024 |
| Mixtral 8x7B | 1 | BF16 | 4 | 4096 | 0 | 1 | 1 | 1 | 8 | 13,008 |
Reproduce these results on your system by following these instructions:
Results on AMD Instinct™ MI325X Platform
The following results are based on:
- Docker container: rocm/primus:v25.9_gfx942
- Release date: October 17, 2025
- Server: Dual AMD EPYC 9655 96-core processor-based production server with 8x AMD MI325X (256GB HBM3E 1000W) GPUs, 1 NUMA node per socket, System BIOS 3B03, Ubuntu® 22.04.5 LTS, Host GPU driver ROCm 6.3.1-48
| Model | # nodes | Precision | Batch Size | Sequence Length | FSDP | TP | CP | PP | EP | Tokens/sec/GPU |
| Llama 3.1 8B | 1 | FP8 | 2 | 8191 | 0 | 1 | 1 | 1 | - | 16,678 |
| Llama 3.1 8B | 1 | BF16 | 4 | 8191 | 0 | 1 | 1 | 1 | - | 11,803 |
| Llama 3.1 70B | 1 | BF16 | 4 | 8191 | 1 | 1 | 1 | 1 | - | 1,091 |
| Llama 3.3 70B | 1 | BF16 | 5 | 8191 | 1 | 1 | 1 | 1 | - | 1,052 |
| Mixtral 8x7B | 1 | BF16 | 4 | 4096 | 0 | 1 | 1 | 1 | 8 | 6,511 |
Reproduce these results on your system by following these instructions:
Results on AMD Instinct™ MI300X Platform
The following results are based on:
- Docker container: rocm/ rocm/primus:v25.9_gfx942
- Release date: October 17, 2025
- Server: Server: Dual AMD EPYC 9554 64-core processor-based production server with 8x AMD MI300X (192GB HBM3 750W) GPUs, 1 NUMA node per socket, System BIOS 1.8, Ubuntu® 22.04.5 LTS, Host GPU driver ROCm 6.4.2-120.
For multi-mode run, Server: Dual Intel Xeon Platinum 8480+ Processors with 8x AMD MI300X (192GB HBM3 750W) GPUs, 1 NUMA node per socket, System BIOS 79007700 Ubuntu® 22.04, Host GPU driver ROCm 6.3.0-39.
| Model | # nodes | Precision | Batch Size | Sequence Length | FSDP | TP | CP | PP | EP | Tokens/sec/GPU |
| Llama 3.1 8B | 1 | FP8 | 2 | 8191 | 0 | 1 | 1 | 1 | - | 14,208 |
| Llama 3.1 8B | 1 | BF16 | 2 | 8191 | 0 | 1 | 1 | 1 | - | 9,782 |
| Llama 3.1 8B | 8 | FP8 | 2 | 8192 | 0 | 1 | 1 | 1 | - | 13,328 |
| Llama 3.1 70B | 1 | BF16 | 3 | 8191 | 1 | 1 | 1 | 1 | - | 827 |
| Llama 3.3 70B | 1 | BF16 | 2 | 8191 | 1 | 1 | 1 | 1 | - | 822 |
| Mixtral 8x7B | 1 | BF16 | 2 | 4096 | 0 | 1 | 1 | 1 | 8 | 5,430 |
Reproduce these results on your system by following these instructions:
Previous Versions
This table lists previous versions of the Megatron-LM training Docker image for training performance testing. For detailed information about available models for benchmarking, see the version-specific documentation.
| Image version | ROCm version | PyTorch version | Resources |
| v25.9 (latest) | 7.0.0 | Primus 0.3.0 PyTorch 2.9.0.dev20250821+rocm7.0.0.lw.git125803b7 |
|
| v25.8 | 6.4.3 | 2.8.0a0+gitd06a406 | |
| v25.7 | 6.4.2 | 2.8.0a0+gitd06a406 | |
| v25.6 | 6.4.1 | 2.8.0a0+git7d205b2 | |
| v25.5 | 6.3.4 | 2.8.0a0+gite2f9759 | |
| v25.4 | 6.3.0 | 2.7.0a0+git637433 |
JaxMaxText
Results on AMD Instinct™ MI355X Platform
The following results are based on:
- Docker container: rocm/ jax-training:maxtext-v25.9
- Release date: October 17, 2025.
- Server: Dual AMD EPYC 9575F 96-core processor-based production server with 8x AMD MI355X (288GB HBM3E 1400W)GPUs,1 NUMA node per socket, System BIOS 1.4a, Ubuntu® 22.04.5 LTS Host GPU driver ROCm 7.0.1.
| Model | # nodes | Precision | Batch Size | Sequence Length | FSDP | TP | CP | PP | EP | Tokens/sec/GPU |
| Llama 3.1 8B | 1 | BF16 | 7 | 8192 | 1 | 1 | 1 | 1 | 1 | 21,306 |
| Llama 3.1 8B | 1 | FP8 | 4 | 8192 | 1 | 1 | 1 | 1 | 1 | 26,756 |
| Llama 3.1 70B | 1 | BF16 | 4 | 8192 | 1 | 1 | 1 | 1 | 1 | 2,440 |
| Llama 3.1 70B | 1 | FP8 | 7 | 8192 | 1 | 1 | 1 | 1 | 1 | 3,793 |
| Llama 3.3 70B | 1 | BF16 | 7 | 8192 | 1 | 1 | 1 | 1 | 1 | 2,441 |
| Mixtral 8x7B | 1 | BF16 | 12 | 4096 | 0 | 1 | 1 | 1 | 8 | 10,597 |
Reproduce these results on your system by following these instructions:
Results on AMD Instinct™ MI325X Platform
The following results are based on:
- Docker container: rocm/ jax-training:maxtext-v25.9
- Release date: October 17, 2025
- Server: Dual AMD EPYC 9655 96-core processor-based production server with 8x AMD MI325X (256GB HBM3E 1000W) GPUs, 1 NUMA node per socket, System BIOS 3B03, Ubuntu® 22.04.5 LTS, Host GPU driver ROCm 6.3.1-48
| Model | # nodes | Precision | Batch Size | Sequence Length | FSDP | TP | CP | PP | EP | Tokens/sec/GPU |
| Llama 3.1 8B | 1 | BF16 | 4 | 8192 | 1 | 1 | 1 | 1 | 1 | 10,292 |
| Llama 3.1 70B | 1 | BF16 | 7 | 8192 | 1 | 1 | 1 | 1 | 1 | 1,178 |
| Llama 3.3 70B | 1 | BF16 | 7 | 8192 | 1 | 1 | 1 | 1 | 1 | 1,178 |
| Mixtral 8x7B | 1 | BF16 | 12 | 4096 | 0 | 1 | 1 | 1 | 8 | 5,519 |
Reproduce these results on your system by following these instructions:
Results on AMD Instinct™ MI300X Platform
The following results are based on:
- Docker container: rocm/ jax-training:maxtext-v25.9
- Release date: October 17, 2025
- Server: Dual AMD EPYC 9554 64-core processor-based production server with 8x AMD MI300X (192GB HBM3 750W) GPUs, 1 NUMA node per socket, System BIOS 1.8, Ubuntu® 22.04.5 LTS, Host GPU driver ROCm 6.4.2-120.
For multi-mode run, Server: Dual AMD EPYC 9654 Processors with 8x AMD MI300X (192GB HBM3 750W) GPUs, 1 NUMA node per socket, System BIOS 3.10 Ubuntu® 22.04, Host GPU driver ROCm 6.3.1-48.
| Model | # nodes | Precision | Batch Size | Sequence Length | FSDP | TP | CP | PP | EP | Tokens/sec/GPU |
| Llama 3.1 8B | 1 | BF16 | 4 | 8192 | 1 | 1 | 1 | 1 | 1 | 8,587 |
| Llama 3.1 8B | 8 | BF16 | 15 | 8192 | 1 | 1 | 1 | 1 | 1 | 7,813 |
| Llama 3.1 70B | 1 | BF16 | 7 | 8192 | 1 | 1 | 1 | 1 | 1 | 949 |
| Llama 3.3 70B | 1 | BF16 | 7 | 8192 | 1 | 1 | 1 | 1 | 1 | 949 |
| Mixtral 8x7B | 1 | BF16 | 12 | 4096 | 0 | 1 | 1 | 1 | 8 | 4,622 |
Reproduce these results on your system by following these instructions:
Previous Versions
The following results are based on:
This table lists previous versions of the ROCm JAX MaxText Docker image for training performance testing. For detailed information about available models for benchmarking, see the version-specific documentation.
| Image version | ROCm version | JAX version | Resources |
| v25.9 (latest) | 7.0.01 | 0.6.2 | |
| v25.7 | 6.4.1 | 0.6.0, 0.5.0 | |
| v25.5 | 6.3.4 | 0.4.35 | |
| v25.4 | 6.3.0 | 0.4.31 |
附注
- TP stands for Tensor Parallelism.
- Throughput is measured in tokens/second
- TP stands for Tensor Parallelism.
- Throughput is measured in tokens/second