This page summarizes performance measurements on AMD Instinct™ GPUs running popular AI models.
The results found on this page highlight both Inference and Training benchmarks. The results are organized by the following:
- AI Inference: vLLM
- AI Training: pyTorch, Megatron-LM, and JAX MaxText
The hardware platforms include Instinct MI355X/MI325X/MI300X GPUs, with benchmark insights provided for each framework where data is available.
The data in the following tables are a reference point to help users evaluate observed performance. It should not be considered as the peak performance that AMD GPUs and ROCm™ software can deliver.
AI Inference
vLLM
Results on AMD Instinct™ MI300X Platform
The following results are based on:
- Docker container: rocm/vllm:rocm7.0.0_vllm_0.11.1_20251103
- Release date: November 3, 2025
- Server: Dual AMD EPYC 9554 64-core processor-based production server with 8x AMD MI300X (192GB HBM3 750W) GPUs, 1 NUMA node per socket, System BIOS 1.8, Ubuntu® 22.04, amdgpu driver 6.8.5
Throughput Measurements
The table below shows performance data where a local inference client is fed requests at an infinite rate and shows the throughput client-server scenario under maximum load.
| Model | Precision | TP1 Size | Input | Output | No. Prompts | Max. Seqs | Throughput2 |
| Llama 3.1 70B (amd/Llama-3.1-70B-Instruct-FP8-KV) | FP8 | 8 | 128 | 2048 | 3200 | 3200 | 13279.6 |
| 128 | 4096 | 1500 | 1500 | 11449.7 | |||
| 500 | 2000 | 2000 | 2000 | 11347.4 | |||
| 2048 | 2048 | 1500 | 1500 | 7651.7 | |||
| Llama 3.1 405B (amd/Llama-3.1-405B-Instruct-FP8-KV) | FP8 | 8 | 128 | 2048 | 1500 | 1500 | 3816.8 |
| 128 | 4096 | 1500 | 1500 | 3099.6 | |||
| 500 | 2000 | 2000 | 2000 | 3026.1 | |||
| 2048 | 2048 | 500 | 500 | 2196.4 |
Latency results
The table below shows latency measurement, which typically involves assessing the time from when the system receives an input to when the model produces a result.
| Model | Precision | TP1 Size | Batch Size | Input | Output | Latency2 |
| Llama 3.1 70B (amd/Llama-3.1-70B-Instruct-FP8-KV) | FP8 | 8 | 1 | 128 | 2048 | 16.154 |
| 2 | 128 | 2048 | 18.041 | |||
| 4 | 128 | 2048 | 18.322 | |||
| 8 | 128 | 2048 | 20.800 | |||
| 16 | 128 | 2048 | 21.850 | |||
| 32 | 128 | 2048 | 25.513 | |||
| 64 | 128 | 2048 | 32.539 | |||
| 128 | 128 | 2048 | 45.193 | |||
| 1 | 2048 | 2048 | 16.256 | |||
| 2 | 2048 | 2048 | 18.084 | |||
| 4 | 2048 | 2048 | 18.851 | |||
| 8 | 2048 | 2048 | 20.930 | |||
| 16 | 2048 | 2048 | 23.079 | |||
| 32 | 2048 | 2048 | 26.873 | |||
| 64 | 2048 | 2048 | 34.585 | |||
| 128 | 2048 | 2048 | 51.856 | |||
| Llama 3.1 405B (amd/Llama-3.1-70B-Instruct-FP8-KV) | FP8 | 8 | 1 | 128 | 2048 | 48.138 |
| 2 | 128 | 2048 | 48.366 | |||
| 4 | 128 | 2048 | 49.790 | |||
| 8 | 128 | 2048 | 53.546 | |||
| 16 | 128 | 2048 | 55.685 | |||
| 32 | 128 | 2048 | 67.445 | |||
| 64 | 128 | 2048 | 86.597 | |||
| 128 | 128 | 2048 | 120.387 | |||
| 1 | 2048 | 2048 | 48.555 | |||
| 2 | 2048 | 2048 | 48.348 | |||
| 4 | 2048 | 2048 | 49.828 | |||
| 8 | 2048 | 2048 | 53.415 | |||
| 16 | 2048 | 2048 | 57.398 | |||
| 32 | 2048 | 2048 | 68.519 | |||
| 64 | 2048 | 2048 | 90.234 | |||
| 128 | 2048 | 2048 | 130.518 |
Reproduce these results on your system by following these instructions:
Previous Versions
This table lists previous versions of the ROCm vLLM inference Docker image for inference performance testing. For detailed information about available models for benchmarking, see the version-specific documentation.
| Docker image tag | Components | Resources |
| rocm/vllm:rocm7.0.0_vllm_0.11.1_20251024 (latest) |
|
|
| rocm/vllm:rocm7.0.0_ vllm_ 0.10.2_ 20251006 |
|
|
| rocm/vllm:rocm6.4.1_ vllm_ 0.10.0_ 20250812 |
|
|
| rocm/vllm:rocm6.4.1_vllm_0.9.1_20250715 |
|
|
| rocm/vllm:rocm6.4.1_vllm_0.9.1_20250702 |
|
|
| rocm/vllm:rocm6.4.1_vllm_0.9.0.1_20250605 |
|
|
| rocm/vllm:rocm6.3.1_vllm_0.8.5_20250521 |
|
|
| rocm/vllm:rocm6.3.1_vllm_0.8.5_20250513 |
|
|
| rocm/vllm:rocm6.3.1_instinct_vllm0.8.3_20250415 |
|
|
| rocm/vllm:rocm6.3.1_instinct_vllm0.7.3_20250325 |
|
|
| rocm/vllm:rocm6.3.1_mi300_ubuntu22.04_py3.12_vllm_0.6.6 |
|
|
| rocm/vllm:rocm6.2_mi300_ubuntu20.04_py3.9_vllm_0.6.4 |
|
|
| rocm/vllm:rocm6.2_mi300_ubuntu22.04_py3.9_vllm_7c5fd50 |
|
AI Training
The table below shows training performance data, where the AMD Instinct™ platform measures text generation training throughput with a unique sequence length and batch size. It focuses on Tokens per second per GPU.
- PyTorch
- Megatron-LM
- JaxMaxText
PyTorch
Results on the AMD Instinct MI355X Platform
The following results are based on:
- Docker container: rocm/primus:v25.9_gfx950
- Release date: October 17, 2025
- Server: Dual AMD EPYC 9575F 96-core processor-based production server with 8x AMD MI355X (288GB HBM3E 1400W)GPUs,1 NUMA node per socket, System BIOS 1.4a, Ubuntu® 22.04.5 LTS Host GPU driver ROCm 7.0.1.
| Model | Precision | Batch Size | Sequence Length | FSDP | TP | CP | PP | Tokens/sec/GPU |
| Llama 3.1 8B | FP8 | 8 | 8192 | 0 | 1 | 1 | 1 | 28,035 |
| Llama 3.1 8B | BF16 | 5 | 8192 | 0 | 1 | 1 | 1 | 20,158 |
| Llama 3.1 70B | FP8 | 6 | 8192 | 1 | 1 | 1 | 1 | 3,570 |
| Llama 3.1 70B | BF16 | 8 | 8192 | 1 | 1 | 1 | 1 | 2,281 |
Fine-tuning
| Model | Precision | Batch Size | Sequence Length | FSDP | TP | CP | PP | Tokens/sec/GPU |
| Llama 3.1 70B SFT | FP8 | 8 | 8192 | 1 | 1 | 1 | 1 | 3,546 |
| Llama 3.1 70B SFT | BF16 | 16 | 8192 | 1 | 1 | 1 | 1 | 2,161 |
| Llama 3.1 70B LoRA | BF16 | 16 | 8192 | 1 | 1 | 1 | 1 | 2,594 |
Reproduce these results on your system by following these instructions:
Results on AMD Instinct™ MI325X Platform
The following results are based on:
- Docker container: rocm/ primus:v25.9_gfx942
- Release date: October 17, 2025
- Server: Dual AMD EPYC 9655 96-core processor-based production server with 8x AMD MI325X (256GB HBM3E 1000W) GPUs, 1 NUMA node per socket, System BIOS 3B03, Ubuntu® 22.04.5 LTS, Host GPU driver ROCm 6.3.1-48
| Model | Precision | Batch Size | Sequence Length | FSDP | TP | CP | PP | Tokens/sec/GPU |
| Llama 3.1 8B | FP8 | 7 | 8192 | 0 | 1 | 1 | 1 | 14,984 |
| Llama 3.1 8B | BF16 | 6 | 8192 | 0 | 1 | 1 | 1 | 11,144 |
| Llama 3.1 70B | FP8 | 5 | 8192 | 1 | 1 | 1 | 1 | 1,716 |
| Llama 3.1 70B | BF16 | 6 | 8192 | 1 | 1 | 1 | 1 | 1,150 |
Fine-tuning
| Model | Precision | Batch Size | Sequence Length | FSDP | TP | CP | PP | Tokens/sec/GPU |
| Llama 3.1 70B SFT | FP8 | 8 | 8192 | 1 | 1 | 1 | 1 | 1,597 |
| Llama 3.1 70B SFT | BF16 | 8 | 8192 | 1 | 1 | 1 | 1 | 1,037 |
| Llama 3.1 70B LoRA | BF16 | 16 | 8192 | 1 | 1 | 1 | 1 | 1,286 |
Reproduce these results on your system by following these instructions:
Results on AMD Instinct™ MI300X Platform
The following results are based on:
- Docker container: rocm//primus:v25.9_gfx942
- Release date: October 17, 2025
- Server: Server: Dual AMD EPYC 9554 64-core processor-based production server with 8x AMD MI300X (192GB HBM3 750W) GPUs, 1 NUMA node per socket, System BIOS 1.8, Ubuntu® 22.04.5 LTS, Host GPU driver ROCm 6.4.2-12
| Model | Precision | Batch Size | Sequence Length | FSDP | TP | CP | PP | Tokens/sec/GPU |
| Llama 3.1 8B | FP8 | 5 | 8192 | 0 | 1 | 1 | 1 | 12,216 |
| Llama 3.1 8B | BF16 | 4 | 8192 | 0 | 1 | 1 | 1 | 9,186 |
| Llama 3.1 70B | FP8 | 3 | 8192 | 1 | 1 | 1 | 1 | 1,307 |
| Llama 3.1 70B | BF16 | 4 | 8192 | 1 | 1 | 1 | 1 | 887 |
Fine-tuning
| Model | Precision | Batch Size | Sequence Length | FSDP | TP | CP | PP | Tokens/sec/GPU |
| Llama 3.1 70B SFT | FP8 | 4 | 8192 | 1 | 1 | 1 | 1 | 1,343 |
| Llama 3.1 70B SFT | BF16 | 4 | 8192 | 1 | 1 | 1 | 1 | 855 |
| Llama 3.1 70B LoRA | BF16 | 4 | 8192 | 1 | 1 | 1 | 1 | 1,053 |
Reproduce these results on your system by following these instructions:
Previous Versions
This table lists previous versions of the PyTorch training Docker image for training performance testing. For detailed information about available models for benchmarking, see the version-specific documentation.
| Image version | ROCm version | PyTorch version | Resources |
| V25.9 (latest) | 7.0.7 | Primus 0.3.0 |
|
| v25.8 | 6.4.3 | 2.8.0a0+gitd06a406 | |
| v25.7 | 6.4.2 | 2.8.0a0+gitd06a406 | |
| v25.6 | 6.3.4 | 2.8.0a0+git7d205b2 | |
| v25.5 | 6.3.4 | 2.7.0a0+git637433 | |
| v25.4 | 6.3.0 | 2.7.0a0+git637433 |
Megatron-LM
Results on AMD Instinct™ MI355X Platform
The following results are based on:
- Docker container: rocm/ primus:v25.9_gfx950
- Release date: October 17, 2025
- Server: Dual AMD EPYC 9575F 96-core processor-based production server with 8x AMD MI355X (288GB HBM3E 1400W)GPUs,1 NUMA node per socket, System BIOS 1.4a, Ubuntu® 22.04.5 LTS Host GPU driver ROCm 7.0.1.
| Model | # nodes | Precision | Batch Size | Sequence Length | FSDP | TP | CP | PP | EP | Tokens/sec/GPU |
| Llama 3.1 8B | 1 | FP8 | 4 | 8191 | 0 | 1 | 1 | 1 | - | 32,451 |
| Llama 3.1 8B | 1 | BF16 | 4 | 8191 | 0 | 1 | 1 | 1 | - | 21,908 |
| Llama 3.1 70B | 1 | BF16 | 4 | 8191 | 1 | 1 | 1 | 1 | - | 2,074 |
| Llama 3.3 70B | 1 | BF16 | 6 | 8191 | 1 | 1 | 1 | 1 | - | 2,024 |
| Mixtral 8x7B | 1 | BF16 | 4 | 4096 | 0 | 1 | 1 | 1 | 8 | 13,008 |
Reproduce these results on your system by following these instructions:
Results on AMD Instinct™ MI325X Platform
The following results are based on:
- Docker container: rocm/primus:v25.9_gfx942
- Release date: October 17, 2025
- Server: Dual AMD EPYC 9655 96-core processor-based production server with 8x AMD MI325X (256GB HBM3E 1000W) GPUs, 1 NUMA node per socket, System BIOS 3B03, Ubuntu® 22.04.5 LTS, Host GPU driver ROCm 6.3.1-48
| Model | # nodes | Precision | Batch Size | Sequence Length | FSDP | TP | CP | PP | EP | Tokens/sec/GPU |
| Llama 3.1 8B | 1 | FP8 | 2 | 8191 | 0 | 1 | 1 | 1 | - | 16,678 |
| Llama 3.1 8B | 1 | BF16 | 4 | 8191 | 0 | 1 | 1 | 1 | - | 11,803 |
| Llama 3.1 70B | 1 | BF16 | 4 | 8191 | 1 | 1 | 1 | 1 | - | 1,091 |
| Llama 3.3 70B | 1 | BF16 | 5 | 8191 | 1 | 1 | 1 | 1 | - | 1,052 |
| Mixtral 8x7B | 1 | BF16 | 4 | 4096 | 0 | 1 | 1 | 1 | 8 | 6,511 |
Reproduce these results on your system by following these instructions:
Results on AMD Instinct™ MI300X Platform
The following results are based on:
- Docker container: rocm/ rocm/primus:v25.9_gfx942
- Release date: October 17, 2025
- Server: Server: Dual AMD EPYC 9554 64-core processor-based production server with 8x AMD MI300X (192GB HBM3 750W) GPUs, 1 NUMA node per socket, System BIOS 1.8, Ubuntu® 22.04.5 LTS, Host GPU driver ROCm 6.4.2-120.
For multi-mode run, Server: Dual Intel Xeon Platinum 8480+ Processors with 8x AMD MI300X (192GB HBM3 750W) GPUs, 1 NUMA node per socket, System BIOS 79007700 Ubuntu® 22.04, Host GPU driver ROCm 6.3.0-39.
| Model | # nodes | Precision | Batch Size | Sequence Length | FSDP | TP | CP | PP | EP | Tokens/sec/GPU |
| Llama 3.1 8B | 1 | FP8 | 2 | 8191 | 0 | 1 | 1 | 1 | - | 14,208 |
| Llama 3.1 8B | 1 | BF16 | 2 | 8191 | 0 | 1 | 1 | 1 | - | 9,782 |
| Llama 3.1 8B | 8 | FP8 | 2 | 8192 | 0 | 1 | 1 | 1 | - | 13,328 |
| Llama 3.1 70B | 1 | BF16 | 3 | 8191 | 1 | 1 | 1 | 1 | - | 827 |
| Llama 3.3 70B | 1 | BF16 | 2 | 8191 | 1 | 1 | 1 | 1 | - | 822 |
| Mixtral 8x7B | 1 | BF16 | 2 | 4096 | 0 | 1 | 1 | 1 | 8 | 5,430 |
Reproduce these results on your system by following these instructions:
Previous Versions
This table lists previous versions of the Megatron-LM training Docker image for training performance testing. For detailed information about available models for benchmarking, see the version-specific documentation.
| Image version | ROCm version | PyTorch version | Resources |
| v25.9 (latest) | 7.0.0 | Primus 0.3.0 PyTorch 2.9.0.dev20250821+rocm7.0.0.lw.git125803b7 |
|
| v25.8 | 6.4.3 | 2.8.0a0+gitd06a406 | |
| v25.7 | 6.4.2 | 2.8.0a0+gitd06a406 | |
| v25.6 | 6.4.1 | 2.8.0a0+git7d205b2 | |
| v25.5 | 6.3.4 | 2.8.0a0+gite2f9759 | |
| v25.4 | 6.3.0 | 2.7.0a0+git637433 |
JaxMaxText
Results on AMD Instinct™ MI355X Platform
The following results are based on:
- Docker container: rocm/ jax-training:maxtext-v25.9
- Release date: October 17, 2025.
- Server: Dual AMD EPYC 9575F 96-core processor-based production server with 8x AMD MI355X (288GB HBM3E 1400W)GPUs,1 NUMA node per socket, System BIOS 1.4a, Ubuntu® 22.04.5 LTS Host GPU driver ROCm 7.0.1.
| Model | # nodes | Precision | Batch Size | Sequence Length | FSDP | TP | CP | PP | EP | Tokens/sec/GPU |
| Llama 3.1 8B | 1 | BF16 | 7 | 8192 | 1 | 1 | 1 | 1 | 1 | 21,306 |
| Llama 3.1 8B | 1 | FP8 | 4 | 8192 | 1 | 1 | 1 | 1 | 1 | 26,756 |
| Llama 3.1 70B | 1 | BF16 | 4 | 8192 | 1 | 1 | 1 | 1 | 1 | 2,440 |
| Llama 3.1 70B | 1 | FP8 | 7 | 8192 | 1 | 1 | 1 | 1 | 1 | 3,793 |
| Llama 3.3 70B | 1 | BF16 | 7 | 8192 | 1 | 1 | 1 | 1 | 1 | 2,441 |
| Mixtral 8x7B | 1 | BF16 | 12 | 4096 | 0 | 1 | 1 | 1 | 8 | 10,597 |
Reproduce these results on your system by following these instructions:
Results on AMD Instinct™ MI325X Platform
The following results are based on:
- Docker container: rocm/ jax-training:maxtext-v25.9
- Release date: October 17, 2025
- Server: Dual AMD EPYC 9655 96-core processor-based production server with 8x AMD MI325X (256GB HBM3E 1000W) GPUs, 1 NUMA node per socket, System BIOS 3B03, Ubuntu® 22.04.5 LTS, Host GPU driver ROCm 6.3.1-48
| Model | # nodes | Precision | Batch Size | Sequence Length | FSDP | TP | CP | PP | EP | Tokens/sec/GPU |
| Llama 3.1 8B | 1 | BF16 | 4 | 8192 | 1 | 1 | 1 | 1 | 1 | 10,292 |
| Llama 3.1 70B | 1 | BF16 | 7 | 8192 | 1 | 1 | 1 | 1 | 1 | 1,178 |
| Llama 3.3 70B | 1 | BF16 | 7 | 8192 | 1 | 1 | 1 | 1 | 1 | 1,178 |
| Mixtral 8x7B | 1 | BF16 | 12 | 4096 | 0 | 1 | 1 | 1 | 8 | 5,519 |
Reproduce these results on your system by following these instructions:
Results on AMD Instinct™ MI300X Platform
The following results are based on:
- Docker container: rocm/ jax-training:maxtext-v25.9
- Release date: October 17, 2025
- Server: Dual AMD EPYC 9554 64-core processor-based production server with 8x AMD MI300X (192GB HBM3 750W) GPUs, 1 NUMA node per socket, System BIOS 1.8, Ubuntu® 22.04.5 LTS, Host GPU driver ROCm 6.4.2-120.
For multi-mode run, Server: Dual AMD EPYC 9654 Processors with 8x AMD MI300X (192GB HBM3 750W) GPUs, 1 NUMA node per socket, System BIOS 3.10 Ubuntu® 22.04, Host GPU driver ROCm 6.3.1-48.
| Model | # nodes | Precision | Batch Size | Sequence Length | FSDP | TP | CP | PP | EP | Tokens/sec/GPU |
| Llama 3.1 8B | 1 | BF16 | 4 | 8192 | 1 | 1 | 1 | 1 | 1 | 8,587 |
| Llama 3.1 8B | 8 | BF16 | 15 | 8192 | 1 | 1 | 1 | 1 | 1 | 7,813 |
| Llama 3.1 70B | 1 | BF16 | 7 | 8192 | 1 | 1 | 1 | 1 | 1 | 949 |
| Llama 3.3 70B | 1 | BF16 | 7 | 8192 | 1 | 1 | 1 | 1 | 1 | 949 |
| Mixtral 8x7B | 1 | BF16 | 12 | 4096 | 0 | 1 | 1 | 1 | 8 | 4,622 |
Reproduce these results on your system by following these instructions:
Previous Versions
The following results are based on:
This table lists previous versions of the ROCm JAX MaxText Docker image for training performance testing. For detailed information about available models for benchmarking, see the version-specific documentation.
| Image version | ROCm version | JAX version | Resources |
| v25.9 (latest) | 7.0.01 | 0.6.2 | |
| v25.7 | 6.4.1 | 0.6.0, 0.5.0 | |
| v25.5 | 6.3.4 | 0.4.35 | |
| v25.4 | 6.3.0 | 0.4.31 |
脚注
- TP stands for Tensor Parallelism.
- Throughput is measured in tokens/second
- TP stands for Tensor Parallelism.
- Throughput is measured in tokens/second