This page summarizes performance measurements on AMD Instinct™ GPUs running popular AI models.
The results found on this page highlight both Inference and Training benchmarks. The results are organized by the following:
- AI Inference: vLLM, xDiT
- AI Training: pyTorch, Megatron-LM, and JAX MaxText
The hardware platforms include Instinct MI355X/MI325X/MI300X GPUs, with benchmark insights provided for each framework where data is available.
The data in the following tables are a reference point to help users evaluate observed performance. It should not be considered as the peak performance that AMD GPUs and ROCm™ software can deliver.
AI Inference
- vLLM
- xDiT
vLLM
Results on AMD Instinct™ MI300X Platform
The following results are based on:
- Docker container: rocm/vllm:rocm7.0.0_vllm_0.11.2_20251210
- Release date: December 11, 2025
- Server: Dual AMD EPYC 9554 64-core processor-based production server with 8x AMD MI300X (192GB HBM3 750W) GPUs, 1 NUMA node per socket, System BIOS 1.8, Ubuntu® 22.04, amdgpu driver 6.14.14
Throughput Measurements
The table below shows performance data where a local inference client is fed requests at an infinite rate and shows the throughput client-server scenario under maximum load.
| Model | Precision | TP1 Size | Input | Output | No. Prompts | Max. Seqs | Throughput2 |
| Llama 3.1 70B (amd/Llama-3.1-70B-Instruct-FP8-KV) | FP8 | 8 | 128 | 2048 | 3200 | 3200 | 13562.4 |
| 128 | 4096 | 1500 | 1500 | 11800.9 | |||
| 500 | 2000 | 2000 | 2000 | 11249.5 | |||
| 2048 | 2048 | 1500 | 1500 | 7753.1 | |||
| Llama 3.1 405B (amd/Llama-3.1-405B-Instruct-FP8-KV) | FP8 | 8 | 128 | 2048 | 1500 | 1500 | 3822.8 |
| 128 | 4096 | 1500 | 1500 | 3085.8 | |||
| 500 | 2000 | 2000 | 2000 | 3059.9 | |||
| 2048 | 2048 | 500 | 500 | 2192.3 |
1TP stands for Tensor Parallelism.
2Throughput is measured in tokens/second
Latency results
The table below shows latency measurement, which typically involves assessing the time from when the system receives an input to when the model produces a result.
| Model | Precision | TP1 Size | Batch Size | Input | Output | Latency2 |
| Llama 3.1 70B (amd/Llama-3.1-70B-Instruct-FP8-KV) | FP8 | 8 | 1 | 128 | 2048 | 16.015 |
| 2 | 128 | 2048 | 18.683 | |||
| 4 | 128 | 2048 | 19.245 | |||
| 8 | 128 | 2048 | 20.468 | |||
| 16 | 128 | 2048 | 22.137 | |||
| 32 | 128 | 2048 | 25.571 | |||
| 64 | 128 | 2048 | 32.987 | |||
| 128 | 128 | 2048 | 46.426 | |||
| 1 | 2048 | 2048 | 16.421 | |||
| 2 | 2048 | 2048 | 19.035 | |||
| 4 | 2048 | 2048 | 20.221 | |||
| 8 | 2048 | 2048 | 21.483 | |||
| 16 | 2048 | 2048 | 24.350 | |||
| 32 | 2048 | 2048 | 29.776 | |||
| 64 | 2048 | 2048 | 40.625 | |||
| 128 | 2048 | 2048 | 63.671 | |||
| Llama 3.1 405B (amd/Llama-3.1-70B-Instruct-FP8-KV) | FP8 | 8 | 1 | 128 | 2048 | 48.618 |
| 2 | 128 | 2048 | 50.980 | |||
| 4 | 128 | 2048 | 52.760 | |||
| 8 | 128 | 2048 | 55.864 | |||
| 16 | 128 | 2048 | 58.795 | |||
| 32 | 128 | 2048 | 69.482 | |||
| 64 | 128 | 2048 | 89.384 | |||
| 128 | 128 | 2048 | 122.601 | |||
| 1 | 2048 | 2048 | 49.106 | |||
| 2 | 2048 | 2048 | 51.664 | |||
| 4 | 2048 | 2048 | 54.220 | |||
| 8 | 2048 | 2048 | 58.904 | |||
| 16 | 2048 | 2048 | 65.389 | |||
| 32 | 2048 | 2048 | 83.387 | |||
| 64 | 2048 | 2048 | 115.575 | |||
| 128 | 2048 | 2048 | 177.779 |
1TP stands for Tensor Parallelism.
2Latency is measured in seconds
Reproduce these results on your system by following these instructions:
Previous Versions
This table lists previous versions of the ROCm vLLM inference Docker image for inference performance testing. For detailed information about available models for benchmarking, see the version-specific documentation.
| Docker image tag | Components | Resources |
| rocm/vllm:rocm7.0.0_vllm_0.11.2_20251210 (latest) |
|
|
| rocm/vllm:rocm7.0.0_vllm_0.11.1_20251024 |
|
|
| rocm/vllm:rocm7.0.0_ vllm_ 0.10.2_ 20251006 |
|
|
| rocm/vllm:rocm6.4.1_ vllm_ 0.10.0_ 20250812 |
|
|
| rocm/vllm:rocm6.4.1_vllm_0.9.1_20250715 |
|
|
| rocm/vllm:rocm6.4.1_vllm_0.9.1_20250702 |
|
|
| rocm/vllm:rocm6.4.1_vllm_0.9.0.1_20250605 |
|
|
| rocm/vllm:rocm6.3.1_vllm_0.8.5_20250521 |
|
|
| rocm/vllm:rocm6.3.1_vllm_0.8.5_20250513 |
|
|
| rocm/vllm:rocm6.3.1_instinct_vllm0.8.3_20250415 |
|
|
| rocm/vllm:rocm6.3.1_instinct_vllm0.7.3_20250325 |
|
|
| rocm/vllm:rocm6.3.1_mi300_ubuntu22.04_py3.12_vllm_0.6.6 |
|
|
| rocm/vllm:rocm6.2_mi300_ubuntu20.04_py3.9_vllm_0.6.4 |
|
|
| rocm/vllm:rocm6.2_mi300_ubuntu22.04_py3.9_vllm_7c5fd50 |
|
xDiT
Results on AMD Instinct™ MI355X Platform
The following results are based on:
- Docker container: rocm/pytorch-xdit:v25.11
- Release date: Nov 24, 2025
- Server: Dual AMD EPYC 9575F 96-core processor-based production server with 8x AMD MI355X (288GB HBM3E 1400W) GPUs,1 NUMA node per socket, System BIOS 1.4a, Ubuntu® 22.04.3 LTS Host GPU driver ROCm 7.10.0_preview.
| Models | Precision | Batch Size | Configuration | Latency1 |
| Hunyuan Video | BF16 | 1 | 720p, 129 Frames, 50 steps | 86.89 |
| Wan2.1 | BF16 | 1 | 720p, 80 Frames, 40 steps | 70.06 |
| Wan2.2 | BF16 | 1 | 720p, 80 Frames, 40 steps | 74.45 |
| Flux.1 | BF16 | 1 | 1024x1240, 25 steps | 0.87 |
1 Latency is measured in seconds
Reproduce these results on your system by following these instructions:
Results on the AMD Instinct™ MI300X platform
The following results are based on:
- Docker container: rocm/pytorch-xdit:v25.11
- Release date: Nov 24, 2025
- Server: Dual AMD EPYC 9554 64-core processor-based production server with 8x AMD MI300X (192GB HBM3 750W) GPUs, 1 NUMA node per socket, System BIOS 1.8, Ubuntu® 22.04.5 LTS, Host GPU driver ROCm 7.10.0_preview.
Models |
Precision | Batch Size | Configuration | Latency1 |
| Hunyuan Video | BF16 | 1 | 720p, 129 Frames, 50 steps | 184.67 |
| Wan2.1 | BF16 | 1 | 720p, 80 Frames, 40 steps | 152.24 |
| Wan2.2 | BF16 | 1 | 720p, 80 Frames, 40 steps | 161.72 |
| Flux.1 | BF16 | 1 | 1024x1240, 25 steps | 1.45 |
1 Latency is measured in seconds
Reproduce these results on your system by following these instructions:
Previous versions
This table lists previous versions of the Megatron-LM training Docker image for training performance testing. For detailed information about available models for benchmarking, see the version-specific documentation.
| Docker image tag | Components | Resources |
| rocm/pytorch-xdit:v25.11(latest) |
|
|
| rocm/pytorch-xdit:v25.10 |
|
AI Training
The table below shows training performance data, where the AMD Instinct™ platform measures text generation training throughput with a unique sequence length and batch size. It focuses on Tokens per second per GPU.
- PyTorch
- Megatron-LM
- JaxMaxText
PyTorch
Results on the AMD Instinct MI355X Platform
The following results are based on:
- Docker container: rocm/primus:v25.10
- Release date: December 8, 2025
- Server: Dual AMD EPYC 9575F 96-core processor-based production server with 8x AMD MI355X (288GB HBM3E 1400W)GPUs,1 NUMA node per socket, System BIOS 1.4a, Ubuntu® 22.04.5 LTS Host GPU driver ROCm 7.0.1.
| Model | Precision | Batch Size | Sequence Length | FSDP | TP | CP | PP | Tokens/sec/GPU |
| Llama 3.1 8B | FP8 | 8 | 8192 | 0 | 1 | 1 | 1 | 28,889 |
| Llama 3.1 8B | BF16 | 5 | 8192 | 0 | 1 | 1 | 1 | 21,210 |
| Llama 3.1 70B | FP8 | 6 | 8192 | 1 | 1 | 1 | 1 | 3,669 |
| Llama 3.1 70B | BF16 | 8 | 8192 | 1 | 1 | 1 | 1 | 2,261 |
Reproduce these results on your system by following these instructions:
Results on AMD Instinct™ MI325X Platform
The following results are based on:
- Docker container: rocm/ primus:v25.10
- Release date: December 8, 2025
- Server: Dual AMD EPYC 9655 96-core processor-based production server with 8x AMD MI325X (256GB HBM3E 1000W) GPUs, 1 NUMA node per socket, System BIOS 3B03, Ubuntu® 22.04.5 LTS, Host GPU driver ROCm 6.3.1-48
| Model | Precision | Batch Size | Sequence Length | FSDP | TP | CP | PP | Tokens/sec/GPU |
| Llama 3.1 8B | FP8 | 7 | 8192 | 0 | 1 | 1 | 1 | 15,338 |
| Llama 3.1 8B | BF16 | 6 | 8192 | 0 | 1 | 1 | 1 | 11,333 |
| Llama 3.1 70B | FP8 | 5 | 8192 | 1 | 1 | 1 | 1 | 1,738 |
| Llama 3.1 70B | BF16 | 6 | 8192 | 1 | 1 | 1 | 1 | 1,168 |
Reproduce these results on your system by following these instructions:
Results on AMD Instinct™ MI300X Platform
The following results are based on:
- Docker container: rocm/primus:v25.10
- Release date: December 8, 2025
- Server: Server: Dual AMD EPYC 9554 64-core processor-based production server with 8x AMD MI300X (192GB HBM3 750W) GPUs, 1 NUMA node per socket, System BIOS 1.8, Ubuntu® 22.04.5 LTS, Host GPU driver ROCm 6.4.2-12
| Model | Precision | Batch Size | Sequence Length | FSDP | TP | CP | PP | Tokens/sec/GPU |
| Llama 3.1 8B | FP8 | 5 | 8192 | 0 | 1 | 1 | 1 | 12,600 |
| Llama 3.1 8B | BF16 | 4 | 8192 | 0 | 1 | 1 | 1 | 9,428 |
| Llama 3.1 70B | FP8 | 3 | 8192 | 1 | 1 | 1 | 1 | 1,374 |
| Llama 3.1 70B | BF16 | 4 | 8192 | 1 | 1 | 1 | 1 | 918 |
Reproduce these results on your system by following these instructions:
Previous Versions
This table lists previous versions of the PyTorch training Docker image for training performance testing. For detailed information about available models for benchmarking, see the version-specific documentation.
| Image version | ROCm version | PyTorch version | Resources |
| v25.10 (latest) | 7.1.0 | PyTorch 2.10.0.dev20251112+rocm7.1 | |
| V25.9 | 7.0.0 | Primus 0.3.0 |
|
| v25.8 | 6.4.3 | 2.8.0a0+gitd06a406 | |
| v25.7 | 6.4.2 | 2.8.0a0+gitd06a406 | |
| v25.6 | 6.3.4 | 2.8.0a0+git7d205b2 | |
| v25.5 | 6.3.4 | 2.7.0a0+git637433 | |
| v25.4 | 6.3.0 | 2.7.0a0+git637433 |
Megatron-LM
Results on AMD Instinct™ MI355X Platform
The following results are based on:
- Docker container: rocm/ primus:v25.10
- Release date: December 8, 2025
- Server: Dual AMD EPYC 9575F 64-core processor-based production server with 8x AMD MI355X (288GB HBM3E 1400W)GPUs,1 NUMA node per socket, System BIOS 1.4a, Ubuntu® 22.04.5 LTS Host GPU driver ROCm 7.0.1.
| Model | # nodes | Precision | Batch Size | Sequence Length | FSDP | TP | CP | PP | EP | Tokens/sec/GPU |
| Llama 3.1 8B | 1 | FP8 | 4 | 8192 | 0 | 1 | 1 | 1 | - | 33,002 |
| Llama 3.1 8B | 1 | BF16 | 4 | 8192 | 0 | 1 | 1 | 1 | - | 22,335 |
| Llama 3.1 70B | 1 | BF16 | 4 | 8192 | 1 | 1 | 1 | 1 | - | 2,159 |
| Llama 3.3 70B | 1 | BF16 | 6 | 8192 | 1 | 1 | 1 | 1 | - | 2,023 |
| Mixtral 8x7B | 1 | BF16 | 4 | 4096 | 0 | 1 | 1 | 1 | 8 | 13,828 |
| DeepSeekV2 Lite | 1 | BF16 | 10 | 4096 | 0 | 1 | 1 | 1 | 8 | 31,536 |
Reproduce these results on your system by following these instructions:
Results on AMD Instinct™ MI325X Platform
The following results are based on:
- Docker container: rocm/primus:v25.10
- Release date: December 8, 2025
- Server: Dual AMD EPYC 9655 96-core processor-based production server with 8x AMD MI325X (256GB HBM3E 1000W) GPUs, 1 NUMA node per socket, System BIOS 3B03, Ubuntu® 22.04.5 LTS, Host GPU driver ROCm 6.3.1-48
For multi-mode run, Server: Dual AMD EPYC 9575F 64-Core processor-based production server with 8x AMD Instinct MI325 (256GB HBM3E 1000W) GPUs, 1 NUMA node per socket, System BIOS 1.5, Ubuntu® 22.04.5 LTS, Host GPU driver ROCm 6.4.2.60402-120~22.04
| Model | # nodes | Precision | Batch Size | Sequence Length | FSDP | TP | CP | PP | EP | Tokens/sec/GPU |
| Llama 3.1 8B | 1 | FP8 | 2 | 8192 | 0 | 1 | 1 | 1 | - | 16,074 |
| Llama 3.1 8B | 8 | FP8 | 2 | 8192 | 0 | 1 | 1 | 1 | - | 16,332 |
| Llama 3.1 8B | 1 | BF16 | 4 | 8192 | 0 | 1 | 1 | 1 | - | 11,449 |
| Llama 3.1 70B | 1 | BF16 | 4 | 8192 | 1 | 1 | 1 | 1 | - | 1,108 |
| Llama 3.1 70B | 8 | FP8 | 4 | 8192 | 1 | 1 | 1 | 1 | - | 1,698 |
| Llama 3.1 70B | 8 | BF16 | 1 | 8192 | 1 | 1 | 1 | 1 | - | 1,156 |
| Llama 3.3 70B | 1 | BF16 | 5 | 8192 | 1 | 1 | 1 | 1 | - | 1,064 |
| Mixtral 8x7B | 1 | BF16 | 4 | 4096 | 0 | 1 | 1 | 1 | 8 | 6,538 |
| DeepSeekV2 Lite | 1 | BF16 | 10 | 4096 | 0 | 1 | 1 | 1 | 8 | 18,103 |
Reproduce these results on your system by following these instructions:
Results on AMD Instinct™ MI300X Platform
The following results are based on:
- Docker container: rocm/primus:v25.10
- Release date: December 8, 2025
- Server: Server: Dual AMD EPYC 9554 64-core processor-based production server with 8x AMD MI300X (192GB HBM3 750W) GPUs, 1 NUMA node per socket, System BIOS 1.8, Ubuntu® 22.04.5 LTS, Host GPU driver ROCm 6.4.2-120.
For multi-mode run, Server: Dual Intel Xeon Platinum 8480+ Processors with 8x AMD MI300X (192GB HBM3 750W) GPUs, 1 NUMA node per socket, System BIOS 79007700 Ubuntu® 22.04, Host GPU driver ROCm 6.3.0-39.
| Model | # nodes | Precision | Batch Size | Sequence Length | FSDP | TP | CP | PP | EP | Tokens/sec/GPU |
| Llama 3.1 8B | 1 | FP8 | 2 | 8192 | 0 | 1 | 1 | 1 | - | 13,763 |
| Llama 3.1 8B | 1 | BF16 | 2 | 8192 | 0 | 1 | 1 | 1 | - | 9,571 |
| Llama 3.1 70B | 1 | BF16 | 3 | 8192 | 1 | 1 | 1 | 1 | - | 856 |
| Llama 3.3 70B | 1 | BF16 | 2 | 8192 | 1 | 1 | 1 | 1 | - | 850 |
| Mixtral 8x7B | 1 | BF16 | 2 | 4096 | 0 | 1 | 1 | 1 | 8 | 5,519 |
| DeepSeekV2 Lite | 1 | BF16 | 4 | 4096 | 0 | 1 | 1 | 1 | 8 | 15,211 |
Reproduce these results on your system by following these instructions:
Previous Versions
This table lists previous versions of the Megatron-LM training Docker image for training performance testing. For detailed information about available models for benchmarking, see the version-specific documentation.
| Image version | ROCm version | PyTorch version | Resources |
| v25.10 (latest) | 7.1.0 | PyTorch 2.10.0.dev20251112+rocm7.1 | |
| v25.9 | 7.0.0 | Primus 0.3.0 PyTorch 2.9.0.dev20250821+rocm7.0.0.lw.git125803b7 |
|
| v25.8 | 6.4.3 | 2.8.0a0+gitd06a406 | |
| v25.7 | 6.4.2 | 2.8.0a0+gitd06a406 | |
| v25.6 | 6.4.1 | 2.8.0a0+git7d205b2 | |
| v25.5 | 6.3.4 | 2.8.0a0+gite2f9759 | |
| v25.4 | 6.3.0 | 2.7.0a0+git637433 |
JaxMaxText
Results on AMD Instinct™ MI355X Platform
The following results are based on:
- Docker container: rocm/ jax-training:maxtext-v25.9
- Release date: October 17, 2025.
- Server: Dual AMD EPYC 9575F 96-core processor-based production server with 8x AMD MI355X (288GB HBM3E 1400W)GPUs,1 NUMA node per socket, System BIOS 1.4a, Ubuntu® 22.04.5 LTS Host GPU driver ROCm 7.0.1.
| Model | # nodes | Precision | Batch Size | Sequence Length | FSDP | TP | CP | PP | EP | Tokens/sec/GPU |
| Llama 3.1 8B | 1 | BF16 | 7 | 8192 | 1 | 1 | 1 | 1 | 1 | 21,306 |
| Llama 3.1 8B | 1 | FP8 | 4 | 8192 | 1 | 1 | 1 | 1 | 1 | 26,756 |
| Llama 3.1 70B | 1 | BF16 | 4 | 8192 | 1 | 1 | 1 | 1 | 1 | 2,440 |
| Llama 3.1 70B | 1 | FP8 | 7 | 8192 | 1 | 1 | 1 | 1 | 1 | 3,793 |
| Llama 3.3 70B | 1 | BF16 | 7 | 8192 | 1 | 1 | 1 | 1 | 1 | 2,441 |
| Mixtral 8x7B | 1 | BF16 | 12 | 4096 | 0 | 1 | 1 | 1 | 8 | 10,597 |
Reproduce these results on your system by following these instructions:
Results on AMD Instinct™ MI325X Platform
The following results are based on:
- Docker container: rocm/ jax-training:maxtext-v25.9
- Release date: October 17, 2025
- Server: Dual AMD EPYC 9655 96-core processor-based production server with 8x AMD MI325X (256GB HBM3E 1000W) GPUs, 1 NUMA node per socket, System BIOS 3B03, Ubuntu® 22.04.5 LTS, Host GPU driver ROCm 6.3.1-48
| Model | # nodes | Precision | Batch Size | Sequence Length | FSDP | TP | CP | PP | EP | Tokens/sec/GPU |
| Llama 3.1 8B | 1 | BF16 | 4 | 8192 | 1 | 1 | 1 | 1 | 1 | 10,292 |
| Llama 3.1 70B | 1 | BF16 | 7 | 8192 | 1 | 1 | 1 | 1 | 1 | 1,178 |
| Llama 3.3 70B | 1 | BF16 | 7 | 8192 | 1 | 1 | 1 | 1 | 1 | 1,178 |
| Mixtral 8x7B | 1 | BF16 | 12 | 4096 | 0 | 1 | 1 | 1 | 8 | 5,519 |
Reproduce these results on your system by following these instructions:
Results on AMD Instinct™ MI300X Platform
The following results are based on:
- Docker container: rocm/ jax-training:maxtext-v25.9
- Release date: October 17, 2025
- Server: Dual AMD EPYC 9554 64-core processor-based production server with 8x AMD MI300X (192GB HBM3 750W) GPUs, 1 NUMA node per socket, System BIOS 1.8, Ubuntu® 22.04.5 LTS, Host GPU driver ROCm 6.4.2-120.
For multi-mode run, Server: Dual AMD EPYC 9654 Processors with 8x AMD MI300X (192GB HBM3 750W) GPUs, 1 NUMA node per socket, System BIOS 3.10 Ubuntu® 22.04, Host GPU driver ROCm 6.3.1-48.
| Model | # nodes | Precision | Batch Size | Sequence Length | FSDP | TP | CP | PP | EP | Tokens/sec/GPU |
| Llama 3.1 8B | 1 | BF16 | 4 | 8192 | 1 | 1 | 1 | 1 | 1 | 8,587 |
| Llama 3.1 8B | 8 | BF16 | 15 | 8192 | 1 | 1 | 1 | 1 | 1 | 7,813 |
| Llama 3.1 70B | 1 | BF16 | 7 | 8192 | 1 | 1 | 1 | 1 | 1 | 949 |
| Llama 3.3 70B | 1 | BF16 | 7 | 8192 | 1 | 1 | 1 | 1 | 1 | 949 |
| Mixtral 8x7B | 1 | BF16 | 12 | 4096 | 0 | 1 | 1 | 1 | 8 | 4,622 |
Reproduce these results on your system by following these instructions:
Previous Versions
The following results are based on:
This table lists previous versions of the ROCm JAX MaxText Docker image for training performance testing. For detailed information about available models for benchmarking, see the version-specific documentation.
| Image version | ROCm version | JAX version | Resources |
| v25.9 (latest) | 7.0.01 | 0.6.2 | |
| v25.7 | 6.4.1 | 0.6.0, 0.5.0 | |
| v25.5 | 6.3.4 | 0.4.35 | |
| v25.4 | 6.3.0 | 0.4.31 |
尾註
- TP stands for Tensor Parallelism.
- Throughput is measured in tokens/second
- TP stands for Tensor Parallelism.
- Throughput is measured in tokens/second