This page summarizes performance measurements on AMD Instinct™ GPUs for popular AI models.

The data in the following tables is a reference point to help users evaluate observed performance. It should not be considered as the peak performance that AMD GPUs and ROCm™ software can deliver.

AI Inference

Throughput Measurements

The table below shows performance data where a local inference client is fed requests at an infinite rate and shows the throughput client-server scenario under maximum load. 

This result is based on the Docker container (rocm/vllm:rocm6.3.1_instinct_vllm0.8.3_20250415), which was released on April 29, 2025.

Model

Precision

TP Size

Input

Output

Num Prompts

Max Num Seqs

Throughput (tokens/s)

Llama 3.1 70B (amd/Llama-3.1-70B-Instruct-FP8-KV)

FP8

8

128

2048

3200

3200

16896.6

     

128

4096

1500

1500

13943.8

     

500

2000

2000

2000

13512.8

     

2048

2048

1500

1500

8444.5

Llama 3.1 405B (amd/Llama-3.1-405B-Instruct-FP8-KV)

FP8

8

128

2048

1500

1500

4359.9

     

128

4096

1500

1500

3430.9

     

500

2000

2000

2000

3226.8

     

2048

2048

500

500

2228.2

TP stands for Tensor Parallelism.

Server: Dual AMD EPYC 9554 64-core processor-based production server with 8x AMD MI300X (192GB HBM3 750W) GPUs, 1 NUMA node per socket, System BIOS 1.8, Ubuntu® 22.04.5 LTS, Host GPU driver ROCm 6.2.2 + amdgpu driver 6.8.5

Reproduce these results on your system by following the instructions in measuring inference performance with vLLM on the AMD GPUs user guide.

Latency Measurements

The table below shows latency measurement, which typically involves assessing the time from when the system receives an input to when the model produces a result.

This result is based on the Docker container (rocm/vllm:rocm6.3.1_instinct_vllm0.8.3_20250415), which was released on April 29, 2025.

Model

Precision

TP Size

Batch Size

Input

Output

Latency (sec)

Llama 3.1 70B (amd/Llama-3.1-70B-Instruct-FP8-KV)

FP8

8

1

128

2048

15.427

     

2

128

2048

16.661

     

4

128

2048

17.326

     

8

128

2048

18.679

     

16

128

2048

20.642

     

32

128

2048

23.260

     

64

128

2048

30.498

     

128

128

2048

42.952

     

1

2048

2048

15.677

     

2

2048

2048

16.715

     

4

2048

2048

17.684

     

8

2048

2048

19.444

     

16

2048

2048

22.282

     

32

2048

2048

26.545

     

64

2048

2048

36.651

     

128

2048

2048

55.949

Llama 3.1 405B (amd/Llama-3.1-70B-Instruct-FP8-KV)

FP8

8

1

128

2048

45.294

     

2

128

2048

46.166

     

4

128

2048

47.867

     

8

128

2048

51.065

     

16

128

2048

54.304

     

32

128

2048

63.078

     

64

128

2048

81.906

     

128

128

2048

108.097

     

1

2048

2048

46.003

     

2

2048

2048

46.596

     

4

2048

2048

49.273

     

8

2048

2048

53.762

     

16

2048

2048

59.629

     

32

2048

2048

73.753

     

64

2048

2048

103.530

     

128

2048

2048

151.785

TP stands for Tensor Parallelism.

Server: Dual AMD EPYC 9554 64-core processor-based production server with 8x AMD MI300X (192GB HBM3 750W) GPUs, 1 NUMA node per socket, System BIOS 1.8, Ubuntu® 22.04.5 LTS, Host GPU driver ROCm 6.2.2 + amdgpu driver 6.8.5 

Reproduce these results on your system by following the instructions in measuring inference performance with ROCm vLLM Dcoker on the AMD GPUs user guide.

Previous versions

This table lists previous versions of the ROCm vLLM inference Docker image for inference performance testing. For detailed information about available models for benchmarking, see the version-specific documentation.

Date

ROCm version

vLLM version

PyTorch version

Resources

4/10/2025

6.3.1

0.8.3

2.7.0

Documentation

Docker Hub

3/25/2025

6.3.1

0.7.3

2.7.0

Documentation

Docker Hub

3/11/2025

6.3.1

0.7.3

2.7.0

Documentation

Docker Hub

2/5/2025

6.3.1

0.6.6

2.7.0

Documentation

Docker Hub

11/7/2024

6.2.1

0.6.4

2.5.0

Documentation

Docker Hub

9/4/2024

6.2.0

0.4.3

2.4.0

Documentation

Docker Hub

AI Training

The table below shows training performance data, where the AMD Instinct™ platform measures text generation training throughput with a unique sequence length and batch size. It focuses on TFLOPS per second per GPU.

For FLUX, image generation training throughput from the FLUX.1-dev model with the best batch size before the runs go out of memory, and it focuses on frame per second per GPU.

PyTorch training results on the AMD Instinct™ MI300X platform

This result is based on the Docker container (rocm/pytorch-training:v25.5), which was released on April 15, 2025.

Models

Precision

Batch Size

Sequence Length

TFLOPS/s/GPU

Llama 3.1 70B with FSDP

BF16

4

8192

426.79

Llama 3.1 8B with FSDP

BF16

3

8192

542.94

Llama 3.1 8B with FSDP

FP8

3

8192

737.40

Llama 3.1 8B with FSDP

BF16

6

4096

523.79

Llama 3.1 8B with FSDP

FP8

6

4096

735.44

Mistral 7B with FSDP

BF16

3

8192

483.17

Mistral 7B with FSDP

FP8

4

8192

723.30

FLUX

BF16

10

-

4.51 (FPS/GPU)*

*Note: FLUX performance is measured in FPS/GPU rather than TFLOPS/s/GPU.

Server: Dual AMD EPYC 9654 96-core processor-based production server with 8x AMD MI300X (192GB HBM3 750W) GPUs, 2 NUMA node per socket, System BIOS 5.27, Ubuntu® 22.04.5 LTS, Host GPU driver ROCm 6.3.0 ROCm 6.3 (pre-release)

Reproduce these results on your system by following the instructions in measuring training performance with ROCm PyTorch Docker on the AMD GPUs user guide.

PyTorch training results on the AMD Instinct MI325X platform

This result is based on the Docker container (rocm/pytorch-training:v25.5), which was released on April 15, 2025.

Models

Precision

Batch Size

Sequence Length

TFLOPS/s/GPU

Llama 3.1 70B with FSDP

BF16

7

8192

526.13

Llama 3.1 8B with FSDP

BF16

3

8192

643.01

Llama 3.1 8B with FSDP

FP8

5

8192

893.68

Llama 3.1 8B with FSDP

BF16

8

4096

625.96

Llama 3.1 8B with FSDP

FP8

10

4096

894.98

Mistral 7B with FSDP

BF16

5

8192

590.23

Mistral 7B with FSDP

FP8

6

8192

860.39

Server: Dual AMD EPYC 9654 96-core processor-based production server with 8x AMD MI325X (256GB HBM3E 1000W) GPUs, 2 NUMA node per socket, System BIOS 5.27, Ubuntu® 22.04.5 LTS, Host GPU driver ROCm 6.3.0 ROCm 6.3 (pre-release)

Reproduce these results on your system by following the instructions in measuring training performance with ROCm PyTorch Docker on the AMD GPUs user guide.

Previous versions

This table lists previous versions of the PyTorch training Docker image for training performance testing. For detailed information about available models for benchmarking, see the version-specific documentation.

Date

Image version

ROCm version

PyTorch version

Resources

3/11/2025

25.4

6.3.0

2.7.0a0+git637433

Documentation

Docker Hub

Megatron-LM training results on the AMD Instinct™ MI300X platform

This result is based on the Docker container(rocm/megatron-lm:v25.4), which was released on March 18, 2025.

Sequence length 8192

Model

# of nodes

Sequence length

MBS   

GBS   

Data Type

TP    

PP    

CP    

TFLOPs/s/GPU 

llama3.1-8B

1

8192

2

128

FP8

1

1

1

697.91

llama3.1-8B

2

8192

2

256

FP8

1

1

1

690.33

llama3.1-8B

4

8192

2

512

FP8

1

1

1

686.74

llama3.1-8B

8

8192

2

1024

FP8

1

1

1

675.50

Sequence length 4096

Model

# of nodes

Sequence length

MBS   

GBS   

Data Type

TP    

PP    

CP    

TFLOPs/s/GPU

llama2-7B

1

4096

4

256

FP8

1

1

1

689.90

llama2-7B

2

4096

4

512

FP8

1

1

1

682.04

llama2-7B

4

4096

4

1024

FP8

1

1

1

676.83

llama2-7B

8

4096

4

2048

FP8

1

1

1

686.25

Server: Dual AMD EPYC 9654 96-core processor-based production server with 8x AMD MI300X (192GB HBM3 750W) GPUs, 2 NUMA node per socket, System BIOS 5.27, Ubuntu® 22.04.5 LTS, Host GPU driver ROCm 6.3.0 ROCm 6.3 (pre-release)

For Deepseek-V2-Lite with 16B parameters, the table below shows training performance data, where the AMD Instinct™ MI300X platform measures text generation training throughput with GEMM tuning was on. It focuses on TFLOPS per second per GPU.  

This result is based on the Docker container (rocm/megatron-lm:v25.4), which was released on March 18, 2025.

Model

# of GPUs

Sequence length

MBS

GBS

Data Type

TP

PP

CP

EP

SP

Recompute

TFLOPs/s/GPU

Deespeek-V2-Lite

8

4096

4

256

BF16

1

1

1

8

On

None

10570

Server: Dual AMD EPYC 9654 96-core processor-based production server with 8x AMD MI300X (192GB HBM3 750W) GPUs, 2 NUMA node per socket, System BIOS 5.27, Ubuntu® 22.04.5 LTS, Host GPU driver ROCm 6.3.0 ROCm 6.3 (pre-release)

Reproduce these results on your system by following the instructions in measuring training performance with ROCm Megatron-LM Docker on the AMD GPUs user guide.

Previous versions

This table lists previous versions of the Megatron-LM training Docker image for training performance testing. For detailed information about available models for benchmarking, see the version-specific documentation.

Date

Image version

ROCm version

PyTorch version

Resources

3/18/2025

25.4

6.3.0

2.7.0a0+git637433

Documentation

Docker Hub