This page summarizes performance measurements on AMD Instinct™ GPUs running popular AI models.

The results found on this page highlight both Inference and Training benchmarks. The results are organized by the following:

  • AI Inference: vLLM, xDiT
  • AI Training: pyTorch, Megatron-LM, and JAX MaxText

The hardware platforms include Instinct MI355X/MI325X/MI300X GPUs, with benchmark insights provided for each framework where data is available.

The data in the following tables are a reference point to help users evaluate observed performance. It should not be considered as the peak performance that AMD GPUs and ROCm™ software can deliver.

AI Inference

vLLM

Results on AMD Instinct™ MI300X Platform

The following results are based on:

  • Docker container: rocm/vllm:rocm7.0.0_vllm_0.11.2_20251210
  • Release date: December 11, 2025
  • Server: Dual AMD EPYC 9554 64-core processor-based production server with 8x AMD MI300X (192GB HBM3 750W) GPUs, 1 NUMA node per socket, System BIOS 1.8, Ubuntu® 22.04, amdgpu driver 6.14.14

Throughput Measurements

The table below shows performance data where a local inference client is fed requests at an infinite rate and shows the throughput client-server scenario under maximum load.

Model Precision TP1 Size Input Output No. Prompts Max. Seqs Throughput2
Llama 3.1 70B (amd/Llama-3.1-70B-Instruct-FP8-KV) FP8 8 128 2048 3200 3200 13562.4
128 4096 1500 1500 11800.9
500 2000 2000 2000 11249.5
2048 2048 1500 1500 7753.1
Llama 3.1 405B (amd/Llama-3.1-405B-Instruct-FP8-KV) FP8 8 128 2048 1500 1500 3822.8
128 4096 1500 1500 3085.8
500 2000 2000 2000 3059.9
2048 2048 500 500 2192.3

1TP stands for Tensor Parallelism.
2Throughput is measured in tokens/second 

Latency results

The table below shows latency measurement, which typically involves assessing the time from when the system receives an input to when the model produces a result.

Model Precision TP1 Size Batch Size Input Output Latency2
Llama 3.1 70B (amd/Llama-3.1-70B-Instruct-FP8-KV) FP8 8 1 128 2048 16.015
2 128 2048 18.683
4 128 2048 19.245
8 128 2048 20.468
16 128 2048 22.137
32 128 2048 25.571
64 128 2048 32.987
128 128 2048 46.426
1 2048 2048 16.421
2 2048 2048 19.035
4 2048 2048 20.221
8 2048 2048 21.483
16 2048 2048 24.350
32 2048 2048 29.776
64 2048 2048 40.625
128 2048 2048 63.671
Llama 3.1 405B (amd/Llama-3.1-70B-Instruct-FP8-KV) FP8 8 1 128 2048 48.618
2 128 2048 50.980
4 128 2048 52.760
8 128 2048 55.864
16 128 2048 58.795
32 128 2048 69.482
64 128 2048 89.384
128 128 2048 122.601
1 2048 2048 49.106
2 2048 2048 51.664
4 2048 2048 54.220
8 2048 2048 58.904
16 2048 2048 65.389
32 2048 2048 83.387
64 2048 2048 115.575
128 2048 2048 177.779

1TP stands for Tensor Parallelism.
2Latency is measured in seconds

Reproduce these results on your system by following these instructions:

Previous Versions

This table lists previous versions of the ROCm vLLM inference Docker image for inference performance testing. For detailed information about available models for benchmarking, see the version-specific documentation.

Docker image tag Components Resources
rocm/vllm:rocm7.0.0_vllm_0.11.2_20251210 (latest)
  • ROCm 7.0.0
  • vLLM 0.11.2
  • PyTorch 2.9.0
rocm/vllm:rocm7.0.0_vllm_0.11.1_20251024
  • ROCm 7.0.0
  • vLLM 0.11.1
  • PyTorch 2.9.0
rocm/vllm:rocm7.0.0_ vllm_ 0.10.2_ 20251006
  • ROCm 7.0.0
  • vLLM 0.10.2
  • PyTorch 2.9.0
rocm/vllm:rocm6.4.1_ vllm_ 0.10.0_ 20250812
  • ROCm 6.4.1
  • vLLM 0.9.1
  • PyTorch 2.7.0
rocm/vllm:rocm6.4.1_vllm_0.9.1_20250715
  • ROCm 6.4.1
  • vLLM 0.9.1
  • PyTorch 2.7.0
rocm/vllm:rocm6.4.1_vllm_0.9.1_20250702
  • ROCm 6.4.1
  • vLLM 0.9.1
  • PyTorch 2.7.0
rocm/vllm:rocm6.4.1_vllm_0.9.0.1_20250605
  • ROCm 6.4.1
  • vLLM 0.9.0.1
  • PyTorch 2.7.0
rocm/vllm:rocm6.3.1_vllm_0.8.5_20250521
  • ROCm 6.3.1
  • 0.8.5 vLLM (0.8.6.dev)
  • PyTorch 2.7.0
rocm/vllm:rocm6.3.1_vllm_0.8.5_20250513
  • ROCm 6.3.1
  • vLLM 0.8.5
  • PyTorch 2.7.0
rocm/vllm:rocm6.3.1_instinct_vllm0.8.3_20250415
  • ROCm 6.3.1
  • vLLM 0.8.3
  • PyTorch 2.7.0
rocm/vllm:rocm6.3.1_instinct_vllm0.7.3_20250325
  • ROCm 6.3.1
  • vLLM 0.7.3
  • PyTorch 2.7.0
rocm/vllm:rocm6.3.1_mi300_ubuntu22.04_py3.12_vllm_0.6.6
  • ROCm 6.3.1
  • vLLM 0.6.6
  • PyTorch 2.7.0
rocm/vllm:rocm6.2_mi300_ubuntu20.04_py3.9_vllm_0.6.4
  • ROCm 6.2.1
  • vLLM 0.6.4
  • PyTorch 2.5.0
rocm/vllm:rocm6.2_mi300_ubuntu22.04_py3.9_vllm_7c5fd50
  • ROCm 6.2.0
  • vLLM 0.4.3
  • PyTorch 2.4.0

 xDiT 

Results on AMD Instinct™ MI355X Platform

The following results are based on:

  • Docker container: rocm/pytorch-xdit:v25.12
  • Release date: Dec 8, 2025
  • Server: Dual AMD EPYC 9575F 96-core processor-based production server with 8x AMD MI355X (288GB HBM3E 1400W) GPUs,1 NUMA node per socket, System BIOS 1.4a, Ubuntu® 22.04.3 LTS Host GPU driver ROCm 7.10.0_preview.
Models Precision Batch Size Configuration Latency1
Hunyuan Video BF16 1 720p, 129 Frames, 50 steps 86.74
Wan2.1 BF16 1 720p, 80 Frames, 40 steps 71.60
Wan2.2 BF16 1 720p, 80 Frames, 40 steps 66.69
Flux.1 BF16 1 1024x1240, 25 steps 0.94

1 Latency is measured in seconds

Reproduce these results on your system by following these instructions:

Results on the AMD Instinct™ MI300X platform

The following results are based on:

  • Docker container: rocm/pytorch-xdit:v25.12
  • Release date: Dec 8, 2025
  • Server: Dual AMD EPYC 9554 64-core processor-based production server with 8x AMD MI300X (192GB HBM3 750W) GPUs, 1 NUMA node per socket, System BIOS 1.8, Ubuntu® 22.04.5 LTS, Host GPU driver ROCm 7.10.0_preview.

Models

Precision Batch Size Configuration Latency1
Hunyuan Video BF16 1 720p, 129 Frames, 50 steps 181.05
Wan2.1 BF16 1 720p, 80 Frames, 40 steps 151.25
Wan2.2 BF16 1 720p, 80 Frames, 40 steps 142.17
Flux.1 BF16 1 1024x1240, 25 steps 1.33

1 Latency is measured in seconds

Reproduce these results on your system by following these instructions:

Previous versions

This table lists previous versions of the Megatron-LM training Docker image for training performance testing. For detailed information about available models for benchmarking, see the version-specific documentation.

Docker image tag

Components

Resources

rocm/pytorch-xdit:v25.12(latest)

rocm/pytorch-xdit:v25.11(latest)

rocm/pytorch-xdit:v25.10

 

AI Training

The table below shows training performance data, where the AMD Instinct™ platform measures text generation training throughput with a unique sequence length and batch size. It focuses on Tokens per second per GPU.

PyTorch

Results on the AMD Instinct MI355X Platform

The following results are based on:

  • Docker container: rocm/primus:v26.3
  • Release date: Jun 9, 2026
  • Server: Dual AMD EPYC 9575F 96-core processor-based production server with 8x AMD MI355X (288GB HBM3E 1400W)GPUs,1 NUMA node per socket, System BIOS 1.4a, Ubuntu® 22.04.5 LTS Host GPU driver ROCm 7.0.1.
    Multi-node: Dual AMD EPYC 9575F 96-core processor-based production server with 8x AMD MI355X (288GB HBM3E 1400W) GPUs,1 NUMA node per socket, System BIOS 1.4a, Ubuntu® 22.04.5 LTS Host GPU driver ROCm 7.0.1. 
Model # nodes Precision Batch Size Sequence Length FSDP TP CP PP Tokens/sec/GPU
Llama 3.1 8B 1 FP8 8 8192

FALSE

1 1 1 30,254
Llama 3.1 8B 1 BF16 6 8192

FALSE

1 1 1 21,763
Llama 3.1 70B 1 FP8 6 8192 TRUE 1 1 1 3,663
Llama 3.1 70B 4 FP8 6 8192 TRUE 1 1 1 3,805
Llama 3.1 70B 1 BF16 8 8192 TRUE 1 1 1 2,294
Llama 3.1 405B  8 FP8 3 8192 TRUE 1 1 1 636

Reproduce these results on your system by following these instructions:

Results on AMD Instinct™ MI325X Platform

The following results are based on:

  • Docker container: rocm/primus:v26.3
  • Release date: Jun 9, 2026
  • Server: Dual AMD EPYC 9655 96-core processor-based production server with 8x AMD MI325X (256GB HBM3E 1000W) GPUs, 1 NUMA node per socket, System BIOS 3B03, Ubuntu® 22.04.5 LTS, Host GPU driver ROCm 7.0.1.
Model # nodes Precision Batch Size Sequence Length FSDP TP CP PP Tokens/sec/GPU
Llama 3.1 8B 1 FP8 7 8192 FALSE 1 1 1 15,662
Llama 3.1 8B 1 BF16 6 8192 FALSE 1 1 1 11,642
Llama 3.1 70B 1 FP8 3 8192 TRUE 1 1 1 1,775
Llama 3.1 70B 8 FP8 3 8192 TRUE 1 1 1 1,748
Llama 3.1 70B 1 BF16 3 8192 TRUE 1 1 1 1,167
Llama3.1 405B 8 FP8 4 8192 TRUE 1 1 1 301.74

Reproduce these results on your system by following these instructions:

Results on AMD Instinct™ MI300X Platform

The following results are based on:

  • Docker container: rocm/primus:v26.3
  • Release date: Jun 9, 2026
  • Server: Server: Dual AMD EPYC 9554 64-core processor-based production server with 8x AMD MI300X (192GB HBM3 750W) GPUs, 1 NUMA node per socket, System BIOS 1.8, Ubuntu® 22.04.5 LTS, Host GPU driver ROCm 6.4.2-120.
Model Precision Batch Size Sequence Length FSDP TP CP PP Tokens/sec/GPU
Llama 3.1 8B FP8 5 8192 FALSE 1 1 1 11,980
Llama 3.1 8B BF16 4 8192 FALSE 1 1 1 8,922
Llama 3.1 70B FP8 3 8192 TRUE 1 1 1 1,284
Llama 3.1 70B BF16 4 8192 TRUE 1 1 1 869

Reproduce these results on your system by following these instructions:

Previous Versions

This table lists previous versions of the PyTorch training Docker image for training performance testing. For detailed information about available models for benchmarking, see the version-specific documentation.

Image version ROCm version PyTorch version Resources
v26.3 (latest) ROCm 7.2.1 PyTorch 2.10.0+git94c6e04
v26.2 ROCm 7.2.0 PyTorch 2.10.0a0+git449b176
v26.1 ROCm 7.1.0 PyTorch 2.10.0.dev20251112+rocm7.1
v25.11 ROCm 7.1.0 PyTorch 2.10.0.dev20251112+rocm7.1
v25.10 7.1.0 PyTorch 2.10.0.dev20251112+rocm7.1
V25.9 7.0.0

Primus 0.3.0

PyTorch 2.9.0.dev20250821+rocm7.0.0.lw.git125803b7

v25.8  6.4.3 2.8.0a0+gitd06a406
v25.7 6.4.2 2.8.0a0+gitd06a406
v25.6 6.3.4 2.8.0a0+git7d205b2
v25.5 6.3.4 2.7.0a0+git637433
v25.4 6.3.0 2.7.0a0+git637433

Megatron-LM

Results on AMD Instinct™ MI355X Platform

The following results are based on:

  • Docker container: rocm/primus:v26.3
  • Release date: Jun 9, 2026
  • Server: Dual AMD EPYC 9575F 64-core processor-based production server with 8x AMD MI355X (288GB HBM3E 1400W)GPUs,1 NUMA node per socket, System BIOS 1.4a, Ubuntu® 22.04.5 LTS Host GPU driver ROCm 7.0.1.
    Multi-node : Dual AMD EPYC 9575F 64-core processor-based production server with 8x AMD MI355X (288GB HBM3E 1400W)GPUs,1 NUMA node per socket, System BIOS 1.4a, Ubuntu® 22.04.5 LTS Host GPU driver ROCm 7.0.1
Model # nodes Precision Batch Size Sequence Length FSDP TP CP PP EP/VPP Tokens/sec/GPU
Llama 3.1 8B 1 FP8 4 8192 FALSE 1 1 1 - 32,031
Llama 3.1 8B 1 BF16 4 8192 FALSE 1 1 1 - 22,894
Llama 3.1 8B 8 FP8 4 8192 FALSE 1 1 1 - 30,581
Llama 3.1 70B 1 BF16 4 8192 TRUE 1 1 1 - 2,130
Llama 3.1 70B 8 FP8 4 8192 FALSE 1 1 1 - 3,207
Llama 3.1 70B 8 BF16 4 8192 FALSE 1 1 1 - 2,002
Llama 3.3 70B 1 BF16 6 8192 TRUE 1 1 1 - 2,031
Mixtral 8x7B 1 BF16 4 4096 FALSE 1 1 1 8 13,381
Mixtral 8x22B 4 BF16 1 8192 FALSE 1 1 4 8/2 3,299
DeepSeekV2 Lite 1 BF16 12 4096 FALSE 1 1 1 8 41,284
Qwen3 –30B 1 BF16 8 4096 FALSE - 1 1 8 24,623
Qwen3 –30B 1 FP8 8 4096 FALSE - 1 1 8 25,489
Qwen3-235B 8 FP8 4 4096 FALSE - 1 1 8/4 4,709
GPT-OSS-20B 1 BF16 8 4096 FALSE - 1 1 8 21,333
GPT-OSS-20B 1 FP8 8 4096 FALSE - 1 1 8 21,065
GPT-OSS-120B 8 BF16 8 4096 FALSE - 1 1 8/2 12,183

Reproduce these results on your system by following these instructions:

Results on AMD Instinct™ MI325X Platform

The following results are based on:

  • Docker container:  rocm/primus:v26.3
  • Release date: Jun 9, 2026
  • Server: Dual AMD EPYC 9655 96-core processor-based production server with 8x AMD MI325X (256GB HBM3E 1000W) GPUs, 1 NUMA node per socket, System BIOS 3B03, Ubuntu® 22.04.5 LTS, Host GPU driver ROCm 7.0.1.
    For multi-mode run, Server: Dual AMD EPYC 9575F 64-Core processor-based production server with 8x AMD Instinct MI325 (256GB HBM3E 1000W) GPUs, 1 NUMA node per socket, System BIOS 1.5, Ubuntu® 22.04.5 LTS, Host GPU driver ROCm 6.4.2.60402-120~22.04
Model # nodes Precision Batch Size Sequence Length FSDP TP CP PP EP Tokens/sec/GPU
Llama 3.1 8B 1 FP8 2 8192 FALSE 1 1 1 - 16,759
Llama 3.1 8B 8 FP8 2 8192 FALSE 1 1 1 - 15,355
Llama 3.1 8B 1 BF16 4 8192 FALSE 1 1 1 - 12,158
Llama 3.1 70B 1 BF16 3 8192 TRUE 1 1 1 - 1,099
Llama 3.1 70B 8 FP8 4 8192 TRUE 1 1 1 - 1,438
Llama 3.1 70B 8 BF16 1 8192 TRUE 1 1 1 - 1,076
Llama 3.3 70B 1 BF16 3 8192 TRUE 1 1 1 - 1,098
Mixtral 8x7B 1 BF16 4 4096 FALSE 1 1 1 8 7,177
DeepSeekV2 Lite 1 BF16 10 4096 FALSE 1 1 1 8 22,216
Qwen3-30B 1 BF16 2 4096 FALSE - 1 1 8 10,088
Qwen3-30B 1 FP8 2 4096 FALSE - 1 1 8 10,969
GPT-OSS-20B 1 BF16 4 4096 FALSE 1 1 1 8 13,952
GPT-OSS-20B 1 FP8 4 4096 FALSE 1 1 1 8 14,652
GPT-OSS-120B 8 BF16 6 4096 FALSE 1 1 2 8 5,273

Reproduce these results on your system by following these instructions:

Results on AMD Instinct™ MI300X Platform

The following results are based on:

  • Docker container: rocm/primus:v26.3
  • Release date: Jun 9, 2026
  • Server: Server: Dual AMD EPYC 9554 64-core processor-based production server with 8x AMD MI300X (192GB HBM3 750W) GPUs, 1 NUMA node per socket, System BIOS 1.8, Ubuntu® 22.04.5 LTS, Host GPU driver ROCm 6.4.2-120.
    For multi-mode run, Server: Dual Intel Xeon Platinum 8480+ Processors with 8x AMD MI300X (192GB HBM3 750W) GPUs, 1 NUMA node per socket, System BIOS 79007700 Ubuntu® 22.04, Host GPU driver ROCm 6.3.0-39.
Model # nodes Precision Batch Size Sequence Length FSDP TP CP PP EP Tokens/sec/GPU
Llama 3.1 8B 1 FP8 2 8192 FALSE 1 1 1 - 12,970
Llama 3.1 8B 1 BF16 2 8192 FALSE 1 1 1 - 9,386
Llama 3.1 70B 1 BF16 3 8192 TRUE 1 1 1 - 788
Llama 3.3 70B 1 BF16 2 8192 TRUE 1 1 1 - 809
Mixtral 8x7B 1 BF16 2 4096 FALSE 1 1 1 8 5,433
DeepSeekV2 Lite 1 BF16 4 4096 FALSE 1 1 1 8 16,421
Qwen3-30B 1 BF16 2 4096 FALSE 1 1 1 8 8,810
Qwen3-30B 1 FP8 2 4096 FALSE 1 1 1 8 8,994
GPT-OSS-20B 1 BF16 4 4096 FALSE 1 1 1 1 11,490
GPT-OSS-20B 1 FP8 4 4096 FALSE 1 1 1 8 12,175

Reproduce these results on your system by following these instructions:

Previous Versions

This table lists previous versions of the Megatron-LM training Docker image for training performance testing. For detailed information about available models for benchmarking, see the version-specific documentation.

Image version ROCm version PyTorch version Resources
v26.3 (latest) 7.2.0 PyTorch 2.10.0+git94c6e04
v26.2 7.2.0 PyTorch 2.10.0+git94c6e04
v26.1 7.1.0 PyTorch 2.10.0.dev20251112+rocm7.1
v25.11 7.1.0 PyTorch 2.10.0.dev20251112+rocm7.1
v25.10 7.1.0 PyTorch 2.10.0.dev20251112+rocm7.1
v25.9 7.0.0 Primus 0.3.0

PyTorch 2.9.0.dev20250821+rocm7.0.0.lw.git125803b7
v25.8  6.4.3 2.8.0a0+gitd06a406
v25.7 6.4.2 2.8.0a0+gitd06a406
v25.6 6.4.1 2.8.0a0+git7d205b2
v25.5 6.3.4 2.8.0a0+gite2f9759
v25.4 6.3.0 2.7.0a0+git637433

JaxMaxText

Results on AMD Instinct™ MI355X Platform

The following results are based on:

  • Docker container: rocm/jax-training:maxtext-v26.3
  • Release date: Jun 9, 2026
  • Server: Dual AMD EPYC 9575F 96-core processor-based production server with 8x AMD MI355X (288GB HBM3E 1400W)GPUs,1 NUMA node per socket, System BIOS 1.4a, Ubuntu® 22.04.5 LTS Host GPU driver ROCm 7.1.1.
    Multi mode: Dual AMD EPYC 9575F 96-core processor-based production server with 8x AMD MI355X (288GB HBM3E 1400W)GPUs,1 NUMA node per socket, System BIOS 1.4a, Ubuntu® 22.04.5 LTS Host GPU driver ROCm 7.0.1. 
Models # nodes Precision Batch Size Sequence Length FSDP TP CP PP   EP Tokens/Sec/GPU
Llama 3.1 8B 1 BF16 9 8192 TRUE   1 1 1 1 20,869
Llama 3.1 8B 1 FP8 9 8192 TRUE   1 1 1 1 26,998
Llama 3.1 8B 8 BF16 9 8192 TRUE 1 1 1 1 20,092
Llama 3.1 70B 1 BF16 10 8192 TRUE   1 1 1 1 2,322
Llama 3.1 70B 1 FP8 10 8192 TRUE   1 1 1 1 3,803
Llama 3.1 70B 8 BF16 10 8192 TRUE 1 1 1 1 2,219
Llama 3.3 70B 1 BF16 10 8192 TRUE   1 1 1 1 2,331
Mixtral 8x7B 1 BF16 11 4096 FALSE   1 1 1 8 11,679

Reproduce these results on your system by following these instructions:

Results on AMD Instinct™ MI325X Platform

The following results are based on:

  • Docker container: rocm/jax-training:maxtext-v26.3
  • Release date: Jun 9, 2026
  • Server: Dual AMD EPYC 9655 96-core processor-based production server with 8x AMD MI325X (256GB HBM3E 1000W) GPUs, 1 NUMA node per socket, System BIOS 3B03, Ubuntu® 22.04.5 LTS, Host GPU driver ROCm 7.0.1.
Models # nodes Precision Batch Size Sequence Length FSDP TP CP PP   EP Tokens/Sec/GPU
Llama 3.1 8B 1 BF16 4 8192 TRUE 1 1 1 1 10,652
Llama 3.1 8B 1 FP8 4 8192 TRUE 1 1 1 1 13,666
Llama 3.1 70B 1 BF16 7 8192 TRUE 1 1 1 1 1,217
Llama 3.1 70B 1 FP8 7 8192 TRUE 1 1 1 1 1,835
Llama 3.3 70B 1 BF16 7 8192 TRUE 1 1 1 1 1,217
Llama 3.3 70B 1 FP8 7 8192 TRUE 1 1 1 1 1,836
Mixtral 8x7B 1 BF16 9 4096 FALSE 1 1 1 8 5,607

Reproduce these results on your system by following these instructions:

Results on AMD Instinct™ MI300X Platform

The following results are based on:

  • Docker container: rocm/jax-training:maxtext-v26.1
  • Release date: Jan 21, 2026
  • Server: Dual AMD EPYC 9554 64-core processor-based production server with 8x AMD MI300X (192GB HBM3 750W) GPUs, 1 NUMA node per socket, System BIOS 1.8, Ubuntu® 22.04.5 LTS, Host GPU driver ROCm 7.1.1.
    For multi-mode run, Server: Dual AMD EPYC 9654 Processors with 8x AMD MI300X (192GB HBM3 750W) GPUs, 1 NUMA node per socket, System BIOS 3.10 Ubuntu® 22.04, Host GPU driver ROCm 6.3.1-48.
Models # nodes Precision Batch Size Sequence Length FSDP TP CP PP   EP Tokens/Sec/GPU
Llama 3.1 8B 1 BF16 4 8192 TRUE 1 1 1 1 8,720
Llama 3.1 8B 1 FP8 4 8192 TRUE 1 1 1 1 11,138
Llama 3.1 70B 1 FP8 5 8192 TRUE 1 1 1 1 1,472
Llama 3.1 70B 1 BF16 5 8192 TRUE 1 1 1 1 963
Llama 3.3 70B 1 BF16 5 8192 TRUE 1 1 1 1 962
Mixtral 8x7B 1 BF16 12 4096 FALSE 1 1 1 8 5,382

Reproduce these results on your system by following these instructions:

Previous Versions

The following results are based on:

This table lists previous versions of the ROCm JAX MaxText Docker image for training performance testing. For detailed information about available models for benchmarking, see the version-specific documentation.  

Image version ROCm version JAX version Resources
v26.2 (latest) 7.1.1 0.8.2
v26.1 7.1.1

0.8.2

v26.2  7.1.1 0.8.2
v26.1  7.1.1  0.8.2 
v25.11 7.1.0 0.7.1
v25.9 7.0.01 0.6.2
v25.7 6.4.1 0.6.0, 0.5.0
v25.5 6.3.4 0.4.35
v25.4 6.3.0 0.4.31
Notas al pie
  1. TP stands for Tensor Parallelism.
  2. Throughput is measured in tokens/second