AMD Instinct™ MI300 Series Accelerators

Skip to main content

Supercharging AI and HPC

AMD Instinct™ MI300 Series accelerators are uniquely well-suited to power even the most demanding AI and HPC workloads, offering exceptional compute performance, large memory density with high bandwidth, and support for specialized data formats.

Add Alt Text

Under the Hood

AMD Instinct MI300 Series accelerators are built on AMD CDNA™ 3 architecture, which offers Matrix Core Technologies and support for a broad range of precision capabilities—from the highly efficient INT8 and FP8 (including sparsity support for AI), to the most demanding FP64 for HPC.

curved gradient divider

Meet the Series

Explore AMD Instinct MI300 Series accelerators, AMD Instinct MI300 Series Platforms, and AMD Instinct MI300A APU.

Meet the AMD Instinct™ MI325X Accelerators

The AMD Instinct™ MI325X GPU accelerator sets new standards in AI performance with 3rd Gen AMD CDNA™ architecture, delivering incredible performance and efficiency for training and inference. With industry leading 256 GB HBM3E memory and 6 TB/s bandwidth, they optimize performance and help reduce TCO.¹

304 CUs

304 GPU Compute Units

256 GB

256 GB HBM3E Memory

6 TB/s

6 TB/s Peak Theoretical Memory Bandwidth

Specs Comparisons

AI Performance
HPC Performance
Memory

AI Performance (Peak TFLOPs)

Up to 1.3X the AI performance vs. competitive accelerators^{2, 3}

TF32

494.7

653.7

0

300

600

900

1200

1500

1800

2100

2400

2700

3000

H200 SXM

MI325X OAM

FP16/BF16 (Tensor/Matrix)

989.4

1307.4

0

300

600

900

1200

1500

1800

2100

2400

2700

3000

H200 SXM

MI325X OAM

FP8

1978.9

2614.9

0

300

600

900

1200

1500

1800

2100

2400

2700

3000

H200 SXM

MI325X OAM

H200 SXM 141 GB

MI325X OAM 256 GB

HPC Performance (Peak TFLOPs)

Up to 2.4X the HPC performance vs. competitive accelerators³

FP64 (Vector)

33.5

81.7

0

20

40

60

80

100

120

140

160

H200 SXM

MI325X OAM

FP64 (Tensor / Matrix)

66.9

163.4

0

20

40

60

80

100

120

140

160

180

H200 SXM

MI325X OAM

FP32 (Vector)

66.9

163.4

0

20

40

60

80

100

120

140

160

180

H200 SXM

MI325X OAM

H200 SXM 141 GB

MI325X OAM 256 GB

Memory Capacity & Bandwidth

1.8X Memory Capacity and 1.2X Memory Bandwidth vs. competitive accelerators¹

Memory Capacity

141 GB

256 GB

0

50

100

150

200

250

300

350

H200 SXM

MI325X OAM

Memory Bandwidth

4.8 TB/s

6 TB/s

0

1

2

3

4

5

6

7

H200 SXM

MI325X OAM

H200 SXM 141 GB

MI325X OAM 256 GB

Instinct MI300X Accelerators

AMD Instinct MI300X Series accelerators are designed to deliver leadership performance for Generative AI workloads and HPC applications.

304 CUs

304 GPU Compute Units

192 GB

192 GB HBM3 Memory

5.3 TB/s

5.3 TB/s Peak Theoretical Memory Bandwidth

AI Performance (Peak TFLOPs)

Up to 1.3X the AI performance vs. competitive accelerators⁶

TF32 (Sparsity)

989.6

1307.4

0

1000

2000

3000

4000

5000

6000

7000

H100 SXM5

MI300X OAM

FP16/BF16 (Sparsity)

1978.9

2614.9

0

1000

2000

3000

4000

5000

6000

7000

H100 SXM5

MI300X OAM

FP8 (Sparsity)

3957.8

5229.8

0

1000

2000

3000

4000

5000

6000

7000

H100 SXM5

MI300X OAM

HPC Performance (Peak TFLOPs)

Up to 2.4X the HPC performance vs. competitive accelerators⁷

FP64 (Vector)

33.5

81.7

0

20

40

60

80

100

120

140

160

180

H100 SXM5

MI300X OAM

FP64 (Tensor / Matrix)

66.9

163.4

0

20

40

60

80

100

120

140

160

180

H100 SXM5

MI300X OAM

FP32 (Vector)

66.9

163.4

0

20

40

60

80

100

120

140

160

180

H100 SXM5

MI300X OAM

Memory Capacity & Bandwidth

2.4X Memory Capacity and 1.6X Peak Theoretical Memory Bandwidth vs. competitive accelerators⁸

Memory Capacity

80

192

0GB

50GB

100GB

150GB

200GB

250GB

300GB

H100 SXM5

MI300X OAM

Memory Bandwidth

3.4

5.3

0TB/s

1TB/s

2TB/s

3TB/s

4TB/s

5TB/s

6TB/s

7TB/s

H100 SXM5

MI300X OAM

AMD Instinct Platforms

The AMD Instinct MI325X Platform integrates 8 fully connected MI325X GPU OAM modules onto an industry-standard OCP design via 4th-Gen AMD Infinity Fabric™ links, delivering up to 2TB HBM3E capacity for low-latency AI processing. This ready-to-deploy platform can accelerate time-to-market and reduce development costs when adding MI325X accelerators into existing AI rack and server infrastructure.

View MI325X Platform Specs

View MI300X Platform Specs

8 MI325X

8 MI325X GPU OAM modules

2 TB

2 TB Total HBM3E Memory

48 TB/s

48 TB/s Peak Theoretical Aggregate Memory Bandwidth

AMD Instinct MI300A APUs

AMD Instinct MI300A accelerated processing units (APUs) combine the power of AMD Instinct accelerators and AMD EPYC™ processors with shared memory to enable enhanced efficiency, flexibility, and programmability. They are designed to accelerate the convergence of AI and HPC, helping advance research and propel new discoveries.

228 CUs

228 GPU Compute Units

24

24 “Zen 4” x86 CPU Cores

128 GB

128 GB Unified HBM3 Memory

5.3 TB/s

5.3 TB/s Peak Theoretical Memory Bandwidth

AI Performance (Peak TFLOPs)¹¹

TF32 (Sparsity)

989.6

980.6

0

500

1000

1500

2000

2500

3000

3500

4000

4500

5000

H100 SXM5

MI300A APU

FP16/BF16 (Sparsity)

1978.9

1961.2

0

500

1000

1500

2000

2500

3000

3500

4000

4500

5000

H100 SXM5

MI300A APU

FP8 (Sparsity)

3957.8

3922.3

0

500

1000

1500

2000

2500

3000

3500

4000

4500

5000

H100 SXM5

MI300A APU

HPC Performance (Peak TFLOPs)

Up to 1.8X the HPC performance vs. competitive accelerators¹²

FP64 (Vector)

33.5

61.3

0

20

40

60

80

100

120

140

160

H100 SXM5

MI300A APU

FP64 (Tensor / Matrix)

66.9

122.6

0

20

40

60

80

100

120

140

160

H100 SXM5

MI300A APU

FP32 (Vector)

66.9

122.6

0

20

40

60

80

100

120

140

160

H100 SXM5

MI300A APU

Memory Capacity & Bandwidth

2.4X Memory Capacity and 1.6X Peak Theoretical Memory Bandwidth vs. competitive accelerators¹³

Memory Capacity

80

128

0GB

20GB

40GB

60GB

80GB

100GB

120GB

140GB

160GB

H100 SXM5

MI300A APU

Memory Bandwidth

3.4

5.3

0TB/s

1TB/s

2TB/s

3TB/s

4TB/s

5TB/s

6TB/s

7TB/s

H100 SXM5

MI300A APU

Advancing Exascale Computing

AMD Instinct accelerators power some of the world’s top supercomputers, including Lawrence Livermore National Laboratory’s El Capitan system. See how this two-Exascale supercomputer will use AI to run first-of-its-kind simulations and advance scientific research.

AMD ROCm™ Software

AMD ROCm™ software includes a broad set of programming models, tools, compilers, libraries, and runtimes for AI models and HPC workloads targeting AMD Instinct accelerators.

ROCm Developer Hub

Case Studies

Lamini

Lamini LLMs are built exclusively on AMD Instinct accelerators for generative AI.

Read Lamini Blog

Ultra Ethernet Consortium

AMD is a founding member of the Ultra Ethernet Consortium, which aims to deliver a complete architecture that optimizes Ethernet for AI and HPC networking.

More on UEC

Hugging Face

AMD and Hugging Face work together to deliver high-performance transformers that work out-of-the-box for training and inference.

Read Hugging Face Blog

OpenXLA

AMD is a founding member of Google’s OpenXLA Project, which streamlines developers’ ability to optimize their models to target a wide variety of hardware, including AMD Instinct accelerators.

Read OpenXLA Blog

MosaicML

MosaicML, along with AMD Instinct accelerators, simplifies training and deploying LLMs and other generative AI models.

Read MosaicML Blog

LUMI case study

LUMI

EuroHPC’s LUMI supercomputer, powered by HPE solutions featuring AMD Instinct accelerators, is advancing groundbreaking scientific research to solve some of the world’s toughest challenges.

More on LUMI

KT Cloud case study

KT Cloud

KT Cloud is conducting large AI model training on AMD Instinct accelerators, which deliver cutting-edge performance, memory capacity, and cost efficiency.

Read Case Study

curved gradient divider

Find Solutions

Find a partner offering AMD Instinct accelerator-based solutions.

AMD Instinct Solutions

Resources

Blogs

Read the latest blogs on AMD Instinct accelerators.

Read Blogs

Case Studies

Read the latest case studies on how customers are leveraging AMD Instinct accelerators.

Read Now

Documentation

Find solution briefs, white papers, programmer references, and more documentation for AMD Instinct accelerators.

Find Docs

curved gradient divider

Stay Informed

Sign up to receive the latest data center news and server content.

Footnotes

MI325-001A - Calculations conducted by AMD Performance Labs as of September 26th, 2024, based on current specifications and /or estimation. The AMD Instinct™ MI325X OAM accelerator will have 256GB HBM3E memory capacity and 6 TB/s GPU peak theoretical memory bandwidth performance. Actual results based on production silicon may vary.
The highest published results on the NVidia Hopper H200 (141GB) SXM GPU accelerator resulted in 141GB HBM3E memory capacity and 4.8 TB/s GPU memory bandwidth performance. https://nvdam.widen.net/s/nb5zzzsjdf/hpc-datasheet-sc23-h200-datasheet-3002446
The highest published results on the NVidia Blackwell HGX B100 (192GB) 700W GPU accelerator resulted in 192GB HBM3E memory capacity and 8 TB/s GPU memory bandwidth performance.
The highest published results on the NVidia Blackwell HGX B200 (192GB) GPU accelerator resulted in 192GB HBM3E memory capacity and 8 TB/s GPU memory bandwidth performance.
Nvidia Blackwell specifications at https://resources.nvidia.com/en-us-blackwell-architecture?_gl=1*1r4pme7*_gcl_aw*R0NMLjE3MTM5NjQ3NTAuQ2p3S0NBancyNkt4QmhCREVpd0F1NktYdDlweXY1dlUtaHNKNmhPdHM4UVdPSlM3dFdQaE40WkI4THZBaWFVajFy
MI325-002 - Calculations conducted by AMD Performance Labs as of May 28th, 2024 for the AMD Instinct™ MI325X GPU resulted in 1307.4 TFLOPS peak theoretical half precision (FP16), 1307.4 TFLOPS peak theoretical Bfloat16 format precision (BF16), 2614.9 TFLOPS peak theoretical 8-bit precision (FP8), 2614.9 TOPs INT8 floating-point performance. Actual performance will vary based on final specifications and system configuration.
Published results on Nvidia H200 SXM (141GB) GPU: 989.4 TFLOPS peak theoretical half precision tensor (FP16 Tensor), 989.4 TFLOPS peak theoretical Bfloat16 tensor format precision (BF16 Tensor), 1,978.9 TFLOPS peak theoretical 8-bit precision (FP8), 1,978.9 TOPs peak theoretical INT8 floating-point performance. BFLOAT16 Tensor Core, FP16 Tensor Core, FP8 Tensor Core and INT8 Tensor Core performance were published by Nvidia using sparsity; for the purposes of comparison, AMD converted these numbers to non-sparsity/dense by dividing by 2, and these numbers appear above
Nvidia H200 source: https://nvdam.widen.net/s/nb5zzzsjdf/hpc-datasheet-sc23-h200-datasheet-3002446 and https://www.anandtech.com/show/21136/nvidia-at-sc23-h200-accelerator-with-hbm3e-and-jupiter-supercomputer-for-2024
Note: Nvidia H200 GPUs have the same published FLOPs performance as H100 products https://resources.nvidia.com/en-us-tensor-core/. MI325-02
MI325-008 - Calculations conducted by AMD Performance Labs as of October 2nd, 2024 for the AMD Instinct™ MI325X (1000W) GPU designed with AMD CDNA™ 3 5nm | 6nm FinFET process technology at 2,100 MHz peak boost engine clock resulted in 163.4 TFLOPs peak theoretical double precision Matrix (FP64 Matrix), 81.7 TFLOPs peak theoretical double precision (FP64), 163.4 TFLOPs peak theoretical single precision Matrix (FP32 Matrix), 163.4 TFLOPs peak theoretical single precision (FP32), 653.7 TFLOPS peak theoretical TensorFloat-32 (TF32), 1307.4 TFLOPS peak theoretical half precision (FP16), Actual performance may vary based on final specifications and system configuration.
Published results on Nvidia H200 SXM (141GB) GPU: 66.9 TFLOPs peak theoretical double precision tensor (FP64 Tensor), 33.5 TFLOPs peak theoretical double precision (FP64), 66.9 TFLOPs peak theoretical single precision (FP32), 494.7 TFLOPs peak TensorFloat-32 (TF32), 989.5 TFLOPS peak theoretical half precision tensor (FP16 Tensor). TF32 Tensor Core performance were published by Nvidia using sparsity; for the purposes of comparison, AMD converted these numbers to non-sparsity/dense by dividing by 2, and this number appears above.
Nvidia H200 source: https://nvdam.widen.net/s/nb5zzzsjdf/hpc-datasheet-sc23-h200-datasheet-3002446 and https://www.anandtech.com/show/21136/nvidia-at-sc23-h200-accelerator-with-hbm3e-and-jupiter-supercomputer-for-2024
Note: Nvidia H200 GPUs have the same published FLOPs performance as H100 products https://resources.nvidia.com/en-us-tensor-core/.
*Nvidia H200 GPUs don’t support FP32 Tensor.
Measurements conducted by AMD Performance Labs as of November 11th, 2023 on the AMD Instinct™ MI300X (750W) GPU designed with AMD CDNA™ 3 5nm | 6nm FinFET process technology at 2,100 MHz peak boost engine clock resulted in 653.7 TFLOPS peak theoretical TensorFloat-32 (TF32), 1307.4 TFLOPS peak theoretical half precision (FP16), 1307.4 TFLOPS peak theoretical Bfloat16 format precision (BF16), 2614.9 TFLOPS peak theoretical 8-bit precision (FP8), 2614.9 TOPs INT8 floating-point performance. The MI300X is expected to be able to take advantage of fine-grained structure sparsity providing an estimated 2x improvement in math efficiency resulting 1,307.4 TFLOPS peak theoretical TensorFloat-32 (TF32), 2,614.9 TFLOPS peak theoretical half precision (FP16), 2,614.9 TFLOPS peak theoretical Bfloat16 format precision (BF16), 5,229.8 TFLOPS peak theoretical 8-bit precision (FP8), 5,229.8 TOPs INT8 floating-point performance with sparsity. The results calculated for the AMD Instinct™ MI250X (560W) 128GB HBM2e OAM accelerator designed with AMD CDNA™ 2 5nm FinFET process technology at 1,700 MHz peak boost engine clock resulted in TF32* (N/A), 383.0 TFLOPS peak theoretical half precision (FP16), 383.0 TFLOPS peak theoretical Bfloat16 format precision (BF16), FP8* (N/A), 383.0 TOPs INT8 floating-point performance. *AMD Instinct MI200 Series GPUs don’t support TF32, FP8 or sparsity. MI300-16

Measurements by internal AMD Performance Labs as of June 2, 2023 on current specifications and/or internal engineering calculations. Large Language Model (LLM) run or calculated with FP16 precision to determine the minimum number of GPUs needed to run the Falcon-7B (7B, 40B parameters), LLaMA (13B, 33B parameters), OPT (66B parameters), GPT-3 (175B parameters), BLOOM (176B parameter), and PaLM (340B, 540B parameters) models. Calculated estimates based on GPU-only memory size versus memory required by the model at defined parameters plus 10% overhead. Calculations rely on published and sometimes preliminary model memory sizes. GPT-3, BLOOM, and PaML results estimated on MI300X due to system/part availability. Tested result configurations: AMD Lab system consisting of 1x EPYC 9654 (96-core) CPU with 1x AMD Instinct™ MI300X (192GB HBM3, OAM Module) 750W accelerator

Results (FP16 precision):

Model	Parameters	Tot Memory Reqd	MI300X Required
Falcon-7B	7 Billion	15.4 GB	1 Actual
LLaMA	13 Billion	44 GB	1 Actual
LLaMA	33 Billion	72.5 GB	1 Actual
Falcon-40B	40 Billion	88 GB	1 Actual
OPT	66 Billion	145.2 GB	1 Actual
GPT-3	175 Billion	385 GB	3 Calculated
Bloom	176 Billion	387.2 GB	3 Calculated
PaLM	340 Billion	748 GB	4 Calculated
PaLM	540 Billion	1188 GB	7 Calculated

Calculated estimates may vary based on final model size; actual and estimates may vary due to actual overhead required and using system memory beyond that of the GPU. Server manufacturers may vary configuration offerings yielding different results. MI300-07

Measurements conducted by AMD Performance Labs as of November 11th, 2023 on the AMD Instinct™ MI300X (750W) GPU designed with AMD CDNA™ 3 5nm | 6nm FinFET process technology at 2,100 MHz peak boost engine clock resulted in 653.7 TFLOPS peak theoretical TensorFloat-32 (TF32), 1307.4 TFLOPS peak theoretical half precision (FP16), 1307.4 TFLOPS peak theoretical Bfloat16 format precision (BF16), 2614.9 TFLOPS peak theoretical 8-bit precision (FP8), 2614.9 TOPs INT8 floating-point performance. The MI300X is expected to be able to take advantage of fine-grained structure sparsity providing an estimated 2x improvement in math efficiency resulting 1,307.4 TFLOPS peak theoretical TensorFloat-32 (TF32), 2,614.9 TFLOPS peak theoretical half precision (FP16), 2,614.9 TFLOPS peak theoretical Bfloat16 format precision (BF16), 5,229.8 TFLOPS peak theoretical 8-bit precision (FP8), 5,229.8 TOPs INT8 floating-point performance with sparsity. Published results on Nvidia H100 SXM (80GB) 700W GPU resulted in 989.4 TFLOPs peak TensorFloat-32 (TF32) with sparsity, 1,978.9 TFLOPS peak theoretical half precision (FP16) with sparsity, 1,978.9 TFLOPS peak theoretical Bfloat16 format precision (BF16) with sparsity, 3,957.8 TFLOPS peak theoretical 8-bit precision (FP8) with sparsity, 3,957.8 TOPs peak theoretical INT8 with sparsity floating-point performance. Nvidia H100 source: https://resources.nvidia.com/en-us-tensor-core. MI300-17
Measurements conducted by AMD Performance Labs as of November 11th, 2023 on the AMD Instinct™ MI300X (750W) GPU designed with AMD CDNA™ 3 5nm | 6nm FinFET process technology at 2,100 MHz peak boost engine clock resulted in 163.4 TFLOPs peak theoretical double precision Matrix (FP64 Matrix), 81.7 TFLOPs peak theoretical double precision (FP64), 163.4 TFLOPs peak theoretical single precision Matrix (FP32 Matrix), 163.4 TFLOPs peak theoretical single precision (FP32), 653.7 TFLOPS peak theoretical TensorFloat-32 (TF32), 1307.4 TFLOPS peak theoretical half precision (FP16), 1307.4 TFLOPS peak theoretical Bfloat16 format precision (BF16), 2614.9 TFLOPS peak theoretical 8-bit precision (FP8), 2614.9 TOPs INT8 floating-point performance. Published results on Nvidia H100 SXM (80GB) GPU resulted in 66.9 TFLOPs peak theoretical double precision tensor (FP64 Tensor), 33.5 TFLOPs peak theoretical double precision (FP64), 66.9 TFLOPs peak theoretical single precision (FP32), 494.7 TFLOPs peak TensorFloat-32 (TF32)*, 989.4 TFLOPS peak theoretical half precision tensor (FP16 Tensor), 133.8 TFLOPS peak theoretical half precision (FP16), 989.4 TFLOPS peak theoretical Bfloat16 tensor format precision (BF16 Tensor), 133.8 TFLOPS peak theoretical Bfloat16 format precision (BF16), 1,978.9 TFLOPS peak theoretical 8-bit precision (FP8), 1,978.9 TOPs peak theoretical INT8 floating-point performance. Nvidia H100 source: https://resources.nvidia.com/en-us-tensor-core/ * Nvidia H100 GPUs don’t support FP32 Tensor. MI300-18
Calculations conducted by AMD Performance Labs as of November 17, 2023, for the AMD Instinct™ MI300X OAM accelerator 750W (192 GB HBM3) designed with AMD CDNA™ 3 5nm FinFet process technology resulted in 192 GB HBM3 memory capacity and 5.325 TFLOPS peak theoretical memory bandwidth performance. MI300X memory bus interface is 8,192 and memory data rate is 5.2 Gbps for total peak memory bandwidth of 5.325 TB/s (8,192 bits memory bus interface * 5.2 Gbps memory data rate/8). The highest published results on the NVidia Hopper H200 (141GB) SXM GPU accelerator resulted in 141GB HBM3e memory capacity and 4.8 TB/s GPU memory bandwidth performance. https://nvdam.widen.net/s/nb5zzzsjdf/hpc-datasheet-sc23-h200-datasheet-3002446 The highest published results on the NVidia Hopper H100 (80GB) SXM5 GPU accelerator resulted in 80GB HBM3 memory capacity and 3.35 TB/s GPU memory bandwidth performance. https://resources.nvidia.com/en-us-tensor-core/nvidia-tensor-core-gpu-datasheet MI300-05A
Measurements conducted by AMD Performance Labs as of November 18th, 2023 on the AMD Instinct™ MI300X (192 GB HBM3) 750W GPU designed with AMD CDNA™ 3 5nm | 6nm FinFET process technology at 2,100 MHz peak boost engine clock resulted in 1307.4 TFLOPS peak theoretical half precision (FP16), 1307.4 TFLOPS peak theoretical Bfloat16 format precision (BF16). The MI300X is expected to be able to take advantage of fine-grained structure sparsity providing an estimated 2x improvement in math efficiency resulting 2,614.9 TFLOPS peak theoretical half precision (FP16), 2,614.9 TFLOPS peak theoretical Bfloat16 format precision (BF16 floating-point performance with sparsity. Published results on Nvidia H100 SXM (80GB HBM3) 700W GPU resulted in 1,978.9 TFLOPS peak theoretical half precision (FP16) with sparsity, 1,978.9 TFLOPS peak theoretical Bfloat16 format precision (BF16) with sparsity floating-point performance. Nvidia H100 source: https://resources.nvidia.com/en-us-tensor-core/ AMD Instinct™ MI300X AMD CDNA 3 technology-based accelerators include up to eight AMD Infinity Fabric links providing up to 1,024 GB/s peak aggregate theoretical GPU peer-to-peer (P2P) transport rate bandwidth performance per GPU OAM module. MI300-25
Measurements conducted by AMD Performance Labs as of November 11th, 2023 on the AMD Instinct™ MI300A (760W) APU designed with AMD CDNA™ 3 5nm | 6nm FinFET process technology at 2,100 MHz peak boost engine clock resulted in 122.6 TFLOPS peak theoretical double precision (FP64 Matrix), 61.3 TFLOPS peak theoretical double precision (FP64), 122.6 TFLOPS peak theoretical single precision matrix (FP32 Matrix), 122.6 TFLOPS peak theoretical single precision (FP32), 490.3 TFLOPS peak theoretical TensorFloat-32 (TF32), 980.6 TFLOPS peak theoretical half precision (FP16), 980.6 TFLOPS peak theoretical Bfloat16 format precision (BF16), 1961.2 TFLOPS peak theoretical 8-bit precision (FP8), 1961.2 TOPs INT8 floating-point performance. The results calculated for the AMD Instinct™ MI250X (560W) 128GB HBM2e OAM accelerator designed with AMD CDNA™ 2 5nm FinFET process technology at 1,700 MHz peak boost engine clock resulted in 95.7 TFLOPS peak theoretical double precision (FP64 Matrix), 47.9 TFLOPS peak theoretical double precision (FP64), 95.7 TFLOPS peak theoretical single precision matrix (FP32 Matrix), 47.9 TFLOPS peak theoretical single precision (FP32), TF32* (N/A), 383.0 TFLOPS peak theoretical half precision (FP16), 383.0 TFLOPS peak theoretical Bfloat16 format precision (BF16), FP8* (N/A), 383.0 TOPs INT8 floating-point performance. Server manufacturers may vary configuration offerings yielding different results. * MI200 Series GPUs don’t support TF32, FP8 or sparsity. MI300-10
Measurements conducted by AMD Performance Labs as of November 11th, 2023 on the AMD Instinct™ MI300A (750W) APU designed with AMD CDNA™ 3 5nm | 6nm FinFET process technology at 2,100 MHz peak boost engine clock resulted in 490.29 TFLOPS peak theoretical TensorFloat-32 (TF32), 980.58 TFLOPS peak theoretical half precision (FP16), 980.58 TFLOPS peak theoretical Bfloat16 format precision (BF16), 1,961.16 TFLOPS peak theoretical 8-bit precision (FP8), 1,961.16 TOPs INT8 floating-point performance. The MI300A is expected to be able to take advantage of fine-grained structure sparsity providing an estimated 2x improvement in math efficiency resulting 980.58 TFLOPS peak theoretical TensorFloat-32 (TF32), 1,961.16 TFLOPS peak theoretical half precision (FP16), 1,961.16 TFLOPS peak theoretical Bfloat16 format precision (BF16), 3,922.33 TFLOPS peak theoretical 8-bit precision (FP8), 3,922.33 TOPs INT8 floating-point performance with sparsity. Published results on Nvidia H100 SXM5 (80GB) GPU resulted in 989.4 TFLOPs peak TensorFloat-32 (TF32) Tensor Core with sparsity, 1,978.9 TFLOPS peak theoretical half precision (FP16) Tensor Core with sparsity, 1,978.9 TFLOPS peak theoretical Bfloat16 format precision (BF16) Tensor Core with sparsity, 3,957.8 TFLOPS peak theoretical 8-bit precision (FP8) Tensor Core with sparsity, 3,957.8 TOPs peak theoretical INT8 Tensor Core with sparsity floating-point performance. Nvidia H100 source: https://resources.nvidia.com/en-us-tensor-core/ Server manufacturers may vary configuration offerings yielding different results. MI300-21
Measurements conducted by AMD Performance Labs as of November 11th, 2023 on the AMD Instinct™ MI300A (760W) GPU designed with AMD CDNA™ 3 5nm | 6nm FinFET process technology at 2,100 MHz peak boost engine clock resulted in 122.6 TFLOPs peak theoretical double precision Matrix (FP64 Matrix), 61.3 TFLOPs peak theoretical double precision (FP64), 122.6 TFLOPs peak theoretical single precision Matrix (FP32 Matrix), 122.6 TFLOPs peak theoretical single precision (FP32), 490.29 TFLOPS peak theoretical TensorFloat-32 (TF32), 980.58 TFLOPS peak theoretical half precision (FP16), 980.58 TFLOPS peak theoretical Bfloat16 format precision (BF16), 1,961.16 TFLOPS peak theoretical 8-bit precision (FP8), 1,961.16 TOPs INT8 floating-point performance. Published results on Nvidia H100 SXM (80GB) 700W GPU resulted in 66.9 TFLOPs peak theoretical double precision tensor (FP64 Tensor), 33.5 TFLOPs peak theoretical double precision (FP64), 66.9 TFLOPs peak theoretical single precision (FP32), 494.7 TFLOPs peak TensorFloat-32 (TF32)*, 989.4 TFLOPS peak theoretical half precision tensor (FP16 Tensor), 133.8 TFLOPS peak theoretical half precision (FP16), 989.4 TFLOPS peak theoretical Bfloat16 tensor format precision (BF16 Tensor), 133.8 TFLOPS peak theoretical Bfloat16 format precision (BF16), 1,978.9 TFLOPS peak theoretical 8-bit precision (FP8), 1,978.9 TOPs peak theoretical INT8 floating-point performance. Nvidia H100 source: https://resources.nvidia.com/en-us-tensor-core/ Server manufacturers may vary configuration offerings yielding different results. * Nvidia H100 GPUs don’t support FP32 Tensor. MI300-20
Calculations conducted by AMD Performance Labs as of November 7, 2023, for the AMD Instinct™ MI300A APU accelerator 760W (128 GB HBM3) designed with AMD CDNA™ 3 5nm FinFet process technology resulted in 128 GB HBM3 memory capacity and 5.325 TFLOPS peak theoretical memory bandwidth performance. MI300A memory bus interface is 8,192 (1024 bits x 8 die) and memory data rate is 5.2 Gbps for total peak memory bandwidth of 5.325 TB/s (8,192 bits memory bus interface * 5.2 Gbps memory data rate/8). The highest published results on the NVidia Hopper H200 (141GB) SXM GPU accelerator resulted in 141GB HBM3e memory capacity and 4.8 TB/s GPU memory bandwidth performance. https://nvdam.widen.net/s/nb5zzzsjdf/hpc-datasheet-sc23-h200-datasheet-3002446 The highest published results on the NVidia Hopper H100 (80GB) SXM GPU accelerator resulted in 80GB HBM3 memory capacity and 3.35 TB/s GPU memory bandwidth performance. https://resources.nvidia.com/en-us-tensor-core/nvidia-tensor-core-gpu-datasheet Server manufacturers may vary configuration offerings yielding different results. MI300-12