

# AMD EPYC™ 9004 Series Architecture Overview

Publication Revision Issue Date 58015 1.3 June, 2023

#### © 2023 Advanced Micro Devices, Inc. All rights reserved.

The information contained herein is for informational purposes only and is subject to change without notice. While every precaution has been taken in the preparation of this document, it may contain technical inaccuracies, omissions and typographical errors, and AMD is under no obligation to update or otherwise correct this information. Advanced Micro Devices, Inc. makes no representations or warranties with respect to the accuracy or completeness of the contents of this document, and assumes no liability of any kind, including the implied warranties of noninfringement, merchantability or fitness for particular purposes, with respect to the operation or use of AMD hardware, software or other products described herein. No license, including implied or arising by estoppel, to any intellectual property rights is granted by this document. Terms and limitations applicable to the purchase or use of AMD's products are as set forth in a signed agreement between the parties or in AMD's Standard Terms and Conditions of Sale.

#### **Trademarks**

AMD, the AMD Arrow logo, AMD EPYC, 3D V-Cache, and combinations thereof are trademarks of Advanced Micro Devices, Inc. PCIe is a registered trademark of PCI-SIG Corporation. Other product names and links to external sites used in this publication are for identification purposes only and may be trademarks of their respective companies.

\* Links to third party sites are provided for convenience and unless explicitly stated, AMD is not responsible for the contents of such linked sites and no endorsement is implied.

| Date       | Version | Changes                                               |
|------------|---------|-------------------------------------------------------|
| July, 2022 | 0.1     | Initial NDA partner release                           |
| Sep, 2022  | 0.2     | Misc. updates                                         |
| Nov, 2022  | 1.0     | Initial public release                                |
| Dec, 2022  | 1.1     | Minor errata corrections                              |
| Mar, 2023  | 1.2     | Added 97xx and AMD 3D V-Cache™ technology information |
| Jun, 2023  | 1.3     | Second public release                                 |

#### **Audience**

This guide provides a high-level technical overview of 4th Gen AMD EPYC™ 9004 Series Processor internal IP.

#### **Author**

Chris Karamatas (with support from the AMD FAE team and Anthony Hernandez)

Note: All of the settings described in this Architecture Guide apply to all AMD EPYC 9004 Series Processors of all core counts with or without AMD 3D V-Cache™ except where explicitly noted otherwise.

ii 58015 – 1.3



# **Table of Contents**

| Chapter 1 | AMD EPYC™ 9004 Series Processors                         |    |  |
|-----------|----------------------------------------------------------|----|--|
| 1.1       | General Specifications                                   |    |  |
| 1.2       | Model-Specific Features                                  |    |  |
| 1.3       | Operating Systems                                        |    |  |
| 1.4       | Processor Layout                                         |    |  |
| 1.5       | "Zen 4" Core                                             |    |  |
| 1.6       | Core Complex (CCX)                                       |    |  |
| 1.7       | Core Complex Dies (CCDs)                                 |    |  |
| 1.8       | AMD 3D V-Cache™ Technology                               | 4  |  |
| 1.9       | I/O Die (Infinity Fabric™)                               |    |  |
| 1.10      | Memory and I/O                                           | E  |  |
| 1.11      | Visualizing AMD EPYC 9004 Series Processors (Family 19h) |    |  |
|           | 1.11.1 Models 91xx-96xx ("Genoa")                        |    |  |
|           | 1.11.2 Models 97xx ("Bergamo")                           | 8  |  |
| 1.12      | NUMA Topology                                            | 8  |  |
|           | 1.12.1 NUMA Settings                                     |    |  |
| 1.13      | Dual-Socket Configurations                               | 10 |  |
| Chapter 2 | Processor Identification                                 | 11 |  |
| 2.1       | CPUID Instruction                                        | 11 |  |
| 2.2       | New Software-Visible Features                            |    |  |
|           | 2.2.1 AVX-512                                            |    |  |
| Chapter 3 | Resources                                                | 13 |  |
| 3.1       | Resources                                                | 13 |  |

This page intentionally left blank.

**iv** 58015 – 1.3



#### Chapter

# AMD EPYC™ 9004 Series Processors

AMD EPYC™ 9004 Series Processors represent the fourth generation of AMD EPYC server-class processors. This generation of AMD EPYC processors feature AMD's latest "Zen 4" based compute cores, next-generation Infinity Fabric, next-generation memory & I/O technology, and use the new SP5 socket/packaging.

#### 1.1 General Specifications

AMD EPYC 9004 Series Processors offer a variety of configurations with varying numbers of cores, Thermal Design Points (TDPs), frequencies, cache sizes, etc. that complement AMD's existing server portfolio with further improvements to performance, power efficiency, and value. Table 1-1 lists the features common to all AMD EPYC 9004 Series Processors.

| Common Features of all AMD EPYC 9004 Series Processors |                       |  |  |  |
|--------------------------------------------------------|-----------------------|--|--|--|
| Compute cores                                          | Zen4-based            |  |  |  |
| Core process technology                                | 5nm                   |  |  |  |
| Maximum cores per Core Complex (CCX)                   | 8                     |  |  |  |
| Max memory per socket                                  | 6 TB                  |  |  |  |
| Max # of memory channels                               | 12 DDR5               |  |  |  |
| Max memory speed                                       | 4800 MT/s DDR5        |  |  |  |
| Max lanes Compute eXpress Links                        | 64 lanes CXL 1.1+     |  |  |  |
| Max lanes Peripheral Component Interconnect            | 128 lanes PCIe® Gen 5 |  |  |  |

Table 1-1: Common features of all AMD EPYC 9004 Series Processors

#### 1.2 Model-Specific Features

Different models of 4th Gen AMD EPYC processors have different feature sets, as shown in Table 1-2.

| AMD EPYC 9004 Series Processor (Family 19h) Features by Model         |              |                |  |  |
|-----------------------------------------------------------------------|--------------|----------------|--|--|
| Codename                                                              | "Genoa"*     | "Bergamo"*     |  |  |
| Model #                                                               | 91xx-96xx    | 97xx           |  |  |
| Max number of Core Complex Dies (CCDs)                                | 12           | 8              |  |  |
| Number of Core Complexes (CCXs) per CCD                               | 1            | 2              |  |  |
| Max number of cores (threads)                                         | 96 (192)     | 128 (256)      |  |  |
| Max L3 cache size (per CCX)                                           | 1GB (96 MB)◆ | 256 MB (16 MB) |  |  |
| Max Processor Frequency                                               | 4.4 GHz ◆ ◆  | 3.15 GHz       |  |  |
| Includes *AMD 3D V-Cache (9xx4X) and **high-frequency (9xx4F) models. | 1            | ı              |  |  |

release dates shown herein and plans only and subject to change. "Genoa" and "Bergamo" are codenames for AMD architectures, and are not product names.

Table 1-2: AMD EPYC 9004 Series Processors features by model

\*GD-122: The information contained herein is for informational purposes only, and is subject to change without notice. Timelines, roadmaps, and/or product

58015 - 1.3

#### 1.3 Operating Systems

AMD recommends using the latest available targeted OS version and updates. Please see <u>AMD EPYC™ Processors</u> <u>Minimum Operating System (OS) Versions</u> for detailed OS version information.

#### 1.4 Processor Layout

AMD EPYC 9004 Series Processors incorporate compute cores, memory controllers, I/O controllers, RAS (Reliability, Availability, and Serviceability), and security features into an integrated System on a Chip (SoC). The AMD EPYC 9004 Series Processor retains the proven Multi-Chip Module (MCM) Chiplet architecture of prior successful AMD EPYC processors while making further improvements to the SoC components.

The SoC includes the Core Complex Dies (CCDs), which contain Core Complexes (CCXs), which contain the "Zen 4"-based cores. The CCDs surround the central high-speed I/O Die (and interconnect via the Infinity Fabric). The following sections describe each of these components.



Figure 1-1: AMD EPYC 9004 configuration with 12 Core Complex Dies (CCD) surrounding a central I/O Die (IOD)

#### 1.5 "Zen 4" Core

AMD EPYC 9004 Series Processors are based on the new "Zen 4" compute core. The "Zen 4" core is manufactured using a 5nm process and is designed to provide an Instructions per Cycle (IPC) uplift and frequency improvements over prior generation "Zen" cores. Each core has a larger L2 cache and improved cache effectiveness over the prior generation. Each "Zen 4" core includes:

- Up to 32 KB of 8-way L1 I-cache and 32 KB of 8-way of L1 D-cache
- Up to a 1MB private unified (Instruction/Data) L2 cache. All caches use a 64B cache line size.

Each core supports Simultaneous Multithreading (SMT), which allows 2 separate hardware threads to run independently, sharing the corresponding core's L2 cache.



#### 1.6 Core Complex (CCX)

Figure 1-2 shows a Core Complex (CCX) where up to eight "Zen 4"-based cores share a L3 or Last Level Cache (LLC). Enabling Simultaneous Multithreading (SMT) allows a single CCX to support up to 16 concurrent hardware threads.



Figure 1-2: Top view of 8 compute cores sharing an L3 cache (91xx-96xx models)

#### 1.7 Core Complex Dies (CCDs)

The Core Complex Die (CCD) in an AMD EPYC 9xx4 Series Processor may contain either one or two CCXs, depending on the processor (91xx-96xx "Genoa" vs. 97xx "Bergamo"), as shown in Figure 1-5.



Figure 1-3: 2 CCXs in a single 4th Gen AMD EPYC 97xx CCD

Each of the Core Complex Dies (CCDs) in a 97xx model AMD EPYC 9004 Series Processor contains two CCXs (Figure 1-5):

| AMD EPYC 9004 Series Processor | 91xx-96xxq | 97xx |
|--------------------------------|------------|------|
| # of CCXs within a CCD         | 1          | 2    |

Table 1-3: CCXs per CCD by AMD EPYC model

You can disable cores in BIOS using one or both of the following approaches:

- Reduce the cores per L3 from 8 down to 7,6,5,4,3,2, or 1 while keeping the number of CCDs constant. This approach increases the effective cache per core ratio but reduces the number of cores sharing the cache.
- Reduce the number of active CCDs while keeping the cores per CCD constant. This approach maintains the advantages of cache sharing between the cores while maintaining the same cache per core ratio.

#### 1.8 AMD 3D V-Cache™ Technology

AMD EPYC 9xx4X Series Processors include AMD 3D V-Cache™ die stacking technology that enables 97xxr, more efficient chiplet integration. AMD 3D Chiplet architecture stacks L3 cache tiles vertically to provide up to 96MB of L3 cache per die (and up to 1 GB L3 Cache per socket) while still providing socket compatibility with all AMD EPYC™ 9004 Series Processor models.

AMD EPYC 9004 Series Processors with AMD 3D V-Cache technology employ industry-leading logic stacking based on copper-to-copper hybrid bonding "bumpless" chip-on-wafer process to enable over 200X the interconnect densities of current 2D technologies (and over 15X the interconnect densities of other 3D technologies using solder bumps), which translates to lower latency, higher bandwidth, and greater power and thermal efficiencies.



Figure 1-4: Side view of vertically-stacked central L3 SRAM tiles

| AMD EPYC 9004 Series Processors | 9xx4  | 9004X OPNs<br>(with 3D V-Cache) |
|---------------------------------|-------|---------------------------------|
| Max Shared L3 Cache (per CCD)   | 32 MB | 96 MB                           |

Table 1-4: L3 cache by processor model

Different OPNs also may have different numbers of cores within the CCX. However, for any given part, all CCxs will always contain the same number of cores.



#### 1.9 I/O Die (Infinity Fabric™)

The CCDs connect to memory, I/O, and each other through an updated I/O Die (IOD). This central AMD Infinity Fabric™ provides the data path and control support to interconnect CCXs, memory, and I/O. Each CCD connects to the IOD via a dedicated high-speed Global Memory Interconnect (GMI) link. The IOD helps maintain cache coherency and additionally provides the interface to extend the data fabric to a potential second processor via its xGMI, or G-links. AMD EPYC 9004 Series Processors support up to 4 xGMI (or G-links) with speeds up to 32Gbps. The IOD exposes DDR5 memory channels, PCIe® Gen5, CXL 1.1+, and Infinity Fabric links.

All dies (chiplets) interconnect with each other via AMD Infinity Fabric technology. Figure 1-6 (which corresponds to Figure 1-2, above) shows the layout of a 96-core AMD EPYC 9654 processor. The AMD EPYC 9654 has 12 CCDs, with each CCD connecting to the IOD via its own GMI connection.



Figure 1-5: AMD EPYC 9654 processor internals interconnect via AMD Infinity Fabric (12 CCD processor shown)

AMD also provides "wide" OPNs (e.g. AMD EPYC 9334) where each CCD connects to two GMI3 interfaces, thereby allowing double the Core-to-I/O die bandwidth.



Figure 1-6: Standard vs. Wide GMI links

The IOD provides twelve unified memory controllers that support DDR5 memory. The IOD also presents 4 'P-links' that the system OEM/designer can configure to support various I/O interfaces, such as PCIe Gen5, and/or CXL 1.1+.

#### 1.10 Memory and I/O

Each UMC can support up to 2 DIMMs per channel (DPC) for a maximum of 24 DIMMs per socket. OEM server configurations may allow either 1 DIMM per channel or 2 DIMMs per channel. 4th Gen AMD EPYC processors can support up to 6TB of DDR5 memory. Having additional and faster memory channels compared to previous generations of AMD EPYC processors provides additional memory bandwidth to feed high-core-count processors. Memory interleaving on 2, 4, 6, 8, 10, and 12 channels helps optimize for a variety of workloads and memory configurations.

Each processor may have a set of 4 P-links and 4 G-links. An OEM motherboard design can use a G-link to either connect to a second 4th Gen AMD EPYC processor or to provide additional PCIe Gen5 lanes. 4th Gen AMD EPYC processors support up to eight sets of x16-bit I/O lanes, that is, 128 lanes of high-speed PCIe Gen5 in single-socket platforms and up to 160 lanes in dual-socket platforms. Further, OEMs may either configure 32 of these 128 lanes as SATA lanes and/or configure 64 lanes as CXL 1.1+. In summary, these links can support:

- Up to 4 G-links of AMD Infinity Fabric connectivity for 2P designs.
- Up to 8 x16 bit or 128 lanes of PCIe Gen 5 connectivity to peripherals in 1P designs (and up to 160 lanes in 2-socket designs).
- Up to 64 lanes (4 P-links) that can be dedicated to Compute Express Link (CXL) 1.1+ connectivity to extended memory.
- Up to 32 I/O lanes that can be configured as SATA disk controllers.



#### 1.11 Visualizing AMD EPYC 9004 Series Processors (Family 19h)

This section depicts AMD EPYC 9004 Series Processors that have been set up with four nodes per socket (NPS=4). Please see "NUMA Topology" on page 8 for more information about nodes.

#### 1.11.1 Models 91xx-96xx ("Genoa")

4th Gen AMD EPYC 9004 processors with model numbers 91xx-96xx have up to 12 CCDs that each contain a single CCX, as shown below.



Figure 1-7: The AMD EPYC 9004 SoC consists of up to 12 CCDs and a central IOD for 91xx-96xx models, including "X" OPNs

**5**8015 – 1.3

#### 1.11.2 Models 97xx ("Bergamo")

97xx 4th Gen AMD EPYC 9004 Series Processors with model numbers 97xx have up to 8 CCDs that each contain two CCXs, as shown below.



Figure 1-8: The AMD EPYC 9004 System on Chip (SoC) consists of up to 8 CCDs and a central IOD for 97xx models

#### 1.12 NUMA Topology

AMD EPYC 9004 Series Processors use a Non-Uniform Memory Access (NUMA) architecture where different latencies may exist depending on the proximity of a processor core to memory and I/O controllers. Using resources within the same NUMA node provides uniform good performance, while using resources in differing nodes increases latencies.

#### 1.12.1 NUMA Settings

A user can adjust the system **NUMA Nodes Per Socket** (NPS) BIOS setting to optimize this NUMA topology for their specific operating environment and workload. For example, setting NPS=4 as shown in "Memory and I/O" on page 6 divides the processor into quadrants, where each quadrant has 3 CCDs, 3 UMCs, and 1 I/O Hub. The closest processor-memory I/O distance is between the cores, memory, and I/O peripherals within the same quadrant. The furthest distance is between a core and memory controller or IO hub in cross-diagonal quadrants (or the other processor in a 2P configuration). The locality of cores, memory, and IO hub/devices in a NUMA-based system is an important factor when tuning for performance.



The NPS setting also controls the interleave pattern of the memory channels within the NUMA Node. Each memory channel within a given NUMA node is interleaved. The number of channels interleaved decreases as the NPS setting gets more granular. For example:

- A setting of NPS=4 partitions the processor into four NUMA nodes per socket with each logical quadrant configured
  as its own NUMA domain. Memory is interleaved across the memory channels associated with each quadrant. PCIe
  devices will be local to one of the four processor NUMA domains, depending on the IOD quadrant that has the
  corresponding PCIe root complex for that device.
- A setting of NPS=2 configures each processor into two NUMA domains that groups half of the cores and half of the memory channels into one NUMA domain, and the remaining cores and memory channels into a second NUMA domain. Memory is interleaved across the six memory channels in each NUMA domain. PCIe devices will be local to one of the two NUMA nodes depending on the half that has the PCIe root complex for that device.
- A setting of NPS=1 indicates a single NUMA node per socket. This setting configures all memory channels on the
  processor into a single NUMA node. All processor cores, all attached memory, and all PCIe devices connected to the
  SoC are in that one NUMA node. Memory is interleaved across all memory channels on the processor into a single
  address space.
- A setting of NPS=0 indicates a single NUMA domain of the entire system (across both sockets in a two-socket configuration). This setting configures all memory channels on the system into a single NUMA node. Memory is interleaved across all memory channels on the system into a single address space. All processor cores across all sockets, all attached memory, and all PCIe devices connected to either processor are in that single NUMA domain.

You may also be able to further improve the performance of certain environments by using the **LLC (L3 Cache) as NUMA** BIOS setting to associate workloads to compute cores that all share a single LLC. Enabling this setting equates each shared L3 or CCX to a separate NUMA node, as a unique L3 cache per CCD. A single AMD EPYC 9004 Series Processor with 12 CCDs can have up to 12 NUMA nodes when this setting is enabled.

Thus, a single EPYC 9004 Series Processor may support a variety of NUMA configurations ranging from one to twelve NUMA nodes per socket.

Note: If software needs to understand NUMA topology or core enumeration, it is imperative to use documented Operating System (OS) APIs, well-defined interfaces, and commands. Do not rely on past assumptions about settings such as APICID or CCX ordering.

#### 1.13 Dual-Socket Configurations

AMD EPYC 9004 Series Processors support single- or dual-socket system configurations. Processors with a 'P' suffix in their name are optimized for single-socket configurations (see the "Processor Identification" chapter) only. Dual-socket configurations require both processors to be identical. You cannot use two different processor Ordering Part Numbers (OPNs) in a single dual-socket system.



Figure 1-9: Two EPYC 9004 Processors connect through 4 xGMI links (NPS1)

In dual-socket systems, two identical EPYC 9004 series SoCs are connected via their corresponding External Global Memory Interconnect [xGMI] links. This creates a high bandwidth, low latency interconnect between the two processors. System manufacturers can elect to use either 3 or 4 of these Infinity Fabric links depending upon I/O and bandwidth system design objectives.

The Infinity Fabric links utilize the same physical connections as the PCIe lanes on the system. Each link uses up to 16 PCIe lanes. A typical dual socket system will reconfigure 64 PCIe lanes (4 links) from each socket for Infinity Fabric connections. This leaves each socket with 64 remaining PCIe lanes, meaning that the system has a total of 128 PCIe lanes. In some cases, a system designer may want to expose more PCIe lanes for the system by reducing the number of Infinity Fabric G-Links to from 4 to 3. In these cases, the designer may allocate up to 160 lanes for PCIe (80 per socket) by utilizing only 48 lanes per socket for Infinity Fabric links instead of 64.

A dual-socket system has a total of 24 memory channels, or 12 per socket. Different OPNs can be configured to support a variety of NUMA domains.

#### Chapter

2

## **Processor Identification**

Figure 2-1 shows the processor naming convention for AMD EPYC 9004 Series Processors and how to use this convention to identify particular processors models:



Figure 2-1: AMD EPYC SoC naming convention

#### 2.1 CPUID Instruction

Software uses the CPUID instruction (Fn0000 0001 EAX) to identify the processor and will return the following values:

- Family: 19h identifies the "Zen 4" architecture
- Model: Varies with product. For example, EPYC Model 10h corresponds to an "A" part "Zen 4" CPU.
  - 91xx-96xx (including "X" OPNs): Family 19h 10-1F
  - **97xx:** Family 19h A0-AF
- Stepping: May be used to further identify minor design changes

For example, CPUID values for Family, Model, and Stepping (decimal) of 25, 17, 1 correspond to a "B1" part "Zen 4" CPU.

#### 2.2 New Software-Visible Features

AMD EPYC 9004 Series Processors introduce several new features that enhance performance, ISA updates, provide additional security features, and improve system reliability and availability. Some of the new features include:

- 5-level Paging
- AVX-512 instructions on a 256-byte datapath, including BFLOAT16 and VNNI support.
- Fast Short Rep STOSB and Rep CMPSB

Not all operating systems or hypervisors support all features. Please refer to your OS or hypervisor documentation for specific releases to identify support for these features.

Please also see the latest version of the AMD64 Architecture Programmer's Manuals or Processor Programming Reference (PPR) for AMD Family 19h.

#### 2.2.1 AVX-512

AVX-512 is a set of individual instructions supporting 512-bit register-width data (i.e., single instruction, multiple data [SIMD]) operations. AMD EPYC 9004 Series Processors implement AVX 512 by "double-pumping" 256-bit-wide registers. AMD's AVX-512 design uses the same 256-bit data path that exists throughout the Zen4 core and enables the two parts to execute on sequential clock cycles. This means that running AVX-512 instructions on AMD EPYC 9004 Series will cause neither drops on effective frequencies nor increased power consumption. On the contrary, many workloads run more energy-efficiently on AVX-512 than on AVX-256P.

Other AVX-512 support includes:

- Vectorized Neural Network Instruction (VNNI) instructions that are used in deep learning models and accelerate neural network inferences by providing hardware support for convolution operations.
- Brain Floating Point 16-bit (BFLOAT16) numeric format. This format is used in Machine Learning applications that require high performance but must also conserve memory and bandwidth. BFLOAT16 support doubles the number of SIMD operands over 32-bit single precision FP, allowing twice the amount of data to be processed using the same memory bandwidth. BFLOAT16 values mantissa dynamic range at the expense of one radix point.

#### Chapter

5

### Resources

#### 3.1 Resources

Please see the following resources for additional information about AMD EPYC 9004 Series processors:

- AMD EPYC™ 9004 Series Server Processors
- AMD64 Architecture Programmer's Manual
- AMD EPYC™ Tech Docs and White Papers
- BIOS & Workload Tuning Guide for AMD EPYC™ 9004 Series Processors (available from AMD EPYC Tuning Guides)
- <u>Memory Population Guidelines for AMD Family 19h Models 10h–1Fh</u> Login required; please review the latest version if multiple versions are present.
- <u>Socket SP5 Platform NUMA Topology for AMD Family 19h Models 10h–1Fh</u> Login required; please review the latest version if multiple versions are present.

This page intentionally left blank.