4. Performance Characterization Using AMDuProfPcm

4.1. Overview

The System Analysis tool, AMDuProfPcm, helps you track, analyze, and understand performance metrics for AMD processors. It collects metrics from hardware performance monitoring counters and operating system interfaces, and displays them as time-series plots, graphs, or CSV reports. The tool also supports basic performance modeling using roofline plots.

4.2. Performance Monitoring Counters (PMC)

Performance Monitoring Counters (PMC) are hardware resources capable of counting events occurring inside the processor. To count a specific event, a counter is configured or loaded with a unique value referred to as an event configuration.

PMC is categorized as follows:

4.3. Key Features

4.3.1. Multiplexing

Multiplexing is required when the number of events to be monitored exceeds the number of available PMCs. Event configurations are grouped together to ensure metrics, which are computed using multiple events, are valid. Each group is configured or loaded onto the PMCs at a time in a round robin fashion. Each group resides on the PMCs for a duration before being replaced by the next group. For the duration an event group does not reside on the PMCs (since PMCs are occupied by other groups), the counts for those events are extrapolated from the last known measurement or sample. Due to this, multiplexing usually introduces noise in the collected data.

4.3.2. Profiling Data Collection and Reporting

AMDuProfPcm collects profiling data in two modes:

4.3.2.1. Perf Mode (Only available on Linux)

PMCs are managed by perf subsystem on Linux and AMDuProfPcm utilizes perf subsystem for accessing PMCs. Perf subsystem enables sharing of hardware resources (PMCs) which allows multiple instances of AMDuProfPcm and other profilers to run simultaneously. On Linux, it is the default mode of data collection.

Multiplexing interval in perf mode can be set by:

Availability of /sys/devices/amd_l3/perf_event_mux_interval_ms and /sys/devices/amd_l3/perf_event_mux_interval_ms depend on kernel version and amd_uncore module.

4.3.2.2. MSR Mode

PMC are accessed by directly configuring PMC MSRs. Root privileges are required for this mode. Older Linux kernels might not support data collection with L3 PMCs and DF PMCs with Perf mode. In such scenarios you can use MSR mode.

To use MSR mode based on the various operating systems:

4.3.3. Data Collection and Reporting

AMDuProfPcm collects and reports the data in the following modes:

AMDuProfPcm reports data in the following formats:

4.3.4. Process Tracking (Only available in Perf mode)

AMDuProfPcm can profile multiple processes, collecting data exclusively for the attached process (designated with the -p <pid,..> or –tid <tid,,..>) or application (specified via command). Other background processes running concurrently are excluded from data collection. This feature is only available in Perf mode, when gathering Core PMC metrics, and only supported with system-wide aggregation (using the -A system option).

Table 4.1 Feature Support Matrix For AMD Zen Family Processors#

Feature/AMD Zen Family

Zen2

Zen3

Zen4

Zen5

HTML Report

No

Yes

Yes

Yes

Virtualization

No

Yes

Yes

Yes

CSV Report

Yes

Yes

Yes

Yes

Roofline

Yes

Yes

Yes

Yes

4.4. Prerequisite(s)

4.4.1. Windows

To profile L3 and DF counters while Hyper-V is enabled, switch to system mode using the following command. After execution, reboot the system.

Note

Run these commands in PowerShell.

4.4.2. Linux

In Perf Mode, AMD processors based on Zen4 and beyond require Linux kernel version 6.0 and greater for supporting collection of DF metrics. For data collection, set /proc/sys/kernel/perf_event_paranoid to 0 or lower. For example: echo 0 >> /proc/sys/kernel/perf_event_paranoid.

In MSR Mode, AMDuProfPcm requires msr kernel module to be loaded to access the MSRs from userspace. After it is loaded, either write permission for /dev/cpu*/msr devices or root privileges are required.

The AMDPcmSetCapability.sh script has been newly added for PCM and after executing the script, you can run the option --msr mode without root privileges.

Note

Use a new terminal for profiling without root after setting the capability.

To load the msr module, use the command:

modprobe msr

To disable NMI watchdog, write 0 to /proc/sys/kernel/nmi_watchdog. For example: echo 0 >> /proc/sys/kernel/nmi_watchdog.

Roofline plotting script (AMDuProfModelling.py) requires python 3.x and python module matplotlib.

4.4.3. FreeBSD

AMDuProfPcm uses the cpuctl module and requires either root privileges or read write permissions for / dev/cpuctl* devices.

4.4.3.1. Synopsis

AMDuProfPcm [<COMMANDS>] [<OPTIONS>] -- <PROGRAM> [<ARGS>]

4.4.3.2. Common Usages

$ AMDuProfPcm -h
# AMDuProfPcm -m ipc -c core=0 -d 10 -o /tmp/pmcdata.txt
# AMDuProfPcm -m memory -a -d 10 -o /tmp/memdata.txt -- /tmp/myapp.exe

4.5. AMDuProfPcm Command Line Options

Here is a list of all command line options.

Table 4.2 AMDuProfPcm Command Line Options#

Option

Description

-h

Displays help information on the console/terminal.

-m <metric,...>

Metrics to report. The supported metric groups and the corresponding metrics are Platform, OS, and Hypervisor specific. Run ./AMDuProfPcm -h to get the list of supported metrics. The following metric groups are supported:

Core

  • ipc: reports metrics such as CEF, Utilization, CPI, and IPC

  • fp: reports GFLOPS

  • dc: advanced caching metrics such as DC refills by source (supported only on AMD Zen 3, AMD Zen 4, and AMD Zen 5 processors)

  • l1: L1 cache related metrics (DC access and IC Fetch miss ratio)

  • l2: L2D and L2I cache related access/hit/miss metrics

  • swpfdc: software prefetch data cache from various nodes and CCX (supported only on AMD Zen 3, AMD Zen 4, and AMD Zen 5 processors)

  • hwpfdc: hardware prefetch data cache from various nodes and CCX (supported only on AMD Zen 3, AMD Zen 4, and AMD Zen 5 processors)

  • pipeline_util: top-down metrics to visualize the bottlenecks in the CPU pipeline (supported only on AMD Zen 4 and AMD Zen 5 processors)

  • avx_imix: SSE/AVX instruction mix percentage

  • cache_miss: L1 Data and Instruction cache misses, L2 Data and Code Read misses

  • tlb: translation lookaside buffer misses

L3

  • l3: L3 cache metrics like L3 Access, L3 Miss, and Average Miss latency

DF Metrics

  • memory: approximate memory read and write bandwidths in GB/s for all the channels

  • pcie: PCIe bandwidth in GB/s (supported only on AMD Zen 2, AMD Zen 4, and AMD Zen 5 processors)

  • xgmi: approximate xGMI outbound databytes in GB/s for all the remote links

  • dma: DMA bandwidth in GB/s (supported only on AMD Zen 4 and AMD Zen 5 processors)

  • ccm_bw: Core complex inbound and outbound bandwidth in GB/s (supported only on AMD Zen 4 processors and later)

  • cxl: Compute Express Link bandwidth in GB/s (supported only on AMD Zen 5 processors and later)

-c <core|ccx|l3|numa|package>=<n>

Collect from the specified core | ccx | ccd | package. The default is core=0.

If ccx or l3 is specified:

  • Core events will be collected from all the cores of this ccx.

  • l3 event will be collected for ccx this core belongs to.

  • df events will be collected for the package this core belongs to.

If package is specified:

  • Core events will be collected for all the cores of this package.

  • l3 events will be collected for all the ccxs of this package.

  • df events will be collected for this package.

If numa is specified:

  • Core events will be collected for all the cores of this numa.

  • l3 events will be collected for all the ccxs of this numa.

  • df events will be collected for this numa.

-a

Collect from all the cores.

Note

Options -c and -a cannot be used together.

-C

Prints the cumulative data at the end of the profile duration. Else, all the samples will be reported as timeseries data.

-A <system,package,ccd,ccx,core>

Prints aggregated metrics at various component levels. The following granularities are supported:

  • system: samples from all the cores in the system will be aggregated.

  • package: samples from all the cores in the package will be aggregated and reported for all the packages available in the system; applicable for multi-package systems.

  • ccd: samples from all the cores in CCD will be aggregated and reported for all the CCDs.

  • ccx: samples from all the cores in CCX will be aggregated and reported for all the CCXs.

    Note

    CCX is only applicable to Core and L3 metrics.

  • core: samples from all the logical cores on which samples are collected will be reported without aggregation.

    Note

    • Option -a should be used along with this option to collect samples from all the cores.

    • Comma separated list of components can be specified.

-I <ms>

Print the metrics at regular intervals. By default, it is enabled with an interval of 1000 ms.

-i <config file>

User defined XML config file that specifies Core|L3|DF counters to monitor. Refer sample files in the <install-dir>/bin/Data/Config/ directory for the format.

Note

  • Options -i and -m cannot be used together.

  • If option -i is used, all the events mentioned in the user defined config file will be collected.

--start-delay <n>

Start profiling after the specified duration (n in milliseconds).

--wait-for-signal

Wait for SIGUSR1 to start profiling. Use SIGINT/SIGTERM to terminate.

Note

Supported only on Linux.

-u, --os-user <n>

Override events to collect for kernel/user-only mode.

Available options are:

  • 0 - Count no events.

  • 1 - Count user events.

  • 2 - Count OS events.

  • 3 - Count all events.

Note

Applicable only for Core metrics.

--tid <tid,..>

Profile existing threads. Thread IDs are separated by comma.

Note

Available only for Core metrics. Not applicable with --msr option.

-d <seconds>

Profile duration to run.

-t <multiplex interval in ms>

The interval in which pmc count values will be read, the minimum is 100 ms.

-o <output file>

The output file name, it is in CSV format.

-P <n>

Sets precision of the metrics reported, the default value is 2.

-q

Hide CPU topology section in the output report.

-r

Force resets the MSRs.

-k

Prefixes pkg in package level counters.

-s

Displays time stamp in the time series report.

-l

Lists the supported raw PMC events.

-z <pmc-event>

Prints the name, description, and available unit masks for the event.

-w <dir>

Specifies the working directory. The default will be the path of the launched application.

-n

Print cpu topology.

-v

Print version.

-X

Collect data using perf subsystem without root privileges. This option is enabled by default and will be deprecated soon.

Note

This is only supported on Linux.

-p <process ID>

Specify the target process ID to monitor.

Note

This is only supported with the option -X on Linux.

-f <util:<n>>

Filter the roofline data based on the utilization. For example, -f util:90 will filter all data points with less than 90% utilization.

Note

This is applicable only with the roofline command.

--html

Collect data and generate HTML report.

--msr

Collect data using MSR mode. Requires root privilege.

--percentile

Generate custom percentile in html report. Default is 95th Percentile.

--collect-xgmi

Collect xgmi data.

--collect-pcie

Collect pcie data.

--collect-power

Collect power data.

--per-core

Prints metrics at core level (no aggregation).

--per-die

Prints aggregated metrics at die level.

--per-socket

Prints aggregated metrics at socket level.

--report-roofline

Report roofline data under the ‘profile’ command.

--no-aggr

Prints non-aggregated metrics data:

  • Core metrics will be reported at core level.

  • l3 metrics will be reported at ccx level.

  • df metrics will be printed at package/aid level.

--collect-clk

Collect clock data.

--collect-guest

Count only the guest events.

Note

This is applicable only in the host when hypervisor is enabled.

--collect-host

Count only the host events. (Default behavior is to collect host and guest data).

Note

This is applicable only in the host when hypervisor is enabled.

--read-smbios

Read memory speed and total memory channels from SMBIOS.

Note

This is applicable only for roofline command.

--show-used-ccx

Show only the CCXs where the target application ran in the heatmap’s L3 Cache section in the html report.

Note

Not applicable with --msr option.

-O <output dir path>

Path to create the output directory.

4.5.1. Commands

Here is a list of all the commands.

Table 4.3 AMDuProfPcm Commands#

Command

Description

hreport

Create html report from csv report.

compare

Compare two collected sessions and create an html comparison report.

roofline

Collects data required for generating roofline model.

top

Collect and report real time timeseries data in a tabular format.

profile

Collect and generate report for timeseries, cumulative and roofline data together.

4.5.2. Examples

4.5.2.1. Linux and FreeBSD

Table 4.4 Linux and FreeBSD Examples#

Command

Description

Collect IPC data from core 0 for the duration of 60 seconds

# ./AMDuProfPcm -m ipc -c core=0 -d 60 -o /tmp/pcmdata.csv

Collect IPC/L3 metrics for CCX=0 for the duration of 60 seconds

# ./AMDuProfPcm -m ipc,l3 -c ccx=0 -d 60 -o /tmp/pcmdata.csv

Collect only the memory bandwidth across all the UMCs for the duration of 60 seconds and save the output in /tmp/pcmdata.csv file

# ./AMDuProfPcm -m memory -a -d 60 -o /tmp/pcmdata.csv

Collect IPC data for 60 seconds from all the cores

# ./AMDuProfPcm -m ipc -a -d 60 -o /tmp/pcmdata.csv

Collect IPC data from core 0 and run the program in core 0

# ./AMDuProfPcm -m ipc -c core=0 -o /tmp/pcmdata.csv -- /usr/bin/taskset -c 0 <application>

Collect IPC data from cores 0-7 and run the application on cores 0-3

# ./AMDuProfPcm -m ipc -c core=0-7 -o /tmp/pcmdata.csv -- /usr/bin/taskset -c 0-3<application>

Collect IPC and data l2 data from core 0 and report the cumulative (not timeseries) and run the program in core 0

# ./AMDuProfPcm -m ipc,l2 -c core=0 -o /tmp/pcmdata.csv -C -- /usr/bin/taskset -c 0<application>

List the supported raw Core PMC events

# ./AMDuProfPcm -l

Print the name, description, and the available unit masks for the specified event

# ./AMDuProfPcm -z pmcx03

Compare two sessions

./AMDuProfPcm compare <output dir path 1>,<output dir path 2>

Note

Not applicable with --msr option.

Collect and generate roofline HTML report

./AMDuProfPcm roofline -O /tmp/ <application>

Collect default set of metrics, roofline and power data from all the cores of the system for 60 seconds and generate timeseries, cumulative, and roofline CSV and HTML reports

./AMDuProfPcm profile –report-roofline --collect-power -O /tmp -d 60

Collect ipc metrics from all the core in the system for 30 seconds and generate timeseries and cumulative CSV and HTML reports in the output directory

./AMDuProfPcm profile -m ipc -O /tmp -a -d 30

4.5.2.2. Windows Commands

Core Metrics

Table 4.5 Core Metrics: Tasks and Related Commands#

Task Description

Command

Collect IPC and data l2 data from all the cores and report the aggregated data at the system and package level

C:\> AMDuProfPcm.exe -m ipc,l2 -a -O C:\tmp -d 30 -A system,package

Collect IPC and data l2 data from all the cores and report the cumulative (not timeseries)

C:\> AMDuProfPcm.exe -m ipc,l2 -a -O C:\tmp -C -d 30

Collect IPC and data l2 data from all the cores and report the cumulative (not timeseries) and aggregate at system and package level

C:\> AMDuProfPcm.exe -m ipc,l2 -a -O C:\tmp -C -A system,package -d 30

Collect IPC and data l2 data from all the cores in CCX=0 and report the cumulative (not timeseries)

C:\> AMDuProfPcm.exe -m ipc,l2 -c ccx=0 -O C:\tmp -C -d 30

Collect IPC data for 30 seconds from all the cores in the system

C:\> AMDuProfPcm.exe -m ipc -a -d 30 -O C:\tmp

Collect IPC data from core 0 and run the program

C:\> AMDuProfPcm.exe -m ipc -c core=0 -O C:\tmp myapp.exe

Collect IPC data from core 0 for the duration of 30 seconds

C:\> AMDuProfPcm.exe -m ipc -c core=0 -d 30 -O C:\tmp

Collect IPC/L2 metrics for all the core in CCX=0 for the duration of 30 seconds

C:\> AMDuProfPcm.exe -m ipc,l2 -c ccx=0 -d 30 -O C:\tmp

Get the list of supported metrics

C:\> AMDuProfPcm.exe -h

L3 Metrics

Table 4.6 L3 Metrics: Tasks and Related Commands#

Task Description

Command

Collect L3 data from ccx=0 for the duration of 30 seconds

C:\> AMDuProfPcm.exe -m l3 -c ccx=0 -d 30 -O C:\tmp

Collect L3 data from all the CCXs and report for the duration of 30 seconds

C:\> AMDuProfPcm.exe -m l3 -a -d 30 -O C:\tmp

Collect L3 data from all the CCXs and aggregate at system and package level and report for the duration of 30 seconds

C:\> AMDuProfPcm.exe -m l3 -a -d 30 -A system,package -O C:\tmp

Collect L3 data from all the CCXs and aggregate at system and package level and report for the duration of 30 seconds; also report for the individual CCXs

C:\> AMDuProfPcm.exe -m l3 -a -d 30 -A system,package,ccx -O C:\tmp

Collect L3 data from all the CCXs for the duration of 30 seconds and report the cumulative data (no timeseries data)

C:\> AMDuProfPcm.exe -m l3 -a -d 30 -C -O C:\tmp

Collect L3 data from all the CCXs and aggregate at system and package level and report cumulative data (no timeseries data)

C:\> AMDuProfPcm.exe -m l3 -a -d 30 -A system,package -C -O C:\tmp

Collect IPC data from core 0 for the duration of 30 seconds

C:\> AMDuProfPcm.exe -m ipc -c core=0 -d 30 -O C:\tmp

Memory Bandwidth

Table 4.7 Memory Bandwidth: Tasks and Related Commands#

Task Description

Command

Report memory bandwidth for all the memory channels for the duration of 60 seconds and save the output in c:\tmp\pcmdata.csv file

C:\> AMDuProfPcm.exe -m memory -a -d 60 -O C:\tmp

Report total memory bandwidth aggregated at the system level for the duration of 60 seconds and save the output in c:\tmp\pcmdata.csv file

C:\> AMDuProfPcm.exe -m memory -a -d 60 -O C:\tmp -A system

Report total memory bandwidth aggregated at the system level and also report for every memory channel

C:\> AMDuProfPcm.exe -m memory -a -d 60 -O C:\tmp -A system,package

Report total memory bandwidth aggregated at the system level and also report for all the available memory channels. To report cumulative metric value instead of the timeseries data

C:\> AMDuProfPcm.exe -m memory -a -d 60 -O C:\tmp S -C -A system,package

Raw Event Count Dump

Table 4.8 Memory Bandwidth: Tasks and Related Commands#

Task Description

Command

Monitor events from core 0 and dump the raw event counts for every sample in timeseries manner, no metrics report will be generated

C:\> AMDuProfPcm.exe -m ipc -d 60 c:\tmp\pcmdata_dump.csv

Monitor events from all the cores and dump the raw event counts for every sample in timeseries manner, no metrics report will be generated

C:\> AMDuProfPcm.exe -m ipc -a -d 60 c:\tmp\pcmdata_dump.csv

Custom Config File

Config files are available for supported processors at <uprof-install-dir>\bin\Data\Config\. Default config file name for a specific processor, identified by family and model number, has the format <Family>_<Model Range>.conf. Example: 0x19_0x1.conf is used for all the processors with family value as 0x19 family and model number between 0x10 to 0x1f. Config files with RL_ prefix are used for roofline command.

Files can be copied and modified to certain user-specific interesting events and formula to compute metrics. All the metrics defined in that file will be monitored and reported.

C:\> AMDuProfPcm.exe -i <custom config file> -a -d 60 -O C:\tmp

Miscellaneous

Table 4.9 Miscellaneous: Tasks and Related Commands#

Task Description

Command

List the supported raw Core PMC events

C:\> AMDuProfPcm.exe -l

Print the name, description, and the available unit masks for the specified event

C:\> AMDuProfPcm.exe -z pmcx03

4.6. Metrics

The performance metrics for AMD EPYC™ Zen 2, Zen 3, Zen 4, and Zen 5 core architecture processors are listed here.

4.6.1. Performance Metrics for AMD EPYC™ Zen 2 Core Architecture Processors

Table 4.10 Performance Metrics for AMD EPYC™ Zen 2#

Metric Group

Metric Details

ipc

  • Utilization (%): Percentage of time the core was running, that is non-idle time.

  • Eff Freq: Core Effective Frequency (CEF) without halted cycles over the sampling period, reported in GHz. The metric is based on CEF = (APERF / TSC) * P0Freq. APERF is incremented in proportion to the actual number of core cycles while the core is in C6 state.

  • IPC: Instructions Per Cycle (IPC) is the average number of instructions retired per CPU cycle. This is measured using Core PMC events PMCx0C0 [Retired Instructions] and PMCx076 [CPU Clocks not Halted]. These PMC events are counted in both OS and User mode.

  • CPI: Cycles Per Instruction (CPI) is the multiplicative inverse of IPC metric. This is one of the basic performance metrics indicating how cache misses, branch mis-predictions, memory latencies, and other bottlenecks are affecting the execution of an application. A lower CPI value is better.

  • Branch Mis-prediction Ratio: The ratio between mis-predicted branches and retired branch instructions.

fp

  • Retired SSE/AVX Flops (GFLOPs): The number of retired SSE/AVX FLOPs.

  • Mixed SSE/AVX Stalls: This metric is in per thousand instructions (PTI).

l1

  • IC(32B) Fetch Miss Ratio: Instruction cache fetch miss ratio.

  • DC Access: All data cache (DC) accesses. This metric is in PTI.

l2

  • L2 Access: All the L2 cache accesses. This metric is in PTI.

  • L2 Access from IC Miss: The L2 cache accesses from IC miss. This metric is in PTI.

  • L2 Access from DC Miss: The L2 cache accesses from DC miss. This metric is in PTI.

  • L2 Access from HWPF: The L2 cache accesses from L2 hardware pre-fetching. This metric is in PTI.

  • L2 Miss: All the L2 cache misses. This metric is in PTI.

  • L2 Miss from IC Miss: The L2 cache misses from IC miss. This metric is in PTI.

  • L2 Miss from DC Miss: The L2 cache misses from DC miss. This metric is in PTI.

  • L2 Miss from HWPF: The L2 cache misses from L2 hardware pre-fetching. This metric is in PTI.

  • L2 Hit: All the L2 cache hits. This metric is in PTI.

  • L2 Hit from IC Miss: The L2 cache hits from IC miss. This metric is in PTI.

  • L2 Hit from DC Miss: The L2 cache hits from DC miss. This metric is in PTI.

  • L2 Hit from HWPF: The L2 cache hits from L2 hardware pre-fetching. This metric is in PTI.

tlb

  • L1 ITLB Miss: The instruction fetches the misses in the L1 Instruction Translation Lookaside Buffer (ITLB), but hit in the L2- ITLB plus the ITLB reloads originating from page table walker. The table walk requests are made for L1-ITLB miss and L2-ITLB misses. This metric is in PTI.

  • L2 ITLB Miss: The number of ITLB reloads from page table walker due to L1-ITLB and L2-ITLB misses. This metric is in PTI.

  • L1 DTLB Miss: The number of L1 Data Translation Lookaside Buffer (DTLB) misses from load store micro-ops. This event counts both L2-DTLB hit and L2-DTLB miss. This metric is in PTI.

  • L2 DTLB Miss: The number of L2 Data Translation Lookaside Buffer (DTLB)missed from load store micro-ops. This metric is in PTI.

l3

  • L3 Access: The count of L3 cache accesses.

  • L3 Miss: The L3 cache miss. This metric is in PTI.

  • L3 Miss (%): The L3 cache miss percentage. This metric is in PTI.

  • Ave L3 Miss Latency: Average L3 miss latency in core cycles.

Memory

Memory Read and Write bandwidth in GB/s for all the channels

  • Mem Ch-A RdBw (GB/s)

  • Mem Ch-A WrBw (GB/s)

xgmi

Approximate xGMI outbound data bytes in GB/s for all the remote links

  • xGMI1 BW (GB/s)

  • xGMI2 BW (GB/s)

  • xGMI3 BW (GB/s)

pcie

Approximate PCIe bandwidth in GB/s

  • PCIe0 (GB/s)

  • PCIe1 (GB/s)

  • PCIe2 (GB/s)

  • PCIe3 (GB/s)

4.6.2. Performance Metrics for AMD EPYC™ Zen 3 Core Architecture Processors

Table 4.11 Performance Metrics for AMD EPYC™ Zen 3#

Metric Group

Metric Details

ipc

  • Utilization (%): Percentage of time the core was running, that is non-idle time.

  • Eff Freq: Core Effective Frequency (CEF) without halted cycles over the sampling period, reported in GHz. The metric is based on CEF = (APERF / TSC) * P0Freq. APERF is incremented in proportion to the actual number of core cycles while the core is in C6 state.

  • IPC: Instructions Per Cycle (IPC) is the average number of instructions retired per CPU cycle. This is measured using Core PMC events PMCx0C0 [Retired Instructions] and PMCx076 [CPU Clocks not Halted]. These PMC events are counted in both OS and User mode.

  • CPI: Cycles Per Instruction (CPI) is the multiplicative inverse of IPC metric. This is one of the basic performance metrics indicating how cache misses, branch mis-predictions, memory latencies, and other bottlenecks are affecting the execution of an application. A lower CPI value is better.

  • Branch Mis-prediction Ratio: The ratio between mis-predicted branches and retired branch instructions.

fp

  • Retired SSE/AVX Flops (GFLOPs): The number of retired SSE/AVX FLOPs.

  • Mixed SSE/AVX Stalls: This metric is in per thousand instructions (PTI).

l1

  • IC(32B) Fetch Miss Ratio: Instruction cache fetch miss ratio.

  • Op Cache (64B) Fetch Miss Ratio: Operation cache fetch miss ratio.

  • IC Access: All instruction cache accesses. This metric is in PTI.

  • IC Miss: The instruction cache miss. This metric is in PTI.

  • DC Access: All data cache (DC) accesses. This metric is in PTI.

l2

  • L2 Access: All the L2 cache accesses. This metric is in PTI.

  • L2 Access from IC Miss: The L2 cache accesses from IC miss. This metric is in PTI.

  • L2 Access from DC Miss: The L2 cache accesses from DC miss. This metric is in PTI.

  • L2 Access from HWPF: The L2 cache accesses from L2 hardware pre-fetching. This metric is in PTI.

  • L2 Miss: All the L2 cache misses. This metric is in PTI.

  • L2 Miss from IC Miss: The L2 cache misses from IC miss. This metric is in PTI.

  • L2 Miss from DC Miss: The L2 cache misses from DC miss. This metric is in PTI.

  • L2 Miss from HWPF: The L2 cache misses from L2 hardware pre-fetching. This metric is in PTI.

  • L2 Hit: All the L2 cache hits. This metric is in PTI.

  • L2 Hit from IC Miss: The L2 cache hits from IC miss. This metric is in PTI.

  • L2 Hit from DC Miss: The L2 cache hits from DC miss. This metric is in PTI.

  • L2 Hit from HWPF: The L2 cache hits from L2 hardware pre-fetching. This metric is in PTI.

tlb

  • L1 ITLB Miss: The instruction fetches the misses in the L1 Instruction Translation Lookaside Buffer (ITLB), but hit in the L2- ITLB plus the ITLB reloads originating from page table walker. The table walk requests are made for L1-ITLB miss and L2-ITLB misses. This metric is in PTI.

  • L2 ITLB Miss: The number of ITLB reloads from page table walker due to L1-ITLB and L2-ITLB misses. This metric is in PTI.

  • L1 DTLB Miss: The number of L1 Data Translation Lookaside Buffer (DTLB) misses from load store micro-ops. This event counts both L2-DTLB hit and L2-DTLB miss. This metric is in PTI.

  • L2 DTLB Miss: The number of L2 Data Translation Lookaside Buffer (DTLB)missed from load store micro-ops. This metric is in PTI.

  • All TLBs Flushed: All the TLBs flushed. This metric is in PTI.

dc

  • DC Fills from Same CCX: The number of DC fills from local L2 cache to the core or different L2 cache in the same CCX or L3 cache that belongs to the CCX. This metric is in PTI.

  • DC Fills from different CCX in same node: The number of DC fills from cache of different CCX in the same package (node). This metric is in PTI.

  • DC Fills from Local Memory: The number of DC fills from DRAM or IO connected in the same package (node). This metric is in PTI.

  • DC Fills from Remote CCX Cache: The number of DC fills from cache of CCX in the different package (node). This metric is in PTI.

  • DC Fills from Remote Memory: The number of DC fills from DRAM or IO connected in the different package (node). This metric is in PTI.

  • All DC Fills: The total number of DC fills from all the data sources. This metric is in PTI.

l3

  • L3 Access: The count of L3 cache accesses.

  • L3 Miss: The L3 cache miss. This metric is in PTI.

  • L3 Miss (%): The L3 cache miss percentage. This metric is in PTI.

  • Ave L3 Miss Latency: Average L3 miss latency in core cycles.

Memory

Memory Read and Write bandwidth in GB/s for all the channels

  • Mem Ch-A RdBw (GB/s)

  • Mem Ch-A WrBw (GB/s)

xgmi

Approximate xGMI outbound data bytes in GB/s for all the remote links

  • xGMI1 BW (GB/s)

  • xGMI2 BW (GB/s)

  • xGMI3 BW (GB/s)

swpfdc

Software prefetch data cache from various nodes and CCX

  • SwPfDC Fills from DRAM or IO connected in remote node (pti)

  • SwPfDC Fills from CCX Cache in remote node (pti)

  • SwPfDC Fills from DRAM or IO connected in local node (pti)

  • SwPfDC Fills from Cache of another CCX in local node (pti)

  • SwPfDC Fills from L3 or different L2 in same CCX (pti)

  • SwPfDC Fills from L2 (pti)

hwpfdc

Hardware prefetch data cache from various nodes and CCX

  • HwPfDC Fills from DRAM or IO connected in remote node (pti)

  • HwPfDC Fills from CCX Cache in remote node (pti)

  • HwPfDC Fills from DRAM or IO connected in local node (pti)

  • HwPfDC Fills from Cache of another CCX in local node (pti)

  • HwPfDC Fills from L3 or different L2 in same CCX (pti)

  • HwPfDC Fills From L2 (pti)

4.6.3. Performance Metrics for AMD EPYC™ Zen 4 and later Core Architecture Processors

Table 4.12 Performance Metrics for AMD EPYC™ Zen 4 and later Versions#

Metric Group

Metric Details

ipc

  • Utilization (%): Percentage of time the core was running, that is non-idle time.

  • System Time (%): Percentage of time in kernel mode.

  • User Time (%): Percentage of time in user mode.

  • System instruction (%):Percentage of retired instruction in kernel mode.

  • User instructions (%): Percentage of retired instructions in user mode.

  • Eff Freq: Core Effective Frequency (CEF) without halted cycles over the sampling period, reported in GHz. The metric is based on CEF = (APERF / TSC) * P0Freq. APERF is incremented in proportion to the actual number of core cycles while the core is in C6 state.

  • IPC (Sys + User): Instructions Per Cycle (IPC) is the average number of instructions retired per CPU cycle. This is measured using Core PMC events PMCx0C0 [Retired Instructions] and PMCx076 [CPU Clocks not Halted]. These PMC events are counted in both OS and User mode.

  • IPC (Sys): Instructions in kernel mode per cpu cycles in kernel mode. IPC of kernel mode.

  • IPC (User): Instructions in user mode per cpu cycles in user mode. IPC of user mode.

  • CPI (Sys + User): Cycles Per Instruction (CPI) is the multiplicative inverse of IPC metric. This is one of the basic performance metrics indicating how cache misses, branch mis-predictions, memory latencies, and other bottlenecks are affecting the execution of an application. A lower CPI value is better.

  • CPI (Sys): Cycles per Instructions for kernel mode.

  • CPI (User):Cycles per Instructions for user mode.

  • Giga Instructions Per Sec: The number of retired giga instructions per second

  • Locked Instructions (pti): The number of retired lock instructions in PTI.

  • Retired Branches (pti): The number of retired branch instructions in PTI.

  • Retired Branches Mispredicted (pti): The number of retired mis-predicted branch instructions in PTI.

fp

  • Retired SSE/AVX Flops (GFLOPs): The number of retired SSE/AVX FLOPs.

  • FP Dispatch Faults (PTI): The number of floating point instruction dispatch faults. This metric is in per thousand instructions (PTI).

avx_imix

  • Packed 512-bit FP Ops Retired (pto): The number of retired floating-point operations executed using packed 512-bit AVX-512 instructions. Each instruction operates on multiple elements in a 512-bit ZMM register (e.g., 8 doubles or 16 floats).

  • Packed 512-bit FP Ops Retired (%): Percentage of retired packed 512-bit floating-point operations out of all retired floating-point operations. Indicates the proportion of AVX-512 utilization.

  • Packed 256-bit FP Ops Retired (pto): The number of retired floating-point operations executed using packed 256-bit AVX instructions (e.g., AVX/AVX2). Each instruction operates on multiple elements in a 256-bit YMM register (e.g., 4 doubles or 8 floats).

  • Packed 256-bit FP Ops Retired (%): Percentage of retired packed 256-bit floating-point operations out of retired floating-point operations.

  • Packed 128-bit FP Ops Retired (pto): The number of retired floating-point operations executed using packed 128-bit SSE instructions. Each instruction operates on multiple elements in a 128-bit XMM register (e.g., 2 doubles or 4 floats).

  • Packed 128-bit FP Ops Retired (%): Percentage of retired packed 128-bit floating-point operations out of retired floating-point operations.

  • Scalar/MMX/x87 FP Ops Retired (%): Percentage of retired scalar floating-point operations, MMX instructions, and x87 floating-point instructions out of retired floating-point operations.

  • Scalar/MMX/x87-bit FP Ops Retired (pto): The number of retired scalar floating-point operations, MMX instructions, and x87 floating-point instructions. These are non-SIMD operations, typically less efficient than packed operations.

  • SSE/AVX Instructions Retired (pti):The number of retired SSE/AVX per thousand instructions.

  • MMX Instructions Retired (pti): The number of retired MMX per thousand instructions.

  • x87 Instructions Retired (pti): The number of retired x87 per thousand instructions.

l1

  • IC(32B) Fetch Miss Ratio: Instruction cache fetch miss ratio.

  • Op Cache Fetch Miss Ratio: Operation cache (64B) fetch miss ratio.

  • IC Access (PTI): All instruction cache accesses. This metric is in PTI.

  • IC Miss (PTI): Instruction cache Miss in PTI.

  • DC Access (PTI): All data cache (DC) accesses. This metric is in PTI.

dc

  • All Demand DC Fills (pti): Total number of data cache fills triggered by demand loads across all sources.

  • Demand DC Fills From Local L2 (pti): Data cache fills satisfied from the local L2 cache.

  • Demand DC Fills From Local L3 or different L2 in same CCX (pti): Fills served by the local L3 cache or another L2 within the same Core Complex (CCX).

  • Demand DC Fills From another CCX in same node (pti): Fills sourced from a different CCX within the same processor node.

  • Demand DC Fills From Local Memory or I/O (pti): Fills coming directly from local DRAM or I/O devices.

  • Demand DC Fills From another CCX in remote node (pti): Fills obtained from a CCX located in a remote node.

  • Demand DC Fills From Remote memory or I/O (pti): Fills fetched from remote DRAM or I/O resources.

  • All DC Fills (pti): Total number of data cache fills in PTI.

  • DC Fills From Same CCX (pti): Total number of data cache fills from the same CCX in PTI.

  • DC Fills From different CCX in same node (pti): Total number of data cache fills from different CCX in the same Numa node in PTI.

  • DC Fills From Local Memory (pti): Total number of data cache fills from local memory in PTI.

  • DC Fills From Remote CCX Cache (pti): Total number of data cache fills from CCX in the remote Numa node in PTI.

  • DC Fills From Remote Memory (pti): Total number of data cache fills from remote memory in PTI.

  • Remote DRAM Reads %: Percentage of data cache fills from remote memory out of total number of number of data cache fills.

l2

  • L2 Access: All the L2 cache accesses. This metric is in PTI.

  • L2 Access from IC Miss: The L2 cache accesses from IC miss. This metric is in PTI.

  • L2 Access from DC Miss: The L2 cache accesses from DC miss. This metric is in PTI.

  • L2 Access from HWPF: The L2 cache accesses from L2 hardware pre-fetching. This metric is in PTI.

  • L2 Miss: All the L2 cache misses. This metric is in PTI.

  • L2 Miss from IC Miss: The L2 cache misses from IC miss. This metric is in PTI.

  • L2 Miss from DC Miss: The L2 cache misses from DC miss. This metric is in PTI.

  • L2 Miss from HWPF: The L2 cache misses from L2 hardware pre-fetching. This metric is in PTI.

  • L2 Hit: All the L2 cache hits. This metric is in PTI.

  • L2 Hit from IC Miss: The L2 cache hits from IC miss. This metric is in PTI.

  • L2 Hit from DC Miss: The L2 cache hits from DC miss. This metric is in PTI.

  • L2 Hit from HWPF: The L2 cache hits from L2 hardware pre-fetching. This metric is in PTI.

tlb

  • L1 ITLB Miss: The instruction fetches the misses in the L1 Instruction Translation Lookaside Buffer (ITLB), but hit in the L2- ITLB plus the ITLB reloads originating from page table walker. The table walk requests are made for L1-ITLB miss and L2-ITLB misses. This metric is in PTI.

  • L2 ITLB Miss: The number of ITLB reloads from page table walker due to L1-ITLB and L2-ITLB misses. This metric is in PTI.

  • L1 DTLB Miss: The number of L1 Data Translation Lookaside Buffer (DTLB) misses from load store micro-ops. This event counts both L2-DTLB hit and L2-DTLB miss. This metric is in PTI.

  • L2 DTLB Miss: The number of L2 Data Translation Lookaside Buffer (DTLB)missed from load store micro-ops. This metric is in PTI.

  • All TLBs Flushed: All the TLBs flushed. This metric is in PTI.

l3

  • L3 Access: The count of L3 cache accesses.

  • L3 Hit %: Percentage of L3 Hit.

  • L3 Miss: The L3 cache miss. This metric is in PTI.

  • L3 Miss (%): The L3 cache miss percentage. This metric is in PTI.

  • Ave L3 Miss Latency: Average L3 miss latency in core cycles.

  • L3 Access (pti): The number of L3 cache accesses per thousand instructions.

  • L3 Miss (pti): The number of L3 cache misses per thousand instructions.

  • L3 Miss Latency From Local Memory or I/O (%): Percentage of L3 cache misses served by local DRAM or I/O, measured by latency contribution.

  • L3 Miss Latency From Remote Memory or I/O (%): Percentage of L3 cache miss latency caused by remote DRAM or I/O accesses.

  • L3 Miss Latency From another CCX in same node (%): Latency share from L3 misses resolved by a different CCX within the same node.

  • L3 Miss Latency From another CCX in remote node (%): Latency share from L3 misses resolved by a CCX in a remote node.

  • L3 Miss Latency From Local Extension Memory (CXL) (%): Percentage of latency from L3 misses served by local CXL-attached memory.

  • L3 Miss Latency From Remote Extension Memory (CXL) (%): Percentage of latency from L3 misses served by remote CXL-attached memory.

Memory

  • Total Memory Bw (GB/s): Total read and write memory bandwidth.

DRAM read and write data bytes for a local processor

  • Local DRAM Read Data Bytes (GB/s)

  • Local DRAM Write Data Bytes (GB/s)

DRAM read and write data bytes for a remote processor

  • Remote DRAM Read Data Bytes (GB/s)

  • Remote DRAM Write Data Bytes (GB/s)

Memory Read and Write bandwidth in GB/s for all the channels

  • Mem Ch-A RdBw (GB/s)

  • Mem Ch-A WrBw (GB/s)

ccm_bw

  • Local Inbound Read Data Bytes(GB/s): Local inbound data bytes to the CPU, for example, read data.

  • Local Outbound Write Data Bytes (GB/s): Local outbound data bytes from the CPU, for example, write data.

  • Remote Inbound Read Data Bytes(GB/s): Remote socket inbound data bytes to the CPU, for example, read data.

  • Remote Outbound Write Data Bytes (GB/s): Remote socket outbound data bytes from the CPU for example, write data.

Reports data traffic to CCM at interfaces 0 and 1

  • Local Socket Inbound Data to CPU Moderator (CCM) 0 at Interface 0 (GB/s)

  • Local Socket Outbound Data from CPU Moderator (CCM) 0 at Interface 0 (GB/s)

  • Remote Socket Inbound Data to CPU Moderator (CCM) 0 at Interface 0 (GB/s)

  • Remote Socket Outbound Data from CPU Moderator (CCM) 0 at Interface 0 (GB/s)

xgmi

xGMI Outbound Data Bytes (GB/s): Total outbound data bytes in Gigabytes per second.

dma (not available in AMD Zen 1, AMD Zen 2, and AMD Zen 3 processors)

  • Total Upstream DMA Read Write Data Bytes (GB/s): Total upstream DMA including read and write.

  • Local Upstream DMA Read Data Bytes (GB/s): Local upstream DMA read data bytes.

  • Local Upstream DMA Write Data Bytes (GB/s): Local upstream DMA write data bytes.

  • Remote Upstream DMA Read Data Bytes (GB/s): Remote socket upstream DMA read data bytes.

  • Remote Upstream DMA Write Data Bytes (GB/s): Remote socket upstream DMA write data bytes.

pcie

Approximate PCIe bandwidth in GB/s

  • PCIe0 (GB/s)

  • PCIe1 (GB/s)

  • PCIe2 (GB/s)

  • PCIe3 (GB/s)

PCIe bandwidth for read and write transactions, local, and remote node bandwidth. Per quad PCIe bandwidth

  • Total PCIE Bandwidth (GB/s)

  • Total PCIE Rd Bandwidth (GB/s)

  • Total PCIE Wr Bandwidth (GB/s)

  • Total PCIE Bandwidth Local (GB/s)

  • Total PCIE Bandwidth Remote (GB/s)

  • Total PCIE Rd Bandwidth Local (GB/s)

  • Total PCIE Wr Bandwidth Local (GB/s)

  • Total PCIE Rd Bandwidth Remote (GB/s)

  • Total PCIE Wr Bandwidth Remote (GB/s)

  • Quad 0 PCIE Rd Bandwidth Local (GB/s)

  • Quad 0 PCIE Wr Bandwidth Local (GB/s)

  • Quad 0 PCIE Rd Bandwidth Remote (GB/s)

  • Quad 0 PCIE Wr Bandwidth Remote (GB/s)

  • Quad 1 PCIE Rd Bandwidth Local (GB/s)

  • Quad 1 PCIE Wr Bandwidth Local (GB/s)

  • Quad 1 PCIE Rd Bandwidth Remote (GB/s)

  • Quad 1 PCIE Wr Bandwidth Remote (GB/s)

  • Quad 2 PCIE Rd Bandwidth Local (GB/s)

  • Quad 2 PCIE Wr Bandwidth Local (GB/s)

  • Quad 2 PCIE Rd Bandwidth Remote (GB/s)

  • Quad 2 PCIE Wr Bandwidth Remote (GB/s)

  • Quad 3 PCIE Rd Bandwidth Local (GB/s)

  • Quad 3 PCIE Wr Bandwidth Local (GB/s)

  • Quad 3 PCIE Rd Bandwidth Remote (GB/s)

  • Quad 3 PCIE Wr Bandwidth Remote (GB/s)

swpfdc

Software prefetch data cache from various nodes and CCX

  • SwPfDC Fills from DRAM or IO connected in remote node (pti)

  • SwPfDC Fills from CCX Cache in remote node (pti)

  • SwPfDC Fills from DRAM or IO connected in local node (pti)

  • SwPfDC Fills from Cache of another CCX in local node (pti)

  • SwPfDC Fills from L3 or different L2 in same CCX (pti)

  • SwPfDC Fills from L2 (pti)

hwpfdc

Hardware prefetch data cache from various nodes and CCX

  • HwPfDC Fills from DRAM or IO connected in remote node (pti)

  • HwPfDC Fills from CCX Cache in remote node (pti)

  • HwPfDC Fills from DRAM or IO connected in local node (pti)

  • HwPfDC Fills from Cache of another CCX in local node (pti)

  • HwPfDC Fills from L3 or different L2 in same CCX (pti)

  • HwPfDC Fills From L2 (pti)

pipeline_util

  • Total_Dispatch_Slots: Up to 6 instructions can be dispatched in one cycle.

  • SMT_Disp_contention: Percentage of unused dispatch slots as other thread was selected.

  • Frontend_Bound: Percentage of dispatch slots that remained unused as the front end did not supply enough instructions/operations.

  • Bad_Speculation: Percentage of unused dispatch slots as other thread was selected.

  • Backend_Bound: Percentage of dispatch slots that remained unused because of the back end stalls.

  • Retiring: Percentage of dispatch slots used by the retired operations.

  • IPC: Instructions per cycle.

  • Frontend_Bound.Latency: Percentage of dispatch slots that remained unused because of a latency bottleneck in the front end, such as Instruction Cache or ITLB misses.

  • Frontend_Bound.BW: Percentage of dispatch slots that remained unused because of a bandwidth bottleneck in the front end, such as decode bandwidth or Op Cache fetch bandwidth.

  • Bad_Speculation.Mispredicts: Percentage of dispatched ops that were flushed due to branch mis-predicts.

  • Bad_Speculation.Pipeline_Restarts: Percentage of dispatched ops that were flushed due to the pipeline restarts (resyncs).

  • Backend_Bound.Memory: Percentage of dispatched slots that remained unused because of stalls due to the memory subsystem.

  • Backend_Bound.CPU: Percentage of dispatched slots that remained unused because of stalls not related to the memory subsystem.

  • Retiring.Fastpath: Percentage of dispatch slots used by the retired fastpath operations.

  • Retiring.Microcode: Percentage of dispatch slots used by the retired microcode operations.

UMC

Note

Supported only in Zen 5 servers.

  • Total Est Mem Bw (GB/s): Estimated total memory bandwidth usage, combining both reads and writes.

  • Total Est Mem RdBw (GB/s): Estimated memory bandwidth consumed by read operations.

  • Total CXL Write Memory BW (GB/s): Estimated memory bandwidth consumed by write operations.

CXL

Note

Supported only in Zen 5 servers.

  • Total Est Mem WrBw (GB/s): Overall memory bandwidth utilized through the CXL interface, including reads and writes.

  • Total CXL Read Memory BW (GB/s): Bandwidth consumed by memory read operations over CXL.

  • Total CXL Write Memory BW (GB/s): Bandwidth consumed by memory write operations over CXL.

Note

Memory channels are available with package level.

4.7. Interpreting Profile Data

4.7.1. Pipeline Utilization

On AMD Zen4 and Zen5-based processors, AMDuProfPcm supports monitoring and reporting the pipeline utilization (pipeline_util) metrics. This feature provides pipeline_util metrics to visualize the bottlenecks in the CPU pipeline. Use the option -m pipeline_util to monitor and report the level-1 and level-2 top-down metrics.

Table 4.13 Level-1 Metrics#

Metric

Description

Total_Disp_Slots

Up to six instructions can be dispatched in one cycle.

SMT_Disp_contention

Unused dispatch slots as the other thread was selected.

Frontend_Bound

Dispatch slots that remained unused because the frontend did not supply appropriate instructions/ops.

Bad_Speculation

Dispatched operations that did not retire.

Backend_Bound

Dispatch slots that remained unused because of backend stalls.

Retiring

Dispatch slots used by operations that retired.

Table 4.14 Level-2 Metrics#

Metric

Description

Frontend_Bound.Latency

Unused dispatch slots due to latency bottleneck in the frontend, such as Instruction Cache or ITLB misses.

Frontend_Bound.BW

Unused dispatch slots due to bandwidth bottleneck in the frontend, such as decode bandwidth or Op Cache fetch bandwidth.

Bad_Speculation.Mispredicts

Dispatched operations that were flushed due to branch mis- predicts.

Bad_Speculation.Pipeline_Restarts

Dispatched operations that were flushed due to pipeline restarts (resyncs).

Backend_Bound.Memory

Dispatched slots that remained unused because of stalls due to memory subsystem.

Retiring.Fastpath

Dispatch slots used by fastpath operations that retired.

Retiring.Microcode

Dispatch slots used by microcode operations that retired.

Due to multiplexing, the reported metrics may be inconsistent. For better results, use taskset to bind the monitored application to a specific set of cores and monitor only the cores on which the monitored application is running.

Run the following command to collect the top-down metrics:

AMDuProfPcm -m pipeline_util  --msr-A system -o /tmp/myapp-td.csv -- /usr/bin/taskset -c 0 myapp.exe

The --msr option requires root privileges. Run sudo ./AMDPcmSetCapability.sh, then open a new terminal tab or run without --msr option as noted here:

sudo AMDuProfPCm -m pipeline_util -c core=0 -A system -o /tmp/myapp-td.csv -- /usr/bin/ taskset -c 0 myapp.exe

Sample Top-Down Metrics report

Sample top-down metrics report

Figure 4.1 Sample Top-Down Metrics report#

Examples

Table 4.15 Top-Down Metrics Examples#

Task Description

Command

Timeseries monitoring of level-1 and level-2 top-down metrics (pipeline utilization) of a single- threaded program

# AMDuProfPcm -m pipeline_util -c core=1 -o /tmp/td.csv -- /usr/bin/taskset -c 1 /tmp/myapp.exe

Timeseries monitoring of level-1 and level-2 top-down metrics of a multi-threaded program running on all the cores:

# AMDuProfPcm -m pipeline_util -a -A system -o /tmp/td.csv -- /tmp/myapp.exe

Cumulative monitoring of level-1 and level-2 top-down metrics of a multi-threaded program running on all the cores

# AMDuProfPcm -m pipeline_util -a -A system -C -o /tmp/td.csv -- /tmp/myapp.exe

4.8. Virtualization Support

Profiling capabilities of AMDuProfPcm might be limited on a virtual machine. Check the following hardware and OS primitives provided by host or guest operating system to determine the level of support.

Run the command AMDuProfCLI info –system to obtain this information and look for the following sections.

Table 4.16 Virtualization Support System Information#

Item

Description

PERF Features Availability

Availability of Core, L3 and DF PMCs on the system. If any of the PMCs are unavailable, the corresponding metrics will not be supported. Ex. On guest VMs, DF PMCs are not accessible due to security reasons, due to which DF metrics such as memory, xgmi, pcie cannot be collected.

Hypervisor Info

Hypervisor vendor and support can be used to determine if the hypervisor is supported. Also determine the mode – host or guest. Usually the host has unrestricted access to all the PMCs.

4.9. Known Behavior - Issues Due to BIOS Settings

Following is the known behavior of L2 Hit/Miss from HWPF metrics based on the BIOS settings:

Roofline plots generated using AMDuProfModelling.py and saved as PDF might have improperly aligned labels for the plot lines.

4.10. Constraints and Limitations

Here is a list of constraints and limitations:

4.11. Monitoring Without Root Privileges

On Linux, use the script AMDPcmSetCapability.sh to run the msr mode without root privilege. This option collects Core, L3, and DF PMC events on AMD Zen-based processors. The newer processors may require the latest kernel for Perf mode support.

Examples

Table 4.17 Examples to Monitor Without Root Privileges#

Task Description

Command

Cumulative reporting of IPC metrics at the end of the benchmark execution

$ AMDuProfPcm -m ipc -C -O /tmp -- /tmp/myapp.exe

Cumulative reporting of IPC metrics at the end of the benchmark execution, aggregate metrics at system level

$ AMDuProfPcm -m ipc -C -A system -O /tmp -- /tmp/ myapp.exe

Cumulative reporting of IPC metrics at the end of the benchmark execution, aggregate metrics per processor package

$ AMDuProfPcm -m ipc -C -A package -O /tmp -- /tmp/ myapp.exe

Cumulative reporting of level-1 and level-2 top-down metrics (pipeline utilization)

$ AMDuProfPcm -m pipeline_util -C -A system -O /tmp -- /tmp/myapp.exe

Timeseries monitoring of IPC of a benchmark, aggregate metrics at system level

$ AMDuProfPcm -m ipc -A system -O /tmp -- /tmp/ myapp.exe

Timeseries monitoring of IPC of a benchmark, aggregate metrics per processor package

$ AMDuProfPcm -m ipc -A package -O /tmp -- /tmp/ myapp.exe

Timeseries monitoring of IPC of a benchmark, system aggregate level

$ AMDuProfPcm -m ipc -O /tmp -- /tmp/myapp.exe

Timeseries monitoring of level-1 and level-2 top-down metrics (pipeline utilization)

$ AMDuProfPcm -m pipeline_util -A system -O /tmp -- / tmp/myapp.exe

Timeseries monitoring of memory bandwidth reporting at package and memory channels level

$ AMDuProfPcm -m memory -a -A system,package -O /tmp