The System Analysis tool, AMDuProfPcm, helps you track, analyze, and understand performance metrics for AMD processors. It collects metrics from hardware performance monitoring counters and operating system interfaces, and displays them as time-series plots, graphs, or CSV reports. The tool also supports basic performance modeling using roofline plots.
Performance Monitoring Counters (PMC) are hardware resources capable of counting events occurring inside the processor. To count a specific event, a counter is configured or loaded with a unique value referred to as an event configuration.
PMC is categorized as follows:
Core PMC: Every logical core has a set of performance monitoring counters referred to as Core PMCs capable of counting events inside the core. For a list of all the supported events for Core PMCs, refer to Core Performance Monitor Counters in the Processor Programming Reference, and search for the specific PPR.
L3 PMC (Uncore):
Every L3 cache has a set of performance monitoring counters referred to as L3 PMCs capable of counting events inside the L3 cache. For a list of all the supported events for L3 PMCs, see the section L3 Cache PMC Events in the Processor Programming Reference, and search for the specific PPR.
Metrics listed under option -m such as l3 are based on L3 PMC events and are referred to as L3 metrics.
DF (Data Fabric) PMC (Uncore):
Every socket has a set of performance monitoring counters referred to as DF PMCs, capable of counting events inside the Data Fabric. For a list of all the supported events for DF PMCs, see the section Data Fabric Performance Monitor Events in the Processor Programming Reference, and search for the specific PPR.
Metrics listed under option -m such as memory, xgmi, pcie, dma, ccm_bw are based on DF PMC events and are referred to as DF metrics.
Multiplexing is required when the number of events to be monitored exceeds the number of available PMCs. Event configurations are grouped together to ensure metrics, which are computed using multiple events, are valid. Each group is configured or loaded onto the PMCs at a time in a round robin fashion. Each group resides on the PMCs for a duration before being replaced by the next group. For the duration an event group does not reside on the PMCs (since PMCs are occupied by other groups), the counts for those events are extrapolated from the last known measurement or sample. Due to this, multiplexing usually introduces noise in the collected data.
Multiplexing Interval: The interval for which an event group resides on PMCs during multiplexing. Multiplexing interval can be set using -t option and should be kept as low as possible for data accuracy.
Logging Interval: The interval after which a sample, containing the count values of all the events to be monitored, is logged. Logging interval can be set with -I option. When multiplexing, the logging interval must be a multiple of the product of the multiplexing interval and the number of event groups. AMDuProfPcm will adjust the logging interval to the closest value specified by the user while ensuring this condition is met.
AMDuProfPcm collects profiling data in two modes:
Perf Mode (Non root mode)(Only supported in Linux)
MSR Mode (Root mode)
PMCs are managed by perf subsystem on Linux and AMDuProfPcm utilizes perf subsystem for accessing PMCs. Perf subsystem enables sharing of hardware resources (PMCs) which allows multiple instances of AMDuProfPcm and other profilers to run simultaneously. On Linux, it is the default mode of data collection.
Multiplexing interval in perf mode can be set by:
Writing the value (in ms) to /sys/devices/cpu/ perf_event_mux_interval_ms for Core PMCs.
Writing the value (in ms) to /sys/devices/amd_l3/perf_event_mux_interval_ms for L3 PMCs.
Writing the value (in ms) to /sys/devices/amd_df/perf_event_mux_interval_ms for DF PMCs.
Availability of /sys/devices/amd_l3/perf_event_mux_interval_ms and /sys/devices/amd_l3/perf_event_mux_interval_ms depend on kernel version and amd_uncore module.
PMC are accessed by directly configuring PMC MSRs. Root privileges are required for this mode. Older Linux kernels might not support data collection with L3 PMCs and DF PMCs with Perf mode. In such scenarios you can use MSR mode.
To use MSR mode based on the various operating systems:
Linux: PMC MSRs are accessed via msr kernel module. Use the --msr option to select this mode for data collection.
Windows: PMC MSRs are accessed via ioctl call to AMDPowerProfiler driver and is the default mode for data collection.
FreeBSD: PMC MSRs are accessed with the cpuctl module and is the default mode for data collection.
AMDuProfPcm collects and reports the data in the following modes:
Timeseries Mode: Profiling data is collected and reported periodically, and the logging interval can be specified with -I option. This is the default mode of data collection.
Cumulative Mode: Profiling data is collected and accumulated over the profile run and is reported for the entire profile duration. Use -C option to select this mode.
Live Mode: Profiling data is collected and printed on the console periodically. Use the top command to select this mode.
AMDuProfPcm reports data in the following formats:
HTML: The HTML report features graphical representations of the monitored metrics, aiding users in visualizing the data. It includes several tabs that organize relevant metrics by PMC type and significance. The report presents a variety of plots and graphs, including time series plots, heatmaps, sunburst charts, and radar graphs. To generate the HTML report, use the --html option in the command.
CSV: Profiling data is printed in comma separated value (.csv) format, which can be viewed using MS Excel or other similar applications or be used as input for custom scripts for further processing the data.
JSON: The JSON report serves as an intermediate format for generating the final HTML report from the CSV report. It organizes performance data into several key sections: hierarchy, metadata, metric-groups, metrics, and sections.
The hierarchy section defines the structural relationship between different performance metrics.
The metadata section captures environment and system details relevant to the report.
The metric-groups section logically groups related metrics together.
The metrics section records the actual performance metrics collected during the observation period.
The sections section contains configuration for graph plotting based on the collected metrics.
The JSON report is automatically generated when the --html option is specified in the command.
AMDuProfPcm can profile multiple processes, collecting data exclusively for the attached process (designated with the -p <pid,..> or –tid <tid,,..>) or application (specified via command). Other background processes running concurrently are excluded from data collection. This feature is only available in Perf mode, when gathering Core PMC metrics, and only supported with system-wide aggregation (using the -A system option).
Feature/AMD Zen Family |
Zen2 |
Zen3 |
Zen4 |
Zen5 |
|---|---|---|---|---|
HTML Report |
No |
Yes |
Yes |
Yes |
Virtualization |
No |
Yes |
Yes |
Yes |
CSV Report |
Yes |
Yes |
Yes |
Yes |
Roofline |
Yes |
Yes |
Yes |
Yes |
To profile L3 and DF counters while Hyper-V is enabled, switch to system mode using the following command. After execution, reboot the system.
Note
Run these commands in PowerShell.
Switch to system mode
bcdedit /set hypervisorperfmon system
Switch back to default mode
bcdedit /deletevalue hypervisorperfmon
In Perf Mode, AMD processors based on Zen4 and beyond require Linux kernel version 6.0 and greater for supporting collection of DF metrics.
For data collection, set /proc/sys/kernel/perf_event_paranoid to 0 or lower. For example: echo 0 >> /proc/sys/kernel/perf_event_paranoid.
In MSR Mode, AMDuProfPcm requires msr kernel module to be loaded to access the MSRs from userspace. After it is loaded, either write permission for /dev/cpu*/msr devices or root privileges are required.
The AMDPcmSetCapability.sh script has been newly added for PCM and after executing the script, you can run the option --msr mode without root privileges.
Note
Use a new terminal for profiling without root after setting the capability.
To load the msr module, use the command:
modprobe msr
To disable NMI watchdog, write 0 to /proc/sys/kernel/nmi_watchdog. For example: echo 0 >> /proc/sys/kernel/nmi_watchdog.
Roofline plotting script (AMDuProfModelling.py) requires python 3.x and python module matplotlib.
AMDuProfPcm uses the cpuctl module and requires either root privileges or read write permissions for / dev/cpuctl* devices.
AMDuProfPcm [<COMMANDS>] [<OPTIONS>] -- <PROGRAM> [<ARGS>]
<PROGRAM> denotes the launch application to be profiled.
<ARGS> denotes the list of arguments for the launch application.
$ AMDuProfPcm -h
# AMDuProfPcm -m ipc -c core=0 -d 10 -o /tmp/pmcdata.txt
# AMDuProfPcm -m memory -a -d 10 -o /tmp/memdata.txt -- /tmp/myapp.exe
Here is a list of all command line options.
Option |
Description |
|---|---|
|
Displays help information on the console/terminal. |
|
Metrics to report. The supported metric groups and the corresponding metrics are Platform, OS, and Hypervisor specific. Run Core
L3
DF Metrics
|
|
Collect from the specified If ccx or l3 is specified:
If package is specified:
If numa is specified:
|
|
Collect from all the cores. Note Options |
|
Prints the cumulative data at the end of the profile duration. Else, all the samples will be reported as timeseries data. |
|
Prints aggregated metrics at various component levels. The following granularities are supported:
|
|
Print the metrics at regular intervals. By default, it is enabled with an interval of 1000 ms. |
|
User defined XML config file that specifies Core|L3|DF counters to monitor. Refer sample files in the <install-dir>/bin/Data/Config/ directory for the format. Note
|
|
Start profiling after the specified duration (n in milliseconds). |
|
Wait for SIGUSR1 to start profiling. Use Note Supported only on Linux. |
|
Override events to collect for kernel/user-only mode. Available options are:
Note Applicable only for Core metrics. |
|
Profile existing threads. Thread IDs are separated by comma. Note Available only for Core metrics. Not applicable with |
|
Profile duration to run. |
|
The interval in which pmc count values will be read, the minimum is 100 ms. |
|
The output file name, it is in CSV format. |
|
Sets precision of the metrics reported, the default value is 2. |
|
Hide CPU topology section in the output report. |
|
Force resets the MSRs. |
|
Prefixes |
|
Displays time stamp in the time series report. |
|
Lists the supported raw PMC events. |
|
Prints the name, description, and available unit masks for the event. |
|
Specifies the working directory. The default will be the path of the launched application. |
|
Print cpu topology. |
|
Print version. |
|
Collect data using perf subsystem without root privileges. This option is enabled by default and will be deprecated soon. Note This is only supported on Linux. |
|
Specify the target process ID to monitor. Note This is only supported with the option |
|
Filter the roofline data based on the utilization. For example, Note This is applicable only with the roofline command. |
|
Collect data and generate HTML report. |
|
Collect data using MSR mode. Requires root privilege. |
|
Generate custom percentile in html report. Default is 95th Percentile. |
|
Collect xgmi data. |
|
Collect pcie data. |
|
Collect power data. |
|
Prints metrics at core level (no aggregation). |
|
Prints aggregated metrics at die level. |
|
Prints aggregated metrics at socket level. |
|
Report roofline data under the ‘profile’ command. |
|
Prints non-aggregated metrics data:
|
|
Collect clock data. |
|
Count only the guest events. Note This is applicable only in the host when hypervisor is enabled. |
|
Count only the host events. (Default behavior is to collect host and guest data). Note This is applicable only in the host when hypervisor is enabled. |
|
Read memory speed and total memory channels from SMBIOS. Note This is applicable only for roofline command. |
|
Show only the CCXs where the target application ran in the heatmap’s L3 Cache section in the html report. Note Not applicable with |
|
Path to create the output directory. |
Here is a list of all the commands.
Command |
Description |
|---|---|
|
Create html report from csv report. |
|
Compare two collected sessions and create an html comparison report. |
|
Collects data required for generating roofline model. |
|
Collect and report real time timeseries data in a tabular format. |
|
Collect and generate report for timeseries, cumulative and roofline data together. |
Command |
Description |
|---|---|
Collect IPC data from core 0 for the duration of 60 seconds |
|
Collect IPC/L3 metrics for CCX=0 for the duration of 60 seconds |
|
Collect only the memory bandwidth across all the UMCs for the duration of 60 seconds and save the output in |
|
Collect IPC data for 60 seconds from all the cores |
|
Collect IPC data from core 0 and run the program in core 0 |
|
Collect IPC data from cores 0-7 and run the application on cores 0-3 |
|
Collect IPC and data l2 data from core 0 and report the cumulative (not timeseries) and run the program in core 0 |
|
List the supported raw Core PMC events |
|
Print the name, description, and the available unit masks for the specified event |
|
Compare two sessions |
Note Not applicable with |
Collect and generate roofline HTML report |
|
Collect default set of metrics, roofline and power data from all the cores of the system for 60 seconds and generate timeseries, cumulative, and roofline CSV and HTML reports |
|
Collect ipc metrics from all the core in the system for 30 seconds and generate timeseries and cumulative CSV and HTML reports in the output directory |
|
Core Metrics
Task Description |
Command |
|---|---|
Collect IPC and data l2 data from all the cores and report the aggregated data at the system and package level |
|
Collect IPC and data l2 data from all the cores and report the cumulative (not timeseries) |
|
Collect IPC and data l2 data from all the cores and report the cumulative (not timeseries) and aggregate at system and package level |
|
Collect IPC and data l2 data from all the cores in |
|
Collect IPC data for 30 seconds from all the cores in the system |
|
Collect IPC data from core 0 and run the program |
|
Collect IPC data from core 0 for the duration of 30 seconds |
|
Collect IPC/L2 metrics for all the core in CCX=0 for the duration of 30 seconds |
|
Get the list of supported metrics |
|
L3 Metrics
Task Description |
Command |
|---|---|
Collect L3 data from ccx=0 for the duration of 30 seconds |
|
Collect L3 data from all the CCXs and report for the duration of 30 seconds |
|
Collect L3 data from all the CCXs and aggregate at system and package level and report for the duration of 30 seconds |
|
Collect L3 data from all the CCXs and aggregate at system and package level and report for the duration of 30 seconds; also report for the individual CCXs |
|
Collect L3 data from all the CCXs for the duration of 30 seconds and report the cumulative data (no timeseries data) |
|
Collect L3 data from all the CCXs and aggregate at system and package level and report cumulative data (no timeseries data) |
|
Collect IPC data from core 0 for the duration of 30 seconds |
|
Memory Bandwidth
Task Description |
Command |
|---|---|
Report memory bandwidth for all the memory channels for the duration of 60 seconds and save the output in |
|
Report total memory bandwidth aggregated at the system level for the duration of 60 seconds and save the output in |
|
Report total memory bandwidth aggregated at the system level and also report for every memory channel |
|
Report total memory bandwidth aggregated at the system level and also report for all the available memory channels. To report cumulative metric value instead of the timeseries data |
|
Raw Event Count Dump
Task Description |
Command |
|---|---|
Monitor events from core 0 and dump the raw event counts for every sample in timeseries manner, no metrics report will be generated |
|
Monitor events from all the cores and dump the raw event counts for every sample in timeseries manner, no metrics report will be generated |
|
Custom Config File
Config files are available for supported processors at <uprof-install-dir>\bin\Data\Config\. Default config file name for a specific processor, identified by family and model number, has the format <Family>_<Model Range>.conf. Example: 0x19_0x1.conf is used for all the processors with family value as 0x19 family and model number between 0x10 to 0x1f. Config files with RL_ prefix are used for roofline command.
Files can be copied and modified to certain user-specific interesting events and formula to compute metrics. All the metrics defined in that file will be monitored and reported.
C:\> AMDuProfPcm.exe -i <custom config file> -a -d 60 -O C:\tmp
Miscellaneous
Task Description |
Command |
|---|---|
List the supported raw Core PMC events |
|
Print the name, description, and the available unit masks for the specified event |
|
The performance metrics for AMD EPYC™ Zen 2, Zen 3, Zen 4, and Zen 5 core architecture processors are listed here.
Metric Group |
Metric Details |
|---|---|
ipc |
|
fp |
|
l1 |
|
l2 |
|
tlb |
|
l3 |
|
Memory |
Memory Read and Write bandwidth in GB/s for all the channels
|
xgmi |
Approximate xGMI outbound data bytes in GB/s for all the remote links
|
pcie |
Approximate PCIe bandwidth in GB/s
|
Metric Group |
Metric Details |
|---|---|
ipc |
|
fp |
|
l1 |
|
l2 |
|
tlb |
|
dc |
|
l3 |
|
Memory |
Memory Read and Write bandwidth in GB/s for all the channels
|
xgmi |
Approximate xGMI outbound data bytes in GB/s for all the remote links
|
swpfdc |
Software prefetch data cache from various nodes and CCX
|
hwpfdc |
Hardware prefetch data cache from various nodes and CCX
|
Metric Group |
Metric Details |
|---|---|
ipc |
|
fp |
|
avx_imix |
|
l1 |
|
dc |
|
l2 |
|
tlb |
|
l3 |
|
Memory |
DRAM read and write data bytes for a local processor
DRAM read and write data bytes for a remote processor
Memory Read and Write bandwidth in GB/s for all the channels
|
ccm_bw |
Reports data traffic to CCM at interfaces 0 and 1
|
xgmi |
xGMI Outbound Data Bytes (GB/s): Total outbound data bytes in Gigabytes per second. |
dma (not available in AMD Zen 1, AMD Zen 2, and AMD Zen 3 processors) |
|
pcie |
Approximate PCIe bandwidth in GB/s
PCIe bandwidth for read and write transactions, local, and remote node bandwidth. Per quad PCIe bandwidth
|
swpfdc |
Software prefetch data cache from various nodes and CCX
|
hwpfdc |
Hardware prefetch data cache from various nodes and CCX
|
pipeline_util |
|
UMC Note Supported only in Zen 5 servers. |
|
CXL Note Supported only in Zen 5 servers. |
|
Note
Memory channels are available with package level.
On AMD Zen4 and Zen5-based processors, AMDuProfPcm supports monitoring and reporting the pipeline utilization (pipeline_util) metrics. This feature provides pipeline_util metrics to visualize the bottlenecks in the CPU pipeline. Use the option -m pipeline_util to monitor and report the level-1 and level-2 top-down metrics.
Metric |
Description |
|---|---|
|
Up to six instructions can be dispatched in one cycle. |
|
Unused dispatch slots as the other thread was selected. |
|
Dispatch slots that remained unused because the frontend did not supply appropriate instructions/ops. |
|
Dispatched operations that did not retire. |
|
Dispatch slots that remained unused because of backend stalls. |
|
Dispatch slots used by operations that retired. |
Metric |
Description |
|---|---|
|
Unused dispatch slots due to latency bottleneck in the frontend, such as Instruction Cache or ITLB misses. |
|
Unused dispatch slots due to bandwidth bottleneck in the frontend, such as decode bandwidth or Op Cache fetch bandwidth. |
|
Dispatched operations that were flushed due to branch mis- predicts. |
|
Dispatched operations that were flushed due to pipeline restarts (resyncs). |
|
Dispatched slots that remained unused because of stalls due to memory subsystem. |
|
Dispatch slots used by fastpath operations that retired. |
|
Dispatch slots used by microcode operations that retired. |
Due to multiplexing, the reported metrics may be inconsistent. For better results, use taskset to bind the monitored application to a specific set of cores and monitor only the cores on which the monitored application is running.
Run the following command to collect the top-down metrics:
AMDuProfPcm -m pipeline_util --msr-A system -o /tmp/myapp-td.csv -- /usr/bin/taskset -c 0 myapp.exe
The --msr option requires root privileges. Run sudo ./AMDPcmSetCapability.sh, then open a new terminal tab or run without --msr option as noted here:
sudo AMDuProfPCm -m pipeline_util -c core=0 -A system -o /tmp/myapp-td.csv -- /usr/bin/ taskset -c 0 myapp.exe
Sample Top-Down Metrics report
Figure 4.1 Sample Top-Down Metrics report#
Examples
Task Description |
Command |
|---|---|
Timeseries monitoring of level-1 and level-2 top-down metrics (pipeline utilization) of a single- threaded program |
|
Timeseries monitoring of level-1 and level-2 top-down metrics of a multi-threaded program running on all the cores: |
|
Cumulative monitoring of level-1 and level-2 top-down metrics of a multi-threaded program running on all the cores |
|
Profiling capabilities of AMDuProfPcm might be limited on a virtual machine. Check the following hardware and OS primitives provided by host or guest operating system to determine the level of support.
Run the command AMDuProfCLI info –system to obtain this information and look for the following sections.
Item |
Description |
|---|---|
PERF Features Availability |
Availability of Core, L3 and DF PMCs on the system. If any of the PMCs are unavailable, the corresponding metrics will not be supported. Ex. On guest VMs, DF PMCs are not accessible due to security reasons, due to which DF metrics such as memory, xgmi, pcie cannot be collected. |
Hypervisor Info |
Hypervisor vendor and support can be used to determine if the hypervisor is supported. Also determine the mode – host or guest. Usually the host has unrestricted access to all the PMCs. |
Following is the known behavior of L2 Hit/Miss from HWPF metrics based on the BIOS settings:
AMDuProfPcm L2 Hit/Miss from HWPF metric doesn’t collect any data when all following options are disabled in BIOS:
L1 Stream HW Prefetcher
L1 Stride Prefetcher
L1 Region Prefetcher
L2 Stream HW Prefetcher
L2 up/Down Prefetcher
AMDuProfPcm L2 Hit/Miss from HWPF metric collects very less samples with the following BIOS settings:
L1 Stream HW Prefetcher: Disable
L1 Stride Prefetcher: Disable
L1 Region Prefetcher: Enable
L2 Stream HW Prefetcher: Disable
L2 up/Down Prefetcher: Disable
Roofline plots generated using AMDuProfModelling.py and saved as PDF might have improperly aligned labels for the plot lines.
Here is a list of constraints and limitations:
Multiple instances of AMDuProfPcm can be launched in MSR mode (option --msr) with the force reset option (-r), doing so might cause undefined behavior.
Sampling interval lower than 1 s (1000 ms) is not supported.
MSR mode has significant overhead due to which the number of collected samples may not be equal to Profile duration/Sampling interval.
In MSR mode, we can observe L3 Miss % to be > 100% when the number of L3 Accesses/Misses is low due to noise/inaccuracies.
Undefined behavior if cores are turned off during a profile run.
Heterogenous system configurations might cause undefined behavior.
Hypervisors (from Cloud vendors) might restrict the guest from collecting some events due to security purposes, which leads to count values of these events being reported as 0.
Limited support for Zen1 and Zen2 AMD processors.
On Linux, use the script AMDPcmSetCapability.sh to run the msr mode without root privilege. This option collects Core, L3, and DF PMC events on AMD Zen-based processors. The newer processors may require the latest kernel for Perf mode support.
Examples
Task Description |
Command |
|---|---|
Cumulative reporting of IPC metrics at the end of the benchmark execution |
|
Cumulative reporting of IPC metrics at the end of the benchmark execution, aggregate metrics at system level |
|
Cumulative reporting of IPC metrics at the end of the benchmark execution, aggregate metrics per processor package |
|
Cumulative reporting of level-1 and level-2 top-down metrics (pipeline utilization) |
|
Timeseries monitoring of IPC of a benchmark, aggregate metrics at system level |
|
Timeseries monitoring of IPC of a benchmark, aggregate metrics per processor package |
|
Timeseries monitoring of IPC of a benchmark, system aggregate level |
|
Timeseries monitoring of level-1 and level-2 top-down metrics (pipeline utilization) |
|
Timeseries monitoring of memory bandwidth reporting at package and memory channels level |
|