7. Investigate Performance Issues

7.1. Profiling Support on Linux for perf_event_paranoid Values

Following table describes profiling support on Linux for different perf_event_paranoid values:

Table 7.1 Profiling perf_event_paranoid Values on Linux#

Config

Profile Scope

perf_event_paranoid Values

Core PMC Event Based Profiling

Specific application launched or attach to process

  • -1: Y

  • 0: Y

  • 1: Y

  • 2: Y

Core PMC Event Based Profiling

Kernel, Hypervisor

  • -1: Y

  • 0: Y

  • 1: Y

  • 2: N

Core PMC Event Based Profiling

Entire System

  • -1: Y

  • 0: Y

  • 1: N

  • 2: N

Instruction Based Sampling

Specific Application

  • -1: Y

  • 0: Y

  • 1: Y

  • 2: N

Instruction Based Sampling

Attach Specific Process

  • -1: Y

  • 0: Y

  • 1: N

  • 2: N

Instruction Based Sampling

Entire System

  • -1: Y

  • 0: Y

  • 1: N

  • 2: N

Time Based Profiling

Specific Application

  • -1: Y

  • 0: Y

  • 1: Y

  • 2: Y

Time Based Profiling

Attach Specific Process

  • -1: Y

  • 0: Y

  • 1: N

  • 2: N

Time Based Profiling

Entire System

  • -1: Y

  • 0: Y

  • 1: N

  • 2: N

7.2. Configure Profile

To perform a collect run, first you should configure the profile by specifying the profile configuration that identifies all the following information used to perform a collect measurement:

Note

The additional profile data to be collected depends on the selected profile type.

7.2.1. Select Profile Target

To start a profile, either click the PROFILE page at the top navigation bar or Profile an Application? link in HOME page Welcome screen. The Start Profiling screen is displayed.

Select Profile Target is available in the Start Profiling window.

Select the profile target in the Start Profiling page.

Figure 7.1 Start Profiling - Select Profile Target#

You can select the one of the following profile targets from the Select Profile Target drop-down:

Once profile target is selected and configured with valid data, the Next button will be enabled to go the next screen of Start Profiling.

Note

The Next button is enabled only if all the selected options are valid.

7.2.2. Select Profile Type

Once profile target is selected and configured, click the Next button. The Select Profile Configuration screen is displayed as follows:

Select the profile configuration.

Figure 7.2 Start Profiling - Select Profile Configuration#

  1. Select one of the following tabs:

  2. Once you select a profile type, the left vertical pane within this window will list the options corresponding to the selected profile type. For CPU Profile type, all the available predefined sampling configurations will be listed.

    Modify event options are available only for the predefined configurations.

  3. Click Advanced Options button to proceed to the Advanced Options screen and set the other options such as the Call Stack Options, Profile Scheduling, Sources, Symbols, and so on.

  4. The details in the: Sample Data table are persistent and saved by the tool with a name (here, it is AMDuProf-EBP-ScimarkStable). You can define this name and navigate to PROFILE > Saved Configurations to reuse/select the same configuration later.

  5. The Next and Previous buttons are available to navigate to various screen of the Start Profiling screen.

The CLI command is available at the bottom of this page, which displays the CLI version of the GUI option selected on the Select Profile Configuration page.

7.2.3. Advanced Options

Click the Advanced Options button in Select Profile Type screen, to set the advanced options for Windows/Linux and start the profiling.

Linux

Set advanced options.

Figure 7.3 Start Profiling - Advanced Options for Linux#

Windows

Set advanced options.

Figure 7.4 Start Profiling - Windows#

You can set the following options on the Advanced Options screen and click Start Profile to begin profiling.

Table 7.2 Profiling-Advanced Options#

Options

Description - Linux

Description - Windows

OpenMP Tracing Options

Enables collection of OpenMP runtime data for performance analysis.

  • Enable OpenMP Tracing: Toggle to enable OpenMP tracing.

  • Select OpenMP Trace Implementation: Dropdown to choose the OpenMP trace implementation (e.g., IMP).

Not available in Windows version.

Enable Thread Concurrency Option

Not available in Linux version.

Displays the number of threads running concurrently for the selected process to help analyze thread-level parallelism.

  • Enable Thread Concurrency: Toggle to enable thread concurrency data collection.

Call Stack Options

Configures call stack sampling for detailed call graph views and debugging.

  • Enable CSS: Toggle to enable Call Stack Sampling.

  • Call Stack Collection Mode: Method for collecting call stacks (e.g., Frame Pointers).

  • Call Stack Unwind Size: Maximum stack size (in bytes) for sample collection (default: 1024).

Configures call stack sampling for detailed call graph views and debugging.

  • Enable CSS: Toggle to enable Call Stack Sampling.

  • Call Stack Collection Mode: Method for collecting call stacks (e.g., Frame Pointers).

  • Call Stack Unwind Size: Maximum stack size (in bytes) for sample collection (default: 1).

Profile Scheduling

Controls profiling start, duration, and API-based instrumentation for data collection.

  • Limit Data Collection by Time (MS): Restricts profiling duration in milliseconds.

  • Enable Data Capture: Toggle to start capturing profile data.

  • Are you using Profile Instrumentation API?: Indicates if the application uses profiling API for control.

  • Start Profiling After (seconds): Delay before profiling starts.

  • Profile Duration (seconds): Duration for which profiling runs.

Controls profiling start, duration, and API-based instrumentation for data collection.

  • Limit Data Collection by Time (MS): Restricts profiling duration in milliseconds.

  • Enable Data Capture: Toggle to start capturing profile data.

  • Are you using Profile Instrumentation API?: Indicates if the application uses profiling API for control.

  • Start Profiling After (seconds): Delay before profiling starts.

  • Profile Duration (seconds): Duration for which profiling runs.

Sources

Specifies source file paths for code attribution and bottleneck identification.

  • Root to Sources: Path to root of source files.

  • Sources Directory: Directory containing source files; includes Browse option.

Specifies source file paths for code attribution and bottleneck identification.

  • Sources Directory: Directory containing source files. Also includes Browse option.

Symbols

Defines symbol and server locations for resolving function names accurately.

  • Symbol Configuration: Path to root of source files.

    You can specify the symbol path in Linux, by providing the .dwp file path.

Configures symbol and server locations for resolving function names accurately.

  • Use Microsoft Symbol Server: Toggle to enable Microsoft Symbol Server.

  • Symbol Search Path: Path for symbol files.

  • Add Symbol Server: Option to add additional symbol servers.

  • Minimum Threshold for Symbol Downloads: Sets threshold for downloading symbols.

Data Aggregation Option

Specify any one of the coarseness of data aggregation level - Low, Medium, High. Default level is Medium.

The lower the data coarseness level, the more fine grained data will be plotted in the timeline views.

Specify any one of the coarseness of data aggregation level - Low, Medium, High. Default level is Medium.

The lower the data coarseness level, the more fine grained data will be plotted in the timeline views.

7.2.4. Start Profile

Once all the options are set correctly, click the Start Profile button to start the profile and collect the profile data. After the profile initialization the following screen is displayed.

Set advanced options.

Figure 7.5 Profile Data Collection#

The time elapsed during the data collection is displayed. When the profiling is in progress, click:

7.3. AMDuProf Overhead Estimation

The Overhead Estimation feature in AMDuProf provides insights into the overhead introduced during data collection and processing while profiling.

This capability helps users understand the impact of profiling on system performance and make informed decisions when selecting profiling configurations.

Supported and Unsupported Profiling Types

Table 7.3 Overhead Support for Profiling (Other Than CPU Profiling)#

Profiling Type

Overhead Feature Support

CPU Tracing

Not Supported

GPU Profiling

Not Supported

GPU Tracing

Not Supported

Live Power Profile

Not Supported

MPI Tracing

Not Supported

OpenMP Tracing

Not Supported

OS Runtime Tracing

Not Supported

Table 7.4 Overhead Support for Predefined Configurations (CPU Profiling)#

Profiling Type

Overhead Feature Support

Assess Performance

Supported

Assess Performance (Extended)

Supported

Cache Analysis

Supported

Hotspots

Supported

Instruction Based Sampling

Supported

Investigate Branching

Supported

Investigate CPI

Supported

Investigate Data Access

Supported

Investigate Instruction Access

Supported

Overview

Supported

Threading Analysis

Supported

Time Based Sampling

Supported

For unsupported configurations, a message stating no overhead data available for the selected configuration is displayed.

There are three categories of overheads:

Sample High Overhead Message.

Figure 7.6 Sample High Overhead Message#

There are two types of overheads:

Before proceeding with the profiling experiment, AMDuProf provides a suggestion regarding the overhead based on the current configuration. This suggestion helps users decide whether to proceed with the current configuration or modify it based on the expected overhead. There are multiple factors influencing overheads.

  1. Predefined Configurations: In predefined configurations, certain events such as overview and threading have a high overhead in both data collection and processing.

  2. Throttling Events: If there are one or more throttling events present in either predefined or custom configurations, the processing overhead will be high, and the collection overhead will be Fair.

  3. Sampling Intervals and Frequencies:

    1. Threshold for Sampling Intervals/Frequencies: A threshold exists for sampling intervals and frequencies.

    2. Higher Sampling Interval/Lower Sampling Frequency: Increases collection and processing overhead.

    3. Lower Sampling Interval/Higher Sampling Frequency: Decreases collection and processing overhead.

    This balance affects the efficiency of data collection and processing.

    In addition to overhead information, you get an estimate of the profiling time based on the configuration and overhead type.

    You can evaluate the overhead suggestion and decide whether to proceed with the current configuration or modify it for optimal performance. This guidance helps balance the desired profiling data and the overhead it may incur on the system.

    For configurations not supported by the overhead feature, a message stating that no overhead data available for the selected configuration will be displayed.

Sample Overhead Message for Unsupported Configuration.

Figure 7.7 Sample Overhead Message for Unsupported Configuration#

7.4. Analyze Profile Data

When the profiling stopped, the collected raw profile data will be processed automatically and you can analyze the profile data using the following UI sections to identify the potential performance bottlenecks:

The sections available depends on the profile type. The CPU Profile will have SUMMARY, ANALYZE, MEMORY, HPC, and SOURCES pages to analyze the data.

7.4.1. Overview of Performance Hotspots

When the translation is complete, the SUMMARY page will be populated with the profile data and Hot Spots screen will be displayed. The SUMMARY page provides an overview of the hotspots for the profile session through various screens such as Hot Spots and Session Information.

In the Hot Spots screen, hotspots will be displayed for functions, modules, process, and threads. Processes and threads will be displayed only if there are more than one.

The following figure shows the Hot Spots screen:

Hotspots summary for functions, modules, process and threads.

Figure 7.8 Summary - Hot Spots Screen#

In the above Hot Spots screen:

Based on the selection, one donut is displayed at a time.

7.4.1.1. Summary Overview

Table 7.5 Summary Overview#

Data Collected

Table Present

Description

Timing Details

OS Trace

Schedule Summary

Summary of per thread running/wait time (percentages).

  • Profile Duration

  • Parallel Time

  • Serial Time

  • Wait Time

  • Sleep Time

OS Trace

Wait Object Summary

Time spent in operations related to several types of synchronization objects, that is, locks, mutexes, condition variables, and so on.

  • Profile Duration

  • Parallel Time

  • Serial Time

  • Wait Time

  • Sleep Time

OS Trace

Wait Function Summary

Time spent in several types of pthread blocking functions, that is, pthread_join, and so on.

  • Profile Duration

  • Parallel Time

  • Serial Time

  • Wait Time

  • Sleep Time

OS Trace

Syscall Summary

Time spent in syscall(s)

  • Profile Duration

  • Parallel Time

  • Serial Time

  • Wait Time

  • Sleep Time

GPU Trace

GPU Kernel Summary

Time spent per GPU kernel in execution in the enqueued device.

Profile Duration

GPU Trace

Data Transfer Summary

Time spent in GPU data copy operations.

Profile Duration

MPI Trace

MPIP2P API Summary

Time spent in various MPI P2P API across all ranks of the profile.

  • Profile Duration

  • Parallel Time

  • Serial Time

  • MPI Time

MPI Trace

MPI Collective API Summary

Time spent in various MPI collective communication API across all ranks of the profile.

  • Profile Duration

  • Parallel Time

  • Serial Time

  • MPI Time

CPU Profile

Hot Functions

Hottest functions based on CPU profile.

  • Profile Duration

  • Parallel Time

  • Serial Time

CPU Profile

Hot Modules

Hottest modules based on CPU profile.

  • Profile Duration

  • Parallel Time

  • Serial Time

CPU Profile

Hot Threads

Hottest threads based on CPU profile.

  • Profile Duration

  • Parallel Time

  • Serial Time

CPU Profile

Hot Processes

Hottest processes based on CPU profile.

  • Profile Duration

  • Parallel Time

  • Serial Time

7.4.1.2. OS Trace Screen

OS trace screen.

Figure 7.9 OS Trace Screen#

7.4.1.3. GPU Trace Screen

GPU trace screen.

Figure 7.10 GPU Trace Screen#

7.4.1.4. MPI Trace Screen

MPI trace screen.

Figure 7.11 MPI Trace Screen#

7.4.1.5. CPU Profile

The CPU Profile screen is similar to the Summary - Hotspots Screen.

7.4.2. Thread Concurrency Graph

Click ANALYZE > Thread Concurrency to view the following graph to analyze the thread concurrency of the profiled application.

Summary - Thread Concurrency Graph.

Figure 7.12 Summary - Thread Concurrency Graph#

The thread concurrency graph displays the duration (in seconds) of the specific number of threads that were running simultaneously.

Bucketization approach is used for this graph. Instead of showing the Elapsed Time for each core, the weighted average based on the bucket size will be taken. The bucket size will be determined based on the cores and number of available pixels available. This is done to avoid the horizontal scrolling.

7.4.3. Function HotSpots

Click ANALYZE on the top horizontal navigation bar to go to Function Hotspots screen, which displays the hot functions across all the profiled processes and load modules as follows. You can view the following:

ANALYZE - Function Hotspots.

Figure 7.13 ANALYZE - Function Hotspots#

Process and thread wise breakdown of data is available if the entries are expanded in Function Hotspots View. The Functions table lists the hot functions. The IP samples are aggregated and attributed at the function-level granularity. On the table, you can do the following:

Filters and Options pane allows you filter the profile data as follows:

If callstack is enabled, the unique hot call-paths for the selected function is displayed in the Functions column.

Event Timeline is the line graph showing the number of aggregated sample values over the period of time. You can use it to identify the hot functions within a profile region. From the Select Metric drop-down you can select the event for which event timeline must be plotted.

All the entries will not be loaded for a profile. To load more than the default number of entries, click the vertical scroll bar on the right. When the entries are expanded, process and thread-wise breakdown of data is available.

7.4.4. Process and Functions

Click ANALYZE > Grouped Metrics to display the profile data table at various program unit granularities - Process, Load Modules, Threads, and Functions. This screen contains data in two different formats as follows:

Summary - Analyze - Grouped Metrics.

Figure 7.14 Summary - Analyze - Grouped Metrics#

The upper tree represents samples grouped by Process. You can expand the tree to view the child entries for each parent (that is for a process). The Load Modules and Threads are child entries for the selected process entry.

You can right-click to view the following options:

The lower Functions table contains samples attributed to corresponding functions. The function entries depend on what is selected in the upper tree. For more specific data, you can select a child entry from the upper tree and the corresponding function data will be updated in the lower tree. You can do any of the following:

You can use the Filters and Options pane to filter the profile data displayed by various controls.

You can use the System Modules option to Exclude or Include the profile data attributed to system modules.

Confidence level

The metrics that cannot be calculated reliably due to low number of samples collected for a program unit will be grayed out.

All entries will not be loaded for a profile. To load more than the default number of entries, click the vertical scroll bar on the right.

7.4.5. Source and Assembly

Double-click on any entry in the Functions table in the Metrics screen to load the source tab for the corresponding function in SOURCES page. If the GUI can find the path to the source file for that function, then it will try to open the file, failing which you will be prompted to locate it.

The following figure depicts the source and assembly screen.

SOURCES - Source and Assembly.

Figure 7.15 SOURCES - Source and Assembly#

The following sections are present in the SOURCES screen:

Table 7.6 Sources Screen Options#

Feature

Description

Filter Pane

Lets you filter the profile data based on the following options:

  • Select View - Controls which counters are displayed in the Sources and Assembly view. Counters and their related metrics are grouped into predefined views. Choose a view (e.g., Event Count, CPU Time)from the Select View drop-down, to decide what performance data appears alongside source lines and assembly instructions.

  • Value Type - Select how metric values are represented, to view the counter values as follows:

    • Sample Count is the number of samples attributed to a function.

    • Event Count is the product of sample count and sampling interval.

    • Percentage is the percentage of samples collected for a function.

  • Process - Lists all the processes on which this selected function is executed and has samples. Select the process being analyzed to filter profiling data.

  • Threads - Lists all the threads on which this selected function is executed and has samples. Choose one or multiple threads for focused analysis.

For multi-threaded or multi-process applications, if a function is executed from multiple threads or processes, each of them is listed in the Process and Threads drop-down lisr in the Filters pane. Changing them will update the profile data for that selection. By default, profile data for the selected function, aggregated across all processes and all threads will be displayed.

Show Assembly

Toggle to enable or disable the assembly view for associating the source code with machine instructions.

Search Pane

Provides options to locate specific code or instructions. After providing the following search criteria, click Search to execute.

  • Search Source Code - Enter keywords or function names to find specific valid source lines.

    The source lines of the selected function are listed and the corresponding metrics are populated in various columns against each source line. If no samples are collected when a source line was executed, the metrics column will be empty.

  • Search Assembly - Enter addresses to find specific assembly instructions.

    The assembly instruction of the corresponding highlighted source line is displayed. The tree also includes the offset for each assembly instruction along with metrics.

After clicking the Search button, the Forward and Backward navigation buttons are enabled, to navigate through the search results.

Select Source Line(s) Ordering Type

Choose the ordering of source lines in the view (e.g., by metric value, by line number) for easier hotspot analysis.

HeatMap Event

Overview of the hotspots at source level.

Copy Options

Provides multiple ways to copy data for analysis and reporting.

  • Copy current cell data

  • Copy all row(s) in range selected

  • Copy data with associated source

  • Copy all associated assembly

  • Copy all associated assembly with source

  • Copy all rows with samples

Note

If the source file cannot be located or opened, only disassembly will be displayed.

7.4.6. Top-down Callstack

Top-down Callstack view can be used to explore the call-sequence flow of the application to analyze the time spent in functions and its callees.

Click ANALYZE > Top-down Callstack to view it as follows:

Top-down Callstack.

Figure 7.16 Top-down Callstack#

Functions are displayed based on the parent to child entries depending on the inclusive samples values sorted.

Inclusive sample values for a function and its descendants.

Enabling Hide C++ std Library Calls option works only when C++ library calls are made. It will exclude such calls from the list and display the other child entries.

Context menu of collapse entries will close all the expanded entries. Expand entries will expand the child entries and the Open Source View option will display the corresponding source view.

7.4.7. Flame Graph

Flame graph is a visualization of sampled call-stack traces to quickly identify the hottest code execution paths. Click ANALYZE > Flame Graph to view it as follows:

ANALYZE - Flame Graph.

Figure 7.17 ANALYZE - Flame Graph#

The x-axis of the flame graph shows the call-stack profile and the y-axis shows the stack depth. It is not plotted based on passage of time. Each cell represents a stack frame and if a frame were present more often in the call-stack samples, the cell would be wider. This screen has the following options:

Click the Zoom Graph button for a better zooming experience.

When you type a function name in the search box, a list of all the relevant matches will be displayed. Select the required function to highlight the cells corresponding to that function in the flame graph.

The Process drop-down lists all the processes for which call-stack samples are collected. Changing the process will plot the flame graph for that particular process.

For multi-threaded applications, the flame graph will be plotted for the cumulative data of all the threads by default.

The Threads drop-down lists all the threads for which call-stack samples are collected. Changing the thread will plot the flame graph for that thread.

The Select Metric drop-down lists all the metrics for which call-stack samples are collected. Changing the metric will plot the flame graph for that particular metric.

7.4.8. Call Graph

Click ANALYZE > Call Graph* to navigate to the call graph screen. This graph is constructed using the call-stack samples and offers a butterfly view to analyze the hot call-paths as follows:

ANALYZE - Call Graph.

Figure 7.18 ANALYZE - Call Graph#

The Function table lists all the functions with inclusive and exclusive samples. Click on function to display its Caller and Callee functions in a butterfly view. In addition the parents and children of the selected function in the Function table are displayed.

Options

7.4.9. All Thread Timeline

7.4.9.1. Timeline Analysis GUI in Linux

To configure threading analysis from the GUI:

  1. Navigate to the Select Profile Configuration screen.

  2. Select Predefined Configs from the tab.

  3. Select Threading Analysis from the left vertical pane.

Profile data collected from CLI or GUI can be visualized in GUI by importing the session. On importing, the following section (Thread Timeline) is displayed on the ANALYZE page.

Time-series data is plotted in timelines per entity (thread, rank, device, and so on). Trace data (if collected) will only be plotted when you zoom into the timeline to address data size related scalability issues (trace data can have millions of records which will not be visually legible if plotted together). The entire view is broadly separated in three vertical parts, top data selectors, middle timelines, and bottom filters. You can use the timeline as follows:

Timeline Analysis GUI in Linux

Figure 7.19 Timeline Analysis GUI in Linux#

The timeline section consists of:

  1. Name of each thread in timeline with Thread ID.

  2. Click Load More button which loads more threads. By default, only a small number of thread timelines are loaded to limit the resource consumption. This button enables loading the next set of thread timelines. The next set is determined by the entries in the table below the timeline.

  3. Select the Data Source drop-down to enable selection of data to display on the timeline. Different types of data source are as follows:

  4. The Select Trace Overlay drop-down enables selection of the type of trace data to display.

  5. Trace Cutoff can be used to specify a duration in nanoseconds, which acts as a cutoff to load the trace data, that is, any traced function which takes less than the specified nanoseconds will not be displayed.

  6. Click the Reset Zoom button to reset any zoom performed earlier.

  7. Hover over any timeline to view the tool-tip containing the relevant data along with timestamp. If trace data is also present, the relevant traced functions with start time and duration.

  8. Filter Threads/Ranks enables you to filter which thread’s (or rank’s) timelines must be displayed. By default, the timelines are sorted internally and the first 6 are loaded. However, from the table, you can select the required threads and clicking Apply Filter to apply the changes. If CPU profile data is collected, highlighting functions or modules is also possible. Each function is assigned a random color, which can be modified and highlighted in the timeline (implies there are samples from the function/module).

  9. Each entry in the filter table has the necessary data, that is, name, parent object, and samples/trace times aggregated across the profile.

  10. Click the Apply Filter button to apply a custom selection of entities or highlight entities in timeline. (If GPU acceleration is available, there is no need to click Apply as the changes are reflected instantaneously)

  11. Click Deselect selected Items to deselect all the entries in the filtering table except the first one. This is useful when a custom selection is required but all timelines are already loaded.

  12. At the bottom of the filtering pane, timeline legend is displayed, which helps in identifying how each type of data source or trace is mapped to which color.

  13. The Show Core Transition button is disabled by default and works only when the CPU profiling data is collected. When enabled, a red line is displayed in each timeline to signify when a thread changes the core.

Note

When you enable CPU profiling (along with other data sources), you can highlight functions and modules in the timeline across threads. The tool lists them in tables under the Filtering Threads option, ordered by CPU sample data. You can select multiple functions or modules to highlight, and they appear as overlays in the timeline view. You can also change the color for each function or module, and the overlay updates accordingly.

Function Highlights in Timeline

Figure 7.20 Function Highlights in Timeline#

Highlight tasks if the profiled application is instrumented using the Instrumentation API. The timeline displays tasks across threads, and you can select them from the Highlight Tasks tab. This tab presents task data grouped by domain and sorted by the total time spent across all task instances. To focus the view, click the Show only associated Threads button to display only those threads that contain at least one instance of the selected tasks.

7.4.9.2. Region Selection

When you select a region in the timeline view by clicking and dragging with the mouse, uProf generates aggregate data for that region. Depending on the configuration and collected data, the following types of aggregate data is displayed:

  1. Function Hotspot – visible only if CPU samples are collected

  2. Flame Graph – visible only if CPU samples are collected

  3. Wait Object Hotspot – visible only when using the Threading Analysis configuration

Region Selection in Timeline

Figure 7.21 Region Selection in Timeline#

7.4.10. Per Thread Timeline

Per Thread timeline view focuses on showing all aspects of a specific thread based on the collected data. Hence it can show CPU profile samples, OS traces, GPU traces and System metrics at per thread level. The selection table at the bottom pane sorts them by the first event in the table, and threads can be switched from the table. Function/Module highlights work in the same way as All thread timeline.

Per Thread Timeline

Figure 7.22 Per Thread Timeline#

Each per-thread timeline displays two types of flame graphs:

The callstack flame graph is only available for the following profile configurations:

If GPU tracing data is also collected for applications using HIP APIs, this view also plots the GPU utilization, GPU Memory, and GPU power for all the GPUs to which kernels were scheduled from current thread. As a single thread can schedule kernels on multiple GPUs, one line is plotted for each such device, the device info being present in the tooltip.

Per Thread Timeline - Tooltip.

Figure 7.23 Per Thread Timeline - Tooltip#

7.4.11. Notes on GPU Acceleration

GPU acceleration is only available if OpenGL drivers exist on the system. This applies to both Windows and Linux. If not, the tool will automatically fallback to CPU implementation, which will not be as performant. (This is also the case when using remote X11 servers i.e. launching the Linux UI in Windows with MobaXTerm-like tools). The minimum expected OpenGL version is 3.1 on both Linux and Windows. While AMDuProf tries to detect the version automatically, should this fail due to unforeseen scenarios, where the tool falls back to CPU rendering, but OpenGL (>= 3.1) is still available, environment variables could be used to tweak the behavior, as listed in the following table.

Table 7.9 Tweak Environment Variables#

Environment Variable

Purpose

Default Value

AMDUPROF_OPENGL_MAJOR

Specify major version of OpenGL, minimum 3.

3

AMDUPROF_OPENGL_MINOR

Specify minor version of OpenGL for the specified major version.

1

AMDUPROF_OPENGL_PROFILE

Specify the OpenGL profile. Valid values are: core, compatibility and none (none implies default profile is selected).

none

AMDUPROF_OPENGL_TYPE

Specify whether to use desktop OpenGL or not. This can be used to disable GPU accelerated graphics entirely if it does no work as intended. (Valid values are desktop and none) .

desktop

7.4.12. IMIX View

IMIX view shows the summary of instruction-wise samples collected. This view is shown only for IBS profiling. Click ANALYZE > IMIX to navigate to the IMIX view.

IMIX view.

Figure 7.24 IMIX View#

7.4.13. Wait Object Hotspots

Wait Object Hotspots view shows the wait object related data in detail. Different groupings are also available for in depth analysis. It can be broken down by expanding in different levels. Click ANALYZE > Wait Object Hotspots to navigate to the Wait Object Hotspots view. For more information refer to Wait Time Analysis.

Wait Object Hotspots.

Figure 7.25 Wait Object Hotspots#

7.5. Hotspots Analysis

Hotspots Analysis is the starting point for algorithm analysis of an application. Use Hotspots Analysis to understand the application code flow and sections of code which has lot of execution time (CPU Time).

7.5.1. User Mode Sampling

User mode sampling embeds an agent library into application address space using LD_PRELOAD. The agent creates a per thread OS timer (default timer interval is 10 ms), interrupts the execution of a thread by generating SIGPROF or another runtime signal. Once thread receives the signal, the agent collects IP samples, and it’s callstack for each sample if callstack collection is enabled from signal handler. Collected data is stored in binary files for later processing.

7.5.2. Callstack Stitching

In OpenMP applications, if a parallel region is executed by multiple threads (master and worker threads), each worker thread will have its own calling sequence (call path or callstack) which logically starts when the master thread encountered the parallel region. During translation, AMDuProf will stitch the worker thread’s call path to master thread call path at the point where the parallel region started, if the worker threads are active in the parallel region. This allows runtimes from the worker threads to be attributed to the correct logical calling sequence of the program (i.e. calling sequence without OpenMP) so that uProf can produce accurate flame graphs.

By default, AMDuProf detects the compiler used to build the application by reading the .comment section and stitches the worker threads call path with master thread. If AMDuProf fails to stitch the call path with master thread, select OpenMP implementation type as omplib for GCC compiled applications and ompt for AOCC, ICC, and LLVM compiled applications.

7.5.3. Data Collection

7.5.3.1. Data Collection Using GUI

To launch the AMDuProf GUI, go to Home > Welcome page.

  1. Click Profile an Application on the Welcome page.

  2. Provide the application path, application options, working directory, and environment variables, if any. Click Next.

  3. From Predefined Configs, select Hotspots.

  4. Set the timer interval* and profiling signal.

  5. From Advanced Options, select the OpenMP implementation type, callstack collection, and callstack unwind depth.

  6. Click Start Profile to start the profiling.

7.5.3.2. Data Collection Using CLI

Once profile data collection completes, session directory will be generated. Use session directory to generate the csv report (or) to import the session in GUI. Refer AMDuProfCLI Collect Command Options.

Example

./AMDuProfCLI collect --config hotspots -g -o /tmp/ /tmp/ScimarkStable
Profiling started
………
Profiling (data collection) completed
Generated data files path: /tmp/AMDuProf-ScimarkStable-Hotspots_Sep-05-2024_21-43-08.

Here, the generated session directory is /tmp/AMDuProf-ScimarkStable-Hotspots_Sep-05-2024_21-43-08.

Hotspot Analysis Options

Table 7.10 Hotspots Analysis options#

CLI Option

GUI Option

Description

--call-graph-depth <depth>

Predefined Configs > Hotspots > Advanced Options > Call Stack Unwind Depth

Provide the depth of stack frames to be collected. By default, 32 frames will be collected; if the application has a greater number of frames in a calling sequence, increase the unwind depth up to 1024.

--openmp-impl <impl>

Predefined Configs > Hotspots > Advanced Options > Select OpenMP Implementation

Provide the OpenMP implementation type to stitch the call path of worker threads with master thread. ompt for tracing of OpenMP libraries supporting OMPT interface (example: LLVM, AOCC), omplib for tracing GCC OpenMP library.

It is valid only in Linux for launch application.

--profiling-signal <signal>

Predefined Configs > Hotspots > Profiling Signal

If application has signal handler for SIGPROF, then use this option to provide unused signal from SIGRTMIN to SIGRTMAX.

It is valid only in Linux for launch application.

--timer-interval <interval>

Predefined Configs > Hotspots > Timer Interval

Provide per thread OS timer interval in msec. Default timer interval is 10 ms.

7.5.4. Analyze the Data

If data is collected using CLI, use Import Session to import the session into GUI to analyze data in GUI.

7.5.4.1. CLI Report

Use the following CLI report command to generate the profile report in .csv format by passing the session directory path generated in collection.

AMDuProfCLI report -i <session directory>

See table AMDuProfCLI Report Command Options for a list of all the supported options.

Example

./AMDuProfCLI report -i /tmp/AMDuProf-ScimarkStable-Hotspots_Sep-05-2024_21-43-08
Translation started

Report generation started

Report generation completed...
Generated report file: /tmp/AMDuProf-ScimarkStable-Hotspots_Sep-05-2024_21-43-08/report.csv

Use the Thread Concurrency Graph to analyze how efficiently the processor cores are utilized by the application. In other words, how much time specific no of threads are running on specific no of cores.

7.5.4.2. Identify the Hottest Function

Use Function HotSpots to get list of most CPU time (self-time and children time) consuming functions, expand the function to get its processes and further expand to get its threads. All the functions are sorted in descending order of CPU time.

Select a function to get all the call paths to this function and total CPU time consumed in every call path.

Double click on the function to analyze the instruction level sample attribution for that function using the Source View. See Source and Assembly.

Function Hotspots.

Figure 7.26 Function Hotspots#

7.5.4.3. Identify the Hot Code Paths

Use Flame Graph to identify hottest code paths of an application. The width of each functions indicates the percentage of CPU time of the function (it’s callees) to the total CPU time of selected process and thread.

Flame Graph.

Figure 7.27 Flame Graph#

Use Top-down Callstack to analyze any issues with call-sequence flow of the application and to analyze the total CPU time spent in functions and its callees.

Top Down Call Stack.

Figure 7.28 Top Down Call Stack#

7.5.5. Troubleshoot

$export AMDUPROF_MAX_PR_INSTANCES=2000000

7.5.6. Limitations

7.6. Threading Analysis

Use Threading Analysis to identify how efficiently an application uses the processor cores, contention among the application threads due to synchronization, CPU utilization of the threads, Wait time analysis of the application threads.

Threading Analysis uses the User mode sampling and tracing approach. Threading analysis is supported only in Linux and if application is using libc and libpthread then these libraries should be linked dynamically.

Reference

7.6.1. User Mode Tracing

User Mode Tracing embeds an agent library into application address space using LD_PRELOAD. It interposes the pthread APIs and a few system calls, collects the start time and end time of an API, and a few other metrics.

System wide performance data and already running process/thread performance data collection is not supported with user mode sampling and tracing.

7.6.2. Data Collection

7.6.2.1. Data Collection Using GUI

To launch the AMDuProf GUI, go to Home > Welcome page.

  1. Click Profile an Application on the Welcome page.

  2. Provide application path, application options, working directory, and environment variables, if any. Click Next.

  3. From Predefined Configs, select Threading Analysis.

  4. Set the Timer Interval and Profiling Signal.

  5. From Advanced Options, select the OpenMP implementation type, callstack collection, and callstack unwind depth.

  6. Click Start Profile to start the profiling.

7.6.2.2. Data Collection Using CLI

Once profile data collection completes, session directory will be generated. Use session directory to generate the .csv report (or) to import the session in GUI.

For a list of all the supported options, refer to AMDuProfCLI Collect Command Options.

Example

./AMDuProfCLI collect --config threading -g -o /tmp/ /tmp/ScimarkStable
Profiling started

Profiling (data collection) completed
Generated data files path: /tmp/AMDuProf-ScimarkStable-Threading_Sep-05-2024_21-48-42

Here, the generated session directory is /tmp/AMDuProf-ScimarkStable-Threading_Sep-05-2024_21-48-42.

System Call Tracing

By default, Threading analysis traces sleep and wait system calls. Configure system call tracing with threading analysis to trace all supported system calls which includes IO system calls, blocking system calls.

CLI command to collect the system call trace data with threading analysis.

AMDuProfCLI collect --config threading --trace osrt --osrt-event syscall -o <output-dir> <application>

Threading Analysis options

Table 7.11 Threading Analysis options#

CLI Option

GUI Option

Description

--call-graph-depth <depth>

Predefined Configs > Threading Analysis > Advanced Options > Call Stack Unwind Depth

Provide the depth of stack frames to be collected. By default, 32 frames will be collected; if the application has a greater number of frames in a calling sequence, increase the unwind depth up to 1024.

--openmp-impl <impl>

Predefined Configs > Threading Analysis > Advanced Options > Select OpenMP Implementation

Provide the OpenMP implementation type to stitch the call path of worker threads with master thread. ompt for tracing of OpenMP libraries supporting OMPT interface (example: LLVM, AOCC), omplib for tracing GCC OpenMP library. ompt is the default selection.

--osrt-threshold <event:threshold>

Predefined Configs > Threading Analysis

Provide event name and threshold value. .. note:: Use this option with --trace osrt option.

--profiling-signal <signal>

Predefined Configs > Threading Analysis > Profiling Signal

If application has signal handler for SIGPROF, then use this option to provide unused signal from SIGRTMIN to SIGRTMAX.

--timer-interval <interval>

Predefined Configs > Threading Analysis > Timer Interval

Provide per thread OS timer interval in msec. Default timer interval is 10 ms.

--collect-sys-modules

Predefined Configs > Threading Analysis > Collect System Module Function(s)

By default, threading config doesn’t collect the callstack beyond the first standard library function called by application. Use this option to disable it and collect the complete callstack which includes the standard library functions.

7.6.3. Analyze the Data

If data is collected using CLI, use Import Session to import the session into GUI to analyze data in GUI.

7.6.3.1. CLI Report

Use the following CLI report command to generate the profile report in .csv format by passing the session directory path generated in collection.

AMDuProfCLI report -i <session directory>

For a list of all the supported options, refer to AMDuProfCLI Report Command Options.

Example

./AMDuProfCLI report -i /tmp/AMDuProf-ScimarkStable-Threading_Sep-05-2024_21-48-42
Translation started

Report generation started

Report generation completed...
Generated report file: /tmp/AMDuProf-ScimarkStable-Threading_Sep-05-2024_21-48-42/report.csv

Reference

7.6.3.2. Threading Summary

Threading Summary provides high level performance snapshot of an application with respect to different timing details.

Refer Posix Thread APIs and libc System Call Wrapper APIs to know the APIs traced with threading analysis.

GUI Threading Summary.

Figure 7.29 GUI Threading Summary#

7.6.3.3. Wait Time Analysis

High wait time means application suffers with parallel performance, use Thread Summary to analyze per thread total run time, wait time and wait time percentage of the thread from total time of the thread. If a thread is optimized, then it’s wait time and percentage wait time might reduce when compare before and after optimization. It helps to identify whether a thread is using the core effectively or not.

Use Wait Object Summary to identify performance critical synchronization object which has more amount of wait time and wait count.

Syscall Summary provides the system call count, total time spent by the application on a system call. It helps to identify the system calls consuming most of the time and that can be optimized if the system calls are blocking or waiting in nature.

GUI Wait Time Analysis Summary.

Figure 7.30 GUI Wait Time Analysis Summary#

By default, Wait Object Hotspots ranks functions or processes according to their total wait time, which is the time spent blocked on locks, events, semaphores, I/O, and similar objects. You can also sort the results based on the wait count or the percentage of wait time relative to the total execution time. To focus your analysis on a specific period, you can select a time range from the top timeline. For any identified hotspot, you can expand the entry to view the associated wait object(s), the callsite, the full call stack, and—if debug information is available—the corresponding source file and line number.

There are four groupings available. User can select any one of those based on the requirements.

GUI Wait Time Analysis.

Figure 7.31 GUI Wait Time Analysis#

CLI Threading Summary.

Figure 7.32 CLI Threading Summary#

7.6.3.4. Timeline

Per Thread Timeline displays the selected thread’s state, CPU utilization, context switch count, running callstacks and many metrics over the time. When you select a time region, the view reveals the thread’s activity and presents an aggregated flamegraph that summarizes function calls within that interval.

Per Thread Timeline.

Figure 7.33 Per Thread Timeline#

7.6.4. Troubleshoot

$export AMDUPROF_MAX_PR_INSTANCES=2000000

7.6.5. Limitations

7.6.6. Reference

7.6.6.1. Posix Thread APIs

Resource Wait Time APIs

Event Wait Time APIs

Spin Time APIs

Other APIs

pthread_create pthread_exit pthread_cancel

7.6.6.2. libc System Call Wrapper APIs

List of libc APIs traced with threading config.

Sleep APIs

Event Wait Time APIs

Resource Wait Time APIs

IO Sync Time APIs

IO Time APIs

Other APIs

7.7. Overview Analysis

Use Overview Analysis to get high level performance snapshot of an application, identify hottest functions and it’s inclusive and exclusive elapsed times, CPU utilization of the threads, and Wait time analysis of the application threads.

Overview Analysis traces the functions defined in the application whose size is more than or equal to 128 bytes by default. It collects the start time, end time of the function, callees function time, and stores the data in raw file for further processing.

Overview Analysis traces GPU offloading which includes kernel launch, kernel execution and data transfer for GPU intensive applications.

Overview Analysis uses the User mode sampling and tracing approach, and it is supported only in Linux and if application is using libc and libpthread then these libraries should be linked dynamically.

Reference

7.7.1. Prerequisites

Here is the list of prerequisites.

7.7.2. Data Collection

7.7.2.1. Data Collection Using GUI

To launch the AMDuProf GUI, go to Home > Welcome page.

  1. Click Profile an Application on the Welcome page.

  2. Provide application path, application options, working directory, and environment variables, if any. Click Next.

  3. From Predefined Configs, select Overview.

  4. Set the Timer Interval and Profiling Signal.

  5. Click Start Profile to start the profiling.

7.7.2.2. Data Collection Using CLI

Once profile data collection completes, session directory will be generated. Use session directory to generate the csv report (or) to import the session in GUI.

For a list of all the supported options, refer to AMDuProfCLI Collect Command Options.

Example

./AMDuProfCLI collect --config overview -o /tmp/ /tmp/ScimarkStable
Profiling started

Profiling (data collection) completed
Generated data files path: /tmp/AMDuProf-ScimarkStable-Overview_Sep-05-2024_21-59-08

Here, the generated session directory is /tmp/AMDuProf-ScimarkStable-Overview_Sep-05-2024_21-59-08.

Overview Analysis options

Table 7.12 Overview Analysis options#

CLI Option

GUI Option

Description

--func-size <size>

NA

By default, overview analysis traces the functions of size more than or equals to 128 bytes, if you want to trace functions of custom size, use this option to set the function size.

--profiling-signal <signal>

Predefined Configs > Overview > Profiling Signal

If application has signal handler for SIGPROF, then use this option to provide unused signal from SIGRTMIN to SIGRTMAX.

--timer-interval <interval>

Predefined Configs > Overview > Timer Interval

Provide per thread OS timer interval in msec. Default timer interval is 10 ms.

7.7.3. Analyze the Data

If data is collected using CLI, use Import Session to import the session into GUI to analyze data in GUI.

7.7.3.1. CLI Report

Use the following CLI report command to generate the profile report in .csv format by passing the session directory path generated in collection.

AMDuProfCLI report -i <session directory>

For a list of all the supported options, refer to AMDuProfCLI Report Command Options.

Example

./AMDuProfCLI report -i /tmp/AMDuProf-ScimarkStable-Overview_Sep-05-2024_21-59-08
Translation started

Report generation started

Report generation completed...
Generated report file: /tmp/AMDuProf-ScimarkStable-Overview_Sep-05-2024_21-59-08/report.csv

Reference

7.7.3.2. GPU Offload Analysis

Use GPU Kernel Summary (Analyze the Data) to identify hottest kernels launched to GPU and it’s count, total execution time on GPU cores.

Use Data Transfer Summary (Analyze the Data) to identify how much time spent in data transfer between host and device, how many times data transfer initiated.

Use Per Thread Timeline to analyze the following.

From this analysis, we can identify whether application is CPU bound or GPU bound. If application is GPU bound, then use GPU Profiling for further analysis and optimization in kernel execution.

Per Thread Timeline.

Figure 7.34 Per Thread Timeline#

7.7.3.3. Function Trace Analysis

Use the Function Count Summary to identify how many times a function is executed and it’s inclusive and exclusive times.

Function Count Summary.

Figure 7.35 Function Count Summary#

7.7.3.4. Timeline

Use Per Thread Timeline to analyze the following from CPU.

Here callstack data is collected using Function tracing, it is more accurate when compared with sampling data reported with Hotspots and Threading Analysis. Refer Function Tracing to trace custom functions with overview analysis.

Per Thread Timeline.

Figure 7.36 Per Thread Timeline#

7.7.3.5. Limitations

7.8. Time-based Profiling

In this analysis, the profile data is periodically collected based on the specified OS timer interval. It is used to identify the hotspots of the profiled applications that are consuming the most time. These hotspots are good candidates for further investigation and optimization.

7.8.1. Configuring and Starting Profile

To configure and start a profile:

  1. Click PROFILE > Start Profiling to navigate to the Select Profile Target screen.

  2. Select the required profile target and click Next. The Select Profiling screen is displayed.

  3. From the Select Profiling screen, select the Predefined Configs tab.

    Time-Based Profile – Configure.

    Figure 7.37 Time-Based Profile – Configure#

  4. Select Time-based Sampling in the left vertical pane.

  5. Click Advanced Options to enable call-stack, set symbol paths (if the debug files are in different locations) and other options. See Advanced Options for more information on this screen.

  6. Set all options and click Start Profile to begin profiling.

    After the profile initialization the profile data collection screen is displayed.

7.8.2. Analyzing Profile Data

Complete the following steps to analyze the profile data:

When the profiling stops, the collected raw profile data will be processed automatically and the Hot Spots screen of the Summary page is displayed. The hotspots are shown for the Timer samples. See Overview of Performance Hotspots for more information.

  1. Click ANALYZE on the top horizontal navigation bar to go to the Function HotSpots screen. See Function HotSpots for more information on this screen.

  2. Click ANALYZE > Metrics to display the profile data table at various granularities: Process, Load Modules, Threads, and Functions. Refer the section Process and Functions for more information on this screen.

  3. Double-click any entry on the Functions table in the Metrics screen to load the source tab for that function in the SOURCES page. Refer the section Source and Assembly for more information on this screen.

7.9. Micro Architecture Analysis

Micro Architecture Analysis profiling follows a statistical sampling-based approach to collect profile data to identify the performance bottlenecks in the application. Use this analysis type to understand the micro architectural bottlenecks in the application runtime.

7.9.1. Overview

AMD uProf CPU profiler follows a statistical sampling-based approach to collect profile data to identify the performance bottlenecks in the application. A few high-level features to understand the CPU profiler capabilities are listed here:

7.9.1.1. Predefined Sampling Configuration

The Predefined Sampling Configuration provides a convenient way to select a useful set of sampling events for profile analysis. The following table lists all such configurations:

Table 7.13 Predefined Sampling Configurations#

Profile Type

Predefined Configuration Name

Abbreviation

Description

Event-based profile (EBP)

Assess performance

assess

Provides an overall assessment of the performance.

Event-based profile (EBP)

Assess performance (Extended)

assess_ext

Provides an overall assessment of the performance with additional metrics.

Event-based profile (EBP)

Investigate data access

data_access

To find data access operations with poor L1 data cache locality and poor DTLB behavior.

Event-based profile (EBP)

Investigate instruction access

inst_access

To find instruction fetches with poor L1 instruction cache locality and poor ITLB behavior.

Event-based profile (EBP)

Investigate branching

branch

To find poorly predicted branches and near returns.

Event-based profile (EBP)

Investigate CPI

cpi

To analyze the CPI and IPC metrics of the running application or the entire system.

IBS

Instruction based sampling

ibs

To collect the sample data using IBS Fetch and IBS OP. Precise sample attribution to instructions.

IBS

Cache Analysis

memory

To identify the false cache-line sharing issues. The profile data will be collected using IBS OP

7.9.1.2. Predefined Core PMC Events

Listed here are some of the Core Performance events of AMD Zen processors.

Table 7.14 Predefined Core PMC Events - AMD 2nd Gen EPYC™ Processors#

Event Id, Unit-mask

Event Abbreviation

Name and Description

PMCx076,0x00

CYCLES_NOT_IN_HALT

CPU clock cycles not halted

The number of CPU cycles when the thread is not in halt state.

PMCx0C0, 0x00

RETIRED_INST

Retired Instructions

The number of instructions retired from execution. This count includes exceptions and interrupts. Each exception or interrupt is counted as one instruction.

PMCx0C1, 0x00

RETIRED_MICRO_OPS

Retired Macro Operations

The number of macro-ops retired. This count includes all processor activity - instructions, exceptions, interrupts, microcode assists, and so on.

PMCx0C2, 0x00

RETIRED_BR_INST

Retired Branch Instructions

The number of branch instructions retired. This includes all types of architectural control flow changes, including exceptions and interrupts

PMCx0C3, 0x00

RETIRED_BR_INST_MISP

Retired Branch Instructions Mispredicted

The number of retired branch instructions that were mis-predicted.

Note

Only EX direct mis-predicts and indirect target mis-predicts are counted.

PMCx003,0x08

RETIRED_SSE_AVX_FLOPS

Retired SSE/AVX Flops

The number of retired SSE/AVX flops. The number of events logged per cycle can vary from 0 to 64. This is a large increment per cycle event as it can count more than 15 events per cycle. This count both single precision and double precision FP events.

PMCx029,0x07

L1_DC_ACCESSES_ALL

All Data cache accesses

The number of load and store ops dispatched to LS unit. This counts the dispatch of single op that performs a memory load, dispatch of single op that performs a memory store, dispatch of a single op that performs a load from and store to the same memory address.

PMCx060,0x10

L2_CACHE_ACCESS_FROM_L1_IC_ MISS

L2 cache access from L1 IC miss

The L2 cache access requests due to L1 instruction cache misses.

PMCx060,0xC8

L2_CACHE_ACCESS_FROM_L1_DC_ MISS

L2 cache access from L1 DC miss

The L2 cache access requests due to L1 data cache misses. This also counts hardware and software prefetches.

PMCx064,0x01

L2_CACHE_MISS_FROM_L1_IC_MISS

L2 cache miss from L1 IC miss

Counts all the Instruction cache fill requests that misses in L2 cache

PMCx064,0x08

L2_CACHE_MISS_FROM_L1_DC_MISS

L2 cache miss from L1 DC miss

Counts all the Data cache fill requests that misses in L2 cache

PMCx071,0x1F

L2_HWPF_HIT_IN_L3

L2 Prefetcher Hits in L3

Counts all L2 prefetches accepted by the L2 pipeline which miss the L2 cache and hit the L3.

PMCx072,0x1F

L2_HWPF_MISS_IN_L2_L3

L2 Prefetcher Misses in L3

Counts all L2 prefetches accepted by the L2 pipeline which miss the L2 and the L3 caches

PMCx064,0x06

L2_CACHE_HIT_FROM_L1_IC_MISS

L2 cache hit from L1 IC miss

Counts all the Instruction cache fill requests that hits in L2 cache.

PMCx064,0x70

L2_CACHE_HIT_FROM_L1_DC_MISS

L2 cache hit from L1 DC miss

Counts all the Data cache fill requests that hits in L2 cache.

PMCx070,0x1F

L2_HWPF_HIT_IN_L2

L2 cache hit from L2 HW Prefetch Counts all L2 prefetches accepted by L2 pipeline which hit in the L2 cache

PMCx043,0x01

L1_DEMAND_DC_REFILLS_LOCAL_ L2

L1 demand DC fills from L2

The demand Data Cache (DC) fills from local L2 cache to the core.

PMCx043,0x02

L1_DEMAND_DC_REFILLS_LOCAL_ CACHE

L1 demand DC fills from local CCX

The demand Data Cache (DC) fills from same the cache of same CCX or cache of different CCX in the same package (node).

PMCx043,0x08

L1_DEMAND_DC_REFILLS_LOCAL_ DRAM

L1 demand DC fills from local Memory The demand Data Cache (DC) fills from

DRAM or IO connected in the same package (node).

PMCx043,0x10

L1_DEMAND_DC_REFILLS_REMOTE_CACHE

L1 demand DC fills from remote cache

The demand Data Cache (DC) fills from cache of CCX in the different package (node).

PMCx043,0x40

L1_DEMAND_DC_REFILLS_REMOTE_DRAM

L1 demand DC fills from remote Memory The demand Data Cache (DC) fills from DRAM or IO connected in the different package(node).

PMCx043,0x5B

L1_DEMAND_DC_REFILLS_ALL

L1 demand DC refills from all data sources. The demand Data Cache (DC) fills from all the data sources.

PMCx060,0xFF

L2_REQUESTS_ALL

All L2 cache requests.

PMCx084,0x00

L1_ITLB_MISSES_L2_HITS

L1 TLB miss L2 TLB hit

The instruction fetches that misses in the L1 Instruction Translation Lookaside Buffer (ITLB) but hit in the L2-ITLB.

PMCx085,0x07

L2_ITLB_MISSES

L1 TLB miss L2 TLB miss

The ITLB reloads originating from page table walker. The table walk requests are made for L1-ITLB miss and L2-ITLB misses.

PMCx045,0xFF

L1_DTLB_MISSES

L1 DTLB miss

The L1 Data Translation Lookaside Buffer (DTLB) misses from load store micro-ops. This event counts both L2-DTLB hit and L2- DTLB miss.

PMCx045,0xF0

L2_DTLB_MISSES

L1 DTLB miss

The L2 Data Translation Lookaside Buffer (DTLB) missed from load store micro-ops.

PMCx047,0x00

MISALIGNED_LOADS

Misaligned Loads

The number of misaligned loads.

Note

On AMD Zen 3 core processors, this event counts the 64B (cache-line crossing) and 4K (page crossing) misaligned loads.

PMCx052,0x03

INEFFECTIVE_SW_PF

Ineffective Software Prefetches

The number of software prefetches that did not fetch data outside of the processor core. This event counts the Software PREFETCH instruction that saw a match on an already - allocated miss request buffer. Also counts the Software PREFETCH instruction that saw a DC hit.

Table 7.15 Predefined Core PMC Events - AMD 4th Gen EPYC™ Processors#

Event Id, Unit-mask

Event Abbreviation

Name and Description

PMCx076,0x00

CYCLES_NOT_IN_HALT

CPU clock cycles not halted

The number of CPU cycles when the thread is not in halt state.

PMCx0C0, 0x00

RETIRED_INST

Retired Instructions

The number of instructions retired from execution. This count includes exceptions and interrupts. Each exception or interrupt is counted as one instruction.

PMCx0C1, 0x00

RETIRED_MICRO_OPS

Retired Macro Operations

The number of macro-ops retired. This count includes all processor activity - instructions, exceptions, interrupts, microcode assists, and so on.

PMCx0C2, 0x00

RETIRED_BR_INST

Retired Branch Instructions

The number of branch instructions retired. This includes all types of architectural control flow changes, including exceptions and interrupts

PMCx0C3, 0x00

RETIRED_BR_INST_MISP

Retired Branch Instructions Mispredicted

The number of retired branch instructions that were mis-predicted.

Note

Only EX direct mis-predicts and indirect target mis-predicts are counted.

PMCx003,0x08

RETIRED_SSE_AVX_FLOPS

Retired SSE/AVX Flops

The number of retired SSE/AVX flops. The number of events logged per cycle can vary from 0 to 64. This is a large increment per cycle event as it can count more than 15 events per cycle. This count both single precision and double precision FP events.

PMCx029,0x07

L1_DC_ACCESSES_ALL

All Data cache accesses

The number of load and store ops dispatched to LS unit. This counts the dispatch of single op that performs a memory load, dispatch of single op that performs a memory store, dispatch of a single op that performs a load from and store to the same memory address.

PMCx060,0x10

L2_CACHE_ACCESS_FROM_L1_IC_ MISS

L2 cache access from L1 IC miss

The L2 cache access requests due to L1 instruction cache misses.

PMCx060,0xE8

L2_CACHE_ACCESS_FROM_L1_DC_ MISS

L2 cache access from L1 DC miss

The L2 cache access requests due to L1 data cache misses. This also counts hardware and software prefetches.

PMCx064,0x01

L2_CACHE_MISS_FROM_L1_IC_MISS

L2 cache miss from L1 IC miss

Counts all the Instruction cache fill requests that misses in L2 cache

PMCx064,0x08

L2_CACHE_MISS_FROM_L1_DC_MISS

L2 cache miss from L1 DC miss

Counts all the Data cache fill requests that misses in L2 cache

PMCx071,0xF F

L2_HWPF_HIT_IN_L3

L2 Prefetcher Hits in L3

Counts all L2 prefetches accepted by the L2 pipeline which miss the L2 cache and hit the L3.

PMCx072,0xFF

L2_HWPF_MISS_IN_L2_L3

L2 Prefetcher Misses in L3

Counts all L2 prefetches accepted by the L2 pipeline which miss the L2 and the L3 caches

PMCx064,0x06

L2_CACHE_HIT_FROM_L1_IC_MISS

L2 cache hit from L1 IC miss

Counts all the Instruction cache fill requests that hits in L2 cache.

PMCx064,0xF0

L2_CACHE_HIT_FROM_L1_DC_MISS

L2 cache hit from L1 DC miss

Counts all the Data cache fill requests that hits in L2 cache.

PMCx070,0xFF

L2_HWPF_HIT_IN_L2

L2 cache hit from L2 HW Prefetch Counts all L2 prefetches accepted by L2 pipeline which hit in the L2 cache

PMCx043,0x01

L1_DEMAND_DC_REFILLS_LOCAL_ L2

L1 demand DC fills from L2

The demand Data Cache (DC) fills from local L2 cache to the core.

PMCx043,0x02

L1_DEMAND_DC_REFILLS_LOCAL_ CACHE

L1 demand DC fills from local CCX.

The demand Data Cache (DC) fills from same the cache of same CCX or cache of different CCX in the same package (node)

PMCx043,0x04

L1_DEMAND_DC_REFILLS_EXTERNAL_CACHE_LOCAL

L1 DC fills from local external CCX caches The DC fills from the cache of different CCX in the same package (node).

PMCx043,0x08

L1_DEMAND_DC_REFILLS_LOCAL_ DRAM

L1 demand DC fills from local Memory The demand Data Cache (DC) fills from

DRAM or IO connected in the same package (node).

PMCx043,0x10

L1_DEMAND_DC_REFILLS_REMOTE_CACHE

L1 demand DC fills from remote cache

The demand Data Cache (DC) fills from cache of CCX in the different package (node).

PMCx043,0x40

L1_DEMAND_DC_REFILLS_REMOTE_DRAM

L1 demand DC fills from remote Memory The demand Data Cache (DC) fills from DRAM or IO connected in the different package(node).

PMCx043,0x14

L1_DEMAND_DC_REFILLS_EXTERNAL_CACHE

L1 demand DC fills from external caches The demand DC fills from the cache of different CCX in the same or different package (node).

PMCx043,0xDF

L1_DEMAND_DC_REFILLS_ALL

L1 demand DC refills from all data sources. The demand DC fills from all the data sources.

PMCx044,0x01

L1_DC_REFILLS_LOCAL_L2

L1DC fills from local L2

The DC fills from the local L2 cache to the core.

PMCx044,0x02

L1_DC_REFILLS_LOCAL_CACHE

L1DC fills from local CCX cache

The DC fills from different L2 cache in the same CCX or L3 cache that belongs to the same CCX.

PMCx044,0x08

L1_DC_REFILLS_EXTERNAL_CACHE_LOCAL

L1 DC fills from local Memory

The DC fills from DRAM or IO connected in the same package (node).

PMCx044,0x04

L1_DC_REFILLS_EXTERNAL_CACHE_LOCAL

L1 DC fills from local external CCX caches

The DC fills from the cache of different CCX in the same package (node).

PMCx044,0x10

L1_DC_REFILLS_EXTERNAL_CACHE_REMOTE

L1 DC fills from remote external CCX caches

The DC fills from the CCX cache in the different package (node).

PMCx044,0x40

L1_DC_REFILLS_REMOTE_DRAM

L1 DC fills from remote Memory

The DC fills from DRAM or IO connected in the different package (node).

PMCx044,0x14

L1_DC_REFILLS_EXTENAL_CACHE

L1 DC fills from local external CCX caches

The DC fills from cache of different CCX in the same or different package (node).

PMCx044,0x48

L1_DC_REFILLS_DRAM

L1 DC fills from local Memory

The DC fills from DRAM or IO connected in the same or different package (node).

PMCx044,0x50

L1_DC_REFILLS_REMOTE_NODE

L1 DC fills from remote node

The DC fills from the CCX cache in the different package (node) or the DRAM / IO connected in the different package (node).

PMCx044,0x03

L1_DC_REFILLS_LOCAL_CACHE_L2_L3

L1 DC fills from same CCX

The DC fills from the local L2 cache to the core or different L2 cache in the same CCX or L3 cache that belongs to the same CCX.

PMCx044,0xDF

L1_DC_REFILLS_ALL

L1 DC fills from all the data sources

The DC fills from all the data sources

PMCx060,0xFF

L2_REQUESTS_ALL

All L2 cache requests.

PMCx084,0x00

L1_ITLB_MISSES_L2_HITS

L1 TLB miss L2 TLB hit

The instruction fetches that misses in the L1 Instruction Translation Lookaside Buffer (ITLB) but hit in the L2-ITLB.

PMCx085,0x07

L2_ITLB_MISSES

L1 TLB miss L2 TLB miss

The ITLB reloads originating from page table walker. The table walk requests are made for L1-ITLB miss and L2-ITLB misses.

PMCx045,0xFF

L1_DTLB_MISSES

L1 DTLB miss

The L1 Data Translation Lookaside Buffer (DTLB) misses from load store micro-ops. This event counts both L2-DTLB hit and L2-DTLBmiss

PMCx045,0xF0

L2_DTLB_MISSES

L1 DTLB miss

The L2 Data Translation Lookaside Buffer (DTLB) missed from load store micro-ops

PMCx078,0xFF

ALL_TLB_FLUSHES

All TLB flushes

PMCx047,0x03

MISALIGNED_LOADS

The number of misaligned loads.

Note

On AMD Zen 3 core processors, this event counts the 64 B (cache-line crossing) and 4 K (page crossing) misaligned loads.

PMCx052,0x03

INEFFECTIVE_SW_PF

Ineffective Software Prefetches

The number of software prefetches that did not fetch data outside of the processor core. This event counts the Software PREFETCH instruction that saw a match on allocated miss request buffer. Also counts the Software PREFETCH instruction that saw a DC hit.

PMCx18E,0x1F

IC_TAG_ALL_IC_ACCESS

IC Tag All Instruction Cache Access

PMCx18E,0x18

IC_TAG_IC_MISS

IC Tag Instruction Cache Miss

PMCx28F, 0x07

OP_CACHE_ALL_ACCESS

All OP Cache Accesses

PMCx28F, 0x04

OP_CACHE_MISS

Op Cache Miss

Core CPU Metrics

Table 7.16 Core CPU Metrics#

CPU Metric

Description

Core Effective Frequency

Core Effective Frequency (without halted cycles) over the sampling period, reported in GHz. The metric is based on APERF and MPERF MSRs. MPERF is incremented by the core at the P0 state frequency while the core is in C0 state. APERF is incremented in proportion to the actual number of core cycles while the core is in C0 state.

CPI

Cycles Per Instruction Retired (CPI) is the multiplicative inverse of IPC metric. This is one of the basic performance metrics indicating how cache misses, branch mis-predictions, memory latencies, and other bottlenecks are affecting the execution of an application. Lower CPI value is better.

IPC

Instructions Retired Per Cycle (IPC) is the average number of instructions retired per cycle. This is measured using Core PMC events PMCx0C0 [Retired Instructions] and PMCx076 [CPU Clocks not Halted]. These PMC events are counted in both OS and User mode.

L1_DC_ACCESS_RATE

The DC access rate is the number of DC accesses divided by the total number of retired instructions

L1_DC_MISS_RATE

The DC miss rate is the number of DC misses divided by the total number of retired instructions.

L1_DC_MISS_RATIO

The DC miss ratio is the number of DC misses divided by the total number of DC accesses.

L1_DC_MISSES(PTI)

The number of L2 cache access requests due to L1 data cache misses, per thousand retired instructions. This L2 cache access requests also includes the hardware and software prefetches.

L1_DC_REFILLS_ALL (PTI)

The number of demand data cache (DC) fills per thousand retired instructions. These demand DC fills are from all the data sources like LocalL2/L3 cache, remote caches, local memory, and remote memory.

L1_DTLB_MISS_RATE

The DTLB L1 miss rate is the number of DTLB L1 misses divided by the total number of retired instructions.

L1_ITLB_MISS_RATE

The ITLB L1 miss rate is the number of ITLB L1_Miss_L2_Hits and L1_Miss_L2_Missdivided by the total number of retired instructions.

L2_CACHE_ACCESSES_FRO M_IC_MISSES

The number of L2 cache access requests due to the L1 instruction cache misses per thousand retired instructions. This L2 cache access requests also includes the prefetches.

L2_CACHE_MISSES_FROM_I C_MISSES

The number of L2 cache misses from L1 instruction cache misses per thousand retired instructions.

L2_DTLB_MISS_RATE

The L2 DTLB miss rate is the number of L2 DTLB misses divided by the total number of retired instructions.

L2_ITLB_MISS_RATE

The ITLB L2 miss rate is the number of ITLB L2 miss divided by the total number of retired instructions.

MISALIGNED_LOADS_RATE

The misalign rate is the number of misaligned loads divided by the total number of retired instructions.

MISALIGNED_LOADS_RATIO

The misalign ratio is the number of misaligned loads divided by the total number of DC accesse

RETIRED_BR_INST_MISP_RATE

This metric is computed as retired mis-predicted branches divided by the total number of retired instructions.

RETIRED_BR_INST_MISP_RATIO

This metric is computed as the retired mis-predicted branches divided by the total number of retired branch instructions.

RETIRED_BR_INST_RATE

The number of retired branch instructions rate. This metric is computed as the retired branches divided by the total number of retired instructions.

RETIRED_INDIRECT_BR_IN ST_MISP (PTI)

The number of retired indirect branches per thousand instructions.

RETIRED_NEAR_RETURNS (PTI)

The number of retired near branches per thousand instructions.

RETIRED_NEAR_RETURNS_ MISP (PTI)

The number of retired mis-predicted near branches per thousand instructions.

RETIRED_NEAR_RETURNS_ MISP_RATE

This metric is computed as the retired mis-predicted near returns divided by the total number of retired instructions.

RETIRED_NEAR_RETURNS_ MISP_RATIO

This metric is computed as retired mis-predicted near returns divided by the total number of retired return instructions.

RETIRED_TAKEN_BR_INST (PTI)

The number of retired taken branches per thousand instructions.

RETIRED_TAKEN_BR_INST_ MISP (PTI)

The number of retired mis-predicted taken branches per thousand instructions.

RETIRED_TAKEN_BR_INST_ RATE

The number of retired taken branches rate. This metric is computed as the retired taken branches divided by the total number of retired instructions.

STLI_OTHER

Store-to-load conflicts:A load was unable to complete due to a non- forwardable conflict with an older store. Most commonly, a load’s address range partially but not completely overlaps with an uncompleted older store. Software can avoid this problem by using the same size and alignment loads and stores when accessing the data.

Vector/SIMD code is particularly susceptible to this problem; software should construct wide vector stores by manipulating the vector elements in the registers using shuffle/blend/swap instructions prior to storing to the memory, instead of using narrow element-by-element stores.

7.9.2. Analysis with Event-based Profiling

In this profile, the CPU Profiler uses the PMCs to monitor the various micro-architectural events supported by the AMD x86-based processor. It helps to identify the CPU and memory related performance issues in the profiled applications. The CPU Profiler provides several predefined EBP profile configurations. To analyze an aspect of the profiled application (or system), a specific set of relevant events are grouped and monitored together. The CPU Profiler provides a list of predefined event configurations, such as Assess Performance and Investigate Branching. You can select any of these predefined configurations to profile and analyze the runtime characteristics of your application. You also can create their custom configurations of events to profile.

In this profile mode, a delay called skid occurs between the time the sampling interrupt occurs and the time the sampled instruction address is collected. Due to this delay, samples may be recorded near but not exactly at the instruction that caused the interrupt. This can lead to an inaccurate distribution of samples, where events are sometimes attributed to neighboring instructions rather than the actual source instruction.

7.9.2.1. Data Collection

7.9.2.1.1. Data Collection Using GUI

To launch the AMDuProf GUI, go to Home > Welcome page.

  1. Click Profile an Application on the Welcome page.

  2. Provide application path, application options, working directory, and environment variables, if any. Click Next.

  3. From Predefined Configs, select any Event Based configuration. For example: Assess Performance, Investigate CPI, Investigate Branching, etc.

  4. Alternatively, use Custom Config to configure events individually.

  5. From Advanced Options, select the appropriate options.

  6. Click Start Profile to start the profiling.

7.9.2.1.2. Data Collection Using CLI

Once profile data collection completes, session directory will be generated. Use session directory to generate the csv report (or) to import the session in GUI.

For a list of all the supported options, refer to AMDuProfCLI Collect Command Options.

Example

./AMDuProfCLI collect --config assess-g -o /tmp/ /tmp/ScimarkStable
Profiling started
………
Profiling (data collection) completed
Generated data files path: /tmp/AMDuProf-ScimarkStable-EBP_Sep-05-2024_21-43-08

Here, the generated session directory is /tmp/AMDuProf-ScimarkStable-EBP_Sep-05-2024_21-43-08.

7.9.2.1.3. Microarchitecture Analysis Options
Table 7.17 Microarchitecture Analysis Options#

CLI Option

GUI Option

Description

--config <config>

Predefined Configs

Predefined sampling configuration to be used to collect samples. Use the command info --list collect-configs to get the list of supported configs. Multiple occurrences of --config are allowed.

-e | --event or <predefined-event>

Predefined Configs > Custom Configs

A predefined event can be directly be used with -e, --event which has predefined arguments. Alternatively, for providing more granular parameters, specify Timer, PMU, IBS event, or a predefined event with arguments in the form of comma separated key=value pairs. The supported keys are:

  • event=<timer | ibs-fetch | ibs-op> or <PMU-event> or <predefined-event>

  • umask=<unit-mask>

  • user=<0 | 1>

  • os=<0 | 1>

  • cmask=<count-mask> (value should be in the range 0x0 to 0x7f)

  • inv=<0 | 1>

  • interval=<sampling-interval>

  • frequency=<frequency (n)> (supported only for Core PMC events, the frequency should be provided in Hz)

  • ibsop-count-control=<0 | 1> (for ibs-op event)

  • loadstore (for ibs-op event, only on Windows platform)

  • ibsop-l3miss=<0 | 1> (for IBS OP event, supported only on AMD Zen4 processors)

  • ibsfetch-l3miss=<0 | 1> (for IBS FETCH event, supported only on AMD Zen4 processors)

  • ibsop-ldlat=<LATENCY> (Filter IBS OP samples by data cache miss latency threshold in CPU cycles. LATENCY must be an integer which is multiple of 128 and between 128 to 2048. Supported on AMD Zen5 and later processors.)

  • call-graph

    Note

    1. It is not required to provide umask with predefined event.

    2. Use the dedicated option --call-graph to specify the arguments related to the call stack sample collection.

Argument details

  • user – Enable (1) or disable (0) user space samples collection.

  • os - Enable (1) or disable (0) kernel space samples collection.

  • interval – Sample collection interval

    • For timer, it is the time interval in milliseconds.

    • For PMU and predefined events, it is the count of the event occurrences.

    • For IBS FETCH, it is the fetch count.

    • For IBS OP, it is the cycle count or the dispatch count.

  • op-count-control – Choose IBS OP sampling by cycle(0) count or dispatch(1) count.

  • loadstore – Enable only the IBS OP load/store samples collection, other IBS OP samples are not collected.

  • ibsop-l3miss – Enable IBS OP sample collection only when a l3 miss occurs. For example: -e event=ibs-op,interval=100000,ibsop-l3miss=1.

  • ibsfetch-l3miss – Enable IBS FETCH sample collection only when a l3 miss occurs. For example: -e event=ibs-fetch,interval=100000,ibsfetch-l3miss=1.

  • ibsop-ldlat – Filter IBS OP samples by data cache miss latency threshold in CPU cycles. LATENCY must be an integer which is multiple of 128 and between 128 to 2048. For example: -e event=ibs-op,interval=100000,ibsop-ldlat=256

When these arguments are not passed, the default values are:

  • umask=0

  • cmask=0x0

  • user=1

  • os=1

  • inv=0

  • ibsop-count-control=0 (for ibs-op event)

  • ibsop-l3miss=0

  • ibsfetch-l3miss=0

  • interval=1.0 ms for timer event

  • interval=250000 for ibs-fetch, ibs-op, pmu-event, or predefined-event

Use the following commands as required:

  • info --list predefined-events for the list of supported predefined events

  • info --list pmu-events for the list of supported PMU-events. Multiple occurrences of --event (-e) are allowed.

7.9.2.2. Analyze Data

If data is collected using CLI, use Import Session to import the session into GUI to analyze data in GUI.

7.9.2.2.1. CLI Report

Use the following CLI report command to generate the profile report in .csv format by passing the session directory path generated in collection.

AMDuProfCLI report -i <session directory>

For a list of all the supported options, refer to AMDuProfCLI Report Command Options.

Example

./AMDuProfCLI report -i /tmp/AMDuProf-ScimarkStable-EBP_Sep-05-2024_21-43-08
Translation started

Report generation started

Report generation completed...
Generated report file: /tmp/AMDuProf-ScimarkStable-EBP_Sep-05-2024_21-43-08/report.csv

Use the Thread Concurrency Graph to analyze how efficiently the processor cores are utilized by the application. In other words, how much time specific number of threads are running on specific no of cores.

Thread Concurrency Graph.

Figure 7.38 Thread Concurrency Graph#

Function Hotspots.

Figure 7.39 Function Hotspots#

Select a function to get all the call paths to this function from different threads, each call path provides the number of samples in that path. Double-click on the function to analyze the instruction level sample attribution for that function using Source View.

7.9.2.2.2. Identify the Hot Code Paths

Use Flame Graph to identify hottest code paths of an application. The width of each function indicates the percentage of event samples of the function (it’s callees) to the total number of samples of selected process and thread for a specific event.

Use Top-Down Callstack to analyze any issues with call-sequence flow of the application and to analyze the bottlenecks in functions and its callees.

Top-Down Callstack.

Figure 7.40 Top-Down Callstack#

7.9.2.2.3. Advisory

Confidence Threshold

The metric with low number of samples collected for a program unit either due to multiplexing or statical sampling will be grayed out. A few points to remember are:

Issue Threshold

Highlight the CPI metric cells exceeding the specific threshold value (>1.0). Those cells will be highlighted in pink to show them as potential performance problem as follows:

CPI Metric - Threshold-Based Performance.

Figure 7.41 CPI Metric - Threshold-Based Performance#

7.9.2.3. Limitations

7.9.3. Analysis with Instruction Based Sampling

In this profile, the CPU Profiler uses the IBS HW supported by the AMD x86-based processor to observe the effect of instructions on the processor and on the memory subsystem. In IBS, HW events are linked with the instruction that caused them. Also, HW events used by the CPU Profiler to derive various metrics, such as data cache latency.

7.9.3.1. Data Collection

7.9.3.1.1. Data Collection Using GUI

To launch the AMDuProf GUI, go to Home > Welcome page.

  1. Click Profile an Application on the Welcome page.

  2. Provide application path, application options, working directory, and environment variables, if any. Click Next.

  3. From Predefined Configs, select Instruction Based Sampling configuration.

    Alternatively, use Custom Config to configure IBS_FETCH, IBS_ALL_OPS events individually.

  4. From Advanced Options, select the appropriate options.

  5. Click Start Profile to start the profiling.

7.9.3.1.2. Data Collection Using CLI

Once profile data collection completes, session directory will be generated. Use session directory to generate the csv report (or) to import the session in GUI.

For a list of all the supported options, refer to AMDuProfCLI Collect Command Options.

Example

./AMDuProfCLI collect --config ibs -g -o /tmp/ /tmp/ScimarkStable
Profiling started
………
Profiling (data collection) completed
Generated data files path: /tmp/AMDuProf-ScimarkStable-IBS_Sep-05-2024_21-43-08

Here, the generated session directory is /tmp/AMDuProf-ScimarkStable-EBP_Sep-05-2024_21-43-08.

7.9.3.2. Analyze Data

If data is collected using CLI, use Import Session to import the session into GUI to analyze data in GUI.

7.9.3.2.1. CLI Report

Use the following CLI report command to generate the profile report in .csv format by passing the session directory path generated in collection.

AMDuProfCLI report -i <session directory>

For a list of all the supported options, refer to AMDuProfCLI Report Command Options.

Example

./AMDuProfCLI report -i /tmp/AMDuProf-ScimarkStable-IBS_Sep-05-2024_21-43-08
Translation started

Report generation started

Report generation completed...

Generated report file is stored as``/tmp/AMDuProf-ScimarkStable-IBS_Sep-05-2024_21-43-08/report.csv``

Use the Thread Concurrency Graph to analyze how efficiently the processor cores are utilized by the application. In other words, how much time specific no of threads are running on specific no of cores.

Thread Concurrency Graph.

Figure 7.42 Thread Concurrency Graph#

Use Function Hotspots to list the functions and the number of samples for the configured events. Expand the function to get its processes and further expand to get its threads.

Function Hotspots.

Figure 7.43 Function Hotspots#

Select a function to get all the call paths to this function from different threads, each call path provides the number of samples in that path. Double-click on the function to analyze the instruction level sample attribution for that function using Source View.

7.9.3.2.2. Identify the Hot Code Paths

Use Flame Graph to identify hottest code paths of an application. The width of each function indicates the percentage of event samples of the function (it’s callees) to the total number of samples of selected process and thread for a specific event.

Flame Graph.

Figure 7.44 Flame Graph#

Use Top-Down Callstack to analyze any issues with call-sequence flow of the application and to analyze the bottlenecks in functions and its callees.

Top-Down Callstack.

Figure 7.45 Top-Down Callstack#

7.9.3.2.3. ASCII Dump of IBS Samples

For some scenarios, it would be useful to analyze the ASCII dump of IBS OP profile samples. To do so, complete the following steps:

Where:

interval denotes sampling interval – loadstore denotes collect only the load & store ops (Windows only option) – ibsop-count-control=1 represents count dispatched micro-ops (0 for count clock cycles) - -data-buffer-count 1024 represents the number of per-core data buffers to allocate (Windows only option)

To collect the IBS OP samples:

  1. Once the raw file is generated, run the following command to translate and get the ASCII dump of IBS OP samples:

    C:\> AMDuProfCLI.exe translate --ascii event-dump -i C:\temp\AMDuProf-IBS_<timestamp>\
    

    The CSV file that containing ASCII dump of the IBS OP samples is generated:

    C:\temp\AMDuProf-IBS_<timestamp>\IbsOpDump.csv
    
  2. During collection the following control knobs are available:

    -e event=ibs-op,interval=100000,loadstore,ibsop-count-control=1
    

In case, there are too many missing records, try the following:

7.9.3.2.4. IBS Derived Events

AMD uProf translates the IBS information produced by the hardware into derived event sample counts that resemble EBP sample counts. All the IBS-derived events contain IBS in the event name and abbreviation. Although IBS-derived events and sample counts look similar to the EBP events and sample counts, the source and sampling basis for the IBS event information are different.

Arithmetic calculation should never be performed between IBS derived event sample counts and EBP event sample counts. It is not meaningful to directly compare the number of samples taken for events that represent the same hardware condition. For example, fewer IBS DC miss samples is not necessarily better than a larger quantity of EBP DC miss samples.

Following table shows the IBS fetch events:

Table 7.18 IBS Fetch Events - AMD Zen1, Zen2, and Zen3 Platforms#

IBS Fetch Event

Description

IBS_FETCH

The number of all the IBS fetch samples. This derived event counts the number of all the IBS fetch samples that were collected including IBS- killed fetch samples.

IBS_FETCH_COMPLETED

The number of completed IBS sampled fetches. A fetch is completed if the attempted fetch delivers instruction data to the instruction decoder. Although the instruction data was delivered, it may still not be used. For example, the instruction data may have been on the wrong path of an incorrectly predicted branch.

IBS_FETCH_ABORTED

The number of IBS sampled fetches that aborted. An attempted fetch is aborted if it did not complete and deliver instruction data to the decoder. An attempted fetch may abort at any point in the process of fetching instruction data. An abort may be due to a branch redirection as the result of a mispredicted branch. The number of IBS aborted fetch samples is a lower bound on the number of unsuccessful, speculative fetch activity. It is a lower bound as the instruction data delivered by completed fetches may not be used.

IBS_FETCH_L1_ITLB_HIT

The number of IBS attempted fetch samples where the fetch operation initially hit in the L1 ITLB (Instruction Translation Lookaside Buffer).

IBS_FETCH_L1_ITLB_MISS_L2_ITLB_HIT

The number of IBS attempted fetch samples where the fetch operation initially missed in the L1 ITLB and hit in the L2 ITLB.

IBS_FETCH_L1_ITLB_MISS_L2_ITLB_MISS

The number of IBS attempted fetch samples where the fetch operation initially missed in both the L1 ITLB and the L2 ITLB.

IBS_FETCH_L1_ITLB_4K_PAGE

The number of IBS attempted fetch samples where the fetch operation produced a valid physical address (that is, address translation completed successfully) and used a 4-KByte page entry in the L1 ITLB.

IBS_FETCH_L1_ITLB_2M_PAGE

The number of IBS attempted fetch samples where the fetch operation produced a valid physical address (that is, address translation completed successfully) and used a 2 MB page entry in the L1 ITLB.

IBS_FETCH_LAT

The total latency of all IBS attempted fetch samples. Divide the total IBS fetch latency by the number of IBS attempted fetch samples to obtain the average latency of the attempted fetches that were sampled.

IBS_FETCH_L2C_MISS

The instruction fetch missed in the L2 Cache.

IBS_FETCH_ITLB_REFILL_LAT

The number of cycles when the fetch engine is stalled for an ITLB reload for the sampled fetch. If there is no reload, the latency will be 0.

Table 7.19 IBS Fetch Events - AMD Zen4 and Zen5 Platforms#

IBS Fetch Event

Description

IBS_FETCH

The number of all the IBS fetch samples. This derived event counts the number of all the IBS fetch samples that were collected including IBS- killed fetch samples.

IBS_FETCH_ATTEMPTED

The number of IBS sampled fetches that were not killed fetch attempts. This derived event measures the number of useful fetch attempts and does not include the number of IBS killed fetch samples. This event should be used to compute ratios such as the ratio of IBS fetch IC misses to attempted fetches. The number of attempted fetches should equal the sum of the number of completed fetches and the number of aborted fetches.

IBS_FETCH_COMPLETED

The number of IBS sampled fetches that completed. A fetch is completed if the attempted fetch delivers instruction data to the instruction decoder.

Although the instruction data was delivered, it may still not be used (for example, the instruction data may have been on the wrong path of an incorrectly predicted branch.)

IBS_FETCH_ABORTED

The number of IBS sampled fetches that aborted. An attempted fetch is aborted if it does not complete and deliver instruction data to the decoder. An attempted fetch may abort at any point in the process of fetching instruction data. An abort may be due to a branch redirection as the result of a mispredicted branch. The number of IBS aborted fetch samples is a lower bound on the amount of unsuccessful, speculative fetch activity. It is a lower bound as the instruction data delivered by completed fetches may not be used.

IBS_FETCH_L1_ITLB_HIT

The number of IBS attempted fetch samples where the fetch operation initially hit in the L1 ITLB (Instruction Translation Lookaside Buffer).

IBS_FETCH_L1_ITLB_MISS_L2_ITLB_HIT

The number of IBS attempted fetch samples where the fetch operation initially missed in the L1 ITLB and hit in the L2 ITLB.

IBS_FETCH_L1_IC_MISS

The number of IBS attempted fetch samples where the fetch operation initially missed in the IC (instruction cache).

BS_FETCH_L1_IC_HIT

The number of IBS attempted fetch samples where the fetch operation initially hit in the IC.

IBS_FETCH_L1_ITLB_4K_PAGE

The number of IBS attempted fetch samples where the fetch operation produced a valid physical address (for example, address translation completed successfully) and used a 4 KB page entry in the L1 ITLB.

IBS_FETCH_L1_ITLB_2M_PAGE

The number of IBS attempted fetch samples where the fetch operation produced a valid physical address (for example, address translation completed successfully) and used a 2 MB page entry in the L1 ITLB.

IBS_FETCH_L1_ITLB_1G_PAGE

The number of IBS attempted fetch samples where the fetch operation produced a valid physical address (for example, address translation completed successfully) and used a 1 GB page entry in the L1 ITLB.

IBS_FETCH_LAT

The total latency of all IBS attempted fetch samples. Divide the total IBS fetch latency by the number of IBS attempted fetch samples to obtain the average latency of the attempted fetches that were sampled.

IBS_FETCH_L2_MISS

The instruction fetch missed in the L2 Cache.

IBS_FETCH_ITLB_REFILL_LAT

The number of cycles when the fetch engine is stalled for an ITLB reload for the sampled fetch. If there is no reload, the latency will be 0.

IBS_FETCH_OP_CACHE_MISS

The number of IBS attempted fetch samples where the Op Cache was notable to supply all the bytes for the tagged fetch.

IBS_FETCH_L3_MISS

The number of IBS attempted fetch samples where the instruction fetch missed in the L3 cache on the same CCX.

Here is a list of IBS fetch metrics.

Table 7.20 IBS Fetch Metrics#

IBS Fetch Metric

Description

IBS_FETCH_LAT_AVE

The average IBS fetch latency. Calculated by dividing the IBS fetch latency by the total number of IBS fetch attempts.

IBS_FETCH_L1_ITLB_MISS_ L2_ITLB_MISS_RATE_%

Percentage of IBS fetch L1 and L2 ITLB misses with respect to the total number of IBS fetch attempts.

IBS_FETCH_L1_ITLB_MISS_ L2_ITLB_HIT_RATE_%

Percentage of IBS fetch L1 ITLB miss and L2 ITLB hits with respect to the total number of IBS fetch attempts.

IBS_FETCH_L1_IC_MISS_RATE_%

Percentage of IBS fetch L1 instruction cache misses with respect to the total number of IBS fetch attempts.

Here is a list of IBS op events.

Table 7.21 IBS Op Events - AMD Zen1, Zen2, and Zen3 Platforms#

IBS Op Event

Description

IBS_ALL_OPS

The number of all the IBS op samples collected. These op samples may be branch ops, resync ops, ops that perform load/store operations, or undifferentiated ops (for example, those ops that perform arithmetic operations, logical operations, and so on). IBS collects data for the retired ops. No data is collected for the ops that are aborted due to pipeline flushes and so on. Thus, all the sampled ops are architecturally significant and contribute to the successful execution of programs.

IBS_TAG_TO_RET or IBS_TAG_TO_RETIRE_CYCLES

The total number of tag-to-retire cycles across all the IBS op samples. The tag-to-retire time of an op is the number of cycles from when the op was tagged (selected for sampling) to when the op retired.

IBS_COMP_TO_RET or IBS_COMP_TO_RETIRE_CYCLES

The total number of completion-to-retire cycles across all the IBS op samples. The completion-to-retire time of an op is the number of cycles from when the op completed to when the op retired.

IBS_BR

The number of IBS retired branch op samples. A branch operation is a change in the program control flow and includes unconditional and conditional branches, subroutine calls, and subroutine returns. Branch ops are used to implement AMD64 branch semantics.

IBS_MISP_BR or IBS_BR_MISP

The number of IBS samples for retired branch operations that were mispredicted. This event should be used to compute the ratio of mispredicted branch operations to all the branch operations.

IBS_TAKEN_BR

The number of IBS samples for the retired branch operations that were taken branches.

IBS_MISP_TAKEN_BR or IBS_TAKEN_BR_MISP

The number of IBS samples for the retired branch operations that were mispredicted taken branches.

IBS_RET

The number of IBS retired branch op samples where the operation was a subroutine return. These samples are a subset of all the IBS retired branch op samples.

IBS_MISP_RET or IBS_RET_MISP

The number of IBS retired branch op samples where the operation was a mispredicted subroutine return. This event should be used to compute the ratio of the mispredicted returns to all the subroutine returns.

IBS_RESYNC

The number of IBS resync op samples. A resync op is only found in certain microcoded AMD64 instructions and causes a complete pipeline flush.

..note:: Not supported on Zen3 and later processors.

IBS_LOAD_STORE

The number of IBS op samples for ops that perform either a load and/or store operation. Each op may perform a load operation, a store operation, or both a load and store operation (each to the same address).

IBS_LOAD

The number of IBS op samples for ops that perform a load operation.

IBS_STORE

The number of IBS op samples for ops that perform a store operation.

IBS_L1_DTLB_HIT

The number of IBS op samples where either a load or store operation initially hit the L1 DTLB (data translation lookaside buffer).

IBS_DTLB_L1M_L2H

The number of IBS op samples where either a load or store operation initially missed in the L1 DTLB and hit the L2 DTLB.

IBS_DTLB_L1M_L2M

The number of IBS op samples where either a load or store operation initially missed in both the L1 DTLB and the L2 DTLB.

IBS_DC_MISS or IBS_L1_DC_MISS

The number of IBS op samples where either a load or store operation initially missed in the L1 DC.

IBS_DC_HIT or IBS_L1_DC_HIT

The number of IBS op samples where either a load or store operation initially hit the L1 DC.

IBS_MISALIGN_ACC or IBS_MISALIGN_ACCESS

The number of IBS op samples where either a load or store operation caused a misaligned access (for example, the load or store operation crossed a 256-bit boundary).

IBS_BANK_CONF_LOAD

The number of IBS op samples where either a load or store operation caused a bank conflict with a load operation.

Note

Not supported on Zen3 and later processors.

IBS_BANK_CONF_STORE

The number of IBS op samples where either a load or store operation caused a bank conflict with a store operation.

Note

Not supported on Zen3 and later processors.

IBS_FORWARDED

The number of IBS op samples where data for a load operation was forwarded from a store operation.

Note

Not supported on Zen3 and later processors.

IBS_STLF_CANCELLED

The number of IBS op samples where data forwarding to a load operation from a store was cancelled.

Note

Not supported on Zen3 and later processors.

IBS_UC_MEM_ACC or IBS_UC_MEM_ACCESS

The number of IBS op samples where a load or store operation accessed uncacheable (UC) memory.

IBS_WC_MEM_ACC or IBS_WC_MEM_ACCESS

The number of IBS op samples where a load or store operation accessed write combining (WC) memory.

IBS_LOCKED_OP

The number of IBS op samples where a load or store operation was a locked operation.

IBS_MAB_HIT

The number of IBS op samples where a load or store operation hit an already allocated entry in the Miss Address Buffer (MAB).

IBS_L1_DTLB_4K

The number of IBS op samples where a load or store operation produced a valid linear (virtual) address and a 4 KB page entry in the L1 DTLB was used for the address translation.

IBS_L1_DTLB_2M

The number of IBS op samples where a load or store operation produced a valid linear (virtual) address and a 2 M page entry in the L1 DTLB was used for the address translation.

IBS_L1_DTLB_1G

The number of IBS op samples where a load or store operation produced a valid linear (virtual) address and a 1 GB page entry in the L1 DTLB was used for the address translation.

IBS_L2_DTLB_4K

The number of IBS op samples where a load or store operation produced a valid linear (virtual) address, hit the L2 DTLB, and used a 4 KB page entry for the address translation.

IBS_L2_DTLB_2M

The number of IBS op samples where a load or store operation produced a valid linear (virtual) address, hit the L2 DTLB, and used a 2 MB page entry for the address translation.

IBS_L2_DTLB_1G

The number of IBS op samples where a load or store operation produced a valid linear (virtual) address, hit the L2 DTLB, and used a 1 GB page entry for address translation.

IBS_LD_L1_DC_MISS_LAT or IBS_DC_MISS_LAT

The total L1 DC miss load latency (in processor cycles) across all the IBS op samples that performed a load operation and missed in the data cache. The miss latency is the number of clock cycles from when the L1 data cache miss was detected to when data was delivered to the core.

IBS_LOAD_RESYNC

Load resync.

Note

Not supported on Zen3 and later processors.

IBS_NB_LOCAL

The number of IBS op samples where a load operation was serviced from the local processor. Northbridge IBS data is only valid for the load operations that miss in both the L1 data cache and the L2 data cache. If a load operation crosses a cache line boundary, he IBS data reflects the access to the lower cache line.

Note

Not supported on Zen3 and later processors.

IBS_NB_REMOTE

The number of IBS op samples where a load operation was serviced from a remote processor.

Note

Not supported on Zen3 and later processors.

IBS_NB_LOCAL_L3

The number of IBS op samples where a load operation was serviced by the local L3 cache.

Note

Not supported on Zen3 and later processors.

IBS_NB_LOCAL_CACHE

The number of IBS op samples where a load operation was serviced by a cache (L1 or L2 data cache) belonging to a local core which is a sibling of the core making the memory request.

Note

Not supported on Zen3 and later processors.

IBS_LD_LOCAL_PEER_CACHE_HIT

IBS Load data returned from local L3 hit or different L1/L2 of same CCX or L1/L2/L3 hit in other CCX of same node.

Note

Not supported on Zen3 and later processors.

IBS_NB_REMOTE_CACHE or IBS_LD_RMT_CACHE_HIT

The number of IBS op samples where a load operation was serviced by a remote L1 data cache, L2 cache, or L3 cache after traversing one or more coherent Hyper Transport links.

IBS_NB_LOCAL_DRAM or IBS_LD_LOCAL_DRAM_HIT

The number of IBS op samples where a load operation was serviced by the local NUMA node’s DRAM (via the local memory controller).

IBS_NB_REMOTE_DRAM or IBS_LD_RMT_DRAM_HIT

The number of IBS op samples where a load operation was serviced by the remote NUMA node’s DRAM (after traversing one or more coherent HyperTransport links and through a remote memory controller).

IBS_NB_LOCAL_OTHER

The number of IBS op samples where a load operation was serviced from local MMIO, configuration or PCI space, or from the local APIC.

Note

Not supported on Zen3 and later processors.

IBS_NB_REMOTE_OTHER

The number of IBS op samples where a load operation was serviced from remote MMIO, configuration, or PCI space.

Note

Not supported on Zen3 and later processors.

IBS_NB_CACHE_MODIFIED

The number of IBS op samples where a load operation was serviced from local or remote cache, and the cache hit state was the Modified (M) state.

Note

Not supported on Zen3 and later processors.

IBS_NB_CACHE_OWNED

The number of IBS op samples where a load operation was serviced from local or remote cache, and the cache hit state was the Owned (O) state.

Note

Not supported on Zen3 and later processors.

IBS_NB_LOCAL_LAT

The total data cache miss latency (in processor cycles) for the load operations that were serviced by the local processor.

Note

Not supported on Zen3 and later processors.

IBS_NB_REMOTE_LAT

The total data cache miss latency (in processor cycles) for the load operations that were serviced by a remote processor.

Note

Not supported on Zen3 and later processors.

Table 7.22 IBS Op Events - AMD Zen4 and Zen5 Platforms#

IBS Op Event

Description

IBS_ALL_OPS

The number of all the IBS op samples that were collected. These samples may be branch ops, resync ops, ops that perform load/store operations, or undifferentiated ops. For example, the ops that perform arithmetic operations, logical operations, and so on. IBS collects data for retired ops. No data is collected for ops that are aborted due to pipeline flushes and so on. Thus, all sampled ops are architecturally significant and contribute to the successful program execution.

IBS_TAG_TO_RETIRE_CYCLES

The total number of tag-to-retire cycles across all the IBS op samples. The tag-to-retire time of an op is the number of cycles from when the op was tagged (selected for sampling) to when the op retired.

IBS_COMP_TO_RETIRE_CYCLES

The total number of completion-to-retire cycles across all the IBS op samples. The completion-to-retire time of an op is the number of cycles from when the op completed to when the op retired.

IBS_BR

The number of IBS retired branch op samples. A branch operation is a change in program control flow; includes unconditional and conditional branches, subroutine and subroutine returns. Branch ops are used to implement AMD64 branch semantics.

IBS_BR_MISP

The number of IBS samples for the retired branch operations that were mispredicted. This event should be used to compute the ratio of mispredicted branch operations to all branch operations.

IBS_TAKEN_BR

The number of IBS samples for retired branch operations that were taken branches.

IBS_TAKEN_BR_MISP

The number of IBS samples for the retired branch operations that were mispredicted taken branches.

IBS_RET

The number of IBS retired branch op samples where the operation was a subroutine return. These samples are a subset of all the IBS retired branch op samples.

IBS_RET_MISP

The number of IBS retired branch op samples where the operation was a mispredicted subroutine return. This event should be used to compute the ratio of the mispredicted returns to all the subroutine returns.

IBS_FUSED_INST_OP

Tagged operation was part of a fused instruction pair.

IBS_MICROCODE_OP

Tagged operation from microcode.

IBS_BR_TAG_TO_RETIRE_CYCLES

The total number of tag-to-retire cycles across all IBS op branch samples. The tag-to-retire time of an op is the number of cycles from when the op was tagged (selected for sampling) to when the op retired.

IBS_BR_MISP_TAG_TO_RETIRE_CYCLES

The total number of tag-to-retire cycles across all branch mispredict instruction op samples. The tag-to-retire time of an op is the number of cycles from when the op was tagged (selected for sampling) to when the op retired.

IBS_TAKEN_BR_TAG_TO_RETIRE_CYCLES

The total number of tag-to-retire cycles across all branch taken op samples. The tag-to-retire time of an op is the number of cycles from when the op was tagged (selected for sampling) to when the op retired.

IBS_RET_TAG_TO_RETIRE_CYCLES

The total number of tag-to-retire cycles across all branch return op samples. The tag-to-retire time of an op is the number of cycles from when the op was tagged (selected for sampling) to when the op retired.

IBS_BR_COMP_TO_RETIRE_CYCLES

The total number of completion-to-retire cycles across all IBS branch samples. The completion-to-retire time of an op is the number of cycles from when the op completed to when the op retired.

IBS_BR_MISP_COMP_TO_RETIRE_CYCLES

The total number of completion-to-retire cycles across all branch mispredict instruction op samples. The completion-to-retire time of an op is the number of cycles from when the op completed to when the op retired.

IBS_TAKEN_BR_COMP_TO_RETIRE_CYCLES

The total number of completion-to-retire cycles across all IBS taken samples. The completion-to-retire time of an op is the number of cycles from when the op completed to when the op retired.

IBS_RET_COMP_TO_RETIRE_CYCLES

IBS branch return op completion-to-retire cycles.

IBS_LOAD_STORE

The number of IBS op samples for the ops that perform either a load and/or store operation. Each op may perform a load/store operation or both a load and store operation (each to the same address).

IBS_LOAD

The number of IBS op samples for the ops that perform a load operation.

IBS_STORE

The number of IBS op samples for the ops that perform a store operation.

IBS_L1_DTLB_HIT

The number of IBS op samples where either a load or store operation initially hit in the L1 DTLB (data translation look aside buffer).

IBS_DTLB_L1M_L2H

The number of IBS op samples where either a load or store operation initially missed in the L1 DTLB and hit in the L2 DTLB.

IBS_DTLB_L1M_L2M

The number of IBS op samples where either a load or store operation initially missed in both the L1 DTLB and the L2 DTLB.

IBS_L1_DC_MISS

The number of IBS op samples where either a load or store operation initially missed in the L1 data cache (DC).

IBS_L1_DC_HIT

The number of IBS op samples where either a load or store operation initially hit in the L1 data cache (DC).

IBS_MISALIGN_ACCESS

The number of IBS op samples where either a load or store operation caused a misaligned access (that is, the load or store operation crossed a 64 byte boundary).

IBS_UC_MEM_ACCESS

The number of IBS op samples where a load or store operation accessed uncacheable (UC) memory.

IBS_WC_MEM_ACCESS

The number of IBS op samples where a load or store operation accessed write combining (WC) memory.

IBS_LOCKED_OP

The number of IBS op samples where a load or store operation was a locked operation.

IBS_MAB_HIT

The number of IBS op samples where a load or store operation hit an allocated entry in the Miss Address Buffer (MAB).

IBS_L1_DTLB_4K

The number of IBS op samples where a load or store operation produced a valid linear (virtual) address and a 4 KB page entry in L1 DTLB was used for the address translation.

IBS_L1_DTLB_2M

The number of IBS op samples where a load or store operation produced a valid linear (virtual) address and a 2 MB page entry in L1 DTLB was used for the address translation.

IBS_L1_DTLB_1G

The number of IBS op samples where a load or store operation produced a valid linear (virtual) address and a 1 GB page entry in L1 DTLB was used for the address translation.

IBS_L2_DTLB_4K

The number of IBS op samples where a load or store operation produced a valid linear (virtual) address, hit L2 DTLB, and used a 4 KB page entry for the address translation.

IBS_L2_DTLB_2M

The number of IBS op samples where a load or store operation produced a valid linear (virtual) address, hit L2 DTLB, and used a 2 MB page entry for the address translation.

IBS_L2_DTLB_1G

The number of IBS op samples where a load or store operation produced a valid linear (virtual) address, hit L2 DTLB, and used a 1 GB page entry for the address translation.

IBS_LD_L1_DC_MISS_LAT

The total L1 DC miss load latency (in processor cycles) across all the IBS op samples that performed a load operation and missed in the data cache. The miss latency is the number of clock cycles from when the L1 data cache miss was detected to when data was delivered to the core.

IBS_ST_L1_DC_MISS

The number of IBS op samples where a store operation missed in L1 data cache.

IBS_ST_L1_DC_HIT

The number of IBS op samples where a store operation hit in L1 data cache.

IBS_LD_L1_DC_HIT

The number of IBS op samples where a load operation hit in L1 data cache.

IBS_LD_L1_DC_MISS

The number of IBS op samples where a load operation missed in data cache.

IBS_LD_L2_HIT

The number of IBS op samples where a load operation hit in L2 cache.

IBS_LD_L2_MISS

The number of IBS op samples where a load operation missed in L2 Cache.

IBS_LD_L2_HIT_LAT

The total latency (in processor cycles) for load operations that were serviced by the L2 cache.

IBS_L1_DTLB_REFILL_LAT

The number of cycles from when a L1 DTLB refill is triggered by a tagged op to when the L1 DTLB fill has been completed.

IBS_LD_TAG_TO_RETIRE_CYCLES

The total number of tag-to-retire cycles across all IBS op load samples. The tag-to-retire time of an op is the number of cycles from when the op was tagged (selected for sampling) to when the op retired.

IBS_ST_TAG_TO_RETIRE_CYCLES

The total number of tag-to-retire cycles across all IBS op store samples. The tag-to-retire time of an op is the number of cycles from when the op was tagged (selected for sampling) to when the op retired.

IBS_LD_ST_TAG_TO_RETIRE_CYCLES

The total number of tag-to-retire cycles across all IBS op load and store samples. The tag-to-retire time of an op is the number of cycles from when the op was tagged (selected for sampling) to when the op retired.

IBS_UC_MEM_ACCESS_TAG_TO_RETIRE_CYCLES

The total number of tag-to-retire cycles across all IBS UC memory access op samples. The tag-to-retire time of an op is the number of cycles from when the op was tagged (selected for sampling) to when the op retired.

IBS_WC_MEM_ACCESS_TAG_TO_RETIRE_CYCLES

The total number of tag-to-retire cycles across all IBS WC memory access op samples. The tag-to-retire time of an op is the number of cycles from when the op was tagged (selected for sampling) to when the op retired.

IBS_MISALIGN_ACCESS_TAG_TO_RETIRE_CYCLES

The total number of tag-to-retire cycles across all IBS misalign access op samples. The tag-to-retire time of an op is the number of cycles from when the op was tagged (selected for sampling) to when the op retired.

IBS_LD_RMT_CACHE_HIT

The number of IBS op samples where a load operation was serviced by a remote L1 data, L2, or L3 cache after traversing one or more coherent HyperTransport links.

IBS_LD_LOCAL_DRAM_HIT

The number of IBS op samples where a load operation was serviced by the local NUMA node’s DRAM (via the local memory controller).

IBS_LD_RMT_DRAM_HIT

The number of IBS op samples where a load operation was serviced by the remote NUMA node’s DRAM (after traversing one or more coherent HyperTransport links and through a remote memory controller).

IBS_LD_LOCAL_CACHE_HIT

The number of IBS op samples where a load operation was serviced by the shared L3 cache or other L1/L2 cache in the same CCX.

IBS_LD_PEER_CACHE_HIT

The number of IBS op samples where a load operation was serviced by L2/L3 cache in a different CCX of same NUMA node.

IBS_LD_DRAM_HIT

The number of IBS op samples where a load operation was serviced by the DRAM.

IBS_LD_NVDIMM_HIT

The number of IBS op samples where a load operation was serviced by the NVDIMM.

IBS_LD_NON_MAIN_MEM_HIT

The number of IBS op samples where a load operation was serviced from MMIO, configuration or PCI space, or from the local APIC.

IBS_LD_EXT_MEM_HIT

The number of IBS op samples where a load operation was serviced by Extension memory.

IBS_LD_PEER_AGENT_MEM

The number of IBS op samples where a load operation was serviced by Peer agent memory.

IBS_LD_LOCAL_NVDIMM_HIT

The number of IBS op samples where a load operation was serviced by local long-latency DIMM.

IBS_LD_RMT_NVDIMM_HIT

The number of IBS op samples where a load operation was serviced by remote long-latency DIMM.

IBS_LD_CACHE_HITM

The number of IBS op samples where a load operation was serviced from the local or remote cache, and the cache hit state was the Modified (M) state.

IBS_LD_CACHE_HIT

The number of IBS op samples where a load operation was serviced from the local or remote cache, and the cache hit state was the Owned (O) state.

IBS_LD_LOCAL_CACHE_HITM

The number of IBS op samples where a load operation was serviced from local L3 or other L2 in the same CCX, and the cache hit state was the Modified (M) state.

IBS_LD_PEER_CACHE_HITM

The number of IBS op samples where a load operation was serviced from another L3 in same NUMA node, and the cache hit state was the Modified (M) state.

IBS_LD_RMT_CACHE_HITM

The number of IBS op samples where a load operation was serviced from another L3 in different NUMA node, and the cache hit state was the Modified (M) state.

IBS_LD_LOCAL_CACHE_HIT_LAT

The total latency (in processor cycles) for load operations that were serviced by the shared L3 cache or other L1/L2 in the same CCX.

IBS_LD_PEER_CACHE_HIT_LAT

The total latency (in processor cycles) for load operations that were serviced by the L2/L3 cache in different CCX of the same NUMA node.

IBS_LD_RMT_CACHE_HIT_LAT

The total latency (in processor cycles) for load operations that were serviced by L2/L3 cache in different CCX on different NUMA node.

IBS_LD_LOCAL_DRAM_HIT_LAT

The total latency (in processor cycles) for load operations that were serviced by the DRAM in the same NUMA node (including on socket NUMA nodes).

IBS_LD_RMT_DRAM_HIT_LAT

The total latency (in processor cycles) for load operations that were serviced by the DRAM in a different NUMA node.

IBS_LD_DRAM_HIT_LAT

The total latency (in processor cycles) for load operations that were serviced by the DRAM.

IBS_LD_NVDIMM_HIT_LAT

The total latency (in processor cycles) for load operations that were serviced by the NVDIMM-P.

IBS_LD_LOCAL_NVDIMM_HIT_LAT

The total latency (in processor cycles) for load operations that were serviced by the local NVDIMM.

IBS_LD_RMT_NVDIMM_HIT_LAT

The total latency (in processor cycles) for load operations that were serviced by the remote NVDIMM.

IBS_LD_EXTN_MEM_HIT_LAT

The total latency (in processor cycles) for load operations that were serviced by the extension memory.

IBS_LD_LOCAL_EXTN_MEM_HIT_LAT

The total latency (in processor cycles) for load operations that were serviced by the local extension memory.

IBS_LD_RMT_EXTN_MEM_HIT_LAT

The total latency (in processor cycles) for load operations that were serviced by the remote extension memory.

IBS_LD_PEER_AGENT_MEM_HIT_LAT

The total latency (in processor cycles) for load operations that were serviced by the peer agent memory.

IBS_LD_LOCAL_PEER_AGENT_MEM_HIT_LAT

The total latency (in processor cycles) for load operations that were serviced by the local peer agent memory.

IBS_LD_RMT_PEER_AGENT_MEM_HIT_LAT

The total latency (in processor cycles) for load operations that were serviced by the remote peer agent memory.

The total latency (in processor cycles) for load operations that were serviced by the remote peer agent memory.

The total latency (in processor cycles) for load operations that were serviced by the MMIO/Config/PCI/APIC.

Here is a list of IBS op metrics for AMD Zen3, Zen4 and AMD Zen5 platforms.

Table 7.23 IBS Op Metrics for AMD Zen3, Zen4 and AMD Zen5 Platforms#

IBS Op Metric

Description

%IBS_BR

Percentage of IBS Branch operations with respect to the total IBS operations.

%IBS_BR_COMP_TO_RETIRE_CYCLES

Percentage of IBS Branch op completion to retire cycles.

%IBS_BR_MISP

Percentage of IBS Branch mispredict operations with respect to IBS branch operations.

%IBS_BR_MISP_COMP_TO_RETIRE_CYCLES

Percentage of IBS Branch mispredict op completion to retire cycles.

%IBS_BR_MISP_CYCLES

Percentage of cycles wasted due to branch mispredicts. The Tag-To-Retire cycles of branch mispredicts divided by the total Tag-To-Retire cycles of all the operations, expressed as percentage.

%IBS_BR_MISP_TAG_TO_RETIRE_CYCLES

Percentage of IBS Branch mispredict op tag to retire cycles.

%IBS_BR_TAG_TO_RETIRE_CYCLES

Percentage of IBS Branch op tag to retire cycles.

%IBS_L1_DTLB_REFILL_LAT_CYCLES

Percentage of cycles wasted due to L1 DTLB misses. The number of L1DTLB refill latency cycles divided by the total number of Tag-To-Retire cycles of all the operations, expressed as percentage.

%IBS_LD_DRAM_HIT_LAT

Percentage of IBS load DRAM hit latency cycles with respect to the loadL1 DC miss latency cycles.

%IBS_LD_EXTN_MEM_HIT_LAT

Percentage of IBS load Extension Memory hit latency cycles with respect to the load L1 DC miss latency cycles.

Note

Not supported on Zen3 processors.

%IBS_LD_L1_DC_MISS_LAT_CYCLES

Percentage of cycles wasted to fetch the data. The number of Load L1 DC misses latency cycles divided by the total number of Tag-To-Retire cycles of all the operations, expressed as percentage.

%IBS_LD_L2_HIT_LAT

Percentage of IBS load L2 hit latency cycles with respect to load L1 DC miss latency cycles.

%IBS_LD_LOCAL_CACHE_HIT_LAT

Percentage of IBS load local cache hit latency cycles with respect to the load L1 DC miss latency cycles.

%IBS_LD_LOCAL_DRAM_HIT_LAT

Percentage of IBS load local DRAM hit latency cycles with respect to the load L1 DC miss latency cycles.

%IBS_LD_NON_MAIN_MEM_HIT_ LAT

Percentage of IBS load Non main memory hit latency cycles with respect to the load L1 DC miss latency cycles.

%IBS_LD_NVDIMM_HIT_LAT

Percentage of IBS load NVDIMM hit latency cycles with respect to the load L1 DC miss latency cycles.

%IBS_LD_PEER_AGENT_MEM_HIT_LAT

Percentage of IBS load Peer Agent Memory hit latency cycles with respect to the load L1 DC miss latency cycles.

Note

Not supported on Zen3 processors.

%IBS_LD_PEER_CACHE_HIT_LAT

Percentage of IBS load peer cache hit latency cycles with respect to the load L1 DC miss latency cycles.

%IBS_LD_RMT_CACHE_HIT_LAT

Percentage of IBS load remote cache hit latency cycles with respect to the load L1 DC miss latency cycles.

%IBS_LD_RMT_DRAM_HIT_LAT

Percentage of IBS load remote DRAM hit latency cycles with respect to the load L1 DC miss latency cycles.

%IBS_LOAD

Percentage of Load operations. The total number of load operations divided by the number of IBS OP samples, expressed as percentage.

%IBS_LOAD_STORE

Percentage of Load and Store operations. The total number of load and store operations divided by the number of IBS OP samples, expressed as percentage.

%IBS_RET

Percentage of IBS Branch return operations with respect to IBS branch operations.

%IBS_RET_COMP_TO_RETIRE_CYCLES

Percentage of IBS Branch return op completion to retire cycles.

%IBS_RET_TAG_TO_RETIRE_CYCLES

Percentage of IBS Branch return op tag to retire cycles.

%IBS_STORE

Percentage of Store operations. The total number of store operations divided by the number of IBS OP samples, expressed as percentage.

%IBS_TAKEN_BR

Percentage of IBS Branch taken operations with respect to IBS branch operations.

%IBS_TAKEN_BR_COMP_TO_RETIRE_CYCLES

Percentage of IBS Branch taken op completion to retire cycles.

%IBS_TAKEN_BR_TAG_TO_RETIRE_CYCLES

Percentage of IBS Branch taken op tag to retire cycles.

IBS_BR_MISP_PTI

Number of Branch mispredicts per thousand operations. The number of branch mispredicts divided by the total number of branch operations, expressed as Per-Thousand-Instructions.

IBS_BR_MISP_RATE_%

Branch mispredict rate in percentage. The number of branch mispredicts divided by the total number of branch operations, expressed as percentage.

IBS_LD_DRAM_HIT_RATE_%

Percentage of load samples where the load operation was serviced by DRAM in the system. The number of IBS_LD_DRAM_HIT divided by IBS_LOAD, expressed in percentage.

IBS_LD_EXT_MEM_HIT_RATE_%

Percentage of load samples where the load operation was serviced by Extension Memory in the system.

The number of IBS_LD_EXT_MEM_HIT divided by IBS_LOAD, expressed in percentage.

Note

Not supported on Zen3 processors.

IBS_LD_PEER_AGENT_MEM_RATE_%

Percentage of load samples where the load operation was serviced by Peer agent Memory in the system. The number of IBS_LD_EXT_MEM_HIT divided by IBS_LOAD, expressed in percentage.

IBS_LD_NON_MAIN_MEM_HIT_RATE_%

Percentage of load samples where the load operation was serviced by Extension Memory in the system.

The number of IBS_LD_EXT_MEM_HIT divided by IBS_LOAD, expressed in percentage.

Note

Not supported on Zen3 processors.

IBS_LD_PEER_AGENT_MEM_RATE_%

Percentage of load samples where the load operation was serviced by Peer agent Memory in the system. The number of IBS_LD_EXT_MEM_HIT divided by IBS_LOAD, expressed in percentage.

IBS_LD_NON_MAIN_MEM_HIT_R ATE_%

Percentage of load samples where the load operation was serviced from MMIO, configuration or PCI space, or from the local APIC in the system. The number of IBS_LD_NON_MAIN_MEM_HIT divided by IBS_LOAD, expressed in percentage.

IBS_LD_L1_DC_MISS_LAT_AVE

Average Load L1 DC Miss latency cycles. The total load L1 DC miss latency cycles divided by the number of load L1 DC misses.

%IBS_LD_L1_DC_MISS_LAT_CYCLES

Percentage of cycles wasted to fetch the data. The number of Load L1 DC misses latency cycles divided by the total number of Tag-To-Retire cycles of all the operations, expressed as percentage.

%IBS_LD_L2_HIT_LAT

Percentage of IBS load L2 hit latency cycles with respect to load L1 DC miss latency cycles.

%IBS_LD_LOCAL_CACHE_HIT_LAT

Percentage of IBS load local cache hit latency cycles with respect to the load L1 DC miss latency cycles.

%IBS_LD_PEER_CACHE_HIT_LAT

Percentage of IBS load peer cache hit latency cycles with respect to the load L1 DC miss latency cycles.

%IBS_LD_RMT_CACHE_HIT_LAT

Percentage of IBS load remote cache hit latency cycles with respect to the load L1 DC miss latency cycles.

%IBS_LD_LOCAL_DRAM_HIT_LAT

Percentage of IBS load local DRAM hit latency cycles with respect to the load L1 DC miss latency cycles.

%IBS_LD_RMT_DRAM_HIT_LAT

Percentage of IBS load remote DRAM hit latency cycles with respect to the load L1 DC miss latency cycles.

%IBS_LD_DRAM_HIT_LAT

Percentage of IBS load DRAM hit latency cycles with respect to the loadL1 DC miss latency cycles.

%IBS_LD_NVDIMM_HIT_LAT

Percentage of IBS load NVDIMM hit latency cycles with respect to the load L1 DC miss latency cycles.

%IBS_LD_EXTN_MEM_HIT_LAT

Percentage of IBS load Extension Memory hit latency cycles with respect to the load L1 DC miss latency cycles.

Note

Not supported on Zen3 processors.

%IBS_LD_PEER_AGENT_MEM_HIT_LAT

Percentage of IBS load Peer Agent Memory hit latency cycles with respect to the load L1 DC miss latency cycles.

Note

Not supported on Zen3 processors.

%IBS_LD_NON_MAIN_MEM_HIT_ LAT

Percentage of IBS load Non main memory hit latency cycles with respect to the load L1 DC miss latency cycles.

7.9.3.2.5. Limitations

CPU profiling in AMD uProf has the following limitations:

IMIX has the following limitations:

7.9.3.3. Cache Analysis

The Cache Analysis uses IBS OP samples to detect the hot false sharing cache lines in multi- threaded and multi-process with shared memory applications.

At a high-level, this feature will report:

7.9.3.3.1. Supported Metrics

The following IBS OP derived metrics are used to generate false cache sharing report.

Table 7.24 IBS Op Derived Metrics#

IBS Op Metric

Description

IBS_LOAD_STORE

Total Loads and stores sampled

IBS_LOAD

Total Loads

IBS_STORE

Total Stores

IBS_DC_MISS_LAT

Accumulated load latencies for the loads to cache lines

IBS_LOAD_DC_L2_HIT

Load operations hit in data cache or L2 cache

IBS_NB_LOCAL_CACHE_MODIFIED

Loads that were serviced from the local cache (L3) and the cache hit state was Modified

IBS_NB_LOCAL_CACHE_OWNED

Loads that were serviced from the local cache (L3) and the cache hit state was Owned

IBS_NB_LOCAL_CACHE_MISS

Loads that were missed in local cache (L3) and serviced by remote cache, local, or remote DRAM

IBS_NB_REMOTE_CACHE_MODIFED

Loads that were serviced from the remote cache (L3) and the cache hit state was Modified

IBS_NB_REMOTE_CACHE_OWNED

Loads that were serviced from the remote cache (L3) and the cache hit state was Owned

IBS_NB_LOCAL_DRAM

Loads that hit in local memory (Memory channels attached to local socket or local CCD)

IBS_NB_REMOTE_DRAM

Loads that hit in remote memory (Memory channels attached to remote socket or other CCDs in the local socket)

IBS_STORE_DC_MISS

Store operations missed in data cache

7.9.3.3.2. Cache Analysis Using GUI

Configuring and Starting Profile

To perform cache analysis, complete the following steps:

1.Select the profile target. 2.Select Cache Analysis profile type in Predefined Configs tab. 3.Start the profile.

Analyzing the Report

After the profile completion, navigate to Cache Analysis page in MEMORY tab to analyze the profile data. This page shows the cache-lines and it offsets with the associated metric values:

Cache Analysis.

Figure 7.46 Cache Analysis#

The Cache Analysis screen has the following options:

7.9.3.3.3. Cache Analysis Using CLI

The CLI has a config type called memory to cache the analysis data. Run the following command to collect the profile data:

$ AMDuProfCLI collect --config memory -o /tmp/cache_analysis <target app>

This command will launch the program and collect the profile data required to generate the cache analysis report. The raw profile data file is created in /tmp/cache_analysis/AMDuProf- IBS_<timestamp>/ directory.

Report Generation and Analysis

Use the following CLI command to generate the cache analysis report.

$ AMDuProfCLI report -i /tmp/cache_analysis/AMDuProf-IBS_<timestamp>/

This will generate a CSV report in /tmp/cache_analysis/AMDuProf- IBS_<timestamp>/report.csv and it will have the following sections.

The following figure shows the Cache Analysis summary sections.

Cache Analysis - Summary Sections.

Figure 7.47 Cache Analysis - Summary Sections#

The following figure shows the Cache Analysis detailed report.

Cache Analysis - Detailed Report.

Figure 7.48 Cache Analysis - Detailed Report#

Use any of the listed metric options with the following command (for example, --sort-by event=ldst-count) to change the sorting by order during the report generation.

--sort-by event=<METRIC>
Table 7.25 Sort-by Metric Options#

Sort-by Metric Options

Description

ldst-count

Total Loads and stores sampled

ld-count

Total Loads

st-count

Total Stores

cache-hitm

Loads that were serviced either from the local or remote cache (L3) and the cache hit state was Modified.

lcl-cache-hitm

Loads that were serviced from the local cache (L3) and the cache hit state was Modified.

rmt-cache-hitm

Loads that were serviced from the remote cache (L3) and the cache hit state was Modified.

lcl-dram-hit

Loads that hit in local memory (memory channels attached to local socket or local CCD).

rmt-dram-hit

Loads that hit in remote memory (memory channels attached to remote socket or other CCDs in the local socket).

l3-miss

Loads that are missed in local cache (L3) and serviced by remote cache, local, or remote DRAM.

st-dc-miss

Store operations missed in data cache.

Note

You can also use the command info --list cacheline-events for a list of supported metrics for sort-by option.

7.9.3.4. Branch Analysis

AMD Zen4 processors support Last Branch Record (LBR) CPU feature that is useful for branch analysis. Use uProf CLI to collect and generate the branch analysis report.

7.9.3.4.1. Prerequisites

PMC event must be enabled for LBR sample collection. If no PMC event is passed, PMCX0C0 event is enabled during LBR sample collection.

7.9.3.4.2. Configuration

CLI

  1. Collect the LBR info.

    $ AMDuProfCLI collect --branch-filter -o /tmp/ ./ScimarkStable/scimark2_64static
    
  2. Generate branch analysis report.

    $ AMDuProfCLI report --detail -i /tmp/AMDuProf-scimark2_64static-Custom_mmm-dd-yyyy_hh-mm-ss
    
7.9.3.4.3. Analyze the Data

The report generated contains a section for branch analysis. Here is a sample screenshot of the Branch Analysis Summary.

Branch Analysis Summary.

Figure 7.49 Branch Analysis Summary#

7.9.3.4.4. Limitations

Branch analysis has the following limitations:

The branch analysis summary table comprises of the following columns:

Table 7.26 Branch Analysis Summary Table#

Column

Description

MISPREDICT (%)

Indicates ratio of mispredicts occurred for the branch. Calculated as: ((MISPREDICT COUNT) * 100/SAMPLES)

MISPREDICT COUNT

Shows the number of branch mis-predicted samples collected for the branch.

OVERHEAD (%)

Indicates which branching was mostly taken. Calculated as: (SAMPLES * 100)/(Total SAMPLES).

PROCESS

Shows the name and PID of the process.

SAMPLES

Shows the number of samples collected for the branch. This does not indicate the actual branches taken.

SOURCE FUNCTION

Shows the function from where the branch was taken.

SOURCE LINE

Shows the file path and line number (from where the branch was taken) of the SOURCE FUNCTION.

SOURCE MODULE

Shows the module name of the SOURCE FUNCTION.

TARGET FUNCTION

Shows the function into which the branch was taken.

TARGET LINE

Shows the file path and line number (into which the branch was taken) of the TARGET FUNCTION.

TARGET MODULE

Shows the module name of the TARGET FUNCTION.

7.9.3.5. Virtualization Support

7.9.3.5.1. Profiling of Guest VM from Guest VM

Time based profiling can be performed on all the supported Host and Guest VMs, whereas the hardware counter profiling is completely dependent on the vPMUs exposed by the hypervisor.

7.9.3.5.2. Profiling of Guest VM from Host System (KVM Hypervisor)

This feature supports profiling of KVM guest OS kernel and kernel modules (*.ko) from the host. The following features are supported:

The following features are not supported: - Call stack - Attach to process - Launch application

7.9.3.5.3. Preparing Host system to Profile Guest Kernel Modules

Before beginning the profiling on the guest OS, the following files must be copied on the host machine to facilitate symbol resolution for the guest VMs:

  1. Copy /proc/kallsyms and /proc/modules from the guest OS to the host machine.

  2. Copy guest vmlinux and kernel sources in a folder on the host system.

These files should belong to the guest VM whose PID is provided as an argument to --guest-kvm option.

7.9.3.5.4. AMD uProf CLI with Profiling Options

AMD uProf CLI contains the following options to support the guest OS profiling from the host OS:

$ ./AMDuProfCLI collect [--kvm-guest <pid>] [--guest-kallsyms <path>] [--guest-modules <path>]
[--guest-search-path <path>] ....

The following table lists the Collect command options applicable for profiling options.

Table 7.27 AMD uProf CLI Collect command options - Profiling Options#

Arguments

Option

Description

--kvm-guest

PID of qemu-kvm process to be profiled.

Collect guest-side performance profile. This option collects KVM guest symbols information.

--guest-search- path

Path of guest vmlinux and kernel sources copied on local host.

GuestOS vmlinux and search directory. AMD uProf reads it to resolve the guest kernel module information. You can copy it from the guest OS.

--guest-modules

Path of guest/proc/modules copied on local host.

GuestOS/proc/modulesfile copy. AMD uProf reads it to get the guest kernel module information. You can copy it from the guest OS.

--guest-kallsyms

Path of guest/proc/kallsyms copied on local host.

GuestOS/proc/kallsymsfile copy. AMD uProf reads it to get guest kernel symbols. You can copy it from the guest OS.

Examples

Get the kvm guest OS PID.

$ ps aux | grep kvm

Collecting pmcx76 event data for 10 secs (for guest kallsyms and guest kernel modules).

$ ./AMDuProfCLI collect -e event=pmcx76,interval=250000 -o /tmp/cpuprof-76-guest-only -d 10 -
-kvm-guest 2444 --guest-kallsyms /home/amd/guest/guest-kallsyms --guest-modules /home/amd/ guest/guest-module

Generate report from the collected data.

Collecting pmcx76 event data for 10 secs (for guest kallsyms).

$ ./AMDuProfCLI collect -e event=pmcx76,interval=250000 -o /tmp/cpuprof-76-guest-only -d 10 -
-kvm-guest 2444 --guest-kallsyms /home/amd/guest/guest-kallsyms

Generate report from the collected data.

$ ./AMDuProfCLI report -i /tmp/cpuprof-76-guest-only/AMDuProf-SWP-EBP_Nov-08-2021_15-00-33

Collecting system-wide samples for pmcx76 event data for 10 secs (for guest kallsyms and guest kernel modules).

$ ./AMDuProfCLI collect -e event=pmcx76,interval=250000 -o /tmp/cpuprof-76-guest-only -d 10 -
-kvm-guest 2444 --guest-kallsyms /home/amd/guest/guest-kallsyms --guest-modules /home/amd/ guest/guest-module -a

Generate report from the collected data.

$ ./AMDuProfCLI report -i /tmp/cpuprof-76-guest-only/AMDuProf-SWP-EBP_Nov-08-2021_15-00-33

Collecting system-wide samples for pmcx76 event data for 10 secs (for guest kallsyms).

$ ./AMDuProfCLI collect -e event=pmcx76,interval=250000 -o /tmp/cpuprof-76-guest-only -d 10 -
-kvm-guest 2444 --guest-kallsyms /home/amd/guest/guest-kallsyms -a$ ./AMDuProfCLI collect -e event=pmcx76,interval=250000 -o        /tmp/cpuprof-76-guest-only -d 10 -
-kvm-guest 2444 --guest-kallsyms /home/amd/guest/guest-kallsyms -a

Generate report from the collected dataGenerate report from the collected data.

$ ./AMDuProfCLI report -i /tmp/cpuprof-76-guest-only/AMDuProf-SWP-EBP_Nov-08-2021_15-00-33

7.10. Parallelism - OpenMP Analysis

The OpenMP API uses the fork-join model of parallel execution. The program starts with a single master thread to run the serial code. When a parallel region is encountered, multiple threads perform the implicit or explicit tasks defined by the OpenMP directives. At the end of that parallel region, the threads join at the barrier and only the master thread continues to execute.

When the threads execute the parallel region code, they should utilize all the available CPU cores and the CPU utilization should be maximized. But the threads wait without doing anything useful due to several reasons:

The OpenMP analysis helps to trace the activities performed by OpenMP threads, their states, and provides the thread state timeline for parallel regions to analyze the performance issues. Use the Parallel Strong Scaling Metrics (Parallel Strong Scaling Metrics (MPI + OpenMP)) to quantify and decompose scalability losses in OpenMP (and MPI) applications.

Parallel Region Aggregation

A parallel region can be executed multiple times during runtime. Reporting all the instances separately might result in a lengthy report making it difficult to analyze the data. AMD uProf aggregates data of multiple instances of the same parallel region and shows it as a single entry for better analysis.

Support Matrix

The following table shows the support matrix:

Table 7.28 Support Matrix#

Component

Supported Versions

Languages

OpenMP Spec

OpenMP v5.0

C and C++

Compiler

LLVM 8 and later

C and C++

Compiler

AOCC 2.1 and later

C, C++, and Fortran

Compiler

ICC 2025.0.4

C, C++, and Fortran

Compiler

GCC7 and later

C, C++, and Fortran

OS

Ubuntu 18.04 LTS and later

C, C++, and Fortran

OS

RHEL 8.6 and 9

C, C++, and Fortran

OS

CentOS 8.4

C, C++, and Fortran

Prerequisite

Compile the OpenMP application using a supported compiler (on a supported platform) with the required compiler options to enable OpenMP.

7.10.1. Data Collection Using GUI

Complete the following steps to start profiling:

  1. Click Profile an Application on the Welcome page.

  2. Provide application path, application options, working directory, and environment variables, if any. Click Next.

  3. Select at least one supported predefined Configuration such as TBP/EBP/IBS along with any desired configuration and click Advanced Options.

  4. In the OpenMP Tracing Options pane, turn on the Enable OpenMP Tracing option.

    Enable OpenMP Tracing.

    Figure 7.50 Enable OpenMP Tracing#

  5. Select the Select OpenMP Trace Implementation type. Choose:

    1. ompt (default option) for tracing of OpenMP libraries supporting OMPT interface (example: LLVM, AOCC, ICC).

    2. omplib for tracing GCC OpenMP library.

  6. If you have selected ompt, next Select OpenMP Tracing Mode. Choose:

    1. full for tracing all the OpenMP events.

    2. basic for basic tracing, where synchronization related OpenMP events are not traced to reduce the disk space usage.

  7. Click Start Profile to start the profiling.

7.10.2. Data Collection Using CLI

Command to collect basic trace info of an OpenMP application supporting OMPT interface:

$ AMDuProfCLI collect --trace openmp --openmp-impl ompt --openmp-scope basic -o /tmp/myapp_perf <openmp-app>

Command to profile an OpenMP application compiled with GCC OpenMP library:

$ AMDuProfCLI collect --trace openmp --openmp-impl omplib -o /tmp/myapp_perf <openmp-app>

Use the --openmp-impl option to provide OpenMP implementation type: ompt for tracing of OpenMP libraries supporting OMPT interface (example: LLVM, AOCC, ICC), omplib for tracing GCC OpenMP library. If --openmp-impl is not specified, the default selection is ompt.

Use --openmp-scope option to provide tracing scope: full for tracing all the OpenMP events, basic for basic tracing, where synchronization related OpenMP events are not traced to reduce the disk space usage. If --openmp-scope is not specified, the default selection is basic.

Note

This option is only applicable with --openmp-impl ompt.

While performing the regular profiling, add option –trace openmp –openmp-impl <ompt | omplib> to enable OpenMP profiling. This command will launch the program and collect the profile data required to generate the OpenMP analysis report.

Once profile data collection is complete, a session directory will be generated. Use session directory to generate the csv report (or) to import the session in GUI.

For a list of all the supported options, refer to AMDuProfCLI Collect Command Options.

7.10.3. Analyze the Data

If data is collected using CLI, then use Import Session to import the session into GUI to analyze data in GUI. OpenMP trace data can be collected in Linux and the session can be imported to GUI or CLI on Windows.

Analyzing the GUI Views

After the session is opened, navigate to the HPC page to analyze the OpenMP tracing data. You can use the left side vertical pane on this page to navigate through the following views:

Overview shows the quick details about the runtime. The following image shows the Overview page.

HPC - Overview.

Figure 7.51 HPC - Overview#

OpenMP Parallel Regions.

Figure 7.52 OpenMP Parallel Regions#

7.10.4. Generate Profile Report

You can generate a CSV report using the AMDuProfCLI report command. Any additional option is not required for the OpenMP report generation. AMD uProf checks for the availability of any OpenMP profiling data and includes it in the report, if available.

The following command will generate a CSV report in /tmp/myapp_perf/<SESSION-DIR>/ report.csv:

$ ./AMDuProfCLI report -i /tmp/myapp_perf/<SESSION-DIR>

Note

If tracing is performed on a cluster, provide –host all option to correctly report openmp data for all the hosts.

An example of the OpenMP report section in the CSV file is given here:

Sample OpenMP Report.

Figure 7.53 Sample OpenMP Report#

Analyzing the OpenMP Report

Openmp report includes the following sections:

7.10.5. Environment Variables

AMDUPROF_MAX_PR_INSTANCES – Set the max number of parallel regions to be traced. The default value is 2000000.

Note

Tracing a smaller number of parallel regions may result in less accurate timing details.

7.10.6. Limitations

The following features are not supported in this release:

7.11. Parallelism - MPI Trace Analysis

MPI trace analysis can be used to analyze, and compute the message passing load imbalance among the ranks of a MPI application running on a cluster. It supports OpenMPI, MPICH, and their derivatives.

The supported thread models are MPI_THREAD_SINGLE, MPI_THREAD_FUNNLED, and MPI_THREAD_SERIALIZED. The profile reports are generated for Point-to-Point and Collective API activity summary.

Fortran bindings are configured and built while compiling the MPI implementations. You can enable/ disable the Fortran bindings based on your need for Fortran language support.

Refer the following options to disable/enable the Fortran bindings:

MPI Trace Support Matrix

Table 7.29 Support Matrix#

Component

Supported Versions

MPI Spec

MPI v3.1 or later

MPI Libraries

  • Open MPI v4.14

  • Open MPI v5.0

  • MPICH v4.0.3

  • MPICH v4.2

  • ParaStation MPI v5.6.0

  • Intel® MPI 2021.1

Operating System

  • Ubuntu: 18.04 LTS, 20.04 LTS, 22.04.04 LTS, and Ubuntu 24.04

  • RHEL: 8.6 and 9

  • CentOS 8.4

Languages

C, C++, FORTRAN

MPI Implementation Support

AMD uProf supports tracing of Open MPI and MPICH and the derivatives:

Ensure that the correct option (mpich or openmpi) is passed depending on the MPI implementation used for compiling the MPI application. Passing incorrect option might cause undefined behavior.

Tracing Modes

The AMDuProf CLI supports the following 2 modes for MPI tracing:

For more information about MPI tracing options refer to Linux Specific Options. For detailed analysis of parallel scalability bottlenecks, see Parallel Strong Scaling Metrics (MPI + OpenMP).

7.11.1. MPI Light-weight Tracing Using CLI

In LWT mode, a quick report gets generated during collection stage. This mode supports a limited set of APIs for tracing as listed in the following table. The LWT report gives an overview of the application runtime activity.

Table 7.30 List of Supported MPI APIs for Light-weight Tracing#

Sl.No

API

Sl.No

API

Sl.No

API

1

MPI_Bsend

21

MPI_Ssend

41

MPI_Ibcast

2

MPI_Recv_init

22

MPI_Iallreduce

42

MPI_Waitall

3

MPI_Bcast

23

MPI_Reduce_scatter

43

MPI_Mrecv

4

MPI_Ireduce_scatter

24

MPI_Irecv

44

MPI_Alltoallv

5

MPI_Bsend_Init

25

MPI_Ssend_Init

45

MPI_Igather

6

MPI_Rsend

26

MPI_Ialltoall

46

MPI_Waitany

7

MPI_Gather

27

MPI_Scan

47

MPI_Probe

8

MPI_Iscan

28

MPI_Irsend

48

MPI_Alltoallw

9

MPI_Ibsend

29

MPI_Allgather

49

MPI_Igatherv

10

MPI_Rsend_init

30

MPI_Ialltoallv

50

MPI_Waitsome

11

MPI_Gatherv

31

MPI_Scatter

51

MPI_Recv

12

MPI_Iscatter

32

MPI_Isend

52

MPI_Barrier

13

MPI_Improbe

33

MPI_Allgatherv

53

MPI_Ireduce

14

MPI_Send

34

MPI_Ialltoallw

15

MPI_Iallgather

35

MPI_Scatterv

16

MPI_Iscatterv

36

MPI_Issend

17

MPI_Imrecv

37

MPI_Ibarrier

18

MPI_Send_init

38

MPI_Wait

19

MPI_Reduce

39

MPI_Mprobe

20

MPI_Iprobe

40

MPI_Alltoall

Collect Profile Data

Example of a command to LWT trace an MPI application using AMDuProfCLI:

$ mpirun -np <number of processes> ./AMDuProfCLI collect --trace mpi --mpi-impl mpich --mpi-scope lwt -o <output_directory> <application>

After completing the tracing, the path to the session directory is displayed on the terminal. LWT report is generated immediately after completing the collection and saved as a .csv file in the session directory: <output_directory>/<SESSION_DIR>/mpi/lwt/mpi-summary.csv.

MPI implementation MPICH or Open MPI should be passed in the command; MPICH is the default. Following are the sample commands:

$ mpirun -np <number of processes> ./AMDuProfCLI collect --trace mpi --mpi-impl openmpi --mpi-scope lwt -o <output_directory> <application>
$ mpirun -np <number of processes> ./AMDuProfCLI collect --trace mpi --mpi-impl mpich --mpi-scope lwt -o <output_directory> <application>

Ensure that the correct option (mpich or openmpi) is passed depending on the MPI implementation used for compiling the MPI application. Passing an incorrect option might cause undefined behavior.

An example of the LWT report section in the .csv file is as follows:

Sample LWT Report.

Figure 7.54 LWT Report Example#

7.11.2. MPI Full Tracing Using CLI

Full tracing mode traces more APIs than LWT tracing, For a complete list of APIs, refer List of Supported MPI APIs for Full Tracing. This mode is helpful for in-depth analysis of an MPI application activity.

The report file for the full tracing includes multiple tables to represent the following details.

The list of supported MPI APIs is as follows:

Table 7.31 List of Supported MPI APIs for Full Tracing#

Sl.No

API

Sl.No

API

Sl.No

API

1

MPI_Pcontrol

30

MPI_Ssend_init

59

MPI_Iscatterv

2

MPI_Mrecv

31

MPI_Neighbor_alltoallv

60

MPI_Intercomm_create

3

MPI_Reduce

32

MPI_Ibarrier

61

MPI_Waitsome

4

MPI_Iallreduce

33

MPI_Test

62

MPI_Scatterv

5

MPI_Cancel

34

MPI_Rsend_init

63

MPI_Igather

6

MPI_Imrecv

35

MPI_Bcast

64

MPI_Intercomm_merge

7

MPI_Allreduce

36

MPI_Ibcast

65

MPI_Barrier

8

MPI_Ialltoall

37

MPI_Testall

66

MPI_Gather

9

MPI_Probe

38

MPI_Send_init

67

MPI_Igatherv

10

MPI_Send

39

MPI_Scan

68

MPI_Cart_create

11

MPI_Alltoall

40

MPI_Comm_create

69

MPI_Recv

12

MPI_Ialltoallv

41

MPI_Testany

70

MPI_Gatherv

13

MPI_Iprobe

42

MPI_Ibsend

71

MPI_Iallgather

14

MPI_Bsend

43

MPI_Reduce_scatter

72

MPI_Cart_sub

15

MPI_Alltoallv

44

MPI_Comm_dup

73

MPI_Irecv

16

MPI_Ialltoallw

45

MPI_Testsome

74

MPI_Allgather

17

MPI_Mprobe

46

MPI_Issend

75

MPI_Iallgatherv

18

MPI_Ssend

47

MPI_Ireduce_scatter

76

MPI_Graph_create

19

MPI_Alltoallw

48

MPI_Comm_dup_with_info

77

MPI_Sendrecv

20

MPI_Ineighbor_alltoall

49

MPI_Wait

78

MPI_Allgatherv

21

MPI_Improbe

50

MPI_Irsend

79

MPI_Ineighbor_allgather

22

MPI_Rsend

51

MPI_Iscan

80

MPI_Dist_graph_create

23

MPI_Neighbor_alltoall

52

MPI_Comm_split

81

MPI_Sendrecv_replace

24

MPI_Ineighbor_alltoallw

53

MPI_Waitall

82

MPI_Neighbor_allgather

25

MPI_Start

54

MPI_Isend

83

MPI_Ineighbor_allgatherv

26

MPI_Bsend_init

55

MPI_Iscatter

84

MPI_Dist_graph_create_adjacent

27

MPI_Neighbor_alltoallw

56

MPI_Comm_split_type

85

MPI_Recv_init

28

MPI_Ineighbor_alltoallv

57

MPI_Waitany

86

MPI_Neighbor_allgatherv

29

MPI_Startall

58

MPI_Scatter

87

MPI_Ireduce

Collect Profile Data

Example of a command to FULL trace an MPI application using AMD uProf CLI:

$ mpirun -np <number of processes> ./AMDuProfCLI collect --trace mpi --mpi-impl mpich --mpi-scope full -o <output_directory> <application>

After completing the tracing, the path to the session directory is displayed on the terminal.

MPI implementation MPICH or Open MPI should be passed in the command; MPICH is the default. Following are the sample commands:

$ mpirun -np <number of processes> ./AMDuProfCLI collect --trace mpi --mpi-impl openmpi --mpi-scope full -o <output_directory> <application>
$ mpirun -np <number of processes> ./AMDuProfCLI collect --trace mpi --mpi-impl mpich --mpi-scope full -o <output_directory> <application>

Ensure that the correct option (mpich or openmpi) is passed depending on the MPI implementation used for compiling the MPI application. Passing an incorrect option might cause undefined behavior.

Generate Profile Report

Example of a command to generate the report in .csv format. Pass the session directory path with the-i option:

$ ./AMDuProfCLI report -i <output_directory>/<SESSION_DIR>

After completing the report generation, the report.csv file path is displayed on the terminal.

Tables in the Report file

The following screenshots show example sections of a full tracing report file:

MPI Communicator Summary Table.

Figure 7.55 MPI Communicator Summary Table#

MPI Rank Summary Table.

Figure 7.56 MPI Rank Summary Table#

MPI Function Summary Table.

Figure 7.57 MPI Function Summary Table#

MPI Communication Matrix.

Figure 7.58 MPI Communication Matrix#

MPI Collective API Summary Table.

Figure 7.59 MPI Collective API Summary Table#

7.11.3. MPI Full Tracing Report Visualization Using GUI

Collecting using CLI and Importing to GUI

Use CLI to trace a target MPI application and generate the report using CLI. For the steps, see MPI Full Tracing Using CLI. Import the report to GUI as shown in the following figure to analyze the trace data:

Import Profile Session.

Figure 7.60 Import Profile Session#

Analyzing MPI Communication Matrix

After the import is complete, use MPI Communication Matrix view to analyze the MPI trace data in the GUI. Navigate to HPC > MPI Communication Matrix to view the MPI communication matrix visualizer. This view displays rank-to-rank communication summary in matrix format. The x and y- axis in the matrix are receiver and sender ranks, respectively.

Following figure shows the MPI communication matrix:

MPI Communication Matrix.

Figure 7.61 MPI Communication Matrix#

By default, the communication matrix appears in a zoomed-out view, displaying interactions between sender and receiver ranks. You can use the mouse wheel to zoom in and out and the scroll bar to navigate horizontally and vertically. When zoomed in, the matrix also reveals the volume of data transferred between ranks in bytes.

Legend

  1. Ranks ordered in row-wise and column-wise.

  2. Each cell displays the total data volume transferred from one rank to another rank.

  3. Tool-tip shows additional details when the mouse is hovered over a cell.

  4. Color-coding legend based on data volume.

  5. Sum of all the data transfers for the rank.

  6. Mean of all the data transfers for the rank.

Analyzing MPI Rank Timeline

Navigate to HPC > MPI Rank Timeline to view to MPI Ranks timeline. This view shows the MPI activities in the timeline graph as follows:

MPI Rank Timeline.

Figure 7.62 MPI Rank Timeline#

Legend

  1. Rank ID

  2. Graph of one of the following depending on the selected data source: - MPI API Activity (running or waiting) - MPI data transfer activity (receiving or sending) - MPI APIs called

  3. Tool-tip shows more information about the MPI activity.

  4. Displays the time range.

  5. To select the data source MPI Activity. For more information, see MPI Data Source in the section MPI Full Tracing Report Visualization Using GUI.

  6. To load more rank details.

  7. To filter the ranks from the view.

  8. Trace Overlay Cutoff can be used to specify duration in nanoseconds, which acts as a cutoff to load the trace data, that is, any traced data source which takes less than the specified nanoseconds will not be displayed.

  9. Color coding legends for data source and trace overlay.

Analyzing MPI P2P API Summary

Navigate to HPC > MPI P2P API Summary. This view summarizes the P2P APIs called by the application as follows:

MPI P2P API Summary.

Figure 7.63 MPI P2P API Summary#

Analyzing MPI Collective API Summary

Navigate to HPC > MPI Collective API Summary. This view summarizes the collective APIs called by the application as follows:

MPI Collective API Summary.

Figure 7.64 MPI Collective API Summary#

MPI Data Source

Supported list of MPI data source is as follows:

An MPI Activity that classifies MPI APIs into either waiting APIs (MPI_Barrier, MPI_Wait, MPI_Waitall, MPI_Waitany, or MPI_Waitsome) or active APIs (all the other MPI functions). MPI APIs can be classified as shown in the following three tables:

Table 7.32 List of P2P and Collective Communication APIs#

P2PSend

P2PReceive

Collective Communication

  • MPI_BSEND

  • MPI_BSEND_INIT

  • MPI_IBSEND

  • MPI_IRSEND

  • MPI_ISEND

  • MPI_ISSEND

  • MPI_RSEND

  • MPI_RSEND_INIT

  • MPI_SEND

  • MPI_SEND_INIT

  • MPI_SENDRECV

  • MPI_SENDRECV_REPLACE

  • MPI_SSEND

  • MPI_SSEND_INIT

  • MPI_IMRECV

  • MPI_IRECV

  • MPI_MRECV

  • MPI_RECV

  • MPI_RECV_INIT

  • MPI_ALLGATHER

  • MPI_ALLGATHERV

  • MPI_ALLREDUCE

  • MPI_ALLTOALL

  • MPI_ALLTOALLV

  • MPI_ALLTOALLW

  • MPI_BARRIER

  • MPI_BCAST

  • MPI_GATHER

  • MPI_GATHERV

  • MPI_IALLGATHER

  • MPI_IALLGATHERV

  • MPI_IALLREDUCE

  • MPI_IALLTOALL

  • MPI_IALLTOALLV

  • MPI_IALLTOALLW

  • MPI_IBARRIER

  • MPI_IBCAST

  • MPI_IGATHER

  • MPI_IGATHERV

  • MPI_IREDUCE

  • MPI_IREDUCE_SCATTER

  • MPI_ISCAN

  • MPI_ISCATTER

  • MPI_ISCATTERV

  • MPI_REDUCE

  • MPI_REDUCE_SCATTER

  • MPI_SCAN

  • MPI_SCATTER

  • MPI_NEIGHBOR_ALLGATHER

  • MPI_NEIGHBOR_ALLGATHERV

  • MPI_NEIGHBOR_ALLTOALL

  • MPI_NEIGHBOR_ALLTOALLV

  • MPI_NEIGHBOR_ALLTOALLW

  • MPI_INEIGHBOR_ALLGATHER

  • MPI_INEIGHBOR_ALLTOALL

  • MPI_INEIGHBOR_ALLGATHERV

  • MPI_INEIGHBOR_ALLTOALLV

  • MPI_INEIGHBOR_ALLTOALLW

Table 7.33 List of Control, Request and Communication APIs#

Control API

RequestAPI

Communication API

  • MPI_PCONTROL

  • MPI_CANCEL

  • MPI_START

  • MPI_STARTALL

  • MPI_TEST

  • MPI_TESTALL

  • MPI_TESTANY

  • MPI_TESTSOME

  • MPI_WAIT

  • MPI_WAITALL

  • MPI_WAITANY

  • MPI_WAITSOME

  • MPI_IMPROBE

  • MPI_IPROBE

  • MPI_MPROBE

  • MPI_PROBE

  • MPI_COMM_CREATE

  • MPI_COMM_DUP

  • MPI_COMM_DUP_WITH_INFO

  • MPI_COMM_SPLIT

  • MPI_COMM_SPLIT_TYPE

  • MPI_COMM_SET_NAME

  • MPI_INTERCOMM_CREATE

  • MPI_INTERCOMM_MERGE

Table 7.34 List of Topology and Environment APIs#

Topology API

Environment API

  • MPI_CART_CREATE

  • MPI_CART_SUB

  • MPI_GRAPH_CREATE

  • MPI_DIST_GRAPH_CREATE

  • MPI_DIST_GRAPH_CREATE_ADJACENT

  • MPI_ABORT

  • MPI_FINALIZE

  • MPI_INIT

  • MPI_INIT_THREA

Limitations

7.11.4. MPI Runtime Library Mismatch (Undefined Symbol Errors)

If you see undefined symbol errors while launching an application with AMD uProf MPI agents, it usually means the profiled application was compiled against one MPI implementation or feature set, but at runtime, the loader resolves a different or incompatible MPI library. These issues are caused by linker/loader mismatches, not by AMD uProf.

Typical console examples (user reports):

libAMDOpenMpiAgent.so: undefined symbol: mpi_fortran_statuses_ignore_
libAMDMpichAgent.so: undefined symbol: MPI_UNWEIGHTED

Common Symbols Linked to Mismatches

Open MPI (C/C++ layer)

Open MPI Fortran / profiling / tools

Sentinel / special constants

These errors occur when an application is built using one MPI implementation or version (e.g., Open MPI) but, at runtime, a different or conflicting MPI library is loaded (e.g., MPICH, an older Open MPI version, or mixed library paths). Such symbol resolution failures originate from the MPI runtime loader and are not caused by AMD uProf.

Verification Steps

Run the following before profiling/tracing to ensure consistency:

which mpirun
mpirun --version
which mpicc
mpicc -show
ldd ./ <application> | grep -i mpi
env | grep -E 'MPI'  # or: env | grep -E 'LD_LIBRARY_PATH|MODULE'

If Open MPI is used, optional:

ompi_info | grep -i 'Open MPI'

For MPICH:

mpichversion

Ensure only one MPI implementation module is loaded (if using environment modules).

Recommended Remedy

Align build and run environments:

To avoid MPI-related symbol resolution issues, ensure consistency between build and run environments:

  1. Unload conflicting modules such as module purge followed by module load openmpi/<version>.

  2. Rebuild the application with the same MPI implementation you intend to trace.

  3. Prefer rpath or correctly ordered LD_LIBRARY_PATH over ad‑hoc injection.

  4. Use absolute path to the intended mpirun when multiple versions exist.

Temporary Workaround (Not Recommended Long-Term)

As a short-term workaround you can force the correct libmpi to load first:

export LD_PRELOAD=/path/to/libmpi.so
mpirun -np <number of processes> ./AMDuProfCLI collect --trace mpi --mpi-impl mpich --mpi-scope full -o <output_directory> <application>

This resolves the undefined symbol by preloading the intended library. However, LD_PRELOAD injects a shared object ahead of normal resolution and can introduce subtle conflicts or harder debugging if symbols overlap. Replace this workaround by fixing library path consistency (proper module setup, LD_LIBRARY_PATH, or rebuilding with correct MPI).

Recommended Practices

Summary

The undefined symbol errors represent runtime linkage issues. AMD uProf tracing requires a coherent MPI environment; once library consistency is ensured, tracing proceeds normally without LD_PRELOAD hacks.

7.12. Parallel Strong Scaling Metrics (MPI + OpenMP)

7.12.1. Overview

Parallel Strong Scaling Metrics quantify scalability loss for hybrid MPI + OpenMP (and pure MPI or pure OpenMP) applications. Each metric is an inefficiency ratio in [0, 1]: 0 means no time lost for that cause; values near 1 indicate severe degradation. The metrics measure deviation from an ideal execution that exhibits perfect strong scaling, in other words:

The metrics are based on separating the notions of work and communication. Communication can be thought of as the cost of parallelizing work. Work is the body of computation whose size remains fixed during a strong scaling study. It is defined as the sum of serial work (time outside OpenMP parallel regions and not in MPI or kernel synchronization APIs) plus OpenMP parallel region work. For OpenMP parallel regions, work excludes synchronization time, implicit barrier wait time, and other OpenMP runtime overheads.

The critical MPI rank is the rank with the largest amount of work. All this work must be completed by the critical rank before the application can terminate. The model assumes that the time this rank spends not computing (in MPI or related synchronization) indicates the extent to which communication inflates the total runtime.

7.12.2. Hierarchy

Level 1 (Overall)

Level 2 (Components of overall parallel inefficiency)

Level 3 (Components of OpenMP Potential Gain)

7.12.3. Metric Definitions

All values reported as fractions in range [0, 1]. A low value indicates no inefficiency, a high value highlights a dominant scalability limiter.

Table 7.35 Level 1 and Level 2 Metric Definitions#

Metric

Scope

What It Measures

Low Value Means

Typical High Causes

Parallel Inefficiency

Application

Total fraction of wall time lost versus a synthetic perfectly scaled execution (perfect balance, no communication delay, no serial bottlenecks, no OpenMP losses).

Execution close to ideal strong scaling where work is perfectly distributed.

Combined rank/thread imbalance, communication delay, serial code, OpenMP overhead.

MPI Load Balance Inefficiency

MPI

Extra wall time due to uneven work distribution across ranks. Computed as the difference between the critical rank’s work and the average work, normalized by runtime.

All ranks finish work simultaneously (balanced load).

Skewed domain decomposition, data skew, uneven rank workloads, rank-specific I/O.

MPI Communication Inefficiency

MPI

Fraction of wall time that the critical rank spends in MPI.  This reflects the extent to which MPI communication/synchronization is slowing down the overall execution of the program. Note that load imbalance can also manifest as communication inefficiency at a global level, as load imbalance will typically cause the critical rank to spend more time in MPI APIs.

Critical rank nearly always computing (communication well overlapped or minimal).

Many small latency-bound messages, blocking collectives, serialization on root ranks, poor overlap.

Serial Region Inefficiency

MPI

Amdahl’s law for each rank, averaged across ranks.  This represents the scalability loss from serial code which is not paralelised with OpenMP.

Negligible serial-only work per rank.

Initialization hot spots, single-threaded loops, legacy non-thread-safe code, unparallelized I/O.

OpenMP Potential Gain

OpenMP

Fraction of wall time wasted in parallel regions due to imperfect work distribution, synchronization, and runtime overhead. Computed as the difference between actual parallel region wall time and theoretical minimum time if work were perfectly balanced across threads.

Parallel regions near ideal efficiency (minimal waste).

Synchronization overhead, implicit barrier waits, unbalanced work, runtime/task management costs.

Table 7.36 Level 3 OpenMP Potential Gain Components#

Metric

Scope

Specific Source of Loss

Low Value Means

Typical High Causes

Potential Gain: Sync

OpenMP

Average per-thread time in explicit synchronization (such as critical sections, explicit barriers, etc.).

Minimal blocking and contention and also imbalance time spent in implicit barriers at the end of worksharing constructs that indicates uneven work distribution among threads).

Contended locks, coarse critical sections, frequent atomic updates, excessive barriers.

Potential Gain: Other

OpenMP

Average per-thread time in other OpenMP API calls not classified as Sync (scheduling operations, task management, runtime bookkeeping

Lightweight runtime overhead and even distribution of work across threads

Excessive fine-grained tasks, scheduling inefficiencies, migration/affinity churn, runtime parameter misconfiguration also imbalance due to poor chunk sizing, skewed task granularity, irregular loop workloads, static scheduling with uneven iterations.

7.12.4. Interpretation Guidelines

Indicative (heuristic) ranges for triage guidance. Actual meaningful thresholds depend on workload characteristics, problem size, and node count. These are not hard pass/fail criteria.

Table 7.37 Indicative Ranges#

Metric Value

Qualitative Status

Suggested Action

0.00 – 0.05

Excellent

Focus on algorithmic improvements or problem-size scaling; parallel losses minor.

0.05 – 0.15

Moderate

Inspect top contributing regions/ranks; targeted tuning likely beneficial.

0.15 – 0.30

High

Prioritize root-cause classification (rank/thread imbalance vs communication vs OpenMP overhead).

> 0.30

Severe

Revisit decomposition, synchronization strategy, and parallel design fundamentals.

7.12.5. Preconditions & Reporting Behavior

Metrics are reported only if all of the following conditions hold:

If any condition is violated, the reported strong scaling metrics may be incorrect or undefined. Users should verify that their application configuration satisfies these prerequisites before interpreting the metric values. Future AMD uProf updates will try to cover additional scenarios and issue warnings.

7.12.6. Concepts & Attribution Model

Work Per Rank

The metrics are based on separating the notions of work and communication. Communication can be thought of as the cost of parallelizing work. Work represents (or approximates) that body of computation whose size remains fixed during a strong scaling study. In strong scaling, we keep the problem size fixed and we increase the number of ranks/threads. In perfect strong scaling, the runtime decreases in direct proportion to the number of ranks/threads added. We define work as the amount of time (in each rank and thread) spent outside of the OpenMP Runtime and outside of any MPI calls. Hence work is runtime less the cost of communication and parallelization.

MPI Communication Inefficiency Assumption

Interprets non-compute intervals on the critical (max-work) rank as communication or communication-induced waiting. Reducing communication on other ranks does not lower wall time unless those changes make a different rank the new max-work rank.

OpenMP Potential Gain

For each parallel region invocation, compute the waste as: waste = omp_parallel_wall_time - (sum_of_thread_work / num_threads). Aggregate waste across invocations, regions, and ranks, then normalize by application runtime. This quantifies how much wall time could be saved if parallel regions had perfectly balanced work with zero synchronization and runtime overhead.

Level 3 Components

Partition OpenMP Potential Gain exclusively into:

7.12.7. Usage Workflow

  1. Start with Parallel Inefficiency to gauge overall scalability health.

  2. Decompose via Level 2 metrics to identify dominant class (MPI balance, communication, serial, or OpenMP).

  3. If OpenMP Potential Gain dominates, inspect its Level 3 breakdown to distinguish synchronization vs other overhead.

  4. Correlate MPI Communication or Load Balance issues with MPI rank timelines and communication matrix views (MPI Communication Matrix).

  5. Apply focused optimizations; re-profile to confirm targeted metric reductions and validate that changes persist under scaling.

7.12.8. Optimization Focus Examples

Some guidelines in case of high efficiency values (this is not an exhaustive list):

High MPI Load Balance Inefficiency

High MPI Communication Inefficiency

High Serial Region Inefficiency

High Potential Gain: Sync

High Potential Gain: Other

7.12.8.1. Examples

The following report is a sample parallel efficiency report for an application using OpenMP and MPI.

Sample Parallel Efficiency Report.

Figure 7.65 Sample Parallel Efficiency Report#

7.12.9. Limitations

7.12.10. Best Practices

7.13. Accelerators

7.13.1. GPU Profiling

GPU Profile is the starting point for analyzing most time-consuming GPU Kernels, A GPU usage based on various pre-defined GPU H/W metrics. GPU Profile uses Radeon Open Compute(ROCm) to collect profiling data and generate raw files.

The AMD Rocprofiler library provides support to monitor GPU hardware performance events when GPU kernels are dispatched and executed. The derived performance metrics are computed and reported in the CSV format (CLI) and in GUI.

7.13.1.1. Prerequisites

Install ROCm™

Install AMD ROCm 7.1.0 on the target system to run GPU Profiling. uProf also supports backward compatibility until version 5.2.1. Supported accelerators - AMD Instinct™ MI200 and MI300A.

Complete the following procedure to install ROCm:

  1. Complete the steps in the ROCm Installation Guide to install AMD ROCm™ v7.1.0 on the host system.

  2. After AMD ROCm™ 7.1.0 installation, make sure the symbolic link of /opt/rocm/ points to /opt/ rocm-7.1.0/.

$ ln -s /opt/rocm-7.1.0/ /opt/rocm/

Note

Profiling might not work as expected on 5.2.1 or older versions.

7.13.1.2. Optional Settings

By default, AMDuProf uses:

7.13.1.3. Supported Options

--ip-block - Provide IP-Block of raw events to be collected.

7.13.1.4. Supported Events and Metrics

Events

Run the following command to list the supported GPU H/W events on the target system:

AMDuProfCLI info -–list gpu-events

Metrics

See the Omniperf document for the extensive list of supported metrics.

7.13.1.5. GPU Profiling Using CLI

Collect Profile Data

Use the following commands to collect GPU performance data:

These commands will launch the program and collect the profile data. After the launched application is executed, the AMDuProfCLI will display the session directory path in which the raw profile data is saved.

For a list of all the supported options, refer to AMDuProfCLI Collect Command Options.

Example

AMDuProfCLI collect --config gpu -o /tmp/ /tmp/namd
Profiling started
………
Profiling (data collection) completed
Generated data files path: /tmp/AMDuProf-namd-GPUProfile_MMM-dd-yyyy_hh-mm-ss

Here, the generated session directory is /tmp/AMDuProf-namd-GPUProfile_MMM-dd-yyyy_hh-mm-ss.

7.13.1.6. GPU Profiling Using GUI

  1. To launch the AMDuProf GUI, go to Home > Welcome page.

  2. Click Profile an Application on the Welcome page.

  3. Provide application path, application options, working directory, and environment variables, if any. Click Next.

  4. From Predefined Configs, select GPU Profile.

  5. Click Start Profile to start the profiling.

Note

Behavior is undefined when the GPU profile collection is interrupted, or the launch application is killed from another terminal.

7.13.1.7. Analyze Data

7.13.1.7.1. Using CLI

Use the following CLI report command to generate the profile report in .csv format by passing the session directory path generated in collection.

AMDuProfCLI report -i <session directory>

For a list of all the supported options, refer to AMDuProfCLI Report Command Options.

Here is an example of a report for a GPU Profile session:

Sample GPU Profile Report.csv.

Figure 7.66 Sample GPU Profile Report.csv#

Sample GPU Profile Report.csv.

Figure 7.67 Sample GPU Profile Report.csv#

7.13.1.7.2. Using GUI

If data is collected using CLI, then use Import Session to import the session into GUI to analyze data in GUI. The following are the supported views to analyze GPU Profile:

Here is a screenshot of an imported GPU Profile session in GUI:

Summary - Hotspots GUI.

Figure 7.68 GPU Offloading Analysis - Hot Spots#

Legend

Summary - Session Information GUI.

Figure 7.69 GPU Offloading Analysis - Session Information#

Legend

GPU Offloading Analysis - Analyze.

Figure 7.70 GPU Offloading Analysis - Analyze#

Legend

  1. All supported views are segregated into three categories as follows:

  2. The filters pane lets you filter the profile data by providing the following options:

  3. This section lists all launched GPU Kernels in descending order of total execution time with total launch count, Min, Max and Avg time taken by each kernel. This section also supports sorting data on all columns.

  4. Any selected kernel(s) will be displayed in this Label.

  5. Select Appropriate view which needs to be analyzed from drop down.

  6. Subsequent Metrics for selected views are listed in this section.

Use these views to analyze how efficiently GPUs are used by the application. In other words, how much time specific GPU kernel took for executions with subsequent H/W counters evaluation for that kernel.

7.13.1.8. Identify the Hottest GPU Kernel

Use GPU Profile to get a list of the most time-consuming GPU Kernels. All kernels are sorted in descending order of total execution time. It also lists the kernel’s launch count, Min, Max, and Avg execution time.

Select one or multiple kernels to evaluate all metrics for that specific kernels.

7.13.1.9. Limitations

7.13.2. GPU Offloading Analysis (GPU Tracing)

GPU offloading analysis is used to explore the traces of the function calls for a GPU compute- intensive application.

It provides an in-depth analysis of the HIP API calls, HSA API calls, order of kernel execution, time taken by each kernel to execute and subsequent Data transfer summary with per thread timeline. It also provides an aggregated list of Hottest kernels with timing metrics.

The AMD ROCtracer library provides support to capture the runtime APIs and GPU activities such as data transfer and kernel execution. This analysis helps to visualize the ROCr, HIP API calls, and GPU activities when a HIP based application is running. It is supported only with a launch application.

7.13.3. Prerequisites

7.13.4. Optional Settings

By default, AMDuProf uses:

7.13.5. Supported Events

AMD uProf supports tracing the following ROCr runtime APIs and GPU activities. To show the collected data in CLI Report/GUI timeline view:

Table 7.38 Supported Interfaces for GPU Tracing#

Category

Event

Description

GPU

hip

HIP runtime trace

GPU

hsa

AMD ROCr runtime trace

7.13.6. Data Collection Using CLI

Use the following commands to collect the Function Trace data.

These commands will launch the program and collect the trace data. Once the launched application is executed, the AMDuProfCLI will display the session directory path in which the raw profile data is saved.

For a list of all the supported options, refer to AMDuProfCLI Collect Command Options.

Example

[bin]$ ./AMDuProfCLI collect --trace gpu -o /tmp/ /tmp/namd
Profiling started
………
Profiling (data collection) completed
Generated data files path: /tmp/AMDuProf-namd-GpuTrace_Sep-05-2024_21-43-08

Here, the generated session directory is /tmp/AMDuProf-namd-GpuTrace_Sep-05-2024_21-43-08.

7.13.7. Data Collection Using GUI

Complete the following steps to start profiling:

  1. Click Profile an Application on the Welcome page.

  2. Provide application path, application options, working directory, and environment variables, if any. Click Next.

  3. From Custom Configs, select GPU Trace.

  4. Click Start Profile to start the profiling.

Note

Behavior is undefined when the GPU profile collection is interrupted, or the launch application is killed from another terminal.

7.13.8. Analyze the Data

7.13.8.1. Using CLI

Use the following CLI report command to generate the profile report in .csv format by passing the session directory path generated in collection.

AMDuProfCLI report -i <session directory>

For a list of all the supported options, refer to AMDuProfCLI Report Command Options.

Here is an example of a report for a GPU Trace session:

Sample GPU Trace Report.

Figure 7.71 Sample GPU Trace Report#

7.13.8.2. Using GUI

If data is collected using CLI, then use Import Session to import the session into GUI to analyze data in GUI. Below are the supported views to analyze GPU Profile.

Here is a screenshot of an imported GPU Profile session in GUI.

GPU Offloading Analysis - Hot Spots.

Figure 7.72 GPU Offloading Analysis - Hot Spots#

Legend

  1. Application Details: Gives an overview of the target application traced.

  2. GPU Kernel Launch Summary: To identify top 4 hottest kernels launched to GPU and its count, total execution time on GPU cores.

  3. Data Transfer Summary: To identify how much time spent in data transfer between host and device, how many times data transfer initiated.

GPU Offloading Analysis - Session Information.

Figure 7.73 GPU Offloading Analysis - Session Information#

Legend

  1. Profile Details

  2. System Details

  3. Target Details

  4. GPU Device Details

GPU Offloading Analysis - Analyze.

Figure 7.74 GPU Offloading Analysis - Analyze#

Legend

  1. HIP Overview: HIP API calls summary

  2. HSA Overview: HSA API calls summary

GPU Offloading Analysis - Per Thread Timeline.

Figure 7.75 GPU Offloading Analysis - Per Thread Timeline#

Use this UI to analyze the following:

  1. GPU Usage, GPU Memory usage and GPU Power of your application over the profile duration.

  2. GPU Kernels executed over the profile duration.

  3. Data Transfer between host and device over the profile duration will help to identify the time spent in data copy. Identify the hottest GPU Kernel.

  4. Use GPU Trace to get a list of hottest GPUS Kernels. All the kernels are sorted in descending order of elapsed time.

7.13.8.3. Limitations

  1. System wide profiling data collection and already running process / thread profiling data collection is not supported.

  2. OpenMP tracing is currently not supported.

7.14. Other Analysis

7.14.1. Function Tracing

Function tracing in Linux is used to monitor and analyze the execution of functions. It provides insights into the functions called by an application and functions’ execution time. Function tracing introduces an additional overhead, which results in longer profiling times for an application.

Note

In high-frequency function tracing scenarios, the eBPF ring buffer may overflow, causing silent data loss. Tracing results captured under these conditions may be unreliable.

7.14.1.1. Prerequisites

Linux kernel 4.15 or later is required. From the AMDuProf installed directory, run the script AMDuProfSetup.sh with root access.

sudo ./AMDuProfSetup.sh

If you install AMD uProf using DEB installer, the script is run by the installer and the info about eBPF (Extended Berkeley Packet Filter) support on the host and function tracing support is provided.

7.14.1.2. Data Collection Using CLI

Use the following commands to collect the GPU Trace data

After the profile data collection is complete, a session directory will be generated. Use session directory to generate the csv report (or) to import the session in GUI.

For a list of all the supported options, refer to AMDuProfCLI Collect Command Options.

Example

./AMDuProfCLI collect --trace func --func /tmp/ScimarkStable:* -o /tmp/ /tmp/ScimarkStable
………
Profiling (data collection) completed
Generated data files path: /tmp/AMDuProf-ScimarkStable-CpuTrace_Sep-06-2024_03-38-23

Here, the generated session directory is /tmp/AMDuProf-ScimarkStable-CpuTrace_Sep-06-2024_03-38-23.

7.14.1.3. Data Collection Using GUI

Complete the following steps to start profiling:

  1. Click Profile an Application on the Welcome page.

  2. Provide application path, application options, working directory, and environment variables, if any. Click Next.

  3. From Custom Configs, select Function Tracing.

  4. Click Start Profile to start the profiling.

7.14.1.4. Function Tracing Options

Table 7.39 Function Tracing Options#

Option

Description

--exclude-func <module:function- pattern>

Specify functions to exclude from the library or executable:

  • Function-pattern can be a function name or partial name ending with * or only * to trace all the functions of a module.

  • Module can be a library or executable.

  • To exclude the kernel functions, use the module as kernel.

Note

It is recommended to provide the absolute path of a module.

--func <module:function- pattern>

Specify functions to trace from the library or executable:

  • Function-pattern can be a function name or partial name ending with * or only * to trace all the functions of a module.

  • Module can be a library or executable.

  • To exclude the kernel functions, use the module as kernel.

Note

It is recommended to provide the absolute/full path of a module.

--func-size <size>

By default, AMDuProf traces functions of size 128 bytes, if you want to trace functions of size more than or equals to 128 bytes, use this option to set the function size.

--func-threshold <value>

By default, function threshold value set to 1000000 ns. If you want to trace functions execution time less than default threshold, use this option to set function threshold value.

7.14.1.5. Analyze the Data Using CLI

An example of the function summary report section in the .csv report file is as follows. It provides the function count, total time (function and its children), min and max function self-time and total self-time.

Function Tracing - Function Summary Report.

Figure 7.76 Function Tracing - Function Summary Report#

7.14.1.6. Analyzing Data Using GUI

If data is collected using CLI, then use Import Session to import the session into GUI to analyze data in GUI.

Function Tracing - Function Count Summary Report.

Figure 7.77 Function Tracing - Function Count Summary Report#

7.14.1.7. Limitations

7.14.2. Memory Tracing Options

Table 7.40 Memory Tracing Options#

Option

Description

--memory-threshold <threshold in bytes>

By default, AMDuProf traces memory allocations of size more than or equals to 1KB. To trace memory allocations of custom size, use this option to set the threshold.

7.14.2.1. Data Collection using CLI

Once profile data collection is complete, a session directory will be generated. Use session directory to generate the csv report (or) to import the session in GUI.

For a list of all the supported options, refer to AMDuProfCLI Collect Command Options.

Example

./AMDuProfCLI --trace memory -o /tmp/ /tmp/ScimarkStable
Profiling started
………
Profiling (data collection) completed
Generated data files path: /tmp/AMDuProf-ScimarkStable-CpuTrace_Sep-06-2024_03-38-23

Here, the generated session directory is /tmp/AMDuProf-ScimarkStable-CpuTrace_Sep-06-2024_03-38-23.

7.14.2.2. Analyze the Data Using CLI

Generate the csv report to analyze the data in csv format.

AMDuProfCLI report -i <session directory>

For a list of all the supported options, refer to AMDuProfCLI Report Command Options.

Example

./AMDuProfCLI report -i /tmp/ AMDuProf-ScimarkStable-CpuTrace_Sep-06-2024_03-38-23
Translation started

Report generation started

Report generation completed...
Generated report file: /tmp/ AMDuProf-ScimarkStable-CpuTrace_Sep-06-2024_03-38-23/report.csv

Here is an example of the memory report section in the .csv report file.

Memory Tracing - Memory Report.

Figure 7.78 Memory Tracing - Memory Report#

7.14.2.3. Limitations

7.14.3. Pagefault Tracing

Pagefault tracing helps identify the total pagefaults caused by a thread and process.

7.14.3.1. Prerequisites

sudo ./AMDuProfSetup.sh

If you install AMD uProf using DEB installer, the script is run by the installer and the info about eBPF (Extended Berkeley Packet Filter) support on the host and OS tracing support is provided.

7.14.3.2. Data Collection using CLI

To trace page faults, run the following command:

AMDuProfCLI collect --trace osrt --osrt-event pagefault -o <output-dir> <application>

Example

./AMDuProfCLI --trace osrt  --osrt-event pagefault -o /tmp/ /tmp/ScimarkStable
Profiling started
………
Profiling (data collection) completed
Generated data files path: /tmp/AMDuProf-ScimarkStable-CpuTrace_Sep-06-2024_03-38-23

Here, the generated session directory is /tmp/AMDuProf-ScimarkStable-CpuTrace_Sep-06-2024_03-38-23.

7.14.3.3. Analyze the Data Using CLI

Generate the csv report to analyze the data in csv format.

./AMDuProfCLI report -i /tmp/ AMDuProf-ScimarkStable-CpuTrace_Sep-06-2024_03-38-23 Translation started

Report generation started

Report generation completed...
Generated report file: /tmp/ AMDuProf-ScimarkStable-CpuTrace_Sep-06-2024_03-38-23/report.csv

An example of the pagefault report section in the .csv report file:

Pagefault Tracing - Pagefault Report.

Figure 7.79 Pagefault Tracing - Pagefault Report#

7.14.3.4. Limitations

7.14.4. Kernel Block I/O Analysis

The Linux OS block I/O calls like insert, issue, and complete can be traced to provide the various metrics related to I/O operations performed by the application.

This analysis can be used to analyze:

Note

The kernel may continue processing queued I/O requests submitted by the profiled application even after the application exits. Therefore, kernel block I/O analysis is supported only with system-wide tracing.

7.14.4.1. Prerequisites

sudo ./AMDuProfSetup.sh

If you install AMD uProf using DEB installer, the script is run by the installer and the info about eBPF (Extended Berkeley Packet Filter) support, BTF support and OS tracing support on the host is provided.

7.14.4.2. Data Collection Using GUI

Complete the following steps to start profiling:

  1. Click Profile an Application on the Welcome page.

  2. Provide application path, application options, working directory, and environment variables, if any. Click Next.

  3. From Custom Configs, select OS Runtime Tracing.

  4. Select diskio event from trace events.

  5. Click Start Profile to start the profiling.

7.14.4.3. Data Collection Using CLI

AMDuProfCLI collect --trace osrt --osrt-event diskio -a -o <output-dir> <application>
AMDuProfCLI collect --trace osrt --osrt-event diskio -a -d 10 -o <output-dir>

Example

./AMDuProfCLI --trace osrt --osrt-event diskio -a -o /tmp/ fio --name=test --ioengine=sync -- rw=randwrite --bs=4k       --numjobs=1 --size=1G --runtime=1m --time_based
Profiling started
………
Profiling (data collection) completed
Generated data files path: /tmp/AMDuProf-fio -CpuTrace_Sep-06-2024_03-38-23

Here generated session directory is /tmp/AMDuProf-fio-CpuTrace_Sep-06-2024_03-38-23.

7.14.4.4. Analyze the Data Using CLI

Generate the csv report to analyze the data in .csv format.

 ./AMDuProfCLI report -i /tmp/ AMDuProf-fio -CpuTrace_Sep-06-2024_03-38-23 Translation started

Report generation started

Report generation completed...
Generated report file: /tmp/ AMDuProf-fio-CpuTrace_Sep-06-2024_03-38-23/report.csv

An example of the Disk I/O report section in the .csv report file is here:

Disk I/O Report.

Figure 7.80 Disk I/O Report#

7.14.4.5. Analyze the Data Using GUI

If data is collected using CLI, then use Import Session to import the session into GUI to analyze data in GUI.

Navigate to the ANALYZE page and then select Disk I/O Stats in the navigation bar:

Disk I/O Stats.

Figure 7.81 Disk I/O Stats#

7.14.4.6. Limitations

7.15. Custom Profile

Apart from the predefine configurations, you can choose the required events to profile.

7.15.1. Configuring and Starting Profile

To perform the custom profile:

  1. Click PROFILE > Start Profiling to navigate to the Select Profile Target screen.

  2. Select the required profile target and click Next.

  3. From the Select Profile Type drop-down, select one of the following:

  4. Select the Custom Configs tab and select CPU Profile from the left vertical pane.

  5. Click Advanced Options to enable call-stack, set symbol paths (if the debug files are in different locations) and other options. Refer the section Advanced Options for more information on this screen.

  6. Once all the options are set, the Start Profile button at the bottom will be enabled. Click it to start the profile.

After the profile initialization the profile data collection screen is displayed.

7.15.2. Analyzing Profile Data

Complete the following steps to analyze the profile data:

  1. When the profiling stops, the collected raw profile data will be processed automatically and the Hot Spots screen of the Summary page is displayed. Refer the section Overview of Performance Hotspots for more information on this screen.

  2. Click ANALYZE on the top horizontal navigation bar to go to the Function HotSpots screen. Refer the section Function HotSpots for more information on this screen.

  3. Click ANALYZE > Metrics to display the profile data table at various granularities - Process, Load Modules, Threads, and Functions. Refer the section Process and Functions for more information on this screen.

  4. Double-click any entry on the Functions table in Metrics screen to load the source tab for that function in SOURCES page. Refer the section Source and Assembly for more information on this screen.

7.15.3. Limitations

CPU profiling in AMD uProf has the following limitations:

IMIX has the following limitations: