Following table describes profiling support on Linux for different perf_event_paranoid values:
Config |
Profile Scope |
perf_event_paranoid Values |
|---|---|---|
Core PMC Event Based Profiling |
Specific application launched or attach to process |
|
Core PMC Event Based Profiling |
Kernel, Hypervisor |
|
Core PMC Event Based Profiling |
Entire System |
|
Instruction Based Sampling |
Specific Application |
|
Instruction Based Sampling |
Attach Specific Process |
|
Instruction Based Sampling |
Entire System |
|
Time Based Profiling |
Specific Application |
|
Time Based Profiling |
Attach Specific Process |
|
Time Based Profiling |
Entire System |
|
To perform a collect run, first you should configure the profile by specifying the profile configuration that identifies all the following information used to perform a collect measurement:
Profile target
Profile type: What profile data should be collected (CPU Profile, CPU Trace, GPU Trace, or Power Profile)
Monitoring events: how the data should be collected
Additional profile data (if needed): callstack samples, profile scheduling, and so on
Note
The additional profile data to be collected depends on the selected profile type.
To start a profile, either click the PROFILE page at the top navigation bar or Profile an Application? link in HOME page Welcome screen. The Start Profiling screen is displayed.
Select Profile Target is available in the Start Profiling window.
Figure 7.1 Start Profiling - Select Profile Target#
You can select the one of the following profile targets from the Select Profile Target drop-down:
Application: Select this target when you want to launch an application and profile it (or launch and do a system-wide profile). The only compulsory option is a valid path to the executable. (By default, the path to the executable becomes the working directory unless you specify a path).
System: Select this if you do not wish to launch any application but perform either a system-wide profile or profile specific set of cores.
Process(es): Select this if you want to profile an application/process which is already running. This will bring up a process table which can be refreshed. Selecting any one of the processes from the table is mandatory to start profile.
Once profile target is selected and configured with valid data, the Next button will be enabled to go the next screen of Start Profiling.
Note
The Next button is enabled only if all the selected options are valid.
Once profile target is selected and configured, click the Next button. The Select Profile Configuration screen is displayed as follows:
Figure 7.2 Start Profiling - Select Profile Configuration#
Select one of the following tabs:
Predefined Configs consists of all the predefined configurations, such as Time-based Profiling, Cache Analysis, and Assess Performance.
Live Power Profiling consists of options to perform real-time power profiling.
Custom Configs has options to perform Custom CPU Profile, CPU Tracing, and GPU Tracing.
Once you select a profile type, the left vertical pane within this window will list the options corresponding to the selected profile type. For CPU Profile type, all the available predefined sampling configurations will be listed.
Modify event options are available only for the predefined configurations.
Click Advanced Options button to proceed to the Advanced Options screen and set the other options such as the Call Stack Options, Profile Scheduling, Sources, Symbols, and so on.
The details in the: Sample Data table are persistent and saved by the tool with a name (here, it is AMDuProf-EBP-ScimarkStable). You can define this name and navigate to PROFILE > Saved Configurations to reuse/select the same configuration later.
The Next and Previous buttons are available to navigate to various screen of the Start Profiling screen.
The CLI command is available at the bottom of this page, which displays the CLI version of the GUI option selected on the Select Profile Configuration page.
Click the Advanced Options button in Select Profile Type screen, to set the advanced options for Windows/Linux and start the profiling.
Linux
Figure 7.3 Start Profiling - Advanced Options for Linux#
Windows
Figure 7.4 Start Profiling - Windows#
You can set the following options on the Advanced Options screen and click Start Profile to begin profiling.
Options |
Description - Linux |
Description - Windows |
|---|---|---|
OpenMP Tracing Options |
Enables collection of OpenMP runtime data for performance analysis.
|
Not available in Windows version. |
Enable Thread Concurrency Option |
Not available in Linux version. |
Displays the number of threads running concurrently for the selected process to help analyze thread-level parallelism.
|
Call Stack Options |
Configures call stack sampling for detailed call graph views and debugging.
|
Configures call stack sampling for detailed call graph views and debugging.
|
Profile Scheduling |
Controls profiling start, duration, and API-based instrumentation for data collection.
|
Controls profiling start, duration, and API-based instrumentation for data collection.
|
Sources |
Specifies source file paths for code attribution and bottleneck identification.
|
Specifies source file paths for code attribution and bottleneck identification.
|
Symbols |
Defines symbol and server locations for resolving function names accurately.
|
Configures symbol and server locations for resolving function names accurately.
|
Data Aggregation Option |
Specify any one of the coarseness of data aggregation level - The lower the data coarseness level, the more fine grained data will be plotted in the timeline views. |
Specify any one of the coarseness of data aggregation level - The lower the data coarseness level, the more fine grained data will be plotted in the timeline views. |
Once all the options are set correctly, click the Start Profile button to start the profile and collect the profile data. After the profile initialization the following screen is displayed.
Figure 7.5 Profile Data Collection#
The time elapsed during the data collection is displayed. When the profiling is in progress, click:
Stop to stop the profiling.
Cancel to cancel the profiling. It will take you back to Select Profile Target screen.
Pause button to pause the profiling. The profile data will not be collected and you can click Resume to continue the profiling.
The Overhead Estimation feature in AMDuProf provides insights into the overhead introduced during data collection and processing while profiling.
This capability helps users understand the impact of profiling on system performance and make informed decisions when selecting profiling configurations.
Supported and Unsupported Profiling Types
Profiling Type |
Overhead Feature Support |
|---|---|
CPU Tracing |
Not Supported |
GPU Profiling |
Not Supported |
GPU Tracing |
Not Supported |
Live Power Profile |
Not Supported |
MPI Tracing |
Not Supported |
OpenMP Tracing |
Not Supported |
OS Runtime Tracing |
Not Supported |
Profiling Type |
Overhead Feature Support |
|---|---|
Assess Performance |
Supported |
Assess Performance (Extended) |
Supported |
Cache Analysis |
Supported |
Hotspots |
Supported |
Instruction Based Sampling |
Supported |
Investigate Branching |
Supported |
Investigate CPI |
Supported |
Investigate Data Access |
Supported |
Investigate Instruction Access |
Supported |
Overview |
Supported |
Threading Analysis |
Supported |
Time Based Sampling |
Supported |
For unsupported configurations, a message stating no overhead data available for the selected configuration is displayed.
There are three categories of overheads:
High Overhead
Fair Overhead
Low Overhead
Figure 7.6 Sample High Overhead Message#
There are two types of overheads:
Collection Overhead: Refers to the time required to collect data during profiling. This time is influenced by the profiling configuration and system conditions.
Processing Overhead: Refers to the time spent processing the collected data. This is impacted by various factors, including the presence of throttling events, sampling intervals, and sampling frequencies.
Before proceeding with the profiling experiment, AMDuProf provides a suggestion regarding the overhead based on the current configuration. This suggestion helps users decide whether to proceed with the current configuration or modify it based on the expected overhead. There are multiple factors influencing overheads.
Predefined Configurations: In predefined configurations, certain events such as overview and threading have a high overhead in both data collection and processing.
Throttling Events: If there are one or more throttling events present in either predefined or custom configurations, the processing overhead will be high, and the collection overhead will be Fair.
Sampling Intervals and Frequencies:
Threshold for Sampling Intervals/Frequencies: A threshold exists for sampling intervals and frequencies.
Higher Sampling Interval/Lower Sampling Frequency: Increases collection and processing overhead.
Lower Sampling Interval/Higher Sampling Frequency: Decreases collection and processing overhead.
This balance affects the efficiency of data collection and processing.
In addition to overhead information, you get an estimate of the profiling time based on the configuration and overhead type.
You can evaluate the overhead suggestion and decide whether to proceed with the current configuration or modify it for optimal performance. This guidance helps balance the desired profiling data and the overhead it may incur on the system.
For configurations not supported by the overhead feature, a message stating that no overhead data available for the selected configuration will be displayed.
Figure 7.7 Sample Overhead Message for Unsupported Configuration#
When the profiling stopped, the collected raw profile data will be processed automatically and you can analyze the profile data using the following UI sections to identify the potential performance bottlenecks:
The sections available depends on the profile type. The CPU Profile will have SUMMARY, ANALYZE, MEMORY, HPC, and SOURCES pages to analyze the data.
SUMMARY page to look at overview of the hotspots for the profile session.
ANALYZE page to examine the profile data at various granularities.
SOURCES page to examine the data at source line and assembly level.
MEMORY page to examine the cache-line data for potential false cache sharing.
HPC page to examine the OpenMP tracing data for potential load imbalance issue and visualize MPI API trace, OS event trace.
When the translation is complete, the SUMMARY page will be populated with the profile data and Hot Spots screen will be displayed. The SUMMARY page provides an overview of the hotspots for the profile session through various screens such as Hot Spots and Session Information.
In the Hot Spots screen, hotspots will be displayed for functions, modules, process, and threads. Processes and threads will be displayed only if there are more than one.
The following figure shows the Hot Spots screen:
Figure 7.8 Summary - Hot Spots Screen#
In the above Hot Spots screen:
The top 5 hottest functions, processes, modules and threads for the selected event are displayed.
The Hot Functions pie chart is interactive in nature. You can click on any section and the corresponding function’s source will open in a separate tab in the SOURCES page.
The hotspots are shown per event and the monitored event can be selected from drop-down in the top-right corner. You can change it to any other event to update the corresponding hotspot data.
From the Select Summary View drop-down, select one of the following:
Hot Threads
Hot Processes
Hot Functions
Hot Modules
Based on the selection, one donut is displayed at a time.
Data Collected |
Table Present |
Description |
Timing Details |
|---|---|---|---|
OS Trace |
Schedule Summary |
Summary of per thread running/wait time (percentages). |
|
OS Trace |
Wait Object Summary |
Time spent in operations related to several types of synchronization objects, that is, locks, mutexes, condition variables, and so on. |
|
OS Trace |
Wait Function Summary |
Time spent in several types of pthread blocking functions, that is, pthread_join, and so on. |
|
OS Trace |
Syscall Summary |
Time spent in syscall(s) |
|
GPU Trace |
GPU Kernel Summary |
Time spent per GPU kernel in execution in the enqueued device. |
Profile Duration |
GPU Trace |
Data Transfer Summary |
Time spent in GPU data copy operations. |
Profile Duration |
MPI Trace |
MPIP2P API Summary |
Time spent in various MPI P2P API across all ranks of the profile. |
|
MPI Trace |
MPI Collective API Summary |
Time spent in various MPI collective communication API across all ranks of the profile. |
|
CPU Profile |
Hot Functions |
Hottest functions based on CPU profile. |
|
CPU Profile |
Hot Modules |
Hottest modules based on CPU profile. |
|
CPU Profile |
Hot Threads |
Hottest threads based on CPU profile. |
|
CPU Profile |
Hot Processes |
Hottest processes based on CPU profile. |
|
Figure 7.9 OS Trace Screen#
Figure 7.10 GPU Trace Screen#
Figure 7.11 MPI Trace Screen#
The CPU Profile screen is similar to the Summary - Hotspots Screen.
Click ANALYZE > Thread Concurrency to view the following graph to analyze the thread concurrency of the profiled application.
Figure 7.12 Summary - Thread Concurrency Graph#
The thread concurrency graph displays the duration (in seconds) of the specific number of threads that were running simultaneously.
Bucketization approach is used for this graph. Instead of showing the Elapsed Time for each core, the weighted average based on the bucket size will be taken. The bucket size will be determined based on the cores and number of available pixels available. This is done to avoid the horizontal scrolling.
Click ANALYZE on the top horizontal navigation bar to go to Function Hotspots screen, which displays the hot functions across all the profiled processes and load modules as follows. You can view the following:
Figure 7.13 ANALYZE - Function Hotspots#
Process and thread wise breakdown of data is available if the entries are expanded in Function Hotspots View. The Functions table lists the hot functions. The IP samples are aggregated and attributed at the function-level granularity. On the table, you can do the following:
Double-click on a function entry to navigate to the corresponding SOURCE view of that function.
Right-click to view the following options: - Copy selected row(s) to copy the highlighted row to clipboard. - Copy all rows to copy all the rows to clipboard.
Filters and Options pane allows you filter the profile data as follows:
You can click the Select View drop-down to control the counters that are displayed. The relevant counters and their derived metrics are grouped in predefined views.
You can use the Value Type drop-down to display the counter values as follows: - Sample Count is the number of samples attributed to a function. - Event Count is the product of sample count and sampling interval. - Percentage is the percentage of samples collected for a function.
You can use the System Modules option to either Exclude or Include the profile data attributed to system modules.
If callstack is enabled, the unique hot call-paths for the selected function is displayed in the Functions column.
Event Timeline is the line graph showing the number of aggregated sample values over the period of time. You can use it to identify the hot functions within a profile region. From the Select Metric drop-down you can select the event for which event timeline must be plotted.
All the entries will not be loaded for a profile. To load more than the default number of entries, click the vertical scroll bar on the right. When the entries are expanded, process and thread-wise breakdown of data is available.
Click ANALYZE > Grouped Metrics to display the profile data table at various program unit granularities - Process, Load Modules, Threads, and Functions. This screen contains data in two different formats as follows:
Figure 7.14 Summary - Analyze - Grouped Metrics#
The upper tree represents samples grouped by Process. You can expand the tree to view the child entries for each parent (that is for a process). The Load Modules and Threads are child entries for the selected process entry.
You can right-click to view the following options:
Expand All Entries to list the modules and threads of all the processes.
Collapse All Entries to list only the top-level entries.
Copy selected row(s) to copy the highlighted row to clipboard.
Copy all rows to copy all the rows to clipboard.
The lower Functions table contains samples attributed to corresponding functions. The function entries depend on what is selected in the upper tree. For more specific data, you can select a child entry from the upper tree and the corresponding function data will be updated in the lower tree. You can do any of the following:
Double-click on a function entry to navigate to the corresponding SOURCE view.
Right-click to view the following options:
Copy selected row(s) to copy the highlighted row to clipboard.
Copy all rows to copy all the rows to clipboard.
You can use the Filters and Options pane to filter the profile data displayed by various controls.
The Select View controls the counters that are displayed. The relevant counters and their derived metrics are grouped in predefined views. You can select the views from the Select View drop-down.
The Group By drop-down is used to group the data by Process, Module, and Thread. By default, the sample data is grouped-by Process.
Click the ValueType drop-down to display the counter values as follows: - Sample Count is the number of samples attributed to a function. - Event Count is the product of sample count and sampling interval. - Percentage is the percentage of samples collected for a function.
You can use the System Modules option to Exclude or Include the profile data attributed to system modules.
Confidence level
The metrics that cannot be calculated reliably due to low number of samples collected for a program unit will be grayed out.
All entries will not be loaded for a profile. To load more than the default number of entries, click the vertical scroll bar on the right.
Double-click on any entry in the Functions table in the Metrics screen to load the source tab for the corresponding function in SOURCES page. If the GUI can find the path to the source file for that function, then it will try to open the file, failing which you will be prompted to locate it.
The following figure depicts the source and assembly screen.
Figure 7.15 SOURCES - Source and Assembly#
The following sections are present in the SOURCES screen:
Feature |
Description |
|---|---|
Filter Pane |
Lets you filter the profile data based on the following options:
For multi-threaded or multi-process applications, if a function is executed from multiple threads or processes, each of them is listed in the Process and Threads drop-down lisr in the Filters pane. Changing them will update the profile data for that selection. By default, profile data for the selected function, aggregated across all processes and all threads will be displayed. |
Show Assembly |
Toggle to enable or disable the assembly view for associating the source code with machine instructions. |
Search Pane |
Provides options to locate specific code or instructions. After providing the following search criteria, click Search to execute.
After clicking the Search button, the Forward and Backward navigation buttons are enabled, to navigate through the search results. |
Select Source Line(s) Ordering Type |
Choose the ordering of source lines in the view (e.g., by metric value, by line number) for easier hotspot analysis. |
HeatMap Event |
Overview of the hotspots at source level. |
Copy Options |
Provides multiple ways to copy data for analysis and reporting.
|
Note
If the source file cannot be located or opened, only disassembly will be displayed.
Top-down Callstack view can be used to explore the call-sequence flow of the application to analyze the time spent in functions and its callees.
Click ANALYZE > Top-down Callstack to view it as follows:
Figure 7.16 Top-down Callstack#
Functions are displayed based on the parent to child entries depending on the inclusive samples values sorted.
Inclusive sample values for a function and its descendants.
Enabling Hide C++ std Library Calls option works only when C++ library calls are made. It will exclude such calls from the list and display the other child entries.
Context menu of collapse entries will close all the expanded entries. Expand entries will expand the child entries and the Open Source View option will display the corresponding source view.
Flame graph is a visualization of sampled call-stack traces to quickly identify the hottest code execution paths. Click ANALYZE > Flame Graph to view it as follows:
Figure 7.17 ANALYZE - Flame Graph#
The x-axis of the flame graph shows the call-stack profile and the y-axis shows the stack depth. It is not plotted based on passage of time. Each cell represents a stack frame and if a frame were present more often in the call-stack samples, the cell would be wider. This screen has the following options:
Module-wise coloring of the cells.
Click on a cell to zoom only that cell and its children. Use the Reset Zoom button visualize the entire graph.
Right-click on a cell to view the following context options:
Copy Function Data to copy the function names and its metrics to clipboard.
Open Source View to navigate to the source tab of that function.
Hover the mouse over a cell to display the tool-tip showing the inclusive and exclusive number of samples of that function.
Click the Zoom Graph button for a better zooming experience.
When you type a function name in the search box, a list of all the relevant matches will be displayed. Select the required function to highlight the cells corresponding to that function in the flame graph.
The Process drop-down lists all the processes for which call-stack samples are collected. Changing the process will plot the flame graph for that particular process.
For multi-threaded applications, the flame graph will be plotted for the cumulative data of all the threads by default.
The Threads drop-down lists all the threads for which call-stack samples are collected. Changing the thread will plot the flame graph for that thread.
The Select Metric drop-down lists all the metrics for which call-stack samples are collected. Changing the metric will plot the flame graph for that particular metric.
Click ANALYZE > Call Graph* to navigate to the call graph screen. This graph is constructed using the call-stack samples and offers a butterfly view to analyze the hot call-paths as follows:
Figure 7.18 ANALYZE - Call Graph#
The Function table lists all the functions with inclusive and exclusive samples. Click on function to display its Caller and Callee functions in a butterfly view. In addition the parents and children of the selected function in the Function table are displayed.
Options
The Process drop-down lists all the processes for which call-stack samples are collected. Changing the process will show the call graph for that particular process.
For multi-threaded applications, the call-graph will be plotted for the cumulative data of all the threads by default.
The Threads drop-down lists all the threads for which call-stack samples are collected. Changing the thread will plot the call graph for that thread.
The Select Metric drop-down lists the metrics for which call-stack samples are collected. Changing the counter will show the call graph for that particular counter.
To configure threading analysis from the GUI:
Navigate to the Select Profile Configuration screen.
Select Predefined Configs from the tab.
Select Threading Analysis from the left vertical pane.
Profile data collected from CLI or GUI can be visualized in GUI by importing the session. On importing, the following section (Thread Timeline) is displayed on the ANALYZE page.
Time-series data is plotted in timelines per entity (thread, rank, device, and so on). Trace data (if collected) will only be plotted when you zoom into the timeline to address data size related scalability issues (trace data can have millions of records which will not be visually legible if plotted together). The entire view is broadly separated in three vertical parts, top data selectors, middle timelines, and bottom filters. You can use the timeline as follows:
Hover the cursor over a timeline to view a vertical line containing the tool-tip for a specific entity, showing relevant details, and the current timestamp.
If CPU profile data is collected, click and drag the mouse over the timeline to select a region across all timelines and brings up the Function Hotspot within the selected time range.
Zoom-in/out horizontally into the timelines using one of the following - The mouse wheel - Pressing CTRL and +/- keys on the keyboard to zoom-in or zoom-out, respectively
Figure 7.19 Timeline Analysis GUI in Linux#
The timeline section consists of:
Name of each thread in timeline with Thread ID.
Click Load More button which loads more threads. By default, only a small number of thread timelines are loaded to limit the resource consumption. This button enables loading the next set of thread timelines. The next set is determined by the entries in the table below the timeline.
Select the Data Source drop-down to enable selection of data to display on the timeline. Different types of data source are as follows:
CPU Utilization: Displays a timeline of CPU utilization (in percentage) per thread. To collect sufficient such data points, the total profile duration should be greater than or equal to 5 seconds. In Threading analysis configuration, this data is collected at fine-grained intervals in milliseconds, whereas for other configurations, it is collected at a per second interval.
Memory Consumption: Plots the timeline for the memory consumption (in MB) categorized as physical and virtual memory consumed. This is enabled only for the Threading Analysis configuration.
Context Switches: Plots the timeline for both voluntary context switches count (sleep, yield, and so on) or involuntary context switches count (OS scheduler triggered context switch). This is enabled only for the Threading Analysis configuration.
CPU Profile Samples: Plots the timeline for the CPU sample collected for the CPU events. The CPU samples are plotted as a heatmap (with colors ranging from dark-green (highest samples in a window) to yellow-green (lowest samples in a window) with gray regions depicting where there are no samples collected). The following events are supported:
Events |
Availability |
|---|---|
CPU Time |
Time-based profiling is performed. |
Cycles not in Halt |
PMC event |
Op Cycles |
IBS op event is collected with |
Retired Instructions |
PMC event |
GPU Related Counts: GPU Kernel Time, Copy Time and HIP API Time are also plotted to provide time spent in GPU kernels, GPU memory copy operations, and time spent inside HIP APIs, respectively.
Thread Trace: Plots the timeline based on OS trace data which can either originate from eBPF Tracing or User-mode Tracing. The trace data is categorized and aggregated at certain intervals to generate time-series plotted in timelines. The following categories are created:
Category |
Description |
|---|---|
Event Wait Time |
Total time spent on IO Multiplexing (poll, select etc.), wait for process / thread to finish (wait, pthread_join etc.). |
I/O Sync Time |
Total time spent on IO sync APIs (sync, fsync etc.). |
I/O Time |
Total Time spent in I/O syscalls, that is, read, write, pread, pwrite, and so on. |
Pause Time |
Total time spent on profile paused state. User can pause the profile collection with profile control APIs and GUI pause after starting the profile collection. |
Resource Wait Time |
Total time spent on synchronization objects, reader and writer locks, thread barrier etc. |
Running Time |
Total active processor time, which includes the time spent in IO operations and spin lock operations. |
Sleep Time |
Total time spent on sleep and waiting for signal delivery system calls. |
Spin Time |
Total time spent on spin lock. |
The Select Trace Overlay drop-down enables selection of the type of trace data to display.
Thread State: Shows the current state of thread from eBPF or User-mode tracing. In the former, thread state is inferred from BPF data. In the latter, thread state is treated as Running if Running Time > 0, otherwise, Sleeping.
Thread Trace: Displays traces for the traced libpthread functions, such as pthread_mutex_lock, pthread_mutex_trylock, and so on.
Syscalls: Displays traces for traced syscall in the specific region of the timeline.
Trace Cutoff can be used to specify a duration in nanoseconds, which acts as a cutoff to load the trace data, that is, any traced function which takes less than the specified nanoseconds will not be displayed.
Click the Reset Zoom button to reset any zoom performed earlier.
Hover over any timeline to view the tool-tip containing the relevant data along with timestamp. If trace data is also present, the relevant traced functions with start time and duration.
Filter Threads/Ranks enables you to filter which thread’s (or rank’s) timelines must be displayed. By default, the timelines are sorted internally and the first 6 are loaded. However, from the table, you can select the required threads and clicking Apply Filter to apply the changes. If CPU profile data is collected, highlighting functions or modules is also possible. Each function is assigned a random color, which can be modified and highlighted in the timeline (implies there are samples from the function/module).
Each entry in the filter table has the necessary data, that is, name, parent object, and samples/trace times aggregated across the profile.
Click the Apply Filter button to apply a custom selection of entities or highlight entities in timeline. (If GPU acceleration is available, there is no need to click Apply as the changes are reflected instantaneously)
Click Deselect selected Items to deselect all the entries in the filtering table except the first one. This is useful when a custom selection is required but all timelines are already loaded.
At the bottom of the filtering pane, timeline legend is displayed, which helps in identifying how each type of data source or trace is mapped to which color.
The Show Core Transition button is disabled by default and works only when the CPU profiling data is collected. When enabled, a red line is displayed in each timeline to signify when a thread changes the core.
Note
Time-series data (from Select Data Source) will be plotted as a line graph, where the x-axis is time and y-axis the height implies how close to the maximum value it reached. For trace records, the height is always total height of the timeline. However, the width varies based on the duration of the traced function.
For CPU profile samples, a heatmap like visualization is used and height plays no significance).
When you enable CPU profiling (along with other data sources), you can highlight functions and modules in the timeline across threads. The tool lists them in tables under the Filtering Threads option, ordered by CPU sample data. You can select multiple functions or modules to highlight, and they appear as overlays in the timeline view. You can also change the color for each function or module, and the overlay updates accordingly.
Figure 7.20 Function Highlights in Timeline#
Highlight tasks if the profiled application is instrumented using the Instrumentation API. The timeline displays tasks across threads, and you can select them from the Highlight Tasks tab. This tab presents task data grouped by domain and sorted by the total time spent across all task instances. To focus the view, click the Show only associated Threads button to display only those threads that contain at least one instance of the selected tasks.
When you select a region in the timeline view by clicking and dragging with the mouse, uProf generates aggregate data for that region. Depending on the configuration and collected data, the following types of aggregate data is displayed:
Function Hotspot – visible only if CPU samples are collected
Flame Graph – visible only if CPU samples are collected
Wait Object Hotspot – visible only when using the Threading Analysis configuration
Figure 7.21 Region Selection in Timeline#
Per Thread timeline view focuses on showing all aspects of a specific thread based on the collected data. Hence it can show CPU profile samples, OS traces, GPU traces and System metrics at per thread level. The selection table at the bottom pane sorts them by the first event in the table, and threads can be switched from the table. Function/Module highlights work in the same way as All thread timeline.
Figure 7.22 Per Thread Timeline#
Each per-thread timeline displays two types of flame graphs:
A callstack flame graph: shows how the call stack evolves over time for a specific thread.
A task flame graph: appears when the profiled application uses the Instrumentation API. This graph visualizes the evolution of tasks per thread. As the tasks can be nested, the flame graph may resemble a call stack when such nesting occurs.
The callstack flame graph is only available for the following profile configurations:
Native
Hotspots + CSS
Overview
TBP + Function trace
EBP + Frequency + Function trace
IBS + Function trace
Custom config + Function trace
Java
Hotspots + CSS
Python
Hotspots + CSS
If GPU tracing data is also collected for applications using HIP APIs, this view also plots the GPU utilization, GPU Memory, and GPU power for all the GPUs to which kernels were scheduled from current thread. As a single thread can schedule kernels on multiple GPUs, one line is plotted for each such device, the device info being present in the tooltip.
Figure 7.23 Per Thread Timeline - Tooltip#
GPU acceleration is only available if OpenGL drivers exist on the system. This applies to both Windows and Linux. If not, the tool will automatically fallback to CPU implementation, which will not be as performant. (This is also the case when using remote X11 servers i.e. launching the Linux UI in Windows with MobaXTerm-like tools). The minimum expected OpenGL version is 3.1 on both Linux and Windows. While AMDuProf tries to detect the version automatically, should this fail due to unforeseen scenarios, where the tool falls back to CPU rendering, but OpenGL (>= 3.1) is still available, environment variables could be used to tweak the behavior, as listed in the following table.
Environment Variable |
Purpose |
Default Value |
|---|---|---|
|
Specify major version of OpenGL, minimum 3. |
|
|
Specify minor version of OpenGL for the specified major version. |
|
|
Specify the OpenGL profile. Valid values are: |
|
|
Specify whether to use desktop OpenGL or not. This can be used to disable GPU accelerated graphics entirely if it does no work as intended. (Valid values are |
|
IMIX view shows the summary of instruction-wise samples collected. This view is shown only for IBS profiling. Click ANALYZE > IMIX to navigate to the IMIX view.
Figure 7.24 IMIX View#
Wait Object Hotspots view shows the wait object related data in detail. Different groupings are also available for in depth analysis. It can be broken down by expanding in different levels. Click ANALYZE > Wait Object Hotspots to navigate to the Wait Object Hotspots view. For more information refer to Wait Time Analysis.
Figure 7.25 Wait Object Hotspots#
Hotspots Analysis is the starting point for algorithm analysis of an application. Use Hotspots Analysis to understand the application code flow and sections of code which has lot of execution time (CPU Time).
User mode sampling embeds an agent library into application address space using LD_PRELOAD. The agent creates a per thread OS timer (default timer interval is 10 ms), interrupts the execution of a thread by generating SIGPROF or another runtime signal. Once thread receives the signal, the agent collects IP samples, and it’s callstack for each sample if callstack collection is enabled from signal handler. Collected data is stored in binary files for later processing.
In OpenMP applications, if a parallel region is executed by multiple threads (master and worker threads), each worker thread will have its own calling sequence (call path or callstack) which logically starts when the master thread encountered the parallel region. During translation, AMDuProf will stitch the worker thread’s call path to master thread call path at the point where the parallel region started, if the worker threads are active in the parallel region. This allows runtimes from the worker threads to be attributed to the correct logical calling sequence of the program (i.e. calling sequence without OpenMP) so that uProf can produce accurate flame graphs.
By default, AMDuProf detects the compiler used to build the application by reading the .comment section and stitches the worker threads call path with master thread. If AMDuProf fails to stitch the call path with master thread, select OpenMP implementation type as omplib for GCC compiled applications and ompt for AOCC, ICC, and LLVM compiled applications.
To launch the AMDuProf GUI, go to Home > Welcome page.
Click Profile an Application on the Welcome page.
Provide the application path, application options, working directory, and environment variables, if any. Click Next.
From Predefined Configs, select Hotspots.
Set the timer interval* and profiling signal.
From Advanced Options, select the OpenMP implementation type, callstack collection, and callstack unwind depth.
Click Start Profile to start the profiling.
Hotspots collection
AMDuProfCLI collect --config hotspots -o <output-dir> <application>
Hotspots with callstack collection
AMDuProfCLI collect --config hotspots -g -o <output-dir> <application>
Hotspots with timer interval of 100msec
AMDuProfCLI collect --config hotspots --timer-interval 100 -o <output-dir> <application>
Hotspots with profiling signal as SIGRTMIN
AMDuProfCLI collect --config hotspots --profiling-signal 34 -o <output-dir> <application>
Hotspots with callstack collection for GCC compiled OpenMP application
AMDuProfCLI collect --config hotspots -g --openmp-impl omplib -o <output-dir> <application>
Hotspots with callstack collection
AMDuProfCLI collect --config hotspots -g --call-graph-depth 128 -o <output-dir> <application>
Once profile data collection completes, session directory will be generated. Use session directory to generate the csv report (or) to import the session in GUI. Refer AMDuProfCLI Collect Command Options.
Example
./AMDuProfCLI collect --config hotspots -g -o /tmp/ /tmp/ScimarkStable
Profiling started
………
Profiling (data collection) completed
Generated data files path: /tmp/AMDuProf-ScimarkStable-Hotspots_Sep-05-2024_21-43-08.
Here, the generated session directory is /tmp/AMDuProf-ScimarkStable-Hotspots_Sep-05-2024_21-43-08.
Hotspot Analysis Options
CLI Option |
GUI Option |
Description |
|---|---|---|
|
Predefined Configs > Hotspots > Advanced Options > Call Stack Unwind Depth |
Provide the depth of stack frames to be collected. By default, 32 frames will be collected; if the application has a greater number of frames in a calling sequence, increase the unwind depth up to 1024. |
|
Predefined Configs > Hotspots > Advanced Options > Select OpenMP Implementation |
Provide the OpenMP implementation type to stitch the call path of worker threads with master thread. It is valid only in Linux for launch application. |
|
Predefined Configs > Hotspots > Profiling Signal |
If application has signal handler for SIGPROF, then use this option to provide unused signal from It is valid only in Linux for launch application. |
|
Predefined Configs > Hotspots > Timer Interval |
Provide per thread OS timer interval in msec. Default timer interval is 10 ms. |
If data is collected using CLI, use Import Session to import the session into GUI to analyze data in GUI.
Use the following CLI report command to generate the profile report in .csv format by passing the session directory path generated in collection.
AMDuProfCLI report -i <session directory>
See table AMDuProfCLI Report Command Options for a list of all the supported options.
Example
./AMDuProfCLI report -i /tmp/AMDuProf-ScimarkStable-Hotspots_Sep-05-2024_21-43-08
Translation started
…
Report generation started
…
Report generation completed...
Generated report file: /tmp/AMDuProf-ScimarkStable-Hotspots_Sep-05-2024_21-43-08/report.csv
Use the Thread Concurrency Graph to analyze how efficiently the processor cores are utilized by the application. In other words, how much time specific no of threads are running on specific no of cores.
Use Function HotSpots to get list of most CPU time (self-time and children time) consuming functions, expand the function to get its processes and further expand to get its threads. All the functions are sorted in descending order of CPU time.
Select a function to get all the call paths to this function and total CPU time consumed in every call path.
Double click on the function to analyze the instruction level sample attribution for that function using the Source View. See Source and Assembly.
Figure 7.26 Function Hotspots#
Use Flame Graph to identify hottest code paths of an application. The width of each functions indicates the percentage of CPU time of the function (it’s callees) to the total CPU time of selected process and thread.
Figure 7.27 Flame Graph#
Use Top-down Callstack to analyze any issues with call-sequence flow of the application and to analyze the total CPU time spent in functions and its callees.
Figure 7.28 Top Down Call Stack#
In Top-Down Callstack or Flame Graph, if ROOT node branches out to more than two nodes then increase callstack unwind depth with --call-graph-depth. Maximum depth supported is 1024.
If worker threads parallel region callstack is not stitched with master thread, and if application is creating more than 1000000 parallel regions, increase this limit by setting the environment variable AMDUPROF_MAX_PR_INSTANCES.
$export AMDUPROF_MAX_PR_INSTANCES=2000000
MPI tracing and Openmp tracing are not supported with Hotspot analysis.
In cluster environment, if two or more threads/processes have same thread id/process id then behavior is undefined.
If the leaf function is highly optimized assembly code, then AMDuProf fails to find its callstack.
This profile type cannot be combined with other pre-defined configurations.
Analyzing 32-bit applications is not supported.
Hotspots config does not report samples before the exec call for an application and samples belonging to vforked process for Hotspots.
Linux
To report the proper callstack in case of system wide profiling, timer interval less than 10msec, attach process (or) thread profiling and launching static executable (application is built with -static) from uProf then make sure application, and its dependent libraries are compiled with frame pointers.
Callstack stitching is not supported in system wide profiling, attach process (or) thread profiling and if launched application is static executable (application is built with -static).
If launched application uses clone () system call directly to create a thread (or) process, then behavior is un-defined.
Behavior is undefined if --no-inherit (not profile the child processes) is used with hotspots analysis.
Windows
To report the proper callstack compile the application and its dependent libraries with frame pointers.
Use Threading Analysis to identify how efficiently an application uses the processor cores, contention among the application threads due to synchronization, CPU utilization of the threads, Wait time analysis of the application threads.
Threading Analysis uses the User mode sampling and tracing approach. Threading analysis is supported only in Linux and if application is using libc and libpthread then these libraries should be linked dynamically.
Reference
User Mode Tracing embeds an agent library into application address space using LD_PRELOAD. It interposes the pthread APIs and a few system calls, collects the start time and end time of an API, and a few other metrics.
System wide performance data and already running process/thread performance data collection is not supported with user mode sampling and tracing.
To launch the AMDuProf GUI, go to Home > Welcome page.
Click Profile an Application on the Welcome page.
Provide application path, application options, working directory, and environment variables, if any. Click Next.
From Predefined Configs, select Threading Analysis.
Set the Timer Interval and Profiling Signal.
From Advanced Options, select the OpenMP implementation type, callstack collection, and callstack unwind depth.
Click Start Profile to start the profiling.
Threading collection
AMDuProfCLI collect --config threading -o <output-dir> <application>
Threading with callstack collection
AMDuProfCLI collect --config threading -g -o <output-dir> <application>
Threading with timer interval of 100 msec
AMDuProfCLI collect --config threading --timer-interval 100 -o <output-dir> <application>
Threading with profiling signal as SIGRTMIN
AMDuProfCLI collect --config threading --profiling-signal 34 -o <output-dir> <application>
Threading with callstack collection for GCC compiled OpenMP application
AMDuProfCLI collect --config threading -g --openmp-impl omplib -o <output-dir> <application>
Threading with callstack collection and unwind depth as 128
AMDuProfCLI collect --config threading -g --call-graph-depth 128 -o <output-dir> <application>
Threading with pthread APIs threshold of 1000000 nsec
AMDuProfCLI collect --config threading --osrt-threshold pthread:1000000 -o <output-dir> <application>
Threading to collect the callstack for system modules (standard C++ library to it’s implementation). By default, AMDuProf shows the callstack until the standard library.
AMDuProfCLI collect --config threading -g --collect-sys-modules -o <output-dir> <application>
Once profile data collection completes, session directory will be generated. Use session directory to generate the .csv report (or) to import the session in GUI.
For a list of all the supported options, refer to AMDuProfCLI Collect Command Options.
Example
./AMDuProfCLI collect --config threading -g -o /tmp/ /tmp/ScimarkStable
Profiling started
…
Profiling (data collection) completed
Generated data files path: /tmp/AMDuProf-ScimarkStable-Threading_Sep-05-2024_21-48-42
Here, the generated session directory is /tmp/AMDuProf-ScimarkStable-Threading_Sep-05-2024_21-48-42.
System Call Tracing
By default, Threading analysis traces sleep and wait system calls. Configure system call tracing with threading analysis to trace all supported system calls which includes IO system calls, blocking system calls.
CLI command to collect the system call trace data with threading analysis.
AMDuProfCLI collect --config threading --trace osrt --osrt-event syscall -o <output-dir> <application>
Threading Analysis options
CLI Option |
GUI Option |
Description |
|---|---|---|
|
Predefined Configs > Threading Analysis > Advanced Options > Call Stack Unwind Depth |
Provide the depth of stack frames to be collected. By default, 32 frames will be collected; if the application has a greater number of frames in a calling sequence, increase the unwind depth up to 1024. |
|
Predefined Configs > Threading Analysis > Advanced Options > Select OpenMP Implementation |
Provide the OpenMP implementation type to stitch the call path of worker threads with master thread. ompt for tracing of OpenMP libraries supporting OMPT interface (example: LLVM, AOCC), |
|
Predefined Configs > Threading Analysis |
Provide event name and threshold value.
.. note:: Use this option with |
|
Predefined Configs > Threading Analysis > Profiling Signal |
If application has signal handler for SIGPROF, then use this option to provide unused signal from |
|
Predefined Configs > Threading Analysis > Timer Interval |
Provide per thread OS timer interval in msec. Default timer interval is 10 ms. |
|
Predefined Configs > Threading Analysis > Collect System Module Function(s) |
By default, threading config doesn’t collect the callstack beyond the first standard library function called by application. Use this option to disable it and collect the complete callstack which includes the standard library functions. |
If data is collected using CLI, use Import Session to import the session into GUI to analyze data in GUI.
Use the following CLI report command to generate the profile report in .csv format by passing the session directory path generated in collection.
AMDuProfCLI report -i <session directory>
For a list of all the supported options, refer to AMDuProfCLI Report Command Options.
Example
./AMDuProfCLI report -i /tmp/AMDuProf-ScimarkStable-Threading_Sep-05-2024_21-48-42
Translation started
…
Report generation started
…
Report generation completed...
Generated report file: /tmp/AMDuProf-ScimarkStable-Threading_Sep-05-2024_21-48-42/report.csv
Reference
Threading Summary provides high level performance snapshot of an application with respect to different timing details.
Elapsed Time is wall clock time from application execution start time to end time.
Total Time is total time of all the threads created by an application.
CPU Time is total active processor time of all threads which includes the time spent in IO operations and spin lock operations. To optimize the code if no of threads are increased, then elapsed time may decrease whereas CPU time might increase.
IO Time is total time spent by all the threads in IO system calls except IO sync calls.
Spin Time is total time spent by all the threads in spin lock.
When application calls IO system calls and blocking system calls (except IO sync calls), there is no guarantee that application will be blocked or not. So these times are added to total CPU time.
Sleep Time is total time spent by all threads on sleep and waiting for signal delivery system calls.
Event Wait Time is total time spent by all threads on IO Multiplexing (poll, select etc.), wait for process / thread to finish (wait, pthread_join etc.).
Resource Wait Time is total time spent by all threads on synchronization objects, reader and writer locks, thread barrier etc.
IO Sync Time is total time spent by all threads on IO sync APIs (sync, fsync etc.).
Pause Time is total time spent by all threads on profile paused state. User can pause the profile collection with profile control APIs and GUI pause after starting the profile collection.
Refer Posix Thread APIs and libc System Call Wrapper APIs to know the APIs traced with threading analysis.
Figure 7.29 GUI Threading Summary#
High wait time means application suffers with parallel performance, use Thread Summary to analyze per thread total run time, wait time and wait time percentage of the thread from total time of the thread. If a thread is optimized, then it’s wait time and percentage wait time might reduce when compare before and after optimization. It helps to identify whether a thread is using the core effectively or not.
Use Wait Object Summary to identify performance critical synchronization object which has more amount of wait time and wait count.
Syscall Summary provides the system call count, total time spent by the application on a system call. It helps to identify the system calls consuming most of the time and that can be optimized if the system calls are blocking or waiting in nature.
Figure 7.30 GUI Wait Time Analysis Summary#
By default, Wait Object Hotspots ranks functions or processes according to their total wait time, which is the time spent blocked on locks, events, semaphores, I/O, and similar objects. You can also sort the results based on the wait count or the percentage of wait time relative to the total execution time. To focus your analysis on a specific period, you can select a time range from the top timeline. For any identified hotspot, you can expand the entry to view the associated wait object(s), the callsite, the full call stack, and—if debug information is available—the corresponding source file and line number.
There are four groupings available. User can select any one of those based on the requirements.
Process → Function → Wait Object → Callsite → Callstack
Wait Object → Process → Thread → Function → Callstack
Task → Wait Object → Process → Thread → Function → Callstack
Process → Wait Object → Callsite → Callstack
Figure 7.31 GUI Wait Time Analysis#
Figure 7.32 CLI Threading Summary#
Per Thread Timeline displays the selected thread’s state, CPU utilization, context switch count, running callstacks and many metrics over the time. When you select a time region, the view reveals the thread’s activity and presents an aggregated flamegraph that summarizes function calls within that interval.
Figure 7.33 Per Thread Timeline#
In Top-Down Callstack or Flame Graph, if the ROOT node branches out to more than two nodes then increase callstack unwind depth with --call-graph-depth. Maximum depth supported is 1024.
If worker threads parallel region callstack is not stitched with master thread, and if application is creating more than 1000000 parallel regions, increase this limit by setting the environment variable AMDUPROF_MAX_PR_INSTANCES.
$export AMDUPROF_MAX_PR_INSTANCES=2000000
MPI tracing and OpenMP tracing are not supported with Threading config.
System wide profiling data collection and already running process / thread profiling data collection is not supported.
Threading Analysis is not supported if application is static executable (application is built with -static ) as LD_PRELOAD is ignored.
If application uses clone system call directly to create a thread (or) process, then behavior is undefined.
In cluster environment, if two or more threads/processes has same thread id/process id then behavior is undefined.
If leaf function is highly optimized assembly code, then AMDuProf fails to find its callstack.
Threading Analysis is not supported with AMD Zen1 and AMD Zen2 architectures.
Analyzing the 32-bit applications is not supported.
Behavior is undefined if --no-inherit (not profile the child processes) is used with threading analysis.
Resource Wait Time APIs
pthread_mutex_lock
pthread_mutex_trylock
pthread_mutex_timedlock
pthread_barrier_wait
pthread_cond_wait
pthread_cond_timedwait
pthread_rwlock_rdlock
pthread_rwlock_tryrdlock
pthread_rwlock_timedrdlock
pthread_rwlock_wrlock
pthread_rwlock_trywrlock
pthread_rwlock_timedwrlock
sem_wait
sem_trywait
sem_timedwait
Event Wait Time APIs
pthread_join
Spin Time APIs
pthread_spin_lock
pthread_spin_trylock
Other APIs
pthread_create pthread_exit pthread_cancel
List of libc APIs traced with threading config.
Sleep APIs
sleep
nanosleep
clock_nanosleep
usleep
pause
sigsuspend
sigwait
sigwaitinfo
sigtimedwait
Event Wait Time APIs
poll
ppoll
select
pselect
epoll_wait
epoll_pwait
wait
waitpid
waitid
wait3
wait4
Resource Wait Time APIs
futex
flock
lock
IO Sync Time APIs
fsync
sync
syncfs
fdatasync
sync_file_range
IO Time APIs
create
open
openat
read
pread
readv
preadv
preadv2
write
pwrite
writev
pwritev
pwritev2
lseek
sendfile
copy_file_range
truncate
ftruncate
readahead
close
Other APIs
accept
accept4
recv
recvfrom
recvmsg
recvmmsg
send
sendto
sendmsg
sendmmsg
mq_send
mq_timedsend
mq_receive
mq_timedreceive
msgsnd
msgrcv
semget
semop
semtimedop
semctl
splice
vmsplice
msync
fcntl
ioctl
epoll_create
epoll_create1
epoll_ctl
socket
bind
listen
connect
socketpair
mq_notify
mq_getattr
mq_setattr
mq_close
mq_unlink
msgget
msgctl
pipe
pipe2
shmat
shmctl
shmget
shmdt
fork
vfork
alarm
system
kill
killpg
brk
sbrk
mlock
munlock
mlock2
mlockall
munlockall
mmap
munmap
move_pages
mprotect
mremap
process_vm_readv
process_vm_writev
acct
chroot
dup
dup2
dup3
fallocate
ioperm
iopl
mount
prctl
ptrace
sigaction
swapon
swapoff
tee
umount
umount2
unshare
vhangup
Use Overview Analysis to get high level performance snapshot of an application, identify hottest functions and it’s inclusive and exclusive elapsed times, CPU utilization of the threads, and Wait time analysis of the application threads.
Overview Analysis traces the functions defined in the application whose size is more than or equal to 128 bytes by default. It collects the start time, end time of the function, callees function time, and stores the data in raw file for further processing.
Overview Analysis traces GPU offloading which includes kernel launch, kernel execution and data transfer for GPU intensive applications.
Overview Analysis uses the User mode sampling and tracing approach, and it is supported only in Linux and if application is using libc and libpthread then these libraries should be linked dynamically.
Reference
Here is the list of prerequisites.
Linux kernel version 4.15 or later.
To profile an application with Overview Analysis, run the script AMDuProfSetup.sh from AMDuProf installed directory with root access.
$sudo ./AMDuProfSetup.sh
Note
When you install AMD uProf using the DEB package, the installer automatically executes the script.
If the system has AMD Instinct accelerators, install the AMD ROCm™ v7.1.0 on the host system.
After AMD ROCm™ 7.1.0 installation, make sure the symbolic link of /opt/rocm/ points to /opt/ rocm-7.1.0/.
$ ln -s /opt/rocm-7.1.0/ /opt/rocm/
Supported accelerators - AMD Instinct™ MI200 and MI300A.
To launch the AMDuProf GUI, go to Home > Welcome page.
Click Profile an Application on the Welcome page.
Provide application path, application options, working directory, and environment variables, if any. Click Next.
From Predefined Configs, select Overview.
Set the Timer Interval and Profiling Signal.
Click Start Profile to start the profiling.
Overview collection
AMDuProfCLI collect --config overview -o <output-dir> <application>
Overview with timer interval of 100 msec
AMDuProfCLI collect --config overview --timer-interval 100 -o <output-dir> <application>
Overview with profiling signal as SIGRTMIN
AMDuProfCLI collect --config overview --profiling-signal 34 -o <output-dir> <application>
Overview with function tracing threshold of 1000000 nsec
AMDuProfCLI collect --config overview --func-threshold 1000000 -o <output-dir> <application>
Overview to trace the functions of size 32 bytes
AMDuProfCLI collect --config overview --func-size 32 -o <output-dir> <application>
Once profile data collection completes, session directory will be generated. Use session directory to generate the csv report (or) to import the session in GUI.
For a list of all the supported options, refer to AMDuProfCLI Collect Command Options.
Example
./AMDuProfCLI collect --config overview -o /tmp/ /tmp/ScimarkStable
Profiling started
…
Profiling (data collection) completed
Generated data files path: /tmp/AMDuProf-ScimarkStable-Overview_Sep-05-2024_21-59-08
Here, the generated session directory is /tmp/AMDuProf-ScimarkStable-Overview_Sep-05-2024_21-59-08.
Overview Analysis options
CLI Option |
GUI Option |
Description |
|---|---|---|
|
NA |
By default, overview analysis traces the functions of size more than or equals to 128 bytes, if you want to trace functions of custom size, use this option to set the function size. |
|
Predefined Configs > Overview > Profiling Signal |
If application has signal handler for SIGPROF, then use this option to provide unused signal from |
|
Predefined Configs > Overview > Timer Interval |
Provide per thread OS timer interval in msec. Default timer interval is 10 ms. |
If data is collected using CLI, use Import Session to import the session into GUI to analyze data in GUI.
Use the following CLI report command to generate the profile report in .csv format by passing the session directory path generated in collection.
AMDuProfCLI report -i <session directory>
For a list of all the supported options, refer to AMDuProfCLI Report Command Options.
Example
./AMDuProfCLI report -i /tmp/AMDuProf-ScimarkStable-Overview_Sep-05-2024_21-59-08
Translation started
…
Report generation started
…
Report generation completed...
Generated report file: /tmp/AMDuProf-ScimarkStable-Overview_Sep-05-2024_21-59-08/report.csv
Reference
Use GPU Kernel Summary (Analyze the Data) to identify hottest kernels launched to GPU and it’s count, total execution time on GPU cores.
Use Data Transfer Summary (Analyze the Data) to identify how much time spent in data transfer between host and device, how many times data transfer initiated.
Use Per Thread Timeline to analyze the following.
GPU Usage, GPU Memory usage and GPU Power of your application over the profile duration.
GPU Kernels executed over the profile duration.
Data Transfer between host and device over the profile duration, it will help to identify the time spent in data copy.
For each kernel execution and data copy, refer the HIP trace to check which hip API is causing this activity on GPU. Refer the callstack section for CPU callstack at that time.
From this analysis, we can identify whether application is CPU bound or GPU bound. If application is GPU bound, then use GPU Profiling for further analysis and optimization in kernel execution.
Figure 7.34 Per Thread Timeline#
Use the Function Count Summary to identify how many times a function is executed and it’s inclusive and exclusive times.
Figure 7.35 Function Count Summary#
Use Per Thread Timeline to analyze the following from CPU.
CPU Utilization, Context switch count, Memory consumption and thread state over the profile duration.
To optimize the code, identify the section where thread state is waiting and check what is the cpu utilization at that time and callstack. (Which call path caused this thread to wait).
Select a region where the CPU utilization is less, timeline provides the list of functions executed within that region and its CPU time.
Here callstack data is collected using Function tracing, it is more accurate when compared with sampling data reported with Hotspots and Threading Analysis. Refer Function Tracing to trace custom functions with overview analysis.
Figure 7.36 Per Thread Timeline#
MPI tracing and OpenMP tracing are not supported with Overview Analysis.
System wide profiling data collection and already running process / thread profiling data collection is not supported.
Overview Analysis is not supported if application is static executable (application is built with -static) as LD_PRELOAD is ignored.
Callstack data collection is not supported with Overview Analysis.
If application uses clone system call directly to create a thread (or) process, then behavior is undefined.
In cluster environment, if two or more threads/processes has same thread id/process id then behavior is undefined.
Overview Analysis is not supported with AMD Zen1 and AMD Zen2 architectures.
Profile Control APIs are not supported with Overview Analysis.
Function tracing will be skipped for non-ELF executables in Overview Analysis.
Analyzing the 32-bit applications is not supported.
Behavior is undefined if --no-inherit (not profile the child processes) is used with overview analysis.
In this analysis, the profile data is periodically collected based on the specified OS timer interval. It is used to identify the hotspots of the profiled applications that are consuming the most time. These hotspots are good candidates for further investigation and optimization.
To configure and start a profile:
Click PROFILE > Start Profiling to navigate to the Select Profile Target screen.
Select the required profile target and click Next. The Select Profiling screen is displayed.
From the Select Profiling screen, select the Predefined Configs tab.
Figure 7.37 Time-Based Profile – Configure#
Select Time-based Sampling in the left vertical pane.
Click Advanced Options to enable call-stack, set symbol paths (if the debug files are in different locations) and other options. See Advanced Options for more information on this screen.
Set all options and click Start Profile to begin profiling.
After the profile initialization the profile data collection screen is displayed.
Complete the following steps to analyze the profile data:
When the profiling stops, the collected raw profile data will be processed automatically and the Hot Spots screen of the Summary page is displayed. The hotspots are shown for the Timer samples. See Overview of Performance Hotspots for more information.
Click ANALYZE on the top horizontal navigation bar to go to the Function HotSpots screen. See Function HotSpots for more information on this screen.
Click ANALYZE > Metrics to display the profile data table at various granularities: Process, Load Modules, Threads, and Functions. Refer the section Process and Functions for more information on this screen.
Double-click any entry on the Functions table in the Metrics screen to load the source tab for that function in the SOURCES page. Refer the section Source and Assembly for more information on this screen.
Micro Architecture Analysis profiling follows a statistical sampling-based approach to collect profile data to identify the performance bottlenecks in the application. Use this analysis type to understand the micro architectural bottlenecks in the application runtime.
AMD uProf CPU profiler follows a statistical sampling-based approach to collect profile data to identify the performance bottlenecks in the application. A few high-level features to understand the CPU profiler capabilities are listed here:
Profile data is collected using one of the following approaches:
Event Based Profiling (EBP) — sampling based on Core PMC events to identify micro- architecture related performance issues in the profiled applications.
Instruction based Sampling (IBS) — precise instruction-based sampling.
Call-stack Sampling
Secondary Profile Data
Thread concurrency (Windows only, requires admin privilege)
Thread names (Windows and Linux only)
Profile Scope
Launch App— launch an application and profile that process and its children.
System-wide — profile all the running processes and/or kernel.
Attach Process — Attach to an existing application (Native applications only)
Profile mode
User/Kernel — profile data is collected when the application is running in User and/or Kernel mode.
Supported Languages:
C, C++
Java
.NET (5.0, 6.0, and Framework)
FORTRAN
Assembly applications
Supported Software Components
User-space applications
Dynamically linked/loaded modules
Drivers
OS kernel modules
Profile data is attributed at various granularities
Process, Thread, Load Module, Function, Source line, or Disassembly
C++ and Java in-lined functions
Note
uProf requires debug information from the compiler for correlating the profile data to functions and source lines.
Data and Report Files
Collected profile data initially stored to raw data files.
Processed profile data is stored to database files used for generating the CLI report or GUI visualization.
Profile report is saved to a comma-separated-value (CSV) format file that can be viewed using any spreadsheet viewer.
AMDuProfCLI, the command-line-interface can be used to configure a profile run, collect the profile data, and generate the profile report.
Collect command to configure and collect the profile data.
Report command to process the profile data and to generate the profile report.
Profile command to collect the performance profile data, analyze it, and generate the profile report.
AMDuProf GUI can be used to:
Configure a profile run.
Start the profile run to collect the performance data.
Analyze the performance data to identify potential bottlenecks.
AMDuProf GUI has various UI elements to analyze and view the profile data at various granularities:
Hot spots summary
Session Information
Thread concurrency graph (Windows only and requires admin privileges)
Process and function analysis
Source and disassembly analysis
Top-down and bottom-up call path — visualizations to explore the function call flow of an application for analyzing the time spent on functions and its callees.
Flame Graph — callstack visualizer as a flame graph
Call Graph — call stack and caller/callee visualizer in table format
HPC — to analyze OpenMP and MPI profile data
Timeline Visualizer — timeline views for MPI API trace and OS event trace information
Cache Analysis — to analyze the hot cache lines that are false shared
Profile Control API
Selectively enable and disable profiling from the target application by instrumenting it, to limit the scope of the profiling.
The Predefined Sampling Configuration provides a convenient way to select a useful set of sampling events for profile analysis. The following table lists all such configurations:
Profile Type |
Predefined Configuration Name |
Abbreviation |
Description |
|---|---|---|---|
Event-based profile (EBP) |
Assess performance |
assess |
Provides an overall assessment of the performance. |
Event-based profile (EBP) |
Assess performance (Extended) |
assess_ext |
Provides an overall assessment of the performance with additional metrics. |
Event-based profile (EBP) |
Investigate data access |
data_access |
To find data access operations with poor L1 data cache locality and poor DTLB behavior. |
Event-based profile (EBP) |
Investigate instruction access |
inst_access |
To find instruction fetches with poor L1 instruction cache locality and poor ITLB behavior. |
Event-based profile (EBP) |
Investigate branching |
branch |
To find poorly predicted branches and near returns. |
Event-based profile (EBP) |
Investigate CPI |
cpi |
To analyze the CPI and IPC metrics of the running application or the entire system. |
IBS |
Instruction based sampling |
ibs |
To collect the sample data using IBS Fetch and IBS OP. Precise sample attribution to instructions. |
IBS |
Cache Analysis |
memory |
To identify the false cache-line sharing issues. The profile data will be collected using IBS OP |
Listed here are some of the Core Performance events of AMD Zen processors.
Event Id, Unit-mask |
Event Abbreviation |
Name and Description |
|---|---|---|
PMCx076,0x00 |
|
CPU clock cycles not halted The number of CPU cycles when the thread is not in halt state. |
PMCx0C0, 0x00 |
|
Retired Instructions The number of instructions retired from execution. This count includes exceptions and interrupts. Each exception or interrupt is counted as one instruction. |
PMCx0C1, 0x00 |
|
Retired Macro Operations The number of macro-ops retired. This count includes all processor activity - instructions, exceptions, interrupts, microcode assists, and so on. |
PMCx0C2, 0x00 |
|
Retired Branch Instructions The number of branch instructions retired. This includes all types of architectural control flow changes, including exceptions and interrupts |
PMCx0C3, 0x00 |
|
Retired Branch Instructions Mispredicted The number of retired branch instructions that were mis-predicted. Note Only EX direct mis-predicts and indirect target mis-predicts are counted. |
PMCx003,0x08 |
|
Retired SSE/AVX Flops The number of retired SSE/AVX flops. The number of events logged per cycle can vary from 0 to 64. This is a large increment per cycle event as it can count more than 15 events per cycle. This count both single precision and double precision FP events. |
PMCx029,0x07 |
|
All Data cache accesses The number of load and store ops dispatched to LS unit. This counts the dispatch of single op that performs a memory load, dispatch of single op that performs a memory store, dispatch of a single op that performs a load from and store to the same memory address. |
PMCx060,0x10 |
|
L2 cache access from L1 IC miss The L2 cache access requests due to L1 instruction cache misses. |
PMCx060,0xC8 |
|
L2 cache access from L1 DC miss The L2 cache access requests due to L1 data cache misses. This also counts hardware and software prefetches. |
PMCx064,0x01 |
|
L2 cache miss from L1 IC miss Counts all the Instruction cache fill requests that misses in L2 cache |
PMCx064,0x08 |
|
L2 cache miss from L1 DC miss Counts all the Data cache fill requests that misses in L2 cache |
PMCx071,0x1F |
|
L2 Prefetcher Hits in L3 Counts all L2 prefetches accepted by the L2 pipeline which miss the L2 cache and hit the L3. |
PMCx072,0x1F |
|
L2 Prefetcher Misses in L3 Counts all L2 prefetches accepted by the L2 pipeline which miss the L2 and the L3 caches |
PMCx064,0x06 |
|
L2 cache hit from L1 IC miss Counts all the Instruction cache fill requests that hits in L2 cache. |
PMCx064,0x70 |
|
L2 cache hit from L1 DC miss Counts all the Data cache fill requests that hits in L2 cache. |
PMCx070,0x1F |
|
L2 cache hit from L2 HW Prefetch Counts all L2 prefetches accepted by L2 pipeline which hit in the L2 cache |
PMCx043,0x01 |
|
L1 demand DC fills from L2 The demand Data Cache (DC) fills from local L2 cache to the core. |
PMCx043,0x02 |
|
L1 demand DC fills from local CCX The demand Data Cache (DC) fills from same the cache of same CCX or cache of different CCX in the same package (node). |
PMCx043,0x08 |
|
L1 demand DC fills from local Memory The demand Data Cache (DC) fills from DRAM or IO connected in the same package (node). |
PMCx043,0x10 |
|
L1 demand DC fills from remote cache The demand Data Cache (DC) fills from cache of CCX in the different package (node). |
PMCx043,0x40 |
|
L1 demand DC fills from remote Memory The demand Data Cache (DC) fills from DRAM or IO connected in the different package(node). |
PMCx043,0x5B |
|
L1 demand DC refills from all data sources. The demand Data Cache (DC) fills from all the data sources. |
PMCx060,0xFF |
|
All L2 cache requests. |
PMCx084,0x00 |
|
L1 TLB miss L2 TLB hit The instruction fetches that misses in the L1 Instruction Translation Lookaside Buffer (ITLB) but hit in the L2-ITLB. |
PMCx085,0x07 |
|
L1 TLB miss L2 TLB miss The ITLB reloads originating from page table walker. The table walk requests are made for L1-ITLB miss and L2-ITLB misses. |
PMCx045,0xFF |
|
L1 DTLB miss The L1 Data Translation Lookaside Buffer (DTLB) misses from load store micro-ops. This event counts both L2-DTLB hit and L2- DTLB miss. |
PMCx045,0xF0 |
|
L1 DTLB miss The L2 Data Translation Lookaside Buffer (DTLB) missed from load store micro-ops. |
PMCx047,0x00 |
|
Misaligned Loads The number of misaligned loads. Note On AMD Zen 3 core processors, this event counts the 64B (cache-line crossing) and 4K (page crossing) misaligned loads. |
PMCx052,0x03 |
|
Ineffective Software Prefetches The number of software prefetches that did not fetch data outside of the processor core. This event counts the Software PREFETCH instruction that saw a match on an already - allocated miss request buffer. Also counts the Software PREFETCH instruction that saw a DC hit. |
Event Id, Unit-mask |
Event Abbreviation |
Name and Description |
|---|---|---|
PMCx076,0x00 |
|
CPU clock cycles not halted The number of CPU cycles when the thread is not in halt state. |
PMCx0C0, 0x00 |
|
Retired Instructions The number of instructions retired from execution. This count includes exceptions and interrupts. Each exception or interrupt is counted as one instruction. |
PMCx0C1, 0x00 |
|
Retired Macro Operations The number of macro-ops retired. This count includes all processor activity - instructions, exceptions, interrupts, microcode assists, and so on. |
PMCx0C2, 0x00 |
|
Retired Branch Instructions The number of branch instructions retired. This includes all types of architectural control flow changes, including exceptions and interrupts |
PMCx0C3, 0x00 |
|
Retired Branch Instructions Mispredicted The number of retired branch instructions that were mis-predicted. Note Only EX direct mis-predicts and indirect target mis-predicts are counted. |
PMCx003,0x08 |
|
Retired SSE/AVX Flops The number of retired SSE/AVX flops. The number of events logged per cycle can vary from 0 to 64. This is a large increment per cycle event as it can count more than 15 events per cycle. This count both single precision and double precision FP events. |
PMCx029,0x07 |
|
All Data cache accesses The number of load and store ops dispatched to LS unit. This counts the dispatch of single op that performs a memory load, dispatch of single op that performs a memory store, dispatch of a single op that performs a load from and store to the same memory address. |
PMCx060,0x10 |
|
L2 cache access from L1 IC miss The L2 cache access requests due to L1 instruction cache misses. |
PMCx060,0xE8 |
|
L2 cache access from L1 DC miss The L2 cache access requests due to L1 data cache misses. This also counts hardware and software prefetches. |
PMCx064,0x01 |
|
L2 cache miss from L1 IC miss Counts all the Instruction cache fill requests that misses in L2 cache |
PMCx064,0x08 |
|
L2 cache miss from L1 DC miss Counts all the Data cache fill requests that misses in L2 cache |
PMCx071,0xF F |
|
L2 Prefetcher Hits in L3 Counts all L2 prefetches accepted by the L2 pipeline which miss the L2 cache and hit the L3. |
PMCx072,0xFF |
|
L2 Prefetcher Misses in L3 Counts all L2 prefetches accepted by the L2 pipeline which miss the L2 and the L3 caches |
PMCx064,0x06 |
|
L2 cache hit from L1 IC miss Counts all the Instruction cache fill requests that hits in L2 cache. |
PMCx064,0xF0 |
|
L2 cache hit from L1 DC miss Counts all the Data cache fill requests that hits in L2 cache. |
PMCx070,0xFF |
|
L2 cache hit from L2 HW Prefetch Counts all L2 prefetches accepted by L2 pipeline which hit in the L2 cache |
PMCx043,0x01 |
|
L1 demand DC fills from L2 The demand Data Cache (DC) fills from local L2 cache to the core. |
PMCx043,0x02 |
|
L1 demand DC fills from local CCX. The demand Data Cache (DC) fills from same the cache of same CCX or cache of different CCX in the same package (node) |
PMCx043,0x04 |
|
L1 DC fills from local external CCX caches The DC fills from the cache of different CCX in the same package (node). |
PMCx043,0x08 |
|
L1 demand DC fills from local Memory The demand Data Cache (DC) fills from DRAM or IO connected in the same package (node). |
PMCx043,0x10 |
|
L1 demand DC fills from remote cache The demand Data Cache (DC) fills from cache of CCX in the different package (node). |
PMCx043,0x40 |
|
L1 demand DC fills from remote Memory The demand Data Cache (DC) fills from DRAM or IO connected in the different package(node). |
PMCx043,0x14 |
|
L1 demand DC fills from external caches The demand DC fills from the cache of different CCX in the same or different package (node). |
PMCx043,0xDF |
|
L1 demand DC refills from all data sources. The demand DC fills from all the data sources. |
PMCx044,0x01 |
|
L1DC fills from local L2 The DC fills from the local L2 cache to the core. |
PMCx044,0x02 |
|
L1DC fills from local CCX cache The DC fills from different L2 cache in the same CCX or L3 cache that belongs to the same CCX. |
PMCx044,0x08 |
|
L1 DC fills from local Memory The DC fills from DRAM or IO connected in the same package (node). |
PMCx044,0x04 |
|
L1 DC fills from local external CCX caches The DC fills from the cache of different CCX in the same package (node). |
PMCx044,0x10 |
|
L1 DC fills from remote external CCX caches The DC fills from the CCX cache in the different package (node). |
PMCx044,0x40 |
|
L1 DC fills from remote Memory The DC fills from DRAM or IO connected in the different package (node). |
PMCx044,0x14 |
|
L1 DC fills from local external CCX caches The DC fills from cache of different CCX in the same or different package (node). |
PMCx044,0x48 |
|
L1 DC fills from local Memory The DC fills from DRAM or IO connected in the same or different package (node). |
PMCx044,0x50 |
|
L1 DC fills from remote node The DC fills from the CCX cache in the different package (node) or the DRAM / IO connected in the different package (node). |
PMCx044,0x03 |
|
L1 DC fills from same CCX The DC fills from the local L2 cache to the core or different L2 cache in the same CCX or L3 cache that belongs to the same CCX. |
PMCx044,0xDF |
|
L1 DC fills from all the data sources The DC fills from all the data sources |
PMCx060,0xFF |
|
All L2 cache requests. |
PMCx084,0x00 |
|
L1 TLB miss L2 TLB hit The instruction fetches that misses in the L1 Instruction Translation Lookaside Buffer (ITLB) but hit in the L2-ITLB. |
PMCx085,0x07 |
|
L1 TLB miss L2 TLB miss The ITLB reloads originating from page table walker. The table walk requests are made for L1-ITLB miss and L2-ITLB misses. |
PMCx045,0xFF |
|
L1 DTLB miss The L1 Data Translation Lookaside Buffer (DTLB) misses from load store micro-ops. This event counts both L2-DTLB hit and L2-DTLBmiss |
PMCx045,0xF0 |
|
L1 DTLB miss The L2 Data Translation Lookaside Buffer (DTLB) missed from load store micro-ops |
PMCx078,0xFF |
|
All TLB flushes |
PMCx047,0x03 |
|
The number of misaligned loads. Note On AMD Zen 3 core processors, this event counts the 64 B (cache-line crossing) and 4 K (page crossing) misaligned loads. |
PMCx052,0x03 |
|
Ineffective Software Prefetches The number of software prefetches that did not fetch data outside of the processor core. This event counts the Software PREFETCH instruction that saw a match on allocated miss request buffer. Also counts the Software PREFETCH instruction that saw a DC hit. |
PMCx18E,0x1F |
|
IC Tag All Instruction Cache Access |
PMCx18E,0x18 |
|
IC Tag Instruction Cache Miss |
PMCx28F, 0x07 |
|
All OP Cache Accesses |
PMCx28F, 0x04 |
|
Op Cache Miss |
Core CPU Metrics
CPU Metric |
Description |
|---|---|
|
Core Effective Frequency (without halted cycles) over the sampling period, reported in GHz. The metric is based on APERF and MPERF MSRs. MPERF is incremented by the core at the P0 state frequency while the core is in C0 state. APERF is incremented in proportion to the actual number of core cycles while the core is in C0 state. |
|
Cycles Per Instruction Retired (CPI) is the multiplicative inverse of IPC metric. This is one of the basic performance metrics indicating how cache misses, branch mis-predictions, memory latencies, and other bottlenecks are affecting the execution of an application. Lower CPI value is better. |
|
Instructions Retired Per Cycle (IPC) is the average number of instructions retired per cycle. This is measured using Core PMC events PMCx0C0 [Retired Instructions] and PMCx076 [CPU Clocks not Halted]. These PMC events are counted in both OS and User mode. |
|
The DC access rate is the number of DC accesses divided by the total number of retired instructions |
|
The DC miss rate is the number of DC misses divided by the total number of retired instructions. |
|
The DC miss ratio is the number of DC misses divided by the total number of DC accesses. |
|
The number of L2 cache access requests due to L1 data cache misses, per thousand retired instructions. This L2 cache access requests also includes the hardware and software prefetches. |
|
The number of demand data cache (DC) fills per thousand retired instructions. These demand DC fills are from all the data sources like LocalL2/L3 cache, remote caches, local memory, and remote memory. |
|
The DTLB L1 miss rate is the number of DTLB L1 misses divided by the total number of retired instructions. |
|
The ITLB L1 miss rate is the number of ITLB L1_Miss_L2_Hits and L1_Miss_L2_Missdivided by the total number of retired instructions. |
|
The number of L2 cache access requests due to the L1 instruction cache misses per thousand retired instructions. This L2 cache access requests also includes the prefetches. |
|
The number of L2 cache misses from L1 instruction cache misses per thousand retired instructions. |
|
The L2 DTLB miss rate is the number of L2 DTLB misses divided by the total number of retired instructions. |
|
The ITLB L2 miss rate is the number of ITLB L2 miss divided by the total number of retired instructions. |
|
The misalign rate is the number of misaligned loads divided by the total number of retired instructions. |
|
The misalign ratio is the number of misaligned loads divided by the total number of DC accesse |
|
This metric is computed as retired mis-predicted branches divided by the total number of retired instructions. |
|
This metric is computed as the retired mis-predicted branches divided by the total number of retired branch instructions. |
|
The number of retired branch instructions rate. This metric is computed as the retired branches divided by the total number of retired instructions. |
|
The number of retired indirect branches per thousand instructions. |
|
The number of retired near branches per thousand instructions. |
|
The number of retired mis-predicted near branches per thousand instructions. |
|
This metric is computed as the retired mis-predicted near returns divided by the total number of retired instructions. |
|
This metric is computed as retired mis-predicted near returns divided by the total number of retired return instructions. |
|
The number of retired taken branches per thousand instructions. |
|
The number of retired mis-predicted taken branches per thousand instructions. |
|
The number of retired taken branches rate. This metric is computed as the retired taken branches divided by the total number of retired instructions. |
|
Store-to-load conflicts:A load was unable to complete due to a non- forwardable conflict with an older store. Most commonly, a load’s address range partially but not completely overlaps with an uncompleted older store. Software can avoid this problem by using the same size and alignment loads and stores when accessing the data. Vector/SIMD code is particularly susceptible to this problem; software should construct wide vector stores by manipulating the vector elements in the registers using shuffle/blend/swap instructions prior to storing to the memory, instead of using narrow element-by-element stores. |
In this profile, the CPU Profiler uses the PMCs to monitor the various micro-architectural events supported by the AMD x86-based processor. It helps to identify the CPU and memory related performance issues in the profiled applications. The CPU Profiler provides several predefined EBP profile configurations. To analyze an aspect of the profiled application (or system), a specific set of relevant events are grouped and monitored together. The CPU Profiler provides a list of predefined event configurations, such as Assess Performance and Investigate Branching. You can select any of these predefined configurations to profile and analyze the runtime characteristics of your application. You also can create their custom configurations of events to profile.
In this profile mode, a delay called skid occurs between the time the sampling interrupt occurs and the time the sampled instruction address is collected. Due to this delay, samples may be recorded near but not exactly at the instruction that caused the interrupt. This can lead to an inaccurate distribution of samples, where events are sometimes attributed to neighboring instructions rather than the actual source instruction.
To launch the AMDuProf GUI, go to Home > Welcome page.
Click Profile an Application on the Welcome page.
Provide application path, application options, working directory, and environment variables, if any. Click Next.
From Predefined Configs, select any Event Based configuration. For example: Assess Performance, Investigate CPI, Investigate Branching, etc.
Alternatively, use Custom Config to configure events individually.
From Advanced Options, select the appropriate options.
Click Start Profile to start the profiling.
Event based collection using Assess config
AMDuProfCLI collect --config assess -o <output-dir> <application>
Event based collection using Investigate Branch config with callstack
AMDuProfCLI collect --config branch -g -o <output-dir> <application>
Event based collection using individual predefined events
AMDuProfCLI collect -e cycles-not-in-halt -e retired-inst -o <output-dir> <application>
Event based collection using individual hardware event IDs
AMDuProfCLI collect -e event=pmcx76,interval=250000 -e event=pmcxc0,user=1,os=0,interval=250000 -o <output-dir> <application>
Once profile data collection completes, session directory will be generated. Use session directory to generate the csv report (or) to import the session in GUI.
For a list of all the supported options, refer to AMDuProfCLI Collect Command Options.
Example
./AMDuProfCLI collect --config assess-g -o /tmp/ /tmp/ScimarkStable
Profiling started
………
Profiling (data collection) completed
Generated data files path: /tmp/AMDuProf-ScimarkStable-EBP_Sep-05-2024_21-43-08
Here, the generated session directory is /tmp/AMDuProf-ScimarkStable-EBP_Sep-05-2024_21-43-08.
CLI Option |
GUI Option |
Description |
|---|---|---|
|
Predefined Configs |
Predefined sampling configuration to be used to collect samples. Use the command |
|
Predefined Configs > Custom Configs |
A predefined event can be directly be used with
Argument details
When these arguments are not passed, the default values are:
Use the following commands as required:
|
If data is collected using CLI, use Import Session to import the session into GUI to analyze data in GUI.
Use the following CLI report command to generate the profile report in .csv format by passing the session directory path generated in collection.
AMDuProfCLI report -i <session directory>
For a list of all the supported options, refer to AMDuProfCLI Report Command Options.
Example
./AMDuProfCLI report -i /tmp/AMDuProf-ScimarkStable-EBP_Sep-05-2024_21-43-08
Translation started
…
Report generation started
…
Report generation completed...
Generated report file: /tmp/AMDuProf-ScimarkStable-EBP_Sep-05-2024_21-43-08/report.csv
Use the Thread Concurrency Graph to analyze how efficiently the processor cores are utilized by the application. In other words, how much time specific number of threads are running on specific no of cores.
Figure 7.38 Thread Concurrency Graph#
Figure 7.39 Function Hotspots#
Select a function to get all the call paths to this function from different threads, each call path provides the number of samples in that path. Double-click on the function to analyze the instruction level sample attribution for that function using Source View.
Use Flame Graph to identify hottest code paths of an application. The width of each function indicates the percentage of event samples of the function (it’s callees) to the total number of samples of selected process and thread for a specific event.
Use Top-Down Callstack to analyze any issues with call-sequence flow of the application and to analyze the bottlenecks in functions and its callees.
Figure 7.40 Top-Down Callstack#
Confidence Threshold
The metric with low number of samples collected for a program unit either due to multiplexing or statical sampling will be grayed out. A few points to remember are:
This is applicable to SW Timer and Core PMC based metrics.
This confidence threshold value can be set through Preferences section in SETTINGS page.
Issue Threshold
Highlight the CPI metric cells exceeding the specific threshold value (>1.0). Those cells will be highlighted in pink to show them as potential performance problem as follows:
Figure 7.41 CPI Metric - Threshold-Based Performance#
CPU profiling expects the profiled application executable binaries must not be compressed or obfuscated by any software protector tools. For example: VMProtect.
In the case of AMD EPYC™ 1st generation B1 parts, only one PMC register is used at a time for Core PMC event-based profiling (EBP).
In this profile, the CPU Profiler uses the IBS HW supported by the AMD x86-based processor to observe the effect of instructions on the processor and on the memory subsystem. In IBS, HW events are linked with the instruction that caused them. Also, HW events used by the CPU Profiler to derive various metrics, such as data cache latency.
To launch the AMDuProf GUI, go to Home > Welcome page.
Click Profile an Application on the Welcome page.
Provide application path, application options, working directory, and environment variables, if any. Click Next.
From Predefined Configs, select Instruction Based Sampling configuration.
Alternatively, use Custom Config to configure IBS_FETCH, IBS_ALL_OPS events individually.
From Advanced Options, select the appropriate options.
Click Start Profile to start the profiling.
Collection using Instruction Based Sampling config
AMDuProfCLI collect --config ibs -o <output-dir> <application>
Collection using Instruction Based Sampling config with callstack
AMDuProfCLI collect --config ibs -g -o <output-dir> <application>
Collection using individual IBS events
AMDuProfCLI collect -e event=IBS_ALL_OPS,interval=250000
-e event=IBS_FETCH,user=1,os=0,interval=250000 -o <output-dir> <application>
Once profile data collection completes, session directory will be generated. Use session directory to generate the csv report (or) to import the session in GUI.
For a list of all the supported options, refer to AMDuProfCLI Collect Command Options.
Example
./AMDuProfCLI collect --config ibs -g -o /tmp/ /tmp/ScimarkStable
Profiling started
………
Profiling (data collection) completed
Generated data files path: /tmp/AMDuProf-ScimarkStable-IBS_Sep-05-2024_21-43-08
Here, the generated session directory is /tmp/AMDuProf-ScimarkStable-EBP_Sep-05-2024_21-43-08.
If data is collected using CLI, use Import Session to import the session into GUI to analyze data in GUI.
Use the following CLI report command to generate the profile report in .csv format by passing the session directory path generated in collection.
AMDuProfCLI report -i <session directory>
For a list of all the supported options, refer to AMDuProfCLI Report Command Options.
Example
./AMDuProfCLI report -i /tmp/AMDuProf-ScimarkStable-IBS_Sep-05-2024_21-43-08
Translation started
…
Report generation started
…
Report generation completed...
Generated report file is stored as``/tmp/AMDuProf-ScimarkStable-IBS_Sep-05-2024_21-43-08/report.csv``
Use the Thread Concurrency Graph to analyze how efficiently the processor cores are utilized by the application. In other words, how much time specific no of threads are running on specific no of cores.
Figure 7.42 Thread Concurrency Graph#
Use Function Hotspots to list the functions and the number of samples for the configured events. Expand the function to get its processes and further expand to get its threads.
Figure 7.43 Function Hotspots#
Select a function to get all the call paths to this function from different threads, each call path provides the number of samples in that path. Double-click on the function to analyze the instruction level sample attribution for that function using Source View.
Use Flame Graph to identify hottest code paths of an application. The width of each function indicates the percentage of event samples of the function (it’s callees) to the total number of samples of selected process and thread for a specific event.
Figure 7.44 Flame Graph#
Use Top-Down Callstack to analyze any issues with call-sequence flow of the application and to analyze the bottlenecks in functions and its callees.
Figure 7.45 Top-Down Callstack#
For some scenarios, it would be useful to analyze the ASCII dump of IBS OP profile samples. To do so, complete the following steps:
Where:
– interval denotes sampling interval
– loadstore denotes collect only the load & store ops (Windows only option)
– ibsop-count-control=1 represents count dispatched micro-ops (0 for count clock cycles)
- -data-buffer-count 1024 represents the number of per-core data buffers to allocate (Windows only option)
To collect the IBS OP samples:
Once the raw file is generated, run the following command to translate and get the ASCII dump of IBS OP samples:
C:\> AMDuProfCLI.exe translate --ascii event-dump -i C:\temp\AMDuProf-IBS_<timestamp>\
The CSV file that containing ASCII dump of the IBS OP samples is generated:
C:\temp\AMDuProf-IBS_<timestamp>\IbsOpDump.csv
During collection the following control knobs are available:
-e event=ibs-op,interval=100000,loadstore,ibsop-count-control=1
In case, there are too many missing records, try the following:
Increase the sampling interval.
Increase the data buffer count.
Reduce the number of cores profiled.
AMD uProf translates the IBS information produced by the hardware into derived event sample counts that resemble EBP sample counts. All the IBS-derived events contain IBS in the event name and abbreviation. Although IBS-derived events and sample counts look similar to the EBP events and sample counts, the source and sampling basis for the IBS event information are different.
Arithmetic calculation should never be performed between IBS derived event sample counts and EBP event sample counts. It is not meaningful to directly compare the number of samples taken for events that represent the same hardware condition. For example, fewer IBS DC miss samples is not necessarily better than a larger quantity of EBP DC miss samples.
Following table shows the IBS fetch events:
IBS Fetch Event |
Description |
|---|---|
|
The number of all the IBS fetch samples. This derived event counts the number of all the IBS fetch samples that were collected including IBS- killed fetch samples. |
|
The number of completed IBS sampled fetches. A fetch is completed if the attempted fetch delivers instruction data to the instruction decoder. Although the instruction data was delivered, it may still not be used. For example, the instruction data may have been on the wrong path of an incorrectly predicted branch. |
|
The number of IBS sampled fetches that aborted. An attempted fetch is aborted if it did not complete and deliver instruction data to the decoder. An attempted fetch may abort at any point in the process of fetching instruction data. An abort may be due to a branch redirection as the result of a mispredicted branch. The number of IBS aborted fetch samples is a lower bound on the number of unsuccessful, speculative fetch activity. It is a lower bound as the instruction data delivered by completed fetches may not be used. |
|
The number of IBS attempted fetch samples where the fetch operation initially hit in the L1 ITLB (Instruction Translation Lookaside Buffer). |
|
The number of IBS attempted fetch samples where the fetch operation initially missed in the L1 ITLB and hit in the L2 ITLB. |
|
The number of IBS attempted fetch samples where the fetch operation initially missed in both the L1 ITLB and the L2 ITLB. |
|
The number of IBS attempted fetch samples where the fetch operation produced a valid physical address (that is, address translation completed successfully) and used a 4-KByte page entry in the L1 ITLB. |
|
The number of IBS attempted fetch samples where the fetch operation produced a valid physical address (that is, address translation completed successfully) and used a 2 MB page entry in the L1 ITLB. |
|
The total latency of all IBS attempted fetch samples. Divide the total IBS fetch latency by the number of IBS attempted fetch samples to obtain the average latency of the attempted fetches that were sampled. |
|
The instruction fetch missed in the L2 Cache. |
|
The number of cycles when the fetch engine is stalled for an ITLB reload for the sampled fetch. If there is no reload, the latency will be 0. |
IBS Fetch Event |
Description |
|---|---|
|
The number of all the IBS fetch samples. This derived event counts the number of all the IBS fetch samples that were collected including IBS- killed fetch samples. |
|
The number of IBS sampled fetches that were not killed fetch attempts. This derived event measures the number of useful fetch attempts and does not include the number of IBS killed fetch samples. This event should be used to compute ratios such as the ratio of IBS fetch IC misses to attempted fetches. The number of attempted fetches should equal the sum of the number of completed fetches and the number of aborted fetches. |
|
The number of IBS sampled fetches that completed. A fetch is completed if the attempted fetch delivers instruction data to the instruction decoder. Although the instruction data was delivered, it may still not be used (for example, the instruction data may have been on the wrong path of an incorrectly predicted branch.) |
|
The number of IBS sampled fetches that aborted. An attempted fetch is aborted if it does not complete and deliver instruction data to the decoder. An attempted fetch may abort at any point in the process of fetching instruction data. An abort may be due to a branch redirection as the result of a mispredicted branch. The number of IBS aborted fetch samples is a lower bound on the amount of unsuccessful, speculative fetch activity. It is a lower bound as the instruction data delivered by completed fetches may not be used. |
|
The number of IBS attempted fetch samples where the fetch operation initially hit in the L1 ITLB (Instruction Translation Lookaside Buffer). |
|
The number of IBS attempted fetch samples where the fetch operation initially missed in the L1 ITLB and hit in the L2 ITLB. |
|
The number of IBS attempted fetch samples where the fetch operation initially missed in the IC (instruction cache). |
|
The number of IBS attempted fetch samples where the fetch operation initially hit in the IC. |
|
The number of IBS attempted fetch samples where the fetch operation produced a valid physical address (for example, address translation completed successfully) and used a 4 KB page entry in the L1 ITLB. |
|
The number of IBS attempted fetch samples where the fetch operation produced a valid physical address (for example, address translation completed successfully) and used a 2 MB page entry in the L1 ITLB. |
|
The number of IBS attempted fetch samples where the fetch operation produced a valid physical address (for example, address translation completed successfully) and used a 1 GB page entry in the L1 ITLB. |
|
The total latency of all IBS attempted fetch samples. Divide the total IBS fetch latency by the number of IBS attempted fetch samples to obtain the average latency of the attempted fetches that were sampled. |
|
The instruction fetch missed in the L2 Cache. |
|
The number of cycles when the fetch engine is stalled for an ITLB reload for the sampled fetch. If there is no reload, the latency will be 0. |
|
The number of IBS attempted fetch samples where the Op Cache was notable to supply all the bytes for the tagged fetch. |
|
The number of IBS attempted fetch samples where the instruction fetch missed in the L3 cache on the same CCX. |
Here is a list of IBS fetch metrics.
IBS Fetch Metric |
Description |
|---|---|
|
The average IBS fetch latency. Calculated by dividing the IBS fetch latency by the total number of IBS fetch attempts. |
|
Percentage of IBS fetch L1 and L2 ITLB misses with respect to the total number of IBS fetch attempts. |
|
Percentage of IBS fetch L1 ITLB miss and L2 ITLB hits with respect to the total number of IBS fetch attempts. |
|
Percentage of IBS fetch L1 instruction cache misses with respect to the total number of IBS fetch attempts. |
Here is a list of IBS op events.
IBS Op Event |
Description |
|---|---|
|
The number of all the IBS op samples collected. These op samples may be branch ops, resync ops, ops that perform load/store operations, or undifferentiated ops (for example, those ops that perform arithmetic operations, logical operations, and so on). IBS collects data for the retired ops. No data is collected for the ops that are aborted due to pipeline flushes and so on. Thus, all the sampled ops are architecturally significant and contribute to the successful execution of programs. |
|
The total number of tag-to-retire cycles across all the IBS op samples. The tag-to-retire time of an op is the number of cycles from when the op was tagged (selected for sampling) to when the op retired. |
|
The total number of completion-to-retire cycles across all the IBS op samples. The completion-to-retire time of an op is the number of cycles from when the op completed to when the op retired. |
|
The number of IBS retired branch op samples. A branch operation is a change in the program control flow and includes unconditional and conditional branches, subroutine calls, and subroutine returns. Branch ops are used to implement AMD64 branch semantics. |
|
The number of IBS samples for retired branch operations that were mispredicted. This event should be used to compute the ratio of mispredicted branch operations to all the branch operations. |
|
The number of IBS samples for the retired branch operations that were taken branches. |
|
The number of IBS samples for the retired branch operations that were mispredicted taken branches. |
|
The number of IBS retired branch op samples where the operation was a subroutine return. These samples are a subset of all the IBS retired branch op samples. |
|
The number of IBS retired branch op samples where the operation was a mispredicted subroutine return. This event should be used to compute the ratio of the mispredicted returns to all the subroutine returns. |
|
The number of IBS resync op samples. A resync op is only found in certain microcoded AMD64 instructions and causes a complete pipeline flush. ..note:: Not supported on Zen3 and later processors. |
|
The number of IBS op samples for ops that perform either a load and/or store operation. Each op may perform a load operation, a store operation, or both a load and store operation (each to the same address). |
|
The number of IBS op samples for ops that perform a load operation. |
|
The number of IBS op samples for ops that perform a store operation. |
|
The number of IBS op samples where either a load or store operation initially hit the L1 DTLB (data translation lookaside buffer). |
|
The number of IBS op samples where either a load or store operation initially missed in the L1 DTLB and hit the L2 DTLB. |
|
The number of IBS op samples where either a load or store operation initially missed in both the L1 DTLB and the L2 DTLB. |
|
The number of IBS op samples where either a load or store operation initially missed in the L1 DC. |
|
The number of IBS op samples where either a load or store operation initially hit the L1 DC. |
|
The number of IBS op samples where either a load or store operation caused a misaligned access (for example, the load or store operation crossed a 256-bit boundary). |
|
The number of IBS op samples where either a load or store operation caused a bank conflict with a load operation. Note Not supported on Zen3 and later processors. |
|
The number of IBS op samples where either a load or store operation caused a bank conflict with a store operation. Note Not supported on Zen3 and later processors. |
|
The number of IBS op samples where data for a load operation was forwarded from a store operation. Note Not supported on Zen3 and later processors. |
|
The number of IBS op samples where data forwarding to a load operation from a store was cancelled. Note Not supported on Zen3 and later processors. |
|
The number of IBS op samples where a load or store operation accessed uncacheable (UC) memory. |
|
The number of IBS op samples where a load or store operation accessed write combining (WC) memory. |
|
The number of IBS op samples where a load or store operation was a locked operation. |
|
The number of IBS op samples where a load or store operation hit an already allocated entry in the Miss Address Buffer (MAB). |
|
The number of IBS op samples where a load or store operation produced a valid linear (virtual) address and a 4 KB page entry in the L1 DTLB was used for the address translation. |
|
The number of IBS op samples where a load or store operation produced a valid linear (virtual) address and a 2 M page entry in the L1 DTLB was used for the address translation. |
|
The number of IBS op samples where a load or store operation produced a valid linear (virtual) address and a 1 GB page entry in the L1 DTLB was used for the address translation. |
|
The number of IBS op samples where a load or store operation produced a valid linear (virtual) address, hit the L2 DTLB, and used a 4 KB page entry for the address translation. |
|
The number of IBS op samples where a load or store operation produced a valid linear (virtual) address, hit the L2 DTLB, and used a 2 MB page entry for the address translation. |
|
The number of IBS op samples where a load or store operation produced a valid linear (virtual) address, hit the L2 DTLB, and used a 1 GB page entry for address translation. |
|
The total L1 DC miss load latency (in processor cycles) across all the IBS op samples that performed a load operation and missed in the data cache. The miss latency is the number of clock cycles from when the L1 data cache miss was detected to when data was delivered to the core. |
|
Load resync. Note Not supported on Zen3 and later processors. |
|
The number of IBS op samples where a load operation was serviced from the local processor. Northbridge IBS data is only valid for the load operations that miss in both the L1 data cache and the L2 data cache. If a load operation crosses a cache line boundary, he IBS data reflects the access to the lower cache line. Note Not supported on Zen3 and later processors. |
|
The number of IBS op samples where a load operation was serviced from a remote processor. Note Not supported on Zen3 and later processors. |
|
The number of IBS op samples where a load operation was serviced by the local L3 cache. Note Not supported on Zen3 and later processors. |
|
The number of IBS op samples where a load operation was serviced by a cache (L1 or L2 data cache) belonging to a local core which is a sibling of the core making the memory request. Note Not supported on Zen3 and later processors. |
|
IBS Load data returned from local L3 hit or different L1/L2 of same CCX or L1/L2/L3 hit in other CCX of same node. Note Not supported on Zen3 and later processors. |
|
The number of IBS op samples where a load operation was serviced by a remote L1 data cache, L2 cache, or L3 cache after traversing one or more coherent Hyper Transport links. |
|
The number of IBS op samples where a load operation was serviced by the local NUMA node’s DRAM (via the local memory controller). |
|
The number of IBS op samples where a load operation was serviced by the remote NUMA node’s DRAM (after traversing one or more coherent HyperTransport links and through a remote memory controller). |
|
The number of IBS op samples where a load operation was serviced from local MMIO, configuration or PCI space, or from the local APIC. Note Not supported on Zen3 and later processors. |
|
The number of IBS op samples where a load operation was serviced from remote MMIO, configuration, or PCI space. Note Not supported on Zen3 and later processors. |
|
The number of IBS op samples where a load operation was serviced from local or remote cache, and the cache hit state was the Modified (M) state. Note Not supported on Zen3 and later processors. |
|
The number of IBS op samples where a load operation was serviced from local or remote cache, and the cache hit state was the Owned (O) state. Note Not supported on Zen3 and later processors. |
|
The total data cache miss latency (in processor cycles) for the load operations that were serviced by the local processor. Note Not supported on Zen3 and later processors. |
|
The total data cache miss latency (in processor cycles) for the load operations that were serviced by a remote processor. Note Not supported on Zen3 and later processors. |
IBS Op Event |
Description |
|---|---|
|
The number of all the IBS op samples that were collected. These samples may be branch ops, resync ops, ops that perform load/store operations, or undifferentiated ops. For example, the ops that perform arithmetic operations, logical operations, and so on. IBS collects data for retired ops. No data is collected for ops that are aborted due to pipeline flushes and so on. Thus, all sampled ops are architecturally significant and contribute to the successful program execution. |
|
The total number of tag-to-retire cycles across all the IBS op samples. The tag-to-retire time of an op is the number of cycles from when the op was tagged (selected for sampling) to when the op retired. |
|
The total number of completion-to-retire cycles across all the IBS op samples. The completion-to-retire time of an op is the number of cycles from when the op completed to when the op retired. |
|
The number of IBS retired branch op samples. A branch operation is a change in program control flow; includes unconditional and conditional branches, subroutine and subroutine returns. Branch ops are used to implement AMD64 branch semantics. |
|
The number of IBS samples for the retired branch operations that were mispredicted. This event should be used to compute the ratio of mispredicted branch operations to all branch operations. |
|
The number of IBS samples for retired branch operations that were taken branches. |
|
The number of IBS samples for the retired branch operations that were mispredicted taken branches. |
|
The number of IBS retired branch op samples where the operation was a subroutine return. These samples are a subset of all the IBS retired branch op samples. |
|
The number of IBS retired branch op samples where the operation was a mispredicted subroutine return. This event should be used to compute the ratio of the mispredicted returns to all the subroutine returns. |
|
Tagged operation was part of a fused instruction pair. |
|
Tagged operation from microcode. |
|
The total number of tag-to-retire cycles across all IBS op branch samples. The tag-to-retire time of an op is the number of cycles from when the op was tagged (selected for sampling) to when the op retired. |
|
The total number of tag-to-retire cycles across all branch mispredict instruction op samples. The tag-to-retire time of an op is the number of cycles from when the op was tagged (selected for sampling) to when the op retired. |
|
The total number of tag-to-retire cycles across all branch taken op samples. The tag-to-retire time of an op is the number of cycles from when the op was tagged (selected for sampling) to when the op retired. |
|
The total number of tag-to-retire cycles across all branch return op samples. The tag-to-retire time of an op is the number of cycles from when the op was tagged (selected for sampling) to when the op retired. |
|
The total number of completion-to-retire cycles across all IBS branch samples. The completion-to-retire time of an op is the number of cycles from when the op completed to when the op retired. |
|
The total number of completion-to-retire cycles across all branch mispredict instruction op samples. The completion-to-retire time of an op is the number of cycles from when the op completed to when the op retired. |
|
The total number of completion-to-retire cycles across all IBS taken samples. The completion-to-retire time of an op is the number of cycles from when the op completed to when the op retired. |
|
IBS branch return op completion-to-retire cycles. |
|
The number of IBS op samples for the ops that perform either a load and/or store operation. Each op may perform a load/store operation or both a load and store operation (each to the same address). |
|
The number of IBS op samples for the ops that perform a load operation. |
|
The number of IBS op samples for the ops that perform a store operation. |
|
The number of IBS op samples where either a load or store operation initially hit in the L1 DTLB (data translation look aside buffer). |
|
The number of IBS op samples where either a load or store operation initially missed in the L1 DTLB and hit in the L2 DTLB. |
|
The number of IBS op samples where either a load or store operation initially missed in both the L1 DTLB and the L2 DTLB. |
|
The number of IBS op samples where either a load or store operation initially missed in the L1 data cache (DC). |
|
The number of IBS op samples where either a load or store operation initially hit in the L1 data cache (DC). |
|
The number of IBS op samples where either a load or store operation caused a misaligned access (that is, the load or store operation crossed a 64 byte boundary). |
|
The number of IBS op samples where a load or store operation accessed uncacheable (UC) memory. |
|
The number of IBS op samples where a load or store operation accessed write combining (WC) memory. |
|
The number of IBS op samples where a load or store operation was a locked operation. |
|
The number of IBS op samples where a load or store operation hit an allocated entry in the Miss Address Buffer (MAB). |
|
The number of IBS op samples where a load or store operation produced a valid linear (virtual) address and a 4 KB page entry in L1 DTLB was used for the address translation. |
|
The number of IBS op samples where a load or store operation produced a valid linear (virtual) address and a 2 MB page entry in L1 DTLB was used for the address translation. |
|
The number of IBS op samples where a load or store operation produced a valid linear (virtual) address and a 1 GB page entry in L1 DTLB was used for the address translation. |
|
The number of IBS op samples where a load or store operation produced a valid linear (virtual) address, hit L2 DTLB, and used a 4 KB page entry for the address translation. |
|
The number of IBS op samples where a load or store operation produced a valid linear (virtual) address, hit L2 DTLB, and used a 2 MB page entry for the address translation. |
|
The number of IBS op samples where a load or store operation produced a valid linear (virtual) address, hit L2 DTLB, and used a 1 GB page entry for the address translation. |
|
The total L1 DC miss load latency (in processor cycles) across all the IBS op samples that performed a load operation and missed in the data cache. The miss latency is the number of clock cycles from when the L1 data cache miss was detected to when data was delivered to the core. |
|
The number of IBS op samples where a store operation missed in L1 data cache. |
|
The number of IBS op samples where a store operation hit in L1 data cache. |
|
The number of IBS op samples where a load operation hit in L1 data cache. |
|
The number of IBS op samples where a load operation missed in data cache. |
|
The number of IBS op samples where a load operation hit in L2 cache. |
|
The number of IBS op samples where a load operation missed in L2 Cache. |
|
The total latency (in processor cycles) for load operations that were serviced by the L2 cache. |
|
The number of cycles from when a L1 DTLB refill is triggered by a tagged op to when the L1 DTLB fill has been completed. |
|
The total number of tag-to-retire cycles across all IBS op load samples. The tag-to-retire time of an op is the number of cycles from when the op was tagged (selected for sampling) to when the op retired. |
|
The total number of tag-to-retire cycles across all IBS op store samples. The tag-to-retire time of an op is the number of cycles from when the op was tagged (selected for sampling) to when the op retired. |
|
The total number of tag-to-retire cycles across all IBS op load and store samples. The tag-to-retire time of an op is the number of cycles from when the op was tagged (selected for sampling) to when the op retired. |
|
The total number of tag-to-retire cycles across all IBS UC memory access op samples. The tag-to-retire time of an op is the number of cycles from when the op was tagged (selected for sampling) to when the op retired. |
|
The total number of tag-to-retire cycles across all IBS WC memory access op samples. The tag-to-retire time of an op is the number of cycles from when the op was tagged (selected for sampling) to when the op retired. |
|
The total number of tag-to-retire cycles across all IBS misalign access op samples. The tag-to-retire time of an op is the number of cycles from when the op was tagged (selected for sampling) to when the op retired. |
|
The number of IBS op samples where a load operation was serviced by a remote L1 data, L2, or L3 cache after traversing one or more coherent HyperTransport links. |
|
The number of IBS op samples where a load operation was serviced by the local NUMA node’s DRAM (via the local memory controller). |
|
The number of IBS op samples where a load operation was serviced by the remote NUMA node’s DRAM (after traversing one or more coherent HyperTransport links and through a remote memory controller). |
|
The number of IBS op samples where a load operation was serviced by the shared L3 cache or other L1/L2 cache in the same CCX. |
|
The number of IBS op samples where a load operation was serviced by L2/L3 cache in a different CCX of same NUMA node. |
|
The number of IBS op samples where a load operation was serviced by the DRAM. |
|
The number of IBS op samples where a load operation was serviced by the NVDIMM. |
|
The number of IBS op samples where a load operation was serviced from MMIO, configuration or PCI space, or from the local APIC. |
|
The number of IBS op samples where a load operation was serviced by Extension memory. |
|
The number of IBS op samples where a load operation was serviced by Peer agent memory. |
|
The number of IBS op samples where a load operation was serviced by local long-latency DIMM. |
|
The number of IBS op samples where a load operation was serviced by remote long-latency DIMM. |
|
The number of IBS op samples where a load operation was serviced from the local or remote cache, and the cache hit state was the Modified (M) state. |
|
The number of IBS op samples where a load operation was serviced from the local or remote cache, and the cache hit state was the Owned (O) state. |
|
The number of IBS op samples where a load operation was serviced from local L3 or other L2 in the same CCX, and the cache hit state was the Modified (M) state. |
|
The number of IBS op samples where a load operation was serviced from another L3 in same NUMA node, and the cache hit state was the Modified (M) state. |
|
The number of IBS op samples where a load operation was serviced from another L3 in different NUMA node, and the cache hit state was the Modified (M) state. |
|
The total latency (in processor cycles) for load operations that were serviced by the shared L3 cache or other L1/L2 in the same CCX. |
|
The total latency (in processor cycles) for load operations that were serviced by the L2/L3 cache in different CCX of the same NUMA node. |
|
The total latency (in processor cycles) for load operations that were serviced by L2/L3 cache in different CCX on different NUMA node. |
|
The total latency (in processor cycles) for load operations that were serviced by the DRAM in the same NUMA node (including on socket NUMA nodes). |
|
The total latency (in processor cycles) for load operations that were serviced by the DRAM in a different NUMA node. |
|
The total latency (in processor cycles) for load operations that were serviced by the DRAM. |
|
The total latency (in processor cycles) for load operations that were serviced by the NVDIMM-P. |
|
The total latency (in processor cycles) for load operations that were serviced by the local NVDIMM. |
|
The total latency (in processor cycles) for load operations that were serviced by the remote NVDIMM. |
|
The total latency (in processor cycles) for load operations that were serviced by the extension memory. |
|
The total latency (in processor cycles) for load operations that were serviced by the local extension memory. |
|
The total latency (in processor cycles) for load operations that were serviced by the remote extension memory. |
|
The total latency (in processor cycles) for load operations that were serviced by the peer agent memory. |
|
The total latency (in processor cycles) for load operations that were serviced by the local peer agent memory. |
|
The total latency (in processor cycles) for load operations that were serviced by the remote peer agent memory. |
|
The total latency (in processor cycles) for load operations that were serviced by the MMIO/Config/PCI/APIC. |
Here is a list of IBS op metrics for AMD Zen3, Zen4 and AMD Zen5 platforms.
IBS Op Metric |
Description |
|---|---|
|
Percentage of IBS Branch operations with respect to the total IBS operations. |
|
Percentage of IBS Branch op completion to retire cycles. |
|
Percentage of IBS Branch mispredict operations with respect to IBS branch operations. |
|
Percentage of IBS Branch mispredict op completion to retire cycles. |
|
Percentage of cycles wasted due to branch mispredicts. The Tag-To-Retire cycles of branch mispredicts divided by the total Tag-To-Retire cycles of all the operations, expressed as percentage. |
|
Percentage of IBS Branch mispredict op tag to retire cycles. |
|
Percentage of IBS Branch op tag to retire cycles. |
|
Percentage of cycles wasted due to L1 DTLB misses. The number of L1DTLB refill latency cycles divided by the total number of Tag-To-Retire cycles of all the operations, expressed as percentage. |
|
Percentage of IBS load DRAM hit latency cycles with respect to the loadL1 DC miss latency cycles. |
|
Percentage of IBS load Extension Memory hit latency cycles with respect to the load L1 DC miss latency cycles. Note Not supported on Zen3 processors. |
|
Percentage of cycles wasted to fetch the data. The number of Load L1 DC misses latency cycles divided by the total number of Tag-To-Retire cycles of all the operations, expressed as percentage. |
|
Percentage of IBS load L2 hit latency cycles with respect to load L1 DC miss latency cycles. |
|
Percentage of IBS load local cache hit latency cycles with respect to the load L1 DC miss latency cycles. |
|
Percentage of IBS load local DRAM hit latency cycles with respect to the load L1 DC miss latency cycles. |
|
Percentage of IBS load Non main memory hit latency cycles with respect to the load L1 DC miss latency cycles. |
|
Percentage of IBS load NVDIMM hit latency cycles with respect to the load L1 DC miss latency cycles. |
|
Percentage of IBS load Peer Agent Memory hit latency cycles with respect to the load L1 DC miss latency cycles. Note Not supported on Zen3 processors. |
|
Percentage of IBS load peer cache hit latency cycles with respect to the load L1 DC miss latency cycles. |
|
Percentage of IBS load remote cache hit latency cycles with respect to the load L1 DC miss latency cycles. |
|
Percentage of IBS load remote DRAM hit latency cycles with respect to the load L1 DC miss latency cycles. |
|
Percentage of Load operations. The total number of load operations divided by the number of IBS OP samples, expressed as percentage. |
|
Percentage of Load and Store operations. The total number of load and store operations divided by the number of IBS OP samples, expressed as percentage. |
|
Percentage of IBS Branch return operations with respect to IBS branch operations. |
|
Percentage of IBS Branch return op completion to retire cycles. |
|
Percentage of IBS Branch return op tag to retire cycles. |
|
Percentage of Store operations. The total number of store operations divided by the number of IBS OP samples, expressed as percentage. |
|
Percentage of IBS Branch taken operations with respect to IBS branch operations. |
|
Percentage of IBS Branch taken op completion to retire cycles. |
|
Percentage of IBS Branch taken op tag to retire cycles. |
|
Number of Branch mispredicts per thousand operations. The number of branch mispredicts divided by the total number of branch operations, expressed as Per-Thousand-Instructions. |
|
Branch mispredict rate in percentage. The number of branch mispredicts divided by the total number of branch operations, expressed as percentage. |
|
Percentage of load samples where the load operation was serviced by DRAM in the system. The number of IBS_LD_DRAM_HIT divided by IBS_LOAD, expressed in percentage. |
|
Percentage of load samples where the load operation was serviced by Extension Memory in the system. The number of IBS_LD_EXT_MEM_HIT divided by IBS_LOAD, expressed in percentage. Note Not supported on Zen3 processors. |
|
Percentage of load samples where the load operation was serviced by Peer agent Memory in the system. The number of IBS_LD_EXT_MEM_HIT divided by IBS_LOAD, expressed in percentage. |
|
Percentage of load samples where the load operation was serviced by Extension Memory in the system. The number of IBS_LD_EXT_MEM_HIT divided by IBS_LOAD, expressed in percentage. Note Not supported on Zen3 processors. |
|
Percentage of load samples where the load operation was serviced by Peer agent Memory in the system. The number of IBS_LD_EXT_MEM_HIT divided by IBS_LOAD, expressed in percentage. |
|
Percentage of load samples where the load operation was serviced from MMIO, configuration or PCI space, or from the local APIC in the system. The number of IBS_LD_NON_MAIN_MEM_HIT divided by IBS_LOAD, expressed in percentage. |
|
Average Load L1 DC Miss latency cycles. The total load L1 DC miss latency cycles divided by the number of load L1 DC misses. |
|
Percentage of cycles wasted to fetch the data. The number of Load L1 DC misses latency cycles divided by the total number of Tag-To-Retire cycles of all the operations, expressed as percentage. |
|
Percentage of IBS load L2 hit latency cycles with respect to load L1 DC miss latency cycles. |
|
Percentage of IBS load local cache hit latency cycles with respect to the load L1 DC miss latency cycles. |
|
Percentage of IBS load peer cache hit latency cycles with respect to the load L1 DC miss latency cycles. |
|
Percentage of IBS load remote cache hit latency cycles with respect to the load L1 DC miss latency cycles. |
|
Percentage of IBS load local DRAM hit latency cycles with respect to the load L1 DC miss latency cycles. |
|
Percentage of IBS load remote DRAM hit latency cycles with respect to the load L1 DC miss latency cycles. |
|
Percentage of IBS load DRAM hit latency cycles with respect to the loadL1 DC miss latency cycles. |
|
Percentage of IBS load NVDIMM hit latency cycles with respect to the load L1 DC miss latency cycles. |
|
Percentage of IBS load Extension Memory hit latency cycles with respect to the load L1 DC miss latency cycles. Note Not supported on Zen3 processors. |
|
Percentage of IBS load Peer Agent Memory hit latency cycles with respect to the load L1 DC miss latency cycles. Note Not supported on Zen3 processors. |
|
Percentage of IBS load Non main memory hit latency cycles with respect to the load L1 DC miss latency cycles. |
CPU profiling in AMD uProf has the following limitations:
CPU profiling expects the profiled application executable binaries must not be compressed or obfuscated by any software protector tools. For example, VMProtect.
In case of AMD EPYC™ 1st generation B1 parts, only one PMC register is used at a time for Core PMC event-based profiling (EBP).
IMIX has the following limitations:
The IMIX view or report is supported only for IBS profile type.
If any module/binary has less than 10 samples, it is not shown in the IMIX report. Extremely less number of samples are not useful for IMIX analysis.
Linux kernel module .ko files are not shown in the IMIX view or report.
The Cache Analysis uses IBS OP samples to detect the hot false sharing cache lines in multi- threaded and multi-process with shared memory applications.
At a high-level, this feature will report:
The cache lines where there is a potential false sharing
Offsets where those accesses occur, readers and writers to those offsets
PID, TID, Function Name, Source File, and Line Number for those reader and writers
Load latency for the loads to those cache lines
The following IBS OP derived metrics are used to generate false cache sharing report.
IBS Op Metric |
Description |
|---|---|
|
Total Loads and stores sampled |
|
Total Loads |
|
Total Stores |
|
Accumulated load latencies for the loads to cache lines |
|
Load operations hit in data cache or L2 cache |
|
Loads that were serviced from the local cache (L3) and the cache hit state was |
|
Loads that were serviced from the local cache (L3) and the cache hit state was Owned |
|
Loads that were missed in local cache (L3) and serviced by remote cache, local, or remote DRAM |
|
Loads that were serviced from the remote cache (L3) and the cache hit state was |
|
Loads that were serviced from the remote cache (L3) and the cache hit state was |
|
Loads that hit in local memory (Memory channels attached to local socket or local CCD) |
|
Loads that hit in remote memory (Memory channels attached to remote socket or other CCDs in the local socket) |
|
Store operations missed in data cache |
Configuring and Starting Profile
To perform cache analysis, complete the following steps:
1.Select the profile target. 2.Select Cache Analysis profile type in Predefined Configs tab. 3.Start the profile.
Analyzing the Report
After the profile completion, navigate to Cache Analysis page in MEMORY tab to analyze the profile data. This page shows the cache-lines and it offsets with the associated metric values:
Figure 7.46 Cache Analysis#
The Cache Analysis screen has the following options:
Group By drop-down decides how the cache-line samples are grouped in the detailed table. It has the option Cache Line Offset.
ValueType drop-down allows you to show the value in sample count.
The CLI has a config type called memory to cache the analysis data. Run the following command to collect the profile data:
$ AMDuProfCLI collect --config memory -o /tmp/cache_analysis <target app>
This command will launch the program and collect the profile data required to generate the cache analysis report. The raw profile data file is created in /tmp/cache_analysis/AMDuProf- IBS_<timestamp>/ directory.
Report Generation and Analysis
Use the following CLI command to generate the cache analysis report.
$ AMDuProfCLI report -i /tmp/cache_analysis/AMDuProf-IBS_<timestamp>/
This will generate a CSV report in /tmp/cache_analysis/AMDuProf- IBS_<timestamp>/report.csv and it will have the following sections.
SHARED DATA CACHELINE SUMMARY: Lists the summary values of all the metrics.
SHARED DATA CACHELINE REPORT: Lists the cache lines and the associated summary values of the metrics.
SHARED DATA CACHELINE DETAIL REPORT: Lists the following:
The cache lines having a potential false sharing
Offsets where those accesses occur, readers and writers to those offsets
PID, TID, Function Name, Source File, and Line Number for those reader and writers
Load latency for the loads to those cache lines
Supported metrics
The following figure shows the Cache Analysis summary sections.
Figure 7.47 Cache Analysis - Summary Sections#
The following figure shows the Cache Analysis detailed report.
Figure 7.48 Cache Analysis - Detailed Report#
Use any of the listed metric options with the following command (for example, --sort-by event=ldst-count) to change the sorting by order during the report generation.
--sort-by event=<METRIC>
Sort-by Metric Options |
Description |
|---|---|
|
Total Loads and stores sampled |
|
Total Loads |
|
Total Stores |
|
Loads that were serviced either from the local or remote cache (L3) and the cache hit state was Modified. |
|
Loads that were serviced from the local cache (L3) and the cache hit state was Modified. |
|
Loads that were serviced from the remote cache (L3) and the cache hit state was Modified. |
|
Loads that hit in local memory (memory channels attached to local socket or local CCD). |
|
Loads that hit in remote memory (memory channels attached to remote socket or other CCDs in the local socket). |
|
Loads that are missed in local cache (L3) and serviced by remote cache, local, or remote DRAM. |
|
Store operations missed in data cache. |
Note
You can also use the command info --list cacheline-events for a list of supported metrics for sort-by option.
AMD Zen4 processors support Last Branch Record (LBR) CPU feature that is useful for branch analysis. Use uProf CLI to collect and generate the branch analysis report.
PMC event must be enabled for LBR sample collection. If no PMC event is passed, PMCX0C0 event is enabled during LBR sample collection.
CLI
Collect the LBR info.
$ AMDuProfCLI collect --branch-filter -o /tmp/ ./ScimarkStable/scimark2_64static
Generate branch analysis report.
$ AMDuProfCLI report --detail -i /tmp/AMDuProf-scimark2_64static-Custom_mmm-dd-yyyy_hh-mm-ss
The report generated contains a section for branch analysis. Here is a sample screenshot of the Branch Analysis Summary.
Figure 7.49 Branch Analysis Summary#
Branch analysis has the following limitations:
Branch analysis is supported only on Linux platform.
Branch analysis is not supported for Java apps.
The branch analysis summary table comprises of the following columns:
Column |
Description |
|---|---|
MISPREDICT (%) |
Indicates ratio of mispredicts occurred for the branch. Calculated as: ((MISPREDICT COUNT) * 100/SAMPLES) |
MISPREDICT COUNT |
Shows the number of branch mis-predicted samples collected for the branch. |
OVERHEAD (%) |
Indicates which branching was mostly taken. Calculated as: (SAMPLES * 100)/(Total SAMPLES). |
PROCESS |
Shows the name and PID of the process. |
SAMPLES |
Shows the number of samples collected for the branch. This does not indicate the actual branches taken. |
SOURCE FUNCTION |
Shows the function from where the branch was taken. |
SOURCE LINE |
Shows the file path and line number (from where the branch was taken) of the SOURCE FUNCTION. |
SOURCE MODULE |
Shows the module name of the SOURCE FUNCTION. |
TARGET FUNCTION |
Shows the function into which the branch was taken. |
TARGET LINE |
Shows the file path and line number (into which the branch was taken) of the TARGET FUNCTION. |
TARGET MODULE |
Shows the module name of the TARGET FUNCTION. |
Time based profiling can be performed on all the supported Host and Guest VMs, whereas the hardware counter profiling is completely dependent on the vPMUs exposed by the hypervisor.
This feature supports profiling of KVM guest OS kernel and kernel modules (*.ko) from the host. The following features are supported:
Collection of PMU samples on guest OS
Profiling of guest OS and/or host OS
System wide profiling to profile KVM-guest and other running processes.
The following features are not supported: - Call stack - Attach to process - Launch application
Before beginning the profiling on the guest OS, the following files must be copied on the host machine to facilitate symbol resolution for the guest VMs:
Copy /proc/kallsyms and /proc/modules from the guest OS to the host machine.
Copy guest vmlinux and kernel sources in a folder on the host system.
These files should belong to the guest VM whose PID is provided as an argument to --guest-kvm option.
AMD uProf CLI contains the following options to support the guest OS profiling from the host OS:
$ ./AMDuProfCLI collect [--kvm-guest <pid>] [--guest-kallsyms <path>] [--guest-modules <path>]
[--guest-search-path <path>] ....
The following table lists the Collect command options applicable for profiling options.
Arguments |
Option |
Description |
|---|---|---|
|
PID of |
Collect guest-side performance profile. This option collects KVM guest symbols information. |
|
Path of guest vmlinux and kernel sources copied on local host. |
GuestOS vmlinux and search directory. AMD uProf reads it to resolve the guest kernel module information. You can copy it from the guest OS. |
|
Path of |
GuestOS/proc/modulesfile copy. AMD uProf reads it to get the guest kernel module information. You can copy it from the guest OS. |
|
Path of |
GuestOS/proc/kallsymsfile copy. AMD uProf reads it to get guest kernel symbols. You can copy it from the guest OS. |
Examples
Get the kvm guest OS PID.
$ ps aux | grep kvm
Collecting pmcx76 event data for 10 secs (for guest kallsyms and guest kernel modules).
$ ./AMDuProfCLI collect -e event=pmcx76,interval=250000 -o /tmp/cpuprof-76-guest-only -d 10 -
-kvm-guest 2444 --guest-kallsyms /home/amd/guest/guest-kallsyms --guest-modules /home/amd/ guest/guest-module
Generate report from the collected data.
Collecting pmcx76 event data for 10 secs (for guest kallsyms).
$ ./AMDuProfCLI collect -e event=pmcx76,interval=250000 -o /tmp/cpuprof-76-guest-only -d 10 -
-kvm-guest 2444 --guest-kallsyms /home/amd/guest/guest-kallsyms
Generate report from the collected data.
$ ./AMDuProfCLI report -i /tmp/cpuprof-76-guest-only/AMDuProf-SWP-EBP_Nov-08-2021_15-00-33
Collecting system-wide samples for pmcx76 event data for 10 secs (for guest kallsyms and guest kernel modules).
$ ./AMDuProfCLI collect -e event=pmcx76,interval=250000 -o /tmp/cpuprof-76-guest-only -d 10 -
-kvm-guest 2444 --guest-kallsyms /home/amd/guest/guest-kallsyms --guest-modules /home/amd/ guest/guest-module -a
Generate report from the collected data.
$ ./AMDuProfCLI report -i /tmp/cpuprof-76-guest-only/AMDuProf-SWP-EBP_Nov-08-2021_15-00-33
Collecting system-wide samples for pmcx76 event data for 10 secs (for guest kallsyms).
$ ./AMDuProfCLI collect -e event=pmcx76,interval=250000 -o /tmp/cpuprof-76-guest-only -d 10 -
-kvm-guest 2444 --guest-kallsyms /home/amd/guest/guest-kallsyms -a$ ./AMDuProfCLI collect -e event=pmcx76,interval=250000 -o /tmp/cpuprof-76-guest-only -d 10 -
-kvm-guest 2444 --guest-kallsyms /home/amd/guest/guest-kallsyms -a
Generate report from the collected dataGenerate report from the collected data.
$ ./AMDuProfCLI report -i /tmp/cpuprof-76-guest-only/AMDuProf-SWP-EBP_Nov-08-2021_15-00-33
The OpenMP API uses the fork-join model of parallel execution. The program starts with a single master thread to run the serial code. When a parallel region is encountered, multiple threads perform the implicit or explicit tasks defined by the OpenMP directives. At the end of that parallel region, the threads join at the barrier and only the master thread continues to execute.
When the threads execute the parallel region code, they should utilize all the available CPU cores and the CPU utilization should be maximized. But the threads wait without doing anything useful due to several reasons:
Idle: Thread being in an inactive state, waiting for an event to occur.
Sync: If locks are used inside the parallel region, threads can wait on synchronization locks to acquire the shared resource. Also, this includes thread waiting at implicit barrier.
Overhead: The thread management overhead
The OpenMP analysis helps to trace the activities performed by OpenMP threads, their states, and provides the thread state timeline for parallel regions to analyze the performance issues. Use the Parallel Strong Scaling Metrics (Parallel Strong Scaling Metrics (MPI + OpenMP)) to quantify and decompose scalability losses in OpenMP (and MPI) applications.
Parallel Region Aggregation
A parallel region can be executed multiple times during runtime. Reporting all the instances separately might result in a lengthy report making it difficult to analyze the data. AMD uProf aggregates data of multiple instances of the same parallel region and shows it as a single entry for better analysis.
Support Matrix
The following table shows the support matrix:
Component |
Supported Versions |
Languages |
|---|---|---|
OpenMP Spec |
OpenMP v5.0 |
C and C++ |
Compiler |
LLVM 8 and later |
C and C++ |
Compiler |
AOCC 2.1 and later |
C, C++, and Fortran |
Compiler |
ICC 2025.0.4 |
C, C++, and Fortran |
Compiler |
GCC7 and later |
C, C++, and Fortran |
OS |
Ubuntu 18.04 LTS and later |
C, C++, and Fortran |
OS |
RHEL 8.6 and 9 |
C, C++, and Fortran |
OS |
CentOS 8.4 |
C, C++, and Fortran |
Prerequisite
Compile the OpenMP application using a supported compiler (on a supported platform) with the required compiler options to enable OpenMP.
Complete the following steps to start profiling:
Click Profile an Application on the Welcome page.
Provide application path, application options, working directory, and environment variables, if any. Click Next.
Select at least one supported predefined Configuration such as TBP/EBP/IBS along with any desired configuration and click Advanced Options.
In the OpenMP Tracing Options pane, turn on the Enable OpenMP Tracing option.
Figure 7.50 Enable OpenMP Tracing#
Select the Select OpenMP Trace Implementation type. Choose:
ompt (default option) for tracing of OpenMP libraries supporting OMPT interface (example: LLVM, AOCC, ICC).
omplib for tracing GCC OpenMP library.
If you have selected ompt, next Select OpenMP Tracing Mode. Choose:
full for tracing all the OpenMP events.
basic for basic tracing, where synchronization related OpenMP events are not traced to reduce the disk space usage.
Click Start Profile to start the profiling.
Command to collect basic trace info of an OpenMP application supporting OMPT interface:
$ AMDuProfCLI collect --trace openmp --openmp-impl ompt --openmp-scope basic -o /tmp/myapp_perf <openmp-app>
Command to profile an OpenMP application compiled with GCC OpenMP library:
$ AMDuProfCLI collect --trace openmp --openmp-impl omplib -o /tmp/myapp_perf <openmp-app>
Use the --openmp-impl option to provide OpenMP implementation type: ompt for tracing of OpenMP libraries supporting OMPT interface (example: LLVM, AOCC, ICC), omplib for tracing GCC OpenMP library. If --openmp-impl is not specified, the default selection is ompt.
Use --openmp-scope option to provide tracing scope: full for tracing all the OpenMP events, basic for basic tracing, where synchronization related OpenMP events are not traced to reduce the disk space usage. If --openmp-scope is not specified, the default selection is basic.
Note
This option is only applicable with --openmp-impl ompt.
While performing the regular profiling, add option –trace openmp –openmp-impl <ompt | omplib> to enable OpenMP profiling. This command will launch the program and collect the profile data required to generate the OpenMP analysis report.
Once profile data collection is complete, a session directory will be generated. Use session directory to generate the csv report (or) to import the session in GUI.
For a list of all the supported options, refer to AMDuProfCLI Collect Command Options.
If data is collected using CLI, then use Import Session to import the session into GUI to analyze data in GUI. OpenMP trace data can be collected in Linux and the session can be imported to GUI or CLI on Windows.
Analyzing the GUI Views
After the session is opened, navigate to the HPC page to analyze the OpenMP tracing data. You can use the left side vertical pane on this page to navigate through the following views:
Overview shows the quick details about the runtime. The following image shows the Overview page.
Figure 7.51 HPC - Overview#
Overview has the following fields:
Profile Time: Total profile duration.
Time Inside Parallel Region: Total time spent inside parallel regions.
Time Outside Parallel Region: Total time spent outside parallel regions.
Parallel Time %: Percentage of total time spent inside parallel regions.
Total threads created: Total number of threads created.
CPU Time: Total time spent utilizing the CPU.
Work Time: Time spent working in the parallel region.
Sync Time: Total time spent at barrier, lock contention and other synchronization activities.
Barrier/Wait Time: Time spent waiting at explicit or implicit barrier by the threads.
Lock Contention Time: Time spent by threads waiting on locks or ordered parallel loops.
Other Sync Time: Time spent in synchronization activities other than waiting at barriers or for locks.
Overhead Time: Total time spent in atomic, reduction and other thread management operations.
Reduction Time: Time spent on reduction operations.
Atomic Time: Time spent on performing atomic operations.
Other Overhead Time: Time spent on other thread management overhead.
Idle Time: Time spent in an inactive state, waiting for an event to occur.
OpenMP Parallel Regions shows the summary of all the parallel regions. This tab is useful to quickly understand which parallel region might be load imbalanced, i.e., a region with less total work time with respect to its total elapsed time. Double-click on the region names to open the region-wise thread details page.
Figure 7.52 OpenMP Parallel Regions#
OpenMP Parallel Regions has the following columns:
Avg Idle Time (secs): Average time spent by the parallel region threads in an inactive state, waiting for an event to occur. It is computed as the difference between the total elapsed time of a parallel region and the sum of Avg Sync Time, Avg Overhead Time and Avg Work Time of that parallel region.
Avg Sync Time (secs): Average time spent by the parallel region threads waiting on the synchronization locks to acquire the shared resource & waiting in barriers. It is computed as the sum of sync time of all the threads within the parallel region divided by the thread count of that parallel region.
Avg Overhead Time (secs): Average time spent in atomic, reduction and other thread management operations. It is computed as the sum of overhead time of all the threads within the parallel region divided by the thread count of that parallel region.
Avg Work Time (secs): Average time spent by the parallel region threads working. It is computed as the sum of work time of all the threads within the parallel region divided by the thread count of that parallel region.
Total Elapsed Time (secs): Time spent in the parallel region.
Thread Count: Number of threads in the parallel region.
Instance Count: Number of instances of the parallel region executed.
Region-Wise Thread Details page helps in understanding how each thread spent its time in the parallel region. If a thread spends too much time on non-work activity, the parallel region should be optimized further to improve the work time of each thread in that region. It has the following columns:
Thread No.: Serial number of the thread.
Thread Id: Thread identifier.
Total Idle Time (secs): Time spent by the thread in an inactive state, waiting for an event to occur.
Total Sync Time (secs): Time spent by the thread waiting on the synchronization locks to acquire the shared resource.
Total Overhead Time (secs): Time spent by the thread in atomic, reduction and other thread management operations.
Total Work Time (secs): Time spent by the thread working.
You can generate a CSV report using the AMDuProfCLI report command. Any additional option is not required for the OpenMP report generation. AMD uProf checks for the availability of any OpenMP profiling data and includes it in the report, if available.
The following command will generate a CSV report in /tmp/myapp_perf/<SESSION-DIR>/ report.csv:
$ ./AMDuProfCLI report -i /tmp/myapp_perf/<SESSION-DIR>
Note
If tracing is performed on a cluster, provide –host all option to correctly report openmp data for all the hosts.
An example of the OpenMP report section in the CSV file is given here:
Figure 7.53 Sample OpenMP Report#
Analyzing the OpenMP Report
Openmp report includes the following sections:
OpenMP OVERVIEW has the following fields:
Profile Time (s): Total profile duration.
Time Inside Parallel Region (s): Total time spent inside parallel regions.
Time Outside Parallel Region (s): Total time spent outside parallel regions.
Parallel Region Time %: Percentage of total time spent inside parallel regions.
Total threads created: Total number of threads created.
CPU Time (seconds): Total time spent utilizing the CPU.
Work Time: Time spent working in the parallel region.
Sync Time: Total time spent at barrier, lock contention and other synchronization activities.
Barrier/Wait Time: Time spent waiting at explicit or implicit barrier by the threads.
Lock Contention Time: Time spent by threads waiting on locks or ordered parallel loops.
Other Sync Time: Time spent in synchronization activities other than waiting at barriers or for locks.
Overhead Time: Total time spent in atomic, reduction and other thread management operations.
Reduction Time: Time spent on reduction operations.
Atomic Time: Time spent on performing atomic operations.
Other Overhead Time: Time spent on other thread management overhead.
Idle Time (seconds): Time spent in an inactive state, waiting for an event to occur.
OpenMP PARALLEL-REGION METRIC helps in understanding the imbalanced region, that is, a region with less total work time with respect to its total elapsed time. It has the following columns:
Avg Idle Time (s): Average time spent by the parallel region threads in an inactive state, waiting for an event to occur. It is computed as the difference between the total elapsed time of a parallel region and the sum of Avg Sync Time, Avg Overhead Time and Avg Work Time of that parallel region.
Avg Sync Time (s): Average time spent by the parallel region threads waiting on the synchronization locks to acquire the shared resource. It is computed as the sum of sync time of all the threads within the parallel region divided by the thread count of that parallel region.
Avg Overhead Time (s): Average time spent in atomic, reduction and other thread management operations. It is computed as the sum of overhead time of all the threads within the parallel region divided by the thread count of that parallel region.
Avg Work Time (s): Average time spent by the parallel region threads working. It is computed as the sum of work time of all the threads within the parallel region divided by the thread count of that parallel region.
Total Elapsed Time (s): Time spent in the parallel region.
Thread Count: Number of threads in the parallel region.
Instance Count: Number of instances of the parallel region executed.
OpenMP THREAD METRIC helps in understanding how each thread spent its time in the parallel region. If a thread spends too much time on non-work activity, the parallel region should be optimized further to improve the work time of each thread in that region. It has the following columns:
ThreadNum: Serial number of the thread.
ThreadId: Thread identifier.
Total Idle Time (s): Time spent by the thread in an inactive state, waiting for an event to occur.
Total Sync Time (s): Time spent by the thread waiting on the synchronization locks to acquire the shared resource.
Total Overhead Time (s): Time spent by the thread in atomic, reduction and other thread management operations.
Total Work Time (s): Time spent by the thread working.
AMDUPROF_MAX_PR_INSTANCES – Set the max number of parallel regions to be traced. The default value is 2000000.
Note
Tracing a smaller number of parallel regions may result in less accurate timing details.
The following features are not supported in this release:
OpenMP profiling with system-wide profiling scope.
Nested parallel regions.
GPU offloading and related constructs.
Callstack for individual OpenMP threads.
OpenMP profiling on Windows and FreeBSD platforms.
Profiling applications with static linkage of OpenMP libraries.
omp_control_tool routine to start/pause/end OpenMP tracing is currently not supported for GCC compiled applications.
While tracing LLVM/AOCC compiled applications, calling omp_control_tool routine to start/pause/end OpenMP tracing might not work as expected if it is the very first OpenMP routine in the application. It is recommended to call one of these entry points (omp_get_max_threads() , omp_get_num_procs()) or #pragma omp parallel etc. before calling omp_control_tool routine.
Attaching to running OpenMP application.
Profiling applications compiled with GCC OpenMP library and using multiple non-OpenMP threads.
Tracing of target, teams and taskloop constructs are not supported in case of applications compiled with GCC OpenMP library.
MPI trace analysis can be used to analyze, and compute the message passing load imbalance among the ranks of a MPI application running on a cluster. It supports OpenMPI, MPICH, and their derivatives.
The supported thread models are MPI_THREAD_SINGLE, MPI_THREAD_FUNNLED, and MPI_THREAD_SERIALIZED. The profile reports are generated for Point-to-Point and Collective API activity summary.
Fortran bindings are configured and built while compiling the MPI implementations. You can enable/ disable the Fortran bindings based on your need for Fortran language support.
Refer the following options to disable/enable the Fortran bindings:
OpenMPI
--enable-mpi-fortran[=VALUE]
--disable-mpi-fortran
By default, OpenMPI will attempt to build all the 3 Fortran bindings: mpif.h, mpi module, and mpi_f08 module.
MPICH
--disable-fortran
By default, the Fortran bindings are enabled. You can use this option to disable it.
MPI Trace Support Matrix
Component |
Supported Versions |
|---|---|
MPI Spec |
MPI v3.1 or later |
MPI Libraries |
|
Operating System |
|
Languages |
C, C++, FORTRAN |
MPI Implementation Support
AMD uProf supports tracing of Open MPI and MPICH and the derivatives:
--trace mpi --mpi-impl mpich for MPICH and derivatives (default option)
--trace mpi --mpi-impl openmpi for Open MPI
Ensure that the correct option (mpich or openmpi) is passed depending on the MPI implementation used for compiling the MPI application. Passing incorrect option might cause undefined behavior.
Tracing Modes
The AMDuProf CLI supports the following 2 modes for MPI tracing:
LWT – Light-weight tracing is useful for quick analysis of an application. The report gets generated in .csv format on-the-fly during collection stage. Use CLI option --mpi-scope lwt for LWT mode.
FULL – Full tracing is useful for in-depth analysis. This mode requires post-processing for report generation in .csv format. Use CLI option --mpi-scope full for FULL mode.
For more information about MPI tracing options refer to Linux Specific Options. For detailed analysis of parallel scalability bottlenecks, see Parallel Strong Scaling Metrics (MPI + OpenMP).
In LWT mode, a quick report gets generated during collection stage. This mode supports a limited set of APIs for tracing as listed in the following table. The LWT report gives an overview of the application runtime activity.
Sl.No |
API |
Sl.No |
API |
Sl.No |
API |
|---|---|---|---|---|---|
1 |
MPI_Bsend |
21 |
MPI_Ssend |
41 |
MPI_Ibcast |
2 |
MPI_Recv_init |
22 |
MPI_Iallreduce |
42 |
MPI_Waitall |
3 |
MPI_Bcast |
23 |
MPI_Reduce_scatter |
43 |
MPI_Mrecv |
4 |
MPI_Ireduce_scatter |
24 |
MPI_Irecv |
44 |
MPI_Alltoallv |
5 |
MPI_Bsend_Init |
25 |
MPI_Ssend_Init |
45 |
MPI_Igather |
6 |
MPI_Rsend |
26 |
MPI_Ialltoall |
46 |
MPI_Waitany |
7 |
MPI_Gather |
27 |
MPI_Scan |
47 |
MPI_Probe |
8 |
MPI_Iscan |
28 |
MPI_Irsend |
48 |
MPI_Alltoallw |
9 |
MPI_Ibsend |
29 |
MPI_Allgather |
49 |
MPI_Igatherv |
10 |
MPI_Rsend_init |
30 |
MPI_Ialltoallv |
50 |
MPI_Waitsome |
11 |
MPI_Gatherv |
31 |
MPI_Scatter |
51 |
MPI_Recv |
12 |
MPI_Iscatter |
32 |
MPI_Isend |
52 |
MPI_Barrier |
13 |
MPI_Improbe |
33 |
MPI_Allgatherv |
53 |
MPI_Ireduce |
14 |
MPI_Send |
34 |
MPI_Ialltoallw |
||
15 |
MPI_Iallgather |
35 |
MPI_Scatterv |
||
16 |
MPI_Iscatterv |
36 |
MPI_Issend |
||
17 |
MPI_Imrecv |
37 |
MPI_Ibarrier |
||
18 |
MPI_Send_init |
38 |
MPI_Wait |
||
19 |
MPI_Reduce |
39 |
MPI_Mprobe |
||
20 |
MPI_Iprobe |
40 |
MPI_Alltoall |
Collect Profile Data
Example of a command to LWT trace an MPI application using AMDuProfCLI:
$ mpirun -np <number of processes> ./AMDuProfCLI collect --trace mpi --mpi-impl mpich --mpi-scope lwt -o <output_directory> <application>
After completing the tracing, the path to the session directory is displayed on the terminal. LWT report is generated immediately after completing the collection and saved as a .csv file in the session directory: <output_directory>/<SESSION_DIR>/mpi/lwt/mpi-summary.csv.
MPI implementation MPICH or Open MPI should be passed in the command; MPICH is the default. Following are the sample commands:
$ mpirun -np <number of processes> ./AMDuProfCLI collect --trace mpi --mpi-impl openmpi --mpi-scope lwt -o <output_directory> <application>
$ mpirun -np <number of processes> ./AMDuProfCLI collect --trace mpi --mpi-impl mpich --mpi-scope lwt -o <output_directory> <application>
Ensure that the correct option (mpich or openmpi) is passed depending on the MPI implementation used for compiling the MPI application. Passing an incorrect option might cause undefined behavior.
An example of the LWT report section in the .csv file is as follows:
Figure 7.54 LWT Report Example#
Full tracing mode traces more APIs than LWT tracing, For a complete list of APIs, refer List of Supported MPI APIs for Full Tracing. This mode is helpful for in-depth analysis of an MPI application activity.
The report file for the full tracing includes multiple tables to represent the following details.
MPI communicator summary table consists of the following columns :
Ranks: Member rank IDs.
Communicator Size: Number of the member ranks.
Elapsed Time: Time spent by the MPI APIs in the communicator.
MPI rank summary table consists of the following columns:
Rank: Rank ID.
PID: Process ID.
MPI Time (seconds): Total time spent on the MPI APIs.
MPI Time (%): Percentage of MPI Time with respect to the total MPI time of all the ranks.
Wait Time (seconds): Time spent by the rank waiting.
Wait Time (%): Percentage of the rank wait time with respect to the application runtime.
Call Count: Number of times MPI APIs are called.
Volume (bytes): Volume of data in bytes sent or received.
Volume (%): Percentage of volume with respect to the total volume sent or received by all the ranks.
Elapsed Time (seconds): Application runtime.
Time (%): Percentage of elapsed time with respect to the total elapsed time.
MPI Function Summary Table consists of the following columns:
Function: MPI API name.
PID: Process ID.
Min Time (seconds): Minimum time spent in this API across all the ranks.
Max Time (seconds): Maximum time spent in this API across all the ranks.
Average Time (seconds): Average time spent on the API per rank.
MPI Time (%): Percentage of the time spent on this API with respect to the total time spent on all the MPI APIs.
Volume (bytes): Volume of data in bytes sent or received.
Call Count: Number of times MPI APIs are called.
Total Time (seconds): Total time spent in the API in all the ranks.
Refer to the following list of non-collective communication APIs:
Communication Matrix consists of the following columns:
Rank: Directed arrow from Sender rank ID to Receiver rank ID.
MPI Time (seconds): Total time spent on the APIs sending data from the sender rank to the receiver rank.
MPI Time (%): Percentage of MPI time with respect to the total MPI Time spent on all the APIs.
Volume (bytes): Total volume of data sent from the sender rank to the receiver rank.
Volume (%): Percentage of volume with respect to the total volume transferred between all the ranks.
Transfer: Number of transfers from the sender rank to the receiver rank.
Collective Events Summary table consists of the following columns:
Function: API name.
Min Time (seconds): Minimum time spent on this API across all ranks.
Max Time (seconds): Maximum time spent on this API across all ranks.
Average time (seconds): Average time spent on this API per rank.
Input Volume (Bytes): Total data in bytes received by all the ranks involved in this API call.
Output Volume (Bytes): Total data sent by all the ranks involved in this API call.
Calls: Number of times this API is called.
Total Time (seconds): Total time spent in the API in all the ranks.
Refer to List of P2P and Collective Communication APIs for a list of APIs covered for this section.
The list of supported MPI APIs is as follows:
Sl.No |
API |
Sl.No |
API |
Sl.No |
API |
|---|---|---|---|---|---|
1 |
MPI_Pcontrol |
30 |
MPI_Ssend_init |
59 |
MPI_Iscatterv |
2 |
MPI_Mrecv |
31 |
MPI_Neighbor_alltoallv |
60 |
MPI_Intercomm_create |
3 |
MPI_Reduce |
32 |
MPI_Ibarrier |
61 |
MPI_Waitsome |
4 |
MPI_Iallreduce |
33 |
MPI_Test |
62 |
MPI_Scatterv |
5 |
MPI_Cancel |
34 |
MPI_Rsend_init |
63 |
MPI_Igather |
6 |
MPI_Imrecv |
35 |
MPI_Bcast |
64 |
MPI_Intercomm_merge |
7 |
MPI_Allreduce |
36 |
MPI_Ibcast |
65 |
MPI_Barrier |
8 |
MPI_Ialltoall |
37 |
MPI_Testall |
66 |
MPI_Gather |
9 |
MPI_Probe |
38 |
MPI_Send_init |
67 |
MPI_Igatherv |
10 |
MPI_Send |
39 |
MPI_Scan |
68 |
MPI_Cart_create |
11 |
MPI_Alltoall |
40 |
MPI_Comm_create |
69 |
MPI_Recv |
12 |
MPI_Ialltoallv |
41 |
MPI_Testany |
70 |
MPI_Gatherv |
13 |
MPI_Iprobe |
42 |
MPI_Ibsend |
71 |
MPI_Iallgather |
14 |
MPI_Bsend |
43 |
MPI_Reduce_scatter |
72 |
MPI_Cart_sub |
15 |
MPI_Alltoallv |
44 |
MPI_Comm_dup |
73 |
MPI_Irecv |
16 |
MPI_Ialltoallw |
45 |
MPI_Testsome |
74 |
MPI_Allgather |
17 |
MPI_Mprobe |
46 |
MPI_Issend |
75 |
MPI_Iallgatherv |
18 |
MPI_Ssend |
47 |
MPI_Ireduce_scatter |
76 |
MPI_Graph_create |
19 |
MPI_Alltoallw |
48 |
MPI_Comm_dup_with_info |
77 |
MPI_Sendrecv |
20 |
MPI_Ineighbor_alltoall |
49 |
MPI_Wait |
78 |
MPI_Allgatherv |
21 |
MPI_Improbe |
50 |
MPI_Irsend |
79 |
MPI_Ineighbor_allgather |
22 |
MPI_Rsend |
51 |
MPI_Iscan |
80 |
MPI_Dist_graph_create |
23 |
MPI_Neighbor_alltoall |
52 |
MPI_Comm_split |
81 |
MPI_Sendrecv_replace |
24 |
MPI_Ineighbor_alltoallw |
53 |
MPI_Waitall |
82 |
MPI_Neighbor_allgather |
25 |
MPI_Start |
54 |
MPI_Isend |
83 |
MPI_Ineighbor_allgatherv |
26 |
MPI_Bsend_init |
55 |
MPI_Iscatter |
84 |
MPI_Dist_graph_create_adjacent |
27 |
MPI_Neighbor_alltoallw |
56 |
MPI_Comm_split_type |
85 |
MPI_Recv_init |
28 |
MPI_Ineighbor_alltoallv |
57 |
MPI_Waitany |
86 |
MPI_Neighbor_allgatherv |
29 |
MPI_Startall |
58 |
MPI_Scatter |
87 |
MPI_Ireduce |
Collect Profile Data
Example of a command to FULL trace an MPI application using AMD uProf CLI:
$ mpirun -np <number of processes> ./AMDuProfCLI collect --trace mpi --mpi-impl mpich --mpi-scope full -o <output_directory> <application>
After completing the tracing, the path to the session directory is displayed on the terminal.
MPI implementation MPICH or Open MPI should be passed in the command; MPICH is the default. Following are the sample commands:
$ mpirun -np <number of processes> ./AMDuProfCLI collect --trace mpi --mpi-impl openmpi --mpi-scope full -o <output_directory> <application>
$ mpirun -np <number of processes> ./AMDuProfCLI collect --trace mpi --mpi-impl mpich --mpi-scope full -o <output_directory> <application>
Ensure that the correct option (mpich or openmpi) is passed depending on the MPI implementation used for compiling the MPI application. Passing an incorrect option might cause undefined behavior.
Generate Profile Report
Example of a command to generate the report in .csv format. Pass the session directory path with the-i option:
$ ./AMDuProfCLI report -i <output_directory>/<SESSION_DIR>
After completing the report generation, the report.csv file path is displayed on the terminal.
Tables in the Report file
The following screenshots show example sections of a full tracing report file:
Figure 7.55 MPI Communicator Summary Table#
Figure 7.56 MPI Rank Summary Table#
Figure 7.57 MPI Function Summary Table#
Figure 7.58 MPI Communication Matrix#
Figure 7.59 MPI Collective API Summary Table#
Collecting using CLI and Importing to GUI
Use CLI to trace a target MPI application and generate the report using CLI. For the steps, see MPI Full Tracing Using CLI. Import the report to GUI as shown in the following figure to analyze the trace data:
Figure 7.60 Import Profile Session#
Analyzing MPI Communication Matrix
After the import is complete, use MPI Communication Matrix view to analyze the MPI trace data in the GUI. Navigate to HPC > MPI Communication Matrix to view the MPI communication matrix visualizer. This view displays rank-to-rank communication summary in matrix format. The x and y- axis in the matrix are receiver and sender ranks, respectively.
Following figure shows the MPI communication matrix:
Figure 7.61 MPI Communication Matrix#
By default, the communication matrix appears in a zoomed-out view, displaying interactions between sender and receiver ranks. You can use the mouse wheel to zoom in and out and the scroll bar to navigate horizontally and vertically. When zoomed in, the matrix also reveals the volume of data transferred between ranks in bytes.
Legend
Ranks ordered in row-wise and column-wise.
Each cell displays the total data volume transferred from one rank to another rank.
Tool-tip shows additional details when the mouse is hovered over a cell.
Color-coding legend based on data volume.
Sum of all the data transfers for the rank.
Mean of all the data transfers for the rank.
Analyzing MPI Rank Timeline
Navigate to HPC > MPI Rank Timeline to view to MPI Ranks timeline. This view shows the MPI activities in the timeline graph as follows:
Figure 7.62 MPI Rank Timeline#
Legend
Rank ID
Graph of one of the following depending on the selected data source: - MPI API Activity (running or waiting) - MPI data transfer activity (receiving or sending) - MPI APIs called
Tool-tip shows more information about the MPI activity.
Displays the time range.
To select the data source MPI Activity. For more information, see MPI Data Source in the section MPI Full Tracing Report Visualization Using GUI.
To load more rank details.
To filter the ranks from the view.
Trace Overlay Cutoff can be used to specify duration in nanoseconds, which acts as a cutoff to load the trace data, that is, any traced data source which takes less than the specified nanoseconds will not be displayed.
Color coding legends for data source and trace overlay.
Analyzing MPI P2P API Summary
Navigate to HPC > MPI P2P API Summary. This view summarizes the P2P APIs called by the application as follows:
Figure 7.63 MPI P2P API Summary#
Analyzing MPI Collective API Summary
Navigate to HPC > MPI Collective API Summary. This view summarizes the collective APIs called by the application as follows:
Figure 7.64 MPI Collective API Summary#
MPI Data Source
Supported list of MPI data source is as follows:
An MPI Activity that classifies MPI APIs into either waiting APIs (MPI_Barrier, MPI_Wait, MPI_Waitall, MPI_Waitany, or MPI_Waitsome) or active APIs (all the other MPI functions). MPI APIs can be classified as shown in the following three tables:
P2PSend |
P2PReceive |
Collective Communication |
|---|---|---|
|
|
|
Control API |
RequestAPI |
Communication API |
|---|---|---|
|
|
|
Topology API |
Environment API |
|---|---|
|
|
Limitations
The MPI environment parameters such as Total number of ranks and Number of ranks running on each node are currently supported only for OpenMPI. MPI tracing with system-wide profiling scope is not supported.
MPI profiling and MPI tracing are not supported with ProfileControlAPIs.
MPI Tracing is not supported with Hotspots, Threading, and Overview profile configs.
If you see undefined symbol errors while launching an application with AMD uProf MPI agents, it usually means the profiled application was compiled against one MPI implementation or feature set, but at runtime, the loader resolves a different or incompatible MPI library. These issues are caused by linker/loader mismatches, not by AMD uProf.
Typical console examples (user reports):
libAMDOpenMpiAgent.so: undefined symbol: mpi_fortran_statuses_ignore_
libAMDMpichAgent.so: undefined symbol: MPI_UNWEIGHTED
Common Symbols Linked to Mismatches
Open MPI (C/C++ layer)
ompi_mpi_comm_world, ompi_mpi_comm_self, ompi_mpi_group_empty`
ompi_mpi_op_sum, ompi_mpi_op_max
ompi_mpi_uint64_t, ompi_mpi_int64_t, ompi_mpi_long_double, etc.
Open MPI Fortran / profiling / tools
mpi_fortran_statuses_ignore_, mpi_fortran_status_ignore_, mpi_fortran_bottom_
ompi_mpi_real8, ompi_mpi_complex16, ompi_mpi_integer8, ompi_mpi_logical
PMPI_Init, PMPI_Allreduce, MPI_T_init_thread, MPI_T_pvar_get_index, etc.
Sentinel / special constants
MPI_STATUS_IGNORE, MPI_STATUSES_IGNORE
MPI_BOTTOM (Fortran: mpi_fortran_bottom_), MPI_UNWEIGHTED, etc.
These errors occur when an application is built using one MPI implementation or version (e.g., Open MPI) but, at runtime, a different or conflicting MPI library is loaded (e.g., MPICH, an older Open MPI version, or mixed library paths). Such symbol resolution failures originate from the MPI runtime loader and are not caused by AMD uProf.
Verification Steps
Run the following before profiling/tracing to ensure consistency:
which mpirun
mpirun --version
which mpicc
mpicc -show
ldd ./ <application> | grep -i mpi
env | grep -E 'MPI' # or: env | grep -E 'LD_LIBRARY_PATH|MODULE'
If Open MPI is used, optional:
ompi_info | grep -i 'Open MPI'
For MPICH:
mpichversion
Ensure only one MPI implementation module is loaded (if using environment modules).
Recommended Remedy
Align build and run environments:
To avoid MPI-related symbol resolution issues, ensure consistency between build and run environments:
Unload conflicting modules such as module purge followed by module load openmpi/<version>.
Rebuild the application with the same MPI implementation you intend to trace.
Prefer rpath or correctly ordered LD_LIBRARY_PATH over ad‑hoc injection.
Use absolute path to the intended mpirun when multiple versions exist.
Temporary Workaround (Not Recommended Long-Term)
As a short-term workaround you can force the correct libmpi to load first:
export LD_PRELOAD=/path/to/libmpi.so
mpirun -np <number of processes> ./AMDuProfCLI collect --trace mpi --mpi-impl mpich --mpi-scope full -o <output_directory> <application>
This resolves the undefined symbol by preloading the intended library. However, LD_PRELOAD injects a shared object ahead of normal resolution and can introduce subtle conflicts or harder debugging if symbols overlap. Replace this workaround by fixing library path consistency (proper module setup, LD_LIBRARY_PATH, or rebuilding with correct MPI).
Recommended Practices
Avoid mixing MPICH and Open MPI components in the same run.
For consistency, utilize compiler wrapper, mpirun, and libmpi.so from the same installation prefix.
Record the MPI version in your profiling session notes for reproducibility.
Re-run verification after environment changes (e.g. module load/unload, container activation).
Summary
The undefined symbol errors represent runtime linkage issues. AMD uProf tracing requires a coherent MPI environment; once library consistency is ensured, tracing proceeds normally without LD_PRELOAD hacks.
Parallel Strong Scaling Metrics quantify scalability loss for hybrid MPI + OpenMP (and pure MPI or pure OpenMP) applications. Each metric is an inefficiency ratio in [0, 1]: 0 means no time lost for that cause; values near 1 indicate severe degradation. The metrics measure deviation from an ideal execution that exhibits perfect strong scaling, in other words:
Distributes work perfectly across ranks and threads.
Adds no communication delay along the critical path.
Eliminates serial (non-parallelizable) regions.
Executes OpenMP parallel regions with perfectly balanced work and zero synchronization or runtime overhead.
The metrics are based on separating the notions of work and communication. Communication can be thought of as the cost of parallelizing work. Work is the body of computation whose size remains fixed during a strong scaling study. It is defined as the sum of serial work (time outside OpenMP parallel regions and not in MPI or kernel synchronization APIs) plus OpenMP parallel region work. For OpenMP parallel regions, work excludes synchronization time, implicit barrier wait time, and other OpenMP runtime overheads.
The critical MPI rank is the rank with the largest amount of work. All this work must be completed by the critical rank before the application can terminate. The model assumes that the time this rank spends not computing (in MPI or related synchronization) indicates the extent to which communication inflates the total runtime.
Level 1 (Overall)
Parallel Inefficiency
Level 2 (Components of overall parallel inefficiency)
MPI Load Balance Inefficiency
MPI Communication Inefficiency
Serial Region Inefficiency
OpenMP Potential Gain
Level 3 (Components of OpenMP Potential Gain)
Potential Gain: Sync
Potential Gain: Other
All values reported as fractions in range [0, 1]. A low value indicates no inefficiency, a high value highlights a dominant scalability limiter.
Metric |
Scope |
What It Measures |
Low Value Means |
Typical High Causes |
|---|---|---|---|---|
Parallel Inefficiency |
Application |
Total fraction of wall time lost versus a synthetic perfectly scaled execution (perfect balance, no communication delay, no serial bottlenecks, no OpenMP losses). |
Execution close to ideal strong scaling where work is perfectly distributed. |
Combined rank/thread imbalance, communication delay, serial code, OpenMP overhead. |
MPI Load Balance Inefficiency |
MPI |
Extra wall time due to uneven work distribution across ranks. Computed as the difference between the critical rank’s work and the average work, normalized by runtime. |
All ranks finish work simultaneously (balanced load). |
Skewed domain decomposition, data skew, uneven rank workloads, rank-specific I/O. |
MPI Communication Inefficiency |
MPI |
Fraction of wall time that the critical rank spends in MPI. This reflects the extent to which MPI communication/synchronization is slowing down the overall execution of the program. Note that load imbalance can also manifest as communication inefficiency at a global level, as load imbalance will typically cause the critical rank to spend more time in MPI APIs. |
Critical rank nearly always computing (communication well overlapped or minimal). |
Many small latency-bound messages, blocking collectives, serialization on root ranks, poor overlap. |
Serial Region Inefficiency |
MPI |
Amdahl’s law for each rank, averaged across ranks. This represents the scalability loss from serial code which is not paralelised with OpenMP. |
Negligible serial-only work per rank. |
Initialization hot spots, single-threaded loops, legacy non-thread-safe code, unparallelized I/O. |
OpenMP Potential Gain |
OpenMP |
Fraction of wall time wasted in parallel regions due to imperfect work distribution, synchronization, and runtime overhead. Computed as the difference between actual parallel region wall time and theoretical minimum time if work were perfectly balanced across threads. |
Parallel regions near ideal efficiency (minimal waste). |
Synchronization overhead, implicit barrier waits, unbalanced work, runtime/task management costs. |
Metric |
Scope |
Specific Source of Loss |
Low Value Means |
Typical High Causes |
|---|---|---|---|---|
Potential Gain: Sync |
OpenMP |
Average per-thread time in explicit synchronization (such as critical sections, explicit barriers, etc.). |
Minimal blocking and contention and also imbalance time spent in implicit barriers at the end of worksharing constructs that indicates uneven work distribution among threads). |
Contended locks, coarse critical sections, frequent atomic updates, excessive barriers. |
Potential Gain: Other |
OpenMP |
Average per-thread time in other OpenMP API calls not classified as Sync (scheduling operations, task management, runtime bookkeeping |
Lightweight runtime overhead and even distribution of work across threads |
Excessive fine-grained tasks, scheduling inefficiencies, migration/affinity churn, runtime parameter misconfiguration also imbalance due to poor chunk sizing, skewed task granularity, irregular loop workloads, static scheduling with uneven iterations. |
Indicative (heuristic) ranges for triage guidance. Actual meaningful thresholds depend on workload characteristics, problem size, and node count. These are not hard pass/fail criteria.
Metric Value |
Qualitative Status |
Suggested Action |
|---|---|---|
0.00 – 0.05 |
Excellent |
Focus on algorithmic improvements or problem-size scaling; parallel losses minor. |
0.05 – 0.15 |
Moderate |
Inspect top contributing regions/ranks; targeted tuning likely beneficial. |
0.15 – 0.30 |
High |
Prioritize root-cause classification (rank/thread imbalance vs communication vs OpenMP overhead). |
> 0.30 |
Severe |
Revisit decomposition, synchronization strategy, and parallel design fundamentals. |
Metrics are reported only if all of the following conditions hold:
No nested OpenMP parallel regions.
No MPI API calls inside OpenMP parallel regions.
Uniform OpenMP thread count across all MPI ranks.
If any condition is violated, the reported strong scaling metrics may be incorrect or undefined. Users should verify that their application configuration satisfies these prerequisites before interpreting the metric values. Future AMD uProf updates will try to cover additional scenarios and issue warnings.
Work Per Rank
The metrics are based on separating the notions of work and communication. Communication can be thought of as the cost of parallelizing work. Work represents (or approximates) that body of computation whose size remains fixed during a strong scaling study. In strong scaling, we keep the problem size fixed and we increase the number of ranks/threads. In perfect strong scaling, the runtime decreases in direct proportion to the number of ranks/threads added. We define work as the amount of time (in each rank and thread) spent outside of the OpenMP Runtime and outside of any MPI calls. Hence work is runtime less the cost of communication and parallelization.
MPI Communication Inefficiency Assumption
Interprets non-compute intervals on the critical (max-work) rank as communication or communication-induced waiting. Reducing communication on other ranks does not lower wall time unless those changes make a different rank the new max-work rank.
OpenMP Potential Gain
For each parallel region invocation, compute the waste as: waste = omp_parallel_wall_time - (sum_of_thread_work / num_threads). Aggregate waste across invocations, regions, and ranks, then normalize by application runtime. This quantifies how much wall time could be saved if parallel regions had perfectly balanced work with zero synchronization and runtime overhead.
Level 3 Components
Partition OpenMP Potential Gain exclusively into:
Sync: Average per-thread time in explicit synchronization primitives and implicit barriers (end-of-worksharing waits).
Other: Average per-thread time in other OpenMP runtime APIs.
Start with Parallel Inefficiency to gauge overall scalability health.
Decompose via Level 2 metrics to identify dominant class (MPI balance, communication, serial, or OpenMP).
If OpenMP Potential Gain dominates, inspect its Level 3 breakdown to distinguish synchronization vs other overhead.
Correlate MPI Communication or Load Balance issues with MPI rank timelines and communication matrix views (MPI Communication Matrix).
Apply focused optimizations; re-profile to confirm targeted metric reductions and validate that changes persist under scaling.
Some guidelines in case of high efficiency values (this is not an exhaustive list):
High MPI Load Balance Inefficiency
Refine domain partitioning for more even per-rank workloads.
Implement dynamic work redistribution or load balancing schemes.
Rebalance rank-local I/O or computational hot spots.
High MPI Communication Inefficiency
Overlap communication with computation using non-blocking primitives.
Leverage non-blocking collectives and persistent communication.
Consolidate small messages; reduce latency-bound exchange patterns.
Ensure load balance across ranks is good.
High Serial Region Inefficiency
Parallelize residual serial loops or initialization phases.
Refactor sequential bottlenecks; introduce task parallelism where feasible.
Profile serial hot spots and target with OpenMP directives.
High Potential Gain: Sync
Use finer-grained locks or shorten critical sections.
Eliminate unnecessary barriers; prefer implicit ones only when needed.
Replace heavy locking with atomics or lock-free patterns.
High Potential Gain: Other
Limit excessive tiny task creation to reduce overhead; batch work.
Set thread affinity (proc bind / places) to reduce migration.
Apply dynamic or guided scheduling for irregular loops.
To report Parallel Strong Scaling metrics with both OpenMP and MPI trace
mpirun -np <number of processes> ./AMDuProfCLI collect --trace mpi --mpi-impl mpich --mpi-scope full --trace openmp --openmp-impl ompt --openmp-scope full -o <output_directory> <application>
To report Parallel Strong Scaling metrics with only OpenMP trace
./AMDuProfCLI collect --trace openmp --openmp-impl omplib -o <output_directory> <application>
To report Parallel Strong Scaling metrics with only MPI trace:
mpirun -np <number of processes> ./AMDuProfCLI collect --trace mpi --mpi-impl openmpi --mpi-scope full -o <output_directory> <application>
Generate the report
./AMDuProfCLI report -i <session directory>
The following report is a sample parallel efficiency report for an application using OpenMP and MPI.
Figure 7.65 Sample Parallel Efficiency Report#
Assumes no core oversubscription (HPC-style dedicated cores with one thread per core or hardware thread).
Modeling assumption (max-work rank) Wall time is approximated as the largest per-rank work time plus its non-work (communication/overhead) intervals. Pathological communication patterns or non-deterministic bottlenecks may weaken this approximation.
Very short runs or tiny parallel regions yield unstable ratios due to insufficient statistical weight and measurement noise.
Threshold guidance is heuristic; validate against strong-scaling studies for production workloads.
Cases with nested OpenMP parallelism, heterogeneous thread counts per rank and MPI API calls inside OpenMP parallel regions provide metrics but those could be unreliable.
Measure at realistic production problem sizes to avoid small-scale artifacts and initialization noise.
Pair metric inspection with timeline views, region summaries, and flame graphs for root-cause context.
Track metric deltas across optimization iterations; convergence toward lower dominant components more meaningful than absolute values alone.
Validate improvements on multiple node counts to ensure gains persist under scaling and do not shift bottlenecks.
Combine with trace analysis for deeper understanding of communication patterns and synchronization hotspots.
GPU Profile is the starting point for analyzing most time-consuming GPU Kernels, A GPU usage based on various pre-defined GPU H/W metrics. GPU Profile uses Radeon Open Compute(ROCm) to collect profiling data and generate raw files.
The AMD Rocprofiler library provides support to monitor GPU hardware performance events when GPU kernels are dispatched and executed. The derived performance metrics are computed and reported in the CSV format (CLI) and in GUI.
Install ROCm™
Install AMD ROCm 7.1.0 on the target system to run GPU Profiling. uProf also supports backward compatibility until version 5.2.1. Supported accelerators - AMD Instinct™ MI200 and MI300A.
Complete the following procedure to install ROCm:
Complete the steps in the ROCm Installation Guide to install AMD ROCm™ v7.1.0 on the host system.
After AMD ROCm™ 7.1.0 installation, make sure the symbolic link of /opt/rocm/ points to /opt/ rocm-7.1.0/.
$ ln -s /opt/rocm-7.1.0/ /opt/rocm/
Note
Profiling might not work as expected on 5.2.1 or older versions.
By default, AMDuProf uses:
ROCm version pointed by the/opt/rocm/ symbolic link. To specify the Rocm path, you must export it using AMDUPROF_ROCM_PATH before launching AMD uProf. For example:
export AMDUPROF_ROCM_PATH=/opt/rocm-7.1.0/
ROCm libraries from /opt/rocm/lib. If AMDUPROF_ROCM_PATH is specified, the specified path or library will be used. To change this path, you must export it using AMDUPROF_ROCM_LIB_PATH before launching AMD uProf. For example:
export AMDUPROF_ROCM_LIB_PATH=/opt/rocm-7.1.0/lib
--ip-block - Provide IP-Block of raw events to be collected.
Events
Run the following command to list the supported GPU H/W events on the target system:
AMDuProfCLI info -–list gpu-events
Metrics
See the Omniperf document for the extensive list of supported metrics.
Collect Profile Data
Use the following commands to collect GPU performance data:
GPU Profile with All supported Views collection
AMDuProfCLI collect ––config gpu -o <output-dir> <application>
GPU Profile with System Speed of Light view
AMDuProfCLI collect ––config gpu_sol -o <output-dir> <application>
GPU Profile with specific IP-Block collection
AMDuProfCLI collect --config gpu -–ip-block SQ -o <output-dir> <application>
These commands will launch the program and collect the profile data. After the launched application is executed, the AMDuProfCLI will display the session directory path in which the raw profile data is saved.
For a list of all the supported options, refer to AMDuProfCLI Collect Command Options.
Example
AMDuProfCLI collect --config gpu -o /tmp/ /tmp/namd
Profiling started
………
Profiling (data collection) completed
Generated data files path: /tmp/AMDuProf-namd-GPUProfile_MMM-dd-yyyy_hh-mm-ss
Here, the generated session directory is /tmp/AMDuProf-namd-GPUProfile_MMM-dd-yyyy_hh-mm-ss.
To launch the AMDuProf GUI, go to Home > Welcome page.
Click Profile an Application on the Welcome page.
Provide application path, application options, working directory, and environment variables, if any. Click Next.
From Predefined Configs, select GPU Profile.
Click Start Profile to start the profiling.
Note
Behavior is undefined when the GPU profile collection is interrupted, or the launch application is killed from another terminal.
Use the following CLI report command to generate the profile report in .csv format by passing the session directory path generated in collection.
AMDuProfCLI report -i <session directory>
For a list of all the supported options, refer to AMDuProfCLI Report Command Options.
Here is an example of a report for a GPU Profile session:
Figure 7.66 Sample GPU Profile Report.csv#
Figure 7.67 Sample GPU Profile Report.csv#
If data is collected using CLI, then use Import Session to import the session into GUI to analyze data in GUI. The following are the supported views to analyze GPU Profile:
Here is a screenshot of an imported GPU Profile session in GUI:
Figure 7.68 GPU Offloading Analysis - Hot Spots#
Legend
Target Application and Profile summary.
Hotspots Summary shows a glimpse of GPU Kernel summary about Kernel Launch count, total execution time(s) etc.
System Speed of Light : System Speed-of-Light summarizes some of the key metrics from various sections of profiling report.
Figure 7.69 GPU Offloading Analysis - Session Information#
Legend
Profile Details
System Details
Target Details
GPU Device Details
Figure 7.70 GPU Offloading Analysis - Analyze#
Legend
All supported views are segregated into three categories as follows:
Overview
System Speed of Light: System Speed-of-Light summarizes some of the key metrics from various sections of profiling report.
Compute
Command Processor: Command Processor handles interacting with AMDGPU kernel driver- Linux Kernel on CPU and user space HSA clients when commands are submitted to HSA queues.
Shader Processor Input: Shader Processor passes dispatches from command processor to workgroup manager onto the CU.
Wavefront Launch Stats: The wavefront launch stats panel gives general information about the kernel launch.
Compute Units Instruction Mix: The wavefront launch stats panel gives general information about the kernel launch.
Compute Units Compute Pipeline: This section reports the number of floating-point and integer operations executed on the VALU and MFMA units in various precisions.
Cache
Memory Chart Analysis: Metrices related to memory including caches and type of Read and Write.
Local Data Share: The LDS statistics panel gives a more detailed view of the H/W.
Instruction Cache: Detail on the hit/miss statistics of the L1 Instruction (L1I) cache.
Scalar L1D Cache: The Scalar L1 Data cache (sL1D) can cache data accessed from scalar load instructions.
Texture Address and Texture Data: The Texture Address shares in-depth understanding of memory instructions, write and atomic data from CU and passes this info to data processing unit.
Vector L1 cache: The vector L1 data (vL1D) cache is local to each CU on the accelerator and handles vector memory operations issued by a wavefront.
L2 Cache: This panel shares performance, accesses, misses and delays in accessing L2 cache.
L2 Cache Per Channel: This panel shares aggregated L2 cache performance.
The filters pane lets you filter the profile data by providing the following options:
Kernel Filter: Evaluate GPU Metrics per view for selected kernel(s).
GPU ID Filter: Filter out kernels launched on a specific GPU device.
Normalization Filter: Evaluate GPU Metrics in all views according to selected normalization. Supported normalizations are:
Per Wave: The total value of the measured counter or metric that occurred per kernel invocation divided by the total number of wavefronts launched in the kernel.
Per Cycle: The total value of the measured counter or metric that occurred per kernel invocation divided by the kernel cycles, that is, the total number of cycles the kernel executed as measured by the command processor.
Per Second: The total value of the measured counter or metric that occurred per kernel invocation divided by the kernel time, that is, the total runtime of the kernel in seconds, as measured by the command processor.
Per Kernel: The total value of the measured counter or metric that occurred per kernel invocation.
Note
Per Wave is default normalization.
All Kernel Info: Resets all filters.
This section lists all launched GPU Kernels in descending order of total execution time with total launch count, Min, Max and Avg time taken by each kernel. This section also supports sorting data on all columns.
Any selected kernel(s) will be displayed in this Label.
Select Appropriate view which needs to be analyzed from drop down.
Subsequent Metrics for selected views are listed in this section.
Use these views to analyze how efficiently GPUs are used by the application. In other words, how much time specific GPU kernel took for executions with subsequent H/W counters evaluation for that kernel.
Use GPU Profile to get a list of the most time-consuming GPU Kernels. All kernels are sorted in descending order of total execution time. It also lists the kernel’s launch count, Min, Max, and Avg execution time.
Select one or multiple kernels to evaluate all metrics for that specific kernels.
System wide profiling data collection and already running process / thread profiling data collection is not supported.
Profiling of MPI applications is not supported for MI300A.
GPU offloading analysis is used to explore the traces of the function calls for a GPU compute- intensive application.
It provides an in-depth analysis of the HIP API calls, HSA API calls, order of kernel execution, time taken by each kernel to execute and subsequent Data transfer summary with per thread timeline. It also provides an aggregated list of Hottest kernels with timing metrics.
The AMD ROCtracer library provides support to capture the runtime APIs and GPU activities such as data transfer and kernel execution. This analysis helps to visualize the ROCr, HIP API calls, and GPU activities when a HIP based application is running. It is supported only with a launch application.
Install ROCm™
Install AMD ROCm 7.1.0 on the target system to enable and run GPU Tracing. uProf supports backward compatibility from ROCm version 5.2.1. Supported accelerators include AMD Instinct™ MI200 and AMD Instinct™ MI300A.
Complete the following procedure to install ROCm:
Complete the steps in the ROCm Installation Guide to install AMD ROCm™ v7.1.0 on the host system.
After AMD ROCm™ 7.1.0 installation, make sure the symbolic link of /opt/rocm/ points to /opt/ rocm-7.1.0/.
$ ln -s /opt/rocm-7.1.0/ /opt/rocm/Note
Tracing might not work as expected on 5.2.1 or older versions.
By default, AMDuProf uses:
ROCm version pointed by the/opt/rocm/ symbolic link. To specify the Rocm path, you must export it using AMDUPROF_ROCM_PATH before launching AMD uProf. For example:
export AMDUPROF_ROCM_PATH=/opt/rocm-7.1.0/
ROCm libraries from /opt/rocm/lib. If AMDUPROF_ROCM_PATH is specified, the specified path or library will be used. To change this path, you must export it using AMDUPROF_ROCM_LIB_PATH before launching AMD uProf. For example:
export AMDUPROF_ROCM_LIB_PATH=/opt/rocm-7.1.0/lib
AMD uProf supports tracing the following ROCr runtime APIs and GPU activities. To show the collected data in CLI Report/GUI timeline view:
Category |
Event |
Description |
|---|---|---|
GPU |
hip |
HIP runtime trace |
GPU |
hsa |
AMD ROCr runtime trace |
Use the following commands to collect the Function Trace data.
GPU Trace collection
AMDuProfCLI collect –trace gpu -o <output-dir> <application>
These commands will launch the program and collect the trace data. Once the launched application is executed, the AMDuProfCLI will display the session directory path in which the raw profile data is saved.
For a list of all the supported options, refer to AMDuProfCLI Collect Command Options.
Example
[bin]$ ./AMDuProfCLI collect --trace gpu -o /tmp/ /tmp/namd
Profiling started
………
Profiling (data collection) completed
Generated data files path: /tmp/AMDuProf-namd-GpuTrace_Sep-05-2024_21-43-08
Here, the generated session directory is /tmp/AMDuProf-namd-GpuTrace_Sep-05-2024_21-43-08.
Complete the following steps to start profiling:
Click Profile an Application on the Welcome page.
Provide application path, application options, working directory, and environment variables, if any. Click Next.
From Custom Configs, select GPU Trace.
Click Start Profile to start the profiling.
Note
Behavior is undefined when the GPU profile collection is interrupted, or the launch application is killed from another terminal.
Use the following CLI report command to generate the profile report in .csv format by passing the session directory path generated in collection.
AMDuProfCLI report -i <session directory>
For a list of all the supported options, refer to AMDuProfCLI Report Command Options.
Here is an example of a report for a GPU Trace session:
Figure 7.71 Sample GPU Trace Report#
If data is collected using CLI, then use Import Session to import the session into GUI to analyze data in GUI. Below are the supported views to analyze GPU Profile.
Here is a screenshot of an imported GPU Profile session in GUI.
Figure 7.72 GPU Offloading Analysis - Hot Spots#
Legend
Application Details: Gives an overview of the target application traced.
GPU Kernel Launch Summary: To identify top 4 hottest kernels launched to GPU and its count, total execution time on GPU cores.
Data Transfer Summary: To identify how much time spent in data transfer between host and device, how many times data transfer initiated.
Figure 7.73 GPU Offloading Analysis - Session Information#
Legend
Profile Details
System Details
Target Details
GPU Device Details
Figure 7.74 GPU Offloading Analysis - Analyze#
Legend
HIP Overview: HIP API calls summary
HSA Overview: HSA API calls summary
Figure 7.75 GPU Offloading Analysis - Per Thread Timeline#
Use this UI to analyze the following:
GPU Usage, GPU Memory usage and GPU Power of your application over the profile duration.
GPU Kernels executed over the profile duration.
Data Transfer between host and device over the profile duration will help to identify the time spent in data copy. Identify the hottest GPU Kernel.
Use GPU Trace to get a list of hottest GPUS Kernels. All the kernels are sorted in descending order of elapsed time.
System wide profiling data collection and already running process / thread profiling data collection is not supported.
OpenMP tracing is currently not supported.
Function tracing in Linux is used to monitor and analyze the execution of functions. It provides insights into the functions called by an application and functions’ execution time. Function tracing introduces an additional overhead, which results in longer profiling times for an application.
Note
In high-frequency function tracing scenarios, the eBPF ring buffer may overflow, causing silent data loss. Tracing results captured under these conditions may be unreliable.
Linux kernel 4.15 or later is required.
From the AMDuProf installed directory, run the script AMDuProfSetup.sh with root access.
sudo ./AMDuProfSetup.sh
If you install AMD uProf using DEB installer, the script is run by the installer and the info about eBPF (Extended Berkeley Packet Filter) support on the host and function tracing support is provided.
Use the following commands to collect the GPU Trace data
Function Tracing
AMDuProfCLI collect --trace func --func <module:pattern> -o <output-dir> <application>
Trace functions of size 256 bytes or more
AMDuProfCLI collect --trace func --func <module:pattern> --func-size 256 -o <output-dir> <application>
Trace functions with threshold of 50000 ns
AMDuProfCLI collect --trace func --func <module:pattern> --func-threshold 50000 -o <output-dir> <application>
Trace kernel functions
AMDuProfCLI collect --trace func --func kernel:vfs_* -o <output-dir> <application>
Function tracing without child process inheritance is recommended to reduce performance overhead.
AMDuProfCLI collect --trace func --func <module:pattern> --no-inherit -o <output-dir> <application>
After the profile data collection is complete, a session directory will be generated. Use session directory to generate the csv report (or) to import the session in GUI.
For a list of all the supported options, refer to AMDuProfCLI Collect Command Options.
Example
./AMDuProfCLI collect --trace func --func /tmp/ScimarkStable:* -o /tmp/ /tmp/ScimarkStable
………
Profiling (data collection) completed
Generated data files path: /tmp/AMDuProf-ScimarkStable-CpuTrace_Sep-06-2024_03-38-23
Here, the generated session directory is /tmp/AMDuProf-ScimarkStable-CpuTrace_Sep-06-2024_03-38-23.
Complete the following steps to start profiling:
Click Profile an Application on the Welcome page.
Provide application path, application options, working directory, and environment variables, if any. Click Next.
From Custom Configs, select Function Tracing.
Click Start Profile to start the profiling.
Option |
Description |
|---|---|
|
Specify functions to exclude from the library or executable:
Note It is recommended to provide the absolute path of a module. |
|
Specify functions to trace from the library or executable:
Note It is recommended to provide the absolute/full path of a module. |
|
By default, AMDuProf traces functions of size 128 bytes, if you want to trace functions of size more than or equals to 128 bytes, use this option to set the function size. |
|
By default, function threshold value set to 1000000 ns. If you want to trace functions execution time less than default threshold, use this option to set function threshold value. |
An example of the function summary report section in the .csv report file is as follows. It provides the function count, total time (function and its children), min and max function self-time and total self-time.
Figure 7.76 Function Tracing - Function Summary Report#
If data is collected using CLI, then use Import Session to import the session into GUI to analyze data in GUI.
Figure 7.77 Function Tracing - Function Count Summary Report#
Function tracing does not support the following:
System-wide data collection
Non-ELF executables
Shared libraries loaded using dlopen()
Analysis of 32-bit applications
Running function tracing on an HPC cluster requires root access
Option |
Description |
|---|---|
|
By default, AMDuProf traces memory allocations of size more than or equals to 1KB. To trace memory allocations of custom size, use this option to set the threshold. |
Memory tracing with default threshold value >=1KB
AMDuProfCLI collect --trace memory -o <output-dir> <application>
Memory tracing with threshold value >=4KB
AMDuProfCLI collect --trace memory --memory-threshold 4096 -o <output-dir> <application>
Memory tracing without child process inheritance is recommended to reduce performance overhead.
AMDuProfCLI collect --trace memory --no-inherit -o <output-dir> <application>
Once profile data collection is complete, a session directory will be generated. Use session directory to generate the csv report (or) to import the session in GUI.
For a list of all the supported options, refer to AMDuProfCLI Collect Command Options.
Example
./AMDuProfCLI --trace memory -o /tmp/ /tmp/ScimarkStable
Profiling started
………
Profiling (data collection) completed
Generated data files path: /tmp/AMDuProf-ScimarkStable-CpuTrace_Sep-06-2024_03-38-23
Here, the generated session directory is /tmp/AMDuProf-ScimarkStable-CpuTrace_Sep-06-2024_03-38-23.
Generate the csv report to analyze the data in csv format.
AMDuProfCLI report -i <session directory>
For a list of all the supported options, refer to AMDuProfCLI Report Command Options.
Example
./AMDuProfCLI report -i /tmp/ AMDuProf-ScimarkStable-CpuTrace_Sep-06-2024_03-38-23
Translation started
…
Report generation started
…
Report generation completed...
Generated report file: /tmp/ AMDuProf-ScimarkStable-CpuTrace_Sep-06-2024_03-38-23/report.csv
Here is an example of the memory report section in the .csv report file.
Figure 7.78 Memory Tracing - Memory Report#
Memory tracing does not support the following:
System-wide data collection.
Analysis of 32-bit applications.
Memory tracing is available only in CLI.
You cannot import a memory tracing CLI session into the GUI.
Running memory tracing on an HPC cluster requires root access.
Pagefault tracing helps identify the total pagefaults caused by a thread and process.
Linux kernel 4.15 or later is required.
From the AMDuProf installed directory, run the script AMDuProfSetup.sh with root access.
sudo ./AMDuProfSetup.sh
If you install AMD uProf using DEB installer, the script is run by the installer and the info about eBPF (Extended Berkeley Packet Filter) support on the host and OS tracing support is provided.
To trace page faults, run the following command:
AMDuProfCLI collect --trace osrt --osrt-event pagefault -o <output-dir> <application>
Example
./AMDuProfCLI --trace osrt --osrt-event pagefault -o /tmp/ /tmp/ScimarkStable
Profiling started
………
Profiling (data collection) completed
Generated data files path: /tmp/AMDuProf-ScimarkStable-CpuTrace_Sep-06-2024_03-38-23
Here, the generated session directory is /tmp/AMDuProf-ScimarkStable-CpuTrace_Sep-06-2024_03-38-23.
Generate the csv report to analyze the data in csv format.
./AMDuProfCLI report -i /tmp/ AMDuProf-ScimarkStable-CpuTrace_Sep-06-2024_03-38-23 Translation started
…
Report generation started
…
Report generation completed...
Generated report file: /tmp/ AMDuProf-ScimarkStable-CpuTrace_Sep-06-2024_03-38-23/report.csv
An example of the pagefault report section in the .csv report file:
Figure 7.79 Pagefault Tracing - Pagefault Report#
PageFault tracing does not support
System-wide data collection.
Analysis of 32-bit applications.
PageFault tracing is available only in CLI.
Running PageFault tracing on an HPC cluster requires root access.
The Linux OS block I/O calls like insert, issue, and complete can be traced to provide the various metrics related to I/O operations performed by the application.
This analysis can be used to analyze:
Time taken to complete the I/O operations
IOPS - Number of block I/O operations per second
Read or Write bytes of block I/O operation
Block I/O bandwidth
Note
The kernel may continue processing queued I/O requests submitted by the profiled application even after the application exits. Therefore, kernel block I/O analysis is supported only with system-wide tracing.
Linux kernel 5.4 or later is required.
BTF support is needed from kernel. AMD uProf expects kernel BTF information to be present at default location: /sys/kernel/btf/vmlinux.
From the AMDuProf installed directory, run the script AMDuProfSetup.sh with root access.
sudo ./AMDuProfSetup.sh
If you install AMD uProf using DEB installer, the script is run by the installer and the info about eBPF (Extended Berkeley Packet Filter) support, BTF support and OS tracing support on the host is provided.
Complete the following steps to start profiling:
Click Profile an Application on the Welcome page.
Provide application path, application options, working directory, and environment variables, if any. Click Next.
From Custom Configs, select OS Runtime Tracing.
Select diskio event from trace events.
Click Start Profile to start the profiling.
Trace kernel block IO data for launched application.
AMDuProfCLI collect --trace osrt --osrt-event diskio -a -o <output-dir> <application>
Trace system wide kernel block IO data for 10 sec.
AMDuProfCLI collect --trace osrt --osrt-event diskio -a -d 10 -o <output-dir>
Example
./AMDuProfCLI --trace osrt --osrt-event diskio -a -o /tmp/ fio --name=test --ioengine=sync -- rw=randwrite --bs=4k --numjobs=1 --size=1G --runtime=1m --time_based
Profiling started
………
Profiling (data collection) completed
Generated data files path: /tmp/AMDuProf-fio -CpuTrace_Sep-06-2024_03-38-23
Here generated session directory is /tmp/AMDuProf-fio-CpuTrace_Sep-06-2024_03-38-23.
Generate the csv report to analyze the data in .csv format.
./AMDuProfCLI report -i /tmp/ AMDuProf-fio -CpuTrace_Sep-06-2024_03-38-23 Translation started
…
Report generation started
…
Report generation completed...
Generated report file: /tmp/ AMDuProf-fio-CpuTrace_Sep-06-2024_03-38-23/report.csv
An example of the Disk I/O report section in the .csv report file is here:
Figure 7.80 Disk I/O Report#
If data is collected using CLI, then use Import Session to import the session into GUI to analyze data in GUI.
Navigate to the ANALYZE page and then select Disk I/O Stats in the navigation bar:
Figure 7.81 Disk I/O Stats#
Diskio I/O does not support:
System-wide data collection.
Analysis of 32-bit applications.
Diskio I/O is supported for Linux kernels built with BTF support.
Running Disk I/O analysis on an HPC cluster requires root access
Apart from the predefine configurations, you can choose the required events to profile.
To perform the custom profile:
Click PROFILE > Start Profiling to navigate to the Select Profile Target screen.
Select the required profile target and click Next.
From the Select Profile Type drop-down, select one of the following:
The CPU Tracing Mode drop-down consists of the options OS Trace and User Mode Trace. On Linux, OS Trace is enabled (with supported events) only in root/ADMIN mode and on Windows, it’s enabled with the supported event Schedule. User Mode Trace is enabled only for Application Analysis on Linux.
CPU Trace looks as follows:
Figure 7.82 CPU Trace#
Figure 7.83 CPU Trace#
GPU Trace looks as follows:
Figure 7.84 GPU Trace#
Multiple categories from the custom configs can be added together, for example, CPU Profile + CPU Trace. When multiple categories are selected, it will be mentioned below as breadcrumbs under Added Categories and you can deselect the unwanted categories. The corresponding CLI command will be generated below. The custom configs screen will look similar to the following:
Figure 7.85 Custom Config - Added Categories#
Select the Custom Configs tab and select CPU Profile from the left vertical pane.
Click Advanced Options to enable call-stack, set symbol paths (if the debug files are in different locations) and other options. Refer the section Advanced Options for more information on this screen.
Once all the options are set, the Start Profile button at the bottom will be enabled. Click it to start the profile.
After the profile initialization the profile data collection screen is displayed.
Complete the following steps to analyze the profile data:
When the profiling stops, the collected raw profile data will be processed automatically and the Hot Spots screen of the Summary page is displayed. Refer the section Overview of Performance Hotspots for more information on this screen.
Click ANALYZE on the top horizontal navigation bar to go to the Function HotSpots screen. Refer the section Function HotSpots for more information on this screen.
Click ANALYZE > Metrics to display the profile data table at various granularities - Process, Load Modules, Threads, and Functions. Refer the section Process and Functions for more information on this screen.
Double-click any entry on the Functions table in Metrics screen to load the source tab for that function in SOURCES page. Refer the section Source and Assembly for more information on this screen.
CPU profiling in AMD uProf has the following limitations:
CPU profiling expects the profiled application executable binaries must not be compressed or obfuscated by any software protector tools, for example, VMProtect.
In case of AMD EPYC™ 1st generation B1 parts, only one PMC register is used at a time for Core PMC event-based profiling (EBP).
Runtime environments like Cygwin, mingw, and msys2 are not supported.
IMIX has the following limitations:
The IMIX view or report is supported only for IBS profile type.
If any module/binary has less than 10 samples, it is not shown in the IMIX report. Extremely less number of samples are not useful for IMIX analysis.
Linux kernel module .ko files are not shown in the IMIX view or report.