Sampling profilers works based on the logic that the part of a program that consumes most of the time (or that triggers the most occurrence of the sampling event) have a larger number of samples. This is because they have a higher probability of being executed while samples are being taken by the CPU Profiler.
The time between the collection of every two samples is the Sampling Interval. For example, in TBP, if the time interval is 1 millisecond, then roughly 1,000 TBP samples are being collected every second for each processor core.
The purpose of a sampling interval depends on the resource used as the sampling event:
OS timer: the sampling interval is in milliseconds.
PMC events: the sampling interval is the number of occurrences of that sampling event.
IBS: the number of processed instructions after which it will be tagged.
Smaller sampling interval increases the number of samples collected and the data collection overhead. Since, the profile data is collected on the same system in which the workload is running, more frequent sampling increases the intrusiveness of profiling. A very small sampling interval also can cause system instability.
When a sampling-point occurs upon the expiry of the sampling-interval for a sampling-event, various profile data, such as Instruction Pointer, Process Id, Thread Id, and Call-stack will be collected by the interrupt handler.
If the number of the monitored PMC events is less than or equal to the number of available performance counters, then each event can be assigned to a counter and monitored 100% of the time. In a single-profile measurement, if the number of monitored events is larger than the number of available counters, the CPU Profiler time-shares the available HW PMC counters. This is called event counter multiplexing. It helps monitor more events and decreases the actual number of samples for each event and thus, reduces the data accuracy. The CPU Profiler auto-scales the sample counts to compensate for this event counter multiplexing. For example, if an event is monitored 50% of the time, the CPU Profiler scales the number of event samples by factor of 2.
The following profile types are classified based on the hardware or software sampling events used to collect the profile data.
In this profile, the profile data is periodically collected based on the specified OS timer interval. It is used to identify the hotspots of the profiled applications.
In this profile, the CPU Profiler uses the PMCs to monitor the various micro-architectural events supported by the AMD x86-based processor. It helps to identify the CPU and memory related performance issues in the profiled applications. The CPU Profiler provides several predefined EBP profile configurations. To analyze an aspect of the profiled application (or system), a specific set of relevant events are grouped and monitored together. The CPU Profiler provides a list of predefined event configurations, such as Assess Performance and Investigate Branching. You can select any of these predefined configurations to profile and analyze the runtime characteristics of your application. You also can create your custom configurations of events to profile.
In this profile mode, a delay called skid occurs between the time at which the sampling interrupt occurs and the time at which the sampled instruction address is collected. This skid distributes the samples in the neighborhood near the actual instruction that triggered a sampling interrupt. This produces an inaccurate distribution of samples and events are often attributed to the wrong instructions.
In this profile, the CPU Profiler uses the IBS HW supported by the AMD x86-based processor to observe the effect of instructions on the processor and on the memory subsystem. In IBS, HW events are linked with the instruction that caused them. Also, HW events used by the CPU Profiler to derive various metrics, such as data cache latency.
This profile allows a combination of HW PMC events, OS timer, and IBS sampling events.
The Predefined Sampling Configuration provides a convenient way to select a useful set of sampling events for profile analysis. The following table lists all such configurations:
Profile Type |
Predefined Configuration Name |
Abbreviation |
Description |
|---|---|---|---|
User mode sampling and tracing |
Overview Analysis |
overview |
To get a high level performance snapshot of an application, identify hottest functions and it’s inclusive and exclusive elapsed times, CPU utilization of the threads. |
Hotspots Analysis |
hotspots |
To understand the application code flow and sections of code consuming huge execution time (CPU Time). |
|
Threading Analysis |
threading |
To identify how efficiently an application uses the processor cores, contention among the application threads due to synchronization, and CPU utilization of the threads. Note This configuration is available only on Linux. It is supported only on AMD Zen3, AMD Zen4, and AMD Zen5 processors. |
|
Time-based profile (TBP) |
Time-based profile |
tbp |
To identify where the programs are consuming time. |
Event-based profile (EBP) |
Assess performance |
assess |
Provides an overall assessment of the performance. |
Assess performance (Extended) |
assess_ext |
Provides an overall assessment of the performance with additional metrics. |
|
Investigate data access |
data_access |
To find data access operations with poor L1 data cache locality and poor DTLB behavior. |
|
Investigate instruction access |
inst_access |
To find instruction fetches with poor L1 instruction cache locality and poor ITLB behavior. |
|
Investigate branching |
branch |
To find poorly predicted branches and near returns. |
|
Investigate CPI |
cpi |
To analyze the CPI and IPC metrics of the running application or the entire system. |
|
IBS |
Instruction based sampling |
ibs |
To collect the sample data using IBS Fetch and IBS OP. Precise sample attribution to instructions. |
Cache Analysis |
memory |
To identify the false cache-line sharing issues. The profile data will be collected using IBS OP. |
Note
The AMDuProf GUI uses the name of the predefined configuration in the above table.
The abbreviation is used with AMDuProfCLI collect command’s --config option.
The supported predefined configurations and the sampling events used in them is based on the processor family and model.
A View is a set of sampled event data and computed performance metrics either displayed in the GUI or in the text report generated by the CLI. Each predefined sampling configuration has a list of associated predefined views.
Following is the list of predefined view configurations for Assess Performance:
View Configuration |
Abbreviation |
Description |
|---|---|---|
Branch assessment |
br_assess |
You can use this view to find code with a high branch density and poorly predicted branches. |
Data access assessment |
dc_assess |
Provides information about data cache (DC) access including DC miss rate and DC miss ratio. |
IPC assessment |
ipc_assess |
Find hotspots with low instruction level parallelism, it provides performance indicators – IPC and CPI. |
Misaligned access assessment |
misalign_assess |
You can use this to identify regions of code that access misaligned data. |
Overall Assessment |
triage_assess |
This view gives the overall picture of performance, including the instructions per clock cycle (IPC), data cache accesses/misses, mis-predicted branches, and misaligned data access. You can use it to find the possible issues for a deeper investigation. |
Following table lists the predefined view configurations for Threading.
View Configuration |
Abbreviation |
Description |
|---|---|---|
IPC assessment |
ipc_assess |
Find hotspots with low instruction level parallelism, it provides performance indicators – IPC and CPI. Note This configuration is available only on Linux. It is supported only on AMD Zen3 and AMD Zen4 processors. |
Time based hotspots |
timer |
Use this view to find hotspots where the program is spending most of its time. |
All events |
all |
Use this view to report all collected events and possible computed metrics. |
Following table lists the predefined view configurations for Overview.
View Configuration |
Abbreviation |
Description |
|---|---|---|
IPC assessment |
ipc_assess |
Find hotspots with low instruction level parallelism, it provides performance indicators – IPC and CPI. Note This configuration is available only on Linux. It is supported only on AMD Zen3 and AMD Zen4 processors. |
Time based hotspots |
timer |
Use this view to find hotspots where the program is spending most of its time. |
All events |
all |
Use this view to report all collected events and possible computed metrics. |
The following table lists the predefined view configurations for Investigate Data Access.
View Configuration |
Abbreviation |
Description |
|---|---|---|
Data access assessment |
dc_assess |
Provides information about data cache (DC) access including DC miss rate and DC miss ratio. |
Data access report |
dc_focus |
You can use this view to analyze L1 Data Cache (DC) behavior and compare misses versus refills. |
DTLB report |
dtlb_focus |
Provides information about L1 DTLB access and miss rates. |
IPC assessment |
ipc_assess |
Find hotspots with low instruction level parallelism. Provides performance indicators – IPC and CPI. |
Misaligned access assessment |
misalign_assess |
Identify regions of code that access misaligned data. |
The following table lists the predefined view configurations for Investigate Branch.
View Configuration |
Abbreviation |
Description |
|---|---|---|
Investigate Branching |
Branch |
You can use this view to find code with a high branch density and poorly predicted branches. |
IPC assessment |
ipc_assess |
Find hotspots with low instruction level parallelism. Provides performance indicators – IPC and CPI. |
Branch assessment |
br_assess |
You can use this view to find code with a high branch density and poorly predicted branches. |
Taken branch report |
taken_focus |
You can use this view to find the code with a high number of taken branches. |
Near return report |
return_focus |
You can use this view to find code with poorly predicted near returns. |
The following table lists the predefined view configurations for Assess Performance (Extended).
View Configuration |
Abbreviation |
Description |
|---|---|---|
Assess Performance (Extended) |
triage_assess_ext |
This view gives an overall picture of performance. You can use it to find possible issues for deeper investigation. |
IPC assessment |
ipc_assess |
Find hotspots with low instruction level parallelism. Provides performance indicators – IPC and CPI. |
Branch assessment |
br_assess |
You can use this view to find code with a high branch density and poorly predicted branches. |
Data access assessment |
dc_assess |
Provides information about data cache (DC) access including DC miss rate and DC miss ratio. |
Misaligned access assessment |
misalign_assess |
Identify regions of code that access misaligned data. |
The following table lists the predefined view configurations for Investigate Instruction Access.
View Configuration |
Abbreviation |
Description |
|---|---|---|
IPC assessment |
ipc_assess |
Find hotspots with low instruction level parallelism. Provides performance indicators – IPC and CPI. |
Instruction cache report |
ic_focus |
You can use this view to identify regions of code that miss in the Instruction Cache (IC). |
ITLB report |
itlb_focus |
You can use this view to analyze and break out ITLB miss rates by levels L1 and L2. |
Following table lists the predefined view configurations for Investigate CPI.
View Configuration |
Abbreviation |
Description |
|---|---|---|
IPC assessment |
ipc_assess |
Find hotspots with low instruction level parallelism, it provides performance indicators – IPC and CPI. |
The following table lists the predefined view configurations for Instruction Based Sampling.
View Configuration |
Abbreviation |
Description |
|---|---|---|
IBS fetch overall |
ibs_fetch_overall |
You can use this view to display an overall summary of the IBS fetch sample data. |
IBS fetch instruction cache |
ibs_fetch_ic |
You can use this view to display a summary of IBS attempted fetch Instruction Cache (IC) miss data. |
IBS fetch instruction TLB |
ibs_fetch_itlb |
You can use this view to display a summary of IBS attempted fetch ITLB misses. |
IBS fetch page translations |
ibs_fetch_page |
You can use this view to display a summary of the IBS L1 ITLB page translations for attempted fetches. |
IBS All ops |
ibs_op_overall |
You can use this view to display a summary of all IBS Op samples. |
IBS MEM all load/ store |
ibs_op_ls |
You can use this view to display a summary of IBS Op load/store data. |
IBS MEM data cache |
ibs_op_ls_dc |
You can use this view to display a summary of DC behavior derived from IBS Op load/store samples. |
IBS MEM data TLB |
ibs_op_ls_dtlb |
You can use this view to display a summary of DTLB behavior derived from IBS Op load/store data. |
IBS MEM locked ops and access by type |
ibs_op_ls_memacc |
You can use this view to display the uncacheable (UC) memory access, write combining (WC) memory access, and locked load/store operations. |
IBS MEM translations by page size |
ibs_op_ls_page |
You can use this view to display a summary of DTLB address translations broken out by page size. |
IBS MEM forwarding and bank conflicts |
ibs_op_ls_expert |
You can use this view to display the memory access bank conflicts, data forwarding, and Missed Address Buffer (MAB) hits. |
IBS BR branch |
ibs_op_branch |
You can use this view to display the IBS retired branch op measurements including mis-predicted and taken branches. |
IBS BR return |
ibs_op_return |
You can use this view to display the IBS return op measurements including the return mis-prediction ratio. |
IBS NB local/remote access |
ibs_op_nb_access |
You can use this view to display the number and latency of local and remote accesses. |
IBS NB cache state |
ibs_op_nb_cache |
You can use this view to display the cache owned (O) and modified (M) state for NB cache service requests. |
IBS NB request breakdown |
ibs_op_nb_service |
You can use this view to display the breakdown of NB access requests. |
Views in AMD Zen3 and later processors |
||
IBS fetch overall |
ibs_fetch_overall |
You can use this view to display an overall summary of the IBS fetch sample data. |
IBS fetch instruction cache |
ibs_fetch_ic |
You can use this view to display a summary of IBS attempted fetch Instruction Cache (IC) miss data (Not supported in AMD Zen3 processors). |
IBS fetch instruction TLB |
ibs_fetch_itlb |
You can use this view to display a summary of IBS attempted fetch ITLB misses. |
IBS fetch page translations |
ibs_fetch_page |
You can use this view to display a summary of the IBS L1 ITLB page translations for attempted fetches. |
IBS Branch Analysis |
ibs_op_branch |
You can use this view to display the IBS retired branch op measurements including mis-predicted and taken branches. |
IBS Load Op Analysis |
ibs_op_ld |
You can use this view to analyze the memory load performance issues of an application. |
IBS Load Op Analysis (ext) |
ibs_op_ld_ext |
You can use this view to analyze the memory load performance issues of an application. |
IBS Branch Overview |
mibs_op_br_overvie w |
You can use this view to analyze the branch metrics. |
IBS Load Latency Analysis |
ibs_op_ld_lat |
You can use this view to analyze the memory load latency performance issues of an application. |
IBS Memory Overview |
ibs_op_ls_overview |
You can use this view to understand the memory access pattern of an application. |
IBS Perf Overview |
ibs_op_overview |
You can use this view to understand the performance characteristics of an application. |
New Views added in Zen4 and Zen5 Processors |
||
Front End Bottlenecks |
ibs_fetch_front_bottle |
You can use this view to show the front end bottlenecks. |
Backend Bound Bottlenecks |
ibs_op_backend_bottle |
You can use this view to show the backend bound bottlenecks. |
Bad Speculations |
ibs_op_bad_speculation |
You can use this to view bad speculations. |
Note
The AMDuProf GUI uses the View configuration name of the predefined configuration mentioned in the above table.
The abbreviation is used in the CLI generated report file.
The supported predefined configurations and the sampling events used in them is based on the processor family and model.
The AMD uProf uses the debug information generated by the compiler to show the correct function names in various analysis views and to correlate the collected samples to source statements in Source page. Otherwise, the results of the CPU Profiler would be less descriptive, displaying only the assembly code.
When using Microsoft Visual C++ to compile the application in release mode, set the following options before compiling the application to ensure that the debug information is generated and saved in a program database file (with a .pdb extension). To set the compiler option to generate the debug information for a x64 application in release mode, complete the following steps:
Right-click the project and select Properties from the menu.
From the Configuration drop-down, select Active(Release).
From the Platform drop-down, select Active(Win32) or Active(x64).
In the project pane on the left, expand Configuration Properties.
Expand C/C++ and select General.
In the work pane, select Debug Information Format.
From the drop-down, select Program Database (/Zi) or Program Database for Edit & Continue (/ZI).
Figure 6.1 AMDTClassicMatMul Property Page#
In the project pane, expand Linker and then select Debugging.
From the Generate Debug Info drop-down, select /DEBUG.
To generate debug information with inline functions for a Release build on Windows using Microsoft Visual C++, you need to configure the compiler and linker settings properly. Complete the following steps:
Open Project Properties - Right-click the project and select Properties from the menu.
Select Configuration and Platform
From the Configuration drop-down, select: - Active(Release)
From the Platform drop-down, select: - Active(x64) or Win32 if targeting 32-bit
Set Debug Information Format
Navigate to: Configuration Properties → C/C++ → General
In the work pane, set the Debug Information Format to Program Database (/Zi) or Program Database for Edit & Continue (/ZI) if you need Edit & Continue in Debug mode.
Figure 6.2 AMDTClassicMatMul Property Page#
Configure Optimization Settings
Navigate to Configuration Properties → C/C++ → Optimization
Set the following options:
Optimization: Maximum Optimization (Favor Speed) (/O2)
Inline Function Expansion: Any Suitable (/Ob2)
Enable Intrinsic Functions: Yes (/Oi)
Whole Program Optimization: Yes (/GL)
Figure 6.3 AMDTClassicMatMul Property Page#
Configure Linker Debugging Settings
Navigate to Configuration Properties → Linker → Debugging and set Generate Debug Info to Yes(/DEBUG).
Figure 6.4 AMDTClassicMatMul Property Page#
Configure Linker Optimization Settings
Navigate to Configuration Properties → Linker → Optimization
Set the following values:
References: No (/OPT:NOREF)
Enable COMDAT Folding: No (/OPT:NOICF)
Link Time Code Generation: Use Link Time Code Generation (/LTCG)
Figure 6.5 AMDTClassicMatMul Property Page#
The application must be compiled with the -g option to enable the compiler to generate debug information. Modify either the Makefile or the respective build scripts accordingly.
The AMD uProf workflow has the following phases:
Collect: Run the application program and collect the profile data.
Translate: Process the profile data to aggregate, correlate, and organize into database.
Analyze: View and analyze the performance data to identify the bottlenecks.
Important concepts of the collect phase are explained in this section.
The profile target is one of the following for which profile data will be collected:
Application: Launch application and profile that process and its children.
System: Profile all the running processes and/or kernel.
Process: Attach to a running application. For more information, refer to Code Profiling.
The profile type defines the type of profile data collected and how the data should be collected. The following profile types are supported:
CPU Profile
CPU Trace
GPU Profile
GPU Trace
System-wide Power Profile
The data collection is defined by Sampling Configuration:
Sampling Configuration identifies the set of Sampling Events, their Sampling Interval, and mode. Sampling Event is a resource used to trigger a sampling point at which a sample (profile data) will be collected. Sampling Interval defines the number of the occurrences of the sampling event after which an interrupt will be generated to collect the sample. Mode defines when to count the occurrences of the sampling event – in User mode and/or OS mode.
Sampled Data — the profile data that is collected when the interrupt is generated (upon the expiry of the sampling interval of a sampling event).
The following table shows the type of profile data collected and sampling events for a profile type:
Profile Type |
Type of Profile Data Collected |
Sampling Events |
|---|---|---|
GPU Tracing |
Runtime Trace — HIP and HSA |
Not applicable |
GPU Profiling |
Perfmon Metrics |
Not applicable |
CPU Tracing |
Collects pthread API, system calls, function trace, page faults and memory allocations |
Not applicable |
CPU Profiling |
|
|
For CPU Profiling, there are numerous micro-architecture specific events available to monitor. The tool groups the related and interesting events to monitor called Predefined Sampling Configuration. For example, Assess Performance is one such configuration used to get the overall assessment of the performance and to find potential issues for investigation. For more information, see Predefined View Configuration.
A Custom Sampling Configuration is the one in which you can define a sampling configuration with events of interest.
A profile configuration identifies all the information used to collect the measurement. It contains the information about profile target, sampling configuration, data to sample, and profile scheduling details.
The GUI saves these profile configuration details with a default name (for example, AMDuProf-TBP- Classic), you can define them too. As the performance analysis is iterative, this is persistent (can be deleted) and hence, you can also reuse the same configuration for the future data collection runs.
A profile session represents a single performance experiment for a profile configuration. The tool saves all the profile and translated data (in a database) in the folder <profile config name>-<timestamp>.
Once the profile data is collected, uProf processes the data to aggregate and attribute the samples to the respective processes, threads, load modules, functions, and instructions. This aggregated data is then written into an SQLite database used during the Analyze phase. This process of the translating the raw profile data happens when CLI generates the profile report or GUI generates the visualization.
The collected raw profile data is processed to aggregate and attribute to the respective processes, threads, load modules, functions, and instructions. The debug information for the launched application generated by the compiler is needed to correlate the samples to functions and source lines.
This phase is performed automatically in the GUI after the profiling is stopped. In the CLI, the report command implicitly processes (translates) the raw profile data and generates the report in CSV format. Also, the CLI provides translate command to perform only the translation and the translated data files can be imported to GUI for visualization.
A View is a set of sampled event data and computed performance metrics either displayed in the GUI pages or in the text report generated by the CLI. Each predefined sampling configuration has a list of associated predefined views.
The tool can be used to filter/view only specific configurations, which is called Predefined View. For example, IPC assessment view lists metrics such as CPU Clocks, Retired Instructions, IPC, and CPI. For more information, see Predefined View Configuration.
The CLI option --export-session helps to generate a compressed archive containing essential session files. The compressed archive can be easily transported to other system and the GUI can be used for analyzing the performance data.
This feature streamlines the process of transferring and utilizing session files across multiple systems, enhancing accessibility and enabling smooth workflow continuity.
Complete the following procedure to export a session:
Generate compressed archive with translate, report, or profile command. A .zip file is generated.
Copy the .zip file to another system and decompress it.
The decompressed session directory can be imported to GUI for data visualization and analysis. To import the decompressed session and to analyze the performance data, refer to Importing Profile Database.
Generate compressed archive with the translate command:
/AMDuProfCLI translate <options> --export-session <options> -i <session_dir>
Generate compressed archive with the report command:
./AMDuProfCLI report <options> --export-session <options> -i <session_dir>
Generate compressed archive with the profile command:
./AMDuProfCLI profile <options> --export-session <options>
Example
Launch the application AMDTClassicMatMul.exe and collect the Time-Based Profile (TBP) samples and generate a report with the export session option enabled.
To analyze an exported session using CLI, click HOME > Import Session to go to the Import Profile Session.
Figure 6.6 Import Profile Page#
Using the Import Profile page, you can import the processed profile data collected using the CLI or the processed profile data saved in GUI’s profile session storage path. You must do the following:
Specify the path containing the session.uprof file in the Profile Data File box.
Binary Path: If the profile run is performed in a system and the corresponding raw profile data is imported in another system, you must specify the path(s) in which binary files can be located.
Source Path: Specify the source path(s) from where the sources files can be located. No sub- directories will be searched in this path to locate any source files.
Root Path to Sources: Specify the path to the root of multiple source directories. The entire directory and sub-directories present in that path will be searched to locate any source files.
Note
The search might take time as all the sub-directories will be searched recursively.
Force Database Regeneration: To forcefully regenerate the database file while importing.
Use Cached Source/Binary/Symbol Files: Enable this option to reuse cached source, binary, and symbol files.
AMD uProf can connect to remote systems and trigger collection, translation of data on the remote system and then visualize it in local GUI.
Note
CLI does not support remote profiling.
AMD uProf uses a separate AMDProfilerService binary that can be launched as an application server on the remote target and local GUI can connect to such a server. By default, authorization must be set up on the server to connect to the local GUI.
Complete the following steps:
Locate the local GUI client ID.
Authorize the client ID on the remote target to connect to AMDProfilerService.
Launch AMDProfilerService with appropriate options/permissions on remote target.
Specify the connection details in the local GUI to connect to the remote target.
Local GUI updates itself and displays the remote data (including settings, session history, available events for profiling/tracing, and so on).
Proceed to import session/profile on the remote target.
When you are done with remote target, disconnect to update the local data in GUI.
Support
Remote profiling from Windows (host/local platform) to Linux (target/remote platform) is supported.
To set up authorization:
Navigate to PROFILE > Remote Profile and locate the Client ID.
Figure 6.7 Client ID#
Copy the Client ID (alphanumeric value).
On remote target, navigate to the AMD uProf bin directory and execute the following command:
AMDProfilerService --add <client_id>
This will authorize the client to connect to this remote target. To revoke the authorization, execute the following command:
AMDProfilerService --clear-user <client_id>
Specify the binding IP address to launch AMDProfilerService as an application server:
AMDProfilerService --ip 127.0.0.1
This IP address should be one of the IP addresses of the target/remote machine on which AMDProfilerService is launched.
If target/remote machine has multiple IP addresses, the ping command can be used on the host/local machine to determine which IP address (of the remote machine) is reachable from the local machine. The reachable IP address can be passed to –ip option.
AMDProfilerService also has an interactive mode, to select the IP address. To launch the application server in interactive mode use.
AMDProfilerService
Then select the correct IP address.
You can also specify any of the following options:
Option |
Description |
|---|---|
|
Specify the IP address. Note This is a required option. |
|
Specify the port number. |
|
Flag to enable IPv6 Networking. |
|
Specify the log file path. |
|
Skip the authorization. Note This option must be used with caution as it will skip the authorization. |
|
To print supported HTTP API version. .. note:: Use this to check compatibility with GUI. |
|
Get the version information. |
|
Add the Client ID. |
|
Remove a particular Client ID. |
|
Remove all registered clients. |
|
Specify the maximum depth for recursive file search operations. .. note:: This option is applicable only for importing a session from the GUI. |
|
Specify the maximum duration (in seconds) for recursive file search operations. Note This option is applicable only for importing a session from the GUI. |
|
To skip any user prompt. |
Example of a remote profiling connection establishment:
Figure 6.8 Remote Profiling Connection Establishment#
Example of an IP selection:
Figure 6.9 IP Address Selection#
AMDProfilerService comes with support for IPv6 Networking Scheme. To enable IPv6 support, use the -ipv6 flag from the command line:
AMDProfilerService –ipv6 --ip fe80::6a05:caff:fe51:8a7f%enp97s0
Interactive mode is also supported for using IPv6 addressing. To use the interactive mode with IPv6 support run:
AMDProfilerService –ipv6
Figure 6.10 IP Address Selection#
To connect the remote target:
Once AMDProfilerService is launched on the remote target, go to the Remote Profile page and specify the IP address, port number, and optional name for the remote target as follows:
Click Connect.
The remote target data is displayed after a few seconds. All the profiling steps or importing session steps remain identical as local henceforth. After it is connected, the provided IP, port, and name are saved:
Figure 6.11 Remote Target Data#
Double-click on any table entry containing IP address to load the corresponding details and connect to the required remote target.
After it is connected, the title bar will reflect the connection to the remote target. The Disconnect button in the Remote Profile page will be enabled.
Figure 6.12 Disconnect Button Enabled#
Here is a list of the limitations:
Once connected to a remote target, all the Browse buttons in the GUI will remain disabled. You can copy/paste or type the URI paths wherever required.
If you have not closed the GUI after profiling locally and try to connect to Remote Target, the GUI may crash sometimes. Hence, it is recommended to close the GUI after local profiling if remote connection is desired.
If local data is not required and you try to connect to the same remote target frequently, use the following command to directly connect to the remote target (if it is running):
AMDuProf --ip <ip_address> --port <port>
For example, AMDuProf --ip 192.168.1.1 --port 32768.
A client (GUI instance) can connect to a AMDProfilerService instance. However, if multiple instances of the GUI are launched by a user, only one will succeed. Different users can connect to the same AMDProfilerService as they will have different client IDs.
Multiple instances of AMDProfilerService can be launched. However, all of them must be on different ports even if they are bound to the same IP address.
Remote profiling connection establishment might fail if the target system firewall is enabled. In such cases, disable the firewall or add an exception for AMDProfilerService in the firewall rules of the target system and try reconnecting. Another reason for failure could be unavailability of port number. This can happen due to network configuration, firewall settings, or another program blocking usable ports.
Profiling of MPI applications is not supported with remote profiling.