AMD uProf’s command line interface AMDuProfCLI provides options to collect and generate report for analyzing the profile data.
AMDuProfCLI [--version] [--help] COMMAND [<options>] [<PROGRAM>] [<ARGS>]
The following commands are supported:
Command |
Description |
|---|---|
|
Runs the given program and collects the profile samples. |
|
Processes the raw profile datafile and generates the profile DB. |
|
Processes the raw profile datafile and generates profile report. |
|
Processes multiple profile-data and generates a comparison report. |
|
Power Profiling — collects and reports system characteristics, such as power, thermal, and frequency metrics. |
|
Displays the generic information about system and topology. |
|
Collects the performance profile data, analyzes it and generates the profile report. |
For more information on the workflow, refer to the section “Workflow and Key Concepts”. To run the command line interface AMDuProfCLI, run the following binaries as per the OS:
OS |
Description |
|---|---|
Windows |
|
Linux |
|
Linux: If installed using the .tar file |
|
FreeBSD |
|
To profile and analyze the performance of a native (C, C++, and Fortran) application, you must complete the following steps:
Prepare the application. See Preparing an Application for Profiling.
Use AMDuProfCLI collect command to collect the samples for the application.
Note
Run AMD uProf on FreeBSD with sudo command or root privilege.
Using AMDuProfCLI report command to generate a report in readable format for analysis.
Preparing the application is to build the launch application with debug information as it is needed to correlate the samples to functions and source lines.
The collect command launches the application (if given) and collects the profile data for the given profile type and sampling configuration. It generates the raw data file (.prd on Windows, .pdata on FreeBSD, and .caperf on Linux) and other miscellaneous files.
The report command translates the collected raw profile data to aggregate and attribute to the respective processes, threads, load modules, functions, and instructions. Also, it writes them into a database and then generates a report in the .CSV file format.
The following figure shows how to run a time-based profile and generate a report for the application AMDTClassicMatMul.exe.
Figure 9.1 Collect and Report Commands#
To get the list of supported predefined sampling configurations that can be used with collect command’s --config option, run the command: AMDuProfCLI info --list collect-configs.
A sample output is as follows:
Figure 9.2 Supported Predefined Configurations on Linux#
Figure 9.3 Supported Predefined Configurations on Windows#
OS |
Description |
|---|---|
EXECUTION |
Information about the target launch application |
PROFILE DETAILS |
Details about the current session, such as profile type, scope, and sampling events |
MONITORED EVENTS |
List of the profiled events and the corresponding sampling intervals |
10 HOTTEST FUNCTIONS |
List of the top 10 hot functions and the metrics attributed to them |
TAKEN BRANCH ANALYSIS SUMMARY |
List of the top 10 hot branches |
10 HOTTEST PROCESSES |
List of the top 10 hot processes and the metrics attributed to them |
10 HOTTEST MODULES |
List of the top 10 hot modules and the metrics attributed to them |
10 HOTTEST THREADS |
List of the top 10 hot threads and the metrics attributed to them |
PROFILE REPORT FOR PROCESS |
The metrics attributed to the profiled process. This section is shown when –detail option is used for report generation. It contains other sub- sections, such as
|
To collect power profile counter values:
Run the AMDuProfCLI timechart command with --list option to get the list of supported counter categories
Use the AMDuProfCLI timechart command for specifying the required counters with --event option to collect and the report the required counters.
The timechart run to list the supported counter categories.
Figure 9.4 Output of timechart –list Command#
The timechart to collect the profile samples and write into a file.
Figure 9.5 Timechart Execution#
The above run collects the power and frequency counters on all the devices on which these counters are supported and writes them in the output file specified with -o option. Before the profiling begins, the given application is launched and the data is collected till the application terminates.
The collect command collects the performance profile data and writes into the raw data files in the specified output directory. These files can then be analyzed using AMDuProfCLI report command or AMDuProf GUI.
AMDuProfCLI collect [--help] [<options>] [<PROGRAM>] [<ARGS>]
where,
<PROGRAM>: Denotes the launch application to be profiled.
<ARGS>: Denotes the list of arguments for the launch application.
$ AMDuProfCLI collect --config <config> <PROGRAM> [<ARGS>]
$ AMDuProfCLI collect [--config <config> | -e <event>] [-a] [-d <duration>] [<PROGRAM>]
The following table lists the collect command options.
Option |
Description |
|---|---|
|
Displays the help information on the console/terminal. |
|
Base directory path in which collected data files will be saved. A new sub- directory will be created in this directory. |
|
Predefined sampling configuration to be used to collect samples. Use the command |
|
Specify this option to collect GPU Profile data for specific IP Block. List Of IP Block:
|
|
A predefined event can be directly be used with -e, –event which has predefined arguments. Alternatively, for providing more granular parameters, specify Timer, PMU, IBS event, or a predefined event with arguments in the form of comma separated key=value pairs. The supported keys are:
Note
Argument details
When these arguments are not passed, then the default values are:
|
|
Profile the existing processes by attaching to a running process. The process IDs are separated by comma. Note
|
|
System Wide Profile (SWP) If this flag is not set, then the command line tool will profile only the launched application or the Process IDs attached with |
|
Comma separated list of CPUs to profile. The ranges of CPUs can be specified with ‘-’, for example: 0-3. This option is not supported while profiling MPI applications. Note On Windows, the selected cores should belong to only one processor group. For example, 0-63, 64-127, and so on. |
|
Profile only for the specified duration n in seconds. |
|
Sampling interval for PMC events. Note This interval will override the sampling interval specified with individual events. |
|
Set the core affinity of the launched application to be profiled. Comma separated list of core-ids. The ranges of the core-ids must be specified, for example, 0-3. The default affinity is all the available cores. This option is not supported while profiling MPI applications. |
|
Do not profile the children of the launched application (processes launched by the profiled application). |
|
Terminate the launched application after the profile data collection ends. Only the launched application process will be killed. Its children (if any) may continue to execute. |
|
Start delay n in seconds. Start profiling after the specified duration. When n is 0, there is no impact. |
|
Profiling paused indefinitely. The target application resumes the profiling using the profile control APIs. This option must be used only when the launched application is instrumented to control the profile data collection using the resume and pause APIs (defined in the AMDProfileControl APIs ). |
|
Specify the working directory. The default is the current working directory. |
|
Specify the path where the log file should be created. If this option is not provided, the log file will be created either in path set by The log file name will be of the format $USER-AMDuProfCLI.log (on Linux, FreeBSD) or %USERNAME%-AMDuProfCLI.log (on Windows). |
|
Capture the timestamp of the log records. |
|
Stop the profiling when the collected data file size (in MB) crosses the specified limit. Note This option may be deprecated in future releases. |
|
Enable data collection at the specified frequency Note This frequency will override the sampling frequency specified with the individual events. |
|
Use this option to set the environment variables. |
|
OS Support: Windows Enables callstack Sampling. Specify the Unwind Interval (I) in milliseconds and Unwind Depth (D) value. Specify the Scope (S) by choosing one of the following:
|
|
OS Support: Linux Enables callstack sampling. Specify (F) to collect/ignore missing frames due to omission of frame pointers by compiler:
When Note Passing a large When |
|
Same as passing –call-graph fp (Linux, FreeBSD). Same as passing –call-graph1:128:user:fp (Windows). |
|
OS Support: Windows Set callstack collection mode.
Default mode is fp. |
|
OS Support: Linux Callstack collection mode. Default mode is fp.
|
|
OS Support: Windows Set callstack scope type. Scope type should contain one of these options:
Default scope type is |
|
OS Support: Windows Set callstack unwind interval. Interval must be within the range [1 - 100]. Default interval is 1 ms. |
|
OS Support: Windows Set callstack unwind depth. Depth must be within the range [2 - 392]. Default depth is 128. |
|
OS Support: Linux Set callstack unwind depth. Depth must be within the range [2 - 1024]. Default depth is 32. |
|
OS Support: Linux Callstack Size. Default size is 1024 bytes. When mode = fpo | dwarf; size must be within [16 - 32768] and specifies the max stack- size (in bytes) to collect per call stack sample. When mode = fp; the size is not applicable and ignored if passed. |
|
OS Support: Windows Collect the thread run time info to report thread concurrency. Thread concurrency provides how much time specific no of threads are running simultaneously. |
|
OS Support: Windows Size (number of pages per core) of the buffer used for data collection by the driver. The default size is 512 pages per core. |
|
OS Support: Windows Stop the profiling when the collected data file size (in MB) crosses the specified limit. When used with the option |
|
OS Support: Windows Specify the profile data collection mode as a ring buffer. The collection limit can be set using the option |
|
OS Support: Linux Profile existing threads by attaching to a running thread. The thread IDs are separated by comma. |
|
OS Support: Linux To trace a target domain. TARGET can be one or more of the following:
Use Note Applicable to per process profiling. Not applicable to:
|
|
OS Support: Linux Provide MPI implementation type: openmpi for tracing OpenMPI library, mpich for tracing MPICH and it’s derivative libraries. Default selection is mpich. Note Use this option with |
|
OS Support: Linux Provide tracing scope: lwt for light-weight tracing, full for complete tracing. Default scope type is full. Note Use this option with |
|
OS Support: Linux Provide OpenMP implementation type: ompt for tracing of OpenMP libraries supporting OMPT interface (example: LLVM, AOCC), omplib for tracing GCC OpenMP library. Default selection is Note Use this option with |
|
OS Support: Linux Provide tracing scope: full for complete tracing, basic for basic tracing, where synchronization related OpenMP events are not traced to reduce the disk space usage. Default selection is basic. Note
|
|
OS Support: Linux Provide event names. Use command Note Use this option with |
|
OS Support: Linux Provide event name and threshold value. Note Use this option with |
|
OS Support: Linux Specify functions to trace from the library or executable.
This option will be deprecated in a future release. Recommended to use |
|
OS Support: Linux Provide minimum function size to trace. Default function size is 128 bytes. This option will be deprecated in a future release. Recommended to use Note Use this option with |
|
OS Support: Linux Specify functions to trace from the library or executable:
Note It is recommended to provide the absolute/full path of a module. |
|
OS Support: Linux Specify functions to exclude from the library or executable:
Note It is recommended to provide the absolute/full path of a module. |
|
OS Support: Linux Set the kernel memory mapped data buffer to size. The size can be specified in pages or with a suffix Bytes (B/b), Kilo bytes (K/k), Megabytes (M/m), and Gigabytes (G/g). |
|
OS Support: Linux Pass this option while collecting CPU Profiling data of a MPI application. For MPI tracing, use the collect command with –trace option. |
|
OS Support: Linux Specify the PID of qemu-kvm process to be profiled to collect guest-side performance profile. |
|
OS Support: Linux Specify the path of guest /proc/kallsyms copied on the local host. AMD uProf reads it to get the guest kernel symbols. |
|
OS Support: Linux Specify the path of guest /proc/modules copied to the local host. AMD uProf reads it to get the guest kernel module information. |
|
OS Support: Linux Specify the path of guest vmlinux and kernel sources copied on the local host. AMD uProf reads it to resolve the guest kernel module information. |
|
OS Support: Linux Capture LBR data. You can also specify the branch filter type:
Note
|
Launch AMDTClassicMatMul.exe and collect the samples for CYCLES_NOT_IN_HALT and RETIRED_INST events:
C:\> AMDuProfCLI.exe collect -e cycles-not-in-halt -e retired-inst --interval 1000000
-o c:\Temp\cpuprof-custom AMDTClassicMatMul.exe
$ ./AMDuProfCLI.exe collect -e event=cycles-not-in-halt,interval=250000
-e event=retired-inst,interval=500000 -o c:\Temp\cpuprof-custom AMDTClassicMatMul.exe
Launch AMDTClassicMatMul.exe and collect the Time-Based Profile (TBP) samples:
C:\> AMDuProfCLI.exe collect -o c:\Temp\cpuprof-tbp AMDTClassicMatMul.exe
Launch AMDTClassicMatMul.exe and do Assess Performance profile for 10 seconds:
C:\> AMDuProfCLI.exe collect --config assess -o c:\Temp\cpuprof-assess -d 10 AMDTClassicMatMul.exe
Launch AMDTClassicMatMul.exe and collect the IBS samples in the SWP mode:
C:\> AMDuProfCLI.exe collect --config ibs -a -o c:\Temp\cpuprof-ibs-swp AMDTClassicMatMul.exe
Collect the TBP samples in SWP mode for 10 seconds:
C:\> AMDuProfCLI.exe collect -a -o c:\Temp\cpuprof-tbp-swp -d 10
Launch AMDTClassicMatMul.exe and collect TBP with callstack sampling:
C:\> AMDuProfCLI.exe collect --config tbp -g -o c:\Temp\cpuprof-tbp AMDTClassicMatMul.exe
Launch AMDTClassicMatMul.exe and collect TBP with callstack sampling (unwind FPO optimized stack):
C:\> AMDuProfCLI.exe collect --config tbp --call-graph 1:64:user:fpo -o c:\Temp\cpuprof-tbp AMDTClassicMatMul.exe
C:\> AMDuProfCLI.exe collect --config tbp --call-graph-mode fpo --call-graph-type user -o c:\Temp\cpuprof-tbp AMDTClassicMatMul.exe
Launch AMDTClassicMatMul.exe and collect TBP with callstack sampling (unwind FPO optimized stack disabled):
C:\> AMDuProfCLI.exe collect --config tbp --call-graph-mode fp -o c:\Temp\cpuprof-tbp AMDTClassicMatMul.exe
Launch AMDTClassicMatMul.exe and collect the samples for PMCx076 and PMCx0C0:
C:\> AMDuProfCLI.exe collect -e event=pmcx76,interval=250000 -e event=pmcxc0,user=1,os=0,interval=250000 -o c:\Temp\cpuprof-tbp AMDTClassicMatMul.exe
Launch AMDTClassicMatMul.exe and collect the samples for IBS OP with an interval of 50000:
C:\> AMDuProfCLI.exe collect -e event=ibs-op,interval=50000 -o c:\Temp\cpuprof-tbp AMDTClassicMatMul.exe
Launch AMDTClassicMatMul.exe and do TBP samples profile for thread concurrency, name:
C:\> AMDuProfCLI.exe collect --config tbp --thread thread=concurrency,name -o c:\Temp\cpuprof- tbp AMDTClassicMatMul.exe
Collect samples for PMCx076 and PMCx0C0, but collect the call graph info only for PMCx0C0:
C:\> AMDuProfCLI.exe collect -e event=pmcx76,interval=250000 -e event=pmcxc0,interval=250000,call-graph -o c:\Temp\cpuprof-pmc AMDTClassicMatMul-bin
Launch AMDTClassicMatMul.exe and collect the samples for predefined event RETIRED_INST and L1_DC_REFILLS.ALL events:
C:\> AMDuProfCLI.exe collect -e event=RETIRED_INST,interval=250000 -e event=L1_DC_REFILLS.ALL,user=1,os=0,interval=250000 -o
C:\Temp\cpuprof-pmc AMDTClassicMatMul.exe
Launch AMDTClassicMatMul.exe, collect the TBP and Assess Performance samples:
C:\> AMDuProfCLI.exe collect --config tbp --config assess -o c:\Temp\cpuprof-tbp-assess AMDTClassicMatMul.exe
Launch AMDTClassicMatMul.exe and collect the samples for PMCx076 and PMCx0C0 events with count-mask enabled:
C:\> AMDuProfCLI.exe collect -e event=pmcx076,cmask=0x0, -e event=pmcx0c0,cmask=0x7f,interval=250000 -o c:\Temp\cpuprof-pmc AMDTClassicMatMul-bin
Launch AMDTClassicMatMul.bin and collect the samples for CYCLES_NOT_IN_HALT and RETIRED_INST events:
$ ./AMDuProfCLI collect -e cycles-not-in-halt -e retired-inst
--interval 1000000 -o /tmp/cpuprof-custom AMDTClassicMatMul-bin
$ ./AMDuProfCLI collect -e event=cycles-not-in-halt,interval=250000
-e event=retired-inst,interval=500000 -o /tmp/cpuprof-custom AMDTClassicMatMul-bin
Launch AMDTClassicMatMul-bin and collect the TBP samples:
$ ./AMDuProfCLI collect -o /tmp/cpuprof-tbp AMDTClassicMatMul-bin
Launch AMDTClassicMatMul-bin and do Assess Performance profile for 10 seconds:
$ ./AMDuProfCLI collect --config assess -o /tmp/cpuprof-assess -d 10 AMDTClassicMatMul-bin
Launch AMDTClassicMatMul-bin and collect the IBS samples in the SWP mode:
$ ./AMDuProfCLI collect --config ibs -a -o /tmp/cpuprof-ibs-swp AMDTClassicMatMul-bin
Collect the TBP samples in SWP mode for 10 seconds
$ ./AMDuProfCLI collect -a -o /tmp/cpuprof-tbp-swp -d 10
Launch AMDTClassicMatMul-bin and collect TBP with callstack sampling:
$ ./AMDuProfCLI collect --config tbp -g -o /tmp/cpuprof-tbp AMDTClassicMatMul-bin
Launch AMDTClassicMatMul-bin and collect TBP with callstack sampling (unwind FPO optimized stack):
$ .AMDuProfCLI collect --config tbp --call-graph-mode fpo --call-graph-size 512 -o /tmp/cpuprof-tbp AMDTClassicMatMul-bin
Launch AMDTClassicMatMul-bin and collect the samples for PMCx076 and PMCx0C0:
$ ./AMDuProfCLI collect -e event=pmcx76,interval=250000 -e event=pmcxc0,user=1,os=0,interval=250000 -o /tmp/cpuprof-tbp AMDTClassicMatMul-bin
Launch AMDTClassicMatMul-bin and collect the samples for IBS OP with interval 50000:
$ ./AMDuProfCLI collect -e event=ibs-op,interval=50000 -o /tmp/cpuprof-tbp AMDTClassicMatMul- bin
Attach to a thread and collect TBP samples for 10 seconds:
$ AMDuProfCLI collect --config tbp -o /tmp/cpuprof-tbp-attach -d 10 --tid <TID>
Collect basic OpenMP trace info of an OpenMP application compiled with GCC OpenMP library:
$ AMDuProfCLI collect --trace openmp --openmp-impl omplib -o /tmp/cpuprof-omp <path-to-openmp-exe>
Launch AMDTClassicMatMul-bin and collect the memory accesses for false cache sharing:
$ AMDuProfCLI collect --config memory -o /tmp/cpuprof-mem AMDTClassicMatMul-bin
Launch AMDTClassicMatMul-bin and collect the threading configuration to analyze hotspots, thread state, and wait object analysis among threads:
$ AMDuProfCLI collect --config threading -o /tmp/cpuprof-threading AMDTClassicMatMul-bin
Collect MPI profiling information:
$ mpirun -np 4 ./AMDuProfCLI collect --config assess --mpi --output-dir /tmp/cpuprof-mpi /tmp/ namd <parameters>
Collect the samples for PMCx076 and PMCx0C0, but collect the call graph info only for PMCx0C0:
$ AMDuProfCLI collect -e event=pmcx76,interval=250000 -e event=pmcxc0,interval=250000,call- graph -o /tmp/cpuprof-pmc AMDTClassicMatMul-bin
Launch AMDTClassicMatMul-bin and collect the samples for predefined event RETIRED_INST and L1_DC_REFILLS.ALL events
$ AMDuProfCLI collect -e event=RETIRED_INST,interval=250000 -e event=L1_DC_REFILLS.ALL,user=1,os=0,interval=250000 -o /tmp/cpuprof-pmc AMDTClassicMatMul-bin
Launch AMDTClassicMatMul-bin and collect pthread runtime trace with default threshold
$ AMDuProfCLI collect --trace osrt --osrt-event pthread -o /tmp/cpuprof-os AMDTClassicMatMul-bin
Launch AMDTClassicMatMul-bin and collect syscall taking more than or equal to 1µs:
$ AMDuProfCLI collect --trace osrt --osrt-event syscall --osrt-threshold syscall:1000000 -o /tmp/cpuprof-os AMDTClassicMatMul-bin
Launch AMDTClassicMatMul-bin and collect the GPU Traces for hip and hsa domain:
$ AMDuProfCLI collect --trace gpu -o /tmp/cpuprof-gpu AMDTClassicMatMul-bin
Launch AMDTClassicMatMul-bin, collect the TBP samples and GPU Traces for hip and hsa domain:
$ AMDuProfCLI collect --config tbp --trace gpu -o /tmp/cpuprof-gpu AMDTClassicMatMul-bin
Launch AMDTClassicMatMul-bin and collect the GPU samples:
$ AMDuProfCLI collect --config gpu -o /tmp/cpuprof-gpu AMDTClassicMatMul-bin
Launch AMDTClassicMatMul-bin and collect GPU samples for SQ Block
$ AMDuProfCLI collect --config gpu --ip-block SQ -o /tmp/gpuprof-gpu AMDTClassicMatMul-bin
Launch AMDTClassicMatMul-bin and collect trace data for all functions in ``AMDTClassicMatMul-bin
$ AMDuProfCLI collect --trace osrt --osrt-event function --osrt-funcs AMDTClassicMatMul-bin:* -o /tmp/cpuprof-os AMDTClassicMatMul-bin
Launch AMDTClassicMatMul-bin and collect trace data for all functions in AMDTClassicMatMul-bin which has size greater than or equals to 64
$ AMDuProfCLI collect --trace osrt --osrt-event function --osrt-func-size 64 --osrt-threshold function:10000 --osrt-funcs AMDTClassicMatMul-bin:* -o /tmp/cpuprof-os AMDTClassicMatMul-bin
Launch AMDTClassMatMul-bin and perform branch analysis with the default filter type:
$ AMDuProfCLI collect --branch-filter -o /tmp/cpuprof-ebp-branch AMDTClassicMatMul-bin
Launch AMDTClassMatMul-bin and collect samples for the event PMCXC0:
$ AMDuProfCLI collect -e event=pmcxc0,interval=250000 --branch-filter u,k,any -o /tmp/cpuprof- ebp-branch AMDTClassicMatMul-bin
The report command generates a report in readable format by processing the raw profile data files or from the (processed) database files available in the specified directory.
AMDuProfCLI report [--help] [<options>]
$ AMDuProfCLI report -i <session-dir path>
Option |
Description |
|---|---|
|
Displays this help information on the console/terminal. |
|
Path to the directory containing collected data. |
|
Generate detailed report. |
|
Specify the report to be generated. The supported report options are:
This option is applicable only with |
|
The print callgraph. Use with the option |
|
Cutoff to limit the number of process, threads, modules, and functions to be reported. n is the minimum number of entries to be reported in various report sections. The default value is 10. Note
|
|
Report only the events present in the given view file. Use the command
|
|
Show inline functions for C, C++ executables. Note Using this option will increase the time taken to generate the report. |
|
Generate detailed function report of the system module functions (if debug info is available) with the source statements. This option only works with |
|
Source file directories (semicolon separated paths). Multiple use of |
|
Report only the assembly instructions having samples. This option only works with |
|
Choose the syntax of assembly instructions. The supported options are att and intel. If this option is not used:
|
|
Generate the function report with only assembly instructions. This command works only with the``–detail`` option. |
|
Report all the assembly instructions of a function with and without samples. This option only works with the |
|
Specify the Timer, PMC, or IBS event on which the reported profile data will be sorted with arguments in the form of comma separated key=value pairs. The supported keys are:
When both event and metric are enabled, event takes priority over metric. Use the command Details about the arguments:
|
|
Use this option to configure the sample aggregation interval which is useful when the session is imported to GUI.
Aggregation INTERVAL can also be specified as a numeric value in milliseconds. |
|
Restricts report generation to the time interval between T1 and T2. Where, T1 and T2 are time in seconds from profile start time. |
|
Generate instruction MIX report. It is only supported for IBS config and IBS events profiling. It is only supported for the native binaries. |
|
IMIX report generation. Supported group-by options are:
|
|
Ignore samples from system modules. |
|
Show percentage of samples instead of actual samples. |
|
Show the number of samples. This option is enabled by default. |
|
Show the number of events occurred. |
|
Show all the cachelines in the report sections for cache analysis. By default, only the cachelines accessed by more than one process/thread are listed. Supported only for memory config report on Windows and Linux platforms. |
|
Show the shared cachelines accessed by more than one process/thread for cache analysis. Set n to the number of shared cacheline addresses to be reported. Use this option for false cache sharing analysis. |
|
Binary file path, multiple usage of |
|
Source file path, multiple usage of |
|
Debug Symbol paths (semicolon separated). Multiple use of |
|
Write a report to a file. If the path has a .csv extension, it is assumed to be a file path and used as it is. If the .csv extension is not used, then the path is assumed to be a directory and the report file is generated in the directory with the default name. |
|
Print the report to a console or terminal. |
|
Perform the re-translation of collected data files with a different set of translation options. |
|
Use this option to generate ASCII dump of IBS OP profile samples.
Note This option might delay the translation. |
|
Remove the raw data files to recover the disk space. |
|
Use this option to show Python interpreter functions in the callgraph/flamegraph when translation is performed on Python profiled data (on Linux). |
|
Create a compressed archive of the required session files which can be used in other system for analysis. |
|
Specify the path where the log file should be created. If this option is not provided, the log file will be created either in the path set by The log file name will be of the format |
|
Capture the timestamp of the log records. |
|
OS Support: Windows Symbol Server directories (semicolon separated paths). For example, Microsoft Symbol Server. Multiple use of |
|
OS Support: Windows The path to store the symbol files downloaded from the Symbol Servers. |
|
OS Support: Windows Download symbols using the Microsoft Symsrv. By default, AMD symbol downloader will be used. |
|
OS Support: Linux This option is used along with the –input-dir option. Generates report belonging to a specific host. The supported options are:
Note If |
|
OS Support: Linux Generate report only for specific profiling category. Comma separated multiple categories can be specified. If this option is not used, then report for all categories gets generated. Multiple instances of Supported categories are:
Example: --category cpu, mpi, trace, gputrace, gpuprof
--category mpi --category cpu --category trace --category gputrace -- category gpuprof
|
|
OS Support: Linux Specify the time interval in seconds to list the function count detail report. If this option is not specified, the function count will be generated for the entire profile duration. |
|
This option is used along with the
Note If |
Generate report from the raw datafile
C:\> AMDuProfCLI.exe report -i c:\Temp\cpuprof-tbp\<SESSION-DIR>
Generate IMIX report from the raw datafile
C:\> AMDuProfCLI.exe report --imix -i c:\Temp\cpuprof-imix\<SESSION-DIR>
Generate report from the raw datafile sorted on pmc event
C:\> AMDuProfCLI.exe report -s event=pmcxc0,user=1,os=0 -i c:\Temp\cpuprof-ebp\<SESSION-DIR>
Generate report from the raw datafile sorted on ibs-op event
C:\> AMDuProfCLI.exe report -s event=ibs-op -i c:\Temp\cpuprof-ibs\<SESSION-DIR>
Generate report from the raw datafile for power samples
C:\> AMDuProfCLI.exe report -i c:\Temp\pwrprof-swp\<SESSION-DIR>
Generate report with Symbol Server paths
C:\> AMDuProfCLI.exe report --symbol-path C:\AppSymbols;C:\DriverSymbols --symbol-server http://msdl.microsoft.com/download/symbols --symbol-cache-dir C:\symbols -i c:\Temp\cpuprof- tbp\<SESSION-DIR>
Generate report from the raw datafile on one of the predefined views
C:\> AMDuProfCLI.exe report --view ipc_assess -i c:\Temp\pwrprof-swp\<SESSION-DIR>
Generate report from the raw datafile providing the source and binary paths
C:\> AMDuProfCLI.exe report --bin-path Examples\AMDTClassicMatMul\bin\ --src-path Examples\AMDTClassicMatMul\ -i c:\Temp\cpuprof-tbp\<SESSION-DIR>
Generate report from the raw datafile
$ AMDuProfCLI report -i /tmp/cpuprof-tbp/<SESSION-DIR>
Generate IMIX report from the raw datafile
$ AMDuProfCLI report --imix -i /tmp/cpuprof-imix/<SESSION-DIR>
Generate report from the raw datafile sorted on pmc event
$ AMDuProfCLI report -s event=pmcxc0,user=1,os=0 -i /tmp/cpuprof-ebp/<SESSION-DIR>
Generate report from the raw datafile sorted on ibs-op event
$ AMDuProfCLI report -s event=ibs-op -i /tmp/cpuprof-ibs/<SESSION-DIR>
Generate Trace report from the raw datafile
$ AMDuProfCLI report -i /tmp/cpuprof-os/<SESSION-DIR> --category trace
Generate GPU Trace report from the raw datafile
$ AMDuProfCLI report -i /tmp/cpuprof-gpu/<SESSION-DIR> --category gputrace
Generate GPU Profile report from the raw datafile
$ AMDuProfCLI report -i /tmp/cpuprof-gpu/<SESSION-DIR> --category gpuprof
The translate command processes the raw profile data and generates the samples info database files. These databases can be imported to GUI or CLI and used for generating the report.
AMDuProfCLI translate [<options>]
$ AMDuProfCLI translate -i <session-dir path>
Following table lists the AMDuProfCLI translate command options:
Option |
Description |
|---|---|
|
Binary file path. Multiple use of |
|
Capture the timestamp of the log records. |
|
Create a compressed archive of required session files which can be used in other system for analysis |
|
Debug symbol path. Multiple instances of |
|
Displays the help information. |
|
Inline function extraction for C and C++ executables. Note Using this option will increase the time taken to generate the report. |
|
OS Support: Linux Path to the file containing kallsyms info. If no path is provided, it defaults to / proc/ kallsyms. |
|
OS Support: Linux Path to the Linux kernel debug info file. If no path provided, it searches for the debug info file in the default download path. |
|
OS Support: Linux Process only a specific profiling category. Comma separated multiple categories can be specified. If this option not used, then all categories raw data files are processed. Multiple instances of –category are allowed. The supported categories are:
Example: category cpu, mpi, trace, gputrace, gpuprof --category mpi --category cpu --category trace --category gputrace --category gpuprof
|
|
OS Support: Linux This option is used with the –input-dir option. It processes samples belonging to a specific host. The supported options are:
Note If |
|
OS Support: Windows Download symbols using the Microsoft Symsrv. By default, AMD symbol downloader will be used. |
|
OS Support: Windows Links to Symbol Server. For example: Microsoft Symbol Server. Multiple instances of |
|
OS Support: Windows Path to save the symbols downloaded from the Symbol Servers. |
|
Path to the directory containing collected data. |
|
Re-translate the collected data files with a different set of translation options. |
|
Use this option to generate ASCII dump of IBS OP profile samples.
Note This option might delay the translation. |
|
Remove the raw data files to recover the disk space |
|
Restricts the processing to the time interval between T1 and T2, where T1, T2 are time in seconds from profile start time. |
|
Specify the path where the log file should be created. If this option is not provided, the log file will be created either in the path set by The log file name will be of the format |
|
Use this option to configure the sample aggregation interval which is useful when the session is imported to GUI.
Aggregation INTERVAL can also be specified as a numeric value in milliseconds. |
|
Use this option to show Python interpreter functions in the callgraph/flamegraph when translation is performed on Python profiled data (on Linux). |
Process all the raw data files
> AMDuProfCLI.exe translate -i c:\Temp\cpuprof-tbp\<SESSION-DIR>
Process the raw data files with Symbol Server paths
> AMDuProfCLI.exe translate --symbol-path C:\AppSymbols;C:\DriverSymbols --symbol-server http://msdl.microsoft.com/download/symbols --symbol-cache-dir C:\symbols -i c:\Temp\cpuprof- tbp\<SESSION-DIR>
Process the raw data files with the source and binary path
> AMDuProfCLI.exe translate --bin-path Examples\AMDTClassicMatMul\bin\ --src-path Examples\AMDTClassicMatMul\ -i c:\Temp\cpuprof-tbp\<SESSION-DIR>
Process all the raw data files
$ AMDuProfCLI translate -i /tmp/cpuprof-tbp/<SESSION-DIR>
Process the trace raw data file
$ AMDuProfCLI translate -i /tmp/cpuprof-os/<SESSION-DIR> --category trace
Process the GPU Trace raw data file
$ AMDuProfCLI translate -i /tmp/cpuprof-gpu/<SESSION-DIR> --category gputrace
This timechart command collects and reports the system characteristics, such as power, thermal and frequency metrics, and generates a text or CSV report.
Note
The timechart command is supported only on Windows and Linux.
AMDuProfCLI timechart [--help] [--list] [<options>] [<PROGRAM>] [<ARGS>]
where,
<PROGRAM>: Denotes the application to be launched before starting the power metrics collection.
<ARGS>: Denotes the list of arguments for the launch application.
$ AMDuProfCLI timechart --list
$ AMDuProfCLI timechart -e <event> -d <duration> [<PROGRAM>] [<ARGS>]
Options |
Description |
|---|---|
|
Collect counters for specified combination of device type and/or category type. Use command timechart –list for the list of supported devices and categories. Note Multiple occurrences of |
|
Displays all the supported devices and categories. |
|
Displays this help information. |
|
Output directory path. |
|
Output file format. Supported formats are:
|
|
Profile duration n in seconds |
|
Sampling interval n in milliseconds. The minimum value is 10 ms. Note If not specified by default interval is 1000 ms. |
|
Set the working directory for the launched target application. |
|
The core affinity. Comma separated list of core-ids. Ranges of core-ids is also be specified, for example, 0-3. The default affinity is all the available cores. The affinity is set for the launched application. |
Collect all the power counter values for a duration of 10 seconds with sampling interval of 100 milliseconds.
C:\> AMDuProfCLI.exe timechart --event power --interval 100 --duration 10
Collect all the frequency counter values for 10 seconds, sampling them every 500 milliseconds and dumping the results into a .csv file.
C.. code:: console
:\> AMDuProfCLI.exe timechart --event frequency -o C:\Temp\output --interval 500 --duration 10
Collect all the frequency counter values at core 0 to 3 for 10 seconds, sampling them every 500 milliseconds and dumping the results into a text file.
C:\> AMDuProfCLI.exe timechart --event core=0-3,frequency -o C:\Temp\PowerOutput --interval 500 -duration 10 --format txt
Collect all the power counter values for a duration of 10 seconds with sampling interval of 100 milliseconds.
$ ./AMDuProfCLI timechart --event power --interval 100 --duration 10
Collect all the frequency counter values for 10 seconds, sampling them every 500 milliseconds and dumping the results into a .csv file.
$ ./AMDuProfCLI timechart --event frequency -o /tmp/PowerOutput --interval 500 --duration 10
Collect all the frequency counter values at core 0 to 3 for 10 seconds, sampling them every 500 milliseconds and dumping the results into a text file.
$ ./AMDuProfCLI timechart --event core=0-3,frequency -o /tmp/PowerOutput --interval 500 -- duration 10 --format txt
The diff command streamlines the process of comparing multiple profile reports by automating the manual comparison of events. It processes the raw profile data, processed files, or database files to generate a markdown comparison report for the collected profiles. The generated markdown file includes detailed function data providing comprehensive insights into the compared profiles.
Furthermore, the diff command can also be used to generate a single profile report by specifying only the base profile path. This simplifies the generation of individual reports, making it more convenient and efficient.
During profile comparison, there is always a single base profile and multiple non-base profiles. Valid comparison results are obtained only for the functions that exist in both the base profile and non-base profiles.
By default, the comparison results are displayed in the source view. In the source view table, information, such as File, Line, Source Code, Address, Instruction, Code Byte, and Events are provided for each function. This comprehensive view enables a detailed analysis of the compared profiles.
Note
To obtain meaningful and accurate comparison results, it is important to ensure that the base profile and non-base profiles have matching functions available for comparison.
AMDuProfCLI diff [--help] [<options>] AMDuProfCLI compare [--help] [<options>]
AMDuProfCLI diff --baseline <base session-dir path> --with <non-base session-dir path> -o <output-dir>
To ensure accurate and meaningful profile comparisons, the following conditions must be met:
Same Events: The profiles being compared should have collected the same events. This ensures that the comparison is performed on relevant and comparable data.
Same Profile Duration (if specified): If the duration (-d) option is specified, the profiles being compared should have the same duration. This ensures consistency in the time span covered by the profiles.
Not a System Wide Profile: System-wide profiles cannot be compared directly. Therefore, only individual process or thread-level profiles are eligible for comparison.
Same Profile Data Limit (if used): If the --limit-size or --limit-data option is used during profiling, the profiles being compared should have the same data limit set. This ensures consistency in the size of profile data collected.
Same Inline Function Profiling (--inline): If the ``--inline option is used to profile inline functions, the profiles being compared should have used the same inline function profiling setting. This ensures consistent handling of inline functions during the comparison.
Option |
Description |
|---|---|
|
In the cases where the function names have changed in the non-base profile, specify the function names in the non-base profile that should be compared with the corresponding function names in the base profile. Specify different functions using the pipe symbol | as a separator. For each set of functions, you can use a comma to separate the function names between the base profile and the non-base profile. |
|
Path to the directory containing collected data. The profile data in this directory will be treated as the base profile against which all other profiles will be compared. |
|
Binary file path for the base profile. This will be considered for the non-base profiles if the corresponding bin path is not specified separately. Multiple usage of |
|
Binary file path for the first non-base profile. Multiple usage of |
|
Binary file path for the second non-base profile. Multiple usage of |
|
Binary file path for the third non-base profile. Multiple usage of |
|
Cut-off to limit the number of functions to be reported. ‘n’ is the maximum number of entries to be reported in various report sections. The default value is 10. |
|
Use this option to create comparison report in HTML format. If not specified, the default comparison report format Markdown will be used to generate the report. |
|
Path where the markdown comparison report will be generated |
|
Comparison results will be displayed in terms of percentages. |
|
Specify the Timer, PMC, or IBS event on which the reported profile data will be sorted with arguments in the form of comma separated key=value pairs. The supported keys are:
Use the command
Multiple occurrences of –sort-by(-s)are not allowed. |
|
Source file directories (semicolon separated paths) for base profile. This will be considered for the non-base profiles if the corresponding file directories are not specified separately. Multiple use of |
|
Source file directories (semicolon separated paths) for the first non-base profile. Multiple use of |
|
Source file directories (semicolon separated paths) for the second non-base profile. Multiple use of |
|
Source file directories (semicolon separated paths) for the third non-base profile. Multiple use of |
|
Comparison report will also be displayed in the terminal or command line interface apart from saving to a file. |
|
Specify the type of comparison to be performed. The supported comparison types are:
The default comparison type is name. |
|
Compare only the events present in the given view file. Use the command |
|
Path to the directory containing collected data. Each profile specified with |
|
Displays this help information on the console/terminal. |
|
Path to the directory containing collected data. Multiple occurrences of -i is allowed. First occurrence of -i is considered as the base session, while all the subsequent occurrences of -i are treated as non-base sessions. Note When using |
Use the following commands to:
Generate a comparison report in html from base profile-data to its successor profile-data with delta shown in percentage .. code:: console
AMDuProfCLI.exe compare –baseline c:Tempcpuprof-tbp<BASE-SESSION-DIR> –with c:Tempcpuprof-tbp<SUCCESSOR-SESSION-DIR> –type name –show-percentage –html -o c:Tempcpuprof-tbp
Generate a comparison report of base profile data with subsequent profile data
C:\> AMDuProfCLI.exe diff --baseline c:\Temp\cpuprof-tbp\<BASE-DIR> --with c:\Temp\cpuprof- tbp\<NON-BASE-DIR> -o c:\Temp\cpuprof-tbp
Generate a comparison report using the -i option
C:\> AMDuProfCLI.exe diff -i c:\Temp\cpuprof-tbp\<BASE-DIR> -i c:\Temp\cpuprof-tbp\< NON- BASE-DIR> -o c:\Temp\cpuprof-tbp
Generate a comparison report without ignoring the unique entries across sessions
C:\> AMDuProfCLI.exe diff --baseline c:\Temp\cpuprof-tbp\<BASE-DIR> --with c:\Temp\cpuprof- tbp\<NON-BASE-DIR> --type order -o c:\Temp\cpuprof-tbp
Generate a comparison report of base profile data with subsequent profile data sorted on ibs-op event
C:\> AMDuProfCLI.exe diff --baseline c:\Temp\cpuprof-tbp\<BASE-DIR> --with c:\Temp\cpuprof- tbp\<NON-BASE-DIR> --type name -s ibs-op -o c:\Temp\cpuprof-tbp
Generate a comparison report with delta shown in percentage
C:\> AMDuProfCLI.exe compare --baseline c:\Temp\cpuprof-tbp\<BASE-DIR> --with c:\Temp\cpuprof-tbp\<NON-BASE-DIR> --type name --show-percentage -o c:\Temp\cpuprof-tbp
Generate a comparison report of base profile data with successor profile data with changed function names across sessions
C:\> AMDuProfCLI.exe compare --baseline c:\Temp\cpuprof-tbp\<BASE-DIR> --with c:\Temp\cpuprof-tbp\<NON-BASE-DIR> --alias CalculateSum,CalculateUpdatedSum|enhanceOutput,optimizeOutput -o c:\Temp\cpuprof-tbp
Generate a comparison report of base profile data with multiple successor profile data
C:\> AMDuProfCLI.exe diff -i c:\Temp\cpuprof-tbp\<BASE-DIR> -i c:\Temp\cpuprof-tbp\<NON-BASE- DIR1> -i c:\Temp\cpuprof-tbp\<NON-BASE-DIR2> --with c:\Temp\cpuprof-tbp\<NON-BASE-DIR3> -o c:\Temp\cpuprof-tbp
Generate a comparison report on one of the predefined views
C:\> AMDuProfCLI.exe diff -i c:\Temp\cpuprof-tbp\<BASE-DIR> -i c:\Temp\cpuprof-tbp\<NON-BASE- DIR> --view ipc_assess -o c:\Temp\cpuprof-tbp
Generate a comparison report providing the source and binary paths
C:\> AMDuProfCLI.exe diff -i c:\Temp\cpuprof-tbp\<BASE-DIR> -i c:\Temp\cpuprof-tbp\<NON-BASE- DIR> --bin-path Examples\AMDTClassicMatMul\bin\ --src-path Examples\AMDTClassicMatMul\ --bin- path1 Examples\AMDTClassicMatMulMod\bin\ --src-path1 Examples\AMDTClassicMatMulMod\ -o c:\Temp\cpuprof-tbp
Generate comparison report in html from base profile-data to its successor profile-data with delta shown in percentage
AMDuProfCLI compare --baseline /tmp/cpuprof-tbp/<BASE-SESSION-DIR> --with /tmp/cpuprof-tbp/<SUCCESSOR-SESSION-DIR> --type name --show-percentage --html -o /tmp/cpuprof-tbp/
Figure 9.6 Diff html report generated with –html option#
Analyzing MPI Communication Matrix
Generate a comparison report of base profile data with subsequent profile data
$ AMDuProfCLI diff --baseline /tmp/cpuprof-tbp/<BASE-DIR> --with /tmp/cpuprof-tbp/<NON-BASE- DIR> -o /tmp/cpuprof-tbp
Generate a comparison report of base profile data with subsequent profile data sorted on PMC event
$ AMDuProfCLI diff --baseline /tmp/cpuprof-tbp/<BASE-DIR> --with /tmp/cpuprof-tbp/<NON-BASE- DIR> -s event=pmcxc0,user=1,os=0 -o /tmp/cpuprof-tbp
The profile command collects the performance profile data, processes it, and generates a profile report in a readable format. It is an alternative to the combination of collect and report command.
AMDuProfCLI profile [--help] [<options>] [<PROGRAM>] [<ARGS>]
where,
<PROGRAM>: Denotes the launch application to be profiled.
<ARGS>: Denotes the list of arguments for the launch application.
$ AMDuProfCLI profile <PROGRAM> [<ARGS>]
$ AMDuProfCLI profile [--config <config> | -e <event>] [-a] [-d <duration>] [<PROGRAM>]
Following table lists the profile commands:
Option |
Description |
|---|---|
|
Set the core affinity of the launched application to be profiled. Comma separated list of core-ids. The ranges of the core-ids must be specified, for example, 0-3. The default affinity is all the available cores. This option is not supported while profiling MPI applications. |
|
Use this option to configure the sample aggregation interval which is useful when the session gets imported to GUI.
Aggregation INTERVAL can also be specified as numeric value in milliseconds. |
|
Use this option to generate ASCII dump of IBS OP profile samples.
Note This option might delay the translation. |
|
Binaryfile path, multiple usage of |
|
OS Support: Linux Use this option to capture LBR data. Specify the branch filter type:
When the above filters are not set, the default filter type will be Note
|
|
OS Support: Linux Enables callstack sampling. Specify (F) to collect/ignore missing frames due to omission of frame pointers by compiler:
When Note Passing a large When |
|
OS Support: Windows Enables callstack Sampling. Specify the Unwind Interval (I) in milliseconds and Unwind Depth (D) value. Specify the Scope (S) by choosing one of the following:
|
|
OS Support: Windows Set callstack unwind depth. Depth must be within the range [2 - 392]. Default depth is 128. |
|
OS Support: Linux Set callstack unwind depth. Depth must be within the range [2 - 1024]. Default depth is 32. This option is applicable for Hotspots and Threading configurations, for any other configurations this option will be ignored. This option is applicable for Hotspots and Threading configurations, for any other configurations this option will be ignored. |
|
OS Support: Windows Set callstack unwind interval. Interval must be within the range [1 - 100]. Default interval is 1 ms. |
|
OS Support: Linux Callstack collection mode. Default mode is
|
|
OS Support: Windows Set callstack collection mode.
|
|
OS Support: Linux Callstack Size. Default size is 1024 bytes. When mode = When mode = |
|
OS Support: Windows Set callstack scope type. Scope type should contain one of these options:
Default scope type is |
|
OS Support: Linux Process only a specific profiling category. Comma separated multiple categories can be specified. If this option not used, then all categories raw data files are processed. Multiple instances of –category are allowed. The supported categories are:
Example: category cpu, mpi, trace, gputrace, gpuprof --category mpi --category cpu --category trace --category gputrace --category gpuprof
|
|
Predefined sampling configuration to be used to collect samples. Use the command |
|
Cut-off to limit the number of functions to be reported. n is the maximum number of entries to be reported in various report sections. The default value is 10. Note
|
|
Generate detailed report. |
|
Report only the assembly instructions having samples. This option only works with the |
|
Report all the assembly instructions of a function with and without samples. This option only works with the |
|
Generate the function report with only assembly instructions. |
|
Choose the syntax of assembly instructions. Supported options are att or intel. If this option is not used, the default style used is intel. |
|
Capture the timestamp of the log records. |
|
Use this option to set the environment variables. |
|
OS Support: Linux Specify functions to exclude from the library, executable, or kernel:
Note It is recommended to provide the absolute path of a module |
|
Use this option to create a compressed archive of required session files which can be used in other system for analysis. |
|
Enable data collection at the specified frequency ‘n’ (in Hz) for Core PMC events. Note This frequency will override the sampling frequency specified with individual events. |
|
OS Support: Linux Specify functions to trace from the library, executable, or kernel: function- pattern can be a function name or partial name ending with ‘*’ or only ‘*’ to trace all the functions of a module. Module can be a library or executable. To trace the kernel functions, replace the module with ‘kernel’. Note It is recommended to provide the absolute/full path of a module. |
|
OS Support: Linux Specify the time interval in seconds to list the function count detail report. If this option is not specified, function count will be generated for the entire profile duration. |
|
Specify the report to be generated. The supported report options are:
This option is applicable only with |
|
OS Support: Linux Specify the path of guest /proc/kallsyms copied on the local host. AMD uProf reads it to get the guest kernel symbol. |
|
OS Support: Linux Specify the path of guest/proc/modules copied to the local host. AMD uProf reads it to get the guest kernel module information. |
|
OS Support: Linux Specify the path of guest vmlinux and kernel sources copied on the local host. AMD uProf reads it to resolve the guest kernel module information. |
|
OS Support: Linux This option is used along with the –input-dir option. Generates report belonging to a specific host. The supported options are:
Note If –host is not used, only the processes belonging to the system from which report is generated is reported. In case, the system is a master node in a cluster, the report will be generated for the lexicographically first host in that cluster. |
|
Ignore samples from system modules. |
|
Report Instruction Mix (only for native binaries). Default is module-wise IMIX. |
|
IMIX report generation. Supported group-by options are:
|
|
Inline function extraction for C and C++ executables. Note Using this option will increase the time taken to generate the report. |
|
Sampling interval for PMC events. Note This interval will override the sampling interval specified with individual events. |
|
OS Support: Linux Specify the PID of qemu-kvm process to be profiled to collect guest-side performance profile. |
|
OS Support: Windows Download symbols using the Microsoft Symsrv. By default, AMD symbol downloader will be used. |
|
Show the shared cachelines accessed by more than one process/thread for cache analysis. Set n to the number of shared cacheline addresses to be reported. Use this option for false cache sharing analysis. |
|
OS Support: Windows Stop the profiling when the collected data file size (in MB) crosses the specified limit. When used with the option |
|
Stop the profiling when the collected data file size (in MB) crosses the specified limit. Note This option may be deprecated in future releases. |
|
Specify the path where the log file should be created. If this option is not provided, the log file will be created either in path set by The log file name will be of the format |
|
Do not profile the children of the launched application (processes launched by the profiled application). |
|
Use this option to perform only collection and translation. |
|
OS Support: Linux Provide OpenMP implementation type:
Note Use this option with |
|
OS Support: Linux Provide tracing scope.
Note Use this option with |
This option is only applicable with |
OS Support: Linux Provide event names. Use command Note Use this option with |
|
OS Support: Linux Specify functions to exclude from the library or executable.
|
|
OS Support: Linux Provide minimum function size to trace. Default function size is 128 bytes. This option will be deprecated in a future release. Recommended to use Note Use this option with |
|
OS Support: Linux Specify functions to trace from the library or executable.
|
|
OS Support: Linux Provide event name and threshold value. Note Use this option with |
|
OS Support: Windows Specify the profile data collection mode as a ring buffer. The collection limit can be set using the option |
|
Use this option to show Python interpreter functions in the callgraph/flamegraph when translation is performed on Python profiled data (on Linux). |
|
Removes the raw data files to reclaim the disk space. |
|
Write a report to a file. If the path has a .csv extension, it is assumed to be a file path and used as it is. If the .csv extension is not used, the path is assumed to be a directory and the report file is generated in the directory with the default name |
|
Perform the re-translation of collected data files with a different set of translation options. |
|
Show all cachelines in report sections for cache analysis. By default, only cachelines accessed by more than one process/thread are listed. Use this option for false cache sharing analysis. |
|
Show the number of events occurred. |
|
Show percentage of samples instead of actual samples. |
|
Show the number of samples. This option is enabled by default. |
|
Generate detailed function report of the system module functions (if debug info is available) with the source statements. This option only works with –detail option. |
|
Source file path, multiple usage of |
|
Source file directories (semicolon separated paths). Multiple use of –src-path is allowed. |
|
Start delay n in seconds. Start profiling after the specified duration. When ‘n’ is 0, there is no impact. |
|
Profiling paused indefinitely. The target application resumes the profiling using the profile control APIs. This option must be used only when the launched application is instrumented to control the profile data collection using the resume and pause APIs (see AMDPowerProfileAPI Library for definitions). |
|
Print the report to a console or terminal. |
|
Collect the thread run time info to report thread concurrency. Thread concurrency provides how much time specific no of threads are running simultaneously. |
|
OS Support: Windows Path to save the symbols downloaded from the Symbol Servers. |
|
Debug Symbol paths (semicolon separated). Multiple use of –symbol-path is allowed. |
|
OS Support: Windows Symbol Server directories (semicolon separated paths). For example, Microsoft Symbol Server. Multiple use of |
|
OS Support: Windows
|
|
OS Support: Linux Profile existing threads by attaching to a running thread. The thread IDs are separated by comma. |
|
Restricts the processing to the time interval between T1 and T2, where T1, T2 are time in seconds from profile start time. |
|
OS Support: Linux To trace a target domain. TARGET can be one or more of the following:
Use Note Applicable to per process profiling. Not applicable to:
|
|
Compare only the events present in the given view file. Use the command |
|
OS Support: Linux Path to the Linux kernel debug info file. If no path provided, it searches for the debug info file in the default download path. |
|
System Wide Profile (SWP): If this flag is not set, the command line tool will profile only the launched application or the Process IDs attached with |
|
Terminate the launched application after the profile data collection ends. Only the launched application process will be killed. Its children (if any) may continue to execute. |
|
Comma separated list of CPUs to profile. The ranges of CPUs can be specified with ‘-’, for example, 0-3. This option is not supported with MPI profiling. Note On Windows, the selected cores should belong to only one processor group. For example, 0-63, 64-127, and so on. |
|
Profile only for the specified duration n in seconds. |
|
A predefined event can directly be used with -e, –event which has predefined arguments. Alternatively, for providing more granular parameters, specify Timer, PMU, IBS event, or a predefined event with arguments in the form of comma separated key=value pairs. The supported keys are:
Note
Argument details
When these arguments are not passed, then the default values are:
|
|
Same as passing –call-graph fp (Linux, FreeBSD). Same as passing –call-graph1:128:user:fp (Windows). |
|
Displays this help information on the console/terminal. |
|
OS Support: Windows Size (number of pages per core) of the buffer used for data collection by the driver. The default size is 512 pages per core. |
|
OS Support: Linux Set the kernel memory mapped data buffer to size. The size can be specified in pages or with a suffix Bytes (B/b), Kilo bytes (K/k), Megabytes (M/m), and Gigabytes (G/g). |
|
Base directory path in which collected data files will be saved. A new sub- directory will be created in this directory. |
|
Profile the existing processes by attaching to a running process. The process IDs are separated by comma. Note
|
|
Specify the Timer, PMC, or IBS event on which the reported profile data will be sorted with arguments in the form of comma separated key=value pairs. The supported keys are:
When both event and metric are enabled, event takes priority over metric. Use the command Details about the arguments:
|
|
Specify the working directory. The default is the current working directory. |
Launch application`` AMDTClassicMatMul.exe`` and collect the samples for CYCLES_NOT_IN_HALT and RETIRED_INST events and generate report
C:\> AMDuProfCLI.exe profile -e cycles-not-in-halt -e retired-inst --interval 1000000
-o c:\Temp\cpuprof-custom AMDTClassicMatMul.exe
$ ./AMDuProfCLI.exe profile -e event=cycles-not-in-halt,interval=250000
-e event=retired-inst,interval=500000 -o c:\Temp\cpuprof-custom AMDTClassicMatMul.exe
Launch AMDTClassicMatMul-bin and collect IBS samples and generate thread-wise imix
C:\> AMDuProfCLI.exe profile --config ibs --imix --imix-group-by thread -o c:\Temp\cpuprof-tbp AMDTClassicMatMul.exe
Launch AMDTClassicMatMul.exe and perform Assess Performance profile for 10 seconds and generate report
C:\> AMDuProfCLI.exe profile --config assess -o c:\Temp\cpuprof-assess -d 10 AMDTClassicMatMul.exe
Launch AMDTClassicMatMul.exe and collect the IBS samples in the SWP mode and generate report sorted on ibs-op event
C:\> AMDuProfCLI.exe profile --config ibs -a -s event=ibs-op -o c:\Temp\cpuprof-ibs-swp AMDTClassicMatMul.exe
Collect the TBP samples in SWP mode for 10 seconds and generate report
C:\> AMDuProfCLI.exe profile -a -o c:\Temp\cpuprof-tbp-swp -d 10
Launch AMDTClassicMatMul.exe, collect TBP with callstack sampling and generate report
C:\> AMDuProfCLI.exe profile --config tbp -g -o c:\Temp\cpuprof-tbp AMDTClassicMatMul.exe
Launch AMDTClassicMatMul.exe, collect TBP with callstack sampling (unwind FPO optimized stack) and generate report
C:\> AMDuProfCLI.exe profile --config tbp --call-graph-mode fpo --call-graph-type user -o c:\Temp\cpuprof-tbp AMDTClassicMatMul.exe
Launch AMDTClassicMatMul.exe and collect the samples for PMCx076 and PMCx0C0 and generate report sorted on pmcxc0 event
C:\> AMDuProfCLI.exe profile -e event=pmcx76,interval=250000 -e event=pmcxc0,user=1,os=0,interval=250000 -s event=pmcxc0 -o c:\Temp\cpuprof-tbp AMDTClassicMatMul.exe
Launch AMDTClassicMatMul.exe and collect the samples for IBS OP with an interval of 50000 and generate report sorted on ibs-op event
C:\> AMDuProfCLI.exe profile -e event=ibs-op,interval=50000 -s event=ibs-op -o c:\Temp\cpuprof-tbp AMDTClassicMatMul.exe
Launch AMDTClassicMatMul.exe and do TBP samples profile for thread concurrency, name, and generate report
C:\> AMDuProfCLI.exe profile --config tbp --thread thread=concurrency,name -o c:\Temp\cpuproftbp AMDTClassicMatMul.exe
Collect samples for PMCx076 and PMCx0C0, but collect the call graph info only for PMCx0C0 and generate report
C:\> AMDuProfCLI.exe profile -e event=pmcx76,interval=250000 -e event=pmcxc0,interval=250000,call-graph -o c:\Temp\cpuprof-pmc AMDTClassicMatMul-bin
Launch AMDTClassicMatMul.exe and collect the samples for predefined event RETIRED_INST and L1_DC_REFILLS.ALL events and generate report
C:\> AMDuProfCLI.exe profile -e event=RETIRED_INST,interval=250000 -e event=L1_DC_REFILLS.ALL,user=1,os=0,interval=250000 -o c:\Temp\cpuprof-pmc AMDTClassicMatMul.exe
Launch AMDTClassicMatMul.exe and collect the TBP, Assess Performance samples, and generate report
C:\> AMDuProfCLI.exe profile --config tbp --config assess -o c:\Temp\cpuprof-tbp-assess AMDTClassicMatMul.exe
Launch AMDTClassicMatMul.bin and collect the samples for CYCLES_NOT_IN_HALT and RETIRED_INST events and generate report
$ ./AMDuProfCLI profile -e cycles-not-in-halt -e retired-inst
--interval 1000000 -o /tmp/cpuprof-custom AMDTClassicMatMul-bin
$ ./AMDuProfCLI profile -e event=cycles-not-in-halt,interval=250000
-e event=retired-inst,interval=500000 -o /tmp/cpuprof-custom AMDTClassicMatMul-bin
Launch AMDTClassicMatMul.bin and collect the IBS samples and generate thread-wise IMIX report from the raw data file
$ ./AMDuProfCLI profile --config ibs --imix --imix-group-by thread -o /tmp/cpuprof-tbp AMDTClassicMatMul-bin
Launch AMDTClassicMatMul.bin and perform Assess Performance profile for 10 seconds and generate report
$ ./AMDuProfCLI profile --config assess -o /tmp/cpuprof-assess -d 10 AMDTClassicMatMul-bin
Launch AMDTClassicMatMul.bin and collect the IBS samples in the SWP mode and generate report sorted based on ibs_op event
$ ./AMDuProfCLI profile --config ibs -a -s event=ibs_op -o /tmp/cpuprof-ibs-swp AMDTClassicMatMul-bin
Collect the TBP samples in SWP mode for 10 seconds and generate report
$ ./AMDuProfCLI profile -a -o /tmp/cpuprof-tbp-swp -d 10
Launch AMDTClassicMatMul.bin and collect TBP with callstack sampling and generate report
$ ./AMDuProfCLI profile --config tbp -g -o /tmp/cpuprof-tbp AMDTClassicMatMul-bin
Launch AMDTClassicMatMul.bin and collect TBP with callstack sampling (unwind FPO optimized stack) and generate report
$ ./AMDuProfCLI profile --config tbp --call-graph-mode fpo --call-graph-size 512 -o /tmp/cpuprof-tbp AMDTClassicMatMul-bin
Launch AMDTClassicMatMul.bin. Collect the samples for PMCx076 and PMCx0C0 and generate report
$ ./AMDuProfCLI profile -e event=pmcx76,interval=250000 -e event=pmcxc0,user=1,os=0,interval=250000 -o /tmp/cpuprof-tbp AMDTClassicMatMul-bin
Launch AMDTClassicMatMul.bin and collect the samples for IBS OP with interval 50000 and generate report sorted on ibs-op event
$ ./AMDuProfCLI profile -e event=ibs-op,interval=50000 -s event=ibs-op -o /tmp/cpuprof-tbp AMDTClassicMatMulbin
Attach to a thread, collect TBP samples for 10 seconds, and generate report
$ AMDuProfCLI profile --config tbp -o /tmp/cpuprof-tbp-attach -d 10 --tid <TID>
Collect basic OpenMP trace info of an OpenMP application compiled with GCC OpenMP library and generate the report
$ AMDuProfCLI profile --trace openmp --openmp-impl omplib -o /tmp/cpuprof-omp <path-to-openmp-exe>
Collect the samples for PMCx076 and PMCx0C0, but collect the call graph info only for PMCx0C0 and generate report
$ AMDuProfCLI profile -e event=pmcx76,interval=250000 -e event=pmcxc0,interval=250000,callgraph -o /tmp/cpuprof-pmc
AMDTClassicMatMul-bin
Launch AMDTClassicMatMul-bin and collect pthread runtime trace with default threshold
$ AMDuProfCLI collect --trace osrt --osrt-event pthread -o /tmp/cpuprof-os AMDTClassicMatMul-bin
Launch AMDTClassicMatMul-bin and collect syscall which are taking more than or equal to 1ms and generate report
$ AMDuProfCLI profile --trace osrt --osrt-event syscall --osrt-threshold syscall:1000000 -o /tmp/cpuprof-os AMDTClassicMatMul-bin
Launch AMDTClassicMatMul-bin and collect the GPU Traces and generate gpu trace report
$ AMDuProfCLI profile --trace gpu -o /tmp/cpuprof-gpu AMDTClassicMatMul-bin --category gputrace
Launch AMDTClassicMatMul.bin and collect the TBP samples, GPU Traces and generate report
$ AMDuProfCLI profile --config tbp --trace gpu -o /tmp/cpuprof-gpu AMDTClassicMatMul-bin
Launch AMDTClassicMatMul-bin and collect ‘GPU’ samples and generate report
$ AMDuProfCLI profile --config gpu -o /tmp/gpuprof-gpu AMDTClassicMatMul-bin
Launch AMDTClassicMatMul-bin and collect ‘GPU’ samples for ‘SQ’ Block
$ AMDuProfCLI profile --config gpu --ip-block SQ -o /tmp/gpuprof-gpu AMDTClassicMatMul-bin
Launch AMDTClassicMatMul-bin and collect trace data for all functions in ‘AMDTClassicMatMul-bin’
$ AMDuProfCLI profile --trace osrt --osrt-event function --osrt-funcs AMDTClassicMatMul-bin:* -o /tmp/cpuprof-os AMDTClassicMatMul-bin
Launch AMDTClassicMatMul-bin and collect trace data for all functions in ‘AMDTClassicMatMul-bin’ which has size greater than or equals to 64
$ AMDuProfCLI profile --trace osrt --osrt-event function --osrt-func-size 64 --osrt-threshold function:10000 --osrt-funcs AMDTClassicMatMul-bin:* -o /tmp/cpuprof-os AMDTClassicMatMul-bin
The Info command fetches the generic information about the system, PMC event details, predefined event details, and so on.
AMDuProfCLI info [--help] [<options>]
$ AMDuProfCLI info --system
Following table lists the info command:
Option |
Description |
|---|---|
|
OS Support: Linux Displays details of the BPF support and BCC Installation. |
|
Displays the details of the given profile configuration used with Use |
|
OS Support: Linux Lists the supported items for the following types:
|
|
Lists the supported items for the following types:
|
|
Displays the details of the given pmu event. Use command info |
|
Displays the processor information of this system. |
|
Displays the details of the given view configuration used in the report generation option Use |
|
Displays the help information. |
Use the following commands to:
Print the system details
C:\> AMDuProfCLI.exe info --system
Print the list of predefined profiles
C:\> AMDuProfCLI.exe info --list collect-configs
Print the list of PMU events
C:\> AMDuProfCLI.exe info --list pmu-events
Print the list of predefined report views
C:\> AMDuProfCLI.exe info --list view-configs
Print details of predefined profile such as “assess_ext”
C:\> AMDuProfCLI.exe info --collect-config assess_ext
Print the details of the pmu-event such as PMCx076
C:\> AMDuProfCLI.exe info --pmu-event pmcx76
Print details of view configuration such as ibs_op_overall
C:\> AMDuProfCLI.exe info --view-config ibs_op_overall
Print the list of trace events
C:\> AMDuProfCLI.exe info --list trace-events