9. AMD uProf CLI Options

9.1. Overview

AMD uProf’s command line interface AMDuProfCLI provides options to collect and generate report for analyzing the profile data.

AMDuProfCLI [--version] [--help] COMMAND [<options>] [<PROGRAM>] [<ARGS>]

The following commands are supported:

Table 9.1 Supported COMMANDS#

Command

Description

collect

Runs the given program and collects the profile samples.

translate

Processes the raw profile datafile and generates the profile DB.

report

Processes the raw profile datafile and generates profile report.

compare, diff

Processes multiple profile-data and generates a comparison report.

timechart

Power Profiling — collects and reports system characteristics, such as power, thermal, and frequency metrics.

info

Displays the generic information about system and topology.

profile

Collects the performance profile data, analyzes it and generates the profile report.

For more information on the workflow, refer to the section “Workflow and Key Concepts”. To run the command line interface AMDuProfCLI, run the following binaries as per the OS:

Table 9.2 Supported COMMANDS#

OS

Description

Windows

C:\Program Files\AMD\AMDuProf\bin\AMDuProfCLI.exe

Linux

/opt/AMDuProf_X.Y-ZZZ/bin/AMDuProfCLI

Linux: If installed using the .tar file

./AMDuProf_Linux_x64_X.Y.ZZZ/bin/AMDuProfCLI

FreeBSD

sh ./AMDuProf_FreeBSD_x64_X.Y.ZZZ/bin/AMDuProfCLI

9.2. Starting a CPU Profile

To profile and analyze the performance of a native (C, C++, and Fortran) application, you must complete the following steps:

  1. Prepare the application. See Preparing an Application for Profiling.

  2. Use AMDuProfCLI collect command to collect the samples for the application.

    Note

    Run AMD uProf on FreeBSD with sudo command or root privilege.

  3. Using AMDuProfCLI report command to generate a report in readable format for analysis.

Preparing the application is to build the launch application with debug information as it is needed to correlate the samples to functions and source lines.

The collect command launches the application (if given) and collects the profile data for the given profile type and sampling configuration. It generates the raw data file (.prd on Windows, .pdata on FreeBSD, and .caperf on Linux) and other miscellaneous files.

The report command translates the collected raw profile data to aggregate and attribute to the respective processes, threads, load modules, functions, and instructions. Also, it writes them into a database and then generates a report in the .CSV file format.

The following figure shows how to run a time-based profile and generate a report for the application AMDTClassicMatMul.exe.

Collect and Report Commands.

Figure 9.1 Collect and Report Commands#

9.2.1. List of Predefined Sample Configurations

To get the list of supported predefined sampling configurations that can be used with collect command’s --config option, run the command: AMDuProfCLI info --list collect-configs.

A sample output is as follows:

Supported Predefined Configurations on Linux.

Figure 9.2 Supported Predefined Configurations on Linux#

Supported Predefined Configurations on Windows.

Figure 9.3 Supported Predefined Configurations on Windows#

Table 9.3 Profile Report#

OS

Description

EXECUTION

Information about the target launch application

PROFILE DETAILS

Details about the current session, such as profile type, scope, and sampling events

MONITORED EVENTS

List of the profiled events and the corresponding sampling intervals

10 HOTTEST FUNCTIONS

List of the top 10 hot functions and the metrics attributed to them

TAKEN BRANCH ANALYSIS SUMMARY

List of the top 10 hot branches

10 HOTTEST PROCESSES

List of the top 10 hot processes and the metrics attributed to them

10 HOTTEST MODULES

List of the top 10 hot modules and the metrics attributed to them

10 HOTTEST THREADS

List of the top 10 hot threads and the metrics attributed to them

PROFILE REPORT FOR PROCESS

The metrics attributed to the profiled process. This section is shown when –detail option is used for report generation. It contains other sub- sections, such as

  • THREAD SUMMARY: List of threads with metrics attributed to them.

  • MODULE SUMMARY: List of load modules which belong to the process with metrics attributed to them.

  • FUNCTION SUMMARY: List of functions that belong to this process for which samples are collected, with metrics attributed to them.

  • LAST BRANCH RECORD FOR PROCESS: List of collected branches for the process.

  • Function Detail Data: Source level attribution for the top functions for which samples are collected.

  • CALLGRAPH: Call graph, if callstack samples are collected.

9.3. Starting a Power Profile

9.3.1. System-wide Power Profiling (Live)

To collect power profile counter values:

  1. Run the AMDuProfCLI timechart command with --list option to get the list of supported counter categories

  2. Use the AMDuProfCLI timechart command for specifying the required counters with --event option to collect and the report the required counters.

    The timechart run to list the supported counter categories.

Output of timechart --list Command.

Figure 9.4 Output of timechart –list Command#

The timechart to collect the profile samples and write into a file.

Timechart Execution.

Figure 9.5 Timechart Execution#

The above run collects the power and frequency counters on all the devices on which these counters are supported and writes them in the output file specified with -o option. Before the profiling begins, the given application is launched and the data is collected till the application terminates.

9.4. Collect Command

The collect command collects the performance profile data and writes into the raw data files in the specified output directory. These files can then be analyzed using AMDuProfCLI report command or AMDuProf GUI.

9.4.1. Synopsis

AMDuProfCLI collect [--help] [<options>] [<PROGRAM>] [<ARGS>]

where,

<PROGRAM>: Denotes the launch application to be profiled.

<ARGS>: Denotes the list of arguments for the launch application.

9.4.2. Common Usages

$ AMDuProfCLI collect --config <config> <PROGRAM> [<ARGS>]
$ AMDuProfCLI collect [--config <config> | -e <event>] [-a] [-d <duration>] [<PROGRAM>]

9.4.3. Options

The following table lists the collect command options.

Table 9.4 AMDuProfCLI Collect Command Options#

Option

Description

-h| --help

Displays the help information on the console/terminal.

-o| --output-dir <directory-path>

Base directory path in which collected data files will be saved. A new sub- directory will be created in this directory.

--config <config>

Predefined sampling configuration to be used to collect samples.

Use the command info --list collect-configs to get the list of supported configs. Multiple occurrences of --config are allowed.

--ip-block <block>

Specify this option to collect GPU Profile data for specific IP Block. List Of IP Block:

  • SQ

  • SQC

  • TA

  • TD

  • TCP

  • TCC

  • SPI

  • CPC

  • CPF

-e | --event or <predefined-event>

A predefined event can be directly be used with -e, –event which has predefined arguments.

Alternatively, for providing more granular parameters, specify Timer, PMU, IBS event, or a predefined event with arguments in the form of comma separated key=value pairs. The supported keys are:

  • event=<timer | ibs-fetch | ibs-op> or <PMU-event> or <predefined-event>

  • umask=<unit-mask>

  • user=<0 | 1>

  • os=<0 | 1>

  • cmask=<count-mask> (Value should be in the range 0x0 to 0x7f)

  • inv=<0 | 1>

  • interval=<sampling-interval>

  • frequency=<frequency (n)> (Supported only for Core PMC events, the frequency should be provided in Hz)

  • ibsop-count-control=<0 | 1> (For ibs-op event. Choose IBS OP sampling by cycle(0) count or dispatch(1) count.)

  • loadstore (for ibs-op event, only on Windows platform)

  • Ibsop-l3miss=<0 | 1> (Do not filter out any IBS OP samples (0), or filter out all IBS OP samples except loads/stores that miss in the L3 cache (1); supported only on AMD Zen4 and later processors.)

  • ibsfetch-l3miss=<0 | 1> (Do not filter out any IBS FETCH samples (0), or filter out all IBS FETCH samples except those that miss in the L3 cache (1); supported only on AMD Zen4 and later processors.)

  • ibsop-ldlat=<LATENCY> (Filter IBS OP samples by data cache miss latency threshold in CPU cycles. LATENCY must be an integer which is multiple of 128 and between 128 to 2048. Supported on AMD Zen5 and later processors.)

  • call-graph

Note

  1. It is not required to provide umask with predefined event.

  2. Use the dedicated option --call-graph to specify the arguments related to the call stack sample collection.

Argument details

  • user – Enable(1) or disable(0) user space samples collection

  • os - Enable(1) or disable(0) kernel space samples collection

  • interval – Sample collection interval. For timer, it is the time interval in milliseconds. For PMU and predefined events, it is the count of the event occurrences. For IBS FETCH, it is the fetch count. For IBS OP, it is the cycle count or the dispatch count.

  • ibsop-count-control – Choose IBS OP sampling by cycle(0) count or dispatch(1) count.

  • loadstore – Enable only the IBS OP load/store samples collection, other IBS OP samples are not collected.

  • ibsop-l3miss – Don’t filter out any IBS OP samples (0), or filter out all. For example, -e event=ibs-op,interval=100000,ibsop-l3miss=1

  • ibsfetch-l3miss – Enable IBS FETCH sample collection only when an l3 miss occurs. For example, -e event=ibs-fetch,interval=100000,ibsfetch-l3miss=1

  • ibsop-ldlat – Filter IBS OP samples by data cache miss latency threshold in CPU cycles. LATENCY must be an integer which is multiple of 128 and between 128 to 2048. For example, -e event=ibs-op,interval=100000,ibsop-ldlat=256.

When these arguments are not passed, then the default values are:

  • umask = 0

  • cmask = 0x0

  • user = 1

  • os = 1

  • inv = 0

  • ibsop-count-control = 1 (for ibs-op event)

  • ibsop-l3miss = 0

  • ibsfetch-l3miss = 0

  • interval = 1.0 ms for timer event

  • interval = 250000 (for ibs-fetch, ibs-op, PMU-event)

-p| --pid <PID...>

Profile the existing processes by attaching to a running process. The process IDs are separated by comma.

Note

  1. A maximum of 512 processes can be attached at a time.

  2. On FreeBSD, multiple attach is not supported.

-a| --system-wide

System Wide Profile (SWP)

If this flag is not set, then the command line tool will profile only the launched application or the Process IDs attached with -p option.

-c| --cpu <core...>

Comma separated list of CPUs to profile. The ranges of CPUs can be specified with ‘-’, for example: 0-3. This option is not supported while profiling MPI applications.

Note

On Windows, the selected cores should belong to only one processor group. For example, 0-63, 64-127, and so on.

-d| --duration <n>

Profile only for the specified duration n in seconds.

--interval <num>

Sampling interval for PMC events.

Note

This interval will override the sampling interval specified with individual events.

--affinity <core...>

Set the core affinity of the launched application to be profiled. Comma separated list of core-ids. The ranges of the core-ids must be specified, for example, 0-3. The default affinity is all the available cores. This option is not supported while profiling MPI applications.

--no-inherit

Do not profile the children of the launched application (processes launched by the profiled application).

-b| --terminate

Terminate the launched application after the profile data collection ends. Only the launched application process will be killed. Its children (if any) may continue to execute.

--start-delay <n>

Start delay n in seconds. Start profiling after the specified duration. When n is 0, there is no impact.

--start-paused

Profiling paused indefinitely. The target application resumes the profiling using the profile control APIs. This option must be used only when the launched application is instrumented to control the profile data collection using the resume and pause APIs (defined in the AMDProfileControl APIs ).

-w| --working-dir <path>

Specify the working directory. The default is the current working directory.

--log-path <path-to-log- dir>`

Specify the path where the log file should be created. If this option is not provided, the log file will be created either in path set by AMDUPROF_LOGDIR environment variable or $TEMP path (Linux, FreeBSD) or %TEMP% path (on Windows) by default.

The log file name will be of the format $USER-AMDuProfCLI.log (on Linux, FreeBSD) or %USERNAME%-AMDuProfCLI.log (on Windows).

--enable-logts

Capture the timestamp of the log records.

--limit-size <n>

Stop the profiling when the collected data file size (in MB) crosses the specified limit.

Note

This option may be deprecated in future releases.

--frequency <n> | --freq <n>| -F <n>

Enable data collection at the specified frequency n (in Hz) for Core PMC events.

Note

This frequency will override the sampling frequency specified with the individual events.

--env-var <key1=value1:key2=value2:...>

Use this option to set the environment variables.

--call-graph <I:D:S:F>

OS Support: Windows

Enables callstack Sampling. Specify the Unwind Interval (I) in milliseconds and Unwind Depth (D) value. Specify the Scope (S) by choosing one of the following:

  • user: Collect only for the user space code.

  • kernel: Collect only for the kernel space code.

  • all: Collect for the code executed in the user and kernel space code. Specify to collect missing frames due to Frame Pointer Omission (F) by compiler.

  • fpo: If the frame pointers are not available, collect callstack information using unwind information.

  • fp: Use the frame pointers to collect callstack information.

--call-graph <F:N>

OS Support: Linux

Enables callstack sampling. Specify (F) to collect/ignore missing frames due to omission of frame pointers by compiler:

  • fpo | dwarf: Collect the process callstack during sample collection and use the DWARF information to reconstruct callstack.

  • fp: Use the frame pointers to collect callstack information.

When F = fpo, (N) specifies the max stack-size in bytes to collect per sample collection. Valid range of the stack size: 16 - 32768. If N is not a multiple of 8, it is aligned down to the nearest value multiple of 8. The default value is 1024 bytes.

Note

Passing a large N value will generate a very large raw data file.

When F = fp; the value for N is not applicable and ignored if passed.

-g

Same as passing –call-graph fp (Linux, FreeBSD).

Same as passing –call-graph1:128:user:fp (Windows).

--call-graph-mode <fp|fpo>

OS Support: Windows

Set callstack collection mode.

  • fpo - If frame pointers are not available, collect call-stack information using unwind information.

  • fp - Use Frame pointers to collect callstack information.

Default mode is fp.

--call-graph-mode <fp|fpo|dwarf>

OS Support: Linux

Callstack collection mode. Default mode is fp.

  • fp: Use Frame pointers to collect call stack information.

  • fpo | dwarf: Collect process call stack during sample collection and use DWARF information to reconstruct the call stack.

--call-graph-type <scope type>

OS Support: Windows

Set callstack scope type. Scope type should contain one of these options:

  • user - Collect only for user space code.

  • kernel - Collect only for kernel space code.

  • all - Collect for code executed in user and kernel space.

Default scope type is user.

--call-graph-interval <num>

OS Support: Windows

Set callstack unwind interval. Interval must be within the range [1 - 100]. Default interval is 1 ms.

--call-graph-depth <num>

OS Support: Windows

Set callstack unwind depth. Depth must be within the range [2 - 392]. Default depth is 128.

--call-graph-depth <num>

OS Support: Linux

Set callstack unwind depth. Depth must be within the range [2 - 1024]. Default depth is 32.

--call-graph-size <size>

OS Support: Linux

Callstack Size. Default size is 1024 bytes.

When mode = fpo | dwarf; size must be within [16 - 32768] and specifies the max stack- size (in bytes) to collect per call stack sample.

When mode = fp; the size is not applicable and ignored if passed.

--thread <thread=concurrency>

OS Support: Windows

Collect the thread run time info to report thread concurrency. Thread concurrency provides how much time specific no of threads are running simultaneously.

-m| --data-buffer- count <size>

OS Support: Windows

Size (number of pages per core) of the buffer used for data collection by the driver. The default size is 512 pages per core.

--limit-data <n>

OS Support: Windows

Stop the profiling when the collected data file size (in MB) crosses the specified limit. When used with the option --overwrite, the limit is before the collection is terminated. Size can be specified with the suffix Mega Bytes (M/ m), Giga Bytes (G/g), or Seconds (secs).

--overwrite

OS Support: Windows

Specify the profile data collection mode as a ring buffer. The collection limit can be set using the option --limit-data. The default --limit-data is to restrict the raw data file size to 512 pages per core.

--tid <TID,..>

OS Support: Linux

Profile existing threads by attaching to a running thread. The thread IDs are separated by comma.

--trace <TARGET>

OS Support: Linux

To trace a target domain. TARGET can be one or more of the following:

  • osrt - to enable tracing of os runtime. Use command info --list trace-events for the list of trace events.

  • func - to enable tracing of functions. Use --func, --func-size and --func-threshold to configure additional options.

  • memory - to enable tracing of dynamic memory allocations. Use --memory-threshold to configure threshold.

  • mpi - to enable tracing of MPI application. Use --mpi-impl and --mpi-scope to configure additional options.

  • openmp - to enable tracing of OpenMP application.

Use --openmp-impl and --openmp-scope to configure additional options.

Note

Applicable to per process profiling. Not applicable to:

  • System wide profiling

  • Java app profiling

  • Attach process profiling

  • For ompt - application should be compiled with LLVM-8 or later, AOCC-2.1 or later, ICC-19.1 or later.

  • For omplib - application should be compiled with GCC-7 or later.

  • Supported base languages are: C, C++, and Fortran.

  • gpu - To trace a target application on GPU. By default, the domain is set to hip and hsa.

--mpi-impl <mpich| openmpi>

OS Support: Linux

Provide MPI implementation type: openmpi for tracing OpenMPI library, mpich for tracing MPICH and it’s derivative libraries. Default selection is mpich.

Note

Use this option with --trace mpi option.

--mpi-scope <lwt| full>

OS Support: Linux

Provide tracing scope: lwt for light-weight tracing, full for complete tracing. Default scope type is full.

Note

Use this option with --trace mpi option.

--openmp-impl <ompt| omplib>

OS Support: Linux

Provide OpenMP implementation type: ompt for tracing of OpenMP libraries supporting OMPT interface (example: LLVM, AOCC), omplib for tracing GCC OpenMP library. Default selection is ompt.

Note

Use this option with --trace openmp option.

--openmp-scope <full|basic>

OS Support: Linux

Provide tracing scope: full for complete tracing, basic for basic tracing, where synchronization related OpenMP events are not traced to reduce the disk space usage. Default selection is basic.

Note

  1. Use this option with --trace openmp option.

  2. This option is only applicable with --openmp-impl ompt.

--osrt-event <event1,event2...>

OS Support: Linux

Provide event names. Use command info --list trace-events for the list of trace events.

Note

Use this option with --trace osrt option.

--osrt-threshold <event:threshold>

OS Support: Linux

Provide event name and threshold value.

Note

Use this option with --trace osrt option.

--osrt-funcs <module:function- pattern>

OS Support: Linux

Specify functions to trace from the library or executable.

  • Function-pattern can be a function name or partial name ending with *. Use only * to trace all the functions of a module.

  • Module can be absolute path to library or executable.

This option will be deprecated in a future release. Recommended to use --func.

--osrt-func-size <size>

OS Support: Linux

Provide minimum function size to trace. Default function size is 128 bytes.

This option will be deprecated in a future release. Recommended to use --func-size.

Note

Use this option with --trace osrt option.

--func <module:function - pattern>

OS Support: Linux

Specify functions to trace from the library or executable:

  • Function-pattern can be a function name or partial name ending with * or only * to trace all the functions of a module.

  • Module can be a library or executable.

Note

It is recommended to provide the absolute/full path of a module.

--exclude-func <module:function- pattern>

OS Support: Linux

Specify functions to exclude from the library or executable:

  • Function-pattern can be a function name or partial name ending with * or only * to trace all the functions of a module.

  • Module can be a library or executable.

Note

It is recommended to provide the absolute/full path of a module.

-m| --mmap-pages <size>

OS Support: Linux

Set the kernel memory mapped data buffer to size. The size can be specified in pages or with a suffix Bytes (B/b), Kilo bytes (K/k), Megabytes (M/m), and Gigabytes (G/g).

--mpi

OS Support: Linux

Pass this option while collecting CPU Profiling data of a MPI application. For MPI tracing, use the collect command with –trace option.

--kvm-guest <pid>

OS Support: Linux

Specify the PID of qemu-kvm process to be profiled to collect guest-side performance profile.

--guest- kallsyms <path>

OS Support: Linux

Specify the path of guest /proc/kallsyms copied on the local host. AMD uProf reads it to get the guest kernel symbols.

--guest- modules <path>

OS Support: Linux

Specify the path of guest /proc/modules copied to the local host. AMD uProf reads it to get the guest kernel module information.

--guest-search-path <path>

OS Support: Linux

Specify the path of guest vmlinux and kernel sources copied on the local host. AMD uProf reads it to resolve the guest kernel module information.

--branch-filter

OS Support: Linux

Capture LBR data. You can also specify the branch filter type:

  • u: user branches

  • k: kernel branches

  • any: any branch type

  • any_call: any call branch

  • any_ret: any return branch

  • ind_call: indirect calls

  • ind_jmp: indirect jumps

  • cond: conditional branches

  • call: direct calls

Note

  1. When the above filters are not set, the default filter type will be any.

  2. This option will work only with PMC events.

  3. This is applicable to per process and attach process profiling. However, it is not applicable to Java app profiling.

9.4.4. Examples

9.4.4.1. Windows

Launch AMDTClassicMatMul.exe and collect the samples for CYCLES_NOT_IN_HALT and RETIRED_INST events:

C:\> AMDuProfCLI.exe collect -e cycles-not-in-halt -e retired-inst --interval 1000000
-o c:\Temp\cpuprof-custom AMDTClassicMatMul.exe
$ ./AMDuProfCLI.exe collect -e event=cycles-not-in-halt,interval=250000
-e event=retired-inst,interval=500000 -o c:\Temp\cpuprof-custom AMDTClassicMatMul.exe

Launch AMDTClassicMatMul.exe and collect the Time-Based Profile (TBP) samples:

C:\> AMDuProfCLI.exe collect -o c:\Temp\cpuprof-tbp AMDTClassicMatMul.exe

Launch AMDTClassicMatMul.exe and do Assess Performance profile for 10 seconds:

C:\> AMDuProfCLI.exe collect --config assess -o c:\Temp\cpuprof-assess -d 10 AMDTClassicMatMul.exe

Launch AMDTClassicMatMul.exe and collect the IBS samples in the SWP mode:

C:\> AMDuProfCLI.exe collect --config ibs -a -o c:\Temp\cpuprof-ibs-swp AMDTClassicMatMul.exe

Collect the TBP samples in SWP mode for 10 seconds:

C:\> AMDuProfCLI.exe collect -a -o c:\Temp\cpuprof-tbp-swp -d 10

Launch AMDTClassicMatMul.exe and collect TBP with callstack sampling:

C:\> AMDuProfCLI.exe collect --config tbp -g -o c:\Temp\cpuprof-tbp AMDTClassicMatMul.exe

Launch AMDTClassicMatMul.exe and collect TBP with callstack sampling (unwind FPO optimized stack):

C:\> AMDuProfCLI.exe collect --config tbp --call-graph 1:64:user:fpo -o c:\Temp\cpuprof-tbp AMDTClassicMatMul.exe
C:\>  AMDuProfCLI.exe collect --config tbp --call-graph-mode fpo --call-graph-type user -o c:\Temp\cpuprof-tbp AMDTClassicMatMul.exe

Launch AMDTClassicMatMul.exe and collect TBP with callstack sampling (unwind FPO optimized stack disabled):

C:\> AMDuProfCLI.exe collect --config tbp --call-graph-mode fp -o c:\Temp\cpuprof-tbp AMDTClassicMatMul.exe

Launch AMDTClassicMatMul.exe and collect the samples for PMCx076 and PMCx0C0:

C:\> AMDuProfCLI.exe collect -e event=pmcx76,interval=250000 -e event=pmcxc0,user=1,os=0,interval=250000 -o c:\Temp\cpuprof-tbp AMDTClassicMatMul.exe

Launch AMDTClassicMatMul.exe and collect the samples for IBS OP with an interval of 50000:

C:\> AMDuProfCLI.exe collect -e event=ibs-op,interval=50000 -o c:\Temp\cpuprof-tbp AMDTClassicMatMul.exe

Launch AMDTClassicMatMul.exe and do TBP samples profile for thread concurrency, name:

C:\> AMDuProfCLI.exe collect --config tbp --thread thread=concurrency,name -o c:\Temp\cpuprof- tbp AMDTClassicMatMul.exe

Collect samples for PMCx076 and PMCx0C0, but collect the call graph info only for PMCx0C0:

C:\> AMDuProfCLI.exe collect -e event=pmcx76,interval=250000 -e event=pmcxc0,interval=250000,call-graph -o c:\Temp\cpuprof-pmc AMDTClassicMatMul-bin

Launch AMDTClassicMatMul.exe and collect the samples for predefined event RETIRED_INST and L1_DC_REFILLS.ALL events:

C:\> AMDuProfCLI.exe collect -e event=RETIRED_INST,interval=250000 -e event=L1_DC_REFILLS.ALL,user=1,os=0,interval=250000 -o
C:\Temp\cpuprof-pmc AMDTClassicMatMul.exe

Launch AMDTClassicMatMul.exe, collect the TBP and Assess Performance samples:

C:\> AMDuProfCLI.exe collect --config tbp --config assess -o c:\Temp\cpuprof-tbp-assess AMDTClassicMatMul.exe

Launch AMDTClassicMatMul.exe and collect the samples for PMCx076 and PMCx0C0 events with count-mask enabled:

C:\> AMDuProfCLI.exe collect -e event=pmcx076,cmask=0x0, -e event=pmcx0c0,cmask=0x7f,interval=250000 -o c:\Temp\cpuprof-pmc AMDTClassicMatMul-bin

9.4.4.2. Linux

Launch AMDTClassicMatMul.bin and collect the samples for CYCLES_NOT_IN_HALT and RETIRED_INST events:

$ ./AMDuProfCLI collect -e cycles-not-in-halt -e retired-inst
--interval 1000000 -o /tmp/cpuprof-custom AMDTClassicMatMul-bin
$ ./AMDuProfCLI collect -e event=cycles-not-in-halt,interval=250000
-e event=retired-inst,interval=500000 -o /tmp/cpuprof-custom AMDTClassicMatMul-bin

Launch AMDTClassicMatMul-bin and collect the TBP samples:

$ ./AMDuProfCLI collect -o /tmp/cpuprof-tbp AMDTClassicMatMul-bin

Launch AMDTClassicMatMul-bin and do Assess Performance profile for 10 seconds:

$ ./AMDuProfCLI collect --config assess -o /tmp/cpuprof-assess -d 10 AMDTClassicMatMul-bin

Launch AMDTClassicMatMul-bin and collect the IBS samples in the SWP mode:

$ ./AMDuProfCLI collect --config ibs -a -o /tmp/cpuprof-ibs-swp AMDTClassicMatMul-bin

Collect the TBP samples in SWP mode for 10 seconds

$ ./AMDuProfCLI collect -a -o /tmp/cpuprof-tbp-swp -d 10

Launch AMDTClassicMatMul-bin and collect TBP with callstack sampling:

$ ./AMDuProfCLI collect --config tbp -g -o /tmp/cpuprof-tbp AMDTClassicMatMul-bin

Launch AMDTClassicMatMul-bin and collect TBP with callstack sampling (unwind FPO optimized stack):

$ .AMDuProfCLI collect --config tbp --call-graph-mode fpo --call-graph-size 512 -o /tmp/cpuprof-tbp AMDTClassicMatMul-bin

Launch AMDTClassicMatMul-bin and collect the samples for PMCx076 and PMCx0C0:

$ ./AMDuProfCLI collect -e event=pmcx76,interval=250000 -e event=pmcxc0,user=1,os=0,interval=250000 -o /tmp/cpuprof-tbp AMDTClassicMatMul-bin

Launch AMDTClassicMatMul-bin and collect the samples for IBS OP with interval 50000:

$ ./AMDuProfCLI collect -e event=ibs-op,interval=50000 -o /tmp/cpuprof-tbp AMDTClassicMatMul- bin

Attach to a thread and collect TBP samples for 10 seconds:

$ AMDuProfCLI collect --config tbp -o /tmp/cpuprof-tbp-attach -d 10 --tid <TID>

Collect basic OpenMP trace info of an OpenMP application compiled with GCC OpenMP library:

$ AMDuProfCLI collect --trace openmp --openmp-impl omplib -o /tmp/cpuprof-omp <path-to-openmp-exe>

Launch AMDTClassicMatMul-bin and collect the memory accesses for false cache sharing:

$ AMDuProfCLI collect --config memory -o /tmp/cpuprof-mem AMDTClassicMatMul-bin

Launch AMDTClassicMatMul-bin and collect the threading configuration to analyze hotspots, thread state, and wait object analysis among threads:

$ AMDuProfCLI collect --config threading -o /tmp/cpuprof-threading AMDTClassicMatMul-bin

Collect MPI profiling information:

$ mpirun -np 4 ./AMDuProfCLI collect --config assess --mpi --output-dir /tmp/cpuprof-mpi /tmp/ namd <parameters>

Collect the samples for PMCx076 and PMCx0C0, but collect the call graph info only for PMCx0C0:

$ AMDuProfCLI collect -e event=pmcx76,interval=250000 -e event=pmcxc0,interval=250000,call- graph -o /tmp/cpuprof-pmc   AMDTClassicMatMul-bin

Launch AMDTClassicMatMul-bin and collect the samples for predefined event RETIRED_INST and L1_DC_REFILLS.ALL events

$ AMDuProfCLI collect -e event=RETIRED_INST,interval=250000 -e event=L1_DC_REFILLS.ALL,user=1,os=0,interval=250000 -o /tmp/cpuprof-pmc AMDTClassicMatMul-bin

Launch AMDTClassicMatMul-bin and collect pthread runtime trace with default threshold

$ AMDuProfCLI collect --trace osrt --osrt-event pthread -o /tmp/cpuprof-os AMDTClassicMatMul-bin

Launch AMDTClassicMatMul-bin and collect syscall taking more than or equal to 1µs:

$ AMDuProfCLI collect --trace osrt --osrt-event syscall --osrt-threshold syscall:1000000 -o /tmp/cpuprof-os AMDTClassicMatMul-bin

Launch AMDTClassicMatMul-bin and collect the GPU Traces for hip and hsa domain:

$ AMDuProfCLI collect --trace gpu -o /tmp/cpuprof-gpu AMDTClassicMatMul-bin

Launch AMDTClassicMatMul-bin, collect the TBP samples and GPU Traces for hip and hsa domain:

$ AMDuProfCLI collect --config tbp --trace gpu -o /tmp/cpuprof-gpu AMDTClassicMatMul-bin

Launch AMDTClassicMatMul-bin and collect the GPU samples:

$ AMDuProfCLI collect --config gpu -o /tmp/cpuprof-gpu AMDTClassicMatMul-bin

Launch AMDTClassicMatMul-bin and collect GPU samples for SQ Block

$ AMDuProfCLI collect --config gpu --ip-block SQ -o /tmp/gpuprof-gpu AMDTClassicMatMul-bin

Launch AMDTClassicMatMul-bin and collect trace data for all functions in ``AMDTClassicMatMul-bin

$ AMDuProfCLI collect --trace osrt --osrt-event function --osrt-funcs AMDTClassicMatMul-bin:* -o /tmp/cpuprof-os AMDTClassicMatMul-bin

Launch AMDTClassicMatMul-bin and collect trace data for all functions in AMDTClassicMatMul-bin which has size greater than or equals to 64

$ AMDuProfCLI collect --trace osrt --osrt-event function --osrt-func-size 64 --osrt-threshold function:10000 --osrt-funcs AMDTClassicMatMul-bin:* -o /tmp/cpuprof-os AMDTClassicMatMul-bin

Launch AMDTClassMatMul-bin and perform branch analysis with the default filter type:

$ AMDuProfCLI collect --branch-filter -o /tmp/cpuprof-ebp-branch AMDTClassicMatMul-bin

Launch AMDTClassMatMul-bin and collect samples for the event PMCXC0:

$ AMDuProfCLI collect -e event=pmcxc0,interval=250000 --branch-filter u,k,any -o /tmp/cpuprof- ebp-branch AMDTClassicMatMul-bin

9.5. Report Command

The report command generates a report in readable format by processing the raw profile data files or from the (processed) database files available in the specified directory.

9.5.1. Synopsis

AMDuProfCLI report [--help] [<options>]

9.5.2. Common Usages

$ AMDuProfCLI report -i <session-dir path>
Table 9.5 AMDuProfCLI Report Command Options#

Option

Description

-h| --help

Displays this help information on the console/terminal.

-i| --input-dir <directory-path>

Path to the directory containing collected data.

--detail

Generate detailed report.

--group-by <section>

Specify the report to be generated. The supported report options are:

  • process: Report process details

  • module: Report module details

  • thread: Report thread details

This option is applicable only with --detail option. The default is group-by process.

-g

The print callgraph. Use with the option --detail or --pid(-p). With --pid option, callgraph will be generated only if the callstack samples were collected for specified PIDs.

--cutoff <n>

Cutoff to limit the number of process, threads, modules, and functions to be reported. n is the minimum number of entries to be reported in various report sections. The default value is 10.

Note

--cutoff 0 will report all the data.

--view <view-config>

Report only the events present in the given view file. Use the command

info--list view-configs to get the list of supported view-configs.

--inline

Show inline functions for C, C++ executables.

Note

Using this option will increase the time taken to generate the report.

--show-sys-src

Generate detailed function report of the system module functions (if debug info is available) with the source statements. This option only works with --detail option.

--src-path <path1;...>

Source file directories (semicolon separated paths). Multiple use of --src- path is allowed.

--disasm

Report only the assembly instructions having samples. This option only works with --detail option.

--disasm-style <att | intel>

Choose the syntax of assembly instructions. The supported options are att and intel. If this option is not used:

  • intel is used by default on Windows.

  • att is used by default on Linux.

--disasm-only

Generate the function report with only assembly instructions. This command works only with the``–detail`` option.

--disasm-full

Report all the assembly instructions of a function with and without samples. This option only works with the --detail option.

-s| --sort-by <EVENT>

Specify the Timer, PMC, or IBS event on which the reported profile data will be sorted with arguments in the form of comma separated key=value pairs.

The supported keys are:

  • event=<timer| ibs-fetch | ibs-op | pmcxNNN>, where NNN is hexadecimal Core PMC event ID.

  • umask=<unit-mask>

  • cmask=<count-mask>

  • inv=<0| 1>

  • user=<0| 1>

  • os=<0| 1>

  • metric=<cpu_time | total_cpu_time | self_time | total_time>

When both event and metric are enabled, event takes priority over metric.

Use the command info--list pmu-events for the list of supported PMC events.

Details about the arguments:

  • umask: Unit mask in decimal or hexadecimal, applicable only to the PMC events.

  • cmask: Count mask in decimal or hexadecimal, applicable only to the PMC events.

  • user, os: User and OS mode. Applicable only to the PMC events.

  • inv: Invert Count Mask, applicable only to the PMC events Multiple occurrences of –sort-by (-s) are not allowed.

  • metric:

    • cpu_time is applicable only if CPU_TIME event is collected.

    • total_cpu_time is applicable only with hotspots (or) threading analysis, if callstack collection (-g) is enabled for dynamically linked launch application.

    • self_time and total_time are applicable only if function tracing is collected.

--agg-interval <low | medium | high | INTERVAL>

Use this option to configure the sample aggregation interval which is useful when the session is imported to GUI.

low level of aggregation interval generates better timeline view in GUI but increases the database size.

Aggregation INTERVAL can also be specified as a numeric value in milliseconds.

--time-filter <T1:T2>

Restricts report generation to the time interval between T1 and T2. Where, T1 and T2 are time in seconds from profile start time.

--imix

Generate instruction MIX report. It is only supported for IBS config and IBS events profiling. It is only supported for the native binaries.

--imix-group-by <module | thread | function>

IMIX report generation. Supported group-by options are:

  • module: Report module-wise IMIX.

  • thread: Report thread-wise IMIX.

  • function: Report function-wise IMIX

--ignore-system-module

Ignore samples from system modules.

--show-percentage

Show percentage of samples instead of actual samples.

--show-sample-count

Show the number of samples. This option is enabled by default.

--show-event-count

Show the number of events occurred.

--show-all-cachelines

Show all the cachelines in the report sections for cache analysis. By default, only the cachelines accessed by more than one process/thread are listed.

Supported only for memory config report on Windows and Linux platforms.

--limit-cacheinfo <n>

Show the shared cachelines accessed by more than one process/thread for cache analysis. Set n to the number of shared cacheline addresses to be reported. Use this option for false cache sharing analysis.

--bin-path <path>

Binary file path, multiple usage of --bin-path is allowed.

--src-path <path>

Source file path, multiple usage of --src-path is allowed.

--symbol-path <path1;...>

Debug Symbol paths (semicolon separated). Multiple use of --symbol-path is allowed.

--report-output <path>

Write a report to a file. If the path has a .csv extension, it is assumed to be a file path and used as it is. If the .csv extension is not used, then the path is assumed to be a directory and the report file is generated in the directory with the default name.

--stdout

Print the report to a console or terminal.

--retranslate

Perform the re-translation of collected data files with a different set of translation options.

--ascii <event-dump | raw-dump>

Use this option to generate ASCII dump of IBS OP profile samples.

  • event-dump: Generate formatted event dump of IBS profile samples.

  • raw-dump: Generate raw data dump of IBS profile samples.

Note

This option might delay the translation.

--remove-raw-files

Remove the raw data files to recover the disk space.

--python-show-all

Use this option to show Python interpreter functions in the callgraph/flamegraph when translation is performed on Python profiled data (on Linux).

--export-session

Create a compressed archive of the required session files which can be used in other system for analysis.

--log-path <path-to- log-dir>

Specify the path where the log file should be created. If this option is not provided, the log file will be created either in the path set by AMDUPROF_LOGDIR environment variable or $TEMP path (Linux, FreeBSD) or %TEMP% path (on Windows) by default.

The log file name will be of the format $USER-AMDuProfCLI.log (on Linux, FreeBSD) or %USERNAME%-AMDuProfCLI.log (on Windows).

--enable-logts

Capture the timestamp of the log records.

--symbol-server <path1;...>

OS Support: Windows

Symbol Server directories (semicolon separated paths). For example, Microsoft Symbol Server. Multiple use of --symbol-server is allowed.

--symbol-cache-dir <path>

OS Support: Windows

The path to store the symbol files downloaded from the Symbol Servers.

--legacy-symbol-downloader

OS Support: Windows

Download symbols using the Microsoft Symsrv. By default, AMD symbol downloader will be used.

--host <hostname>

OS Support: Linux

This option is used along with the –input-dir option. Generates report belonging to a specific host. The supported options are:

  • <hostname>: Report process belonging to a specific host.

  • all: Report all the processes.

Note

If --host is not used, only the processes belonging to the system from which report is generated is reported. In case, the system is a master node in a cluster, the report will be generated for the lexicographically first host in that cluster.

--category <PROFILE>

OS Support: Linux

Generate report only for specific profiling category. Comma separated multiple categories can be specified. If this option is not used, then report for all categories gets generated. Multiple instances of --category is allowed.

Supported categories are:

  • cpu – Generate report specific to CPU Profiling.

  • mpi – Generate report specific to MPI Tracing.

  • openmp – Generate report specific to OpenMP Tracing.

  • trace – Generate report specific to trace events.

  • gputrace – Generate report specific to GPU Tracing.

  • gpuprof – Generate report specific to GPU Profiling.

Example:

--category cpu, mpi, trace, gputrace, gpuprof
--category mpi --category cpu --category trace --category gputrace -- category gpuprof

--funccount-interval <funccount-interval>

OS Support: Linux

Specify the time interval in seconds to list the function count detail report. If this option is not specified, the function count will be generated for the entire profile duration.

--host <hostname>

This option is used along with the --input-dir option. Generates report belonging to a specific host. The supported options are:

  • <hostname>: Report process belonging to a specific host.

  • all: Report all the processes.

Note

If --host is not used, only the processes belonging to the system from which report is generated is reported. In case, the system is a master node in a cluster, the report will be generated for the lexicographically first host in that cluster.

9.5.3. Examples

9.5.3.1. Windows

Generate report from the raw datafile

C:\> AMDuProfCLI.exe report -i c:\Temp\cpuprof-tbp\<SESSION-DIR>

Generate IMIX report from the raw datafile

C:\> AMDuProfCLI.exe report --imix -i c:\Temp\cpuprof-imix\<SESSION-DIR>

Generate report from the raw datafile sorted on pmc event

C:\> AMDuProfCLI.exe report -s event=pmcxc0,user=1,os=0 -i c:\Temp\cpuprof-ebp\<SESSION-DIR>

Generate report from the raw datafile sorted on ibs-op event

C:\> AMDuProfCLI.exe report -s event=ibs-op -i c:\Temp\cpuprof-ibs\<SESSION-DIR>

Generate report from the raw datafile for power samples

C:\> AMDuProfCLI.exe report -i c:\Temp\pwrprof-swp\<SESSION-DIR>

Generate report with Symbol Server paths

C:\> AMDuProfCLI.exe report --symbol-path C:\AppSymbols;C:\DriverSymbols --symbol-server http://msdl.microsoft.com/download/symbols --symbol-cache-dir C:\symbols -i c:\Temp\cpuprof- tbp\<SESSION-DIR>

Generate report from the raw datafile on one of the predefined views

C:\> AMDuProfCLI.exe report --view ipc_assess -i c:\Temp\pwrprof-swp\<SESSION-DIR>

Generate report from the raw datafile providing the source and binary paths

C:\> AMDuProfCLI.exe report --bin-path Examples\AMDTClassicMatMul\bin\ --src-path Examples\AMDTClassicMatMul\ -i     c:\Temp\cpuprof-tbp\<SESSION-DIR>

9.5.3.2. Linux

Generate report from the raw datafile

$ AMDuProfCLI report -i /tmp/cpuprof-tbp/<SESSION-DIR>

Generate IMIX report from the raw datafile

$ AMDuProfCLI report --imix -i /tmp/cpuprof-imix/<SESSION-DIR>

Generate report from the raw datafile sorted on pmc event

$ AMDuProfCLI report -s event=pmcxc0,user=1,os=0 -i /tmp/cpuprof-ebp/<SESSION-DIR>

Generate report from the raw datafile sorted on ibs-op event

$ AMDuProfCLI report -s event=ibs-op -i /tmp/cpuprof-ibs/<SESSION-DIR>

Generate Trace report from the raw datafile

$ AMDuProfCLI report -i /tmp/cpuprof-os/<SESSION-DIR> --category trace

Generate GPU Trace report from the raw datafile

$ AMDuProfCLI report -i /tmp/cpuprof-gpu/<SESSION-DIR> --category gputrace

Generate GPU Profile report from the raw datafile

$ AMDuProfCLI report -i /tmp/cpuprof-gpu/<SESSION-DIR> --category gpuprof

9.6. Translate Command

The translate command processes the raw profile data and generates the samples info database files. These databases can be imported to GUI or CLI and used for generating the report.

9.6.1. Synopsis

AMDuProfCLI translate [<options>]

9.6.2. Common Usages

$ AMDuProfCLI translate -i <session-dir path>

9.6.3. Options

Following table lists the AMDuProfCLI translate command options:

Table 9.6 AMDuProfCLI Translate Command Options#

Option

Description

--bin-path <path>

Binary file path. Multiple use of --bin-path is allowed.

--enable-logts

Capture the timestamp of the log records.

--export-session

Create a compressed archive of required session files which can be used in other system for analysis

--symbol-path <path>

Debug symbol path. Multiple instances of --symbol-path are allowed.

-h| --help

Displays the help information.

--inline

Inline function extraction for C and C++ executables.

Note

Using this option will increase the time taken to generate the report.

--kallsyms-path <path>

OS Support: Linux

Path to the file containing kallsyms info. If no path is provided, it defaults to / proc/ kallsyms.

--vmlinux-path <path>

OS Support: Linux

Path to the Linux kernel debug info file. If no path provided, it searches for the debug info file in the default download path.

--category <PROFILE>

OS Support: Linux

Process only a specific profiling category. Comma separated multiple categories can be specified. If this option not used, then all categories raw data files are processed. Multiple instances of –category are allowed.

The supported categories are:

  • cpu - CPU Profiling

  • mpi - MPI Tracing

  • openmp – Generate report specific to OpenMP Tracing.

  • trace - User mode tracing

  • gputrace - GPU Tracing

  • gpuprof - GPU Profiling

Example:

category cpu, mpi, trace, gputrace, gpuprof

--category mpi --category cpu --category trace --category gputrace --category gpuprof

--host <hostname>

OS Support: Linux

This option is used with the –input-dir option. It processes samples belonging to a specific host. The supported options are:

  • <hostname>: Translate only the processes belonging to a specific host.

  • all: Translate all processes.

Note

If --host is not used, then only the processes belonging to the current system is translated. In case the system is a master node in a cluster, then processing will be done for the lexicographically first host in that cluster.

--legacy-symbol- downloader

OS Support: Windows

Download symbols using the Microsoft Symsrv. By default, AMD symbol downloader will be used.

--symbol-server <path1;…>

OS Support: Windows

Links to Symbol Server. For example: Microsoft Symbol Server. Multiple instances of --symbol-server are allowed.

--symbol-cache-dir <path>

OS Support: Windows

Path to save the symbols downloaded from the Symbol Servers.

-i| --input-dir <directory-path>

Path to the directory containing collected data.

--retranslate

Re-translate the collected data files with a different set of translation options.

--ascii <event-dump | raw-dump>

Use this option to generate ASCII dump of IBS OP profile samples.

  • event-dump: Generate formatted event dump of IBS profile samples.

  • raw-dump: Generate raw data dump of IBS profile samples.

Note

This option might delay the translation.

--remove-raw-files

Remove the raw data files to recover the disk space

--time-filter <T1:T2>

Restricts the processing to the time interval between T1 and T2, where T1, T2 are time in seconds from profile start time.

--log-path <path-to-log-dir>

Specify the path where the log file should be created. If this option is not provided, the log file will be created either in the path set by AMDUPROF_LOGDIR environment variable or %TEMP% path by default.

The log file name will be of the format $USER-AMDuProfCLI.log``(on Linux, FreeBSD) or ``%USERNAME%-AMDuProfCLI.log (on Windows).

--agg-interval <low | medium | high | INTERVAL>

Use this option to configure the sample aggregation interval which is useful when the session is imported to GUI.

low level of aggregation interval generates better timeline view in GUI but increases the database size.

Aggregation INTERVAL can also be specified as a numeric value in milliseconds.

--python-show-all

Use this option to show Python interpreter functions in the callgraph/flamegraph when translation is performed on Python profiled data (on Linux).

9.6.4. Examples

9.6.4.1. Windows

Process all the raw data files

> AMDuProfCLI.exe translate -i c:\Temp\cpuprof-tbp\<SESSION-DIR>

Process the raw data files with Symbol Server paths

> AMDuProfCLI.exe translate --symbol-path C:\AppSymbols;C:\DriverSymbols --symbol-server http://msdl.microsoft.com/download/symbols --symbol-cache-dir C:\symbols -i c:\Temp\cpuprof- tbp\<SESSION-DIR>

Process the raw data files with the source and binary path

> AMDuProfCLI.exe translate --bin-path Examples\AMDTClassicMatMul\bin\ --src-path Examples\AMDTClassicMatMul\ -i c:\Temp\cpuprof-tbp\<SESSION-DIR>

9.6.4.2. Linux

Process all the raw data files

$ AMDuProfCLI translate -i /tmp/cpuprof-tbp/<SESSION-DIR>

Process the trace raw data file

$ AMDuProfCLI translate -i /tmp/cpuprof-os/<SESSION-DIR> --category trace

Process the GPU Trace raw data file

$ AMDuProfCLI translate -i /tmp/cpuprof-gpu/<SESSION-DIR> --category gputrace

9.7. Timechart Command

This timechart command collects and reports the system characteristics, such as power, thermal and frequency metrics, and generates a text or CSV report.

Note

The timechart command is supported only on Windows and Linux.

9.7.1. Synopsis

AMDuProfCLI timechart [--help] [--list] [<options>] [<PROGRAM>] [<ARGS>]

where,

<PROGRAM>: Denotes the application to be launched before starting the power metrics collection.

<ARGS>: Denotes the list of arguments for the launch application.

9.7.2. Common Usages

$ AMDuProfCLI timechart --list
$ AMDuProfCLI timechart -e <event> -d <duration> [<PROGRAM>] [<ARGS>]
Table 9.7 AMDuProfCLI Timechart Command Options#

Options

Description

-e | --event <type...>

Collect counters for specified combination of device type and/or category type.

Use command timechart –list for the list of supported devices and categories.

Note

Multiple occurrences of -e is allowed.

--list

Displays all the supported devices and categories.

-h | --help

Displays this help information.

-o | --output-dir <dir>

Output directory path.

-f | --format <fmt>

Output file format. Supported formats are:

  • txt: Text (.txt) format.

  • csv: (Default format). Comma Separated Value (.csv) format.

-d | --duration <n>

Profile duration n in seconds

-t | --interval <n>

Sampling interval n in milliseconds. The minimum value is 10 ms.

Note

If not specified by default interval is 1000 ms.

-w | --working-dir <dir>

Set the working directory for the launched target application.

--affinity <core...>

The core affinity. Comma separated list of core-ids. Ranges of core-ids is also be specified, for example, 0-3. The default affinity is all the available cores. The affinity is set for the launched application.

9.7.3. Examples

9.7.3.1. Windows

Collect all the power counter values for a duration of 10 seconds with sampling interval of 100 milliseconds.

C:\> AMDuProfCLI.exe timechart --event power --interval 100 --duration 10

Collect all the frequency counter values for 10 seconds, sampling them every 500 milliseconds and dumping the results into a .csv file.

C.. code:: console

:\> AMDuProfCLI.exe timechart --event frequency -o C:\Temp\output --interval 500 --duration 10

Collect all the frequency counter values at core 0 to 3 for 10 seconds, sampling them every 500 milliseconds and dumping the results into a text file.

C:\> AMDuProfCLI.exe timechart --event core=0-3,frequency -o C:\Temp\PowerOutput --interval 500 -duration 10 --format txt

9.7.3.2. Linux

Collect all the power counter values for a duration of 10 seconds with sampling interval of 100 milliseconds.

$ ./AMDuProfCLI timechart --event power --interval 100 --duration 10

Collect all the frequency counter values for 10 seconds, sampling them every 500 milliseconds and dumping the results into a .csv file.

$ ./AMDuProfCLI timechart --event frequency -o /tmp/PowerOutput --interval 500 --duration 10

Collect all the frequency counter values at core 0 to 3 for 10 seconds, sampling them every 500 milliseconds and dumping the results into a text file.

$ ./AMDuProfCLI timechart --event core=0-3,frequency -o /tmp/PowerOutput --interval 500 -- duration 10 --format txt

9.8. Diff Command

The diff command streamlines the process of comparing multiple profile reports by automating the manual comparison of events. It processes the raw profile data, processed files, or database files to generate a markdown comparison report for the collected profiles. The generated markdown file includes detailed function data providing comprehensive insights into the compared profiles.

Furthermore, the diff command can also be used to generate a single profile report by specifying only the base profile path. This simplifies the generation of individual reports, making it more convenient and efficient.

During profile comparison, there is always a single base profile and multiple non-base profiles. Valid comparison results are obtained only for the functions that exist in both the base profile and non-base profiles.

By default, the comparison results are displayed in the source view. In the source view table, information, such as File, Line, Source Code, Address, Instruction, Code Byte, and Events are provided for each function. This comprehensive view enables a detailed analysis of the compared profiles.

Note

To obtain meaningful and accurate comparison results, it is important to ensure that the base profile and non-base profiles have matching functions available for comparison.

9.8.1. Synopsis

AMDuProfCLI diff [--help] [<options>] AMDuProfCLI compare [--help] [<options>]

9.8.2. Common Usages

AMDuProfCLI diff --baseline <base session-dir path> --with <non-base session-dir path> -o <output-dir>

9.8.3. Profile Comparison Eligibility Criteria

To ensure accurate and meaningful profile comparisons, the following conditions must be met:

9.8.4. Options

Table 9.8 AMDuProfCLI DIFF Command Options#

Option

Description

--alias <base- fun,non-base- fun,…|base-fun-1,non-base-fun- 1,…|…>

In the cases where the function names have changed in the non-base profile, specify the function names in the non-base profile that should be compared with the corresponding function names in the base profile.

Specify different functions using the pipe symbol | as a separator. For each set of functions, you can use a comma to separate the function names between the base profile and the non-base profile.

--baseline <directory-path>

Path to the directory containing collected data. The profile data in this directory will be treated as the base profile against which all other profiles will be compared.

--bin-path <path1;...>

Binary file path for the base profile. This will be considered for the non-base profiles if the corresponding bin path is not specified separately.

Multiple usage of --bin-path is allowed.

--bin-path1 <path1;...>

Binary file path for the first non-base profile. Multiple usage of --bin-path1 is allowed.

--bin-path2 <path1;...>

Binary file path for the second non-base profile. Multiple usage of --bin-path2 is allowed

--bin-path3 <path1;...>

Binary file path for the third non-base profile. Multiple usage of --bin-path3 is allowed.

--cutoff <n>

Cut-off to limit the number of functions to be reported. ‘n’ is the maximum number of entries to be reported in various report sections. The default value is 10.

--html

Use this option to create comparison report in HTML format. If not specified, the default comparison report format Markdown will be used to generate the report.

--output-dir| -o <directory-path>

Path where the markdown comparison report will be generated

--show-percentage

Comparison results will be displayed in terms of percentages.

--sort-by| -s <EVENT>

Specify the Timer, PMC, or IBS event on which the reported profile data will be sorted with arguments in the form of comma separated key=value pairs. The supported keys are:

  • event=<timer | ibs-fetch | ibs-op | pmcxNNN>, where NNN is hexadecimal Core PMC event ID.

  • umask=<unit-mask>

  • user=<0 | 1>

  • os=<0 | 1>

Use the command info --list pmu-events for the list of supported PMC events. The arguments details:

  • umask — Unit mask in decimal or hexadecimal. Applicable only to the PMC events.

  • user, os — User and OS mode. Applicable only to the PMC events.

Multiple occurrences of –sort-by(-s)are not allowed.

--src-path <path1;...>

Source file directories (semicolon separated paths) for base profile. This will be considered for the non-base profiles if the corresponding file directories are not specified separately.

Multiple use of --src-path is allowed.

--src-path1 <path1;...>

Source file directories (semicolon separated paths) for the first non-base profile. Multiple use of --src-path1 is allowed.

--src-path2 <path1;...>

Source file directories (semicolon separated paths) for the second non-base profile. Multiple use of --src-path2 is allowed.

--src-path3 <path1;...>

Source file directories (semicolon separated paths) for the third non-base profile. Multiple use of --src-path3 is allowed.

--stdout

Comparison report will also be displayed in the terminal or command line interface apart from saving to a file.

--type <comparison- type>

Specify the type of comparison to be performed. The supported comparison types are:

  • name: With this type, only the top n functions from the base profile will be compared with the corresponding functions available in the non-base profiles. The comparison will focus on the similar functions between the profiles.

  • order: With this type, the top n functions from all the profiles will be displayed in the order of profiles. The order will be: base profile first, followed by the first non-base profile, second non-base profile, and so on. The comparison will still be performed with the functions present in the base profile and only for the similar functions across the profiles.

The default comparison type is name.

--view <view-config>

Compare only the events present in the given view file. Use the command info --list view-configs to get the list of supported view-configs.

--with <directory-path>

Path to the directory containing collected data. Each profile specified with -- with will be considered as a non-base profile and compared against the base profile. You can use multiple instances of --with to specify multiple non-base profiles for comparison.

-h| --help

Displays this help information on the console/terminal.

-i, --input-dir <directory-path>

Path to the directory containing collected data. Multiple occurrences of -i is allowed. First occurrence of -i is considered as the base session, while all the subsequent occurrences of -i are treated as non-base sessions.

Note

When using -i, --input-dir, you should not use the --baseline or --with options in conjunction. If you use --baseline and -i together, the --baseline option will take precedence and be considered as the base session. If the --baseline option is not present, the first occurrence of -i will automatically be considered as the base session.

9.8.5. Examples

9.8.5.1. Windows

Use the following commands to:

Generate a comparison report in html from base profile-data to its successor profile-data with delta shown in percentage .. code:: console

AMDuProfCLI.exe compare –baseline c:Tempcpuprof-tbp<BASE-SESSION-DIR> –with c:Tempcpuprof-tbp<SUCCESSOR-SESSION-DIR> –type name –show-percentage –html -o c:Tempcpuprof-tbp

Generate a comparison report of base profile data with subsequent profile data

C:\> AMDuProfCLI.exe diff --baseline c:\Temp\cpuprof-tbp\<BASE-DIR> --with c:\Temp\cpuprof- tbp\<NON-BASE-DIR> -o c:\Temp\cpuprof-tbp

Generate a comparison report using the -i option

C:\> AMDuProfCLI.exe diff -i c:\Temp\cpuprof-tbp\<BASE-DIR> -i c:\Temp\cpuprof-tbp\< NON- BASE-DIR> -o c:\Temp\cpuprof-tbp

Generate a comparison report without ignoring the unique entries across sessions

C:\> AMDuProfCLI.exe diff --baseline c:\Temp\cpuprof-tbp\<BASE-DIR> --with c:\Temp\cpuprof- tbp\<NON-BASE-DIR> --type order -o    c:\Temp\cpuprof-tbp

Generate a comparison report of base profile data with subsequent profile data sorted on ibs-op event

C:\> AMDuProfCLI.exe diff --baseline c:\Temp\cpuprof-tbp\<BASE-DIR> --with c:\Temp\cpuprof- tbp\<NON-BASE-DIR> --type name -s ibs-op -o c:\Temp\cpuprof-tbp

Generate a comparison report with delta shown in percentage

C:\> AMDuProfCLI.exe compare --baseline c:\Temp\cpuprof-tbp\<BASE-DIR> --with c:\Temp\cpuprof-tbp\<NON-BASE-DIR> --type name --show-percentage -o c:\Temp\cpuprof-tbp

Generate a comparison report of base profile data with successor profile data with changed function names across sessions

C:\> AMDuProfCLI.exe compare --baseline c:\Temp\cpuprof-tbp\<BASE-DIR> --with c:\Temp\cpuprof-tbp\<NON-BASE-DIR> --alias CalculateSum,CalculateUpdatedSum|enhanceOutput,optimizeOutput -o c:\Temp\cpuprof-tbp

Generate a comparison report of base profile data with multiple successor profile data

C:\> AMDuProfCLI.exe diff -i c:\Temp\cpuprof-tbp\<BASE-DIR> -i c:\Temp\cpuprof-tbp\<NON-BASE- DIR1> -i c:\Temp\cpuprof-tbp\<NON-BASE-DIR2> --with c:\Temp\cpuprof-tbp\<NON-BASE-DIR3> -o c:\Temp\cpuprof-tbp

Generate a comparison report on one of the predefined views

C:\> AMDuProfCLI.exe diff -i c:\Temp\cpuprof-tbp\<BASE-DIR> -i c:\Temp\cpuprof-tbp\<NON-BASE- DIR> --view ipc_assess -o c:\Temp\cpuprof-tbp

Generate a comparison report providing the source and binary paths

C:\> AMDuProfCLI.exe diff -i c:\Temp\cpuprof-tbp\<BASE-DIR> -i c:\Temp\cpuprof-tbp\<NON-BASE- DIR> --bin-path   Examples\AMDTClassicMatMul\bin\ --src-path Examples\AMDTClassicMatMul\ --bin- path1 Examples\AMDTClassicMatMulMod\bin\ --src-path1 Examples\AMDTClassicMatMulMod\ -o c:\Temp\cpuprof-tbp

9.8.5.2. Linux

Generate comparison report in html from base profile-data to its successor profile-data with delta shown in percentage

AMDuProfCLI compare --baseline /tmp/cpuprof-tbp/<BASE-SESSION-DIR> --with /tmp/cpuprof-tbp/<SUCCESSOR-SESSION-DIR> --type name --show-percentage --html -o /tmp/cpuprof-tbp/
Diff html report generated with --html option.

Figure 9.6 Diff html report generated with –html option#

Analyzing MPI Communication Matrix

Generate a comparison report of base profile data with subsequent profile data

$ AMDuProfCLI diff --baseline /tmp/cpuprof-tbp/<BASE-DIR> --with /tmp/cpuprof-tbp/<NON-BASE- DIR> -o /tmp/cpuprof-tbp

Generate a comparison report of base profile data with subsequent profile data sorted on PMC event

$ AMDuProfCLI diff --baseline /tmp/cpuprof-tbp/<BASE-DIR> --with /tmp/cpuprof-tbp/<NON-BASE- DIR> -s event=pmcxc0,user=1,os=0 -o /tmp/cpuprof-tbp

9.9. Profile Command

The profile command collects the performance profile data, processes it, and generates a profile report in a readable format. It is an alternative to the combination of collect and report command.

9.9.1. Synopsis

AMDuProfCLI profile [--help] [<options>] [<PROGRAM>] [<ARGS>]

where,

<PROGRAM>: Denotes the launch application to be profiled.

<ARGS>: Denotes the list of arguments for the launch application.

9.9.2. Common Usages

$ AMDuProfCLI profile <PROGRAM> [<ARGS>]
$ AMDuProfCLI profile [--config <config> | -e <event>] [-a] [-d <duration>] [<PROGRAM>]

9.9.3. Options

Following table lists the profile commands:

Table 9.9 AMDuProfCLI Profile Command Options#

Option

Description

--affinity <core-id...>

Set the core affinity of the launched application to be profiled. Comma separated list of core-ids. The ranges of the core-ids must be specified, for example, 0-3. The default affinity is all the available cores. This option is not supported while profiling MPI applications.

--agg-interval <low | medium | high | INTERVAL>

Use this option to configure the sample aggregation interval which is useful when the session gets imported to GUI.

low level of aggregation interval generates better timeline view in GUI, but increases the database size.

Aggregation INTERVAL can also be specified as numeric value in milliseconds.

--ascii <event-dump | raw-dump>

Use this option to generate ASCII dump of IBS OP profile samples.

  • event-dump: Generate formatted event dump of IBS profile samples.

  • raw-dump: Generate raw data dump of IBS profile samples.

Note

This option might delay the translation.

--bin-path <path>

Binaryfile path, multiple usage of --bin-path is allowed

--branch-filter

OS Support: Linux

Use this option to capture LBR data. Specify the branch filter type:

  • u: user branches

  • k: kernel branches

  • any: any branch type

  • any_call: any call branch

  • any_ret: any return branch

  • ind_call: indirect calls

  • ind_jmp: indirect jumps

  • cond: conditional branches

  • call: direct calls

When the above filters are not set, the default filter type will be any.

Note

  • When the above filters not set, the default filter type will be any.

  • This option will work only with the PMC events.

  • This is applicable to per process and attach process profiling. However, it is not applicable to Java app profiling.

--call-graph <F:N>

OS Support: Linux

Enables callstack sampling. Specify (F) to collect/ignore missing frames due to omission of frame pointers by compiler:

  • fpo | dwarf: Collect the process callstack during sample collection and use the DWARF information to reconstruct callstack.

  • fp: Use the frame pointers to collect callstack information.

When F = fpo, (N) specifies the max stack-size in bytes to collect per sample collection. Valid range of the stack size: 16 - 32768. If N is not a multiple of 8, it is aligned down to the nearest value multiple of 8. The default value is 1024 bytes.

Note

Passing a large N value will generate a very large raw data file.

When F = fp; the value for N is not applicable and ignored if passed.

--call-graph <I:D:S:F>

OS Support: Windows

Enables callstack Sampling. Specify the Unwind Interval (I) in milliseconds and Unwind Depth (D) value. Specify the Scope (S) by choosing one of the following:

  • user: Collect only for the user space code.

  • kernel: Collect only for the kernel space code.

  • all: Collect for the code executed in the user and kernel space code. Specify to collect missing frames due to Frame Pointer Omission (F) by compiler.

  • fpo: If the frame pointers are not available, collect callstack information using unwind information.

  • fp: Use the frame pointers to collect callstack information.

--call-graph-depth <num>

OS Support: Windows

Set callstack unwind depth. Depth must be within the range [2 - 392]. Default depth is 128.

--call-graph-depth <num>

OS Support: Linux

Set callstack unwind depth. Depth must be within the range [2 - 1024].

Default depth is 32. This option is applicable for Hotspots and Threading configurations, for any other configurations this option will be ignored. This option is applicable for Hotspots and Threading configurations, for any other configurations this option will be ignored.

--call-graph-interval <num>

OS Support: Windows

Set callstack unwind interval. Interval must be within the range [1 - 100]. Default interval is 1 ms.

--call-graph-mode <fp|fpo| dwarf>

OS Support: Linux

Callstack collection mode. Default mode is fp.

  • fp: Use Frame pointers to collect call stack information.

  • fpo | dwarf: Collect process call stack during sample collection and use DWARF information to reconstruct the call stack.

--call-graph-mode <mode>

OS Support: Windows

Set callstack collection mode.

  • fpo - If frame pointers are not available, collect call-stack information using unwind information.

  • fp - Use Frame pointers to collect callstack information. Default mode.

--call-graph-size <size>

OS Support: Linux

Callstack Size. Default size is 1024 bytes.

When mode = fpo | dwarf; size must be within [16 - 32768] and specifies the max stack-size (in bytes) to collect per call stack sample.

When mode = fp; the size is ignored, hence no need to pass it.

--call-graph-type <scope type>

OS Support: Windows

Set callstack scope type. Scope type should contain one of these options:

  • user - Collect only for user space code.

  • kernel - Collect only for kernel space code.

  • all - Collect for code executed in user and kernel space.

Default scope type is user.

--category <PROFILE>

OS Support: Linux

Process only a specific profiling category. Comma separated multiple categories can be specified. If this option not used, then all categories raw data files are processed. Multiple instances of –category are allowed.

The supported categories are:

  • cpu - CPU Profiling

  • mpi - MPI Tracing

  • openmp – Generate report specific to OpenMP Tracing.

  • trace - User mode tracing

  • gputrace - GPU Tracing

  • gpuprof - GPU Profiling

Example:

category cpu, mpi, trace, gputrace, gpuprof

--category mpi --category cpu --category trace --category gputrace --category gpuprof

--config <config>

Predefined sampling configuration to be used to collect samples.

Use the command info --list collect-configs to get the list of supported configs. Multiple occurrences of --config are allowed.

--cutoff <n>

Cut-off to limit the number of functions to be reported. n is the maximum number of entries to be reported in various report sections. The default value is 10.

Note

--cutoff 0 will report all the data.

--detail

Generate detailed report.

--disasm

Report only the assembly instructions having samples. This option only works with the --detail option.

--disasm-full

Report all the assembly instructions of a function with and without samples. This option only works with the --detail option.

--disasm-only

Generate the function report with only assembly instructions.

--disasm-style <att |intel>

Choose the syntax of assembly instructions. Supported options are att or intel. If this option is not used, the default style used is intel.

--enable-logts

Capture the timestamp of the log records.

--env-var <key1=value1:key2=value2:...>

Use this option to set the environment variables.

--exclude-func <module:function-pattern>

OS Support: Linux

Specify functions to exclude from the library, executable, or kernel:

  • function-pattern can be a function name or partial name ending with * or only * to trace all the functions of a module.

  • Module can be a library or executable. To trace the kernel functions, replace the module with ‘kernel’.

Note

It is recommended to provide the absolute path of a module

--export-session

Use this option to create a compressed archive of required session files which can be used in other system for analysis.

--frequency <n> | -- freq <n> | -F <n>

Enable data collection at the specified frequency ‘n’ (in Hz) for Core PMC events.

Note

This frequency will override the sampling frequency specified with individual events.

--func <module:function- pattern>

OS Support: Linux

Specify functions to trace from the library, executable, or kernel: function- pattern can be a function name or partial name ending with ‘*’ or only ‘*’ to trace all the functions of a module.

Module can be a library or executable. To trace the kernel functions, replace the module with ‘kernel’.

Note

It is recommended to provide the absolute/full path of a module.

--funccount-interval <funccount-interval>

OS Support: Linux

Specify the time interval in seconds to list the function count detail report. If this option is not specified, function count will be generated for the entire profile duration.

--group-by <section>

Specify the report to be generated. The supported report options are:

  • process: Report process details

  • module: Report module details

  • thread: Report thread details

This option is applicable only with --detail option. The default is group-by process.

--guest-kallsyms <path>

OS Support: Linux

Specify the path of guest /proc/kallsyms copied on the local host. AMD uProf reads it to get the guest kernel symbol.

--guest-modules <path>

OS Support: Linux

Specify the path of guest/proc/modules copied to the local host. AMD uProf reads it to get the guest kernel module information.

--guest-search-path <path>

OS Support: Linux

Specify the path of guest vmlinux and kernel sources copied on the local host. AMD uProf reads it to resolve the guest kernel module information.

--host <hostname>

OS Support: Linux

This option is used along with the –input-dir option. Generates report belonging to a specific host. The supported options are:

  • <hostname>: Report process belonging to a specific host.

  • all: Report all the processes.

Note

If –host is not used, only the processes belonging to the system from which report is generated is reported. In case, the system is a master node in a cluster, the report will be generated for the lexicographically first host in that cluster.

--ignore-system-module

Ignore samples from system modules.

--imix

Report Instruction Mix (only for native binaries). Default is module-wise IMIX.

--imix-group-by <module | thread | function>

IMIX report generation. Supported group-by options are:

  • module: Report module-wise IMIX.

  • thread: Report thread-wise IMIX.

  • function: Report function-wise IMIX.

--inline

Inline function extraction for C and C++ executables.

Note

Using this option will increase the time taken to generate the report.

--interval <num>

Sampling interval for PMC events.

Note

This interval will override the sampling interval specified with individual events.

--kvm-guest <pid>

OS Support: Linux

Specify the PID of qemu-kvm process to be profiled to collect guest-side performance profile.

--legacy-symbol-downloader

OS Support: Windows

Download symbols using the Microsoft Symsrv. By default, AMD symbol downloader will be used.

--limit-cacheinfo <n>

Show the shared cachelines accessed by more than one process/thread for cache analysis. Set n to the number of shared cacheline addresses to be reported. Use this option for false cache sharing analysis.

--limit-data <n>

OS Support: Windows

Stop the profiling when the collected data file size (in MB) crosses the specified limit. When used with the option --overwrite, the limit is before the collection is terminated. Size can be specified with the suffix Mega Bytes (M/ m), Giga Bytes (G/g), or Seconds (secs).

--limit-size <n>

Stop the profiling when the collected data file size (in MB) crosses the specified limit.

Note

This option may be deprecated in future releases.

--log-path <path-to- logdir>

Specify the path where the log file should be created. If this option is not provided, the log file will be created either in path set by AMDUPROF_LOGDIR environment variable or $TEMP path (Linux, FreeBSD) or %TEMP% path (on Windows) by default.

The log file name will be of the format $USER-AMDuProfCLI.log (on Linux,FreeBSD) or %USERNAME%-AMDuProfCLI.log (on Windows).

--no-inherit

Do not profile the children of the launched application (processes launched by the profiled application).

--no-report

Use this option to perform only collection and translation.

--openmp-impl <ompt| omplib>

OS Support: Linux

Provide OpenMP implementation type:

  • ompt for tracing of OpenMP libraries supporting OMPT interface (example: LLVM, AOCC). Default implementation type.

  • omplib for tracing GCC OpenMP library.

Note

Use this option with --trace openmp option.

--openmp-scope <full| basic>

OS Support: Linux

Provide tracing scope.

  • full for complete tracing

  • basic for basic tracing, where synchronization related OpenMP events are not traced to reduce the disk space usage. Default selection is basic.

Note

Use this option with --trace openmp option.

--osrt-event <event1,event2...>

This option is only applicable with --openmp-impl ompt.

OS Support: Linux

Provide event names. Use command info --list trace-events for the list of trace events.

Note

Use this option with --trace osrt option.

--osrt-exclude-funcs <module:function-pattern>

OS Support: Linux

Specify functions to exclude from the library or executable.

  • Function-pattern can be a function name or partial name ending with *. Use only * to trace all the functions of a module.

  • Module can be absolute path to library or executable. This option will be deprecated in a future release.

  • Recommended to use --exclude-func.

--osrt-func-size <size>

OS Support: Linux

Provide minimum function size to trace. Default function size is 128 bytes.

This option will be deprecated in a future release. Recommended to use --func-size.

Note

Use this option with --trace osrt option.

--osrt-funcs <module:function-pattern>

OS Support: Linux

Specify functions to trace from the library or executable.

  • Function-pattern can be a function name or partial name ending with *. Use only * to trace all the functions of a module.

  • Module can be absolute path to library or executable. This option will be deprecated in a future release.

  • Recommended to use --func.

--osrt-threshold <event:threshold>

OS Support: Linux

Provide event name and threshold value.

Note

Use this option with --trace osrt option.

--overwrite

OS Support: Windows

Specify the profile data collection mode as a ring buffer. The collection limit can be set using the option --limit-data. The default --limit-data is to restrict the raw data file size to 512 pages per core.

--python-show-all

Use this option to show Python interpreter functions in the callgraph/flamegraph when translation is performed on Python profiled data (on Linux).

--remove-raw-files

Removes the raw data files to reclaim the disk space.

--report-output <path>

Write a report to a file. If the path has a .csv extension, it is assumed to be a file path and used as it is. If the .csv extension is not used, the path is assumed to be a directory and the report file is generated in the directory with the default name

--retranslate

Perform the re-translation of collected data files with a different set of translation options.

--show-all-cachelines

Show all cachelines in report sections for cache analysis. By default, only cachelines accessed by more than one process/thread are listed. Use this option for false cache sharing analysis.

--show-event-count

Show the number of events occurred.

--show-percentage

Show percentage of samples instead of actual samples.

--show-sample-count

Show the number of samples. This option is enabled by default.

--show-sys-src

Generate detailed function report of the system module functions (if debug info is available) with the source statements. This option only works with –detail option.

--src-path <path>

Source file path, multiple usage of --src-path is allowed.

--src-path <path1;...>

Source file directories (semicolon separated paths). Multiple use of –src-path is allowed.

--start-delay <n>

Start delay n in seconds. Start profiling after the specified duration. When ‘n’ is 0, there is no impact.

--start-paused

Profiling paused indefinitely. The target application resumes the profiling using the profile control APIs. This option must be used only when the launched application is instrumented to control the profile data collection using the resume and pause APIs (see AMDPowerProfileAPI Library for definitions).

--stdout

Print the report to a console or terminal.

--thread <thread=concurrency>

Collect the thread run time info to report thread concurrency. Thread concurrency provides how much time specific no of threads are running simultaneously.

--symbol-cache-dir <path>

OS Support: Windows

Path to save the symbols downloaded from the Symbol Servers.

--symbol-path <path1;...>

Debug Symbol paths (semicolon separated). Multiple use of –symbol-path is allowed.

--symbol-server <path1;...>

OS Support: Windows

Symbol Server directories (semicolon separated paths). For example, Microsoft Symbol Server. Multiple use of --symbol-server is allowed.

--thread <thread=concurrency>

OS Support: Windows

Collect the thread run time info to report thread concurrency. Thread concurrency provides how much time specific no of threads are running simultaneously.

--tid <TID,..>

OS Support: Linux

Profile existing threads by attaching to a running thread. The thread IDs are separated by comma.

--time-filter <T1:T2>

Restricts the processing to the time interval between T1 and T2, where T1, T2 are time in seconds from profile start time.

--trace <TARGET>

OS Support: Linux

To trace a target domain. TARGET can be one or more of the following:

  • osrt - to enable tracing of os runtime. Use command info --list trace-events for the list of trace events.

  • func - to enable tracing of functions. Use --func, --func-size and --func-threshold to configure additional options.

  • memory - to enable tracing of dynamic memory allocations. Use --memory-threshold to configure threshold.

  • mpi - to enable tracing of MPI application. Use --mpi-impl and --mpi-scope to configure additional options.

  • openmp - to enable tracing of OpenMP application.

Use --openmp-imp``l and ``--openmp-scope to configure additional options.

Note

Applicable to per process profiling. Not applicable to:

  • System wide profiling

  • Java app profiling

  • Attach process profiling

  • For ompt - application should be compiled with LLVM-8 or later, AOCC-2.1 or later, ICC-19.1 or later.

  • For omplib - application should be compiled with GCC-7 or later.

  • Supported base languages are: C, C++, and Fortran.

  • gpu - To trace a target application on GPU. By default, the domain is set to hip and hsa.

--view <view-config>

Compare only the events present in the given view file. Use the command info --list view-configs to get the list of supported view-configs.

--vmlinux-path <path>

OS Support: Linux

Path to the Linux kernel debug info file. If no path provided, it searches for the debug info file in the default download path.

-a| --system-wide

System Wide Profile (SWP): If this flag is not set, the command line tool will profile only the launched application or the Process IDs attached with -p option.

-b| --terminate

Terminate the launched application after the profile data collection ends. Only the launched application process will be killed. Its children (if any) may continue to execute.

-c| --cpu <core...>

Comma separated list of CPUs to profile. The ranges of CPUs can be specified with ‘-’, for example, 0-3. This option is not supported with MPI profiling.

Note

On Windows, the selected cores should belong to only one processor group. For example, 0-63, 64-127, and so on.

-d| --duration <n>

Profile only for the specified duration n in seconds.

-e | --event or <predefined-event>

A predefined event can directly be used with -e, –event which has predefined arguments.

Alternatively, for providing more granular parameters, specify Timer, PMU, IBS event, or a predefined event with arguments in the form of comma separated key=value pairs. The supported keys are:

  • event=<timer | ibs-fetch | ibs-op> or <PMU-event> or <predefined-event>

  • mask=<unit-mask>

  • user=<0 | 1>

  • os=<0 | 1>

  • cmask=<count-mask> (Value should be in the range 0x0 to 0x7f)

  • inv=<0 | 1>

  • interval=<sampling-interval>

  • frequency=<frequency (n)> (Supported only for Core PMC events. Frequency should be provided in Hz)

  • ibsop-count-control=<0 | 1> (for ibs-op event)

  • loadstore (for ibs-op event, only on Windows platform)

  • ibsop-l3miss (for ibs-op event, supported only on AMD “Zen4” processors)

  • ibsfetch-l3miss (for ibs-fetch event, supported only on AMD “Zen4” processors)

  • ibsop-ldlat=<LATENCY> (Filter IBS OP samples by data cache miss latency threshold in CPU cycles. LATENCY must be an integer which is multiple of 128 and between 128 to 2048. Supported on AMD Zen5 and later processors.)

  • call-graph

Note

  1. Providing umask with predefined event is not required.

  2. Use the dedicated option –call-graph to specify the arguments related to the call stack sample collection.

Argument details

  • user – Enable(1) or disable(0) user space samples collection

  • os - Enable(1) or disable(0) kernel space samples collection

  • interval – Sample collection interval. For timer, it is the time interval in milliseconds. For PMU and predefined events, it is the count of the event occurrences. For IBS FETCH, it is the fetch count. For IBS OP, it is the cycle count or the dispatch count.

  • ibsop-count-control – Choose IBS OP sampling by cycle(0) count or dispatch(1) count.

  • loadstore – Enable only the IBS OP load/store samples collection, other IBS OP samples are not collected.

  • ibsop-l3miss – Don’t filter out any IBS OP samples (0), or filter out all. For example, -e event=ibs-op,interval=100000,ibsop-l3miss=1

  • ibsfetch-l3miss – Enable IBS FETCH sample collection only when an l3 miss occurs. For example, -e event=ibs-fetch,interval=100000,ibsfetch-l3miss=1

  • ibsop-ldlat – Filter IBS OP samples by data cache miss latency threshold in CPU cycles. LATENCY must be an integer which is multiple of 128 and between 128 to 2048. For example, -e event=ibs-op,interval=100000,ibsop-ldlat=256.

When these arguments are not passed, then the default values are:

  • umask = 0

  • cmask = 0x0

  • user = 1

  • os = 1

  • inv = 0

  • ibsop-count-control = 1 (for ibs-op event)

  • ibsop-l3miss = 0

  • ibsfetch-l3miss = 0

-g

Same as passing –call-graph fp (Linux, FreeBSD).

Same as passing –call-graph1:128:user:fp (Windows).

-h| --help

Displays this help information on the console/terminal.

-m| --data-buffer-count <size>

OS Support: Windows

Size (number of pages per core) of the buffer used for data collection by the driver. The default size is 512 pages per core.

-m| --mmap-pages <size>

OS Support: Linux

Set the kernel memory mapped data buffer to size. The size can be specified in pages or with a suffix Bytes (B/b), Kilo bytes (K/k), Megabytes (M/m), and Gigabytes (G/g).

-o| --output-dir <directory-path>

Base directory path in which collected data files will be saved. A new sub- directory will be created in this directory.

-p | --pid <PID...>

Profile the existing processes by attaching to a running process. The process IDs are separated by comma.

Note

  1. A maximum of 512 processes can be attached at a time.

  2. On FreeBSD, multiple attach is not supported.

-s| --sort-by <EVENT>

Specify the Timer, PMC, or IBS event on which the reported profile data will be sorted with arguments in the form of comma separated key=value pairs.

The supported keys are:

  • event=<timer| ibs-fetch | ibs-op | pmcxNNN>, where NNN is hexadecimal Core PMC event ID.

  • umask=<unit-mask>

  • cmask=<count-mask>

  • inv=<0| 1>

  • user=<0| 1>

  • os=<0| 1>

  • metric=<cpu_time | total_cpu_time | self_time | total_time>

When both event and metric are enabled, event takes priority over metric.

Use the command info--list pmu-events for the list of supported PMC events.

Details about the arguments:

  • umask: Unit mask in decimal or hexadecimal, applicable only to the PMC events.

  • cmask: Count mask in decimal or hexadecimal, applicable only to the PMC events.

  • user, os: User and OS mode. Applicable only to the PMC events.

  • inv: Invert Count Mask, applicable only to the PMC events Multiple occurrences of –sort-by (-s) are not allowed.

  • metric:

    • cpu_time is applicable only if CPU_TIME event is collected.

    • total_cpu_time is applicable only with hotspots (or) threading analysis, if callstack collection (-g) is enabled for dynamically linked launch application.

    • self_time and total_time are applicable only if function tracing is collected.

  • Multiple occurrences of --sort-by(-s) are not allowed.

-w| --working-dir <path>

Specify the working directory. The default is the current working directory.

9.9.4. Examples

9.9.4.1. Windows

Launch application`` AMDTClassicMatMul.exe`` and collect the samples for CYCLES_NOT_IN_HALT and RETIRED_INST events and generate report

C:\> AMDuProfCLI.exe profile -e cycles-not-in-halt -e retired-inst --interval 1000000
-o c:\Temp\cpuprof-custom AMDTClassicMatMul.exe
$ ./AMDuProfCLI.exe profile -e event=cycles-not-in-halt,interval=250000
-e event=retired-inst,interval=500000 -o c:\Temp\cpuprof-custom AMDTClassicMatMul.exe

Launch AMDTClassicMatMul-bin and collect IBS samples and generate thread-wise imix

C:\> AMDuProfCLI.exe profile --config ibs --imix --imix-group-by thread -o c:\Temp\cpuprof-tbp AMDTClassicMatMul.exe

Launch AMDTClassicMatMul.exe and perform Assess Performance profile for 10 seconds and generate report

C:\> AMDuProfCLI.exe profile --config assess -o c:\Temp\cpuprof-assess -d 10 AMDTClassicMatMul.exe

Launch AMDTClassicMatMul.exe and collect the IBS samples in the SWP mode and generate report sorted on ibs-op event

C:\> AMDuProfCLI.exe profile --config ibs -a -s event=ibs-op -o c:\Temp\cpuprof-ibs-swp AMDTClassicMatMul.exe

Collect the TBP samples in SWP mode for 10 seconds and generate report

C:\> AMDuProfCLI.exe profile -a -o c:\Temp\cpuprof-tbp-swp -d 10

Launch AMDTClassicMatMul.exe, collect TBP with callstack sampling and generate report

C:\> AMDuProfCLI.exe profile --config tbp -g -o c:\Temp\cpuprof-tbp AMDTClassicMatMul.exe

Launch AMDTClassicMatMul.exe, collect TBP with callstack sampling (unwind FPO optimized stack) and generate report

C:\> AMDuProfCLI.exe profile --config tbp --call-graph-mode fpo --call-graph-type user -o c:\Temp\cpuprof-tbp AMDTClassicMatMul.exe

Launch AMDTClassicMatMul.exe and collect the samples for PMCx076 and PMCx0C0 and generate report sorted on pmcxc0 event

C:\> AMDuProfCLI.exe profile -e event=pmcx76,interval=250000 -e event=pmcxc0,user=1,os=0,interval=250000 -s event=pmcxc0 -o  c:\Temp\cpuprof-tbp AMDTClassicMatMul.exe

Launch AMDTClassicMatMul.exe and collect the samples for IBS OP with an interval of 50000 and generate report sorted on ibs-op event

C:\> AMDuProfCLI.exe profile -e event=ibs-op,interval=50000 -s event=ibs-op -o c:\Temp\cpuprof-tbp AMDTClassicMatMul.exe

Launch AMDTClassicMatMul.exe and do TBP samples profile for thread concurrency, name, and generate report

C:\> AMDuProfCLI.exe profile --config tbp --thread thread=concurrency,name -o c:\Temp\cpuproftbp AMDTClassicMatMul.exe

Collect samples for PMCx076 and PMCx0C0, but collect the call graph info only for PMCx0C0 and generate report

C:\> AMDuProfCLI.exe profile -e event=pmcx76,interval=250000 -e event=pmcxc0,interval=250000,call-graph -o c:\Temp\cpuprof-pmc AMDTClassicMatMul-bin

Launch AMDTClassicMatMul.exe and collect the samples for predefined event RETIRED_INST and L1_DC_REFILLS.ALL events and generate report

C:\> AMDuProfCLI.exe profile -e event=RETIRED_INST,interval=250000 -e event=L1_DC_REFILLS.ALL,user=1,os=0,interval=250000 -o c:\Temp\cpuprof-pmc AMDTClassicMatMul.exe

Launch AMDTClassicMatMul.exe and collect the TBP, Assess Performance samples, and generate report

C:\> AMDuProfCLI.exe profile --config tbp --config assess -o c:\Temp\cpuprof-tbp-assess AMDTClassicMatMul.exe

9.9.4.2. Linux

Launch AMDTClassicMatMul.bin and collect the samples for CYCLES_NOT_IN_HALT and RETIRED_INST events and generate report

$ ./AMDuProfCLI profile -e cycles-not-in-halt -e retired-inst
--interval 1000000 -o /tmp/cpuprof-custom AMDTClassicMatMul-bin
$ ./AMDuProfCLI profile -e event=cycles-not-in-halt,interval=250000
-e event=retired-inst,interval=500000 -o /tmp/cpuprof-custom AMDTClassicMatMul-bin

Launch AMDTClassicMatMul.bin and collect the IBS samples and generate thread-wise IMIX report from the raw data file

$ ./AMDuProfCLI profile --config ibs --imix --imix-group-by thread -o /tmp/cpuprof-tbp AMDTClassicMatMul-bin

Launch AMDTClassicMatMul.bin and perform Assess Performance profile for 10 seconds and generate report

$ ./AMDuProfCLI profile --config assess -o /tmp/cpuprof-assess -d 10 AMDTClassicMatMul-bin

Launch AMDTClassicMatMul.bin and collect the IBS samples in the SWP mode and generate report sorted based on ibs_op event

$ ./AMDuProfCLI profile --config ibs -a -s event=ibs_op -o /tmp/cpuprof-ibs-swp AMDTClassicMatMul-bin

Collect the TBP samples in SWP mode for 10 seconds and generate report

$ ./AMDuProfCLI profile -a -o /tmp/cpuprof-tbp-swp -d 10

Launch AMDTClassicMatMul.bin and collect TBP with callstack sampling and generate report

$ ./AMDuProfCLI profile --config tbp -g -o /tmp/cpuprof-tbp AMDTClassicMatMul-bin

Launch AMDTClassicMatMul.bin and collect TBP with callstack sampling (unwind FPO optimized stack) and generate report

$ ./AMDuProfCLI profile --config tbp --call-graph-mode fpo --call-graph-size 512 -o /tmp/cpuprof-tbp AMDTClassicMatMul-bin

Launch AMDTClassicMatMul.bin. Collect the samples for PMCx076 and PMCx0C0 and generate report

$ ./AMDuProfCLI profile -e event=pmcx76,interval=250000 -e event=pmcxc0,user=1,os=0,interval=250000 -o /tmp/cpuprof-tbp AMDTClassicMatMul-bin

Launch AMDTClassicMatMul.bin and collect the samples for IBS OP with interval 50000 and generate report sorted on ibs-op event

$ ./AMDuProfCLI profile -e event=ibs-op,interval=50000 -s event=ibs-op -o /tmp/cpuprof-tbp AMDTClassicMatMulbin

Attach to a thread, collect TBP samples for 10 seconds, and generate report

$ AMDuProfCLI profile --config tbp -o /tmp/cpuprof-tbp-attach -d 10 --tid <TID>

Collect basic OpenMP trace info of an OpenMP application compiled with GCC OpenMP library and generate the report

$ AMDuProfCLI profile --trace openmp --openmp-impl omplib -o /tmp/cpuprof-omp <path-to-openmp-exe>

Collect the samples for PMCx076 and PMCx0C0, but collect the call graph info only for PMCx0C0 and generate report

$ AMDuProfCLI profile -e event=pmcx76,interval=250000 -e event=pmcxc0,interval=250000,callgraph -o /tmp/cpuprof-pmc
AMDTClassicMatMul-bin

Launch AMDTClassicMatMul-bin and collect pthread runtime trace with default threshold

$ AMDuProfCLI collect --trace osrt --osrt-event pthread -o /tmp/cpuprof-os AMDTClassicMatMul-bin

Launch AMDTClassicMatMul-bin and collect syscall which are taking more than or equal to 1ms and generate report

$ AMDuProfCLI profile --trace osrt --osrt-event syscall --osrt-threshold syscall:1000000 -o /tmp/cpuprof-os AMDTClassicMatMul-bin

Launch AMDTClassicMatMul-bin and collect the GPU Traces and generate gpu trace report

$ AMDuProfCLI profile --trace gpu -o /tmp/cpuprof-gpu AMDTClassicMatMul-bin --category gputrace

Launch AMDTClassicMatMul.bin and collect the TBP samples, GPU Traces and generate report

$ AMDuProfCLI profile --config tbp --trace gpu -o /tmp/cpuprof-gpu AMDTClassicMatMul-bin

Launch AMDTClassicMatMul-bin and collect ‘GPU’ samples and generate report

$ AMDuProfCLI profile --config gpu -o /tmp/gpuprof-gpu AMDTClassicMatMul-bin

Launch AMDTClassicMatMul-bin and collect ‘GPU’ samples for ‘SQ’ Block

$ AMDuProfCLI profile --config gpu --ip-block SQ -o /tmp/gpuprof-gpu AMDTClassicMatMul-bin

Launch AMDTClassicMatMul-bin and collect trace data for all functions in ‘AMDTClassicMatMul-bin’

$ AMDuProfCLI profile --trace osrt --osrt-event function --osrt-funcs AMDTClassicMatMul-bin:* -o /tmp/cpuprof-os   AMDTClassicMatMul-bin

Launch AMDTClassicMatMul-bin and collect trace data for all functions in ‘AMDTClassicMatMul-bin’ which has size greater than or equals to 64

$ AMDuProfCLI profile --trace osrt --osrt-event function --osrt-func-size 64 --osrt-threshold function:10000 --osrt-funcs AMDTClassicMatMul-bin:* -o /tmp/cpuprof-os AMDTClassicMatMul-bin

9.10. Info Command

The Info command fetches the generic information about the system, PMC event details, predefined event details, and so on.

9.10.1. Synopsis

AMDuProfCLI info [--help] [<options>]

9.10.2. Common Usages

$ AMDuProfCLI info --system

9.10.3. Options

Following table lists the info command:

Table 9.10 AMDuProfCLI INFO Command Options#

Option

Description

--bpf

OS Support: Linux

Displays details of the BPF support and BCC Installation.

--collect-config <name>

Displays the details of the given profile configuration used with collect -- config <name> option.

Use info --list collect-configs command for the details on the supported profile configurations.

--list <type>

OS Support: Linux

Lists the supported items for the following types:

  • trace-events: List of trace events that can be used with collect --trace os or collect -- trace user option.

  • gpu-events: List of GPU events can be used in gpu profile configuration.

--list <type>

Lists the supported items for the following types:

  • collect-configs: Predefined profile configurations that can be used with collect--config option.

  • predefined-events: List of the supported predefined events that can be used with collect --event option.

  • pmu-events: Raw PMC events that can be used with collect--event option. Alternatively, info--pmu-event all can be used to print information of all the supported events.

  • cacheline-events: List of event aliases to be used with report--sort-by option for cache analysis. It is supported only on Windows and Linux platforms.

  • view-configs*: List the supported data view configurations that can be used with report--view option.

--pmu-event <event>

Displays the details of the given pmu event. Use command info --list pmu-events for the list of supported pmc events.

--system

Displays the processor information of this system.

--view-config <name>

Displays the details of the given view configuration used in the report generation option report--view <name>.

Use info --list view-configs command for the details on the supported data view configurations.

-h| --help

Displays the help information.

9.10.4. Examples

Use the following commands to:

Print the system details

C:\> AMDuProfCLI.exe info --system

Print the list of predefined profiles

C:\> AMDuProfCLI.exe info --list collect-configs

Print the list of PMU events

C:\> AMDuProfCLI.exe info --list pmu-events

Print the list of predefined report views

C:\> AMDuProfCLI.exe info --list view-configs

Print details of predefined profile such as “assess_ext”

C:\> AMDuProfCLI.exe info --collect-config assess_ext

Print the details of the pmu-event such as PMCx076

C:\> AMDuProfCLI.exe info --pmu-event pmcx76

Print details of view configuration such as ibs_op_overall

C:\> AMDuProfCLI.exe info --view-config ibs_op_overall

Print the list of trace events

C:\> AMDuProfCLI.exe info --list trace-events