11. AMD uProf Application Programming Interfaces

The AMDProfileControl APIs allow you to limit the profiling scope to a specific portion of the code within the target application. This is particularly helpful when within a long running application, the user wants to focus on performance monitoring and analysis of a specific area of code.

Usually, while profiling an application, samples for the entire control flow of the application execution will be collected, that is, from the start of execution till end of the application execution. The control APIs can be used to enable the profiler to collect data only for a specific part of application, for example, a CPU intensive loop, a hot function etc.

The target application needs to be recompiled after instrumenting that with profiling enable/disable APIs to profile the required code regions only.

11.1. CPU Profile Control APIs

Profile Control APIs allow you to pause and resume profiling at runtime from within the application being profiled. You can call these APIs from C/C++ or Python code. Note that these APIs control profiling only—they do not pause or resume the execution of the application itself.

Pause-Resume Profiling Flow.

Figure 11.1 Pause-Resume Profiling Flow#

It is mandatory to use the CLI command line option: –start-paused or keep Enable start paused option ON in GUI, while launching the application instrumented by profile control APIs, otherwise the behavior is undefined.

To switch on the GUI Enable start paused option:

  1. Select the Profile Target page and click Next.

  2. Click Advanced Options.

  3. On the Advanced Options page, in the Profile Scheduling section, switch on the Enable start paused option.

There are two groups of APIs for pause/resume. APIs from different groups cannot be mixed within a single run of an application.

11.1.1. Difference between Group1 and Group2 APIs

As described earlier, Profile Control APIs are for profiling a small (in which user is interested) portion of the typically long running program.

For Group1 APIs, on a amdProfilePause() call, the profile-pause message, from the amdProfilePause() call, propagates through the application layers to reach a point where the actual stop of the profiling happens. Meanwhile, the application continues to run with profiling on state. Hence some additional instructions are also profiled, after the pause call.

Similar thing happens with amdProfileResume() call, where application continues to run few more instructions after its amdProfileResume() call without profiling.

In general, this is not an issue, but when the user is trying to monitor/profile a short running functions, this behavior is undesirable.

Group2 APIs i.e. amdProfileStrictPause() and amdProfileStrictResume() remove the above-mentioned issue by strictly pausing or resuming on these calls respectively. But Group2 APIs do this at the cost of speed.

So, Group1 APIs are faster but slip few instructions, while Group2 APIs are slower but produce no error.

11.1.2. C/C++ API Descriptions

To use C++ APIs, user need to include the header AMDProfileController.h. This file is available in the include directory under AMD uProf’s install path: ~/Foo/AMDuProf_Linux_x64_5.x.x/include/AMDProfileController.h, hence that include folder ~/Foo/AMDuProf_Linux_x64_5.x.x/include/ must be in the include path, while building the instrumented application.

For linking, the library user must include the library path of libAMDProfileController.so, whose path is: /lib/x64/shared/libAMDProfileController.so.

Hence, C++ compilation command can be:

$ g++ -std=c++11 -g <sourcefile.cpp> -I /include -L/lib/x64/ -lAMDProfileController -lrt -pthread

where for C code, the compilation can be:

$ gcc -g <sourcefile.c> -I /include -L/lib/x64/ - lAMDProfileController -lrt -pthread

11.1.3. Static Library

The instrumented application should link with the AMDProfileController static library available in:

11.1.3.1. Windows

<AMDuProf-install-dir>\lib\x86\AMDProfileController.lib
<AMDuProf-install-dir>\lib\x64\AMDProfileController.lib

11.1.3.2. Linux

<AMDuProf-install-dir>/lib/x64/libAMDProfileController.a

11.1.4. C/C++ API Descriptions

11.1.4.1. Group1

There are two C/C++ based APIs from Group1:

Table 11.1 C/C++ API Descriptions - Group 1#

Sl.no

API

Description

1

bool amdProfileResume ();

Is called to resume the paused application, either by command line option --start-paused or by previous call of bool amdProfilePause (); API from the application.

Returns true on success or false in case of failure.

2

bool amdProfilePause ();

Is called to pause a currently running profiling.

Returns true on success or false in case of failure.

11.1.4.2. Group2

There are two C/C++ based APIs for Group2. For more information, refer to Difference between Group1 and Group2 APIs.

Table 11.2 C/C++ API Descriptions - Group 2#

Sl.no

API

Description

1

bool amdProfileStrictResume ();

Is called to resume the paused application, either by command line option –start-paused or by previous call of bool amdProfileStrictPause (); API from the application.

Returns true on success or false in case of failure.

2

bool amdProfileStrictPause ();

Is called to pause a currently running profiling.

Returns true on success or false in case of failure.

11.1.5. Python API Descriptions

Python Profile Control APIs are like C/C++ APIs. For python, Pause/Resume APIs are supported for Python version 3.7 onwards.

To use these APIs, you must:

  1. Set environment variable PYTHONPATH to the path of the module amd_instrument.py, located in: /lib/x64/python/

  2. Import the module amd_instrument in your source file, which defines the domain/task APIs.

Like C++, there are two python-based APIs from Group1:

Table 11.3 Python API Descriptions - Group 1#

Sl.no

API

Description

1

amd_profile_resume()

Is called to resume the paused application, either by command line option --start-paused or by previous call of bool amd_profile_pause(); API from the application.

Returns true on success or false in case of failure.

2

amd_profile_pause()

Is called to pause a currently running profiling.

Returns true on success or false in case of failure.

11.1.5.1. Group2

There are two Python based APIs for Group2. For more information, refer to Difference between Group1 and Group2 APIs.

Table 11.4 Python API Descriptions - Group 2#

Sl.no

API

Description

1

amd_profile_strict_pause()

Is called to resume the paused application, either by command line option –start-paused or by previous call of bool amdProfileStrictPause (); API from the application.

Returns true on success or false in case of failure.

2

amd_profile_strict_resume()

Is called to pause a currently running profiling.

Returns true on success or false in case of failure.

These Pause/resume APIs can be called multiple times within the application.

Example Code1 - C++.

Figure 11.2 Example Code1 - C++#

Above code is self-explanatory with the provided comments. Functions f1(), f2(), … etc. to be declared and defined properly.

11.1.5.2. Unmatched API Calls

Profile Control APIs can cause an unmatched AMDTaskBegin() or AMDTaskEnd(), i.e. one of these two calls for which corresponding AMDTaskEnd() or AMDTaskBegin() (respectively) calls are not found within the same thread in the profiled run and generated data. There can be multiple unmatched AMDTaskBegin() or AMDTaskEnd() in a generated profile data. One of the possible reasons of Unmatched AMDTaskBegin() or AMDTaskEnd() API call can be usage of Pause/Resume APIs along with Domain/Task API incorrectly. Example: one item of the pair falls in paused state and the other in resume state. In those cases, we just ignore the unmatched API calls.

Refer to Unmatched Task for additional details.

Note

  1. Profile Control APIs can be mixed with Domain/Task/Events APIs to get meaningful data.

  2. Pause/resume APIs can be spread across multiple functions or threads, i.e. resume called from one function or thread and pause called from another function or thread. In both cases, they make the desired impact on profiling of entire applications.

  3. Nested Resume/Pause calls are not supported.

  4. The -static option should not be used while compiling with g++.

  5. The Profile control APIs are supported only for C/C++ and Python-based CPU applications but not supported for Fortran, MPI, OpenMP, Java, .NET applications, and GPU Profiling.

  6. Attach process is supported with Profile Control APIs. Refer to Example Steps to Attach Instrumented Process for additional details.

  7. AMDProfileControl APIs work only with AMDuProfCLI and GUI for application analysis. They do not work with:

    • Power Profiler

    • System analysis tools (uProfPcm and uProfSys)

11.2. Instrument APIs - Domain and Task

uProf provides time information at module level, process level, thread level and function level. All of them are programming language or system defined levels. Can we have some time information about some user defined code-block level, where code block can contain single line of instruction or multiple function calls and/or multiple lines of instructions. “Task” defines that code-block and is identified by a unique name. A domain like namespace, helps to organize tasks, by qualifying the task with domain names. In other words, domain is container of tasks.

11.2.1. Definition and Concepts

Domain: A domain defines a name and creates a namespace for the tasks. Hence each domain can contain one or more or no task(s).

Task: A task is sequence of synchronous/blocking code executed between the calls amdTaskBegin(…) and amdTaskEnd(…).

For C++ based multi-process application, when a task-item is present in multiple processes, Hottest Task section shows the sum of all times consumed by all task-instances from all the processes. But, with –detail option, for each task-items, per process sum for that task-item is reported in the Task Summary section which is per process.

Example Code2 - C++.

Figure 11.4 Example Code2 - C++#

Example Code2 - C++.

Figure 11.5 Example Code2 - C++#

This example shows how to mix Domain/Task APIs and Profile Control APIs.

Here we have four tasks: Task1, Task2, Task3 and Task4 and two domains: DomainABC and DomainABC1. Task1 is defined within paused state of application and hence will not be included in the profiled data.

Example Code2 - Python.

Figure 11.6 Example Code2 - Python#

Here, we assumed python functions f1() to f7() are defined. As we start profiling with –start-paused option, f1() and f2() will not be profiled. After the call amd_profile_resume() f3() and f4() will be profiled. Then the call amd_profile_pause() makes the f5() not to be profiled. Again, the second call of amd_profile_resume() makes the profiling on, hence f6() and f7() would be profiled.

11.3. Instrument APIs - Event

Event APIs are similar APIs as Task and events are listed under task sections in GUI/CLI, execpt for events user does not need to specify the domain name. All events are listed under the domain User, which is reserved for events and not to be used as a user defined domain for any task. One more difference with task is that the event start API call which is not matched with corresponding Event end API call, is considered as vlid event of 0.0 duration unlike task.

11.3.1. Definition and Concepts

Event: An event, in ITT context, is sequence of synchronous/blocking code executed between the calls amdEventStart(…) and amdEventEnd(…).

11.3.2. Overall workflow with small example

Suppose we want to find top hottest overheads for initializations of n algorithms, where each algorithm has 1 or more initialization steps (some function calls, some computation statements etc.). Left side show original codes for algorithm1, algorithm2, …, algorithm n. In the middle, it shows codes after instrumentation and how tasks (user defined units of code) are generated, and finally right side shows the task-wise data displayed in CLI/GUI.

Example Task Instrumentation and Output.

Figure 11.7 Example Task Instrumentation and Output#

11.3.3. How to Run and Get Task/Event detail

You start with a source code, where some code snippets to be defined as tasks and then using profiling with proper configuration, you get defined task details (time information etc.) in the report.csv. Presently, it is supported in CLI mode profiling only and no GUI is supported.

1. Have your buildable source code to be profiled with Task/Event APIs. 2.Instrument the source code with task APIs.

  1. Include the header AMDProfileController.h in the source file where you want to create a task. Path of this header is:

    <Installation Directory>/include/AMDProfileController.h
    

    Example

    ~/Foo/AMDuProf_Linux_x64_5.x.x/include/AMDProfileController.h
    

    Hence that folder must be in the include path, while building the instrumented application.

  2. Create domain and get domain handle using amdDomainCreate(…).

  3. Create string handle, using amdStringHandleCreate(…), which is used to name a task.

  4. Using domain handle and string handle create a task and mark the start of the task – both are done with single call amdTaskBegin(…) and then your code (which is part of the task) comes.

  5. Call amdTaskEnd(…) to mark the end of the task.

  6. Can add more than one tasks in similar way.

  7. Events can be added using Event APIs similarly.

  1. For linking, the library user must include the library path of libAMDProfileController.so whose path is: <Installation Directory>/lib/x64/shared/libAMDProfileController.so.

  2. Build the instrumented application.

  3. Run the application under CLI profiler:

    $ AMDuProfCLI collect --config tbp -o <outputFolderPath> <applicationNameToBeProfiled>
    
  4. Generate the report: (i)

    $ AMDuProfCLI report –i < outputFolderPath > or (ii) detailed report: AMDuProfCLI report –i < outputFolderPath > --detail
    
  5. Summary report:

    $ AMDuProfCLI report –i < outputFolderPath >
    
  6. Detailed report:

    $AMDuProfCLI report –i < outputFolderPath > --detail
    

Examples codes are given for different use cases in the following directories:

11.3.3.1. Description

Currently, for Domain and Task APIs C/C++ and Python APIs are supported for instrumenting codes in those respective programming languages. For Event APIs, only C++ support is available.

11.3.3.1.1. Using the Task APIs

C/C++ and Python APIs are described here for instrumenting codes.

11.3.3.2. C++ Task APIs

To use C++ APIs, include the header AMDProfileController.h. Path of this header is: <Installation Directory>/include/AMDProfileController.h. Example: ~/Foo/AMDuProf_Linux_x64_5.x.x/include/AMDProfileController.h. Hence, that folder must be in the include path, while building the instrumented application.

For linking, the library, include the library path of libAMDProfileController.so, whose path is: <Installation Directory>/lib/x64/shared/libAMDProfileController.so.

There are four C++ based APIs for supporting domain and task.

Table 11.5 C++ Task APIs#

API and description

Parameters

Returns

unsigned long* amdStringHandleCreate(const char* name);

Create a handle for a string which is used as name of the task.

const char* name: Name associated with the handle, which is used as task name.

Handle as void* to the created string.

void amdTaskBegin(void* domainHandle, int taskId, int parentId, const long unsigned int*strHandleForTask);

Marks the beginning of the task in this domain.

1.``void* domainHandle``: a valid domain handle created by amdDomainCreate(...). 2.``int taskId``: currently, only 0 is to be passed for this. 3.``int parentId``: currently, only 0 is to be passed for this. 4.``const long unsigned int*strHandleForTask``: a valid string handle returned by amdStringHandleCreate(…)

Nothing

void amdTaskEnd(void* domainHandle);

Marks the end of the task, which was begun by latest(last) runtime call of amdTaskBegin(…) with same domain.

void* domainHandle: a valid domain handle created by amdDomainCreate(...).

Nothing

void* amdDomainCreate(const char* name);

This function creates a domain with specified name name and returns the handle, as void pointer. If called multiple times within same scope, it returns the same handle.

const char* name: Name of the domain. It can contain space, period, etc. such as amd.com.uprof.

Handle as void* to the created domain.

Example

void *execDomain = amdDomainCreate ("Main.ExecuteBenchmark");
void *mulStr = amdStringHandleCreate ("MultiplyMatrices");
// some activities, not within task
int matrixSize =1000;
initialize_matrices(matrixSize);

amdTaskBegin (execDomain, 0, 0, mulStr);
// some activities within task
multiply_matrices(1000);
printOutout();
amdTaskEnd (execDomain);

// some activities, not within task
CleanUp();

11.3.3.3. Python APIs

To use these Python APIs:

  1. Set environment variable PYTHONPATH to the path of the module amd_instrument.py, located in: <uProf Install Directory>/lib/x64/python/.

  2. Import the module amd_instrument, which defines the domain/task APIs.

  3. There are two groups of APIs: Python APIs and C++ based APIs, for supporting domain and task. APIs from one group cannot be mixed with APIs for another group.

Python- group 1 contains 2 APIs.

Table 11.6 Python APIs - Group 1#

API and description

Parameters

Returns

amd_start_task(domain_task_name)

This creates the domain and the task and begins the task.

domain_task_name: This is :: separated domain task combination string in the format <domain>::<task>.

Nothing

amd_end_task(domain)

This marks the end of the last (latest) task in the specified domain.``

domain: the domain from which the last task to be marked as end.

Nothing

Example

import amd_instrument as a
#... other code e.g. function calls and/or statements
a. amd_start_task("DomainABC::TaskPQR")

# ... blocking function calls and/or statements, considered within this task
# or non-blocking function calls and/or statements, not considered within tasks
a.amd_end_task(“DomainABC”)

Python- group 2 contains 3 APIs.

Table 11.7 Python APIs - Group 2#

API and description

Parameters

Returns

domain_create(domain_name : str)

This API take a domain name string and return a domain, which is used in the task_begin(…) API.

domain: the domain name as string under which current task is being considered.

domain that was created.

task_begin(domain, name)

This API takes, a domain created in and returned by domain_create(…) API and takes the task name. It defines the starting point of the task.

domain: the domain returned from the domain_create(…) call.

task: the task name as string

Nothing

task_end(domain)

This API takes the domain_create(…) created in and returned by domain_create(…) API and marks the end of the task which was last created within the same domain.

domain: the domain returned from the domain_create(…) call.

Nothing

Example

import amd_instrument as a
#... other code e.g. function calls and/or statements
d = a.domain_create("DomainABC")
a.task_begin(d, "TaskPQR")

# ... blocking function calls and/or statements, considered within this task
# or non-blocking function calls and/or statements, not considered within tasks
a.task_end(d)
11.3.3.3.1. Using the Event APIs

C/C++ Event APIs are described here for instrumenting codes.

To use C++ APIs, include the header AMDProfileController.h. Path of this header is: <Installation Directory>/include/AMDProfileController.h. Example: ~/Foo/AMDuProf_Linux_x64_5.x.x/include/AMDProfileController.h. Hence, that folder must be in the include path, while building the instrumented application.

For linking, the library, include the library path of libAMDProfileController.so, whose path is: <Installation Directory>/lib/x64/shared/libAMDProfileController.so. There are three C++ based APIs for supporting Events.

Table 11.8 C++ Event APIs#

API and description

Parameters

Returns

void* amdEventCreate(char* name);

This API creates and return the event handle. As argument it takes event name which is null terminates string. If the length is bigger than the length of the null terminated string, then name would be considered to be delimited by the null character.

const char* name: Name of the event, it is associated with returned handle.

Handle as void* to the created string.

int amdEventStart(void* eventHandle);

This API marks the start of the event. When there is no matching end call in the runtime of the code, this is considered as instantaneous event of 0.00 duration and listed. This behavior is different from Task for which unmatched begin calls are ignored. Un-matched end call are always ignored both for events and tasks.

void* eventHandle: a valid event handle created by amdEventCreate(...).

0 on success.

int amdEventEnd(void* eventHandle);

This API mark the end of the event. As described, in above API, this can be skipped for some event. Un-matched end call are always ignored for events and tasks.

void* eventHandle: a valid event handle created by amdEventCreate(...).

0 on success.

Example

void *execEvent = amdEventCreate ("MyNewEvent");
// some activities, not within the event
int matrixSize =2048;
initialize_matrices(matrixSize);

amdEventStart (execEvent);
// some activities within events
LU_decompose_matrix(matrixSize);
printResults();
amdEventEnd (execEvent);

// some activities, not within task
CleanUp();

11.3.4. Compiling Instrumented Target Application

Complete the following procedure to compile a instrumented target application:

  1. Include the header AMDProfileController.h in the source file where you want to create a task. The path of this header is:

    <Installation Directory>/include/AMDProfileController.h
    

    Example

    ~/Foo/AMDuProf_Linux_x64_5.x.x/include/AMDProfileController.h
    

    Hence that folder must be in the include path, while building the instrumented application.

  2. For linking, the library user must include the library path of libAMDProfileController.so, whose path is: <Installation Directory>/lib/x64/shared/libAMDProfileController.so.

  3. Build the instrumented application, based on the present compiler option(s) and commands.

11.3.5. Profiling Instrumented Target Application

Complete the following procedure to profile an instrumented target application:

  1. Run the application AMDuProfCLI collect --config tbp -o <outputFolderPath> <applicationNameToBeProfiled>.

    The raw output is generated in the directory Session Directory>/instrument.

  2. To generate:

    Summary report:

    AMDuProfCLI report –i < outputFolderPath >
    

    Detailed report:

    AMDuProfCLI report –i < outputFolderPath > --detail
    

11.3.6. Disabling Instrumentation APIs

In cases user need to have a quick way to disable the Instrumentation APIs from the code. Manually commenting each occurrence of them may be tedious. You can disable the Instrumentation APIs quickly using one of the following options:

  1. Compile time disabling by using macro AMDUPROF_DISABLE_INSTRUMENT_API: Define the macro in a common header included by all the files which contains calls to Instrumentation APIs.

  2. Runtime disabling by using the environment variable: Setting the environment variable AMDUPROF_INSTRUMENT_ENABLE to 0 leads to disabling the effect of the Instrumentaion APIs.

11.3.7. Example Steps to Attach Instrumented Process

  1. Run an application in one terminal or as demon with environment variable set AMDUPROF_INSTRUMENT_ENABLE=1.

    Example

    $AMDUPROF_INSTRUMENT_ENABLE=1 python3 /TestProfileControlApi.py
    
  2. Get the pid of the running application using command in some other terminal.

    Example

    $ps -u
    
  3. Launch CLI with that process attach.

    Example

    $<path>/AMDuProfCLI --config hotspots --timer-interval 10 -o ./ -p <pidOfTheTargetProcess>
    

    You can use the -d option to set the duration for which profiling should be allowed.

4. Finally generate the report as usual and the generated report.csv will show top 10 (or as selected by provided option) hottest tasks. If multiple attachment of the same prcess (say long running process) is performed, then multiple session files would created. Each would display the corresponding data in its report.csv file.

11.3.8. Unmatched Task/Event

11.3.8.1. Definition

Unmatch in Task: An unmatched AMDTaskBegin() or AMDTaskEnd() is one of these two calls for which corresponding AMDTaskEnd() or AMDTaskBegin() (respectively) calls are not found in the generated profiled data. There can be multiple unmatched AMDTaskBegin() or AMDTaskEnd() in a generated profile data.

Note that, by definition, unmatched AMDTaskBegin() and AMDTaskEnd() means that no match for those are found from same domain and same thread.

Unmatch in Event: An unmatched amdEventStart() or amdEventEnd() is one of these two calls for which corresponding amdEventEnd() or amdEventStart() (respectively) calls are not found in the generated profiled data. There can be multiple unmatched amdEventStart() or amdEventEnd() in a generated profile data.

11.3.8.2. Reason for Unmatched Task/Event

The following are a few possible cases where we get unmatched AMDTaskBegin() or amdEventStart() or AMDTaskEnd() or amdEventEnd() in the generated data (that is, # AMDTaskBegin() or amdEventStart() is not same as # AMDTaskEnd() or amdEventEnd() entries in the data).

  1. Case 1 - The usage of start-pause – one item of the pair was in paused state and other in resume state.

  2. Case 2 - Multiple times attachment of same process - one item of the pair was in not attached state and other in attached state.

  3. Case 3 - Coding error while instrumenting, forgetting to write balanced AMDTaskBegin() or amdEventStart() and AMDTaskEnd() or amdEventEnd() pair.

  4. Case 4 - Runtime condition.

    Example: some unforeseen jump or exception handling made one of AMDTaskBegin()/amdEventStart() or AMDTaskEnd()/AMDTaskEnd() call unreached.

11.3.8.3. Handling the Unmatched Task

Any unmatched task is completely ignored, that is, not considered or listed in reported tasks.

When there is no matching end call in the runtime of the code, this is considered as instantaneous event of 0.00 duration and listed. This behavior is different from Task for which unmatched begin calls are ignored. Un-matched end call are always ignored both for events and tasks. Un-matched end call are always ignored for events and tasks.

11.3.9. Depth Within Domain

11.3.9.1. Definition

When tasks/event from same thread and same domain( anyway, all events are always from same domain viz. User) are nesting, the level of nesting is described by Depth Within Domain. Following pseudo code of small example illustrate the concept

Depth Within Domain.

Figure 11.8 Depth Within Domain#

Explanations

  1. .... indicates user activities like function call or some code.

  2. Here 2 domains are there. Domain1111 and Domain2222. Domain1111 contains TaskA and TaskC, while Domain2222 contains TaskB, TaskD and TaskE.

  3. Since the nesting is per Domain, the Depth Within Domain for the above tasks are like:

    Table 11.9 Depth Within Domain#

    Task

    Depth Within Domain

    Enclosing Task within same Domain

    TaskA

    0

    Nothing

    TaskB

    0

    Nothing

    TaskC

    1

    TaskA

    TaskD

    1

    TaskB

    TaskE

    2

    TaskD

For events, no need to consider the domain, because all events are from same domain viz. User. Hence within the same thread, one event’s Depth Within Domain increases as nesting level increases.

Note

Depth Within Domain is not directly printed in CLI or GUI in uProf 5.2. But it is used in Flame graph.

11.3.10. Output

The report command generates a list of task hotspots in the Hotspot section of the <Session Directory>/report.csv. For example: AMDuProf-ClassicDomainTaskAPI-TBP_Sep-02-2024_16-50-28/report.csv.

The corresponding task section would look similar to this:

"10 HOTTEST TASKS (Sort Event - CPU_TIME)"
DOMAIN,TASK,TASK COUNT,ELAPSED TIME(seconds),"CPU_TIME" (seconds)
"Task0", "BASIC_TASK_MATRIX", 1, 0.0537,0.0500
"User", "Event18", 1, 0.0352,0.0400
"User", "Event14", 1, 0.0352,0.0400
"User", "Event4", 1, 0.0352,0.0400
"User", "Event10", 1, 0.0352,0.0400
"User", "Event0", 1, 0.0385,0.0400
"User", "Event8", 1, 0.0353,0.0400
"User", "Event16", 1, 0.0352,0.0300
"Task14", "BASIC_TASK_MATRIX", 1, 0.0274,0.0300
"Task8", "BASIC_TASK_MATRIX", 1, 0.0274,0.0300

With the --detail option, report generation (AMDuProfCLI report -i <Session Directory> –detail) tasks can be grouped by process (grouping by modules/threads are not supported now), and will appear in the per process section after Hotspot section as Task Summary.

TASK SUMMARY
DOMAIN,TASK,TASK COUNT,ELAPSED TIME(seconds),"CPU_TIME" (seconds)
"Task0","BASIC_TASK_MATRIX",1,0.0537,0.0500
"User","Event18",1,0.0352,0.0400
"User","Event14",1,0.0352,0.0400
"User","Event4",1,0.0352,0.0400
"User","Event10",1,0.0352,0.0400
"User","Event0",1,0.0385,0.0400
"User","Event8",1,0.0353,0.0400
"User","Event16",1,0.0352,0.0300
"Task14","BASIC_TASK_MATRIX",1,0.0274,0.0300
"User","Event12",1,0.0352,0.0300
"Task2","BASIC_TASK_MATRIX",1,0.0274,0.0300
"Task12","BASIC_TASK_MATRIX",1,0.0274,0.0300
"User","Event6",1,0.0352,0.0300
"User","Event2",1,0.0352,0.0300
"Task10","BASIC_TASK_MATRIX",1,0.0274,0.0300
"Task6","BASIC_TASK_MATRIX",1,0.0274,0.0300
"Task16","BASIC_TASK_MATRIX",1,0.0274,0.0300
"Task18","BASIC_TASK_MATRIX",1,0.0274,0.0200
"Task4","BASIC_TASK_MATRIX",1,0.0274,0.0200
"Task8","BASIC_TASK_MATRIX",1,0.0274,0.0000

Note that there are 4 constants columns call:

Remaining columns are based on collect command configuration and view.

11.3.10.1. GUI Task Hotspot Summary

The following figure shows static Task Hotspot Summary table where time or other column (used for sorting the data of the table) is shown in the Summary page.

GUI Task Hotspot Summary.

Figure 11.9 GUI Task Hotspot Summary#

11.3.11. Limitations

Here is a list of limitations:

11.4. OneAPI support in AMDuProf

AMDuProf enables profiling of applications instrumented with OneAPI, allowing developers to analyze performance metrics and execution timelines effectively.

11.4.1. Overview

OneAPI offers a unified programming model for heterogeneous computing environments and includes support for Instrumentation and Tracing Technology (ITT) APIs to enhance performance analysis.

These features integrate seamlessly with AMDuProf Profiler, allowing visualization of user-defined tasks and correlation with CPU time in profiling reports. This helps developers identify bottlenecks and optimize performance across heterogeneous workloads.

11.4.2. Supported APIs

The following APIs from the OneAPI are currently supported for profiling using uProf.

  1. Domain/Task APIs

    __itt_domain* __itt_domain_create(const char* name);
    __itt_string_handle* __itt_string_handle_create(const char* name);
    void __itt_task_begin(const __itt_domain* domain, __itt_id taskid, __itt_id parentid, __itt_string_handle* name);
    void __itt_task_end(const __itt_domain* domain);
    
  2. Event APIs

    __itt_event __itt_event_create(const char *name, int namelen);
    int __itt_event_start(__itt_event event);
    int __itt_event_end(__itt_event event);
    
  3. Collection Control APIs

    void __itt_pause(void);
    void __itt_resume(void);
    

11.4.3. How to Run

To profile application tasks using OneAPI ITT APIs and visualize them in AMDuProf reports:

  1. Prepare Your Source Code

    Ensure you have a buildable source code that will be profiled using supported OneAPI ITT APIs.

  2. Instrument the Source Code

    Add ITT instrumentation to define tasks:

    1. Include the ITT header file.

      Add ittnotify.h in the source file where tasks will be created.

      Header path:

      <Installation Directory>/sdk/include/ittnotify.h
      

      Make sure this folder is in your include path during compilation.

      Example:

      /opt/intel/oneapi/vtune/2025.4/sdk/include/ittnotify.h
      
    2. Create a domain

      Use __itt_domain_create(...) to create a domain and obtain its handle.

    3. Create a string handle

      Use __itt_string_handle_create(...) to name the task.

    4. Define and start a task

      Call __itt_task_begin(...) with the domain and string handle, then execute the code for that task.

    5. End the task

      Call __itt_task_end(...) after the task code completes.

    6. You can define multiple tasks by repeating the above steps.

  3. Link the ITT Library

    Include the path to libittnotify.a during linking:<Installation Directory>/sdk/lib64/libittnotify.a

  4. Build the Instrumented Application

    Compile and link the application with the ITT instrumentation.

  5. Set Environment Variable

    Point INTEL_LIBITTNOTIFY64 to AMD’s ITT implementation library:

    export INTEL_LIBITTNOTIFY64=<Installation Directory>/bin/libAMDInstrumentationDataCollector.so
    
  6. Run the application under CLI profiler

    Collect profiling data:

    $ AMDuProfCLI collect --config tbp -o <outputFolderPath> <applicationNameToBeProfiled>
    
  7. Generate the reports

    $ AMDuProfCLI report –i <outputFolderPath> or (ii) detailed report: AMDuProfCLI report –i <outputFolderPath> --detail
    
    1. Summary Report

      $ AMDuProfCLI report –i <outputFolderPath>
      
    2. Detailed Report

      $ AMDuProfCLI report –i <outputFolderPath> --detail
      

11.4.4. Compiling an Instrumented Target Application

Follow these steps to compile a OneAPI-instrumented application:

  1. Include the ITT Header

    Add the header file ittnotify.h in the source file where you want to create tasks.

    Header path

    <Installation Directory>/sdk/include/ittnotify.h
    

    Example

    /opt/intel/oneapi/vtune/2025.4/sdk/include/ittnotify.h
    

    Ensure this folder is in your include path during compilation.

  2. Link the ITT Library

    Include the path to libittnotify.a` during linking:

    <Installation Directory>/sdk/lib64/libittnotify.a
    

    Example

    /opt/intel/oneapi/vtune/2025.4/sdk/lib64/libittnotify.a
    
  3. Build the Instrumented Application

    Compile and link the application using your existing compiler options and commands.

11.4.5. Profiling Instrumented Target Application

Follow these steps to profile an application instrumented with OneAPI ITT APIs:

  1. Set the Environment Variable

    Point INTEL_LIBITTNOTIFY64 to AMD’s ITT implementation library:

    export INTEL_LIBITTNOTIFY64=<Installation Directory>/bin/libAMDInstrumentationDataCollector.so
    
  2. Run the Application Under AMDuProf CLI

    Collect profiling data using the Time-Based Profiling (tbp) configuration:

    $ AMDuProfCLI collect --config tbp -o <outputFolderPath> <applicationNameToBeProfiled>
    

    The raw output is stored in <Session Directory>/instrument directory.

  3. Generate Reports:

    1. Summary report:

      $ AMDuProfCLI report –i < outputFolderPath >
      
    2. Detailed report:

      $ AMDuProfCLI report –i < outputFolderPath > --detail
      

11.4.6. Example Steps to Attach Instrumented Process

  1. Set the Environment Variable

    Point INTEL_LIBITTNOTIFY64 to AMD’s ITT implementation library:

    export INTEL_LIBITTNOTIFY64=<Installation Directory>/bin/libAMDInstrumentationDataCollector.so
    
  2. Start the Application

    Run the application in one terminal (or as a daemon) with the environment variable AMDUPROF_INSTRUMENT_ENABLE set to 1:

    Example:

    $ AMDUPROF_INSTRUMENT_ENABLE=1 ./TestDomainTaskOneAPI
    
  3. Get the process id

    In another terminal, find the PID of the running application:

    Example

    $ ps -u
    
  4. Attach AMDuProf CLI

    Launch CLI with that process attach.

    Example

    $ <path>/AMDuProfCLI --config hotspots --timer-interval 10 -o ./ -p <pidOfTheTargetProcess>
    

    You can use the -d option to set the duration for which profiling should be allowed.

  5. Generate the Report

    Run the report command as usual. The generated report`.csv` will include the top 10 (or as configured) hottest tasks.

11.4.7. Output

The report command generates a list of task hotspots in the Hotspot section of the <Session Directory>/report.csv.

For example: AMDuProf-ITT_OneAPI-TBP_Nov-05-2025_09-11-20/report.csv

The corresponding task section would look similar to this:

"10 HOTTEST TASKS (Sort Event - CPU_TIME)"
DOMAIN,TASK,TASK COUNT,ELAPSED TIME(seconds),"CPU_TIME" (seconds)
"DomainAAA", "Task0", 1, 9.5736,8.7910
"DomainBBB", "Task19", 1, 8.3636,8.3640
"DomainAAA", "Task18", 1, 8.2230,8.2240
"DomainBBB", "Task17", 1, 6.1693,6.1700
"DomainAAA", "Task16", 1, 4.1271,4.1280
"DomainBBB", "Task15", 1, 3.4609,3.4620
"DomainAAA", "Task14", 1, 2.9403,2.9420
"DomainBBB", "Task13", 1, 2.5704,2.5710
"DomainAAA", "Task12", 1, 2.3202,2.3220
"DomainBBB", "Task11", 1, 2.1727,2.1730

With the --detail option, report generation (AMDuProfCLI report -i <Session Directory> –detail) tasks are grouped by process/thread (grouping by modules is not supported), and appear in the per process/per thread section respectively after Hotspot section as Task Summary.

TASK SUMMARY
DOMAIN,TASK,TASK COUNT,ELAPSED TIME(seconds),"CPU_TIME" (seconds)
"DomainAAA","Task0",1,9.5736,8.7910
"DomainBBB","Task19",1,8.3636,8.3640
"DomainAAA","Task18",1,8.2230,8.2240
"DomainBBB","Task17",1,6.1693,6.1700
"DomainAAA","Task16",1,4.1271,4.1280
"DomainBBB","Task15",1,3.4609,3.4620
"DomainAAA","Task14",1,2.9403,2.9420
"DomainBBB","Task13",1,2.5704,2.5710
"DomainAAA","Task12",1,2.3202,2.3220
"DomainBBB","Task11",1,2.1727,2.1730
"DomainAAA","Task10",1,2.0951,2.0960
"DomainBBB","Task9",1,2.0675,2.0690
"DomainAAA","Task8",1,2.0634,2.0640
"DomainBBB","Task7",1,1.3972,1.3990
"DomainAAA","Task6",1,0.8767,0.8770
"DomainBBB","Task5",1,0.5070,0.5080
"DomainAAA","Task4",1,0.2568,0.2580
"DomainBBB","Task3",1,0.1093,0.1110
"DomainAAA","Task2",1,0.0317,0.0320
"DomainBBB","Task1",1,0.0039,0.0050

11.4.7.1. GUI Task Hotspot Summary

The following figure shows static Task Hotspot Summary table where time or other column (used for sorting the data of the table) is shown in the Summary page.

GUI Task Hotspot Summary.

Figure 11.10 GUI Task Hotspot Summary#

11.4.8. APIs Usage Example

  1. Domain/Task APIs:

    /* Domain API usage example with classic_multiply_matrices() to simulate work. */
    
    void Test_DomainTask_API()
    {
        std::cout << "[START] Test_DomainTask_API\n";
    
        /* Create Odd and Even domains to distrubute tasks */
        __itt_domain* domainOdd = __itt_domain_create("BASIC_DOMAIN_ODD");
        __itt_domain* domainEven = __itt_domain_create("BASIC_DOMAIN_EVEN");
    
        /* Create domain & task/string handle for aggregation task */
        __itt_domain* domainAgg = __itt_domain_create("BASIC_DOMAIN_AGG");
        __itt_string_handle* taskAgg = __itt_string_handle_create("BASIC_TASK_MATRIX_AGG");
    
        /* First call of aggregation Task APIs */
        __itt_task_begin(domainAgg, __itt_null, __itt_null, taskAgg);
        classic_multiply_matrices(1);
        __itt_task_end(domainAgg);
    
        /* Logic to test Odd-Even association of tasks */
        for (int i = 0; i < 4; i++)
        {
            std::string task_name = "BASIC_TASK_MATRIX_" + std::to_string(i);
            __itt_string_handle* task = __itt_string_handle_create(task_name.c_str());
    
            if( i % 2 == 0)
                __itt_task_begin(domainEven, __itt_null, __itt_null, task);
            else
                __itt_task_begin(domainOdd, __itt_null, __itt_null, task);
    
            classic_multiply_matrices(i+1);
    
            if( i % 2 == 0)
                __itt_task_end(domainEven);
            else
                __itt_task_end(domainOdd);
        }
    
        /* Second call of aggregation Task APIs */
        __itt_task_begin(domainAgg, __itt_null, __itt_null, taskAgg);
        classic_multiply_matrices(2);
        __itt_task_end(domainAgg);
    
        std::cout << "[DONE] Test_DomainTask_API\n\n";
    }
    

    Output:

    GUI:

    GUI OneAPI-DomainTask API Summary

    Figure 11.11 GUI OneAPI-DomainTask API Summary#

    CLI:

    "10 HOTTEST TASKS (Sort Event - CPU_TIME)"
    DOMAIN,TASK,TASK COUNT,ELAPSED TIME(seconds),"CPU_TIME" (seconds)
    "BASIC_DOMAIN_ODD", "BASIC_TASK_MATRIX_3", 1, 16.0178,16.0050
    "BASIC_DOMAIN_AGG", "BASIC_TASK_MATRIX_AGG", 2, 12.7786,12.0400
    "BASIC_DOMAIN_EVEN", "BASIC_TASK_MATRIX_2", 1, 12.0135,12.0140
    "BASIC_DOMAIN_ODD", "BASIC_TASK_MATRIX_1", 1, 8.0090,8.0100
    "BASIC_DOMAIN_EVEN", "BASIC_TASK_MATRIX_0", 1, 4.0046,4.0050
    

    Explanation:

    • The function classic_multiply_matrices() simulates approximately 4 seconds of work per repetition, with its parameter controlling the number of repetitions.

    • The task BASIC_TASK_MATRIX_AGG aggregates time across both start-end calls, and its TASK COUNT increases for each invocation.

    • The domains are correctly associated: BASIC_DOMAIN_ODD handles odd-numbered tasks (BASIC_TASK_MATRIX_1, BASIC_TASK_MATRIX_3), while BASIC_DOMAIN_EVEN handles even-numbered tasks (BASIC_TASK_MATRIX_0, BASIC_TASK_MATRIX_2).

  2. Event APIs:

    /* Event API usage example with classic_multiply_matrices() to simulate work. */
    
    void Test_Event_API()
    {
        std::cout << "[START] Test_Event_API\n";
    
        /* Create two events */
        __itt_event event1 = __itt_event_create("BASIC_EVENT_1", sizeof("BASIC_EVENT_1") - 1);
        __itt_event event2 = __itt_event_create("BASIC_EVENT_2", sizeof("BASIC_EVENT_2") - 1);
    
        /* Create a event to test aggregation */
        __itt_event eventAgg = __itt_event_create("BASIC_EVENT_AGG", sizeof("BASIC_EVENT_AGG") - 1);
    
        /* First call of aggregation Event APIs */
        __itt_event_start(eventAgg);
        classic_multiply_matrices(1);
        __itt_event_end(eventAgg);
    
        __itt_event_start(event1);
        classic_multiply_matrices(1);
        __itt_event_end(event1);
    
        __itt_event_start(event2);
        classic_multiply_matrices(2);
        __itt_event_end(event2);
    
        /* Second call of aggregation Event APIs */
        __itt_event_start(eventAgg);
        classic_multiply_matrices(2);
        __itt_event_end(eventAgg);
    
        std::cout << "[DONE] Test_Event_API\n\n";
    }
    

    Output:

    GUI:

    GUI OneAPI-Event API Summary

    Figure 11.12 GUI OneAPI-Event API Summary#

    CLI:

    "10 HOTTEST TASKS (Sort Event - CPU_TIME)"
    DOMAIN,TASK,TASK COUNT,ELAPSED TIME(seconds),"CPU_TIME" (seconds)
    "User", "BASIC_EVENT_AGG", 2, 12.6475,12.0140
    "User", "BASIC_EVENT_2", 1, 8.0188,8.0110
    "User", "BASIC_EVENT_1", 1, 4.0065,4.0060
    

    Explanation:

    • The function classic_multiply_matrices() simulates about 4 seconds of work per repetition. Its parameter directly controls the number of repetitions.

    • The event BASIC_EVENT_AGG combines time from both start–end calls, and its TASK COUNT reflects the total number of event invocations.

  3. Collection Control APIs:

    /* Collection Control API usage example with classic_multiply_matrices() to simulate work. Profiling starts in paused state*/
    
    void Test_Collection_Control()
    {
        std::cout << "[START] Test_Collection_Control\n";
    
        classic_initialize_matrices_1();
    
        __itt_resume();
    
        classic_initialize_matrices_2();
        classic_initialize_matrices_3();
    
        __itt_pause();
    
        classic_initialize_matrices_4();
    
        std::cout << "[DONE] Test_Collection_Control\n\n";
    }
    

    Output:

    GUI:

    GUI OneAPI-Collection Control API Summary

    Figure 11.13 GUI OneAPI-Collection Control API Summary#

    GUI OneAPI-Collection Control API Hot Functions Summary

    Figure 11.14 GUI OneAPI-Collection Control API Hot Functions Summary#

    CLI:

    Profile Start Time:,"Wed Nov 12 09:39:30 2025"
    Profile End Time:,"Wed Nov 12 09:40:40 2025"
    Profile Duration:,"70.049 seconds"
    Pause Duration:,"21.582 seconds"
    ...
    "10 HOTTEST FUNCTIONS (Sort Event - CPU_TIME)"
    FUNCTION,"CPU_TIME" (seconds),Module
    "classic_initialize_matrices_3()",12.0280,"/home/Repo/Test/ITT_Test/ITT_Test_App"
    "classic_initialize_matrices_2()",8.0060,"/home/Repo/Test/ITT_Test/ITT_Test_App"
    

    Explanation:

    • The report now includes Pause Duration, indicating how long profiling was paused, along with Profile Duration, which covers the entire profiling session including pauses.

    • Only classic_multiply_matrices_2() & classic_multiply_matrices_3() are reflected in report as profiling is in resume state when these are called.

    • classic_multiply_matrices_1() & classic_multiply_matrices_4() are not reflected in report as profiling is in paused state when these are called.

11.4.9. Limitations

Here is a list of limitations: