The AMDProfileControl APIs allow you to limit the profiling scope to a specific portion of the code within the target application. This is particularly helpful when within a long running application, the user wants to focus on performance monitoring and analysis of a specific area of code.
Usually, while profiling an application, samples for the entire control flow of the application execution will be collected, that is, from the start of execution till end of the application execution. The control APIs can be used to enable the profiler to collect data only for a specific part of application, for example, a CPU intensive loop, a hot function etc.
The target application needs to be recompiled after instrumenting that with profiling enable/disable APIs to profile the required code regions only.
11.1. CPU Profile Control APIs
Profile Control APIs allow you to pause and resume profiling at runtime from within the application being profiled. You can call these APIs from C/C++ or Python code. Note that these APIs control profiling only—they do not pause or resume the execution of the application itself.
It is mandatory to use the CLI command line option: –start-paused or keep Enable start paused option ON in GUI, while launching the application instrumented by profile control APIs, otherwise the behavior is undefined.
To switch on the GUI Enable start paused option:
Select the Profile Target page and click Next.
Click Advanced Options.
On the Advanced Options page, in the Profile Scheduling section, switch on the Enable start paused option.
There are two groups of APIs for pause/resume. APIs from different groups cannot be mixed within a single run of an application.
Group1: The APIs amdProfileResume(); and amdProfilePause(); are respectively for resuming and pausing the profiling.
Group2: The APIs amdProfileStrictPause() and amdProfileStrictResume() for resuming and pausing the profiling.
11.1.1. Difference between Group1 and Group2 APIs
As described earlier, Profile Control APIs are for profiling a small (in which user is interested) portion of the typically long running program.
For Group1 APIs, on a amdProfilePause() call, the profile-pause message, from the amdProfilePause() call, propagates through the application layers to reach a point where the actual stop of the profiling happens. Meanwhile, the application continues to run with profiling on state. Hence some additional instructions are also profiled, after the pause call.
Similar thing happens with amdProfileResume() call, where application continues to run few more instructions after its amdProfileResume() call without profiling.
In general, this is not an issue, but when the user is trying to monitor/profile a short running functions, this behavior is undesirable.
Group2 APIs i.e. amdProfileStrictPause() and amdProfileStrictResume() remove the above-mentioned issue by strictly pausing or resuming on these calls respectively. But Group2 APIs do this at the cost of speed.
So, Group1 APIs are faster but slip few instructions, while Group2 APIs are slower but produce no error.
11.1.2. C/C++ API Descriptions
To use C++ APIs, user need to include the header AMDProfileController.h. This file is available in the include directory under AMD uProf’s install path: ~/Foo/AMDuProf_Linux_x64_5.x.x/include/AMDProfileController.h, hence that include folder ~/Foo/AMDuProf_Linux_x64_5.x.x/include/ must be in the include path, while building the instrumented application.
For linking, the library user must include the library path of libAMDProfileController.so, whose path is: /lib/x64/shared/libAMDProfileController.so.
Is called to resume the paused application, either by command line option --start-paused or by previous call of bool amdProfilePause(); API from the application.
Returns true on success or false in case of failure.
2
boolamdProfilePause();
Is called to pause a currently running profiling.
Returns true on success or false in case of failure.
Is called to resume the paused application, either by command line option –start-paused or by previous call of bool amdProfileStrictPause (); API from the application.
Returns true on success or false in case of failure.
2
boolamdProfileStrictPause();
Is called to pause a currently running profiling.
Returns true on success or false in case of failure.
11.1.5. Python API Descriptions
Python Profile Control APIs are like C/C++ APIs. For python, Pause/Resume APIs are supported for Python version 3.7 onwards.
To use these APIs, you must:
Set environment variable PYTHONPATH to the path of the module amd_instrument.py, located in: /lib/x64/python/
Import the module amd_instrument in your source file, which defines the domain/task APIs.
Like C++, there are two python-based APIs from Group1:
Is called to resume the paused application, either by command line option --start-paused or by previous call of bool amd_profile_pause(); API from the application.
Returns true on success or false in case of failure.
2
amd_profile_pause()
Is called to pause a currently running profiling.
Returns true on success or false in case of failure.
Is called to resume the paused application, either by command line option –start-paused or by previous call of bool amdProfileStrictPause (); API from the application.
Returns true on success or false in case of failure.
2
amd_profile_strict_resume()
Is called to pause a currently running profiling.
Returns true on success or false in case of failure.
These Pause/resume APIs can be called multiple times within the application.
Above code is self-explanatory with the provided comments. Functions f1(), f2(), … etc. to be declared and defined properly.
11.1.5.2. Unmatched API Calls
Profile Control APIs can cause an unmatched AMDTaskBegin() or AMDTaskEnd(), i.e. one of these two calls for which corresponding AMDTaskEnd() or AMDTaskBegin() (respectively) calls are not found within the same thread in the profiled run and generated data. There can be multiple unmatched AMDTaskBegin() or AMDTaskEnd() in a generated profile data. One of the possible reasons of Unmatched AMDTaskBegin() or AMDTaskEnd() API call can be usage of Pause/Resume APIs along with Domain/Task API incorrectly. Example: one item of the pair falls in paused state and the other in resume state. In those cases, we just ignore the unmatched API calls.
Profile Control APIs can be mixed with Domain/Task/Events APIs to get meaningful data.
Pause/resume APIs can be spread across multiple functions or threads, i.e. resume called from one function or thread and pause called from another function or thread. In both cases, they make the desired impact on profiling of entire applications.
Nested Resume/Pause calls are not supported.
The -static option should not be used while compiling with g++.
The Profile control APIs are supported only for C/C++ and Python-based CPU applications but not supported for Fortran, MPI, OpenMP, Java, .NET applications, and GPU Profiling.
AMDProfileControl APIs work only with AMDuProfCLI and GUI for application analysis. They do not work with:
Power Profiler
System analysis tools (uProfPcm and uProfSys)
11.2. Instrument APIs - Domain and Task
uProf provides time information at module level, process level, thread level and function level. All of them are programming language or system defined levels. Can we have some time information about some user defined code-block level, where code block can contain single line of instruction or multiple function calls and/or multiple lines of instructions. “Task” defines that code-block and is identified by a unique name. A domain like namespace, helps to organize tasks, by qualifying the task with domain names. In other words, domain is container of tasks.
11.2.1. Definition and Concepts
Domain: A domain defines a name and creates a namespace for the tasks. Hence each domain can contain one or more or no task(s).
Within a domain, task names must be unique and task names can be the same within different domains.
(domain, task) combination is to be unique within a thread and this tuple uniquely identifies a Task Item. Same (domain, task), that is, the same task item can appear multiple times within a thread, each such appearance is called a Task Instance. Time taken by all such Task - Items would be summed up in the report.
Currently, nested domains are not allowed, that is, one domain cannot contain another domain.
Task: A task is sequence of synchronous/blocking code executed between the calls amdTaskBegin(…) and amdTaskEnd(…).
Blocking code in the definition of the task means each instruction or function call within the task is first completed, before proceeding to the next instruction or function call. In case of thread launch, process launch or some other asynchronous call, time consumed by that thread or process or that asynchronous call is not included in the task, only the launch overhead(which contributes to the elapse time between the calls amdTaskBegin(...) and amdTaskEnd(...)) is included in the task time.
Code executed may be different from the code listed between the calls amdTaskBegin(...) and amdTaskEnd(...). because at runtime control may go to different areas of the code as seen in the following example.
Code listed between the calls amdTaskBegin(...) and amdTaskEnd(...) contains the sequence of instructions: Instruction 73 to Instruction 78.
But actual instructions(code) executed within that task are the following and are in the following sequence: Instruction 73, Instruction 74, Instruction 1, Instruction 2, Instruction 3, Instruction 77, Instruction 78.
A task cannot span over multiple threads, that is, it cannot start in one thread and end in another.
amdTaskEnd(...) takes the domain as parameter, but not a task. It closes the last open task (that is, the task created and started by last encountered amdTaskBegin(...) in the runtime) from the same domain. Hence, within a domain the behavior is like a stack, where conceptually for each domain there is a separate stack in which amdTaskBegin(...) is pushed and whenever amdTaskEnd(…) with the same domain is encountered, top of the stack is popped and combined with the current amdTaskEnd(...) to define this task-instance.
Nested tasks are allowed, but nested domains are not allowed.
Overlapping tasks are allowed.
Incomplete tasks, where amdTaskBegin(...) is defined and encountered at the runtime, but amdTaskEnd(...) is not defined or not encountered at the runtime, are ignored.
If a task-item (defined by unique combination of (domain, task) tuple) is repeated in multiple threads of a process, then time consumed by all such task-instances (any instance of appearance of task-item in the code; there can be multiple appearance of same task-item even within a single thread or function) are summed up and reported in the hotspot and per process section (added with –detail option).
For C++ based multi-process application, when a task-item is present in multiple processes, Hottest Task section shows the sum of all times consumed by all task-instances from all the processes. But, with –detail option, for each task-items, per process sum for that task-item is reported in the Task Summary section which is per process.
This example shows how to mix Domain/Task APIs and Profile Control APIs.
Here we have four tasks: Task1, Task2, Task3 and Task4 and two domains: DomainABC and DomainABC1. Task1 is defined within paused state of application and hence will not be included in the profiled data.
Here, we assumed python functions f1() to f7() are defined. As we start profiling with –start-paused option, f1() and f2() will not be profiled. After the call amd_profile_resume() f3() and f4() will be profiled. Then the call amd_profile_pause() makes the f5() not to be profiled. Again, the second call of amd_profile_resume() makes the profiling on, hence f6() and f7() would be profiled.
11.3. Instrument APIs - Event
Event APIs are similar APIs as Task and events are listed under task sections in GUI/CLI, execpt for events user does not need to specify the domain name. All events are listed under the domain User, which is reserved for events and not to be used as a user defined domain for any task. One more difference with task is that the event start API call which is not matched with corresponding Event end API call, is considered as vlid event of 0.0 duration unlike task.
11.3.1. Definition and Concepts
Event: An event, in ITT context, is sequence of synchronous/blocking code executed between the calls amdEventStart(…) and amdEventEnd(…).
Like Task, events are per thread, can be nested and overlapped, but unlike tasks, events have no user specified domain.
Task and events can be mixed within a program.
For a given event start API call, when there is no matching end call in the runtime of the code, this is considered as instantaneous event of 0.00 duration and listed.
This behavior is different from Task for which unmatched begin calls are ignored. Event that is only created and ended, but not started (no matching start), are just ignored.
Same events (like task) are aggregated within and across functions, threads and processes and also even when self-nesting is done.
Nested events/tasks are allowed (event the tasks nesting events and event nesting tasks) are also allowed. Outer task/event would contains/include the time of inner task/event i.e. inclusive time.
If the outer task/event launch a thread and inner task/event is there within that thread function, then outer task/event will not contain/include time of that inner task/event. This is true for such asynchronous calls. Only exclusive time is reported not the inclusive time.
11.3.2. Overall workflow with small example
Suppose we want to find top hottest overheads for initializations of n algorithms, where each algorithm has 1 or more initialization steps (some function calls, some computation statements etc.). Left side show original codes for algorithm1, algorithm2, …, algorithm n. In the middle, it shows codes after instrumentation and how tasks (user defined units of code) are generated, and finally right side shows the task-wise data displayed in CLI/GUI.
Figure 11.7 Example Task Instrumentation and Output#
11.3.3. How to Run and Get Task/Event detail
You start with a source code, where some code snippets to be defined as tasks and then using profiling with proper configuration, you get defined task details (time information etc.) in the report.csv. Presently, it is supported in CLI mode profiling only and no GUI is supported.
1. Have your buildable source code to be profiled with Task/Event APIs.
2.Instrument the source code with task APIs.
Include the header AMDProfileController.h in the source file where you want to create a task. Path of this header is:
Hence that folder must be in the include path, while building the instrumented application.
Create domain and get domain handle using amdDomainCreate(…).
Create string handle, using amdStringHandleCreate(…), which is used to name a task.
Using domain handle and string handle create a task and mark the start of the task – both are done with single call amdTaskBegin(…) and then your code (which is part of the task) comes.
Call amdTaskEnd(…) to mark the end of the task.
Can add more than one tasks in similar way.
Events can be added using Event APIs similarly.
For linking, the library user must include the library path of libAMDProfileController.so whose path is: <Installation Directory>/lib/x64/shared/libAMDProfileController.so.
Currently, for Domain and Task APIs C/C++ and Python APIs are supported for instrumenting codes in those respective programming languages. For Event APIs, only C++ support is available.
11.3.3.1.1. Using the Task APIs
C/C++ and Python APIs are described here for instrumenting codes.
11.3.3.2. C++ Task APIs
To use C++ APIs, include the header AMDProfileController.h. Path of this header is: <Installation Directory>/include/AMDProfileController.h. Example: ~/Foo/AMDuProf_Linux_x64_5.x.x/include/AMDProfileController.h. Hence, that folder must be in the include path, while building the instrumented application.
For linking, the library, include the library path of libAMDProfileController.so, whose path is: <InstallationDirectory>/lib/x64/shared/libAMDProfileController.so.
There are four C++ based APIs for supporting domain and task.
1.``void* domainHandle``: a valid domain handle created by amdDomainCreate(...).
2.``int taskId``: currently, only 0 is to be passed for this.
3.``int parentId``: currently, only 0 is to be passed for this.
4.``const long unsigned int*strHandleForTask``: a valid string handle returned by amdStringHandleCreate(…)
Nothing
voidamdTaskEnd(void*domainHandle);
Marks the end of the task, which was begun by latest(last) runtime call of amdTaskBegin(…) with same domain.
void*domainHandle: a valid domain handle created by amdDomainCreate(...).
Nothing
void*amdDomainCreate(constchar*name);
This function creates a domain with specified name name and returns the handle, as void pointer. If called multiple times within same scope, it returns the same handle.
constchar*name: Name of the domain. It can contain space, period, etc. such as amd.com.uprof.
Handle as void* to the created domain.
Example
void *execDomain = amdDomainCreate ("Main.ExecuteBenchmark");void *mulStr = amdStringHandleCreate ("MultiplyMatrices");// some activities, not within taskint matrixSize =1000;initialize_matrices(matrixSize);amdTaskBegin (execDomain, 0, 0, mulStr);// some activities within taskmultiply_matrices(1000);printOutout();amdTaskEnd (execDomain);// some activities, not within taskCleanUp();
11.3.3.3. Python APIs
To use these Python APIs:
Set environment variable PYTHONPATH to the path of the module amd_instrument.py, located in: <uProf Install Directory>/lib/x64/python/.
Import the module amd_instrument, which defines the domain/task APIs.
There are two groups of APIs: Python APIs and C++ based APIs, for supporting domain and task. APIs from one group cannot be mixed with APIs for another group.
This creates the domain and the task and begins the task.
domain_task_name: This is :: separated domain task combination string in the format <domain>::<task>.
Nothing
amd_end_task(domain)
This marks the end of the last (latest) task in the specified domain.``
domain: the domain from which the last task to be marked as end.
Nothing
Example
import amd_instrument as a#...othercodee.g.functioncallsand/orstatements
a. amd_start_task("DomainABC::TaskPQR")# ...blockingfunctioncallsand/orstatements,consideredwithinthistask
# ornon-blockingfunctioncallsand/orstatements,notconsideredwithintasks
a.amd_end_task(“DomainABC”)
This API take a domain name string and return a domain, which is used in the task_begin(…) API.
domain: the domain name as string under which current task is being considered.
domain that was created.
task_begin(domain,name)
This API takes, a domain created in and returned by domain_create(…) API and takes the task name. It defines the starting point of the task.
domain: the domain returned from the domain_create(…) call.
task: the task name as string
Nothing
task_end(domain)
This API takes the domain_create(…) created in and returned by domain_create(…) API and marks the end of the task which was last created within the same domain.
domain: the domain returned from the domain_create(…) call.
Nothing
Example
import amd_instrument as a#...othercodee.g.functioncallsand/orstatements
d = a.domain_create("DomainABC")a.task_begin(d, "TaskPQR")# ...blockingfunctioncallsand/orstatements,consideredwithinthistask
# ornon-blockingfunctioncallsand/orstatements,notconsideredwithintasks
a.task_end(d)
11.3.3.3.1. Using the Event APIs
C/C++ Event APIs are described here for instrumenting codes.
To use C++ APIs, include the header AMDProfileController.h. Path of this header is: <Installation Directory>/include/AMDProfileController.h. Example: ~/Foo/AMDuProf_Linux_x64_5.x.x/include/AMDProfileController.h. Hence, that folder must be in the include path, while building the instrumented application.
For linking, the library, include the library path of libAMDProfileController.so, whose path is: <InstallationDirectory>/lib/x64/shared/libAMDProfileController.so.
There are three C++ based APIs for supporting Events.
This API creates and return the event handle. As argument it takes event name which is null terminates string.
If the length is bigger than the length of the null terminated string, then name would be considered to be delimited by the null character.
const char* name: Name of the event, it is associated with returned handle.
Handle as void* to the created string.
intamdEventStart(void*eventHandle);
This API marks the start of the event. When there is no matching end call in the runtime of the code, this is considered as instantaneous event of 0.00 duration and listed. This behavior is different from Task for which unmatched begin calls are ignored. Un-matched end call are always ignored both for events and tasks.
void*eventHandle: a valid event handle created by amdEventCreate(...).
0 on success.
intamdEventEnd(void*eventHandle);
This API mark the end of the event. As described, in above API, this can be skipped for some event.
Un-matched end call are always ignored for events and tasks.
void*eventHandle: a valid event handle created by amdEventCreate(...).
0 on success.
Example
void *execEvent = amdEventCreate ("MyNewEvent");// some activities, not within the eventint matrixSize =2048;initialize_matrices(matrixSize);amdEventStart (execEvent);// some activities within eventsLU_decompose_matrix(matrixSize);printResults();amdEventEnd (execEvent);// some activities, not within taskCleanUp();
11.3.4. Compiling Instrumented Target Application
Complete the following procedure to compile a instrumented target application:
Include the header AMDProfileController.h in the source file where you want to create a task. The path of this header is:
Hence that folder must be in the include path, while building the instrumented application.
For linking, the library user must include the library path of libAMDProfileController.so, whose path is: <InstallationDirectory>/lib/x64/shared/libAMDProfileController.so.
Build the instrumented application, based on the present compiler option(s) and commands.
11.3.5. Profiling Instrumented Target Application
Complete the following procedure to profile an instrumented target application:
Run the application AMDuProfCLIcollect--configtbp-o<outputFolderPath><applicationNameToBeProfiled>.
The raw output is generated in the directory SessionDirectory>/instrument.
In cases user need to have a quick way to disable the Instrumentation APIs from the code. Manually commenting each occurrence of them may be tedious.
You can disable the Instrumentation APIs quickly using one of the following options:
Compile time disabling by using macro AMDUPROF_DISABLE_INSTRUMENT_API: Define the macro in a common header included by all the files which contains calls to Instrumentation APIs.
Runtime disabling by using the environment variable: Setting the environment variable AMDUPROF_INSTRUMENT_ENABLE to 0 leads to disabling the effect of the Instrumentaion APIs.
11.3.7. Example Steps to Attach Instrumented Process
Run an application in one terminal or as demon with environment variable set AMDUPROF_INSTRUMENT_ENABLE=1.
You can use the -d option to set the duration for which profiling should be allowed.
4. Finally generate the report as usual and the generated report.csv will show top 10 (or as selected by provided option) hottest tasks.
If multiple attachment of the same prcess (say long running process) is performed, then multiple session files would created. Each would display the corresponding data in its report.csv file.
11.3.8. Unmatched Task/Event
11.3.8.1. Definition
Unmatch in Task: An unmatched AMDTaskBegin() or AMDTaskEnd() is one of these two calls for which corresponding AMDTaskEnd() or AMDTaskBegin() (respectively) calls are not found in the generated profiled data. There can be multiple unmatched AMDTaskBegin() or AMDTaskEnd() in a generated profile data.
Note that, by definition, unmatched AMDTaskBegin() and AMDTaskEnd() means that no match for those are found from same domain and same thread.
Unmatch in Event: An unmatched amdEventStart() or amdEventEnd() is one of these two calls for which corresponding amdEventEnd() or amdEventStart() (respectively) calls are not found in the generated profiled data. There can be multiple unmatched amdEventStart() or amdEventEnd() in a generated profile data.
11.3.8.2. Reason for Unmatched Task/Event
The following are a few possible cases where we get unmatched AMDTaskBegin()oramdEventStart() or AMDTaskEnd()oramdEventEnd() in the generated data (that is, # AMDTaskBegin()oramdEventStart() is not same as #AMDTaskEnd()oramdEventEnd() entries in the data).
Case 1 - The usage of start-pause – one item of the pair was in paused state and other in resume state.
Case 2 - Multiple times attachment of same process - one item of the pair was in not attached state and other in attached state.
Case 3 - Coding error while instrumenting, forgetting to write balanced AMDTaskBegin() or amdEventStart() and AMDTaskEnd() or amdEventEnd() pair.
Case 4 - Runtime condition.
Example: some unforeseen jump or exception handling made one of AMDTaskBegin()/amdEventStart() or AMDTaskEnd()/AMDTaskEnd() call unreached.
11.3.8.3. Handling the Unmatched Task
Any unmatched task is completely ignored, that is, not considered or listed in reported tasks.
When there is no matching end call in the runtime of the code, this is considered as instantaneous event of 0.00 duration and listed. This behavior is different from Task for which unmatched begin calls are ignored. Un-matched end call are always ignored both for events and tasks.
Un-matched end call are always ignored for events and tasks.
11.3.9. Depth Within Domain
11.3.9.1. Definition
When tasks/event from same thread and same domain( anyway, all events are always from same domain viz. User) are nesting, the level of nesting is described by Depth Within Domain. Following pseudo code of small example illustrate the concept
For events, no need to consider the domain, because all events are from same domain viz. User. Hence within the same thread, one event’s Depth Within Domain increases as nesting level increases.
Note
Depth Within Domain is not directly printed in CLI or GUI in uProf 5.2. But it is used in Flame graph.
11.3.10. Output
The report command generates a list of task hotspots in the Hotspot section of the <SessionDirectory>/report.csv. For example: AMDuProf-ClassicDomainTaskAPI-TBP_Sep-02-2024_16-50-28/report.csv.
The corresponding task section would look similar to this:
With the --detail option, report generation (AMDuProfCLIreport-i<SessionDirectory>–detail) tasks can be grouped by process (grouping by modules/threads are not supported now), and will appear in the per process section after Hotspot section as Task Summary.
TASK COUNT - Count of task instance i.e. number of time this task was encountered in runtime of the application, over all processes and threads
ELAPSED TIME(seconds) - The wall time of the task (not to be confused with CPU TIME).
Remaining columns are based on collect command configuration and view.
11.3.10.1. GUI Task Hotspot Summary
The following figure shows static Task Hotspot Summary table where time or other column (used for sorting the data of the table) is shown in the Summary page.
Currently, the Domain/Task/Event APIs are supported in Linux (not on Android, Windows, FreeBSD, Yocto). For Python, supports are for 3.10+.
The Pause-resume option which is invoked with start-paused option, work with domain-task APIs, but limitation of each of the two set of APIs remains.
group-bymodule Option is not supported with --detail option in the report generation command. Only thread and process (the default option (process)) are supported.
In Python API amd_start_task(domain_task_name), domain or task name cannot contain the ::substring, as that is used as separator. No-validation of this string is there, to avoid performance overhead, hence giving incorrect string as input (example: missing :: or multiple ::) may lead to incorrect behavior.
Event APIs are not supported in Python.
In C++ API comma(,) cannot be used in the domain/task/event name. No-validation of this string is there, to avoid performance overhead, hence giving incorrect string as input may lead to incorrect behavior.
Attach process option is supported with the current version 5.2, but the process to be attached must be started with the environment setting AMDUPROF_INSTRUMENT_ENABLE=1.
Attaching same process multiple times, are supported.
Blocking/Synchronous function calls and/or statements within task begin and task end are considered part of the current task and non-blocking/asynchronous function calls and/or statements within task begin and task end are not considered part of the current tasks. Hence, when a new process, thread, etc., is launched from the parent process, thread etc., the child’s consumption of time, etc., would not be part of the parent.
Domain/Task/Event APIs all can be mixed in a single application.
11.4. OneAPI support in AMDuProf
AMDuProf enables profiling of applications instrumented with OneAPI, allowing developers to analyze performance metrics and execution timelines effectively.
11.4.1. Overview
OneAPI offers a unified programming model for heterogeneous computing environments and includes support for Instrumentation and Tracing Technology (ITT) APIs to enhance performance analysis.
Domain-Task/Event ITT APIs allow developers to define logical tasks/events within application threads, providing fine-grained visibility into execution timelines.
Collection Control APIs enable selective data capture, helping focus analysis on critical code regions.
These features integrate seamlessly with AMDuProf Profiler, allowing visualization of user-defined tasks and correlation with CPU time in profiling reports. This helps developers identify bottlenecks and optimize performance across heterogeneous workloads.
11.4.2. Supported APIs
The following APIs from the OneAPI are currently supported for profiling using uProf.
With the --detail option, report generation (AMDuProfCLIreport-i<SessionDirectory>–detail) tasks are grouped by process/thread (grouping by modules is not supported), and appear in the per process/per thread section respectively after Hotspot section as Task Summary.
The following figure shows static Task Hotspot Summary table where time or other column (used for sorting the data of the table) is shown in the Summary page.
The function classic_multiply_matrices() simulates approximately 4 seconds of work per repetition, with its parameter controlling the number of repetitions.
The task BASIC_TASK_MATRIX_AGG aggregates time across both start-end calls, and its TASK COUNT increases for each invocation.
The domains are correctly associated: BASIC_DOMAIN_ODD handles odd-numbered tasks (BASIC_TASK_MATRIX_1, BASIC_TASK_MATRIX_3), while BASIC_DOMAIN_EVEN handles even-numbered tasks (BASIC_TASK_MATRIX_0, BASIC_TASK_MATRIX_2).
Event APIs:
/* Event API usage example with classic_multiply_matrices() to simulate work. */void Test_Event_API(){ std::cout << "[START] Test_Event_API\n"; /* Create two events */ __itt_event event1 = __itt_event_create("BASIC_EVENT_1", sizeof("BASIC_EVENT_1") - 1); __itt_event event2 = __itt_event_create("BASIC_EVENT_2", sizeof("BASIC_EVENT_2") - 1); /* Create a event to test aggregation */ __itt_event eventAgg = __itt_event_create("BASIC_EVENT_AGG", sizeof("BASIC_EVENT_AGG") - 1); /* First call of aggregation Event APIs */ __itt_event_start(eventAgg); classic_multiply_matrices(1); __itt_event_end(eventAgg); __itt_event_start(event1); classic_multiply_matrices(1); __itt_event_end(event1); __itt_event_start(event2); classic_multiply_matrices(2); __itt_event_end(event2); /* Second call of aggregation Event APIs */ __itt_event_start(eventAgg); classic_multiply_matrices(2); __itt_event_end(eventAgg); std::cout << "[DONE] Test_Event_API\n\n";}
The function classic_multiply_matrices() simulates about 4 seconds of work per repetition. Its parameter directly controls the number of repetitions.
The event BASIC_EVENT_AGG combines time from both start–end calls, and its TASK COUNT reflects the total number of event invocations.
Collection Control APIs:
/* Collection Control API usage example with classic_multiply_matrices() to simulate work. Profiling starts in paused state*/void Test_Collection_Control(){ std::cout << "[START] Test_Collection_Control\n"; classic_initialize_matrices_1(); __itt_resume(); classic_initialize_matrices_2(); classic_initialize_matrices_3(); __itt_pause(); classic_initialize_matrices_4(); std::cout << "[DONE] Test_Collection_Control\n\n";}
Output:
GUI:
Figure 11.13 GUI OneAPI-Collection Control API Summary#
Figure 11.14 GUI OneAPI-Collection Control API Hot Functions Summary#
CLI:
Profile Start Time:,"Wed Nov 12 09:39:30 2025"Profile End Time:,"Wed Nov 12 09:40:40 2025"Profile Duration:,"70.049 seconds"Pause Duration:,"21.582 seconds"..."10 HOTTEST FUNCTIONS (Sort Event - CPU_TIME)"FUNCTION,"CPU_TIME" (seconds),Module"classic_initialize_matrices_3()",12.0280,"/home/Repo/Test/ITT_Test/ITT_Test_App""classic_initialize_matrices_2()",8.0060,"/home/Repo/Test/ITT_Test/ITT_Test_App"
Explanation:
The report now includes Pause Duration, indicating how long profiling was paused, along with Profile Duration, which covers the entire profiling session including pauses.
Only classic_multiply_matrices_2() & classic_multiply_matrices_3() are reflected in report as profiling is in resume state when these are called.
classic_multiply_matrices_1() & classic_multiply_matrices_4() are not reflected in report as profiling is in paused state when these are called.
11.4.9. Limitations
Here is a list of limitations:
Currently, limited OneAPIs are supported and only Linux platform is supported.
The Pause-resume option which is invoked with start-paused option, work with domain-task APIs, but limitation of each of the two set of APIs remains.
Comma (,) cannot be used in domain/task/event name.
Options group-by module is not supported with –detail option in the report generation command.
Attach process option is supported with the current version 5.2, but the process to be attached must be started with the environment setting AMDUPROF_INSTRUMENT_ENABLE=1.
System-wide support is not available with current version 5.2.
Blocking or synchronous function calls and statements within __itt_task_begin and __itt_task_end are considered as part of the current task. In contrast, it does not consider non-blocking or asynchronous function calls and statements within the same boundaries as part of the current task. When a parent process or thread launches a new process or thread, UProf does not include the child’s time consumption in the parent’s task metrics.