Profiling

You can profile an OpenCL application as any other application with generic profiling tools such as “gprof”. Here we explain how to profile commands in the OpenCL command queue.

Additionally, you can profile hardware events such as L2 cache misses and pipeline stalls through the AET library. See the Profiling Hardware Events section for more information.

Host Side Profiling

To profile commands on the host side, you will need to specify “CL_QUEUE_PROFILING_ENABLE” property when creating the command queue. OpenCL runtime will then record the host-side timestamp in nano-seconds when the command is enqueued, is submitted, starts execution and finishes execution. User code can query these timestamps using “clGetEventProfilingInfo” API on the corresponding event that is returned at command enqueue time.

DSP Side Profiling

OpenCL runtime also records timestamps for predefined OpenCL activities on the DSP side and uses lightweight ULM (Usage and Load Monitor) to communicate those data back to host. These predefined activities include start and finish of a workgroup execution, and dsp cache coherency operations. On the host side, users will need to use dsptop utility to retrieve the information. You probably will need two windows/consoles and run your OpenCL application after dsptop is run, for example,

  1. In window 1, run dsptop -l last
  2. In window 2, launch your OpenCL application, wait for it to finish
  3. Back in window 1, type “q” to quit dsptop. dsptop should print out the information sent back from the DSP side.

Details about usage of dsptop can be found by running dsptop -h and by this dsptop wikipage .

Profiling Hardware Events

Profiling hardware events is a useful capability provided through the CTools AET library. Starting from TI OpenCL product v1.1.14, this capability is integrated into the OpenCL runtime. See all the stall and memory events in the AET Profiling Events section below. Events not prefixed with AET_EVT_STALL_ or AET_EVT_MEM_ cannot be profiled. To profile hardware events, there are three choices:

  1. Profile All Possible Events
  2. Profile a Select Few Events
  3. Manually Profile 1 or 2 Events

Profile All Events

To profile all hardware events, run the profiling script as follows:

/usr/share/ti/opencl/profiling/oclaet.sh [-g] oclapp
# e.g oclapp is ./vecadd

While the the executable is running, raw profiling data is recorded into profiling/aetdata.txt relative to the current directory. After the profiling script finishes, it runs a python script that forms a json table, a html table, and optionally a plot of each kernel’s profiling information if -g option is specified. Note that genering plots may require installing additional python packages.

Profile Select Events

To profile selected hardware events, run the profiling script as follows:

/usr/share/ti/opencl/profiling/oclaet.sh oclapp event_type event_number [event_number ...]
# event_type: 1 for stall events, 2 for meeory events
# event_number is the event offset from AET_GEM_STALL_EVT_START, if
#   profiling stall cycles, or from AET_GEM_MEM_EVT_START, if profiling
#   memory events.  At each script run, you can only profile one or more
#   events of the same type.

Manually Profile 1 or 2 Events

To manually profile 1 stall event, or 1 to 2 memory events, simply set the environment variables described in Environment Variables, as follows.

TI_OCL_EVENT_TYPE=1 TI_OCL_EVENT_NUMBER1=13 TI_OCL_STALL_CYCLE_THRESHOLD=0 ./vecadd
TI_OCL_EVENT_TYPE=2 TI_OCL_EVENT_NUMBER1=11 TI_OCL_EVENT_NUMBER1=12 ./vecadd

Note that profiling data is appended to profiling/aetdata.txt at each profiling run. If a fresh profiling is needed, remove profiling/aetdata.txt before profiling run.

Analyzing Profiling Data

If you manually profile, you will have to run the python script to obtain the JSON file or html table.

python /usr/share/ti/opencl/profiling/oclaet.py -t -g profiling/aetdata.txt

The -t flag and -g flags tell the script to produce an html table and matplot plot of profiling data, respectively. If neither of these flags are specified, then only the json file of raw counter data will be formed. This json is easier to read than the raw data dump in profiling/aetdata.txt. The current format for raw data is:

EVENT_TYPE            # the event type
EVENT_NUMBER1         # the first event number (offset from base AET event)
EVENT_NUMBER2         # the second event number (offset from base AET event)
STALL_CYCLE_THRESHOLD # the stall cycle threshold
Core number           # number of core
Counter0_Value        # hardware counter 0 value: memory event 1
Counter1_Value       # hardware counter 1 value: memory event 2 or stall event
~~~~End Core          # End of core data for Core number
...                   # MORE CORE DATA CAN FOLLOW THIS
EVENT_TYPE
EVENT_NUMBER1
EVENT_NUMBER2
STALL_CYCLE_THRESHOLD
Core number
Counter0_Value
Counter1_Value
~~~~End Core
VectorAdd             # Kernel Name
---End Kernel         # Ends Kernel Data

Profiling Data Plotting Requirements

Plotting profiling data with python script requires matplotlib, pandas, and seaborn python packages to be installed. If they are not already installed on your system, you can follow the instructions below.

which pip
# install pip if it is not already on your system
wget https://bootstrap.pypa.io/get-pip.py
python get-pip.py

pip install matplotlib
pip install pandas
pip install seaborn

AET Profiling Events

AET_EVT_MEM_L1D_RH_SRAM_A (AET_GEM_MEM_EVT_START + 0)
L1D Read Hit SRAM A
AET_EVT_MEM_L1D_RH_SRAM_B (AET_GEM_MEM_EVT_START + 1)
L1D Read Hit SRAM B
AET_EVT_MEM_L1D_RH_CACHE_A (AET_GEM_MEM_EVT_START + 2)
L1D Read Hit Cache A
AET_EVT_MEM_L1D_RH_CACHE_B (AET_GEM_MEM_EVT_START + 3)
L1D Read Hit Cache B
AET_EVT_MEM_L1D_WH_BUF_NOT_FULL_A (AET_GEM_MEM_EVT_START + 4)
L1D Write Hit, Tag Buffer Not Full A
AET_EVT_MEM_L1D_WH_BUF_NOT_FULL_B (AET_GEM_MEM_EVT_START + 5)
L1D Write Hit, Tag Buffer Not Full B
AET_EVT_MEM_L1D_WH_BUF_FULL_A (AET_GEM_MEM_EVT_START + 6)
L1D Write Hit, Tag Buffer Full A
AET_EVT_MEM_L1D_WH_BUF_FULL_B (AET_GEM_MEM_EVT_START + 7)
L1D Write Hit, Tag Buffer Full B
AET_EVT_MEM_L1D_RM_HITS_L2_SRAM_A (AET_GEM_MEM_EVT_START + 8)
L1D Read Miss, Hits L2 SRAM A
AET_EVT_MEM_L1D_RM_HITS_L2_SRAM_B (AET_GEM_MEM_EVT_START + 9)
L1D Read Miss, Hits L2 SRAM B
AET_EVT_MEM_L1D_RM_HITS_L2_CACHE_A (AET_GEM_MEM_EVT_START + 10)
L1D Read Miss, Hits L2 Cache A
AET_EVT_MEM_L1D_RM_HITS_L2_CACHE_B (AET_GEM_MEM_EVT_START + 11)
L1D Read Miss, Hits L2 Cache B
AET_EVT_MEM_L1D_RM_HITS_EXT_CABLE_A (AET_GEM_MEM_EVT_START + 12)
L1D Read Miss, Hits External, Cacheable A
AET_EVT_MEM_L1D_RM_HITS_EXT_CABLE_B (AET_GEM_MEM_EVT_START + 13)
L1D Read Miss, Hits External, Cacheable B
AET_EVT_MEM_L1D_RM_HITS_EXT_NON_CABLE_A (AET_GEM_MEM_EVT_START + 14)
L1D Read Miss, Hits External, Non Cacheable A
AET_EVT_MEM_L1D_RM_HITS_EXT_NON_CABLE_B (AET_GEM_MEM_EVT_START + 15)
L1D Read Miss, Hits External, Non Cacheable B
AET_EVT_MEM_L1D_WM_WRT_BUF_NOT_FULL_A (AET_GEM_MEM_EVT_START + 16)
L1D Write Miss, Write Buffer Not Full A
AET_EVT_MEM_L1D_WM_WRT_BUF_NOT_FULL_B (AET_GEM_MEM_EVT_START + 17)
L1D Write Miss, Write Buffer Not Full B
AET_EVT_MEM_L1D_WM_WRT_BUF_FULL_A (AET_GEM_MEM_EVT_START + 18)
L1D Write Miss, Write Buffer Full A
AET_EVT_MEM_L1D_WM_WRT_BUF_FULL_B (AET_GEM_MEM_EVT_START + 19)
L1D Write Miss, Write Buffer Full B
AET_EVT_MEM_L1D_WM_TAG_VIC_WRT_BUF_FLUSH_A (AET_GEM_MEM_EVT_START + 20)
L1D Write Miss, Tag/Victim/Write Buffer Flush A
AET_EVT_MEM_L1D_WM_TAG_VIC_WRT_BUF_FLUSH_B (AET_GEM_MEM_EVT_START + 21)
L1D Write Miss, Tag/Victim/Write Buffer Flush B
AET_EVT_MEM_CPU_CPU_BANK_CONFLICT (AET_GEM_MEM_EVT_START + 22)
CPU - CPU Bank Conflict
AET_EVT_MEM_CPU_SNOOP_CONFLICT (AET_GEM_MEM_EVT_START + 23)
CPU - Snoop/Coherence Conflict (A or B)
AET_EVT_MEM_CPU_IDMA_EDMA_BANK_CONFLICT (AET_GEM_MEM_EVT_START + 24)
CPU - iDMA/EDMA Bank Conflict (A or B)
AET_EVT_MEM_L1P_RH_SRAM (AET_GEM_MEM_EVT_START + 25)
L1P Read Hit SRAM
AET_EVT_MEM_L1P_RH_CACHE (AET_GEM_MEM_EVT_START + 26)
L1P Read Hit Cache
AET_EVT_MEM_L1P_RM_HITS_L2_SRAM (AET_GEM_MEM_EVT_START + 27)
L1P Read Miss, Hits L2 SRAM
AET_EVT_MEM_L1P_RM_HITS_L2_CACHE (AET_GEM_MEM_EVT_START + 28)
L1P Read Miss, Hits L2 Cache
AET_EVT_MEM_L1P_RM_HITS_EXT_CABLE (AET_GEM_MEM_EVT_START + 29)
L1P Read Miss, Hits External Cacheable
AET_EVT_STALL_CPU_PIPELINE (AET_GEM_STALL_EVT_START + 0)
CPU Stall Cycles
AET_EVT_STALL_CROSS_PATH (AET_GEM_STALL_EVT_START + 1)
Stall Due to a Cross path
AET_EVT_STALL_BRANCH_TO_SPAN_EXEC_PKT (AET_GEM_STALL_EVT_START + 2)
Stall due to a branch to an execute packet that spans two fetch packets
AET_EVT_STALL_EXT_FUNC_IFACE (AET_GEM_STALL_EVT_START + 3)
Stall due to an External Functional Interface
AET_EVT_STALL_MVC (AET_GEM_STALL_EVT_START + 4)
Stall Conditions: 1) AMR write followed by addressing mode instruction and src2 register is affected by AMR Values 2) Read of emulation registers in the ECM
AET_EVT_STALL_L1P_OTHER (AET_GEM_STALL_EVT_START + 5)
Any other stall not prwviously listed
AET_EVT_STALL_L1P_WAIT_STATE (AET_GEM_STALL_EVT_START + 6)
Stall due to Wait states in L1P Memory
AET_EVT_STALL_L1P_MISS (AET_GEM_STALL_EVT_START + 8)
Execute Packed held off due to a Cache Miss
AET_EVT_STALL_L1D_OTHER (AET_GEM_STALL_EVT_START + 10)
Any other L1D Stall not previosuly listed
AET_EVT_STALL_L1D_BANK_CONFLICT (AET_GEM_STALL_EVT_START + 11)
Stall on a memory bank conflict between A and B
AET_EVT_STALL_L1D_DMA_CONFLICT (AET_GEM_STALL_EVT_START + 12)
Stall while CPU access is held off by a DMA Access
AET_EVT_STALL_L1D_WRITE_BUFFER_FULL (AET_GEM_STALL_EVT_START + 13)
Stall on a write miss on A or B while the Write Buffer is Full
AET_EVT_STALL_L1D_TAG_UD_BUF_FULL (AET_GEM_STALL_EVT_START + 14)
Stall on a Write Hit with a tag update on either A or B while the tag update buffer is full
AET_EVT_STALL_L1D_LINE_FILL_B (AET_GEM_STALL_EVT_START + 15)
Stall on a read miss on B while the read miss data is being fetched from the lower memory level
AET_EVT_STALL_L1D_LINE_FILL_A (AET_GEM_STALL_EVT_START + 16)
Stall on a read miss on A while the read miss data is being fetched from the lower memory level
AET_EVT_STALL_L1D_WRT_BUF_FLUSH (AET_GEM_STALL_EVT_START + 17)
Stall on a read Miss on Either A or B while the Write Buffer is being flushed
AET_EVT_STALL_L1D_VICTIM_BUF_FLUSH (AET_GEM_STALL_EVT_START + 18)
Stall on a read miss on either A or B while the Victim Buffer is being flushed
AET_EVT_STALL_L1D_TAG_UD_BUF_FLUSH (AET_GEM_STALL_EVT_START + 20)
Stall on a read miss on either A or B while the Tag Update Buffer is being flushed
AET_EVT_STALL_L1D_SNOOP_CONFLICT (AET_GEM_STALL_EVT_START + 21)
Stall while a CPU access is held off by a Snoop access
AET_EVT_STALL_L1D_COH_OP_CONFLICT (AET_GEM_STALL_EVT_START + 22)
Stall while a CPU access is held off by a block cache coherence operation access