Example Descriptions¶
The examples in this section are included in the installation of the OpenCL product. Not all of these examples are applicable to all supported device platforms.
These examples are are intended to illustrate a technique, an extension, or a mode of operation. The following table provides a high level map of the example name to the features that are highlighted by that example.
The key to the codes in the table are in subsequent tables.
Name | Type | Execute Model | Kernel Compile | Buffer Model | Profiling | Extensions | Techniques |
---|---|---|---|---|---|---|---|
abort_exit | S | ndr,iot,oot | B/E | read | abort,exit | ||
ccode | S | 1wi | S/F | read | C | ||
conv1d | P | ndr,1wi | B/E | map | host | C, edma | async, local, query, vec |
dgemm | P | iot | B/E | host | host | C, omp, msmc, edma, cache | |
dspheap | S | 1wi | B/F | dspheap, msmc | functor | ||
dsplib_fft | P | ndr,1wi | B/E | host | host | C | |
edmamgr | S | 1wi | B/E | read | C, edma | ||
edmabw | I | iot | B/E | host | C | async | |
float_compute | S | ndr | B/F | host | host | local, async, vec | |
mandelbrot | S | ndr | S/F | read | host | nDev | |
matmpy | S | 1wi | B/F | read | host | C, msmc | nDev, async, local |
null | I | iot | S/E | host | |||
offline | S | ndr | B/F | read | event | vec | |
offline_embed | S | ndr | B/E | read | event | vec | |
ooo | S | oot | S/E | read | host | event, native | |
ooo_callback | S | oot | S/E | read | host | event, callback | |
ooo_map | S | oot | S/E | map | host | event, native | |
platforms | I | query | |||||
tidl | P | custom | host | host | |||
sgemm | P | 1wi | B/E | map | host | C, msmc, edma, cache | local, vec |
Simple | S | ndr | S/E | read | functor | ||
timeout | S | ndr,iot,oot | B/E | read | timeout | ||
vecadd | S | ndr | S/E | host | vec | ||
vecadd_mpax | S | ndr | S/E | map | extMem, query, vec | ||
vecadd_openmp | S | iot | S/F | read | event | C, omp | |
vecadd_openmp_t | S | iot | S/F | read | event | C, omp | |
vecadd_subdevice | S | ndr | S/F | host | host | vec | |
vecadd_compile_link | S | ndr | S/E | host | host | compile, link, library | |
vecadd_compile_link_loadbinary | S | ndr | B/E | host | host | compile, link, library, loadbinary |
Type | |
---|---|
S | Simple illustration |
P | Performance motivated |
I | Information gathering |
Buffer Model | |
---|---|
read | Uses enqueueReadBuffer and enqueueWriteBuffer |
map | Uses enqueueMapBuffer and enqueueUnmapMemObject |
host | Uses the CL_MEM_USE_HOST_PTR buffer creation attribute |
Kernel Compile | |
---|---|
S/E | Creates kernel program from Source Embedded in the host application |
S/F | Creates kernel program from Source read from a File |
B/E | Creates kernel program from Binary Embedded in the host application |
B/F | Creates kernel program from Binary read from a File |
Profiling | |
---|---|
event | Uses profiling timestamp information queried from OpenCL Event objects |
host | Uses the host clock_gettime function to measure elapsed time |
device | Uses __clock() or __clock64() to measure elapsed cycles on the DSP |
Execution Model | |
---|---|
ndr | Queues a generic NDRangeKernel with > 1 work-item per work-group |
1wi | Queues a NDRangeKernel with 1 work-item per work-group |
iot | Queues a Task (1 work-item) in an In Order Queue |
oot | Queues a Task (1 work-item) in an Out of Order Queue |
custom | Queues a builtin kernel onto a custom device |
Extensions | |
---|---|
C | Kernels contain calls to standard C code |
omp | Kernels contain calls to standard C code with OpenMP pragmas |
msmc | Buffers created in on-chip MSMC memory are used |
edma | Kernels use the EdmaMgr builtin functions for DMA control |
cache | Kernels use the cache re-configuration builtin functions |
dspheap | Kernels create user defined heaps on the DSP |
abort | Kernels call abort() to terminate execution |
exit | Kernels call exit() to terminate execution |
timeout | Kernels terminate if the set timeout limit expires |
Techniques | |
---|---|
functor | The C++ binding’s Kernel Functor object is used |
event | OpenCL Events are used to set dependencies between enqueued commands |
nDev | The OpenCL application can be dynamically partitioned across multiple OpenCL devices |
native | The OpenCL application uses native kernels on the host |
callback | The callback feature is used to asynchronously call a host function on event status change |
extMem | The extended memory capability is used to access memory beyond the 32-bit address space |
async | The async_work_group_copy functions are used to move data between memory spaces |
local | OpenCL Local Buffers are used for performance improvement |
query | OpenCL platforms and/or devices are queried for attributes |
vec | OpenCL C vector data types are used in kernels |
compile | Use of program compile API to create compiled program objects from source program objects |
library | Use of program link API to create a library from compiled program objects |
link | Use of program link API to link compiled program objects and libraries |
loadbinary | Creation of program object from linked program binary |
platforms example¶
This example uses the OpenCL C++ bindings to discover key platform and device information from the OpenCL implementation and print it to the screen. It also reports the version of the installed TI OpenCL product.
simple example¶
This is a ‘hello world’ type of example that illustrates the minimum steps needed to dispatch a kernel to a DSP device and read a buffer of data back.
mandelbrot, mandelbrot_native examples¶
The ‘mandelbrot’ example is an OpenCL demo that uses OpenCL to generate the pixels of a Mandelbrot set image. This example also uses the C++ OpenCL binding. The OpenCL kernels are repeatedly called generating images that are zoomed in from the previous image. This repeats until the zoom factor reaches 1E15.
This example illustrates several key OpenCL features:
- OpenCL queues tied to potentially multiple DSP devices and a dispatch structure that allows the DSPs to cooperatively generate pixel data,
- The event wait feature of OpenCL,
- The division of one time setup of OpenCL to the repetitive en-queuing of kernels, and
- The ease with which kernels can be shifted from one device type to another.
The ‘mandelbrot_native’ example is non-OpenCL native implementation (no dispatch to the DSPs) that can be used for comparison purposes. It uses OpenMP for dispatch to each ARM core. Note: The display of the resulting Mandelbrot images is currently disabled when run on the default EVM Linux file system included in the Processor SDK. Instead it will output frame information.
ccode example¶
This example illustrates the TI extension to OpenCL that allows OpenCL C code to call standard C code that has been compiled off-line into an object file or static library. This mechanism can be used to allow optimized C or C callable assembly routines to be called from OpenCL C code. It can also be used to essentially dispatch a standard C function, by wrapping it with an OpenCL C wrapper. Calling C++ routines from OpenCL C is not yet supported. You should also ensure that the standard C function and the call tree resulting from the standard C function do not allocate device memory, change the cache structure, or use any resources already being used by the OpenCL runtime.
matmpy example¶
This example performs a 1K x 1K matrix multiply using both OpenCL and a native ARM OpenMP implementation (GCC libgomp). The output is the execution time for each approach (OpenCL dispatch to the DSP vs. OpenMP dispatching to the 4 ARM A15s).
offline example¶
This example performs a vector addition by pre-compiling an OpenCL kernel into a device executable file. The OpenCL program reads the file containing the pre-compiled kernel in and uses it directly. If you use offline compilation to generate a .out file containing the OpenCL C program and you subsequently move the executable, you will either need to move the .out as well or the application will need to specify a non-relative path to the .out file.
vecadd_openmp example¶
This is an OpenCL + OpenMP example. OpenCL program is running on the host, managing data transfers, and dispatching an OpenCL wrapper kernel to the device. The OpenCL wrapper kernel will use the ccode mode (see ccode example) to call the C function that has been compiled with OpenMP options (omp). To facilitate OpenMP mode, the OpenCL wrapper kernel needs to be dispatched as an OpenCL Task to an In-Order OpenCL Queue.
vecadd_openmp_t example¶
This is another OpenCL + OpenMP example, similar to vecadd_openmp. The main difference with respect to vecadd_openmp is that this example uses OpenMP tasks within the OpenMP parallel region to distribute computation across the DSP cores.
vecadd example¶
The same functionality as the vecadd_openmp example, but expressed fully as an OpenCL application without OpenMP. Included for comparison purposes.
vecadd_mpax example¶
The same functionality as the vecadd example, but with extended buffers. The example iteratively traverses smaller chunks (sub-buffers) of large buffers. During each iteration, the smaller chunks are mapped/unmapped for read/write. The sub-buffers are then passed to the kernels for processing. This example could also be converted to use a pipelined scheme where different iterations of CPU computation and device computation are overlapped. NOTE: The size of the buffers in the example (determined by the variable ‘NumElements’) is dependent on the available CMEM block size. Currently this example is configured to use buffers sizes for memory configurations that can support 1.5 GB total buffer size. The example can be modified to use more (or less) based on the platform memory configuration.
vecadd_mpax_openmp example¶
Similar to vecadd_mpax example, but used OpenMP to perform the parallelization and the computation. This example also illustrates that printf() could be used in OpenMP C code for debugging.
vecadd_subdevice example¶
The same functionality as the vecadd example, but using sub devices. This example illustrates the use of sub devices using the OpenCL C API. It performs vecadd on the root device as well as equally partitioned individual sub devices and measures the time taken by each of them.
vecadd_compile_link example¶
The same functionality as the vecadd example, but using separate compile and link functionality to build a program. This example also illustrates creation of a library from compiled program objects and uses this library to create a linked program object which is then used to create the kernel.
vecadd_compile_link_loadbinary example¶
The same functionality as the vecadd_compile_link example with the additional step of creating a new program from a linked program binary. This new program is then used to create the kernel.
dsplib_fft example¶
An example to compute multiple channels of FFTs using a routine from the dsplib library. This illustrates calling a standard C library function from an OpenCL kernel. It also illustrates how to improve performance over multiple channels by moving data from DDR into internal local L2 memory with EDMA, and overlapping computation with data movement using double buffering.
ooo, ooo_map examples¶
This application illustrates several features of OpenCL.
- Using a combination of In-Order and Out-Of-Order queues
- Using native kernels on the CPU
- Using events to manage dependencies among the tasks to be executed. A JPEG in this directory illustrates the dependence graph being enforced in the application using events.
The ooo_map version additionally illustrates the use of OpenCL map and unmap operations for accessing shared memory between a host and a device. The Map/Unmap protocol can be used instead of read/write protocol on shared memory platforms.
Requires the TI_OCL_CPU_DEVICE_ENABLE environment variable to be set. For details, refer Environment Variables
null example¶
This application is intended to report the time overhead that OpenCL requires to submit and dispatch a kernel. A null(empty) kernel is created and dispatched so that the OpenCL profiling times queried from the OpenCL events reflects only the OpenCL overhead necessary to submit and execute the kernel on the device. This overhead is for the round-trip for a single kernel dispatch. In practice, when multiple tasks are being enqueued, this overhead is pipelined with execution and can approach zero.
sgemm example¶
This example illustrates how to efficiently offload the CBLAS SGEMM routine (single precision matrix multiply) to the DSPs using OpenCL. The results obtained on the DSP are compared against a cblas_sgemm call on the ARM. The example reports performance in GFlops for both DSP and ARM variants.
dgemm example¶
This example illustrates how to efficiently offload the CBLAS DGEMM routine (double precision matrix multiply) to the DSPs using OpenCL. The results obtained on the DSP are compared against a cblas_dgemm call on the ARM. The example reports performance in GFlops for both DSP and ARM variants.
conv1d example¶
This example illustrates step by step how to optimize a 1D convolution kernel applied to 2D data. The results obtained on the DSP are compared against the same computation performed on the ARM. Optimization techniques include software pipelining improvement, SIMDization, and asynchronous data movement with double buffering into faster memory to overlap computation with data movement. Details can be found in Example: Optimizing 1D convolution kernel.
Note
The conv1d example is available in Processor SDK version >= 3.3.
edmamgr example¶
This application illustrates how to use the edmamgr API to asynchronously move data around the DSP memory hierarchy from OpenCL C kernels. The edmamgr.h header file in this directory enumerates the APIs available from the edmamgr package.
edmabw example¶
This application measures the average data transfer times between different memory regions (DDR, MSMC, L2 SRAM) for a DSP core using EDMA operations via the async_work_group_copy API. It also demonstrates the use of sub devices via the C++ API and the __dsp_frequency() builtin function within the OpenCL C kernel.
dspheap example¶
This application illustrates how to use the user defined heaps feature to allow C code called from OpenCL C code to define custom and use custom heaps on the DSP devices. See User Defined DSP Heap Extension
abort_exit example¶
This example illustrates how to call abort() or exit() in kernel code
for early kernel termination, and how to check corresponding kernel
event status to determine if abort() or exit() has been called.
Two extended kernel event status are CL_ERROR_KERNEL_ABORT_TI
and
CL_ERROR_KERNEL_EXIT_TI
.
Note that these two functions can be called from either OpenCL C code
or standard C code.
Note
The latest TI RTOS migrated to use newlib-nano and disabled C++ exceptions (see limitations of newlib-nano libc). As a result, in OpenCL RTOS setup, this example won’t run to full completion. OpenCL Linux is not affected.
timeout example¶
This example illustrates how to query the OpenCL device queue properties for timeout extension, how to create a command queue with timeout property, how to set a timeout on a kernel, and how to query kernel event status to determine if a timeout has occurred. Details of timeout extension can be found in Setting Timeout Limit on OpenCL Kernels.
Note
The latest TI RTOS migrated to use newlib-nano and disabled C++ exceptions (see limitations of newlib-nano libc). As a result, in OpenCL RTOS setup, this example won’t run to full completion. OpenCL Linux is not affected.
Note
The following examples are available only available on 66AK2x
- mandelbrot, mandelbrot_native
- vecadd_mpax, vecadd_mpax_openmp (not available on 66AK2G)