Optimization TipsΒΆ
OpenCL applications consist of a host application and a set of device kernels. There are optimization techniques for both the host code and the device code. There are some techniques that span the boundary between host and device. This section provides tips for writing OpenCL applications that perform well. It targets TI SoCs with DSPs as accelerator devices. These tips are organized into sections based on where the tip is applicable, i.e. the host or device.
- Optimization Techniques for Host Code
- Optimization Techniques for Device (DSP) Code
- Prefer Kernels with 1 work-item per work-group
- Use Local Buffers
- Use async_work_group_copy and async_work_group_strided_copy
- Avoid DSP writes directly to DDR
- Use the reqd_work_group_size attribute on kernels
- Use the TI OpenCL extension than allows Standard C code to be called from OpenCL C code
- Avoid OpenCL C Barriers
- Use the most efficient data type on the DSP
- Do Not Use Large Vector Types
- Consecutive memory accesses
- Prefer the CPU style of writing OpenCL code over the GPU style
- Typical Steps to Optimize Device Code
- Example: Optimizing 1D convolution kernel
- Overview
- Summary of results
- Driver code setup
- k_baseline: Ensure correct measurements
- k_baseline: Check software pipelining
- k_loop: Improve software pipelining
- k_loop_simd: Improve software pipelining with SIMDization
- k_loop_db: EDMA and double buffer k_loop
- k_loop_simd_db: EDMA and double buffer k_loop_simd
- k_loop_simd_db_extc: Use external C function for k_loop_simd_db
- Example: Optimizing 3x3 Gaussian smoothing filter
- Performance Data