Example: Optimizing 1D convolution kernel

Note

This example is available in Processor SDK version >= 3.3

Overview

In this conv1d example shipped with OpenCL product, we show how to optimize an OpenCL 1D convolution kernel step by step. In general, there are three areas that we want to optimize:

  1. Instruction pipeline: has the loop been software pipelined at a low II (initiation interval)?
  2. SIMD efficiency: has the available SIMD instructions (e.g. C66x) been fully utilized?
  3. Memory hierarchy performance: can the input and output data be EDMAed with double buffering into faster memory to overlap computation and data movement?

Summary of results

The example 1D convolution kernel is applied to each row of a 2D data, which could represent an image, a collection of independent channels, and so on. The 1D convolution kernel/filter size is 5x1. We write a generic kernel for asymmetric filters. If your filter is symmetric, you are welcome to optimize away two multiplications.

We use the timing of a simple straight-forward kernel as the baseline, and report the speedup of optimized versions against that baseline. Execution time is measured with a 1920x1080 input image on the host side and reported as microseconds. The same experiments are conducted on both AM572x (2 DSP cores) and K2H (8 DSP cores) EVMs. By varying NUMCOMPUNITS in the code, we obtain the performance of dispatching the kernel to 1, 2, 4 and 8 DSP cores. Each experiment is repeated 5 times and the minimum in each category is reported here.

Kernel version Execution Time (us) / Speedup vs. baseline
  AM57 1 Core AM57 2 Cores K2H 1 Core K2H 2 Cores K2H 4 Cores K2H 8 Cores
host_compute 22796 / 1.40 22796 / 0.74 25150 / 0.56 25150 / 0.65 25150 / 0.65 25150 / 0.58
k_baseline 31957 / 1.00 16997 / 1.00 13964 / 1.00 16415 / 1.00 16395 / 1.00 14595 / 1.00
k_loop 30397 / 1.05 16769 / 1.01 13084 / 1.07 14937 / 1.10 16593 / 0.99 14048 / 1.04
k_loop_simd 18075 / 1.77 9986 / 1.70 8403 / 1.66 9700 / 1.69 11867 / 1.38 10579 / 1.38
k_loop_db 12783 / 2.50 6548 / 2.60 8498 / 1.64 4407 / 3.72 2741 / 5.98 3086 / 4.73
k_loop_simd_db 9816 / 3.26 5072 / 3.35 6213 / 2.25 3254 / 5.04 2681 / 6.11 2989 / 4.88
k_loop_simd_db_extc 8200 / 3.90 4321 / 3.93 5428 / 2.57 2854 / 5.75 2800 / 5.85 3056 / 4.78

Note

This example hit the memory bandwidth constraint on K2H after 2 DSP cores. In other words, there simply isn’t enough computation to go around more DSP cores and keep them all busy, per maximum of data that can be transferred to the chip at a time. If the computation to data ratio is higher, we do see benefits of using more DSPs. sgemm and dgemm are two of these examples that are shipped with OpenCL product.

Example AM572x 2 DSP cores K2H 8 DSP cores
sgemm 13.378 GFlops 93.654 GFlops
dgemm 3.609 GFlops 25.891 GFlops

In what follows, we go through the optimizations that we applied to achieve the performance improvements.

Driver code setup

The 1D convolution kernel is applied in a 2D image. We wrote a drive code that initialize a 2D image with random data and call the OpenCL kernels accordingly. We chose HD image size 1920x1080 as input and output. The same kernel computation is performed on the host (ARM) side and its results are verified against the kernel results from DSP to ensure the correctness. Performance is measured as the elapsed time on the host side before enqueuing the kernel and after the kernel has finished.

Initially we partitioned OpenCL global size (1920, 1080) into NUMCOMPUNITS workgroups, each of local size (1920, 1080/NUMCOMPUNITS), so that each DSP core will get a workgroup.

k_baseline: Ensure correct measurements

TI’s OpenCL runtime will lazily load the device program upon the first enqueue of a kernel from the program, so the elapsed time overall from the first enqueue will be longer to account for the loading of the program. To remove program loading overhead from kernel performance, we can enqueue a null kernel before running other kernels.

k_baseline: Check software pipelining

We can look at the assembly output to see if the compiler has successfully software pipelined the kernel. For the original 2 dimensional kernel, compiler will add two implicit loops around the kernel to create an OpenCL workgroup and try to software pipeline the innermost loop. Run compiler with an option, -k, to keep the assembly code:

clocl -k ti_kernels.cl

Look at the k_conv1d_5x1 function in ti_kernels.asm, search for SOFTWARE PIPELINE and we see these two lines:

;*      Searching for software pipeline schedule at ...
;*         ii = 7  Schedule found with 5 iterations in parallel

So the original kernel has already been software pipelined. A loop iteration in the innermost dimension is started every 7 cycles.

k_loop: Improve software pipelining

A closer look at the baseline source code and we see that the loop is dealing with some boundary conditions. If we can peel those boundary iterations from the main loop, then the main loop may be scheduled at a lower ii. To do that, we also need to reduce OpenCL kernel work space from 2D to 1D so that the innermost loop becomes explicit in the kernel code. Kernel k_loop is the result of such transformations. From the assembly file, we can see that the main loop is schedule at ii=3, which means an iteration is started every 3 cycles:

;*      Searching for software pipeline schedule at ...
;*         ii = 3  Schedule found with 10 iterations in parallel

Summary

  1. Make col-dimension loop explicit in the kernel, reduce OpenCL kernel work space from 2D to 1D
  2. Peel the boundary conditions and remove the boundary checks

Alternatively, you may pad the input data or reduce the output size so that the boundary conditions go away.

With reduced ii, we didn’t see much performance improvement from execution when compared to the baseline version. One possible reason is that the software pipeline stalls due to cache misses have dominated the execution. It is time to optimize for the memory hierarchy. Before doing that, let’s see if we can optimize for the SIMD features available on C66 DSPs.

k_loop_simd: Improve software pipelining with SIMDization

Sometimes, compiler may not be able to auto-SIMDize the loop. We can look at the involved memory accesses and computations and perform SIMDization by hand. Due to the OpenCL C vector semantics, we have to assume that each row is properly aligned on the 8-byte boundary for using vector type of float2. First we SIMDize the memory accesses and computations, next we seek the opportunity to pipeline the loaded values in the registers. k_loop_simd is the result of SIMDization. From the assembly, we can see that an unrolled iteration (corresponding to two baseline iterations) is started every 5 cycles:

;*      Searching for software pipeline schedule at ...
;*         ii = 5  Schedule found with 5 iterations in parallel

Summary

  1. Unroll col-loop by a factor of 2 by hand
  2. Data layout requirement: each row is aligned on 8-byte double word boundary
  3. SIMDize loads and stores
  4. SIMDize computation
  5. Pipeline loaded values in registers if possible

k_loop_db: EDMA and double buffer k_loop

TI’s OpenCL implementation designate part of L2 SRAM on each core for OpenCL local memory. We can use EDMA to move data from global buffers (DDR) into local buffers (L2), perform computation on the local buffers, then store results from local buffers (L2) back to global buffers (DDR). OpenCL C kernel language has built-in async_work_group_*() functions that we map to TI’s EDMA routines. To best utilize the asynchronous feature of EDMA, we use double buffering (ping-pong) to effectively overlap data movement and computation.

For this particular kernel, each row requires COLS*sizeof(float) + COLS*sizeof(float) bytes for input and output. With double buffering, each row requires 16 * COLS bytes for input and output. Given COLS = 1920 that we chose, we can fit a maximum of four rows into the 128KB local memory, or a maximum of 25 rows into 768KB local memory:

4  * (2 * (1920*4 + 1920*4)) <= 128 * 1024
25 * (2 * (1920*4 + 1920*4)) <= 768 * 1024

To ensure that the double buffering pipeline executes at least a few times, say 8, we can cap the BLOCK_HEIGHT to ROWS / NUMCOMPUNITS / 8 + 1. In the kernel, before computing current block of rows of image in local memory, we prefect next block of rows into local memory with EDMA.

Another transformation is that the kernel now explicitly iterates through the row dimension as well, because of the requirement of double buffering. Accordingly, we need to set required kernel work group size to (1, 1, 1). In the host code, we only need to specify the number of workgroups, which we use the number of compute units, when enqueuing the ND range kernel.

We added three additional arguments to the kernel: block height, local buffer for input and local buffer for output. Local buffers are allocated automatically by OpenCL runtime, OpenCL application code only needs to specify the sizes.

With all these transformation, we see that non-SIMDized k_loop_db outperforms not only baseline k_loop, but also SIMDized k_loop_simd.

Summary

  1. Require 8-byte alignment for each row
  2. Determine the block height for double buffering
  3. Set required work group size to (1,1,1) for kernel
  4. Set OpenCL workspace to (NUMCOMPUNITS, 1, 1), each work group will figure out which rows to work on
  5. Double buffer with EDMA on input and output, computation only loads from and stores to local buffers

k_loop_simd_db: EDMA and double buffer k_loop_simd

We apply the same EDMA and double buffering transformation on k_loop_simd as above. Now we see similar performance improvements upon k_loop_simd.

k_loop_simd_db_extc: Use external C function for k_loop_simd_db

While we can handle this example completely in OpenCL C language, sometimes OpenCL C has limitations in expressiveness with regard to our C66 DSP. For example, C66 DSP can do more patterns of EDMA transfers than async_work_group_*() OpenCL C built-in functions can represent, C66 DSP support non-aligned SIMD loads and stores. When these limitations do affect user applications, we can work around them in standard C functions and invoke them from within OpenCL C code.

We use this version as an example how to incorporate standard C functions into OpenCL. We move the body of k_loop_simd_db into an external C function, and treat the OpenCL C function as a simple wrapper function. The external C function is compiled with C66 C compiler and you can use C66 C intrinsics. Similarly, you can re-utilize existing optimized C implementations and libraries developed by themselves or TI. Of course, this is a TI’s extension and is not applicable to OpenCL platforms from other vendors.

c_loop_simd_db_extc() in k_ext.c is the rewritten C function. Note the explicit use of EdmaMgr functions and C66 SIMD intrinsics. With this version, we got slightly better performance.

Summary

  1. Move kernel body to an external standard C function
  2. Use EdmaMgr_*() functions directly, cover non-consecutive transfers
  3. Use C66 C SIMD intrinsic built-in functions, cover non-aligned SIMD loads and stores
  4. Link separately compiled C object back to kernel executable