4. TI Extensions

TI has implemented 2 extensions to the OpenMP Accelerator Model:

  1. An API to reduce the overhead of offloading target regions
  2. A local map type to map variables to target scratchpad memory - this memory has much lower access times compared to MSMC SRAM or DDR.

4.1. Reducing offload overhead

Data synchronization between the host and target device can be a significant source of overhead. This overhead has implications for the amount of computation that needs to be performed by a target region to outweigh the data synchronization overhead. On a 66AK2x SoC, the host and target device share internal and external memory. However:

  • the target device does not have a memory management unit (MMU); and
  • there is no hardware cache coherency between the target and host device.

As a result, the host accesses shared memory using virtual addresses and the target accesses the shared memory using physical addresses. Moreover, host device variables can span multiple non-contiguous pages in Linux virtual memory whereas the target device operates on contiguous physical memory. When mapping variables from the Linux process space, the variables must be copied into contiguous memory for target operation. This copy is inefficient, especially for large variables. To eliminate this copy, the implementation provides a special purpose dynamic memory allocation API, malloc ddr and malloc msmc. The physical memory associated with this heap is contiguous and is mapped to a contiguous chunk of virtual memory on the host. If any host variables allocated via this API are mapped into target regions, the map clauses translate to cache management operations on the host, significantly reducing the overhead.

The following TI-provided functions may be used to allocate and free contiguous segments of memory that may be accessed by all ARM and DSP cores:

  • __malloc_ddr: Allocate a buffer in contiguous DDR memory
  • __free_ddr: Free associated with __malloc_ddr
  • __malloc_msmc: Allocate a buffer in contiguous MSMC memory
  • __free_msmc: Free associated with __malloc_msmc

These functions have the same interface as standard malloc/free. Note that when using these functions within an OpenMP application, any required memory consistency operations are automatically handled by the supporting runtime systems.

/* Allocate buffer in device memory */
float* a     = (float*) __malloc_ddr(bufsize);

/* Initialize buffer on the host */
for (int i=0; i < NumElements; ++i)
{
    a[i] = 1.0;
}


/* Map to is a cache write-back operation on host. No copy is performed
   Map from is a cache invalidate operation on host. No copy performed */
#pragma omp target map(to:a[0:size],size) map(from: a[0:size])
{
    int i;
    #pragma omp parallel for
    for (i = 0; i < size; i++)
        a[i] *= 2.0;
}

/* Free buffer */
__free_ddr(a);

The table below lists data movement and cache operations that are performed by the Accelerator Model runtime for the various target constructs. This information is specific to TI’s implementation of the OpenMP Accelerator Model.

  target or target data map target update
Location of host variable To target From target To target From target
Linux process memory (stack/heap/global) Memcpy from host to target memory Memcpy from host to target memory Memcpy from host to target memory Memcpy from host to target memory
Allocated via __malloc_ddr or __malloc_msmc ARM cache writeback of address range ARM cache invalidate of address range ARM cache writeback of address range ARM cache invalidate of address range

4.2. local map type

An additional *local* map type has been added, which maps a variable to the L2 scratchpad memory. In terms of data synchronization, such variables are treated as map-type alloc. They have an undefined initial value on entry to the target region and any updates to the variable in the target region cannot be reflected back to the host. Mapping host variables to target scratchpad memory provides significant performance improvements.

Each DSP core has an independent L2 memory. The following table shows the L2 memory configuration per each core:

         
Total L2 SRAM 1MB 1MB 512KB 1MB
L2 allocated for user scratchpad 768KB 768KB 256KB 832KB
L2 allocated for runtime 128KB 128KB 128KB 64KB
L2 allocated for cache 128KB 128KB 128KB 128KB
/* A 1KB scratch buffer */
char* scratch_buffer = malloc(1024);

#pragma omp target map(tofrom: a[0:size]) map(local: scratch_buffer[0:1024])
{
    /* a[] is copied to the Device */
    /* scratch_buffer[] is allocated in L2 SRAM,
       scratch_buffer[] is NOT copied to the Device */

    /* Perform operations on buffer a[] in DDR using the L2 SRAM scratch_buffer */
    operate_on(a, scratch_buffer);

} /* a[] is copied back to the Host,
     scratch_buffer[] is NOT copied back to Host */