OpenMP Acclerator Model

The OpenMP 4.0 specification enables the use of OpenMP on heterogeneous systems by adding support for a set of device constructs. The OpenMP community uses the term OpenMP Accelerator Model to refer to this set. OpenMP 4.0 defines a host device on which the OpenMP programs begin execution, and target devices onto which regions of code can be offloaded.

Following is a brief summary of some of the features of the openMP Accelerator Model. For details on using the TI OpenMP Accelerator model implementation, refer to TI OpenMP-Acc.

Device Constructs supported

This product supports the the following OpenMP 4.0 device constructs and runtime library routines:

  • #pragma omp target
  • #pragma omp declare target
  • #pragma omp target data
  • #pragma omp target update
  • omp_set_default_device()
  • omp_get_default_device()
  • omp_get_num_devices()
  • omp_is_initial_device()

#pragma omp target

The target construct is used to specify the region of code that should be offloaded for execution onto the target device. The construct also creates a device data environment by mapping host buffers to the target for the extent of the associated region. Map clauses on the target construct specify data movement from host to device before execution of the offloaded region, and device to host after execution of the offloaded region. Map types alloc, to, from and tofrom indicate data allocation and transfer direction.

float a[1024];
float b[1024];
float c[1024];
int size;

void vadd_openmp(float *a, float *b, float *c, int size)
{
    #pragma omp target map(to:a[0:size],b[0:size],size) map(from: c[0:size])
    {
        int i;
        #pragma omp parallel for
        for (i = 0; i < size; i++)
            c[i] = a[i] + b[i];

    }
}

In the example above, variables a, b, c and size initially reside in host (ARM Linux) memory. Upon encountering a target construct:

  • Space is allocated in device memory for variables a[0:size], b[0:size], c[0:size] and size.
  • Any variables annotated ‘to’ are copied from host memory → device memory.
  • The target region is executed on the device. Note that #pragma omp parallel for is used to distribute iterations of the for loop across the 8 DSP cores.
  • Any variables annotated ‘from’ are copied from device memory → host memory.

#pragma omp declare target

The declare target construct specifies that variables and functions are mapped to a device. Each function specified in a declare target region must have definitions for both the host and target device. In the code snippet below, function DSPF_sp_fftSPxSP is called from within a target region. It’s prototype must be placed in a declare target region.

#pragma omp declare target
/* There must be a host and accelerator target definition for this function */
void DSPF_sp_fftSPxSP(int N,
                      float *x, float *w, float *y,
                      unsigned char *brev,
                      int n_min, int offset, int n_max);
#pragma omp end declare target

void dsplib_fft(int N, int bufsize,
                float* x, float* w, float *y,
                int n_min, int offset, int n_max)
{
    #pragma omp target map(to: N, x[0:bufsize], w[0:bufsize], \
                               n_min, offset, n_max) \
                       map(from: y[0:bufsize])
    {
        DSPF_sp_fftSPxSP (N, x, w, y, 0, n_min, offset, n_max);
    }
}

#pragma omp target data

The target data construct creates a device data environment by mapping host buffers to the target for the extent of the associated region. Target regions may be executed within a device data environment. Furthermore, mapped buffers may be re-used by multiple enclosed target regions and are valid for the entire scope of the ‘target data’ region. Target data regions may also be nested.

/* Create device buffers for a, b, c and transfer data from Host -> Device for a,b */
#pragma omp target data map(to:a[0:size], b[0:size]) map(from:c[0:size])
{
   /* Existing device buffers are used and no data is transferred here */
   #pragma omp target
   {
       int i;
       #pragma omp parallel for
       for (i = 0; i < size; i++)
           c[i] += a[i] + b[i];
   }
} /* Device -> Host data transfer of buffer c is done here*/

#pragma omp target update

The target update construct is used to synchronize host or device buffers within a target data region as required. In the following example, buffer c is synchronized to the host in the first target update call and to the device in the second target update call. The update direction is specified using to and from clauses.

void operate_on_host(int* buffer);

#pragma omp declare target
void operate_on_device_initial(int* srcA, int* srcB, int* result);
void operate_on_device_final(int* srcA, int* srcB, int* result);
#pragma omp end declare target

/* Create device buffers for a, b, c and transfer data from Host -> Device for a,b */
#pragma omp target data map(to:a[0:size], b[0:size]) map(from:c[0:size])
{
   /* Existing device buffers are used and no data is transferred here */
   #pragma omp target
   {
      /* Here buffer c is modified with results while operating on input buffers a,b */
      operate_on_device_initial(a,b,c);
   }

   /* Copy modified target data to the host */
   #pragma omp target update from(c[0:size])

   /* Do some host side processing */
   operate_on_host(c);

   /* Synchronize c with target device data buffer */
   #pragma omp target update to(c[0:size])

   /* Offload more computation to target device */
   #pragma omp target
   {
      operator_on_device_final(a,b,c);
   }

} /* Device -> Host data transfer of buffer c is done here again*/