offloading API overview

Target construct:

A. Syntax
When an OpenMP program starts on the host device, if it encounters a target construct the target region is executed on the target device and the thread on the host waits until the execution of the thread on the device completes. In case of the target absence, the target region is also executed by the host device. In C, the target region in the code simply is created by adding the syntax bellow before the region:

#pragma omp target [clause[ [,] clause] ... ] new-line

The figure bellow shows how thread for host and devices are created for OpenMP accelerator.

Here you can find the Example.1 for OpenMP offloading.

In this example target data map and map clauses are used which we will discuss later in this tutorial.

In order to run the first example you can use this Makefile.

B. Clauses

device clause

In target construct we can define specific target device by adding device clause. If device clause is not added in the syntax the default device is considered as a target device (as it is done in example.1).

#pragma omp target device (0)

Example.2:


      #pragma omp target device (0)
       #pragma omp for private(i)
        for (i = 0; i < n; i++)
            y[i] += a * x[i];

Now, If the device (0) is MIC, you can use Intel compiler to compile the example as it is shown bellow and it uses in Makefile for Example1.

icc -O0 -openmp example2_offload.c -o example_off

map clause

Using data-mapping attribute clause explicitly maps the original variables on the host device to corresponding variables in a target device data environment.

#pragma target map(to:x[0:n]) map(from:y[0:n])

The to map indicates at the start of target region the variables with to map type are initialized with the values of the original values on the host devices. The from type indicates at the start of the target region the from map type is not initialized with the original value, but at the end of the target region these variables are assigned to the original variables on the host device.

Using Map clauses helps compiler for moving data which leads to more accurate results. You may not be able to get the correct results without mapping clauses. The figure below illustrated data movement of the map clause.

There are various forms of map clause:

map(to:variables) : initializes the variables from host to target device.

map(from:variables): assigns the variables from the target device to corresponding data on the host device.

map(tofrom:variables: initials the variables from host to target device and also write the variables back from the target device to corresponding data on the host device.

map(alloc:variables): data is allocated instead of initialization.

map(variables): if there is no type specified it is referred as map (tofrom:variables).

map(to: x[0:N]): The array notation clause must be used when "X" is a pointer.

Example.3 shows the way that map clause can be used in the program. You can also download the runnable code here.

Example.3:


  #pragma omp target map(to:x[0:n]) map(from:y[0:n])
     #pragma omp for private(i)
        for (i = 0; i < n; i++)
            y[i] += a * x[i];

if clause

Conditional clause on the target construct indicates that the device data environment creates if the condition is met. The target constructs enclosed in the target data region must use the same if conditional clause. Example.4 shows the way that a condition can be added to a target construct. In Example.9 the condition is applied to both target and target map construct.You can download Example.9 here.

#pragma omp target if(n>THRESHOLD) map(from: p[0:N])

Example.4:


   { 
    #pragma omp target  if(n>THRESHOLD)  map(to:x[0:n], z[0:n])
    #pragma omp parallel for
     for (i=0; i<n; i++) 
     y[i] = x[i] * z[i]
    }

Asynchronous clauses: nowait and depend

Asynchronous execution of a target region can be accomplished by creating an explicit task around the target region as it is shown in Example.5. An explicit task that includes the target region is generated when the task encounters the task construct, and the encountering thread to a target region waits for the completion of that region. The thread executing the explicit task encounters a task scheduling point while waiting for the execution of the target region to complete ,allowing the thread to switch back to the execution of the encountering task or one of the previously generated explicit tasks.

Example.5:


for (c=0; c<n; c+=CHUNKSZ)
      { 
       #pragma omp target update to(x[c:CHUNKSZ])
       #pragma omp task shared(x,y)
        #pragma omp target  
        #pragma omp parallel for 
        for (i = 0; i < CHUNKSZ; i++)
         y[i] += a*x[i];
       #pragma omp target update from(y[0:n])  
     }
       #pragma omp taskwait

The runnable version of this example is available here.

nowait and depend clauses were added to the target construct in OpenMP 4.5 to improve support for asynchronous execution of target regions.
- nowait clause: When a thread encounters the nowait clause indicates that it would not wait for the target region, and the thread of the target task can perform other work while waiting for the target region execution to complete.

Example.6:


#pragma omp target map(to: x[0:n])map(from:y[0:n]) nowait
     {
     int i;
    #pragma omp for private(i)
    for (i = 0; i < n; i++)
    y[i] += a * x[i];
    }

The code of this example is available here.

-depend clause: The depend clause can be used for the synchronize with other tasks. In the following example different flow dependencies are used. In the first two dependencies the target task does not execute until the preceding explicit tasks have finished. The last dependence is produced in the target task. The last task does not execute until the target task finishes.

Example.7:


 #pragma omp parallel num_threads(2)
     {
      #pragma omp single
        { 
      #pragma omp task depend(out:v1)
      init(v1,n);
      #pragma omp task depend(out:v2)
      init(v2,n);
      #pragma omp target nowait depend(in:v1,v2) depend(out:y)\
                                 map(to:v1,v2) map(from:y)
      #pragma omp parallel for private(i)
       for (i = 0; i < n; i++)
       y[i] += v1[i] * v2[i];
      #pragma omp task depend(in:p)
   output(p, N);
        }
     }

This example is available here for download.

Target data construct

A. Syntax
Target data construct creates a new device data environment and maps the variables listed in map clause to the new device data environment. The target construct that is closed in the target data region also creates a new device data environment and inherits the variables from the target data map.

#pragma omp target data clause[ [ [,] clause] ... ] new-line

The map clause also can be used for multiple target data region in order to avoid frequent data transfer, Example.8 shows how the target data map clause can be used for more than one target regions. Here Example.8 is available for download.

Example.8:


  #pragma omp target data map(to:x[0:n],k[0:n]) map(from:y[0:n], z[0:n])
    { int i;
      #pragma omp target 
       #pragma omp for private(i)
        for (i = 0; i < n; i++)
            y[i] += a * x[i];
      #pragma omp target 
       #pragma omp for private(i)
        for (i = 0; i < n; i++)
            z[i] += a * k[i];
    }

B. Clauses

if clause

Conditional clause on the target data construct indicates that if the condition is met the device data environment will be created. The target constructs enclosed in the target data region must use the same if condition clause.

#pragma omp target data if(N>THRESHOLD) map(from: p[0:N])

Example.9:


#pragma omp target data if(n>THRESHOLD) map(from: y[0:n])
     {
       int i;
       #pragma omp target if (n>THRESHOLD) map (to: x[0:n], z[0:n])  
       
       #pragma omp for private(i)
        for (i = 0; i < n; i++)
        y[i] += z[i] * x[i];     
      }

This example can be downloaded from this link.

enter and exit clause

Structured data construct such as target data construct provides persistent data on a device for one or multiple target constructs , unstructured data construct such as target enter and exit data constructs, on the other hand, allow the creation and deletion data on the target device within the host code. The target enter data constructor uses an alloc modifier in the map clause to avoid copying values to the device and target exit clause uses the delete modifier to avoid copying data map to the host device.

#pragma omp target enter data map(alloc:x[0:len]) #pragma omp target exit data map(alloc:x[0:len])

Example.10:


  void init_matrix( int n, double v[])
  {
  #pragma omp target enter data map(alloc:v[0:n])
   }
  void free_matrix(int n, double v[])
   {
  #pragma omp target exit data map(delete:v[0:n])
   }

This example can be downloaded from this link.

Target update construct

A. Syntax
Update construct uses to synchronize the value of mapped variables. It uses to maintain consistency between the original values on the host device and the corresponding data in the target device.

#pragma omp target update clause[ [ [,] clause] ... ] new-line

As it is shown in Example.11, after the first target region, the variable "x" is initialized with new value, and update construct is used in the second target region to assign the new value of x on the host device to the corresponding data on the target device.

Example.11 is available here for download.

Example.11:


#pragma omp target data map(to:v1[0:n],v2[0:n]) map(from:y[0:n])
     {    
      #pragma omp target 
      #pragma omp parallel for private(i)
       for (i = 0; i < n; i++)
       y[i] += v1[i] * v2[i];
       init(v1,n);
      #pragma omp target update to (v1[0:n])
      #pragma omp target
      #pragma omp parallel for private(i)
        for (i = 0; i< n; i++)
        y[i] += v1[i] * a;
       }

B. Clauses

if clause

When if clause is used with update construct, the update is happened only if the condition is met.

#pragma omp target update if (changed) to(v1[:N])

In Example.12, after the first target region if the initialization of x happened, the variable "changed" returns true and then in the second target region if condition is met and new value of x is updated.

Example.12:


     
#pragma omp target data map(to:v1[0:n],v2[0:n]) map(from:y[0:n],y1[0:n])
    {    
      #pragma omp target 
      #pragma omp parallel for 
       for (i = 0; i < n; i++)
       y[i] += v1[i] * v2[i]; 
       bool changed=change(v1,n);      
      #pragma omp target update if (changed)  to (v1[0:n])
      #pragma omp target
      #pragma omp parallel for 
       for (i = 0; i < n; i++)
       y1[i] += v1[i] * a;     
    }

This example is available here for download.

- Array section: In map clause of target and target data map a part of array can be used, however two separate sections of the same array can not be used inside of a target construct unless the second part be a subset of the first section. Example below show the valid usage of the array section.

Example.13 is available here for download.

Example.13:


void foo ()
 {
 int A[30], *p;
 #pragma omp target data map( A[0:10] )
    {
    p = &A[0];
    #pragma omp target map( p[3:7] )
      {
      A[2] = 0;
      p[8] = 0;
      A[8] = 1;
      }
    }
 }

Declare target construct

A. Syntax
Declare construct is used for two purposes: First, it can indicate the global variables and map that variables to the device data environment for the whole program. Second, it can be used with the aim of preparing a function for a target device. So, that function can be invoked in the target device as well as the host device.

#pragma omp declare target new-line declaration-definition-seq #pragma omp end declare target new-line

Example.14 illustrates the way that static variables(x,y) are mapped to device data for the whole program by using declare construct.

Example.14 is available here for download.

Example.14:


  #pragma omp declare target 
   float x[N], y[N]; 
  #pragma omp end declare target 
         #pragma omp target 
         #pragma omp for private(i)
         for (i = 0; i < n; i++)
            y[i] += a * x[i];

Example.15 shows the way that the pfun function is prepared to be accessed in the target device by using declare construct. It means that every time we need to call a function in the target device we must use declare construct to prepare that function for target device usage.

Example.15 is available here for download.

Example.15:


  #pragma omp declare target
  double pfun(int i, double v1[], int a)
  {
  return a*v1[i];
  }
  #pragma omp target update to(x[0:n])
  #pragma omp target  
  #pragma omp parallel for 
    for (i = 0; i < n; i++)
        y[i]+= F(x[i]);
       #pragma omp target update from(y[0:n])

B. Clauses

link clause

Link clause is introduced in OpenMP 4.5 for declare construct with the aim of controlling data mapping on the target device. Link clause mapped static data listed in declare clause only when it is needed. By using link clause if all the global data is not fit on the target device simultaneously, data can be mapped only when it is needed.

#pragma omp declare target link(sp,sv1,sv2)

Example.16 shows the way that link clause can be applied. The variables list in link clause(x,y) are mapped when they are used in vec_mul function.

Example.16:


 #pragma omp declare target link(x,y) 
   double x[n], y[n], z[n];
 #pragma omp end declare target
    #pragma omp target update to(x)
    #pragma omp target 
    #pragma omp parallel for 
    for (i = 0; i < n; i++)
    y[i] += a * x[i];
    #pragma omp target update from (y)

This example can be downloaded here.

simd clause

Using simd clause with declare target indicates that a simd version of function is available on the target device as well as the host device.

#pragma omp declare simd uniform(i) linear(k) notinbranch

In the following example, the simd clause is used with declare construct in order to create the simd version of function p() on the target device as well.

Example.17 can be download from this link.

Example.17:


#pragma omp declare target
#pragma omp declare simd uniform(i) linear(j) notinbranch
double pfun( int i,  int j, double v1[n][n])
 {
 return a * v1[i][j];
 }
#pragma omp end declare target
  
int main(int argc, char* argv[])
{
  #pragma omp target map(tofrom:tmp) map(to: v1[0:n][0:n])
   {
    #pragma omp parallel for reduction(+:tmp)      
       for (i=0; i<n;i++) 
       {
       double tmp1=0;
      #pragma omp simd reduction(+:tmp1)
        for (k=0;k<n; k++)
        tmp1+= pfun(i,k,v1);
        }
    }
  }

Team construct

A. Syntax

#pragma omp teams [clause[ [,] clause] ... ] new-line

Target and teams constructs are used to create a set of thread teams. This set of thread teams called league of thread teams. The teams construct creates a league of teams. Each team has one master thread which executes the teams region. When a team’s master thread encounters the parallel loop construct, the other threads in the team are activated. The thread workshares and executes the parallel region. Fig.5 illustrates a teams construct with two teams each of which has three threads. These teams executes 6 iterations , each team in the league receives 3 iterations and each thread in the thread teams receives one iteration.

B. Clauses

num_teams,num_threads clauses

In team construct, number of teams can be specified by using num_teams clauses. The omp_get_num_teams routine returns the number of teams executing in a league. The omp_get_team_num routine identifies an unique team number in the team region, and the number of threads in each team is determined by the num_threads clause. Example bellow shows how these clauses can be applied in a program, and this example can be download from here.

Example.18:


   float dotprod(float B[], float C[], int N)
    {
      float sum0 = 0.0;
      float sum1 = 0.0;
      #pragma omp target map(to: B[:N], C[:N]) map(tofrom: sum0, sum1)
      #pragma omp teams num_teams(2)
      {
        int i;
        if (omp_get_num_teams() != 2)
        abort();
        if (omp_get_team_num() == 0)
         {
          #pragma omp parallel for reduction(+:sum0)
          for (i=0; i< N/2; i++)
          sum0 += B[i] * C[i];
         }
       else if (omp_get_team_num() == 1)
         {
         #pragma omp parallel for reduction(+:sum1)
         for (i=N/2; i <N; i++)
         sum1 += B[i] * C[i];
         }
      }
      return sum0 + sum1;
   }

thread_limit clause

Thread_limit clause is used to limit the number of threads in a team. In Example.17, the number of teams in the league is limited by the num_blocks variable and the number of thread in each team in the league is limited by the the block_threads variable. However, It is better to avoid using thread_limit clause because it might degrade performance. Example.21 shows the teams construct with at most eight teams where the number of threads in each team is limited by 16.

Example.19:


 int block_size,int block_threads;
 #pragma omp target map(to: B[0:N], C[0:N]) map(tofrom: sum)
 #pragma omp teams num_teams(2) thread_limit(block_threads)  reduction(+:sum)

The runnable version of this code is available here.

Data sharing clause

Data in OpenMP is shared by default. In team construct, data sharing attributes such as private(), firstprivate() and reduction() can be used in order to control data access in the device. In Example.15 reduction clause is used in order to avoid race condition.
C. Team Construct Restrictions
1. With tread construct only distribute, parallel, parallel for, parallel sections construct can be used in the the team region. 2. A teams construct must be located within a target construct that must not have any directives or statements outside this teams construct.

Distribute construct

A. Syntax

#pragma omp distribute [clause[ [,] clause] ... ] new-line

Distribute construct is the new kind of worksharing. It schedules the loop iterations across the master threads of each team, and when the master thread faces the parallel region, other threads in the team will be activated. The following example shows how the target teams and distribute parallel loop constructs are used to execute a target region. The target teams construct creates a league of teams and the distribute parallel loop construct schedules the associated loop iterations across the master threads of each team.

Example.20:


#pragma omp target  map(to:v1[0:n]) map(tofrom:sum)
    #pragma omp teams num_teams(2) thread_limit(block_threads) reduction(+:sum)
     #pragma omp distribute 
       for (i=0;i<n;i+=block_size)  
          { int c= min(i+block_size,n);
           #pragma omp parallel for reduction(+:sum)
            for (j=i;j<c;j++)
            sum+=a*v1[j];}
    }

This code is available here.

B.Clauses

Scheduling clauses

The scheduling clause with the distribute parallel loop construct, uses the static scheduling to schedule the associated loop iterations to the master threads of each team and then across the threads of associated team to that master. This scheduling used a round_robin manner to distribute chunks among teams. If the chunk size is not specified in the clause, equal chunk size is considered. The example.19 shows using of the dist_schedule clause with the distribute parallel loop construct where each master of the teams receives the chunk of 1024 of loop iterations and then the schedule clause indicates the distribution of that the 1024 iterations to the associated threads of that master team in chunks of 64 iterations.

Example.21:


#pragma omp target  map(to:v1[0:n]) map(tofrom:sum)
    #pragma omp teams num_teams(8) thread_limit(16) reduction(+:sum)
     #pragma omp distribute parallel for reduction(+:sum) dist_schedule(static, 1024) schedule(static,64) 
       for (i=0;i<n;i++)  
            sum+=a*v1[i];

This code is available here for download.

Collapse clause

Collapse clause which has been used for nested loops in OpenMP, can also be used along with the Distribute construct.

Data sharing clauses

Data sharing attributes such as private(), firstprivate() and reduction() can be used along with the Distribute construct in order to control the scoping of enclosed variables.

Distribute simd Constructs

The distribute simd construct schedules the loop iterations across the master thread of each team like the distribute construct, but it uses simd parallelism to execute the iterations.

Example.22:


 #pragma omp target teams map(to:v1[0:n]) map(tofrom:sum)
     #pragma omp distribute simd 
       for (i=0;i<n;i++)  
            sum+=a*v1[i];

This code is available here for download.

Distribute parallel loop simd Construct

Distribute parallel loop simd Construct also schedules the loop iterations across the master thread of each team. It vectorizes the loop that follows worksharing. Example.23 is the example for this construct and can be downloaded from this link.

Example.23:


#pragma omp target teams map(to:v1[0:n]) map(tofrom:sum)
     #pragma omp distribute parallel for simd
       for (i=0;i<n;i++)  
            sum+=a*v1[i];

Composite team, distribute and simd construct

A. Syntax
Combined constructs are a shortcut way to specify one construct inside another construct. The Composite constructs of distribute and team constructs with other constructs are also defined by OpenMP 4.0. Here is the list and syntax of the composite constructs:
- Distribute simd

#pragma omp distribute simd [clause[ [,] clause] ... ] newline for-loops

- Distribute parallel loop

#pragma omp distribute parallel for [clause[ [,] clause] ... ] newline for-loops

- Distribute parallel loop simd

#pragma omp distribute parallel for simd [clause[ [,] clause] ... ] newline for-loops

- Teams Distribute Parallel Loop Construct

#pragma omp teams distribute parallel for [clause[ [,] clause] ... ] new-line for-loops

- Target Teams Distribute Parallel Loop Construct

#pragma omp target teams distribute parallel for [clause[ [,] clause] ... ] new-line for-loops

The clause can be any clauses accepted by the specific directives with identical meanings and restrictions.

here you can find an example for the combined constructs of team and distribute.

Target Memory and Device Pointers Routines

The following example shows how to create space on a device, transfer data to and from that space, and free the space, using API calls. The API calls directly execute these operations without any mapping. The omp_target_alloc routine allocates space and returns the pointer for allocated space. The omp_target_free routine frees the space on the device. The example also uses is_device_ptr clause to access that space in a target region.

Example.24:


 void get_dev_cos(double *mem, size_t s)
 {
 int h, t, i;
double * mem_dev_cpy;
h = omp_get_initial_device();
t = omp_get_default_device();

 if (omp_get_num_devices() < 1 || t < 0) {
 printf(" ERROR: No device found.\n");
 exit(1);
 }

 mem_dev_cpy = omp_target_alloc( sizeof(double) * s, t);
 if(mem_dev_cpy == NULL){
 printf(" ERROR: No space left on device.\n");
 exit(1);
 }

 /* dst src */
 omp_target_memcpy(mem_dev_cpy, mem, sizeof(double)*s, 0, 0, t, h);

 #pragma omp target is_device_ptr(mem_dev_cpy) device(t)
 #pragma omp teams distribute parallel for
for(i=0;i<s;i++){ mem_dev_cpy[i] = cos((double)i); } /* init data */

/* dst src */
omp_target_memcpy(mem, mem_dev_cpy, sizeof(double)*s, 0, 0,S-36 h, t);
omp_target_free(mem_dev_cpy, t);

This code is available here for download.