6.16. Asynchronous target Execution and Dependences#

Asynchronous execution of a target region can be accomplished by creating an explicit task around the target region. Examples with explicit tasks are shown at the beginning of this section.

As of OpenMP 4.5 and beyond the nowait clause can be used on the target directive for asynchronous execution. Examples with nowait clauses follow the explicit task examples.

This section also shows the use of depend clauses to order executions through dependences.

6.16.1. Asynchronous target with Tasks#

The following example shows how the task and target constructs are used to execute multiple target regions asynchronously. The task that encounters the task construct generates an explicit task that contains a target region. The thread executing the explicit task encounters a task scheduling point while waiting for the execution of the target region to complete, allowing the thread to switch back to the execution of the encountering task or one of the previously generated explicit tasks.

//%compiler: clang
//%cflags: -fopenmp

/*
* name: async_target.1
* type: C
* version: omp_5.1
*/
#pragma omp begin declare target
float F(float);
#pragma omp end declare target

#define N 1000000000
#define CHUNKSZ 1000000
void init(float *, int);
float Z[N];
void pipedF(){
   int C, i;
   init(Z, N);
   for (C=0; C<N; C+=CHUNKSZ){
      #pragma omp task shared(Z)
      #pragma omp target map(Z[C:CHUNKSZ])
      #pragma omp parallel for
      for (i=0; i<CHUNKSZ; i++) Z[i] = F(Z[i]);
   }
   #pragma omp taskwait
}

The Fortran version has an interface block that contains the declare target. An identical statement exists in the function declaration (not shown here).

!!%compiler: gfortran
!!%cflags: -fopenmp

! name: async_target.1
! type: F-free
! version: omp_4.0
module parameters
integer, parameter :: N=1000000000, CHUNKSZ=1000000
end module
subroutine pipedF()
use parameters, ONLY: N, CHUNKSZ
integer            :: C, i
real               :: z(N)

interface
   function F(z)
   !$omp declare target
     real, intent(IN) ::z
     real             ::F
   end function F
end interface

   call init(z,N)

   do C=1,N,CHUNKSZ

      !$omp task shared(z)
      !$omp target map(z(C:C+CHUNKSZ-1))
      !$omp parallel do
         do i=C,C+CHUNKSZ-1
            z(i) = F(z(i))
         end do
      !$omp end target
      !$omp end task

   end do
   !$omp taskwait
   print*, z

end subroutine pipedF

The following example shows how the task and target constructs are used to execute multiple target regions asynchronously. The task dependence ensures that the storage is allocated and initialized on the device before it is accessed.

//%compiler: clang
//%cflags: -fopenmp

/*
* name: async_target.2
* type: C
* version: omp_5.1
*/
#include <stdlib.h>
#include <omp.h>

#pragma omp begin declare target
extern void init(float *, float *, int);
#pragma omp end declare target

extern void foo();
extern void output(float *, int);
void vec_mult(float *p, int N, int dev)
{
   float *v1, *v2;
   int i;
   #pragma omp task shared(v1, v2) depend(out: v1, v2)
   #pragma omp target device(dev) map(v1, v2)
   {
       // check whether on device dev
       if (omp_is_initial_device())
   abort();
       v1 = (float *)malloc(N*sizeof(float));
       v2 = (float *)malloc(N*sizeof(float));
       init(v1, v2, N);
   }
   foo(); // execute other work asychronously
   #pragma omp task shared(v1, v2, p) depend(in: v1, v2)
   #pragma omp target device(dev) map(to: v1, v2) map(from: p[0:N])
   {
       // check whether on device dev
       if (omp_is_initial_device())
   abort();
       #pragma omp parallel for
       for (i=0; i<N; i++)
  p[i] = v1[i] * v2[i];
       free(v1);
       free(v2);
   }
   #pragma omp taskwait
   output(p, N);
}

The Fortran example below is similar to the C version above. Instead of pointers, though, it uses the convenience of Fortran allocatable arrays on the device. In order to preserve the arrays allocated on the device across multiple target regions, a target data region is used in this case.

If there is no shape specified for an allocatable array in a map clause, only the array descriptor (also called a dope vector) is mapped. That is, device space is created for the descriptor, and it is initially populated with host values. In this case, the v1 and v2 arrays will be in a non-associated state on the device. When space for v1 and v2 is allocated on the device in the first target region the addresses to the space will be included in their descriptors.

At the end of the first target region, the arrays v1 and v2 are preserved on the device for access in the second target region. At the end of the second target region, the data in array p is copied back, the arrays v1 and v2 are not.

A depend clause is used in the task directive to provide a wait at the beginning of the second target region, to insure that there is no race condition with v1 and v2 in the two tasks. It would be noncompliant to use v1 and/or v2 in lieu of N in the depend clauses, because the use of non-allocated allocatable arrays as list items in a depend clause would lead to unspecified behavior.

Note - This example is not strictly compliant with the OpenMP 4.5 specification since the allocation status of allocatable arrays v1 and v2 is changed inside the target region, which is not allowed. (See the restrictions for the map clause in the Data-mapping Attribute Rules and Clauses section of the specification.) However, the intention is to relax the restrictions on mapping of allocatable variables in the next release of the specification so that the example will be compliant.

!!%compiler: gfortran
!!%cflags: -fopenmp

! name: async_target.2
! type: F-free
! version: omp_4.0
 subroutine mult(p,  N, idev)
   use omp_lib, ONLY: omp_is_initial_device
   real             :: p(N)
   real,allocatable :: v1(:), v2(:)
   integer ::  i, idev
   !$omp declare target (init)

   !$omp target data map(v1,v2)

   !$omp task shared(v1,v2) depend(out: N)
      !$omp target device(idev)
         if( omp_is_initial_device() ) &
            stop "not executing on target device"
         allocate(v1(N), v2(N))
         call init(v1,v2,N)
      !$omp end target
   !$omp end task

   call foo()  ! execute other work asychronously

   !$omp task shared(v1,v2,p) depend(in: N)
      !$omp target device(idev) map(from: p)
         if( omp_is_initial_device() ) &
            stop "not executing on target device"
         !$omp parallel do
            do i = 1,N
               p(i) = v1(i) * v2(i)
            end do
         deallocate(v1,v2)

      !$omp end target
   !$omp end task

   !$omp taskwait

   !$omp end target data

   call output(p, N)

end subroutine

6.16.2. nowait Clause on target Construct#

The following example shows how to execute code asynchronously on a device without an explicit task. The nowait clause on a target construct allows the thread of the target task to perform other work while waiting for the target region execution to complete. Hence, the target region can execute asynchronously on the device (without requiring a host thread to idle while waiting for the target task execution to complete).

In this example the product of two vectors (arrays), v1 and v2 , is formed. One half of the operations is performed on the device, and the last half on the host, concurrently.

After a team of threads is formed the primary thread generates the target task while the other threads can continue on, without a barrier, to the execution of the host portion of the vector product. The completion of the target task (asynchronous target execution) is guaranteed by the synchronization in the implicit barrier at the end of the host vector-product worksharing loop region. See the barrier glossary entry in the OpenMP specification for details.

The host loop scheduling is dynamic, to balance the host thread executions, since one thread is being used for offload generation. In the situation where little time is spent by the target task in setting up and tearing down the target execution, static scheduling may be desired.

//%compiler: clang
//%cflags: -fopenmp

/*
* name: async_target.3
* type: C
* version: omp_5.1
*/
#include <stdio.h>

#define N 1000000      //N must be even
void init(int n, float *v1, float *v2);

int main(){
   int   i, n=N;
   int   chunk=1000;
   float v1[N],v2[N],vxv[N];

   init(n, v1,v2);

   #pragma omp parallel
   {

      #pragma omp masked
      #pragma omp target teams distribute parallel for nowait \
                                map(to: v1[0:n/2]) \
                                map(to: v2[0:n/2]) \
                                map(from: vxv[0:n/2])
      for(i=0; i<n/2; i++){ vxv[i] = v1[i]*v2[i]; }

      #pragma omp for schedule(dynamic,chunk)
      for(i=n/2; i<n; i++){ vxv[i] = v1[i]*v2[i]; }

   }
   printf(" vxv[0] vxv[n-1] %f %f\n", vxv[0], vxv[n-1]);
   return 0;
}
!!%compiler: gfortran
!!%cflags: -fopenmp

! name: async_target.3
! type: F-free
! version: omp_5.1
program concurrent_async
   use omp_lib
   integer,parameter :: n=1000000  !!n must be even
   integer           :: i, chunk=1000
   real              :: v1(n),v2(n),vxv(n)

   call init(n, v1,v2)

   !$omp parallel

      !$omp masked
      !$omp target teams distribute parallel do nowait &
      !$omp&                    map(to: v1(1:n/2))   &
      !$omp&                    map(to: v2(1:n/2))   &
      !$omp&                    map(from: vxv(1:n/2))
      do i = 1,n/2;    vxv(i) = v1(i)*v2(i); end do
      !$omp end masked

      !$omp do schedule(dynamic,chunk)
      do i = n/2+1,n;  vxv(i) = v1(i)*v2(i); end do

   !$omp end parallel

   print*, " vxv(1) vxv(n) :", vxv(1), vxv(n)

end program

6.16.3. Asynchronous target with nowait and depend Clauses#

More details on dependences can be found in Section 5.3, Task Dependences. In this example, there are three flow dependences. In the first two dependences the target task does not execute until the preceding explicit tasks have finished. These dependences are produced by arrays v1 and v2 with the out dependence type in the first two tasks, and the in dependence type in the target task.

The last dependence is produced by array p with the out dependence type in the target task, and the in dependence type in the last task. The last task does not execute until the target task finishes.

The nowait clause on the target construct creates a deferrable target task , allowing the encountering task to continue execution without waiting for the completion of the target task .

//%compiler: clang
//%cflags: -fopenmp

/*
* name: async_target.4
* type: C
* version: omp_4.5
*/

extern void init(  float*, int);
extern void output(float*, int);

void vec_mult(int N)
{
   int i;
   float p[N], v1[N], v2[N];

   #pragma omp parallel num_threads(2)
   {
      #pragma omp single
      {
         #pragma omp task depend(out:v1)
         init(v1, N);

         #pragma omp task depend(out:v2)
         init(v2, N);

         #pragma omp target nowait depend(in:v1,v2) depend(out:p) \
                                      map(to:v1,v2) map( from: p)
         #pragma omp parallel for private(i)
         for (i=0; i<N; i++)
            p[i] = v1[i] * v2[i];

         #pragma omp task depend(in:p)
         output(p, N);
      }
   }
}
!!%compiler: gfortran
!!%cflags: -fopenmp

! name: async_target.4
! type: F-free
! version: omp_4.5

subroutine vec_mult(N)
   implicit none
   integer           :: i, N
   real, allocatable :: p(:), v1(:), v2(:)
   allocate( p(N), v1(N), v2(N) )

   !$omp parallel num_threads(2)

      !$omp single

         !$omp task depend(out:v1)
         call init(v1, N)
         !$omp end task

         !$omp task depend(out:v2)
         call init(v2, N)
         !$omp end task

         !$omp target nowait depend(in:v1,v2) depend(out:p) &
         !$omp&                 map(to:v1,v2)  map(from: p)
         !$omp parallel do
         do i=1,N
            p(i) = v1(i) * v2(i)
         end do
         !$omp end target


         !$omp task depend(in:p)
         call output(p, N)
         !$omp end task

     !$omp end single
   !$omp end parallel

   deallocate( p, v1, v2 )

end subroutine