Metadirectives
12.7. Metadirectives#
A metadirective directive provides a mechanism to select a directive in a when clause to be used, depending upon one or more contexts: implementation, available devices and the present enclosing construct. The directive in an otherwise clause is used when a directive of the when clause is not selected.
In the when clause the context selector (or just selector ) defines traits that are evaluated for selection of the directive that follows the selector. This “selectable” directive is called a directive variant . Traits are grouped by construct , implementation and device sets to be used by a selector of the same name.
In the first example the architecture trait arch of the device selector set specifies that if an nvptx architecture is active in the OpenMP context, then the teams loop directive variant is selected as the directive; otherwise, the parallel loop directive variant of the otherwise clause is selected as the directive. That is, if a device of nvptx architecture is supported by the implementation within the enclosing target construct, its directive variant is selected. The architecture names, such as nvptx , are implementation defined. Also, note that device as used in a target construct specifies a device number, while device , as used in the metadirective directive as selector set, has traits of kind , isa and arch .
//%compiler: clang
//%cflags: -fopenmp
/*
* name: metadirective.1
* type: C
* version: omp_5.2
*/
#define N 100
#include <stdio.h>
int main()
{
int v1[N], v2[N], v3[N];
for(int i=0; i<N; i++){ v1[i]=(i+1); v2[i]=-(i+1); }
#pragma omp target map(to:v1,v2) map(from:v3) device(0)
#pragma omp metadirective \
when( device={arch("nvptx")}: teams loop) \
otherwise( parallel loop)
for (int i= 0; i< N; i++) v3[i] = v1[i] * v2[i];
printf(" %d %d\n",v3[0],v3[N-1]); //output: -1 -10000
return 0;
}
!!%compiler: gfortran
!!%cflags: -fopenmp
! name: metadirective.1
! type: F-free
! version: omp_5.2
program main
integer, parameter :: N= 100
integer :: v1(N), v2(N), v3(N);
do i=1,N; v1(i)=i; v2(i)=-i; enddo ! initialize
!$omp target map(to:v1,v2) map(from:v3) device(0)
!$omp metadirective &
!$omp& when( device={arch("nvptx")}: teams loop) &
!$omp& otherwise( parallel loop)
do i= 1,N; v3(i) = v1(i) * v2(i); enddo
!$omp end target
print *, v3(1),v3(N) !!output: -1 -10000
end program
In the second example, the implementation selector set is specified in the when clause to distinguish between platforms. Additionally, specific architectures are specified with the device selector set.
In the code, different teams constructs are employed as determined by the metadirective directive. The number of teams is restricted by a num_teams clause and a thread limit is also set by a thread_limit clause for vendor platforms and specific architecture traits. Otherwise, just the teams construct is used without any clauses, as prescribed by the otherwise clause.
//%compiler: clang
//%cflags: -fopenmp
/*
* name: metadirective.2
* type: C
* version: omp_5.2
*/
#define N 100
#include <stdio.h>
#include <omp.h>
void work_on_chunk(int idev, int i);
int main() //Driver
{
int i,idev;
for (idev=0; idev<omp_get_num_devices(); idev++)
{
#pragma omp target device(idev)
#pragma omp metadirective \
when( implementation={vendor(nvidia)}, \
device={arch("kepler")}: \
teams num_teams(512) thread_limit(32) ) \
when( implementation={vendor(amd)}, \
device={arch("fiji" )}: \
teams num_teams(512) thread_limit(64) ) \
otherwise( \
teams)
#pragma omp distribute parallel for
for (i=0; i<N; i++) work_on_chunk(idev,i);
}
return 0;
}
!!%compiler: gfortran
!!%cflags: -fopenmp
! name: metadirective.2
! type: F-free
! version: omp_5.2
program main !!Driver
use omp_lib
implicit none
integer, parameter :: N=1000
external :: work_on_chunk
integer :: i,idev
do idev=0,omp_get_num_devices()-1
!$omp target device(idev)
!$omp begin metadirective &
!$omp& when( implementation={vendor(nvidia)}, &
!$omp& device={arch("kepler")}: &
!$omp& teams num_teams(512) thread_limit(32) ) &
!$omp& when( implementation={vendor(amd)}, &
!$omp& device={arch("fiji" )}: &
!$omp& teams num_teams(512) thread_limit(64) ) &
!$omp& otherwise( teams )
!$omp distribute parallel do
do i=1,N
call work_on_chunk(idev,i)
end do
!$omp end metadirective
!$omp end target
end do
end program
In the third example, a construct selector set is specified in the when clause. Here, a metadirective directive is used within a function that is also compiled as a function for a target device as directed by a declare target directive. The target directive name of the construct selector ensures that the distribute parallel for/do construct is employed for the target compilation. Otherwise, for the host-compiled version the parallel for/do simd construct is used.
In the first call to the exp_pi_diff() routine the context is a target teams construct and the distribute parallel for/do construct version of the function is invoked, while in the second call the parallel for/do simd construct version is used.
This case illustrates an important point for users that may want to hoist the target directive out of a function that contains the usual target teams distribute parallel for/do construct (for providing alternate constructs through the metadirective directive as here). While this combined construct can be decomposed into a target and teams distribute parallel for/do constructs, the OpenMP 5.0 specification has the restriction: “If a teams construct is nested within a target construct, that target construct must contain no statements, declarations or directives outside of the teams construct’’. So, the teams construct must immediately follow the target construct without any intervening code statements (which includes function calls). Since the target construct alone cannot be hoisted out of a function, the target teams construct has been hoisted out of the function, and the distribute parallel for/do construct is used as the variant directive of the metadirective directive within the function.
//%compiler: clang
//%cflags: -fopenmp
/*
* name: metadirective.3
* type: C
* version: omp_5.2
*/
#include <stdio.h>
#include <math.h>
#define N 1000
#pragma omp begin declare target
void exp_pi_diff(double *d, double my_pi){
#pragma omp metadirective \
when( construct={target}: distribute parallel for ) \
otherwise( parallel for simd )
for(int i = 0; i<N; i++) d[i] = exp( (M_PI-my_pi)*i );
}
#pragma omp end declare target
int main()
{
//Calculates sequence of exponentials: (M_PI-my_pi) * index
//M_PI is from math.h, and my_pi is user provided.
double d[N];
double my_pi=3.14159265358979e0;
#pragma omp target teams map(tofrom: d[0:N])
exp_pi_diff(d,my_pi);
// value should be near 1
printf("d[N-1] = %20.14f\n",d[N-1]); // ...= 1.00000000000311
exp_pi_diff(d,my_pi); // value should be near 1
printf("d[N-1] = %20.14f\n",d[N-1]); // ...= 1.00000000000311
}
!!%compiler: gfortran
!!%cflags: -fopenmp
! name: metadirective.3
! type: F-free
! version: omp_5.2
module params
integer, parameter :: N=1000
DOUBLE PRECISION, PARAMETER::M_PI=4.0d0*DATAN(1.0d0)
! 3.1415926535897932_8
end module
subroutine exp_pi_diff(d, my_pi)
use params
implicit none
integer :: i
double precision :: d(N), my_pi
!$omp declare target
!$omp metadirective &
!$omp& when( construct={target}: distribute parallel do ) &
!$omp& otherwise( parallel do simd )
do i = 1,size(d)
d(i) = exp( (M_PI-my_pi)*i )
end do
end subroutine
program main
! Calculates sequence of exponentials: (M_PI-my_pi) * index
! M_PI is from usual way, and my_pi is user provided.
! Fortran Standard does not provide PI
use params
implicit none
double precision :: d(N)
double precision :: my_pi=3.14159265358979d0
!$omp target teams map(from: d)
call exp_pi_diff(d,my_pi)
!$omp end target teams
! value should be near 1
print*, "d(N) = ",d(N) ! 1.00000000000311
call exp_pi_diff(d,my_pi) ! value should be near 1
print*, "d(N) = ",d(N) ! 1.00000000000311
end program
The user selector set can be used in a metadirective to select directives at execution time when the condition( boolean-expr ) selector expression is not a constant expression. In this case it is a dynamic trait set, and the selection is made at run time, rather than at compile time.
In the following example the foo function employs the condition selector to choose a device for execution at run time. In the bar routine metadirectives are nested. At the outer level a selection between serial and parallel execution in performed at run time, followed by another run time selection on the schedule kind in the inner level when the active construct trait is parallel.
(Note, the variable b in two of the “selected’’ constructs is declared private for the sole purpose of detecting and reporting that the construct is used. Since the variable is private, its value is unchanged outside of the construct region, whereas it is changed if the “unselected’’ construct is used.)
//%compiler: clang
//%cflags: -fopenmp
/*
* name: metadirective.4
* type: C
* version: omp_5.2
*/
#define N 100
#include <stdbool.h>
#include <stdio.h>
#include <omp.h>
void foo(int *a, int n, bool use_gpu)
{
int b=0; // use b to detect if run on gpu
#pragma omp metadirective \
when( user={condition(use_gpu)}: \
target teams distribute parallel for \
private(b) map(from:a[0:n]) ) \
otherwise( \
parallel for )
for (int i=0; i<n; i++) {a[i]=i; if(i==n-1) b=1;}
if(b==0) printf("PASSED 1 of 3\n");
}
void bar (int *a, int n, bool run_parallel, bool unbalanced)
{
int b=0;
#pragma omp metadirective \
when(user={condition(run_parallel)}: parallel)
{
if(omp_in_parallel() == 1 && omp_get_thread_num() == 0)
{printf("PASSED 2 of 3\n");}
#pragma omp metadirective \
when( construct={parallel}, \
user={condition(unbalanced)}: for schedule(guided) \
private(b)) \
when( construct={parallel} : for schedule(static))
for (int i=0; i<n; i++) {a[i]=i; if(i==n-1) b=1;}
}
// if guided b=0, because b is private
if(b==0) printf("PASSED 3 of 3\n");
}
void foo(int *a, int n, bool use_gpu);
void bar(int *a, int n, bool run_parallel, bool unbalanced);
int main(){
int p[N];
// App normally sets these, dependent on input parameters
bool use_gpu=true, run_parallel=true, unbalanced=true;
// Testing: set Env Var MK_FAIL to anything to fail tests
if(getenv("MK_FAIL")!=NULL) {
use_gpu=false; run_parallel=false; unbalanced=false;
}
foo(p, N, use_gpu);
bar(p, N, run_parallel,unbalanced);
}
!!%compiler: gfortran
!!%cflags: -fopenmp
! name: metadirective.4
! type: F-free
! version: omp_5.2
subroutine foo(a, n, use_gpu)
integer :: n, a(n)
logical :: use_gpu
integer :: b=0 !! use b to detect if run on gpu
!$omp metadirective &
!$omp& when(user={condition(use_gpu)}: &
!$omp& target teams distribute parallel for &
!$omp& private(b) map(from:a(1:n)) ) &
!$omp& otherwise( &
!$omp& parallel do)
do i = 1,n; a(i)=i; if(i==n) b=1; end do
if(b==0) print *, "PASSED 1 of 3" ! bc b is firstprivate for gpu run
end subroutine
subroutine bar (a, n, run_parallel, unbalanced)
use omp_lib, only : omp_get_thread_num
integer :: n, a(n)
logical :: run_parallel, unbalanced
integer :: b=0
!$omp begin metadirective when(user={condition(run_parallel)}: parallel)
if(omp_in_parallel() == 1 .and. omp_get_thread_num() == 0) &
print *,"PASSED 2 of 3"
!$omp metadirective &
!$omp& when(construct={parallel}, user={condition(unbalanced)}: &
!$omp& for schedule(guided) private(b)) &
!$omp& when(construct={parallel}: for schedule(static))
do i = 1,n; a(i)=i; if(i==n) b=1; end do
!$omp end metadirective
if(b==0) print *, "PASSED 3 of 3" !!if guided, b=0 since b is private
end subroutine
program meta
use omp_lib
integer, parameter :: N=100
integer :: p(N)
integer :: env_stat
!! App normally sets these, dependent on input parameters
logical :: use_gpu=.true., run_parallel=.true., unbalanced=.true.
!! Testing: set Env Var MK_FAIL to anything to fail tests
call get_environment_variable('MK_FAIL',status=env_stat)
if(env_stat /= 1) then ! status =1 when not set!
use_gpu=.false.; run_parallel=.false.; unbalanced=.false.
endif
call foo(p, N, use_gpu)
call bar(p, N, run_parallel,unbalanced)
end program
Metadirectives can be used in conjunction with templates as shown in the C++ code below. Here the template definition generates two versions of the Fibonacci function. The tasking boolean is used in the condition selector to enable tasking. The true form implements a parallel version with task and taskwait constructs as in the tasking.4.c code in Section 5.1. The false form implements a serial version without any tasking constructs. Note that the serial version is used in the parallel function for optimally processing numbers less than 8.
//%compiler: clang
//%cflags: -fopenmp
/*
* name: metadirective.5
* type: C++
* version: omp_5.0
*/
#include <stdio.h>
// revised Fibonacci from tasking.4.c example
template <bool tasking>
int fib(int n) {
int i, j;
if (n<2) {
return n;
} else if ( tasking && n<8 ) { // serial/taskless cutoff for n<8
return fib<false>(n);
} else {
#pragma omp metadirective \
when(user={condition(tasking)}: task shared(i))
{
i=fib<tasking>(n-1);
}
#pragma omp metadirective \
when(user={condition(tasking)}: task shared(j))
{
j=fib<tasking>(n-2);
}
#pragma omp metadirective \
when(user={condition(tasking)}: taskwait)
return i+j;
}
}
int main(int argc, char** argv) {
int n = 15;
#pragma omp parallel
#pragma omp single
{
printf("fib(%i) = %i\n", n, fib<true>(n));
}
return 0;
}
// OUTPUT:
// fib(15) = 610