teams Construct and Related Combined Constructs
Contents
6.15. teams Construct and Related Combined Constructs#
6.15.1. target and teams Constructs with omp_get_num_teams and omp_get_team_num Routines#
The following example shows how the target and teams constructs are used to create a league of thread teams that execute a region. The teams construct creates a league of at most two teams where the primary thread of each team executes the teams region.
The omp_get_num_teams routine returns the number of teams executing in a teams region. The omp_get_team_num routine returns the team number, which is an integer between 0 and one less than the value returned by omp_get_num_teams. The following example manually distributes a loop across two teams.
//%compiler: clang
//%cflags: -fopenmp
/*
* name: teams.1
* type: C
* version: omp_4.0
*/
#include <stdlib.h>
#include <omp.h>
float dotprod(float B[], float C[], int N)
{
float sum0 = 0.0;
float sum1 = 0.0;
#pragma omp target map(to: B[:N], C[:N]) map(tofrom: sum0, sum1)
#pragma omp teams num_teams(2)
{
int i;
if (omp_get_num_teams() != 2)
abort();
if (omp_get_team_num() == 0)
{
#pragma omp parallel for reduction(+:sum0)
for (i=0; i<N/2; i++)
sum0 += B[i] * C[i];
}
else if (omp_get_team_num() == 1)
{
#pragma omp parallel for reduction(+:sum1)
for (i=N/2; i<N; i++)
sum1 += B[i] * C[i];
}
}
return sum0 + sum1;
}
/* Note: The variables sum0,sum1 are now mapped with tofrom, for
correct execution with 4.5 (and pre-4.5) compliant compilers.
See Devices Intro.
*/
!!%compiler: gfortran
!!%cflags: -fopenmp
! name: teams.1
! type: F-free
! version: omp_4.0
function dotprod(B,C,N) result(sum)
use omp_lib, ONLY : omp_get_num_teams, omp_get_team_num
real :: B(N), C(N), sum,sum0, sum1
integer :: N, i
sum0 = 0.0e0
sum1 = 0.0e0
!$omp target map(to: B, C) map(tofrom: sum0, sum1)
!$omp teams num_teams(2)
if (omp_get_num_teams() /= 2) stop "2 teams required"
if (omp_get_team_num() == 0) then
!$omp parallel do reduction(+:sum0)
do i=1,N/2
sum0 = sum0 + B(i) * C(i)
end do
else if (omp_get_team_num() == 1) then
!$omp parallel do reduction(+:sum1)
do i=N/2+1,N
sum1 = sum1 + B(i) * C(i)
end do
end if
!$omp end teams
!$omp end target
sum = sum0 + sum1
end function
! Note: The variables sum0,sum1 are now mapped with tofrom, for correct
! execution with 4.5 (and pre-4.5) compliant compilers. See Devices Intro.
6.15.2. target, teams, and distribute Constructs#
The following example shows how the target, teams, and distribute constructs are used to execute a loop nest in a target region. The teams construct creates a league and the primary thread of each team executes the teams region. The distribute construct schedules the subsequent loop iterations across the primary threads of each team.
The number of teams in the league is less than or equal to the variable num_blocks . Each team in the league has a number of threads less than or equal to the variable block_threads . The iterations in the outer loop are distributed among the primary threads of each team.
When a team’s primary thread encounters the parallel loop construct before the inner loop, the other threads in its team are activated. The team executes the parallel region and then workshares the execution of the loop.
reduction clause reduction clause!on teams construct on teams construct Each primary thread executing the teams region has a private copy of the variable sum that is created by the reduction clause on the teams construct. The primary thread and all threads in its team have a private copy of the variable sum that is created by the reduction clause on the parallel loop construct. The second private sum is reduced into the primary thread’s private copy of sum created by the teams construct. At the end of the teams region, each primary thread’s private copy of sum is reduced into the final sum that is implicitly mapped into the target region.
//%compiler: clang
//%cflags: -fopenmp
/*
* name: teams.2
* type: C
* version: omp_4.0
*/
#define min(x, y) (((x) < (y)) ? (x) : (y))
float dotprod(float B[], float C[], int N, int block_size,
int num_teams, int block_threads)
{
float sum = 0.0;
int i, i0;
#pragma omp target map(to: B[0:N], C[0:N]) map(tofrom: sum)
#pragma omp teams num_teams(num_teams) thread_limit(block_threads) \
reduction(+:sum)
#pragma omp distribute
for (i0=0; i0<N; i0 += block_size)
#pragma omp parallel for reduction(+:sum)
for (i=i0; i< min(i0+block_size,N); i++)
sum += B[i] * C[i];
return sum;
}
/* Note: The variable sum is now mapped with tofrom, for correct
execution with 4.5 (and pre-4.5) compliant compilers. See
Devices Intro.
*/
!!%compiler: gfortran
!!%cflags: -fopenmp
! name: teams.2
! type: F-free
! version: omp_4.0
function dotprod(B,C,N, block_size, num_teams, block_threads) result(sum)
implicit none
real :: B(N), C(N), sum
integer :: N, block_size, num_teams, block_threads, i, i0
sum = 0.0e0
!$omp target map(to: B, C) map(tofrom: sum)
!$omp teams num_teams(num_teams) thread_limit(block_threads) &
!$omp& reduction(+:sum)
!$omp distribute
do i0=1,N, block_size
!$omp parallel do reduction(+:sum)
do i = i0, min(i0+block_size,N)
sum = sum + B(i) * C(i)
end do
end do
!$omp end teams
!$omp end target
end function
! Note: The variable sum is now mapped with tofrom, for correct
! execution with 4.5 (and pre-4.5) compliant compilers. See Devices Intro.
6.15.3. target teams, and Distribute Parallel Loop Constructs#
The following example shows how the target teams and distribute parallel loop constructs are used to execute a target region. The target teams construct creates a league of teams where the primary thread of each team executes the teams region.
The distribute parallel loop construct schedules the loop iterations across the primary threads of each team and then across the threads of each team.
//%compiler: clang
//%cflags: -fopenmp
/*
* name: teams.3
* type: C
* version: omp_4.5
*/
float dotprod(float B[], float C[], int N)
{
float sum = 0;
int i;
#pragma omp target teams map(to: B[0:N], C[0:N]) \
defaultmap(tofrom:scalar) reduction(+:sum)
#pragma omp distribute parallel for reduction(+:sum)
for (i=0; i<N; i++)
sum += B[i] * C[i];
return sum;
}
/* Note: The variable sum is now mapped with tofrom from the defaultmap
clause on the combined target teams construct, for correct
execution with 4.5 (and pre-4.5) compliant compilers.
See Devices Intro.
*/
!!%compiler: gfortran
!!%cflags: -fopenmp
! name: teams.3
! type: F-free
! version: omp_4.5
function dotprod(B,C,N) result(sum)
real :: B(N), C(N), sum
integer :: N, i
sum = 0.0e0
!$omp target teams map(to: B, C) &
!$omp& defaultmap(tofrom:scalar) reduction(+:sum)
!$omp distribute parallel do reduction(+:sum)
do i = 1,N
sum = sum + B(i) * C(i)
end do
!$omp end target teams
end function
! Note: The variable sum is now mapped with tofrom from the defaultmap
! clause on the combined target teams construct, for correct
! execution with 4.5 (and pre-4.5) compliant compilers. See Devices Intro.
6.15.4. target teams and Distribute Parallel Loop Constructs with Scheduling Clauses#
The following example shows how the target teams and distribute parallel loop constructs are used to execute a target region. The teams construct creates a league of at most eight teams where the primary thread of each team executes the teams region. The number of threads in each team is less than or equal to 16.
The distribute parallel loop construct schedules the subsequent loop iterations across the primary threads of each team and then across the threads of each team.
The dist_schedule clause on the distribute parallel loop construct indicates that loop iterations are distributed to the primary thread of each team in chunks of 1024 iterations.
The schedule clause indicates that the 1024 iterations distributed to a primary thread are then assigned to the threads in its associated team in chunks of 64 iterations.
//%compiler: clang
//%cflags: -fopenmp
/*
* name: teams.4
* type: C
* version: omp_4.0
*/
#define N 1024*1024
float dotprod(float B[], float C[])
{
float sum = 0.0;
int i;
#pragma omp target map(to: B[0:N], C[0:N]) map(tofrom: sum)
#pragma omp teams num_teams(8) thread_limit(16) reduction(+:sum)
#pragma omp distribute parallel for reduction(+:sum) \
dist_schedule(static, 1024) schedule(static, 64)
for (i=0; i<N; i++)
sum += B[i] * C[i];
return sum;
}
/* Note: The variable sum is now mapped with tofrom, for correct
execution with 4.5 (and pre-4.5) compliant compilers.
See Devices Intro.
*/
!!%compiler: gfortran
!!%cflags: -fopenmp
! name: teams.4
! type: F-free
! version: omp_4.0
module arrays
integer,parameter :: N=1024*1024
real :: B(N), C(N)
end module
function dotprod() result(sum)
use arrays
real :: sum
integer :: i
sum = 0.0e0
!$omp target map(to: B, C) map(tofrom: sum)
!$omp teams num_teams(8) thread_limit(16) reduction(+:sum)
!$omp distribute parallel do reduction(+:sum) &
!$omp& dist_schedule(static, 1024) schedule(static, 64)
do i = 1,N
sum = sum + B(i) * C(i)
end do
!$omp end teams
!$omp end target
end function
! Note: The variable sum is now mapped with tofrom, for correct
! execution with 4.5 (and pre-4.5) compliant compilers. See Devices Intro.
6.15.5. target teams and distribute simd Constructs#
The following example shows how the target teams and distribute simd constructs are used to execute a loop in a target region. The target teams construct creates a league of teams where the primary thread of each team executes the teams region.
The distribute simd construct schedules the loop iterations across the primary thread of each team and then uses SIMD parallelism to execute the iterations.
//%compiler: clang
//%cflags: -fopenmp
/*
* name: teams.5
* type: C
* version: omp_4.0
*/
extern void init(float *, float *, int);
extern void output(float *, int);
void vec_mult(float *p, float *v1, float *v2, int N)
{
int i;
init(v1, v2, N);
#pragma omp target teams map(to: v1[0:N], v2[:N]) map(from: p[0:N])
#pragma omp distribute simd
for (i=0; i<N; i++)
p[i] = v1[i] * v2[i];
output(p, N);
}
!!%compiler: gfortran
!!%cflags: -fopenmp
! name: teams.5
! type: F-free
! version: omp_4.0
subroutine vec_mult(p, v1, v2, N)
real :: p(N), v1(N), v2(N)
integer :: i
call init(v1, v2, N)
!$omp target teams map(to: v1, v2) map(from: p)
!$omp distribute simd
do i=1,N
p(i) = v1(i) * v2(i)
end do
!$omp end target teams
call output(p, N)
end subroutine
6.15.6. target teams and Distribute Parallel Loop SIMD Constructs#
The following example shows how the target teams and the distribute parallel loop SIMD constructs are used to execute a loop in a target teams region. The target teams construct creates a league of teams where the primary thread of each team executes the teams region.
The distribute parallel loop SIMD construct schedules the loop iterations across the primary thread of each team and then across the threads of each team where each thread uses SIMD parallelism.
//%compiler: clang
//%cflags: -fopenmp
/*
* name: teams.6
* type: C
* version: omp_4.0
*/
extern void init(float *, float *, int);
extern void output(float *, int);
void vec_mult(float *p, float *v1, float *v2, int N)
{
int i;
init(v1, v2, N);
#pragma omp target teams map(to: v1[0:N], v2[:N]) map(from: p[0:N])
#pragma omp distribute parallel for simd
for (i=0; i<N; i++)
p[i] = v1[i] * v2[i];
output(p, N);
}
!!%compiler: gfortran
!!%cflags: -fopenmp
! name: teams.6
! type: F-free
! version: omp_4.0
subroutine vec_mult(p, v1, v2, N)
real :: p(N), v1(N), v2(N)
integer :: i
call init(v1, v2, N)
!$omp target teams map(to: v1, v2) map(from: p)
!$omp distribute parallel do simd
do i=1,N
p(i) = v1(i) * v2(i)
end do
!$omp end target teams
call output(p, N)
end subroutine