logo

OpenMP Application Programming Interface Examples

  • Welcome to OMP Jupyter Book
  • Cover
  • Foreword
  • 1. Introduction
    • 1.1. Examples Organization
  • 2. OpenMP Directive Syntax
    • 2.1. C/C++ Pragmas
    • 2.2. C++ Attributes
    • 2.3. Fortran Comments (Fixed Source Form)
    • 2.4. Fortran Comments (Free Source Form)
  • 3. Parallel Execution
    • 3.1. A Simple Parallel Loop
    • 3.2. parallel Construct
    • 3.3. teams Construct on Host
    • 3.4. Controlling the Number of Threads on Multiple Nesting Levels
    • 3.5. Interaction Between the num_threads Clause and omp_set_dynamic
    • 3.6. Fortran Restrictions on the do Construct
    • 3.7. nowait Clause
    • 3.8. collapse Clause
    • 3.9. linear Clause in Loop Constructs
    • 3.10. parallel sections Construct
    • 3.11. firstprivate Clause and sections Construct
    • 3.12. single Construct
    • 3.13. workshare Construct
    • 3.14. masked Construct
    • 3.15. loop Construct
    • 3.16. Parallel Random Access Iterator Loop
    • 3.17. omp_set_dynamic and omp_set_num_threads Routines
    • 3.18. omp_get_num_threads Routine
  • 4. OpenMP Affinity
    • 4.1. proc_bind Clause
    • 4.2. Task Affinity
    • 4.3. Affinity Display
    • 4.4. Affinity Query Functions
  • 5. Tasking
    • 5.1. task and taskwait Constructs
    • 5.2. Task Priority
    • 5.3. Task Dependences
    • 5.4. Task Detachment
    • 5.5. taskgroup Construct
    • 5.6. taskyield Construct
    • 5.7. taskloop Construct
    • 5.8. Combined parallel masked and taskloop Constructs
  • 6. Devices
    • 6.1. target Construct
    • 6.2. defaultmap Clause
    • 6.3. Pointer Mapping
    • 6.4. Structure Mapping
    • 6.5. Fortran Allocatable Array Mapping
    • 6.6. Array Sections in Device Constructs
    • 6.7. C++ Virtual Functions
    • 6.8. Array Shaping
    • 6.9. declare mapper Directive
    • 6.10. target data Construct
    • 6.11. target enter data and target exit data Constructs
    • 6.12. target update Construct
    • 6.13. Declare Target Directive
    • 6.14. Lambda Expressions
    • 6.15. teams Construct and Related Combined Constructs
    • 6.16. Asynchronous target Execution and Dependences
    • 6.17. Device Routines
  • 7. SIMD
    • 7.1. simd and declare simd Directives
    • 7.2. inbranch and notinbranch Clauses
    • 7.3. Loop-Carried Lexical Forward Dependence
    • 7.4. ref , val , uval Modifiers for linear Clause
  • 8. Loop Transformations
    • 8.1. tile Construct
    • 8.2. unroll Construct
    • 8.3. Incomplete Tiles
  • 9. Synchronization
    • 9.1. critical Construct
    • 9.2. Worksharing Constructs Inside a critical Construct
    • 9.3. Binding of barrier Regions
    • 9.4. atomic Construct
    • 9.5. Restrictions on the atomic Construct
    • 9.6. flush Construct without a List
    • 9.7. Synchronization Based on Acquire/Release Semantics
    • 9.8. ordered Clause and ordered Construct
    • 9.9. depobj Construct
    • 9.10. Doacross Loop Nest
    • 9.11. Lock Routines
  • 10. Data Environment
    • 10.1. threadprivate Directive
    • 10.2. default(none) Clause
    • 10.3. private Clause
    • 10.4. Fortran Private Loop Iteration Variables
    • 10.5. Fortran Restrictions on shared and private Clauses with Common Blocks
    • 10.6. Fortran Restrictions on Storage Association with the private Clause
    • 10.7. C/C++ Arrays in a firstprivate Clause
    • 10.8. lastprivate Clause
    • 10.9. Reduction
    • 10.10. scan Directive
    • 10.11. copyin Clause
    • 10.12. copyprivate Clause
    • 10.13. C++ Reference in Data-Sharing Clauses
    • 10.14. Fortran ASSOCIATE Construct
  • 11. Memory Model
    • 11.1. OpenMP Memory Model
    • 11.2. Memory Allocators
    • 11.3. Race Conditions Caused by Implied Copies of Shared Variables in Fortran
  • 12. Program Control
    • 12.1. Conditional Compilation
    • 12.2. Internal Control Variables (ICVs)
    • 12.3. Placement of flush , barrier , taskwait and taskyield Directives
    • 12.4. Cancellation Constructs
    • 12.5. requires Directive
    • 12.6. declare variant Directive
    • 12.7. Metadirectives
    • 12.8. Nested Loop Constructs
    • 12.9. Restrictions on Nesting of Regions
    • 12.10. Target Offload
    • 12.11. Controlling Concurrency and Reproducibility with the order Clause
    • 12.12. interop Construct
    • 12.13. Utilities
  • 13. OMPT Interface
    • 13.1. OMPT Start
Powered by Jupyter Book
  • Binder
  • repository
  • open issue
  • .ipynb
Contents
  • 6.15.1. target and teams Constructs with omp_get_num_teams and omp_get_team_num Routines
  • 6.15.2. target , teams , and distribute Constructs
  • 6.15.3. target teams , and Distribute Parallel Loop Constructs
  • 6.15.4. target teams and Distribute Parallel Loop Constructs with Scheduling Clauses
  • 6.15.5. target teams and distribute simd Constructs
  • 6.15.6. target teams and Distribute Parallel Loop SIMD Constructs

teams Construct and Related Combined Constructs

Contents

  • 6.15.1. target and teams Constructs with omp_get_num_teams and omp_get_team_num Routines
  • 6.15.2. target , teams , and distribute Constructs
  • 6.15.3. target teams , and Distribute Parallel Loop Constructs
  • 6.15.4. target teams and Distribute Parallel Loop Constructs with Scheduling Clauses
  • 6.15.5. target teams and distribute simd Constructs
  • 6.15.6. target teams and Distribute Parallel Loop SIMD Constructs

6.15. teams Construct and Related Combined Constructs#

6.15.1. target and teams Constructs with omp_get_num_teams and omp_get_team_num Routines#

The following example shows how the target and teams constructs are used to create a league of thread teams that execute a region. The teams construct creates a league of at most two teams where the primary thread of each team executes the teams region.

The omp_get_num_teams routine returns the number of teams executing in a teams region. The omp_get_team_num routine returns the team number, which is an integer between 0 and one less than the value returned by omp_get_num_teams. The following example manually distributes a loop across two teams.

//%compiler: clang
//%cflags: -fopenmp

/*
* name: teams.1
* type: C
* version: omp_4.0
*/
#include <stdlib.h>
#include <omp.h>
float dotprod(float B[], float C[], int N)
{
   float sum0 = 0.0;
   float sum1 = 0.0;
   #pragma omp target map(to: B[:N], C[:N]) map(tofrom: sum0, sum1)
   #pragma omp teams num_teams(2)
   {
      int i;
      if (omp_get_num_teams() != 2)
         abort();
      if (omp_get_team_num() == 0)
      {
  #pragma omp parallel for reduction(+:sum0)
  for (i=0; i<N/2; i++)
     sum0 += B[i] * C[i];
      }
      else if (omp_get_team_num() == 1)
      {
  #pragma omp parallel for reduction(+:sum1)
  for (i=N/2; i<N; i++)
     sum1 += B[i] * C[i];
      }
   }
   return sum0 + sum1;
}

/* Note:  The variables sum0,sum1 are now mapped with tofrom, for
          correct execution with 4.5 (and pre-4.5) compliant compilers.
          See Devices Intro.
 */
!!%compiler: gfortran
!!%cflags: -fopenmp

! name: teams.1
! type: F-free
! version:    omp_4.0
function dotprod(B,C,N) result(sum)
use omp_lib, ONLY : omp_get_num_teams, omp_get_team_num
    real    :: B(N), C(N), sum,sum0, sum1
    integer :: N, i
    sum0 = 0.0e0
    sum1 = 0.0e0
    !$omp target map(to: B, C) map(tofrom: sum0, sum1)
    !$omp teams num_teams(2)
      if (omp_get_num_teams() /= 2) stop "2 teams required"
      if (omp_get_team_num() == 0) then
         !$omp parallel do reduction(+:sum0)
         do i=1,N/2
            sum0 = sum0 + B(i) * C(i)
         end do
      else if (omp_get_team_num() == 1) then
         !$omp parallel do reduction(+:sum1)
         do i=N/2+1,N
            sum1 = sum1 + B(i) * C(i)
         end do
      end if
    !$omp end teams
    !$omp end target
    sum = sum0 + sum1
end function

! Note:  The variables sum0,sum1 are now mapped with tofrom, for correct
! execution with 4.5 (and pre-4.5) compliant compilers. See Devices Intro.

6.15.2. target, teams, and distribute Constructs#

The following example shows how the target, teams, and distribute constructs are used to execute a loop nest in a target region. The teams construct creates a league and the primary thread of each team executes the teams region. The distribute construct schedules the subsequent loop iterations across the primary threads of each team.

The number of teams in the league is less than or equal to the variable num_blocks . Each team in the league has a number of threads less than or equal to the variable block_threads . The iterations in the outer loop are distributed among the primary threads of each team.

When a team’s primary thread encounters the parallel loop construct before the inner loop, the other threads in its team are activated. The team executes the parallel region and then workshares the execution of the loop.

reduction clause reduction clause!on teams construct on teams construct Each primary thread executing the teams region has a private copy of the variable sum that is created by the reduction clause on the teams construct. The primary thread and all threads in its team have a private copy of the variable sum that is created by the reduction clause on the parallel loop construct. The second private sum is reduced into the primary thread’s private copy of sum created by the teams construct. At the end of the teams region, each primary thread’s private copy of sum is reduced into the final sum that is implicitly mapped into the target region.

//%compiler: clang
//%cflags: -fopenmp

/*
* name: teams.2
* type: C
* version: omp_4.0
*/
#define min(x, y) (((x) < (y)) ? (x) : (y))

float dotprod(float B[], float C[], int N, int block_size,
  int num_teams, int block_threads)
{
    float sum = 0.0;
    int i, i0;
    #pragma omp target map(to: B[0:N], C[0:N]) map(tofrom: sum)
    #pragma omp teams num_teams(num_teams) thread_limit(block_threads) \
      reduction(+:sum)
    #pragma omp distribute
    for (i0=0; i0<N; i0 += block_size)
       #pragma omp parallel for reduction(+:sum)
       for (i=i0; i< min(i0+block_size,N); i++)
           sum += B[i] * C[i];
    return sum;
}
/* Note:  The variable sum is now mapped with tofrom, for correct
   execution with 4.5 (and pre-4.5) compliant compilers. See
   Devices Intro.
 */
!!%compiler: gfortran
!!%cflags: -fopenmp

! name: teams.2
! type: F-free
! version: omp_4.0
function dotprod(B,C,N, block_size, num_teams, block_threads) result(sum)
implicit none
    real    :: B(N), C(N), sum
    integer :: N, block_size, num_teams, block_threads, i, i0
    sum = 0.0e0
    !$omp target map(to: B, C) map(tofrom: sum)
    !$omp teams num_teams(num_teams) thread_limit(block_threads) &
    !$omp&  reduction(+:sum)
    !$omp distribute
       do i0=1,N, block_size
          !$omp parallel do reduction(+:sum)
          do i = i0, min(i0+block_size,N)
             sum = sum + B(i) * C(i)
          end do
       end do
    !$omp end teams
    !$omp end target
end function

! Note:  The variable sum is now mapped with tofrom, for correct
! execution with 4.5 (and pre-4.5) compliant compilers. See Devices Intro.

6.15.3. target teams, and Distribute Parallel Loop Constructs#

The following example shows how the target teams and distribute parallel loop constructs are used to execute a target region. The target teams construct creates a league of teams where the primary thread of each team executes the teams region.

The distribute parallel loop construct schedules the loop iterations across the primary threads of each team and then across the threads of each team.

//%compiler: clang
//%cflags: -fopenmp

/*
* name: teams.3
* type: C
* version: omp_4.5
*/
float dotprod(float B[], float C[], int N)
{
   float sum = 0;
   int i;
   #pragma omp target teams map(to: B[0:N], C[0:N]) \
                            defaultmap(tofrom:scalar) reduction(+:sum)
   #pragma omp distribute parallel for reduction(+:sum)
   for (i=0; i<N; i++)
      sum += B[i] * C[i];
   return sum;
}

/* Note:  The variable sum is now mapped with tofrom from the defaultmap
          clause on the combined target teams construct, for correct
          execution with 4.5 (and pre-4.5) compliant compilers.
          See Devices Intro.
 */
!!%compiler: gfortran
!!%cflags: -fopenmp

! name: teams.3
! type: F-free
! version: omp_4.5
function dotprod(B,C,N) result(sum)
   real    :: B(N), C(N), sum
   integer :: N, i
   sum = 0.0e0
   !$omp target teams map(to: B, C)  &
   !$omp&             defaultmap(tofrom:scalar) reduction(+:sum)
   !$omp distribute parallel do reduction(+:sum)
      do i = 1,N
         sum = sum + B(i) * C(i)
      end do
   !$omp end target teams
end function

! Note:  The variable sum is now mapped with tofrom from the defaultmap
!  clause on the combined target teams construct, for correct
!  execution with 4.5 (and pre-4.5) compliant compilers. See Devices Intro.

6.15.4. target teams and Distribute Parallel Loop Constructs with Scheduling Clauses#

The following example shows how the target teams and distribute parallel loop constructs are used to execute a target region. The teams construct creates a league of at most eight teams where the primary thread of each team executes the teams region. The number of threads in each team is less than or equal to 16.

The distribute parallel loop construct schedules the subsequent loop iterations across the primary threads of each team and then across the threads of each team.

The dist_schedule clause on the distribute parallel loop construct indicates that loop iterations are distributed to the primary thread of each team in chunks of 1024 iterations.

The schedule clause indicates that the 1024 iterations distributed to a primary thread are then assigned to the threads in its associated team in chunks of 64 iterations.

//%compiler: clang
//%cflags: -fopenmp

/*
* name: teams.4
* type: C
* version: omp_4.0
*/
#define N 1024*1024
float dotprod(float B[], float C[])
{
    float sum = 0.0;
    int i;
    #pragma omp target map(to: B[0:N], C[0:N]) map(tofrom: sum)
    #pragma omp teams num_teams(8) thread_limit(16) reduction(+:sum)
    #pragma omp distribute parallel for reduction(+:sum) \
                dist_schedule(static, 1024) schedule(static, 64)
    for (i=0; i<N; i++)
        sum += B[i] * C[i];
    return sum;
}

/* Note:  The variable sum is now mapped with tofrom, for correct
          execution with 4.5 (and pre-4.5) compliant compilers.
          See Devices Intro.
 */

!!%compiler: gfortran
!!%cflags: -fopenmp

! name: teams.4
! type: F-free
! version: omp_4.0
module arrays
integer,parameter :: N=1024*1024
real :: B(N), C(N)
end module
function dotprod() result(sum)
use arrays
   real    :: sum
   integer :: i
   sum = 0.0e0
   !$omp target map(to: B, C) map(tofrom: sum)
   !$omp teams num_teams(8) thread_limit(16) reduction(+:sum)
   !$omp distribute parallel do reduction(+:sum) &
   !$omp&  dist_schedule(static, 1024) schedule(static, 64)
      do i = 1,N
         sum = sum + B(i) * C(i)
      end do
   !$omp end teams
   !$omp end target
end function

! Note:  The variable sum is now mapped with tofrom, for correct
! execution with 4.5 (and pre-4.5) compliant compilers. See Devices Intro.

6.15.5. target teams and distribute simd Constructs#

The following example shows how the target teams and distribute simd constructs are used to execute a loop in a target region. The target teams construct creates a league of teams where the primary thread of each team executes the teams region.

The distribute simd construct schedules the loop iterations across the primary thread of each team and then uses SIMD parallelism to execute the iterations.

//%compiler: clang
//%cflags: -fopenmp

/*
* name: teams.5
* type: C
* version: omp_4.0
*/
extern void init(float *, float *, int);
extern void output(float *, int);
void vec_mult(float *p, float *v1, float *v2, int N)
{
   int i;
   init(v1, v2, N);
   #pragma omp target teams map(to: v1[0:N], v2[:N]) map(from: p[0:N])
   #pragma omp distribute simd
   for (i=0; i<N; i++)
     p[i] = v1[i] * v2[i];
   output(p, N);
}
!!%compiler: gfortran
!!%cflags: -fopenmp

! name: teams.5
! type: F-free
! version: omp_4.0
subroutine vec_mult(p, v1, v2, N)
   real    ::  p(N), v1(N), v2(N)
   integer ::  i
   call init(v1, v2, N)
   !$omp target teams map(to: v1, v2) map(from: p)
      !$omp distribute simd
         do i=1,N
            p(i) = v1(i) * v2(i)
         end do
   !$omp end target teams
   call output(p, N)
end subroutine

6.15.6. target teams and Distribute Parallel Loop SIMD Constructs#

The following example shows how the target teams and the distribute parallel loop SIMD constructs are used to execute a loop in a target teams region. The target teams construct creates a league of teams where the primary thread of each team executes the teams region.

The distribute parallel loop SIMD construct schedules the loop iterations across the primary thread of each team and then across the threads of each team where each thread uses SIMD parallelism.

//%compiler: clang
//%cflags: -fopenmp

/*
* name: teams.6
* type: C
* version: omp_4.0
*/
extern void init(float *, float *, int);
extern void output(float *, int);
void vec_mult(float *p, float *v1, float *v2, int N)
{
   int i;
   init(v1, v2, N);
   #pragma omp target teams map(to: v1[0:N], v2[:N]) map(from: p[0:N])
   #pragma omp distribute parallel for simd
   for (i=0; i<N; i++)
     p[i] = v1[i] * v2[i];
   output(p, N);
}
!!%compiler: gfortran
!!%cflags: -fopenmp

! name: teams.6
! type: F-free
! version: omp_4.0
subroutine vec_mult(p, v1, v2, N)
   real    ::  p(N), v1(N), v2(N)
   integer ::  i
   call init(v1, v2, N)
   !$omp target teams map(to: v1, v2) map(from: p)
      !$omp distribute parallel do simd
         do i=1,N
            p(i) = v1(i) * v2(i)
         end do
   !$omp end target teams
   call output(p, N)
end subroutine

previous

6.14. Lambda Expressions

next

6.16. Asynchronous target Execution and Dependences

By The OpenMP Community
© Copyright 2022.