5.7. taskloop Construct#

The following example illustrates how to execute a long running task concurrently with tasks created with a taskloop directive for a loop having unbalanced amounts of work for its iterations.

The grainsize clause specifies that each task is to execute at least 500 iterations of the loop.

The nogroup clause removes the implicit taskgroup of the taskloop construct; the explicit taskgroup construct in the example ensures that the function is not exited before the long-running task and the loops have finished execution.

//%compiler: clang
//%cflags: -fopenmp

/*
* name: taskloop.1
* type: C
* version: omp_4.5
*/
void long_running_task(void);
void loop_body(int i, int j);

void parallel_work(void) {
   int i, j;
#pragma omp taskgroup
   {
#pragma omp task
      long_running_task(); // can execute concurrently

#pragma omp taskloop private(j) grainsize(500) nogroup
      for (i = 0; i < 10000; i++) { // can execute concurrently
         for (j = 0; j < i; j++) {
            loop_body(i, j);
         }
      }
   }
}
!!%compiler: gfortran
!!%cflags: -fopenmp

! name: taskloop.1
! type: F-free
! version:    omp_4.5
subroutine parallel_work
   integer i
   integer j
!$omp taskgroup

!$omp task
   call long_running_task()
!$omp end task

!$omp taskloop private(j) grainsize(500) nogroup
   do i=1,10000
      do j=1,i
         call loop_body(i, j)
      end do
   end do
!$omp end taskloop

!$omp end taskgroup
end subroutine

Because a taskloop construct encloses a loop, it is often incorrectly perceived as a worksharing construct (when it is directly nested in a parallel region).

While a worksharing construct distributes the loop iterations across all threads in a team, the entire loop of a taskloop construct is executed by every thread of the team.

In the example below the first taskloop occurs closely nested within a parallel region and the entire loop is executed by each of the T threads; hence the reduction sum is executed T * N times.

The loop of the second taskloop is within a single region and is executed by a single thread so that only N reduction sums occur. (The other N -1 threads of the parallel region will participate in executing the tasks. This is the common use case for the taskloop construct.)

In the example, the code thus prints x1 = 16384 ( T * N ) and x2 = 1024 ( N ).

//%compiler: clang
//%cflags: -fopenmp

/*
* name:   taskloop.2
* type:   C
* version: omp_4.5
*/
#include <stdio.h>

#define T 16
#define N 1024

void parallel_work() {
    int x1 = 0, x2 = 0;

    #pragma omp parallel shared(x1,x2) num_threads(T)
    {
        #pragma omp taskloop
        for (int i = 0; i < N; ++i) {
            #pragma omp atomic
            x1++;          // executed T*N times
        }

        #pragma omp single
        #pragma omp taskloop
        for (int i = 0; i < N; ++i) {
            #pragma omp atomic
            x2++;          // executed N times
        }
    }

    printf("x1 = %d, x2 = %d\n", x1, x2);
}
!!%compiler: gfortran
!!%cflags: -fopenmp

! name:   taskloop.2
! type:   F-free
! version: omp_4.5
subroutine parallel_work
    implicit none
    integer :: x1, x2
    integer :: i
    integer, parameter :: T = 16
    integer, parameter :: N = 1024

    x1 = 0
    x2 = 0
    !$omp parallel shared(x1,x2) num_threads(T)
    !$omp taskloop
    do i = 1,N
        !$omp atomic
        x1 = x1 + 1     ! executed T*N times
        !$omp end atomic
    end do
    !$omp end taskloop

    !$omp single
    !$omp taskloop
    do i = 1,N
        !$omp atomic
        x2 = x2 + 1     ! executed N times
        !$omp end atomic
    end do
    !$omp end taskloop
    !$omp end single
    !$omp end parallel

    write (*,'(A,I0,A,I0)') 'x1 = ', x1, ', x2 = ',x2
end subroutine