taskloop Construct
5.7. taskloop Construct#
The following example illustrates how to execute a long running task concurrently with tasks created with a taskloop directive for a loop having unbalanced amounts of work for its iterations.
The grainsize clause specifies that each task is to execute at least 500 iterations of the loop.
The nogroup clause removes the implicit taskgroup of the taskloop construct; the explicit taskgroup construct in the example ensures that the function is not exited before the long-running task and the loops have finished execution.
//%compiler: clang
//%cflags: -fopenmp
/*
* name: taskloop.1
* type: C
* version: omp_4.5
*/
void long_running_task(void);
void loop_body(int i, int j);
void parallel_work(void) {
int i, j;
#pragma omp taskgroup
{
#pragma omp task
long_running_task(); // can execute concurrently
#pragma omp taskloop private(j) grainsize(500) nogroup
for (i = 0; i < 10000; i++) { // can execute concurrently
for (j = 0; j < i; j++) {
loop_body(i, j);
}
}
}
}
!!%compiler: gfortran
!!%cflags: -fopenmp
! name: taskloop.1
! type: F-free
! version: omp_4.5
subroutine parallel_work
integer i
integer j
!$omp taskgroup
!$omp task
call long_running_task()
!$omp end task
!$omp taskloop private(j) grainsize(500) nogroup
do i=1,10000
do j=1,i
call loop_body(i, j)
end do
end do
!$omp end taskloop
!$omp end taskgroup
end subroutine
Because a taskloop construct encloses a loop, it is often incorrectly perceived as a worksharing construct (when it is directly nested in a parallel region).
While a worksharing construct distributes the loop iterations across all threads in a team, the entire loop of a taskloop construct is executed by every thread of the team.
In the example below the first taskloop occurs closely nested within a parallel region and the entire loop is executed by each of the T threads; hence the reduction sum is executed T * N times.
The loop of the second taskloop is within a single region and is executed by a single thread so that only N reduction sums occur. (The other N -1 threads of the parallel region will participate in executing the tasks. This is the common use case for the taskloop construct.)
In the example, the code thus prints x1 = 16384 ( T * N ) and x2 = 1024 ( N ).
//%compiler: clang
//%cflags: -fopenmp
/*
* name: taskloop.2
* type: C
* version: omp_4.5
*/
#include <stdio.h>
#define T 16
#define N 1024
void parallel_work() {
int x1 = 0, x2 = 0;
#pragma omp parallel shared(x1,x2) num_threads(T)
{
#pragma omp taskloop
for (int i = 0; i < N; ++i) {
#pragma omp atomic
x1++; // executed T*N times
}
#pragma omp single
#pragma omp taskloop
for (int i = 0; i < N; ++i) {
#pragma omp atomic
x2++; // executed N times
}
}
printf("x1 = %d, x2 = %d\n", x1, x2);
}
!!%compiler: gfortran
!!%cflags: -fopenmp
! name: taskloop.2
! type: F-free
! version: omp_4.5
subroutine parallel_work
implicit none
integer :: x1, x2
integer :: i
integer, parameter :: T = 16
integer, parameter :: N = 1024
x1 = 0
x2 = 0
!$omp parallel shared(x1,x2) num_threads(T)
!$omp taskloop
do i = 1,N
!$omp atomic
x1 = x1 + 1 ! executed T*N times
!$omp end atomic
end do
!$omp end taskloop
!$omp single
!$omp taskloop
do i = 1,N
!$omp atomic
x2 = x2 + 1 ! executed N times
!$omp end atomic
end do
!$omp end taskloop
!$omp end single
!$omp end parallel
write (*,'(A,I0,A,I0)') 'x1 = ', x1, ', x2 = ',x2
end subroutine