8.2. unroll Construct#

The unroll construct is a loop transformation that increases the number of loop blocks in a loop, while reducing the number of iterations. The full clause specifies that the loop is to be completely unrolled. That is, a loop block for each iteration is created, and the loop is removed. A partial clause with a unroll-factor specifies that the number of iterations will be reduced multiplicatively by the factor while the number of blocks will be increased by the same factor. Operationally, the loop is tiled by the factor, and the tiled loop is fully expanded, resulting in a single loop with multiple blocks.

Unrolling can reduce control-flow overhead and provide additional optimization opportunities for the compiler and the processor pipeline. Nevertheless, unrolling can increase the code size, and saturate the instruction cache. Hence, the trade-off may need to be assessed. Unrolling a loop does not change the code’s semantics. Also, compilers may unroll loops without explicit directives, at various optimization levels.

In the example below, the unroll construct is used without any clause, and then with a full clause, in the first two functions, respectively. When no clause is used, it is up to the implementation (compiler) to decide if and how the loop is to be unrolled. The iteration count can have a run time value. In the second function, the unroll construct uses a full clause to completely unroll the loop. A compile-time constant is required for the iteration count. The statements in the third function ( unroll_full_equivalent ) illustrates equivalent code for the full unrolling in the second function.

//%compiler: clang
//%cflags: -fopenmp

/*
* name: unroll.1
* type: C
* version:    omp_5.1
*/
void unroll(double A[], int n)
{
   #pragma omp unroll
   for (int i = 0; i < n; ++i)
      A[i] = 0;
}

void unroll_full(double A[])
{
   #pragma omp unroll full
   for (int i = 0; i < 4; ++i)
      A[i] = 0;
}

void unroll_full_equivalent(double A[])
{
   A[0] = 0;
   A[1] = 0;
   A[2] = 0;
   A[3] = 0;
}
!!%compiler: gfortran
!!%cflags: -fopenmp

! name:       unroll.1
! type:       F-free
! version:    omp_5.1
subroutine unroll(A, n)
   implicit none
   integer          :: i,n
   double precision :: A(n)

   !$omp unroll
   do i = 1,n
      A(i) = 0.0d0
   end do
end subroutine

subroutine unroll_full(A)
   implicit none
   integer :: i
   double precision :: A(*)

   !$omp unroll full
   do i = 1,4
      A(i) = 0.0d0
   end do
end subroutine

subroutine unroll_full_equivalent(A)
   implicit none
   double precision :: A(*)

   A(1) = 0.0d0
   A(2) = 0.0d0
   A(3) = 0.0d0
   A(4) = 0.0d0
end subroutine

The next example shows cases when it is incorrect to use full unrolling.

//%compiler: clang
//%cflags: -fopenmp

/*
* name: unroll.2
* type: C
* version:    omp_5.1
*/
void illegal_2a(double A[])
{
    #pragma omp for
    #pragma omp unroll full  // ERROR: No loop left after full unrolling.
    for (int i = 0; i < 12; ++i)
        A[i] = 0;
}

void illegal_2b(double A[])
{
    // Loop might be fully unrolled (or a partially unrolled loop
    // replacement). Hence, no canonical for-loop, resulting in
    // non-compliant code. Implementations may suggest adding a
    // "partial" clause.

    #pragma omp for          //        Requires a canonical loop
    #pragma omp unroll       // ERROR: may result in non-compliant code
    for (int i = 0; i < 12; ++i)
        A[i] = 0;
}

void illegal_2c(int n, double A[])
{
    #pragma omp unroll full  // ERROR: Constant iteration count required.
    for (int i = 0; i < n; ++i)
        A[i] = 0;
}
!!%compiler: gfortran
!!%cflags: -fopenmp

! name:       unroll.2
! type:       F-free
! version:    omp_5.1
subroutine illegal_2a(A)
   implicit none
   double precision :: A(*)
   integer :: i

   !$omp do
   !$omp unroll full  !! ERROR: No loop left after full unrolling
   do i = 1,12
      A(i) = 0.0d0
   end do
end subroutine

subroutine illegal_2b(A)
   implicit none
   double precision :: A(*)
   integer :: i

   !! Loop might be fully unrolled (or a partially unrolled loop
   !! replacement).  Hence, no canonical do-loop will exist,
   !! resulting in non-compliant code.
   !! Implementations may suggest to adding a "partial" clause.

   !$omp do           !!        Requires a canonical loop
   !$omp unroll       !! ERROR: may result in non-compliant code
   do i = 1,12
      A(i) = 0.0d0
   end do
end subroutine

subroutine illegal_2c(n, A)
   implicit none
   integer          :: i,n
   double precision :: A(*)

   !$omp unroll full  !! Full unroll requires constant iteration count
   do i = 1,n
      A(i) = 0.0d0
   end do
end subroutine

In many cases, when the iteration count is large and/or dynamic, it is reasonable to partially unroll a loop by including a partial clause. In the unroll3_partial function below, the unroll-factor value of 4 is used to create a tile size of 4 that is unrolled to create 4 unrolled statements. The equivalent “hand unrolled’’ loop code is presented in the unroll3_partial_equivalent function. If the unroll-factor is omitted, as in the unroll3_partial_nofactor function, the implementation may optimally select a factor from 1 (no unrolling) to the iteration count (full unrolling). In the latter case the construct generates a loop with a single iteration.

//%compiler: clang
//%cflags: -fopenmp

/*
* name: unroll.3
* type: C
* version:    omp_5.1
*/
void unroll3_partial(double A[])
{
   #pragma omp unroll partial(4)
   for (int i = 0; i < 128; ++i)
       A[i] = 0;
}

void unroll3_partial_equivalent(double A[])
{
   for (int i_iv = 0; i_iv < 32; ++i_iv) {
       A[i_iv * 4 + 0] = 0;
       A[i_iv * 4 + 1] = 0;
       A[i_iv * 4 + 2] = 0;
       A[i_iv * 4 + 3] = 0;
  }
}

void unroll3_partial_nofactor(double A[])
{
   #pragma omp unroll partial
   for (int i = 0; i < 128; ++i)
       A[i] = 0;
}
!!%compiler: gfortran
!!%cflags: -fopenmp

! name:       unroll.3
! type:       F-free
! version:    omp_5.1
subroutine unroll3_partial(A)
   implicit none
   double precision :: A(*)
   integer :: i

   !$omp unroll partial(4)
   do i = 1,128
      A(i) = 0
   end do
end subroutine

subroutine unroll3_partial_equivalent(A)
   implicit none
   double precision :: A(*)
   integer :: i_iv

   do i_iv = 0, 31
      A(i_iv * 4 + 1) = 0
      A(i_iv * 4 + 2) = 0
      A(i_iv * 4 + 3) = 0
      A(i_iv * 4 + 4) = 0
  end do
end subroutine

subroutine unroll3_partial_nofactor(A)
   implicit none
   double precision :: A(*)
   integer :: i

   !$omp unroll partial
   do i = 1, 128
      A(i) = 0
   end do
end subroutine

When the iteration count is not a multiple of the unroll-factor , iterations that should not produce executions must be conditionally protected from execution. In this example, the first function unrolls a loop that has a variable iteraction count. Since the unroll construct uses a partial( 4 ) clause, the compiler will need to create code that can account for cases when the iteration count is not a multiple of 4. A brute-force, simple-to-understand approach for implementing the conditionals is shown in the unroll_partial_remainder_option1 function.

The remaining two functions show more optimal algorithms the compiler may select to implement the transformation. Optimal approaches may reduce the number of conditionals as shown in unroll_partial_remainder_option2 , and may eliminate conditionals completely by peeling off a “remainder’’ into a separate loop as in unroll_partial_remainder_option3 .

Regardless of the optimization, implementations must ensure that the semantics remain the same, especially when additional directives are applied to the unrolled loop. For the case in the unroll_partial_remainder_option3 function, the fission of the worksharing-loop construct may result in a different distribution of threads to the iterations. Since no reproducible scheduling is specified on the work-sharing construct, the worksharing-loop and unrolling are compliant.

//%compiler: clang
//%cflags: -fopenmp

/*
* name: unroll.4
* type: C
* version:    omp_5.1
*/
void unroll_partial_remainder(int n, int A[])
{
    #pragma omp parallel for
    #pragma omp unroll partial(4)
    for (int i = 0; i < n; ++i)
        A[i] = i;
}

void unroll_partial_remainder_option1(int n, int A[])
{
    #pragma omp parallel for
    for (int i_iv = 0; i_iv < (n+3)/4; ++i_iv) {
                              A[i_iv * 4 + 0] = i_iv * 4 + 0;
        if (i_iv * 4 + 1 < n) A[i_iv * 4 + 1] = i_iv * 4 + 1;
        if (i_iv * 4 + 2 < n) A[i_iv * 4 + 2] = i_iv * 4 + 2;
        if (i_iv * 4 + 3 < n) A[i_iv * 4 + 3] = i_iv * 4 + 3;
   }
}

void unroll_partial_remainder_option2(int n, int A[])
{
    #pragma omp parallel for
    for (int i_iv = 0; i_iv < (n+3)/4; ++i_iv) {
        if (i_iv < n/4) {
            A[i_iv * 4 + 0] = i_iv * 4 + 0;
            A[i_iv * 4 + 1] = i_iv * 4 + 1;
            A[i_iv * 4 + 2] = i_iv * 4 + 2;
            A[i_iv * 4 + 3] = i_iv * 4 + 3;
        } else {
            // remainder loop
            for (int i_rem = i_iv*4; i_rem < n; ++i_rem)
                A[i_rem] = i_rem;
        }
    }
}

void unroll_partial_remainder_option3(int n, int A[])
{
    // main loop
    #pragma omp parallel for
    for (int i_iv = 0; i_iv < n/4; ++i_iv) {
        A[i_iv * 4 + 0] = i_iv * 4 + 0;
        A[i_iv * 4 + 1] = i_iv * 4 + 1;
        A[i_iv * 4 + 2] = i_iv * 4 + 2;
        A[i_iv * 4 + 3] = i_iv * 4 + 3;
    }

    // remainder loop
    #pragma omp parallel for
    for (int i_rem = (n/4)*4; i_rem < n; ++i_rem)
        A[i_rem] = i_rem;
}

#include <stdio.h>
#define NT 12

int main(){
int error=0, A[NT],C[NT];
for(int i = 0; i<NT; i++){ A[i]=0; C[i]=i; }

    for(int i = 0; i<NT; i++) A[i]=0.0;
    unroll_partial_remainder(NT,A);
    for(int i = 0; i<NT; i++) if(A[i] != C[i]) error=1;

    for(int i = 0; i<NT; i++) A[i]=0.0;
    unroll_partial_remainder_option1(NT,A);
    for(int i = 0; i<NT; i++) if(A[i] != C[i]) error=1;

    for(int i = 0; i<NT; i++) A[i]=0.0;
    unroll_partial_remainder_option2(NT,A);
    for(int i = 0; i<NT; i++) if(A[i] != C[i]) error=1;

    for(int i = 0; i<NT; i++) A[i]=0.0;
    unroll_partial_remainder_option3(NT,A);
    for(int i = 0; i<NT; i++) if(A[i] != C[i]) error=1;

  if(!error) printf("OUT: Passed\n");
  if( error) printf("OUT: Failed\n");
}
!!%compiler: gfortran
!!%cflags: -fopenmp

! name:       unroll.4
! type:       F-free
! version:    omp_5.1
subroutine unroll_partial_remainder(n, A)
   implicit none
   integer :: n, i
   integer :: A(*)

    !$omp parallel do
    !$omp unroll partial(4)
    do i = 1, n
       A(i) = i
    end do

end subroutine

subroutine unroll_partial_remainder_option1(n, A)
   implicit none
   integer :: n, i_iv
   integer :: A(*)

   !$omp parallel do
   do i_iv = 0,(n+3)/4 -1
                               A(i_iv * 4 + 1) = i_iv * 4 + 1
        if (i_iv * 4 + 2 <= n) A(i_iv * 4 + 2) = i_iv * 4 + 2
        if (i_iv * 4 + 3 <= n) A(i_iv * 4 + 3) = i_iv * 4 + 3
        if (i_iv * 4 + 4 <= n) A(i_iv * 4 + 4) = i_iv * 4 + 4
   end do

end subroutine

subroutine unroll_partial_remainder_option2(n, A)
   implicit none
   integer :: n, i_iv, i_rem
   integer :: A(*)

    !$omp parallel do
    do i_iv = 0, (n+3)/4 -1
        if (i_iv < n/4) then
            A(i_iv * 4 + 1) = i_iv * 4 + 1
            A(i_iv * 4 + 2) = i_iv * 4 + 2
            A(i_iv * 4 + 3) = i_iv * 4 + 3
            A(i_iv * 4 + 4) = i_iv * 4 + 4
        else
           !! remainder loop
           do i_rem = i_iv*4 +1, n
              A(i_rem) = i_rem
           end do
        end if
    end do

end subroutine

subroutine unroll_partial_remainder_option3(n, A)
   implicit none
   integer :: n, i_iv, i_rem
   integer :: A(*)

    !$omp parallel do
    do i_iv = 0, (n/4) -1

       A(i_iv * 4 + 1) = i_iv * 4 + 1
       A(i_iv * 4 + 2) = i_iv * 4 + 2
       A(i_iv * 4 + 3) = i_iv * 4 + 3
       A(i_iv * 4 + 4) = i_iv * 4 + 4
    end do

    !! remainder loop
    !$omp parallel do
    do i_rem = (n/4)*4 +1, n
       A(i_rem) = i_rem
    end do

end subroutine

program main
implicit none
integer, parameter :: NT=12

integer :: i
logical :: error=.false.
integer   :: A(NT), C(NT)=[ (i, i=1,NT) ]

    A(1:NT)=0
    call unroll_partial_remainder(NT, A)
    if( .not. all(A(1:NT) == C(1:NT)) ) error = .true.

    A(1:NT)=0
    call unroll_partial_remainder_option1(NT, A)
    if( .not. all(A(1:NT) == C(1:NT)) ) error = .true.

    A(1:NT)=0
    call unroll_partial_remainder_option2(NT, A)
    if( .not. all(A(1:NT) == C(1:NT)) ) error = .true.

    A(1:NT)=0
    call unroll_partial_remainder_option3(NT, A)
    if( .not. all(A(1:NT) == C(1:NT)) ) error = .true.

    if(.not. error) print*, "OUT: Passed."
    if(      error) print*, "OUT: Failed"
end program