7.1. simd and declare simd Directives#

The following example illustrates the basic use of the simd construct to assure the compiler that the loop can be vectorized.

//%compiler: clang
//%cflags: -fopenmp

/*
* name: SIMD.1
* type: C
* version: omp_4.0
*/
void star( double *a, double *b, double *c, int n, int *ioff )
{
   int i;
   #pragma omp simd
   for ( i = 0; i < n; i++ )
      a[i] *= b[i] * c[i+ *ioff];
}
!!%compiler: gfortran
!!%cflags: -fopenmp

! name: SIMD.1
! type: F-free
! version: omp_4.0
subroutine star(a,b,c,n,ioff_ptr)
   implicit none
   double precision :: a(*),b(*),c(*)
   integer          :: n, i
   integer, pointer :: ioff_ptr

   !$omp simd
   do i = 1,n
      a(i) = a(i) * b(i) * c(i+ioff_ptr)
   end do

end subroutine

When a function can be inlined within a loop the compiler has an opportunity to vectorize the loop. By guaranteeing SIMD behavior of a function’s operations, characterizing the arguments of the function and privatizing temporary variables of the loop, the compiler can often create faster, vector code for the loop. In the examples below the declare simd directive is used on the add1 and add2 functions to enable creation of their corresponding SIMD function versions for execution within the associated SIMD loop. The functions characterize two different approaches of accessing data within the function: by a single variable and as an element in a data array, respectively. The add3 C function uses dereferencing.

The declare simd directives also illustrate the use of uniform and linear clauses. The uniform(fact) clause indicates that the variable fact is invariant across the SIMD lanes. In the add2 function a and b are included in the uniform list because the C pointer and the Fortran array references are constant. The i index used in the add2 function is included in a linear clause with a constant-linear-step of 1, to guarantee a unity increment of the associated loop. In the declare simd directive for the add3 C function the linear(a,b:1) clause instructs the compiler to generate unit-stride loads across the SIMD lanes; otherwise, costly gather instructions would be generated for the unknown sequence of access of the pointer dereferences.

In the simd constructs for the loops the private(tmp) clause is necessary to assure that the each vector operation has its own tmp variable.

//%compiler: clang
//%cflags: -fopenmp

/*
* name: SIMD.2
* type: C
* version: omp_4.0
*/
#include <stdio.h>

#pragma omp declare simd uniform(fact)
double add1(double a, double b, double fact)
{
   double c;
   c = a + b + fact;
   return c;
}

#pragma omp declare simd uniform(a,b,fact) linear(i:1)
double add2(double *a, double *b, int i, double fact)
{
   double c;
   c = a[i] + b[i] + fact;
   return c;
}

#pragma omp declare simd uniform(fact) linear(a,b:1)
double add3(double *a, double *b, double fact)
{
   double c;
   c = *a + *b + fact;
   return c;
}

void work( double *a, double *b, int n )
{
   int i;
   double tmp;
   #pragma omp simd private(tmp)
   for ( i = 0; i < n; i++ ) {
      tmp  = add1( a[i],  b[i], 1.0);
      a[i] = add2( a,     b, i, 1.0) + tmp;
      a[i] = add3(&a[i], &b[i], 1.0);
   }
}

int main(){
   int i;
   const int N=32;
   double a[N], b[N];

   for ( i=0; i<N; i++ ) {
      a[i] = i; b[i] = N-i;
   }

   work(a, b, N );

   for ( i=0; i<N; i++ ) {
      printf("%d %f\n", i, a[i]);
   }

   return 0;
}
!!%compiler: gfortran
!!%cflags: -fopenmp

! name: SIMD.2
! type: F-free
! version: omp_4.0
program main
   implicit none
   integer, parameter :: N=32
   integer :: i
   double precision   :: a(N), b(N)
   do i = 1,N
      a(i) = i-1
      b(i) = N-(i-1)
   end do
   call work(a, b, N )
   do i = 1,N
      print*, i,a(i)
   end do
end program

function add1(a,b,fact) result(c)
   implicit none
!$omp declare simd(add1) uniform(fact)
   double precision :: a,b,fact, c
   c = a + b + fact
end function

function add2(a,b,i, fact) result(c)
   implicit none
!$omp declare simd(add2) uniform(a,b,fact) linear(i:1)
   integer          :: i
   double precision :: a(*),b(*),fact, c
   c = a(i) + b(i) + fact
end function

subroutine work(a, b, n )
   implicit none
   double precision           :: a(n),b(n), tmp
   integer                    :: n, i
   double precision, external :: add1, add2

   !$omp simd private(tmp)
   do i = 1,n
      tmp  = add1(a(i), b(i), 1.0d0)
      a(i) = add2(a,    b, i, 1.0d0) + tmp
      a(i) = a(i) + b(i) + 1.0d0
   end do
end subroutine

A thread that encounters a SIMD construct executes a vectorized code of the iterations. Similar to the concerns of a worksharing loop a loop vectorized with a SIMD construct must assure that temporary and reduction variables are privatized and declared as reductions with clauses. The example below illustrates the use of private and reduction clauses in a SIMD construct.

//%compiler: clang
//%cflags: -fopenmp

/*
* name: SIMD.3
* type: C
* version: omp_4.0
*/
double work( double *a, double *b, int n )
{
   int i;
   double tmp, sum;
   sum = 0.0;
   #pragma omp simd private(tmp) reduction(+:sum)
   for (i = 0; i < n; i++) {
      tmp = a[i] + b[i];
      sum += tmp;
   }
   return sum;
}
!!%compiler: gfortran
!!%cflags: -fopenmp

! name: SIMD.3
! type: F-free
! version: omp_4.0
subroutine work( a, b, n, sum )
   implicit none
   integer :: i, n
   double precision :: a(n), b(n), sum, tmp

   sum = 0.0d0
   !$omp simd private(tmp) reduction(+:sum)
   do i = 1,n
      tmp = a(i) + b(i)
      sum = sum + tmp
   end do

end subroutine work

A safelen(N) clause in a simd construct assures the compiler that there are no loop-carried dependencies for vectors of size N or below. If the safelen clause is not specified, then the default safelen value is the number of loop iterations.

The safelen(16) clause in the example below guarantees that the vector code is safe for vectors up to and including size 16. In the loop, m can be 16 or greater, for correct code execution. If the value of m is less than 16, the behavior is undefined.

//%compiler: clang
//%cflags: -fopenmp

/*
* name: SIMD.4
* type: C
* version: omp_4.0
*/
void work( float *b, int n, int m )
{
   int i;
   #pragma omp simd safelen(16)
   for (i = m; i < n; i++)
      b[i] = b[i-m] - 1.0f;
}
!!%compiler: gfortran
!!%cflags: -fopenmp

! name: SIMD.4
! type: F-free
! version: omp_4.0
subroutine work( b, n, m )
   implicit none
   real       :: b(n)
   integer    :: i,n,m

   !$omp simd safelen(16)
   do i = m+1, n
      b(i) = b(i-m) - 1.0
   end do
end subroutine work

The following SIMD construct instructs the compiler to collapse the i and j loops into a single SIMD loop in which SIMD chunks are executed by threads of the team. Within the workshared loop chunks of a thread, the SIMD chunks are executed in the lanes of the vector units.

//%compiler: clang
//%cflags: -fopenmp

/*
* name: SIMD.5
* type: C
* version: omp_4.0
*/
void work( double **a, double **b, double **c, int n )
{
   int i, j;
   double tmp;
   #pragma omp for simd collapse(2) private(tmp)
   for (i = 0; i < n; i++) {
      for (j = 0; j < n; j++) {
         tmp = a[i][j] + b[i][j];
         c[i][j] = tmp;
      }
   }
}
!!%compiler: gfortran
!!%cflags: -fopenmp

! name: SIMD.5
! type: F-free
! version: omp_4.0
subroutine work( a, b, c,  n )
   implicit none
   integer :: i,j,n
   double precision :: a(n,n), b(n,n), c(n,n), tmp

   !$omp do simd collapse(2) private(tmp)
   do j = 1,n
      do i = 1,n
         tmp = a(i,j) + b(i,j)
         c(i,j) = tmp
      end do
   end do

end subroutine work