Creating SPMD parallelism using OpenMP parallel directive

2.2. Creating SPMD parallelism using OpenMP parallel directive#

From this part, we begin to introduce how to use OpenMP directives to write programs. We first introduce the most basic and most commonly used parallel directive.

2.2.1. Get Started with Parallel Directive to Create Parallelism#

The parallel directive is used to mark a parallel region. When a thread encounters a parallel region, a group of threads is created to execute the parallel region. The original thread that executed the serial part will be the primary thread of the new team. All threads in the team execute parallel regions together. After a team is created, the number of threads in the team remains constant for the duration of that parallel region.

Primary thread is also known as the master thread

When a thread team is created, the primary thread will implicitly create as many tasks as the number of threads, each task is assigned and bounded to one thread. When threads are all occupied, implicit tasks that have not been allocated will be suspended waiting for idle threads.

The following example from Chapter 1 shows how to use the parallel directive in C.

//%compiler: clang
//%cflags: -fopenmp

#include <stdio.h>
#include <stdlib.h>
#include <omp.h>

int main(int argc, char *argv[]){
    #pragma omp parallel
    printf("%s\n", "Hello World");
    
    return 0;
}

Hello World
Hello World
Hello World
Hello World
Hello World
Hello World
Hello World
Hello World

This example prints Hello World 8 times, which means 8 threads are created by default. The default number of threads is determined by the computer hardware, 8 threads are created on the author’s computer. The following example shows how to use the num_threads clause in the parallel directive to specify the number of threads to create.

//%compiler: clang
//%cflags: -fopenmp

#include <stdio.h>
#include <stdlib.h>
#include <omp.h>

int main(int argc, char *argv[]){
    #pragma omp parallel num_threads(4)
    printf("%s\n", "Hello World");
    
    return 0;
}

Hello World
Hello World
Hello World
Hello World

In this example, we use the num_threads clause to specify the use of 4 threads to execute the parallel region. When the master thread encounters OpenMP constructs, three threads are created, and together with these three threads, a thread group of 4 is formed. Hello World is printed four times, once per thread.

The next two examples show how to use the parallel directive in Fortran, and they have the exactly same meaning as the two examples in C above.

!!%compiler: gfortran
!!%cflags: -fopenmp

PROGRAM Parallel_Hello_World
USE OMP_LIB

!$OMP PARALLEL

  PRINT *, "Hello World"

!$OMP END PARALLEL

END

 Hello World
 Hello World
 Hello World
 Hello World
 Hello World
 Hello World
 Hello World
 Hello World

!!%compiler: gfortran
!!%cflags: -fopenmp

PROGRAM Parallel_Hello_World
USE OMP_LIB

!$OMP PARALLEL num_threads(4)

  PRINT *, "Hello World"

!$OMP END PARALLEL

END

 Hello World
 Hello World
 Hello World
 Hello World

2.2.2. Syntax and Semantics of Parallel Directive and Its Clauses#

Through the examples shown in the last section, it is not difficult to conclude that the syntax of the parallel directive in C is:

#pragma omp parallel [clause[ [,] clause] ... ] new-line
    structured-block

The syntax of the parallel directive in Fortran is:

!$omp parallel do [clause[ [,] clause] ... ]
    loop-nest
[!$omp end parallel do]

As we have already introduced in chapter 1, clauses are used to specify additional information with the directive. Ten clauses can be used with the parallel directive, listing as follows:

if([ parallel :] scalar-expression)
num_threads(integer-expression)
default(data-sharing-attribute)
private(list)
firstprivate(list)
shared(list)
copyin(list)
reduction([reduction-modifier ,] reduction-identifier : list)
proc_bind(affinity-policy)
allocate([allocator :] list)

2.2.2.1. if Clause#

The if clause is used to achieve conditional parallelism. It can be used with many directives, such as parallel directive, task directive, simd directive, etc. Its effect depends on the construct to which it is applied. The syntax is as follows:

if([ directive-name-modifier :] scalar-expression)

or

if([ directive-name-modifier :] scalar-logical-expression)

The directive-name-modifier is optional and is very useful in combined constructs, which we will cover in the later chapters. When the if claue is used with the parallel directive, only parallel can be used as the directive-name-modifier. Its semantics are that the parallel region is active when the scaler-expression or scalar-logical-expression is true, otherwise the parallel region that follows will be inactive. At most one if clause can be used in each parallel construct.

2.2.2.2. num_threads Clause#

The syntax for num_threads is as follows:

num_threads(integer-expression)

We used the num_threads clause in the previous example to indicate the number of threads used to execute parallel regions. The number of threads in the parallel region is determined by integer-expression. At most one num_threads clause can be used in each parallel construct. Because the number of threads used must be uniquely determined when entering the parallel region， whether explicitly specified by the programmer using num_threads clause or implicitly specified by the compiler. Of course, the scaler-expression must be an integer greater than zero.

2.2.2.3. Data-Sharing Attribute Clauses#

Data-sharing attribute clauses are used to control the data-sharing attributes of variables.

There are four data-sharing attribute clauses that can be used with the parallel directive, namely default clause, private clause, firstprivate clause and shared clause. The lastprivate clause and linear clause are two other data-sharing attribute clauses. They cannot be used with parallel directives, so we will skip them for now and describe them in detail in later chapters.

We first introduce the private clause, firstprivate clause and shared clause.

2.2.2.3.1. private Clause#

The syntax of the private clause is as follows:

private(list)

As mentioned before, a set of implicit tasks, equal in number to the number of threads in the team, is generated by the master thread when it encountering the parallel region. The private caluse is used to declare private variable(s) in the list for a task or a SIMD lane. A private variable has a different address in the execution context of every thread. These variables are private to threads, and threads cannot access private variables owned by other threads. Programmers can use the private clause as many times as needed.

2.2.2.3.2. firstprivate Clause#

The firstprivate clause is very similar to the private clause. They both indicate that the variables in the list are private to the thread. Its syntax is as follows:

firstprivate(list)

But unlike the private clause, the variables in the list are initialized to the initial value that the original item had in the serial area. Like private clause, the firstprivate clause can be used multiple times in the parallel construct.

2.2.2.3.3. shared Clause#

The shared clause is used to declare that one or more items in the list can be shared by tasks or SIMD lane. Its syntax is as follows:

shared(list)

Shared variables have the same address in the execution context of every thread. In a parallel region, all threads or SIMD lanes can access these variables. The shared clause can also be used multiple times within a parallel struct.

2.2.2.3.4. default Clause#

The syntax of default clause：

default(data-sharing-attribute)

The default clause is used to define the default data-sharing attribute of variables in a parallel construct(it also can be used in a teams, or task-generating construct). The data-sharing-attribute is one of the following:

private
firstprivate
shared
none

When the data-sharing attribute of a variable is not specified, the data-sharing attribute of this variable will be set to the attribute specified in the default clause. If we have a variable a and the data-sharing-attribute in the default clause is shared, then we can understand it like this:

if(variable a is not assigned data-sharing attribute && we have clause default(shared)) {
    equals to : we have clause shared(a)
}

A special note for the default(none) clause:

The default(none) clause means that we do not define any data-sharing attribute for variables. The compiler does not implicitly define this for us, therefore, the variables need to be listed explicitly in other data-sharing attribute clauses.

Unlike the above three clauses, default clause can only appear once in a directive.

2.2.2.3.5. Implicit Scoping Rules#

You may ask, if the default clause is not used, what data-sharing attribute is the variable that is not explicitly listed in other clauses? In fact, OpenMP defines a set of implicit scope rules. It determines the data-sharing attribute of a variable based on the characteristics of the variable being accessed. For example, for threads in a group, a variable is scoped to shared if using this variable in the parallel region does not result in a data race condition. A variable is scoped to private if, in each thread executing the parallel region, the variable is always written before being read by the same thread. The detailed rules can be found in the OpenMP specification.

Although OpenMP proposes as detailed and explicit rules as possible for implicit scoping, there is still a high possibility of unpredictable errors. Therefore, it is undoubtedly the best choice for programmers to use clauses to explicitly specify the data-sharing attributes of variables. This is an important aspect of OpenMP program optimization.

2.2.2.4. copyin Clause#

The copyin clause is the one of two data copying clauses that can be used on the parallel construct or combined parallel worksharing constructs. The other one is the copyprivate clause which is only allowed on the single construct.

Before introducing these two clauses, we need to introduce a new data-sharing attribute, named threadprivate. The difference between private variables and thread private variables can be briefly described as follows:

Private variables are local to a region and are placed on the stack most of the time. The lifetime of a private variable is the duration defined by the data scope clause. Every thread, including the main thread, makes a private copy of the original variable, and the new variable is no longer storage-associated to the original.
Threadprivate variables are persist across regions, most likely to be placed in the heap or in thread-local storage, which can be seen as being stored in a local memory local to the thread. The main thread uses the original variable and other threads make private copies of the original variable. The host variable is still store-associated with the original variable.

The syntax of the copyin Clause is as follows:

copyin(list)

It can copy the value of the main thread’s threadprivate variable into the threadprivate variable of every other member in the team that is executing the parallel region. And it can be used multiple times within a parallel struct.

2.2.2.5. reduction Clause#

The reduction clause belongs to the reduction scoping clauses and the reduction participating clauses. The reduction scoping clauses define the region of a reduction computed by a task or a SIMD lane. The reduction participating clauses are used to define a task or SIMD channel as a reduction participant. The reduction clause is specifically designed for reduction operations, and it allows the user to specify one or more thread-private variables that accept reduction operation at the end of the parallel region.

The syntax of the reduction clause is as follows:

reduction([ reduction-modifier,]reduction-identifier : list)

The reduction-modifier is optional and is used to describe the characteristics of the reduction operation. It can be one of the following:

inscan
task
default

When the reduction-modifier is inscan, the list items on each iteration of a loop nest will be updated with scan computation. When inscan is used, one list item must be as a list item in an inclusive or exclusive clause on a scan directive enclosed by the construct. It separates the items in the list from the reduction operations, and decides whether the storage statement includes or excludes the scan input of the present iteration.

When the reduction-modifier is task, an indeterminate number of additional private copies will be generated to support task reduction, and these reduction-related copies will be initialized before they are accessed by the tasks.

When reduction-modifier is default or when no reduction-modifier is specified, the behavior will be relative to the construct in which the reduction is located. For parallel, scope and simd construcrs, one or more private copies of each list item are created for each implicit task (for parallel and scope) or SIMD lane (for simd), as if the private clause had been used. Some other rules for other constructs can be found in the OpenMP specification and we will not go into details here.

The reduction-identifier is used to specify the reduction operator. According to the OpenMP specification, the reduction-identifier has the following syntax:

For C language，a reduction-identifier is either an identifier or one of the following operators: +, - (deprecated), *, &, |, ^, && and ||.
For C++, a reduction-identifier is either an id-expression or one of the following operators: +, - (deprecated), *, &, |, ^, && and ||.
For Fortran， a reduction-identifier is either a base language identifier, or a user-defined operator, or one of the following operators: +, - (deprecated), *, .and., .or., .eqv., .neqv., or one of the following intrinsic procedure names: max, min, iand, ior, ieor.

The following two tables, also from the OpenMP specification, show implicitly declared reduction-identifiers for numeric and logical types, including the initial value settings, and semantics for the reduction-identifiers.

For C/C++:

Identifier	Initializer	Combiner
+	omp_priv = 0	omp_out += omp_in
-	omp_priv = 0	omp_out += omp_in
*	omp_priv = 1	omp_out *= omp_in
&	omp_priv = ~ 0	omp_out &= omp_in
\|	omp_priv = 0	omp_out \|= omp_in
^	omp_priv = 0	omp_out ^= omp_in
&&	omp_priv = 1	omp_out = omp_in && omp_out
\|\|	omp_priv = 0	omp_out = omp_in
max	omp_priv = Minimal representable number in the reduction list item type	omp_out = omp_in > omp_out ? omp_in : omp_out
min	omp_priv = Maximal representable number in the reduction list item type	omp_out = omp_in < omp_out ? omp_in : omp_out

For Fortran：

Identifier	Initializer	Combiner
+	omp_priv = 0	omp_out = omp_in + omp_out
-	omp_priv = 0	omp_out = omp_in + omp_out
*	omp_priv = 1	omp_out = omp_in * omp_out
.and.	omp_priv = .true.	omp_out = omp_in .and. omp_out
.or.	omp_priv = .false.	omp_out = omp_in .or. omp_out
.eqv.	omp_priv = .true.	omp_out = omp_in .eqv. omp_out
.neqv.	omp_priv = .false.	omp_out = omp_in .neqv. omp_out
max	omp_priv = Minimal representable number in the reduction list item type	omp_out = max(omp_in, omp_out)
min	omp_priv = Maximal representable number in the reduction list item type	omp_out = min(omp_in, omp_out)
iand	omp_priv = All bits on	omp_out = iand(omp_in, omp_out)
ior	omp_priv = 0	omp_out = ior(omp_in, omp_out)
ieor	omp_priv = 0	omp_out = ieor(omp_in, omp_out)

The reduction clause can be used multiple times as needed within a parallel struct.

2.2.2.6. proc_bind Clause#

The proc_bind clause is used to specify a mapping of OpenMP threads to places within the current place partition for implicit tasks of the encountering threads. At most one proc_bind clause can appear on the directive. Its syntax is as follows:

proc_bind(affinity-policy) 

and the affinity-policy is one of the following：

primary
master [deprecated]
close
spread

When affinity-policy is specified as primary, it means that the execution environment assigns each thread in the group to the same location as the primary thread. The master affinity policy has been deprecated, it has identical semantics as prime.

The close thread affinity policy instructs the execution environment to assign threads in the group closer to the parent thread’s location. When the number of threads P in the group is less than the number of locations P in the partition where the parent thread is located, each thread will be assigned one place, otherwise, T/P threads will be assigned to one place. When P does not divide T equally, the exact number of threads at a particular place is implementation-defined. The principle of allocation is that the thread number with the thread with the smallest thread number is executed at the position of the parent thread, and then the threads are allocated backward in the order of increasing thread number, and wrap-around with respect to the place partition of the primary thread.

The spread thread affinity policy is to create P sub-partitions in the parent partition and distribute the threads in these sub-partitions, thus achieving a sparse distribution. When the number of threads is equal to or less than P, the number of sub-partitions is equal to the number of threads T, and the threads are distributed in the first position of each partition in order of thread number from small to large. Conversely, when the number of threads is greater than P, T/P threads with consecutive thread numbers are allocated to each sub-partition. Sorted by thread number, the first T/P threads are allocated in the subpartition that contains the place of the parent thread. The remaining thread groups are allocated backward in turn, with a wrap-around with respect to the original place partition of the primary thread.

2.2.2.7. allocate Clause#

The allocate clause is used to specify the memory allocator used to obtain storage of the private variables. When it is used with parallel construct, the syntax is shown as following:

allocate([allocator:] list) 

In C/C++, the allocator is an expression of the omp_allocator_handle_t type. In Fortran, the allocator is an integer expression of the omp_allocator_handle_kindkind. We will not list these allocators here, readers can find them in the OpenMP specification, but we will explain some commonly used allocators in the examples in the following section.

The allocate clause can be used multiple times as needed within a parallel struct.

2.2.3. More Advanced Examples of Using Other Clauses and the Analysis of Performance and Improvement#

In the above section, we have introduced the syntax and semantics of all the clauses that can be used with the parallel directive. In this section, we introduce some examples of how to use these clauses.

Meanwhile, in some of the examples, we will show some ways to improve performance, and discuss some potential possibilities for improving performance.

2.2.3.1. if clause and num_threads clause#

The if clause can achieve conditional parallelism and the num_threads clause will identify the number of threads used. In the following example, we use three cases to display the effect of the if clause and the num_trheads clause. No if clause is used in case 1, and the num_threads clause indicated we should use 4 threads, so we can get the output from four threads in parallel. The scalar expression in the if clause in case 2 is evaluated to be false, so the statements following the openMP directive (maybe code block in other cases) will be executed serially by only one thread. And in case 3, the scalar expression is evaluated to be true, so the statements following the openMP directive will be executed parallel by four threads.

//%compiler: clang
//%cflags: -fopenmp

#include <stdio.h>
#include <stdlib.h>
#include <omp.h>

int main(int argc, char *argv[]){
    int M =10;
    //case 1
    #pragma omp parallel num_threads(4)
    printf("Hello World in case 1 from thread %d\n", omp_get_thread_num());

    printf("-------------------------------------------------\n");
    
    //case 2
    #pragma omp parallel num_threads(4) if(M > 10)
    printf("Hello World in case 2 from thread %d\n", omp_get_thread_num());

    printf("-------------------------------------------------\n");
    
    //case 3
    #pragma omp parallel num_threads(4) if(M <= 10)
    printf("Hello World in case 3 from thread %d\n", omp_get_thread_num());
    
    return 0;
}

Hello World in case 1 from thread 2
Hello World in case 1 from thread 3
Hello World in case 1 from thread 1
Hello World in case 1 from thread 0
-------------------------------------------------
Hello World in case 2 from thread 0
-------------------------------------------------
Hello World in case 3 from thread 3
Hello World in case 3 from thread 1
Hello World in case 3 from thread 0
Hello World in case 3 from thread 2

Within a parallel region, the thread number uniquely identifies each thread. A thread can obtain its own thread number by calling the omp_get_thread_num library routine.

2.2.3.2. Data-Sharing Attribute Clauses#

The following example is a little more complicated. It shows how to use the omp_get_thread_num library routine and shows how to use two other clauses, the default clause and the private clause. It assigns tasks to each thread explicitly.

//%compiler: clang
//%cflags: -fopenmp
//This example is from https://www.openmp.org/wp-content/uploads/openmp-examples-5.1.pdf. The size of array and output are changed

#include <stdio.h>
#include <stdlib.h>
#include <omp.h>

void subdomain(float *x, int istart, int ipoints) {
    int i;
    for (i = 0; i < ipoints; i++)       
         x[istart+i] = istart+i;
}

void sub(float *x, int npoints) {
    int iam, nt, ipoints, istart;
    #pragma omp parallel default(shared) private(iam,nt,ipoints,istart)
    {
        iam = omp_get_thread_num();
        nt = omp_get_num_threads();
        ipoints = npoints / nt; /* size of partition */
        istart = iam * ipoints; /* starting array index */
        if (iam == nt-1) /* last thread may do more */
            ipoints = npoints - istart;
        subdomain(x, istart, ipoints);
    }
}

void print(float *x, int npoints) {
    for (int i = 0; i < npoints; i++) {
        if(i % 10 == 0)
            printf("\n");
        printf("%.5f ", x[i]);
    }
}

int main() {
    float array[100];
    sub(array, 100);
    print(array, 100);
    return 0;
}

00000 1.00000 2.00000 3.00000 4.00000 5.00000 6.00000 7.00000 8.00000 9.00000 
00000 11.00000 12.00000 13.00000 14.00000 15.00000 16.00000 17.00000 18.00000 19.00000 
00000 21.00000 22.00000 23.00000 24.00000 25.00000 26.00000 27.00000 28.00000 29.00000 
00000 31.00000 32.00000 33.00000 34.00000 35.00000 36.00000 37.00000 38.00000 39.00000 
00000 41.00000 42.00000 43.00000 44.00000 45.00000 46.00000 47.00000 48.00000 49.00000 
00000 51.00000 52.00000 53.00000 54.00000 55.00000 56.00000 57.00000 58.00000 59.00000 
00000 61.00000 62.00000 63.00000 64.00000 65.00000 66.00000 67.00000 68.00000 69.00000 
00000 71.00000 72.00000 73.00000 74.00000 75.00000 76.00000 77.00000 78.00000 79.00000 
00000 81.00000 82.00000 83.00000 84.00000 85.00000 86.00000 87.00000 88.00000 89.00000 
00000 91.00000 92.00000 93.00000 94.00000 95.00000 96.00000 97.00000 98.00000 99.00000 

In the above example, we use the default number of threads to perform assignment operations on 100 elements in the array. Tasks are evenly distributed to each thread, and when the number of tasks is not divisible by the number of threads, the remaining tasks will be completed by the last thread.

When programming in parallel, the most important and hardest part is how to assign tasks and manage threads. We already introduced that a thread can get its own id through the omp_get_thread_num routine. Another important routine is omp_get_num_threads, which returns the number of threads in the current team.

In the above example, variable npoints presents the total number of elements in the array, and it is divided into nt parts, each of size ipoints. The starting address of each part is istart. Each part is completed by one thread, and a total of 8 threads execute tasks in parallel.

The default clause is used to define the default data-sharing attributes of variables that are referenced in a parallel, teams, or task-generating construct. In the above example, default(shared) indicates that by default, the variables in the parallel region are shared variables. The private clause is used to explicitly specify variables that are private in each task or SIMD lane (SIMD will be introduced in the next chapter). In the above example, the variables iam, nt, ipoints and istart are private variables for each thread, which means a thread cannot access these variables of another thread.

The corresponding Fortran program is shown below.

!!%compiler: gfortran
!!%cflags: -fopenmp

    SUBROUTINE SUBDOMAIN(X, ISTART, IPOINTS)
        INTEGER ISTART, IPOINTS
        REAL X(0:99)

        INTEGER I

        DO 100 I=0,IPOINTS-1
            X(ISTART+I) = ISTART+I
100     CONTINUE

    END SUBROUTINE SUBDOMAIN

    SUBROUTINE SUB(X, NPOINTS)
        INCLUDE "omp_lib.h" ! or USE OMP_LIB

        REAL X(0:99)
        INTEGER NPOINTS
        INTEGER IAM, NT, IPOINTS, ISTART

!$OMP PARALLEL DEFAULT(PRIVATE) SHARED(X,NPOINTS)

        IAM = OMP_GET_THREAD_NUM()
        NT = OMP_GET_NUM_THREADS()
        IPOINTS = NPOINTS/NT
        ISTART = IAM * IPOINTS
        IF (IAM .EQ. NT-1) THEN
            IPOINTS = NPOINTS - ISTART
        ENDIF
        CALL SUBDOMAIN(X,ISTART,IPOINTS)

!$OMP END PARALLEL
    END SUBROUTINE SUB
    
    SUBROUTINE print(X, NPOINTS)
        INTEGER I
        REAL X(0:99)
        INTEGER NPOINTS
        DO I = 0,NPOINTS-1
            IF (mod(I,10) .EQ. 0) THEN
                print*,' '
            END IF
            WRITE(*,'(1x,f9.5,$)') X(I)
        END DO
    END SUBROUTINE PRINT

    PROGRAM PAREXAMPLE
        REAL ARRAY(100)
        CALL SUB(ARRAY, 100)
        CALL PRINT(ARRAY, 100)
    END PROGRAM PAREXAMPLE

  
00000   1.00000   2.00000   3.00000   4.00000   5.00000   6.00000   7.00000   8.00000   9.00000  
00000  11.00000  12.00000  13.00000  14.00000  15.00000  16.00000  17.00000  18.00000  19.00000  
00000  21.00000  22.00000  23.00000  24.00000  25.00000  26.00000  27.00000  28.00000  29.00000  
00000  31.00000  32.00000  33.00000  34.00000  35.00000  36.00000  37.00000  38.00000  39.00000  
00000  41.00000  42.00000  43.00000  44.00000  45.00000  46.00000  47.00000  48.00000  49.00000  
00000  51.00000  52.00000  53.00000  54.00000  55.00000  56.00000  57.00000  58.00000  59.00000  
00000  61.00000  62.00000  63.00000  64.00000  65.00000  66.00000  67.00000  68.00000  69.00000  
00000  71.00000  72.00000  73.00000  74.00000  75.00000  76.00000  77.00000  78.00000  79.00000  
00000  81.00000  82.00000  83.00000  84.00000  85.00000  86.00000  87.00000  88.00000  89.00000  
00000  91.00000  92.00000  93.00000  94.00000  95.00000  96.00000  97.00000  98.00000  99.00000

//%compiler: clang
//%cflags: -fopenmp
//This example is from https://www.openmp.org/wp-content/uploads/openmp-examples-5.1.pdf.

#include <assert.h>

int A[2][2] = {1, 2, 3, 4};

void f(int n, int B[n][n], int C[]) {
    int D[2][2] = {1, 2, 3, 4};
    int E[n][n];

    assert(n >= 2);
    E[1][1] = 4;

    #pragma omp parallel firstprivate(B, C, D, E)
    {
        assert(sizeof(B) == sizeof(int (*)[n]));
        assert(sizeof(C) == sizeof(int*));
        assert(sizeof(D) == 4 * sizeof(int));
        assert(sizeof(E) == n * n * sizeof(int));

        /* Private B and C have values of original B and C. */
        assert(&B[1][1] == &A[1][1]);
        assert(&C[3] == &A[1][1]);
        assert(D[1][1] == 4);
        assert(E[1][1] == 4);
    }
}

int main() {
    f(2, A, A[0]);
    return 0;
}

In the above example, we explored how to use the firstprivate clause. The size and values of the array or pointer appearing in the list of the firstprivate clause depend on the original arrays or pointers in the serial part. The type of A is array of two arrays of two ints. B and C are function variables, which will be adjusted according to the type defined in the function. The type of B is pointer to array of n ints, and the type of C is pointer to int. The type of D is exactly the same as A, and the type of E is array of n arrays of n ints.

In the example, we use multiple assert statements to check whether B, C, D, and E have the expected types and values. Thus to verify the function of the firstprivate clause.