4.6. Parallel Execution on GPU Devices#

OpenMP provides directives for parallel execution on GPU devices, allowing programmers to efficiently distribute work across multiple threads and teams. The teams and distribute directives are key constructs for achieving parallel execution on GPUs. In this section, we will explore the teams directive for SPMD (Single Program, Multiple Data) parallelism, the distribute directive for work sharing, and how to combine these directives for optimal performance.

4.6.1. teams directive for SPMD parallelism#

The teams directive is used to create a league of thread teams, where each team executes the structured block associated with the directive. This allows for SPMD parallelism, where each team executes the same code but operates on different data.

The syntax of the teams directive is as follows:

#pragma omp teams [clause[[,] clause]...]
structured-block

The teams directive supports clauses such as num_teams and thread_limit to control the number of teams and the maximum number of threads per team, respectively.

Example:

#pragma omp target teams map(to: a[0:n]) map(from: b[0:n])
{
  #pragma omp parallel for
  for (int i = 0; i < n; i++) {
    b[i] = a[i] * 2;
  }
}

In this example, the teams directive creates a league of thread teams, and each team executes the parallel loop inside the structured block. The parallel directive is used to distribute the loop iterations among the threads within each team.

4.6.2. distribute directive for work sharing#

The distribute directive is used to distribute loop iterations across the master threads of the teams created by the teams directive. It provides a way to share work among the teams and achieve parallelism at the team level.

The syntax of the distribute directive is as follows:

#pragma omp distribute [clause[[,] clause]...]
for-loops

The distribute directive supports clauses such as private, firstprivate, lastprivate, and collapse to control data sharing and loop collapsing.

Example:

#pragma omp target teams distribute map(to: a[0:n]) map(from: b[0:n])
for (int i = 0; i < n; i++) {
  b[i] = a[i] * 2;
}

In this example, the distribute directive distributes the loop iterations across the master threads of the teams. Each team operates on a different portion of the input and output arrays.

4.6.3. Combining teams and distribute directives#

The teams and distribute directives can be combined to achieve multi-level parallelism on GPU devices. The teams directive creates a league of thread teams, and the distribute directive distributes the work among the teams. Additionally, the parallel directive can be used within each team to further parallelize the work among the threads within a team.

Example:

#pragma omp target teams distribute map(to: a[0:n]) map(from: b[0:n])
for (int i = 0; i < n; i++) {
  #pragma omp parallel for
  for (int j = 0; j < m; j++) {
    b[i] += a[i * m + j];
  }
}

In this example, the teams and distribute directives are combined to distribute the outer loop iterations across the teams. Within each team, the parallel directive is used to parallelize the inner loop iterations among the threads.

4.6.4. Nested parallelism with teams and parallel directives#

OpenMP supports nested parallelism, where parallel regions can be nested within each other. The teams directive can be combined with the parallel directive to achieve nested parallelism on GPU devices.

Example:

#pragma omp target teams map(to: a[0:n][0:m]) map(from: b[0:n])
{
  #pragma omp distribute
  for (int i = 0; i < n; i++) {
    #pragma omp parallel for reduction(+:b[i])
    for (int j = 0; j < m; j++) {
      b[i] += a[i][j];
    }
  }
}

In this example, the teams directive creates a league of thread teams, and the distribute directive distributes the outer loop iterations across the teams. Within each team, the parallel directive is used to parallelize the inner loop iterations among the threads, and a reduction is performed to compute the sum of elements in each row of the input array.

By using the teams and distribute directives, along with the parallel directive for nested parallelism, programmers can effectively parallelize their code and utilize the full potential of GPU devices.

In the next section, we will discuss techniques for tuning the performance of GPU-offloaded code, including choosing the optimal number of teams and threads, optimizing data transfers and memory usage, and leveraging device-specific features.