1.3. Performance Analysis#

For parallel programs, performance is the primary concern. The relative ease of using OpenMP is a mixed blessing. We can quickly write a correct OpenMP program, but without the desired level of performance. Even though there are certain “best practices” to avoid common performance problems, extra work is needed to program with a large thread count. In this section, we briefly introduce some factors that affect performance when programming, and some “best practices” to improve performance.

1.3.1. Factors Impacting Performance#

The OpenMP program includes serial regions and parallel regions, firstly, and the performance of the serial regions will affect the performance of the whole program. There are many optimization methods for serial programs, such as eliminating data dependencies, constant folding, copy propagation, removing dead code, etc. Although the compiler has made some optimization efforts for serial programs, in fact, for current compilers the effect of this optimization is minimal. A simpler and more effective method is to parallelize and vectorize the parts that can be parallelized or vectorized. For the purposes of this book, we won’t go into too much detail about the optimization of the serial parts.

The second factor is the proportion of the serial parts and the parallel parts. It is well understood that the more parts that can be parallelized in a program, the greater the room for program performance improvement. Amdahl’s law(or Amdahl’s argument) gives the theoretical speedup in latency of the execution of a task at a fixed workload that can be expected of a system whose resources are improved. This formula is often applied to parallel processing systems. For the speedup S describing the effect of parallel processing under a fixed load, Amdahl gave the following formula:

\[ S=1/((1-a)+a/n) \]

where a is the proportion of the parallel computing part, and n is the number of parallel processing nodes. In this way, when 1-a=0, (ie, no serial, only parallel), the maximum speedup ratio s=n; when a=0 (ie, only serial, no parallelism), the minimum speedup ratio s=1; when n→∞, the limit acceleration ratio s→1/(1-a), which is also the upper limit of the acceleration ratio. For example, 90% of the code of a program is parallel, but there is still 10% of the serial code, the maximum speedup that can be achieved even by an infinite number of processors in the system is still 9. This formula has been accepted by academia, and it can indicate that the maximum speedup that a program can achieve is limited by the serial portion of the program.

Next, let’s analyze what factors can affect performance in the parallel regions. To analyze this problem, we can consider where the overhead of parallelism is, and then reduce the overhead or balance the overhead and benefits to obtain ideal performance. Common parallel overheads are mainly concentrated in two aspects. The first is the overhead of thread management. Thread management usually includes creating, resuming, managing, suspending, destroying and synchronizing. The management overhead of threads is often large compared to computation, especially synchronization. Multiple threads waiting for synchronization cause a lot of computing resources to be wasted. The idea to solve this problem is to distribute tasks as evenly as possible, making the thread load as evenly as possible.

The second aspect is memory access, usually, memory is the main limiting factor in the performance of shared memory programs. Problems such as memory access conflicts will seriously affect the performance of the program. The access location of data is the key to the latency and bandwidth of memory access. The most commonly used method is to optimize data locality. We can improve this problem by explicitly managing the data which will be introduced in detail in the following chapters.

1.3.2. Ideas for Improving the performance#

Based on the factors which affect the performance, there are several performance-enhancing efforts that can be considered:

  • Thread load balancing

  • Minimizing synchronization.

  • Using more efficient directives. For example, using PARALLEL DO/FOR instead of worksharing DO/FOR directives in parallel regions.

  • Expansion or merging of parallel regions

  • Improving cache hit rate

  • Optimization of round-robin scheduling

  • Variable property optimization

  • Optimization to reduce false sharing problem

In the following chapters of this book, we will introduce how to create parallel programs using OpenMP based on three different architectures, and explain how to optimize performance based on some of the ideas mentioned above.