Topics and Sample Questions

CSCE569 Parallel Computing, Spring 2018

Midterm Topics: Introduction | Linux and C | OpenMP | Principles of Parallel Algorithm Design | Distributed Memory Systems and Message Passing Interface (MPI)

Final Topics: Domain Decomposition and Boundary Exchange using MPI | Analytical Modeling of Parallel Programs: Metrics and Analysis | Analytical Modeling of Parallel Programs: Scalability | Parallel Architecture: Thread Level Parallelism and Data Level Parallelism | Memory Hierarchy and Cache Coherence | Manycore GPUs and CUDA Programming

Topics for Midterm Exam

Introduction

What is Parallel Computing? A form of computation: 1) Large problems divided into smaller ones; 2)Smaller ones are carried out and solved simultaneously
Why parallel computing?
1. Save time (execution time) and money!Parallel program can run faster if running concurrently instead of sequentially.
2. Solve larger and more complex problems! Utilize more computational resources
Why parallel computing is the only way, so far, to achieve high performance computing
1. Applications that require HPC are naturally parallelizable, and Data cannot fit in memory of one machine Computer systems
2. Physics limitation: has to build it parallel. and Parallel systems are widely accessible
Three pillars of science: theory, experiment and computer simulation.
What is Flop/s (FLOPS)? What is Top500?
So far the performance of the fastest supercomputing in the Top500 is around 93 a) Petaflops, b) Teraflops, c) Gigaflops, d) exaflops
How to compute CPU and machine peak performance?
How to measure and calculate application performance?
How Top500 machines are being ranked?
How to calcualte HPC performance and power efficiency
The difference of HPL (Top500), HPCG, Green500, and Graph500 Ranking
What is Moore's Law and why we cannot enjoy the performance improvement over the years as before with regards to the Moore's Law (Problems of traditional ILP scaling)
Since 2002-2005, what is the hardware solution for overcoming the performance and scaling challenges from traditional ILP? (multicore)

Linux, C Programming and Compilation

What are the four stages of compiling a C program to create an executable?
Row-major/Column storage C multiple-dimensional array. For example, given a 2-dimensional array int A[4][5], how the array elements are layed out in the memory in C?
How to calculate memory address of an array element? For int A[4][5], if the memory address of A is 0x012345600, what is the memory address of element A[2][4]?
If float A[256] stores a two-dimensional matrix 8x32 in column major, A is 0x012345600, what is the memory address of element at 6x25?

OpenMP

OpenMP multithreading execution model: fork-join
OpenMP multithreading memory model: shared memory. Threads each has its own private memory and all threads have direct load/store access to the shared memory region.
Given the following OpenMP parallel program
```
#pragma omp parallel private(i) num_threads(4) schedule(static, 10)
for (int i = 0; i < 100; i++ {
   a[i] = i;
}
```
Which thread would execute iteration 76? Which chunks of iterations would thread 2 execute? Fill in the following table?

Thread Chunks

0 [0-9], [40-49], [80-89]

1

2

3

Thread	Chunks
0	[0-9], [40-49], [80-89]
1
2
3

Optimizing the following code by removing redundant barriers:

  #pragma omp parallel private(i)
  {
    #pragma omp for
    for(i=0;i<n;i++)
       a[i] +=b[i];
    #pragma omp for
    for(i=0;i<n;i++)
       c[i] +=d[i];
    #pragma omp barrier
    #pragma omp for reduction(+:sum)
    for(i=0;i<n;i++)
      sum += a[i] + c[i];
  }

Solution:

  #pragma omp parallel private(i)
  {
    #pragma omp for nowait
    for(i=0;i<n;i++)
       a[i] +=b[i];
    #pragma omp for nowait
    for(i=0;i<n;i++)
       c[i] +=d[i];
    #pragma omp barrier
    #pragma omp for nowait reduction(+:sum)
    for(i=0;i<n;i++)
      sum += a[i] + c[i];
  }

What is the final results of IS: (2+8+9+10). For LASTPRIVATE(IS), it means the last thread who executes the last iteration will return the IS value to the master thread after the parallel region. So the value of IS depends on which thread executes the last iteration (10). One, the most common one, way of chunking the loop is to have [1-4] for thread 0, [5-7] for thread 1 and [8-10] for thread 3. Thus the thread that executes the last iteration 10 finishes iterations 8,9,10, and the IS value is 2 (coming from LASTPRIVATE(IS)) + 8 + 9 + 10. It is also possible that thread 3 execute [7-10] or thread 0 execute [1-3,10], which will yield different results for IS.
```
        IS = 2
  C$OMP PARALLEL DO FIRSTPRIVATE(IS) 
  C$OMP& LASTPRIVATE(IS) NUM_THREADS(3)
        DO J=1,10 
          IS = IS + J
  CONTINUE
  C$OMP END PARALLEL DO 
        print *, IS
```
Understand false-sharing and first-touch policy

Principles of Parallel Algorithm Design

What are the two major steps of parallel algorithms design? Tasks and Decomposition, and Processes and Mapping
Explain what is the critical path of a task graph and what critical path represents? Critical path is the longest weighted path throughout the graph, and it reprepsents shortest time in which the program can be finished.
Given a task graph, identify the critical path, and calculate the path weight and average degree of concurrency (speedup).
What is task interaction graph?
Give an example of an algorithm that you can apply recursive decomposition and explain in details
Give an example of data decomposition and explain in details.
What are the general design guideline and techniques for minimizing interactions between tasks when mapping tasks onto processors?

Distributed Memory Systems and Message Passing Interface (MPI)

Latency and Bandwidth are two important metrics for network. Explain what is α-β model for network modeling. If a network has 7.6us latency and 10GB/s bandwidth, how long it will take to transfer a 1Mbyte data using α-β model?
In the network LogP model, network EEL (end to end latency) includes overheads from both sender and receiver, and wire latency (L). Why often we see that the EEL is less then the sum of the overhead and wire latency? Draw a diagram to explain.
Explain what is blocking message passing and what is non-blocking messag passing.
An MPI collective call often has different semantics from different processes even all the processes make and participate in the same call. Explain the semantics of MPI_Reduce call: int MPI_Reduce(void *sendbuf, void *recvbuf, int count, MPI_Datatype datatype, MPI_Op op, int target, MPI_Comm comm)

Topics for Final Exam

Domain Decomposition and Boundary Exchange using MPI

In Assignment #3, the 2-dimension array is decomposed using row-wise distribution among all the MPI processes. E.g. For array u[256][256] and 4 processes, and each MPI process will computate 256/4=64 rows of data. Check MPI Exercise slide

Why each process needs to have 65 (for process 0 and 3) or 66 rows (for process 1 and 2) of data? (why process 0 and 3 needs to have one more row of data and process 1 and 2 need to have 2 more rows of data than it needs to compute).
For boundary exchange using blocking MPI_Send/MPI_Recv, the order of MPI_Send/Recv call by each process is very critical for not introducing deadlock, draw a diagram and explain in details of MPI_Send/Recv calls used for boundary exchange in this 4-process example.
An MPI collective call often has different semantics from different processes even all the processes make and participate in the same call. Explain the semantics of MPI_AllReduce call: int MPI_Allreduce(const void *sendbuf, void *recvbuf, int count, MPI_Datatype datatype, MPI_Op op, MPI_Comm comm)

Analytical Modeling of Parallel Programs: Metrics and Analysis

Explain why parallel execution time also depends on the platform and communication parameters. (Explain why analyzing parallel program execution time is more complicated than analyzing sequential program).
Give examples of parallel overheads
Explain each of the five metrics of parallel program analysis: Execution Time, Overhead, Speedup, Efficiency, and Cost
Fill in the following tables, assuming serial execution time is 100

Num of Processors Parallel Execution Time Speedup Efficiency Cost Parallel Overhead

2 60

4 40

8 20
Explain what is linear speedup.
Why superlinear speedup is posibble? Using examples to explain.
What is Amdahl’s Law? A program has three portions, sequential portion, 30% that can be optimized 4 times and 50% can be accelerated by 2 times, what is the overall speedup using Amdahl's Law.
```
The sequential portion is 1 - 30% - 50% = 20%

Speedup = 1 / (0.2 + 0.3/4 + 0.5/2) ~= 1.9
```

Analytical Modeling of Parallel Programs: Scalability

Explains the scalability of parallel system and the two types of scalabiling we talked about during the class.
Explian what is the isoeficiency metric and how to use isoefficiency analysis for the sum example we discussed during the class.

Parallel Architecture: Thread Level Parallelism and Data Level Parallelism

There are three types of hardware parallelism we mainly talk about, i.e. instruction level parallelism, thread level parallelism and data level parallelism. Give some examples of instruction level parallelism you are aware of from the computer architecture courses?
What are the three types of thread-level parallelism? What hyperthreading is?
What is the Flynn's taxonomy for classifying of parallel architectures? What single-core architecture is according to Flynn's taxonomy? What vector architecutre is according to Flynn's taxonomy? what GPU architecture is according to Flynn's taxonomy? What multicore is according to Flynn's taxonomy? What multi-CPU system is according to Flynn's taxonomy?
What is data-level parallelism? What architectures are designed for handling data-parallel application effectively?

Memory Hierarchy and Cache Coherence

What is the principle of locality, how memory hierarchy is designed to work with this principle?
What is memory wall?
What are the differences between DRAM and SRAM in terms of speed, size, and cost?
Are CPU cache made of DRAM or SRAM?
What are the two kinds of locality in a program, given the following program, identify which statement/variables exhibit which of the two kinds of locaties?
```
sum = 0;
for (i = 0; i < n; i++)
sum += a[i];
return sum;
```
What expert programmers, us, should do to write locality-friendly code? Explain using examples.
Questions about calculating misses per inner loop iterations for the matrix multiplication example discussed during the class.

Explain using the following example what cache-coherence and false-sharing are.

  int A[4], B[5]; /* assuming cache line size is 8 elements of int, and A and B will NOT be in the same cache line */
  #pragma omp parallel for num_threads(2)
  for (i=0; i<4; i++) {
      B[i] = A[i];
      C[i] = B[i];

  }

Manycore GPUs and CUDA Programming

What is the GPU today? What are the main characteristics of the GPU architecutre today?
What are the differences between CPU parallelism and GPU parallelism
Explain GPU computing offloading process.
How GPU threads are organized in CUDA?
Why GPU threads can be scheduled in a more efficient way on GPU than CPU threads on CPU?
What is the difference between CUDA global memory and shared memory, in terms of speed, size, access patterns?
What are the two imporant access patterns a kernel should try to achieve to optimizing an application’s use of global memory bandwidth? Achieving aligned and coalesced global memory accesses.

Num of Processors	Parallel Execution Time	Speedup	Efficiency	Cost	Parallel Overhead
2	60
4	40
8	20