What is Parallel Computing? A form of computation: 1) Large problems divided into smaller ones; 2)Smaller ones are carried out and solved simultaneously
Why parallel computing?
Why parallel computing is the only way, so far, to achieve high performance computing
Three pillars of science: theory, experiment and computer simulation.
What is Flop/s (FLOPS)? What is Top500?
So far the performance of the fastest supercomputing in the Top500 is around 93 a) Petaflops, b) Teraflops, c) Gigaflops, d) exaflops
How to compute CPU and machine peak performance?
How to measure and calculate application performance?
How Top500 machines are being ranked?
How to calcualte HPC performance and power efficiency
The difference of HPL (Top500), HPCG, Green500, and Graph500 Ranking
What is Moore's Law and why we cannot enjoy the performance improvement over the years as before with regards to the Moore's Law (Problems of traditional ILP scaling)
Since 2002-2005, what is the hardware solution for overcoming the performance and scaling challenges from traditional ILP? (multicore)
What are the four stages of compiling a C program to create an executable?
Row-major/Column storage C multiple-dimensional array. For example, given a 2-dimensional array int A[4][5]
, how the array elements are layed out in the memory in C?
How to calculate memory address of an array element? For int A[4][5]
, if the memory address of A is 0x012345600
, what is the memory address of element A[2][4]
?
If float A[256]
stores a two-dimensional matrix 8x32 in column major, A is 0x012345600
, what is the memory address of element at 6x25?
OpenMP multithreading memory model: shared memory. Threads each has its own private memory and all threads have direct load/store access to the shared memory region.
Given the following OpenMP parallel program
#pragma omp parallel private(i) num_threads(4) schedule(static, 10)
for (int i = 0; i < 100; i++ {
a[i] = i;
}
Which thread would execute iteration 76? Which chunks of iterations would thread 2 execute? Fill in the following table?
Thread | Chunks |
---|---|
0 | [0-9], [40-49], [80-89] |
1 | |
2 | |
3 |
Optimizing the following code by removing redundant barriers:
#pragma omp parallel private(i)
{
#pragma omp for
for(i=0;i<n;i++)
a[i] +=b[i];
#pragma omp for
for(i=0;i<n;i++)
c[i] +=d[i];
#pragma omp barrier
#pragma omp for reduction(+:sum)
for(i=0;i<n;i++)
sum += a[i] + c[i];
}
Solution:
#pragma omp parallel private(i)
{
#pragma omp for nowait
for(i=0;i<n;i++)
a[i] +=b[i];
#pragma omp for nowait
for(i=0;i<n;i++)
c[i] +=d[i];
#pragma omp barrier
#pragma omp for nowait reduction(+:sum)
for(i=0;i<n;i++)
sum += a[i] + c[i];
}
What is the final results of IS: (2+8+9+10). For LASTPRIVATE(IS)
, it means the last thread who executes the last iteration will return the IS value to the master thread after the parallel region. So the value of IS depends on which thread executes the last iteration (10). One, the most common one, way of chunking the loop is to have [1-4] for thread 0, [5-7] for thread 1 and [8-10] for thread 3. Thus the thread that executes the last iteration 10 finishes iterations 8,9,10, and the IS value is 2 (coming from LASTPRIVATE(IS)
) + 8 + 9 + 10. It is also possible that thread 3 execute [7-10] or thread 0 execute [1-3,10], which will yield different results for IS.
IS = 2
C$OMP PARALLEL DO FIRSTPRIVATE(IS)
C$OMP& LASTPRIVATE(IS) NUM_THREADS(3)
DO J=1,10
IS = IS + J
CONTINUE
C$OMP END PARALLEL DO
print *, IS
Understand false-sharing and first-touch policy
What are the two major steps of parallel algorithms design? Tasks and Decomposition, and Processes and Mapping
Explain what is the critical path of a task graph and what critical path represents? Critical path is the longest weighted path throughout the graph, and it reprepsents shortest time in which the program can be finished.
Given a task graph, identify the critical path, and calculate the path weight and average degree of concurrency (speedup).
What is task interaction graph?
Give an example of an algorithm that you can apply recursive decomposition and explain in details
Give an example of data decomposition and explain in details.
What are the general design guideline and techniques for minimizing interactions between tasks when mapping tasks onto processors?
Latency and Bandwidth are two important metrics for network. Explain what is α-β model for network modeling. If a network has 7.6us latency and 10GB/s bandwidth, how long it will take to transfer a 1Mbyte data using α-β model?
In the network LogP model, network EEL (end to end latency) includes overheads from both sender and receiver, and wire latency (L). Why often we see that the EEL is less then the sum of the overhead and wire latency? Draw a diagram to explain.
Explain what is blocking message passing and what is non-blocking messag passing.
An MPI collective call often has different semantics from different processes even all the processes make and participate in the same call. Explain the semantics of MPI_Reduce
call:
int MPI_Reduce(void *sendbuf, void *recvbuf, int count, MPI_Datatype datatype, MPI_Op op, int target, MPI_Comm comm)
In Assignment #3, the 2-dimension array is decomposed using row-wise distribution among all the MPI processes. E.g. For array u[256][256] and 4 processes, and each MPI process will computate 256/4=64 rows of data. Check MPI Exercise slide
Why each process needs to have 65 (for process 0 and 3) or 66 rows (for process 1 and 2) of data? (why process 0 and 3 needs to have one more row of data and process 1 and 2 need to have 2 more rows of data than it needs to compute).
For boundary exchange using blocking MPI_Send/MPI_Recv
, the order of MPI_Send/Recv
call by each process is very critical for not introducing deadlock, draw a diagram and explain in details of MPI_Send/Recv
calls used for boundary exchange in this 4-process example.
An MPI collective call often has different semantics from different processes even all the processes make and participate in the same call. Explain the semantics of MPI_AllReduce
call:
int MPI_Allreduce(const void *sendbuf, void *recvbuf, int count, MPI_Datatype datatype, MPI_Op op, MPI_Comm comm)
Explain why parallel execution time also depends on the platform and communication parameters. (Explain why analyzing parallel program execution time is more complicated than analyzing sequential program).
Give examples of parallel overheads
Explain each of the five metrics of parallel program analysis: Execution Time, Overhead, Speedup, Efficiency, and Cost
Fill in the following tables, assuming serial execution time is 100
Num of Processors | Parallel Execution Time | Speedup | Efficiency | Cost | Parallel Overhead |
---|---|---|---|---|---|
2 | 60 | ||||
4 | 40 | ||||
8 | 20 |
Explain what is linear speedup.
Why superlinear speedup is posibble? Using examples to explain.
What is Amdahl’s Law? A program has three portions, sequential portion, 30% that can be optimized 4 times and 50% can be accelerated by 2 times, what is the overall speedup using Amdahl's Law.
The sequential portion is 1 - 30% - 50% = 20%
Speedup = 1 / (0.2 + 0.3/4 + 0.5/2) ~= 1.9
Explains the scalability of parallel system and the two types of scalabiling we talked about during the class.
Explian what is the isoeficiency metric and how to use isoefficiency analysis for the sum example we discussed during the class.
There are three types of hardware parallelism we mainly talk about, i.e. instruction level parallelism, thread level parallelism and data level parallelism. Give some examples of instruction level parallelism you are aware of from the computer architecture courses?
What are the three types of thread-level parallelism? What hyperthreading is?
What is the Flynn's taxonomy for classifying of parallel architectures? What single-core architecture is according to Flynn's taxonomy? What vector architecutre is according to Flynn's taxonomy? what GPU architecture is according to Flynn's taxonomy? What multicore is according to Flynn's taxonomy? What multi-CPU system is according to Flynn's taxonomy?
What is data-level parallelism? What architectures are designed for handling data-parallel application effectively?
What are the two kinds of locality in a program, given the following program, identify which statement/variables exhibit which of the two kinds of locaties?
sum = 0;
for (i = 0; i < n; i++)
sum += a[i];
return sum;
What expert programmers, us, should do to write locality-friendly code? Explain using examples.
Questions about calculating misses per inner loop iterations for the matrix multiplication example discussed during the class.
Explain using the following example what cache-coherence and false-sharing are.
int A[4], B[5]; /* assuming cache line size is 8 elements of int, and A and B will NOT be in the same cache line */
#pragma omp parallel for num_threads(2)
for (i=0; i<4; i++) {
B[i] = A[i];
C[i] = B[i];
}