In this assignment, you will get familar with CUDA development workflow by implementing the square matrix multiplication algorithms (A[N][N] * B[N][N] = C[N][N]
) on GPU using CUDA
and CUBLAS
library, and study their performance. The implementation includes three versions:
You implementations will be mainly refactoring the existing implementation for the provided skeleton code matmul.cu. Please refer to the CUDA programming guide for the implementation, in which you will find that Figure 9 provides the implementation of version 1 and Figure 10 for version 2. For version 3, you can refer the solution in matrixMulCUBLAS-CUBLASExample.cpp. matrixMul-CUDAExample.cu is the complete file for version 2. You can find the doc and other examples from http://docs.nvidia.com/cuda/cuda-samples/index.html. So your implementation will be based on matmul.cu and need to include two CUDA kernels for version 1 and 2 and codes for memory allocation and data movement. For version 3, the implementation is mainly codes for memory allocation/data movement and call to the sgemm procedure.
The algorithms of the two kernels for version 1 and 2 will need a 2-dimension topology of both threads of a block and blocks of the grid, and please choose 16x16 for the block size. Each thread will compute one element of the matrix C. Kernels 2 and 3 already use this configuration. To simplify, we will assume the matrix size N to be a number of power of 2 (64, 128, 256, 512, …).
The matmul.cu file provided includes helper functions, matmul_base
function for the sequential implementation, matmul_openmp
function that is the openmp-parallelized version for CPU. You should put the three versions along with the two CUDA kernels in the matmul.cu file. The main functions need to be modified to include code to drive and time the three implementation, and reports timing and error information. Arrays A, B and C are all now allocated on the heap on the host using malloc so we can run the experiments with bigger input.
The matmul.cu should be compiled using nvcc
compiler with -Xcompiler –fopenmp
to enable the compilation of the OpenMP version, e.g.
nvcc -Xcompiler -fopenmp matmul.cu -lpthread -lcublas -o matmul
Your executable should be able to run with two arguments: the first required argument is for the matrix size: N for NxN square matrix; the second optional argument is the # of OpenMP threads for parallelization on CPU, with default value 5 if not provided. The output of your program should include both the error of computation, time(ms) and FLOP/s performance. Below is a screenshot of the output for running the code so you get idea of what normally we can put in the output.
In this exercise, you will implement the CUDA version of the jacobi computation using the provided jacobi.cu files. The code includes skeleton code and TODO items for your implementation. Please check the code for more details.
Your development can be done in any machine that has the NVIDIA GPU and CUDA driver and compiler. In this assingment, if you want to report your performance results, you should use Bridges supercomputer from PSC. Please check XSEDE PSC Bridges Supercomputer section for details about how to submit compile and use GPU node to run your program.
Your submission should be a single zipped file named LastNameFirstName.zip that includes the following: matmul.cu and jacobi.cu ands one optional PDF file for your report. Please remove all other files, including the executables, Excel sheet, etc. The source file contains your implementation, and should be compiled to generate executable.
The report, if included, is max 3-page PDF document that includes the following. The report for this assignment is optional, 20 bonus points will be given for a complete report.
Function implementation: 100 points for the implementation
1. matmul 1: 15
2. matmul 2: 15
3. matmul 3: 10
4. jacobi TODO #1: 15
5. jacobi TODO #2: 10
6. jacobi TODO #3: 5
7. jacobi TODO #4: 5
8. jacobi TODO #5: 5
9. jacobi TODO #6: 10
10. jacobi TODO #7: 5
11. jacobi TODO #8: 5
For source file that cannot be compiled, you only receive max 60% of function implementation points. For compliable, but with execution errors or incorrectness, you receive max 80% of function implementation points.
Report: Bonus 20 points.