CSCE569

Assignment 4 – CSCE569, Spring 2018

Due: 11:55PM April 30th Monday 2018


GPU/CUDA Implementation of Dense Matrix Multiplication

In this assignment, you will get familar with CUDA development workflow by implementing the square matrix multiplication algorithms (A[N][N] * B[N][N] = C[N][N]) on GPU using CUDA and CUBLAS library, and study their performance. The implementation includes three versions:

  1. input matrix A and B are all stored in global memory and kernel computation access data directly from global memory
  2. input matrix A and B are read into shared memory of thread blocks and computation access data from shared memory
  3. implementation directly calls sgemm procedure of CUBLAS library to perform the computation.

You implementations will be mainly refactoring the existing implementation for the provided skeleton code matmul.cu. Please refer to the CUDA programming guide for the implementation, in which you will find that Figure 9 provides the implementation of version 1 and Figure 10 for version 2. For version 3, you can refer the solution in matrixMulCUBLAS-CUBLASExample.cpp. matrixMul-CUDAExample.cu is the complete file for version 2. You can find the doc and other examples from http://docs.nvidia.com/cuda/cuda-samples/index.html. So your implementation will be based on matmul.cu and need to include two CUDA kernels for version 1 and 2 and codes for memory allocation and data movement. For version 3, the implementation is mainly codes for memory allocation/data movement and call to the sgemm procedure.

The algorithms of the two kernels for version 1 and 2 will need a 2-dimension topology of both threads of a block and blocks of the grid, and please choose 16x16 for the block size. Each thread will compute one element of the matrix C. Kernels 2 and 3 already use this configuration. To simplify, we will assume the matrix size N to be a number of power of 2 (64, 128, 256, 512, …).

The matmul.cu file provided includes helper functions, matmul_base function for the sequential implementation, matmul_openmp function that is the openmp-parallelized version for CPU. You should put the three versions along with the two CUDA kernels in the matmul.cu file. The main functions need to be modified to include code to drive and time the three implementation, and reports timing and error information. Arrays A, B and C are all now allocated on the heap on the host using malloc so we can run the experiments with bigger input.

The matmul.cu should be compiled using nvcc compiler with -Xcompiler –fopenmp to enable the compilation of the OpenMP version, e.g.

nvcc -Xcompiler -fopenmp matmul.cu -lpthread -lcublas -o matmul

Your executable should be able to run with two arguments: the first required argument is for the matrix size: N for NxN square matrix; the second optional argument is the # of OpenMP threads for parallelization on CPU, with default value 5 if not provided. The output of your program should include both the error of computation, time(ms) and FLOP/s performance. Below is a screenshot of the output for running the code so you get idea of what normally we can put in the output.

GPU/CUDA Implementation of Jacobi Kernel

In this exercise, you will implement the CUDA version of the jacobi computation using the provided jacobi.cu files. The code includes skeleton code and TODO items for your implementation. Please check the code for more details.

Machines for Development and for Collecting Results

Your development can be done in any machine that has the NVIDIA GPU and CUDA driver and compiler. In this assingment, if you want to report your performance results, you should use Bridges supercomputer from PSC. Please check XSEDE PSC Bridges Supercomputer section for details about how to submit compile and use GPU node to run your program.

Submission

Your submission should be a single zipped file named LastNameFirstName.zip that includes the following: matmul.cu and jacobi.cu ands one optional PDF file for your report. Please remove all other files, including the executables, Excel sheet, etc. The source file contains your implementation, and should be compiled to generate executable.

The report, if included, is max 3-page PDF document that includes the following. The report for this assignment is optional, 20 bonus points will be given for a complete report.

  1. Short description on how you parallelize Jacobi.
  2. Performance report for matmul using two figures (execution time and breakdown timing, which should be collected using nvprof). An excel sheet is provided for creating those figures after you input the execution times you will collect (The numbers in the current sheet are dummy numbers). The two figures will be automatically populated and generated by Excel as well. Please include those figures in your report.
  3. Explanation of the performance results shown in your figures and draw meaningful conclusions.

Grading:

  1. Function implementation: 100 points for the implementation

    1. matmul 1: 15
    2. matmul 2: 15
    3. matmul 3: 10
    4. jacobi TODO #1: 15
    5. jacobi TODO #2: 10
    6. jacobi TODO #3: 5
    7. jacobi TODO #4: 5
    8. jacobi TODO #5: 5
    9. jacobi TODO #6: 10
    10. jacobi TODO #7: 5
    11. jacobi TODO #8: 5
    

    For source file that cannot be compiled, you only receive max 60% of function implementation points. For compliable, but with execution errors or incorrectness, you receive max 80% of function implementation points.

  2. Report: Bonus 20 points.

Assignment Policy

  1. Programming assignments are to be done individually. You may discuss assignments with others, but you must code your own solutions and submit them with a write-up in your own words. Indicate clearly the name(s) of student(s) you collaborated with, if any.
  2. Although homework assignments will not be pledged, per se, the submitted solutions must be your work and not copied from other students’ assignments or other sources.
  3. You may not transmit or receive code from anyone in the class in any way–visually (by showing someone your code), electronically (by emailing, posting, or otherwise sending someone your code), verbally (by reading your code to someone), or in any other way.
  4. You may not collaborate with people who are not your classmates, TAs, or instructor in any way. For example, you may not post questions to programming forums.
  5. Any violations of these rules will be reported to the honor council. Check the syllabus for the late policy and academic conduct.