# Lecture 22: Data Level Parallelism -- Graphical Processing Unit (GPU) and Loop-Level Parallelism

**CSCE 513 Computer Architecture** 

**Department of Computer Science and Engineering** 

**Yonghong Yan** 

yanyh@cse.sc.edu

https://passlab.github.io/CSCE513

# **Topics for Data Level Parallelism (DLP)**

- Parallelism (centered around ... )
  - Instruction Level Parallelism
  - Data Level Parallelism
  - Thread Level Parallelism
- DLP Introduction and Vector Architecture - 4.1, 4.2
- SIMD Instruction Set Extensions for Multimedia
   4.3
- Graphical Processing Units (GPU)
  - 4.4
- GPU and Loop-Level Parallelism and Others – 4.4, 4.5

## **Computer Graphics**



# **Graphics Processing Unit (GPU)**



Image: http://www.ntu.edu.sg/home/ehchua/programming/opengl/CG\_BasicsTheory.html



## **Recent GPU Architecture**

- Unified Scalar Shader Architecture
- Highly Data Parallel Stream Processing



**3D Graphics Rendering Pipeline**: Output of one stage is fed as input of the next stage. A vertex has attributes such as (x, y, z) position, color (RGB or RGBA), vertex-normal  $(n_x, n_y, n_z)$ , and texture. A primitive is made up of one or more vertices. The rasterizer raster-scans each primitive to produce a set of grid-aligned fragments, by interpolating the vertices.

Image: http://www.ntu.edu.sg/home/ehchua/programming/opengl/CG\_BasicsTheory.html

An Introduction to Modern GPU Architecture, Ashu Rege, NVIDIA Director of Developer Technology ftp://download.nvidia.com/developer/cuda/seminar/TDCI\_Arch.pdf

#### **Unified Shader Architecture**



**FIGURE A.2.5 Basic unified GPU architecture.** Example GPU with 112 streaming processor (SP) cores organized in 14 streaming multiprocessors (SMs); the cores are highly multithreaded. It has the basic Tesla architecture of an NVIDIA GeForce 8800. The processors connect with four 64-bit-wide DRAM partitions via an interconnection network. Each SM has eight SP cores, two special function units (SFUs), instruction and constant caches, a multithreaded instruction unit, and a <sup>6</sup> shared memory. Copyright © 2009 Elsevier, Inc. All rights reserved.

# **GPU Today**

- It is a processor optimized for 2D/3D graphics, video, visual computing, and display.
- It is highly parallel, highly multithreaded multiprocessor optimized for visual computing.
- It provide real-time visual interaction with computed objects via graphics images, and video.
- It serves as both a programmable graphics processor and a scalable parallel computing platform.
  - Heterogeneous systems: combine a GPU with a CPU
- It is called as Many-core

# Latest NVIDIA Volta GV100 GPU

- Released May 2017
  - Total 84 Stream Multiprocessors (SM)
- Cores
  - 5120 FP32 cores
    - Can do FP16 also
  - 2560 FP64 cores
  - 640 Tensor cores
- Memory
  - 16G HBM2
  - L2: 6144 KB
  - Shared memory: 96KB \* 80 (SM)
  - Register File: 20,480 KB (Huge)

https://devblogs.nvidia.com/parallelforall/inside-volta/



# SM of Volta GPU

- Released May 2017
  - Total 84 SM
- Cores
  - 5120 FP32 cores
    - Can do FP16 also
  - 2560 FP64 cores
  - 640 Tensor cores
- Memory
  - 16G HBM2
  - L2: 6144 KB
  - Shared memory: 96KB \* 80 (SN)
  - Register File: 20,480 KB (Huge)

| _                              |                                 |           |           |                                     |           |           |                      | L1 Instruc                | tion                                                   | Cache                         |           |           |           |           |           |                        |          |
|--------------------------------|---------------------------------|-----------|-----------|-------------------------------------|-----------|-----------|----------------------|---------------------------|--------------------------------------------------------|-------------------------------|-----------|-----------|-----------|-----------|-----------|------------------------|----------|
| L0 Instruction Cache           |                                 |           |           |                                     |           |           | L0 Instruction Cache |                           |                                                        |                               |           |           |           |           |           |                        |          |
| Warp Scheduler (32 thread/clk) |                                 |           |           |                                     |           |           |                      |                           | War                                                    | p Sch                         | edule     | r (32 t   | hread     | /clk)     |           |                        |          |
|                                |                                 | Di        | spatcl    | h Unit                              | (32 th    | read/c    | :lk)                 |                           |                                                        | Dispatch Unit (32 thread/clk) |           |           |           |           |           |                        |          |
|                                | Register File (16,384 x 32-bit) |           |           |                                     |           |           |                      |                           | Register File (16,384 x 32-bit)                        |                               |           |           |           |           |           |                        |          |
| FP                             | 64                              | INT       | INT       | FP32                                | FP32      |           |                      |                           |                                                        | FP64                          | INT       | INT       | FP32      | FP32      |           |                        |          |
| FP                             | 64                              | INT       | INT       | FP32                                | FP32      | $\vdash$  |                      |                           |                                                        | FP64                          | INT       | INT       | FP32      | FP32      | $\vdash$  | $\left  \cdot \right $ | $\vdash$ |
| FP                             | 64                              | INT       | INT       | FP32                                | FP32      | TENSOR    | TENSOR               |                           | FP64                                                   |                               | INT       | FP32      | FP32      |           | TENSOR    |                        |          |
| FP                             | 64                              | INT       | INT       | FP32                                | FP32      |           |                      |                           | FP64                                                   |                               | INT       | FP32      | FP32      | TENSOR    |           |                        |          |
| FP                             | FP64                            |           | INT       | FP32                                | FP32      | CORE      | CORE                 |                           | FP64                                                   |                               | INT       | FP32      | FP32      | CORE      | CORE      |                        |          |
| FP                             | 64                              | INT       | INT       | FP32                                | FP32      |           |                      |                           |                                                        | FP64                          | INT       | INT       | FP32      | FP32      | Ħ         |                        |          |
| FP                             |                                 | INT       | INT       | FP32                                |           |           |                      |                           |                                                        | FP64                          | INT       | INT       |           | FP32      | H         |                        |          |
| FP                             |                                 | INT       | INT       | FP32                                |           |           |                      |                           |                                                        | FP64                          | INT       | INT       | FP32      |           | H         |                        |          |
| LD/<br>ST                      | LD/<br>ST                       | LD/<br>ST | LD/<br>ST | LD/<br>ST                           | LD/<br>ST | LD/<br>ST | LD/<br>ST            | SFU                       |                                                        | .D/ LD/<br>ST ST              | LD/<br>ST | LD/<br>ST | LD/<br>ST | LD/<br>ST | LD/<br>ST | LD/<br>ST              | SFU      |
|                                |                                 | Wai       |           | nstruc<br>nedule                    |           |           | clk)                 |                           | L0 Instruction Cache<br>Warp Scheduler (32 thread/clk) |                               |           |           |           |           |           |                        |          |
|                                |                                 | Di        | spatcl    | h Unit                              | (32 th    | read/c    | :lk)                 |                           |                                                        | Dispatch Unit (32 thread/clk) |           |           |           |           |           |                        |          |
|                                |                                 | Reg       | ister     | File (′                             | 16,384    | 4 x 32    | -bit)                |                           | Register File (16,384 x 32-bit)                        |                               |           |           |           |           |           |                        |          |
| FP                             | 64                              | INT       | INT       | FP32                                | FP32      | $\square$ |                      | $\square \square \square$ | $\  \Gamma$                                            | FP64                          | INT       | INT       | FP32      | FP32      | H         |                        |          |
| FP                             | 64                              | INT       | INT       | FP32                                | FP32      | +         |                      |                           |                                                        | FP64                          | INT       | INT       | FP32      | FP32      | H         |                        |          |
| FP                             | 64                              | INT       | INT       | FP32                                | FP32      | H         |                      |                           |                                                        | FP64                          | INT       | INT       | FP32      | FP32      | H         |                        |          |
| FP                             | 64                              | INT       | INT       | FP32                                | FP32      | TEN       | SOR                  | TENSOR                    |                                                        | FP64                          | INT       | INT       | FP32      | FP32      | TEN       | ISOR                   | TENSOR   |
| FP                             | FP64                            | INT       | INT       | FP32                                | FP32      | CORE      | CORE                 |                           | FP64                                                   | INT                           | INT       | FP32      | FP32      | cc        | DRE       | CORE                   |          |
| FP                             | 64                              | INT       | INT       | FP32                                | FP32      | H         |                      |                           |                                                        | FP64                          | INT       | INT       | FP32      | FP32      | H         |                        |          |
| FP                             | 64                              | INT       | INT       | FP32                                | FP32      | -         |                      |                           |                                                        | FP64                          | INT       | INT       | FP32      | FP32      | H         |                        |          |
|                                | 64                              | INT       | INT       | FP32                                | FP32      | H         |                      |                           |                                                        | FP64                          | INT       | INT       | FP32      | FP32      | Ħ         |                        |          |
| FP                             | LD/                             | LD/<br>ST | LD/<br>ST | LD/<br>ST                           | LD/<br>ST | LD/<br>ST | LD/<br>ST            | SFU                       |                                                        | .D/ LD/<br>ST ST              | LD/<br>ST | LD/<br>ST | LD/<br>ST | LD/<br>ST | LD/<br>ST | LD/<br>ST              | SFU      |
| FP<br>LD/<br>ST                | ST                              |           |           | 128KB L1 Data Cache / Shared Memory |           |           |                      |                           |                                                        |                               |           |           |           |           |           |                        |          |
| LD/                            |                                 |           |           |                                     |           |           | 128K                 | B L1 Data Cac             | :he / \$                                               | hared M                       | emory     |           | _         |           |           |                        |          |

# SM of Volta GPU

- Released May 2017
  - Total 84 SM
- Cores
  - 5120 FP32 cores
    - Can do FP16 also
  - 2560 FP64 cores
  - 640 Tensor cores
- Memory
  - 16G HBM2
  - L2: 6144 KB
  - Shared memory: 96KB \* 80 (SM)
  - Register File: 20,480 KB (Huge)

| S | SM                             |           |           |           |           |           |           |           |            |  |
|---|--------------------------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|------------|--|
| L |                                |           |           |           |           |           |           |           | L1 Instruc |  |
| Г | L0 Instruction Cache           |           |           |           |           |           |           |           |            |  |
|   | Warp Scheduler (32 thread/clk) |           |           |           |           |           |           |           |            |  |
|   | Dispatch Unit (32 thread/clk)  |           |           |           |           |           |           |           |            |  |
|   |                                |           | Reg       | ister     | File (′   | 16,384    | 4 x 32    | ?-bit)    |            |  |
|   | FP                             | 64        | INT       | INT       | FP32      | FP32      | $\square$ |           |            |  |
|   | FP64<br>FP64<br>FP64<br>FP64   |           | INT       | INT       | FP32      | FP32      | +         |           |            |  |
|   |                                |           | INT       | INT       | FP32      | FP32      |           |           |            |  |
|   |                                |           | INT       | INT       | FP32      | FP32      |           | SOR       | TENSOR     |  |
|   |                                |           | INT       | INT       | FP32      | FP32      | CC        | DRE       | CORE       |  |
|   | FP64                           |           | INT       | INT       | FP32      | FP32      |           |           |            |  |
|   | FP64<br>FP64                   |           | INT       | INT       | FP32      | FP32      |           |           |            |  |
|   |                                |           | INT       | INT       | FP32      | FP32      | $\vdash$  |           |            |  |
|   | LD/<br>ST                      | LD/<br>ST | LD/<br>ST | LD/<br>ST | LD/<br>ST | LD/<br>ST | LD/<br>ST | LD/<br>ST | SFU        |  |

#### **GPU Performance Gains Over CPU**

Theoretical GFLOP/s



11

#### **GPU Performance Gains Over CPU**



# **Programming for NVIDIA GPUs**

|                              | GPU Computing Applications                       |               |                                             |                       |                                                             |         |                        |                              |  |  |  |
|------------------------------|--------------------------------------------------|---------------|---------------------------------------------|-----------------------|-------------------------------------------------------------|---------|------------------------|------------------------------|--|--|--|
|                              |                                                  |               |                                             |                       |                                                             |         |                        |                              |  |  |  |
|                              | CUFFT<br>CUBLAS<br>CURAND<br>CUSPARSE            | CULA<br>MAGMA | Thrust<br>NPP<br>OpenCu                     |                       | A Phys                                                      |         |                        | MATLAB<br>Mathematica        |  |  |  |
|                              |                                                  |               | Progra                                      | Programming Languages |                                                             |         |                        |                              |  |  |  |
|                              | с                                                | C++           | Forti                                       | ran                   | Java<br>Python<br>Wrappers                                  | DirectC | ompute                 | Directives<br>(e.g. OpenACC) |  |  |  |
|                              |                                                  |               | CUDA-Enabled NVIDIA GPUs                    |                       |                                                             |         |                        |                              |  |  |  |
|                              | Kepler Archite<br>(compute capat                 |               | GeForce 600                                 | Series                | Quadro Keple                                                | rSeries | Tesla K20<br>Tesla K10 |                              |  |  |  |
|                              | Fermi Architecture<br>(compute capabilities 2.x) |               | GeForce 500<br>GeForce 400                  |                       | Quadro Fermi Series                                         |         | Tesla 20 Series        |                              |  |  |  |
|                              | Tesla Architec<br>(compute capat                 |               | GeForce 200<br>GeForce 9 Se<br>GeForce 8 Se | ries                  | Quadro FX Series<br>Quadro Plex Series<br>Quadro NVS Series |         | Tesla 10 Series        |                              |  |  |  |
| http://docs.n<br>programming |                                                  | uda/cuda      | Entertain                                   | ment                  | Professio<br>Graphic                                        |         |                        | Performance<br>omputing      |  |  |  |

## **CUDA**(Compute Unified Device Architecture)

#### Both an *architecture* and *programming model*

- Architecture and execution model
  - Introduced in NVIDIA in 2007
  - Get highest possible execution performance requires understanding of hardware architecture
- Programming model
  - Small set of extensions to C
  - Enables GPUs to execute programs written in C
  - Within C programs, call SIMT "kernel" routines that are executed on GPU.

# **CUDA Thread**

- Parallelism in Vector/SIMD is the combination of lanes (# PUs) and vector length
- CUDA thread is a unified term that abstract the parallelism for both programmers and GPU execution model
  - Programmer: A CUDA thread performs operations for one data element (think of this way as of now)
    - There could be thousands or millions of threads
  - A CUDA thread represents a hardware FU
    - GPU calls it a core (much simpler than a conventional CPU core)
- Hardware-level parallelism is more explicit

# **CUDA Thread Hierarchy:**

- Allows flexibility and efficiency in processing 1D, 2-D, and 3-D data on GPU.
- Linked to internal organization
- Threads in one block execute together.





Creating a number of threads which is (or slightly greater) the number of elements to be processed, and each thread launch the same daxpy function.

## **DAXPY with Device Code**

\_\_global\_\_\_void daxpy( ... )

- CUDA C/C++ keyword \_\_global\_ indicates a function that:
  - Runs on the device
  - Is called from host code
- nvcc compiler separates source code into host and device components
  - Device functions (e.g. axpy()) processed by NVIDIA compiler
  - Host functions (e.g. main()) processed by standard host compiler
    - gcc, cl.exe

# **DAXPY with Device COde**

#### axpy<<<num\_blocks,num\_threads>>>();

- Triple angle brackets mark a call from *host* code to *device* code
  - Also called a "kernel launch"
  - <<< ... >>> parameters are for thread dimensionality
- That's all that is required to execute a function on the GPU!



#### **GPU Computing – Offloading Computation**

 The GPU is connected to the CPU by a reasonable fast bus (8 GB/s is typical today): PCIe





- Terminology
  - Host: The CPU and its memory (host memory)
  - Device: The GPU and its memory (device memory)

# **Simple Processing Flow**



# **Simple Processing Flow**



# **Simple Processing Flow**





# **Offloading Computation**



#### **CUDA Programming Model for NVIDIA GPUs**

- The CUDA API is split into:
  - The CUDA Management API
  - The CUDA Kernel API
- The CUDA Management API is for a variety of operations
  - GPU memory allocation, data transfer, execution, resource creation
  - Mostly regular C function and calls
- The CUDA Kernel API is used to define the computation to be performed by the GPU
  - C extensions

# **CUDA Kernel**, i.e. Thread Functions

- A CUDA kernel:
  - Defines the operations to be performed by a single thread on the GPU
  - Just as a C/C++ function defines work to be done on the CPU
  - Syntactically, a kernel looks like C/C++ with some extensions

```
__global__ void kernel(...) {
   ...
}
```

- Every CUDA thread executes the same kernel logic (SIMT)
- Initially, the only difference between threads are their thread coordinates

# Programming View: How are CUDA threads organized?

- CUDA thread hierarchy
  - Thread Block = SIMT Groups that run concurrently on an SM
    - Can barrier sync and have shared access to the SM shared memory
  - Grid = All Thread Blocks created by the same kernel launch
    - Shared access to GPU global memory



- Launching a kernel is simple and similar to a function call.
  - kernel name and arguments
  - # of thread blocks/grid and # of threads/block to create: kernel<<<nblocks,</li>

threads\_per\_block>>>(arg1, arg2, ...);

- Threads can be configured in one-, two-, or threedimensional layouts
  - One-dimensional blocks and grids:

int nblocks = 4;

int threads\_per\_block = 8;

kernel<<<nblocks, threads\_per\_block>>>(...);



- Threads can be configured in one-, two-, or threedimensional layouts
  - Two-dimensional blocks and grids:

```
dim3 nblocks(2,2)
```

dim3 threads\_per\_block(4,2);

kernel<<<nblocks, threads\_per\_block>>>(...);



- Threads can be configured in one-, two-, or threedimensional layouts
  - Two-dimensional grid and one-dimensional blocks: dim3 nblocks(2,2); int threads\_per\_block = 8; kernel<<<nblocks, threads\_per\_block>>>(...);



- The number of blocks and threads per block is exposed through *intrinsic thread coordinate variables*:
  - Dimensions

– IDs

| Variable                                             | Meaning                                                  |
|------------------------------------------------------|----------------------------------------------------------|
| gridDim.x, gridDim.y,<br>gridDim.z                   | Number of blocks in a kernel launch.                     |
| <pre>blockIdx.x, blockIdx.y,</pre>                   | Unique ID of the block that contains the current thread. |
| blockDim.x, blockDim.y,<br>blockDim.z                | Number of threads in each block.                         |
| <pre>threadIdx.x, threadIdx.y,     threadIdx.z</pre> | Unique ID of the current thread within its block.        |

to calculate a globally unique ID for a thread inside a onedimensional grid and one-dimensional block:



- Thread coordinates offer a way to differentiate threads and identify thread-specific input data or code paths.
  - Co-relate data and computation, a mapping

```
__global___void kernel(int *arr) {
    int tid = blockIdx.x * blockDim.x + threadIdx.x;
    if (tid < 32) {
        arr[tid] = f(arr[tid]);
    } code path for threads with tid < 32
    } else {
        arr[tid] = g(arr[tid]);
    } code path for threads with tid >= 32
}
```

Thread Divergence: useless code path is executed, but then disabled in SIMT execution model (EXE-commit, more later

# How is GPU memory managed?

- CUDA Memory Management API
  - Allocation of GPU memory
  - Transfer of data from the host to GPU memory
  - Free-ing GPU memory
  - Foo(int A[][N]) { }

| Host Function | CUDA Analogue |
|---------------|---------------|
| malloc        | cudaMalloc    |
| memcpy        | cudaMemcpy    |
| free          | cudaFree      |

- Allocate size bytes of GPU memory and store their address at \*devPtr
  - cudaError\_t cudaFree(void \*devPtr);
- Release the device memory allocation stored at devPtr
- Must be an allocation that was created using cudaMalloc

## How is GPU memory managed?

#### cudaError\_t cudaMemcpy(

void \*dst, const void \*src, size\_t count,

enum cudaMemcpyKind kind);

- Transfers count bytes from the memory pointed to by src to dst
- kind can be:
  - cudaMemcpyHostToHost,
  - cudaMemcpyHostToDevice,
  - cudaMemcpyDeviceToHost,
  - cudaMemcpyDeviceToDevice
- The locations of dst and src must match kind, e.g. if kind is cudaMemcpyHostToDevice then src must be a host array and dst must be a device array

#### How is GPU memory managed?

void \*d\_arr, \*h\_arr; h\_addr = ... ; /\* init host memory and data \*/ // Allocate memory on GPU and its address is in d\_arr cudaMalloc((void \*\*)&d\_arr, nbytes);

// Transfer data from a device to a host
cudaMemcpy(h\_arr, d\_arr, nbytes,

cudaMemcpyDeviceToHost);

```
// Free the allocated memory
cudaFree(d_arr);
```

## **CUDA Program Flow**

- At its most basic, the flow of a CUDA program is as follows:
  - 1. Allocate GPU memory
  - 2. Populate GPU memory with inputs from the host
  - 3. Execute a GPU kernel on those inputs
  - 4. Transfer outputs from the GPU back to the host
  - 5. Free GPU memory



# **Offloading Computation**



# **GPU Multi-Threading (SIMD)**

- NVIDIA calls it Single-Instruction, Multiple-Thread (SIMT)
  - Many threads execute the same instructions in lock-step
    - A warp (32 threads)
    - Each thread ≈ vector lane; 32 lanes lock step
  - Implicit synchronization after every instruction (think vector parallelism)



## **GPU Multi-Threading**

 In SIMT, all threads share instructions but operate on their own private registers, allowing threads to store threadlocal state



# **GPU Multi-Threading**

- GPUs execute many groups of SIMT threads in parallel
  - Each executes instructions independent of the others



# Warp Switching

SMs can support more concurrent SIMT groups than core count would suggest → Coarse grained multiwarpping (the term I coined)

- Similar to coarse-grained multi-threading
- Each thread persistently stores its own state in a private register set
  - Enable very efficient context switching between warps
- SIMT warps block if not actively computing
  - Swapped out for other, no worrying about losing state
    - Keeping blocked SIMT groups scheduled on an SM would waste cores



#### **Execution Model to Hardware**

#### • This leads to a nested thread hierarchy on GPUs



#### **NVIDIA PTX (Parallel Thread Execution) ISA**

- Compiler target (Not hardware ISA)
  - Similar to X86 ISA, and use virtual register
  - Both translate to internal form (micro-ops in x86)
    - X86's translation happens in hardware at runtime
    - NVIDIA GPU PTX is translated by software at load time
- Basic format (d is destination, a, b and c are operands)

opcode.type d, a, b, c;

| Туре                                    | .type Specifier       |  |  |  |
|-----------------------------------------|-----------------------|--|--|--|
| Untyped bits 8, 16, 32, and 64 bits     | .b8, .b16, .b32, .b64 |  |  |  |
| Unsigned integer 8, 16, 32, and 64 bits | .u8, .u16, .u32, .u64 |  |  |  |
| Signed integer 8, 16, 32, and 64 bits   | .s8, .s16, .s32, .s64 |  |  |  |
| Floating Point 16, 32, and 64 bits      | .f16, .f32, .f64      |  |  |  |

#### **Basic PTX Operations (ALU, MEM, and Control)**

| Group               | Instruction                                 | Example                 |                  | Meaning                      | Comments                                   |                                |                            |
|---------------------|---------------------------------------------|-------------------------|------------------|------------------------------|--------------------------------------------|--------------------------------|----------------------------|
|                     | arithmetic .type =                          | .s32, .u32, .f32, .s64, |                  |                              |                                            |                                |                            |
| Arithmetic          | add.type                                    | add.f32 d, a, b         |                  | d = a + b;                   |                                            |                                |                            |
|                     | sub.type                                    | sub.f32 d, a, b         |                  | d = a - b;                   |                                            |                                |                            |
|                     | mul.type                                    | mul.f32 d, a, b         |                  | d = a * b;                   |                                            |                                |                            |
|                     | mad.type                                    | mad.f32 d, a, b, c      |                  | d = a * b + c;               | multiply-add                               |                                |                            |
|                     | div.type                                    | div.f32 d, a, b         |                  | d = a / b;                   | multiple microinstructions                 |                                |                            |
|                     | rem.type                                    | rem.u32 d, a, b         |                  | d = a % b;                   | integer remainder                          |                                |                            |
|                     | abs.type                                    | abs.f32 d, a            |                  | <pre>memory.space = .g</pre> | <pre>lobal, .shared, .local, .const;</pre> | .type = .b8, .u8, .s8, .       | b16, .b32, .b64            |
|                     | neg.type                                    | neg.f32 d, a            |                  | <pre>ld.space.type</pre>     | ld.global.b32 d, [a+off]                   | d = *(a+off);                  | load from memory space     |
|                     | min.type                                    | min.f32 d, a, b         | Memory           | <pre>st.space.type</pre>     | st.shared.b32 [d+off], a                   | *(d+off) = a;                  | store to memory space      |
|                     | max.type                                    | max.f32 d, a, b         | Access           | <pre>tex.nd.dtyp.btype</pre> | tex.2d.v4.f32.f32 d, a, b                  | d = tex2d(a, b);               | texture lookup             |
|                     | <pre>setp.cmp.type</pre>                    | setp.lt.f32 p, a, b     |                  |                              | atom.global.add.u32 d,[a],                 |                                | atomic read-modify-write   |
|                     | numeric .cmp = eq, ne, lt, le, gt, ge; unor |                         |                  | atom.spc.op.type             | atom.global.cas.b32 d,[a],                 |                                | operation                  |
|                     | mov.type mov.b32 d, a                       |                         | atom.op = and, o |                              | , xor, add, min, max, exch, cas            |                                |                            |
|                     | selp.type                                   | selp.f32 d, a, b, p     |                  | branch                       | @p bra target                              | if (p) goto target;            | conditional branch         |
|                     | cvt.dtype.atype                             | cvt.f32.s32 d, a        | Control          | call                         | call (ret), func, (params)                 | <pre>ret = func(params);</pre> | call function              |
|                     | <pre>special .type = .f3</pre>              | 32 (some .f64)          | Flow             | ret                          | ret                                        | return;                        | return from function call  |
|                     | rcp.type                                    | rcp.f32 d, a            |                  | bar.sync                     | bar.sync d                                 | wait for threads               | barrier synchronization    |
|                     | sqrt.type                                   | sqrt.f32 d, a           |                  | exit                         | exit                                       | exit;                          | terminate thread execution |
| Special<br>Function | rsqrt.type                                  | rsqrt.f32 d, a          |                  | d = 1/sqrt(a);               | reciprocal square root                     |                                |                            |
|                     | sin.type                                    | sin.f32 d, a            | d = sin(a);      |                              | sine                                       |                                |                            |
|                     | cos.type                                    | cos.f32 d, a            |                  | d = cos(a);                  | cosine                                     |                                |                            |
|                     | lg2.type                                    | 1g2.f32 d, a            |                  | $d = \log(a)/\log(2)$        | binary logarithm                           |                                |                            |
|                     | ex2.type                                    | ex2.f32 d, a            |                  | d = 2 ** a;                  | binary exponential                         |                                |                            |
|                     | <pre>logic.type = .pred</pre>               | ,.b32, .b64             |                  |                              |                                            |                                |                            |
| Logical             | and.type                                    | and.b32 d, a, b         |                  | d = a & b;                   |                                            |                                |                            |
|                     | or.type                                     | or.b32 d, a, b          |                  | d = a   b;                   |                                            |                                |                            |
|                     | xor.type                                    | xor.b32 d, a, b         |                  | d = a ^ b;                   |                                            |                                |                            |
|                     | not.type                                    | not.b32 d, a, b         |                  | d = ~a;                      | one's complement                           |                                |                            |
|                     | cnot.type                                   | cnot.b32 d, a, b        |                  | d = (a==0)? 1:0;             | C logical not                              |                                |                            |
|                     | shl.type                                    | shl.b32 d, a, b         |                  | d = a << b;                  | shift left                                 |                                |                            |
|                     | shr.type                                    | shr.s32 d, a, b         |                  | d = a >> b;                  | shift right                                |                                | 46                         |

#### **NVIDIA PTX GPU ISA Example**

global

#### DAXPY

```
void daxpy(int n, double a, double *x, double *y) {
    int i = blockIdx.x*blockDim.x + threadIdx.x;
    if (i < n) y[i] = a*x[i] + y[i];</pre>
```

shl.s32 R8, blockldx, 9 (512 or 29) add.s32 R8, R8, threadldx

ld.global.f64 RD0, [X+R8] ld.global.f64 RD2, [Y+R8] mul.f64 R0D, RD0, RD4

(scalar a)

add.f64 R0D, RD0, RD2 st.global.f64 [Y+R8], RD0 ; Thread Block ID \* Block size

- ; R8 = i = my CUDA thread ID
- ; RD0 = X[i]

; Product in RD0 = RD0 \* RD4

; Sum in RD0 = RD0 + RD2 (Y[i]) ; Y[i] = sum (X[i]\*a + Y[i])

# **Conditional Branching in GPU**

- Like vector, GPU branch hardware uses internal masks
- Also uses
  - Branch synchronization stack
    - Entries consist of masks for each core
    - I.e. which threads commit their results (all threads execute)
  - Instruction markers to manage when a branch diverges into multiple execution paths
    - Push on divergent branch
  - …and when paths converge
    - Act as barriers
    - Pops stack
- Per-thread-lane 1-bit predicate register, specified by programmer

# **Conditional Branching in GPU**

- Instruction lock-step execution by multi-threads
- SIMT threads can be "disabled" when they need to execute instructions different from others in their group
  - Mask and commit
- Branch divergence
  - Hurt performance and efficiency



#### **PTX Example**

- if (X[i] != 0) X[i] = X[i] - Y[i];else X[i] = Z[i];
- Id.global.f64RD0, [X+R8] ; RD0 = X[i] setp.neq.s32 P1, RD0, #0 @!P1, bra ELSE1, \*Push

ld.global.f64RD2, [Y+R8] sub.f64 RD0, RD0, RD2 st.global.f64[X+R8], RD0 @P1, bra ENDIF1, \*Comp

- - ; P1 is predicate register 1
  - ; Push old mask, set new mask bits
  - ; if P1 false, go to ELSE1
- ; RD2 = Y[i]
  - ; Difference in RD0
  - ; X[i] = RD0

; X[i] = RD0

- ; complement mask bits
- ; if P1 true, go to ENDIF1
- ELSE1: Id.global.f64 RD0, [Z+R8] ; RD0 = Z[i] st.global.f64 [X+R8], RD0
- ENDIF1: <next instruction>, \*Pop
- ; pop to restore old mask

## **NVIDIA GPU Memory Structures**



## **GPU Memory for CUDA Programming**

Sequence



## **Shared Memory Allocation**

- Shared memory can be allocated statically or dynamically
- Statically Allocated Shared Memory
  - Size is fixed at compile-time
  - Can declare many statically allocated shared memory variables
  - Can be declared globally or inside a device function
  - Can be multi-dimensional arrays

\_\_\_\_\_\_\_shared\_\_\_\_\_int s\_arr[256][256];

## **Shared Memory Allocation**

- Dynamically Allocated Shared Memory
  - Size in bytes is set at kernel launch with a third kernel launch configurable
  - Can only have one dynamically allocated shared memory array per kernel
  - Must be one-dimensional arrays

kernel<<<nblocks, threads\_per\_block,
shared\_memory\_bytes>>>(...);

## **GPU Memory**

- More complicated
- Different usage scope
- Different size, and performance
  - Latency and bandwidth
  - Read-only or R/W cache

| MEMORY   | ON/OFF CHIP | CACHED | ACCESS | SCOPE                | LIFETIME        |
|----------|-------------|--------|--------|----------------------|-----------------|
| Register | On          | n/a    | R/W    | 1 thread             | Thread          |
| Local    | Off         | †      | R/W    | 1 thread             | Thread          |
| Shared   | On          | n/a    | R/W    | All threads in block | Block           |
| Global   | Off         | †      | R/W    | All threads + host   | Host allocation |
| Constant | Off         | Yes    | R      | All threads + host   | Host allocation |
| Texture  | Off         | Yes    | R      | All threads + host   | Host allocation |



#### **GPU and Manycore Architecture**

We only INTRODUCE the programming interface and architecture

For more info:

- http://docs.nvidia.com/cuda/
- Professional CUDA C Programming, John Cheng Max
   Grossman Ty McKercher September 8, 2014, John Wiley &
   Sons
- **Other Related info** 
  - AMD GPU and OpenCL
  - Programming with Accelerator using pragma
    - OpenMP and OpenACC

- Focuses on determining whether data accesses in later iterations are dependent on data values produced in earlier iterations
  - Loop-carried dependence
- Example 1:

for (i=999; i>=0; i=i-1) x[i] = x[i] + s;

• No loop-carried dependence



- S1 and S2 use values computed by S1 and S2 in previous iteration: loop-carried dependency → serial execution

   A[i] → A[i+1], B[i] → B[i+1]
- S2 uses value computed by S1 in same iteration → not loop carried
  - $A[i+1] \rightarrow A[i+1]$

• Example 3:

S1 uses value computed by S2 in previous iteration but dependence is not circular so loop is parallel

• Transform to:

```
A[0] = A[0] + B[0];
for (i=0; i<99; i=i+1) {
    B[i+1] = C[i] + D[i];
    A[i+1] = A[i+1] + B[i+1];
}
B[100] = C[99] + D[99];
```

- Example 4: for (i=0;i<100;i=i+1) { A[i] = B[i] + C[i]; /\* S1 \*/ D[i] = A[i] \* E[i]; /\* S2 \*/
   No need to store A[i] in S1 and then load A[i] in S2 }
- Example 5: for (i=1;i<100;i=i+1) { Y[i] = Y[i-1] + Y[i]; }

Recurrence: for exploring pipelining parallelism between iterations

## **Finding dependencies**

- Assume indices are affine:
  - a x i + b (i is loop index and a and b are constants)
- Assume:
  - Store to a x i + b, then
  - Load from c x i + d
  - *i* runs from *m* to *n*
  - Dependence exists if:
    - Given j, k such that  $m \le j \le n, m \le k \le n$
    - Store to *a* x *j* + *b*, load from *a* x *k* + *d*, and *a* x *j* + *b* = *c* x *k* + *d*

# **Finding dependencies**

- Generally cannot determine at compile time
- Test for absence of a dependence:
  - GCD test:
    - If a dependency exists, GCD(c,a) must evenly divide (d-b)
- Example:

```
for (i=0; i<100; i=i+1) {
X[2*i+3] = X[2*i] * 5.0;
}
```

a=2, b=3, c=2, and d=0, then GCD(a,c)=2, and d-b=-3. Since 2 does not divide -3, no dependence is possible.

# **Finding dependencies**

• Example 2:

for (i=0; i<100; i=i+1) {
 Y[i] = X[i] / c; /\* S1 \*/
 X[i] = X[i] + c; /\* S2 \*/
 Z[i] = Y[i] + c; /\* S3 \*/
 Y[i] = c - Y[i]; /\* S4 \*/
}</pre>

for (i=0; i<100; i=i+1 {
 T[i] = X[i] / c; /\* Y renamed to T to remov
 X1[i] = X[i] + c;/\* X renamed to X1 to ren
 Z[i] = T[i] + c;/\* Y renamed to T to rem
 Y[i] = c - T[i];</pre>

- True dependencies:
  - S1 to S3 and S1 to S4 because of Y[i], not loop carried
- Antidependence:
  - S1 to S2 based on X[i] and S3 to S4 for Y[i]
- Output dependence:
  - S1 to S4 based on Y[i]

#### Reductions

- Reduction Operation: for (i=9999; i>=0; i=i-1) sum = sum + x[i] \* y[i];
- Transform to... for (i=9999; i>=0; i=i-1) sum [i] = x[i] \* y[i]; for (i=9999; i>=0; i=i-1) finalsum = finalsum + sum[i];
- Do on p processors:

for (i=999; i>=0; i=i-1)

finalsum[p] = finalsum[p] + sum[i+1000\*p];

Note: assumes associativity!

#### **Dependency Analysis**

- Mostly done by compiler before vectorization
  - Can be conservative if compiler is not 100% sure
- For programmer:
  - Write code that can be easily analyzed by compiler for vectorization
  - Use explicit parallel model such as OpenMP or CUDA

```
15 #pragma omp parallel for \
16 shared(a,b,c,chunk) private(i) \
17 schedule(static,chunk)
18 for (i=0; i < n; i++)
19 c[i] = a[i] + b[i];
20 }</pre>
```

https://computing.llnl.gov/tutorials/openMP/

## Wrap-Ups (Vector, SIMD and GPU)

Data-level parallelism