## Friday 11/09 Class and Next Week

- Friday 11/09, 9:00AM, 2A31 (Same classroom)
- I will post assignment 4 by today
- Next week:
- Monday: No class
- Wednesday: regular time via https://webconnect.sc.edu/csce513/
» Check https://passlab.github.io/CSCE513/OnlineAdobeConnect.html for details how to make your computer ready
- The week after next (11/19 week)
- Monday 11/19 regular class
- Wednesday 11/21, No class because of Thanksgiving
- Then there are two more weeks for this course.
- Final Exam: 9:00AM 12/12 Wednesday


# Lecture 20: Data Level Parallelism -- Introduction and Vector Architecture 

## CSCE 513 Computer Architecture

Department of Computer Science and Engineering
Yonghong Yan
yanyh@cse.sc.edu
https://passlab.github.io/CSCE513

## Very Important Terms for InstructionLevel Parallelism

- Dynamic Scheduling $\rightarrow$ Out-of-order Execution
- Speculation $\rightarrow$ In-order Commit
- Superscalar $\rightarrow$ Multiple Issue

| Techniques | Goals | Implementation | Addressing | Approaches |
| :--- | :--- | :--- | :--- | :--- |
| Dynamic <br> Scheduling | Out-of- <br> order <br> execution | Reservation <br> Stations, Load/Store <br> Buffer and CDB | Data hazards <br> (RAW, WAW, <br> WAR) | Register <br> renaming |
| Speculation | In-order <br> commit | Branch Prediction <br> (BHT/BTB) and <br> Reorder Buffer | Control <br> hazards <br> (branch, func, <br> exception) | Prediction <br> and <br> misprediction <br> recovery |
| Superscalar MLIW | Multiple <br> issue | Software and <br> Hardware | To Increase <br> CPI | By compiler <br> or hardware |

Though mostly invented earlier, these techniques are still widely used today, in from embedded CPUs to server/desktop CPUs.

## Multithreading: Hyper-Threading = SMT


$\square$ Thread 1
Thread 2


Thread 3
Thread 4

Multiprocessing


Thread 5
$\square$ Idle slot

Simultaneous
Multithreading (SMT)


SMT/HTT performance improvement is NOT significant! https://en.wikipedia.org/wiki/Hyper-threading

## CSE 564 Class Contents

- Introduction to Computer Architecture (CA)
- Quantitative Analysis, Trend and Performance of CA
- Chapter 1
- Instruction Set Principles and Examples
- Appendix A
- Pipelining and Implementation, RISC-V ISA and Implementation
- Appendix C, RISC-V (riscv.org) and UCB RISC-V impl
- Memory System (Technology, Cache Organization and Optimization, Virtual Memory)
- Appendix B and Chapter 2
- Midterm covered till Memory Tech and Cache Organization
- Instruction Level Parallelism (Dynamic Scheduling, Branch Prediction, Hardware Speculation, Superscalar, VLIW and SMT)
- Chapter 3
- Data Level Parallelism (Vector, SIMD, and GPU)
- Chapter 4
- Thread Level Parallelism
- Chapter 5
- Domain-specific architecture
- Chapter 7


## Topics for Data Level Parallelism (DLP)

- Parallelism (centered around ...)
- Instruction Level Parallelism
- Data Level Parallelism
- Thread Level Parallelism
- DLP Introduction and Vector Architecture
-4.1, 4.2
- SIMD Instruction Set Extensions for Multimedia
-4.3
- Graphical Processing Units (GPU)
-4.4
- GPU and Loop-Level Parallelism and Others
-4.4, 4.5


## Flynn's Taxonomy for Classifying CA



Michael J. Flynn: http://arith.stanford.edu/~flynn/

## Flynn's Classification (1966)

## Broad classification of parallel computing systems

- based upon the number of concurrent Instruction
(or control) streams and Data streams
- SISD: Single Instruction, Single Data
- conventional uniprocessor, a single core
- SIMD: Single Instruction, Multiple Data
- one instruction stream, multiple data paths
- distributed memory SIMD (MPP, DAP, CM-1\&2, Maspar)
- shared memory SIMD (STARAN, vector computers)
- MIMD: Multiple Instruction, Multiple Data
- message passing machines (Transputers, nCube, CM-5, clusters)
- non-cache-coherent shared memory machines (BBN Butterfly, T3D)
- cache-coherent shared memory machines (Multicore, multiprocessors, Sequent, Sun Starfire, SGI Origin)
- MISD: Multiple Instruction, Single Data
- Not a practical configuration


## SIMD: Single Instruction, Multiple Data (Data Level Parallelism)

- SIMD architectures can exploit significant datalevel parallelism for:
- matrix-oriented scientific computing

$$
\text { for } \begin{array}{r}
(i=999 ; i>=0 ; i=i-1) \\
x[i]=x[i]+s ;
\end{array}
$$

- media-oriented image and sound processors
- SIMD is more energy efficient than MIMD
- Only needs to fetch one instruction per data operation processing multiple data elements
- Makes SIMD attractive for personal mobile devices
- SIMD allows programmer to continue to think sequentially



## Hardware Implementation for SIMD|DataLevel Parallelism

- Three variations
- Vector architectures
- SIMD extensions
- Graphics Processor Units (GPUs)
- E.g. x86 processors $\rightarrow$ MIMD + SIMD
- Expect two additional cores per chip per year (MIMD)
- Each core has SIMD, and SIMD width double every four years
- Potential speedup from SIMD to be twice that from MIMD!


## Vector Architecture

## VLIW vs Vector

- VLIW takes advantage of instruction level parallelism (ILP) by specifying multiple (different) instructions to execute in parallel

| Int Op 1 | Int Op 2 | Mem Op 1 | Mem Op 2 | FP Op 1 | FP Op 2 |
| :---: | :---: | :---: | :---: | :---: | :---: |

- Vector architectures perform the same operation on multiple data elements - single instruction
- Data-level parallelism

Vector Arithmetic Instructions

ADDV v3, v1, v2


## Vector Programming Model



## Control Information

- Vector length (VL) register limits the max number of elements to be processed by a vector instruction
- VL is loaded prior to executing the vector instruction with a special instruction
- Stride for load/stores:
- Vectors may not be adjacent in memory addresses
- E.g., different dimensions of a matrix
- Stride can be specified as part of the load/store



## Basic Structure of Vector Architecuture

- RV64V
- 32 32x8-byte vector registers


Figure 4.1 The basic structure of a vector architecture, RV64V, which include: RISC-V scalar architecture. There are also 32 vector registers, and all the functional ur are vector functional units. The vector and scalar registers have a significant number read and write norts to allow multinle simultaneous vector onerations. A set of crosst

## RV64V Vector Instructions

- Suffix
- V suffix
- VS suffix
- Load/Store
- vld
- vst
- Registers
- V registers
- VL (vector length register)
- Predicate

| Mnemonic | Name | Description |
| :---: | :---: | :---: |
| vadd | ADD | Add elements of V[rs1] and V[rs2], then put each result in V [rd] |
| vsub | SUBtract | Subtract elements of V[rs2] frpm V[rs1], then put each result in V[rd] |
| vmul | MULtiply | Multiply elements of $\mathrm{V}[\mathrm{rs} 1]$ and $\mathrm{V}[\mathrm{rs} 2]$, then put each result in $\mathrm{V}[\mathrm{rd}]$ |
| vdiv | DIVide | Divide elements of V[rs1] by V[rs2], then put each result in V[rd] |
| vrem | REMainder | Take remainder of elements of V[rs1] by V[rs2], then put each result in $\mathrm{V}[\mathrm{rd}]$ |
| vsqrt | SQuare RooT | Take square root of elements of V[rs1], then put each result in V[rd] |
| vs 11 | Shift Left | Shift elements of V[rs1] left by V[rs2], then put each result in V[rd] |
| vsrl | Shift Right | Shift elements of V[rs1] right by $\mathrm{V}[\mathrm{rs} 2]$, then put each result in $\mathrm{V}[\mathrm{rd}]$ |
| vsra | Shift Right Arithmetic | Shift elements of V[rs1] right by V[rs2] while extending sign bit, then put each result in V[rd] |
| vxor | XOR | Exclusive OR elements of V[rs1] and V[rs2], then put each result in V[rd] |
| vor | OR | Inclusive OR elements of V[rs1] and V[rs2], then put each result in V[rd] |
| vand | AND | Logical AND elements of V[rs1] and V[rs2], then put each result in V[rd] |
| vsgnj | SiGN source | Replace sign bits of V[rs1] with sign bits of V[rs2], then put each result in V[rd] |
| vsgnjn | Negative SiGN source | Replace sign bits of V[rs1] with complemented sign bits of V[rs2], then put each result in V [rd] |
| vsgnjx | Xor SiGN source. | Replace sign bits of $\mathrm{V}[\mathrm{rs} 1]$ with xor of sign bits of $\mathrm{V}[\mathrm{rs} 1]$ and $\mathrm{V}[\mathrm{rs} 2]$, then put each result in VIrdl |
| v7d | Load | Load vector register V[rd] from memory starting at address R[rs1] |
| v7ds | Strided Load | Load V[rd] from address at R[rs1] with stride in $\mathrm{R}[\mathrm{rs} 2]$ (i.e., $\mathrm{R}[\mathrm{rs} 1]+\mathrm{i} \times \mathrm{R}[\mathrm{rs} 2]$ ) |
| v1dx | Indexed Load (Gather) | Load V[rs1] with vector whose elements are at $\mathrm{R}[\mathrm{rs} 2]+\mathrm{V}[\mathrm{rs} 2]$ (i.e., $\mathrm{V}[\mathrm{rs} 2]$ is an index) |
| vst | Store | Store vector register V[rd] into memory starting at address R[rs1] |
| vsts | Strided Store | Store V[rd] into memory at address R[rs1] with stride in $\mathrm{R}[\mathrm{rs} 2]$ (i.e., $\mathrm{R}[\mathrm{rs} 1]+\mathrm{i} \times \mathrm{R}[\mathrm{rs} 2]$ ) |
| vstx | Indexed Store (Scatter) | Store $\mathrm{V}[\mathrm{rs} 1]$ into memory vector whose elements are at $\mathrm{R}[\mathrm{rs} 2]+\mathrm{V}[\mathrm{rs} 2]$ ( i.e., $\mathrm{V}[\mathrm{rs} 2]$ is an index) |
| vpeq | Compare $=$ | Compare elements of $\mathrm{V}[\mathrm{rs} 1]$ and $\mathrm{V}[\mathrm{rs} 2]$. When equal, put a 1 in the corresponding 1-bit element of p[rd]; otherwise, put 0 |
| vpne | Compare ! $=$ | Compare elements of $\mathrm{V}[\mathrm{rs} 1]$ and $\mathrm{V}[\mathrm{rs} 2]$. When not equal, put a 1 in the corresponding 1-bit element of p[rd]; otherwise, put 0 |
| vp7t | Compare < | Compare elements of $\mathrm{V}[\mathrm{rs} 1]$ and $\mathrm{V}[\mathrm{rs} 2]$. When less than, put a 1 in the corresponding 1- |
| vpxor | Predicate XOR | Exclusive OR 1-bit elements of $\mathrm{p}[\mathrm{rs} 1]$ and $\mathrm{p}[\mathrm{r} 2$ ], then put each result in $\mathrm{p}[\mathrm{rd}]$ |
| vpor | Predicate OR | Inclusive OR 1-bit elements of $\mathrm{p}[\mathrm{rs} 1]$ and $\mathrm{p}[\mathrm{rs} 2]$, then put each result in $\mathrm{p}[\mathrm{rd}]$ |
| vpand | Predicate AND | Logical AND 1-bit elements of $\mathrm{p}[\mathrm{rs} 1]$ and $\mathrm{p}[\mathrm{rs} 2]$, then put each result in $\mathrm{p}[\mathrm{rd}]$ |
| setv1 | Set Vector Length | Set vl and the destination register to the smaller of mvl and the source regsiter |

## Highlight of RV64V Vector Instructions

- All are R-format instruction
- .vv and .vs|.sv operands
- .vv: Vector-vector operands; .vs|.sv: Vector-scalar operands
- Vector load and store which loads or stores an entire vector
- One operand is the vector register to be loaded or stored; The other operand, a GPR, is the starting address of the vector in memory.
- vlds/vsts: for stride load/store; vldx/vstx: indexed load/store
- Vector-length register vl is used when the natural vector length is not equal to mvl
- Vector-type register vctype records register types
- Predicate registers pi are used when loops involve IF statements.
- We'll see them in the following example:


## Dynamic Register Typing in RV64V

| 64b |  |  |  |  |  |  |  |
| :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
| 32b |  |  | 32b |  |  |  |  |
| 16b |  | 16b |  | 16b |  | 16b |  |
| 8b | 8b | 8b | 8b | 8b | 8b | 8b |  |

- A vector register has 32 64-bit elements
- Or 128 16- bit elements, and even 256 8-bit elements are equally valid views.

| Integer | $8,16,32$, and 64 bits | Floating point | 16,32, and 64 bits |
| :--- | :--- | :--- | :--- |

Figure 4.3 Data sizes supported for RV64V assuming it also has the single- and double-precision floating-point extensions RVS and RVD. Adding RVV to such a

- Associate a data type and data size with each vector register using vctype register
- Existing and normal approach is that the instruction supplying the type information, but not in RV64V


## DAXPY（Y＝a＊$X+Y$ ， 32 elements）in RV64G and RV64V

double a，X［］，Y［］；／／8－byte per element for（ $i=0 ; i<32 ; i++$ ）

$$
Y[i]=a * X[i]+Y[i] ;
$$

The starting addresses of $X$ anc
$Y$ are in X5 and $\mathbf{X 6}$ ，respectively
－\＃instrs：
－ 8 vs 258
－Pipeline stalls
－32x higher by RV64G
－Vector chaining （forwarding）
－Per each vector element
－v0 $\rightarrow \mathrm{v} 1 \rightarrow \mathrm{v} 2 \rightarrow \mathrm{v} 3$

| Loop： | fld | f0，a | 非Load scalar a |
| :---: | :---: | :---: | :---: |
|  | addi | x28，x5，非256 | 非Last address to loa |
|  | fld | f1，0（x5） | 非Load X［i］ |
|  | fmul．d | f1，f1，f0 | 非 $\mathrm{a} \times \mathrm{X}$［i］ |
|  | fld | f2，0（x6） | 非 Load Y［i］ |
|  | fadd．d | f2，f2，f1 | 非 $\mathrm{a} \times \mathrm{X}[\mathrm{i}]+\mathrm{Y}[\mathrm{i}]$ |
|  | fsd | f2，0（x6） | 非Store into Y［i］ |
|  | addi | x5，x5，非 | 非 Increment index to X |
|  | addi | x6，x6，非 | 非 Increment index to Y |
|  | bne | x28，x5，Loop | 非 Check if done |


| vsetdcfg | $4 * F P 64$ |
| :--- | :--- |
| f1d | f0，a |
| v1d | v0，x5 |
| vmul | v1，v0，f0 |
| v1d | v2，x6 |
| vadd | v3，v1，v2 |
| vst | v3，x6 |
| vdisable |  |

非 Enable 4 DP FP vregs
非 Load scalar a
非 Load vector X
非 Vector－scalarmult
非 Load vector Y
非 Vector－vector add
非 Store the sum
非 Disable vector regs

## DAXPY（Y＝a＊$X+Y$ ， 32 elements）in RV64G and RV64V

```
for (i=0; i<32; i++)
\[
Y[i]=a * X[i]+Y[i] ;
\]
```

double a, X[], Y[]; // 8-byte per element

| ```vsetdcfg f1d v7d vmul v7d vadd vst vdisable``` | $\begin{aligned} & 4 \star \text { FP64 } \\ & \text { f0, a } \\ & \text { v0, x5 } \\ & \text { v1,v0,f0 } \\ & \text { v2,x6 } \\ & \text { v3, v1, v2 } \\ & \text { v3, x6 } \end{aligned}$ | 非 Enable 4 DP FP vregs <br> 非 Load scalar a <br> 非 Load vector X <br> 非 Vector－scalar mult <br> 非 Load vector Y <br> 非 Vector－vector add <br> 非 Store the sum <br> 非 Disable vector regs |
| :---: | :---: | :---: |
| vsetdcfg 1＊FP32，3＊FP64 非1 32b，364b vregs |  |  |
| flw | f0，a | 非Load scalar a |
| v1d | v0，x5 | 非 Load vector X |
| vmu1 | v1，v0，f0 | 非Vector－scalarmult |
| v7d | v2，x6 | 非 Load vector Y |
| vadd | v3，v1，v2 | 非 Vector－vector add |
| vst | v3，x6 | 非Store the sum |
| vdisable |  | 非Disable vector regs |

The starting addresses of $X$ and
$Y$ are in X5 and X 6 ，respectively

## DAXPY(Y = a * $X+Y$, 32 elements) in RV64G and RV64V



## Vector Instruction Set Advantages

- Compact
- one short instruction encodes $\mathbf{N}$ operations
- Expressive and predictable, tells hardware that these $\mathbf{N}$ operations:
- are independent
- use the same functional unit
- access disjoint registers
- access registers in same pattern as previous instructions
- access a contiguous block of memory (unit-stride load/store)
- access memory in a known pattern (strided load/store)
- Scalable
- can run same code on more parallel pipelines (lanes)


## Vector Length Register

- Loop count not known at compile time?

$$
\begin{aligned}
& \text { for }(i=0 ; i<n ; i=i+1) \\
& \quad Y[i]=a \star X[i]+Y[i] ;
\end{aligned}
$$

- Use Vector Length (VL) and Max VL (MVL) Registers
- Use strip mining for vectors over the maximum length (serialized version before vectorization by compiler)
- Break loops into pieces that fit in registers

|  | ```low = 0; VL = (n % MVL); /*find odd-size piece using modulo op % */ for (j = 0; j <= (n/MVL); j=j+1) {/*outer loop*/ for (i = low; i < (low+VL); i=i+1) /*runs for length VL*/ Y[i] = a * X[i] + Y[i] ; /*main operation*/ low = low + VL; /*start of next vector*/ VL = MVL; /*reset the length to maximum vector length*/ }None``` |
| :---: | :---: |

## Vector Stripmining in RV64V

$$
\begin{aligned}
& \text { for }(i=0 ; i<n ; i=i+1) \\
& \quad Y[i]=a \star X[i]+Y[i] ;
\end{aligned}
$$

## Using Predictate Register for Vector Mask

$$
\begin{aligned}
& \text { for }(i=0 ; i<64 ; \quad i=i+1) \\
& \quad \text { if }(X[i]!=0) \\
& \quad X[i]=X[i]-Y[i] ;
\end{aligned}
$$

Use predicate register to＂disable＂elements（1 bit per element）：

| vsetdcfg | 2＊FP64 | 非 Enable 2 64b FP vector regs |
| :---: | :---: | :---: |
| vsetpcfgi | 1 | 非Enable 1 predicate register |
| v1d | v0，x5 | \＃Load vector X into v0 |
| v1d | v1，x6 | \＃Load vector Y into v1 |
| fmv．d．x | f0，x0 | 非Put（FP）zero into fo |
| vpne | po，vo，fo | 非Set p0（i）to 1 if v0（i）！＝f0 |
| vsub | $\mathrm{v} 0, \mathrm{v} 0, \mathrm{v} 1$ | 非Subtract under vector mask |
| vst | v0，x5 | 非Store the result in X |
| vdisable |  | 非 Disable vector registers |
| vpdisable |  | \＃Disable predicate registers |

－GFLOPS rate decreases！
－Vector operation becomes bubble（＂NOP＂）at elements where mask bit is clear

## Stride

## DGEMM (Double-Precision Matrix Multiplication)

```
for (i = 0; i < 100; i=i+1)
    for (j=0; j< 100; j=j+1) {
    A[i][j] = 0.0;
    for (k=0;k< 100; k=k+1)
        A[i][j]=A[i][j]+B[i][k] * D[k][j];
        }
```



- Must vectorize multiplication of rows of B with columns of $D$
- Row-major: B: 1 double ( 8 bytes), and D: 100

- Use non-unit stride
- vlds and vsts: strided load and store
- Bank conflict (stall) occurs when the same bank is hit faster than bank busy time:
- \#banks / LCM(stride,\#banks) < bank busy time


## Scatter－Gather

## －Sparse matrix：

－Non－zero values are compacted to a smaller value array（A［ ］）
－indirect array indexing，i．e．use an array to store the index to value array（K［ ］）
for $(i=0 ; i<n ; \quad i=i+1)$
$A[K[i]]=A[K[i]]+C[M[i]] ;$

$$
\left(\begin{array}{cccccccc}
1.0 & 0 & 5.0 & 0 & 0 & 0 & 0 & 0 \\
0 & 3.0 & 0 & 0 & 0 & 0 & 11.0 & 0 \\
0 & 0 & 0 & 0 & 9.0 & 0 & 0 & 0 \\
0 & 0 & 6.0 & 0 & 0 & 0 & 0 & 0 \\
0 & 0 & 0 & 7.0 & 0 & 0 & 0 & 0 \\
2.0 & 0 & 0 & 0 & 0 & 10.0 & 0 & 0 \\
0 & 0 & 0 & 8.0 & 0 & 0 & 0 & 0 \\
0 & 4.0 & 0 & 0 & 0 & 0 & 0 & 12.0
\end{array}\right)
$$

－Use index vector：

| vsetdcfg | 4 ＊FP64 | 非 4 64b FP vector registers |
| :--- | :--- | :--- |
| v1d | $v 0, x 7$ | 非Load K［］ |
| v1dx | $v 1,(x 5, v 0)$ | 非Load A［K［］］ |
| v1d | $v 2, x 28$ | 非Load M［］ |
| v1di | $v 3,(x 6, v 2)$ | 非Load C［M［］］ |
| vadd | v1，v1，v3 | 非Add them |
| vstx | $v 1,(x 5, v 0)$ | 非Store A［K［］］ |
| vdisable |  | 非Disable vector registers |

## Memory Operations (vid and vst)

- Load/store operations move groups of data between registers and memory
- Increased mem/instr ratio (intensity)
- Three types of addressing
- Unit stride
» Contiguous block of information in memory
» Fastest: always possible to optimize this
- Non-unit (constant) stride
» Harder to optimize memory system for all possible strides
» Prime number of data banks makes it easier to support different strides at full bandwidth
- Indexed (gather-scatter)
» Vector equivalent of register indirect
» Good for sparse arrays of data
» Increases number of programs that vectorize


## Conclusion

- Vector is alternative model for exploiting ILP
- If code is vectorizable, then simpler hardware, more energy efficient, and better real-time model than Out-of-order machines
- Design issues include number of lanes, number of functional units, number of vector registers, length of vector registers, exception handling, conditional operations


## History: In 70s-80s, Supercomputer $\equiv$ Vector Machine

- Definition of a supercomputer:
- Fastest machine in world at given task
- A device to turn a compute-bound problem into an I/O bound problem
- Any machine costing \$30M+
- Any machine designed by Seymour Cray (originally)
- CDC6600 (Cray, 1964) regarded as first supercomputer
- A vector machine
- www.cray.com: The Supercomputer Company
- Today's supercomputer
- https://www.top500.org/

The Father of Supercomputing


## Seymour Cray

Electrical engineer

Seymour Roger Cray was an American electrical engineer and supercomputer architect who designed a series of computers that were the fastest in the world for decades, and founded Cray Research which built many of these machines. Wikipedia

Born: September 28, 1925, Chippewa Falls, WI
Died: October 5, 1996, Colorado Springs, CO
Awards: Eckert-Mauchly Award
Parents: Seymour R. Cray, Lillian Cray
Education: University of Minnesota, Chippewa Falls High School
Fields: Applied mathematics, Computer Science, Electrical engineering
https://len.wikipedia.org/wiki/Seymour Cray

## Supercomputer Applications

- Typical application areas
- Military research (nuclear weapons, cryptography)
- Scientific research
- Weather forecasting
- Oil exploration
- Industrial design (car crash simulation)
- Bioinformatics
- Cryptography
- All involve huge computations on large data sets


## Vector Supercomputers

- Epitomy: Cray-1, 1976
- Scalar Unit
- Load/Store Architecture
- Vector Extension
- Vector Registers
- Vector Instructions
- Implementation
- Hardwired Control
- Highly Pipelined Functional Units
- Interleaved Memory System
- No Data Caches
- No Virtual Memory



## Vector Arithmetic Execution

- Use deep pipeline (=> fast clock) to execute element operations
- Simplifies control of deep pipeline because elements in vector are independent (=> no hazards!)

Six stage multiply pipeline


V3 <- v1 * v2

## Vector Execution: Element Group




## Vector Instruction Execution with Pipelined Functional Units



## Vector Unit Structure (4 Lanes)



## Vector Instruction Parallelism

- Can overlap execution of multiple vector instructions
- example machine has 32 elements per vector register and 8 lanes


Complete 24 operations/cycle while issuing 1 short instruction/cycle

## Vector Chaining

## Vector version of register bypassing

- introduced with Cray-1



## Vector Chaining Advantage

- Without chaining, must wait for last element of result to be written before starting dependent instruction

- With chaining, can start dependent instruction as soon as first result appears



## Class Lecture Ends Here!

## Automatic Code Vectorization

> for (i=0; i < N; i++)
$\mathrm{C}[\mathrm{i}]=\mathrm{A}[\mathrm{i}]+\mathrm{B}[\mathrm{i}]$; Vectorized Code
Scalar Sequential Code

lIter. $1 \quad$ lIter. 2
Vector Instruction
Vectorization is a massive compile-time reordering of operation sequencing $\Rightarrow$ requires extensive loop dependence analysis

## Masked Vector Instructions

## Simple Implementation

- execute all $N$ operations, turn off result writeback according to mask



## Density-Time Implementation

- scan mask vector and only execute elements with non-zero masks



## Compress/Expand Operations

- Compress packs non-masked elements from one vector register contiguously at start of destination vector register
- population count of mask vector gives packed vector length
- Expand performs inverse operation


Compress Expand
Used for density-time conditionals and also for general selection operations

## Interleaved Memory Layout

- Great for unit stride:
- Contiguous elements in different DRAMs
- Startup time for vector operation is latency of single read
- What about non-unit stride?
- Above good for strides that are relatively prime to 8
- Bad for: 2, 4

|  |  |  |  |  |  |  |  |
| :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
| Addr Mod 8 $=0$ | Addr <br> Mod 8 $=1$ | Addr Mod 8 $=2$ | Addr Mod 8 $=3$ | Addr Mod 8 $=4$ | Addr Mod 8 $=5$ | Addr <br> Mod 8 $=6$ | Addr <br> Mod 8 $=7$ |

## Avoiding Bank Conflicts

- Lots of banks
int $x[256][512]$;

$$
\begin{aligned}
& \text { for (j = 0; j < 512; j = j+1) } \\
& \text { for (i = 0; i < 256; i = i+1) } \\
& \text { x[i][j] = } 2 \text { * } x[i][j] ;
\end{aligned}
$$

- Even with 128 banks, since 512 is multiple of 128, conflict on word accesses
- SW: loop interchange or declaring array not power of 2 ("array padding")
- HW: Prime number of banks
- bank number = address mod number of banks
- address within bank = address / number of words in bank
- modulo \& divide per memory access with prime no. banks?
- address within bank = address mod number words in bank
- bank number? easy if $\mathbf{2}^{\mathrm{N}}$ words per bank


## Finding Bank Number and Address within a bank

- Problem: Determine the number of banks, $\mathrm{N}_{\mathrm{b}}$ and the number of words in each bank, $N_{w}$, such that:
- given address $x$, it is easy to find the bank where $x$ will be found, $B(x)$, and the address of $x$ within the bank, $A(x)$.
- for any address $x, B(x)$ and $A(x)$ are unique
- the number of bank conflicts is minimized
- Solution: Use the Chinese remainder theorem to determine $B(x)$ and $\mathrm{A}(\mathrm{x})$ :

$$
B(x)=x \operatorname{MOD} N_{b}
$$ $A(x)=x$ MOD $N_{w}$ where $N_{b}$ and $N_{w}$ are co-prime (no factors)

- Chinese Remainder Theorem shows that $B(x)$ and $A(x)$ unique.
- Condition allows $\mathrm{N}_{\mathrm{w}}$ to be power of two (typical) if $\mathrm{N}_{\mathrm{b}}$ is prime of form $2^{\mathrm{m}}-1$.
- Simple (fast) circuit to compute ( $\bmod N_{b}$ ) when $N_{b}=2^{m}-1$ :
- Since $2^{k}=2^{k-m}\left(2^{m}-1\right)+2^{k-m} \Rightarrow 2^{k}$ MOD $N_{b}=2^{k-m}$ MOD $N_{b}=\ldots=2^{j}$ with $j<m$
- And, remember that: (A+B) MOD C = [(A MOD C)+(B MOD C)] MOD C
- for every power of 2, compute single bit MOD (in advance)
- $B(x)=$ sum of these values MOD $\mathrm{N}_{\mathrm{b}}$ (low complexity circuit, adder with $\sim m$ bits)

