## Lecture 08: RISC-V Pipeline Implementation

**CSCE 513 Computer Architecture** 

Department of Computer Science and Engineering Yonghong Yan <u>yanyh@cse.sc.edu</u> <u>https://passlab.github.io/CSCE513</u>

### Acknowledgement

 Slides adapted from Computer Science 152: Computer Architecture and Engineering, Spring 2016 by Dr. George Michelogiannakis from UC Berkeley

## Review

- CPU performance factors
  - Instruction count
    - Determined by ISA and compiler *CPU* Time = -
  - CPI and Cycle time
    - Determined by CPU hardware

### Three groups of instructions

- Memory reference: lw, sw
- Arithmetic/logical: add, sub, and, or, slt
- Control transfer: jal, jalr, b\*
- CPI
  - Single-cycle, CPI = 1, and normally longer cycle
  - 5 stage unpipelined, CPI = 5
  - 5 stage pipelined, CPI = 1

Instructions \* Cycles \* Time

Program Instruction Cycle

## **Review: Unpipelined Datapath for RISC-V**



## **Review: Hardwired Control Table**

| Opcode                      | ImmSel               | Op2Sel | FuncSel | MemWr | RFWen | WBSel | WASel | PCSel |
|-----------------------------|----------------------|--------|---------|-------|-------|-------|-------|-------|
| ALU                         | *                    | Reg    | Func    | no    | yes   | ALU   | rd    | pc+4  |
| ALUi                        | IType <sub>12</sub>  | Imm    | Ор      | no    | yes   | ALU   | rd    | pc+4  |
| LW                          | IType <sub>12</sub>  | Imm    | +       | no    | yes   | Mem   | rd    | pc+4  |
| SW                          | SType <sub>12</sub>  | Imm    | +       | yes   | no    | *     | *     | pc+4  |
| <b>BEQ</b> <sub>true</sub>  | SBType <sub>12</sub> | *      | *       | no    | no    | *     | *     | br    |
| <b>BEQ</b> <sub>false</sub> | SBType <sub>12</sub> | *      | *       | no    | no    | *     | *     | pc+4  |
| J                           | *                    | *      | *       | no    | no    | *     | *     | jabs  |
| JAL                         | *                    | *      | *       | no    | yes   | PC    | X1    | jabs  |
| JALR                        | *                    | *      | *       | no    | yes   | РС    | rd    | rind  |

Op2Sel= Reg / Imm WASel = rd / X1

WBSel = ALU / Mem / PC PCSel = pc+4 / br / rind / jabs

## **An Ideal Pipeline**



- All objects go through the same stages
- No sharing of resources between any two stages
- Propagation delay through all pipeline stages is equal
- The scheduling of an object entering the pipeline is not affected by the objects in other stages

These conditions generally hold for industrial assembly lines For laundry pipeline, two loads do not depend on each other. But instructions depend on each other!

## **Technology Assumptions**

- A small amount of very fast memory (caches) backed up by a large, slower memory
- Fast ALU (at least for integers)
- Multiported Register files (slower)

Thus, the following timing assumption is reasonable

$$t_{IF} \sim = t_{ID/RF} \sim = t_{EX} \sim = t_{MEM} \sim = t_{WB}$$

A 5-stage pipeline will be focus of our detailed design Some commercial designs have over 30 pipeline stages to do an integer add!

### **5-Stage Pipelined Execution: Resource Usage**

The Whole Pipeline Resources are Used by 5 Instructions in Every Cycle!





• An instruction Reg (IR) in each stage to contain the instruction in that stage

### **Connect Controls from Instruction Register**



### **Compared With Control Logic in Unpipelined**

- Unpipelined:
  - Single control logic uses the instruction from IF
- Pipelined:
  - Distributed logics that uses instructions from IRs in each stage





### Instructions Interact With Each Other in Pipeline: Dealing with Hazards

- An instruction may need a resource being used by another instruction → structural hazard
  - Solution #1: Stalling newer instruction till older instruction finishes
  - Solution #2: Adding more hardware to design
    - E.g., separate memory into I-memory and D-memory
  - Our 5-stage pipeline has no structural hazards by design
- An instruction depends on something produced by an earlier one
  - Dependence may be for a data value or for using same register (not the value) → data hazard
    - Solutions for RAW hazards: #1, interlocking (bubble delay), and #2, forwarding
    - WAR and WRW hazards: not possible for 5-stage pipeline
  - Dependence may be for the next instruction's address → control hazard (branches, exceptions)
    - Solutions: # delay, prediction, etc

### **Read-After-Write (RAW)** Data Hazards



 $\begin{array}{l} \dots \\ x1 \leftarrow x0 + 10 \\ x4 \leftarrow x1 + 17 \\ \dots \end{array}$ 

x1 in GPRs contains stale value since the passing of value between two instructions has to go through GPRs (register file).

### To Resolve Data Hazards: #1, Interlocking, i.e. Stall Pipeline by Inserting Bubbles





 $\rightarrow$  pipeline bubble



## **Interlock Control Logic**



Compare the *source registers* of the instruction in the decode stage with the *destination register* of the *uncommitted* instructions.

### **Interlock Control Logic**

ignoring jumps & branches

we: write enable, 1-bit on/off ws: write select, 5-bit register number re: read enable, 1-bit on/off rs: read select, 5-bit register number



Should we always stall if an rs field matches some rd? not every instruction writes a register => we not every instruction reads a register => re

### **Source & Destination Registers**

|                             | func7                         | rs2      | rs1     | func3  | rd     | opcode       |          | ALU          |
|-----------------------------|-------------------------------|----------|---------|--------|--------|--------------|----------|--------------|
| immediate12 rs1 func3 rd    |                               |          |         | rd     | opcode | ALUI/LW/JALR |          |              |
|                             | imm                           | rs2      | rs1     | func3  | imm    | opcode       |          | SW/Bcond     |
|                             | Jur                           | np Offs  | et[19:  | 0]     | rd     | opcode       |          |              |
|                             |                               |          |         |        |        | sc           | ource(s) | destination  |
| ALU rd <= rs1 func10 rs2    |                               |          |         |        |        |              | rs1, rs2 | rd           |
| ALL                         | JI rd ·                       | <= rs1 ( | op im   | m      |        |              | rs1      | rd           |
| LW                          | rd ·                          | <= M [r  | rs1 + i | mm]    |        |              | rs1      | rd           |
| SW                          | M                             | [rs1 + i | mm] •   | <= rs2 |        |              | rs1, rs2 | -            |
| Bcond rs1,rs2               |                               |          |         |        |        |              | rs1, rs2 | -            |
| <i>true:</i> PC <= PC + imm |                               |          |         |        |        |              |          |              |
|                             | fals                          | se: PC   | : <= P( | C + 4  |        |              |          |              |
| JAL                         | <b>x1</b>                     | <= PC,   | PC <=   | PC + i |        | -            | rd       |              |
| JAL                         | ALR rd <= PC, PC <= rs1 + imm |          |         |        |        |              | rs1      | <b>rd</b> 18 |



|                                | source(s) | destination |
|--------------------------------|-----------|-------------|
| ALU rd <= rs1 func10 rs2       | rs1, rs2  | rd          |
| ALUI rd <= rs1 op imm          | rs1       | rd          |
| LW rd <= M [rs1 + imm]         | rs1       | rd          |
| SW M [rs1 + imm] <= rs2        | rs1, rs2  | -           |
| Bcond rs1,rs2                  | rs1, rs2  | -           |
| <i>true:</i> PC <= PC + imm    |           |             |
| false: PC <= PC + 4            |           | No          |
| JAL x1 <= PC, PC <= PC + imm   | -         | rd be       |
| JALR rd <= PC, PC <= rs1 + imm | rs1       | rd an       |

**C**<sub>dest</sub>

## **Deriving the Stall Signal**

| C <sub>re</sub>                      |
|--------------------------------------|
| re1 = <i>Case</i> opcode             |
| ALU, ALUi, LW, SW, Bcond, JALR => on |
| JAL =>off                            |
| re2 = <i>Case</i> opcode             |
| ALU, SW, Bcond =>on                  |
| =>off                                |

No need the WB for interlock control since we only need to deal with hazard between MEM-EXE and EXE-EXE. For two instructions which are in WB and EXE, and have RAW hazard, the dependency are handled through the register file.

$$C_{stall} \quad stall = ((rs1_D == ws_{EX}) \&\& we_{EX} + (rs1_D == ws_{MEM}) \&\& we_{MEM} + (rs1_D == ws_{MEM}) \&\& we_{MEM} + (rs1_D == ws_{WB}) \&\& re1_D + ((rs2_D == ws_{EX}) \&\& we_{EX} + ((rs2_D == ws_{MEM}) \&\& we_{MEM} + (rs2_D == ws_{MEM}) \&W = (rs2_D == ws_{MEM}) W = (rs2_D == ws_{M$$

### To Resolve Data Hazards: #2, Forwarding (Bypassing)



=> CPI > 1

A new datapath, i.e., *a bypass*, can get the data from the output of the ALU to its input



#### Review: Hardware Support for Forwarding, and Detecting RAW Hazards with Previous and 2<sup>nd</sup> Previous Instructions



## Adding a Bypass (To Bypass Register Files)



x4 <= x1 + 17

No, Load  $\rightarrow$  EXE-Use

 $(I_2) x4 <= x1 + 17$ Yes

22

x4 <= x1 + 17

No

## The Bypass Signal: Deriving it from the Stall Signal

stall = ( ( $(rs1_D = ws_E) \& we_E + (rs1_D = ws_M) \& we_M + (rs1_D = ws_W) \& we_W$ ) && re1\_D +(( $rs2_D = ws_E$ ) &&  $we_E + (rs2_D = ws_M) \& we_M + (rs2_D = ws_W) \& we_W$ ) &&  $re2_D$ )



 $ASrc = (rs1_{D} = ws_{E}) \& we_{E} \& re1_{D}$ 

Is this correct?

No because only ALU and ALUi instructions can benefit from this bypass

Split we<sub>E</sub> into two components: we-bypass, we-stall

## **Bypass and Stall Signals**

Split we<sub>E</sub> into two components: we-bypass, we-stall

we-bypass<sub>E</sub> = *Case* opcode<sub>E</sub> ALU, ALUi => on ... => off we-stall<sub>E</sub> = *Case* opcode<sub>E</sub> LW, JAL, JALR=> on JAL => on ... => off

 $ASrc = (rs1_D == ws_E) \&\& we-bypass_E \&\& re1_D$ 

 $stall = ((rs1_{D} == ws_{E}) \&\& we-stall_{E} + (rs1_{D} == ws_{M}) \&\& we_{M} + (rs1_{D} == ws_{W}) \&\& re1_{D} + ((rs2_{D} == ws_{E}) \&\& we_{E} + (rs2_{D} == ws_{M}) \&\& we_{M} + (rs2_{D} == ws_{W}) \&\& re2_{D} = (rs2_{D} == ws_{M}) & (rs2_{D}$ 

## **Fully Bypassed Datapath**



## **Control Hazards: Branches and Jumps**

• JAL: unconditional jump to PC+immediate

| 31                       | 30 |           | 21   | 20                       | 19 1                        | 2 11 7                | 6                    | 0 |
|--------------------------|----|-----------|------|--------------------------|-----------------------------|-----------------------|----------------------|---|
| $\operatorname{imm}[20]$ |    | imm[10:1] |      | $\operatorname{imm}[11]$ | $\operatorname{imm}[19:12]$ | rd                    | opcode               |   |
| 1                        |    | 10        |      | 1                        | 8                           | 5                     | 7                    |   |
|                          |    | offset[   | 20:1 | .]                       |                             | $\operatorname{dest}$ | $\operatorname{JAL}$ |   |

• JALR: indirect jump to rs1+immediate

| 31                      | 20 1 | .9 15 | 14 12  | 11 7                  | 6      | 0 |
|-------------------------|------|-------|--------|-----------------------|--------|---|
| imm[11:0]               |      | rs1   | funct3 | rd                    | opcode |   |
| 12                      |      | 5     | 3      | 5                     | 7      |   |
| $\mathrm{offset}[11:0]$ |      | base  | 0      | $\operatorname{dest}$ | JALR   |   |
|                         | _    |       | -      |                       |        |   |

• Branch: if (rs1 conds rs2), branch to PC+immediate

| 31                       | 30                         | $25\ 24$              | 20 19 15              | 5 14 12 | 2 11 8               | 3 7          | 6      | 0 |
|--------------------------|----------------------------|-----------------------|-----------------------|---------|----------------------|--------------|--------|---|
| $\operatorname{imm}[12]$ | $\operatorname{imm}[10:5]$ | rs2                   | rs1                   | funct3  | imm[4:1]             | imm[11]      | opcode |   |
| <br>1                    | 6                          | 5                     | 5                     | 3       | 4                    | 1            | 7      |   |
| offset                   | [12, 10:5]                 | $\operatorname{src2}$ | $\operatorname{src1}$ | BEQ/BNE | $\mathrm{offset}[1]$ | $1,\!4:\!1]$ | BRANCH |   |
| offset                   | [12, 10:5]                 | $\operatorname{src2}$ | $\operatorname{src1}$ | BLT[U]  | offset[1]            | 1,4:1]       | BRANCH |   |
| offset                   | [12, 10:5]                 | $\operatorname{src2}$ | $\operatorname{src1}$ | BGE[U]  | offset[1]            | 1,4:1]       | BRANCH |   |

### Info for Control Transfer

Two pieces of info:

- 1. Taken or Not Taken
- 2. Target address?



- JAL: unconditional jump to PC+immediate
- JALR: indirect jump to rs1+immediate
- Branch: if (rs1 conds rs2), branch to PC+immediate

| Instruction       | Taken known?       | Target known?      |
|-------------------|--------------------|--------------------|
| JAL               | After Inst. Decode | After Inst. Decode |
| JALR              | After Inst. Decode | After Reg. Fetch   |
| B <cond.></cond.> | After Execute      | After Inst. Decode |

### **Speculate Next Address is PC+4**



kill

 $I_1 \\ I_2 \\ I_3$ 

096 ADD 100 J 304 104 ADD

#### A jump instruction kills (not stalls) the following instruction *How?*

## **Pipelining Jumps**



I<sub>4</sub> 304 ADD

### **Jump Pipeline Diagrams**



*Resource Usage* 

time t1 t2 t3 t4 t5 t6 t7 ... **t0** IF  $\mathbf{I}_1$ ID  $I_1 I_2 - I_4 I_5$ EX  $\mathbf{I}_2 - \mathbf{I}_4 - \mathbf{I}_5$  $\mathbf{I_1}$ ME  $\mathbf{I}_2 - \mathbf{I}_4$ WB  $\mathbf{I}_1$ **I**5  $- \Rightarrow pipeline bubble$ 

# **Pipelining Conditional Branches**



- **I**<sub>1</sub> 096 ADD
- I<sub>2</sub> 100 BEQ x1,x2 +200
- I<sub>3</sub> 104 ADD
- I<sub>4</sub> 304 ADD

Branch condition is not known until the execute stage

# **Pipelining Conditional Branches**



- **I**<sub>1</sub> 096 ADD
- I<sub>2</sub> 100 BEQ x1,x2 +200
- I<sub>3</sub> 104 ADD
- I<sub>4</sub> 304 ADD

If the branch is taken:

- Kill the two following instructions
- The instruction at the decode stage is not valid ⇒ stall signal is not valid

# **Pipelining Conditional Branches**



- I<sub>1:</sub> 096 ADD
- I<sub>2:</sub> 100 BEQ x1,x2 +200
- I<sub>3:</sub> 104 ADD
- I<sub>4:</sub> 304 ADD

If the branch is taken

- kill the two following instructions
- the instruction at the decode stage is not valid ⇒ stall signal is not valid

### **Branch Pipeline Diagrams**

(resolved in execute stage)



time **t0** t1 t2 t3 t4 t5 t6 t7 ....  $\mathbf{I}_2 \quad \mathbf{I}_3 \quad \mathbf{I}_4 \quad \mathbf{I}_5$ IF  $\mathbf{I}_1$  $I_1 I_2 I_3 - I_5$ ID Resource EX  $I_1 I_2 -$ **I**5 Usage  $I_1$ **I**<sub>2</sub> - -ME I<sub>5</sub> WB  $I_1 I_2 \mathbf{I}_{5}$ 

 $- \Rightarrow pipeline bubble$ 

### Use Simpler Branches: E.g. Only Compare One Register Against Zero in ID Stage



time t1 t2 t3 t4 t5 t6 t7 .... **t0**  $\mathbf{I}_2 \quad \mathbf{I}_3 \quad \mathbf{I}_4 \quad \mathbf{I}_5$ IF  $\mathbf{I}_1$  $\mathbf{I}_1 \quad \mathbf{I}_2 \quad - \quad \mathbf{I}_4 \quad \mathbf{I}_5$ ID Resource - I<sub>4</sub> I<sub>5</sub> EX  $I_1 I_2$ Usage  $\mathbf{I}_2 - \mathbf{I}_4 \mathbf{I}_5$ ME  $\mathbf{I}_1$ **WB**  $\mathbf{I}_1 \quad \mathbf{I}_2 \quad - \quad \mathbf{I}_4 \quad \mathbf{I}_5$ 

 $- \Rightarrow pipeline bubble$ 

## **Pipelined MIPS Datapath**



36

# **Control Hazard Delay Summary**

- JAL: unconditional jump to PC+immediate
  - 1 cycle delay of pipeline
- JALR: indirect jump to rs1+immediate
  - 1 cycle delay
- Branch: if (rs1 conds rs2), branch to PC+immediate
  - 2 cycles delay
  - 1 cycle delay for simpler branch (BEQZ) with pipeline improvement

# **Reducing Control Flow Penalty**

- Software solutions
  - Eliminate branches loop unrolling
    - Increases the run length





- Reduce resolution time instruction scheduling
  - Compute the branch condition as early as possible (of limited value because branches often in critical path through code)
- Hardware solutions
  - Find something else to do delay slots
    - Replaces pipeline bubbles with useful work (requires software cooperation)
  - Speculate branch prediction
    - Speculative execution of instructions beyond the branch

# Additional Materials – Branch Prediction

# **Branch Prediction**

- Motivation
  - Branch penalties limit performance of deeply pipelined processors
  - Modern branch predictors have high accuracy
  - (>95%) and can reduce branch penalties significantly
- Required hardware support:
  - Prediction structures:
    - Branch history tables, branch target buffers, etc.
- Mispredict recovery mechanisms:
  - Keep result computation separate from commit
  - Kill instructions following branch in pipeline
  - Restore state to that following branch

# **Static Branch Prediction**

Overall probability a branch is taken is ~60-70% but:



ISA can attach preferred direction semantics to branches, e.g., Motorola MC88110

bne0 (preferred taken) beq0 (not taken)

#### Dynamic Branch Prediction learning based on past behavior

- Temporal correlation (time)
  - If I tell you that a certain branch was taken last time, does this help?
  - The way a branch resolves may be a good predictor of the way it will resolve at the next execution
- Spatial correlation (space)
  - Several branches may resolve in a highly correlated manner
  - For instance, a preferred path of execution

# **Dynamic Branch Prediction**

- 1-bit prediction scheme
  - Low-portion address as address for a one-bit flag for Taken or NotTaken historically
  - Simple
- 2-bit prediction
  - Miss twice to change

# **Branch Prediction Bits**

- Assume 2 BP bits per instruction
- Change the prediction after two consecutive mistakes!



BP state:

(predict take/-take) x (last prediction right/wrong)

## **Branch History Table**



4K-entry BHT, 2 bits/entry, ~80-90% correct predictions

#### **Exploiting Spatial Correlation**

Yeh and Patt, 1992

If first condition false, second condition also false

History register, H, records the direction of the last N branches executed by the processor

## **Two-Level Branch Predictor**

Pentium Pro uses the result from the last two branches to select one of the four sets of BHT bits (~95% correct)



# **Speculating Both Directions**

- An alternative to branch prediction is to execute both directions of a branch speculatively
  - resource requirement is proportional to the number of concurrent speculative executions
  - only half the resources engage in useful work when both directions of a branch are executed speculatively
  - branch prediction takes less resources than speculative execution of both paths
- With accurate branch prediction, it is more cost effective to dedicate all resources to the predicted direction!
  - What would you choose with 80% accuracy?

# Are We Missing Something?

• Knowing whether a branch is taken or not is great, but what else do we need to know about it?

#### **Branch target address**

# **Branch Target Buffer**



BP bits are stored with the predicted target address.

IF stage: *If (BP=taken) then nPC=target else nPC=PC+4* Later: *check prediction, if wrong then kill the instruction and update BTB & BPb else update BPb* 

# **Address Collisions (MisPrediction)**



Is this a common occurrence?

#### **BTB is only for Control Instructions**

- Is even branch prediction fast enough to avoid bubbles?
- When do we index the BTB?
  - i.e., what state is the branch in, in order to avoid bubbles?
- BTB contains useful information for branch and jump instructions only
   => Do not update it for other instructions
- For all other instructions the next PC is PC+4 !
- *How to achieve this effect without decoding the instruction?*

# **Branch Target Buffer (BTB)**



- Keep both the branch PC and target PC in the BTB
- PC+4 is fetched if match fails
- Only *taken* branches and jumps held in BTB
- Next PC determined *before* branch fetched and decoded

# **Combining BTB and BHT**

- BTB entries are considerably more expensive than BHT, but can redirect fetches at earlier stage in pipeline and can accelerate indirect branches (JR)
- BHT can hold many more entries and is more accurate



BTB/BHT only updated after branch resolves in E stage

# **Uses of Jump Register (JR)**

- Switch statements (jump to address of matching case)
  BTB works well if same case used repeatedly
- Dynamic function call (jump to run-time function address)
  BTB works well if same function usually called, (e.g., in C++ programming, when objects have same type in virtual function call)
- Subroutine returns (jump to return address)
  BTB works well if usually return to the same place
  ⇒ Often one function called from many distinct call sites!

How well does BTB work for each of these cases?

### **Subroutine Return Stack**

Small structure to accelerate JR for subroutine returns, typically much more accurate than BTBs.



