### Chapter 2: Instructions: Language of the Computer 2.13 - 2.14: C sort example and array vs pointer 2.16 – 2.17: X86 and more info for RISC-V 2.20: Conclusion

ITSC 3181 Introduction to Computer Architecture https://passlab.github.io/ITSC3181

Department of Computer Science

Yonghong Yan

yyan7@uncc.edu

https://passlab.github.io/yanyh/

### **Chapter 2: Instructions: Language of the Computer**

- Lecture
  - 2.1 Introduction
  - 2.2 Operations of the Computer Hardware
  - 2.3 Operands of the Computer Hardware

### • Lecture

- 2.4 Signed and Unsigned Numbers
- 2.5 Representing Instructions in the Computer
- Lecture
  - 2.6 Logical Operations
  - 2.7 Instructions for Making Decisions

- Lecture
  - 2.8 Supporting Procedures in Computer Hardware
  - 2.9 Communicating with People
  - 2.10 RISC-V Addressing for Wide Immediate and Addresses

### Lecture

- 2.11 Parallelism and Instructions: Synchronization
- 2.12 Translating and Starting a Program
  - We covered before along with C Basics
- 2.13 A C Sort Example to Put It All Together
- 2.14 Arrays versus Pointers
  - We covered most before along with C Basics
- 2.15 Advanced Material: Compiling C and Interpreting Java
- 2.16 Real Stuff: MIPS Instructions
- 2.17 Real Stuff: x86 Instructions
- 2.18 Real Stuff: The rest of RISC-V
- 2.19 Fallacies and Pitfalls
- 2.20 Concluding Remarks
- 2.21 Historical Perspective and Further Reading

3

### The Sort Procedure in C (Textbook Page 133)

- Illustrates use of assembly instructions for a C bubble sort function
- no more swaps required Non-leaf (calls swap) void sort (long long int v[], size\_t n) { size\_t i, j; for (i = 0; i < n; i += 1) { (j = i - 1;)for >= 0 & v[j] > v[j + 1];-= 1) { swap(v,j); } – v in x10, n in x11, i in x19, j in x20



## C Sort Example



```
Swap procedure (leaf)
void swap(long long int v[]
long long int jong int k)
{
long long int temp;
temp = v[k];
v[k] = v[k+1];
v[k+1] = temp;
}
- vin x10, k in x11, temp in x5
```

| <pre>void sort (long long int v[], size_t n</pre>              | ) |
|----------------------------------------------------------------|---|
| <sup>1</sup> <u>size_t</u> i, j;                               |   |
| <pre>for (i = 0; i &lt; n; i += 1) {     for (j = i - 1;</pre> |   |
| $j \ge 0 \& v[j] > v[j + 1];$                                  |   |
| $j = 1$ {<br>swap(v, j);                                       |   |
| Swap(v,j),                                                     |   |
| <b>3</b>                                                       |   |





### **The Inner Loop**

- Skeleton of inner loop:
  - for (j = i 1; j >= 0 && v[j] > v[j + 1]; j = 1) { swap (v, j); }

```
addi x20,x19,-1 // j = i -1
for2tst:
   blt x20,x0,exit2 // go to exit2 if x20 < 0 (j < 0)
   slli x5,x20,3 // reg x5 = j * 8
   add x5,x10,x5 // reg x5 = v + (j * 8)
   1d x_{6,0}(x_{5}) // reg x_{6} = v[j]
   1d x7,8(x5) // reg x7 = v[j + 1]
   ble x6,x7,exit2 // go to exit2 if x6 \leq x7
   mv x10, x21 // first swap parameter is v
      x11, x20 // second swap parameter is j
   mν
   jal x1, swap // call swap
   addi x20,x20,-1 // j -= 1
   i
        for2tst // branch to test of inner loop
exit2:
```

### **Preserving Registers**

### Preserve saved registers:

addi sp,sp,-40 // make room on stack for 5 regs
sd x1,32(sp) // save x1 on stack
sd x22,24(sp) // save x22 on stack
sd x21,16(sp) // save x21 on stack
sd x20,8(sp) // save x20 on stack
sd x19,0(sp) // save x19 on stack

### • Restore saved registers:

exit1:

| ld   | x19,0(sp)  | // | restore | x19 from stack |
|------|------------|----|---------|----------------|
| ٦d   | x20,8(sp)  | // | restore | x20 from stack |
| ٦d   | x21,16(sp) | // | restore | x21 from stack |
| ٦d   | x22,24(sp) | // | restore | x22 from stack |
| ٦d   | x1,32(sp)  | // | restore | x1 from stack  |
| addi | sp,sp, 40  | // | restore | stack pointer  |
| jalr | x0,0(x1)   |    |         |                |

### **The Full Version**

Check the textbook

| Saving registers |                                                                                     |                                                                                                                                            |  |
|------------------|-------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------------------------------------------------------|--|
| sort:            | addi sp. sp40<br>sd x1. 32(sp)<br>sd x22. 24(sp)<br>sd x21. 16(sp)<br>sd x21. 6(sp) | <pre># make room on stack for 5 registers # save return address on stack # save x22 on stack # save x21 on stack # save x21 on stack</pre> |  |
|                  | sd x20. 8(sp)<br>sd x19. 0(sp)                                                      | # save x20 on stack<br># save x19 on stack                                                                                                 |  |

|                             | Proc                                                                                                                                       | edure body                                                                                       |
|-----------------------------|--------------------------------------------------------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------------|
| Move parameters             | mv x21. x10<br>mv x22. x11                                                                                                                 | <pre># copy parameter x10 into x21 # copy parameter x11 into x22</pre>                           |
| Outer loop                  | li x19, 0<br>forltst:bge x19, x22, exit1                                                                                                   | <pre># i = 0 # go to exitl if i &gt;= n</pre>                                                    |
| Inner loop                  | addi x20, x19, -1<br>for2tst:blt x20, x0, exit2<br>slli x5, x20, 3<br>add x5, x21, x5<br>ld x6, 0(x5)<br>ld x7, 8(x5)<br>ble x6, x7, exit2 | <pre># go to exit2 if j &lt; 0 # x5 = j * 8 # x5 = v + (j * 8) # x6 = v[j] # x7 = v[j + 1]</pre> |
| Pass parameters<br>and call | mv x10. x21<br>mv x11. x20<br>jal x1. swap                                                                                                 | <pre># first swap parameter is v # second swap parameter is j # call swap</pre>                  |
| Inner loop                  | addi x20. x20. −1<br>j for2tst                                                                                                             | j for2tst<br>∦ go to for2tst                                                                     |
| Outer loop                  | exit2: addi x19. x19. 1<br>j forltst                                                                                                       | # i += 1<br># go to forltst                                                                      |

| Restoring registers |                 |                                                |  |
|---------------------|-----------------|------------------------------------------------|--|
| exit1:              | 1d x19, 0(sp)   | ∦ restore x19 from stack                       |  |
|                     | 1d x20, 8(sp)   | <pre># restore x20 from stack</pre>            |  |
|                     | 1d x21, 16(sp)  | # restore x21 from stack                       |  |
|                     | 1d x22, 24(sp)  | # restore x22 from stack                       |  |
|                     | ld x1, 32(sp)   | <pre># restore return address from stack</pre> |  |
|                     | addi sp. sp. 40 | # restore stack pointer                        |  |

| Procedure return |                |                             |  |  |
|------------------|----------------|-----------------------------|--|--|
|                  | jalr x0, 0(x1) | # return to calling routine |  |  |

### **Arrays vs. Pointers**

- Array indexing involves
  - Multiplying index by element size
  - Adding to array base address
- Pointers correspond directly to memory addresses
  - Can avoid indexing complexity

### **Example: Clearing an Array**

| <pre>clear1(int array[], int size) {     int i;     for (i = 0; i &lt; size; i += 1)         array[i] = 0; }</pre> | <pre>clear2(int *array, int size) {     int *p;     for (p = &amp;array[0]; p &lt; &amp;array[size];         p = p + 1)         *p = 0; }</pre> |
|--------------------------------------------------------------------------------------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------|
| <pre>li x5,0 // i = 0 loop1:     slli x6,x5,2 // x6 = i * 4     add x7,x10,x6 // x7 = address</pre>                | <pre>mv x5,x10  // p = address</pre>                                                                                                            |

### **Comparison of Array vs. Ptr**

- Multiply "strength reduced" to shift
- Array version requires shift to be inside loop
  - Part of index calculation for incremented i
  - c.f. incrementing pointer
- Compiler can achieve same effect as manual use of pointers
  - Induction variable elimination
  - Better to make program clearer and safer

### The Intel x86 ISA

- Evolution with backward compatibility
  - 8080 (1974): 8-bit microprocessor
    - Accumulator, plus 3 index-register pairs
  - 8086 (1978): 16-bit extension to 8080
    - Complex instruction set (CISC)
  - 8087 (1980): floating-point coprocessor
    - Adds FP instructions and register stack
  - 80286 (1982): 24-bit addresses, MMU
    - Segmented memory mapping and protection
  - 80386 (1985): 32-bit extension (now IA-32)
    - Additional addressing modes and operations
    - Paged memory mapping as well as segments

## The Intel x86 ISA

- Further evolution...
  - i486 (1989): pipelined, on-chip caches and FPU
    - Compatible competitors: AMD, Cyrix, ...
  - Pentium (1993): superscalar, 64-bit datapath
    - Later versions added MMX (Multi-Media eXtension) instructions
    - The infamous FDIV bug
  - Pentium Pro (1995), Pentium II (1997)
    - New microarchitecture (see Colwell, The Pentium Chronicles)
  - Pentium III (1999)
    - Added SSE (Streaming SIMD Extensions) and associated registers
  - Pentium 4 (2001)
    - New microarchitecture
    - Added SSE2 instructions

## The Intel x86 ISA

- And further...
  - AMD64 (2003): extended architecture to 64 bits
  - EM64T Extended Memory 64 Technology (2004)
    - AMD64 adopted by Intel (with refinements)
    - Added SSE3 instructions
  - Intel Core (2006)
    - Added SSE4 instructions, virtual machine support
  - AMD64 (announced 2007): SSE5 instructions
    - Intel declined to follow, instead...
  - Advanced Vector Extension (announced 2008)
    - Longer SSE registers, more instructions
- If Intel didn't extend with compatibility, its competitors would!
  - Technical elegance ≠ market success

### **Basic x86 Registers**



### **Basic x86 Addressing Modes**

Two operands per instruction

| Source/dest operand | Second source operand |
|---------------------|-----------------------|
| Register            | Register              |
| Register            | Immediate             |
| Register            | Memory                |
| Memory              | Register              |
| Memory              | Immediate             |

- Memory addressing modes
  - Address in register
  - Address = R<sub>base</sub> + displacement
  - Address =  $R_{base}$  +  $2^{scale} \times R_{index}$  (scale = 0, 1, 2, or 3)
  - Address = R<sub>base</sub> + 2<sup>scale</sup> × R<sub>index</sub> + displacement

## x86 Instruction Encoding

#### a. JE EIP + displacement

| 4  | 4              | 8            |
|----|----------------|--------------|
| JE | Condi-<br>tion | Displacement |

#### b. CALL

| 8    | 32     |
|------|--------|
| CALL | Offset |

#### c. MOV EBX, [EDI + 45]

| 6   | 1 | 1 | 8               | 8            |
|-----|---|---|-----------------|--------------|
| MOV | d | w | r/m<br>Postbyte | Displacement |

#### d. PUSH ESI

| 5    | 3   |
|------|-----|
| PUSH | Reg |

#### e. ADD EAX, #6765

| 4   | 3   | 1 | 32        |
|-----|-----|---|-----------|
| ADD | Reg | w | Immediate |

#### f. TEST EDX, #42

| <br>7 | 1 | 8        | 32        |
|-------|---|----------|-----------|
| TEST  | w | Postbyte | Immediate |

- Variable length encoding
  - Postfix bytes specify addressing mode
  - Prefix bytes modify operation
    - Operand length, repetition, locking, ...

## **Implementing IA-32**

- Complex instruction set makes implementation difficult
  - Hardware translates instructions to simpler microoperations
    - Simple instructions: 1–1
    - Complex instructions: 1–many
  - Microengine similar to RISC
  - Market share makes this economically viable
- Comparable performance to RISC
  - Compilers avoid complex instructions

## **More Materials for RISC-V Instruction**

- Slides for RISC-V intro and specification:
  - <u>https://passlab.github.io/ITSC3181/notes/lectureXX\_RISCV\_ISA.</u>
     <u>pdf</u>
- RISC-V instruction reference cards:
  - https://passlab.github.io/ITSC3181/resources/RISCVGreenCardv 8-20151013.pdf
- Information for learning assembly programming
  - <u>https://passlab.github.io/ITSC3181/resources/RISC-</u>
     <u>VAssemblyProgramming.html</u>
- Resources from the official website including the standard
  - https://riscv.org/

## **Concluding Remarks**

- Instruction Set Architecture are Hardware and Software Interface
- Three major classes of instructions
  - Arithmetic and logic instructions
  - Load/Store instructions
  - Control transfer (branch and jump/link)
  - Other helpful instruction, e.g. load immediate, etc.
- High-level language constructs to instruction sequence
  - Arithmetic and logic expression => Arithmetic and logic instructions
  - Array reference => address calculation and load/store
  - If-else/switch-case, for/while-loop => branch and jump
  - Function call => jump/link, store and restore registers
- Design principles
  - 1. Simplicity favors regularity
  - 2. Smaller is faster
  - 3. Good design demands good compromises
  - 4. Make the common case fast

### **Contents Not Covered.**

- struct person { int age; int height;} sam;
- Class:

# **Effect of Compiler Optimization**





### **Effect of Language and Algorithm**



### **Lessons Learnt**

- Instruction count and CPI are not good performance indicators in isolation
- Compiler optimizations are sensitive to the algorithm
- Java/JIT compiled code is significantly faster than JVM interpreted
  - Comparable to optimized C in some cases
- Nothing can fix a dumb algorithm!

### **MIPS Instructions**

- MIPS: commercial predecessor to RISC-V
- Similar basic set of instructions
  - 32-bit instructions
  - 32 general purpose registers, register 0 is always 0
  - 32 floating-point registers
  - Memory accessed only by load/store instructions
    - Consistent use of addressing modes for all data sizes
- Different conditional branches
  - For <, <=, >, >=
  - RISC-V: blt, bge, bltu, bgeu
  - MIPS: slt, sltu (set less than, result is 0 or 1)
    - Then use beq, bne to complete the branch

### **Instruction Encoding**

| Register-re | gister       |                     |                 |            |    |             |                  |            |           |   |
|-------------|--------------|---------------------|-----------------|------------|----|-------------|------------------|------------|-----------|---|
|             | 31           | 25 24               | 20 1            | 19         | 15 | 14 12 11    | 7                | 6          |           | 0 |
| RISC-V      | funct7(7)    | rs2(5)              |                 | rs1(5)     |    | funct3(3)   | rd(5)            |            | opcode(7) |   |
|             | 31 26        | 6 25 2 <sup>-</sup> | 1 20            | 16         | 15 | 11          | 10               | 6          | 5         | 0 |
| MIPS        | Op(6)        | Rs1(5)              |                 | Rs2(5)     |    | Rd(5)       | Const(5)         |            | Opx(6)    |   |
|             |              |                     |                 |            |    |             |                  |            |           |   |
| Load        |              |                     |                 |            |    |             |                  |            |           |   |
|             | 31           |                     | 20 1            |            | 15 | 14 12 11    |                  | 6          |           | 0 |
| RISC-V      |              | diate(12)           |                 | rs1(5)     |    | funct3(3)   | rd(5)            |            | opcode(7) |   |
|             |              |                     | 1 20            |            | 15 |             |                  |            |           | 0 |
| MIPS        | Op(6)        | Rs1(5)              |                 | Rs2(5)     |    | Const(16)   |                  |            |           |   |
|             |              |                     |                 |            |    |             |                  |            |           |   |
| Store       | <b>A</b> /   | <u></u>             | ~~              | 10         |    |             | _                | _          |           | - |
|             | 31           | 25 24               | 20 1            |            | 15 | 14 12 11    |                  | 6          |           | 0 |
| RISC-V      | immediate(7) | rs2(5)              | 1 00            | rs1(5)     | 45 | funct3(3) i | mmediate(5)      |            | opcode(7) |   |
|             |              |                     | 1 20            |            | 15 |             | O a re at/ 10    | • • •      |           | 0 |
| MIPS        | Op(6)        | Rs1(5)              |                 | Rs2(5)     |    |             | Const(16         | )          |           |   |
| Branch      |              |                     |                 |            |    |             |                  |            |           |   |
| Branch      | 31           | 25 24               | 20 <sup>2</sup> | 10         | 15 | 14 12 11    | 7                | 6          |           | 0 |
| RISC-V      | immediate(7) | rs2(5)              | 20              | rs1(5)     | 13 |             | /<br>mmediate(5) | 0          | opcode(7) | 0 |
| 1100-0      |              |                     | 1 20            |            | 15 |             |                  |            |           | 0 |
| MIPS        | Op(6)        | Rs1(5)              |                 | Dpx/Rs2(5) | 13 |             | Const(16         | ;)         |           |   |
|             |              |                     |                 | PN/102(0)  |    |             |                  | <i>'</i> ) |           |   |

### **Other RISC-V Instructions**

- Base integer instructions (RV64I)
  - Those previously described, plus
  - auipc rd, immed // rd = (imm<<12) + pc</p>
    - follow by jalr (adds 12-bit immed) for long jump
  - slt, sltu, slti, sltui: set less than (like MIPS)
  - addw, subw, addiw: 32-bit add/sub
  - sllw, srlw, srlw, slliw, srliw, sraiw: 32-bit shift
- 32-bit variant: RV32I
  - registers are 32-bits wide, 32-bit operations

### **Instruction Set Extensions**

- M: integer multiply, divide, remainder
- A: atomic memory operations
- F: single-precision floating point
- D: double-precision floating point
- C: compressed instructions
  - 16-bit encoding for frequently used instructions

## Fallacies

- Powerful instruction  $\Rightarrow$  higher performance
  - Fewer instructions required
  - But complex instructions are hard to implement
    - May slow down all instructions, including simple ones
  - Compilers are good at making fast code from simple instructions
- Use assembly code for high performance
  - But modern compilers are better at dealing with modern processors
  - More lines of code  $\Rightarrow$  more errors and less productivity

### **Fallacies**

- Backward compatibility ⇒ instruction set doesn't change
  - But they do accrete more instructions



## Pitfalls

- Sequential words are not at sequential addresses
  - Increment by 4, not by 1!
- Keeping a pointer to an automatic variable after procedure returns
  - e.g., passing pointer back via an argument
  - Pointer becomes invalid when stack popped