# Lecture 02: Technology Trends and Quantitative Design and Analysis for Performance 

## CSCE 513 Computer Architecture

Department of Computer Science and Engineering
Yonghong Yan
yanyh@cse.sc.edu
http://cse.sc.edu/~yanyh

## Contents

- Computer components
- Computer architectures and great ideas in computer architectures
- Trends and Performance


## Computer Architecture

## Genuine Computer Architecture: Designing the Organization and Hardware to Meet Goals and Functional Requirements

- Covers three aspects of computer design
- Instruction set architecture
- Software and hardware interfaces
- Organization or microarchitecture
- CPU, memory, cache architecture
- Hardware
- Computer systems, e.g. I/O devices


## Levels of Program Code

- High-level language
- Level of abstraction closer to problem domain
- Provides for productivity and portability
- Assembly language
- Textual representation of instructions
- Hardware representation - Binary digits (bits)
- Encoded instructions and data



## Below Your Program

- Application software
- Written in high-level language
- System software
- Compiler: translates HLL code to machine code
- Operating System: service code
- Handling input/output
- Managing memory and storage
- Scheduling tasks \& sharing resources
- Hardware
- Processor, memory, I/O controllers



## Understanding Performance

- Algorithm
- Determines number of operations executed
- Programming language, compiler, architecture
- Determine number of machine instructions executed per operation
- Processor and memory system
- Determine how fast instructions are executed
- I/O system (including OS)
- Determines how fast I/O operations are executed
- Architecture vs Technology


## Trends in Technology

- Integrated circuit technology (Moore's Law)
- Transistor density: 35\%/year
- Die size: 10-20\%/year
- Integration overall: 40-55\%/year
- DRAM capacity: 25-40\%/year (slowing)
- 8 Gb (2014), 16 Gb (2019), possibly no 32 Gb
- Flash capacity: 50-60\%/year
- 8-10X cheaper/bit than DRAM
- Magnetic disk capacity: recently slowed to 5\%/year
- Density increases may no longer be possible, maybe increase from 7 to 9 platters
- 8-10X cheaper/bit then Flash
- 200-300X cheaper/bit than DRAM


## Bandwidth and Latency

- Bandwidth or throughput
- Total work done in a given time
- 10,000-25,000X improvement for processors
- 300-1200X improvement for memory and disks

- Latency or response time
- Time between start and completion of an event
- 30-80X improvement for processors
- 6-8X improvement for memory and disks


## Measuring Performance

- Typical performance metrics:
- Response time
- Throughput
- Speedup of $X$ relative to $Y$ : Execution timeY / Execution timeX
- Example: time taken to run a program, 10s on X, 15s on Y
- Speedup: $15 \mathrm{~s} / 10 \mathrm{~s}=1.5, \rightarrow \mathrm{X}$ is 1.5 faster than Y


## - Execution time

- Wall clock time: includes all system overheads (I/O, swapping, etc)
- CPU time: only computation time
- Benchmarks
- Kernels (e.g. matrix multiply)
- Toy programs (e.g. sorting)
- Synthetic benchmarks (e.g. Dhrystone)
- Benchmark suites (e.g. SPECO6fp, TPC-C)


## Measuring Execution Time 1/2

- Elapsed time
- Total response time, including all aspects
- Processing, I/O, OS overhead, idle time
- Determines system performance

```
elapsed = read_timer();
REAL result \(=\operatorname{sum}(\mathrm{N}, \mathrm{X}, \mathrm{a})\);
elapsed = (read_timer() - elapsed);
```

https://passlab.github.io/CSCE513/exercises/sum/sum_full.c

- CPU time


## Measuring Execution Time 2/2

- Elapsed time
- CPU time
- Time spent processing a given job
- Discounts I/O time, other jobs' shares
- Comprises user CPU time and system CPU time
- Different programs are affected differently by CPU and system
- "time" command in Linux
lyanyh@vm:~\$ time ./matmul 5121
Matrix Multiplication: $A[M][K] * B[k][N]=C[M][N], M=K=N=512$,

| Performance: | Runtime (ms) | MFLOPS |
| :---: | :---: | :---: |
| matmul_base: | 628.999949 | 426.765466 |
| matmul_openmp: | 776.000023 | 345.921969 |

```
real 0m1.419s
user 0m1.408s
sys 0m0.008s
```


## CPU Clocking

- Operation of digital hardware governed by a constant-rate clock

- Clock period: duration of a clock cycle
- e.g., $250 \mathrm{ps}=0.25 \mathrm{~ns}=250 \times 10^{-12} \mathrm{~s}$
- Clock frequency (rate): cycles per second
- e.g., $4.0 \mathrm{GHz}=4000 \mathrm{MHz}=4.0 \times 10^{9} \mathrm{~Hz}$
- Clock period: $1 /\left(4.0 \times 10^{9}\right) \mathrm{s}=0.25 \mathrm{~ns}$


## No Excuse About the Unit

- Should be as clear as we know about thousand/million/billion dollars

| $10^{-3} \mathrm{~s}$ | ms | millisecond | Decimal | Binary |
| :---: | :---: | :---: | :---: | :---: |
| $10^{-6} \mathrm{~s}$ | $\mu \mathrm{s}$ | microsecond | Value Metric | Value IEC |
| $10^{-9} \mathrm{~s}$ | ns | nanosecond | 1000 kB kilobyte | 1024 KiB kibibyte |
| $10^{-12} \mathrm{~s}$ | ps | picosecond | $1000^{2}$ MB megabyte $1000^{3}$ GB gigabyte | $1024^{2}$ MiB mebibyte $1024^{3}$ GiB gibibyte |
| $10^{3} \mathrm{~Hz}$ | kHz | kilohertz | $1000^{4}$ TB terabyte | $1024^{4}$ TiB tebibyte |
| $10^{6} \mathrm{~Hz}$ | MHz | megahertz | $1000^{5} \mathrm{~PB}$ petabyte | $1024^{5}$ PiB pebibyte |
| $10^{9} \mathrm{~Hz}$ | GHz | gigahertz |  |  |

## CPU Time

## CPU Time $=$ CPU Clock Cycles $\times$ Clock Cycle Time $=\frac{\text { CPU Clock Cycles }}{\text { Clock Rate }}$

- Performance improved by
- Reducing number of clock cycles
- Increasing clock rate (frequency)
- Hardware designer must often trade off clock rate against cycle count One Clock



## CPU Time Example

- Computer A: 2GHz clock, 10s CPU time
- Designing Computer B
- Aim for 6s CPU time
- Can do faster clock, but causes $1.2 \times$ clock cycles of A
- How fast must Computer B clock be?

Clock Cycles $_{\mathrm{A}}=$ CPU Time $_{\mathrm{A}} \times$ Clock Rate $_{\mathrm{A}}$

$$
=10 \mathrm{~s} \times 2 \mathrm{GHz}=20 \times 10^{9}
$$

Clock Rate ${ }_{\mathrm{B}}=\frac{1.2 \times 20 \times 10^{9}}{6 \mathrm{~s}}=\frac{24 \times 10^{9}}{6 \mathrm{~s}}=4 \mathrm{GHz}$

## Instruction Count and CPI

Clock Cycles = Instruction Count $\times$ Cycles per Instruction
CPU Time $=$ Instruction Count $\times$ CPI $\times$ Clock Cycle Time

## $=\frac{\text { Instruction Count } \times \text { CPI }}{\text { Clock Rate }}$

- Instruction Count for a program
- Determined by program, ISA and compiler
- Average cycles per instruction
- Determined by CPU hardware
- If different instructions have different CPI
- Average CPI affected by instruction mix

| Instr. No. | Pipeline Stage |  |  |  |  |  |  |
| :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
| 1 | IF | ID | EX | MEM | WB |  |  |
| 2 |  | IF | ID | EX | MEM | WB |  |
| 3 |  |  | IF | ID | EX | MEM | WB |
| 4 |  |  |  | IF | ID | EX | MEM |
| 5 |  |  |  |  | IF | ID | EX |
| Clock <br> Cycle | 1 | 2 | 3 | 4 | 5 | 6 | 7 |

## CPI Example

- Computer A: Cycle Time = 250ps, CPI = 2.0
- Computer B: Cycle Time $=500 \mathrm{ps}, \mathrm{CPI}=1.2$
- Same ISA
- Which is faster, and by how much?

CPU Time $A=$ Instruction Count $\times$ CPI $_{A} \times$ Cycle Time $_{A}$

$$
=1 \times 2.0 \times 250 \mathrm{ps}=1 \times 500 \mathrm{ps} \longleftarrow \quad \mathrm{~A} \text { is faster... }
$$

CPU Time ${ }_{B}=$ Instruction Count $\times$ CPI $_{\mathrm{B}} \times$ Cycle Time $_{\mathrm{B}}$

$$
=1 \times 1.2 \times 500 \mathrm{ps}=1 \times 600 \mathrm{ps}
$$

$\frac{\mathrm{CPUTime}_{\mathrm{B}}}{\mathrm{CPUTime}_{\mathrm{A}}}=\frac{1 \times 600 \mathrm{ps}}{1 \times 500 \mathrm{ps}}=1.2$

## CPI in More Detail

- If different instruction classes take different numbers of cycles

$$
\text { Clock Cycles }=\sum_{\mathrm{i}=1}^{\mathrm{n}}\left(\mathrm{CPI}_{\mathrm{i}} \times \text { Instruction Count }_{\mathrm{i}}\right)
$$

- Weighted average CPI

$$
\mathrm{CPI}=\frac{\text { Clock Cycles }}{\text { Instruction Count }}=\sum_{\mathrm{i}=1}^{\mathrm{n}}(\mathrm{CPI}_{\mathrm{i}} \times \underbrace{\text { Instruction Count }_{\frac{\text { Instruction Count }}{\mathrm{i}}}}_{\text {Relative frequency }})
$$

## CPI Example

- Alternative compiled code sequences using instructions in classes $A, B, C$

| Class | A | B | C |
| :--- | :---: | :---: | :---: |
| CPI for class | $\mathbf{1}$ | $\mathbf{2}$ | 3 |
|  |  |  |  |
| IC in sequence \#1 | 2 | 1 | 2 |
| IC in sequence \#2 | 4 | 1 | 1 |

- Sequence \#1: IC = 5
- Sequence \#2: IC = 6
- Clock Cycles

$$
\begin{aligned}
& =2 \times 1+1 \times 2+2 \times 3 \\
& =10
\end{aligned}
$$

- Avg. $\mathrm{CPI}=10 / 5=2.0$
- Clock Cycles
$=4 \times 1+1 \times 2+1 \times 3$
$=9$
- Avg. $\mathrm{CPI}=9 / 6=1.5$


## Impacts by Components

$$
\text { CPU Time }=\frac{\text { Instructions }}{\text { Program }} \times \frac{\text { Clock cycles }}{\text { Instruction }} \times \frac{\text { Seconds }}{\text { Clock cycle }}
$$

|  | Inst Count | CPI | Clock Rate |
| :--- | :---: | :---: | :---: |
| Program | X |  |  |
| Compiler | X | $\mathrm{X})$ |  |
| Inst. Set. | X | X |  |
| Architecture | X |  | X |
| Technology |  |  | X |

## Processor Performance Equation Summary

CPU time $=$ CPU clock cycles for a program $\times$ Clock cycle time

$$
\begin{aligned}
& \text { CPU time }=\frac{\text { CPU clock cycles for a program }}{\text { Clock rate }} \\
& \mathrm{CPI}=\frac{\text { CPU clock cycles for a program }}{\text { Instruction count }}
\end{aligned}
$$

CPU time $=$ Instruction count $\times$ Cycles per instruction $\times$ Clock cycle time

$$
\begin{gathered}
\frac{\text { Instructions }}{\text { Program }} \times \frac{\text { Clock cycles }}{\text { Instruction }} \times \frac{\text { Seconds }}{\text { Clock cycle }}=\frac{\text { Seconds }}{\text { Program }}=\text { CPU time } \\
\text { CPU clock cycles }=\sum_{i=1}^{n} \mathrm{IC}_{i} \times \mathrm{CPI}_{i} \\
\text { CPU time }=\left(\sum_{i=1}^{n} \mathrm{IC}_{i} \times \mathrm{CPI}_{i}\right) \times \text { Clock cycle time }
\end{gathered}
$$

## Principles of Computer Design

- Take Advantage of Parallelism
- e.g. multiple processors, disks, memory banks, pipelining, multiple functional units
- Principle of Locality
- Reuse of data and instructions
- Focus on the Common Case
- Amdahl's Law

$$
\begin{aligned}
& \text { Execution time }_{\text {new }}=\text { Execution time }_{\text {old }} \times\left(\left(1-\text { Fraction }_{\text {enhanced } \left.)+\frac{\text { Fraction }_{\text {enhanced }}}{\text { Speedup }_{\text {enhanced }}}\right)}^{\text {Speedup }_{\text {overall }}=\frac{\text { Execution time }_{\text {old }}}{\text { Execution time }_{\text {new }}}=\frac{1}{\left(1-\text { Fraction }_{\text {enhanced }}\right)+\frac{\text { Fraction }_{\text {enhanced }}}{\text { Speedup }_{\text {enhanced }}}}}\right.\right.
\end{aligned}
$$

## Amdahl's Law

ExTime $_{\text {new }}=$ ExTime $_{\text {old }} \times\left[\left(1-\right.\right.$ Fraction $\left.\left._{\text {enhanced }}\right)+\frac{\text { Fraction }_{\text {enhanced }}}{\text { Speedup }_{\text {enhanced }}}\right]$

Speedup $_{\text {overall }}=\frac{\text { ExTime }_{\text {old }}}{\text { ExTime }_{\text {new }}}=\frac{1}{\left(1-\text { Fraction }_{\text {enhanced }}\right)+\text { Fraction }_{\text {enhanced }}^{\text {Speedup }_{\text {enhanced }}}}$
Best you could ever hope to do:

$$
\text { Speedup }_{\text {maximum }}=\frac{1}{\left(1-\text { Fraction }_{\text {enhanced }}\right)}
$$



## Using Amdahl's Law

Overall speedup if we make $90 \%$ of a program run 10 times faster.

$$
\begin{aligned}
& \mathrm{F}=0.9 \quad \mathrm{~S}=10 \\
& \text { Overall Speedup }=\frac{1}{(1-0.9)+\frac{0.9}{10}}=\frac{1}{0.1+0.09}=5.26
\end{aligned}
$$

Overall speedup if we make $80 \%$ of a program run $20 \%$ faster.

$$
F=0.8 \quad S=1.2
$$

## Amdahl's Law for Parallelism

- The enhanced fraction F is through parallelism, perfect parallelism with linear speedup
- The speedup for F is N for N processors
- Overall speedup

$$
S(N)=\frac{T_{s}}{T_{p}}=\frac{T_{s}}{(1-F) * T_{s}+\frac{F^{*} T_{s}}{N}}=\frac{1}{1-F+\frac{F}{N}}
$$

- Speedup upper bound (when $N \rightarrow \infty$ ): $S(N) \leq \frac{1}{1-F}$ - 1-F: the sequential portion of a program


## Amdahl's Law for Parallelism



## Exercise \#1: Amdahl's Law

Suppose that we want to enhance the processor used for Web serving. The new processor is 10 times faster on computation in the Web serving application than the original processor. Assuming that the original processor is busy with computation $40 \%$ of the time and is waiting for I/O $60 \%$ of the time, what is the overall speedup gained by incorporating the enhancement?

## Exercise \#1: Amdahl's Law Solution

Fraction $_{\text {enhanced }}=0.4 ;$ Speedup $_{\text {enhanced }}=10 ;$

$$
\text { Speedup }_{\text {overall }}=\frac{1}{0.6+\frac{0.4}{10}}=\frac{1}{0.64} \approx 1.56
$$

## General Amdahl's Law

- FO 30\%, no speedup; F1 40\%, speedup by 4; F2 30\% speedup by 3 , what is the overall speedup
- $=1 /(0.3+0.4 / 4+0.3 / 3)=1 / 0.5=2$


## Exercise \#2: CPU time and Speedup

Suppose we have made the following measurements:
Frequency of FP operations $=25 \%$
Average CPI of FP operations $=4.0$
Average CPI of other instructions $=1.33$
Frequency of $\mathrm{FPSQR}=2 \%$
CPI of $\mathrm{FPSQR}=20$
Assume that the two design alternatives are to decrease the CPI of FPSQR to 2 or to decrease the average CPI of all FP operations to 2.5 . Compare these two design alternatives using the processor performance equation.

## Exercise \#2: Solution, Textbook Page 54

First, observe that only the CPI changes; the clock rate and instruction count remain identical. We start by finding the original CPI with neither enhancement:

$$
\begin{aligned}
\mathrm{CPI}_{\text {original }} & =\sum_{i=1}^{n} \mathrm{CPI}_{i} \times\left(\frac{\mathrm{IC}_{i}}{\text { Instruction count }}\right) \\
& =(4 \times 25 \%)+(1.33 \times 75 \%)=2.0
\end{aligned}
$$

We can compute the CPI for the enhanced FPSQR by subtracting the cycles saved from the original CPI:

$$
\begin{aligned}
\mathrm{CPI}_{\text {with new }} \text { FPSQR } & =\mathrm{CPI}_{\text {original }}-2 \% \times\left(\mathrm{CPI}_{\text {old FPSQR }}-\mathrm{CPI}_{\text {of new FPSQR only }}\right) \\
& =2.0-2 \% \times(20-2)=1.64
\end{aligned}
$$

We can compute the CPI for the enhancement of all FP instructions the same way or by summing the FP and non-FP CPIs. Using the latter gives us:

$$
\mathrm{CPI}_{\text {new FP }}=(75 \% \times 1.33)+(25 \% \times 2.5)=1.625
$$

Since the CPI of the overall FP enhancement is slightly lower, its performance will be marginally better. Specifically, the speedup for the overall FP enhancement is

$$
\begin{aligned}
\text { Speedup }_{\text {new FP }} & =\frac{\mathrm{CPU} \text { time }_{\text {original }}}{\mathrm{CPU} \text { time }}=\frac{\mathrm{IC} \times \text { Clock cycle } \times \mathrm{CPI}_{\text {original }}}{\mathrm{IC} \times \text { Clock cycle } \times \mathrm{CPI}_{\text {new FP }}} \\
& =\frac{\mathrm{CPI}_{\text {original }}}{\mathrm{CPI}_{\text {new FP }}}=\frac{2.00}{1.625}=1.23
\end{aligned}
$$

## Power and Energy

- Problem:
- Get power in and distribute around
- get power out: dissipate heat
- Three primary concerns:
- Max power requirement for a process
- Thermal Design Power (TDP)
- Characterizes sustained power consumption
- Used as target for power supply and cooling system
- Lower than peak power, higher than average power consumption
- Energy and energy efficiency
- Clock rate can be reduced dynamically to limit power consumption


## Energy and Energy Efficiency

- Power: energy per unit time
- 1 watt = 1 joule per second
- Energy per task is often a better measurement
- Processor A has 20\% higher average power consumption than processor B. A executes task in only $70 \%$ of the time needed by B.
- So energy consumption of $A$ will be 1.2 * $0.7=0.84$ of $B$


## Dynamic Energy and Power

- Dynamic energy
- Transistor switch from 0 -> 1 or 1 -> 0

Energy $_{\text {dynamic }} \propto 1 / 2 \times{\text { Capacitive load } \times \text { Voltage }^{2}}^{2}$


## An Example from Textbook page \#25

Some microprocessors today are designed to have adjustable voltage, so a $15 \%$ reduction in voltage may result in a $15 \%$ reduction in frequency. What would be the impact on dynamic energy and on dynamic power?

Since the capacitance is unchanged, the answer for energy is the ratio of the voltages since the capacitance is unchanged:

$$
\frac{\text { Energy }_{\text {new }}}{\text { Energy }_{\text {old }}}=\frac{(\text { Voltage } \times 0.85)^{2}}{\text { Voltage }^{2}}=0.85^{2}=0.72
$$

thereby reducing energy to about $72 \%$ of the original. For power, we add the ratio of the frequencies

$$
\frac{\text { Power }_{\text {new }}}{\text { Power }_{\text {old }}}=0.72 \times \frac{(\text { Frequency switched } \times 0.85)}{\text { Frequency switched }}=0.61
$$

shrinking power to about $61 \%$ of the original.

## An Example from Textbook

- Suppose a new CPU has
- $85 \%$ of capacitive load of old CPU
- $15 \%$ voltage and $15 \%$ frequency reduction

$$
\frac{P_{\text {new }}}{P_{\text {old }}}=\frac{C_{\text {old }} \times 0.85 \times\left(\mathrm{V}_{\text {old }} \times 0.85\right)^{2} \times \mathrm{F}_{\text {old }} \times 0.85}{\mathrm{C}_{\text {old }} \times \mathrm{V}_{\text {old }}^{2} \times \mathrm{F}_{\text {old }}}=0.85^{4}=0.52
$$

## Power Trends



Power $=$ Capacitive load $\times$ Voltage $^{2} \times$ Frequency $\times 30$
$5 \mathrm{~V} \rightarrow 1 \mathrm{~V}$

## Reducing Power

- Techniques for reducing power:
- Do nothing well
- Dynamic Voltage-Frequency Scaling

- Low power state for DRAM, disks
- Overclocking, turning off cores


## Static Power

- Power includes both dynamic power and static power
- Static power consumption
- 25-50\% of total power

Power $_{\text {static }} \propto$ Current $_{\text {static }} \times$ Voltage

- Scales with number of transistors
- To reduce: power gating (turn off power of inactive modules)


