Main Study Topics

CSCE 513 Computer Architecture

Textbook
Lecture 01, Introduction
Lecture 02, Technology Trends and Quantitative Design and Analysis for Performance
Lecture 03-04, Instruction Set Principles
Lecture 05-06, and 07-09, Pipeline and Implementation
Lecture 10-12, Memory Technology, Cache Organization and Performance, and Cache Optimizations

Textbooks

Read at least 15 pages per week in average following the class schedule

Required textbook (CAQA): Computer Architecture A Quantitative Approach, 6th Edition, John L. Hennessy and David A. Patterson, Amazon, Publisher website.
Reference and introductory book for computer architecture (COD): Computer Organization and Design, RISC-V Edition: The Hardware/Software Interface, Amazon, Publisher website

Lecture 01, Introduction

Explain what is Moore's Law, how it impacts the computer industry and why we cannot see the computer hardware performance improvement after 2000s as we saw before 2000s.
List 3-5 great ideas in computer architectures and explains them.
Watch John Hennessy and David Patterson Turing Award Lecture at ISCA 2018 and read the article about why they think this is A New Golden Age for Computer Architecture.
Review and get familar with Linux and C programming (slides, resources), and make sure you are able to access CSE Linux machines by following the instructions in the notes.
Compiler and inspect C and assembly program of sum program by following the slide of lecture01

Lecture 02, Technology Trends and Quantitative Design and Analysis for Performance

What are the factors that determines dynamic energy consumption and dynamic power of computers?
The full formula of CPU time with as many factors as posibble
From program, compiler, ISA, computer architecture and Technology, which factors impact instruction count, CPI and clock rate?
What is the principle or law that guides the principal of focusing on common cases when optimizing applications?
Exercise for power and energy consumption in the textbook (also in lecture note)
Exercise for CPI and quantitative analyslis of CPU performance in textbook page (also in lecture note)
Exercise for Amdahl's law speedup in textbook (also in lecture note)
PAPI Programming (check the resource page for details)

Lecture 03-04, Instruction Set Principles

What are the two commonly used ISA today? (Load-store of RISC, e.g. ARM, PowerPC, SPARC, MIPS, RISC-V; and Register-memory as in X86)
A new architecture is expected to support at least three addressing modes, what are the three modes?
Whar are the three choices for instruction encoding? To select an encoding length is a tradeoff between performance and code size, what is impact of each encoding lenght to performance and code size in general?
Highligh the major differences of RISC and CISC.
The iron-cold summary of ISA design in the lecture 03 notes, e.g. how many registers a RISC CPU would normally have? What are the three most used addressing mode in an RISC ISA? What are the three groups of instructions that a RISC CPU should provide?
Compile and inspect the assembly generated by gcc compiler and the RISC-V assembly code of sum.c
PAPI usage in assignment 1, exploring and using other PAPI counters to diagonise different metrics of a program execution, e.g. cache behavior.

Lecture 05-06 and 07-09, Pipeline and Implementation

What are the five stages of a typical RISC CPU pipeline.
Which stages are needed by each of the three categories of RISC instructions (ALU, Branch, and Load/Store)?
Understand the four imporant things that make pipeline real (1. sepearate I-cache and D-Cache, Read/Write register file in the same cycle, PC adder/target addresser calcualtion in ID stage, pipeline register for storing results between pipeline stages).
What is pipeline hazards, why they will hurt pipeline performance.
What is branch delay slot and how compiler make use of it for what purpose?
Illustrate using examples structure hazards, data hazards and branch hazards and also provide a solution from the text book how each of those hazards are handled to reduce the stalls otherwise we have to introduce?
While forwarding, also known as bypassing, addresses most of the RAW data hazard problem so no stall needed for the pipeline. However, there is one case that we cannot completely eliminate stalls. What is the case and explain in details. (Answer: the Load and ALU use case, e.g. ld R1 30(R3); add R2 R1 R4. Please note forwarding eliminate stalls for Load/Store sequence of instructions)
branch pipeline penalty: how many cycles of branch penalty for each of the three control transfer instructions in RISC-V (unconditional jump to PC+immediate, unconditional jump to rs1+immediate, and conditional branch)?
Two additions to the pipeline for handling branch penalty are BHT and BTB, explains what is the purpose of each of the two additions.

Lecture 10-12, Memory Technology, Cache Organization and Performance, and Cache Optimizations

The differences (latency, sizes, and cost) between DRAM and SRAM and how they are used in a computer system (DRAM is used for main memory and SRAM is used mainly for cache). Four questions of cache organizations
Understand CPU-memory gap and what is memory wall
Explain what is memory hierarchy and why it works, and how it works
Understand principles of locality and what are the two types of locality
Given a code, identify references to variables and array that exhibit each of the two types of locality.
Explain how caching exploits both types of locality by preloading and keeping data in faster SRAM cache.
Explain the four cache organization questions
Understand AMAT and be able to calculate AMAT using examples, including the multi-level situation.
Understand the difference between miss rate and misses/instructions and able to calcuate one from the other one.
Understand and able to use the formula to calcuate CPU time that taking into account memory stall cycles.
Understand the three misses of cache.
Know how to use the AMAT guideline to optimize cache performance
Explain why 64 bytes is a good cache block size for most of the CPU design.

Final Topics

Lecture 14, Cache Optimization, part 2

CPI calculation considring memory and AMAT are very important.
Understand the goals (hit time, miss rate and miss penalty) of each cache optimization technique and the tradeoff of the optimizations
Know the basic and gold cache structure configurations, e.g. 16-128K L1 cache, multiple MByptes for L2/L3 cache, 64-byte cache block, 2-4 way set associativity.
Know and understand the reason about why L1 has been pretty small and L2/L3 are normally big. (L1 is designed to reduce hit time and L2/L3 is designed mainly for reducing miss rate).
Compiler and software techniques, e.g. loop interchange to turn column-wise access to rom-wise access when accessing C multi-dimensional array. Loop blocking (tiling) to use local cache efficiently.
Able to explain the cache miss ratio for each of the six matrix multiplication implementation and how blocking help caching

Lecture 16, ILP - Dynamic Scheduling (Out-of-Order Execution)

What is dynamic scheduling? rearrange order of instruction by the hardware to reduce stalls while maintaining data flow. It minimize impacts by RAW hazards, elminiate or minimize WAW and WAR hazards by register renaming, etc.
What are the advantage and disadvantage of dynamic scheduling in terms of its impact to hardware and software complexity?
The two major dynamic scheduling algorithms/implementations: scoreboard and tomasulo.
What are the three major addition to regular pipeline architecture in order to implement Tomasulo algorithm (Load/Store buffer, reservation station and common data bus).
Identify the main purpose of having reservation station and load/store buffer (maintains RAW data dependencies between instructions and implicit register renaming for handling WAR/WAW hazards).
Know how implicit register renaming is achieved using reservation station and load/store buffer.
Given Tomasulo algorithm diagram, able to explain instruction execution sequence and register renaming.
What is special about the common data bus compared with regular data bus? (in common data bus, msgs contains source instruction that produce the data and the data itself, which in regular data bus, msgs include memory address (desitination and data iteslf). Explain how common data bus is used in Tomasulo algorithms and how a regular data bus will not help.

Lecture 17, ILP - Hardware Speculation and VLIW (Static Superscalar)

What is hardware speculation? It is a technique for dealing with control transfor instructions and for reducing penalty of those instructions. It use branch prediction and mis-prediction recovery techniques.
Hardware branch prediction requires what buffer to use (Branch history table and branch target buffer).
Explain how mis-prediction recovery is achieved using re-order buffer? reorder buffer maintains the order of instructions that are being issued and it will graduate instructions in that order. When an instruction is executed because of mis-prediction, hardware is able to check reorder buffer and not to commit those instructions that should be executed.
What modification we need to make to Tomasulo algorithm in order to do hardware speculation? We need to add re-order buffer which will combine with load buffer.
Explain the commit stage of a pipeline execution when there is hardware speculation.
Schedule instruction on hardware that has speculation
Dynamic scheduling+hardware speculation achieve in-order issue, out-of-order execution, out-of-order completion and in-order commit of execution pipeline
What are the purpose and approach of static or dynamic superscalar? To increase instruction issue per cycles. It tries to issue multiple instructions per cycle.
Understand VLIW and can use example to explain the benefits of VLIW
Loop unrolling example: unroll loops and do instruction scheduling for VLIW given a latency table and VLIW format, calcuate cycles needed to do that.
Know the VLIW approach in Intel/HP IA-64 architecture and pros and cons of VLIW. Explain why IA-64 was not successful.

Lecture 18, ILP - (Hardware Dynamic) Superscalar and ARM/Intel CPU Examples

A typical hardware superscalar machine would allow to issues the same number of instructions as the number of function units.
Instruction scheduling combining dynamic scheduling, hardware speculation and hardware superscalar
Number of pipeline stages in ARM CPU and Intel Core i7 CPU
Given the hardware pipeline stage diagram of Intel Core i7 CPU, understand and explain how OOP, speculation and superscalar are achieved.
Know that the purpose of Intel micro-op decode stage (hardware decode CISC instruction to RISC-style micro-ops)

Lecture 19, ILP - Exploiting Thread-Level Parallelism in Uniprocessor

Explain the differences between superscalar, fine-grained, coarse-Grained, simultaneous multithreading (SMT), and multiprocessing.
Give example of CPUs that implement SMT.
What is hyper-threading from Intel CPU?

Lecture 20, DLP - Introduction and Vector

What are the Flynn's classification of parallel architectures and explain each of them?
What is SIMD and what are the differences between SIMD and VLIW?
Vector architecture requires: vector registers, vector AL and load/store instructions, and special-pupose vector registers/instructions such as vector length register, mask register, etc.
Given a simple example such as AXPY, be able to write vector code and compare the performance and code size with CPU code.
What is the usage of vector length register and how it it used for strip mining for vectorizing a loop
What is the purpose and usage of vector master (predicate) register using an example?
Explain the usage using examples of strided load and store instructions for vector.
Know the vector architecture are the main architecture of early age supercomputers. Know who is Seymour Cray, the father of supercomputer.
Know vector lane and vector chaining in vector implementation and execution.

Lecture 20, DLP - SIMD Extensions

Why imaging processing is better fit for SIMD architecture? 1) image pixel can be short, e.g. 1-byte, 2) the processing of each pixel is the same, so it has large amount of inherent data parallelism.
How to extend a general purpose CPU to have SIMD capability using adder example in general? 1). to allow scalar registers to handle multiple elements, 2). to extend a 64-bit adder to do 4 16-bit addition.
SIMD extened architecture examples: 1) Intel SSE, AVX and 2) ARM Scalable Vector Extension (SVX)
Roofline performance model, arithemic intensity calculation, computer bound or memory bound problem.

Lecture 21-22, DLP - GPU and Loop-Level Parallelism

What is GPU today?
About how many cores of the latest NVIDIA GPUs? 10x, 100x, 1000x, 10000x?
Understand thread hierarchy and topology of GPU threads.
Know how to write simple GPU program that has loop-level parallelism.
Understand GPU offloading execution model and workflow of GPU execution and data movement
Understand GPU's SIMT execution model. What is a wrap?
Understand thread divergence problem and how GPUs handle thread divergence.
What is special about GPU memory and what kinds of memories are available for programmers to use?
What is GPU shared memory and know how to use it in programming.
Given a loop, determine whether the loop has loop-carried dependencies and whether data-level paralleism can be exploited.

Lecture 23-25, TLP - Thread-Level Parallelism, Cache Coherence, Snoopy Protocol, False Sharing and Synchronization

Are multicore architecture SIMD, MISD, or MIMD?
Why we cannot increase frequency as before since ~2005 and we resort to multicore architecture? The limitation of ILP and power wall issue.
What are the two kinds of shared memory machines and what the difference in terms of memory access?
Know and explain cache-coherence problem in shared memory multi-core or multi-processor systems
What are the two cache-coherence protocols?
Know snoopy cache-coherence protocol and how it works with regards to write-back and write-through policy.
Explain what is false sharing using a program as example and provide a solution to remove false sharing using padding.
Know directory-based cache-coherence protocol and how it works.
Explain how data race could happen using program example and what is the programming solution for handling data racing.
How mutual exclusion is achieved using locks
The three components of an synchronization events.
Why the strawman lock implementation does not work and what is the basic primitive is needed to have a working lock implementation.
Know compare-and-swap primitives and LL-SC.
How to use CAS or LL-SC to implement lock?

Lecture 26-27, Domain-Specific Architectures

Explain architecture evolvement with regards to Moore's law and lots of people think the next trend is domain-specific architecture
What are the three major methods that are widely used by deep neural network applications? MV, MM and Stencil.
What are the five guidelines for designing domain specific architectures ?
Understand Google TPU architecture and Systolic array.

Main Study Topics

CSCE 513 Computer Architecture

Table of Contents

Textbooks

Read at least 15 pages per week in average following the class schedule

Lecture 01, Introduction

Lecture 02, Technology Trends and Quantitative Design and Analysis for Performance

Lecture 03-04, Instruction Set Principles

Lecture 05-06 and 07-09, Pipeline and Implementation

Lecture 10-12, Memory Technology, Cache Organization and Performance, and Cache Optimizations

Final Topics

Lecture 14, Cache Optimization, part 2

Lecture 16, ILP - Dynamic Scheduling (Out-of-Order Execution)

Lecture 17, ILP - Hardware Speculation and VLIW (Static Superscalar)

Lecture 18, ILP - (Hardware Dynamic) Superscalar and ARM/Intel CPU Examples

Lecture 19, ILP - Exploiting Thread-Level Parallelism in Uniprocessor

Lecture 20, DLP - Introduction and Vector

Lecture 20, DLP - SIMD Extensions

Lecture 21-22, DLP - GPU and Loop-Level Parallelism

Lecture 23-25, TLP - Thread-Level Parallelism, Cache Coherence, Snoopy Protocol, False Sharing and Synchronization

Lecture 26-27, Domain-Specific Architectures