Advanced Computer Architecture – An Introduction

Computer Architecture – An Introduction CS4342 Advanced Computer Architecture Dilum Bandara [email protected] Slides adapted from “Computer Architecture, A Quantitative Approach” by John L. Hennessy and David A. Patterson, 5 th Edition, 2012, Morgan Kaufmann Publishers

Outline Walls Classes of computers Instruction set architecture Trends Technology Power & energy Cost Principles of computer design 2

Single Processor Performance 3 RISC Move to multi-processor

Why Such Rapid Change? Performance improvements Improvements in semiconductor technology Clock speed, feature size Improvements in computer architectures High-level language compilers, UNIX Lead to RISC architectures Lower costs Simpler development Higher volumes Lower margins Function Rise of networking & interconnection technology 4

Today’s Status 5 Moore’s Law – No of transistors on a chip tends to double about every 2 years Transistor count still rising Clock speed flattening sharply Source: www.extremetech.com/wp-content/uploads/2012/02/CPU-Scaling.jpg

Clock Speed vs. Power Intel 80386 consumed ~ 2 W 3.3 GHz Intel Core i7 consumes 130 W Heat must be dissipated from 1.5 x 1.5 cm 2 chip Limits what can be cooled by air 6

Conventional Wisdom in Question Conventional Wisdom – Power is free, Transistors are expensive Today – Power is expensive, Transistors are free Power wall Can put more on chip than can afford to turn on Conventional Wisdom – Increase Instruction Level Parallelism (ILP) via compilers, innovation Out-of-order, speculation, VLIW Today – Law of diminishing returns on more hardware for ILP ILP wall 7

Conventional Wisdom in Question (Cont.) Conventional Wisdom – Multiplies are slow, Memory access is fast Today – Memory is slow, multiplies are fast Memory wall 200 clock cycles to DRAM memory, 4 clocks to multiply Conventional Wisdom – Uniprocessor performance 2× / 1.5 years Today – Power Wall + ILP Wall + Memory Wall = Brick Wall Multi-cores Simpler processors are more power efficient 8

Current Trends in Architecture Can’t continue to leverage ILP Uniprocessor performance improvement ended in 2003 New models for performance Data-level parallelism (DLP) Thread-level parallelism (TLP) Request-level parallelism (RLP) These require explicit restructuring of applications 9

Parallelism 10

Parallelism (Cont.) Classes of parallelism in applications Data-Level Parallelism (DLP) Task-Level Parallelism (TLP) Classes of architectural parallelism Instruction-Level Parallelism (ILP) Exploits DLP in pipelining & speculative execution Vector architectures/Graphic Processor Units (GPUs) Exploit DLP by applying same instruction on many data items Thread-Level Parallelism Exploit DLP & TLP in cooperative processing by threads Request-Level Parallelism Parallel execution of tasks that are independent 11

Flynn’s Taxonomy Single instruction stream, single data stream (SISD) Normal sequential programs Uniprocessor Single instruction stream, multiple data streams (SIMD) Data parallelism Vector architectures Multimedia extensions (Intel MMX) Graphics Processor Units (GPUs) Multiple instruction streams, single data stream (MISD) No commercial implementation Fault tolerant sachems Multiple instruction streams, multiple data streams (MIMD) Most parallel programs Multi-core 12

Classes of Computers & Performance Metrics 13 Want to achieve these performance metrics? Then you need to understand & design based on principles of computer architecture

Classes of Computers Personal Mobile Device (PMD) Smart phones & tablets Emphasis is on energy efficiency, cost, responsiveness, & multimedia performance Desktop Computing Desktops, netbooks, & laptops Emphasis is on price-performance, energy, & graphic performance Servers Emphasis is on availability, scalability, throughput, & energy 14

Classes of Computers (Cont.) Clusters / Warehouse Scale Computers Used for “Software as a Service (SaaS)” Emphasis on availability, price-performance, throughput, & energy Sub-class – Supercomputers Emphasis – floating-point performance & fast internal networks Embedded Computers Emphasis on price, power, size, application-specific performance 15

Terminology 16

Blocks of a Microprocessor 17 Literal Address Operation Program Memory Instruction Register STACK Program Counter Instruction Decoder Timing, Control and Register selection Accumulator RAM & Data Registers ALU IO I O FLAG & Special Function Registers Clock Reset Interrupts Program Execution Section Register Processing Section Set up Set up Modify Address Internal data bus Source: Makis Malliris & Sabir Ghauri, UWE

18 Uniprocessor – Internal Structure A E D C B ALU Address BUS Control Unit IR FLAG ALU PC +1 Data BUS CTRL BUS

19 Instruction Execution Sequence Fetch next instruction from memory to IR Change PC to point to next instruction Determine type of instruction just fetched If instruction needs data from memory, determine where it is Fetch data if needed into register Execute instruction Go to step 1 & continue with next instruction

20 Sample Program 100: Load A,10 101: Load B,15 102: Add A,B 103: STORE A,[20] Load A,10 Load B,15 ADD A,B STORE A,[20] 100 101 102 103 104 105 Program memory 18 19 20 21 Data memory 00 00 00 00

21 Before Execution 1 st Fetch Cycle A E D C B ALU Address BIU Control Unit IR FLAG ALU 100 +1 Data BIU CTRL BIU

22 After 1 st Fetch Cycle … A E D C B ALU Address BIU Control Unit Load A,10 FLAG ALU 101 +1 Data BIU CTRL BIU

23 After 1 st Instruction Cycle … 10 E D C B ALU Address BIU Control Unit Load A,10 FLAG ALU 101 +1 Data BIU CTRL BIU

24 Sample Program 100: Load A,10 101: Load B,15 102: Add A,B

25 After 2 nd Fetch Cycle … A E D C B ALU Address BIU Control Unit Load B,15 FLAG ALU 102 +1 Data BIU CTRL BIU

26 After 2 nd Instruction Cycle … 10 E D C 15 ALU Address BIU Control Unit Load B,15 FLAG ALU 102 +1 Data BIU CTRL BIU

27 Sample Program 100: Load A,10 101: Load B,15 102: Add A,B

28 After 3 rd Fetch Cycle … 10 E D C 15 ALU Address BIU Control Unit ADD A,B FLAG ALU 103 +1 Data BIU CTRL BIU

29 After 3 rd Instruction Cycle … 25 E D C 15 ALU Address BIU Control Unit ADD A,B FLAG ALU 103 +1 Data BIU CTRL BIU

Architectural Differences Length of microprocessors’ data word 4, 8, 16, 32, & 64 bit Speed of instruction execution Clock rate & processor speed Size of direct addressable memory CPU architecture Instruction set Number & types of registers Support circuits Compatibility with existing software & hardware development systems 30

Instruction Set Architecture (ISA) 31 Instruction Set Software Hardware

Properties of a Good ISA Abstraction Lasts through many generations (portability) Used in many different ways (generality) Provides convenient functionality to higher levels Permits an efficient implementation at lower levels 32

Computer Architecture Topics 33 Instruction Set Architecture Pipelining, Hazard Resolution, Superscalar, Reordering, Prediction, Speculation, Vector, DSP Addressing, Protection, Exception Handling L1 Cache L2 Cache DRAM Disks, WORM, Tape Coherence, Bandwidth, Latency Emerging Technologies Interleaving Bus protocols RAID, SSD Input/Output & Storage Memory Hierarchy Pipelining & Instruction Level Parallelism

Course Focus 34 Understanding design techniques, machine structures, technology factors, evaluation methods that will determine forms of computers in 21 st Century Technology Programming Languages Operating Systems History Applications Interface Design (ISA) Measurement & Evaluation Parallelism Computer Architecture • Instruction Set Design • Organization • Hardware

Trends in Technology Integrated circuit technology Transistor density – +35%/year Die size – +10-20%/yea Integration overall – +40-55%/year DRAM capacity – +25-40%/year (slowing) Flash capacity – +50-60%/year 15-20× cheaper/bit than DRAM Magnetic disk technology – +40%/year (slowing) 15-25× cheaper/bit than Flash 300-500× cheaper/bit than DRAM 35

Measuring Performance Typical performance metrics Response time Throughput Execution time Wall clock time – includes all system overheads CPU time – only computation time Speedup of X relative to Y Speed up = Execution time Y / Execution time X Benchmarks Kernels (e.g., matrix multiply) Toy programs (e.g., sorting) Synthetic benchmarks (e.g., Dhrystone) Benchmark suites (e.g., SPEC06fp, TPC-C, PCMark ) 36

Bandwidth & Latency Bandwidth or throughput Total work done in a given time 10,000-25,000X improvement for processors 300-1200X improvement for memory & disks Latency or response time Time between start & completion of an event 30-80X improvement for processors 6-8X improvement for memory & disks While bandwidth is increasing latency isn’t reducing 37

Transistors & Wires Feature size Minimum size of transistor or wire in x or y dimension 10 microns in 1971 to 0.014 microns in 2014 Transistor performance used to scale Wires Feature size reduce  shorter wires High density But resistance & capacitance per unit length grow Wire delay don’t reduce with feature size! While transistors are getting small latency isn’t reducing 38

Power & Energy Problem – Getting power in & out Thermal Design Power (TDP) Characterizes sustained power consumption Used as target for power supply & cooling system Lower than peak power, higher than average power Intel i7-4770K 4 Cores @ 3.5 GHz TPD 84W & Peak ~140W Clock rate can be reduced dynamically to limit power consumption Intel i7, AMD Ryzen Energy per task is often a better measurement Tight to the task & execution time 39

Techniques for Reducing Power Do nothing well Dynamic Voltage-Frequency Scaling (DVFS) e.g., AMD Opteron Low power state for DRAM, disks Sleep mode Overclocking, turning off cores Intel i7, AMD Ryzen 40 Source: AMD

Dynamic Energy & Power Dynamic energy Transistor switch from 0  1 or 1  ½ × Capacitive load × Voltage 2 Dynamic power ½ × Capacitive load × Voltage 2 × Frequency switched Reducing voltage reduce energy Reducing clock rate reduces power, not energy 41

Static Power Static power consumption Current static × Voltage Scales with no of transistors Not giving clock signal is insufficient Power gating 42

Exercise Which processor has better performance-power gain? Core i7-4770K 4 core, 3.9 GHz TDP – 84W, average consumption 95.5W Apple A8 2 core, 1.5 GHz (iPad Mini) 2W 43

Trends in Cost Cost driven down by learning curve Yield Microprocessors – price depends on volume 10% less for each doubling of volume DRAM – price closely tracks cost 44

Principles of Computer Design Take Advantage of Parallelism Principle of Locality Focus on the Common Case Amdahl’s Law Processor Performance Equation 45

1. Taking Advantage of Parallelism Increasing throughput via multiple processors or multiple disks Examples Multiple processors RAID Memory banks Pipelining Multiple functional units – superscalar 46

47 Source: http://mail.humber.ca/~paul.michaud/Pipeline.htm Instruction Level Parallelism (ILP)

Pipelining Overlap instruction execution to reduce total time to complete an instruction sequence Not every instruction depends on immediate predecessor  executing instructions completely/partially in parallel when possible Classic 5-stage pipeline Instruction Fetch Register Read Execute (ALU) Data Memory Access Register Write (Reg) 48

Pipelined Instruction Execution 49 I n s t r. O r d e r Time (clock cycles) Reg ALU DMem Ifetch Reg Reg ALU DMem Ifetch Reg Reg ALU DMem Ifetch Reg Reg ALU DMem Ifetch Reg Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 6 Cycle 7 Cycle 5

Limits to Pipelining Hazards prevent next instruction from executing during its designated clock cycle Structural hazards Attempt to use same hardware to do 2 different things at once Data hazards Instruction depends on result of prior instruction still in pipeline Control hazards Caused by delay between fetching of instructions & decisions about changes in control flow (branches & jumps) 50

2. Principle of Locality Program access a relatively small portion of address space at any instant of time Types of locality Spatial Locality If an item is referenced, items whose addresses are close by tend to be referenced soon e.g., straight-line code, array access Temporal Locality If an item is referenced, it will tend to be referenced again soon e.g., loops, reuse 51

Locality – Example sum = 0; for (i = 0; i < n; i++) sum += a[i]; return sum; Data Access array elements in succession – Spatial locality Reference sum each iteration – Temporal locality Instructions Reference instructions in sequence – Spatial locality Cycle through loop repeatedly – Temporal locality 52 a[0] a[1] a[2] a[3] … …

3. Focus on Common Case Common sense guides computer design It’s engineering! Favor frequent case over infrequent case e.g., instruction fetch & decode unit used more frequently than multiplier, so optimize it 1 st e.g., in databases storage dependability dominates system dependability, so optimize it 1 st Frequent case is often simpler & can be done faster than infrequent case e.g., overflow is rare when adding numbers, so improve performance by optimizing common case of no overflow May slow down overflow, but overall performance improved by optimizing for normal case 53

4. Amdahl’s Law 54 Best you could ever hope to do

Amdahl’s Law – Example Floating point instructions improved to run 2X; but only 10% of actual instructions are FP 55 Speedup overall = 1 0.95 = 1.053 ExTime new = ExTime old × (0.9 + 0.1/2) = 0.95 × ExTime old

5. Processor Performance Equation 56 CPU time = Seconds = Instructions × Cycles × Seconds Program Program Instruction Cycle Instruction count CPI Cycle time

57 5. Processor Performance Equation (Cont.) Inst Count CPI Clock Rate Program X Compiler X (X) Inst. Set. X X Organization X Technology X

Fallacies & Pitfalls Fallacies – commonly held misconceptions When discussing a fallacy, we try to give a counterexample Pitfalls – easily made mistakes Often generalizations of principles true in limited context Show Fallacies & Pitfalls to help you avoid these errors 58

Fallacies & Pitfalls (Cont.) Fallacy – Benchmarks remain valid indefinitely Once a benchmark becomes popular, tremendous pressure to improve performance by Targeted optimizations or Aggressive interpretation of rules for running the benchmark A.k.a. “ benchmarksmanship ” 70 benchmarks from the 5 SPEC releases 70% were dropped from next release because no longer useful 59

Fallacies & Pitfalls (Cont.) Pitfall – A single point of failure System is as reliable as its weakest link Rule of thumb for fault tolerant systems – make sure that every component was redundant so that no single component failure could bring down the whole system e.g., power supply vs. fan 60

Advanced Computer Architecture – An Introduction

About This Presentation

Slide Content

Tags

Categories

Download

Quick Actions

Statistics

Related Slideshows

Advanced Computer Architecture – An Introduction

About This Presentation

Slide Content

Slide 1

Slide 2

Slide 3

Slide 4

Slide 5

Slide 6

Slide 7

Slide 8

Slide 9

Slide 10

Slide 11

Slide 12

Slide 13

Slide 14

Slide 15

Slide 16

Slide 17

Slide 18

Slide 19

Slide 20

Slide 21

Slide 22

Slide 23

Slide 24

Slide 25

Slide 26

Slide 27

Slide 28

Slide 29

Slide 30

Slide 31

Slide 32

Slide 33

Slide 34

Slide 35

Slide 36

Slide 37

Slide 38

Slide 39

Slide 40

Slide 41

Slide 42

Slide 43

Slide 44

Slide 45

Slide 46

Slide 47

Slide 48

Slide 49

Slide 50

Slide 51

Slide 52

Slide 53

Slide 54

Slide 55

Slide 56

Slide 57

Slide 58

Slide 59

Slide 60

Tags

Categories

Download

Quick Actions

Statistics

Related Slideshows

Pray For The Peace Of Jerusalem and You Will Prosper

Don_t_Waste_Your_Life_God.....powerpoint

VILLASUR_FACTORS_TO_CONSIDER_IN_PLATING_SALAD_10-13.pdf

Fertility awareness methods for women in the society

Chapter 5 Arithmetic Functions Computer Organisation and Architecture

syakira bhasa inggris (1) (1).pptx.......