dvance computer architecture computer architecture: a quantitative approach chapter 1 Fundamentals of Quantitative Design and Analysis

Computer Technology Performance improvements: Improvements in semiconductor technology Feature size, clock speed Improvements in computer architectures Enabled by HLL compilers, UNIX Lead to RISC architectures Together have enabled: Lightweight computers Productivity-based managed/interpreted programming languages Copyright © 2019, Elsevier Inc. All rights reserved. Introduction

Copyright © 2019, Elsevier Inc. All rights reserved. Current Trends in Architecture Cannot continue to leverage Instruction-Level parallelism (ILP) Single processor performance improvement ended in 2003 New models for performance: Data-level parallelism (DLP) Thread-level parallelism (TLP) Request-level parallelism (RLP) These require explicit restructuring of the application Introduction

Copyright © 2019, Elsevier Inc. All rights reserved. Classes of Computers Personal Mobile Device (PMD) e.g. start phones, tablet computers Emphasis on energy efficiency and real-time Desktop Computing Emphasis on price-performance Servers Emphasis on availability, scalability, throughput Clusters / Warehouse Scale Computers Used for “Software as a Service ( SaaS )” Emphasis on availability and price-performance Sub-class: Supercomputers, emphasis: floating-point performance and fast internal networks Internet of Things/Embedded Computers Emphasis: price Classes of Computers

Copyright © 2019, Elsevier Inc. All rights reserved. Parallelism Classes of parallelism in applications: Data-Level Parallelism (DLP) Task-Level Parallelism (TLP) Classes of architectural parallelism: Instruction-Level Parallelism (ILP) Vector architectures/Graphic Processor Units (GPUs) Thread-Level Parallelism Request-Level Parallelism Classes of Computers

Copyright © 2019, Elsevier Inc. All rights reserved. Flynn’s Taxonomy Single instruction stream, single data stream (SISD) Single instruction stream, multiple data streams (SIMD) Vector architectures Multimedia extensions Graphics processor units Multiple instruction streams, single data stream (MISD) No commercial implementation Multiple instruction streams, multiple data streams (MIMD) Tightly-coupled MIMD Loosely-coupled MIMD Classes of Computers

1- Single Instruction Single Data(SISD) This category is the uniprocessor . The programmer thinks of it as the standard sequential computer,but it can exploit ILP .

2-Single Instruction Multiple Data(SIMD) The same instruction is executed by multiple processors using different data streams. SIMD computers exploit data-level parallelism by applying the same operations to multiple items of data in parallel. Each processor has its own data memory but there is a single instruction memory and control processor ,which fetches and dispatches instructions. vector architectures, multimedia extensions to standard instruction sets, and GPUs.

3- Multiple Instruction Single Data(MISD) Nocommercial multiprocessor of this type has been built to date, but it rounds out this simple classification.

4- Multiple Instruction Multiple Data(MIMD) Each processor fetches its own instructions and operates on its own data, and it targets task-level parallelism(TLP) DLP (more expensive than SIMD) Tightly coupled MIMD architectures:TLP Loosely coupled MIMD architectures:RLP Clusters warehouse-scale computers

Copyright © 2019, Elsevier Inc. All rights reserved. Defining Computer Architecture “Old” view of computer architecture: Instruction Set Architecture (ISA) design i.e. decisions regarding: registers, memory addressing, addressing modes, instruction operands, available operations, control flow instructions, instruction encoding “Real” computer architecture: Specific requirements of the target machine Design to maximize performance within constraints: cost, power, and availability Includes ISA, microarchitecture, hardware Defining Computer Architecture

Instruction Set Architecture Class of ISA General-purpose registers Register-memory vs load-store RISC-V registers 32 g.p., 32 f.p. Copyright © 2019, Elsevier Inc. All rights reserved. Defining Computer Architecture Register Name Use Saver x0 zero constant 0 n/a x1 ra return addr caller x2 sp stack ptr callee x3 gp gbl ptr x4 tp thread ptr x5-x7 t0-t2 temporaries caller x8 s0/fp saved/ frame ptr callee Register Name Use Saver x9 s1 saved callee x10-x17 a0-a7 arguments caller x18-x27 s2-s11 saved callee x28-x31 t3-t6 temporaries caller f0-f7 ft0-ft7 FP temps caller f8-f9 fs0-fs1 FP saved callee f10-f17 fa0-fa7 FP arguments callee f18-f27 fs2-fs21 FP saved callee f28-f31 ft8-ft11 FP temps caller

Instruction Set Architecture Memory addressing RISC-V: byte addressed, aligned accesses faster An access to an object of size s bytes at byte address A is aligned if A mod s=0. Addressing modes RISC-V: Register, immediate, displacement ( base+offset ) Other examples: autoincrement , indexed, PC-relative Types and size of operands RISC-V: 8-bit, 32-bit, 64-bit IEEE 754 floating point in 32-bit (single precision) and 64-bit (double precision). The 80x86 also supports 80-bit floating point (extended double precision). Copyright © 2019, Elsevier Inc. All rights reserved. Defining Computer Architecture

Instruction Set Architecture Operations RISC-V: data transfer, arithmetic, logical, control, floating point See Fig. 1.5 in text Control flow instructions Use content of registers (RISC-V) vs. status bits (x86, ARMv7, ARMv8) Return address in register (RISC-V, ARMv7, ARMv8) vs. on stack (x86) Encoding Fixed (RISC-V, ARMv7/v8 except compact instruction set) vs. variable length (x86) Copyright © 2019, Elsevier Inc. All rights reserved. Defining Computer Architecture

…Genuine Computer Architecture Organization the high-level aspects of a computer’s design, the memory system, the memory interconnect, and the design of the internal processor or CPU (central processing unit—where arithmetic, logic, branching, and data transfer are implemented). The term microarchitecture is also used instead of organization. Copyright © 2019, Elsevier Inc. All rights reserved.

…Genuine Computer Architecture Copyright © 2019, Elsevier Inc. All rights reserved. Two processors with the same instruction set architectures but different organizations are the AMD Opteron and the Intel Core i7. Both processors implement the 80x86 instruction set, but they have very different pipeline and cache organizations.

…Genuine Computer Architecture Hardware refers to the specifics of a computer: the detailed logic design the packaging technology of the computer. Often a line of computers contains computers with : identical instruction set architectures very similar organizations, differ in the detailed hardware implementation. Copyright © 2019, Elsevier Inc. All rights reserved.

…Genuine Computer Architecture the Intel Core i7 and the Intel Xeon E7 nearly identical different clock rates different memory systems the Xeon E7 more effective for server computers. Copyright © 2019, Elsevier Inc. All rights reserved.

Computer architects must design a computer to meet functional requirements as well as price,power,performance,andavailability goals architects also must determine what the functional requirements are, which can be a major task . The requirements may be specific features inspired by the market . Application software typically drives the choice of certain functional requirements by determining how the computer will be used Copyright © 2019, Elsevier Inc. All rights reserved. …Genuine Computer Architecture

Copyright © 2019, Elsevier Inc. All rights reserved. Trends in Technology Integrated circuit technology (Moore’s Law) Transistor density: 35%/year Die size: 10-20%/year Integration overall: 40-55%/year DRAM capacity: 25-40%/year (slowing) 8 Gb (2014), 16 Gb (2019), possibly no 32 Gb Flash capacity: 50-60%/year 8-10X cheaper/bit than DRAM Magnetic disk capacity: recently slowed to 5%/year Density increases may no longer be possible, maybe increase from 7 to 9 platters 8-10X cheaper/bit then Flash 200-300X cheaper/bit than DRAM Network technology Network Performance depends both on the performance of switches and on the performance of the transmission system. Trends in Technology Designers often design for the next technology. Cost has decreased at about the rate at which density increases.

Copyright © 2019, Elsevier Inc. All rights reserved. Bandwidth and Latency Bandwidth or throughput Total work done in a given time 32,000-40,000X improvement for processors 300-1200X improvement for memory and disks Latency or response time Time between start and completion of an event 50-90X improvement for processors 6-8X improvement for memory and disks Trends in Technology

Bandwidth and Latency… Performance is the primary differentiator for microprocessors and networks. the greatest gains: 32,000–40,000 in bandwidth and 50–90 in latency. Capacity is generally more important than performance for memory and disks. capacity has improved more, bandwidth advances of 400–2400 gains in latency of 8–9. Copyright © 2019, Elsevier Inc. All rights reserved.

Copyright © 2019, Elsevier Inc. All rights reserved. Bandwidth and Latency Log-log plot of bandwidth and latency milestones relative to the first milestone. latency improved 8–91, **** bandwidth improved about 400–32,000. Except for networking , there were modest improvements in latency and bandwidth in the other three technologies in the six years (2011-2017): 0%–23% in latency and 23%–70% in bandwidth. Trends in Technology

Copyright © 2019, Elsevier Inc. All rights reserved. Transistors and Wires Feature size Minimum size of transistor or wire in x or y dimension 10 microns in 1971 to .011 microns in 2017 Transistor performance scales linearly Wire delay does not improve with feature size! Integration density scales quadratically Trends in Technology Larger and larger fractions of the clock cycle have been consumed by the propagation delay of signals on wires . but power now plays an even greater role than wire delay.

Power and Energy concerns what is the maximum power a processor ever requires? voltage indexing methods that allow the processor to slow down and regulate voltage within a wider margin. what is the sustained power consumption( thermal design power (TDP)) it determines the cooling requirement. Which metric is the right one for comparing processors: energy or power? Copyright © 2019, Elsevier Inc. All rights reserved.

Copyright © 2019, Elsevier Inc. All rights reserved. Power and Energy Problem: Get power in, get power out Thermal Design Power (TDP) Characterizes sustained power consumption Used as target for power supply and cooling system Lower than peak power (1.5X higher), higher than average power consumption Clock rate can be reduced dynamically to limit power consumption Energy per task is often a better measurement Trends in Power and Energy

Power and Energy power : energy per unit time 1 watt = 1 joule per second. E=P*T Which metric is the right one for comparing processors: energy or power? In general, energy is always a better metric because it is tied to a specific task and the time required for that task . Copyright © 2012, Elsevier Inc. All rights reserved.

Power and Energy if we want to know which of two processors is more efficient for a given task, we should compare energy consumption (not power) for executing the task . Copyright © 2012, Elsevier Inc. All rights reserved.

Copyright © 2019, Elsevier Inc. All rights reserved. Dynamic Energy and Power Dynamic energy Transistor switch from 0 -> 1 or 1 -> 0 ½ x Capacitive load x Voltage 2 Dynamic power ½ x Capacitive load x Voltage 2 x Frequency switched Reducing clock rate reduces power, not energy Trends in Power and Energy

Copyright © 2019, Elsevier Inc. All rights reserved. Power Intel 80386 consumed ~ 2 W 3.3 GHz Intel Core i7 consumes 130 W Heat must be dissipated from 1.5 x 1.5 cm chip This is the limit of what can be cooled by air Trends in Power and Energy

Copyright © 2012, Elsevier Inc. All rights reserved. Reducing Power Techniques for reducing power: Do nothing well: (clock gating) Most microprocessors today turn off the clock of inactive modules to save energy and dynamic power Dynamic Voltage-Frequency Scaling ( DVFS). Personal mobile devices, laptops, and even servers have periods of low activity where there is no need to operate at the highest clock frequency and voltages. Low power state for DRAM, disks : Given that PMDs and laptops are often idle, memory and storage offer low power modes to save energy Overclocking, turning off cores the 3.3 GHz Core i7 can run in short bursts for 3.6 GHz. microprocessors can turn off all cores but one and run it at an even higher clock rate. For single threaded code, these microprocessors can turn off all cores but one and run it at an even higher clock rate. Trends in Power and Energy

Copyright © 2019, Elsevier Inc. All rights reserved. Reducing Power Techniques for reducing power: Do nothing well Dynamic Voltage-Frequency Scaling Low power state for DRAM, disks Overclocking, turning off cores Trends in Power and Energy

Copyright © 2019, Elsevier Inc. All rights reserved. Static Power Static power consumption 25-50% of total power Current static x Voltage Scales with number of transistors To reduce: power gating Trends in Power and Energy

Static Power large SRAM caches that need power to maintain the storage values. (The S in SRAM is for static.) The only hope to stop leakage is to turn off power to the chips’ subsets. Copyright © 2019, Elsevier Inc. All rights reserved.

race-to-halt. because the processor is just a portion of the whole energy cost of a system, it can make sense to use a faster, less energy-efficient processor to allow the rest of the system to go into a sleep mode . This strategy is known as race-to-halt. Copyright © 2019, Elsevier Inc. All rights reserved.

Domain specific processors A computer will consist of standard processors to run conventional large programs such as operating systems Domain specific processors do only a narrow range of tasks, but they do them extremely well. such computers will be much more heterogeneous than the homogeneous multicore chips of the past. Copyright © 2019, Elsevier Inc. All rights reserved.

Trends in Cost Although costs tend to be less important in some computer designs—specifically supercomputers cost-sensitive designs are of growing significance learning curve : manufacturing costs decrease over time. Example Price per megabyte of DRAM has dropped over the long term. price and cost of DRAM track closely. Microprocessor prices also drop over time, but because they are less standardized than DRAMs , the relationship between price and cost is more complex. Copyright © 2019, Elsevier Inc. All rights reserved. yield

Copyright © 2019, Elsevier Inc. All rights reserved. Trends in Cost Cost driven down by learning curve Yield DRAM: price closely tracks cost Microprocessors: price depends on volume 10% less for each doubling of volume Trends in Cost

Cost of an Integrated Circuit standard parts—disks, Flash memory, DRAMs, and so on—are becoming a significant portion of any system’s cost. with PMDs’ increasing reliance of whole systems on a chip (SOC), the cost of the integrated circuits is much of the cost of the PMD. Copyright © 2019, Elsevier Inc. All rights reserved.

Copyright © 2019, Elsevier Inc. All rights reserved. Integrated Circuit Cost Integrated circuit Bose-Einstein formula: Defects per unit area = 0.016-0.057 defects per square cm (2010) N = process-complexity factor = 11.5-15.5 (40 nm, 2010) For 28 nm processes in 2017, N is 7.5–9.5. For a 16 nm process, N ranges from 10 to 14 Trends in Cost

Integrated Circuit Cost :redundancy as a way to raise yield. Given the tremendous price pressures on commodity products such as DRAM and SRAM, designers have included redundancy as a way to raise yield. DRAMs have regularly included some redundant memory cells so that a certain number of flaws can be accommodated. Designers have used similar techniques in both standard SRAMs and in large SRAM arrays used for caches within microprocessors. GPUs have 4 redundant processors out of 84 for the same reason. Obviously, the presence of redundant entries can be used to boost the yield significantly. Copyright © 2019, Elsevier Inc. All rights reserved.

Cost Versus Price Margin between the cost to manufacture a product and the price the product sells for has been shrinking. Those margins pay for company’s research and development (R&D), marketing, sales, manufacturing equipment maintenance, building rental, cost of financing, Pretax profits, and taxes. Copyright © 2019, Elsevier Inc. All rights reserved.

Cost of Manufacturing Versus Cost of Operation Before cost meant the cost to build a computer price meant price to purchase a computer. With the advent of WSCs, capital expenses (CAPEX): tens of thousands of servers, operational expenses (OPEX): the cost to operate the computers Copyright © 2019, Elsevier Inc. All rights reserved.

Dependability Copyright © 2019, Elsevier Inc. All rights reserved. Before : ICs were one of the most reliable components of a computer. their pins may be vulnerable, and faults may occur over communication channels, the failure rate inside the chip was very low. Now, because of feature sizes of 16 nm and smaller, Transient faults and permanent faults are becoming more commonplace.

Dependability Systems alternate between two states: Service accomplishment : where the service is delivered as specified. Service interruption: where the delivered service is different from the SLA Transitions between these two states are caused by Failures (from state 1 to state 2) Restorations (2 to 1). Copyright © 2019, Elsevier Inc. All rights reserved.

Dependability Quantifying these transitions leads to the two main measures of dependability: Module reliability a measure of the continuous service accomplishment the time to failure from a reference initial instant. Module availability a measure of the service accomplishment with respect to the alternation between the two states of accomplishment and interruption. Copyright © 2019, Elsevier Inc. All rights reserved.

Copyright © 2019, Elsevier Inc. All rights reserved. Dependability Module reliability Mean time to failure (MTTF) mean time to failure FIT (=1/MTTF) failures in time rate of failures, generally reported as failures per billion hours of operation Mean time to repair (MTTR) Mean time between failures (MTBF) = MTTF + MTTR Module Availability = MTTF / MTBF Dependability

Dependability Assume a disk subsystem with the following components and MTTF: 10 disks, each rated at 1,000,000-hour MTTF 1 ATA controller, 500,000-hour MTTF 1 power supply, 200,000-hour MTTF 1 fan, 200,000-hour MTTF 1 ATA cable, 1,000,000-hour MTTF Copyright © 2019, Elsevier Inc. All rights reserved.

Dependability Redundancy The primary way to cope with failure in time (repeat the operation to see if it still is erroneous) in resources (have other components to take over from the one that failed). Copyright © 2019, Elsevier Inc. All rights reserved.

Dependability Redundancy example Assume that one power supply is sufficient to run the disk subsystem and that we are adding one redundant power supply . 2 power supplies and independent failures MTTF for redundant power supplies MTTF one = MTTF power supply /2 MTTFpair : the mean time until one power supply fails divided by the chance that the other will fail before the first one is replaced. the probability of a second failure is MTTR over the mean time until the other power supply fails 24 hours to notice that a power supply has failed and to replace it 4150 times more reliable than a single power supply Copyright © 2019, Elsevier Inc. All rights reserved.

Copyright © 2019, Elsevier Inc. All rights reserved. Measuring Performance Typical performance metrics: Response time :execution time Throughput Speedup of X relative to Y Execution time Y / Execution time X Execution time the time between the start and the completion of an event Wall clock time: includes all system overheads storage accesses, memory accesses, input/output activities, operating system, … CPU time: only computation time Measuring Performance

Benchmarks Kernels (e.g. matrix multiply) Toy programs (e.g. sorting) Synthetic benchmarks (e.g. Dhrystone) Benchmark suites (e.g. SPEC06fp, TPC-C) Standard test suites CPU tests Mathematical operations, compression, encryption, physics. 2D graphics tests Vectors, bitmaps, fonts, text, and GUI elements. 3D graphics tests DirectX 9 to DirectX 12 in 4K resolution. DirectCompute & OpenCL Disk tests Reading, writing & seeking within disk files + IOPS Memory tests Memory access speeds and latency Copyright © 2019, Elsevier Inc. All rights reserved.

Copyright © 2019, Elsevier Inc. All rights reserved. Principles of Computer Design Take Advantage of Parallelism e.g. multiple processors, disks, memory banks, pipelining, multiple functional units ILP,DLP,TLP,RLP Principle of Locality Reuse of data and instructions a program spends 90% of its execution time in only 10% of the code. Focus on the Common Case : energy, resource allocation, and performance. The instruction fetch and decode unit of a processor may be used much more frequently than a multiplier, so optimize it first. Amdahl’s Law Principles

Principles of Computer Design Example: Suppose we made the following measurements: Frequency of FP operations=25% Average CPI of FP operations=4.0 Average CPI of other instructions=1.33 Frequency of FSQRT=2% CPI of FSQRT=20 Compare these two design decrease the CPI of FSQRT to 2 decrease the average CPI of all FP operations to 2.5. Copyright © 2019, Elsevier Inc. All rights reserved.

Principles of Computer Design Copyright © 2019, Elsevier Inc. All rights reserved. Example: Suppose we made the following measurements: Frequency of FP operations=25% Average CPI of FP operations=4.0 Average CPI of other instructions=1.33 Frequency of FSQRT=2% CPI of FSQRT=20 Compare these two design decrease the CPI of FSQRT to 2 decrease the average CPI of all FP operations to 2.5.

Fallacies and Pitfalls All exponential laws must come to an end Dennard scaling (constant power density) Stopped by threshold voltage Disk capacity 30-100% per year to 5% per year Moore’s Law Most visible with DRAM capacity ITRS disbanded Only four foundries left producing state-of-the-art logic chips 11 nm, 3 nm might be the limit Copyright © 2019, Elsevier Inc. All rights reserved.

Fallacies and Pitfalls Microprocessors are a silver bullet Performance is now a programmer’s burden Falling prey to Amdahl’s Law A single point of failure Hardware enhancements that increase performance also improve energy efficiency, or are at worst energy neutral Benchmarks remain valid indefinitely Compiler optimizations target benchmarks Copyright © 2019, Elsevier Inc. All rights reserved.

Fallacies and Pitfalls The rated mean time to failure of disks is 1,200,000 hours or almost 140 years, so disks practically never fail MTTF value from manufacturers assume regular replacement Peak performance tracks observed performance Fault detection can lower availability Not all operations are needed for correct execution Copyright © 2019, Elsevier Inc. All rights reserved.

dvance computer architecture computer architecture: a quantitative approach chapter 1 Fundamentals of Quantitative Design and Analysis

About This Presentation

Slide Content

Tags

Categories

Download

Quick Actions

Statistics

Related Slideshows

dvance computer architecture computer architecture: a quantitative approach chapter 1 Fundamentals of Quantitative Design and Analysis

About This Presentation

Slide Content

Slide 1

Slide 2

Slide 3

Slide 4

Slide 5

Slide 6

Slide 7

Slide 8

Slide 9

Slide 10

Slide 11

Slide 12

Slide 13

Slide 14

Slide 15

Slide 16

Slide 17

Slide 18

Slide 19

Slide 20

Slide 21

Slide 22

Slide 23

Slide 24

Slide 25

Slide 26

Slide 27

Slide 28

Slide 29

Slide 30

Slide 31

Slide 32

Slide 33

Slide 34

Slide 35

Slide 36

Slide 37

Slide 38

Slide 39

Slide 40

Slide 41

Slide 42

Slide 43

Slide 44

Slide 45

Slide 46

Slide 47

Slide 48

Slide 49

Slide 50

Slide 51

Slide 52

Slide 53

Slide 54

Slide 55

Slide 56

Slide 57

Slide 58

Slide 59

Slide 60

Slide 61

Slide 62

Slide 63

Slide 64

Slide 65

Slide 66

Slide 67

Slide 68

Slide 69

Slide 70

Slide 71

Slide 72

Slide 73

Slide 74

Slide 75

Slide 76

Slide 77