EC6009-Advanced Computer Architecture UNIT I FUNDAMENTALS OF COMPUTER DESIGN 9 Review of Fundamentals of CPU, Memory and IO – Trends in technology, power, energy and cost, Dependability - Performance Evaluation UNIT II INSTRUCTION LEVEL PARALLELISM 9 ILP concepts – Pipelining overview - Compiler Techniques for Exposing ILP – Dynamic Branch Prediction – Dynamic Scheduling – Multiple instruction Issue – Hardware Based Speculation – Static scheduling - Multi-threading - Limitations of ILP – Case Studies . UNIT III DATA-LEVEL PARALLELISM 9 Vector architecture – SIMD extensions – Graphics Processing units – Loop level parallelism. UNIT IV THREAD LEVEL PARALLELISM 9 Symmetric and Distributed Shared Memory Architectures – Performance Issues –Synchronization – Models of Memory Consistency – Case studies: Intel i7 Processor, SMT & CMP Processors UNIT V MEMORY AND I/O 9 Cache Performance – Reducing Cache Miss Penalty and Miss Rate – Reducing Hit Time – Main Memory and Performance – Memory Technology. Types of Storage Devices – Buses – RAID – Reliability, Availability and Dependability – I/O Performance Measures. 26-Jun-19 VII-ECE-B
On completion of the course, the students will be able to CO1 Explain the performance of different architectures with respect to various parameters K2 CO2 Describe the performance of different ILP techniques K2 CO3 Discuss the performance of different architectures & exploiting DLP K2 CO4 Illustrate the concepts of Transport level protocol. K2 CO5 Distinguish cache and memory related issues in multiprocessor. K2 EC6009-Advanced Computer Architecture CO/PO PO1 PO2 PO3 PO4 PO5 PO6 PO7 PO8 PO9 PO10 PO11 PO12 PSO1 PSO2 CO1 3 2 1 CO2 3 2 1 CO3 3 2 1 CO4 3 2 1 CO5 3 2 2 C404 3 2 1 - - - - - - - -- - - - 26-Jun-19 VII-ECE-B
PO 1 - Engineering Knowledge PO 2 - Problem analysis PO 3 - Design / development of solutions PO 4 - Conduct investigations of complex problems PO 5 - Modern tool usage: PO 6 - Engineer and Society: PO 7 - Environment and sustainability: PO 8 - Ethics: PO 9 - Individual and Team-work: PO 10 - Communication: PO 11 - Project management and finance: PO 12 - Life-long learning: 26-Jun-19 VII-ECE-B
Outline 1.1 Introduction 1.2 Classes of Computers 1.3 Defining Computer Architecture 1.4 Trends in Technology 1.5 Trends in Power in Integrated Circuits 1.6 Trends in Cost 1.7 Dependability 1.8 Measuring, Reporting, and Summarizing Performance 1.9 Quantitative Principles of Computer Design 1.10 Putting It All Together: Performance and Price-Performance 26-Jun-19 VII-ECE-B
Computer Technology Performance improvements: Improvements in semiconductor technology Feature size, clock speed Improvements in computer architectures Enabled by HLL compilers, UNIX Lead to RISC (Simple INS Set)architectures Together have enabled: Lightweight computers Productivity-based managed/interpreted programming languages Introduction 26-Jun-19 VII-ECE-B
Single Processor Performance Introduction RISC Move to multi-processor VII-ECE-B
Current Trends in Architecture Cannot continue to leverage Instruction-Level parallelism (ILP) Single processor performance improvement ended in 2003 New models for performance: Data-level parallelism (DLP) Thread-level parallelism (TLP) Request-level parallelism (RLP) These require explicit restructuring of the application Introduction 26-Jun-19 VII-ECE-B
Organization, Hardware, and Architecture Organization : includes the high-level aspects of a computer’s design. Memory system, the memory interconnect, and the design of the internal processor or CPU (arithmetic, logic, branching, and data transfer). For example: AMD Opteron 64 and Intel P4 have same ISA, but they have different internal pipeline and cache organizations. Hardware : detailed logic design and the packaging technology. For example, P4 and Mobile P4 have same ISA and organization, but they have different clock frequency and memory system. Architecture : covers all three aspects of computer design – instruction set architecture, organization, and hardware. Designer must meet functional requirements as well as price, power, performance, and availability goals . 26-Jun-19 VII-ECE-B
Instruction Set Architecture: Critical Interface Properties of a good abstraction Lasts through many generations (portability) Used in many different ways (generality) Provides convenient functionality to higher levels Permits an efficient implementation at lower levels instruction set software hardware 26-Jun-19 VII-ECE-B
Instruction Set Architecture (ISA) Class of ISA: ISA is the actual programmer-visible instruction set. General purpose Architecture( Reg Memory, Load-Store ) Stack Architecture Memory addressing; ( if Program running 32-bit processor can address upto 4GB (2* 32bytes ) of address space) Addressing modes; (Direct & Indirect) apart etc… Types and sizes of operands: The common type Supported by ISA, includes, signed , unsigned, single & double precision Floating point numbers) Data processing & Control flow instructions; 26-Jun-19 VII-ECE-B
Classes of Computers Personal Mobile Device (PMD) e.g. start phones, tablet computers Emphasis on energy efficiency and real-time Desktop Computing (Work stations) Emphasis on price-performance Servers (Main frame) Emphasis on availability, scalability, throughput Clusters / Warehouse Scale Computers Used for “Software as a Service ( SaaS )” Emphasis on availability and price-performance Sub-class: Supercomputers, emphasis: floating-point performance and fast internal networks Embedded Computers Emphasis: price Classes of Computers 26-Jun-19 VII-ECE-B
Trends in Technology A successful new ISA may last decades, for example, IBM mainframe. Four critical technologies Integrated circuit logic technology : transistor density increased by about 35% per year, quadrupling in somewhat over four years; Semiconductor DRAM (Dynamic Random-Access Memory) : capacity increases by about 40% per year, doubling roughly every two years; Magnetic disk technology : roller coaster of rates, disk are 50-100 times cheaper per bit than DRAM . Network technology : network performance depends both on the performance of switches and transmission. 26-Jun-19 VII-ECE-B
Scaling of Transistor Performance and Wires Feature size : the minimum size of a transistor or a wire in either the x or y dimension. From 10 microns in 1971 to 0.09 microns (90 nm) in 2006; The density of transistors increases quadratically with a linear decrease in feature size; Transistor performance improves linearly with decreasing feature size; Since improvement in transistor density, thus CPU move quickly from 4-bit to 8-bit, to 16-bit, to 32-bit microprocessors; 26-Jun-19 VII-ECE-B
Performance Trends: Bandwidth over Latency Bandwidth or throughput : the total amount of work done in a given time. Such as megabyte per second for a disk transfer. Latency or response time: the time between the start and the completion of an event. Such as milliseconds for a disk access. 26-Jun-19 VII-ECE-B
Power Power also provides challenges as devices are scaled. Dynamic power (watts, W)in CMOS chip : the traditional dominant energy consumption has been in switching transistors. For mobile devices : they care about battery life more than power, so energy is the proper metric, measured in joules: In modern VLSI, the exact power measurement is the sum of, Power total = Power dynamic +Power static +Power leakage 26-Jun-19 VII-ECE-B
Power Static power: an important issue because leakage current flows even when a transistor is off: Thus, transistor ↑, power ↑; Feature size ↓, power ↑ (why? You can find out in VLSI area). 26-Jun-19 VII-ECE-B
Silicon Wafer and Dies Exponential cost decrease – technology basically the same : A wafer is tested and chopped into dies that are packaged. Die ( 晶粒 ) Wafer ( 晶圓 ) AMD K8, source : http://www.amd.com dies along the edge 26-Jun-19 VII-ECE-B
Cost of an Integrated Circuit (IC) Today’s technology: 4.0, defect density 0.4 ~ 0.8 per cm 2 (A greater portion of the cost that varies between machines) (sensitive to die size) (# of dies along the edge) 26-Jun-19 VII-ECE-B
Response Time, Throughput, and Performance Response time : the time between the start and the completion of an event – also referred to as execution time . The computer user is interested. Throughput : the total amount of work done in a given time. The administrator of a large data processing center may be interested. In comparing design alternatives, The phrase “X is faster than Y” is used here to mean that the response time or execution time is lower on X than on Y. In particular, “X is n times faster than Y” or “the throughput of X is n times higher than Y” will mean 26-Jun-19 VII-ECE-B
Performance Measuring Execution is the reciprocal of performance, 26-Jun-19 VII-ECE-B
Reliable Measure – User CPU Time Response time may include disk access, memory access, input/output activities, CPU event and operating system overhead – everything… In order to get an accurate measure of performance, we use CPU time instead of using response time. CPU time is the time the CPU spends computing a program and does not include time spent waiting for I/O or running other programs. CPU time can also be divided into user CPU time (program) and system CPU time (OS). Key in UNIX command time, we have, 90.7s 12.9s 2:39 65% (user CPU, system CPU, total response,%). In our performance measures, we use user CPU time – because of its independence on the OS and other factors. 26-Jun-19 VII-ECE-B
CPU Performance Essentially all computers are constructed using clock (all called ticks , clock ticks , clock periods , clocks , cycles , or clock cycles ) running at a constant rate. Clock rate: today in GHz Clock cycle time: clock cycle time = 1/clock rate Ex. 1 GHz clock rate = 1 ns cycle time Thus, the CPU time for a program can be expressed two ways: Or, 26-Jun-19 VII-ECE-B
CPU Performance We can also count the number of instructions executed – the instruction path length or instruction count (IC). If we know the number of clock cycles and IC , then the average number of clock cycles per instruction (CPI). CPI is computed as Thus, clock cycles can be defined as IC × CPI, this allows us to use CPI in the execution time formula: This figure provides insight into different styles of instruction sets and implementations. 26-Jun-19 VII-ECE-B
26-Jun-19 VII-ECE-B
CPU Performance The pieces fit together of CPU time A α % improvement in any one of three pieces leads to a α % improvement in CPU time. Unfortunately , it is difficult to change one parameter in complete isolation form others, because the technologies of them are interdependent: Clock cycle time : Hardware technology and organization; CPI : Organization and instruction set architecture; Instruction count : Instruction set architecture and compiler technology. Processor performance is dependent upon three characteristics: instruction count , clock cycles per instruction and clock cycle (or rate) . Computer architecture is focus on CPI and IC parameters. 26-Jun-19 VII-ECE-B
CPU Performance To calculate the number of total processor clock cycles as To express CPU time again And overall CPI as IC i : the number of times instruction i is executed in a program. CPI i : the average number of clocks per instruction for instruction i . IC i /IC presents the fraction of occurrences of that instruction in a program. It is useful in designing the processor. Hint: CPI i should be measured because pipeline effects, cache misses, and any other memory system inefficiencies. 26-Jun-19 VII-ECE-B