Memory Designed and Some Other Conecpt of Computer Architecture

COMPUTERORGANIZATIONANDDESIGN
The Hardware/Software Interface
RISC-V
Edition
Chapter 1
Computer Abstractions
and Technology

Chapter 1 —Computer Abstractions and Technology —2
The Computer Revolution
nProgress in computer technology
nUnderpinned by Moore’s Law
nMakes novel applications feasible
nComputers in automobiles
nCell phones
nHuman genome project
nWorld Wide Web
nSearch Engines
nComputers are pervasive
§1.1 Introduction

Single Processor Performance
Copyright © 2019, Elsevier Inc. All rights reserved.
Introduction

Copyright © 2019, Elsevier Inc. All rights reserved.
Current Trends in Architecture
nCannot continue to leverage Instruction-Level parallelism (ILP)
nSingle processor performance improvement ended in 2003
nNew models for performance:
nData-level parallelism (DLP)
nThread-level parallelism (TLP)
nRequest-level parallelism (RLP)
nThese require explicit restructuring of the application
Introduction

Chapter 1 —Computer Abstractions and Technology —5
Classes of Computers
nPersonal computers
nGeneral purpose, variety of software
nSubject to cost/performance tradeoff
nServer computers
nNetwork based
nHigh capacity, performance, reliability
nRange from small servers to building sized

Classes of Computers
nSupercomputers
nHigh-end scientific and engineering calculations
nHighest capability but represent a small fraction of the overall
computer market
nTera Bytes of Memory !!!
nEmbedded computers
nHidden as components of systems
nStringent power/performance/cost constraints
Chapter 1 —Computer Abstractions and Technology —6

Chapter 1 —Computer Abstractions and Technology —7
The PostPC Era - HW

Chapter 1 —Computer Abstractions and Technology —8
The PostPC Era - OS

The PostPC Era
Chapter 1 —Computer Abstractions and Technology —9
nPersonal Mobile Device (PMD)
nBattery operated
nConnects to the Internet
nHundreds of dollars
nSmart phones, tablets, electronic glasses
nCloud computing
nWarehouse Scale Computers (WSC)
nSoftware as a Service (SaaS)
nPortion of software run on a PMD and a portion run in
the Cloud
nAmazon and Google

Copyright © 2019, Elsevier Inc. All rights reserved.
Parallelism
nClasses of parallelism in applications:
nData-Level Parallelism (DLP)
nTask-Level Parallelism (TLP)
nClasses of architectural parallelism:
nInstruction-Level Parallelism (ILP)
nVector architectures/Graphic Processor Units (GPUs)
nThread-Level Parallelism
nRequest-Level Parallelism
Classes of Computers

Copyright © 2019, Elsevier Inc. All rights reserved.
Flynn’s Taxonomy
nSingle instruction stream, single data stream (SISD)
nSingle instruction stream, multiple data streams (SIMD)
nVector architectures
nMultimedia extensions
nGraphics processor units
nMultiple instruction streams, single data stream (MISD)
nNo commercial implementation
nMultiple instruction streams, multiple data streams (MIMD)
nTightly-coupled MIMD à exploit thread-level parallelism
nLoosely-coupled MIMD à exploit request-level parallelism à little need for
communication between tasks
Classes of Computers

Copyright © 2019, Elsevier Inc. All rights reserved.
Defining Computer Architecture
n“Old” view of computer architecture:
nInstruction Set Architecture (ISA) design
ni.e. decisions regarding:
nregisters, memory addressing, addressing modes, instruction operands,
available operations, control flow instructions, instruction encoding
n“Real” computer architecture:
nSpecific requirements of the target machine
nDesign to maximize performance within constraints: cost,
power, and availability
nIncludes ISA, microarchitecture, hardware
Defining Computer Architecture

Eight Great Ideas
nDesign for Moore’s Law à close to end or
already dead!
nUse abstraction to simplify design
nMake the common case fast
nPerformance via parallelism
nPerformance via pipelining
nPerformance via prediction
nHierarchy of memories
nDependability via redundancy
Chapter 1 —Computer Abstractions and Technology —13
§1.2 Eight Great Ideas in Computer Architecture

Moore’s Law (Transistor Count)
Copyright © 2019, Elsevier Inc. All rights reserved.
Introduction

Chapter 1 —Computer Abstractions and Technology —15
Below Your Program
nApplication software
nWritten in high-level language
nSystem software
nCompiler: translates HLL code to
machine code
nOperating System: service code
nHandling input/output
nManaging memory and storage
nScheduling tasks & sharing resources
nHardware
nProcessor, memory, I/O controllers
§1.3 Below Your Program

Chapter 1 —Computer Abstractions and Technology —16
Levels of Program Code
nHigh-level language
nLevel of abstraction closer to problem
domain
nProvides for productivity and
portability
nAssembly language
nTextual representation of instructions
nHardware representation
nBinary digits (bits)
nEncoded instructions and data

Chapter 1 —Computer Abstractions and Technology —17
Components of a Computer
nSame components for
all kinds of computer
nDesktop, server,
embedded
nInput/output includes
nUser-interface devices
nDisplay, keyboard, mouse
nStorage devices
nHard disk, CD/DVD, flash
nNetwork adapters
nFor communicating with
other computers
§1.4 Under the Covers
The BIG Picture

Chapter 1 —Computer Abstractions and Technology —18
Opening the Box

Opening the Box: Server

Opening the Box: MacBook Air

Opening the Box: iPhone 4s

Chapter 1 —Computer Abstractions and Technology —23
Inside the Processor (CPU)
nDatapath: performs operations on data
nControl: sequences datapath, memory, ...
nCache memory
nSmall fast SRAM memory for immediate access to data

Chapter 1 —Computer Abstractions and Technology —24
Inside the Processor
nApple A5

Chapter 1 —Computer Abstractions and Technology —25
Inside the Processor
nApple A14 Bionic

Chapter 1 —Computer Abstractions and Technology —26
Inside the Processor
nApple A5

Chapter 1 —Computer Abstractions and Technology —27

Chapter 1 —Computer Abstractions and Technology —28

Chapter 1 —Computer Abstractions and Technology —29

A15 Vs. A14 Vs. A13
Chapter 1 —Computer Abstractions and Technology —30

Process Technology
Chapter 1 —Computer Abstractions and Technology —31

Chapter 1 —Computer Abstractions and Technology —32
Abstractions
nAbstraction helps us deal with complexity
nHide lower-level detail
nInstruction set architecture (ISA)
nThe hardware/software interface
nApplication binary interface
nThe ISA plus system software interface
nImplementation
nThe details underlying and interface
The BIG Picture

Abstractions

Chapter 1 —Computer Abstractions and Technology —34
A Safe Place for Data
nVolatile main memory
nLoses instructions and data when power off
nNon-volatile secondary memory
nMagnetic disk
nFlash memory
nOptical disk (CDROM, DVD)
nReRAM, STTRAM, PCM, …

Copyright © 2019, Elsevier Inc. All rights reserved.
Trends in Technology
Four implementation technologies are crucial for modern implementations:
nIntegrated circuit technology (Moore’s Law)
nTransistor density: 35%/year
nDie size: 10-20%/year
nIntegration overall: 40-55%/year (or doubling every 2 years)
nDRAM capacity: 25-40%/year (slowing)
n8 Gb (2014), 16 Gb (2019), possibly no 32 Gb
nFlash capacity: 50-60%/year (doubling every two years)
nStandard storage in PMDs
n8-10X cheaper/bit than DRAM
nMagnetic disk capacity: recently slowed to 5%/year
nDensity increases may no longer be possible, maybe increase from 7 to 9 platters
n8-10X cheaper/bit then Flash
n200-300X cheaper/bit than DRAM
Trends in Technology

Copyright © 2019, Elsevier Inc. All rights reserved.
Bandwidth and Latency Trends
nBandwidth or throughput
nTotal work done in a given time
n32,000-40,000X improvement for processors
n300-1200X improvement for memory and disks
nLatency or response time
nTime between start and completion of an event
n50-90X improvement for processors
n6-8X improvement for memory and disks
Trends in Technology

Copyright © 2019, Elsevier Inc. All rights reserved.
Bandwidth and Latency
Log-log plot of bandwidth and latency milestones
Trends in Technology
Increase in BW over time has
changed for microprocessors,
DRAM, and disk which are
affected by Moore’s law
For network, the continued
improvement is because of
advances in optics

Copyright © 2019, Elsevier Inc. All rights reserved.
Transistors and Wires
nFeature size
nMinimum size of transistor or wire in x or y dimension
n10 microns in 1971 to 0.016 micron (16nm) in 2017
n7nm and 5 nm in 2022
nTransistor performance scales linearly
nWire delay does not improve with feature size!
nIntegration density scales quadratically
Trends in Technology

Copyright © 2019, Elsevier Inc. All rights reserved.
Power and Energy Trends
nProblem: Get power in, get power out
nThermal Design Power (TDP)
nCharacterizes sustained power consumption
nUsed as target for power supply and cooling system
nLower than peak power (1.5X higher), higher than average power
consumption
nClock rate can be reduced dynamically to limit power consumption
nEnergy per task is often a better measurement
Trends in Power and Energy

Copyright © 2019, Elsevier Inc. All rights reserved.
Dynamic Energy and Power
nDynamic energy
nTransistor switch from 0 -> 1 or 1 -> 0
n½ x Capacitive load x Voltage2
nDynamic power (or switching power)
n½ x Capacitive load x Voltage2 x Frequency switched
nReducing clock rate reduces power, not energy
nStatic power (or leakage power)
nStatic current x voltage
Trends in Power and Energy

Chapter 1 —Computer Abstractions and Technology —41
Reducing Power -Example
nSuppose a new CPU has
n85% of capacitive load of old CPU
n15% voltage and 15% frequency reduction
0.520.85
FVC
0.85F0.85)(V0.85C
P
P
4
ol d
2
ol dol d
ol d
2
ol dol d
ol d
ne w
==
´´
´´´´´
=
nThe power wall
nWe can’t reduce voltage further
nWe can’t remove more heat
nHow else can we improve performance?

Copyright © 2019, Elsevier Inc. All rights reserved.
Power Trends
nIntel 80386 consumed ~
2 W
n3.3 GHz Intel Core i7
consumes 130 W
nHeat must be dissipated
from 1.5 x 1.5 cm chip
nThis is the limit of what
can be cooled by air
Trends in Power and Energy

Copyright © 2019, Elsevier Inc. All rights reserved.
Reducing Power
nTechniques for reducing power:
nDo nothing well
nDynamic Voltage-Frequency Scaling
nLow power state for DRAM, disks (PMDs, laptops, …)
nOverclocking, turning off cores
Trends in Power and Energy

Copyright © 2019, Elsevier Inc. All rights reserved.
Static (Leakage) Power
nStatic power consumption
n25-50% of total power
nCurrentstatic x Voltage
nScales with number of transistors
nTo reduce: power gating (inactivate modules)
Trends in Power and Energy

Copyright © 2019, Elsevier Inc. All rights reserved.
Trends in Cost
nCost driven down by learning curve
nYield
nDRAM: price closely tracks cost
nMicroprocessors: price depends on volume
n10% less for each doubling of volume
Trends in Cost

Chapter 1 —Computer Abstractions and Technology —48
Understanding Performance
Key factors that affecting the performance:
nAlgorithm
nDetermines number of operations executed
nProgramming language, compiler, architecture
nDetermine number of machine instructions executed per operation
nProcessor and memory system
nDetermine how fast instructions are executed
nI/O system (including OS)
nDetermines how fast I/O operations are executed

Chapter 1 —Computer Abstractions and Technology —49
Defining Performance
nWhich airplane has the best performance?
0 100 200 300 400 500
Douglas
DC-8-50
BAC/Sud
Concorde
Boeing 747
Boeing 777
Passenger Capacity
0 200040006000800010000
Douglas DC-
8-50
BAC/Sud
Concorde
Boeing 747
Boeing 777
Cruising Range (miles)
0 500 1000 1500
Douglas
DC-8-50
BAC/Sud
Concorde
Boeing 747
Boeing 777
Cruising Speed (mph)
0 100000200000300000400000
Douglas DC-
8-50
BAC/Sud
Concorde
Boeing 747
Boeing 777
Passengers x mph
§1.6 Performance

Chapter 1 —Computer Abstractions and Technology —50
Response Time and Throughput
nResponse time
nHow long it takes to do a task
nThroughput
nTotal work done per unit time
ne.g., tasks/transactions/… per hour
nHow are response time and throughput affected by
nReplacing the processor with a faster version?
nAdding more processors?
nWe’ll focus on response time for now…

Chapter 1 —Computer Abstractions and Technology —51
Relative Performance
nDefine Performance = 1/Execution Time
n“X is n time faster than Y”
n==
XY
YX
time Executiontime Execution
ePerformancePerformanc
nExample: time taken to run a program
n10s on A, 15s on B
nExecution TimeB / Execution TimeA= 15s / 10s = 1.5
nSo A is 1.5 times faster than B

Chapter 1 —Computer Abstractions and Technology —52
Measuring Execution Time
nElapsed time
nTotal response time, including all aspects
nProcessing, I/O, OS overhead, idle time
nDetermines system performance
nCPU time
nTime spent processing a given job
nDiscounts I/O time, other jobs’ shares
nComprises user CPU time and system CPU time
nDifferent programs are affected differently by CPU and system
performance

Chapter 1 —Computer Abstractions and Technology —53
CPU Clocking
nOperation of digital hardware governed by a
constant-rate clock
Clock (cycles)
Data transfer
and computation
Update state
Clock period
nClock period: duration of a clock cycle
ne.g., 250ps = 0.25ns = 250×10–12s
nClock frequency (rate): cycles per second
ne.g., 4.0GHz = 4000MHz = 4.0×109Hz

Clock Cycle Example
T = 1 / f (T is Clock period , which is the duration of a
clock cycle.)
f = 1 / T (f is frequency (rate) which is cycles per second
or number of clock cycler per second)
MHz = 106Hz
GHz = 109Hz
When we say a computer system runs at 1GHz frequency,
what is the value of T (duration of a clock cycle (clock
time)) on that computer system?
f = 1 / T à 1 GHz = 1 / T à 1 / 10–9 = 1 / T à T =
10–9 à T = 1 ns
Chapter 1 —Computer Abstractions and Technology —54

Chapter 1 —Computer Abstractions and Technology —55
CPU Time
nPerformance improved by
nReducing number of clock cycles
nIncreasing clock rate
nHardware designer must often trade off clock
rate against cycle count
Rate Clock
Cycles Clock CPU
Time Cycle ClockCycles Clock CPUTime CPU
=
´=

Chapter 1 —Computer Abstractions and Technology —56
CPU Time Example
nComputer A: 2GHz clock, 10s CPU time
nDesigning Computer B
nAim for 6s CPU time
nCan do faster clock, but causes 1.2 × clock cycles
nHow fast must Computer B clock be?
4GHz
6s
1024
6s
10201.2
Rate Clock
10202GHz10s
Rate ClockTime CPUCycles Clock
6s
Cycles Clock1.2
Time CPU
Cycles Clock
Rate Clock
99
B
9
AAA
A
B
B
B
=
´
=
´´
=
´=´=
´=
´
==

Chapter 1 —Computer Abstractions and Technology —57
Instruction Count and CPI
nInstruction Count for a program
nDetermined by program, ISA and compiler
nAverage cycles per instruction
nDetermined by CPU hardware
nIf different instructions have different CPI
nAverage CPI affected by instruction mix
Rate Clock
CPICount nInstructio
Time Cycle ClockCPICount nInstructioTime CPU
nInstructio per CyclesCount nInstructioCycles Clock
´
=
´´=
´=

Chapter 1 —Computer Abstractions and Technology —58
CPI Example
nComputer A: Cycle Time = 250ps, CPI = 2.0
nComputer B: Cycle Time = 500ps, CPI = 1.2
nSame ISA
nWhich is faster, and by how much?
1.2
500psI
600psI
A
Time CPU
B
Time CPU
600psI500ps1.2I
B
Time Cycle
B
CPICount nInstructio
B
Time CPU
500psI250ps2.0I
A
Time Cycle
A
CPICount nInstructio
A
Time CPU
=
´
´
=
´=´´=
´´=
´=´´=
´´=
A is faster…
…by this much

Chapter 1 —Computer Abstractions and Technology —59
CPI in More Detail
nIf different instruction classes take different
numbers of cycles
å
=
´=
n
1i
ii )Count nInstructio(CPICycles Clock
nWeighted average CPI
å
=
÷
ø
ö
ç
è
æ
´==
n
1i
i
i
Count nInstructio
Count nInstructio
CPI
Count nInstructio
Cycles Clock
CPI
Relative frequency

Chapter 1 —Computer Abstractions and Technology —60
CPI Example
nAlternative compiled code sequences using
instructions in classes A, B, C
ClassABC
CPI for class123
IC in sequence 1212
IC in sequence 2411
nSequence 1: IC = 5
nClock Cycles
= 2×1 + 1×2 + 2×3
= 10
nAvg. CPI = 10/5 = 2.0
nSequence 2: IC = 6
nClock Cycles
= 4×1 + 1×2 + 1×3
= 9
nAvg. CPI = 9/6 = 1.5

Chapter 1 —Computer Abstractions and Technology —61
Performance Summary
nPerformance depends on
nAlgorithm: affects IC, possibly CPI
nProgramming language: affects IC, CPI
nCompiler: affects IC, CPI
nInstruction set architecture: affects IC, CPI, Tc
The BIG Picture
cycle Clock
Seconds
nInstructio
cycles Clock
Program
nsInstructio
Time CPU ´´=

Chapter 1 —Computer Abstractions and Technology —62
SPEC CPU Benchmark
nPrograms used to measure performance
nSupposedly typical of actual workload
nStandard Performance Evaluation Corp (SPEC)
nDevelops benchmarks for CPU, I/O, Web, …
nSPEC CPU2006
nElapsed time to execute a selection of programs
nNegligible I/O, so focuses on CPU performance
nNormalize relative to reference machine
nSummarize as geometric mean of performance ratios
nCINT2006 (integer) and CFP2006 (floating-point)

Chapter 1 —Computer Abstractions and Technology —63
CINT2006 for Intel Core i7 920

Chapter 1 —Computer Abstractions and Technology —64
SPECpower_ssj2008 for Xeon X5650

Chapter 1 —Computer Abstractions and Technology —65
Pitfall: Amdahl’s Law
nImproving an aspect of a computer and
expecting a proportional improvement in
overall performance
§1.10 Fallacies and Pitfalls
20
80
20 +=
n
nCan’t be done!
unaffected
affected
improved T
factor timprovemen
T
T +=
nExample: multiply accounts for 80s/100s
nHow much improvement in multiply performance to
get 5× overall?
nCorollary: make the common case fast

Amdahl’s Law
!"##$ &"=!
"#$%& ()*+,$-#. "-/*
!"#$% '()*+#,"- !,.)
=)()*+#,"- #,.) "0 01$*#,"- +-)-ℎ$-*)3
+)()*+#,"- #,.) "0 01$*#,"- )-ℎ$-*)3
!"##$ &"=1
1−7+(7
,.91":).)-# 0$*#"1)
Chapter 1 —Computer Abstractions and Technology —66

Chapter 1 —Computer Abstractions and Technology —67
Amdahl’s Law Example
nSuppose that we want to enhance the processor used for Web
serving. The new processor is 10 times faster on computation
in the Web serving application than the original processor.
Assuming that the original processor is busy with computation
40% of the time and is writing for I/O 60% of the time, what is
the overall speedup gained by incorporating the enhancement.

Chapter 1 —Computer Abstractions and Technology —68
Amdahl’s Law Example
nA common transformation required in graphics engines is square root.
Implementation of floating-point (FP) square root varies significantly in
performance, especially among processors designed for graphics.
Suppose FP square root (FPSQR) is responsible for 20% of the execution
time of a critical graphics benchmark.
1.One proposal is to enhance the FPSQR hardware and speed up this operation by a factor of
10.
2.The other alternative is just to try to make all FP instructions in graphics processor run faster
by a factor of 1.6;
nFP instructions are responsible for a total of 50% of the execution time for
the application. Compare speed up of these two design alternatives.

Chapter 1 —Computer Abstractions and Technology —69
Pitfall: MIPS as a Performance Metric
nMIPS: Millions of Instructions Per Second
nDoesn’t account for
nDifferences in ISAs between computers
nDifferences in complexity between instructions
6
6
6
10CPI
rate Clock
10
rate Clock
CPIcount nInstructio
count nInstructio
10time Execution
count nInstructio
MIPS
´
=
´
´
=
´
=
nCPI varies between programs on a given CPU

Example
nConsider the following performance measurements for a
program:
nWhich computer has the higher MIPS rating?
nWhich computer is faster?
Chapter 1 —Computer Abstractions and Technology —70

Example
Chapter 1 —Computer Abstractions and Technology —71

Chapter 1 —Computer Abstractions and Technology —72
Concluding Remarks
nCost/performance is improving
nDue to underlying technology development
nHierarchical layers of abstraction
nIn both hardware and software
nInstruction set architecture
nThe hardware/software interface
nExecution time: the best performance measure
nPower is a limiting factor
nUse parallelism to improve performance
§1.11 Concluding Remarks

COMPUTERORGANIZATIONANDDESIGN
The Hardware/Software Interface
RISC-V
Edition
Pipeline Processor
Introduction

Chapter 4 —The Processor —2
Introduction
nCPU performance factors
nInstruction count
nDetermined by ISA and compiler
nCPI and Cycle time
nDetermined by CPU hardware
nWe will examine two RISC-V implementations
nA simplified version
nA more realistic pipelined version
nSimple subset, shows most aspects
nMemory reference: ld, sd
nArithmetic/logical: add, sub, and, or
nControl transfer: beq
§4.1 Introduction

Chapter 4 —The Processor —3
Instruction Execution
nPC ®instruction memory, fetch instruction
nRegister numbers®register file, read registers
nDepending on instruction class
nUse ALU to calculate
nArithmetic result
nMemory address for load/store
nBranch comparison
nAccess data memory for load/store
nPC ¬target address or PC + 4

Chapter 4 —The Processor —4
CPU Overview

Chapter 4 —The Processor —5
Multiplexers
nCan’t just join
wires together
nUse multiplexers

Chapter 4 —The Processor —6
Control

Chapter 4 —The Processor —7
Logic Design Basics
§4.2 Logic Design Conventions
nInformation encoded in binary
nLow voltage = 0, High voltage = 1
nOne wire per bit
nMulti-bit data encoded on multi-wire buses
nCombinational element
nOperate on data
nOutput is a function of input
nState (sequential) elements
nStore information

Chapter 4 —The Processor —8
Combinational Elements
nAND-gate
nY = A & B
A
BY
I0
I1Y
M
u
x
S
nMultiplexer
nY = S ? I1 : I0
A
B
Y+
A
B
YALU
F
nAdder
nY = A + B
nArithmetic/Logic Unit
nY = F(A, B)

Chapter 4 —The Processor —9
Sequential Elements
nRegister: stores data in a circuit
nUses a clock signal to determine when to
update the stored value
nEdge-triggered: update when Clk changes
from 0 to 1
D
Clk
Q
Clk
D
Q

Chapter 4 —The Processor —10
Sequential Elements
nRegister with write control
nOnly updates on clock edge when write
control input is 1
nUsed when stored value is required later
D
Clk
Q
Write
Write
D
Q
Clk

Chapter 4 —The Processor —11
Clocking Methodology
nCombinational logic transforms data during
clock cycles
nBetween clock edges
nInput from state elements, output to state
element
nLongest delay determines clock period

Chapter 4 —The Processor —12
Building a Datapath
nDatapath
nElements that process data and addresses
in the CPU
nRegisters, ALUs, mux’s, memories, …
nWe will build a RISC-V datapath
incrementally
nRefining the overview design
§4.3 Building a Datapath

Chapter 4 —The Processor —13
Instruction Fetch
64-bit
register
Increment by
4 for next
instruction

Chapter 4 —The Processor —14
R-Format Instructions
nRead two register operands
nPerform arithmetic/logical operation
nWrite register result

Chapter 4 —The Processor —15
Load/Store Instructions
nRead register operands
nCalculate address using 12-bit offset
nUse ALU, but sign-extend offset
nLoad: Read memory and update register
nStore: Write register value to memory

Chapter 4 —The Processor —16
Branch Instructions
nRead register operands
nCompare operands
nUse ALU, subtract and check Zero output
nCalculate target address
nSign-extend displacement
nShift left 1 place (halfword displacement)
nAdd to PC value

Chapter 4 —The Processor —17
Branch Instructions
Just
re-routes
wires
Sign-bit wire
replicated

Chapter 4 —The Processor —18
Composing the Elements
nFirst-cut data path does an instruction in
one clock cycle
nEach datapath element can only do one
function at a time
nHence, we need separate instruction and data
memories
nUse multiplexers where alternate data
sources are used for different instructions

Chapter 4 —The Processor —19
R-Type/Load/Store Datapath

Chapter 4 —The Processor —20
Full Datapath

Chapter 4 —The Processor —21
ALU Control
nALU used for
nLoad/Store: F = add
nBranch: F = subtract
nR-type: F depends on opcode
§4.4 A Simple Implementation Scheme
ALU controlFunction
0000AND
0001OR
0010add
0110subtract

Chapter 4 —The Processor —22
ALU Control
nAssume 2-bit ALUOp derived from opcode
nCombinational logic derives ALU control
opcodeALUOpOperationOpcode fieldALU function
ALU
control
ld00load registerXXXXXXXXXXXadd0010
sd00store registerXXXXXXXXXXXadd0010
beq01branch on equalXXXXXXXXXXXsubtract0110
R-type10add100000add0010
subtract100010subtract0110
AND100100AND0000
OR100101OR0001

Chapter 4 —The Processor —23
The Main Control Unit
nControl signals derived from instruction

Chapter 4 —The Processor —24
Datapath With Control

Chapter 4 —The Processor —25
R-Type Instruction

Chapter 4 —The Processor —26
Load Instruction

Chapter 4 —The Processor —27
BEQ Instruction

Chapter 4 —The Processor —28
Performance Issues
nLongest delay determines clock period
nCritical path: load instruction
nInstruction memory ®register file ®ALU ®
data memory ®register file
nNot feasible to vary period for different
instructions
nViolates design principle
nMaking the common case fast
nWe will improve performance by pipelining

Chapter 4 —The Processor —29
Pipelining Analogy
nPipelined laundry: overlapping execution
nParallelism improves performance
§4.5 An Overview of Pipelining
nFour loads:
nSpeedup
= 8/3.5 = 2.3
nN-stop:
nSpeedup
= 2n/0.5n + 1.5 ≈ 4
= number of stages

Chapter 4 —The Processor —30
RISC-V Pipeline
nFive stages, one step per stage
1.IF: Instruction fetch from memory
2.ID: Instruction decode & register read
3.EX: Execute operation or calculate address
4.MEM: Access memory operand
5.WB: Write result back to register

Chapter 4 —The Processor —31
Pipeline Performance
nAssume time for stages is
n100ps for register read or write
n200ps for other stages
nCompare pipelined datapath with single-cycle
datapath
InstrInstr fetchRegister
read
ALU opMemory
access
Register
write
Total time
ld200ps100 ps200ps200ps100 ps800ps
sd200ps100 ps200ps200ps700ps
R-format200ps100 ps200ps100 ps600ps
beq200ps100 ps200ps500ps

Chapter 4 —The Processor —32
Pipeline Performance
Single-cycle (Tc= 800ps)
Pipelined (Tc= 200ps)

Chapter 4 —The Processor —33
Pipeline Speedup
nIf all stages are balanced
ni.e., all take the same time
nTime between instructionspipelined
= Time between instructionsnonpipelined
Number of stages
nIf not balanced, speedup is less
nSpeedup due to increased throughput
nLatency (time for each instruction) does not
decrease

Chapter 4 —The Processor —34
Pipelining and ISA Design
nRISC-V ISA designed for pipelining
nAll instructions are 32-bits
nEasier to fetch and decode in one cycle
nc.f. x86: 1-to 17-byte instructions
nFew and regular instruction formats
nCan decode and read registers in one step
nLoad/store addressing
nCan calculate address in 3rdstage, access memory
in 4thstage

Chapter 4 —The Processor —35
Hazards
nSituations that prevent starting the next
instruction in the next cycle
nStructure hazards
nA required resource is busy
nData hazard
nNeed to wait for previous instruction to
complete its data read/write
nControl hazard
nDeciding on control action depends on
previous instruction

Chapter 4 —The Processor —36
Structure Hazards
nConflict for use of a resource
nIn RISC-V pipeline with a single memory
nLoad/store requires data access
nInstruction fetch would have to stallfor that
cycle
nWould cause a pipeline “bubble”
nHence, pipelined datapaths require
separate instruction/data memories
nOr separate instruction/data caches

Chapter 4 —The Processor —37
Data Hazards
nAn instruction depends on completion of
data access by a previous instruction
naddx19, x0, x1
subx2, x19, x3

Chapter 4 —The Processor —38
Forwarding (aka Bypassing)
nUse result when it is computed
nDon’t wait for it to be stored in a register
nRequires extra connections in the datapath

Chapter 4 —The Processor —39
Load-Use Data Hazard
nCan’t always avoid stalls by forwarding
nIf value not computed when needed
nCan’t forward backward in time!

Chapter 4 —The Processor —40
Code Scheduling to Avoid Stalls
nReorder code to avoid use of load result in
the next instruction
nC code for a = b + e; c = b + f;
ldx1, 0(x0)
ldx2, 8(x0)
addx3, x1, x2
sdx3, 24(x0)
ldx4, 16(x0)
addx5, x1, x4
sdx5, 32(x0)
stall
stall
ldx1, 0(x0)
ldx2, 8(x0)
ldx4, 16(x0)
addx3, x1, x2
sdx3, 24(x0)
addx5, x1, x4
sdx5, 32(x0)
11 cycles13 cycles

Chapter 4 —The Processor —41
Control Hazards
nBranch determines flow of control
nFetching next instruction depends on branch
outcome
nPipeline can’t always fetch correct instruction
nStill working on ID stage of branch
nIn RISC-V pipeline
nNeed to compare registers and compute
target early in the pipeline
nAdd hardware to do it in ID stage

Control Hazards
Chapter 4 —The Processor —42

Chapter 4 —The Processor —43
Stall on Branch
nWait until branch outcome determined
before fetching next instruction

Chapter 4 —The Processor —44
Branch Prediction
nLonger pipelines can’t readily determine
branch outcome early
nStall penalty becomes unacceptable
nPredict outcome of branch
nOnly stall if prediction is wrong
nIn RISC-V pipeline
nCan predict branches not taken
nFetch instruction after branch, with no delay

Chapter 4 —The Processor —45
More-Realistic Branch Prediction
nStatic branch prediction
nBased on typical branch behavior
nExample: loop and if-statement branches
nPredict backward branches taken
nPredict forward branches not taken
nDynamic branch prediction
nHardware measures actual branch behavior
ne.g., record recent history of each branch
nAssume future behavior will continue the trend
nWhen wrong, stall while re-fetching, and update history

Chapter 4 —The Processor —46
Pipeline Summary
nPipelining improves performance by
increasing instruction throughput
nExecutes multiple instructions in parallel
nEach instruction has the same latency
nSubject to hazards
nStructure, data, control
nInstruction set design affects complexity of
pipeline implementation
The BIG Picture

Chapter 4 —The Processor —47
RISC-V Pipelined Datapath
§4.6 Pipelined Datapath and Control
WB
MEM
Right-to-left
flow leads to
hazards

Chapter 4 —The Processor —48
Pipeline registers
nNeed registers between stages
nTo hold information produced in previous cycle

Chapter 4 —The Processor —49
Pipeline Operation
nCycle-by-cycle flow of instructions through
the pipelined datapath
n“Single-clock-cycle” pipeline diagram
nShows pipeline usage in a single cycle
nHighlight resources used
nc.f. “multi-clock-cycle” diagram
nGraph of operation over time
nWe’ll look at “single-clock-cycle” diagrams
for load & store

Chapter 4 —The Processor —50
IF for Load, Store, …

Chapter 4 —The Processor —51
ID for Load, Store, …

Chapter 4 —The Processor —52
EX for Load

Chapter 4 —The Processor —53
MEM for Load

Chapter 4 —The Processor —54
WB for Load
Wrong
register
number

Chapter 4 —The Processor —55
Corrected Datapath for Load
WriteregisternumbercomesfromMEM/WBpipelineregister.Theregisternumberis
passedfromIDstagetothepipelinewhichaddsanother5bitstothepipelineregisters
(lastthreeregisters)

Chapter 4 —The Processor —56
EX for Store

Chapter 4 —The Processor —57
MEM for Store

Chapter 4 —The Processor —58
WB for Store

Chapter 4 —The Processor —59
Multi-Cycle Pipeline Diagram
nForm showing resource usage

Chapter 4 —The Processor —60
Multi-Cycle Pipeline Diagram
nTraditional form

Chapter 4 —The Processor —61
Single-Cycle Pipeline Diagram
nState of pipeline in a given cycle

Chapter 4 —The Processor —62
Pipelined Control (Simplified)

Chapter 4 —The Processor —63
Pipelined Control
nControl signals derived from instruction
nAs in single-cycle implementation

Chapter 4 —The Processor —64
Pipelined Control

Chapter 4 —The Processor —65
Data Hazards in ALU Instructions
nConsider this sequence:
sub x2, x1,x3
and x12,x2,x5
or x13,x6,x2
add x14,x2,x2
sd x15,100(x2)
nWe can resolve hazards with forwarding
nHow do we detect when to forward?
§4.7 Data Hazards: Forwarding vs. Stalling

Chapter 4 —The Processor —66
Dependencies & Forwarding

Pipelined Registers
Chapter 4 —The Processor —67
IF/ID ID/EX EX/MEM MEM/WB
IF ID EX MEM WB

Detecting the Need to Forward
add x1, x2, x3
sub x4, x1, x6
--------
add x1, x2, x3
sub x4, x6, x1
ld x1, 0(x3)
sub x4, x1, x6
-------
ld x1, 0(x3)
sub x4, x6, x1
Chapter 4 —The Processor —68
Fwd from
EX/MEM
pipeline reg
Fwd from
MEM/WB
pipeline reg

Chapter 4 —The Processor —69
Detecting the Need to Forward
nPass register numbers along pipeline
ne.g., ID/EX.RegisterRs1 = register number for Rs1
sitting in ID/EX pipeline register
nALU operand register numbers in EX stage
are given by
nID/EX.RegisterRs1, ID/EX.RegisterRs2
nData hazards when
1a.EX/MEM.RegisterRd = ID/EX.RegisterRs1
1b.EX/MEM.RegisterRd = ID/EX.RegisterRs2
2a.MEM/WB.RegisterRd = ID/EX.RegisterRs1
2b.MEM/WB.RegisterRd = ID/EX.RegisterRs2
Fwd from
EX/MEM
pipeline reg
Fwd from
MEM/WB
pipeline reg

Chapter 4 —The Processor —70
Detecting the Need to Forward
nBut only if forwarding instruction will write
to a register!
nEX/MEM.RegWrite, MEM/WB.RegWrite
nAnd only if Rd for that instruction is not x0
nEX/MEM.RegisterRd ≠ 0,
MEM/WB.RegisterRd ≠ 0

Forwarding Paths
Chapter 4 —The Processor —71
rs1
rs2
rd

Chapter 4 —The Processor —72
Forwarding Conditions
Mux controlSourceExplanation
ForwardA = 00ID/EXThe first ALU operand comes from the register file.
ForwardA = 10EX/MEMThe first ALU operand is forwarded from the prior
ALU result.
ForwardA= 01MEM/WBThe first ALU operand is forwarded from data
memory or an earlier ALU result.
ForwardB = 00ID/EXThe second ALU operand comes from the register
file.
ForwardB = 10EX/MEMThe second ALU operand is forwarded from the prior
ALU result.
ForwardB= 01MEM/WBThe second ALU operand is forwarded from data
memory or an earlier ALU result.

Chapter 4 —The Processor —73
Double Data Hazard
nConsider the sequence:
add x1,x1,x2
add x1,x1,x3
add x1,x1,x4
nBoth hazards occur
nWant to use the most recent
nRevise MEM hazard condition
nOnly fwd if EX hazard condition isn’t true

Double Data Hazard
Chapter 4 —The Processor —74

Chapter 4 —The Processor —75
Datapath with Forwarding

Chapter 4 —The Processor —76
Load-Use Hazard Detection
nCheck when using instruction is decoded
in ID stage
nALU operand register numbers in ID stage
are given by
nIF/ID.RegisterRs1, IF/ID.RegisterRs2
nLoad-use hazard when
nID/EX.MemRead and
((ID/EX.RegisterRd = IF/ID.RegisterRs1) or
(ID/EX.RegisterRd = IF/ID.RegisterRs1))
nIf detected, stall and insert bubble

Chapter 4 —The Processor —77
How to Stall the Pipeline
nForce control values in ID/EX register
to 0
nEX, MEM and WB do nop(no-operation)
nPrevent update of PC and IF/ID register
nUsing instruction is decoded again
nFollowing instruction is fetched again
n1-cycle stall allows MEM to read data for ld
nCan subsequently forward to EX stage

Chapter 4 —The Processor —78
Load-Use Data Hazard
Stall inserted
here

Chapter 4 —The Processor —79
Datapath with Hazard Detection

Chapter 4 —The Processor —80
Stalls and Performance
nStalls reduce performance
nBut are required to get correct results
nCompiler can arrange code to avoid
hazards and stalls
nRequires knowledge of the pipeline structure
The BIG Picture

Chapter 4 —The Processor —81
Branch Hazards
nIf branch outcome determined in MEM
§4.8 Control Hazards
PC
Flush these
instructions
(Set control
values to 0)

Chapter 4 —The Processor —82
Reducing Branch Delay
nMove hardware to determine outcome to ID
stage
nTarget address adder
nRegister comparator
nExample: branch taken
36: sub x10, x4, x8
40: beq x1, x3, 16 // PC-relative branch
// to 40+16*2=72
44: and x12, x2, x5
48: orr x13, x2, x6
52: add x14, x4, x2
56: sub x15, x6, x7
...
72: ld x4, 50(x7)

Chapter 4 —The Processor —83
Example: Branch Taken

Chapter 4 —The Processor —84
Example: Branch Taken

Chapter 4 —The Processor —85
Dynamic Branch Prediction
nIn deeper and superscalar pipelines, branch
penalty is more significant
nUse dynamic prediction
nBranch prediction buffer (aka branch history table)
nIndexed by recent branch instruction addresses
nStores outcome (taken/not taken)
nTo execute a branch
nCheck table, expect the same outcome
nStart fetching from fall-through or target
nIf wrong, flush pipeline and flip prediction

Branch Prediction Buffer
Chapter 4 —The Processor —86

1-Bit Predictor
Chapter 4 —The Processor —87

Chapter 4 —The Processor —88
1-Bit Predictor: Shortcoming
nInner loop branches mispredicted twice!
outer: …
…
inner: …
…
beq …, …, inner
…
beq …, …, outer
nMispredict as taken on last iteration of
inner loop
nThen mispredict as not taken on first
iteration of inner loop next time around

Chapter 4 —The Processor —89
2-Bit Predictor
nOnly change prediction on two successive
mispredictions

Chapter 4 —The Processor —90
Calculating the Branch Target
nEven with predictor, still need to calculate
the target address
n1-cycle penalty for a taken branch
nBranch target buffer
nCache of target addresses
nIndexed by PC when instruction fetched
nIf hit and instruction is branch predicted taken, can
fetch target immediately

Branch Target Buffer
Chapter 4 —The Processor —91

Chapter 4 —The Processor —92
Exceptions and Interrupts
n“Unexpected” events requiring change
in flow of control
nDifferent ISAs use the terms differently
nException
nArises within the CPU
ne.g., undefined opcode, syscall, …
nInterrupt
nFrom an external I/O controller
nDealing with them without sacrificing
performance is hard
§4.9 Exceptions

Chapter 4 —The Processor —93
Handling Exceptions
nSave PC of offending (or interrupted) instruction
nIn RISC-V: Supervisor Exception Program Counter
(SEPC)
nSave indication of the problem
nIn RISC-V: Supervisor Exception Cause Register
(SCAUSE)
n64 bits, but most bits unused
nException code field: 2 for undefined opcode, 12 for hardware
malfunction, …
nJump to handler
nAssume at 0000 0000 1C09 0000hex

Chapter 4 —The Processor —94
An Alternate Mechanism
nVectored Interrupts
nHandler address determined by the cause
nException vector address to be added to a
vector table base register:
nUndefined opcode00 0100 0000two
nHardware malfunction:01 1000 0000two
n…:…
nInstructions either
nDeal with the interrupt, or
nJump to real handler

Chapter 4 —The Processor —95
Handler Actions
nRead cause, and transfer to relevant
handler
nDetermine action required
nIf restartable
nTake corrective action
nuse SEPC to return to program
nOtherwise
nTerminate program
nReport error using SEPC, SCAUSE, …

Chapter 4 —The Processor —96
Exceptions in a Pipeline
nAnother form of control hazard
nConsider malfunction on add in EX stage
add x1, x2, x1
nPrevent x1 from being clobbered
nComplete previous instructions
nFlush addand subsequent instructions
nSet SEPC and SCAUSE register values
nTransfer control to handler
nSimilar to mispredicted branch
nUse much of the same hardware

Chapter 4 —The Processor —97
Pipeline with Exceptions

Chapter 4 —The Processor —98
Exception Properties
nRestartable exceptions
nPipeline can flush the instruction
nHandler executes, then returns to the
instruction
nRefetched and executed from scratch
nPC saved in SEPC register
nIdentifies causing instruction

Chapter 4 —The Processor —99
Exception Example
nException on addin
40sub x11, x2, x4
44and x12, x2, x5
48orr x13, x2, x6
4cadd x1, x2, x1
50sub x15, x6, x754ld x16, 100(x7)
…
nHandler
1C090000sd x26, 1000(x10)
1c090004 sd x27, 1008(x10)
…

Chapter 4 —The Processor —100
Exception Example

Chapter 4 —The Processor —101
Exception Example

Chapter 4 —The Processor —102
Multiple Exceptions
nPipelining overlaps multiple instructions
nCould have multiple exceptions at once
nSimple approach: deal with exception from
earliest instruction
nFlush subsequent instructions
n“Precise” exceptions
nIn complex pipelines
nMultiple instructions issued per cycle
nOut-of-order completion
nMaintaining precise exceptions is difficult!

Chapter 4 —The Processor —103
Imprecise Exceptions
nJust stop pipeline and save state
nIncluding exception cause(s)
nLet the handler work out
nWhich instruction(s) had exceptions
nWhich to complete or flush
nMay require “manual” completion
nSimplifies hardware, but more complex handler
software
nNot feasible for complex multiple-issue
out-of-order pipelines

Chapter 4 —The Processor —104
Instruction-Level Parallelism (ILP)
nPipelining: executing multiple instructions in
parallel
nTo increase ILP
nDeeper pipeline
nLess work per stage Þshorter clock cycle
nMultiple issue
nReplicate pipeline stages Þmultiple pipelines
nStart multiple instructions per clock cycle
nCPI < 1, so use Instructions Per Cycle (IPC)
nE.g., 4GHz 4-way multiple-issue
n16 BIPS, peak CPI = 0.25, peak IPC = 4
nBut dependencies reduce this in practice
§4.10
Parallelism via Instructions

Chapter 4 —The Processor —105
Multiple Issue
nStatic multiple issue
nCompiler groups instructions to be issued together
nPackages them into “issue slots”
nCompiler detects and avoids hazards
nDynamic multiple issue
nCPU examines instruction stream and chooses
instructions to issue each cycle
nCompiler can help by reordering instructions
nCPU resolves hazards using advanced techniques at
runtime

Chapter 4 —The Processor —106
Speculation
n“Guess” what to do with an instruction
nStart operation as soon as possible
nCheck whether guess was right
nIf so, complete the operation
nIf not, roll-back and do the right thing
nCommon to static and dynamic multiple issue
nExamples
nSpeculate on branch outcome
nRoll back if path taken is different
nSpeculate on load
nRoll back if location is updated

Chapter 4 —The Processor —107
Compiler/Hardware Speculation
nCompiler can reorder instructions
ne.g., move load before branch
nCan include “fix-up” instructions to recover
from incorrect guess
nHardware can look ahead for instructions
to execute
nBuffer results until it determines they are
actually needed
nFlush buffers on incorrect speculation

Chapter 4 —The Processor —108
Speculation and Exceptions
nWhat if exception occurs on a
speculatively executed instruction?
ne.g., speculative load before null-pointer
check
nStatic speculation
nCan add ISA support for deferring exceptions
nDynamic speculation
nCan buffer exceptions until instruction
completion (which may not occur)

Chapter 4 —The Processor —109
Static Multiple Issue
nCompiler groups instructions into “issue
packets”
nGroup of instructions that can be issued on a
single cycle
nDetermined by pipeline resources required
nThink of an issue packet as a very long
instruction
nSpecifies multiple concurrent operations
nÞVery Long Instruction Word (VLIW)

Chapter 4 —The Processor —110
Scheduling Static Multiple Issue
nCompiler must remove some/all hazards
nReorder instructions into issue packets
nNo dependencies with a packet
nPossibly some dependencies between
packets
nVaries between ISAs; compiler must know!
nPad with nop if necessary

Chapter 4 —The Processor —111
RISC-V with Static Dual Issue
nTwo-issue packets
nOne ALU/branch instruction
nOne load/store instruction
n64-bit aligned
nALU/branch, then load/store
nPad an unused instruction with nop
AddressInstruction typePipeline Stages
nALU/branchIFIDEXMEMWB
n + 4Load/storeIFIDEXMEMWB
n + 8ALU/branchIFIDEXMEMWB
n + 12Load/storeIFIDEXMEMWB
n + 16ALU/branchIFIDEXMEMWB
n + 20Load/storeIFIDEXMEMWB

Chapter 4 —The Processor —112
RISC-V with Static Dual Issue

Chapter 4 —The Processor —113
Hazards in the Dual-Issue RISC-V
nMore instructions executing in parallel
nEX data hazard
nForwarding avoided stalls with single-issue
nNow can’t use ALU result in load/store in same packet
nadd x10, x0, x1
ld x2, 0(x10)
nSplit into two packets, effectively a stall
nLoad-use hazard
nStill one cycle use latency, but now two instructions
nMore aggressive scheduling required

Chapter 4 —The Processor —114
Scheduling Example
nSchedule this for dual-issue RISC-V
Loop: ld x31,0(x20) // x31=array element
add x31,x31,x21 // add scalar in x21
sd x31,0(x20) // store result
addi x20,x20,-8 // decrement pointer
blt x22,x20,Loop // branch if x22 < x20
ALU/branchLoad/storecycle
Loop:nopldx31,0(x20) 1
addix20,x20,-8 nop2
add x31,x31,x21nop3
bltx22,x20,Loopsdx31,8(x20)4
nIPC = 5/4 = 1.25 (c.f. peak IPC = 2)

Chapter 4 —The Processor —115
Loop Unrolling
nReplicate loop body to expose more
parallelism
nReduces loop-control overhead
nUse different registers per replication
nCalled “register renaming”
nAvoid loop-carried “anti-dependencies”
nStore followed by a load of the same register
nAka “name dependence”
nReuse of a register name

Chapter 4 —The Processor —116
Loop Unrolling Example
nIPC = 14/8 = 1.75
nCloser to 2, but at cost of registers and code size
ALU/branchLoad/storecycle
Loop:addix20,x20,-32ldx28, 0(x20)1
nopldx29, 24(x20)2
add x28,x28,x21ldx30, 16(x20)3
add x29,x29,x21ldx31, 8(x20)4
add x30,x30,x21sdx28, 32(x20)5
add x31,x31,x21sdx29, 24(x20)6
nopsdx30, 16(x20)7
bltx22,x20,Loopsdx31, 8(x20)8

Chapter 4 —The Processor —117
Dynamic Multiple Issue
n“Superscalar” processors
nCPU decides whether to issue 0, 1, 2, …
each cycle
nAvoiding structural and data hazards
nAvoids the need for compiler scheduling
nThough it may still help
nCode semantics ensured by the CPU

Chapter 4 —The Processor —118
Dynamic Pipeline Scheduling
nAllow the CPU to execute instructions out
of order to avoid stalls
nBut commit result to registers in order
nExample
ld x31,20(x21)
add x1,x31,x2
sub x23,x23,x3
andi x5,x23,20
nCan start subwhile add is waiting for ld

Chapter 4 —The Processor —119
Dynamically Scheduled CPU
Results also sent
to any waiting
reservation stations
Reorders buffer for
register writesCan supply
operands for
issued instructions
Preserves
dependencies
Hold pending
operands

Chapter 4 —The Processor —120
Register Renaming
nReservation stations and reorder buffer
effectively provide register renaming
nOn instruction issue to reservation station
nIf operand is available in register file or
reorder buffer
nCopied to reservation station
nNo longer required in the register; can be
overwritten
nIf operand is not yet available
nIt will be provided to the reservation station by a
function unit
nRegister update may not be required

Chapter 4 —The Processor —121
Speculation
nPredict branch and continue issuing
nDon’t commit until branch outcome
determined
nLoad speculation
nAvoid load and cache miss delay
nPredict the effective address
nPredict loaded value
nLoad before completing outstanding stores
nBypass stored values to load unit
nDon’t commit load until speculation cleared

Chapter 4 —The Processor —122
Why Do Dynamic Scheduling?
nWhy not just let the compiler schedule
code?
nNot all stalls are predicable
ne.g., cache misses
nCan’t always schedule around branches
nBranch outcome is dynamically determined
nDifferent implementations of an ISA have
different latencies and hazards

Chapter 4 —The Processor —123
Does Multiple Issue Work?
nYes, but not as much as we’d like
nPrograms have real dependencies that limit ILP
nSome dependencies are hard to eliminate
ne.g., pointer aliasing
nSome parallelism is hard to expose
nLimited window size during instruction issue
nMemory delays and limited bandwidth
nHard to keep pipelines full
nSpeculation can help if done well
The BIG Picture

Chapter 4 —The Processor —124
Power Efficiency
nComplexity of dynamic scheduling and
speculations requires power
nMultiple simpler cores may be better
MicroprocessorYearClock RatePipeline
Stages
Issue
width
Out-of-order/
Speculation
CoresPower
i486198925MHz51No15W
Pentium199366MHz52No110W
Pentium Pro1997200MHz103Yes129W
P4 Willamette20012000MHz223Yes175W
P4 Prescott20043600MHz313Yes1103W
Core20062930MHz144Yes275W
UltraSparc III20031950MHz144No190W
UltraSparc T120051200MHz61No870W

Cortex A53 and Intel i7
ProcessorARM A53Intel Core i7 920
MarketPersonalMobile DeviceServer, cloud
Thermal designpower100 milliWatts
(1 core @ 1 GHz)
130 Watts
Clock rate1.5 GHz2.66 GHz
Cores/Chip4(configurable)4
Floating point?YesYes
Multiple issue?DynamicDynamic
Peakinstructions/clock cycle2 4
Pipelinestages8 14
Pipeline scheduleStatic in-orderDynamicout-of-order
with speculation
BranchpredictionHybrid2-level
1stlevel caches/core16-64 KiB I, 16-64 KiBD32KiBI, 32 KiBD
2ndlevel caches/core128-2048 KiB256 KiB (per core)
3rdlevel caches(shared)(platformdependent)2-8 MB
Chapter 4 —The Processor —125
§4.11 Real Stuff: The ARM Cortex
-A53 and Intel Core i7 Pipelines

Chapter 4 —The Processor —126
Fallacies
nPipelining is easy (!)
nThe basic idea is easy
nThe devil is in the details
ne.g., detecting data hazards
nPipelining is independent of technology
nSo why haven’t we always done pipelining?
nMore transistors make more advanced techniques
feasible
nPipeline-related ISA design needs to take account of
technology trends
ne.g., predicated instructions
§4.14 Fallacies and Pitfalls

Chapter 4 —The Processor —127
Pitfalls
nPoor ISA design can make pipelining
harder
ne.g., complex instruction sets (VAX, IA-32)
nSignificant overhead to make pipelining work
nIA-32 micro-op approach
ne.g., complex addressing modes
nRegister update side effects, memory indirection
ne.g., delayed branches
nAdvanced pipelines have long delay slots

Chapter 4 —The Processor —128
Concluding Remarks
nISA influences design of datapath and control
nDatapath and control influence design of ISA
nPipelining improves instruction throughput
using parallelism
nMore instructions completed per second
nLatency for each instruction not reduced
nHazards: structural, data, control
nMultiple issue and dynamic scheduling (ILP)
nDependencies limit achievable parallelism
nComplexity leads to the power wall
§4.14 Concluding Remarks

Chapter 5
Large and Fast:
Exploiting Memory
Hierarchy

Chapter 5 —Large and Fast: Exploiting Memory Hierarchy —2
Memory Technology
nInternal
nStatic RAM (SRAM) àvolatile
n0.5ns –2.5ns, $500 –$1000 per GB (in 2012)
nDynamic RAM (DRAM) àvolatile
n50ns –70ns, $10 –$20 per GB (in 2012)
nExternal (Secondary)
nFlash Memory ànonvolatile à100-1000x faster than disk
n5,000–50,000 ns , $0.75 -$1 per GB (in 2012)
nMagnetic disk memory (HDD) ànonvolatile
n5ms –20ms, $0.05 –$0.10 per GB (in 2012)
§5.1 Introduction

Memory Technologies
Chapter 5 —Large and Fast: Exploiting Memory Hierarchy —3

Memory Hierarchy
nWhat is memory hierarchy:
nstructurethatusesmultiplelevelsof
memories;
nasthedistancefromprocessorincreases,the
sizeofthememoriesandtheaccesstime
bothincrease.
nFastermemoriesaremoreexpensiveperbit
thanthesloweronesandthusaresmaller.
nIdealmemory:
nAslargeasthelargestlevel
nAsfastasthesmallestlevel
Chapter 5 —Large and Fast: Exploiting Memory Hierarchy —4

Memory Hierarchy
nBasic structure of memory hierarchy
Chapter 5 —Large and Fast: Exploiting Memory Hierarchy —5

Chapter 5 —Large and Fast: Exploiting Memory Hierarchy —6
Principle of Locality
nStates programs access a small proportion of
their address space at any time
nTemporal locality (locality in time)
nItems accessed recently are likely to be accessed
againsoon
ne.g., instructions in a loop
nSpatial locality (locality in space)
nItems nearthose accessed recently are likely to be
accessed soon
nE.g., sequential instruction access, array data

Chapter 5 —Large and Fast: Exploiting Memory Hierarchy —7
Taking Advantage of Locality
nImplementing memory of a computer as a memory
hierarchy
nStore everything on disk
nCopy recently accessed (and nearby) items from disk to
smaller DRAM memory
nMain memory
nCopy more recently accessed (and nearby) items from
DRAM to smaller SRAM memory
nCache memory is attached to CPU

Chapter 5 —Large and Fast: Exploiting Memory Hierarchy —8
Memory Hierarchy Levels
nBlock (aka line): unit of copying
nMay be multiple words
nIf accessed data is present in
upper level
nHit: access satisfied by upper level
nHit ratio: hits/accesses
nIf accessed data is absent
nMiss: block copied from lower level
nTime taken: miss penalty
nMiss ratio: misses/accesses
= 1 –hit ratio
nThen accessed data supplied from
upper level

Memory Hierarchy Levels
nHittime:
nthetimerequiredtoaccessalevelofthe
memoryhierarchy,includingthetimeneeded
todeterminewhethertheaccessisahitornot
nHitrate:
nthefractionofmemoryaccessesfoundina
levelofmemoryhierarchy
nMissrate:
nthefractionofmemoryaccessesnotfoundin
alevelofmemoryhierarchy
Chapter 5 —Large and Fast: Exploiting Memory Hierarchy —9

Memory Hierarchy Levels
nMiss penalty:
nThetimerequiredtofetchablockfromlower
levelofmemoryhierarchyintoahigherlevel,
including
nthetimetoaccesstheblock,
ntransmititfromoneleveltotheother,
ninsertitintoalevelthatmisshappened,
nandpasstheblockbacktotheprocessor
nHittimeismuchsmallerthanthetimeto
accessthenextlevelofmemory
Chapter 5 —Large and Fast: Exploiting Memory Hierarchy —10

Chapter 5 —Large and Fast: Exploiting Memory Hierarchy —11
Cache Memory
nCache memory
nThe level of the memory hierarchy closest to the CPU
nGiven accesses X1, …, Xn–1, Xn
Following is a simple cache structure:
§5.2 The Basics of Caches
nHow do we know if
the data is present
in cache?
nWhere do we look?

Chapter 5 —Large and Fast: Exploiting Memory Hierarchy —12
Direct Mapped Cache
nLocation determined by address
nDirect mapped: only one choice
n(Block address) modulo (#Blocks in cache)
n# of Blocks is
a power of 2
nUse low-order
address bits

Chapter 5 —Large and Fast: Exploiting Memory Hierarchy —13
Tags and Valid Bits
nHow do we know which particular block is
stored in a cache location?
nStore block address as well as the data
nActually, only need the high-order bits
nWhich is called the tag
nWhat if there is no data in a location?
nValid bit: 1 = present, 0 = not present
nInitially 0

Cache Example
nBelow is a sequence of nine memory references to an
empty eight-block cache, show the contents of the
cache after each cache access.
22, 26, 22, 26, 16, 3, 16, 18, 16
Chapter 5 —Large and Fast: Exploiting Memory Hierarchy —14

Cache Example
nBelow is a sequence of nine memory references to an
empty eight-block cache, show the contents of the
cache after each cache access.
22, 26, 22, 26, 16, 3, 16, 18, 16
Solution:
Since there are eight blocks in the cache, we need the
low-order 3 bits of the address
Chapter 5 —Large and Fast: Exploiting Memory Hierarchy —15

Chapter 5 —Large and Fast: Exploiting Memory Hierarchy —16
Cache Example
n8-blocks, 1 word/block, direct mapped
nInitial state
IndexVTagData
000N
001N
010N
011N
100N
101N
110N
111N

Chapter 5 —Large and Fast: Exploiting Memory Hierarchy —17
Cache Example
IndexVTagData
000N
001N
010N
011N
100N
101N
110Y10Mem[10110]
111N
Word addrBinary addrHit/missCache block
2210 110Miss110

Chapter 5 —Large and Fast: Exploiting Memory Hierarchy —18
Cache Example
IndexVTagData
000N
001N
010Y11Mem[11010]
011N
100N
101N
110Y10Mem[10110]
111N
Word addrBinary addrHit/missCache block
2611 010Miss010

Chapter 5 —Large and Fast: Exploiting Memory Hierarchy —19
Cache Example
IndexVTagData
000N
001N
010Y11Mem[11010]
011N
100N
101N
110Y10Mem[10110]
111N
Word addrBinary addrHit/missCache block
2210 110Hit110
2611 010Hit010

Chapter 5 —Large and Fast: Exploiting Memory Hierarchy —20
Cache Example
IndexVTagData
000Y10Mem[10000]
001N
010Y11Mem[11010]
011Y00Mem[00011]
100N
101N
110Y10Mem[10110]
111N
Word addrBinary addrHit/missCache block
1610 000Miss000
300 011Miss011
1610 000Hit000

Chapter 5 —Large and Fast: Exploiting Memory Hierarchy —21
Cache Example
IndexVTagData
000Y10Mem[10000]
001N
010Y10Mem[10010]
011Y00Mem[00011]
100N
101N
110Y10Mem[10110]
111N
Word addrBinary addrHit/missCache block
1810 010Miss010

Address Subdivision
Chapter 5 —Large and Fast: Exploiting Memory Hierarchy —22
64 bits
? bits? bits? bits
Given a 64-bit memory address, how bits goes to:
1.Tag
2.Blocks in the Cache
3.Words in a block
4.Bytes per word

Address Subdivision
Chapter 5 —Large and Fast: Exploiting Memory Hierarchy —23

Address Subdivision
Chapter 5 —Large and Fast: Exploiting Memory Hierarchy —25
64 bits
n bitsm bits2 bits
64

Chapter 5 —Large and Fast: Exploiting Memory Hierarchy —26
Address Subdivision

Chapter 5 —Large and Fast: Exploiting Memory Hierarchy —27
Example: Larger Block Size
n32-bit Memory address, Cache with 64
blocks, 16 bytes/block
nTo what block number does byte address
1200 map?

Chapter 5 —Large and Fast: Exploiting Memory Hierarchy —28
Example: Larger Block Size
n64 blocks, 16 bytes/block
nTo what block number does byte address 1200 map?
nBlock address = ëbyte address / #of bytes per blockû
nBlock address = ë1200/16û= 75
nBlock number = 75 modulo 64 = 11
TagIndexOffset
03491031
4 bits6 bits22 bits

Example: Larger Block Size
Chapter 5 —Large and Fast: Exploiting Memory Hierarchy —29
Block number
in CacheByte number
in Block
TagIndexOffset
03491031
4 bits6 bits22 bits

Example: Directed Mapped Cache
nConsideramachinewithabyteaddressablemain
memoryof2^16bytesandblocksizeof8bytes.
Assumethatadirected-mappedcacheconsistingof32
linesisusedwiththismachine.
a)How is a 16-bit memory address divided into tag, line
number, and byte number?
b)Into what line would bytes with each of the following
addresses be stored?
n0001 0001 0001 1011
n1100 0011 0011 0100
n1101 0000 0001 1101
n1010 1010 1010 1010
c)Why is the tag also stored in the cache?
Chapter 5 —Large and Fast: Exploiting Memory Hierarchy —30

Solution
a)8leftmostbits=tag,5middlebits=block
number,3rightmostbits=byteoffset
b) Line 3 -> 0001 0001 0001 1011
Line 6 -> 1100 0011 0011 0100
Line 3
Line 21
c) Because two items with two different memory
addresses can be stored in the same place in the
cache. The tag is used to distinguish between them.
Chapter 5 —Large and Fast: Exploiting Memory Hierarchy —31

Example: Directed Mapped Cache
How many total bits are required for a direct-
mapped cache with 16 KiB of data and four-
word blocks, assuming a 32 bit address?
Chapter 5 —Large and Fast: Exploiting Memory Hierarchy —32

Solution
Chapter 5 —Large and Fast: Exploiting Memory Hierarchy —33

Solution
Chapter 5 —Large and Fast: Exploiting Memory Hierarchy —34

Solution
Chapter 5 —Large and Fast: Exploiting Memory Hierarchy —35

Chapter 5 —Large and Fast: Exploiting Memory Hierarchy —36
Block Size Considerations
nLarger blocks should reduce miss rate
nExploit spatial locality
nBut in a fixed-sized cache
nLarger blocks Þfewer number of blocks
nMore competition for mapping.
nincreased miss rate
nLarger blocks Þpollution: block will be bumped out
of cache before many of its words are accessed
nMore serious problem is larger miss penalty
nCan override benefit of reduced miss rate

Chapter 5 —Large and Fast: Exploiting Memory Hierarchy —37
Cache Misses
nOn cache hit, CPU proceeds normally
nOn cache miss
nStall the CPU pipeline
nFetch block from next level of hierarchy
nInstruction cache miss
nRestart instruction fetch
nData cache miss
nComplete data access
nType of operations performed on data:
nRead operation àread miss or read hit
nWrite operation àwrite miss or write hit

Instruction Cache Misses
nSteps to be taken on an instruction cache miss:
nSend original PC value to memory
nInstruct memory to perform a read and wait for
memory to complete its access.
nWrite the cache entry (data portion, tag bits, turning
valid bit to 1)
nRestart instruction execution at the first step (re-fetch
instruction), this time is a cache hit.
nControlofcacheonadataaccessisidentical,
simplystallpipelineuntilmemoryrespondswith
thedata.
Chapter 5 —Large and Fast: Exploiting Memory Hierarchy —38

Chapter 5 —Large and Fast: Exploiting Memory Hierarchy —39
Handling Writes: Write-Through
nOn data-write hit, could just update the block in
cache
nBut then cache and memory would be inconsistent
nSimplest way is Write throughscheme:
nUpdate both cache and memory at the same time
nThis design handles write very simply
nProblem:
nMakes writes take longer
ne.g., if base CPI = 1, 10% of instructions are stores,
write to memory takes 100 cycles
nEffective CPI = 1 + 0.1×100 = 11
nReducing performance by more than a factor of 10

Write-Through solution
nSolution to this problem is write buffer
nIt holds (stores) data waiting to be written to
memory
nCPU continues immediately
nOnly stalls on write if write buffer is already
full
nWrite buffer is a queue
Chapter 5 —Large and Fast: Exploiting Memory Hierarchy —40

Chapter 5 —Large and Fast: Exploiting Memory Hierarchy —41
Write-Back
nAlternative: On data-write hit, just update
the block in cache
nKeep track of whether each block is dirty
nWhen a dirty block is replaced
nWrite it back to memory
nCan use a write buffer to allow replacing block
to be read first

Chapter 5 —Large and Fast: Exploiting Memory Hierarchy —42
Write Allocation
nWhat should happen on a data-write miss?
nData-write miss: when the processor wants to
write but the address (block) doesn’t exist in
cache.
nAlternatives for write-throughpolicy
nAllocate on miss: fetch the block
nWrite around: don’t fetch the block
nSince programs often write a whole block before
reading it (e.g., initialization)
nFor write-backpolicy
nUsually fetch the block from memory

Memory access
nRead
nHit
nMiss (bring data from main memory to the
cache and read)
nWrite
nHit
nWrite-through
nUse write buffer to improve performance
nWrite-back
nMiss (bring data from main memory to the
cache and write)
Chapter 5 —Large and Fast: Exploiting Memory Hierarchy —43

Chapter 5 —Large and Fast: Exploiting Memory Hierarchy —44
Example: IntrinsityFastMATH

Chapter 5 —Large and Fast: Exploiting Memory Hierarchy —45
Example: IntrinsityFastMATH
nEmbedded MIPS processor
n12-stage pipeline
nInstruction and data access on each cycle
nSplit cache: separate I-cache and D-cache
nEach 16KB: 256 blocks ×16 words/block
nD-cache: write-through or write-back
nSPEC2000 miss rates
nI-cache miss rate: 0.4%
nD-cache miss rate: 11.4%

Chapter 5 —Large and Fast: Exploiting Memory Hierarchy —46
Measuring Cache Performance
nCPU time can be divided into:
nProgram execution clock cycles
nIncludes cache hit time
nMemory stall cycles
nMainly from cache misses
nWith simplifying assumptions:
§5.3 Measuring and Improving Cache Performance

Chapter 5 —Large and Fast: Exploiting Memory Hierarchy —47
Measuring Cache Performance
§5.3 Measuring and Improving Cache Performance
penalty Miss
nInstructio
Misses
Program
nsInstructio
penalty Missrate Miss
Program
accessesMemory
cycles stallMemory
´´=
´´=
By combining read and write stall cycles:

Example –Cache Performance
nAssumethemissrateofanI-cacheis2%
andmissrateofD-cacheis4%.Ifa
processorhasaCPIof2withoutany
memorystalls,andthemisspenaltyis100
clockcyclesforallmisses,determinehow
muchfasteraprocessorwouldrunwitha
perfectcachethatnevermissed.Assume
thatfrequencyofallloadsandstoresis
36%(relatedtodatamiss).
Chapter 5 —Large and Fast: Exploiting Memory Hierarchy —48

Solution
Instruction miss cycles = 2% * 100 = 2 cycles
Data miss cycles = 36% * 4% * 100 = 1.44 cycles
Total memory stall cycles = 2 + 1.44 = 3.44
Actual CPI = base CPI + 3.44 = 2 + 3.44 = 5.44
CPI_stall / CPI_perfect = 5.44 / 2 = 2.72 ideal CPI
CPU (without any misses) is 2.72x faster.
Chapter 5 —Large and Fast: Exploiting Memory Hierarchy —49

Example –Cache Performance
Chapter 5 —Large and Fast: Exploiting Memory Hierarchy —50
without

Performance Summary
nTo take advantage of spatial locality:
nIncrease cache block size
nLarger block size decreases miss rate
nIt can also increase miss penalty, if miss penalty
increased linearly with the larger block size, it can
lower the performance
nSo, to avoid performance loss, maim memory
bandwidth is increased to transfer cache block
more efficiently àreducing miss penalty
nCommon methods to increase memory bandwidth:
nWider memory (wider memory bus to increase memory
BW)
Chapter 5 —Large and Fast: Exploiting Memory Hierarchy —51

Chapter 5 —Large and Fast: Exploiting Memory Hierarchy —52
Main Memory Supporting Caches
nUse DRAMs for main memory
nFixed width (e.g., 1 word)
nConnected by fixed-width clocked bus
nBus clock is typically slower than CPU clock
nExample cache block read
n1 bus cycle for address transfer
n15 bus cycles per DRAM access
n1 bus cycle per data transfer
nFor 4-word block, 1-word-wide DRAM
nMiss penalty = 1 + 4×15 + 4×1 = 65 bus cycles
nBandwidth = 16 bytes / 65 cycles = 0.25 B/cycle

Chapter 5 —Large and Fast: Exploiting Memory Hierarchy —53
Increasing Memory Bandwidth
n4-word wide memory
nMiss penalty = 1 + 15 + 1 = 17 bus cycles
nBandwidth = 16 bytes / 17 cycles = 0.94 B/cycle

Chapter 5 —Large and Fast: Exploiting Memory Hierarchy —54
Average Access Time
nTimetoaccessdataforbothHitsandMisses
affecttheperformance
nSodesigneruseAMATasametrictoexamine
alternativecachedesigns
nAverage memory access time (AMAT)
nAMAT = Hit time + Miss rate ×Miss penalty
nHow to improve AMAT?
nImproving miss rate àbetter cache structure
nImproving miss penalty à?
nImproving Hit time àbetter cache structure & better memory
technology

Example -AMAT
FindtheAMATforaprocessorwitha1nsclock
cycletime,amisspenaltyof20clockcycles,a
missrateof0.05missesperinstruction,anda
cacheaccesstime(includinghitdetection)of1
clockcycle.Assumethatthereadandwrite
misspenaltiesarethesameandignoreother
writestalls.
Chapter 5 —Large and Fast: Exploiting Memory Hierarchy —55

Solution
Chapter 5 —Large and Fast: Exploiting Memory Hierarchy —56

Chapter 5 —Large and Fast: Exploiting Memory Hierarchy —57
Performance Summary
nWhen CPU performance increased
nMiss penalty becomes more significant
nDecreasing base CPI
nGreater proportion of time spent on memory
stalls
nIncreasing clock rate (frequency)
nMemory stalls account for more CPU cycles
nCan’t neglect cache behavior when
evaluating system performance

Reducing Cache Misses
nHow?
nBy more flexible placement of blocks
nCache mapping schemes:
nDirect mapped: a block can be placed in
exactly one location
nFully associative
nSet-associative:
nThe middle range of designs between direct
mapped and fully associative is called set
associative.
Chapter 5 —Large and Fast: Exploiting Memory Hierarchy —58

Chapter 5 —Large and Fast: Exploiting Memory Hierarchy —59
Associative Caches
nFully associative
nAllow a block to be placed in any location in cache
nRequires all entries to be searched at once
nComparator per entry (expensive)
nn-way set associative
nEach set contains nentries, a block has nchoices
for placement.
nBlock addrnumber determines which set
n(Block addr) modulo (#Sets in cache)
nSearch all entries in a given set at once
nncomparators (less expensive)

Chapter 5 —Large and Fast: Exploiting Memory Hierarchy —60
Associative Cache Example
Thelocationofamemoryblockwhoseaddressis12inacachewith
eightblocksvariesfordirect-mapped,set-associative,andfully
associativeplacement:

Direct-mapped Vs Set-associative
nRemember that in a direct-mapped cache,
the position of a memory block is given by
n(Block addr) modulo (#blocks in the cache)
nIn a set-associative cache, the set
containing a memory block is given by
n(Block addr) modulo (#sets in the cache)
Chapter 5 —Large and Fast: Exploiting Memory Hierarchy —61

Chapter 5 —Large and Fast: Exploiting Memory Hierarchy —62
Spectrum of Associativity
nFor a cache with 8 entries

Chapter 5 —Large and Fast: Exploiting Memory Hierarchy —63
Associativity Example
nCompare 4-block caches
nDirect mapped, 2-way set associative,
fully associative
nBlock access sequence: 0, 8, 0, 6, 8
nDirect mapped
Block
address
Cache
index
Hit/missCache content after access
0123
00missMem[0]
80missMem[8]
00missMem[0]
62missMem[0]Mem[6]
80missMem[8]Mem[6]

Chapter 5 —Large and Fast: Exploiting Memory Hierarchy —64
Associativity Example
n2-way set associative
Block
address
Cache
index
Hit/missCache content after access
Set 0Set 1
00missMem[0]
80missMem[0]Mem[8]
00hitMem[0]Mem[8]
60missMem[0]Mem[6]
80missMem[8]Mem[6]
nFully associative
Block
address
Hit/missCache content after access
0 missMem[0]
8 missMem[0]Mem[8]
0 hitMem[0]Mem[8]
6 missMem[0]Mem[8]Mem[6]
8 hitMem[0]Mem[8]Mem[6]

Chapter 5 —Large and Fast: Exploiting Memory Hierarchy —65
How Much Associativity
nIncreased associativity decreases miss
rate
nBut with diminishing returns
nSimulation result of a system with 64KB
D-cache, 16-word blocks, SPEC2000
associativity : data cache miss rate
n1-way: 10.3%
n2-way: 8.6%
n4-way: 8.3%
n8-way: 8.1%

Chapter 5 —Large and Fast: Exploiting Memory Hierarchy —66
Set Associative Cache Organization

Chapter 5 —Large and Fast: Exploiting Memory Hierarchy —67
Replacement Policy (choosing which block to
replace)
nDirect mapped: no choice
nSet associative
nPrefer non-valid entry, if there is one
nOtherwise, choose among entries in the set
nLeast-recently used (LRU)
nChoose the one unused for the longest time
nSimple for 2-way, manageable for 4-way, too hard beyond that
nRandom
nGives approximately the same performance as LRU for high associativity

Example –Cache
Consideramemorysystemthatusesa32-bitaddressto
addressatthebytelevel,plusacachethatusesa64-byte
linesize.
a)Assumeadirectmappedcachewithatagfieldinthe
addressof20bits.Showtheaddressformatanddetermine
thefollowingparameters:numberofaddressableunits,
numberoflines(blocks)incache,sizeoftag.
Chapter 5 —Large and Fast: Exploiting Memory Hierarchy —68

Solution
Addressformat:Tag=20bits;index=6bits;
Byte-offset=6bits
Numberofaddressableunits=2^32bytes
Numberoflines(blocks)incache=2^6=64
sizeoftag=20bits
Chapter 5 —Large and Fast: Exploiting Memory Hierarchy —69

Example -Cache
b)Assumeafullyassociativecache.Showthe
addressformatanddeterminethefollowing
parameters:numberofaddressableunits,number
ofsetsincache,sizeoftag.
Chapter 5 —Large and Fast: Exploiting Memory Hierarchy —70

Solution
Address format: Tag = 26 bits; byte offset =
6 bits
Number of addressable units = 2^32bytes
Number of sets in cache = 1
size of tag = 26 bits
Chapter 5 —Large and Fast: Exploiting Memory Hierarchy —71

Example -Cache
c)Assumeafour-wayset-associativecachewithatagfield
intheaddressof9bits.Showtheaddressformatand
determinethefollowingparameters:numberofaddressable
units,numberoflinesinset,numberofsetsincache,
numberoflinesincache,sizeoftag.
Chapter 5 —Large and Fast: Exploiting Memory Hierarchy —72

Solution
Address format: Tag = 9 bits; Set = 17 bits;
Word = 6 bits
Number of addressable units = 2^32 bytes
Number of lines (blocks) in set = k = 4
Number of sets in cache = 2^17; total
Number of lines (blocks) in cache =
= 2^17 * 2^2 = 2^19
Size of tag = 9 bits.
Chapter 5 —Large and Fast: Exploiting Memory Hierarchy —73

Chapter 5 —Large and Fast: Exploiting Memory Hierarchy —74
Multilevel Caches
nMultilevelcacheisamemoryhierarchywith
multiplelevelsofcaches,ratherthanjusta
cacheandmainmemory
nWecanreducemisspenaltyusing
multilevelcaches
nL-1 cache (primary), faster, smaller
nL-2 cache
nL-3 cache

Multilevel Caches
nPrimary cache (L1) attached to CPU
nSmall, but fast
nLevel-2 cache services misses from
primary cache
nLarger, slower, but still faster than main
memory
nIf desired data is present in L2 cache, miss
penalty for L1 is access time of L2 cache
nMain memory services L2 cache misses
nSome high-end systems include L3 cache
Chapter 5 —Large and Fast: Exploiting Memory Hierarchy —75

Chapter 5 —Large and Fast: Exploiting Memory Hierarchy —76
Multilevel Cache Example
nGiven
nCPU base CPI = 1, clock rate = 4GHz
nMiss rate/instruction (L1 miss rate) = 2%
nMain memory access time = 100ns
nEffective CPI = ?
nWith just primary cache (L1)
nMiss penalty = 100ns/0.25ns = 400 cycles
nEffective CPI = base CPI + Memory stall
cycles/instruction = 1 + 0.02 ×400 = 9 cycles

Chapter 5 —Large and Fast: Exploiting Memory Hierarchy —77
Example (cont.)
nNow add L-2 cache:
nGiven:
nAccess time (L2 hit time) = 5ns
nGlobal miss rate to main memory = 0.5%
nPrimary miss penalty with L-2 miss = 500 cycles
nPrimary miss penalty with L-2 hit
nMiss penalty = 5ns/0.25ns = 20 cycles
nTotal CPI = Base CPI + primary stall/instruction +
secondary stalls/instruction
nTotal CPI = 1 + 0.02 ×20 + 0.005 ×500 = 3.4
nPerformance ratio = 9/3.4 = 2.6

Chapter 5 —Large and Fast: Exploiting Memory Hierarchy —78
Multilevel Cache Considerations
nPrimary cache (L-1)
nFocus on minimizing hit time (access time) to
yield a shorter clock cycle
nL-2 cache
nFocus on low miss rate to avoid main memory
access
nHit time has less overall impact
nResults
nL-1 cache usually smaller than a single cache
nL-1 block size smaller than L-2 block size

Chapter 5 —Large and Fast: Exploiting Memory Hierarchy —79
Interactions with Advanced CPUs
nOut-of-order CPUs can execute
instructions during cache miss
nPending store stays in load/store unit
nDependent instructions wait in reservation
stations
nIndependent instructions continue
nEffect of miss depends on program data
flow
nMuch harder to analyze
nUse system simulation

Chapter 5 —Large and Fast: Exploiting Memory Hierarchy —80
Interactions with Software
nMisses depend on memory access
patterns
nSoftware optimization
nAlgorithm behavior
nDesign an efficient algorithm
nCompiler optimization for memory
access

Summary
nWe focused on:
nCache performance
nMiss rate
nMiss penalty
nUsing associativity to improve miss rates
nUsing multilevel cache hierarchy to improve
miss penalty
Chapter 5 —Large and Fast: Exploiting Memory Hierarchy —81

Virtual Memory
nVM is a technique that uses main memory
as a “cache” for secondary storage.
nSecondary storage can be addressed as taught it
was part of main memory
nVM is managed jointly by CPU hardware and OS
nTwo motivations for VM:
nAllow efficient sharing of memory among multiple
programs
nAllow a single program to expand its address
space beyond the limits of MM.
Chapter 5 —Large and Fast: Exploiting Memory Hierarchy —82

Chapter 5 —Large and Fast: Exploiting Memory Hierarchy —83
Virtual Memory
nPrograms share main memory
nEach program gets a private virtual address space holding its frequently used code and data
nThis space is protected from other programs
nVM implements the translation of a virtual address to a physical address
nPhysical address is an address in MM
nVirtual address corresponds to a location in
virtual space (RAM and Disk together)
§5.4 Virtual Memory

Virtual Memory
nCPU and OS translate virtual addresses to physical addresses
nOS fills page table
nCPU (hardware) does translation
nVM “block” is called a page
nVM translation “miss” is called a page fault
nTheprocessoftranslatingavirtualaddresstoaphysicaladdressisaddresstranslation(addressmapping)
Chapter 5 —Large and Fast: Exploiting Memory Hierarchy —84

Chapter 5 —Large and Fast: Exploiting Memory Hierarchy —85
Address Translation
nFixed-size pages (e.g., 4K)
nNumber of bits in page offset field determines
page size

Chapter 5 —Large and Fast: Exploiting Memory Hierarchy —86
Page Fault Penalty
nOnpagefault,thepagemustbefetchedfrom
disk(secondarystorage)
nTakes millions of clock cycles (enormous miss
penalty) àhandled by OS code
nWrite-through will not work for virtual memory
nSince writes take too long, instead VM systems use
write-back
nVM tries to minimize page fault rate
nFully associative placement of pages in memory
nSmart replacement algorithms

Chapter 5 —Large and Fast: Exploiting Memory Hierarchy —87
Page Tables
nStores placement information
nArray of page table entries, indexed by virtual
page number
nPage table register in CPU points to page table
in physical memory
nIf page is present in memory
nPTE stores the physical page number
nPlus other status bits (referenced, dirty, …)

Page Tables
nIf page is not present
nPTE can refer to location in swap space on disk
§Swap space:
Space on the disk reserved for the full virtual
memory space of a process.
nEach program has its own PT
nMaps the Virtual address space of that program to
MM
nPT has an entry for every virtual page.
Chapter 5 —Large and Fast: Exploiting Memory Hierarchy —88

Chapter 5 —Large and Fast: Exploiting Memory Hierarchy —89
Translation Using a Page Table

Chapter 5 —Large and Fast: Exploiting Memory Hierarchy —90
Mapping Pages to Storage
The pages in main memory and disks are the same size
PT maps each page in VM to either a page in MM or a page
in disk

Chapter 5 —Large and Fast: Exploiting Memory Hierarchy —91
Replacement and Writes
nTo reduce page fault rate, OS uses least-recently used (LRU) replacement
nReference bit (aka use bit) in PTE set to 1 on access to page
nPeriodically cleared to 0 by OS
nA page with reference bit = 0 has not been used recently
nWrites to disk take millions of cycles
nWrite-through scheme is impractical
nSo VM uses write-back scheme
nAlso VM tracks whether a page is dirty or not to avoid writing unchanged pages to memory

Page Table Example
nWitha32-bitvirtualaddress,4KiBpages,
and4bytesperpagetableentry,wecan
computethetotalpagetablesize:
#PTE=2^32/2^12=2^20
sizeofPT=2^20*4=4MB
Chapter 5 —Large and Fast: Exploiting Memory Hierarchy —92

Chapter 5 —Large and Fast: Exploiting Memory Hierarchy —93
Fast Translation Using a TLB
nAddress translation appears to require extra
memory references
nOne to access the PTE
nThen the actual memory access (getting data)
nThe key to improve access performance is:
nRely on locality of reference to the PT
nHow it works?
nWhen a page is referenced, it will probably be
needed again in the near future
nSo modern computers use a spatial cache (within
CPU) that keeps track of recently used pages
nCalled Translation Look-aside Buffer (TLB)

Chapter 5 —Large and Fast: Exploiting Memory Hierarchy —94
Fast Translation Using a TLB
TLB acts as a cache of the PT for the entries that map to the physical
pages only.

Fast Translation Using a TLB
nOn every reference:
nLook up the Virtual page number in TLB
nOn a TLB Hit:
nThe physical page number is used to form the address
nCorresponding reference bitis set to 1.
nIf CPU performing a write, dirty bit is also set to 1.
nOn a TLB Miss:
nNeed to indicate if that miss comes from TLB
nOr it comes from main memory (page fault)
Chapter 5 —Large and Fast: Exploiting Memory Hierarchy —95

Chapter 5 —Large and Fast: Exploiting Memory Hierarchy —96
TLB Misses
nIf page is in memory (TLB miss)
nCPU can handle this miss by:
nLoading the translation (PTE) from the PT into the
TLB and try accessing again
nIf page is not in memory (page fault)
nOS handles fetching the page from disk and
updating the page table
nThen restart the faulting instruction
nTLB misses are more frequent than page fault
nBecause TLB has many fewer entries than
number of pages in MM.

Chapter 5 —Large and Fast: Exploiting Memory Hierarchy —97
TLB Miss Handler
nTLB miss indicates
nPage present, but PTE not in TLB
nPage not present in MM
nHandler copies PTE from memory to TLB
nThen restarts instruction
nIf page not present, page fault will occur

Chapter 5 —Large and Fast: Exploiting Memory Hierarchy —98
Page Fault Handler
nUse faulting virtual address to find PTE
nLocate page on disk
nChoose page to replace
nIf dirty, write to disk first
nRead page into memory and update page
table
nMake process runnable again
nRestart from faulting instruction

Chapter 5 —Large and Fast: Exploiting Memory Hierarchy —99
TLB and Cache Interaction
nIf cache tag uses
physical address
nNeed to translate
before cache lookup
nTLB has fully
associative structure

Address Translation steps
Chapter 5 —Large and Fast: Exploiting Memory Hierarchy —100

Chapter 5 —Large and Fast: Exploiting Memory Hierarchy —101
Memory Protection
nDifferent tasks can share parts of their
virtual address spaces
nBut need to protect against errant access
nRequires OS assistance
nHardware support for OS protection
nPrivileged supervisor mode (aka kernel mode)
nPrivileged instructions
nPage tables and other state information only
accessible in supervisor mode
nSystem call exception (e.g., syscall in MIPS)

Chapter 5 —Large and Fast: Exploiting Memory Hierarchy —102
The Memory Hierarchy
nCommon principles apply at all levels of
the memory hierarchy
nBased on notions of caching
nAt each level in the hierarchy
nBlock placement
nFinding a block
nReplacement on a miss
nWrite policy
§5.5 A Common Framework for Memory Hierarchies
The BIG Picture

Chapter 5 —Large and Fast: Exploiting Memory Hierarchy —103
Block Placement
nDetermined by associativity
nDirect mapped (1-way associative)
nOne choice for placement
nn-way set associative
nn choices within a set
nFully associative
nAny location
nHigher associativity reduces miss rate
nIncreases complexity, cost, and access time

Chapter 5 —Large and Fast: Exploiting Memory Hierarchy —104
Finding a Block
nHardware caches (TLB and Cache)
nReduce comparisons to reduce cost (set-associative placement)
nTLBs and caches use set-associative placement
nVirtual memory systems
nUse full associativity placement to reduce miss rate
nFull map can be easily indexed with no extra hardware and no
searching for index.
AssociativityLocation methodTag comparisons
Direct mappedIndex1
n-way set
associative
Set index, then search
entries within the set
n
Fully associativeSearch all entries#entries

Chapter 5 —Large and Fast: Exploiting Memory Hierarchy —105
Replacement
nChoice of entry to replace on a miss
nLeast recently used (LRU)
nComplex and costly hardware for high associativity
nRandom
nClose to LRU, easier to implement
nVirtual memory
nLRU approximation with hardware support

Chapter 5 —Large and Fast: Exploiting Memory Hierarchy —106
Write Policy
nWrite-through
nUpdate both upper and lower levels
nSimplifies replacement, but may require write buffer
nWrite-back
nUpdate upper level only
nUpdate lower level when block is replaced
nNeed to keep more state
nVirtual memory
nOnly write-back is feasible, given disk write latency

Chapter 5 —Large and Fast: Exploiting Memory Hierarchy —107
Sources of Misses
nCompulsory misses (aka cold start misses)
nFirst access to a block
nCapacity misses
nDue to finite cache size
nA replaced block is later accessed again
nConflict misses (aka collision misses)
nIn a non-fully associative cache
nDue to competition for entries in a set
nWould not occur in a fully associative cache of
the same total size

Chapter 5 —Large and Fast: Exploiting Memory Hierarchy —108
Cache Design Trade-offs
Design changeEffect on miss rateNegative performance
effect
Increase cache sizeDecrease capacity
misses
May increase access
time
Increase associativityDecrease conflict
misses
May increase access
time
Increase block sizeDecrease compulsory
misses
Increases miss
penalty. For very large
block size, may
increase miss rate
due to pollution.

Chapter 5 —Large and Fast: Exploiting Memory Hierarchy —109
Cache Control
nExample cache characteristics
nDirect-mapped, write-back, write allocate
nBlock size: 4 words (16 bytes)
nCache size: 16 KB (1024 blocks)
n32-bit byte addresses
nValid bit and dirty bit per block
§5.7
Using a Finite State Machine to Control A Simple Cache
TagIndexOffset
03491031
4 bits10 bits18 bits

Chapter 5 —Large and Fast: Exploiting Memory Hierarchy —110
Multilevel On-Chip Caches
§5.10 Real Stuff: The AMD Opteron X4 and Intel Nehalem
Per core: 32KB L1 I-cache, 32KB L1 D-cache, 512KB L2 cache
Intel Nehalem 4-core processor

Chapter 5 —Large and Fast: Exploiting Memory Hierarchy —111
2-Level TLB Organization
Intel NehalemAMD Opteron X4
Virtual addr48 bits48 bits
Physical addr44 bits48 bits
Page size4KB, 2/4MB4KB, 2/4MB
L1 TLB
(per core)
L1 I-TLB: 128 entries for small
pages, 7 per thread (2×) for
large pages
L1 D-TLB: 64 entries for small
pages, 32 for large pages
Both 4-way, LRU replacement
L1 I-TLB: 48 entries
L1 D-TLB: 48 entries
Both fully associative, LRU
replacement
L2 TLB
(per core)
Single L2 TLB: 512 entries
4-way, LRU replacement
L2 I-TLB: 512 entries
L2 D-TLB: 512 entries
Both 4-way, round-robin LRU
TLB missesHandled in hardwareHandled in hardware

Chapter 5 —Large and Fast: Exploiting Memory Hierarchy —112
3-Level Cache Organization
Intel NehalemAMD Opteron X4
L1 caches
(per core)
L1 I-cache: 32KB, 64-byte
blocks, 4-way, approx LRU
replacement, hit time n/a
L1 D-cache: 32KB, 64-byte
blocks, 8-way, approx LRU
replacement, write-
back/allocate, hit time n/a
L1 I-cache: 32KB, 64-byte
blocks, 2-way, LRU
replacement, hit time 3 cycles
L1 D-cache: 32KB, 64-byte
blocks, 2-way, LRU
replacement, write-
back/allocate, hit time 9 cycles
L2 unified
cache
(per core)
256KB, 64-byte blocks, 8-way,
approx LRU replacement, write-
back/allocate, hit time n/a
512KB, 64-byte blocks, 16-way,
approx LRU replacement, write-
back/allocate, hit time n/a
L3 unified
cache
(shared)
8MB, 64-byte blocks, 16-way,
replacement n/a, write-
back/allocate, hit time n/a
2MB, 64-byte blocks, 32-way,
replace block shared by fewest
cores, write-back/allocate, hit
time 32 cycles
n/a: data not available

Chapter 5 —Large and Fast: Exploiting Memory Hierarchy —113
Mis Penalty Reduction
nReturn requested word first
nThen back-fill rest of block
nNon-blocking miss processing
nHit under miss: allow hits to proceed
nMis under miss: allow multiple outstanding
misses
nHardware prefetch: instructions and data
nOpteron X4: bank interleaved L1 D-cache
nTwo concurrent accesses per cycle

Chapter 5 —Large and Fast: Exploiting Memory Hierarchy —114
Concluding Remarks
nFast memories are small, large memories are
slow
nWe really want fast, large memories L
nCaching gives this illusion J
nPrinciple of locality
nPrograms use a small part of their memory space
frequently
nMemory hierarchy
nL1 cache «L2 cache «… «DRAM memory
«disk
nMemory system design is critical for
multiprocessors
§5.12 Concluding Remarks

Memory Designed and Some Other Conecpt of Computer Architecture

About This Presentation

Slide Content

Tags

Categories

Download

Quick Actions

Statistics

Related Slideshows

Memory Designed and Some Other Conecpt of Computer Architecture

About This Presentation

Slide Content

Slide 1

Slide 2

Slide 3

Slide 4

Slide 5

Slide 6

Slide 7

Slide 8

Slide 9

Slide 10

Slide 11

Slide 12

Slide 13

Slide 14

Slide 15

Slide 16

Slide 17

Slide 18

Slide 19

Slide 20

Slide 21

Slide 22

Slide 23

Slide 24

Slide 25

Slide 26

Slide 27

Slide 28

Slide 29

Slide 30

Slide 31

Slide 32

Slide 33

Slide 34

Slide 35

Slide 36

Slide 37

Slide 38

Slide 39

Slide 40

Slide 41

Slide 42

Slide 43

Slide 44

Slide 45

Slide 46

Slide 47

Slide 48

Slide 49

Slide 50

Slide 51

Slide 52

Slide 53

Slide 54

Slide 55

Slide 56

Slide 57

Slide 58

Slide 59

Slide 60

Slide 61

Slide 62

Slide 63

Slide 64

Slide 65

Slide 66

Slide 67

Slide 68

Slide 69

Slide 70

Slide 71

Slide 72

Slide 73

Slide 74

Slide 75

Slide 76

Slide 77