KSpeculative aspects of high-speed processor design

ssuser7dcef0 11 views 81 slides Jun 01, 2024
Slide 1
Slide 1 of 81
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27
Slide 28
28
Slide 29
29
Slide 30
30
Slide 31
31
Slide 32
32
Slide 33
33
Slide 34
34
Slide 35
35
Slide 36
36
Slide 37
37
Slide 38
38
Slide 39
39
Slide 40
40
Slide 41
41
Slide 42
42
Slide 43
43
Slide 44
44
Slide 45
45
Slide 46
46
Slide 47
47
Slide 48
48
Slide 49
49
Slide 50
50
Slide 51
51
Slide 52
52
Slide 53
53
Slide 54
54
Slide 55
55
Slide 56
56
Slide 57
57
Slide 58
58
Slide 59
59
Slide 60
60
Slide 61
61
Slide 62
62
Slide 63
63
Slide 64
64
Slide 65
65
Slide 66
66
Slide 67
67
Slide 68
68
Slide 69
69
Slide 70
70
Slide 71
71
Slide 72
72
Slide 73
73
Slide 74
74
Slide 75
75
Slide 76
76
Slide 77
77
Slide 78
78
Slide 79
79
Slide 80
80
Slide 81
81

About This Presentation

Highest total system speed
Ex. TOP500 speed,
Application speed of supercomputers
Highest processor chip performance
Ex. SPEC CPU rate
NAS parallel benchmarks
Highest single core performance
SPEC CPU int,
SPEC CPU fp
Dhrystone


Slide Content

平木敬東京大学
Speculative aspects of high-speed
processor design
Kei Hiraki The University of Tokyo
Departmentof Creative Informatics
Graduate School of Information Science and
Technology
The University of Tokyo

平木敬
Our goal
•Highest total system speed
–Ex.TOP500 speed,
–Application speed of supercomputers
•Highest processor chip performance
–Ex. SPEC CPU rate
–NAS parallel benchmarks
•Highest single core performance
–SPEC CPU int,
–SPEC CPU fp
–Dhrystone
Single core performance is the starting point
東京大学

x1000 in 11 years
x1.87 in a year
1 EFlops

平木敬
Single Core Performance
•Base for all the performance improvement
•Various speed-up methods
–Faster clock frequency
•New device ---Si, GaAs, HEMT, JJ device, Optical devices
•Smaller device ---Semiconductor device width
–Intel 4004 10,000 nm
–Intel Corei7 3xxxx 22 nm
Clock speed is now saturating
–Power consumption
–Device density
東京大学

Device Technology for Memries
0.01μ
0. 1μ

10μ
Year
70 80 90 2000 2010 2020
Width
Capacity
Quantum Device
Normal Device
1T
1G
1M
1K
16K
64K
256K
1M
4M
16M
64M
256M
1G
4G
16G

Prediction byITRS
2010 2012 2014 2016 2018 2020
Metal1 ½ pitch
(nm)
45 32 24 18.9 15 11.9
Vt(V) 0.289
EPbluk
0.291
EPbulk
0.221
UTB FD
0.202
MG
0.207
MG
0.219
MG
Vdd(V) 0.97 0.9 0.84 0.78 0.73 0.68
Power Density
(W/mm2)
0.5 0.6 0.7 0.8 0.9 1
Pin count Max 4900 5300 5900 6500 7200 7900
Performance On-
chip(GHz)
5.88 6.82 7.91 9.18 10.65 12.36
PerformanceChip-
to-Board (Gb/s)
10 14 17 30 40 50
ITRS 2009 Process, Integration, Design and System / Assembly and Packaging
*I
sd,leak= 100 uA/um

平木敬
Power wall ---limitation of clock speed
•100x faster clock from 1993 to 2003
–Progress of CMOS technology
•No improvement from 2003 to Now
–150W/chip power limit
–More transistor / area size
–Faster clock requires higher threshold voltage
•High-speed 1.2V
•Low power 0.8V 40nm CMOS
東京大学
7

平木敬
Clock speed limit
10
100
1000
10000
199319951997199920012003200520072009
Clock Freq of Top 10 machines of Top500
東京大学

平木敬
Historical view of processor performance
•Performance measurements of historical and latest
processors(100 systems)
–Intel 386, 486, Pentium, Pen Pro, ………. Corei7, ATOM,Itanium II
–AMD Opteron
–SPARCWeitek 8701、microSPARCII, UltraSparc I, II, III
–Alpha 21064, 21164, 21264(EV67)
–MIPS R4000, R5000, R12000
–Power Power5, PowerPC750, 970
–ARM Tegra, iMX515,
–HP HP-PA 7100
–Motorola MC68328, MC68VZ328
※Repair and maintenance
are the biggest problems
東京大学
9

平木敬東京大学
10

平木敬
Old and New systems to be measured
東京大学
11

平木敬東京大学
Integer performance/core:Dhrystone MIPS
Year
High^
performance
Core
Embedded
core

平木敬東京大学
Integer performance/core:Dhrystone MIPS
Year
High^
performance
Core
Embedded
core

平木敬
Observation 1
•Performance rapidly increase till 2003
–Faster clock frequency (up to 4 GHz)
–Pipelined architecture design
–Cache memory
–Superscalar architecture
•Performance still increase from 2003
–Constant clock freq.
–Wider superscalar architecture
–Deep branch prediction
–Prefetching
–・・・・・・
–・・・・・・
東京大学
14

平木敬
Performance / clock
東京大学

平木敬東京大学
Performance / clock

0
0.2
0.4
0.6
0.8
1
1.2
1.4
1990 1995 2000 2005 2010 2015
CINT2000 ratio / MHz
Year
HyperSPARC
SuperSPARC-II
MIPS R4000
Pentium Pro
UltraSPARC-II
PenM
POWER5
Cell BE
Pentium D
K8
Core Duo
Core 2 Duo
Tegra 250
AMD C-50
SandyBridge-E
SandyBridge
AMD FX
Atom
Cortex A8
Nehalem
Lynnfield
POWER7
TM5800
MMX P5

平木敬東京大学
High-speed features of a processor 1
1.Pipeline design
–Sequential execution ~10 cycles/instruction
•Old computers, Intel 8080
–Basic pipelined execution2~4 cycles/instructions
•CDC6600, Manchester MU5, MIPS and many more
–Out of order execution 1~cycle/instruction
•IBM 360/91, Alpha 21264, most of today’s processors
–SuperScalarexecution ~0.5 cycle/instruction
•Intel i960CA, Alpha21064, most of today’s processors
–VLIW ~0.3 cycle/instruction
•Multiflow, Intel Itanium
•Out of order, SuperScalarshould be used with
branch prediction
18

平木敬東京大学
High-speed features of a processor 2
2.Branch Prediction
–Branch Target Buffer Short and local history
•Manchester MU5, processors before 1995
–Two level branch predictionhistory and pattern table
•Intel Pentium III, and many more processos
–Gshareand Gselect Use of global history
•AMD, Pentium M, Core2,
–Perceptron predictor Machine learning
•AMD
–ITTAGE Cascaded history table
Practical use of speculative execution
19

平木敬東京大学
High-speed features of a processor 3
3.Prefetch(hardware prefetch)
–Memory address prediction for future accesses
–Access throttling for optimal memory utilization
–Sequential Prefetcher Next block
–Stride Prefetcher Finding stride from history
–Global History Buffer Use of Global history
–Address Map Matching Current State Of The Art
Effective speculative execution
Practical use of global history
20

平木敬東京大学
Other High-speed features
4.Cache memory, hierarchical cache memory
5.Cache replacement algorithm
–Non-LRU algorithms to eliminate dead blocks
5.DRAM Memory access scheduling
6.Network on Chip (NoC) scheduling
7.Accelerator (floating point accelerator)
Power and hardware budget
21

平木敬
MIPS / core of a processor core
東京大学

平木敬東京大学
Dataflow execution and Speculative execution
•Dataflow execution
–Controlled by availability of data/control
–Ideal for parallel execution
–Difficulty in latency reduction
–I am a dataflow Guru (Developed still largest
dataflow machine)
•Speculative execution
–Independent from data dependency
–Accuracy of prediction is the key
–Today’s speed-up of processors is mainly based on
speculation
23

平木敬
Methods for speculation (prediction)
•Local history
–Past behavior of the target instruction
•Outcome of the branch instruction
•Accessed address of the load/store instruction
•Prediction based on the patterns of local history
•Global history
–Past behavior of instructions other than the target
•Other branch instruction
•Accessed address of other load/store instructions
•For accurate prediction from the first iteration of the loop
東京大学
24

Ideal form of Dynamic Scheduling
Completed instructions
Executed instruction →Dynamic instruction sequence
Executing instructions and
instructions that is wating
Instructions before
execution
Instruction Window
When a new instruction enters to the Instruction Window
・Read registers and memory if the value on it is not
reserved by instruction within the instruction window
・Operands produced by instructions within instruction
window are passed directly from the instruction
Inside the instruction window、
・Execution order is decided by data-dependency
(Execution starts when all the input operand are ready)
・If CPU has infinite resources (ALU memory etc.)
Dynamic Scheduling gives shortest execution time

Utilization of ILP (Instruction Level Parallelism)

Conditional branch inside Instruction Window
if then else
Loop constructs
Nested conditional branches

Example:Branch Prediction
•Speculative execution of instructions after the conditional branch
–Static branch prediction
•Compiler decides the major direction of the branch instruction
–Alpha21064: based on the direction of branch
(forward⇒Not taken、backward⇒taken)
–SPARCetc.: Conditional Branch has a hint bit
FD REXW
FD REXW
Conditional Branch
Next instruction (Success)
SSS
Actual branch outcome
FD REXNext instruction (fail)
FD REXW

Limitation of static branch prediction
•Speculative execution of conditional branch
–Large penalty when speculation fails
•Cancelation of speculative execution
•Keeping memory consistency by speculative execution
–Problem: High miss rate of prediction
•Loop exiting branch : Loop constructs
–1 failure per one loop construct 1/n Misprediction
•Conditional branch in loop body
–Difficult to predict statically⇒Profile based branch prediction
•About 80% successful prediction

History of Dynamic branch prediction
•Branch prediction table and its extension
–Manchester University, I-Unit of MU-5 computer
•Taylor, L. A. “Instruction Accessing in High Speed Computers,” MS
thesis, University of Manchester, 1969.
–Branch Target Buffer (Manchester Univ.)
•Ibbett, R. N., “The MU5 instruction pipeline,” The Computer Journal,
Vol. 15, No. 1, pp. 42-50, 1972.
–2-level branch prediction (used in Pentium III)
•Yeh, T.-Y. and Patt, Y.N., “Two-Level Adaptive Branch Prediction”,
Proc. 24
th
Int. Symposium and workshop on Microarchitecture, pp. 51-
61, 1991.
–Gshare(Used in DEC Alpha)
•McFarling, “Combining Branch Predictors, “WRL Technical Note TN-36,
June 1993.

Branch prediction table(BPT or BTB)
(Branch Prediction Table, Branch Target Buffer)
•BPT Address of branch, past branch direction (counter)
•1 bit prediction⇒High (x2) misprediction rate
•2 bit prediction⇒Saturating counter (Bimodal prediction)
Instruction address ValidHistory Counter
Predicts as
Taken
Predict as
Taken
Predict as
not taken
Predict as
Not taken
not taken
not taken
not taken
Taken
Taken
Taken
taken
Not Taken

Branch Target Buffer
•Branch Target Buffer (BTB)
–Table to get predicted branch target address
⇒Zero branch prediction penalty
Address of branch (lower bits) ValidHistory Counter
Address of BC
Search
=Equality Check
Branch target

Further improvement of dynamic branch prediction
•2 Level branch predictor (Bimodal)
–Based on patterns of local history table
–about 90% successful prediction
•small loop size
•nested loops
•90% ⇒about 0.35 clock penalty

2-level branch prediction
•Prediction by branch history and branch patterns
Example:PAs(Yeh and Patt, 1992)IntelPentiumIII
Address of branch
Branch History TableBranch Pattern Tableibit
jbit
kbit
2BC
Bimodal counter
Prediction result
Private Table
Shared Table
Prediction miss rate 7%(SPECint92)

Function of 2-level branch prediction
•Detection of frequent patter of branch
•Effective to short loop (loop count < Local History Length
Example:Double loop (N=4)
Branch directionTTTNTTTN
2bc TTTTTTTT⇒Miss rate25%
PAs TTTNTTTN⇒Miss rate0%

Gshare
•Inex ofPHT=address EXOR history
Ultra-SPARC3 (j:k) = (12:14)
Address of branch
Branch History
(Global) EXOR
12bits
14bits
Global Table
Prediction result
Miss rate is about 6%

Use of Global history
for(i=0, I < N, i++)
{loop body 1}
・・・・・・・
・・・・・・・
・・・・・・・
for(i=0, I < N, i++)
{loop body 2}
・・・・・・・
・・・・・・・
・・・・・・・
for(i=0, I < N, i++)
{loop body 2}

Hybrid branch prediction
•Combination of Local History prediction and global
history prediction
•Reliability counter for each predictor
•DECAlpha21264
•Advanced branch predictor
–Perceptron hybrid predictor ---A kind of neural
network
–TAGE, GEHL ----Advanced (more complex) hybrid
–FTL ----Path base predictor (Our predictor)

平木敬
Computer Architecture Competition
Our history of competition
2012Memory access scheduling Winner
2011Branch prediction 2
nd
place
2010 Cache replacement algorithm2
nd
place
2009 Prefetching Winner
2008 Branch prediction 2
nd
place
東京大学
38

平木敬
Rest of this talk
1.AMPM prefetcher
Best prefetcher today
2.DRAM memory access scheduling
One of best memory access schedulers
東京大学
39

High Performance
Memory Access Scheduling
Using Compute-Phase Prediction
and Writeback-Refresh Overlap
YasuoIshii, KouheiHosokawa, Mary Inaba, Kei Hiraki

Design Goal: High Performance Scheduler
Three Evaluation Metrics
Execution Time (Performance)
Energy-Delay Product
Performance-Fairness Product
We found several trade-offs among these metrics
The best execution time (performance) configuration does not
show the best energy-delay product

DRAM:Standard memory device for computers
High Density
Low cost
Recent DDR3 memory has strong constraints on Row
acess
Row buffer access timing constraint due to power consumption
DRAM memory scheduling

Structure of DDR DRAM(1channel)
Memory Cotroller
Rank
Bank
Row Buffer
Row

Overhead of switching Row buffer contents
①Latency
Row Hit access
(Row Access)
Row Conflict access
(Row Close) →(Row Open) →(Row Access)
②Power consumption
Re-write to DRAM cells (read modify write)
DRAM scheduling for a single core (single thread)
processor
Improvement ofRowHitratio is
important for a single thread
x3latency

Non-priority requests
Priority requests
Thread-priority Control
Thread-priority control is beneficial for multi-core chips
Network Fair Queuing[Nesbit+ 2006], Atlas[Kim+ 2010], Thread
Cluster Memory Scheduling[Kim+ 2010]
Typically, policies are updated periodically (Each epoch
contains millions of cycles in TCM)
Compute-
Intensive
Memory-
Intensive
Memory
(DRAM)
Memory-
Intensive
Compute-
Intensive
priority status is
not yet changed
Core 0
Core 1
high priority

Example: Memory Traffic of Blackscholes
One application contains both memory-intensive phases
and compute-intensive phases
0
10
20
30
40
50
60
70
80
90
100
Miss per Kilo Instructions (MPKI)

Phase-prediction result of TCM
We think this inaccurate classification is caused by the
conventional periodically updating prediction strategy
0
10
20
30
40
50
60
70
80
90
100
Miss per Kilo Instructions (MPKI)
Compute-phase Memory-phase

Contribution 1: Compute-Phase Prediction
“Distance-based phase prediction” to realize fine-grain
thread priority control scheme
Core
Memory
(DRAM)
Distance = # of committed instructions between 2 memory requests
Core DRAM
Distance of req. exceed Θ
interval
Compute-phase
Core DRAM
Non-distant of req. continue Θ
distant
times
Memory-phase

Phase-prediction of Compute-Phase Prediction
Prediction result nearly matches the optimal classification
Improves fairness and system throughput
0
10
20
30
40
50
60
70
80
90
100
Miss per Kilo Instructions (MPKI)

Outline
Proposals
Compute-Phase Prediction
Thread-priority control technique for multi-core processor
Writeback-Refresh Overlap
Mitigates refresh penalty on multi-rank memory system
Optimizations
MLP-aware priority control
Memory bus reservation
Activate throttling

DRAM refreshing penalty
DRAM refreshing increases the stall time of read requests
Stall of read requests increases the execution time
Shifting refresh timing cannot reduce the stall time
This increases the threat of stall time for read requests
tREFI tRFC
Rank-0
Rank-1
Mem. Bus
Stall of read requestsIncreases the threat of stall

Contribution 2: Writeback-Refresh Overlap
Typically, modern controllers divide read phases and write
phases to reduce bus turnaround penalties
Overlaps refresh command with the write phase
Avoid to increasing the stall time of read requests
R
Rank-0
Rank-1
Mem. Bus W
Read requests stall
RWRWRWRWRWRWR

Optimization 1: MLP-Aware Priority Control
Prioritizes Low-MLP requests to reduce the stall time.
This priority is higher than the priority control of compute-
phase predictions
Minimalist [Kaseridis+ 2011] also uses MLP-aware scheduling
load(1)
load(0)Core 0
Memory
(DDR3)
Core 1
load(1)
Request Queue
gives extra priority
Stall
Stall
load(1)
load (0)
load(1)
load(1)
load(1)

Optimization 2: Memory Bus Reservation
Reserves HW resources to reduce the latency of critical
read requests
Data bus for read and write (considering tRTR/tWTR penalty)
This method improves the system throughput and fairness
Command-Rank-0 ACT RD
tRAS
Command-Rank-1
BL
RD
RD
RD RD
Additional penalty
Memory bus

Optimization 3: Activate Throttling
Controls precharge / activation based on tFAW tracking
Too early precharge command does not contribute to the latency
reduction of following activate command
Command-Rank-0
Memory clock
tFAW
tRP
PREACT
1
ACT
2
ACT
3
ACT
4
Activate throttling increases the chance of row-hit access
ACT
Row-conflict

Implementation: Optimized Memory Controller
The optimized controller does not require large HW cost
We mainly extend thread-priority control and controller state
through our new scheduling technique
Read Queue
Write Queue
Refresh Queue
MUX
Controller
State
Refresh Timer
Processor
Core
DDR3
Devices
Thread
Priority
Control
Enhanced
Controller
State
Adds priority bit
for each request
Extends controller
state (2-bit)

Implementation: Hardware Cost
Per-channel resource (341.25B)
Compute-Phase Prediction (258B)
Writeback-Refresh Overlap (2-bit)
Other features (83B)
Per-request resource (3-bit)
Priority bit, Row-hit bit, Timeout flag bit
Overall Hardware Cost: 2649B

Evaluation Results
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
1chan4chan1chan4chan1chan4chan1chan4chan1chan4chan1chan4chan4chan4chan1chan4chan1chan4chan /
MT-
canneal
MT-
canneal
bl-bl-fr-
fr
bl-bl-fr-
fr
c1-c1c1-c2c1-c1-
c2-c2
c1-c1-
c2-c2
c2 c2fa-fa-fe-
fe
fa-fa-fe-
fe
fl-fl-sw-
sw-c2-
c2-fe-fe
fl-fl-sw-
sw-c2-
c2-fe-
fe-bl-bl-
fr-fr-c1-
c1-st-st
fl-sw-
c2-c2
fl-sw-
c2-c2
st-st-st-
st
st-st-st-
st
Overall
Total Execution Time
FCFSCloseProposed

Evaluation Results
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
1chan4chan1chan4chan1chan4chan1chan4chan1chan4chan1chan4chan4chan4chan1chan4chan1chan4chan /
MT-
canneal
MT-
canneal
bl-bl-fr-
fr
bl-bl-fr-
fr
c1-c1c1-c2c1-c1-
c2-c2
c1-c1-
c2-c2
c2 c2fa-fa-fe-
fe
fa-fa-fe-
fe
fl-fl-sw-
sw-c2-
c2-fe-fe
fl-fl-sw-
sw-c2-
c2-fe-
fe-bl-bl-
fr-fr-c1-
c1-st-st
fl-sw-
c2-c2
fl-sw-
c2-c2
st-st-st-
st
st-st-st-
st
Overall
Total Execution Time
FCFSCloseProposed

11.2% Performance improvement
from FCFS consists of
Close Page Policy:4.2%
Baseline Optimization:4.9%
Proposal Optimization:1.9%
Baseline optimization accomplishes
a 9.1% improvement
Optimization Breakdown
0%
2%
4%
6%
8%
10%
Performance
improvement
Proposed
Optimization
Baseline
Optimization
Close Page
FCFS(base)
4.2%
4.9%
1.9%
・Timeout Detection
・Write Queue Spill Prevention
・Auto-Precharge
・Max Activate-Number Restriction
Proposals
Compute-phase prediction
Writeback-refresh overlap
Optimizations
MLP-aware priority control
Memory bus reservation
Activate throttling

Optimization Breakdown
11.2% Performance improvement
from FCFS consists of
Close Page Policy:4.2%
Baseline Optimization:4.9%
Proposal Optimization:1.9%
Baseline optimization accomplishes
a 9.1% improvement
0%
2%
4%
6%
8%
10%
Performance
improvement
4.2%
4.9%
1.9%
Proposed
Optimization
Baseline
Optimization
Close Page
FCFS(base)

Summary of memory access scheduling
High Performance Memory Access Scheduling
Proposals
Novel thread-priority control method: Compute-phase prediction
Cost-effective refreshing method: Writeback-refresh overlap
Optimization strategies
MLP-aware priority control, Memory bus reservation, Activate
Throttling, Aggressive precharge, force refresh, timeout handling
The optimized scheduler reduces exec time by 11.2%
Several trade-offs between performance and EDP
Aggregating the various optimization strategies is
most important for the DRAM system efficiency

Access Map Pattern Matching Prefetch:
Optimization Friendly Method
Yasuo Ishii
1
, Mary Inaba
2
, and Kei Hiraki
2

Background
Speed gap between processor and
memory has been increased
To hide long memory latency, many
techniques have beenproposed.
Importance of HW data prefetch has been
increased
Many HW prefetchers have been
proposed

Conventional Methods
Prefetchers uses
1.Instruction Address
2.Memory Access Order
3.Memory Address
Optimizations scrambles information
Out-of-Order memory access
Loop unrolling

Limitation of Stride Prefetch[Chen+95]
Out-of-Order Memory Access
Memory Address Space






0xAB04
0xAB03
0xAB05
0xAB06
0xABFF
0xAB04 2 steady
Cache
Line



0xAB02
A Access 4
Access 3
Access 1
0xAB01
0xAB00
0xAAFF
Access 2
for (int i=0; i<N; i++) {
load A[2*i];・・・・・(A)
}
Tag Address Stride State
Out of
Order
Cannot detect strides

Weakness of Conventional
Methods
Out-of-Order Memory Access
Scrambles memory access order
Prefetcher cannot detect address correlations
Loop-Unrolling
Requires additional table entry
Each entry trained slowly
Optimization friendly prefetcher is required

Access Map Pattern Matching
Pattern Matching
Order Free Prefetching
Optimization Friendly Prefetch
Access Map
Map-base history
2-bit state map
Each state is attached to cache block

State Diagram for Each Cache
Block
Init
Initialized state
Access
Already accessed
Prefetch
Issued Pref.
Requests
Success
Accessed Pref. Data
Init Access
Access
Success
Access
Pre-
fetch
Prefetch

Memory Access Pattern Map
Corresponding to
memory address
space
Cache line granularity
II
Memory Address Space



Cache Line
Zone Size
・・・






A
Memory Access Pattern Map
Pattern Match Logic
S PA

Pattern Matching Logic
Access Map
Shifter
Pattern
Detector
Pipeline
Register
Prefetch
Selector
Addr
Memory Access Pattern Map
IAAAIAIIIA
Access Map Shifter
101
IAAAIAA AIII
A
・・・
Addr
・・・1
Priority Encoder & Adder
Prefetch Request
Feedback Path
0
+
1
+
2
+
3
・・・
(Addr+2)
Access Map Shifter
・・・
00・・・
Priority Encoder & Adder
IIAI IAAAIA A

Parallel Pattern Matching
Detects patterns from memory access
map
Detects address correlations in parallel
Searches candidates effectively
ISIAIAIAAIII IAA
Memory Access Pattern Map
・・・ ・・・

AMPM Prefetch
Memory address
space divides into
zone
Detects hot zone
Memory Access
Map Table
LRU
replacement
Pattern Matching
Zone
Zone
Zone
Zone
Zone
Memory Address Space
Hot
Zone
Hot
Zone
Hot
Zone
Access
Zone
Prefetch Request
Memory Access Map Table



PSA I・・・PS IA・・・
Pattern Match
Logic

Features of AMPM Prefetcher
Pattern Matching Base Prefetching
Map base history
Optimization friendly prefetching
Parallel pattern matching
Searches candidates effectively
Complexity-effective implementation

Methodology
Simulation Environment
DPC Framework
Skips first 4000M instructions and evaluate
following 100M instructions
Benchmark
SPEC CPU2006 benchmark suite
Compile Option: “-O3 -fomit-frame-pointer -
funroll-all-loops”

IPC Measurement
Improves performance by 53%
Improves performance in all benchmarks0.0
0.5
1.0
1.5
2.0
2.5
3.0
3.5
400.perlbench
401.bzip2
403.gcc
410.bwaves 416.gamess
429.mcf
433.milc
434.zeusmp
435.gromacs
436.cactusADM
437.leslie3d
444.namd
445.gobmk
447.dealII
450.soplex 453.povray
454.calculix
456.hmmer
458.sjeng
459.GemsFDTD
462.libquantum
464.h264ref
465.tonto
470.lbm
471.omnetpp
473.astar
481.wrf
482.sphinx3
483.xalancbmk
Arith Mean
Instructions Per Cycle
NOPREF PREFETCH

L2 Cache Miss Count
Reduces L2 Cache Miss by 76%0
500000
1000000
1500000
2000000
2500000
3000000
3500000
4000000
4500000
5000000
400.perlbench
401.bzip2
403.gcc
410.bwaves 416.gamess
429.mcf 433.milc
434.zeusmp
435.gromacs
436.cactusADM
437.leslie3d
444.namd
445.gobmk
447.dealII
450.soplex 453.povray
454.calculix
456.hmmer
458.sjeng
459.GemsFDTD
462.libquantum
464.h264ref
465.tonto
470.lbm
471.omnetpp
473.astar
481.wrf
482.sphinx3
483.xalancbmk
Arith Mean
L2 Miss Count Per 100M Instructions
L2 Miss Count / 100M Inst. (without Prefetch)
L2 Miss Count / 100M Inst. (with Prefetch)

Summary of prefetching
Access Map Pattern Matching Prefetch
Order-Free Prefetch
Optimization friendly prefetching
Parallel Pattern Matching
Complexity-effective implementation
Optimized AMPM realizes good
performance
Improves IPC by 53%
Reduces L2 cache miss by 76%

Spatial
Q & A
Stride Prefetch
Fu+ 1992
Markov Prefetch
Joseph+ 1997
GHB
Nesbit+ 2004
Feedback based
Honjo2009
Hybrid
Hsu+ 1998
Software Support
Mowry+ 1992
AC/DC
Nesbit+ 2004
Adaptive Stream
Hur+ 2006
FDP
Srinath+ 2007
Software
Sequence-Base
(Order Sensitive)
Tag Correlation
Hu+ 2003
Buffer Block
Gindele1977
SMS
Somogyi 2006
Sequential
Smith+ 1978
RPT
Chen+ 1995
Locality Detect
Johnson+, 1998
Spatial Pat.
Chen+ 2004
Adaptive
Hybrid
Adaptive Seq.
Dahlgren+ 1993
Commercial
Processors
SuperSPARC
R10000
PA7200
Power4
Pentium 4
AMPMPrefetch
Ishii+ 2009
HW/SW Integrate
Gornish+ 1994

平木敬
Summary
1.Speculation is the most important tool to
speed-up a single core processor
2.Our target in the next 10 years is more than 20
instructions / cycle
3.Next target would be
Prediction of NoC data injection
Prefetching for gather/scatter operation
Practical value prediction
東京大学
80

平木敬
Questions
東京大学
81