Maximizing CPU Efficiency: A Comprehensive Exploration of Pipelining in Computer Architecture.pdf

haseebali10701 21 views 79 slides May 29, 2024
Slide 1
Slide 1 of 79
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27
Slide 28
28
Slide 29
29
Slide 30
30
Slide 31
31
Slide 32
32
Slide 33
33
Slide 34
34
Slide 35
35
Slide 36
36
Slide 37
37
Slide 38
38
Slide 39
39
Slide 40
40
Slide 41
41
Slide 42
42
Slide 43
43
Slide 44
44
Slide 45
45
Slide 46
46
Slide 47
47
Slide 48
48
Slide 49
49
Slide 50
50
Slide 51
51
Slide 52
52
Slide 53
53
Slide 54
54
Slide 55
55
Slide 56
56
Slide 57
57
Slide 58
58
Slide 59
59
Slide 60
60
Slide 61
61
Slide 62
62
Slide 63
63
Slide 64
64
Slide 65
65
Slide 66
66
Slide 67
67
Slide 68
68
Slide 69
69
Slide 70
70
Slide 71
71
Slide 72
72
Slide 73
73
Slide 74
74
Slide 75
75
Slide 76
76
Slide 77
77
Slide 78
78
Slide 79
79

About This Presentation

Pipelining in computer architecture is a technique used to increase CPU throughput by allowing multiple instructions to be processed simultaneously. Instead of completing one instruction at a time, the CPU divides the execution of instructions into several stages and overlaps the execution of differ...


Slide Content

Asst. Prof. Dr.Sumek Wisayataksin
Computer Architectures
Chapter 5 : Pipelining Processor

Increasing Process Speed
▪Cooking Process
Process 1 :
Material Preparing
Meat/Veggie Chop
5 mins
Process 3 :
Garnishing Food
5 mins
Process 4 :
Serving
3 mins
Process 2 :
Cooking
10 mins
It takes 23 mins to finish 1 dish !!!

Parallel Processing
3 times speed up
Average
23/3 = 7.67 min/dish
Chef 1
Chef 2
Chef 3

Parallel Processing Limitation
▪The processors (Chefs) must be an expert of all process from 1-4.
▪You need to have many expert chefs to speed up the process.
▪Every processor (Chef) must have the same skills

Pipelining Process
Time
Dish2
Dish1
Dish3
Dish4
10 mins
10 mins
10 mins
10 mins
10 mins
10 mins
4 Processes
Average
10 mins/dish

Balancing Pipeline stages
▪Since the bottleneck is Process 2 (10 mins), we can break it into 2
subprocess
Time
Dish2
Dish1
Dish3
Dish4
Dish5
5 mins
5 Processes
Average
5 mins/dish

What is pipelining
▪Pipelining is a set of data processing elements connected in series, where
the output of one element is the input of the next one.
▪The elements of a pipeline are often executed in parallel or in time-sliced
fashion.
▪Some amount of buffer storage is often inserted between elements.

Pipelining and Maintaining Correctness
•We can break the execution of instructions into stages and overlap the
executionof different instructions.
•We need to ensure that the results produced by our new pipelined processor
are nodifferent to the unpipelined one.
•What sort of situations could cause problems once we pipeline our
processor?

Basic Processor Architecture
Program
Counter
(PC)
R
E
G
0
R
E
G
N
……
C
T
R
L
Data bus
Address bus
ALU
Data Registers
I
R
Z
C
N
V
FLAG
Control Bus

5-stage Pipelined Processor
PC
+
+4
Program
Memory
Regfile
(R0-R14)
CTRL
Box
Rn
Rm
Vn
Vm
Rs Vs
Shift ALU
Data
Memory
+
+offset
Flag
Vd
+
Vidx
Addr Read
Inst
Addr
Read
Data
Write
Data
Stage 1
Fetch
Stage 2
Decode
Stage 3
Execute
Stage 4
Memory
Stage 5
Writeback
#imm
Signals from CTRL Box

Stage 1 : Fetch (Read Instructions)
•The next instruction is fetched from the memory address
that is currently stored in the program counter (PC) and
stored into the instruction register.
•At the end of the fetch operation, the PC points to the next
instruction that will be read at the next cycle.
•The next instruction might be from branch address.

Stage 1 : Fetch
1.The CPU sends the contents of the PC to address bus
and sends a read command on the control bus
2.In response to the read command (with address equal to
PC), the program memory returns the data stored at the
memory location indicated by the PC on the data bus
3.The CPU copies the data from the data bus into its
Instruction Register (IR)
4.The PC is incremented, so that it points to the next
instruction. This step prepares the CPU for the next cycle.

Stage 2 : Decode (Inst Interpretation)
▪During this stage, the encoded
instruction presented in the instruction
register is interpreted by the decoder.
▪Register values and immediate
constants are also read in this
stage

Stage 3 : Execute (Calculation)
▪The CPU sends the decoded instruction
as a set of control signals to the
corresponding computer components.
▪If the instruction involves arithmetic or
logic, the ALU is utilized and data to be
computed.
▪This is the only stage of the instruction
cycle that is compute data. Flags are
updated in this stage.
▪The branch address (for jump) is also
processed here.

Stage 4 : Memory (Access RAM)
▪This stage access the data memory
(LDR/STR)
▪This process must be after Execute stage
because address can be pre-indexed.
▪The address from ALU calculation will be
sent to the address bus.
▪At the end of stage, the data memory
return/store 32-bit data to/from the data
bus

Stage 5 : Writeback (Write Output)
▪This stage, it writes the data
register (R0-R12).
▪Results from Execute and
Memory stage are stored in the
register selected by the
instruction (Destination register)
Regfile
(R0-R14)
Rn
Rm
Vn
Vm
Rs Vs
Vd
Vidx
Decode
Execute
Memory
Writeback

Data Processing Execution Process
0x00 : ADD R2, R0, R1 // R2 = R0+R1
0x04 : ORR R3, R0, #3 // R3 = R0 | 0x03
0x08 : SUB R4, R1, R7 LSR #4 // R4 = R1 – (R7>>4)
0x0C : EOR R5, R0, R6 LSL R7 // R5 = R0 ^ (R6<<R7)
PC Instruction

ADD R2, R0, R1
+
+4
Program
Memory
Regfile
(R0-R14)
CTRL
Box
Rn
Rm
Vn
Vm
Rs Vs
Shift ALU
Data
Memory
+
+offset
Flag
Vd
+
Vidx
Addr
Read
Inst
Addr
Read
Data
Write
Data
#imm
Instruction
Register
Stage 1
Fetch
0x00
0x00
ADD R2, R0, R1
0x04
0x00
Stage 3
Execute
R0
R1
ADD
R0+R1
Stage 4
Memory
R0+R1
Stage 2
Decode
Rn=0
Rm=1
ADD R2, R0, R1
Vn=R0
Vm=R1
Stage 5
Writeback
R0+R1
R2=R0+R1
PC

ORR R3, R0, #3
+
+4
Program
Memory
Regfile
(R0-R14)
CTRL
Box
Rn
Rm
Vn
Vm
Rs Vs
Shift ALU
Data
Memory
+
+offset
Flag
Vd
+
Vidx
Addr
Read
Inst
Addr
Read
Data
Write
Data
#imm
Instruction
Register
Stage 1
Fetch
0x04
0x04
ORR R3, R0, #3
0x08
0x04
Stage 3
Execute
R0
imm=3
ORR
R0|3
Stage 4
Memory
R0|3
Stage 2
Decode
Rn=0
imm=3
ORR R3, R0, #3
Vn=R0
Stage 5
Writeback
R0|3
R3=R0|3
PC

SUB R4, R1, R7 LSR #4
+
+4
Program
Memory
Regfile
(R0-R14)
CTRL
Box
Rn
Rm
Vn
Vm
Rs Vs
Shift ALU
Data
Memory
+
+offset
Flag
Vd
+
Vidx
Addr
Read
Inst
Addr
Read
Data
Write
Data
#imm
Instruction
Register
Stage 1
Fetch
0x08
0x08
SUB R4, R1, R7 LSR #4
0x0C
0x08
Stage 4
Memory
R1-(R7>>4)
Stage 5
Writeback
R1-(R7>>4)
R4=R1-(R7>>4)
Stage 2
Decode
Rn=1
Rm=7
SUB R4, R1, R7 LSR #4
Vn=R1
imm=4
Vm=R7
Stage 3
Execute
R1
imm=4
SUB
R1-(R7>>4)
R7
R7>>4
PC

EOR R5, R0, R6 LSL R7
+
+4
Program
Memory
CTRL
Box
Rn
Rm
Vn
Vm
Rs Vs
Shift ALU
Data
Memory
+
+offset
Flag
Vd
+
Vidx
Addr
Read
Inst
Addr
Read
Data
Write
Data
#imm
Instruction
Register
Stage 1
Fetch
0x0C
0x0C
EOR R5, R0, R6 LSL R7
0x10
0x0C
Stage 4
Memory
R0^(R6<R7)
Stage 5
Writeback
R0^(R6<R7)
R5=R0^(R6<<R7)
Stage 3
Execute
R0
R7
EOR
R0^(R6<R7)
R6
R6<<R7
Stage 2
Decode
Rn=0
Rm=6
EOR R5, R0, R6 LSL R7
Vn=R0
Rs=7
Vm=R6
Vs=R7
PC

Memory Access Execution Process
0x10 : LDR R2, [R0,#4] // R2 = mem[R0+4]
0x14 : LDR R4, [R0], #4 // R4 = mem[R0]
R0 = R0+4
0x18 : STR R3, [R0,R2]! // mem[R0+R2] = R3
R0 = R0+R2
0x1C : STR R5, [R1], R9 // mem[R1] = R5
R1 = R1+R5
PC Instruction

LDR R2, [R0,#4]
+
+4
Program
Memory
Regfile
(R0-R14)
CTRL
Box
Rn
Rm
Vn
Vm
Rs Vs
Shift ALU
Data
Memory
+
+offset
Flag
Vd
+
Vidx
Addr
Read
Inst
Addr
Read
Data
Write
Data
#imm
Instruction
Register
Stage 1
Fetch
0x10
0x10
LDR R2, [R0,#4]
0x14
0x10
Stage 3
Execute
R0
imm=4
ADD
R0+4
Stage 2
Decode
Rn=0
imm=4
LDR R2, [R0,#4]
Vn=R0
Stage 5
Writeback
Mem[R0+4]
R2=Mem[R0+4]
Stage 4
Memory
Mem[R0+4]
R0+4
PC

LDR R4, [R0], #4
+
+4
Program
Memory
Regfile
(R0-R14)
CTRL
Box
Rn
Rm
Vn
Vm
Rs Vs
Shift ALU
Data
Memory
+
+offset
Flag
Vd
+
Vidx
Addr
Read
Inst
Addr
Read
Data
Write
Data
#imm
Instruction
Register
Stage 1
Fetch
0x14
0x14
LDR R4, [R0],#4
0x18
0x14
Stage 3
Execute
R0
imm=4
BYPASS
R0
Stage 2
Decode
Rn=0
imm=4
LDR R4, [R0],#4
Vn=R0
Stage 4
Memory
Mem[R0]
R0
R0+4
Stage 5
Writeback
Mem[R0]
R4=Mem[R0] R0+4
R0=R0+4
PC

STR R3, [R0,R2]!
+
+4
Program
Memory
CTRL
Box
Rn
Rm
Vn
Vm
Rs Vs
Shift ALU
Data
Memory
+
+offset
Flag
Vd
+
Vidx
Addr
Read
Inst
Addr
Read
Data
Write
Data
#imm
Instruction
Register
Stage 1
Fetch
0x18
0x18
STR R3, [R0,R2]!
0x1C
0x18
Stage 5
Writeback
R0+R2
R0=R0+R2
Stage 2
Decode
Rn=0
Rm=2
STR R3, [R0,R2]!
Vn=R0
Rs=3
Vm=R2
Vs=R3
Stage 3
Execute
R0
R3
ADD
R0+R2
R2
Stage 4
Memory
R3
R0+R2
R0+R2
0
PC

STR R5, [R1], R9
+
+4
Program
Memory
CTRL
Box
Rn
Rm
Vn
Vm
Rs Vs
Shift ALU
Data
Memory
+
+offset
Flag
Vd
+
Vidx
Addr
Read
Inst
Addr
Read
Data
Write
Data
#imm
Instruction
Register
Stage 1
Fetch
0x1C
0x1C
STR R5, [R1], R9
0x20
0x1C
Stage 5
Writeback
R1+R9
R1=R1+R9
Stage 2
Decode
Rn=1
Rm=9
STR R5, [R1], R9
Vn=R1
Rs=5
Vm=R9
Vs=R5
Stage 3
Execute
R1
R5
BYPASS
R1
R9
Stage 4
Memory
R5
R1
R1+R9
R9
PC

The pipeline flow
FE DE EX ME WB
CLK 1 INST 0x00
CLK 2 INST 0x04 INST 0x00
CLK 3 INST 0x08 INST 0x04 INST 0x00
CLK 4 INST 0x0C INST 0x08 INST 0x04 INST 0x00
CLK 5 INST 0x10 INST 0x0C INST 0x08 INST 0x04 INST 0x00

The pipeline in Action
Clock
Cycle 1 2 3 4 5 6 7 8 9
FETCH
FETCH
FETCH
FETCH
EXEDEC
DEC
DEC
DEC EXE
EXE
EXE
MEM
MEM
MEM
MEM
WB
WB
WB
WB
FETCH DEC EXE MEM WB
On clock cycle 5, the pipeline contains all the
instructions listed in different pipeline stages
1.LDR R1, [R2]
2. ADD R2, R2, R3
3. STR R4, [R5], #4
4. SUB R0, R0, #1
5. LDR R5, [R3]

Branch Execution Process
0x20 : CMP R3, #0 // R3 -0 and Update Flag
0x24 : BEQ JUMP1
0x28 : ADD R7, R7, #1
0x2c : . . . .
0x30 : . . . .
0x34 : . . . .
0x38 : JUMP1: SUB R7, R7, #1
PC Instruction

BEQ JUMP1 (Condition Meet)
+
+4
Program
Memory
CTRL
Box
Rn
Rm
Vn
Vm
Rs Vs
Shift ALU
Data
Memory
+
+offset
Flag
Vd
+
Vidx
Addr
Read
Inst
Addr
Read
Data
Write
Data
#imm
Instruction
Register
Stage 1
Fetch
0x24
0x24
BEQ JUMP1
0x28
0x24
Stage 2
Decode
BEQ JUMP1
0x28
(Offset*4)+4
Stage 3
Execute
0x38
PC
Stage 4
Memory
MEET
0x38
0x38
MEET
0x38
0x28
0x38
SUB R7, R7, #1
0x3C

BEQ JUMP1 (Condition Not Meet)
+
+4
Program
Memory
CTRL
Box
Rn
Rm
Vn
Vm
Rs Vs
Shift ALU
Data
Memory
+
+offset
Flag
Vd
+
Vidx
Addr
Read
Inst
Addr
Read
Data
Write
Data
#imm
Instruction
Register
Stage 1
Fetch
0x24
0x24
BEQ JUMP1
0x28
0x24
Stage 2
Decode
BEQ JUMP1
0x28
(Offset*4)+4
Stage 3
Execute
0x38
PC
Stage 4
Memory
NOT MEET
0x38
0x38
NOT MEET
0x28
0x28
0x28
ADD R7, R7, #1
0x2C

Data Hazard
Clock
Cycle 1 2 3 4 5 6 6
FETCH
FETCH EXEDEC
DEC EXE MEM
MEM
WB
WB
1.LDR R1, [R2]
2. ADD R1, R1, R3
R1 is updated here
R1 = R1+R3 is
executed before LDR
instruction is written
The result of ADD R1, R1, R3 is wrong.
This is Called “Data Hazard”

Data Dependencies - True Data Dependencies
Clock
Cycle 1 2 3 4 5 6 7 8
FETCH
FETCH EXEDEC
DEC EXE MEM
MEM
WB
WB
1.LDR R1, [R2]
2. ADD R1, R1, R3
R1 updated here
STALL STALL
Wait R1 Updated
FETCH
FETCH EXEDEC
DEC EXE MEM
MEM
WB
WB
1. LDR R1, [R2]
3. ADD R1, R1, R3
R1 updated here
STALL
Wait R1 Updated
2. ADD R2, R2, R3
FETCH DEC EXE MEM WB
3. ADD R2, R2, R3 FETCH DECSTALL EXE MEMSTALL

Data Dependencies
True Data Dependencies
True data dependence (green arrow). Also
known as a Read-After-Write (RAW)
dependence.
If we imagine two instructions,ifollowed by j
If j uses the result of i, we say that jis
data dependent on i. This is an example of
a true data dependence (a read after a
write).
This type of data dependency causes the
pipeline hazards
SUB R1, R1, #2
ADD R1, R1, R3
ADD R2, R2, R3
Write
Read

Data dependencies
Name dependencies may also
exist when two instructions refer to
the same register. Unlike true data
dependencies, it doesn’t affect the
pipeline stall.
▪Output dependencies (red arrow)
▪Anti-dependence (green arrow)
SUB R1, R1, #2
ADD R1, R1, R3
LDR R3, [R2]
SUB R2, R3, #1
Write
Write

Data dependencies
▪Output dependence (red
arrow). Also known as a Write-
After-Write (WAW) dependence
▪We need to ensure we don’t
reorder writes to the same
register. This would mean
subsequent instructions could
receive the wrong data value.
▪This type doesn’t cause data
hazard, if the execution order
doesn’t change
SUB R1, R1, #2
ADD R1, R1, R3
Write
Write

Data dependencies
▪Anti-dependence(green arrows).
Also known as a Write-After-Read
(WAR) dependence
▪we need to be careful not to
overwrite a register whose current
value is still required by an earlier
instruction.
▪This type doesn’t cause data
hazard, if the execution order
doesn’t change
ADD R1, R1, R3
LDR R3, [R2]
SUB R2, R3, #1
Write
Read
Write
Read

Data Hazards
▪Data Hazards occur when an instruction depends on the result of previous
instruction and that result of instruction has not yet been computed. whenever
two different instructions use the same storage. the location must appear as if
it is executed in sequential order.
FETCH
FETCH EXEDEC
DEC EXE MEM
MEM
WB
WB
1.LDR R1, [R2]
2. ADD R1, R1, R3
R1 updated here
STALL STALL
Wait R1 Updated
3. ADD R2, R2, R3 FETCH DECSTALL EXE MEMSTALL

An ideal Pipeline
In the ideal case, when our pipeline never stalls, our CPI will equal 1 (and IPC = 1).
If we need to stall the pipeline, our CPI will increase, e.g.:
If we must stall for 1 cycle for 20% of instructions, and 3 cycles for 5% of instructions, our new
CPI would be:
Pipeline CPI = ideal pipeline CPI + pipeline stalls per instruction
= 1 + 1*0.20 + 3*0.05 = 1.35
Note: to make the best use of pipelining, we should avoid stalling as much as possible. (without
increasing our clock period!) Remember:
Time = instructions executed ×Clocks Per Instruction (CPI) ×clock period

Hazard Elimination Methods
1. Pipeline bubbling/ Stall
Instructions are fetched, control logic determines whether a hazard
could/will occur. If this is true, then the control logic inserts “STALL”
into the pipeline. Thus, before the next instruction (which would
cause the hazard) executes, the prior one will have had sufficient
time to finish and prevent the hazard.
FETCH
FETCH EXEDEC
DEC EXE MEM
MEM
WB
WB
1.LDR R1, [R2]
2. ADD R1, R1, R3
R1 updated here
STALL STALL
Wait R1 Updated
3. ADD R2, R2, R3 FETCH DECSTALL EXE MEMSTALL

Hazard Elimination Methods
2. Out-of-Order Execution
We can avoid data hazard by reorder instructions. However, the reorder
program must not break the data dependencies
▪True data dependencies – Read after Write (RAW)
▪Output Dependencies – Write after Write (WAW)
▪Anti-Dependencies – Write After Read (WAR)
Note – Read after Read (RAR) reordering does not affect data
dependencies.

Hazard Elimination Methods
3. Data forwarding
a technique to resolve the data hazards by passing a value directly
from one stage to an earlier stage in order to avoid stalling.
FE DE EX ME WB
Data forwarding from ME to DE

No Data forwarding (Memory)
FETCH
FETCH EXEDEC
DEC EXE MEM
MEM
WB
WB
1.LDR R1, [R2]
2. ADD R1, R1, R3
R1 updated here
STALL STALL
Wait R1 Updated
FE DE EX ME WB
Clock
Cycle 1 2 3 4 5 6 7 8

No Data forwarding (Memory)
FETCH
FETCH EXEDEC
DEC EXE MEM
MEM
WB
WB
1.LDR R1, [R2]
2. ADD R1, R1, R3 STALL
Wait MEM Read
Clock
Cycle 1 2 3 4 5 6 7 8
FE DE EX ME WB
Data forwarding from ME to DE

No Data forwarding (Execute)
FETCH
FETCH EXEDEC
DEC EXE MEM
MEM
WB
WB
1.MVN R1, R1
2. ADD R1, R1, #1
R1 updated here
STALL STALL
Wait R1 Updated
FE DE EX ME WB
Clock
Cycle 1 2 3 4 5 6 7 8

Data forwarding (Execute)
FETCH
FETCH EXEDEC
DEC EXE MEM
MEM
WB
WB
1.MVN R1, R1
2. ADD R1, R1, #1
FE DE EX ME WB
Clock
Cycle 1 2 3 4 5 6 7 8
Data forwarding from ME to DE

Data forwarding Disadvantage
▪Data forwarding reduces the pipeline stall. However, the forwarding path increase the logic
delay significantly.
▪Bad data forwarding will reduce the overall clock speed due to the longer path.
▪Processor designer must compromise between data forwarding and Pipeline hazard to
optimize the processor performance.
DE EXDE EX
DelayDelay
Delay

Exercise 5.1
▪Draw Pipeline stage diagram of these instructions with pipeline
stall to avoid data hazard.
MOV R0, #0x30
MOV R1, #0x54
MOV R2, #0xAA
ADD R3, R1, R2
SUB R4, R0, #1

Exercise 5.1
1 2 3 4 5 6 7 8 9 1011121314151617
MOV R0, #0x30
MOV R1, #0x54
MOV R2, #0xAA
ADD R3, R1, R2
SUB R4, R0, #1

Exercise 5.2
▪Draw Pipeline stage diagram of these instructions with pipeline
stall to avoid data hazard.
MOV R0, #0x41000000
MOV R1, #0x12
MOV R2, #0x45
STRB R1, [R0], #1
STRB R2, [R0], #1
ADD R3, R3, R1
SUB R4, R4, R2

Exercise 5.2
1 2 3 4 5 6 7 8 9 1011121314151617
MOV R0, #0x41000000
MOV R1, #0x12
MOV R2, #0x45
STRB R1, [R0], #1
STRB R2, [R0], #1
ADD R3, R3, R1
SUB R4, R4, R2

Exercise 5.3
▪Draw Pipeline stage diagram of these instructions with pipeline
stall after reordering instructions to avoid data hazard.
MOV R0, #0x41000000
MOV R1, #0x12
MOV R2, #0x45
STRB R1, [R0], #1
STRB R2, [R0], #1
ADD R3, R3, R1
SUB R4, R4, R2

Exercise 5.3
1 2 3 4 5 6 7 8 9 1011121314151617

Control Hazard (Branching Hazard)
▪Control hazard occurs whenever the pipeline makes decision to
jump or not.
CMP R0, #0
BEQ ZERO
ONE:ADD R1, R1, #1
ZERO:ADD R2, R2, #1
(R0==0)
Jump to
ZERO
Jump to
ONE
Yes No

Control Hazard (Branching Hazard)
FETCH
FETCH EXE
DEC EXE MEM
MEM
WB
WB
1. CMP R0, #0
2. BEQ ZERO
3. ZERO ADD R2, R2, #1 FETCH
Clock
Cycle 1 2 3 4 5 6 7 8
CASE R==0
1. CMP R0, #0
2. BEQ ZERO
3. ONE ADD R1, R1, #1
CASE R!=0
STALLSTALL
DEC
EXE MEMDEC
FETCH
FETCH EXE
DEC EXE MEM
MEM
WB
WB
FETCHSTALLSTALL
DEC
EXE MEMDEC

Control Hazard
What happens immediately after we fetch a conditional branch
instruction?
1.We must determine if the branch is taken or not.
2.If the branch is taken, we must compute and branch to the target
address.
(R0==0)
Jump to
ZERO
Jump to
ONE
Yes No

How Would We Like to Handle Branches?
-Inapipelinedprocessor,wewouldliketocalculatethenext
valueoftheprogramcounter(PC)inparallelwithreadingour
instructionmemory.
-Inpractice,thisisnon-trivialtosupportasbranchesmaynotbe
decodedandevaluateduntillaterinthepipeline.
FE DE EX ME WB

PC
+
+4
Program
Memory
Regfile
(R0-R14)
CTRL
Box
Rn
Rm
Vn
Vm
Rs Vs
Shift ALU
+
+offset
Flag
Vd
Vidx
Addr
Read
Inst
Stage 1
Fetch
Stage 2
Decode
Stage 3
Execute
Stage 4
Memory
#imm
How Would We Like to Handle Branches?

How to speed up the branch instruction
•Option 1: Assume a branch is not taken.
•Option 2: Evaluate the branch earlier in the pipeline.
•Option 3: Branch prediction.

Assume the Branch Is Not Taken
FETCH
FETCH EXE
DEC EXE MEM
MEM
WB
WB
1. CMP R0, #0
2. BEQ ZERO
3. ONE ADD R1, R1, #1
FETCH
Clock
Cycle 1 2 3 4 5 6 7 8
Fetch the next
Instruction before
branch decision
1. CMP R0, #0
2. BEQ ZERO
3. ZERO ADD R2, R2, #1
If Branch is taken,
Flush the pipeline
DEC
EXE MEMDEC
FETCH
FETCH EXE
DEC EXE MEM
MEM
WB
WB
FETCHFLUSHFLUSH
DEC
EXE MEMDEC

Assume a Branch Is Not Taken
If we evaluate the branch in the execute stage, we wouldlose two
cycles every time we encountered a taken branch.
If we assume 20% of instructions are branches, and 60% are taken,
and an otherwise perfect CPI of 1:
CPI contribution from taken branches = 0.2*2 * 0.6 = 0.24
Our new CPI = 1.0 + branch stalls = 1.24

Why don’t we assume branch is taken ?
MOV R3, #0
LOOP ADD R3, R3 ,#1
ADD R1, R1, #2
SUB R2, R2, #4
CMP R3, #10
BNE LOOP
CMP R3, #10
BNE LOOP
LOOP SUB R1, R1, R0
FETCH
FETCH EXE
DEC EXE MEM
MEM
WB
WB
FETCHSTALLSTALL
DEC
EXE MEMDEC
We don’t know
where to jump to
until offset address
is calculated.
- From the left code, it seems that
branch is taken more than not taken
(9:1)
- Why don’t we assume branch is
taken instead ?

Evaluate the branch earlier in the pipeline
▪The same idea as “data forwarding”, the branch is evaluated early
in DEC stage and Flag is bypassed
FE DE EX ME WB
Flag Calculate
Branch Address

Evaluate the branch earlier in the pipeline
FE DE EX ME WB
Flag Calculate
CMP R3, #10
BNE LOOP
LOOP ADD R3, R3, R1
FETCH
FETCH EXE
DEC EXE MEM
MEM
WB
WB
FETCHSTALL
DEC
EXE MEMDEC
Branch Address

Evaluate the branch earlier in the pipeline
-This technique causes the longer delay in the circuit. It affects the clock speed
of the whole system and consider to be not worth.
FE DE EX
Flag Calculate
Branch Address

Dynamic Branch Prediction
▪Predict branch direction before branch is executed.
▪Prediction based on branch history
▪Start fetching instruction at predicted direction

Branch Prediction Buffer (BPB)
▪Asimpledynamicbranch
predictionalgorithmisto
predictabranchtaken
onlyifitwasthelatest
timeweexecutedit.We
canuseasimpletableof1-
bitentriestostoreour
predictions
0x00000064 : BNE LOOP
0x64 = 0b01100100
0001 1
*
*
BPB
0x0000004C
*
*
Jump Address0 = Not Taken
1 = Taken

PC
+
+4
Program
Memory
Regfile
(R0-R14)
CTRL
Box
Rn
Rm
Vn
Vm
Rs Vs
Shift ALU
+
+offset
Flag
Vd
Vidx
Addr
Read
Inst
Stage1 Fetch Stage2 Decode Stage3 Execute Stage4 Memory
#imm
How Would We Like to Handle Branches?
BPB Jump Add
Decision

1-Bit Branch Prediction (Not Taken)
0x00000040 : CMP R0, #0
0x00000044 : BEQ ZERO
0x00000048 : ONE ADD R1, R1, #1
0x0000004C: ZEROADD R2, R2, #1
0x44 = 0b01000100
0001
0 = Not Taken
1 = Taken
0
*
*
BPB
0x0000004C
*
*
Jump Address

1-Bit Branch Prediction
FETCH
FETCH EXE
DEC EXE MEM
MEM
WB
WB
1. CMP R0, #0
2. BEQ ZERO
3. ONE ADD R1, R1, #1
FETCH
Clock
Cycle 1 2 3 4 5 6 7 8
Fetch the next
Instruction before
the prediction
DEC
EXE MEMDEC
1. CMP R0, #0
2. BEQ ZERO
3. ZERO ADD R2, R2, #1
If Prediction is
Wrong
FETCH
FETCH EXE
DEC EXE MEM
MEM
WB
WB
FETCH
DEC
EXE MEMDEC
PC+4
B_addr
FETCH DEC
PC+4

1-Bit Branch Prediction (Taken)
0x00000040 : CMP R0, #0
0x00000044 : BEQ ZERO
0x00000048 : ONE ADD R1, R1, #1
0x0000004C: ZEROADD R2, R2, #1
0x44 = 0b01000100
0001
0 = Not Taken
1 = Taken
0
*
*
BPB
0x0000004C
*
*
Jump Address

1-Bit Branch Prediction
FETCH
FETCH EXE
DEC EXE MEM
MEM
WB
WB
1. CMP R0, #0
2. BEQ ZERO
3. ZERO ADD R2, R2, #1
Clock
Cycle 1 2 3 4 5 6 7 8
Fetch the next
Instruction after
the prediction
DEC
1. CMP R0, #0
2. BEQ ZERO
3. ONE ADD R1, R1, #1
If Prediction is
Wrong
FETCH
FETCH EXE
DEC EXE MEM
MEM
WB
WB
FETCH
DEC
EXE MEMDEC
FETCH
0x4C
PC+4
EXE MEM WBDEC
FETCH
0x4C
DEC

Exercise 5.4
▪Draw Pipeline stage diagram of these instructions with pipeline
stall using Branch Not-Taken method to avoid branch hazard.
MOV R0, #0x41000000
MOV R1, #0x0
MOV R2, #0x82
B1 ADD R1, R1, #1
ADD R3, R3, R0
SUB R4, R4, R2
CMP R1, #0x2
BNE B1
NOP

Exercise 5.5
▪Draw Pipeline stage diagram of these instructions with pipeline
stall using Branch Prediction to avoid branch hazard.
MOV R0, #0x41000000
MOV R1, #0x0
MOV R2, #0x82
B1 ADD R1, R1, #1
ADD R3, R3, R0
SUB R4, R4, R2
CMP R1, #0x2
BNE B1
NOP

Exercise 5.6
▪Calculate the stall cycles caused by B1 branch in Exercise 5.4
MOV R0, #0x41000000
MOV R1, #0x0
MOV R2, #0x82
B1 ADD R1, R1, #1
ADD R3, R3, R0
SUB R4, R4, R2
CMP R1, #0x10
BNE B1
NOP

Exercise 5.7
▪Calculate prediction success rate (Hit rate) of B1 branch in these
program, if the initial prediction is Not-Taken and calculate the stall
cycles caused by B1 branch in Exercise 5.5
MOV R0, #0x41000000
MOV R1, #0x0
MOV R2, #0x82
B1 ADD R1, R1, #1
ADD R3, R3, R0
SUB R4, R4, R2
CMP R1, #0x10
BNE B1
NOP
Tags