Performance Enhancement with Pipelining

AneeshR 4,399 views 72 slides May 11, 2014
Slide 1
Slide 1 of 72
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27
Slide 28
28
Slide 29
29
Slide 30
30
Slide 31
31
Slide 32
32
Slide 33
33
Slide 34
34
Slide 35
35
Slide 36
36
Slide 37
37
Slide 38
38
Slide 39
39
Slide 40
40
Slide 41
41
Slide 42
42
Slide 43
43
Slide 44
44
Slide 45
45
Slide 46
46
Slide 47
47
Slide 48
48
Slide 49
49
Slide 50
50
Slide 51
51
Slide 52
52
Slide 53
53
Slide 54
54
Slide 55
55
Slide 56
56
Slide 57
57
Slide 58
58
Slide 59
59
Slide 60
60
Slide 61
61
Slide 62
62
Slide 63
63
Slide 64
64
Slide 65
65
Slide 66
66
Slide 67
67
Slide 68
68
Slide 69
69
Slide 70
70
Slide 71
71
Slide 72
72

About This Presentation

Performance Enhancement with Pipelining


Slide Content

Aneesh Raveendran
Centre for Development of Advanced
Computing, INDIA

•What is pipelining ?
•Pipeline Taxonomies
•Instruction Pipelines
•MIPS Instruction Pipeline
•Pipeline Hazards
•MIPS Pipelined Datapath
•Load Word Instruction Example
•Pipeline Datapath Example
•Pipeline Control
•Pipeline Instruction Example

•Pipeline Hazards
•Control Hazards
•Data Hazards
•Detecting Data Hazards
•Resolving Data Hazards
•Forwarding Example
•Stalling Example
•Branch Hazards
•Branching Example
•Key terms

•There are two main ways to increase the performance of
a processor through high-level system architecture
•Increasing the memory access speed
•Increasing the number of supported concurrent operations
•Pipelining !
•Parallelism ?
•Pipelining is the process by which instructions are
parallelized over several overlapping stages of
execution, in order to maximize datapath efficiency

•Pipelining is analogous to many everyday scenarios
•Car manufacturing process
•Batch laundry jobs
•Basically, any assembly-line operation applies
•Two important concepts:
•New inputs are accepted at one end before previously
accepted inputs appear as outputs at the other end;
•The number of operations performed per second is increased,
even though the elapsed time needed to perform any one
operation remains the same

Looking at the textbook’s example,
we have a 4-stage pipeline of
laundry tasks:
1.Place one dirty load of clothes
into washer
2.Place the washed clothes into a
dryer
3.Place a dry load on a table and
fold
4.Put the clothes away
Graphically speaking:
• Sequential (top) vs.
• Pipelined (bottom) execution

•There are two types of pipelines used in computer systems
•Arithmetic pipelines
•Used to pipeline data intensive functionalities
•Instruction pipelines
•Used to pipeline the basic instruction fetch and execute sequence
•Other classifications include
•Linear vs. nonlinear pipelines
•Presence (or lack) of feedforward and feedback paths between stages
•Static vs. dynamic pipelines
•Dynamic pipelines are multifunctional, taking on a different form
depending on the function being executed
•Scalar vs. vector pipelines
•Vector pipelines specifically target computations using vector data

•Let us now introduce the pipeline we’re working with
•It’s a 5-stage instruction, linear, static and scalar
pipeline, consisting of the following steps:
•Fetch instruction from Memory (IF)
•Read registers while decoding the instruction (ID)
•Execute the operation or calculate an address (EX)
•Access an operand in data memory (MEM)
•Write the result into a register (WB)
•Again, theoretically, pipeline speedup = number of
stages in pipeline

Inst. Fetch (2ns), Reg. read/write (1ns), ALU op. (2ns), Data access (2ns)

Clk
Cycle 1
Multiple Cycle Implementation:
IfetchRegExecMem Wr
Cycle 2Cycle 3Cycle 4Cycle 5Cycle 6Cycle 7Cycle 8Cycle 9Cycle 10
LoadIfetchRegExecMem Wr
IfetchRegExecMem
Load Store
Pipeline Implementation:
IfetchRegExecMem WrStore
Clk
Single Cycle Implementation:
Load Store Waste
Ifetch
R-type
IfetchRegExecMem WrR-type
Cycle 1 Cycle 2

•Suppose
•100 instructions are executed
•The single cycle machine has a cycle time of 45 ns
•The multicycle and pipeline machines have cycle times of 10 ns
•The multicycle machine has a CPI of 4.6
•Single Cycle Machine
•45 ns/cycle x 1 CPI x 100 inst = 4500 ns
•Multicycle Machine
•10 ns/cycle x 4.6 CPI x 100 inst = 4600 ns
•Ideal pipelined machine
•10 ns/cycle x (1 CPI x 100 inst + 4 cycle drain) = 1040 ns
•Ideal pipelined vs. single cycle speedup
•4500 ns / 1040 ns = 4.33
•What has not yet been considered?

•What makes it easy
•all instructions are the same length
•just a few instruction formats
•memory operands appear only in loads and stores
•What makes it hard?
•structural hazards: suppose we had only one memory
•control hazards: need to worry about branch instructions
•data hazards: an instruction depends on a previous instruction
•We’ll build a simple pipeline and look at these issues

•structural hazards: attempt to use the same resource two
different ways at the same time
•E.g., two instructions try to read the same memory at the same time
•data hazards: attempt to use item before it is ready
•instruction depends on result of prior instruction still in the pipeline
add r1, r2, r3
sub r4, r2, r1
•control hazards: attempt to make a decision before condition is
evaulated
•branch instructions
beq r1, loop
add r1, r2, r3
•Can always resolve hazards by waiting
•pipeline control must detect the hazard
•take action (or delay action) to resolve hazards

What do we need to split the datapath into stages ?

Pipeline registers (buffers) are similar to multicycle processor design

Instruction fetch stage

Instruction decode and register file read stage

Execute or address calculation stage

Memory access stage

Write back stage

Write register number comes from the MEM/WB pipeline register along with the data

Multiple-clock cycle (vs. single-clock cycle) pipelined diagrams

Single-cycle pipeline diagram with one instruction on the pipeline

Single-cycle pipeline diagram with two instructions on the pipeline

•What control signals are required ?
•First, notice that the pipeline registers are written every clock
cycle, hence do not require explicit control signals, otherwise:
•Instruction fetch and PC increment
•Again, asserted at every clock cycle
•Instruction decode and register file read
•Again, asserted at every clock cycle
•Execution and address calculation
•Need to select the result register, the ALU operation, and either Read
data 2 or the sign-extended immediate for the ALU
•Memory access
•Need to read from memory, write to memory or complete branch
•Write back
•Need to send back either ALU result or memory value to the register file

Execution/Address
Calculation stage control
lines
Memory access stage
control lines
Write-back
stage control
lines
Instruction
Reg
Dst
ALU
Op1
ALU
Op0
ALU
Src
Branc
h
Mem
Read
Mem
Write
Reg
write
Mem
to Reg
R-format 1 1 0 0 0 0 0 1 0
lw 0 0 0 1 0 1 0 1 1
sw X 0 0 1 0 0 1 0 X
beq X 0 1 0 1 0 0 0 X

•Structural hazard
•Occurs when a combination of instructions is not supported by the datapath
•For example, a unified memory unit would need to be accessed in stages 1 (IF)
and 4 (MEM), which would cause a contention
•Pipeline outright fails in the presence of structural hazards
•Control hazard
•Occurs when a decision is made based on the results of one instructions, while
others are executing
•For example, a branch instruction is either taken or not
•Solutions that exist are stalling and predicting
•Data hazard
•Occurs when an instruction depends on the results of an instruction resident on
the pipeline
•For example, adding two register contents and storing their result into a third
register, then using that register’s contents for another operation
•Solutions that exist are based on forwarding

•Three major solutions
•Stall
•Predict
•Delayed branch slot
•Stalling involves always waiting for the PC to be updated with the
correct address before moving on
•A pipeline stall (or bubble) allows us to perform this wait
•Quite costly, as we have to stall even if the branch fails

•Predicting involves guessing whether the branch is taken or not,
and acting on that guess
•If correct, then proceed with normal pipeline execution
•If incorrect, then stall pipeline execution

•Delayed branch involves executing the next sequential instruction
with the branch taking place after that delayed branch slot
•The assembler automatically adjusts the instructions to make it
transparent from the programmer
•The instruction has to be safe, as in it shouldn’t affect the branch
•Longer pipelines requires the use of more branch delay slots
•Actual MIPS architecture solution

•Forwarding involves providing the inputs to a stage of one
instruction before the completion of another instruction
•Valid if destination stage is later in time than the source stage
•Left diagram shows typical forwarding scenario (add then sub)
•Right diagram shows that we still need a stall in the case of a load-
use data hazard (load then R-type)

sub$2, $1 , $3
and$1 2, $2, $5
o r$1 3, $6 , $2
add$1 4, $2, $2
sw$1 4, 1 00($2)

•We could insert “no operation” (nop) instructions to delay the
pipeline execution until the correct result is in the register file
sub$2, $1 , $3
no p
no p
and$1 2, $2, $5
o r$1 3, $6 , $2
add$1 4, $2, $2
sw$1 4, 1 00($2)
•Too slow as it adds extra useless clock cycles
•In reality, we try to find useful instructions to execute between data-
dependent instructions, but this happens too often to be efficient

•Let us try to formalize detecting a data hazard
1.EX/MEM.RegisterRd = ID/EX.RegisterRs
2.EX/MEM.RegisterRd = ID/EX.RegisterRt
3.MEM/WB.RegisterRd = ID/EX.RegisterRs
4.MEM/WB.RegisterRd = ID/EX.RegisterRt
sub$2, $1 , $3
and$1 2, $2, $5 Data hazard o f type #1
o r$1 3, $6 , $2 Data hazard o f type #4
add$1 4, $2, $2 No data hazard – re g iste r file
sw$1 4, 1 00($2) No data hazard – co rre ct o pe ratio n

•Two modifications are in order
•Firstly, we don’t have to forward all the time!
•Some instructions don’t write registers (e.g. beq)
•Use RegWrite signal in WB control block to determine condition
•Secondly, the $0 register must always return 0
•Can’t limit programmer of using it as a destination register
•Use RegisterRd to determine if $0 is being used
1. If (EX/MEM.RegWrite & (EX/MEM.RegisterRd ≠ 0) & (EX/MEM.RegisterRd=ID/EX.RegisterRs)) ForwardA= 10
2. If (EX/MEM.RegWrite & (EX/MEM.RegisterRd ≠ 0) & (EX/MEM.RegisterRd=ID/EX.RegisterRt)) ForwardB= 10
3. If (MEM/WB.RegWrite & (MEM/WB.RegisterRd ≠ 0) & (MEM/WB.RegisterRd=ID/EX.RegisterRs)) ForwardA= 01
4. If (MEM/WB.RegWrite & (MEM/WB.RegisterRd ≠ 0) & (MEM/WB.RegisterRd=ID/EX.RegisterRt)) ForwardB= 01
•Let us examine the hardware changes to our datapath

•Remember that there is no hazard in the WB stage,
because the register file is able to be written and read in
the same stage
Mux control Source Description
ForwardA = 00 ID/EX First ALU operand comes from RF
ForwardA = 01 EX/MEM First ALU operand forwarded from prior ALU result
ForwardA = 10 MEM/WB First ALU operand forwarded from data memory or prior ALU result
ForwardB = 00 ID/EX Second ALU operand comes from RF
ForwardB = 01 EX/MEM Second ALU operand forwarded from prior ALU result
ForwardB = 10 MEM/WB Second ALU operand forwarded from data memory or prior ALU result

lw$2, 20($1 )
and$4, $2, $5
o r$8 , $2, $6
add$9, $4, $2
slt$1 , $6 , $7

•Let us try to formalize detecting a stalling data hazard
• If (ID/EX.MemRead & ((ID/EX.RegisterRt = IF/ID.RegisterRs) or (ID/EX.RegisterRt = IF/ID/RegisterRt)))
•On the condition being true, we stall the pipeline!

•Other instructions are on the pipeline when we find out whether
we take the branch or not!

•Two solutions
•Assume branch is not taken
•Dynamic branch prediction
•We’ve already discussed the first solution
•Note that three instruction stages have to be flushed when the
branch is taken
•Done similarly to a data hazard stall (control values set to 0s)
•We can increase branch performance by moving the branch
decision to the ID stage (rather than the MEM stage)
•Branch target address calculated by moving adder into ID stage
•Branch decision done by comparing Rs and Rt
•Flushing the IF stage instruction involves nop instructions

•Store, in a branch prediction buffer, the history of each branch
instruction
•1-bit requires one wrong prediction to update history table
•2-bits requires two wrong predictions to update history table

•Pipelining vs. Parallelism
•Pipeline Stages
•Pipeline Taxonomies
•MIPS Instruction Pipeline
•Structural Hazards
•Control Hazards
•Data Hazards
•Pipeline Registers and Operation
•Pipeline Control
•Pipeline Throughput
•Pipeline Efficiency

•Control Hazard Stalling
•Control Hazard Predicting
•Control Hazard Delayed Branch
•Data Hazard Forwarding
•Data Hazard Detection
•Forwarding Unit
•Data Hazard Stalling
•Branch Prediction Buffer

1.aneeshr2020@ gmail.com