2
W + D + R
T
For N clothes, time T
1= N.T
SupposewehavebuiltamachineMthatcan
wash(W),dry(D),andiron(R)clothes,one
clothatatime.
TotaltimerequiredisT.
A Real-life Example
Wants to speed up the process by factor 3?
Approach 1:
W + D + R
T
W + D + R
T
W + D + R
T
Number of machine: 3
Cost : 3x
Asanalternative,wesplitthemachineintothreesmaller
machinesM
w(onlywash),M
D(onlyDry),andM
R(onlyIron)
,which
canperformthespecifictaskonly.
TimerequiredbyeachofthesmallermachinesisT/3(say)
3
W D R
T/3 T/3 T/3
For N clothes, time T3 = (2+N).T/3
Wants to speed up the process by factor 3? Approach2
Suppose we have N (large number, say 1000) cloths to wash
Time required to complete:
Approach 1: T1 = N.T
Approach2: T3 = (2+N).T/3.
Which is approximately T1/3, if N is large, speed up by 3
Cost not increases by 3x (only marginal increment of the cost)
W + D + R
T
For N clothes, time T
1= N.T
Approach 1
Approach 2
4
How does the pipeline work?
Space –time diagram / Reservation Table
5
Time required to fill the pipe
Output is available from here
What is Pipelining?
6
Amechanismforoverlappedexecutionofseveralinputsetsbypartitioning
somecomputationintoasetofk-subcomputations(orstages).
Verynominalincreaseinthecostofimplementation.
Verysignificantspeed-up(ideallyk).
Extending the concept to processor pipeline
7
Example -1
•Suppose we want to perform the following task
A
i * B
i + C
i wherei = 1,2,…..7
8
Non pipeline
2 stage Pipeline
Table1: Content of registers in pipeline version
1 clock cycle delay
9
Model of a Synchronous k-stage pipeline
clock
S
1
S
2 S
k
Stage 1 Stage 2 Stage k
Input
Output
Latch/Register (buffer) to store intermediate results
Total task is :: S
1+S
2+ … S
k
Minimum clock period??
10
Task
(1+2)
Clock (f
clk)
t = 10nS
task1
Clock (f
clk)
t = 4nStask2
t = 6nS
task1
Clock (f
clk)
t = 5nStask2
t = 5nS
11
Clock Period
??????=clock period of the pipeline system
�
??????= time delay of the circuit in stage �
??????t
1 t
2 t
k
??????
�= delay of a latch
Maximum stage delay ??????
??????= ??????????????????{�
??????}
Thus??????=??????
??????+ ??????
�
Pipeline frequency ??????=
1
??????
If one result is expected to come out of pipeline every clock cycle, fwill be the
maximum throughput of the pipeline.
clock
S
1
S
2 S
k
Stage 1 Stage 2 Stage k
Input Output
12
Speedup Calculation
The total time to process N data sets by a pipeline system (k stages) is given by
�
??????=??????−1+??????.??????
??????−1.??????times required to fill the pipeline;
1 result every ??????time after that is ??????.??????
For an equivalent non-pipelined processor (i.e., one stage) the total time to process N data
sets is given by
�
1=N.k.τ(Ignoring the latch delay)
we assume that the time it takes to process a task is the same in the pipeline and non-pipeline
processor.
Speed-up of the k-stage pipeline over equivalent non-pipelined processor.
�
??????=
�1
�
??????
=
??????.??????.??????
[(??????−1)+??????].??????
=
??????.??????
�+(??????−1)
If ??????→∞, �
??????→??????(Maximum speed up possible ≡ no of stages in the pipeline(k))
Problem-1
13
14
Pipeline Efficiency
How close is the performance to its ideal
value
??????
??????=
�
??????
??????
=
??????.??????
??????+??????−1
×
1
??????
=
??????.??????
�+(??????−1)
Pipeline Throughput
Number of operations computed per unit time
??????
??????=
??????
�
??????
=
??????
??????+??????−1.??????
Latency
It is the length of time required to perform a single task or Time required to get the first
result.
15
How does the pipeline work?
Latency : 3 x T/3 = T
Critical path delay (t
critical) = The path through the logic which determines
the ultimate speed of the structure.
Maximum clock frequency (f
clk) =
t
latch= delay due to latch.
Latency = Length of time required to perform a single task =10nS.
(t
latch= ignored).
throughput = Numbers of tasks executed per unit time. i.e.
= 1 output/10nS.
Sampling rate = 1/10nS = 100 Msamples/sec1
()tt
criticallatch
task
Clock (f
clk)
t = 10nS
02 1345
10nS
Non-Pipeline
16
•T
clk= max {t1, t2, t3}, i.e. T
clk= 4nS.
(t
latch= ignored).
•f
clk= 1/T
clk= 250 MHz.
•Speedup =
•throughput = 1 output/4nS
•latency = Nos. of stages ×T
clk
= 3 ×4nS = 12nS.
•Sampling Rate = 250 Msamples/sec 1/4
1/10
outputnS
outputnS
P1
P2
P3
clock
t1= 3nS
t2=4nS
t3=3nS
Utilization
¾ or 75%
1 or 100%
¾ or 75%
02 1345
4nS
Pipeline
P3
P2
P1
0 4 8 1216
T1
T1
T1
T2
T2
T2
T3
T3
T3
Space
Time (ns)
20
17
18
Problem-2
Suppose that the time delays of the four segments of a pipeline systems are t
1= 60ns, t
2= 70ns, t
3= 100ns
and t
4= 80ns, and interface registers have a delay of t
r= 10ns. Calculate the speed-up of the pipelined system
over the equivalent non pipelined system.
19
Instruction Pipeline
20
Four segment CPU pipeline
clock
FI DA EX
Input
Output
Latch/Register
FO
Problem 3
21
Location/Address
0
1
021
022
023
Write a program using the instructions mentioned below, which can perform 5-6?
5
6
result
PC