Introduction_pipeline24.ppt which include

GauravDaware2 29 views 77 slides Apr 30, 2024
Slide 1
Slide 1 of 77
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27
Slide 28
28
Slide 29
29
Slide 30
30
Slide 31
31
Slide 32
32
Slide 33
33
Slide 34
34
Slide 35
35
Slide 36
36
Slide 37
37
Slide 38
38
Slide 39
39
Slide 40
40
Slide 41
41
Slide 42
42
Slide 43
43
Slide 44
44
Slide 45
45
Slide 46
46
Slide 47
47
Slide 48
48
Slide 49
49
Slide 50
50
Slide 51
51
Slide 52
52
Slide 53
53
Slide 54
54
Slide 55
55
Slide 56
56
Slide 57
57
Slide 58
58
Slide 59
59
Slide 60
60
Slide 61
61
Slide 62
62
Slide 63
63
Slide 64
64
Slide 65
65
Slide 66
66
Slide 67
67
Slide 68
68
Slide 69
69
Slide 70
70
Slide 71
71
Slide 72
72
Slide 73
73
Slide 74
74
Slide 75
75
Slide 76
76
Slide 77
77

About This Presentation

COA ppt about pipelining


Slide Content

How do we increase speed of
computers
•How do we increase the speed of computers?
Use parallelism in single processor
•Overlapping operations of various units of computer : Processor and I/o
work simultaneously
•Overlapping operations of successive instructions : pipelining
•Superscalar architecture : multiple instructions in one clock cycle
Parallel computers
Use number of interconnected processors to work cooperatively to solve the
problem
Flynn classification: SISD, SIMD (vector and array processor), MISD, MIMD

Performance
•Execution of program -turnaround time
•CPU time needed to execute program: CPU clock cycles for a
program x Clock cycle time
-cycles per instruction(CPI)
Time=Ic*CPI*ţ

Cpu time T= Ic*CPI/f
MIPS=Ic/T*10
6
CPU time T=Ic*10
-6
/MIPS
Flops : floating point operations per second
MFLOPS=floating point operations*10
-6
/Execution
time
Mega, tera and peta(10
15
)
Throughput rate : programs per second or
Transactions/unit time : Used in Server

To improve the performance of a CPU we have two options:
1) Improve the hardware by introducing faster circuits.
2) Arrange the hardware such that more than one operation can be
performed at the same time. Since there is a limit on the speed of
hardware and the cost of faster circuits is quite high, we have to adopt
the 2
nd
option.
Performance

Pipelining
•To overlap operations of successive instructions :
increase throughput
•Fetching next instruction while current
instruction is in execution known as pipelining
e.g. 1000 answer sheet, with four questions check
by one teacher take 20*1000=20000 minutes
Suppose four teacher, first teacher check Q1 and
give teacher 2, who start correcting Q2. The first
teacher takes the second paper and corrects Q1..
Total time=20 + (999 *5)=5015minutes

Pipeline
•Pipelingisaneffectivemethodofincreasingtheexecutionspeedof
processorsprovidedthefollowingidealconditions.
•Itispossibletobreakupaninstructionintoanumberof
independentpartseachparttakingequaltimetoexecute
•Thereissocalledlocalityininstructionexecutioni.e.instructions
areexecutedinsequenceoneaftertheotherintheorderinwhich
theyarewritten.Iftheinstructionsarenotexecutedinsequence
butjumparoundduetomanybranchinstructionsthenpipeliningis
noteffective
•Successiveinstructionsarealsoindependentofoneanother
•Sufficientresourcesareavailableinaprocessorsothatifa
resourceisrequiredbysuccessiveinstructionsinthepipelineitis
readilyavailable

pipeline

pipeline
•Performance
•Let total number of instructions executed=m
•CPI=n
•Time taken w/o pieline=mn
•Time taken with pipeline=n+(m-1)
•Speedup=mn/n+(m-1)=n/(n/m+(m-1)/m)
•If m>>n then speedup=n/((n/m)+1)=n

Pipeline
•First non ideal condition is extra buffer
registers between pipeline stages
•Thus instead of 1 clock cycle per instruction it
will take (1+e) clock cycles
•Speedup=mn/(n+m-1)(1+e)=n/(1+e)
•If n=5 cycles and e=0.1
•Speedup=4.5

pipeline

Pipelining
•Pipelining, however, has several limitations.
•The speed of a pipeline is eventually limited by the slowest
stage.
•For this reason, conventional processors rely on very deep
pipelines (20 stage pipelines in state-of-the-art Pentium
processors).
•However, in typical program traces, every 5-6th instruction is
a conditional jump! This requires very accurate branch
prediction.
•The penalty of a misprediction grows with the depth of the
pipeline, since a larger number of instructions will have to be
flushed.

A non pipelined single cycle processor operating at 100 MHz
is converted into a synchronous pipelined processor with
five stages requiring 2.5 nsec, 1.5 nsec, 2 nsec, 1.5 nsec
and 2.5 nsec, respectively. The delay of the latches is 0.5
nsec. The speedup of the pipeline processor for a large
number of instructions is

For non pipelined system time required =
2.5+1.5+2.0+1.5+2.5 =10
for pipelined system = Max(stage
delay)+Max(Latch delay)
=> 2.5+0.5 = 3.0
speedup = time in non-pipelined
system/time in pipelined system
= 10/3 = 3.33

Consider a3-stage pipelined processor having a delay
of10ns(nanoseconds),20ns, and14ns,for the first,
second, and the third stages, respectively. Assume that there
is no other delay and the processor does not suffer from any
pipeline hazards. Also assume that one instruction is fetched
every cycle.
The total execution time for executing100instructions on this
processor is _____________
Answer: 2040ns

delays=10ns,20ns,14ns
total instructions=100
We know that pipeline delay as tp=max(10,20,14)
number of stages(k)=3
So, total execution stages (k+(n-1))tp
(3+100-1)*20ns
2040ns

Hazards
•Delays in pipeline execution of instructions
due to non-ideal conditions are called pipeline
hazards in the literature
•Delays due to resource constraints is known as
structural hazard or resource dependency
•Delay due to dependency between
instructions known as data hazard
•Delays due to branch instructions known as
control hazard or branch hazard

Structural hazard
•Delay due to resource constraints
•Suppose instruction I requires to read/write memory and i+3
require fetching from memory
•Only one of these instructions can be carried out and other
has to wait. Forced waiting of an instruction in pipeline
processing is called pipeline stall
•Solution : we used two different memories(cache) for
instructions and data to avoid such stall
•Pipeline execution may also be stalled if one of the steps takes
longer then one clock cycle. Normally floating point division
takes more than one clock cycle.

Structural hazards
Instruction i+1is floating point operation which takes 3 clock cycles

Loss of speedup due to resource
non-availability
•Assume fraction f of total number of
instructions are floating point instructions
which take n+k clock cycles. Total m
instructions are executed
•With no pipelining =m(1-f)n +mf(n+k) clock
cycles
•Time with pipelining=n+(m-1){(1-f)+kf}
•Speed up=n/(1-f+kf) assuming m>>n and
n>>kf
•Let f=10%, k=2 speedup=n/1.1=0.909n

•Solution is to have to speed up floating point
execution by extra hardware so it takes one
clock cycle

Data dependency
•Delay due to data dependency
•Pipeline execution is delayed due to fact that successive
instructions are not always independent of one another. The
result produced by an instruction may be needed by
succeeding instructions and the results may not be ready
when needed.

pipeline
Sub R7,R2,R6 complete execution before the previous instruction.
This is called out of order execution.

Pipeline locking

Data hazards

Data dependency
Solution
Hardware method : register forwarding Result is in buffer which is
available to ALU. Hardware required to detect dependency and result is
feed back to ALU

Data dependency

pipeline
Software method: Sequence of instructions is
reordered without changing the meaning of the
program

pipeline

Consider an instruction pipeline with four stages (S1, S2,
S3 and S4) each with combinational circuit only. The
pipeline registers are required between each stage and at
the end of the last stage. Delays for the stages and for
the pipeline registers are as given in the figure:
What is the approximate speed up of the pipeline in
steady state under ideal conditions when compared to
the corresponding non-pipeline implementation?
S1=5ns, S2=6ns, S3=11ns, S4=8ns

So the total count will be
5+6+11+8= 30 [without pipeline]
Now, for pipeline, each stage will be of 11 n -sec (+ 1 n-
sec for overhead).
and, in steady state output is produced after every
pipeline cycle. Here,
in this case 11 n-sec. After adding 1n-sec overhead, We
will get 12 n-sec
Speed up=30/12

•Efficiency= Given speed up / Max speed up =
S / S
max
We know that, Smax = n
•So, Efficiency= S / n
•Throughput= Number of instructions / Total
time to complete the instructions
•So, Throughput= m / (n + m –1) * Tp

Delay due to branch instructions
Branch instruction are known at decode stage. But branch is taken or not it
knowns after Ex stage. Assume that address assignment at the end of stage
4
If branch is not taken then delay of 2 cycles and taken then delay of three
cycles

Branch Hazards
•Maximum ideal speedup=5
•Percentage of unconditional branches=5%
•Percentage of conditional branches be 15%
•Assume 80% conditional branches are taken

•Average delay because of unconditional
branches=3*0.05=0.15
•Average delay due to conditional
branches=3(0.15*0.8) + 2*(0.15*0.2)=0.42
•Speed up with
branches=5/(1+0.15+0.42)=3.18
•%loss of speedup due to branches=36.4%

Branch hazards
Hardware solution
•Branch prediction buffer
•Branch target buffer
BPB uses a small fast memory called Branch prediction buffer
Lower order bits of branch instruction are used as address of branch
prediction buffer. It consists of Address where branch is taken and two bits
count the number of times branch has been taken successfully in the
immediately preceding attempts

•At Decode state it knows , initially prediction bits are
00
•Every time if branch is taken the prediction bits are
incremented by 1, and if not taken are decremented
by 1. if prediction bits are 11 and branch is taken it
remains 11 and if not taken it is decremented to 10
•Prediction bits are examine if they are 11 or 10 control
jumps to branch address present in prediction buffer,
otherwise next sequential instruction
•Delay is reduced to one cycle if it is taken and zero if it
is not taken so Two clock cycles will be saved

•BTB Branch target buffer
Use at the time of instruction fetch
Prediction is true then no delay

•Software Method
Rearrange instructions such that the statement following the branch
statement( called delay slot) is always executed once it is fetched
Delay slot may be filled by the target instruction of the branch. If probability of
the branch being taken is high then this is very effective. When branch is not
taken, compiler should undo the statement executed in delay slot

A processor X1 operating at 2 GHz has a standard 5-stage RISC
instruction pipeline having a base CPI (cycles per instruction) of one
without any pipeline hazards. For a given program P that has 30% branch
instructions, control hazards incur 2 cycles stall for every branch. A new
version of the processor X2 operating at same clock frequency has an
additional branch predictor unit (BPU) that completely eliminates stalls for
correctly predicted branches. There is neither any savings nor any
additional stalls for wrong predictions. There are no structural hazards and
data hazards for X1 and X2. If the BPU has a prediction accuracy of 80%,
the speed up (rounded off to two decimal places) obtained by X2 over X1
in executing P

CPI of the original Pipeline X1 =0.7* 1 +
0.3*(1+2) = 1.6 Cycles.
Given, 80% of the times stalls can be
avoided in X2,
CPI of the pipeline X2 = 0.7*1 +
0.3*(0.8*1+0.2(1+2)) = 1.12
Speedup = CPI (X1)/CPI(X2) = 1.6/1.12
=1.43

The instruction pipeline of a RISC processor has the following stages:
Instruction Fetch (IF), Instruction Decode (ID), Operand Fetch (OF), Perform
Operation (PO) and Writeback (WB), The IF, ID, OF and WB stages take 1
clock cycle each for every instruction. Consider a sequence of 100
instructions. In the PO stage, 40 instructions take 3 clock cycles each, 35
instructions take 2 clock cycles each, and the remaining 25 instructions take 1
clock cycle each. Assume that there are no data hazards and no control
hazards.
The number of clock cycles required for completion of execution of the
sequence of instruction is ______ .

Therefore, the number of clock cycles
required = Total number of cycles required
in general case + Extra cycles required
(here, in PO stage),
= (n + k -1) + Extra cycles
= (100 + 5 -1) + 40*(3-1)+35*(2-1)+25*(1-1)
= (100 + 4) + 40*2+35*1+25*0
= 104 + 115
= 219 cycles

Pipelining and Superpipeling
Execution
•One of the assumption in pipelining is each stage takes same
time namely one clock cycle
•But some stages require less than one clock cycle e.g. DE and
SR stage
•Clock cycle is divided into phases and allocate interval
appropriate for each step in instruction cycle
•This method of pipeline execution is known as
superpipelining
•In ideal case one instruction will take half a clock cycle

Superscalar
Anotherapproachtoimprovethespeedofaprocessorisby
IssuingSeveralinstructionssimultaneouslyineachcycleis
knownassuperscalarprocessing
•Thequestionthenbecomesoneofselectingtheseinstructions
•Datadependencyofsuccessiveinstructionsduringpipeline
executioncanbecomequitehighinsuperscalarprocessor
•Hardwarerequiredtoallowexecutionofinstructionsin
arbitraryorderwhileensuringthecorrectnessoftheprogram.
Anotherapproachistotransformthesourcecodebya
compilerinsuchawaythattheresultingcodehasreduced
datadependencyandresourceconflicts

The superscalar approach depends on the ability to execute
multiple instructions in parallel.
The term instruction-level parallelism refers to the degree to
which, on average, the instructions of a program can be executed
in parallel
.A combination of compiler-based optimization and hardware
techniques can be used to maximize instruction-level parallelism.
In superscalar organization there are multiple functional units each
of which is implemented as pipeline

Assume processor has 2 integer
execution units and one floating Point
unit
Incorrect result in R3 because R2 updated before it is read, here I3 is
executed before I2, known as anti-dependency (WAR). This occur
when value computed by later instruction is used by a previous
instruction

Superscalar Execution
•Scheduling of instructions is determined by a number of factors:
–True Data Dependency: The result of one operation is an input to the
next.
–Resource Dependency: Two operations require the same resource.
–Branch Dependency: Scheduling instructions across conditional
branch statements cannot be done deterministically a-priori.
•Anti-dependency. Value computed by later instruction is used by previous
instruction. Also known as write-after read hazard. Solution is register
renaming
•When destination of two instructions are same there is output
dependency. Also known as write after write dependency
–The scheduler, a piece of hardware looks at a large number of
instructions in an instruction queue and selects appropriate number
of instructions to execute concurrently based on these factors.
–The complexity of this hardware is an important constraint on
superscalar processors.

Register renaming
After completing the sequence of instructions the
renamed registers overwrite their parents(R2 is R2N
parent. Assumption is two floating point unit and 2
integer unit

Reschedule issue of instructions
Hardware of superscalar processor provide buffer with a capacity of
several instructions. From this buffer called an instruction window form
which it picks up 2 instruction such a way that the completion time is
minimized
First time i1 and i3 out of i1,i2,i3,i4
Second time i2 and i5 out of i2,i4,i5,i6
Third time i4 and i8 out of i4, i6, i7,i8 because i6 and i7 required integer unit
not free

Superscalar Execution:
Issue Mechanisms
•In the simpler model, instructions can be issued only in the order in which
they are encountered. That is, if the second instruction cannot be issued
because it has a data dependency with the first, only one instruction is
issued in the cycle. This is called in-orderissue.
•In a more aggressive model, instructions can be issued out of order. In this
case, if the second instruction has data dependencies with the first, but
the third instruction does not, the first and third instructions can be co-
scheduled. This is also called dynamic issue.
•Performance of in-order issue is generally limited.

Superscalar Execution:
•Themajorproblemindesigningsuperscalarprocessorsis
besidestheneedtoduplicateinstructionregister,decoder
andarithmeticunititisdifficulttoscheduleinstructions
dynamicallytoreducepipelinedelays
•Hardwarelooksonlyatasmallwindowofinstructions.
Schedulingthemtouseallavailableprocessingunitsis
suboptimal
•Compilerscantakeaglobalviewoftheprogramand
rearrangecodetobetterutilizestheresourcesandreduce
pipelinedelays

In general terms, we can group superscalar instruction issue
policies into the
following categories:
• In-order issue with in-order completion
• In-order issue with out-of-order completion
• Out-of-order issue with out-of-order completion

Very Long Instruction Word (VLIW)
Processors
•The hardware cost and complexity of the superscalar scheduler is a major
consideration in processor design.
•To address this issues, VLIW processors rely on compile time analysis to
identify and bundle together instructions that can be executed
concurrently.
•These instructions are packed and dispatched together, and thus the name
very long instruction word. May be 128 or 256 bits long. Processor must
have enough resources to execute all operations specified in an instruction
word simultaneously.
•This concept was used with some commercial success in the Multiflow
Trace machine (circa 1984).
•Variants of this concept are employed in the Intel IA64 processors.
•Problems : Lack of sufficient instruction level parallelism in programs,
difficulties in building hardware and inefficient use of bits in a very long
instruction word

Very Long Instruction Word (VLIW)
Processors: Considerations
•Issue hardware is simpler.
•Compiler has a bigger context from which to select co-scheduled
instructions.
•Compilers, however, do not have runtime information such as
cache misses,branch history buffer. Scheduling is, therefore,
inherently conservative.
•Branch and memory prediction is more difficult.
•VLIW performance is highly dependent on the compiler. A
number of techniques such as loop unrolling, speculative
execution(execute instructions ahead of time before their actual
appearance and store result to keep execution engine as busy by
using branch prediction and data flow analysis), branch prediction
are critical.
•Typical VLIW processors are limited to 4-way to 8-way parallelism.

Loop unrolling
•Improve instruction parallelism
•Do i=2,n-1
•a[i]=a[i]+a[i-1]*a[i+1]
•Loop unrolling
•do i=2,n-2,2
•a[i]=a[i]+a[i-1]*a[i+1]
•a[i+1]=a[i+1]+a[i]*a[i+2]

Multithreaded processors
•A short sequence of instructions schedulable as a unit by a
processor. Process is a long sequence of instructions which is
scheduled by OS
•Loop can be unrolled and a thread created for each value of I
e.g. for i=1 to10
• a[i]=b[i]+c[i]
•Three types of multithreaded processors
Blocked multithreaded
Interleaved Multithreaded
Simultaneously multithreaded

Block Multithreading
•Program is broken into many independent threads. Each has independent
stack
•Let there are four threads A, B ,C and D
•LET A is schedule first, when there is more delay it switch to next thread

Interleaved
•Every instruction is issued from a different
thread

Simultaneous multithreading(SMT)
•Two or more instructions are issued
simultaneously in each cycle from different
threads
PENTIUM 4 More recent models of the Pentium 4 use a multithreading
technique that the Intel literature refers to as hyperthreading . In
essence, the Pentium 4 approach is to use SMT with support for two
threads. Thus, the single multithreaded processor is logically two
processors.

Motivation for multicore
Trends of transistors: Double in 18months. Accommodate more transistors
same area so more functional units possible so Superscalar, more amount of
cache possible

1 Power dissipation is proportional to frequency
2 pipeline has more stage and time available in each stage is one clock
cycle. Frequency increases cycle time is reducing

Shift from 90nm to 10nm means in in same space 9 such
processor can be kept in same space
System on chip (SOC)

Multicore
•A multicore computer, also known as a chip multiprocessor,
combines two or more processors (called cores) on a single
piece of silicon (called a die).
•Typically, each core consists of all of the components of an
independent processor, such as registers,ALU, pipeline
hardware, and control unit, plus L1 instruction and data
caches.
•In addition to the multiple cores, contemporary multicore
chips also include L2 cache and, in some cases, L3 cache.

At a top level of description, the main variables in
a multicore organization are as follows:
• The number of core processors on the chip
• The number of levels of cache memory
• The amount of cache memory that is shared

AMD Opteron
ARM11 MPCore
The Intel
Core Duo has this organization.The Intel Core i7 is an example of this
organization.

Core i7 uses SMT cores. SMT has the effect of scaling up the number
of hardwarelevel threads that the multicore system supports.
Thus, a multicore system with four cores and SMT that supports four
simultaneous threads in each core appears the same to the
application level as a multicore system with 16 cores.
As software is developed to more fully exploit parallel resources, an
SMT approach appears to be more attractive than a superscalar
approach.
Tags