chap2_slidesforparallelcomputingananthgarama

doomzday27 10 views 126 slides Mar 10, 2025
Slide 1
Slide 1 of 126
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27
Slide 28
28
Slide 29
29
Slide 30
30
Slide 31
31
Slide 32
32
Slide 33
33
Slide 34
34
Slide 35
35
Slide 36
36
Slide 37
37
Slide 38
38
Slide 39
39
Slide 40
40
Slide 41
41
Slide 42
42
Slide 43
43
Slide 44
44
Slide 45
45
Slide 46
46
Slide 47
47
Slide 48
48
Slide 49
49
Slide 50
50
Slide 51
51
Slide 52
52
Slide 53
53
Slide 54
54
Slide 55
55
Slide 56
56
Slide 57
57
Slide 58
58
Slide 59
59
Slide 60
60
Slide 61
61
Slide 62
62
Slide 63
63
Slide 64
64
Slide 65
65
Slide 66
66
Slide 67
67
Slide 68
68
Slide 69
69
Slide 70
70
Slide 71
71
Slide 72
72
Slide 73
73
Slide 74
74
Slide 75
75
Slide 76
76
Slide 77
77
Slide 78
78
Slide 79
79
Slide 80
80
Slide 81
81
Slide 82
82
Slide 83
83
Slide 84
84
Slide 85
85
Slide 86
86
Slide 87
87
Slide 88
88
Slide 89
89
Slide 90
90
Slide 91
91
Slide 92
92
Slide 93
93
Slide 94
94
Slide 95
95
Slide 96
96
Slide 97
97
Slide 98
98
Slide 99
99
Slide 100
100
Slide 101
101
Slide 102
102
Slide 103
103
Slide 104
104
Slide 105
105
Slide 106
106
Slide 107
107
Slide 108
108
Slide 109
109
Slide 110
110
Slide 111
111
Slide 112
112
Slide 113
113
Slide 114
114
Slide 115
115
Slide 116
116
Slide 117
117
Slide 118
118
Slide 119
119
Slide 120
120
Slide 121
121
Slide 122
122
Slide 123
123
Slide 124
124
Slide 125
125
Slide 126
126

About This Presentation

Parallel


Slide Content

ParallelComputingPlatforms
AnanthGrama,AnshulGupta,GeorgeKarypis,andVipinKumar
Toaccompanythetext“IntroductiontoParallelComputing”,
AddisonWesley,2003.

TopicOverview
ImplicitParallelism:TrendsinMicroprocessorArchitectures
LimitationsofMemorySystemPerformance
DichotomyofParallelComputingPlatforms
CommunicationModelofParallelPlatforms
PhysicalOrganizationofParallelPlatforms
CommunicationCostsinParallelMachines
MessagingCostModelsandRoutingMechanisms
MappingTechniques
CaseStudies
–TypesetbyFoilTEX–1

ScopeofParallelism
Conventionalarchitecturescoarselycompriseofaprocessor,
memorysystem,andthedatapath.
Eachofthesecomponentspresentsignicantperformance
bottlenecks.
Parallelismaddresseseachofthesecomponentsinsignicant
ways.
Differentapplicationsutilizedifferentaspectsofparallelism
–e.g.,dataitensiveapplicationsutilizehighaggregate
throughput,serverapplicationsutilizehighaggregatenetwork
bandwidth,andscienticapplicationstypicallyutilizehigh
processingandmemorysystemperformance.
Itisimportanttounderstandeachoftheseperformance
bottlenecks.
–TypesetbyFoilTEX–2

ImplicitParallelism:TrendsinMicroprocessor
Architectures
Microprocessorclockspeedshavepostedimpressivegains
overthepasttwodecades(twotothreeordersofmagnitude).
Higherlevelsofdeviceintegrationhavemadeavailablea
largenumberoftransistors.
Thequestionofhowbesttoutilizetheseresourcesisan
importantone.
Currentprocessorsusetheseresourcesinmultiplefunctional
unitsandexecutemultipleinstructionsinthesamecycle.
Theprecisemannerinwhichtheseinstructionsareselected
andexecutedprovidesimpressivediversityinarchitectures.
–TypesetbyFoilTEX–3

PipeliningandSuperscalarExecution
Pipeliningoverlapsvariousstagesofinstructionexecutionto
achieveperformance.
Atahighlevelofabstraction,aninstructioncanbeexecuted
whilethenextoneisbeingdecodedandthenextoneisbeing
fetched.
Thisisakintoanassemblylineformanufactureofcars.
–TypesetbyFoilTEX–4

PipeliningandSuperscalarExecution
Pipelining,however,hasseverallimitations.
Thespeedofapipelineiseventuallylimitedbytheslowest
stage.
Forthisreason,conventionalprocessorsrelyonverydeep
pipelines(20stagepipelinesinstate-of-the-artPentium
processors).
However,intypicalprogramtraces,every5-6thinstruction
isaconditionaljump!Thisrequiresveryaccuratebranch
prediction.
Thepenaltyofamispredictiongrowswiththedepthofthe
pipeline,sincealargernumberofinstructionswillhavetobe
ushed.
–TypesetbyFoilTEX–5

PipeliningandSuperscalarExecution
Onesimplewayofalleviatingthesebottlenecksistouse
multiplepipelines.
Thequestionthenbecomesoneofselectingtheseinstructions.
–TypesetbyFoilTEX–6

SuperscalarExecution:AnExample
(i)
(a) Three different code fragments for adding a list of four numbers.
(iii) (ii)
WB: Write-back
NA: No Action
E: Instruction Execute
IFIDE NA
IFID
OF: Operand Fetch
OF
OF
IFIDOFE
IFIDE OF
IFIDNAWB
Adder Utilization
Clock cycle
5
6
7
4
Vertical waste
Horizontal waste
Full issue slots
Empty issue slots
1. load R1, @1000
2. load R2, @1008
3. add R1, @1004
4. add R2, @100C
5. add R1, R2
6. store R1, @2000
024
Instruction cycles
68
1. load R1, @1000
2. add R1, @1004
3. add R1, @1008
4. add R1, @100C
5. store R1, @2000
1. load R1, @1000
3. load R2, @1008
4. add R2, @100C
5. add R1, R2
6. store R1, @2000
2. add R1, @1004
load R1, @1000
load R2, @1008
add R1, @1004
add R2, @100C
add R1, R2
store R1, @2000
ID: Instruction Decode
IF
IF: Instruction Fetch
ID
(b) Execution schedule for code fragment (i) above.
(c) Hardware utilization trace for schedule in (b).
Exampleofatwo-waysuperscalarexecutionofinstructions.
–TypesetbyFoilTEX–7

SuperscalarExecution:AnExample
Intheaboveexample,thereissomewastageofresourcesdue
todatadependencies.
Theexamplealsoillustratesthatdifferentinstructionmixeswith
identicalsemanticscantakesignicantlydifferentexecution
time.
–TypesetbyFoilTEX–8

SuperscalarExecution
Schedulingofinstructionsisdeterminedbyanumberof
factors:
TrueDataDependency:Theresultofoneoperationisaninput
tothenext.
ResourceDependency:Twooperationsrequirethesame
resource.
BranchDependency:Schedulinginstructionsacrossconditional
branchstatementscannotbedonedeterministicallya-priori.
Thescheduler,apieceofhardwarelooksatalargenumber
ofinstructionsinaninstructionqueueandselectsappropriate
numberofinstructionstoexecuteconcurrentlybasedonthese
factors.
Thecomplexityofthishardwareisanimportantconstrainton
superscalarprocessors.
–TypesetbyFoilTEX–9

SuperscalarExecution:IssueMechanisms
Inthesimplermodel,instructionscanbeissuedonlyintheorder
inwhichtheyareencountered.Thatis,ifthesecondinstruction
cannotbeissuedbecauseithasadatadependencywiththe
rst,onlyoneinstructionisissuedinthecycle.Thisiscalledin-
orderissue.
Inamoreaggressivemodel,instructionscanbeissuedout
oforder.Inthiscase,ifthesecondinstructionhasdata
dependencieswiththerst,butthethirdinstructiondoesnot,
therstandthirdinstructionscanbeco-scheduled.Thisisalso
calleddynamicissue.
Performanceofin-orderissueisgenerallylimited.
–TypesetbyFoilTEX–10

SuperscalarExecution:EfciencyConsiderations
Notallfunctionalunitscanbekeptbusyatalltimes.
Ifduringacycle,nofunctionalunitsareutilized,thisisreferred
toasverticalwaste.
Ifduringacycle,onlysomeofthefunctionalunitsareutilized,
thisisreferredtoashorizontalwaste.
Duetolimitedparallismintypicalinstructiontraces,dependencies,
ortheinabilityoftheschedulertoextractparallelism,the
performanceofsuperscalarprocessorsiseventuallylimited.
Conventionalmicroprocessorstypicallysupportfour-waysuperscalar
execution.
–TypesetbyFoilTEX–11

VeryLongInstructionWord(VLIW)Processors
Thehardwarecostandcomplexityofthesuperscalar
schedulerisamajorconsiderationinprocessordesign.
Toaddressthisissues,VLIWprocessorsrelyoncompiletime
analysistoidentifyandbundletogetherinstructionsthatcan
beexecutedconcurrently.
Theseinstructionsarepackedanddispatchedtogether,and
thusthenameverylonginstructionword.
Thisconceptwasusedwithsomecommercialsuccessinthe
MultiowTracemachine(circa1984).
VariantsofthisconceptareemployedintheIntelIA64
processors.
–TypesetbyFoilTEX–12

VeryLongInstructionWord(VLIW)Processors:
Considerations
Issuehardwareissimpler.
Compilerhasabiggercontextfromwhichtoselectco-
scheduledinstructions.
Compilers,however,donothaveruntimeinformationsuchas
cachemisses.Schedulingis,therefore,inherentlyconservative.
Branchandmemorypredictionismoredifcult.
VLIWperformanceishighlydependentonthecompiler.A
numberoftechniquessuchasloopunrolling,speculative
execution,branchpredictionarecritical.
TypicalVLIWprocessorsarelimitedto4-wayto8-way
parallelism.
–TypesetbyFoilTEX–13

LimitationsofMemorySystemPerformance
Memorysystem,andnotprocessorspeed,isoftenthe
bottleneckformanyapplications.
Memorysystemperformanceislargelycapturedbytwo
parameters,latencyandbandwidth.
Latencyisthetimefromtheissueofamemoryrequesttothe
timethedataisavailableattheprocessor.
Bandwidthistherateatwhichdatacanbepumpedtothe
processorbythememorysystem.
–TypesetbyFoilTEX–14

MemorySystemPerformance:Bandwidthand
Latency
Itisveryimportanttounderstandthedifferencebetween
latencyandbandwidth.
Considertheexampleofare-hose.Ifthewatercomesout
ofthehosetwosecondsafterthehydrantisturnedon,the
latencyofthesystemistwoseconds.
Oncethewaterstartsowing,ifthehydrantdeliverswaterat
therateof5gallons/second,thebandwidthofthesystemis5
gallons/second.
Ifyouwantimmediateresponsefromthehydrant,itisimportant
toreducelatency.
Ifyouwanttoghtbigres,youwanthighbandwidth.
–TypesetbyFoilTEX–15

MemoryLatency:AnExample
Consideraprocessoroperatingat1GHz(1nsclock)
connectedtoaDRAMwithalatencyof100ns(nocaches).
Assumethattheprocessorhastwomultiply-addunitsandis
capableofexecutingfourinstructionsineachcycleof1ns.The
followingobservationsfollow:
Thepeakprocessorratingis4GFLOPS.
Sincethememorylatencyisequalto100cyclesandblock
sizeisoneword,everytimeamemoryrequestismade,the
processormustwait100cyclesbeforeitcanprocessthedata.
–TypesetbyFoilTEX–16

MemoryLatency:AnExample
Ontheabovearchitecture,considertheproblemof
computingadot-productoftwovectors.
Adot-productcomputationperformsonemultiply-addon
asinglepairofvectorelements,i.e.,eachoatingpoint
operationrequiresonedatafetch.
Itfollowsthatthepeakspeedofthiscomputationislimitedto
oneoatingpointoperationevery100ns,oraspeedof10
MFLOPS,averysmallfractionofthepeakprocessorrating!
–TypesetbyFoilTEX–17

ImprovingEffectiveMemoryLatencyUsingCaches
Cachesaresmallandfastmemoryelementsbetweenthe
processorandDRAM.
Thismemoryactsasalow-latencyhigh-bandwidthstorage.
Ifapieceofdataisrepeatedlyused,theeffectivelatencyof
thismemorysystemcanbereducedbythecache.
Thefractionofdatareferencessatisedbythecacheiscalled
thecachehitratioofthecomputationonthesystem.
Cachehitratioachievedbyacodeonamemorysystemoften
determinesitsperformance.
–TypesetbyFoilTEX–18

ImpactofCaches:Example
Considerthearchitecturefromthepreviousexample.Inthis
case,weintroduceacacheofsize32KBwithalatencyof1nsor
onecycle.WeusethissetuptomultiplytwomatricesAandBof
dimensions3232.Wehavecarefullychosenthesenumbersso
thatthecacheislargeenoughtostorematricesAandB,aswell
astheresultmatrixC.
–TypesetbyFoilTEX–19

ImpactofCaches:Example(continued)
Thefollowingobservationscanbemadeabouttheproblem:
Fetchingthetwomatricesintothecachecorrespondsto
fetching2Kwords,whichtakesapproximately200s.
Multiplyingtwonnmatricestakes2n
3
operations.Forour
problem,thiscorrespondsto64Koperations,whichcanbe
performedin16Kcycles(or16s)atfourinstructionspercycle.
Thetotaltimeforthecomputationisthereforeapproximately
thesumoftimeforload/storeoperationsandthetimeforthe
computationitself,i.e.,200+16s.
Thiscorrespondstoapeakcomputationrateof64K=216or303
MFLOPS.
–TypesetbyFoilTEX–20

ImpactofCaches
Repeatedreferencestothesamedataitemcorrespondto
temporallocality.
Inourexample,wehadO(n
2
)dataaccessesandO(n
3
)
computation.Thisasymptoticdifferencemakestheabove
exampleparticularlydesirableforcaches.
Datareuseiscriticalforcacheperformance.
–TypesetbyFoilTEX–21

ImpactofMemoryBandwidth
Memorybandwidthisdeterminedbythebandwidthofthe
memorybusaswellasthememoryunits.
Memorybandwidthcanbeimprovedbyincreasingthesizeof
memoryblocks.
Theunderlyingsystemtakesltimeunits(wherelisthelatency
ofthesystem)todeliverbunitsofdata(wherebistheblock
size).
–TypesetbyFoilTEX–22

ImpactofMemoryBandwidth:Example
Considerthesamesetupasbefore,exceptinthiscase,the
blocksizeis4wordsinsteadof1word.Werepeatthedot-product
computationinthisscenario:
Assumingthatthevectorsarelaidoutlinearlyinmemory,eight
FLOPs(fourmultiply-adds)canbeperformedin200cycles.
Thisisbecauseasinglememoryaccessfetchesfour
consecutivewordsinthevector.
Therefore,twoaccessescanfetchfourelementsofeachof
thevectors.ThiscorrespondstoaFLOPevery25ns,forapeak
speedof40MFLOPS.
–TypesetbyFoilTEX–23

ImpactofMemoryBandwidth
Itisimportanttonotethatincreasingblocksizedoesnot
changelatencyofthesystem.
Physically,thescenarioillustratedherecanbeviewedasa
widedatabus(4wordsor128bits)connectedtomultiple
memorybanks.
Inpractice,suchwidebusesareexpensivetoconstruct.
Inamorepracticalsystem,consecutivewordsaresentonthe
memorybusonsubsequentbuscyclesaftertherstwordis
retrieved.
–TypesetbyFoilTEX–24

ImpactofMemoryBandwidth
Theaboveexamplesclearlyillustratehowincreasedbandwidth
resultsinhigherpeakcomputationrates.
Thedatalayoutswereassumedtobesuchthatconsecutive
datawordsinmemorywereusedbysuccessiveinstructions
(spatiallocalityofreference).
Ifwetakeadata-layoutcentricview,computationsmustbe
reorderedtoenhancespatiallocalityofreference.
–TypesetbyFoilTEX–25

ImpactofMemoryBandwidth:Example
Considerthefollowingcodefragment:
for(i=0;i<1000;i++)
column_sum[i]=0.0;
for(j=0;j<1000;j++)
column_sum[i]+=b[j][i];
Thecodefragmentsumscolumnsofthematrixbintoavector
column_sum.
–TypesetbyFoilTEX–26

ImpactofMemoryBandwidth:Example
Thevectorcolumn_sumissmallandeasilytsintothecache
Thematrixbisaccessedinacolumnorder.
Thestridedaccessresultsinverypoorperformance.
(a) Column major data access
AbA
=
bAbAb
+++
(b) Row major data access.
AbAbAbAb
====
Multiplyingamatrixwithavector:(a)multiplying
column-by-column,keepingarunningsum;(b)computingeach
elementoftheresultasadotproductofarowofthematrixwith
thevector.
–TypesetbyFoilTEX–27

ImpactofMemoryBandwidth:Example
Wecanxtheabovecodeasfollows:
for(i=0;i<1000;i++)
column_sum[i]=0.0;
for(j=0;j<1000;j++)
for(i=0;i<1000;i++)
column_sum[i]+=b[j][i];
Inthiscase,thematrixistraversedinarow-orderand
performancecanbeexpectedtobesignicantlybetter.
–TypesetbyFoilTEX–28

MemorySystemPerformance:Summary
Theseriesofexamplespresentedinthissectionillustratethe
followingconcepts:
Exploitingspatialandtemporallocalityinapplicationsis
criticalforamortizingmemorylatencyandincreasingeffective
memorybandwidth.
Theratioofthenumberofoperationstonumberofmemory
accessesisagoodindicatorofanticipatedtoleranceto
memorybandwidth.
Memorylayoutsandorganizingcomputationappropriately
canmakeasignicantimpactonthespatialandtemporal
locality.
–TypesetbyFoilTEX–29

AlternateApproachesforHidingMemoryLatency Considertheproblemofbrowsingthewebonaveryslow
networkconnection.Wedealwiththeprobleminoneofthree
possibleways:
weanticipatewhichpageswearegoingtobrowseaheadof
timeandissuerequestsfortheminadvance;
weopenmultiplebrowsersandaccessdifferentpagesineach
browser,thuswhilewearewaitingforonepagetoload,we
couldbereadingothers;or
weaccessawholebunchofpagesinonego–amortizingthe
latencyacrossvariousaccesses.
Therstapproachiscalledprefetching,thesecondmultithreading,
andthethirdonecorrespondstospatiallocalityinaccessing
memorywords.
–TypesetbyFoilTEX–30

MultithreadingforLatencyHiding
Athreadisasinglestreamofcontrolintheowofaprogram.
Weillustratethreadswithasimpleexample:
for(i=0;i<n;i++)
c[i]=dot_product(get_row(a,i),b);
Eachdot-productisindependentoftheother,andtherefore
representsaconcurrentunitofexecution.Wecansafelyrewrite
theabovecodesegmentas:
for(i=0;i<n;i++)
c[i]=create_thread(dot_product,get_row(a,i),b);
–TypesetbyFoilTEX–31

MultithreadingforLatencyHiding:Example
Inthecode,therstinstanceofthisfunctionaccessesapairof
vectorelementsandwaitsforthem.
Inthemeantime,thesecondinstanceofthisfunctioncan
accesstwoothervectorelementsinthenextcycle,andso
on.
Afterlunitsoftime,wherelisthelatencyofthememory
system,therstfunctioninstancegetstherequesteddatafrom
memoryandcanperformtherequiredcomputation.
Inthenextcycle,thedataitemsforthenextfunctioninstance
arrive,andsoon.Inthisway,ineveryclockcycle,wecan
performacomputation.
–TypesetbyFoilTEX–32

MultithreadingforLatencyHiding
Theexecutionscheduleinthepreviousexampleispredicated
upontwoassumptions:thememorysystemiscapableof
servicingmultipleoutstandingrequests,andtheprocessoris
capableofswitchingthreadsateverycycle.
Italsorequirestheprogramtohaveanexplicitspecicationof
concurrencyintheformofthreads.
MachinessuchastheHEPandTerarelyonmultithreaded
processorsthatcanswitchthecontextofexecutioninevery
cycle.Consequently,theyareabletohidelatencyeffectively.
–TypesetbyFoilTEX–33

PrefetchingforLatencyHiding
Missesonloadscauseprogramstostall.
Whynotadvancetheloadssothatbythetimethedatais
actuallyneeded,itisalreadythere!
Theonlydrawbackisthatyoumightneedmorespacetostore
advancedloads.
However,iftheadvancedloadsareoverwritten,weareno
worsethanbefore!
–TypesetbyFoilTEX–34

TradeoffsofMultithreadingandPrefetching
Multithreadingandprefetchingarecriticallyimpactedbythe
memorybandwidth.Considerthefollowingexample:
Consideracomputationrunningonamachinewitha1GHz
clock,4-wordcacheline,singlecycleaccesstothecache,and
100nslatencytoDRAM.Thecomputationhasacachehitratio
at1KBof25%andat32KBof90%.Considertwocases:rst,a
singlethreadedexecutioninwhichtheentirecacheisavailable
totheserialcontext,andsecond,amultithreadedexecutionwith
32threadswhereeachthreadhasacacheresidencyof1KB.
Ifthecomputationmakesonedatarequestineverycycleof
1ns,youmaynoticethattherstscenariorequires400MB/sof
memorybandwidthandthesecond,3GB/s.
–TypesetbyFoilTEX–35

TradeoffsofMultithreadingandPrefetching
Bandwidthrequirementsofamultithreadedsystemmay
increaseverysignicantlybecauseofthesmallercache
residencyofeachthread.
Multithreadedsystemsbecomebandwidthboundinsteadof
latencybound.
Multithreadingandprefetchingonlyaddressthelatency
problemandmayoftenexacerbatethebandwidthproblem.
Multithreadingandprefetchingalsorequiresignicantlymore
hardwareresourcesintheformofstorage.
–TypesetbyFoilTEX–36

ExplicitlyParallelPlatforms
–TypesetbyFoilTEX–37

DichotomyofParallelComputingPlatforms
Anexplicitlyparallelprogrammustspecifyconcurrencyand
interactionbetweenconcurrentsubtasks.
Theformerissometimesalsoreferredtoasthecontrolstructure
andthelatterasthecommunicationmodel.
–TypesetbyFoilTEX–38

ControlStructureofParallelPrograms
Parallelismcanbeexpressedatvariouslevelsofgranularity–
frominstructionleveltoprocesses.
Betweentheseextremesexistarangeofmodels,alongwith
correspondingarchitecturalsupport.
–TypesetbyFoilTEX–39

ControlStructureofParallelPrograms
Processingunitsinparallelcomputerseitheroperateunder
thecentralizedcontrolofasinglecontrolunitorwork
independently.
Ifthereisasinglecontrolunitthatdispatchesthesame
instructiontovariousprocessors(thatworkondifferentdata),
themodelisreferredtoassingleinstructionstream,multiple
datastream(SIMD).
Ifeachprocessorhasitsowncontrolcontrolunit,each
processorcanexecutedifferentinstructionsondifferentdata
items.Thismodeliscalledmultipleinstructionstream,multiple
datastream(MIMD).
–TypesetbyFoilTEX–40

SIMDandMIMDProcessors
(a)(b)
Global
+
+
+
+
PE
PE
PE
PE
PE
PE
PE
PE
PE
control
unit
INTERCONNECTION NETWORK
INTERCONNECTION NETWORK
control unit
control unit
control unit
control unit
PE: Processing Element
AtypicalSIMDarchitecture(a)andatypicalMIMDarchitecture
(b).
–TypesetbyFoilTEX–41

SIMDProcessors
SomeoftheearliestparallelcomputerssuchastheIlliacIV,
MPP,DAP,CM-2,andMasParMP-1belongedtothisclassof
machines.
Variantsofthisconcepthavefounduseinco-processingunits
suchastheMMXunitsinIntelprocessorsandDSPchipssuchas
theSharc.
SIMDreliesontheregularstructureofcomputations(suchas
thoseinimageprocessing).
Itisoftennecessarytoselectivelyturnoffoperationsoncertain
dataitems.Forthisreason,mostSIMDprogrammingparadigms
allowforan“activitymask”,whichdeterminesifaprocessor
shouldparticipateinacomputationornot.
–TypesetbyFoilTEX–42

ConditionalExecutioninSIMDProcessors
Idle
Idle
(b)
Step 2
(a)
Idle
Step 1
Initial values
Idle
C
B
0
A
B
C
0
A
B
C
0
A
B
A
0
else
C
Processor 0Processor 1Processor 2
5
0
4
2
1
1
0
0
A
B
C0
A
B
C
A
B
C0
A
B
C 50
C = A/B;
C = A;
if (B == 0)
Processor 3
Processor 0Processor 1Processor 2Processor 3
5
0
4
2
1
1
0
0
Processor 0Processor 1Processor 2Processor 3
5
0
4
2
1
1
0
0
0
A
B
C
A
B
C
A
B
C
A
B
C51 2
ExecutingaconditionalstatementonanSIMDcomputerwith
fourprocessors:(a)theconditionalstatement;(b)theexecution
ofthestatementintwosteps.
–TypesetbyFoilTEX–43

MIMDProcessors
IncontrasttoSIMDprocessors,MIMDprocessorscanexecute
differentprogramsondifferentprocessors.
Avariantofthis,calledsingleprogrammultipledatastreams
(SPMD)executesthesameprogramondifferentprocessors.
ItiseasytoseethatSPMDandMIMDarecloselyrelatedin
termsofprogrammingexibilityandunderlyingarchitectural
support.
ExamplesofsuchplatformsincludecurrentgenerationSun
UltraServers,SGIOriginServers,multiprocessorPCs,workstation
clusters,andtheIBMSP.
–TypesetbyFoilTEX–44

SIMD-MIMDComparison
SIMDcomputersrequirelesshardwarethanMIMDcomputers
(singlecontrolunit).
However,sinceSIMDprocessorsaespeciallydesigned,they
tendtobeexpensiveandhavelongdesigncycles.
NotallapplicationsarenaturallysuitedtoSIMDprocessors.
Incontrast,platformssupportingtheSPMDparadigmcanbe
builtfrominexpensiveoff-the-shelfcomponentswithrelatively
littleeffortinashortamountoftime.
–TypesetbyFoilTEX–45

CommunicationModelofParallelPlatforms
Therearetwoprimaryformsofdataexchangebetween
paralleltasks–accessingashareddataspaceand
exchangingmessages.
Platformsthatprovideashareddataspacearecalledshared-
address-spacemachinesormultiprocessors.
Platformsthatsupportmessagingarealsocalledmessage
passingplatformsormulticomputers.
–TypesetbyFoilTEX–46

Shared-Address-SpacePlatforms
Part(orall)ofthememoryisaccessibletoallprocessors.
Processorsinteractbymodifyingdataobjectsstoredinthis
shared-address-space.
Ifthetimetakenbyaprocessortoaccessanymemory
wordinthesystemglobalorlocalisidentical,theplatform
isclassiedasauniformmemoryaccess(UMA),else,anon-
uniformmemoryaccess(NUMA)machine.
–TypesetbyFoilTEX–47

NUMAandUMAShared-Address-SpacePlatforms
M
Interconnection Network
Interconnection Network
M
M
Interconnection Network
M
M
P
C
M
M
(b)
P
C
P
C
P
C C
P
P
M
M
C
(a)(c)
P
P
P
Typicalshared-address-spacearchitectures:(a)
Uniform-memory-accessshared-address-spacecomputer;(b)
Uniform-memory-accessshared-address-spacecomputerwith
cachesandmemories;(c)Non-uniform-memory-access
shared-address-spacecomputerwithlocalmemoryonly.
–TypesetbyFoilTEX–48

NUMAandUMAShared-Address-SpacePlatforms
ThedistinctionbetweenNUMAandUMAplatformsisimportant
fromthepointofviewofalgorithmdesign.NUMAmachines
requirelocalityfromunderlyingalgorithmsforperformance.
Programmingtheseplatformsiseasiersincereadsandwrites
areimplicitlyvisibletootherprocessors.
However,read-writedatatoshareddatamustbecoordinated
(thiswillbediscussedingreaterdetailwhenwetalkabout
threadsprogramming).
Cachesinsuchmachinesrequirecoordinatedaccessto
multiplecopies.Thisleadstothecachecoherenceproblem.
Aweakermodelofthesemachinesprovidesanaddressmap,
butnotcoordinatedaccess.Thesemodelsarecallednon
cachecoherentsharedaddressspacemachines.
–TypesetbyFoilTEX–49

Shared-Address-Spacevs.SharedMemoryMachines Itisimportanttonotethedifferencebetweenthetermsshared
addressspaceandsharedmemory.
Werefertotheformerasaprogrammingabstractionandto
thelatterasaphysicalmachineattribute.
Itispossibletoprovideasharedaddressspaceusinga
physicallydistributedmemory.
–TypesetbyFoilTEX–50

Message-PassingPlatforms
Theseplatformscompriseofasetofprocessorsandtheirown
(exclusive)memory.
Instancesofsuchaviewcomenaturallyfromclustered
workstationsandnon-shared-address-spacemulticomputers.
Theseplatformsareprogrammedusing(variantsof)sendand
receiveprimitives.
LibrariessuchasMPIandPVMprovidesuchprimitives.
–TypesetbyFoilTEX–51

MessagePassingvs.SharedAddressSpacePlatforms Messagepassingrequireslittlehardwaresupport,otherthana
network.
Sharedaddressspaceplatformscaneasilyemulatemessage
passing.Thereverseismoredifculttodo(inanefcient
manner).
–TypesetbyFoilTEX–52

PhysicalOrganizationofParallelPlatforms
Webeginthisdiscussionwithanidealparallelmachinecalled
ParallelRandomAccessMachine,orPRAM.
–TypesetbyFoilTEX–53

ArchitectureofanIdealParallelComputer
AnaturalextensionoftheRandomAccessMachine(RAM)
serialarchitectureistheParallelRandomAccessMachine,or
PRAM.
PRAMsconsistofpprocessorsandaglobalmemoryof
unboundedsizethatisuniformlyaccessibletoallprocessors.
Processorsshareacommonclockbutmayexecutedifferent
instructionsineachcycle.
–TypesetbyFoilTEX–54

ArchitectureofanIdealParallelComputer
Dependingonhowsimultaneousmemoryaccessesare
handled,PRAMscanbedividedintofoursubclasses.
Exclusive-read,exclusive-write(EREW)PRAM.
Concurrent-read,exclusive-write(CREW)PRAM.
Exclusive-read,concurrent-write(ERCW)PRAM.
Concurrent-read,concurrent-write(CRCW)PRAM.
–TypesetbyFoilTEX–55

ArchitectureofanIdealParallelComputer
Whatdoesconcurrentwritemean,anyway?
Common:writeonlyifallvaluesareidentical.
Arbitrary:writethedatafromarandomlyselectedprocessor.
Priority:followapredeterminedpriorityorder.
Sum:Writethesumofalldataitems.
–TypesetbyFoilTEX–56

PhysicalComplexityofanIdealParallelComputer
Processorsandmemoriesareconnectedviaswitches.
SincetheseswitchesmustoperateinO(1)timeatthelevelof
words,forasystemofpprocessorsandmwords,theswitch
complexityisO(mp).
Clearly,formeaningfulvaluesofpandm,atruePRAMisnot
realizable.
–TypesetbyFoilTEX–57

InterconnectionNetworksforParallelComputers
Interconnectionnetworkscarrydatabetweenprocessorsand
tomemory.
Interconnectsaremadeofswitchesandlinks(wires,ber).
Interconnectsareclassiedasstaticordynamic.
Staticnetworksconsistofpoint-to-pointcommunicationlinks
amongprocessingnodesandarealsoreferredtoasdirect
networks.
Dynamicnetworksarebuiltusingswitchesandcommunication
links.Dynamicnetworksarealsoreferredtoasindirect
networks.
–TypesetbyFoilTEX–58

StaticandDynamicInterconnectionNetworks
Static networkIndirect network
Switching element
Processing node
Network interface/switch
P
PPP
P
P
P
P
Classicationofinterconnectionnetworks:(a)astaticnetwork;
and(b)adynamicnetwork.
–TypesetbyFoilTEX–59

InterconnectionNetworks
Switchesmapaxednumberofinputstooutputs.
Thetotalnumberofportsonaswitchisthedegreeofthe
switch.
Thecostofaswitchgrowsasthesquareofthedegreeofthe
switch,theperipheralhardwarelinearlyasthedegree,andthe
packagingcostslinearlyasthenumberofpins.
–TypesetbyFoilTEX–60

InterconnectionNetworks:NetworkInterfaces
Processorstalktothenetworkviaanetworkinterface.
ThenetworkinterfacemayhangofftheI/Obusorthememory
bus.
Inaphysicalsense,thisdistinguishesaclusterfromatightly
coupledmulticomputer.
TherelativespeedsoftheI/Oandmemorybusesimpactthe
performanceofthenetwork.
–TypesetbyFoilTEX–61

NetworkTopologies
Avarietyofnetworktopologieshavebeenproposedand
implemented.
Thesetopologiestradeoffperformanceforcost.
Commercialmachinesoftenimplementhybridsofmultiple
topologiesforreasonsofpackaging,cost,andavailable
components.
–TypesetbyFoilTEX–62

NetworkTopologies:Buses
Someofthesimplestandearliestparallelmachinesusedbuses.
Allprocessorsaccessacommonbusforexchangingdata.
ThedistancebetweenanytwonodesisO(1)inabus.Thebus
alsoprovidesaconvenientbroadcastmedia.
However,thebandwidthofthesharedbusisamajor
bottleneck.
Typicalbusbasedmachinesarelimitedtodozensofnodes.
SunEnterpriseserversandIntelPentiumbasedshared-bus
multiprocessorsareexamplesofsucharchitectures.
–TypesetbyFoilTEX–63

NetworkTopologies:Buses
Cache /
Local Memory
Cache /
Local Memory
Shared Memory
Data
Processor 0
Address
Data
Shared Memory
Processor 0Processor 1
(a)
(b)
Address
Processor 1
Bus-basedinterconnects(a)withnolocalcaches;(b)withlocal
memory/caches.
Sincemuchofthedataaccessedbyprocessorsislocaltothe
processor,alocalmemorycanimprovetheperformanceofbus-
basedmachines.
–TypesetbyFoilTEX–64

NetworkTopologies:Crossbars
Acrossbarnetworkusesanpmgridofswitchestoconnectp
inputstomoutputsinanon-blockingmanner.
Memory Banks
b-1 5 4 3 2 1 0
Processing Elements
0
1
2
3
4
5
6
p-1
element
A switching
Acompletelynon-blockingcrossbarnetworkconnectingp
processorstobmemorybanks.
–TypesetbyFoilTEX–65

NetworkTopologies:Crossbars
ThecostofacrossbarofpprocessorsgrowsasO(p
2
).
Thisisgenerallydifculttoscaleforlargevaluesofp.
ExamplesofmachinesthatemploycrossbarsincludetheSun
UltraHPC10000andtheFujitsuVPP500.
–TypesetbyFoilTEX–66

NetworkTopologies:MultistageNetworks
Crossbarshaveexcellentperformancescalabilitybutpoorcost
scalability.
Buseshaveexcellentcostscalability,butpoorperformance
scalability.
Multistageinterconnectsstrikeacompromisebetweenthese
extremes.
–TypesetbyFoilTEX–67

NetworkTopologies:MultistageNetworks
Memory banks
0
1
0
. . . . . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Stage 1
b-1
Stage 2Stage n
p-1
ProcessorsMultistage interconnection network
1
Theschematicofatypicalmultistageinterconnectionnetwork.
–TypesetbyFoilTEX–68

NetworkTopologies:MultistageOmegaNetwork
Oneofthemostcommonlyusedmultistageinterconnectsis
theOmeganetwork.
Thisnetworkconsistsoflogpstages,wherepisthenumberof
inputs/outputs.
Ateachstage,inputiisconnectedtooutputjif:
j=

2i;0ip=21
2i+1p;p=2ip1
–TypesetbyFoilTEX–69

NetworkTopologies:MultistageOmegaNetwork EachstageoftheOmeganetworkimplementsaperfect
shufeasfollows:
000
010
100
110
001
011
101
111
000
010
100
110
001
011
101
111
0
1
2
3
4
5
6
7
0
1
2
3
4
5
6
7
= left_rotate(000)
= left_rotate(100)
= left_rotate(001)
= left_rotate(101)
= left_rotate(010)
= left_rotate(110)
= left_rotate(011)
= left_rotate(111)
Aperfectshufeinterconnectionforeightinputsandoutputs.
–TypesetbyFoilTEX–70

NetworkTopologies:MultistageOmegaNetwork
Theperfectshufepatternsareconnectedusing22switches.
Theswitchesoperateintwomodes–crossoverorpassthrough.
(b) (a)
Twoswitchingcongurationsofthe22switch:(a)Pass-through;
(b)Cross-over. –TypesetbyFoilTEX–71

NetworkTopologies:MultistageOmegaNetwork AcompleteOmeganetworkwiththeperfectshufe
interconnectsandswitchescannowbeillustrated:
111
110
101
100
011
010
001
000000
001
010
011
100
101
110
111
Acompleteomeganetworkconnectingeightinputsandeight
outputs.
Anomeganetworkhasp=2logpswitchingnodes,andthe
costofsuchanetworkgrowsas(plogp).
–TypesetbyFoilTEX–72

NetworkTopologies:MultistageOmegaNetwork–
Routing
Letsbethebinaryrepresentationofthesourceanddbethat
ofthedestinationprocessor.
Thedatatraversesthelinktotherstswitchingnode.Ifthe
mostsignicantbitsofsandtarethesame,thenthedatais
routedinpass-throughmodebytheswitchelse,itswitchesto
crossover.
Thisprocessisrepeatedforeachofthelogpswitchingstages.
Notethatthisisnotanon-blockingswitch.
–TypesetbyFoilTEX–73

NetworkTopologies:MultistageOmegaNetwork–
Routing
111
110
101
100
011
010
001
000000
001
010
011
100
101
110
111
A
B
Anexampleofblockinginomeganetwork:oneofthemessages
(010to111or110to100)isblockedatlinkAB. –TypesetbyFoilTEX–74

NetworkTopologies:CompletelyConnectedNetwork Eachprocessorisconnectedtoeveryotherprocessor.
ThenumberoflinksinthenetworkscalesasO(p
2
).
Whiletheperformancescalesverywell,thehardware
complexityisnotrealizableforlargevaluesofp.
Inthissense,thesenetworksarestaticcounterpartsof
crossbars.
–TypesetbyFoilTEX–75

NetworkTopologies:CompletelyConnectedandStar
ConnectedNetworks
Exampleofan8-nodecompletelyconnectednetwork.
(a)(b)
(a)Acompletely-connectednetworkofeightnodes;(b)aStar
connectednetworkofninenodes.
–TypesetbyFoilTEX–76

NetworkTopologies:StarConnectedNetwork
Everynodeisconnectedonlytoacommonnodeatthe
center.
DistancebetweenanypairofnodesisO(1).However,the
centralnodebecomesabottleneck.
Inthissense,starconnectednetworksarestaticcounterparts
ofbuses.
–TypesetbyFoilTEX–77

NetworkTopologies:LinearArrays,Meshes,andk-d
Meshes
Inalineararray,eachnodehastwoneighbors,onetoitsleft
andonetoitsright.Ifthenodesateitherendareconnected,
werefertoitasa1-Dtorusoraring.
Ageneralizationto2dimensionshasnodeswith4neighbors,to
thenorth,south,east,andwest.
Afurthergeneralizationtoddimensionshasnodeswith2d
neighbors.
Aspecialcaseofad-dimensionalmeshisahypercube.Here,
d=logp,wherepisthetotalnumberofnodes.
–TypesetbyFoilTEX–78

NetworkTopologies:LinearArrays
(a)(b)
Lineararrays:(a)withnowraparoundlinks;(b)withwraparound
link.
–TypesetbyFoilTEX–79

NetworkTopologies:Two-andThreeDimensional
Meshes
(c) (b) (a)
Twoandthreedimensionalmeshes:(a)2-Dmeshwithno
wraparound;(b)2-Dmeshwithwraparoundlink(2-Dtorus);and
(c)a3-Dmeshwithnowraparound.
–TypesetbyFoilTEX–80

NetworkTopologies:Hypercubesandtheir
Construction
0
1
00
01
10
11
000010
001011
100110
111 101
0000
0100
00010011
0101
0110
0010
0111
11001110
1111
1011 1001
1000
1101
1010
0-D hypercube1-D hypercube2-D hypercube3-D hypercube
4-D hypercube
Constructionofhypercubesfromhypercubesoflower
dimension.
–TypesetbyFoilTEX–81

NetworkTopologies:PropertiesofHypercubes
Thedistancebetweenanytwonodesisatmostlogp.
Eachnodehaslogpneighbors.
Thedistancebetweentwonodesisgivenbythenumberofbit
positionsatwhichthetwonodesdiffer.
–TypesetbyFoilTEX–82

NetworkTopologies:Tree-BasedNetworks
(a)(b)
Processing nodes
Switching nodes
Completebinarytreenetworks:(a)astatictreenetwork;and
(b)adynamictreenetwork.
–TypesetbyFoilTEX–83

NetworkTopologies:TreeProperties
Thedistancebetweenanytwonodesisnomorethan2logp.
Linkshigherupthetreepotentiallycarrymoretrafcthanthose
atthelowerlevels.
Forthisreason,avariantcalledafat-tree,fattensthelinksas
wegoupthetree.
Treescanbelaidoutin2Dwithnowirecrossings.Thisisan
attractivepropertyoftrees.
–TypesetbyFoilTEX–84

NetworkTopologies:FatTrees
Afattreenetworkof16processingnodes.
–TypesetbyFoilTEX–85

EvaluatingStaticInterconnectionNetworks
Diameter:Thedistancebetweenthefarthesttwonodesinthe
network.Thediameterofalineararrayisp1,thatofamesh
is2(
p
p1),thatofatreeandhypercubeislogp,andthatofa
completelyconnectednetworkisO(1).
BisectionWidth:Theminimumnumberofwiresyoumustcutto
dividethenetworkintotwoequalparts.Thebisectionwidth
ofalineararrayandtreeis1,thatofameshis
p
p,thatofa
hypercubeisp=2andthatofacompletelyconnectednetwork
isp
2
=4.
Cost:Thenumberoflinksorswitches(whicheveris
asymptoticallyhigher)isameaningfulmeasureofthecost.
However,anumberofotherfactors,suchastheabilitytolay
outthenetwork,thelengthofwires,etc.,alsofactorintothe
cost.
–TypesetbyFoilTEX–86

EvaluatingStaticInterconnectionNetworks
BisectionArcCost
NetworkDiameterWidthConnectivity(No.oflinks)
Completely-connected
1p
2
=4p1p(p1)=2
Star
211p1
Completebinarytree
2log((p+1)=2)11p1
Lineararray
p111p1
2-Dmesh,nowraparound
2(
p
p1)
p
p22(p
p
p)
2-Dwraparoundmesh
2b
p
p=2c2
p
p42p
Hypercube
logpp=2logp(plogp)=2
Wraparound
k
-ary
d
-cube
dbk=2c2k
d1
2ddp
–TypesetbyFoilTEX–87

EvaluatingDynamicInterconnectionNetworks
1ptBisectionArcCost
NetworkDiameterWidthConnectivity(No.oflinks)
Crossbar
1p1p
2
OmegaNetwork
logpp=22p=2
DynamicTree
2logp
12
p1
–TypesetbyFoilTEX–88

CacheCoherenceinMultiprocessorSystems
Interconnectsprovidebasicmechanismsfordatatransfer.
Inthecaseofsharedaddressspacemachines,additional
hardwareisrequiredtocoordinateaccesstodatathatmight
havemultiplecopiesinthenetwork.
Theunderlyingtechniquemustprovidesomeguaranteeson
thesemantics.
Thisguaranteeisgenerallyoneofserializability,i.e.,thereexists
someserialorderofinstructionexecutionthatcorrespondsto
theparallelschedule.
–TypesetbyFoilTEX–89

CacheCoherenceinMultiprocessorSystems
Whenthevalueofavariableischanges,allitscopiesmust
eitherbeinvalidatedorupdated.
(b)
(a)
Invalidate
Memory Memory
P1 P0 P1 P0
Update
Memory Memory
P1 P0 P1 P0
load x
write #3, x load x load x
x = 1
x = 1 x = 1
x = 1
x = 1 x = 1
x = 3
x = 3
x = 3 x = 3
x = 1
x = 1
write #3, x load x
Cachecoherenceinmultiprocessorsystems:(a)Invalidate
protocol;(b)Updateprotocolforsharedvariables.
–TypesetbyFoilTEX–90

CacheCoherence:UpdateandInvalidateProtocols
Ifaprocessorjustreadsavalueonceanddoesnot
needitagain,anupdateprotocolmaygeneratesignicant
overhead.
Iftwoprocessorsmakeinterleavedtestandupdatestoa
variable,anupdateprotocolisbetter.
Bothprotocolssufferfromfalsesharingoverheads(twowords
thatarenotshared,however,theylieonthesamecacheline).
Mostcurrentmachinesuseinvalidateprotocols.
–TypesetbyFoilTEX–91

MaintainingCoherenceUsingInvalidateProtocols
Eachcopyofadataitemisassociatedwithastate.
Oneexampleofsuchasetofstatesis,shared,invalid,ordirty.
Insharedstate,therearemultiplevalidcopiesofthedataitem
(andtherefore,aninvalidatewouldhavetobegeneratedon
anupdate).
Indirtystate,onlyonecopyexistsandtherefore,noinvalidates
needtobegenerated.
Ininvalidstate,thedatacopyisinvalid,therefore,aread
generatesadatarequest(andassociatedstatechanges).
–TypesetbyFoilTEX–92

MaintainingCoherenceUsingInvalidateProtocols
flush
read/write
readwrite
C_read
read
C_write
write
C_write
Dirty
Shared
Invalid
Statediagramofasimplethree-statecoherenceprotocol.
–TypesetbyFoilTEX–93

MaintainingCoherenceUsingInvalidateProtocols
y = 13, D
y = 13, S
x = 6, S
x = 6, I
y = 19, D
y = 20, D
x = 5, S
y = 12, S
x = 5, I
y = 12, I
y = 13, S
x = 6, S
y = 13, I
x = 6, I
y = 13, I
x = 5, D
y = 12, D
x = 6, I
read x
x = x + 1
x = x + y
x = x + 1
read y
y = y + 1
read x
y = x + y
read y
y = 12, S
y = 13, I
x = 19, D
x = 6, S
x = 20, D
y = 13, S
x = 6, D
x = 5, S
y = y + 1
Processor 0
Variables and
their states at
Processor 1
Variables and
their states in Processor 1
Global mem.
Instruction at
Processor 0
Instruction at Time
their states at
Variables and
Exampleofparallelprogramexecutionwiththesimple
three-statecoherenceprotocol.
–TypesetbyFoilTEX–94

SnoopyCacheSystems
Howareinvalidatessenttotherightprocessors?
Insnoopycaches,thereisabroadcastmediathatlistens
toallinvalidatesandreadrequestsandperformsappropriate
coherenceoperationslocally.
Tags
Snoop H/W
Processor
Cache
Tags
Snoop H/W
Processor
Cache
Tags
Snoop H/W
Processor
Cache
Dirty
Address/data
Memory
Asimplesnoopybusbasedcachecoherencesystem.
–TypesetbyFoilTEX–95

PerformanceofSnoopyCaches
Oncecopiesofdataaretaggeddirty,allsubsequent
operationscanbeperformedlocallyonthecachewithout
generatingexternaltrafc.
Ifadataitemisreadbyanumberofprocessors,ittransitions
tothesharedstateinthecacheandallsubsequentread
operationsbecomelocal.
Ifprocessorsreadandupdatedataatthesametime,they
generatecoherencerequestsonthebus–whichisultimately
bandwidthlimited.
–TypesetbyFoilTEX–96

DirectoryBasedSystems
Insnoopycaches,eachcoherenceoperationissenttoall
processors.Thisisaninherentlimitation.
Whynotsendcoherencerequeststoonlythoseprocessorsthat
needtobenotied?
Thisisdoneusingadirectory,whichmaintainsapresence
vectorforeachdataitem(cacheline)alongwithitsglobal
state.
–TypesetbyFoilTEX–97

DirectoryBasedSystems
(a)
(b)
Directory
Data
State
Presence
Bits
Cache
Processor
Processor
Cache
Processor
Cache
Processor
Cache
Interconnection Network
Interconnection Network
Memory
Presence
bits / State
Processor
Cache
Memory
Presence
bits / State
Architectureoftypicaldirectorybasedsystems:(a)acentralized
directory;and(b)adistributeddirectory. –TypesetbyFoilTEX–98

PerformanceofDirectoryBasedSchemes
Theneedforabroadcastmediaisreplacedbythedirectory.
Theadditionalbitstostorethedirectorymayaddsignicant
overhead.
Theunderlyingnetworkmustbeabletocarryallthe
coherencerequests.
Thedirectoryisapointofcontention,therefore,distributed
directoryschemesmustbeused.
–TypesetbyFoilTEX–99

CommunicationCostsinParallelMachines
Alongwithidlingandcontention,communicationisamajor
overheadinparallelprograms.
Thecostofcommunicationisdependentonavarietyof
featuresincludingtheprogrammingmodelsemantics,the
networktopology,datahandlingandrouting,andassociated
softwareprotocols.
–TypesetbyFoilTEX–100

MessagePassingCostsinParallelComputers
Thetotaltimetotransferamessageoveranetworkcomprises
ofthefollowing:
Startuptime(t
s
):Timespentatsendingandreceivingnodes
(executingtheroutingalgorithm,programmingrouters,etc.).
Per-hoptime(t
h
):Thistimeisafunctionofnumberofhopsand
includesfactorssuchasswitchlatencies,networkdelays,etc.
Per-wordtransfertime(t
w
):Thistimeincludesalloverheadsthat
aredeterminedbythelengthofthemessage.Thisincludes
bandwidthoflinks,errorcheckingandcorrection,etc.
–TypesetbyFoilTEX–101

Store-and-ForwardRouting
Amessagetraversingmultiplehopsiscompletelyreceivedat
anintermediatehopbeforebeingforwardedtothenexthop.
Thetotalcommunicationcostforamessageofsizemwordsto
traverselcommunicationlinksis
t
comm
=t
s
+(mt
w
+t
h
)l:(1)
Inmostplatforms,t
h
issmallandtheaboveexpressioncanbe
approximatedby
t
comm
=t
s
+mlt
w
:
–TypesetbyFoilTEX–102

RoutingTechniques
P0
P3
P2
P1
P3
P1
P2
P3
P2
P0
P1
P0
Time
Time
Time
(a)
A single message sent over a
store-and-forward network
(b) The same message broken into two parts
and sent over the network.
(c) The same message broken into four parts
and sent over the network.
PassingamessagefromnodeP
0
toP
3
(a)througha
store-and-forwardcommunicationnetwork;(b)and(c)
extendingtheconcepttocut-throughrouting.Theshaded
regionsrepresentthetimethatthemessageisintransit.The
startuptimeassociatedwiththismessagetransferisassumedto
bezero.
–TypesetbyFoilTEX–103

PacketRouting
Store-and-forwardmakespooruseofcommunicationresources.
Packetroutingbreaksmessagesintopacketsandpipelines
themthroughthenetwork.
Sincepacketsmaytakedifferentpaths,eachpacketmust
carryroutinginformation,errorchecking,sequencing,and
otherrelatedheaderinformation.
Thetotalcommunicationtimeforpacketroutingisapproximated
by:
t
comm
=t
s
+t
h
l+t
w
m
Thefactort
w
accountsforoverheadsinpacketheaders.
–TypesetbyFoilTEX–104

Cut-ThroughRouting
Takestheconceptofpacketroutingtoanextremebyfurther
dividingmessagesintobasicunitscalledits.
Sinceitsaretypicallysmall,theheaderinformationmustbe
minimized.
Thisisdonebyforcingallitstotakethesamepath,in
sequence.
Atracermessagerstprogramsallintermediaterouters.Allits
thentakethesameroute.
Errorchecksareperformedontheentiremessage,asopposed
toits.
Nosequencenumbersareneeded.
–TypesetbyFoilTEX–105

Cut-ThroughRouting
Thetotalcommunicationtimeforcut-throughroutingis
approximatedby:
t
comm
=t
s
+t
h
l+t
w
m:
Thisisidenticaltopacketrouting,however,t
w
istypicallymuch
smaller.
–TypesetbyFoilTEX–106

SimpliedCostModelforCommunicatingMessages
Thecostofcommunicatingamessagebetweentwonodesl
hopsawayusingcut-throughroutingisgivenby
t
comm
=t
s
+lt
h
+t
w
m:
Inthisexpression,t
h
istypicallysmallerthant
s
andt
w
.Forthis
reason,thesecondtermintheRHSdoesnotshow,particularly,
whenmislarge.
Furthermore,itisoftennotpossibletocontrolroutingand
placementoftasks.
Forthesereasons,wecanapproximatethecostofmessage
transferby
t
comm
=t
s
+t
w
m:
–TypesetbyFoilTEX–107

SimpliedCostModelforCommunicatingMessages
Itisimportanttonotethattheoriginalexpressionfor
communicationtimeisvalidforonlyuncongestednetworks.
Ifalinktakesmultiplemessages,thecorrespondingt
w
term
mustbescaledupbythenumberofmessages.
Differentcommunicationpatternscongestdifferentnetworks
tovaryingextents.
Itisimportanttounderstandandaccountforthisinthe
communicationtimeaccordingly.
–TypesetbyFoilTEX–108

CostModelsforSharedAddressSpaceMachines
Whilethebasicmessagingcostappliestothesemachinesas
well,anumberofotherfactorsmakeaccuratecostmodeling
moredifcult.
Memorylayoutistypicallydeterminedbythesystem.
Finitecachesizescanresultincachethrashing.
Overheadsassociatedwithinvalidateandupdateoperations
aredifculttoquantify.
Spatiallocalityisdifculttomodel.
Prefetchingcanplayaroleinreducingtheoverhead
associatedwithdataaccess.
Falsesharingandcontentionaredifculttomodel.
–TypesetbyFoilTEX–109

RoutingMechanismsforInterconnectionNetworks
Howdoesonecomputetheroutethatamessagetakesfrom
sourcetodestination?
Routingmustpreventdeadlocks–forthisreason,weuse
dimension-orderedore-cuberouting.
Routingmustavoidhot-spots–forthisreason,two-steprouting
isoftenused.Inthiscase,amessagefromsourcesto
destinationdisrstsenttoarandomlychosenintermediate
processoriandthenforwardedtodestinationd.
–TypesetbyFoilTEX–110

RoutingMechanismsforInterconnectionNetworks
Step 2 (110 111) Step 1 (010 110)
111 110
101
011
100
010
001 000
111 110
101
011
100
010
001
001
000
010
101 100
011
110111
000
PSfragreplacements
P
s
P
s
P
s
P
d
P
d
P
d
RoutingamessagefromnodeP
s
(010)tonodeP
d
(111)ina
three-dimensionalhypercubeusingE-cuberouting.
–TypesetbyFoilTEX–111

MappingTechniquesforGraphs
Often,weneedtoembedaknowncommunicationpattern
intoagiveninterconnectiontopology.
Wemayhaveanalgorithmdesignedforonenetwork,which
weareportingtoanothertopology.
Forthesereasons,itisusefultounderstandmappingbetween
graphs.
–TypesetbyFoilTEX–112

MappingTechniquesforGraphs:Metrics
WhenmappingagraphG(V;E)intoG
0
(V
0
;E
0
),thefollowing
metricsareimportant:
ThemaximumnumberofedgesmappedontoanyedgeinE
0
iscalledthecongestionofthemapping.
ThemaximumnumberoflinksinE
0
thatanyedgeinEis
mappedontoiscalledthedilationofthemapping.
TheratioofthenumberofnodesinthesetV
0
tothatinsetVis
calledtheexpansionofthemapping.
–TypesetbyFoilTEX–113

EmbeddingaLinearArrayintoaHypercube
Alineararray(oraring)composedof2
d
nodes(labeled
0through2
d
1)canbeembeddedintoad-dimensional
hypercubebymappingnodeiofthelineararrayontonode
G(i;d)ofthehypercube.ThefunctionG(i;x)isdenedasfollows:
G(0;1)=0
G(1;1)=1
G(i;x+1)=

G(i;x);i<2
x
2
x
+G(2
x+1
1i;x);i2
x
ThefunctionGiscalledthebinaryreectedGraycode(RGC).
Sinceadjoiningentries(G(i;d)andG(i+1;d))differfromeach
otheratonlyonebitposition,correspondingprocessorsare
mappedtoneighborsinahypercube.Therefore,thecongestion,
dilation,andexpansionofthemappingareall1.
–TypesetbyFoilTEX–114

EmbeddingaLinearArrayintoaHypercube:Example
1bit Gray code2bit Gray code3bit Gray code3D hypercube8processor ring
0
1
3
2
6
7
5
4
0 0
0 1
1 1
1 0
0
1
0
1
2
3
4
5
6
7
0 0 0
0 0 1
0 1 1
0 1 0
1 1 0
1 1 1
1 0 1
1 0 0
line
along this
Reflect
(a)
110
010
000001
011
111
101
(b)
100
(a)Athree-bitreectedGraycodering;and(b)itsembedding
intoathree-dimensionalhypercube.
–TypesetbyFoilTEX–115

EmbeddingaMeshintoaHypercube
A2
r
2
s
wraparoundmeshcanbemappedtoa2
r+s
-node
hypercubebymappingnode(i;j)ofthemeshontonodeG(i;r
1)kG(j;s1)ofthehypercube(wherekdenotesconcatenation
ofthetwoGraycodes).
–TypesetbyFoilTEX–116

EmbeddingaMeshintoaHypercube
(3,3) 10 10
(2,3) 11 10
(1,3) 01 10
(0,3) 00 10
(3,2) 10 11
(2,2) 11 11
(1,2) 01 11
(0,2) 00 11
(3,1) 10 01
(2,1) 11 01
(1,1) 01 01
(0,1) 00 01
(3,0) 10 00
(2,0) 11 00
(1,0) 01 00
(0,0) 00 00
(0,0) 0 00(0,1) 0 01(0,2) 0 11(0,3) 0 10
(1,0) 1 00(1,1) 1 01(1,2) 1 11(1,3) 1 10

011
001 000
010
110111
101 100
identical two leastsignificant bits
Processors in a column have Processors in a row have identical
two mostsignificant bits
(a)
(b)
(a)A44meshillustratingthemappingofmeshnodestothe
nodesinafour-dimensionalhypercube;and(b)a24mesh
embeddedintoathree-dimensionalhypercube.
Onceagain,thecongestion,dilation,andexpansionofthe
mappingis1.
–TypesetbyFoilTEX–117

EmbeddingaMeshintoaLinearArray
Sinceameshhasmoreedgesthanalineararray,wewillnot
haveanoptimalcongestion/dilationmapping.
Werstexaminethemappingofalineararrayintoameshand
theninvertthismapping.
Thisgivesusanoptimalmapping(intermsofcongestion).
–TypesetbyFoilTEX–118

EmbeddingaMeshintoaLinearArray:Example
(a) Mapping a linear array into a
linear array (congestion 5)
(b) Inverting the mapping mapping a 2D mesh into a
2D mesh (congestion 1).
(a)Embeddinga16nodelineararrayintoa2-Dmesh;and(b)
theinverseofthemapping.Solidlinescorrespondtolinksinthe
lineararrayandnormallinestolinksinthemesh.
–TypesetbyFoilTEX–119

EmbeddingaHypercubeintoa2-DMesh
Each
p
pnodesubcubeofthehypercubeismappedtoa
p
p
noderowofthemesh.
Thisisdonebyinvertingthelinear-arraytohypercubemapping.
Thiscanbeshowntobeanoptimalmapping.
–TypesetbyFoilTEX–120

EmbeddingaHypercubeintoa2-DMesh:Example
P = 32 (b)
P = 16 (a)
Embeddingahypercubeintoa2-Dmesh.
–TypesetbyFoilTEX–121

CaseStudies:TheIBMBlue-GeneArchitecture
.
.
.
.
.
.
.
.
.
(b) Chip (32 GF)
.
(a) CPU (1GF)(c) Board (2 TF)
(d) Tower (16 TF)(e) Blue Gene (1 PF)
ThehierarchicalarchitectureofBlueGene.
–TypesetbyFoilTEX–122

CaseStudies:TheCrayT3EArchitecture
Router (a)
(b)
PControl
Memory
InterconnectionnetworkoftheCrayT3E:(a)nodearchitecture;
(b)networktopology.
–TypesetbyFoilTEX–123

CaseStudies:TheSGIOrigin3000Architecture
1 R-Brick, 4 C-Bricks, and
16 processors at each vertex.
Processor
128 Processor Configuration
(16 processors)
R-Brick
To meta-router
To 8 other R-Bricks
C-Brick
To 4 C-Bricks
C-Brick
C-Brick
Metarouter
128 processors
512 Processor Configuration
C-Brick
C-Brick
C-Brick
C-BrickC-Brick
Crossbar
Memory/
Directory
C-Brick
I/P/D/X Brick
R-Brick
R-Brick
32 Processor Configuration
ArchitectureoftheSGIOrigin3000familyofservers.
–TypesetbyFoilTEX–124

CaseStudies:TheSunHPCServerArchitecture
Starfire Ultra 1000 (up to 64 processors)
16 x 16 non-blocking crossbar
Address bus
System Board System Board
32 byte data bus
Sun Ultra 6000 (6 - 30 processors)
Four address buses
System Board
System Board
System Board
System Board
ArchitectureoftheSunEnterprisefamilyofservers.
–TypesetbyFoilTEX–125
Tags