chap2_slidesforparallelcomputingananthgarama

ParallelComputingPlatforms
AnanthGrama,AnshulGupta,GeorgeKarypis,andVipinKumar
Toaccompanythetext“IntroductiontoParallelComputing”,
AddisonWesley,2003.

TopicOverview
ImplicitParallelism:TrendsinMicroprocessorArchitectures
LimitationsofMemorySystemPerformance
DichotomyofParallelComputingPlatforms
CommunicationModelofParallelPlatforms
PhysicalOrganizationofParallelPlatforms
CommunicationCostsinParallelMachines
MessagingCostModelsandRoutingMechanisms
MappingTechniques
CaseStudies
–TypesetbyFoilTEX–1

ScopeofParallelism
Conventionalarchitecturescoarselycompriseofaprocessor,
memorysystem,andthedatapath.
Eachofthesecomponentspresentsignicantperformance
bottlenecks.
Parallelismaddresseseachofthesecomponentsinsignicant
ways.
Differentapplicationsutilizedifferentaspectsofparallelism
–e.g.,dataitensiveapplicationsutilizehighaggregate
throughput,serverapplicationsutilizehighaggregatenetwork
bandwidth,andscienticapplicationstypicallyutilizehigh
processingandmemorysystemperformance.
Itisimportanttounderstandeachoftheseperformance
bottlenecks.
–TypesetbyFoilTEX–2

ImplicitParallelism:TrendsinMicroprocessor
Architectures
Microprocessorclockspeedshavepostedimpressivegains
overthepasttwodecades(twotothreeordersofmagnitude).
Higherlevelsofdeviceintegrationhavemadeavailablea
largenumberoftransistors.
Thequestionofhowbesttoutilizetheseresourcesisan
importantone.
Currentprocessorsusetheseresourcesinmultiplefunctional
unitsandexecutemultipleinstructionsinthesamecycle.
Theprecisemannerinwhichtheseinstructionsareselected
andexecutedprovidesimpressivediversityinarchitectures.
–TypesetbyFoilTEX–3

PipeliningandSuperscalarExecution
Pipeliningoverlapsvariousstagesofinstructionexecutionto
achieveperformance.
Atahighlevelofabstraction,aninstructioncanbeexecuted
whilethenextoneisbeingdecodedandthenextoneisbeing
fetched.
Thisisakintoanassemblylineformanufactureofcars.
–TypesetbyFoilTEX–4

PipeliningandSuperscalarExecution
Pipelining,however,hasseverallimitations.
Thespeedofapipelineiseventuallylimitedbytheslowest
stage.
Forthisreason,conventionalprocessorsrelyonverydeep
pipelines(20stagepipelinesinstate-of-the-artPentium
processors).
However,intypicalprogramtraces,every5-6thinstruction
isaconditionaljump!Thisrequiresveryaccuratebranch
prediction.
Thepenaltyofamispredictiongrowswiththedepthofthe
pipeline,sincealargernumberofinstructionswillhavetobe
ushed.
–TypesetbyFoilTEX–5

PipeliningandSuperscalarExecution
Onesimplewayofalleviatingthesebottlenecksistouse
multiplepipelines.
Thequestionthenbecomesoneofselectingtheseinstructions.
–TypesetbyFoilTEX–6

SuperscalarExecution:AnExample
(i)
(a) Three different code fragments for adding a list of four numbers.
(iii) (ii)
WB: Write-back
NA: No Action
E: Instruction Execute
IFIDE NA
IFID
OF: Operand Fetch
OF
OF
IFIDOFE
IFIDE OF
IFIDNAWB
Adder Utilization
Clock cycle
5
6
7
4
Vertical waste
Horizontal waste
Full issue slots
Empty issue slots
1. load R1, @1000
2. load R2, @1008
3. add R1, @1004
4. add R2, @100C
5. add R1, R2
6. store R1, @2000
024
Instruction cycles
68
1. load R1, @1000
2. add R1, @1004
3. add R1, @1008
4. add R1, @100C
5. store R1, @2000
1. load R1, @1000
3. load R2, @1008
4. add R2, @100C
5. add R1, R2
6. store R1, @2000
2. add R1, @1004
load R1, @1000
load R2, @1008
add R1, @1004
add R2, @100C
add R1, R2
store R1, @2000
ID: Instruction Decode
IF
IF: Instruction Fetch
ID
(b) Execution schedule for code fragment (i) above.
(c) Hardware utilization trace for schedule in (b).
Exampleofatwo-waysuperscalarexecutionofinstructions.
–TypesetbyFoilTEX–7

SuperscalarExecution:AnExample
Intheaboveexample,thereissomewastageofresourcesdue
todatadependencies.
Theexamplealsoillustratesthatdifferentinstructionmixeswith
identicalsemanticscantakesignicantlydifferentexecution
time.
–TypesetbyFoilTEX–8

SuperscalarExecution
Schedulingofinstructionsisdeterminedbyanumberof
factors:
TrueDataDependency:Theresultofoneoperationisaninput
tothenext.
ResourceDependency:Twooperationsrequirethesame
resource.
BranchDependency:Schedulinginstructionsacrossconditional
branchstatementscannotbedonedeterministicallya-priori.
Thescheduler,apieceofhardwarelooksatalargenumber
ofinstructionsinaninstructionqueueandselectsappropriate
numberofinstructionstoexecuteconcurrentlybasedonthese
factors.
Thecomplexityofthishardwareisanimportantconstrainton
superscalarprocessors.
–TypesetbyFoilTEX–9

SuperscalarExecution:IssueMechanisms
Inthesimplermodel,instructionscanbeissuedonlyintheorder
inwhichtheyareencountered.Thatis,ifthesecondinstruction
cannotbeissuedbecauseithasadatadependencywiththe
rst,onlyoneinstructionisissuedinthecycle.Thisiscalledin-
orderissue.
Inamoreaggressivemodel,instructionscanbeissuedout
oforder.Inthiscase,ifthesecondinstructionhasdata
dependencieswiththerst,butthethirdinstructiondoesnot,
therstandthirdinstructionscanbeco-scheduled.Thisisalso
calleddynamicissue.
Performanceofin-orderissueisgenerallylimited.
–TypesetbyFoilTEX–10

SuperscalarExecution:EfciencyConsiderations
Notallfunctionalunitscanbekeptbusyatalltimes.
Ifduringacycle,nofunctionalunitsareutilized,thisisreferred
toasverticalwaste.
Ifduringacycle,onlysomeofthefunctionalunitsareutilized,
thisisreferredtoashorizontalwaste.
Duetolimitedparallismintypicalinstructiontraces,dependencies,
ortheinabilityoftheschedulertoextractparallelism,the
performanceofsuperscalarprocessorsiseventuallylimited.
Conventionalmicroprocessorstypicallysupportfour-waysuperscalar
execution.
–TypesetbyFoilTEX–11

VeryLongInstructionWord(VLIW)Processors
Thehardwarecostandcomplexityofthesuperscalar
schedulerisamajorconsiderationinprocessordesign.
Toaddressthisissues,VLIWprocessorsrelyoncompiletime
analysistoidentifyandbundletogetherinstructionsthatcan
beexecutedconcurrently.
Theseinstructionsarepackedanddispatchedtogether,and
thusthenameverylonginstructionword.
Thisconceptwasusedwithsomecommercialsuccessinthe
MultiowTracemachine(circa1984).
VariantsofthisconceptareemployedintheIntelIA64
processors.
–TypesetbyFoilTEX–12

VeryLongInstructionWord(VLIW)Processors:
Considerations
Issuehardwareissimpler.
Compilerhasabiggercontextfromwhichtoselectco-
scheduledinstructions.
Compilers,however,donothaveruntimeinformationsuchas
cachemisses.Schedulingis,therefore,inherentlyconservative.
Branchandmemorypredictionismoredifcult.
VLIWperformanceishighlydependentonthecompiler.A
numberoftechniquessuchasloopunrolling,speculative
execution,branchpredictionarecritical.
TypicalVLIWprocessorsarelimitedto4-wayto8-way
parallelism.
–TypesetbyFoilTEX–13

LimitationsofMemorySystemPerformance
Memorysystem,andnotprocessorspeed,isoftenthe
bottleneckformanyapplications.
Memorysystemperformanceislargelycapturedbytwo
parameters,latencyandbandwidth.
Latencyisthetimefromtheissueofamemoryrequesttothe
timethedataisavailableattheprocessor.
Bandwidthistherateatwhichdatacanbepumpedtothe
processorbythememorysystem.
–TypesetbyFoilTEX–14

MemorySystemPerformance:Bandwidthand
Latency
Itisveryimportanttounderstandthedifferencebetween
latencyandbandwidth.
Considertheexampleofare-hose.Ifthewatercomesout
ofthehosetwosecondsafterthehydrantisturnedon,the
latencyofthesystemistwoseconds.
Oncethewaterstartsowing,ifthehydrantdeliverswaterat
therateof5gallons/second,thebandwidthofthesystemis5
gallons/second.
Ifyouwantimmediateresponsefromthehydrant,itisimportant
toreducelatency.
Ifyouwanttoghtbigres,youwanthighbandwidth.
–TypesetbyFoilTEX–15

MemoryLatency:AnExample
Consideraprocessoroperatingat1GHz(1nsclock)
connectedtoaDRAMwithalatencyof100ns(nocaches).
Assumethattheprocessorhastwomultiply-addunitsandis
capableofexecutingfourinstructionsineachcycleof1ns.The
followingobservationsfollow:
Thepeakprocessorratingis4GFLOPS.
Sincethememorylatencyisequalto100cyclesandblock
sizeisoneword,everytimeamemoryrequestismade,the
processormustwait100cyclesbeforeitcanprocessthedata.
–TypesetbyFoilTEX–16

MemoryLatency:AnExample
Ontheabovearchitecture,considertheproblemof
computingadot-productoftwovectors.
Adot-productcomputationperformsonemultiply-addon
asinglepairofvectorelements,i.e.,eachoatingpoint
operationrequiresonedatafetch.
Itfollowsthatthepeakspeedofthiscomputationislimitedto
oneoatingpointoperationevery100ns,oraspeedof10
MFLOPS,averysmallfractionofthepeakprocessorrating!
–TypesetbyFoilTEX–17

ImprovingEffectiveMemoryLatencyUsingCaches
Cachesaresmallandfastmemoryelementsbetweenthe
processorandDRAM.
Thismemoryactsasalow-latencyhigh-bandwidthstorage.
Ifapieceofdataisrepeatedlyused,theeffectivelatencyof
thismemorysystemcanbereducedbythecache.
Thefractionofdatareferencessatisedbythecacheiscalled
thecachehitratioofthecomputationonthesystem.
Cachehitratioachievedbyacodeonamemorysystemoften
determinesitsperformance.
–TypesetbyFoilTEX–18

ImpactofCaches:Example
Considerthearchitecturefromthepreviousexample.Inthis
case,weintroduceacacheofsize32KBwithalatencyof1nsor
onecycle.WeusethissetuptomultiplytwomatricesAandBof
dimensions3232.Wehavecarefullychosenthesenumbersso
thatthecacheislargeenoughtostorematricesAandB,aswell
astheresultmatrixC.
–TypesetbyFoilTEX–19

ImpactofCaches:Example(continued)
Thefollowingobservationscanbemadeabouttheproblem:
Fetchingthetwomatricesintothecachecorrespondsto
fetching2Kwords,whichtakesapproximately200s.
Multiplyingtwonnmatricestakes2n
3
operations.Forour
problem,thiscorrespondsto64Koperations,whichcanbe
performedin16Kcycles(or16s)atfourinstructionspercycle.
Thetotaltimeforthecomputationisthereforeapproximately
thesumoftimeforload/storeoperationsandthetimeforthe
computationitself,i.e.,200+16s.
Thiscorrespondstoapeakcomputationrateof64K=216or303
MFLOPS.
–TypesetbyFoilTEX–20

ImpactofCaches
Repeatedreferencestothesamedataitemcorrespondto
temporallocality.
Inourexample,wehadO(n
2
)dataaccessesandO(n
3
)
computation.Thisasymptoticdifferencemakestheabove
exampleparticularlydesirableforcaches.
Datareuseiscriticalforcacheperformance.
–TypesetbyFoilTEX–21

ImpactofMemoryBandwidth
Memorybandwidthisdeterminedbythebandwidthofthe
memorybusaswellasthememoryunits.
Memorybandwidthcanbeimprovedbyincreasingthesizeof
memoryblocks.
Theunderlyingsystemtakesltimeunits(wherelisthelatency
ofthesystem)todeliverbunitsofdata(wherebistheblock
size).
–TypesetbyFoilTEX–22

ImpactofMemoryBandwidth:Example
Considerthesamesetupasbefore,exceptinthiscase,the
blocksizeis4wordsinsteadof1word.Werepeatthedot-product
computationinthisscenario:
Assumingthatthevectorsarelaidoutlinearlyinmemory,eight
FLOPs(fourmultiply-adds)canbeperformedin200cycles.
Thisisbecauseasinglememoryaccessfetchesfour
consecutivewordsinthevector.
Therefore,twoaccessescanfetchfourelementsofeachof
thevectors.ThiscorrespondstoaFLOPevery25ns,forapeak
speedof40MFLOPS.
–TypesetbyFoilTEX–23

ImpactofMemoryBandwidth
Itisimportanttonotethatincreasingblocksizedoesnot
changelatencyofthesystem.
Physically,thescenarioillustratedherecanbeviewedasa
widedatabus(4wordsor128bits)connectedtomultiple
memorybanks.
Inpractice,suchwidebusesareexpensivetoconstruct.
Inamorepracticalsystem,consecutivewordsaresentonthe
memorybusonsubsequentbuscyclesaftertherstwordis
retrieved.
–TypesetbyFoilTEX–24

ImpactofMemoryBandwidth
Theaboveexamplesclearlyillustratehowincreasedbandwidth
resultsinhigherpeakcomputationrates.
Thedatalayoutswereassumedtobesuchthatconsecutive
datawordsinmemorywereusedbysuccessiveinstructions
(spatiallocalityofreference).
Ifwetakeadata-layoutcentricview,computationsmustbe
reorderedtoenhancespatiallocalityofreference.
–TypesetbyFoilTEX–25

ImpactofMemoryBandwidth:Example
Considerthefollowingcodefragment:
for(i=0;i<1000;i++)
column_sum[i]=0.0;
for(j=0;j<1000;j++)
column_sum[i]+=b[j][i];
Thecodefragmentsumscolumnsofthematrixbintoavector
column_sum.
–TypesetbyFoilTEX–26

ImpactofMemoryBandwidth:Example
Thevectorcolumn_sumissmallandeasilytsintothecache
Thematrixbisaccessedinacolumnorder.
Thestridedaccessresultsinverypoorperformance.
(a) Column major data access
AbA
=
bAbAb
+++
(b) Row major data access.
AbAbAbAb
====
Multiplyingamatrixwithavector:(a)multiplying
column-by-column,keepingarunningsum;(b)computingeach
elementoftheresultasadotproductofarowofthematrixwith
thevector.
–TypesetbyFoilTEX–27

ImpactofMemoryBandwidth:Example
Wecanxtheabovecodeasfollows:
for(i=0;i<1000;i++)
column_sum[i]=0.0;
for(j=0;j<1000;j++)
for(i=0;i<1000;i++)
column_sum[i]+=b[j][i];
Inthiscase,thematrixistraversedinarow-orderand
performancecanbeexpectedtobesignicantlybetter.
–TypesetbyFoilTEX–28

MemorySystemPerformance:Summary
Theseriesofexamplespresentedinthissectionillustratethe
followingconcepts:
Exploitingspatialandtemporallocalityinapplicationsis
criticalforamortizingmemorylatencyandincreasingeffective
memorybandwidth.
Theratioofthenumberofoperationstonumberofmemory
accessesisagoodindicatorofanticipatedtoleranceto
memorybandwidth.
Memorylayoutsandorganizingcomputationappropriately
canmakeasignicantimpactonthespatialandtemporal
locality.
–TypesetbyFoilTEX–29

AlternateApproachesforHidingMemoryLatency Considertheproblemofbrowsingthewebonaveryslow
networkconnection.Wedealwiththeprobleminoneofthree
possibleways:
weanticipatewhichpageswearegoingtobrowseaheadof
timeandissuerequestsfortheminadvance;
weopenmultiplebrowsersandaccessdifferentpagesineach
browser,thuswhilewearewaitingforonepagetoload,we
couldbereadingothers;or
weaccessawholebunchofpagesinonego–amortizingthe
latencyacrossvariousaccesses.
Therstapproachiscalledprefetching,thesecondmultithreading,
andthethirdonecorrespondstospatiallocalityinaccessing
memorywords.
–TypesetbyFoilTEX–30

MultithreadingforLatencyHiding
Athreadisasinglestreamofcontrolintheowofaprogram.
Weillustratethreadswithasimpleexample:
for(i=0;i<n;i++)
c[i]=dot_product(get_row(a,i),b);
Eachdot-productisindependentoftheother,andtherefore
representsaconcurrentunitofexecution.Wecansafelyrewrite
theabovecodesegmentas:
for(i=0;i<n;i++)
c[i]=create_thread(dot_product,get_row(a,i),b);
–TypesetbyFoilTEX–31

MultithreadingforLatencyHiding:Example
Inthecode,therstinstanceofthisfunctionaccessesapairof
vectorelementsandwaitsforthem.
Inthemeantime,thesecondinstanceofthisfunctioncan
accesstwoothervectorelementsinthenextcycle,andso
on.
Afterlunitsoftime,wherelisthelatencyofthememory
system,therstfunctioninstancegetstherequesteddatafrom
memoryandcanperformtherequiredcomputation.
Inthenextcycle,thedataitemsforthenextfunctioninstance
arrive,andsoon.Inthisway,ineveryclockcycle,wecan
performacomputation.
–TypesetbyFoilTEX–32

MultithreadingforLatencyHiding
Theexecutionscheduleinthepreviousexampleispredicated
upontwoassumptions:thememorysystemiscapableof
servicingmultipleoutstandingrequests,andtheprocessoris
capableofswitchingthreadsateverycycle.
Italsorequirestheprogramtohaveanexplicitspecicationof
concurrencyintheformofthreads.
MachinessuchastheHEPandTerarelyonmultithreaded
processorsthatcanswitchthecontextofexecutioninevery
cycle.Consequently,theyareabletohidelatencyeffectively.
–TypesetbyFoilTEX–33

PrefetchingforLatencyHiding
Missesonloadscauseprogramstostall.
Whynotadvancetheloadssothatbythetimethedatais
actuallyneeded,itisalreadythere!
Theonlydrawbackisthatyoumightneedmorespacetostore
advancedloads.
However,iftheadvancedloadsareoverwritten,weareno
worsethanbefore!
–TypesetbyFoilTEX–34

TradeoffsofMultithreadingandPrefetching
Multithreadingandprefetchingarecriticallyimpactedbythe
memorybandwidth.Considerthefollowingexample:
Consideracomputationrunningonamachinewitha1GHz
clock,4-wordcacheline,singlecycleaccesstothecache,and
100nslatencytoDRAM.Thecomputationhasacachehitratio
at1KBof25%andat32KBof90%.Considertwocases:rst,a
singlethreadedexecutioninwhichtheentirecacheisavailable
totheserialcontext,andsecond,amultithreadedexecutionwith
32threadswhereeachthreadhasacacheresidencyof1KB.
Ifthecomputationmakesonedatarequestineverycycleof
1ns,youmaynoticethattherstscenariorequires400MB/sof
memorybandwidthandthesecond,3GB/s.
–TypesetbyFoilTEX–35

TradeoffsofMultithreadingandPrefetching
Bandwidthrequirementsofamultithreadedsystemmay
increaseverysignicantlybecauseofthesmallercache
residencyofeachthread.
Multithreadedsystemsbecomebandwidthboundinsteadof
latencybound.
Multithreadingandprefetchingonlyaddressthelatency
problemandmayoftenexacerbatethebandwidthproblem.
Multithreadingandprefetchingalsorequiresignicantlymore
hardwareresourcesintheformofstorage.
–TypesetbyFoilTEX–36

ExplicitlyParallelPlatforms
–TypesetbyFoilTEX–37

DichotomyofParallelComputingPlatforms
Anexplicitlyparallelprogrammustspecifyconcurrencyand
interactionbetweenconcurrentsubtasks.
Theformerissometimesalsoreferredtoasthecontrolstructure
andthelatterasthecommunicationmodel.
–TypesetbyFoilTEX–38

ControlStructureofParallelPrograms
Parallelismcanbeexpressedatvariouslevelsofgranularity–
frominstructionleveltoprocesses.
Betweentheseextremesexistarangeofmodels,alongwith
correspondingarchitecturalsupport.
–TypesetbyFoilTEX–39

ControlStructureofParallelPrograms
Processingunitsinparallelcomputerseitheroperateunder
thecentralizedcontrolofasinglecontrolunitorwork
independently.
Ifthereisasinglecontrolunitthatdispatchesthesame
instructiontovariousprocessors(thatworkondifferentdata),
themodelisreferredtoassingleinstructionstream,multiple
datastream(SIMD).
Ifeachprocessorhasitsowncontrolcontrolunit,each
processorcanexecutedifferentinstructionsondifferentdata
items.Thismodeliscalledmultipleinstructionstream,multiple
datastream(MIMD).
–TypesetbyFoilTEX–40

SIMDandMIMDProcessors
(a)(b)
Global
+
+
+
+
PE
PE
PE
PE
PE
PE
PE
PE
PE
control
unit
INTERCONNECTION NETWORK
INTERCONNECTION NETWORK
control unit
control unit
control unit
control unit
PE: Processing Element
AtypicalSIMDarchitecture(a)andatypicalMIMDarchitecture
(b).
–TypesetbyFoilTEX–41

SIMDProcessors
SomeoftheearliestparallelcomputerssuchastheIlliacIV,
MPP,DAP,CM-2,andMasParMP-1belongedtothisclassof
machines.
Variantsofthisconcepthavefounduseinco-processingunits
suchastheMMXunitsinIntelprocessorsandDSPchipssuchas
theSharc.
SIMDreliesontheregularstructureofcomputations(suchas
thoseinimageprocessing).
Itisoftennecessarytoselectivelyturnoffoperationsoncertain
dataitems.Forthisreason,mostSIMDprogrammingparadigms
allowforan“activitymask”,whichdeterminesifaprocessor
shouldparticipateinacomputationornot.
–TypesetbyFoilTEX–42

ConditionalExecutioninSIMDProcessors
Idle
Idle
(b)
Step 2
(a)
Idle
Step 1
Initial values
Idle
C
B
0
A
B
C
0
A
B
C
0
A
B
A
0
else
C
Processor 0Processor 1Processor 2
5
0
4
2
1
1
0
0
A
B
C0
A
B
C
A
B
C0
A
B
C 50
C = A/B;
C = A;
if (B == 0)
Processor 3
Processor 0Processor 1Processor 2Processor 3
5
0
4
2
1
1
0
0
Processor 0Processor 1Processor 2Processor 3
5
0
4
2
1
1
0
0
0
A
B
C
A
B
C
A
B
C
A
B
C51 2
ExecutingaconditionalstatementonanSIMDcomputerwith
fourprocessors:(a)theconditionalstatement;(b)theexecution
ofthestatementintwosteps.
–TypesetbyFoilTEX–43

MIMDProcessors
IncontrasttoSIMDprocessors,MIMDprocessorscanexecute
differentprogramsondifferentprocessors.
Avariantofthis,calledsingleprogrammultipledatastreams
(SPMD)executesthesameprogramondifferentprocessors.
ItiseasytoseethatSPMDandMIMDarecloselyrelatedin
termsofprogrammingexibilityandunderlyingarchitectural
support.
ExamplesofsuchplatformsincludecurrentgenerationSun
UltraServers,SGIOriginServers,multiprocessorPCs,workstation
clusters,andtheIBMSP.
–TypesetbyFoilTEX–44

SIMD-MIMDComparison
SIMDcomputersrequirelesshardwarethanMIMDcomputers
(singlecontrolunit).
However,sinceSIMDprocessorsaespeciallydesigned,they
tendtobeexpensiveandhavelongdesigncycles.
NotallapplicationsarenaturallysuitedtoSIMDprocessors.
Incontrast,platformssupportingtheSPMDparadigmcanbe
builtfrominexpensiveoff-the-shelfcomponentswithrelatively
littleeffortinashortamountoftime.
–TypesetbyFoilTEX–45

CommunicationModelofParallelPlatforms
Therearetwoprimaryformsofdataexchangebetween
paralleltasks–accessingashareddataspaceand
exchangingmessages.
Platformsthatprovideashareddataspacearecalledshared-
address-spacemachinesormultiprocessors.
Platformsthatsupportmessagingarealsocalledmessage
passingplatformsormulticomputers.
–TypesetbyFoilTEX–46

Shared-Address-SpacePlatforms
Part(orall)ofthememoryisaccessibletoallprocessors.
Processorsinteractbymodifyingdataobjectsstoredinthis
shared-address-space.
Ifthetimetakenbyaprocessortoaccessanymemory
wordinthesystemglobalorlocalisidentical,theplatform
isclassiedasauniformmemoryaccess(UMA),else,anon-
uniformmemoryaccess(NUMA)machine.
–TypesetbyFoilTEX–47

NUMAandUMAShared-Address-SpacePlatforms
M
Interconnection Network
Interconnection Network
M
M
Interconnection Network
M
M
P
C
M
M
(b)
P
C
P
C
P
C C
P
P
M
M
C
(a)(c)
P
P
P
Typicalshared-address-spacearchitectures:(a)
Uniform-memory-accessshared-address-spacecomputer;(b)
Uniform-memory-accessshared-address-spacecomputerwith
cachesandmemories;(c)Non-uniform-memory-access
shared-address-spacecomputerwithlocalmemoryonly.
–TypesetbyFoilTEX–48

NUMAandUMAShared-Address-SpacePlatforms
ThedistinctionbetweenNUMAandUMAplatformsisimportant
fromthepointofviewofalgorithmdesign.NUMAmachines
requirelocalityfromunderlyingalgorithmsforperformance.
Programmingtheseplatformsiseasiersincereadsandwrites
areimplicitlyvisibletootherprocessors.
However,read-writedatatoshareddatamustbecoordinated
(thiswillbediscussedingreaterdetailwhenwetalkabout
threadsprogramming).
Cachesinsuchmachinesrequirecoordinatedaccessto
multiplecopies.Thisleadstothecachecoherenceproblem.
Aweakermodelofthesemachinesprovidesanaddressmap,
butnotcoordinatedaccess.Thesemodelsarecallednon
cachecoherentsharedaddressspacemachines.
–TypesetbyFoilTEX–49

Shared-Address-Spacevs.SharedMemoryMachines Itisimportanttonotethedifferencebetweenthetermsshared
addressspaceandsharedmemory.
Werefertotheformerasaprogrammingabstractionandto
thelatterasaphysicalmachineattribute.
Itispossibletoprovideasharedaddressspaceusinga
physicallydistributedmemory.
–TypesetbyFoilTEX–50

Message-PassingPlatforms
Theseplatformscompriseofasetofprocessorsandtheirown
(exclusive)memory.
Instancesofsuchaviewcomenaturallyfromclustered
workstationsandnon-shared-address-spacemulticomputers.
Theseplatformsareprogrammedusing(variantsof)sendand
receiveprimitives.
LibrariessuchasMPIandPVMprovidesuchprimitives.
–TypesetbyFoilTEX–51

MessagePassingvs.SharedAddressSpacePlatforms Messagepassingrequireslittlehardwaresupport,otherthana
network.
Sharedaddressspaceplatformscaneasilyemulatemessage
passing.Thereverseismoredifculttodo(inanefcient
manner).
–TypesetbyFoilTEX–52

PhysicalOrganizationofParallelPlatforms
Webeginthisdiscussionwithanidealparallelmachinecalled
ParallelRandomAccessMachine,orPRAM.
–TypesetbyFoilTEX–53

ArchitectureofanIdealParallelComputer
AnaturalextensionoftheRandomAccessMachine(RAM)
serialarchitectureistheParallelRandomAccessMachine,or
PRAM.
PRAMsconsistofpprocessorsandaglobalmemoryof
unboundedsizethatisuniformlyaccessibletoallprocessors.
Processorsshareacommonclockbutmayexecutedifferent
instructionsineachcycle.
–TypesetbyFoilTEX–54

ArchitectureofanIdealParallelComputer
Dependingonhowsimultaneousmemoryaccessesare
handled,PRAMscanbedividedintofoursubclasses.
Exclusive-read,exclusive-write(EREW)PRAM.
Concurrent-read,exclusive-write(CREW)PRAM.
Exclusive-read,concurrent-write(ERCW)PRAM.
Concurrent-read,concurrent-write(CRCW)PRAM.
–TypesetbyFoilTEX–55

ArchitectureofanIdealParallelComputer
Whatdoesconcurrentwritemean,anyway?
Common:writeonlyifallvaluesareidentical.
Arbitrary:writethedatafromarandomlyselectedprocessor.
Priority:followapredeterminedpriorityorder.
Sum:Writethesumofalldataitems.
–TypesetbyFoilTEX–56

PhysicalComplexityofanIdealParallelComputer
Processorsandmemoriesareconnectedviaswitches.
SincetheseswitchesmustoperateinO(1)timeatthelevelof
words,forasystemofpprocessorsandmwords,theswitch
complexityisO(mp).
Clearly,formeaningfulvaluesofpandm,atruePRAMisnot
realizable.
–TypesetbyFoilTEX–57

InterconnectionNetworksforParallelComputers
Interconnectionnetworkscarrydatabetweenprocessorsand
tomemory.
Interconnectsaremadeofswitchesandlinks(wires,ber).
Interconnectsareclassiedasstaticordynamic.
Staticnetworksconsistofpoint-to-pointcommunicationlinks
amongprocessingnodesandarealsoreferredtoasdirect
networks.
Dynamicnetworksarebuiltusingswitchesandcommunication
links.Dynamicnetworksarealsoreferredtoasindirect
networks.
–TypesetbyFoilTEX–58

StaticandDynamicInterconnectionNetworks
Static networkIndirect network
Switching element
Processing node
Network interface/switch
P
PPP
P
P
P
P
Classicationofinterconnectionnetworks:(a)astaticnetwork;
and(b)adynamicnetwork.
–TypesetbyFoilTEX–59

InterconnectionNetworks
Switchesmapaxednumberofinputstooutputs.
Thetotalnumberofportsonaswitchisthedegreeofthe
switch.
Thecostofaswitchgrowsasthesquareofthedegreeofthe
switch,theperipheralhardwarelinearlyasthedegree,andthe
packagingcostslinearlyasthenumberofpins.
–TypesetbyFoilTEX–60

InterconnectionNetworks:NetworkInterfaces
Processorstalktothenetworkviaanetworkinterface.
ThenetworkinterfacemayhangofftheI/Obusorthememory
bus.
Inaphysicalsense,thisdistinguishesaclusterfromatightly
coupledmulticomputer.
TherelativespeedsoftheI/Oandmemorybusesimpactthe
performanceofthenetwork.
–TypesetbyFoilTEX–61

NetworkTopologies
Avarietyofnetworktopologieshavebeenproposedand
implemented.
Thesetopologiestradeoffperformanceforcost.
Commercialmachinesoftenimplementhybridsofmultiple
topologiesforreasonsofpackaging,cost,andavailable
components.
–TypesetbyFoilTEX–62

NetworkTopologies:Buses
Someofthesimplestandearliestparallelmachinesusedbuses.
Allprocessorsaccessacommonbusforexchangingdata.
ThedistancebetweenanytwonodesisO(1)inabus.Thebus
alsoprovidesaconvenientbroadcastmedia.
However,thebandwidthofthesharedbusisamajor
bottleneck.
Typicalbusbasedmachinesarelimitedtodozensofnodes.
SunEnterpriseserversandIntelPentiumbasedshared-bus
multiprocessorsareexamplesofsucharchitectures.
–TypesetbyFoilTEX–63

NetworkTopologies:Buses
Cache /
Local Memory
Cache /
Local Memory
Shared Memory
Data
Processor 0
Address
Data
Shared Memory
Processor 0Processor 1
(a)
(b)
Address
Processor 1
Bus-basedinterconnects(a)withnolocalcaches;(b)withlocal
memory/caches.
Sincemuchofthedataaccessedbyprocessorsislocaltothe
processor,alocalmemorycanimprovetheperformanceofbus-
basedmachines.
–TypesetbyFoilTEX–64

NetworkTopologies:Crossbars
Acrossbarnetworkusesanpmgridofswitchestoconnectp
inputstomoutputsinanon-blockingmanner.
Memory Banks
b-1 5 4 3 2 1 0
Processing Elements
0
1
2
3
4
5
6
p-1
element
A switching
Acompletelynon-blockingcrossbarnetworkconnectingp
processorstobmemorybanks.
–TypesetbyFoilTEX–65

NetworkTopologies:Crossbars
ThecostofacrossbarofpprocessorsgrowsasO(p
2
).
Thisisgenerallydifculttoscaleforlargevaluesofp.
ExamplesofmachinesthatemploycrossbarsincludetheSun
UltraHPC10000andtheFujitsuVPP500.
–TypesetbyFoilTEX–66

NetworkTopologies:MultistageNetworks
Crossbarshaveexcellentperformancescalabilitybutpoorcost
scalability.
Buseshaveexcellentcostscalability,butpoorperformance
scalability.
Multistageinterconnectsstrikeacompromisebetweenthese
extremes.
–TypesetbyFoilTEX–67

NetworkTopologies:MultistageNetworks
Memory banks
0
1
0
. . . . . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Stage 1
b-1
Stage 2Stage n
p-1
ProcessorsMultistage interconnection network
1
Theschematicofatypicalmultistageinterconnectionnetwork.
–TypesetbyFoilTEX–68

NetworkTopologies:MultistageOmegaNetwork
Oneofthemostcommonlyusedmultistageinterconnectsis
theOmeganetwork.
Thisnetworkconsistsoflogpstages,wherepisthenumberof
inputs/outputs.
Ateachstage,inputiisconnectedtooutputjif:
j=

2i;0ip=21
2i+1p;p=2ip1
–TypesetbyFoilTEX–69

NetworkTopologies:MultistageOmegaNetwork EachstageoftheOmeganetworkimplementsaperfect
shufeasfollows:
000
010
100
110
001
011
101
111
000
010
100
110
001
011
101
111
0
1
2
3
4
5
6
7
0
1
2
3
4
5
6
7
= left_rotate(000)
= left_rotate(100)
= left_rotate(001)
= left_rotate(101)
= left_rotate(010)
= left_rotate(110)
= left_rotate(011)
= left_rotate(111)
Aperfectshufeinterconnectionforeightinputsandoutputs.
–TypesetbyFoilTEX–70

NetworkTopologies:MultistageOmegaNetwork
Theperfectshufepatternsareconnectedusing22switches.
Theswitchesoperateintwomodes–crossoverorpassthrough.
(b) (a)
Twoswitchingcongurationsofthe22switch:(a)Pass-through;
(b)Cross-over. –TypesetbyFoilTEX–71

NetworkTopologies:MultistageOmegaNetwork AcompleteOmeganetworkwiththeperfectshufe
interconnectsandswitchescannowbeillustrated:
111
110
101
100
011
010
001
000000
001
010
011
100
101
110
111
Acompleteomeganetworkconnectingeightinputsandeight
outputs.
Anomeganetworkhasp=2logpswitchingnodes,andthe
costofsuchanetworkgrowsas(plogp).
–TypesetbyFoilTEX–72

NetworkTopologies:MultistageOmegaNetwork–
Routing
Letsbethebinaryrepresentationofthesourceanddbethat
ofthedestinationprocessor.
Thedatatraversesthelinktotherstswitchingnode.Ifthe
mostsignicantbitsofsandtarethesame,thenthedatais
routedinpass-throughmodebytheswitchelse,itswitchesto
crossover.
Thisprocessisrepeatedforeachofthelogpswitchingstages.
Notethatthisisnotanon-blockingswitch.
–TypesetbyFoilTEX–73

NetworkTopologies:MultistageOmegaNetwork–
Routing
111
110
101
100
011
010
001
000000
001
010
011
100
101
110
111
A
B
Anexampleofblockinginomeganetwork:oneofthemessages
(010to111or110to100)isblockedatlinkAB. –TypesetbyFoilTEX–74

NetworkTopologies:CompletelyConnectedNetwork Eachprocessorisconnectedtoeveryotherprocessor.
ThenumberoflinksinthenetworkscalesasO(p
2
).
Whiletheperformancescalesverywell,thehardware
complexityisnotrealizableforlargevaluesofp.
Inthissense,thesenetworksarestaticcounterpartsof
crossbars.
–TypesetbyFoilTEX–75

NetworkTopologies:CompletelyConnectedandStar
ConnectedNetworks
Exampleofan8-nodecompletelyconnectednetwork.
(a)(b)
(a)Acompletely-connectednetworkofeightnodes;(b)aStar
connectednetworkofninenodes.
–TypesetbyFoilTEX–76

NetworkTopologies:StarConnectedNetwork
Everynodeisconnectedonlytoacommonnodeatthe
center.
DistancebetweenanypairofnodesisO(1).However,the
centralnodebecomesabottleneck.
Inthissense,starconnectednetworksarestaticcounterparts
ofbuses.
–TypesetbyFoilTEX–77

NetworkTopologies:LinearArrays,Meshes,andk-d
Meshes
Inalineararray,eachnodehastwoneighbors,onetoitsleft
andonetoitsright.Ifthenodesateitherendareconnected,
werefertoitasa1-Dtorusoraring.
Ageneralizationto2dimensionshasnodeswith4neighbors,to
thenorth,south,east,andwest.
Afurthergeneralizationtoddimensionshasnodeswith2d
neighbors.
Aspecialcaseofad-dimensionalmeshisahypercube.Here,
d=logp,wherepisthetotalnumberofnodes.
–TypesetbyFoilTEX–78

NetworkTopologies:LinearArrays
(a)(b)
Lineararrays:(a)withnowraparoundlinks;(b)withwraparound
link.
–TypesetbyFoilTEX–79

NetworkTopologies:Two-andThreeDimensional
Meshes
(c) (b) (a)
Twoandthreedimensionalmeshes:(a)2-Dmeshwithno
wraparound;(b)2-Dmeshwithwraparoundlink(2-Dtorus);and
(c)a3-Dmeshwithnowraparound.
–TypesetbyFoilTEX–80

NetworkTopologies:Hypercubesandtheir
Construction
0
1
00
01
10
11
000010
001011
100110
111 101
0000
0100
00010011
0101
0110
0010
0111
11001110
1111
1011 1001
1000
1101
1010
0-D hypercube1-D hypercube2-D hypercube3-D hypercube
4-D hypercube
Constructionofhypercubesfromhypercubesoflower
dimension.
–TypesetbyFoilTEX–81

NetworkTopologies:PropertiesofHypercubes
Thedistancebetweenanytwonodesisatmostlogp.
Eachnodehaslogpneighbors.
Thedistancebetweentwonodesisgivenbythenumberofbit
positionsatwhichthetwonodesdiffer.
–TypesetbyFoilTEX–82

NetworkTopologies:Tree-BasedNetworks
(a)(b)
Processing nodes
Switching nodes
Completebinarytreenetworks:(a)astatictreenetwork;and
(b)adynamictreenetwork.
–TypesetbyFoilTEX–83

NetworkTopologies:TreeProperties
Thedistancebetweenanytwonodesisnomorethan2logp.
Linkshigherupthetreepotentiallycarrymoretrafcthanthose
atthelowerlevels.
Forthisreason,avariantcalledafat-tree,fattensthelinksas
wegoupthetree.
Treescanbelaidoutin2Dwithnowirecrossings.Thisisan
attractivepropertyoftrees.
–TypesetbyFoilTEX–84

NetworkTopologies:FatTrees
Afattreenetworkof16processingnodes.
–TypesetbyFoilTEX–85

EvaluatingStaticInterconnectionNetworks
Diameter:Thedistancebetweenthefarthesttwonodesinthe
network.Thediameterofalineararrayisp1,thatofamesh
is2(
p
p1),thatofatreeandhypercubeislogp,andthatofa
completelyconnectednetworkisO(1).
BisectionWidth:Theminimumnumberofwiresyoumustcutto
dividethenetworkintotwoequalparts.Thebisectionwidth
ofalineararrayandtreeis1,thatofameshis
p
p,thatofa
hypercubeisp=2andthatofacompletelyconnectednetwork
isp
2
=4.
Cost:Thenumberoflinksorswitches(whicheveris
asymptoticallyhigher)isameaningfulmeasureofthecost.
However,anumberofotherfactors,suchastheabilitytolay
outthenetwork,thelengthofwires,etc.,alsofactorintothe
cost.
–TypesetbyFoilTEX–86

EvaluatingStaticInterconnectionNetworks
BisectionArcCost
NetworkDiameterWidthConnectivity(No.oflinks)
Completely-connected
1p
2
=4p1p(p1)=2
Star
211p1
Completebinarytree
2log((p+1)=2)11p1
Lineararray
p111p1
2-Dmesh,nowraparound
2(
p
p1)
p
p22(p
p
p)
2-Dwraparoundmesh
2b
p
p=2c2
p
p42p
Hypercube
logpp=2logp(plogp)=2
Wraparound
k
-ary
d
-cube
dbk=2c2k
d1
2ddp
–TypesetbyFoilTEX–87

EvaluatingDynamicInterconnectionNetworks
1ptBisectionArcCost
NetworkDiameterWidthConnectivity(No.oflinks)
Crossbar
1p1p
2
OmegaNetwork
logpp=22p=2
DynamicTree
2logp
12
p1
–TypesetbyFoilTEX–88

CacheCoherenceinMultiprocessorSystems
Interconnectsprovidebasicmechanismsfordatatransfer.
Inthecaseofsharedaddressspacemachines,additional
hardwareisrequiredtocoordinateaccesstodatathatmight
havemultiplecopiesinthenetwork.
Theunderlyingtechniquemustprovidesomeguaranteeson
thesemantics.
Thisguaranteeisgenerallyoneofserializability,i.e.,thereexists
someserialorderofinstructionexecutionthatcorrespondsto
theparallelschedule.
–TypesetbyFoilTEX–89

CacheCoherenceinMultiprocessorSystems
Whenthevalueofavariableischanges,allitscopiesmust
eitherbeinvalidatedorupdated.
(b)
(a)
Invalidate
Memory Memory
P1 P0 P1 P0
Update
Memory Memory
P1 P0 P1 P0
load x
write #3, x load x load x
x = 1
x = 1 x = 1
x = 1
x = 1 x = 1
x = 3
x = 3
x = 3 x = 3
x = 1
x = 1
write #3, x load x
Cachecoherenceinmultiprocessorsystems:(a)Invalidate
protocol;(b)Updateprotocolforsharedvariables.
–TypesetbyFoilTEX–90

CacheCoherence:UpdateandInvalidateProtocols
Ifaprocessorjustreadsavalueonceanddoesnot
needitagain,anupdateprotocolmaygeneratesignicant
overhead.
Iftwoprocessorsmakeinterleavedtestandupdatestoa
variable,anupdateprotocolisbetter.
Bothprotocolssufferfromfalsesharingoverheads(twowords
thatarenotshared,however,theylieonthesamecacheline).
Mostcurrentmachinesuseinvalidateprotocols.
–TypesetbyFoilTEX–91

MaintainingCoherenceUsingInvalidateProtocols
Eachcopyofadataitemisassociatedwithastate.
Oneexampleofsuchasetofstatesis,shared,invalid,ordirty.
Insharedstate,therearemultiplevalidcopiesofthedataitem
(andtherefore,aninvalidatewouldhavetobegeneratedon
anupdate).
Indirtystate,onlyonecopyexistsandtherefore,noinvalidates
needtobegenerated.
Ininvalidstate,thedatacopyisinvalid,therefore,aread
generatesadatarequest(andassociatedstatechanges).
–TypesetbyFoilTEX–92

MaintainingCoherenceUsingInvalidateProtocols
flush
read/write
readwrite
C_read
read
C_write
write
C_write
Dirty
Shared
Invalid
Statediagramofasimplethree-statecoherenceprotocol.
–TypesetbyFoilTEX–93

MaintainingCoherenceUsingInvalidateProtocols
y = 13, D
y = 13, S
x = 6, S
x = 6, I
y = 19, D
y = 20, D
x = 5, S
y = 12, S
x = 5, I
y = 12, I
y = 13, S
x = 6, S
y = 13, I
x = 6, I
y = 13, I
x = 5, D
y = 12, D
x = 6, I
read x
x = x + 1
x = x + y
x = x + 1
read y
y = y + 1
read x
y = x + y
read y
y = 12, S
y = 13, I
x = 19, D
x = 6, S
x = 20, D
y = 13, S
x = 6, D
x = 5, S
y = y + 1
Processor 0
Variables and
their states at
Processor 1
Variables and
their states in Processor 1
Global mem.
Instruction at
Processor 0
Instruction at Time
their states at
Variables and
Exampleofparallelprogramexecutionwiththesimple
three-statecoherenceprotocol.
–TypesetbyFoilTEX–94

SnoopyCacheSystems
Howareinvalidatessenttotherightprocessors?
Insnoopycaches,thereisabroadcastmediathatlistens
toallinvalidatesandreadrequestsandperformsappropriate
coherenceoperationslocally.
Tags
Snoop H/W
Processor
Cache
Tags
Snoop H/W
Processor
Cache
Tags
Snoop H/W
Processor
Cache
Dirty
Address/data
Memory
Asimplesnoopybusbasedcachecoherencesystem.
–TypesetbyFoilTEX–95

PerformanceofSnoopyCaches
Oncecopiesofdataaretaggeddirty,allsubsequent
operationscanbeperformedlocallyonthecachewithout
generatingexternaltrafc.
Ifadataitemisreadbyanumberofprocessors,ittransitions
tothesharedstateinthecacheandallsubsequentread
operationsbecomelocal.
Ifprocessorsreadandupdatedataatthesametime,they
generatecoherencerequestsonthebus–whichisultimately
bandwidthlimited.
–TypesetbyFoilTEX–96

DirectoryBasedSystems
Insnoopycaches,eachcoherenceoperationissenttoall
processors.Thisisaninherentlimitation.
Whynotsendcoherencerequeststoonlythoseprocessorsthat
needtobenotied?
Thisisdoneusingadirectory,whichmaintainsapresence
vectorforeachdataitem(cacheline)alongwithitsglobal
state.
–TypesetbyFoilTEX–97

DirectoryBasedSystems
(a)
(b)
Directory
Data
State
Presence
Bits
Cache
Processor
Processor
Cache
Processor
Cache
Processor
Cache
Interconnection Network
Interconnection Network
Memory
Presence
bits / State
Processor
Cache
Memory
Presence
bits / State
Architectureoftypicaldirectorybasedsystems:(a)acentralized
directory;and(b)adistributeddirectory. –TypesetbyFoilTEX–98

PerformanceofDirectoryBasedSchemes
Theneedforabroadcastmediaisreplacedbythedirectory.
Theadditionalbitstostorethedirectorymayaddsignicant
overhead.
Theunderlyingnetworkmustbeabletocarryallthe
coherencerequests.
Thedirectoryisapointofcontention,therefore,distributed
directoryschemesmustbeused.
–TypesetbyFoilTEX–99

CommunicationCostsinParallelMachines
Alongwithidlingandcontention,communicationisamajor
overheadinparallelprograms.
Thecostofcommunicationisdependentonavarietyof
featuresincludingtheprogrammingmodelsemantics,the
networktopology,datahandlingandrouting,andassociated
softwareprotocols.
–TypesetbyFoilTEX–100

MessagePassingCostsinParallelComputers
Thetotaltimetotransferamessageoveranetworkcomprises
ofthefollowing:
Startuptime(t
s
):Timespentatsendingandreceivingnodes
(executingtheroutingalgorithm,programmingrouters,etc.).
Per-hoptime(t
h
):Thistimeisafunctionofnumberofhopsand
includesfactorssuchasswitchlatencies,networkdelays,etc.
Per-wordtransfertime(t
w
):Thistimeincludesalloverheadsthat
aredeterminedbythelengthofthemessage.Thisincludes
bandwidthoflinks,errorcheckingandcorrection,etc.
–TypesetbyFoilTEX–101

Store-and-ForwardRouting
Amessagetraversingmultiplehopsiscompletelyreceivedat
anintermediatehopbeforebeingforwardedtothenexthop.
Thetotalcommunicationcostforamessageofsizemwordsto
traverselcommunicationlinksis
t
comm
=t
s
+(mt
w
+t
h
)l:(1)
Inmostplatforms,t
h
issmallandtheaboveexpressioncanbe
approximatedby
t
comm
=t
s
+mlt
w
:
–TypesetbyFoilTEX–102

RoutingTechniques
P0
P3
P2
P1
P3
P1
P2
P3
P2
P0
P1
P0
Time
Time
Time
(a)
A single message sent over a
store-and-forward network
(b) The same message broken into two parts
and sent over the network.
(c) The same message broken into four parts
and sent over the network.
PassingamessagefromnodeP
0
toP
3
(a)througha
store-and-forwardcommunicationnetwork;(b)and(c)
extendingtheconcepttocut-throughrouting.Theshaded
regionsrepresentthetimethatthemessageisintransit.The
startuptimeassociatedwiththismessagetransferisassumedto
bezero.
–TypesetbyFoilTEX–103

PacketRouting
Store-and-forwardmakespooruseofcommunicationresources.
Packetroutingbreaksmessagesintopacketsandpipelines
themthroughthenetwork.
Sincepacketsmaytakedifferentpaths,eachpacketmust
carryroutinginformation,errorchecking,sequencing,and
otherrelatedheaderinformation.
Thetotalcommunicationtimeforpacketroutingisapproximated
by:
t
comm
=t
s
+t
h
l+t
w
m
Thefactort
w
accountsforoverheadsinpacketheaders.
–TypesetbyFoilTEX–104

Cut-ThroughRouting
Takestheconceptofpacketroutingtoanextremebyfurther
dividingmessagesintobasicunitscalledits.
Sinceitsaretypicallysmall,theheaderinformationmustbe
minimized.
Thisisdonebyforcingallitstotakethesamepath,in
sequence.
Atracermessagerstprogramsallintermediaterouters.Allits
thentakethesameroute.
Errorchecksareperformedontheentiremessage,asopposed
toits.
Nosequencenumbersareneeded.
–TypesetbyFoilTEX–105

Cut-ThroughRouting
Thetotalcommunicationtimeforcut-throughroutingis
approximatedby:
t
comm
=t
s
+t
h
l+t
w
m:
Thisisidenticaltopacketrouting,however,t
w
istypicallymuch
smaller.
–TypesetbyFoilTEX–106

SimpliedCostModelforCommunicatingMessages
Thecostofcommunicatingamessagebetweentwonodesl
hopsawayusingcut-throughroutingisgivenby
t
comm
=t
s
+lt
h
+t
w
m:
Inthisexpression,t
h
istypicallysmallerthant
s
andt
w
.Forthis
reason,thesecondtermintheRHSdoesnotshow,particularly,
whenmislarge.
Furthermore,itisoftennotpossibletocontrolroutingand
placementoftasks.
Forthesereasons,wecanapproximatethecostofmessage
transferby
t
comm
=t
s
+t
w
m:
–TypesetbyFoilTEX–107

SimpliedCostModelforCommunicatingMessages
Itisimportanttonotethattheoriginalexpressionfor
communicationtimeisvalidforonlyuncongestednetworks.
Ifalinktakesmultiplemessages,thecorrespondingt
w
term
mustbescaledupbythenumberofmessages.
Differentcommunicationpatternscongestdifferentnetworks
tovaryingextents.
Itisimportanttounderstandandaccountforthisinthe
communicationtimeaccordingly.
–TypesetbyFoilTEX–108

CostModelsforSharedAddressSpaceMachines
Whilethebasicmessagingcostappliestothesemachinesas
well,anumberofotherfactorsmakeaccuratecostmodeling
moredifcult.
Memorylayoutistypicallydeterminedbythesystem.
Finitecachesizescanresultincachethrashing.
Overheadsassociatedwithinvalidateandupdateoperations
aredifculttoquantify.
Spatiallocalityisdifculttomodel.
Prefetchingcanplayaroleinreducingtheoverhead
associatedwithdataaccess.
Falsesharingandcontentionaredifculttomodel.
–TypesetbyFoilTEX–109

RoutingMechanismsforInterconnectionNetworks
Howdoesonecomputetheroutethatamessagetakesfrom
sourcetodestination?
Routingmustpreventdeadlocks–forthisreason,weuse
dimension-orderedore-cuberouting.
Routingmustavoidhot-spots–forthisreason,two-steprouting
isoftenused.Inthiscase,amessagefromsourcesto
destinationdisrstsenttoarandomlychosenintermediate
processoriandthenforwardedtodestinationd.
–TypesetbyFoilTEX–110

RoutingMechanismsforInterconnectionNetworks
Step 2 (110 111) Step 1 (010 110)
111 110
101
011
100
010
001 000
111 110
101
011
100
010
001
001
000
010
101 100
011
110111
000
PSfragreplacements
P
s
P
s
P
s
P
d
P
d
P
d
RoutingamessagefromnodeP
s
(010)tonodeP
d
(111)ina
three-dimensionalhypercubeusingE-cuberouting.
–TypesetbyFoilTEX–111

MappingTechniquesforGraphs
Often,weneedtoembedaknowncommunicationpattern
intoagiveninterconnectiontopology.
Wemayhaveanalgorithmdesignedforonenetwork,which
weareportingtoanothertopology.
Forthesereasons,itisusefultounderstandmappingbetween
graphs.
–TypesetbyFoilTEX–112

MappingTechniquesforGraphs:Metrics
WhenmappingagraphG(V;E)intoG
0
(V
0
;E
0
),thefollowing
metricsareimportant:
ThemaximumnumberofedgesmappedontoanyedgeinE
0
iscalledthecongestionofthemapping.
ThemaximumnumberoflinksinE
0
thatanyedgeinEis
mappedontoiscalledthedilationofthemapping.
TheratioofthenumberofnodesinthesetV
0
tothatinsetVis
calledtheexpansionofthemapping.
–TypesetbyFoilTEX–113

EmbeddingaLinearArrayintoaHypercube
Alineararray(oraring)composedof2
d
nodes(labeled
0through2
d
1)canbeembeddedintoad-dimensional
hypercubebymappingnodeiofthelineararrayontonode
G(i;d)ofthehypercube.ThefunctionG(i;x)isdenedasfollows:
G(0;1)=0
G(1;1)=1
G(i;x+1)=

G(i;x);i<2
x
2
x
+G(2
x+1
1i;x);i2
x
ThefunctionGiscalledthebinaryreectedGraycode(RGC).
Sinceadjoiningentries(G(i;d)andG(i+1;d))differfromeach
otheratonlyonebitposition,correspondingprocessorsare
mappedtoneighborsinahypercube.Therefore,thecongestion,
dilation,andexpansionofthemappingareall1.
–TypesetbyFoilTEX–114

EmbeddingaLinearArrayintoaHypercube:Example
1bit Gray code2bit Gray code3bit Gray code3D hypercube8processor ring
0
1
3
2
6
7
5
4
0 0
0 1
1 1
1 0
0
1
0
1
2
3
4
5
6
7
0 0 0
0 0 1
0 1 1
0 1 0
1 1 0
1 1 1
1 0 1
1 0 0
line
along this
Reflect
(a)
110
010
000001
011
111
101
(b)
100
(a)Athree-bitreectedGraycodering;and(b)itsembedding
intoathree-dimensionalhypercube.
–TypesetbyFoilTEX–115

EmbeddingaMeshintoaHypercube
A2
r
2
s
wraparoundmeshcanbemappedtoa2
r+s
-node
hypercubebymappingnode(i;j)ofthemeshontonodeG(i;r
1)kG(j;s1)ofthehypercube(wherekdenotesconcatenation
ofthetwoGraycodes).
–TypesetbyFoilTEX–116

EmbeddingaMeshintoaHypercube
(3,3) 10 10
(2,3) 11 10
(1,3) 01 10
(0,3) 00 10
(3,2) 10 11
(2,2) 11 11
(1,2) 01 11
(0,2) 00 11
(3,1) 10 01
(2,1) 11 01
(1,1) 01 01
(0,1) 00 01
(3,0) 10 00
(2,0) 11 00
(1,0) 01 00
(0,0) 00 00
(0,0) 0 00(0,1) 0 01(0,2) 0 11(0,3) 0 10
(1,0) 1 00(1,1) 1 01(1,2) 1 11(1,3) 1 10

011
001 000
010
110111
101 100
identical two leastsignificant bits
Processors in a column have Processors in a row have identical
two mostsignificant bits
(a)
(b)
(a)A44meshillustratingthemappingofmeshnodestothe
nodesinafour-dimensionalhypercube;and(b)a24mesh
embeddedintoathree-dimensionalhypercube.
Onceagain,thecongestion,dilation,andexpansionofthe
mappingis1.
–TypesetbyFoilTEX–117

EmbeddingaMeshintoaLinearArray
Sinceameshhasmoreedgesthanalineararray,wewillnot
haveanoptimalcongestion/dilationmapping.
Werstexaminethemappingofalineararrayintoameshand
theninvertthismapping.
Thisgivesusanoptimalmapping(intermsofcongestion).
–TypesetbyFoilTEX–118

EmbeddingaMeshintoaLinearArray:Example
(a) Mapping a linear array into a
linear array (congestion 5)
(b) Inverting the mapping mapping a 2D mesh into a
2D mesh (congestion 1).
(a)Embeddinga16nodelineararrayintoa2-Dmesh;and(b)
theinverseofthemapping.Solidlinescorrespondtolinksinthe
lineararrayandnormallinestolinksinthemesh.
–TypesetbyFoilTEX–119

EmbeddingaHypercubeintoa2-DMesh
Each
p
pnodesubcubeofthehypercubeismappedtoa
p
p
noderowofthemesh.
Thisisdonebyinvertingthelinear-arraytohypercubemapping.
Thiscanbeshowntobeanoptimalmapping.
–TypesetbyFoilTEX–120

EmbeddingaHypercubeintoa2-DMesh:Example
P = 32 (b)
P = 16 (a)
Embeddingahypercubeintoa2-Dmesh.
–TypesetbyFoilTEX–121

CaseStudies:TheIBMBlue-GeneArchitecture
.
.
.
.
.
.
.
.
.
(b) Chip (32 GF)
.
(a) CPU (1GF)(c) Board (2 TF)
(d) Tower (16 TF)(e) Blue Gene (1 PF)
ThehierarchicalarchitectureofBlueGene.
–TypesetbyFoilTEX–122

CaseStudies:TheCrayT3EArchitecture
Router (a)
(b)
PControl
Memory
InterconnectionnetworkoftheCrayT3E:(a)nodearchitecture;
(b)networktopology.
–TypesetbyFoilTEX–123

CaseStudies:TheSGIOrigin3000Architecture
1 R-Brick, 4 C-Bricks, and
16 processors at each vertex.
Processor
128 Processor Configuration
(16 processors)
R-Brick
To meta-router
To 8 other R-Bricks
C-Brick
To 4 C-Bricks
C-Brick
C-Brick
Metarouter
128 processors
512 Processor Configuration
C-Brick
C-Brick
C-Brick
C-BrickC-Brick
Crossbar
Memory/
Directory
C-Brick
I/P/D/X Brick
R-Brick
R-Brick
32 Processor Configuration
ArchitectureoftheSGIOrigin3000familyofservers.
–TypesetbyFoilTEX–124

CaseStudies:TheSunHPCServerArchitecture
Starfire Ultra 1000 (up to 64 processors)
16 x 16 non-blocking crossbar
Address bus
System Board System Board
32 byte data bus
Sun Ultra 6000 (6 - 30 processors)
Four address buses
System Board
System Board
System Board
System Board
ArchitectureoftheSunEnterprisefamilyofservers.
–TypesetbyFoilTEX–125

chap2_slidesforparallelcomputingananthgarama

About This Presentation

Slide Content

Tags

Categories

Download

Quick Actions

Statistics

Related Slideshows

chap2_slidesforparallelcomputingananthgarama

About This Presentation

Slide Content

Slide 1

Slide 2

Slide 3

Slide 4

Slide 5

Slide 6

Slide 7

Slide 8

Slide 9

Slide 10

Slide 11

Slide 12

Slide 13

Slide 14

Slide 15

Slide 16

Slide 17

Slide 18

Slide 19

Slide 20

Slide 21

Slide 22

Slide 23

Slide 24

Slide 25

Slide 26

Slide 27

Slide 28

Slide 29

Slide 30

Slide 31

Slide 32

Slide 33

Slide 34

Slide 35

Slide 36

Slide 37

Slide 38

Slide 39

Slide 40

Slide 41

Slide 42

Slide 43

Slide 44

Slide 45

Slide 46

Slide 47

Slide 48

Slide 49

Slide 50

Slide 51

Slide 52

Slide 53

Slide 54

Slide 55

Slide 56

Slide 57

Slide 58

Slide 59

Slide 60

Slide 61

Slide 62

Slide 63

Slide 64

Slide 65

Slide 66

Slide 67

Slide 68

Slide 69

Slide 70

Slide 71

Slide 72

Slide 73

Slide 74

Slide 75

Slide 76

Slide 77