chap7_slidesforparallelcomputingananthgrama

doomzday27 2 views 67 slides Mar 10, 2025
Slide 1
Slide 1 of 67
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27
Slide 28
28
Slide 29
29
Slide 30
30
Slide 31
31
Slide 32
32
Slide 33
33
Slide 34
34
Slide 35
35
Slide 36
36
Slide 37
37
Slide 38
38
Slide 39
39
Slide 40
40
Slide 41
41
Slide 42
42
Slide 43
43
Slide 44
44
Slide 45
45
Slide 46
46
Slide 47
47
Slide 48
48
Slide 49
49
Slide 50
50
Slide 51
51
Slide 52
52
Slide 53
53
Slide 54
54
Slide 55
55
Slide 56
56
Slide 57
57
Slide 58
58
Slide 59
59
Slide 60
60
Slide 61
61
Slide 62
62
Slide 63
63
Slide 64
64
Slide 65
65
Slide 66
66
Slide 67
67

About This Presentation

Parallel


Slide Content

ProgrammingSharedAddressSpace
Platforms
AnanthGrama,AnshulGupta,GeorgeKarypis,andVipinKumar
Toaccompanythetext“IntroductiontoParallelComputing”,
AddisonWesley,2003.

TopicOverview
ThreadBasics
ThePOSIXThreadAPI
SynchronizationPrimitivesinPthreads
ControllingThreadandSynchronizationAttributes
CompositeSynchronizationConstructs
OpenMP:aStandardforDirectiveBasedParallelProgramming
–TypesetbyFoilTEX–1

OverviewofProgrammingModels
Programmingmodelsprovidesupportforexpressingconcurrency
andsynchronization.
Processbasedmodelsassumethatalldataassociatedwitha
processisprivate,bydefault,unlessotherwisespecied.
Lightweightprocessesandthreadsassumethatallmemoryis
global.
Directivebasedprogrammingmodelsextendthethreaded
modelbyfacilitatingcreationandsynchronizationofthreads.
–TypesetbyFoilTEX–2

OverviewofProgrammingModels
Athreadisasinglestreamofcontrolintheowofaprogram.
Aprogramlike:
for(row=0;row<n;row++)
for(column=0;column<n;column++)
c[row][column]=
dot_product(get_row(a,row),
get_col(b,col)); canbetransformedto: for(row=0;row<n;row++)
for(column=0;column<n;column++)
c[row][column]=
create_thread(
dot_product(get_row(a,row),
get_col(b,col))); Inthiscase,onemaythinkofthethreadasaninstanceofa
functionthatreturnsbeforethefunctionhasnishedexecuting.
–TypesetbyFoilTEX–3

ThreadBasics
Allmemoryinthelogicalmachinemodelofathreadisglobally
accessibletoeverythread.
Thestackcorrespondingtothefunctioncallisgenerally
treatedasbeinglocaltothethreadforlivenessreasons.
Thisimpliesalogicalmachinemodelwithbothglobalmemory
(default)andlocalmemory(stacks).
Itisimportanttonotethatsuchaatmodelmayresultin
verypoorperformancesincememoryisphysicallydistributed
intypicalmachines.
–TypesetbyFoilTEX–4

ThreadBasics
P P P
Shared Address Space
P P P
M M M
Shared Address Space
Thelogicalmachinemodelofathread-basedprogramming
paradigm.
–TypesetbyFoilTEX–5

ThreadBasics
Threadsprovidesoftwareportability.
Inherentsupportforlatencyhiding.
Schedulingandloadbalancing.
Easeofprogrammingandwidespreaduse.
–TypesetbyFoilTEX–6

ThePOSIXThreadAPI
CommonlyreferredtoasPthreads,POSIXhasemergedasthe
standardthreadsAPI,supportedbymostvendors.
Theconceptsdiscussedherearelargelyindependentofthe
APIandcanbeusedforprogrammingwithotherthreadAPIs
(NTthreads,Solaristhreads,Javathreads,etc.)aswell.
–TypesetbyFoilTEX–7

ThreadBasics:CreationandTermination
Pthreadsprovidestwobasicfunctionsforspecifyingconcurrency
inaprogram:
#include<pthread.h>
intpthread_create(
pthread_t*thread_handle,
constpthread_attr_t*attribute,
void*(*thread_function)(void*),
void*arg);
intpthread_join(
pthread_tthread,
void**ptr);
Thefunctionpthread_createinvokesfunctionthread_function
asathread.
–TypesetbyFoilTEX–8

ThreadBasics:CreationandTermination(Example)
#include<pthread.h>
#include<stdlib.h>
#defineMAX_THREADS512
void*compute_pi(void*);
....
main(){
...
pthread_tp_threads[MAX_THREADS];
pthread_attr_tattr;
pthread_attr_init(&attr);
for(i=0;i<num_threads;i++){
hits[i]=i;
pthread_create(&p_threads[i],&attr,compute_pi,
(void*)&hits[i]);
}
for(i=0;i<num_threads;i++){
pthread_join(p_threads[i],NULL);
total_hits+=hits[i];
}
...
} –TypesetbyFoilTEX–9

ThreadBasics:CreationandTermination(Example)
void*compute_pi(void*s){
intseed,i,*hit_pointer;
doublerand_no_x,rand_no_y;
intlocal_hits;
hit_pointer=(int*)s;
seed=*hit_pointer;
local_hits=0;
for(i=0;i<sample_points_per_thread;i++){
rand_no_x=(double)(rand_r(&seed))/(double)((2<<14)-1);
rand_no_y=(double)(rand_r(&seed))/(double)((2<<14)-1);
if(((rand_no_x-0.5)*(rand_no_x-0.5)+
(rand_no_y-0.5)*(rand_no_y-0.5))<0.25)
local_hits++;
seed*=i;
}
*hit_pointer=local_hits;
pthread_exit(0);
} –TypesetbyFoilTEX–10

ProgrammingandPerformanceNotes
Notetheuseofthefunctionrand
r(insteadofsuperior
randomnumbergeneratorssuchasdrand48).
Executingthisona4-processorSGIOrigin,weobservea3.91
foldspeedupat32threads.Thiscorrespondstoaparallel
efciencyof0.98!
Wecanalsomodifytheprogramslightlytoobservetheeffect
offalse-sharing.
Theprogramcanalsobeusedtoassessthesecondarycache
linesize.
–TypesetbyFoilTEX–11

ProgrammingandPerformanceNotes
0123456
0
1
2
3
4
5
6
7
8
9
"optimal"
"local"
"spaced_1"
"spaced_16" "spaced_32"
logarithm of number of threads
Time Executiontimeofthecompute
piprogram.
–TypesetbyFoilTEX–12

SynchronizationPrimitivesinPthreads
Whenmultiplethreadsattempttomanipulatethesamedata
item,theresultscanoftenbeincoherentifpropercareisnot
takentosynchronizethem.
Consider:
/*eachthreadtriestoupdatevariablebest_costasfollows*/
if(my_cost<best_cost)
best_cost=my_cost;
Assumethattherearetwothreads,theinitialvalueof
best
costis100,andthevaluesofmy
costare50and75at
threadst1andt2.
Dependingonthescheduleofthethreads,thevalueof
best
costcouldbe50or75!
Thevalue75doesnotcorrespondtoanyserializationofthe
threads.
–TypesetbyFoilTEX–13

MutualExclusion
Thecodeinthepreviousexamplecorrespondstoacritical
segment;i.e.,asegmentthatmustbeexecutedbyonlyone
threadatanytime.
CriticalsegmentsinPthreadsareimplementedusingmutex
locks.
Mutex-lockshavetwostates:lockedandunlocked.Atany
pointoftime,onlyonethreadcanlockamutexlock.Alockis
anatomicoperation.
Athreadenteringacriticalsegmentrsttriestogetalock.It
goesaheadwhenthelockisgranted.
–TypesetbyFoilTEX–14

MutualExclusion
ThePthreadsAPIprovidesthefollowingfunctionsforhandling
mutex-locks:
intpthread_mutex_lock(
pthread_mutex_t*mutex_lock);
intpthread_mutex_unlock(
pthread_mutex_t*mutex_lock);
intpthread_mutex_init(
pthread_mutex_t*mutex_lock,
constpthread_mutexattr_t*lock_attr); –TypesetbyFoilTEX–15

MutualExclusion
Wecannowwriteourpreviouslyincorrectcodesegmentas:
pthread_mutex_tminimum_value_lock;
...
main(){
....
pthread_mutex_init(&minimum_value_lock,NULL);
....
}
void*find_min(void*list_ptr){
....
pthread_mutex_lock(&minimum_value_lock);
if(my_min<minimum_value)
minimum_value=my_min;
/*andunlockthemutex*/
pthread_mutex_unlock(&minimum_value_lock);
} –TypesetbyFoilTEX–16

Producer-ConsumerUsingLocks
Theproducer-consumerscenarioimposesthefollowing
constraints:
Theproducerthreadmustnotoverwritethesharedbufferwhen
theprevioustaskhasnotbeenpickedupbyaconsumer
thread.
Theconsumerthreadsmustnotpickuptasksuntilthereis
somethingpresentintheshareddatastructure.
Individualconsumerthreadsshouldpickuptasksoneatatime.
–TypesetbyFoilTEX–17

Producer-ConsumerUsingLocks
pthread_mutex_ttask_queue_lock;
inttask_available;
...
main(){
....
task_available=0;
pthread_mutex_init(&task_queue_lock,NULL);
....
}
void*producer(void*producer_thread_data){
....
while(!done()){
inserted=0;
create_task(&my_task);
while(inserted==0){
pthread_mutex_lock(&task_queue_lock);
if(task_available==0){
insert_into_queue(my_task);
task_available=1;
inserted=1;
}
pthread_mutex_unlock(&task_queue_lock);
}
}
} –TypesetbyFoilTEX–18

Producer-ConsumerUsingLocks
void*consumer(void*consumer_thread_data){
intextracted;
structtaskmy_task;
/*localdatastructuredeclarations*/
while(!done()){
extracted=0;
while(extracted==0){
pthread_mutex_lock(&task_queue_lock);
if(task_available==1){
extract_from_queue(&my_task);
task_available=0;
extracted=1;
}
pthread_mutex_unlock(&task_queue_lock);
}
process_task(my_task);
}
} –TypesetbyFoilTEX–19

TypesofMutexes
Pthreadssupportsthreetypesofmutexes–normal,recursive,
anderror-check.
Anormalmutexdeadlocksifathreadthatalreadyhasalock
triesasecondlockonit.
Arecursivemutexallowsasinglethreadtolockamutexas
manytimesasitwants.Itsimplyincrementsacountonthe
numberoflocks.Alockisrelinquishedbyathreadwhenthe
countbecomeszero.
Anerrorcheckmutexreportsanerrorwhenathreadwitha
locktriestolockitagain(asopposedtodeadlockingintherst
case,orgrantingthelock,asinthesecondcase).
Thetypeofthemutexcanbesetintheattributesobjectbefore
itispassedattimeofinitialization.
–TypesetbyFoilTEX–20

OverheadsofLocking
Locksrepresentserializationpointssincecriticalsectionsmust
beexecutedbythreadsoneaftertheother.
Encapsulatinglargesegmentsoftheprogramwithinlockscan
leadtosignicantperformancedegradation.
Itisoftenpossibletoreducetheidlingoverheadassociated
withlocksusinganalternatefunction,pthread
mutex
trylock.
intpthread_mutex_trylock(
pthread_mutex_t*mutex_lock);
pthread
mutex
trylockistypicallymuchfasterthanpthread
mutex
lock
ontypicalsystemssinceitdoesnothavetodealwithqueues
associatedwithlocksformultiplethreadswaitingonthelock.
–TypesetbyFoilTEX–21

AlleviatingLockingOverhead(Example)
/*Findingkmatchesinalist*/
void*find_entries(void*start_pointer){
/*Thisisthethreadfunction*/
structdatabase_record*next_record;
intcount;
current_pointer=start_pointer;
do{
next_record=find_next_entry(current_pointer);
count=output_record(next_record);
}while(count<requested_number_of_records);
}
intoutput_record(structdatabase_record*record_ptr){
intcount;
pthread_mutex_lock(&output_count_lock);
output_count++;
count=output_count;
pthread_mutex_unlock(&output_count_lock);
if(count<=requested_number_of_records)
print_record(record_ptr);
return(count);
} –TypesetbyFoilTEX–22

AlleviatingLockingOverhead(Example)
/*rewrittenoutput_recordfunction*/
intoutput_record(structdatabase_record*record_ptr){
intcount;
intlock_status;
lock_status=pthread_mutex_trylock(&output_count_lock);
if(lock_status==EBUSY){
insert_into_local_list(record_ptr);
return(0);
}
else{
count=output_count;
output_count+=number_on_local_list+1;
pthread_mutex_unlock(&output_count_lock);
print_records(record_ptr,local_list,
requested_number_of_records-count);
return(count+number_on_local_list+1);
}
} –TypesetbyFoilTEX–23

ConditionVariablesforSynchronization
Aconditionvariableallowsathreadtoblockitselfuntil
specieddatareachesapredenedstate.
Aconditionvariableisassociatedwiththispredicate.When
thepredicatebecomestrue,theconditionvariableisusedto
signaloneormorethreadswaitingonthecondition.
Asingleconditionvariablemaybeassociatedwithmorethan
onepredicate.
Aconditionvariablealwayshasamutexassociatedwithit.A
threadlocksthismutexandteststhepredicatedenedonthe
sharedvariable.
Ifthepredicateisnottrue,thethreadwaitsonthecondition
variableassociatedwiththepredicateusingthefunction
pthread
cond
wait.
–TypesetbyFoilTEX–24

ConditionVariablesforSynchronization
Pthreadsprovidesthefollowingfunctionsforconditionvariables: intpthread_cond_wait(pthread_cond_t*cond,
pthread_mutex_t*mutex);
intpthread_cond_signal(pthread_cond_t*cond);
intpthread_cond_broadcast(pthread_cond_t*cond);
intpthread_cond_init(pthread_cond_t*cond,
constpthread_condattr_t*attr);
intpthread_cond_destroy(pthread_cond_t*cond); –TypesetbyFoilTEX–25

Producer-ConsumerUsingConditionVariables
pthread_cond_tcond_queue_empty,cond_queue_full;
pthread_mutex_ttask_queue_cond_lock;
inttask_available;
/*otherdatastructureshere*/
main(){
/*declarationsandinitializations*/
task_available=0;
pthread_init();
pthread_cond_init(&cond_queue_empty,NULL);
pthread_cond_init(&cond_queue_full,NULL);
pthread_mutex_init(&task_queue_cond_lock,NULL);
/*createandjoinproducerandconsumerthreads*/
} –TypesetbyFoilTEX–26

Producer-ConsumerUsingConditionVariables
void*producer(void*producer_thread_data){
intinserted;
while(!done()){
create_task();
pthread_mutex_lock(&task_queue_cond_lock);
while(task_available==1)
pthread_cond_wait(&cond_queue_empty,
&task_queue_cond_lock);
insert_into_queue();
task_available=1;
pthread_cond_signal(&cond_queue_full);
pthread_mutex_unlock(&task_queue_cond_lock);
}
} –TypesetbyFoilTEX–27

Producer-ConsumerUsingConditionVariables
void*consumer(void*consumer_thread_data){
while(!done()){
pthread_mutex_lock(&task_queue_cond_lock);
while(task_available==0)
pthread_cond_wait(&cond_queue_full,
&task_queue_cond_lock);
my_task=extract_from_queue();
task_available=0;
pthread_cond_signal(&cond_queue_empty);
pthread_mutex_unlock(&task_queue_cond_lock);
process_task(my_task);
}
} –TypesetbyFoilTEX–28

ControllingThreadandSynchronizationAttributes
ThePthreadsAPIallowsaprogrammertochangethedefault
attributesofentitiesusingattributesobjects.
Anattributesobjectisadata-structurethatdescribesentity
(thread,mutex,conditionvariable)properties.
Oncethesepropertiesareset,theattributesobjectcanbe
passedtothemethodinitializingtheentity.
Enhancesmodularity,readability,andeaseofmodication.
–TypesetbyFoilTEX–29

AttributesObjectsforThreads
Usepthread
attr
inittocreateanattributesobject.
Individualpropertiesassociatedwiththeattributesobjectcan
bechangedusingthefollowingfunctions:
pthreadattr
setdetachstate,
pthread
attr
setguardsize
np,
pthread
attr
setstacksize,
pthread
attr
setinheritsched,
pthread
attr
setschedpolicy,and
pthread
attr
setschedparam.
–TypesetbyFoilTEX–30

AttributesObjectsforMutexes
Initializetheattrributesobjectusingfunction:
pthread
mutexattr
init.
Thefunctionpthread
mutexattr
settype
npcanbeused
forsettingthetypeofmutexspeciedbythemutexattributes
object.
pthread_mutexattr_settype_np(
pthread_mutexattr_t*attr,
inttype); Here,typespeciesthetypeofthemutexandcantakeone
of:
–PTHREAD
MUTEX
NORMAL
NP
–PTHREAD
MUTEX
RECURSIVE
NP
–PTHREAD
MUTEX
ERRORCHECK
NP
–TypesetbyFoilTEX–31

CompositeSynchronizationConstructs
Bydesign,Pthreadsprovidesupportforabasicsetof
operations.
Higherlevelconstructscanbebuiltusingbasicsynchronization
constructs.
Wediscusstwosuchconstructs–read-writelocksandbarriers.
–TypesetbyFoilTEX–32

Read-WriteLocks
Inmanyapplications,adatastructureisreadfrequentlybut
writteninfrequently.Forsuchapplications,weshoulduseread-
writelocks.
Areadlockisgrantedwhenthereareotherthreadsthatmay
alreadyhavereadlocks.
Ifthereisawritelockonthedata(oriftherearequeuedwrite
locks),thethreadperformsaconditionwait.
Iftherearemultiplethreadsrequestingawritelock,theymust
performaconditionwait.
Withthisdescription,wecandesignfunctionsforreadlocks
mylib
rwlock
rlock,writelocksmylib
rwlock
wlock,and
unlockingmylib
rwlock
unlock.
–TypesetbyFoilTEX–33

Read-WriteLocks
Thelockdatatypemylib
rwlock
tholdsthefollowing:
–acountofthenumberofreaders,
–thewriter(a0/1integerspecifyingwhetherawriteris
present),
–aconditionvariablereaders
proceedthatissignaledwhen
readerscanproceed,
–aconditionvariablewriter
proceedthatissignaledwhen
oneofthewriterscanproceed,
–acountpending
writersofpendingwriters,and
–amutexread
write
lockassociatedwiththeshareddata
structure.
–TypesetbyFoilTEX–34

Read-WriteLocks
typedefstruct{
intreaders;
intwriter;
pthread_cond_treaders_proceed;
pthread_cond_twriter_proceed;
intpending_writers;
pthread_mutex_tread_write_lock;
}mylib_rwlock_t;
voidmylib_rwlock_init(mylib_rwlock_t*l){
l->readers=l->writer=l->pending_writers=0;
pthread_mutex_init(&(l->read_write_lock),NULL);
pthread_cond_init(&(l->readers_proceed),NULL);
pthread_cond_init(&(l->writer_proceed),NULL);
} –TypesetbyFoilTEX–35

Read-WriteLocks
voidmylib_rwlock_rlock(mylib_rwlock_t*l){
/*ifthereisawritelockorpendingwriters,performcondition
wait..elseincrementcountofreadersandgrantreadlock*/
pthread_mutex_lock(&(l->read_write_lock));
while((l->pending_writers>0)||(l->writer>0))
pthread_cond_wait(&(l->readers_proceed),
&(l->read_write_lock));
l->readers++;
pthread_mutex_unlock(&(l->read_write_lock));
} –TypesetbyFoilTEX–36

Read-WriteLocks
voidmylib_rwlock_wlock(mylib_rwlock_t*l){
/*iftherearereadersorwriters,incrementpendingwriters
countandwait.Onbeingwoken,decrementpendingwriters
countandincrementwritercount*/
pthread_mutex_lock(&(l->read_write_lock));
while((l->writer>0)||(l->readers>0)){
l->pending_writers++;
pthread_cond_wait(&(l->writer_proceed),
&(l->read_write_lock));
}
l->pending_writers--;
l->writer++
pthread_mutex_unlock(&(l->read_write_lock));
} –TypesetbyFoilTEX–37

Read-WriteLocks
voidmylib_rwlock_unlock(mylib_rwlock_t*l){
/*ifthereisawritelockthenunlock,elseifthereare
readlocks,decrementcountofreadlocks.Ifthecount
is0andthereisapendingwriter,letitthrough,else
iftherearependingreaders,letthemallgothrough*/
pthread_mutex_lock(&(l->read_write_lock));
if(l->writer>0)
l->writer=0;
elseif(l->readers>0)
l->readers--;
pthread_mutex_unlock(&(l->read_write_lock));
if((l->readers==0)&&(l->pending_writers>0))
pthread_cond_signal(&(l->writer_proceed));
elseif(l->readers>0)
pthread_cond_broadcast(&(l->readers_proceed));
} –TypesetbyFoilTEX–38

Barriers
AsinMPI,abarrierholdsathreaduntilallthreadsparticipating
inthebarrierhavereachedit.
Barrierscanbeimplementedusingacounter,amutexanda
conditionvariable.
Asingleintegerisusedtokeeptrackofthenumberofthreads
thathavereachedthebarrier.
Ifthecountislessthanthetotalnumberofthreads,thethreads
executeaconditionwait.
Thelastthreadentering(andsettingthecounttothenumber
ofthreads)wakesupallthethreadsusingacondition
broadcast.
–TypesetbyFoilTEX–39

Barriers
typedefstruct{
pthread_mutex_tcount_lock;
pthread_cond_tok_to_proceed;
intcount;
}mylib_barrier_t;
voidmylib_init_barrier(mylib_barrier_t*b){
b->count=0;
pthread_mutex_init(&(b->count_lock),NULL);
pthread_cond_init(&(b->ok_to_proceed),NULL);
} –TypesetbyFoilTEX–40

Barriers
voidmylib_barrier(mylib_barrier_t*b,intnum_threads){
pthread_mutex_lock(&(b->count_lock));
b->count++;
if(b->count==num_threads){
b->count=0;
pthread_cond_broadcast(&(b->ok_to_proceed));
}
else
while(pthread_cond_wait(&(b->ok_to_proceed),
&(b->count_lock))!=0);
pthread_mutex_unlock(&(b->count_lock));
} –TypesetbyFoilTEX–41

Barriers
Thebarrierdescribedaboveiscalledalinearbarrier.
Thetriviallowerboundonexecutiontimeofthisfunctionis
thereforeO(n)fornthreads.
Thisimplementationofabarriercanbespeededupusing
multiplebarriervariablesorganizedinatree.
Weusen=2conditionvariable-mutexpairsforimplementinga
barrierfornthreads.
Atthelowestlevel,threadsarepairedupandeachpairof
threadssharesasingleconditionvariable-mutexpair.
Onceboththreadsarrive,oneofthetwomoveson,theother
onewaits.
Thisprocessrepeatsupthetree.
ThisisalsocalledalogbarrieranditsruntimegrowsasO(logp).
–TypesetbyFoilTEX–42

Barrier
05
101520253035404550
0
20
40
60
80
100
120
140
Time (seconds)
Number of threads
Log Barrier (1000, 32 procs)
Linear Barrier (1000, 32 procs)
Executiontimeof1000sequentialandlogarithmicbarriersasa
functionofnumberofthreadsona32processorSGIOrigin2000.
–TypesetbyFoilTEX–43

TipsforDesigningAsynchronousPrograms
Neverrelyonschedulingassumptionswhenexchangingdata.
Neverrelyonlivenessofdataresultingfromassumptionson
scheduling.
Donotrelyonschedulingasameansofsynchronization.
Wherepossible,deneandusegroupsynchronizationsand
datareplication.
–TypesetbyFoilTEX–44

OpenMP:aStandardforDirectiveBasedParallel
Programming OpenMPisadirective-basedAPIthatcanbeusedwith
FORTRAN,C,andC++forprogrammingsharedaddressspace
machines.
OpenMPdirectivesprovidesupportforconcurrency,synchronization,
anddatahandlingwhileobviatingtheneedforexplicitly
settingupmutexes,conditionvariables,datascope,and
initialization.
–TypesetbyFoilTEX–45

OpenMPProgrammingModel
OpenMPdirectivesinCandC++arebasedonthe#pragma
compilerdirectives.
Adirectiveconsistsofadirectivenamefollowedbyclauses.
#pragmaompdirective[clauselist]
OpenMPprogramsexecuteseriallyuntiltheyencounterthe
paralleldirective,whichcreatesagroupofthreads.
#pragmaompparallel[clauselist]
/*structuredblock*/
Themainthreadthatencounterstheparalleldirective
becomesthemasterofthisgroupofthreadsandisassigned
thethreadid0withinthegroup.
–TypesetbyFoilTEX–46

OpenMPProgrammingModel
Theclauselistisusedtospecifyconditionalparallelization,
numberofthreads,anddatahandling.
ConditionalParallelization:Theclauseif(scalarexpression)
determineswhethertheparallelconstructresultsincreationof
threads.
DegreeofConcurrency:Theclausenum
threads(integer
expression)speciesthenumberofthreadsthatare
created.
DataHandling:Theclauseprivate(variablelist)
indicatesvariableslocaltoeachthread.Theclause
firstprivate(variablelist)issimilartotheprivate,
exceptvaluesofvariablesareinitializedtocorresponding
valuesbeforetheparalleldirective.Theclauseshared
(variablelist)indicatesthatvariablesaresharedacross
allthethreads.
–TypesetbyFoilTEX–47

OpenMPProgrammingModel
pthread_create (......., internal_thread_fn_name, ...);
// serial segment
for (i = 0; i < 8; i++)
for (i = 0; i < 8; i++)
pthread_join (.......);
// rest of serial segment
}
void *internal_thread_fn_name (void *packaged_argument) [
int a;
// parallel segment
}
main() {
int a, b;
Code
inserted by
the OpenMP
compiler
Sample OpenMP program
Corresponding Pthreads translation
{
// parallel segment
}
// serial segment
#pragma omp parallel num_threads (8) private (a) shared (b)
// rest of serial segment
}
main() {
int a, b;
AsampleOpenMPprogramalongwithitsPthreadstranslation
thatmightbeperformedbyanOpenMPcompiler.
–TypesetbyFoilTEX–48

OpenMPProgrammingModel
#pragmaompparallelif(is_parallel==1)num_threads(8)\
private(a)shared(b)firstprivate(c)
{
/*structuredblock*/
} Ifthevalueofthevariableis
parallelequalsone,eight
threadsarecreated.
Eachofthesethreadsgetsprivatecopiesofvariablesaandc,
andsharesasinglevalueofvariableb.
Thevalueofeachcopyofcisinitializedtothevalueofcbefore
theparalleldirective.
Thedefaultstateofavariableisspeciedbytheclause
default(shared)ordefault(none).
–TypesetbyFoilTEX–49

ReductionClauseinOpenMP
Thereductionclausespecieshowmultiplelocalcopiesofa
variableatdifferentthreadsarecombinedintoasinglecopy
atthemasterwhenthreadsexit.
Theusageofthereductionclauseisreduction(operator:
variablelist).
Thevariablesinthelistareimplicitlyspeciedasbeingprivate
tothreads.
Theoperatorcanbeoneof+,*,-,&,|,ˆ,&&,and||.
#pragmaompparallelreduction(+:sum)num_threads(8)
{
/*computelocalsumshere*/
}
/*sumherecontainssumofalllocalinstancesofsums*/
–TypesetbyFoilTEX–50

OpenMPProgramming:Example
/*******************************************************
AnOpenMPversionofathreadedprogramtocomputePI.
*******************************************************/
#pragmaompparalleldefault(private)shared(npoints)\
reduction(+:sum)num_threads(8)
{
num_threads=omp_get_num_threads();
sample_points_per_thread=npoints/num_threads;
sum=0;
for(i=0;i<sample_points_per_thread;i++){
rand_no_x=(double)(rand_r(&seed))/(double)((2<<14)-1);
rand_no_y=(double)(rand_r(&seed))/(double)((2<<14)-1);
if(((rand_no_x-0.5)*(rand_no_x-0.5)+
(rand_no_y-0.5)*(rand_no_y-0.5))<0.25)
sum++;
}
} –TypesetbyFoilTEX–51

SpecifyingConcurrentTasksinOpenMP
Theparalleldirectivecanbeusedinconjunctionwithother
directivestospecifyconcurrencyacrossiterationsandtasks.
OpenMPprovidestwodirectives–forandsections-to
specifyconcurrentiterationsandtasks.
Thefordirectiveisusedtosplitparalleliterationspacesacross
threads.Thegeneralformofafordirectiveisasfollows:
#pragmaompfor[clauselist]
/*forloop*/
Theclausesthatcanbeusedinthiscontextare:private,
firstprivate,lastprivate,reduction,schedule,nowait,
andordered.
–TypesetbyFoilTEX–52

SpecifyingConcurrentTasksinOpenMP:Example
#pragmaompparalleldefault(private)shared(npoints)\
reduction(+:sum)num_threads(8)
{
sum=0;
#pragmaompfor
for(i=0;i<npoints;i++){
rand_no_x=(double)(rand_r(&seed))/(double)((2<<14)-1);
rand_no_y=(double)(rand_r(&seed))/(double)((2<<14)-1);
if(((rand_no_x-0.5)*(rand_no_x-0.5)+
(rand_no_y-0.5)*(rand_no_y-0.5))<0.25)
sum++;
}
}
–TypesetbyFoilTEX–53

AssigningIterationstoThreads
Thescheduleclauseofthefordirectivedealswiththe
assignmentofiterationstothreads.
Thegeneralformofthescheduledirectiveis
schedule(scheduling
class[,parameter]).
OpenMPsupportsfourschedulingclasses:static,dynamic,
guided,andruntime.
–TypesetbyFoilTEX–54

AssigningIterationstoThreads:Example
/*staticschedulingofmatrixmultiplicationloops*/
#pragmaompparalleldefault(private)shared(a,b,c,dim)\
num_threads(4)
#pragmaompforschedule(static)
for(i=0;i<dim;i++){
for(j=0;j<dim;j++){
c(i,j)=0;
for(k=0;k<dim;k++){
c(i,j)+=a(i,k)*b(k,j);
}
}
} –TypesetbyFoilTEX–55

AssigningIterationstoThreads:Example 32
CC
16 cols
A
B
C
32
A A
(c) (b)
128
B
(a)
128 128
32
32
B
Threedifferentschedulesusingthestaticschedulingclassof
OpenMP.
–TypesetbyFoilTEX–56

ParallelForLoops
Often,itisdesirabletohaveasequenceoffor-directives
withinaparallelconstructthatdonotexecuteanimplicit
barrierattheendofeachfordirective.
OpenMPprovidesaclause–nowait,whichcanbeusedwith
afordirective.
–TypesetbyFoilTEX–57

ParallelForLoops:Example
#pragmaompparallel
{
#pragmaompfornowait
for(i=0;i<nmax;i++)
if(isEqual(name,current_list[i])
processCurrentName(name);
#pragmaompfor
for(i=0;i<mmax;i++)
if(isEqual(name,past_list[i])
processPastName(name);
}
–TypesetbyFoilTEX–58

ThesectionsDirective
OpenMPsupportsnon-iterativeparalleltaskassignmentusing
thesectionsdirective.
Thegeneralformofthesectionsdirectiveisasfollows:
#pragmaompsections[clauselist]
{
[#pragmaompsection
/*structuredblock*/
]
[#pragmaompsection
/*structuredblock*/
]
...
}
–TypesetbyFoilTEX–59

ThesectionsDirective:Example
#pragmaompparallel
{
#pragmaompsections
{
#pragmaompsection
{
taskA();
}
#pragmaompsection
{
taskB();
}
#pragmaompsection
{
taskC();
}
}
}
–TypesetbyFoilTEX–60

NestingparallelDirectives
NestedparallelismcanbeenabledusingtheOMP
NESTED
environmentvariable.
IftheOMP
NESTEDenvironmentvariableissettoTRUE,nested
parallelismisenabled.
Inthiscase,eachparalleldirectivecreatesanewteamof
threads.
–TypesetbyFoilTEX–61

SynchronizationConstructsinOpenMP
OpenMPprovidesavarietyofsynchronizationconstructs:
#pragmaompbarrier
#pragmaompsingle[clauselist]
structuredblock
#pragmaompmaster
structuredblock
#pragmaompcritical[(name)]
structuredblock
#pragmaompordered
structuredblock –TypesetbyFoilTEX–62

OpenMPLibraryFunctions
Inadditiontodirectives,OpenMPalsosupportsanumberof
functionsthatallowaprogrammertocontroltheexecutionof
threadedprograms.
/*threadandprocessorcount*/
voidomp_set_num_threads(intnum_threads);
intomp_get_num_threads();
intomp_get_max_threads();
intomp_get_thread_num();
intomp_get_num_procs();
intomp_in_parallel(); –TypesetbyFoilTEX–63

OpenMPLibraryFunctions
/*controllingandmonitoringthreadcreation*/
voidomp_set_dynamic(intdynamic_threads);
intomp_get_dynamic();
voidomp_set_nested(intnested);
intomp_get_nested();
/*mutualexclusion*/
voidomp_init_lock(omp_lock_t*lock);
voidomp_destroy_lock(omp_lock_t*lock);
voidomp_set_lock(omp_lock_t*lock);
voidomp_unset_lock(omp_lock_t*lock);
intomp_test_lock(omp_lock_t*lock); Inaddition,alllockroutinesalsohaveanestedlockcounterpart
forrecursivemutexes.
–TypesetbyFoilTEX–64

EnvironmentVariablesinOpenMP
OMP
NUM
THREADS:Thisenvironmentvariablespeciesthe
defaultnumberofthreadscreateduponenteringaparallel
region.
OMP
SET
DYNAMIC:Determinesifthenumberofthreadscanbe
dynamicallychanged.
OMP
NESTED:Turnsonnestedparallelism.
OMP
SCHEDULE:Schedulingoffor-loopsiftheclausespecies
runtime.
–TypesetbyFoilTEX–65

ExplicitThreadsversusDirectiveBasedProgramming
Directiveslayeredontopofthreadsfacilitateavarietyof
thread-relatedtasks.
Aprogrammerisridofthetasksofinitializingattributesobjects,
settingupargumentstothreads,partitioningiterationspaces,
etc.
Therearesomedrawbackstousingdirectivesaswell.
Anartifactofexplicitthreadingisthatdataexchangeismore
apparent.Thishelpsinalleviatingsomeoftheoverheadsfrom
datamovement,falsesharing,andcontention.
ExplicitthreadingalsoprovidesaricherAPIintheformof
conditionwaits,locksofdifferenttypes,andincreasedexibility
forbuildingcompositesynchronizationoperations.
Finally,sinceexplicitthreadingisusedmorewidelythan
OpenMP,toolsandsupportforPthreadsprogramsareeasier
tond.
–TypesetbyFoilTEX–66
Tags