Unit- 2_my1.pdf jbvjwe vbeijv dv d d d kjd k

bhattkathit123 16 views 96 slides Aug 03, 2024
Slide 1
Slide 1 of 96
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27
Slide 28
28
Slide 29
29
Slide 30
30
Slide 31
31
Slide 32
32
Slide 33
33
Slide 34
34
Slide 35
35
Slide 36
36
Slide 37
37
Slide 38
38
Slide 39
39
Slide 40
40
Slide 41
41
Slide 42
42
Slide 43
43
Slide 44
44
Slide 45
45
Slide 46
46
Slide 47
47
Slide 48
48
Slide 49
49
Slide 50
50
Slide 51
51
Slide 52
52
Slide 53
53
Slide 54
54
Slide 55
55
Slide 56
56
Slide 57
57
Slide 58
58
Slide 59
59
Slide 60
60
Slide 61
61
Slide 62
62
Slide 63
63
Slide 64
64
Slide 65
65
Slide 66
66
Slide 67
67
Slide 68
68
Slide 69
69
Slide 70
70
Slide 71
71
Slide 72
72
Slide 73
73
Slide 74
74
Slide 75
75
Slide 76
76
Slide 77
77
Slide 78
78
Slide 79
79
Slide 80
80
Slide 81
81
Slide 82
82
Slide 83
83
Slide 84
84
Slide 85
85
Slide 86
86
Slide 87
87
Slide 88
88
Slide 89
89
Slide 90
90
Slide 91
91
Slide 92
92
Slide 93
93
Slide 94
94
Slide 95
95
Slide 96
96

About This Presentation

n v njv f


Slide Content

Unit- 2
Parallel Algorithm Design
By: Ritu Agrawal
Assistant Professor,
CSE department, PIET, PU






Algorithm
Sequential Algorithm
Parallel Algorithm
Principles of Parallel Algorithms
Preliminaries

Example: Multiplying a Dense Matrix with a Vector

Example: Database Query Processing
ID# Model Year Color Dealer Price
4523 Civic 2002 Blue MN $18,000
3476 Corolla 1999 White IL $15,000
7623 Camry 2001 Green NY $21,000
9834 Prius 2001 Green CA $18,000
6734 Civic 2001 White OR $17,000
5342 Altima 2001 Green FL $19,000
3845 Maxima 2001 Blue NY $22,000
8354 Accord 2000 Green VT $18,000
4395 Civic 2001 Red CA $17,000
7352 Civic 2002 Red WA $18,000

Consider the execution of the query:

MODEL = ``CIVIC'' AND YEAR = 2001 AND
(COLOR = ``GREEN'' OR COLOR = ``WHITE)

Task-dependency graph

Example: Database Query Processing
ID# Model Year Color Dealer Price
4523 Civic 2002 Blue MN $18,000
3476 Corolla 1999 White IL $15,000
7623 Camry 2001 Green NY $21,000
9834 Prius 2001 Green CA $18,000
6734 Civic 2001 White OR $17,000
5342 Altima 2001 Green FL $19,000
3845 Maxima 2001 Blue NY $22,000
8354 Accord 2000 Green VT $18,000
4395 Civic 2001 Red CA $17,000
7352 Civic 2002 Red WA $18,000

An alternate decomposition of the given problem into subtasks, along with their
data dependencies.

Fine-grained granularity decomposition of task

Coarse-grained granularity decomposition of task

Abstraction of task graphs

Limits on Parallel Performance



Types of granularity, utilizing the resulting concurrency
Facing inherit bound





To capture interactions among tasks
Node = task
Edge(undirected/directed) = interaction or data
exchange

Task-dependency graph vs. task-interaction graph

Task Interaction Graph

Example: Sparse matrix vector multiplication

Tasks: each task computes an entry of y[]
• Assign i
th
row of A to Task i. Also assign b[i] to Task i.



We need to compute the products A[i, j] x b[j] for only those values of j
for which A[i, j]≠ 0.
For example, y[0] = A[0, 0].b[0] + A[0, 1].b[1] + A[0, 4].b[4] + A[0, 8].b[8].
Example: Sparse matrix vector multiplication






Process: The tasks, into which a problem is decomposed, run on
physical processors.
It is an abstract entity that uses the code and data corresponding to
a task to produce the output of that task within a finite amount of
time after the task is activated by the parallel program.
Process = Task + task data + task code required to produce the
task’s output
Mapping:
Processor: It is a hardware unit that physically performs the
computations.
Processes





We refer to the mapping as being from tasks to processes, as
opposed to processors.
This is because typical programming APIs do not allow easy
binding of tasks to physical processors.
Rather, we aggregate tasks into processes and rely on the system
to map these processes to physical processors.
We use processes, not in the UNIX sense of a process, rather,
simply as a collection of tasks and associated data.
Why use processes rather than processors?

Basis for Choosing Mapping
Task-dependency graph
Makes sure the max.
concurrency

Task-interaction graph
Minimum communication.

1.
2.
3.
Mapping independent tasks to different processes.

Assigning tasks on critical path to processes as soon as
they become available.

Minimizing interaction between processes by mapping
tasks with dense interactions to the same process.

Criteria of Mapping

Mapping Database Query to Processes
4 processes can be used in total since the max.
concurrency is 4. • Assign all tasks within a level to different

1.
2.
3.
4.
Recursive decomposition
Data decomposition
Exploratory decomposition
Speculative decomposition
Decomposition Techniques

•Ideal for problems to be solved by divide-and
conquer method.
Steps:
1.Decompose a problem into a set of independent
sub problems
2. Recursively decompose each sub-problem
3. Stop decomposition when minimum desired
granularity
ihid(til) ltibtid
Recursive Decomposition Technique

Quick Sort Example

Find the minimum in an array of numbers A of
length n
1. procedure SERIAL_MIN(A,n)
2. begin
3. min =A[0];
4. for i:= 1 to n − 1 do
5. if (A[i] < min) min := A[ i];
6. endfor;
7. return min;
8endSERIALMIN
Recursive Decomposition for Finding Min

1. procedure RECURSIVE_MIN ( A, n)
2. begin
3. if ( n = 1 ) then
4. min := A [0] ;
5. else
6. lmin := RECURSIVE_MIN ( A, n/2 );
7. rmin := RECURSIVE_MIN ( &( A[n/2]), n - n/2 );
8. if ( lmin < rmin) then
9. min := lmin;
10. else
11. min := rmin;
12. endelse;
13. endelse;
14. return min;
15. end RECURSIVE_MIN

Finding Min by using Recursive Procedure







Ideal for problems that operate on large data structures
Steps:
1. The data on which the computations are performed are
partitioned
2. Data partition is used to induce a partitioning of the
computations
into tasks.
Data Partitioning –
Partition output data
Partition input data
Partitioninput+outputdata
Data Decomposition Technique:

• If each element of the output
can be computed
independently of others as a
function of the input.
• Partitioning computations
into tasks is natural. Each task
is assigned with the work of
computing a portion of the
output.
Data Decomposition Based on Partitioning Output Data

Matrix-matrix multiplication: �� = �� × ��
• Partition matrix C into 2× 2 submatrices
• Computation of C then can be partitioned into four
tasks.
Matrix Multiplication Example



A partitioning of output data does not result in a
unique decomposition into tasks.
For example, for the same problem as in previous,
with identical output data distribution, we can
derive the following two (other) decompositions:




Consider the problem of computing the frequency of a set of
itemsets in a transaction database.
In this problem there is a set T containing n transactions and a set I
containing m itemsets.
Each transaction and itemset contains a small number of items, out
of a possible set of items.
Problem: itemset in a database of transaction

Data Decomposition Based on Partitioning Input Data




Ideal if output is a single unknown value or the
individual elements of the output can not be
efficiently determined in isolation.
Example: Finding the minimum, maximum, or sum
of a set of numbers.
Example. Sorting a set.
Partitioning the input data and associating a task
with each partition of the input data.

Data Decomposition based on partitioning input/output data





Often input and output data decomposition can be combined for
a higher degree of concurrency.
Decomposition based on partitioning input/output data is
referred to as the owner- computes rule.
Each partition performs all the computations involving data that
it owns.
Input data decomposition: – A task performs all the
computations that can be done using these input data.
Output data decomposition: – A task computes all the results in
the partition assigned to it.

For the itemset counting example, the transaction set (input) and itemset counts (output) can both
be decomposed as follows:

Data Decomposition Based on Partitioning Intermediate Data


Applicable for problems which can be solved
by multi-stage computations such that the
output of one stage is the input to the
subsequent stage.
Partitioning can be based on input or output of
an intermediate stage.

Example: Dense matrix multiplication
•Original output data decomposition yields a maximum
degree of concurrency of 4.

Exploratory Decomposition



In many cases, the decomposition of the problem goes
hand-in-hand with its execution.
Exploratory decomposition is used to decompose
problems whose underlying computations correspond to a
search of a space for solutions.
In exploratory decomposition, the search space is
partitioned into smaller parts, and search each one of
these parts concurrently, until the desired solutions are
found.



Example of Problems in this class include a variety of discrete
optimization problems, theorem proving, game playing, etc.
A simple application of exploratory decomposition is in the solution
to a 15 puzzle (a tile puzzle). We show a sequence of three moves
that transform a given initial state (a) to desired final state (d).

Speculative Decomposition




In some applications, dependencies between tasks are not
known a-priori.
For such applications, it is impossible to identify independent
tasks.
There are generally two approaches to dealing with such
applications: conservative approaches, which identify
independent tasks only when they are guaranteed to not have
dependencies, and, optimistic approaches, which schedule tasks
even when they may potentially be erroneous.
Conservative approaches may yield little concurrency and
optimisticapproachesmayrequirerollbackmechanisminthe

Example




A classic example of speculative decomposition is in discrete
event simulation.
The central data structure in a discrete event simulation is a time-
ordered event list.
Events are extracted precisely in time order, processed, and if
required, resulting events are inserted back into the event list.
Consider your day today as a discrete event system - you get up,
get ready, drive to work, work, eat lunch, work some more, drive
back, eat dinner, and sleep.




Each of these events may be processed independently, however, in
driving to work, you might meet with an unfortunate accident and
not get to work at all.
Therefore, an optimistic scheduling of other events will have to be
rolled back.
Another Example: The simulation of a network of nodes (for
instance, an assembly line or a computer network through which
packets pass). The task is to simulate the behavior of this network
for various inputs and node delay parameters (note that networks
may become unstable for certain values of service rates, queue
sizesetc)

Example: A simple network for discrete event simulation

Characteristics of Tasks




1.
2.
3.
Identify the concurrency that is available in a problem and
decompose it into tasks that can be executed in parallel.
The nature of the tasks and the interactions among them has a
bearing on the mapping.
The characteristics of these tasks critically impact choice and
performance of parallel algorithms.
Relevant task characteristics include:
Task generation.
Task sizes.
Size of data associated with tasks.



Task Generation: The tasks that constitute a parallel algorithm may be
generated either statically or dynamically.
Static task generation refers to where all the tasks are known before the
algorithm starts execution.
Task Sizes: the size of a task is the relative amount of time required to
complete it.
The complexity of mapping schemes often depends on whether or not
the tasks are uniform; i.e., whether or not they require roughly the same
amount of time. If the amount of time required by the tasks varies
significantly, then they are said to be non-uniform.



If the size of all the tasks is known, then this information can often
be used in mapping of tasks to processes.

Size of Data Associated with Tasks:
It is important because the data associated with a task must be
available to the process performing that task, and the size and the
location of these data may determine the process that can perform
the task without incurring excessive data-movement overheads.

Characteristics of Inter-Task Interactions



Tasks need to interact with each other to share data, work, or synchronization
information.
Different parallel algorithms require different types of interactions among
concurrent tasks.
The nature of these interactions makes them more suitable for certain
programming paradigms and mapping schemes
1) Static versus dynamic –
Static: interactions are known prior to execution.
2) Regular versus irregular –
Regular: interaction pattern can be exploited for efficient implementation.
3) Read-only versus read-write
4) One-way versus two-way

Static vs. Dynamic Interactions








Static interaction –
Tasks and associated interactions are predetermined:
Task-interaction graph and times that interactions occur are
known:
Example: matrix multiplication – Easy to program
Dynamic interaction –
Timing of interaction or sets of tasks to interact with can not
be determined prior to the execution.
Difficult to program using massage-passing;
Shared memory space programming may be simple

Regular vs. Irregular Interactions






Regular interactions –
Interaction has a spatial structure that can be exploited for
efficient implementation:
ring, mesh
Irregular Interactions –
Interactions has no well-defined structure
Example: Sparse matrix-vector multiplication

Example:
Image
dithering

Read-Only versus Read-Write


Read-Only interactions:- tasks require only a read-
access to the data shared among many concurrent
tasks.

Read-Write interactions:- multiple tasks need to
read and write access on some shared data.

One-way versus Two-way




One-way interaction:-
only one of a pair of communicating tasks initiates the
interaction and completes it without interrupting the other one.
Two-way interactions:-
The data or work needed by a task or a subset of tasks is
explicitly supplied by another task or subset of tasks, called which
usually involve predefined producer and consumer tasks..
All read-only interactions can be formulated as one-way interactions.
Read-write interactions can be either one way or two-way.

Mapping Techniques for Load Balancing




Problem  Decomposed in to no. of tasks  these tasks are mapped
onto processes
Objective Minimize the execution time
 In order to achieve a small execution time, the overheads
of executing the tasks in parallel must be minimized.
For a given decomposition, there are two key sources of overhead:
1. time spent in inter-process interaction and
2. time that some processes may spend being idle.
A good mapping must ensure that the computations and interactions among
processes at each stage of the execution of the parallel algorithm are well
balanced.

Dense Matrix example

Schemes for Static Mapping




Mapping must simultaneously minimize idling and load
balance.
It distributes the tasks among processes prior to the
execution of the algorithm.
For this to work, we must have a good estimate of the size
of each task.
Types of techniques:
1. Mapping Based on Data Partitioning
2. Task Graph Partitioning
3HbidS i

Mapping Based on Data Partitioning




By owner-computes rule, mapping the relevant data onto processes is
equivalent to mapping tasks onto processes
Mappings based on partitioning, two of the most common ways of
representing data in algorithms are arrays (or matrices) and graphs.
Array or Matrices regular data
– Block distributions
– Cyclic and block cyclic distributions
Irregular Data
– Graph partitioning

Block Distribution




In these distributions, n-dimensional array is distributed
among the processes such that each process receives a
contiguous block of array entries along a specified subset of
array dimensions.

Consider 1-D of dense matrix example
No. of processes = 8
Hint: Distribute rows or columns of matrix to different
processes

1-D Dense matrix multiplication

Multi-D Block Distribution
•Distribute blocks of matrix to different processes
Examples of two-dimensional distributions of an array, (a) on a 4 x 4 process grid, and (b) on a 2 x 8
processgrid





For multiplying two dense matrices A and B, we can partition the
output matrix C using a block decomposition.
For load balance, we give each task the same number of elements of
C. (Note that each element of C corresponds to a single dot product.)
The choice of precise decomposition (1-D or 2-D) is determined by
the associated communication overhead.
In general, higher dimension decomposition allows the use of larger
number of processes.







Example: �� × �� dense matrix multiplication �� =
�� × �� using �� processes
Decomposition based on output data.
Each entry of �� use the same amount of computation.
Either 1D or 2D block distribution can be used
Multi-D distribution allows higher degree of concurrency.
Multi-D distribution can also help to reduce interactions

Block Distribution

Block Cyclic Distributions



If the amount of work differs for different entries of a matrix, a
block distribution can lead to load imbalances.
Example. Doolittle’s method of LU factorization of dense matrix
The amount of computation increases from the top left to the
bottom right of the matrix.

Work used to compute Entries of L and U

•Block distribution of LU factorization tasks leads to load imbalance.

Block-Cyclic Distribution


A variation of block distribution that can be used to
alleviate the load-imbalance.
Steps:
1. Partition an array into many more blocks than the
number of available processes
2. Assign blocks to processes in a round-robin manner
so that each process gets several non-adjacent
blocks.

Cyclic Distribution


A cyclic distribution is an extreme case of a block-cyclic
distribution and can result in an almost perfect load
balance due to the extreme fine-grained underlying
decomposition.
Cyclic Distribution  when the block size =1
 achieve fine grained granularity




Partitioning a given task-dependency graph across
processes.

Determining an optimal mapping for a general task-
dependency graph is an NP-complete problem.

Excellent heuristics exist for structured graphs.

Task Graph Partitioning




In case of sparse matrices, block decompositions are more
complex.

Consider the problem of multiplying a sparse matrix with a
vector.

The graph of the matrix is a useful indicator of the work
(number of nodes) and communication (the degree of each
node).

Ithi ldlikt titith h t i

Task Graph Partitioning
Sparse-matrix vector multiplication

Dynamic
Mapping
Techniques



Dynamic mapping is sometimes also referred to as dynamic load
balancing, since load balancing is the primary motivation for
dynamic mapping.

Dynamic mapping is necessary in situations where a static
mapping may result in a highly imbalanced distribution of work
among processes

Dynamic mapping schemes can be
1. centralized
2distributed

Centralized
Schemes•





All executable tasks are maintained in a common central data structure or
they are maintained by a special process or a subset of processes.
Processes are designated as masters or slaves.
When a process runs out of work, it requests the master for more work.
When the number of processes increases, the master may become the
bottleneck.
To alleviate this, a process may pick up a number of tasks (a chunk) at one
time. This is called Chunk scheduling.
Selecting large chunk sizes may lead to significant load imbalances as well.
A number of schemes have been used to gradually decrease chunk size as
the computation progresses.

Distributed
Schemes•






Each process can send or receive work from other processes.

This alleviates the bottleneck in centralized schemes.

There are four critical questions:
How are sending and receiving processes paired together,
Who initiates work transfer,
How much work is transferred, and
When is a transfer triggered?

Methods
for
Containing
Interaction
Overheads




Maximize data locality: Where possible, reuse intermediate data.
Restructure computation so that data can be reused in smaller time
windows.
Minimize volume of data exchange: There is a cost associated with
each word that is communicated. For this reason, we must minimize
the volume of data communicated.
Minimize frequency of interactions: There is a startup cost associated
with each interaction. Therefore, try to merge multiple interactions to
one, where possible.
Minimize contention and hot-spots: Use decentralized techniques,
replicatedatawherenecessary

Overlapping computations with interactions: Use non-blocking
communications, multithreading, and prefetching to hide latencies.

Replicating data or computations.

Using group communications instead of point-to-point primitives.

Overlap interactions with other interactions.







An algorithm model is a way of structuring a parallel algorithm by selecting a
decomposition and mapping technique and applying the appropriate strategy
to minimize interactions.
Data Parallel Model: Tasks are statically (or semi-statically) mapped to
processes and each task performs similar operations on different data.
Usually based on data decomposition followed by static mapping
Uniform partitioning of data followed by static mapping guarantees load
balance
Example algorithm: dense matrix multiplication
Task Graph Model: Starting from a task dependency graph, the
interrelationships among the tasks are utilized to promote locality or to reduce
interaction costs.
Till dtl bl h tfdt itdith
Parallel Algorithm Models






Static mapping usually used to optimize data movement costs
Example algorithm: parallel quicksort, sparse matrix factorization
Master-Slave Model: One or more processes generate work and
allocate it to worker processes. This allocation may be static or
dynamic.
Pipeline / Producer-Consumer Model: A stream of data is passed
through a succession of processes, each of which perform some task
on it.
Hybrid Models: A hybrid model may be composed either of multiple
models applied hierarchically or multiple models applied sequentially
todifferentphasesofaparallelalgorithm.
Tags