Example: Database Query Processing
ID# Model Year Color Dealer Price
4523 Civic 2002 Blue MN $18,000
3476 Corolla 1999 White IL $15,000
7623 Camry 2001 Green NY $21,000
9834 Prius 2001 Green CA $18,000
6734 Civic 2001 White OR $17,000
5342 Altima 2001 Green FL $19,000
3845 Maxima 2001 Blue NY $22,000
8354 Accord 2000 Green VT $18,000
4395 Civic 2001 Red CA $17,000
7352 Civic 2002 Red WA $18,000
Consider the execution of the query:
MODEL = ``CIVIC'' AND YEAR = 2001 AND
(COLOR = ``GREEN'' OR COLOR = ``WHITE)
Task-dependency graph
Example: Database Query Processing
ID# Model Year Color Dealer Price
4523 Civic 2002 Blue MN $18,000
3476 Corolla 1999 White IL $15,000
7623 Camry 2001 Green NY $21,000
9834 Prius 2001 Green CA $18,000
6734 Civic 2001 White OR $17,000
5342 Altima 2001 Green FL $19,000
3845 Maxima 2001 Blue NY $22,000
8354 Accord 2000 Green VT $18,000
4395 Civic 2001 Red CA $17,000
7352 Civic 2002 Red WA $18,000
An alternate decomposition of the given problem into subtasks, along with their
data dependencies.
Fine-grained granularity decomposition of task
Coarse-grained granularity decomposition of task
Abstraction of task graphs
Limits on Parallel Performance
Types of granularity, utilizing the resulting concurrency
Facing inherit bound
To capture interactions among tasks
Node = task
Edge(undirected/directed) = interaction or data
exchange
Task-dependency graph vs. task-interaction graph
Task Interaction Graph
Example: Sparse matrix vector multiplication
Tasks: each task computes an entry of y[]
• Assign i
th
row of A to Task i. Also assign b[i] to Task i.
We need to compute the products A[i, j] x b[j] for only those values of j
for which A[i, j]≠ 0.
For example, y[0] = A[0, 0].b[0] + A[0, 1].b[1] + A[0, 4].b[4] + A[0, 8].b[8].
Example: Sparse matrix vector multiplication
•
•
•
•
•
Process: The tasks, into which a problem is decomposed, run on
physical processors.
It is an abstract entity that uses the code and data corresponding to
a task to produce the output of that task within a finite amount of
time after the task is activated by the parallel program.
Process = Task + task data + task code required to produce the
task’s output
Mapping:
Processor: It is a hardware unit that physically performs the
computations.
Processes
•
•
•
•
We refer to the mapping as being from tasks to processes, as
opposed to processors.
This is because typical programming APIs do not allow easy
binding of tasks to physical processors.
Rather, we aggregate tasks into processes and rely on the system
to map these processes to physical processors.
We use processes, not in the UNIX sense of a process, rather,
simply as a collection of tasks and associated data.
Why use processes rather than processors?
Basis for Choosing Mapping
Task-dependency graph
Makes sure the max.
concurrency
Task-interaction graph
Minimum communication.
1.
2.
3.
Mapping independent tasks to different processes.
Assigning tasks on critical path to processes as soon as
they become available.
Minimizing interaction between processes by mapping
tasks with dense interactions to the same process.
Criteria of Mapping
Mapping Database Query to Processes
4 processes can be used in total since the max.
concurrency is 4. • Assign all tasks within a level to different
•Ideal for problems to be solved by divide-and
conquer method.
Steps:
1.Decompose a problem into a set of independent
sub problems
2. Recursively decompose each sub-problem
3. Stop decomposition when minimum desired
granularity
ihid(til) ltibtid
Recursive Decomposition Technique
Quick Sort Example
Find the minimum in an array of numbers A of
length n
1. procedure SERIAL_MIN(A,n)
2. begin
3. min =A[0];
4. for i:= 1 to n − 1 do
5. if (A[i] < min) min := A[ i];
6. endfor;
7. return min;
8endSERIALMIN
Recursive Decomposition for Finding Min
1. procedure RECURSIVE_MIN ( A, n)
2. begin
3. if ( n = 1 ) then
4. min := A [0] ;
5. else
6. lmin := RECURSIVE_MIN ( A, n/2 );
7. rmin := RECURSIVE_MIN ( &( A[n/2]), n - n/2 );
8. if ( lmin < rmin) then
9. min := lmin;
10. else
11. min := rmin;
12. endelse;
13. endelse;
14. return min;
15. end RECURSIVE_MIN
Finding Min by using Recursive Procedure
•
•
•
Ideal for problems that operate on large data structures
Steps:
1. The data on which the computations are performed are
partitioned
2. Data partition is used to induce a partitioning of the
computations
into tasks.
Data Partitioning –
Partition output data
Partition input data
Partitioninput+outputdata
Data Decomposition Technique:
• If each element of the output
can be computed
independently of others as a
function of the input.
• Partitioning computations
into tasks is natural. Each task
is assigned with the work of
computing a portion of the
output.
Data Decomposition Based on Partitioning Output Data
Matrix-matrix multiplication: �� = �� × ��
• Partition matrix C into 2× 2 submatrices
• Computation of C then can be partitioned into four
tasks.
Matrix Multiplication Example
•
•
A partitioning of output data does not result in a
unique decomposition into tasks.
For example, for the same problem as in previous,
with identical output data distribution, we can
derive the following two (other) decompositions:
•
•
•
Consider the problem of computing the frequency of a set of
itemsets in a transaction database.
In this problem there is a set T containing n transactions and a set I
containing m itemsets.
Each transaction and itemset contains a small number of items, out
of a possible set of items.
Problem: itemset in a database of transaction
Data Decomposition Based on Partitioning Input Data
•
•
•
•
Ideal if output is a single unknown value or the
individual elements of the output can not be
efficiently determined in isolation.
Example: Finding the minimum, maximum, or sum
of a set of numbers.
Example. Sorting a set.
Partitioning the input data and associating a task
with each partition of the input data.
Data Decomposition based on partitioning input/output data
•
•
•
•
•
Often input and output data decomposition can be combined for
a higher degree of concurrency.
Decomposition based on partitioning input/output data is
referred to as the owner- computes rule.
Each partition performs all the computations involving data that
it owns.
Input data decomposition: – A task performs all the
computations that can be done using these input data.
Output data decomposition: – A task computes all the results in
the partition assigned to it.
For the itemset counting example, the transaction set (input) and itemset counts (output) can both
be decomposed as follows:
Data Decomposition Based on Partitioning Intermediate Data
•
•
Applicable for problems which can be solved
by multi-stage computations such that the
output of one stage is the input to the
subsequent stage.
Partitioning can be based on input or output of
an intermediate stage.
Example: Dense matrix multiplication
•Original output data decomposition yields a maximum
degree of concurrency of 4.
Exploratory Decomposition
•
•
•
In many cases, the decomposition of the problem goes
hand-in-hand with its execution.
Exploratory decomposition is used to decompose
problems whose underlying computations correspond to a
search of a space for solutions.
In exploratory decomposition, the search space is
partitioned into smaller parts, and search each one of
these parts concurrently, until the desired solutions are
found.
•
•
Example of Problems in this class include a variety of discrete
optimization problems, theorem proving, game playing, etc.
A simple application of exploratory decomposition is in the solution
to a 15 puzzle (a tile puzzle). We show a sequence of three moves
that transform a given initial state (a) to desired final state (d).
Speculative Decomposition
•
•
•
•
In some applications, dependencies between tasks are not
known a-priori.
For such applications, it is impossible to identify independent
tasks.
There are generally two approaches to dealing with such
applications: conservative approaches, which identify
independent tasks only when they are guaranteed to not have
dependencies, and, optimistic approaches, which schedule tasks
even when they may potentially be erroneous.
Conservative approaches may yield little concurrency and
optimisticapproachesmayrequirerollbackmechanisminthe
Example
•
•
•
•
A classic example of speculative decomposition is in discrete
event simulation.
The central data structure in a discrete event simulation is a time-
ordered event list.
Events are extracted precisely in time order, processed, and if
required, resulting events are inserted back into the event list.
Consider your day today as a discrete event system - you get up,
get ready, drive to work, work, eat lunch, work some more, drive
back, eat dinner, and sleep.
•
•
•
Each of these events may be processed independently, however, in
driving to work, you might meet with an unfortunate accident and
not get to work at all.
Therefore, an optimistic scheduling of other events will have to be
rolled back.
Another Example: The simulation of a network of nodes (for
instance, an assembly line or a computer network through which
packets pass). The task is to simulate the behavior of this network
for various inputs and node delay parameters (note that networks
may become unstable for certain values of service rates, queue
sizesetc)
Example: A simple network for discrete event simulation
Characteristics of Tasks
•
•
•
•
1.
2.
3.
Identify the concurrency that is available in a problem and
decompose it into tasks that can be executed in parallel.
The nature of the tasks and the interactions among them has a
bearing on the mapping.
The characteristics of these tasks critically impact choice and
performance of parallel algorithms.
Relevant task characteristics include:
Task generation.
Task sizes.
Size of data associated with tasks.
•
•
Task Generation: The tasks that constitute a parallel algorithm may be
generated either statically or dynamically.
Static task generation refers to where all the tasks are known before the
algorithm starts execution.
Task Sizes: the size of a task is the relative amount of time required to
complete it.
The complexity of mapping schemes often depends on whether or not
the tasks are uniform; i.e., whether or not they require roughly the same
amount of time. If the amount of time required by the tasks varies
significantly, then they are said to be non-uniform.
•
•
If the size of all the tasks is known, then this information can often
be used in mapping of tasks to processes.
Size of Data Associated with Tasks:
It is important because the data associated with a task must be
available to the process performing that task, and the size and the
location of these data may determine the process that can perform
the task without incurring excessive data-movement overheads.
Characteristics of Inter-Task Interactions
•
•
•
Tasks need to interact with each other to share data, work, or synchronization
information.
Different parallel algorithms require different types of interactions among
concurrent tasks.
The nature of these interactions makes them more suitable for certain
programming paradigms and mapping schemes
1) Static versus dynamic –
Static: interactions are known prior to execution.
2) Regular versus irregular –
Regular: interaction pattern can be exploited for efficient implementation.
3) Read-only versus read-write
4) One-way versus two-way
Static vs. Dynamic Interactions
•
•
•
•
•
•
•
•
Static interaction –
Tasks and associated interactions are predetermined:
Task-interaction graph and times that interactions occur are
known:
Example: matrix multiplication – Easy to program
Dynamic interaction –
Timing of interaction or sets of tasks to interact with can not
be determined prior to the execution.
Difficult to program using massage-passing;
Shared memory space programming may be simple
Regular vs. Irregular Interactions
•
•
•
•
•
•
Regular interactions –
Interaction has a spatial structure that can be exploited for
efficient implementation:
ring, mesh
Irregular Interactions –
Interactions has no well-defined structure
Example: Sparse matrix-vector multiplication
Example:
Image
dithering
Read-Only versus Read-Write
•
•
Read-Only interactions:- tasks require only a read-
access to the data shared among many concurrent
tasks.
Read-Write interactions:- multiple tasks need to
read and write access on some shared data.
One-way versus Two-way
•
•
•
•
One-way interaction:-
only one of a pair of communicating tasks initiates the
interaction and completes it without interrupting the other one.
Two-way interactions:-
The data or work needed by a task or a subset of tasks is
explicitly supplied by another task or subset of tasks, called which
usually involve predefined producer and consumer tasks..
All read-only interactions can be formulated as one-way interactions.
Read-write interactions can be either one way or two-way.
Mapping Techniques for Load Balancing
•
•
•
•
Problem Decomposed in to no. of tasks these tasks are mapped
onto processes
Objective Minimize the execution time
In order to achieve a small execution time, the overheads
of executing the tasks in parallel must be minimized.
For a given decomposition, there are two key sources of overhead:
1. time spent in inter-process interaction and
2. time that some processes may spend being idle.
A good mapping must ensure that the computations and interactions among
processes at each stage of the execution of the parallel algorithm are well
balanced.
Dense Matrix example
Schemes for Static Mapping
•
•
•
•
Mapping must simultaneously minimize idling and load
balance.
It distributes the tasks among processes prior to the
execution of the algorithm.
For this to work, we must have a good estimate of the size
of each task.
Types of techniques:
1. Mapping Based on Data Partitioning
2. Task Graph Partitioning
3HbidS i
Mapping Based on Data Partitioning
•
•
•
•
By owner-computes rule, mapping the relevant data onto processes is
equivalent to mapping tasks onto processes
Mappings based on partitioning, two of the most common ways of
representing data in algorithms are arrays (or matrices) and graphs.
Array or Matrices regular data
– Block distributions
– Cyclic and block cyclic distributions
Irregular Data
– Graph partitioning
Block Distribution
•
•
•
•
In these distributions, n-dimensional array is distributed
among the processes such that each process receives a
contiguous block of array entries along a specified subset of
array dimensions.
Consider 1-D of dense matrix example
No. of processes = 8
Hint: Distribute rows or columns of matrix to different
processes
1-D Dense matrix multiplication
Multi-D Block Distribution
•Distribute blocks of matrix to different processes
Examples of two-dimensional distributions of an array, (a) on a 4 x 4 process grid, and (b) on a 2 x 8
processgrid
•
•
•
•
For multiplying two dense matrices A and B, we can partition the
output matrix C using a block decomposition.
For load balance, we give each task the same number of elements of
C. (Note that each element of C corresponds to a single dot product.)
The choice of precise decomposition (1-D or 2-D) is determined by
the associated communication overhead.
In general, higher dimension decomposition allows the use of larger
number of processes.
•
•
•
•
•
•
Example: �� × �� dense matrix multiplication �� =
�� × �� using �� processes
Decomposition based on output data.
Each entry of �� use the same amount of computation.
Either 1D or 2D block distribution can be used
Multi-D distribution allows higher degree of concurrency.
Multi-D distribution can also help to reduce interactions
Block Distribution
Block Cyclic Distributions
•
•
•
If the amount of work differs for different entries of a matrix, a
block distribution can lead to load imbalances.
Example. Doolittle’s method of LU factorization of dense matrix
The amount of computation increases from the top left to the
bottom right of the matrix.
Work used to compute Entries of L and U
•Block distribution of LU factorization tasks leads to load imbalance.
Block-Cyclic Distribution
•
•
A variation of block distribution that can be used to
alleviate the load-imbalance.
Steps:
1. Partition an array into many more blocks than the
number of available processes
2. Assign blocks to processes in a round-robin manner
so that each process gets several non-adjacent
blocks.
Cyclic Distribution
•
•
A cyclic distribution is an extreme case of a block-cyclic
distribution and can result in an almost perfect load
balance due to the extreme fine-grained underlying
decomposition.
Cyclic Distribution when the block size =1
achieve fine grained granularity
•
•
•
Partitioning a given task-dependency graph across
processes.
Determining an optimal mapping for a general task-
dependency graph is an NP-complete problem.
Excellent heuristics exist for structured graphs.
Task Graph Partitioning
•
•
•
In case of sparse matrices, block decompositions are more
complex.
Consider the problem of multiplying a sparse matrix with a
vector.
The graph of the matrix is a useful indicator of the work
(number of nodes) and communication (the degree of each
node).
Dynamic
Mapping
Techniques
•
•
•
Dynamic mapping is sometimes also referred to as dynamic load
balancing, since load balancing is the primary motivation for
dynamic mapping.
Dynamic mapping is necessary in situations where a static
mapping may result in a highly imbalanced distribution of work
among processes
Dynamic mapping schemes can be
1. centralized
2distributed
Centralized
Schemes•
•
•
•
•
•
All executable tasks are maintained in a common central data structure or
they are maintained by a special process or a subset of processes.
Processes are designated as masters or slaves.
When a process runs out of work, it requests the master for more work.
When the number of processes increases, the master may become the
bottleneck.
To alleviate this, a process may pick up a number of tasks (a chunk) at one
time. This is called Chunk scheduling.
Selecting large chunk sizes may lead to significant load imbalances as well.
A number of schemes have been used to gradually decrease chunk size as
the computation progresses.
Distributed
Schemes•
•
•
•
•
•
•
Each process can send or receive work from other processes.
This alleviates the bottleneck in centralized schemes.
There are four critical questions:
How are sending and receiving processes paired together,
Who initiates work transfer,
How much work is transferred, and
When is a transfer triggered?
Methods
for
Containing
Interaction
Overheads
•
•
•
•
Maximize data locality: Where possible, reuse intermediate data.
Restructure computation so that data can be reused in smaller time
windows.
Minimize volume of data exchange: There is a cost associated with
each word that is communicated. For this reason, we must minimize
the volume of data communicated.
Minimize frequency of interactions: There is a startup cost associated
with each interaction. Therefore, try to merge multiple interactions to
one, where possible.
Minimize contention and hot-spots: Use decentralized techniques,
replicatedatawherenecessary
Overlapping computations with interactions: Use non-blocking
communications, multithreading, and prefetching to hide latencies.
Replicating data or computations.
Using group communications instead of point-to-point primitives.
Overlap interactions with other interactions.
•
•
•
•
•
•
An algorithm model is a way of structuring a parallel algorithm by selecting a
decomposition and mapping technique and applying the appropriate strategy
to minimize interactions.
Data Parallel Model: Tasks are statically (or semi-statically) mapped to
processes and each task performs similar operations on different data.
Usually based on data decomposition followed by static mapping
Uniform partitioning of data followed by static mapping guarantees load
balance
Example algorithm: dense matrix multiplication
Task Graph Model: Starting from a task dependency graph, the
interrelationships among the tasks are utilized to promote locality or to reduce
interaction costs.
Till dtl bl h tfdt itdith
Parallel Algorithm Models
•
•
•
•
•
Static mapping usually used to optimize data movement costs
Example algorithm: parallel quicksort, sparse matrix factorization
Master-Slave Model: One or more processes generate work and
allocate it to worker processes. This allocation may be static or
dynamic.
Pipeline / Producer-Consumer Model: A stream of data is passed
through a succession of processes, each of which perform some task
on it.
Hybrid Models: A hybrid model may be composed either of multiple
models applied hierarchically or multiple models applied sequentially
todifferentphasesofaparallelalgorithm.