Differentiate between parallel IR and distributed IR.ppt
MARasheed3
4 views
24 slides
Mar 05, 2025
Slide 1 of 24
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
About This Presentation
Differentiate between parallel IR and distributed IR
Size: 191.08 KB
Language: en
Added: Mar 05, 2025
Slides: 24 pages
Slide Content
Parallel and Distributed IR
Eric Brown
Parallel Computing
SISD:single instruction stream, single data stream.
SIMD:single instruction stream, multiple data stream.
MISD:multiple instruction stream, single data stream.
MIMD:multiple instruction stream, multiple data stream.
Performance Measures
S=
Running time of best available sequential algorithm
---------------------------------------------------------------
Running time of parallel algorithm
S<=
1
f +(1-f)/N
1
f
<=
=
S
N
Parallel IR
Introduction:
Develop new retrieval strategies that directly
lend themselves to parallel implementation.
Adapt existing, well studied information retrieval
algorithms to parallel processing.
MIMD Architecture
MIMD Architecture
Inverted Files
Logical Document Partitioning
Essentially the same basic underlying inverted file ind
ex as in the original sequential algorithm.
Physical Document Partitioning
Each subcollection has its own inverted file and the se
arch processes shard nothing during query evaluation.
MIMD Architecture
Logical document partitioning requires less commu
nication than physical document partitioning with si
milar parallelization, and so is likely to provide bett
er overall performance.
Physical document partitioning, on the other hand,
offers more flexibility and conversion of an existing
IR system into a parallel IR system is simpler using
physical document partition.
MIMD Architectures
Term partitioning
When term partitioning is used with an inverted file is
created for the document collection and the inverted lists
are spread across the processors.
Assuming each processor has its own I/O channel
and disks when term distribution in the documents
and the queries are more skewed, document partition
performs better. When terms are uniformly
distributed in user queries, term partition performs
better.
MIMD Architecture
SIMD Architecture
Signature Files
SIMD Architecture
Signature Files
SIMD Architecture
Signature Files
SIMD Architectures
Inverted Files
SIMD Architectures
SIMD Architectures
Inverted Files
SIMD Architectures
Distributed IR
Introduction
A distributed computing system can be viewed
as a MIMD parallel processor with relatively
slow inter-processor communication channel and
the freedom to employ a heterogeneous
collection of processors in the system.
Distributed IR
Introduction
Distributed Model is very similar to the MIMD
parallel processing model.
The main difference here is that subtasks run on
different computers and the communication
between the subtasks is performed using network
protocol such as TCP/IP.
Collection Partitioning
The procedure used to adding documents to
search servers in a distributed IR system
depends a number of factors.
Consider whether or not the system is centrally
administered.
Collection Partitioning
When the distribute system is centrally
administered, more options are available.
The first option is simple replication of the collection
across all of the search servers.
The second option is random distribution of the
documents.
The final option is explicit semantic partitioning of the
documents.
Source Selection
Source selection is the process of determining which of the
distributed document collections are most likely to contain
relevant documents for the current query, and therefore
should receive the query for processing.
The basic technique is to treat each collection as if it were a
single large document, index the collections, and evaluate
the query against the collections to produce a ranked listing
of collections.
Query Processing
Query processing in a distributed IR system proceeds
as follows:
Select collection to search.
Distribute query to selected collections.
Evaluate query at distributed collection in parallel.
Combine results from distributed collection into final result.
Web Issues
The parallel and distributed techniques
described above can then be used directly as
if the Web were any other large document
collection. This is the approach currently
taken by most of the popular Web search
services.
Trends and Research Issues
The trend in parallel hardware is the develop of
general MIMD machines.
Many challenges remain in the area of parallel and
distributed text retrieval.
The first challenge is measuring retrieval effectiveness
on large text collections.
The second significant challenge is interoperability, or
building distributed IR systems form heterogeneous
components.