Differentiate between parallel IR and distributed IR.pdf
MARasheed3
4 views
138 slides
Mar 05, 2025
Slide 1 of 138
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
About This Presentation
Differentiate between parallel IR and distributed IR
Size: 744.5 KB
Language: en
Added: Mar 05, 2025
Slides: 138 pages
Slide Content
ModernInformationRetrieval
Chapter10
ParallelandDistributedIR
withEricBrown
Introduction
ATaxonomyofDistributedIRSystems
DataPartitioning
ParallelIR
Cluster-basedIR
DistributedIR
FederatedSearch
RetrievalinPeer-to-PeerNetworks
Parallel and Distributed IR, Modern Information Retrieval , Addison Wesley, 2010 p. 1
Introduction
Thevolumeofonlinecontenttodayisstaggeringandit
hasbeengrowingatanexponentialrate
Onataslightlysmallerscale,thelargestcorporate
intranetsnowcontainseveralmillionWebpages
Asdocumentcollectionsgrowlarger,theybecome
moreexpensivetomanage
Inthisscenario,itisnecessarytoconsideralternative
IRarchitecturesandalgorithms
Theapplicationofparallelismanddistributedcomputing
cangreatlyenhancetheabilitytoscaleIRalgorithms
Parallel and Distributed IR, Modern Information Retrieval , Addison Wesley, 2010 p. 2
ATaxonomyof
DistributedIRSystems
Parallel and Distributed IR, Modern Information Retrieval , Addison Wesley, 2010 p. 3
ATaxonomy
TaxonomyofdistributedandparallelIRsystems
One Processor
Multiple Processors
Same Software
Same Software
Various Software
Internal
Standard
Parallel Search
Parallel Search
Communication
Search
SIMD
MIMD
Local Area
n/a
Cluster-based
Local
Communication
Search
Federated Search
Broadband
n/a
Distributed Search
Federated Search
Communication
(P2P)
(P2P)
Parallel and Distributed IR, Modern Information Retrieval , Addison Wesley, 2010 p. 4
DataPartitioning
Parallel and Distributed IR, Modern Information Retrieval , Addison Wesley, 2010 p. 5
DataPartitioning
IRtasksaretypicallycharacterizedbyasmallamount
ofprocessingappliedtoalargeamountofdata
Howtopartitionthedocumentcollectionandthe
index?
Parallel and Distributed IR, Modern Information Retrieval , Addison Wesley, 2010 p. 6
DataPartitioning
Figurebelowpresentsahighlevelviewofthedata
processedbytypicalsearchalgorithms
D
o
c
u
m
e
n
t
s
Indexing Items
k1k2...ki...kt
d1
w1,1w2,1...wi,1...wt,1
d2
w1,2w2,2...wi,2...wt,2
...
... ... ... ... ... ...
dj
w1,jw2,j...wi,j...wt,j
...
... ... ... ... ... ...
dN
w1,Nw2,N...wi,N...wt,N
Eachrowrepresentsadocumentdjandeachcolumn
representsanindexingitemki
Parallel and Distributed IR, Modern Information Retrieval , Addison Wesley, 2010 p. 7
DataPartitioning
Documentpartitioningslicesthematrixhorizontally,
dividingthedocumentsamongthesubtasks
TheNdocumentsinthecollectionaredistributed
acrossthePprocessorsinthesystem
Duringqueryprocessing,eachparallelprocess
evaluatesthequeryonN/Pdocuments
Theresultsfromeachofthesub-collectionsare
combinedintoanalresultlist
Parallel and Distributed IR, Modern Information Retrieval , Addison Wesley, 2010 p. 8
DataPartitioning
Intermpartitioning,thematrixisslicedvertically
It divides the indexing items among thePprocessors
Inthisway,theevaluationprocedureforeachdocument
isspreadovermultipleprocessors
Otherpossiblepartitionstrategiesincludedivisionsby
languageorotherintrinsiccharacteristicsofthedata
Itmaybethecasethateachindependentsearchserver
isfocusedonaparticularsubjectarea
Parallel and Distributed IR, Modern Information Retrieval , Addison Wesley, 2010 p. 9
CollectionPartitioning
Whenthedistributedsystemiscentrally
administered,moreoptionsareavailable
Therstoptionisjustthereplicationofthecollection
acrossallsearchservers
Abrokerroutesqueriestothesearchserversand
balancestheloadontheservers:
Broker
Search
Engine
Search
Engine
Search
Engine
Search
Engine
Search
Engine
User
Query
Result
User
Query
Result
Parallel and Distributed IR, Modern Information Retrieval , Addison Wesley, 2010 p. 10
CollectionPartitioning
Thesecondoptionisrandomdistributionofthe
documents
Thisisappropriatewhenalargedocumentcollection
mustbedistributedforperformancereasons
However,thedocumentswillalwaysbeviewedand
searchedasiftheyarepartofasingle,logicalcollection
Thebrokerbroadcastseveryquerytoallsearchservers
andcombinestheresultsfortheuser
Parallel and Distributed IR, Modern Information Retrieval , Addison Wesley, 2010 p. 11
CollectionPartitioning
Thenaloptionisexplicitsemanticpartitioningofthe
documents
Herethedocumentsareeitheralreadyorganizedinto
semanticallymeaningfulcollections
Howtopartitionacollectionofdocumentstomake
eachcollectionwellseparatedfromtheothers?
Well separated means that each query maps to a distinct
collection containing the largest number of relevant docum ents
Parallel and Distributed IR, Modern Information Retrieval , Addison Wesley, 2010 p. 12
CollectionPartitioning
Toconstructsuchamapping,Puppinetalusedquery
logs
Theyrepresenteachdocumentwithallthequeriesthat
returnthatdocumentasananswer
Thisrepresentationenablestobuildclustersofqueries
andclustersofdocuments
Parallel and Distributed IR, Modern Information Retrieval , Addison Wesley, 2010 p. 13
CollectionSelection
Inmanycases,thecollectionsarepredeterminedand
cannotbechanged
Inthatcase,collectionselectionistheprocessof
determiningwhichofthedocumentcollectionsaremost
likelytocontainrelevantdocumentsforeachquery
Oneapproachistoalwaysassumethateverycollection
isequallylikelytocontainrelevantdocuments
Whencollectionsaresemanticallypartitioned,the
collectionscanberankedaccordingtotheirlikelihoodof
containingrelevantdocuments
Parallel and Distributed IR, Modern Information Retrieval , Addison Wesley, 2010 p. 14
CollectionSelection
Thebasictechniqueistotreateachcollectionasifit
wereasinglelargedocument
Then,wecanevaluatethequeryagainstthecollections
toproducearankedlistingofcollections
Letwc,ijrefertotheweightoftermkiincollectionCj:
wc,ij=fc,ij×IDFc,i
where
fc,ijis the total frequency of occurrence of term kiin all
documents of collectionCj
IDFc,iis the inverse collection frequency
Parallel and Distributed IR, Modern Information Retrieval , Addison Wesley, 2010 p. 15
CollectionSelection
Thatis,
IDFc,i=log
Nc
nc,i
where
Ncis the number of collections nc,iis the number of collections in which term kioccurs
Theseweightsarethenusedtoassemblequeryand
collectionvectors
Parallel and Distributed IR, Modern Information Retrieval , Addison Wesley, 2010 p. 16
CollectionSelection
Aproblemisthatmayhappenthattherearenorelevant
documentswithinacollectionthatreceivesahigh
relevancescore
MoffatandZobelavoidthisproblem
They propose to index each collection as a series of blocks,
where each block containsBdocuments Voorheesproposestousetrainingqueriestobuilda
contentmodelforthedistributedcollections
TheGlOSSsystemrankscollectionsbasedon:
The number of documents containing a query term, and The total weight of a term over all documents
TheCORIsystemrankscollectionsasiftheywere
documents,usingtheInqueryinferencenetwork
Parallel and Distributed IR, Modern Information Retrieval , Addison Wesley, 2010 p. 17
InvertedIndexPartitioning
Werstdiscussinvertedindexesthatemploydocument
partitioning,andthenwecovertermpartitioning
Inbothcasesweaddresstheindexingandthebasic
queryprocessingphase
Therearetwoapproachestodocumentpartitioningin
systemsthatuseinvertedindexes
Logical document partitioning Physical document partitioning
Parallel and Distributed IR, Modern Information Retrieval , Addison Wesley, 2010 p. 18
LogicalDocumentPartitioning
Inthiscase,thedatapartitioningisdonelogicallyusing
thesameinvertedindexasintheoriginalalgorithm
Theinvertedindexisextendedtogiveeachprocessor
directaccesstotheirportionoftheindex
EachtermdictionaryentryisextendedtoincludeP
pointersintothecorrespondinginvertedlist
Thej-thpointerindexestheblockofdocumententries
intheinvertedlistassociatedwiththesub-collectionin
thej-thprocessor
Parallel and Distributed IR, Modern Information Retrieval , Addison Wesley, 2010 p. 19
LogicalDocumentPartitioning
Extendeddictionaryentryfordocumentpartitioning
term i
P0
P1
P2
P3
Dictionary
Inverted List
Term i
Parallel and Distributed IR, Modern Information Retrieval , Addison Wesley, 2010 p. 20
LogicalDocumentPartitioning
Whenaqueryissubmittedtothesystem,thebroker
initiatesPparallelprocessestoevaluatethequery
Eachprocessexecutesthesamedocumentscoring
algorithmonitsdocumentsub-collection
Thesearchprocessesrecorddocumentscoresina
singlesharedarrayofdocumentscoreaccumulators
Then,thebrokersortsthearrayofdocumentscore
accumulatorsandproducesthenalranking
Parallel and Distributed IR, Modern Information Retrieval , Addison Wesley, 2010 p. 21
LogicalDocumentPartitioning
Atinvertedindexconstructiontime,theindexing
processcanexploittheparallelprocessors
First,theindexerpartitionsthedocumentsamongthe
processors
Next,itassignsdocumentidentierssuchthatall
identiersinpartitioniarelessthanthoseinpartition
i+1
Theindexerthenrunsaseparateindexingprocesson
eachprocessorinparallel
Afterallofthebatcheshavebeengenerated,amerge
stepisperformedtocreatethenalinvertedindex
Parallel and Distributed IR, Modern Information Retrieval , Addison Wesley, 2010 p. 22
PhysicalDocumentPartitioning
Inthissecondapproach,thedocumentsarephysically
partitionedintoseparatesub-collections
Eachsub-collectionhasitsowninvertedindexandthe
processorssharenothingduringqueryevaluation
Whenaqueryissubmittedtothesystem,thebroker
distributesthequerytoalloftheprocessors
Eachprocessorevaluatesthequeryonitsportionofthe
documentcollection,producingaintermediatehit-list
Thebrokerthencollectstheintermediatehit-listsfrom
allprocessorsandmergesthemintoanalhit-list
ThePintermediatehit-listscanbemergedefciently
usingabinaryheap-basedpriorityqueue
Parallel and Distributed IR, Modern Information Retrieval , Addison Wesley, 2010 p. 23
PhysicalDocumentPartitioning
Eachprocessmayrequireglobaltermstatisticsinorder
toproducegloballyconsistentdocumentscores
Therearetwobasicapproachestocollectinformation
onglobaltermstatistics
The rst approach is to compute global term statistics at ind exing
time and store these statistics with each of the sub-collect ions
The second approach is to process the queries in two phases
1. Term statistics from each of the processes are combined in to
global term statistics
2. The broker distributes the query and global term statisti cs to
the search processes
Parallel and Distributed IR, Modern Information Retrieval , Addison Wesley, 2010 p. 24
PhysicalDocumentPartitioning
Tobuildtheinvertedindexesforphysicallypartitioned
documents,eachprocessorcreatesitsownindex
Inthecaseofreplicatedcollections,indexingthe
documentsishandledinoneoftwoways
In the rst method, each search server separately indexes it s
replica of the documents
In the second method, each server is assigned a mutually
exclusive subset of documents to index and the index subsets are
replicated across the search servers
Amergeofthesubsetsisrequiredateachsearch
servertocreatethenalindexes
Ineithercase,documentupdatesanddeletionsmust
bebroadcasttoallserversinthesystem
Parallel and Distributed IR, Modern Information Retrieval , Addison Wesley, 2010 p. 25
Comparison
Logicaldocumentpartitioningrequiresless
communicationthanphysicaldocumentpartitioning
Thus, it is likely to provide better overall performance
Physicaldocumentpartitioning,ontheotherhand,
offersmoreexibility
E.g., document partitions may be searched individually
TheconversionofanexistingIRsystemintoaparallel
systemissimplerusingphysicaldocumentpartitioning
Foreitherdocumentpartitioningscheme,threads
provideaconvenientprogrammingparadigmfor
creatingthesearchprocesses
Parallel and Distributed IR, Modern Information Retrieval , Addison Wesley, 2010 p. 26
TermPartitioning
Intermpartitioning,theinvertedlistsarespreadacross
theprocessors
Eachqueryisdecomposedintoitemsandeachitemis
senttothecorrespondingprocessor
Theprocessorscreatehit-listswithpartialdocument
scoresandreturnthemtothebroker
Thebrokerthencombinesthehit-listsaccording
Parallel and Distributed IR, Modern Information Retrieval , Addison Wesley, 2010 p. 27
TermPartitioning
Thequeriescanbeprocessedconcurrently,aseach
processorcananswerdifferentpartialqueries
However,thequeryloadisnotnecessarilybalanced,
andthenpartoftheconcurrencygainsarelost
Hence,themajorgoalistopartitiontheindexsuchthat:
The number of contacted processors/servers is minimal; and Load is equally spread across all available processors/ser vers
Wecanusequerylogstosplittheindexvocabulary
amongtheprocessorstoachievethegoalabove
Acomplementarytechniqueistoprocessthequery
usingapipelineofprocessors
Parallel and Distributed IR, Modern Information Retrieval , Addison Wesley, 2010 p. 28
OverallComparison
Documentpartitioningaffordssimplerinvertedindex
constructionandmaintenancethantermpartitioning
AssumingeachprocessorhasitsownI/Ochanneland
disks,documentpartitioningperformsbetter
Whentermsareuniformlydistributedinuserqueries,
termpartitioningperformsbetter
Infact,Webberetalshowthattermpartitioningresults
inlowerutilizationofresources
Morespecically,itsignicantlyreducesthenumberof
diskaccessesandthevolumeofdataexchanged
Parallel and Distributed IR, Modern Information Retrieval , Addison Wesley, 2010 p. 29
OverallComparison
Themajordrawbackofdocumentpartitionedsystems:
Many not needed operations are carried out to query
sub-collections possibly containing few relevant documen ts Themaindisadvantageoftermpartitioning:
It have to build and maintain the entire global index, which l imits
its scalability Inaddition,termpartitioninghasalargervariance
regardinganswertimeandxingthisneedsmore
complicatedbalancingmechanisms
Parallel and Distributed IR, Modern Information Retrieval , Addison Wesley, 2010 p. 30
SufxArrays
Wecanapplydocumentpartitioningtosufxarraysina
straightforwardfashion
Asbefore,thedocumentcollectionisdividedamongthe
Pprocessorsandeachpartitionistreatedasan
independentcollection
Thesystemcanthenapplythesufxarrayconstruction
techniquestoeachofthepartitions
Duringsearch,thebrokerbroadcaststhequerytoallof
thesearchprocesses
Thentheintermediateresultsaremergedintoanal
hit-list
Parallel and Distributed IR, Modern Information Retrieval , Addison Wesley, 2010 p. 31
SufxArrays
Ifallofthedocumentswillbekeptinasinglecollection,
wecanstillexploittheparallelprocessorstoreduce
indexingtime
Inthesufxarrayconstructionalgorithmforlargetexts,
eachofthemergesofpartialindexesisindependent
ThereforealloftheO((n/M)
2
)mergesmayberunin
parallelonseparateprocessors
Afterallmergesarecomplete,thenalindexmerge
maybeperformed
Parallel and Distributed IR, Modern Information Retrieval , Addison Wesley, 2010 p. 32
SufxArrays
Intermpartitioningforasufxarray,eachprocessoris
responsibleforalexicographicalintervalofthearray
Duringqueryprocessing,thebrokerdistributesthe
querytotheprocessorsthatcontaintherelevant
portionsofthesufxarrayandmergestheresults
Notethatwhensearchingthesufxarray,allofthe
processorsrequireaccesstotheentiretext
However,onasingleparallelcomputerwithshared
memory,thetextmaybecachedinsharedmemory
Parallel and Distributed IR, Modern Information Retrieval , Addison Wesley, 2010 p. 33
SignatureFiles
Toimplementdocumentpartitioninginasystemthat
usessignatureles,thedocumentsaredividedamong
theprocessorsasbefore
Eachprocessorgeneratessignaturesforitsdocument
partition
Atquerytime,thebrokergeneratesasignatureforthe
queryanddistributesittoalloftheparallelprocessors
Eachprocessorevaluatesthequerysignaturelocallyas
ifitspartitionwasaseparatecollection
Thentheresultsaresenttothebroker,whichcombines
themintoanalhit-listfortheuser
Parallel and Distributed IR, Modern Information Retrieval , Addison Wesley, 2010 p. 34
SignatureFiles
Toapplytermpartitioning,wewouldhavetousea
bit-slicedsignatureleandpartitionthebitslicesacross
theprocessors
Theamountofsequentialworkrequiredseverelylimits
thespeedupSavailablewiththisorganization
Accordingly,thisorganizationisnotrecommended
Parallel and Distributed IR, Modern Information Retrieval , Addison Wesley, 2010 p. 35
ParallelIR
Parallel and Distributed IR, Modern Information Retrieval , Addison Wesley, 2010 p. 36
ParallelComputing
Processorscanbecombinedinavarietyofwaysto
formparallelarchitectures
Flynnhasdenedacommonlyusedtaxonomyof
parallelarchitecturesthatincludesfourclasses:
SISD:single instruction stream, single data stream; SIMD:single instruction stream, multiple data stream; MISD:multiple instruction stream, single data stream; MIMD:multiple instruction stream, multiple data stream
TheSISDclassincludesthetraditionalvonNeumann
computerrunningsequentialprograms,
E.g., uniprocessor personal computers
Parallel and Distributed IR, Modern Information Retrieval , Addison Wesley, 2010 p. 37
ParallelComputing
SIMDcomputersconsistofNprocessorsoperatingon
Ndatastreams,andareoftencomputerswith:
Many relatively simple processors running the same program A communication network between the processors A control unit that supervises the synchronous operation of the
processors Theprocessorsmayusesharedmemory,oreach
processormayhaveitsownlocalmemory
Sequentialprogramsrequiresignicantmodicationto
makeeffectiveuseofaSIMDarchitecture
Parallel and Distributed IR, Modern Information Retrieval , Addison Wesley, 2010 p. 38
ParallelComputing
MISDcomputersuseNprocessorsoperatingona
singledatastreaminsharedmemory
MISD architectures are relatively rare and systolic arrays are the
best known example MIMDisthemostgeneralandmostpopularclassof
parallelarchitectures
A MIMD computer containsNprocessors,Ninstruction streams,
andNdata streams
In this architecture, each processor has its own control uni t,
processing unit, and local memory
Parallel and Distributed IR, Modern Information Retrieval , Addison Wesley, 2010 p. 39
ParallelComputing
Theprocessorscanworkonseparate,unrelatedtasks,
ortheycancooperatetosolveasingletask
Tightlycoupled: MIMDsystemswithahighdegreeof
processorinteraction
Looselycoupled: systemswithalowdegreeof
processorinteraction
MIMDcanalsocharacterizedistributedcomputing
architectures
In distributed computing, multiple computers connected by a local
or wide area network cooperate to solve a single problem
Parallel and Distributed IR, Modern Information Retrieval , Addison Wesley, 2010 p. 40
PerformanceMeasures
Whenweemployparallelcomputing,weusuallywantto
knowwhatistheperformanceimprovement
Anumberofmetricsareavailabletomeasurethe
performanceofaparallelalgorithm
Onesuchmeasureisthespeedup,denedas:
S=
Runningtimeofbestavailablesequentialalgorithm
Runningtimeofparallelalgorithm
Ideally,whenrunningaparallelalgorithmonN
processors,wewouldobtainperfectspeedup,orS=N
Parallel and Distributed IR, Modern Information Retrieval , Addison Wesley, 2010 p. 41
PerformanceMeasures
Inpractice,perfectspeedupisunattainableeither
because:
the problem cannot be decomposed intoNequal independent
subtasks
the parallel architecture imposes control and communicati on
overheads, or
the problem contains an inherently sequential component
Amdahl'slaw:
S≤
1
f+(1−f)/N
≤
1
f
wherefisthefractionoftheproblemthatmustbe
computedsequentially
Parallel and Distributed IR, Modern Information Retrieval , Addison Wesley, 2010 p. 42
PerformanceMeasures
Anothermeasureofparallelperformanceisefciency,
givenby:
φ=
S
N
whereSisthespeedupandNisthenumberof
processors
Idealefciencyoccurswhenφ=1andnoprocessoris
everidleorperformsunnecessarywork
Aswithperfectspeedup,idealefciencyisunattainable
inpractice
Parallel and Distributed IR, Modern Information Retrieval , Addison Wesley, 2010 p. 43
PerformanceMeasures
Ultimately,theperformanceimprovementofaparallel
programoverasequentialprogramisviewedasthe
combinationof:
the reduction in real time required to complete the task the additional monetary cost associated with the parallel
hardware required to run the parallel program Thisgivesthebestoverallpictureofparallelprogram
performanceandcosteffectiveness
Parallel and Distributed IR, Modern Information Retrieval , Addison Wesley, 2010 p. 44
ParallelIR
WecanapproachthedevelopmentofparallelIR
algorithmsfromtwodifferentdirections
Onepossibilityistodevelopnewretrievalstrategies
thatdirectlylendthemselvestoparallelimplementation
For example, a text search procedure can be built on top of a
neural network Theotherpossibilityistoadaptexisting,wellstudiedIR
algorithmstoparallelprocessing
Thislaterapproachisconsiderednext
Parallel and Distributed IR, Modern Information Retrieval , Addison Wesley, 2010 p. 45
ParallelIR
Themodicationsrequiredtoadaptanexisting
algorithmdependonthetargetparallelplatform
We investigate techniques for applying a number of retrieva l
algorithms to the MIMD and SIMD architectures Parallelcomputingisthesimultaneousapplicationof
multipleprocessorstosolveasingleproblem
Theoveralltimerequiredtosolvetheproblemcanbe
reducedtothetimerequiredbythelongestrunningpart
Parallel and Distributed IR, Modern Information Retrieval , Addison Wesley, 2010 p. 46
ParallelIRonMIMDArchitectures
MIMDarchitecturesofferagreatdealofexibilityinhow
parallelismisdenedandexploitedtosolveaproblem
Thesimplestwayinwhicharetrievalsystemcanexploit
aMIMDcomputeristhroughtheuseofmultitasking
Eachoftheprocessorsintheparallelcomputerrunsa
separate,independentsearchservice
Thesubmissionofuserqueriestothesearchservices
ismanagedbyabroker
Thebrokeracceptssearchrequestsanddistributesthe
requestsamongtheavailablesearchservices
Parallel and Distributed IR, Modern Information Retrieval , Addison Wesley, 2010 p. 47
ParallelIRonMIMDArchitectures
ParallelmultitaskingonaMIMDmachine
Broker
Search
Engine
Search
Engine
Search
Engine
Search
Engine
Search
Engine
User
Query
Result
User
Query
Result
Parallel and Distributed IR, Modern Information Retrieval , Addison Wesley, 2010 p. 48
ParallelIRonMIMDArchitectures
Caremustbetakentoproperlybalancethehardware
resourcesonthesystem
Searchprocessesrunningonthedifferentprocessors
canperformI/Oandcompetefordiskaccess
Abottleneckatthediskwillbedisastrousfor
performanceandcouldeliminatethethroughputgains
Inadditiontoaddingmorediskstothecomputer,the
indexdatamustbedistributedoverthedisks
Atoneextreme,replicatingtheentireindexoneach
diskeliminatesdiskcontentionatthecostofincreased
storagerequirementsandupdatecomplexity
Parallel and Distributed IR, Modern Information Retrieval , Addison Wesley, 2010 p. 49
ParallelIRonMIMDArchitectures
Alternatively,heavilyaccesseddatacanbereplicated
andlessfrequentlyaccesseddatacanbedistributed
Yetanotherapproachistoinstalladiskarrayandlet
theoperatingsystemhandlepartitioningtheindex
Asinsequentialsystems,cachingisanotherimportant
techniquethatimprovesperformance
Parallel and Distributed IR, Modern Information Retrieval , Addison Wesley, 2010 p. 50
ParallelIRonMIMDArchitectures
Toimprovequeryresponsetime,thecomputation
requiredtoevaluateasinglequerymustbe:
partitioned into subtasks distributed among the multiple processors
Broker
Search
Process
User
Query
Result
Sub−query/
Results
Search
Process
Search
Process
Search
Process
Search
Process
Parallel and Distributed IR, Modern Information Retrieval , Addison Wesley, 2010 p. 51
ParallelIRonSIMDArchitectures
SIMDarchitectureslendthemselvestoamore
restricteddomainofproblemsthanMIMDarchitectures
PerhapsthebestknownSIMDarchitectureisthe
ThinkingMachinesConnectionMachine2(CM-2)
This computer was discontinued during the 1990's
TheCM-2wasusedtosupportbothsignatureleand
invertedindexbasedinformationretrievalalgorithms
Parallel and Distributed IR, Modern Information Retrieval , Addison Wesley, 2010 p. 52
InvertedIndexes
Invertedindexesaresomewhatawkwardtoimplement
onSIMDmachines
Nevertheless,Stanllhaveproposedtwoadaptationsof
invertedindexesfortheCM-2
Initssimplestform,aninvertedlistcontainsaposting
foreachdocumentinwhichagiventermappears
Apostingisatupleoftheform(ki,dj),wherekiisa
termidentieranddjisadocumentidentier
Dependingontheretrievalmodel,postingsmay
additionallycontainweightsorpositionalinformation
Ifpositionalinformationisstored,thenapostingis
createdforeachoccurrenceofkiindj
Parallel and Distributed IR, Modern Information Retrieval , Addison Wesley, 2010 p. 53
InvertedIndexes
TherstparallelinvertedindexfortheCM-2usedthe
twostandardstructures: apostingstableandanindex
The postings table contains the document identiers from th e
postings
The index (vocabulary) maps terms to their corresponding en tries
in the postings table
Parallel and Distributed IR, Modern Information Retrieval , Addison Wesley, 2010 p. 54
InvertedIndexes
Parallelinvertedindex
Documents
This little piggy went
to market.
This little piggy
stayed home.
This little piggy had
roast beef.
Postings
beef 2
had 2
home 1
little 0
little 1
little 2
market 0
piggy 0
piggy 1
piggy 2
roast 2
stayed 1
this 0
this 1
this 2
to 0
went 0
Index
First
Last
Term
Row
Pos
Row
Pos
beef
0
0
0
0
had
0
1
0
1
home
0
2
0
2
little
0
3
1
1
market
1
2
1
2
piggy
1
3
2
1
roast
2
2
2
2
stayed
2
3
2
3
this
3
0
3
2
to
3
3
3
3
went
4
0
4
0
Postings Table
2 2 1 0 1 2 0 0 1 2 2 1 0 1 2 0 0
Parallel and Distributed IR, Modern Information Retrieval , Addison Wesley, 2010 p. 55
InvertedIndexes
Atsearchtimethesedatastructuresareusedtorank
documentsasfollows
First,theretrievalsystemloadsthepostingstableonto
theback-endprocessors
Foreachqueryterm,anindexlookupreturnstherange
ofpostingstableentriesthatmustbeprocessed
Foreachrowofthisrange,theprocessorsthatcontain
entriesforthecurrenttermareactivated
Then,theassociateddocumentidentiersareusedto
updatethescoresofthecorrespondingdocuments
Documentscoresarebuiltupinaccumulators,which
areallocatedinaparallelarraysimilar
Parallel and Distributed IR, Modern Information Retrieval , Addison Wesley, 2010 p. 56
InvertedIndexes
Thecompletealgorithmforscoringaweightedtermis
score_term(P_float Doc_score[], P_posting Posting[],term_t term) {
int i, first_pos, last_pos;
P_int Doc_row, Doc_pos;
P_float Weight;
for (i = term.first_row; i <= term.last_row; i++) {
first_pos = (i == term.first_row ? term.first_pos : 0);
last_pos = (i == term.last_row ?
term.last_pos : N_PROCS - 1);
where (Position >= first_pos && Position <= last_pos) {
Doc_row = Posting[i].row;
Doc_pos = Posting[i].pos;
Weight = term.weight*Posting[i].weight;
[Doc_pos]Doc_score[Doc_row] += Weight;
}
}
}
Parallel and Distributed IR, Modern Information Retrieval , Addison Wesley, 2010 p. 57
InvertedIndexes
Itisexpensivetosendpostingweightstoaccumulators
ondifferentprocessors
Toaddressthisproblem,Stanllproposedthe
partitionedpostingsle
This structure stores the postings and accumulator for a giv en
document on the same processor
This proposal eliminates the communication required in the
previous algorithm
Parallel and Distributed IR, Modern Information Retrieval , Addison Wesley, 2010 p. 58
InvertedIndexes
TheFigurebelowshowshowthepostingscanbe
loadedintoatablefortwoprocessors
InthisFigure,documents0and
1wereassignedtoprocessor0
and document 2 was assigned
toprocessor1
home 1
beef 2
little 0
had 2
little 1
little 2
market 0
piggy 2
piggy 0
roast 2
piggy 1
this 2
stayed 1 this 0 this 1 to 0 went 0
(a)
Parallel and Distributed IR, Modern Information Retrieval , Addison Wesley, 2010 p. 59
InvertedIndexes
Noticethatthepostingsforthetermthisareskewed
andnolongerspanconsecutiverows
Tohandlethissituation,weap-
ply the second trick of the par-
titioned postings le: segment
the postings such that every
term in segmentiis lexico-
graphicallylessthanorequalto
everyterminsegmenti+1
home 1
beef 2
little 0
had 2
little 1
little 2
market 0
piggy 2
piggy 0
roast 2
piggy 1 stayed 1
this 2
this 0 this 1 to 0 went 0
(b)
Parallel and Distributed IR, Modern Information Retrieval , Addison Wesley, 2010 p. 60
InvertedIndexes
Thepostingstableandindexundergoafewmore
modicationsbeforereachingtheirnalform:
Index
First
Last
Term
Partition
Partition
Tag
beef
0
0
0
had
0
0
1
home
0
0
2
little
0
0
3
market
1
1
0
piggy
1
1
1
roast
1
1
2
stayed
2
2
0
this
2
2
1
to
3
3
0
went
3
3
1
Postings Table
2 1
0 0
3 0
1 0
3 1
3 0
0 0
1 0
1 0
2 0
1 1 0 1
1 0
1 0 1 1 0 0 1 0
Parallel and Distributed IR, Modern Information Retrieval , Addison Wesley, 2010 p. 61
InvertedIndexes
Themodiedtermscoringalgorithmisbelow
HereN_ROWSis the number of rows per partition
ppf_score_term (P_float Doc_score[], P_posting Posting[],term_t term) {
int i;
P_int Doc_row;
P_float Weight;
for (i = term.first_part*N_ROWS;
i < (term.last_part + 1)*N_ROWS; i++) {
where (Posting[i].tag == term.tag) {
Doc_row = Posting[i].row;
Weight = term.weight*Posting[i].weight;
Doc_score[Doc_row] += Weight;
}
}
}
Parallel and Distributed IR, Modern Information Retrieval , Addison Wesley, 2010 p. 62
Cluster-basedIR
Parallel and Distributed IR, Modern Information Retrieval , Addison Wesley, 2010 p. 63
Cluster-basedIR
Clustercomputingisanintermediatecasebetween
parallelanddistributedcomputing
Aclusterofserversisadistributedsystemthathas
manycomputers,allphysicallycloseandusually
connectedthroughafastlocalareanetwork
Aslocalnetworksbecomefaster,aclusterpresents
behaviorthatresemblesthatofaparallelmachine
Oneimportantprobleminclustercomputingisto
balancetheworkloadamongtheservers
Loadbalancers: specialnodesthatbalancetheload
amongdifferentmachines
Parallel and Distributed IR, Modern Information Retrieval , Addison Wesley, 2010 p. 64
Cluster-basedIR
Therearemanydifferenttypesofclusters
For instance, clusters targeted to high-availability have redundant
nodes Otherclustersareusedprimarilyforcomputational
purposes,asthefollowingtwocases:
Beowulf Clustersare clusters of homogeneous nodes that are
run on a dedicated network
Grid Computingallows the allocation of jobs to computers that
perform the work independently of the rest of the cluster
Parallel and Distributed IR, Modern Information Retrieval , Addison Wesley, 2010 p. 65
Cluster-basedIR
ThesamemeasuresthatwementionedinparallelIR
canbeappliedtocluster-basedcomputing
Theequivalenttoef×ciencyiscalledloadbalancing Forexample,wecanmeasurethefractionofthe
highestdeviationfromtheaverageloadℓ:
LB=
n
max
i=1
|loadi−ℓ|
ℓ
whereℓ=sum
n
j=1
loadj/n
NoticethatLBcanrangefrom:
LB= 0(perfect balance) to LB=n−1(complete imbalance)
Parallel and Distributed IR, Modern Information Retrieval , Addison Wesley, 2010 p. 66
Cluster-basedIR
Loadbalancecanbeachievedbythecombinationof
severaltechniques
Thesimplestoneistohaveaspecialbroker,aload
balancer,whichtakescareofthejob
Howeverinsomecasesthatisnotpossibleandspecic
loadbalancingalgorithmsareneeded
Parallel and Distributed IR, Modern Information Retrieval , Addison Wesley, 2010 p. 67
Cluster-basedIR
Toprogramacluster,therearemiddlewaresoftware
suchas:
MPI (Message Passing Interface) PVM (Parallel Virtual Machine)
Anotherpossibilityisthemap-reduceparallel
computingparadigmintroducedbyDeanetal
Itisavailableasopensourcesoftwareinthe
Hadooppackage
Currentresearchisfocusedonextendingthepowerof
themap-reduceparadigm
Parallel and Distributed IR, Modern Information Retrieval , Addison Wesley, 2010 p. 68
DistributedIR
Parallel and Distributed IR, Modern Information Retrieval , Addison Wesley, 2010 p. 69
DistributedComputing
Distributedsystemstypicallyconsistof:
a set of server processes, each running on a separate node, an d a designated broker process
Thebroker:
accepts and distributes the requests to the servers, collects intermediate results from the servers, and combines the intermediate results into a nal result for the client
Thecommunicationbetweenthesubtasksisperformed
usinganetworkprotocolsuchasTCP/IP
Parallel and Distributed IR, Modern Information Retrieval , Addison Wesley, 2010 p. 70
DistributedComputing
Distributedcomputingusesmultiplecomputers
connectedbyanetworktosolveasingleproblem
Adistributedcomputingsystemcanemploya
heterogeneouscollectionofprocessorsinthesystem
In fact, a single processing node in the distributed system c ould
be a parallel computer in its own right
Parallel and Distributed IR, Modern Information Retrieval , Addison Wesley, 2010 p. 71
DistributedComputing
Thecostofinter-processorcommunicationis
considerablyhigherinadistributedcomputingsystem
Assuch,distributedprogramsareusuallycoarse
grained
Granularity refers to the amount of computation relative to the
amount of communication performed by the program
Coarse grained programs perform large amounts of computation
relative to the communication cost Ofcourse,anapplicationmayusedifferentlevelsof
granularityatdifferenttimestosolveagivenproblem
Parallel and Distributed IR, Modern Information Retrieval , Addison Wesley, 2010 p. 72
DistributedComputing
Further,indistributedcomputingeachprocessorhasits
ownlocalmemory
Ontheotherhand,adistributedsystemisalsoin
practiceaparallelsystem
Inadistributedsystem,wehavefourelementsthatare
crucialforscalability:
Partitioningdeals with data scalability and, in a large IR system,
implies partitioning the document collection and the index
Communicationdeals with processing scalability, which in our
case is query processing
A system isdependableif its operation is free of failures Theexternal factorsare the external constraints on the system
Parallel and Distributed IR, Modern Information Retrieval , Addison Wesley, 2010 p. 73
GoalsandKeyIssues
Applicationsthatlendthemselveswelltoadistributed
implementationusuallyinvolve:
Computation and data that can be split into coarse grained
operations, and
Relatively little communication is required between the op erations
Parallelinformationretrievalbasedondocument
partitioningtsthisprolewell
Document partitioning can be used to divide the search task i nto
multiple, self contained subtasks
Each subtask involves extensive computation and data
processing with little communication among them
Parallel and Distributed IR, Modern Information Retrieval , Addison Wesley, 2010 p. 74
GoalsandKeyIssues
TheultimategoalofadistributedIRsystemistoanswer
querieswellandfastinalargedocumentcollection
Thatimpliesthreedifferentgoalsthatwedetailnext
Scalability: the IR system needs to cope with content growth
and change
Capacity: the system must also provide high capacity Quality: the system must not compromise quality of answers, as
it is easy to output bad answers quickly Thesemaingoalsaresharedbyallthemodulesofan
IRsystem
ThegoalsabovearecrucialforWebretrieval
Parallel and Distributed IR, Modern Information Retrieval , Addison Wesley, 2010 p. 75
GoalsandKeyIssues
MainmodulesofadistributedIRsystem,andkey
issuesforeachmodule
Module
Communication Dependability External
(synchronization) factors
Indexing
Reindexing
Partial indexing
Updating
Merging
Content growth
Content change
Global statistics
Querying
Replication
Caching
Rank aggregation
Personalization
Changing user needs
User base growth
DNS
Parallel and Distributed IR, Modern Information Retrieval , Addison Wesley, 2010 p. 76
Dependability
Aclassicwayofcopingwithfaultsisreplication Therearedifferentaspectstoreplicate: network
communication,functionality,anddata
Toreplicatenetworkcommunication,wereplicatethe
numberoflinks,makingsitesmulti-homed
Therearetwopossiblelevelsofreplicationfor
functionalityanddata:
In asingle site, if either functionality or data is not replicated,
then a single fault can render a service unavailable
Usingmultiple sitesincreases the likelihood that there is always
some server available to perform the request
Parallel and Distributed IR, Modern Information Retrieval , Addison Wesley, 2010 p. 77
Communication
Amajordrawbackthatarisesfromthedistributed
systemisthattheservershavetocommunicate
Networkcommunicationcanbeabottleneckas
bandwidthisoftenascarceresource
Asasimpleexample,supposewemodelafront-end
serverasaqueueingsystemG/G/c
Inthismodel,thecserverscorrespondtothethreads
thatserverequestson,forexample,aWebserver
Theresponseofeachthreadtoarequestdepends
uponthecommunicationofthisthreadwithotherparts
ofthesystem
Inthiscase,bandwidthandmessagelatencycontribute
totheresponsetime
Parallel and Distributed IR, Modern Information Retrieval , Addison Wesley, 2010 p. 78
Communication
Maximumcapacityofafront-endserverusinga
G/G/150model
0 2 4 6 8 01 21 41 61
01
001
0001
Maximum number of requests per second
)s m( e mit e civre s e garevA
051/ G/G
Parallel and Distributed IR, Modern Information Retrieval , Addison Wesley, 2010 p. 79
Indexing
Onewaytopartitiontheindexacrossthequery
processorsistoconsiderthetopicsofthedocuments
Routingthequeriesaccordingtotheirtopicinvolves
identifyingthetopicsofbothdocumentsandqueries
However,topicdistributionmighthaveanegativeeffect
ontheperformanceofthedistributedretrievalsystem
Changesinthetopicdistributioncanresultineither:
the resources not being exploited to their full extent, or allocation of fewer resources to popular topics
Apossiblesolutiontothischallengeistheautomatic
recongurationoftheindexpartition,considering
informationfromthequerylogsoftheIRsystem
Parallel and Distributed IR, Modern Information Retrieval , Addison Wesley, 2010 p. 80
Indexing
Partitioningtheindexaccordingtothelanguageof
queriesisalsoasuitableapproach
Achallengeinroutingqueriesusinglanguageisthe
presenceofmultilingualdocumentssuchasintheWeb
For example, documents describing technical content can ha ve a
number of English terms Inaddition,queriescanbemultilingual,involvingterms
indifferentlanguages
Parallel and Distributed IR, Modern Information Retrieval , Addison Wesley, 2010 p. 81
Indexing
Buildinganindexinadistributedfashionisa
challengingproblem
Sofar,veryfewpaperssuggestapproachestobuildan
invertedindexinadistributedfashion
Forexample,apossibleapproachistoorganizethe
serversinapipeline
Parallel and Distributed IR, Modern Information Retrieval , Addison Wesley, 2010 p. 82
Dependability
Thedistributedsearchsystemdependsuponthe
existenceofindexstructuresthatenablequeryresolution
For example, if enough index servers fail, then the service a s a
whole also fails Anotherissuewithdependabilityistheupdateofthe
index
In some systems it is crucial to have the latest results for qu eries
and content changes very often
In this case, it is important that the index data available at a given
moment reects all the changes in a timely fashion
Parallel and Distributed IR, Modern Information Retrieval , Addison Wesley, 2010 p. 83
Dependability
Ifaserverofthesystemfails,itisimpossibletorecover
thecontentofthatserverunlessitisreplicated
Ifthisisnotthecase,thenapossibleinefcientwayto
recoveristorebuildtheentireindex
Anotherpossibilitywouldbetomakethepartitions
partiallyoverlapping
Documentpartitionedsystemsaremorerobustwith
respecttoserversfailures
Supposethataserverfails
The system might still be able to answer queries possibly wit hout
losing too much effectiveness
Parallel and Distributed IR, Modern Information Retrieval , Addison Wesley, 2010 p. 84
Communication
Thedistributedmergeoperationsoftheindexing
processcanimpactthecommunicationamongservers
A practical approach for achieving this goal is a
map-reduce approach Indexesareusuallyrebuiltfromscratchaftereach
updateoftheunderlyingdocumentcollection
Thisupdateoperationusuallyrequireslockingthe
index,jeopardizingthewholesystemperformance
Termsthatrequirefrequentupdatesmightbespread
acrosstheservers,thusamplifyingthelockouteffect
Parallel and Distributed IR, Modern Information Retrieval , Addison Wesley, 2010 p. 85
ExternalFactors
IndistributedIRsystemsthereareseveralbottlenecks
todealwith
InadocumentpartitionedIRsystemisnecessaryto
computevaluesforsomeglobalparameterssuchas
the collection frequency, and the inverse document frequency of a term
Therearetwopossibleapproaches:
One can compute the nal global parameter by aggregating all
the local statistics available after the indexing phase
The problem of computing global statistics can be moved to th e
system's broker
Parallel and Distributed IR, Modern Information Retrieval , Addison Wesley, 2010 p. 86
ExternalFactors
Tocomputesuchstatistics,thebrokerusuallyresolves
queriesusingatwo-roundprotocol
In the rst round the broker requests local statistics from e ach
server
In the second, it requests results from each server, piggyba cking
global statistics onto the second message containing the qu ery Thequestionatthispointis:
Given asmartpartitioning strategy using local instead of global
statistics, what is the impact on the nal system effectiven ess?
Parallel and Distributed IR, Modern Information Retrieval , Addison Wesley, 2010 p. 87
ExternalFactors
Inarealworldsearchengine,infact,itisdifcultto
denewhatisacorrectanswerforaquery
Thus,itisdifculttounderstandwhetherusingonly
localstatisticsmakesadifference
Furthermore,notethatifwemakeuseofacollection
selectionstrategy,usingtheglobalstatisticsisnot
feasible
Parallel and Distributed IR, Modern Information Retrieval , Addison Wesley, 2010 p. 88
QueryProcessing
InadistributedIRsystem,itisimportanttodetermine
whichresourcestoallocatetoprocessagivenquery
Thepoolofavailableresourcescomprisescomponents
havingoneofthefollowingroles:
Coordinator, cache, or query processor
Acoordinatormakesdecisionsonhowtoroutethe
queriestodifferentpartsofthesystem
Thequeryprocessorsholdindexordocument
information
Cacheserverscanholdresultsforthemostfrequentor
popularqueries
They can reduce query latency and load on servers
Parallel and Distributed IR, Modern Information Retrieval , Addison Wesley, 2010 p. 89
QueryProcessing
Animportantassumptionisthatoneormoreservers
implementeachofthesecomponents
This assumption is particularly important for large-scale systems
Designingcomponentsinsuchawaythatwecanadd
morephysicalserverstoincreasetheoverallsystem
capacityisfundamentalforsuchlarge-scalesystems
In fact, separating parts of the system into component roles is
already an attempt to promote scalability as a single monoli thic
system cannot scale in an unrestricted way
Astheseserverscanbeindifferentphysicallocations,
wecallsitetoeachgroupofcollocatedservers
Parallel and Distributed IR, Modern Information Retrieval , Addison Wesley, 2010 p. 90
QueryProcessing
Instanceofadistributedqueryprocessingsystem
WAN
Client
1
2
3
Site A
Region X
Site B
Region Y
Site C
Region Z
Query processor : matches
documents to the received queries
Coordinator : receives queries and
routes them to appropriate sites
Cache : stores results from
previous queries
Parallel and Distributed IR, Modern Information Retrieval , Addison Wesley, 2010 p. 91
QueryProcessing
Weclassifyadistributedqueryprocessingsystem
accordingtofourattributes:
Number of components Connectivity Distinction of roles Interaction
Thenumberofcomponentsdeterminestheamountof
resourcesavailableforprocessingqueries
Thechoicesontheallocationofcomponentschangeas
differentchoicesleadtodifferentperformancevalues
Infact,minimizingtheamountofresourcesperqueryis
ingeneralanimportantgoal
Parallel and Distributed IR, Modern Information Retrieval , Addison Wesley, 2010 p. 92
QueryProcessing
Inpractice,partitioningthedatapotentiallyenhances
thequerythroughputperformance
In the case of document partitioning, we could select only a
subset of the machines in the search server that ideally cont ain
relevant results
However,theabilityofretrievingthelargestnumberof
relevantdocumentsis,aswealreadydiscussed,the
collectionselectionorqueryroutingproblem
Inthecaseoftermpartitioning,effectivecollection
selectionisnotahardproblem
The solution in this case consists in selecting the server th at
holds the information on the particular terms of the query
Parallel and Distributed IR, Modern Information Retrieval , Addison Wesley, 2010 p. 93
QueryLoadBalancing
Themajorissueforquerythroughput,infact,isan
unevendistributionoftheloadacrosstheservers
TheFigurebelowillustratestheaveragebusyloadfor
eachofthe8serversofadocumentpartitionedsystem
(left)andapipelinedtermpartitionedsystem(right)
Parallel and Distributed IR, Modern Information Retrieval , Addison Wesley, 2010 p. 94
QueryLoadBalancing
Inthetermpartitionedsystem,thereisanevidentlack
ofbalanceinthedistributionontheloadoftheservers
To overcome this issue, one could take into account estimate s of
the index access patterns to distribute the query load Inthecaseofpartitioningdocumentsrandomlyacross
servers,allserversreceiveallqueries
This in principle is a perfect load balance However, the amount of work per server is not necessarily the
same, and then a random partitioning does not guarantee an
even query load balance
Parallel and Distributed IR, Modern Information Retrieval , Addison Wesley, 2010 p. 95
QueryLoadBalancing
Foratermpartitionedsystem,Moffatetalshowthatit
ispossibletobalancetheload
Their approach exploit information on the frequencies of te rms
occurring in the queries and postings list replication
They abstract the problem of partitioning the vocabulary in a term
partitioned system as abin-packing problem
Each bin represents a partition and each term represents an
object to put in the bin
Parallel and Distributed IR, Modern Information Retrieval , Addison Wesley, 2010 p. 96
Dependability
Queryprocessorscannotfulllclientrequestswithout
theprocessingcapacityandthedatatheystore
Also,duetothelargeamountofdatatheyhandle,itis
challengingtodeterminegoodreplicationschemes
Havingallqueryprocessorsstoringthesamedata,the
systemachievesthebestavailabilitylevelpossible
Thisislikelytoimposeasignicantandunnecessary
overhead,alsoreducingthetotalstoragecapacity
Thus,anopenquestionishowtoreplicatedatawith
minimalstorageoverhead
Parallel and Distributed IR, Modern Information Retrieval , Addison Wesley, 2010 p. 97
Dependability
Multiplequeryprocessorsenableamoredependable
system,aswellasamorescalablesolution
Availabilityforcachescanalsoreferstofailuresof
queryprocessors
If a query processor is temporarily unavailable, we can serv e
cached results during the period of the outage
Parallel and Distributed IR, Modern Information Retrieval , Addison Wesley, 2010 p. 98
Dependability
Consistencyisalsooftenaveryimportantgoalfor
onlinesystems
Therearetechniquesfromdistributedalgorithmsto
implementfault-tolerantservices
Themainchallengeistoapplysuchtechniqueson
large-scalesystems
Itisalsopossibletousetechniquesthatenablestale
resultsthusimplementingweakerconsistency
constraints
Parallel and Distributed IR, Modern Information Retrieval , Addison Wesley, 2010 p. 99
Dependability
Asystemdesigncanconsideracachingsystemas
eitheranalternativeoracomplementtoreplication
Upon query processor failures, the system returns cached re sults
Animportantquestionishowtodesignsuchacache
systemtobeeffectiveincopingwithfailures
Ofcourse,agooddesignhasalsotoconsiderthe
primarygoalsofacachesystem,whichare:
reducing the average response time, balance the load on the query processing servers, and good bandwidth utilization
Parallel and Distributed IR, Modern Information Retrieval , Addison Wesley, 2010 p. 100
Communication
Distributedsystemsneedtoconsidertheoverheads
imposedbythecommunicationofitscomponents
Atermpartitionedsystemusingpipeliningroutes
partiallyresolvedqueriesamongservers
Iftheindexincludesthepositionofterms,the
communicationoverheadbetweenserversincreases
Insuchacase,thepositioninformationneedstobe
compressedefciently
Parallel and Distributed IR, Modern Information Retrieval , Addison Wesley, 2010 p. 101
Communication
Inthecaseofadocumentpartitionedsystem,query
processorssendthequeryresultstothecoordinator
The coordinator may become a bottleneck while merging the
results from a large number of query processors Insuchacase,itispossibletouseahierarchyof
coordinators
Furthermore,theresponsetimedependsonthe
responsetimeofitsslowestcomponent
Thisconstraintdependsonthediskcaching
mechanism,theamountofmemory,andthenumberof
servers
Parallel and Distributed IR, Modern Information Retrieval , Addison Wesley, 2010 p. 102
Communication
Whenmultipleprocessorsparticipateintheresolution
ofaquery,thecommunicationlatencycanbesignicant
Onewaytomitigatethisproblemistoadoptan
incrementalqueryprocessingapproach
In this approach, the faster query processors provide an ini tial set
of results
Other remote query processors provide additional results w ith a
higher latency and users continuously obtain new results However,morerelevantresultsmayappearlaterdueto
latencies
Parallel and Distributed IR, Modern Information Retrieval , Addison Wesley, 2010 p. 103
Communication
Sometimes,thequeryprocessinginvolvesadaptationof
thesearchresultsaccordingtotheinterestsoftheuser
Eachuserprolerepresentsastate,whichmustbethe
lateststateandbeconsistentacrossreplicas
Alternatively,asystemcanimplementpersonalization
asathinlayerontheclient-side
However,thislastapproachrestrictstheusertoalways
usingthesameterminal
Parallel and Distributed IR, Modern Information Retrieval , Addison Wesley, 2010 p. 104
Communication
Sinceuserbehaviorchangesovertime,oneshouldbe
abletoupdatethemodelaccordingly
Asimpleapproachistoscheduleupdatesofthemodel
atxedtimeintervals
However,ahigherupdatefrequencyimpliesahigher
networktrafcandalowerqueryprocessingcapacity
Ideally,thesystemcommunicationadaptstovariations
oftheunderlyingmodelwhenevertheyoccur
Parallel and Distributed IR, Modern Information Retrieval , Addison Wesley, 2010 p. 105
Communication
Inaddition,inalargeIRsystem,therearehundredsof
thousandstomillionsofqueriesperday
Loggingtheseactionsandusingthemeffectivelyis
challengingbecausethevolumeofdataisextremely
high
Infact,movingthisdatafromservertoserverisrarelya
possibilityduetobandwidthlimitations
Parallel and Distributed IR, Modern Information Retrieval , Addison Wesley, 2010 p. 106
ExternalFactors
ThedesignoflargeIRsystemsincludesusersin
differentways
Similarly,thedesignandanalysisofcachingpolicies
requireinformationonusers,orausermodel
Userbehavior,however,isanexternalfactor,which
cannotbecontrolledbytheIRsystem
For example, the topics the users search for have slowly chan ged
in the past
Parallel and Distributed IR, Modern Information Retrieval , Addison Wesley, 2010 p. 107
ExternalFactors
Achangeinuserbehaviorcanalsoaffectthe
performanceofcachingpolicies
Sometimes it is necessary to provide mechanisms that enable
automatic reconguration of the system Thechallengewouldthenbetodetermineonlinewhen
userschangetheirbehaviorsignicantly
Parallel and Distributed IR, Modern Information Retrieval , Addison Wesley, 2010 p. 108
WebIssues
Thedistributedtechniquescanbeuseddirectlyinthe
Web,asifitwereanyotherlargedocumentcollection
This is the approach currently taken by most of the popular We b
search services Alternatively,wecanspreadtheworkofcollecting,
organizing,andsearchingallofthedocuments
ThisistheapproachtakenbytheHarvestsystemand
newerdistributedWebsearcharchitectures
Harvest comprises a number of components for gathering,
summarizing, replicating, distributing, and searching do cuments
Parallel and Distributed IR, Modern Information Retrieval , Addison Wesley, 2010 p. 109
WebIssues
Queriesareprocessedbybrokers,whichcollectand
reneinformationfromgatherersandotherbrokers
Theinformationataparticularbrokeristypicallyrelated
toarestrictedsetoftopics
Thisallowsuserstodirecttheirqueriestothemost
appropriatebrokers
Acentralbrokerregistryhelpsusersndthebest
brokersfortheirqueries
Parallel and Distributed IR, Modern Information Retrieval , Addison Wesley, 2010 p. 110
FederatedSearch
Parallel and Distributed IR, Modern Information Retrieval , Addison Wesley, 2010 p. 111
FederatedSearch
Afederatedsearchsystemreliesonacollectionof
heterogeneousserverstoansweruserqueries
Thecriticalengineeringissuesarebasicallythree:
dening the search protocol for transmitting requests and r esults, designing a server that can efciently accept a request and
initiate a thread to service, and the request
designing a broker that can submit asynchronous search
requests to multiple servers and combine the results
Parallel and Distributed IR, Modern Information Retrieval , Addison Wesley, 2010 p. 112
FederatedSearch
Thealgorithmicissuesarealsothree:
how to distributedocuments across the distributedsearch s ervers how to select which servers should receive a particular quer y how to process the queries and combine the results from the
different servers Thesearchprotocolspecies:
the syntax and semantics of messages transmitted between
clients and servers
the sequence of messages required to establish a connection
and carry out a search operation
the underlying transport mechanism for sending messages
Parallel and Distributed IR, Modern Information Retrieval , Addison Wesley, 2010 p. 113
FederatedSearch
Ataminimum,theprotocolshouldallowaclientto:
obtain information about a search sever, e.g., a list of data bases
available for searching at the server
submit a search request for one or more databases using a well
dened query language
receive search results in a well dened format retrieve items identied in the search results
Parallel and Distributed IR, Modern Information Retrieval , Addison Wesley, 2010 p. 114
FederatedSearch
Forclosedsystems,acustomsearchprotocolmaybe
mostappropriate
Alternatively,astandardprotocolmaybeused,allowing
thesystemtointeroperatewithothersearchservers:
Z39.50 is the standard for client/server information retri eval STARTS, Stanford Proposal for Internet Meta-Searching
STARTSincludedfeaturesintendedtosolvetherelated
algorithmicissues,suchasmergingresultsfrom
heterogeneoussources
Parallel and Distributed IR, Modern Information Retrieval , Addison Wesley, 2010 p. 115
QueryProcessing
QueryprocessinginafederatedIRsystemproceedsas
follows:
1. select the collections to search
2. distribute the query to the selected collections
3. process the query at each of the distributed collections i n parallel
4. combine the partial results into a nal result
Step1maybeeliminatedifthequeryisalways
broadcasttoeverydocumentcollectioninthesystem
Theparticipatingserversevaluatesthequeryonthe
selectedcollectionsusingitsownlocalsearchalgorithm
Parallel and Distributed IR, Modern Information Retrieval , Addison Wesley, 2010 p. 116
Howtomergetheresults?
First,ifthequeryisBooleanandtheserversreturn
Booleanresultsets,allofthesetsaresimplyjoined
Ifthequeryinvolvesfree-textranking,anumberof
techniquesareavailable
Thesimplestapproachistocombinetherankedhit-lists
usingroundrobininterleaving
Animprovementonthisprocessistomergethehit-lists
basedontheirrelevancescores
Unless proper global term statistics are used to compute the
document scores, we may get incorrect results Ifthedistributeddocumentcollectionsaresemantically
partitioned,thenre-rankingmustbeperformed
Parallel and Distributed IR, Modern Information Retrieval , Addison Wesley, 2010 p. 117
Howtomergetheresults?
Callanproposesre-rankingdocumentsbyweighting
documentscoresbasedontheircollectionsimilarity
computedduringthesourceselectionstep
Theweightforacollectioniscomputedas
w=1+|C|
×
s−ˉs
ˉs
where
|C|is the number of collections searched sis the collection's score ˉsis the mean of the collection scores
Parallel and Distributed IR, Modern Information Retrieval , Addison Wesley, 2010 p. 118
Howtomergetheresults?
Themostaccuratetechniqueformergingranked
hit-listsistouseaccurateglobaltermstatistics
Thebrokercanincludethesestatisticsinthequery
whenitdistributesthequerytotheremotesearch
servers
Ifacollectionindexisunavailable,querydistribution
canproceedintworoundsofcommunication:
In the rst round, the broker distributes the query and gathe rs
collection statistics from each of the search servers
Then, these statistics are combined by the broker and distri buted
back to the search servers
Parallel and Distributed IR, Modern Information Retrieval , Addison Wesley, 2010 p. 119
Howtomergetheresults?
Thesearchprotocolcanrequirethatsearchservers
returnglobalandper-documentquerytermstatistics
Thebrokercanthenre-rankdocumentsusingthequery
termstatisticsandarankingalgorithmofitschoice
Theendresultisahit-listthatcontaindocumentsinthe
sameorderasifallofthedocumentshadbeenindexed
inasinglecollection
Parallel and Distributed IR, Modern Information Retrieval , Addison Wesley, 2010 p. 120
Retrievalin
Peer-to-PeerNetworks
Parallel and Distributed IR, Modern Information Retrieval , Addison Wesley, 2010 p. 121
RetrievalinPeer-to-PeerNetworks
Apeerornodeisanarbitrarycomputerwhich,when
connectedtotheInternet,joinsapeer-to-peer
network,conformingapeer-to-peer(P2P)system
IRalgorithmscantakeadvantageofresources
distributedacrossInternet,inparticularlesharing
Therstlesharingsystems,suchasNapster,
Gnutella,andFreenet,differedinhowthedataofthe
peerswasfound
Napster was the most efcient system using a central index
server, but was also the one most vulnerable to attacks
Gnutella, on the other hand, used a ooding query model that
was inefcient, but highly fault tolerant
Freenet used a more efcient heuristic, but did not guarante e that
an existing le would be found
Parallel and Distributed IR, Modern Information Retrieval , Addison Wesley, 2010 p. 122
RetrievalinPeer-to-PeerNetworks
Thesolutiontothisproblemwasadistributedhash
table(DHT)
DHTsareamiddlewarelayertoprovidethefollowing
characteristics:
Decentralization: the peers collectively form the system without
any central coordination
Scalability: the system functions efciently even with millions of
peers, as is the case of Internet
Fault tolerance: the system is as reliable as possible, even with
peers continuously joining, leaving, and failing
Parallel and Distributed IR, Modern Information Retrieval , Addison Wesley, 2010 p. 123
RetrievalinPeer-to-PeerNetworks
Toachievethesegoals,oneofthepeerscancoordinate
withonlyafewotherpeersinthenetwork
commonlyΘ(logn)for a system with currentlynpeers
Thislimitstheamountofworkneededwhenapeer
joinsorleavesthenetwork
Inaddition,DHTsmustdealwithproblemssuchasload
balancing,dataintegrity,andperformance
Parallel and Distributed IR, Modern Information Retrieval , Addison Wesley, 2010 p. 124
RetrievalinPeer-to-PeerNetworks
DHTsarebasedinanabstractnumericalkeyspace,
wherethekeysidentifyanyresource
Then,usingapartitioningscheme,theownershipofthe
keyspaceisdividedamongtheparticipatingpeers
Anoverlaynetworkthenconnectsthepeers,allowing
themtondtheownerofanygivenkeyinthekeyspace
TherstfourDHTswereintroducedmoreorlessatthe
sametime:
CAN (for content addressable network) Chord Pastry Tapestry
Parallel and Distributed IR, Modern Information Retrieval , Addison Wesley, 2010 p. 125
RetrievalinPeer-to-PeerNetworks
Thepartitioningschemeusuallyemployssomevariant
ofconsistenthashing
Consistent hashing denes a distance functionδamong keys
Then,apeeridentiedwiththeID(key)iwillownallthe
keysforwhichiistheclosestkeyunderδ
Consistenthashinghastheessentialpropertythat
removingoraddingapeerchangesonlythesetofkeys
ownedbythepeerswithadjacentIDs
Parallel and Distributed IR, Modern Information Retrieval , Addison Wesley, 2010 p. 126
RetrievalinPeer-to-PeerNetworks
Theoverlaynetworkisbasedonaroutingtablewhere
eachpeermaintainsthesetofneighbouringpeers
Themainpropertyisthateachpeerownsaparticular
keykorhasaneighborthatisclosertotheownerofk
Asimplegreedyalgorithmforwardsmessagestothe
neighborpeerwhoseIDisclosesttok
Thisalgorithmsguaranteesndingkintimebounded
bythediameteroftheoverlaynetwork,butitisnot
necessarilyoptimal
Parallel and Distributed IR, Modern Information Retrieval , Addison Wesley, 2010 p. 127
RetrievalinPeer-to-PeerNetworks
Tostoreadocumentwithgivenlename,weproducea
keykusingahashingfunctionoverthelename
Thenwesendamessage(k,document)totheoverall
systemthatwillberoutedthroughneighborpeers
Thiswillbeforwardedfrompeertopeeruntilitreaches
thesinglepeerresponsibleforthekeyk
Inthispeer,whichisspeciedbythepartitioningofthe
keyspace,thetuple(k,document)isnallystored
Parallel and Distributed IR, Modern Information Retrieval , Addison Wesley, 2010 p. 128
RetrievalinPeer-to-PeerNetworks
Toretrievethedocumentwereversetheprocess:
The peer ndskby hashing the le name and sending a query to
nd the data associated withkin the network Themessagewillagainberoutedthroughtheoverlay
networktothepeerresponsiblefork
Then,thepeerwillsendbackdirectlythedatastored
associatedwiththatkey
Parallel and Distributed IR, Modern Information Retrieval , Addison Wesley, 2010 p. 129
RetrievalinPeer-to-PeerNetworks
ThemainretrievaldrawbackisthatDHTsonlysupport
exact-matchsearch,ratherthankeywordsearch
P2P-IR,andinparticularfulltextretrieval,hasbeen
investigatedforvariousP2Pnetworkorganizations
Search techniques in unstructured networks are usually bas ed on
broadcast, thus suffering from high bandwidth consumption Henceapproachesbasedonrandomwalkshavebeen
proposedtoreducethetrafcinaP2Pnetwork
Parallel and Distributed IR, Modern Information Retrieval , Addison Wesley, 2010 p. 130
RetrievalinPeer-to-PeerNetworks
Peer-leveldocumentcollectiondescriptionscanbe
usedtoidentifynodesthatcanprocessthequery
Thesedescriptionsguidethepeer-selectionprocess
andthedocumentretrievalfromtheselectedpeers
Resourcesarerankedbytheirlikelihoodtoreturn
relevantdocumentsandtop-rankedresourcesare
selected
Parallel and Distributed IR, Modern Information Retrieval , Addison Wesley, 2010 p. 131
RetrievalinPeer-to-PeerNetworks
ThefederatedsearchsystemdescribedbyLuetal
usesahierarchicalP2Pnetworkorganization
Minervamaintainsaglobalindexwithpeerselection
statistics
Minerva∞isaP2P-IRsystemthatisbasedonanorder
preservingDHT
It relies on Term Index Networks (TINs) storing the global
inverted list of a term on several peers
The query is processed by a parallel top- kalgorithm involving
nodes within TINs and across TINs
Parallel and Distributed IR, Modern Information Retrieval , Addison Wesley, 2010 p. 132
RetrievalinPeer-to-PeerNetworks
Document-levelindexingapproachescanpotentially
deliverhigherretrievalquality
Ontheotherside,suchapproachestypicallydistribute
thecompleteindexinastructuredP2Pnetwork
Thus, it requires higher index maintenance costs
Thisapproachfacessignicantscalabilityproblems
causedbythehightrafccostsrequiredforintersecting
largepostinglists
Thus,anumberofsolutionshavebeensuggestedto
resolvethisissue
Parallel and Distributed IR, Modern Information Retrieval , Addison Wesley, 2010 p. 133
RetrievalinPeer-to-PeerNetworks
Chenetalreport73%trafcreductionbyapplyingan
optimalBloomlterforDHT-basedfulltextretrieval
However,Zhangetalshowsthatsingle-termindexingis
practicallynotscalableforWebsizes
Parallel and Distributed IR, Modern Information Retrieval , Addison Wesley, 2010 p. 134
RetrievalinPeer-to-PeerNetworks
Top-kqueryprocessinghasbeenemployedtosolvethe
problemofextensivebandwidthconsumption
Themainideaistoterminatetheprocessingofaquery
whenthetop-kresultsobtainedsofararecorrect
Earlyterminationisparticularlybenecialfordistribut ed
intersectionsofpostinglists
Top-kqueryprocessingalgorithmstailoredforP2P
networksinclude:
Distributed Pruning Protocol (DPP) Three-Phase Uniform Threshold (TPUT) algorithm A family of distributed threshold algorithms (DTA) with Blo om lter
optimizations
Parallel and Distributed IR, Modern Information Retrieval , Addison Wesley, 2010 p. 135
RetrievalinPeer-to-PeerNetworks
Micheletalproposedafamilyofapproximatetop-k
queryprocessingalgorithmscalledKLEE
With small penalties on the top- kresult quality, KLEE algorithms
signicantly reduce bandwidth consumption TheapproachbyThauLooetalsuggeststo
complementindex-basedqueryprocessingwith
broadcasting
The authors suggest using ooding mechanisms to answer
popular queries, and resort to indexing only for rare querie s
Parallel and Distributed IR, Modern Information Retrieval , Addison Wesley, 2010 p. 136
RetrievalinPeer-to-PeerNetworks
Ahybridindexpartitioningschemeforkeywordsearch
isproposedinShietal
Allpeersareclusteredingroupsandtheindexing
techniqueemploystermpartitioningwithinthegroups
Eachqueryhastobebroadcasttoallthegroupsbut
onlyseveralnodesdotheactualprocessing
Sincethedocumentcollectionsizewithinagroupcan
bebounded,thissolutionreduceslatencyand
Further,itefcientlydistributesthebandwidth
consumption
Parallel and Distributed IR, Modern Information Retrieval , Addison Wesley, 2010 p. 137
RetrievalinPeer-to-PeerNetworks
Nguyenetalsuggestanadaptiveschemeaimingat
balancingthecostsbetweenindexingandquery
processing
Foranindividualpeer,groupsoflocaldocumentsare
createdandrepresentedastermsets
Thus,suchagroup-levelindexingstrategyisa
generalizationofbothindexingtechniques:
peer-level (one group per peer) document-level (one document per group)
Theauthorsproposeaprobabilisticmodeltoestimate
thecostassociatedwithagivennumberofgroups
Parallel and Distributed IR, Modern Information Retrieval , Addison Wesley, 2010 p. 138