Information retrieval systems irt ppt do

PonnuthuraiSelvaraj1 83 views 156 slides Mar 24, 2024
Slide 1
Slide 1 of 156
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27
Slide 28
28
Slide 29
29
Slide 30
30
Slide 31
31
Slide 32
32
Slide 33
33
Slide 34
34
Slide 35
35
Slide 36
36
Slide 37
37
Slide 38
38
Slide 39
39
Slide 40
40
Slide 41
41
Slide 42
42
Slide 43
43
Slide 44
44
Slide 45
45
Slide 46
46
Slide 47
47
Slide 48
48
Slide 49
49
Slide 50
50
Slide 51
51
Slide 52
52
Slide 53
53
Slide 54
54
Slide 55
55
Slide 56
56
Slide 57
57
Slide 58
58
Slide 59
59
Slide 60
60
Slide 61
61
Slide 62
62
Slide 63
63
Slide 64
64
Slide 65
65
Slide 66
66
Slide 67
67
Slide 68
68
Slide 69
69
Slide 70
70
Slide 71
71
Slide 72
72
Slide 73
73
Slide 74
74
Slide 75
75
Slide 76
76
Slide 77
77
Slide 78
78
Slide 79
79
Slide 80
80
Slide 81
81
Slide 82
82
Slide 83
83
Slide 84
84
Slide 85
85
Slide 86
86
Slide 87
87
Slide 88
88
Slide 89
89
Slide 90
90
Slide 91
91
Slide 92
92
Slide 93
93
Slide 94
94
Slide 95
95
Slide 96
96
Slide 97
97
Slide 98
98
Slide 99
99
Slide 100
100
Slide 101
101
Slide 102
102
Slide 103
103
Slide 104
104
Slide 105
105
Slide 106
106
Slide 107
107
Slide 108
108
Slide 109
109
Slide 110
110
Slide 111
111
Slide 112
112
Slide 113
113
Slide 114
114
Slide 115
115
Slide 116
116
Slide 117
117
Slide 118
118
Slide 119
119
Slide 120
120
Slide 121
121
Slide 122
122
Slide 123
123
Slide 124
124
Slide 125
125
Slide 126
126
Slide 127
127
Slide 128
128
Slide 129
129
Slide 130
130
Slide 131
131
Slide 132
132
Slide 133
133
Slide 134
134
Slide 135
135
Slide 136
136
Slide 137
137
Slide 138
138
Slide 139
139
Slide 140
140
Slide 141
141
Slide 142
142
Slide 143
143
Slide 144
144
Slide 145
145
Slide 146
146
Slide 147
147
Slide 148
148
Slide 149
149
Slide 150
150
Slide 151
151
Slide 152
152
Slide 153
153
Slide 154
154
Slide 155
155
Slide 156
156

About This Presentation

IRS


Slide Content

INFORMATION RETRIEVAL SYSTEMS
IV B.TECH -I SEMESTER (JNTUH-R15)
Ms. S.J. Sowjanya , Associate Professor , CSE
Mr. N.V.Krishna Rao, Associate Professor, CSE
Mr. C.Praveen Kumar, Assistant Professor, CSE
COMPUTER SCIENCE AND ENGINEERING
INSTITUTE OF AERONAUTICAL ENGINEERING
(Autonomous)
DUNDIGAL, HYDERABAD -500 043 1

UNIT 1
2

What isIR?
System-
centered
View
•IRisabranchofappliedcomputerscience
focusingontherepresentation,storage,
organization,access,anddistributionof
information.
•IRinvolveshelpingusersfindinformationthat
matchestheirinformationneeds.
User-centered
3

IRSystems
•IR systems contain threecomponents:
–System
–People
–Documents (informationitems)
System
User Documents
4

Data andInformation
•Data
–Stringofsymbolsassociatedwithobjects,
people,andevents
–Valuesofanattribute
•Dataneednothavemeaningtoeveryone
•Datamustbeinterpretedwithassociated
attributes.
5

Data andInformation
Information
–Themeaningofthedatainterpretedbyapersonor
asystem.
–Datathatchangesthestateofapersonorsystem
thatperceivesit.
–Datathatreducesuncertainty.
•ifdatacontainnouncertainty,thereareno
informationwiththedata.
•Examples:Itsnowsinthewinter.
Itdoesnotsnowthiswinter.
6

Information andKnowledge
•knowledge
–Structuredinformation
•throughstructuring,informationbecomes
understandable
–ProcessedInformation
•throughprocessing,informationbecomes
meaningfulanduseful
–informationsharedandagreeduponwithina
community
Data knowledge
information
7

InformationRetrieval
•Conceptually,informationretrievalisusedto
coverallrelatedproblemsinfindingneeded
information
•Historically,informationretrievalisabout
documentretrieval,emphasizingdocumentas
thebasicunit
•Technically,informationretrievalrefersto
(text)stringmanipulation,indexing,matching,
querying,etc.
8

Definition ofIRS
•AnInformationRetrievalSystemisasystemthatis
capableofstorageretrievalandmaintenanceof
information.
–Informationmaybeatext(includingnumericand
datedata),images,videoandothermultimedia
objects.
•Informationretrievalistheformalstudyofefficient
andeffectivewaystoextracttherightbitof
informationfromacollection.
–Thewebisaspecialcase,aswewilldiscuss.
9

•AnIRSconsistsofs/wprogramthatfacilitatesauserin
findingtheinfo.theuserneeds.
–Thesystemmayusestandardcomputerh/wtosupport
thesearchsubfunctionandtoconvertnon-textual
sourcestoasearchablemedia.
•ThesuccessofanIRSishowwellitcanminimizetheuser
overheadforausertofindtheneededinfo.
–Overheadfromuser‟sperspectiveisthetimerequiredto
findtheinfo.needed,excludingthetimeforactually
readingtherelevantdata.
–Thus,searchcomposition,searchexec.,&readingnon-
relevantitemsareallaspectsofIRoverhead.
10

•Tominimizetheoverheadofauserlocatingneededinfo
•Twomajormeasures
1.Precision:Theabilitytoretrievetop-ranked
documentsthataremostlyrelevant.
2.Recall:Theabilityofthesearchtofindallofthe
relevantitemsinthecorpus.
•Whenauserdecidestoissueasearchlookinginfoona
topic,thetotaldbislogicallydividedinto4segmentsas
11

–WhereNumber_Possible_Relevantaretheno.of
relevantitemsinthedb.
–Number_TotalRetrievedisthetotalno.ofitems
retrievedfromthequery.
–Number_Retrieved_Relevantistheno.ofitems
retrievedthatarerelevanttotheuser‟stotheuser‟s
searchneed.
–Precisionmeasuresoneaspectofinformationretrieved
overheadforauserassociatedwithaparticularsearch.
–Ifasearchhasa85%,then15%oftheusereffortis
overheadreviewingnon-relevantitems.
–Recallisaveryusefulconcept,butduetothe
denominatorisnon-calculableinoperationalsystems.
12

Systemevaluation
•Efficiency:time,space
•Effectiveness:
–Howisasystemcapableofretrievingrelevant
documents?
–Isasystembetterthananotherone?
•Metricsoftenused(together):
–Precision=retrievedrelevantdocs/retrieveddocs
–Recall=retrievedrelevantdocs/relevantdocs
relevant
retrievedrelevant
retrieved
13

General form ofprecision/recall
Precision
1.0
Recall
1.0
-Precision change w.r.t. Recall (not a fixedpoint)
-Systems cannot compare at one Precision/Recallpoint
-Average precision (on 11 points of recall: 0.0, 0.1, …,1.0)
14

Some techniques to improve IR
effectiveness
•Interactionwithuser(relevancefeedback)
-Keywordsonlycoverpartofthecontents
-Usercanhelpbyindicatingrelevant/irrelevantdocument
•Theuseofrelevancefeedback
–Toimprovequeryexpression:
Qnew=*Qold+*Rel_d-*Nrel_d
whereRel_d=centroidofrelevantdocumentsNRel_d=
centroidofnon-relevantdocuments
15

IR on theWeb
•Nostabledocumentcollection(spider,crawler)
•Invaliddocument,duplication,etc.
•Hugenumberofdocuments(partialcollection)
•Multimediadocuments
•Greatvariationofdocumentquality
•Multilingualproblem
16

Final remarks onIR
•IRisrelatedtomanyareas:
–NLP,AI,database,machinelearning,user
modeling…
–library,Web,multimediasearch,…
•Relativelyweektheories
•Verystrongtraditionofexperiments
•Manyremaining(andexciting)problems
•Difficultarea:Intuitivemethodsdonotnecessarily
improveeffectivenessinpractice
17

Why is IRdifficult
•Vocabulariesmismatching
–Synonymy:e.g.carv.s.automobile
–Polysemy:table
•Queriesareambiguous,theyarepartialspecificationof
user‟sneed
•Contentrepresentationmaybeinadequateandincomplete
•Theuseristheultimatejudge,butwedon‟tknowhowthe
judgejudges…
–Thenotionofrelevanceisimprecise,context-anduser-
dependent
•Buthowmuchitisrewardingtogain10%improvement!
18

Outline
•WhatistheIRproblem?
•HowtoorganizeanIRsystem?(Orthemain
processesinIR)
•Indexing
•Retrieval
19

Possibleapproaches
1.String matching (linear search in documents)
-Slow
-Difficult to improve
2.Indexing (*)
-Fast
-Flexible to further improvement
20

Retrieval
•Theproblemsunderlyingretrieval
–Retrievalmodel
•Howisadocumentrepresentedwiththe
selectedkeywords?
•Howaredocumentandqueryrepresentations
comparedtocalculateascore?
–Implementation
21

Idea forTFxIDF

TF:intra-clusteringsimilarityisquantifiedbymeasuringthe
rawfrequencyofatermkiinsideadocumentdj
–termfrequency(thetffactor)providesonemeasureof
howwellthattermdescribesthedocumentcontents

IDF:inter-clusteringsimilarityisquantifiedbymeasuring
theinverseofthefrequencyofatermkiamongthe
documentsinthecollection
22

Vector Model
•Indextermsareassignedpositiveandnon-binary
weights
•Theindextermsinthequeryarealsoweighted
d j(w1, j , w2, j,, wt , j)

q(w1,q , w2,q , , wt ,q)
•Termweightsareusedtocomputethedegreeof
similaritybetweendocumentsandtheuserquery
•Then,retrieveddocumentsaresortedindecreasingorder
23

Vector Model
•Advantages
–Itsterm-weightingschemeimprovesretrievalperformance
–Itspartialmatchingstrategyallowsretrievalofdocuments
thatapproximatethequeryconditions
–Itscosinerankingformulasortsthedocumentsaccording
totheirdegreeofsimilaritytothequery
•Disadvantage
–Theassumptionofmutualindependencebetweenindex
terms
24

Vector spacemodel
•Vector space = all the keywordsencountered
<t1,t2,t3, …,tn>
•Document
D =< a1, a2, a3, …,an>
ai = weight of ti inD
•Query
Q=
< b1, b2, b3, …,bn>
bi = weight of ti inQ
•R(D,Q) =Sim(D,Q)
25

Probabilistic Model
•IntroducedbyRoberstonandSparckJones,1976
–Binaryindependenceretrieval(BIR)model
•Idea:Givenauserqueryq,andtheidealanswersetRof
the
relevantdocuments,theproblemistospecifythe
propertiesforthisset
–Assumption(probabilisticprinciple):theprobability
ofrelevancedependsonthequeryanddocument
representationsonly;idealanswersetRshould
maximizetheoverallprobabilityofrelevance
–Theprobabilisticmodeltriestoestimatethe
probabilitythattheuserwillfindthedocumentdj
relevantwithratio
P(djrelevanttoq)/P(djnonrelevanttoq) 26

Probabilistic Model
•Definition
All index term weights are all binary i.e., wi,j {0,1}
Let R be the set of documents known to be relevant to
query q
Let R‟ be the complement ofR
Let(R|d)be the probability that thedocument
to the queryq
dj is nonelevantP(R|dj) to queryq
27

Probabilistic Model
•The similarity sim(dj,q) of the document dj to the query q
isdefined as theratio
Pr(R | d)
sim(d j , q)
j
Pr(R | d j)
28

Probabilistic Model
–Pr(ki|R)standsfortheprobabilitythattheindex
termkiispresentinadocumentrandomly
selectedfromthesetR
–Pr(ki|R)standsfortheprobabilitythattheindex
termkiisnotpresentinadocumentrandomly
selectedfromthesetR
29

UNIT-II
30

RelevanceFeedback
ThequeryisrepresentedbyavectorQ,eachdocumentis
representedbyavectorDi,andameasureofrelevance
betweenthequeryandthedocumentvectoriscomputed
asSC(Q,Di),whereSCisthesimilaritycoefficient.
ThebasicassumptionisthattheuserhasissuedaqueryQ
andretrievedasetofdocuments.
Theuseristhenaskedwhetherornotthedocumentsare
relevant.
Aftertheuserresponds,thesetRcontainsthenlrelevant
documentvectors,andthesetScontainsthen2non-
relevantdocumentvectors.
31

•Theideaisthattherelevantdocumentshaveterms
matchingthoseintheoriginalquery.
•Theweightscorrespondingtothesetermsareincreasedby
addingtherelevantdocumentvector.Termsinthequery
thatareinthenonrelevantdocumentshavetheirweights
decreased.
•Also,termsthatarenotintheoriginalquery(hadaninitial
componentvalueofzero)arenowaddedtotheoriginal
query.
32

•Onlythetoprankednon-relevantdocumentisused,instead
ofthesumofallnon-relevantdocuments.
•Aninterestingcaseoccurswhentheoriginalqueryretrieves
onlynon-relevantdocuments.
•Byincreasingtheweight,thetermnowringstrueandyields
somerelevantdocuments.
•Itisnotapplicabletoautomaticfeedbackasthetopn
documentsareassumed,bydefinition,toberelevant.
33

Whyclustering?
•Let‟slookattheprobleminadifferentangle
–Theissuehereisdealingwithhigh-dimensionaldata
•Howdopeopledealwithhigh-dimensionaldata?
–Startbyfindinginterestingpatternsassociatedwiththe
data
–Clusteringisoneofthewell-knowntechniqueswith
successfulapplicationsonlargedomainforfinding
patterns
•Butwhatisclustering?
34

Introduction
•The goal of clustering is to
–group data points that are close (or similar) to each other
–identify such groupings (or clusters) in an unsupervised
manner
•Unsupervised: no information is provided to the
algorithm on which data points belong to which
clusters
35

Hierarchicalclustering
•ModifiedfromDr.SeungchanKim‟sslides
•GiventheinputsetS,thegoalistoproduceahierarchy
(dendrogram)inwhichnodesrepresentsubsetsofS.
•Featuresofthetreeobtained:
–TherootisthewholeinputsetS.
–TheleavesaretheindividualelementsofS.
–Theinternalnodesaredefinedastheunionoftheir
children.
•Eachlevelofthetreerepresentsapartitionoftheinput
dataintoseveral(nested)clustersorgroups.
36

Hierarchicalclustering
37

Hierarchicalclustering
•Therearetwostylesofhierarchicalclusteringalgorithmsto
buildatreefromtheinputsetS:
–Agglomerative(bottom-up):
•Beginningwithsingletons(setswith1element)
•MergingthemuntilSisachievedastheroot.
•Itisthemostcommonapproach.
–Divisive(top-down):
•RecursivelypartitioningSuntilsingletonsetsare
reached.
38

Hierarchical clustering: forming
clusters
•Forming clusters fromdendograms
39

Hierarchicalclustering
•Advantages
–Dendogramsaregreatforvisualization
–Provideshierarchicalrelationsbetweenclusters
–Showntobeabletocaptureconcentricclusters
•Disadvantages
–Noteasytodefinelevelsforclusters
–Experimentsshowedthatotherclustering
techniquesoutperformhierarchicalclustering
40

N-Grams
•N-Gramsaresequencesoftokens.
•TheNstandsforhowmanytermsareused
–Unigram:1term
–Bigram:2terms
–Trigrams:3terms
•Youcanusedifferentkindsoftokens
–Characterbasedn-grams
–Word-basedn-grams
–POS-basedn-grams
•N-Gramsgiveussomeideaofthecontextaround
thetokenwearelookingat.
41

SimpleN-Grams
•AssumealanguagehasVwordtypesinitslexicon,how
likelyiswordxtofollowwordy?
–Simplestmodelofwordprobability:1/V
–Alternative1:estimatelikelihoodofxoccurringin
newtextbasedonitsgeneralfrequencyofoccurrence
estimatedfromacorpus(unigramprobability)
•popcornismorelikelytooccurthanunicorn
–Alternative2:conditionthelikelihoodofxoccurring
inthecontextofpreviouswords(bigrams,
trigrams,…)
mythicalunicornismorelikelythanmythicalpopcorn
42

UsingN-Grams
•For N-gram models
–P(wn-1,wn) = P(wn | wn-1) P(wn-1)
–By the chain rule we can decompose a joint
probability, e.g. P(w1,w2,w3)
•P(w1,w2, ...,wn) = P(w1|w2,w3,...,wn) P(w2|w3, ...,wn) …
P(wn-1|wn) …P(wn)
•For bigrams then, the probability of a sequence is just the
product of the conditional probabilities of its bigrams
P(the,mythical,unicorn) = P(unicorn|mythical)
P(mythical|the) P(the|<start>)
43

Applications
•Whydowewanttopredictaword,givensomepreceding
words?
–Rankthelikelihoodofsequencescontainingvarious
alternativehypotheses,e.g.forautomatedspeech
recognition,OCRing.
•Theatreownerssaypopcorn/unicornsaleshavedoubled...
–Assessthelikelihood/goodnessofasentence,e.g.for
textgenerationormachinetranslation
44

Regression Analysis:Introduction
Basicidea:
Usedatatoidentifyrelationshipsamongvariablesand
usetheserelationshipstomakepredictions.
45

Linearregression
•Lineardependence:constantrateofincreaseofonevariablewith
respecttoanother(asopposedto,e.g.,diminishingreturns).
•Regressionanalysisdescribestherelationshipbetweentwo(or
more)variables.
•Examples:
–Incomeandeducationallevel
–Demandforelectricityandtheweather
–Homesalesandinterestrates
•Ourfocus:
–Gainsomeunderstandingofthemechanics.
•theregressionline
•regressionerror
46

UNITIII
47

Outlines
•SemanticNetworks
•Parsing
•Cross Language InformationRetrieval
•Introduction
•Crossing the Languagebarrier
48

What is Cross-Language
Information Retrieval?
•Definition:Selectinformationinonelanguagebasedon
queriesinanother.
•Terminologies
–Cross-LanguageInformationRetrieval(ACMSIGIR96
WorkshoponCross-LinguisticInformationRetrieval)
–TranslingualInformationRetrieval(DefenseAdvanced
ResearchProjectAgency-DARPA)
49

An Architecture of Cross-
Language InformationRetrieval
50

Building Blocks forCLIR
Information Retrieval Artificial
Intelligence
Speech
Recognition
Information
Science
Computational Linguistics
51

Information Retrieval
•Filtering
•Relevance Feedback
•Documentrepresentation
•Latent semantic indexing
•Generalization vector spacemodel
•Collectionfusion
•Passage retrieval
52

InformationRetrieval
•Similaritythesaurus
•Local contextanalysis
•Automatic queryexpansion
•Fuzzy term matching
•Adapting retrieval methods tocollection
•Building cheap testcollection
•Evaluation
53

ArtificialIntelligence
•Machine translation
•Machine learning
•Template extraction andmatching
•Building large knowledgebases
•Semanticnetwork
54

SpeechRecognition
•Signal processing
•Pattern matching
•Phonelattice
•Background noiseelimination
•Speech segmentation
•Modeling speechprosody
•Building testdatabases
•Evaluation
55

Major Problems ofCLIR
•Queriesanddocumentsareindifferentlanguages.
–translation
•Wordsinaquerymaybeambiguous.
–disambiguation
•Queriesareusuallyshort.
–expansion
56

Major Problems ofCLIR
•Queriesmayhavetobesegmented.
–segmentation
•Adocumentmaybeintermsofvariouslanguages.
–languageidentification
57

Enhancing Traditional
Information Retrieval Systems
•Which part(s) should be modified forCLIR?
Documents Queries
(1) (3)
Document
Representation
Query
Representation
(2) (4)
Comparison
58

Cross-Language Information
Retrieval
Cross-Language InformationRetrieval
QueryTranslation DocumentTranslationNoTranslation
ControlledVocabularyFreeText TextTranslationVectorTranslation
Knowledge-based Corpus-based Hybrid
Ontology-basedDictionary-basedTerm-alignedSentence-alignedDocument-alignedUnaligned
Thesaurus-based ParallelComparable
59

Query Translation Based
CLIR
English
Query
Translation
Device
Retrieved
Chinese
Documents
Chinese
Query
Monolingual
Chinese
Retrieval
System
60

Hybrid Methods
•Ballesteros & Croft1997
Original Spanish
TREC Queries
human
translation
Spanish
Queries
English (BASE)
Queries
automatic
dictionary
translation
query
expansion
Spanish
Queries
query
expansion
English
Queries
automatic
dictionary
translation
Spanish
Queries
INQUERY
61

HybridMethods
–Performance Evaluation
•pre-translation
MRD (0.0823) vs. LF (0.1099) vs. LCA10(0.1139)
+33.5% +38.5%
•post-translation
MRD (0.0823) vs. LF (0.0916) vs. LCA20(0.1022)
+11.3% +24.1%
•combined pre-andpost-translation
MRD (0.0823) vs. LF (0.1242) vs. LCA20(0.1358)
+51.0% +65.0%
•32% below a monolingualbaseline
62

Hybrid Methods
•Davis 1997(TREC5)
EnglishQuery
Bilingual
Dictionary
Parallel
IREngine
Spanish
Equivalents
(POS)
TREC
Spanis
h
Corpu
s
UNEnglish
Doc
Vectors
UNSpanish
Mono
IR Engine
Reduced
Equivalent
Set 63

Document
Translation
•Translate the documents, not thequery
Documents Queries
MT
Document
Representation
Query
Representation
Comparison
(1)Efficiency Problem
(2)Retrieval Effectiveness??? (word
order, stop words)
(3)Cross-language mate finding
using MT-LSI (Dumais, et al,
1997)
64

VectorTranslation
•Translate documentvectors
Documents Queries
Document
Representation
MT
Query
Representation
Comparison
65

CLIR system usingquery
translation
66

Generating Mixed Ranked
Lists of Documents
•Normalizing scales ofrelevance
–using aligneddocuments
–using ranks
–interleaving according to givenratios
•Mapping documents into the samespace
–LSI
–documenttranslations
67

Types ofTools
•Character Set/FontHandling
•WordSegmentation
•Phrase/CompoundHandling
•TerminologyExtraction
•Parsers/LinguisticProcessors
•LexiconAcquisition
•MTSystems
•Summarization
•Mark-UpTools
•LanguageIdentification
•Stemming/Normalization
•Entity Recognition
•Part-of-Speechtaggers
•IndexingTools
•Text Alignment
•Speech Recognition/OCR
•Visualization
68

Character Set/FontHandling
•Input and Display Support
–Special input modules for e.g. Asian languages
–Out-of-the-box support much improved thanks to
modern web browsers
•Character Set/File Format
–Unicode/UTF-8
–XML
69

LanguageIdentification
•Different levels of multilingualdata
–In differentsub-collections
–Within sub-collections
–Within items
•Different approaches
–Tri-gram
–Stopwords
–Linguistic analysis
70

Stemming/Normalization
•Reductionofwordstotheirrootform
•Importantforlanguageswithrichmorphology
•Rule-basedordictionary-based
•Casenormalization
•Handlingofdiacritics(French,…)
•Vowel(re-)substitution(e.g.semiticlanguages,…)
71

Entity Recognition/
Terminology Extraction
•ProperNames,Locations,...
–Critical,sinceoftenmissingfromdictionaries
–SpecialproblemsinlanguagessuchasChinese
•Domain-specificvocabulary,technicalterms
–Criticalforeffectivenessandaccuracy
72

Phrase/CompoundHandling
•Collocations(“HongKong“)
–Importantfordictionarylookup
–Improvesretrievalaccuracy
•Compounds(“Bankangestelltenlohn“–bankemployee
salary)
–BigprobleminGerman
–Infinitenumberofcompounds–dictionaryisno
viablesolution
73

Lexicon Acquisition/
Text Alignment
•Goal:automaticconstructionofdatastructuressuchas
dictionariesandthesauri
–Workonparallelandcomparablecorpora
–Terminologyextraction
–Similaritythesauri
•Prerequisite:trainingdata,usuallyaligned
–Document,sentence,wordlevelalignment
74

Too many factors in
CLIR system evaluation
•translation
•automatic relevancefeedback
•termexpansion
•disambiguation
•result merging
•test collection
•need to tone it down to see whathappened
75

TREC-6 Cross-LanguageTrack
•In cooperation with the Swiss Federal Institute of
Technology (ETH)
•Task Summary: retrieval of English, French, and German
documents, both in a monolingual and a cross-lingual
mode
•Documents
-SDA (1988-1990): French (250MB), German (330 MB)
-Neue Zurcher Zeitung (1994): German (200MB)
–AP (1988-1990): English (759MB)
•13 participating groups
76

TREC-7 Cross-LanguageTrack
•Task Summary: retrieval of English, French, German
and Italian documents
•Results to be returned as a single multilingual ranked
list
•Addition of Italian SDA (1989-1990), 90 MB
•Addition of a subtask of 31,000 structured German
social science documents (GIRT)
•9 participating groups
77

TREC-8 Cross-LanguageTrack
•Tasks, documents and topic creation similar to TREC-7
•12 participating groups
78

CLIR inTREC-9
•Documents
–HongKongCommercialDaily,HongKongDaily
News,Takungpao:allfrom1999andabout260
MBtotal
•25 new topics built in English; translations
made to Chinese
79

Cross-LanguageEvaluationForum
•AcollaborationbetweentheDELOS Networkof
ExcellenceforDigitalLibrariesandtheUSNational
InstituteforStandardsandTechnology(NIST)
•ExtensionofCLIRtrackatTREC(1997-1999)
80

MainGoals
•Promoteresearchincross-languagesystemdevelopment
forEuropeanlanguagesbyprovidinganappropriate
infrastructurefor:
–CLIRsystemevaluation,testingandtuning
–Comparisonanddiscussionofresults
81

CLEF2000TaskDescription
•FourevaluationtracksinCLEF2000
–multilingualinformationretrieval
–bilingualinformationretrieval
–monolingual(non-English)informationretrieval
–domain-specificIR
82

CLEF2000DocumentCollection
•MultilingualComparableCorpus
–English:LosAngelesTimes
–French:LeMonde
–German:FrankfurterRundschau+DerSpeigel
–Italian:LaStampa
•Similarforgenre,content,time
83

3MinDigitalLibraries/Museums
•Multi-media
–Selectingsuitablemediatorepresentcontents
•Multi-linguality
–Decreasingthelanguagebarriers
•Multi-culture
–Integratingmultiplecultures
84

NPDMProject
•PalaceMuseum,Taipei,oneofthefamousmuseumsin
theworld
•NSCsupportsapioneerstudyofadigitalmuseum
projectNPDMstartingfrom2000
–EnamelsfromtheMingandCh‟ingDynasties
–FamousAlbumLeavesoftheSungDynasty
–IllustrationsinBuddhistScriptureswithRelative
Drawings
85

DesignIssues
•Standardization
–Astandardmetadataprotocolisindispensablefor
theinterchangeofresourceswithothermuseums.
•Multimedia
–Asuitablepresentationschemeisrequired.
•Internationalization
–tosharethevaluableresourcesofNPDMwithusers
ofdifferentlanguages
–toutilizeknowledgepresentedinaforeignlanguage
86

TranslingualIssue
•CLIR
–toallowuserstoissuequeriesinonelanguageto
accessdocumentsinanotherlanguage
–thequerylanguageisEnglishandthedocument
languageisChinese
•Twocommonapproaches
–Querytranslation
–Documenttranslation
87

ResourcesinNPDMpilot
•Anenamel,acalligraphy,apainting,oranillustration
•MICI-DC
–MetadataInterchangeforChineseInformation
–Accessiblefieldstousers
•Shortdescriptionsvs.fulltexts
•Bilingualversionsvs.Chineseonly
–Fieldsformaintenanceonly
88

SearchModes
•Freesearch
–usersdescribetheirinformationneedusingnatural
languages(ChineseorEnglish)
•Specifictopicsearch
–usersfillinspecificfieldsdenotingauthors,titles,
dates,andsoon
89

Example
•Informationneed
–Retrieval“TravelersAmongMountainsandStreams,
FanK„uan”(“范寬谿山行旅圖 ”)
•Possiblequeries
–Author:FanKuan;Kuan,Fan
–Time:SungDynasty
–Title:MountainsandStreams;Travelamongmountains;
Travelamongstreams;Mountainandstreampainting
–Freesearch:landscapepainting;travelers,huge
mountain,Nature;scenery;Shensiprovince
90

English
Query
Document
Translation
Query
Translatio
n
English
Names Name
Search
Specific
Bilingual
Dictionary
Machine
Transliteratio
n
Chinese
Names
Query
Disambiguatio
n
English
Titles Title
Search
Generic
Bilingual
Dictionary
Chinese
Titles
Chinese
Query
NPDM
ChineseIR
System
Collection
91

SpecificTopicSearch
•propernamesareimportantqueryterms
–Creatorssuchas“林逋”(LinP‟u),“李建中”(LiChien-
chung),“歐陽脩”(Ou-yangHsiu),etc.
–Emperorssuchas“康熙”(K'ang-hsi),“乾隆”(Ch'ien-
lung),“徽宗”(Hui-tsung),etc.
–Dynastysuchas”宋”(Sung),“明”(Ming),
“清”(Ch‟ing),etc.
92

NameTransliteration
•ThealphabetsofChineseandEnglisharetotally
different
•Wade-Giles(WG)andPinyinaretwofamous
systemstoromanizeChineseinlibraries
•backwardtransliteration
–Transliteratetargetlanguagetermsbackto
sourcelanguageones
–Chen,Huang,andTsai(COLING,1998)
–LinandChen(ROCLING,2000)
93

NameMappingTable
•DivideanameintoasequenceofChinesecharacters,and
transformeachcharacterintophonemes
•Lookupphoneme-to-WG(Pinyin)mappingtable,and
deriveacanonicalformforthename
94

NameSimilarity
•Extractnamedentityfromthequery
•Selectthemostsimilarnamedentityfromname
mappingtable
•Namingsequence/scheme
–LastNameFirstName1,e.g.,ChuHsi(朱熹)
–FirstName1LastName,e.g.,HsiChu(朱熹)
–LastNameFirstName1-FirstName2,e.g.,HsuTao-
ning
(許道寧)
–FirstName1-FirstName2LastName,e.g.,Tao-ning
Hsu(許道寧)
–Anyorder,e.g.,TaoNingHsu(許道寧)
–Anytransliteration,e.g.,JuShi(朱熹) 95

Title
•“Travelers among Mountains
and Streams”
•"travelers", "mountains", and
"streams" are basic components
•Userscanexpress
informationneedthrough
their
the
descriptions of a desired art
•System will measure the similarity
of art titles (descriptions) and a
query
96

FreeSearch
•Aqueryiscomposedofseveralconcepts.
•Conceptsareeithertransliteratedortranslated.
•ThequerytranslationsimilartoasmallscaleIR
system
•Resources
–Name-mappingtable
–Title-mappingtable
–SpecificEnglish-ChineseDictionary
–GenericEnglish-ChineseDictionary
97

Algorithm
•(1)Foreachresource,theChinesetranslationswhose
scoresarelargerthanaspecificthresholdareselected.
•(2)TheChinesetranslationsidentifiedfromdifferent
resourcesaremerged,andaresortedbytheirscores.
•(3)ConsidertheChinesetranslationwiththehighest
scoreinthesortingsequence.
–IftheintersectionofthecorrespondingEnglish
descriptionandqueryisnotempty,thenselectthe
translationanddeletethecommonEnglishterms
betweenqueryandEnglishdescriptionfromquery.
–Otherwise,skiptheChinesetranslation.
98

Algorithm
•(4)Repeatstep(3)untilqueryisemptyoralltheChinese
translationsinthesortingsequenceareconsidered.
•(5)Ifthequeryisnotempty,thenthesewordsarelookedup
fromthegeneraldictionary.AChinesequeryiscomposedof
allthetranslatedresults.
99

UNIT-IV
100

InvertedIndex
Query
Processing
SignatureFiles
Duplicate DocumentDetection
173
101

InvertedIndex
•Invertedindexeswereusedinbothearlyinformation
retrievalanddatabasemanagementsystemsinthe1960's.
•Insteadofscanningtheentirecollection,thetextis
preprocessedandalluniquetermsareidentified.
•Thislistofuniquetermsisreferredtoastheindex.
•Foreachterm,alistofdocumentsthatcontain
•thetermisalsostored.Thislistisreferredtoasapostinglist
102

103

•Thesizeoftheindexisanotherconcern.Manyindexcanbe
equaltosizeoftheoriginaltext.
•Thismeansthatstoragerequirementsaredoubledduetothe
index.
•Thesizeofpostinglistsintheinvertedindexcanbe
approximatedbytheZipfiandistribution.
•Zipfproposedthatthetermfrequencydistributionina
naturallanguageissuchthatifalltermswereorderedand
assignedarank
104

•UsingC/r,whereristherankandCisthevalueof
theconstant,anestimatecanbemadeforthe
numberofoccurrencesofagiventerm.
•TheconstantC,isdomain-specificandequalsthe
numberofoccurrencesofthemostfrequentterm.
105

BuildinganInvertedIndex
•Aninvertedindexconsistsoftwocomponents,alistofeach
distincttermreferredtoastheindexandasetoflists
referredtoaspostinglists.
•Thus,apostinglistcontainsasetoftuplesforeachdistinct
termin thecollection.Thesetoftuplesisof
theform<docid,if>foreachdistinctterminthecollection.
106

•Indexfile.Containstheactualpostinglistforeachdistinct
terminthecollection.Aterm,tthatoccursinidifferent
documentswillhaveapostinglist.
•Documentfile.Containsinformationabouteachdistinct
document---documentidentifier,longdocumentname,date
published,etc.
•Weightfile.Containstheweightforeachdocument.Thisis
thedenominatorforthecosinecoefficient-definedasthe
cosineoftheanglebetweenthequeryanddocumentvector
107

CompressinganInvertedIndex
•Akeyobjectiveinthedevelopmentofinvertedindexfilesis
todevelopalgorithmsthatreduceI/Obandwidthand
storageoverhead.
•Thesizeoftheindexfiledeterminesthestorageoverhead
imposed.
•Twoprimaryareasinwhichaninvertedindexmightbe
compressedarethetermdictionaryandthepostinglists.
108

FixedLengthIndexCompression
•Thisschemeeffectivelyreducesthedomainoftheidentifiers,
allowingthemtobestoredinamoreconciseformat.
•Foreachvaluetobecompressed,theminimumnumberof
bytesrequiredtostorethisvalueiscomputed.
•Fortermfrequencies,thereisnoconceptofusinganoffset
betweenthesuccessivevaluesaseachfrequencyis
independentoftheprecedingvalue.
109

VariableLengthIndex
Compression
•Inthismethod,thefrequencydistributionofalloftheoffsetsis
obtainedthroughaninitialpassoverthetext.
•Acompressionschemeisdevelopedbasedonthefrequency
distribution,andasecondpassusesthenewcompression
scheme.
•Thiscoderepresentsanintegerxwith2[log2x]+1bits.The
first[log2x]bitsaretheunaryrepresentationof[log2xJ]
110

VaryingCompressionBased
onPostingListSize
•Thegammaschemecanbegeneralizedasacoding
paradigmbasedonthevectorVwithpositiveintegersI
where$:Vi>=N.Tocodeintegerx>1relativetoV,find
k.
•Clearly,Vcanbechangedtogivedifferentcompression
characteristics.
•Lowvaluesinvoptimizecompressionforlownumbers,
whilehighervaluesinvprovidemoreresilienceforhigh
numbers.
111

QueryProcessing
InvertedIndexModifications
•Aninvertedindexcanbesegmentedtoallowforfastand
aquicksearchofapostinglisttolocatea
particular document.
•Asuggestedimprovementistocontinueprocessingall
thetermsinthequery,butonlyupdatetheweightsfound
intheddocuments.
•Also,afterthescoreforeverydocument,itisd
documentsareaccessed,thereisnoneedtoupdateonly
necessarytoupdatethescoreforthosedocumentsthat
haveanon-zeroscore.
112

PartialResultSetRetrieval
CutoffBasedonDocumentFrequency
•Thesimplestmeasureoftermqualityistorelyon
documentfrequency.
•Betweentwenty-fivetoseventy-fivepercentofthequery
termsaftertheyweresortedbydocumentfrequency
resultedinalmostnodegradationinprecisionandrecall
fortheTREC-4documentcollection.
113

VectorSpaceSimplifications
•Thefirstvariationwastoreplacethedocumentlength
normalizationthatisbasedonweightwiththesquareroot
ofthenumberoftermsinDi.
•Thesecondvariationwastosimplyremovethedocument
lengthnormalization.
•Thethirdmeasuredropstheidf.Thiseliminatesoneentry
intheindexforeachterm.
•Thefourthmeasuredropsthetfbutretainstheidf.This
eliminatestheneedtostorethetfineachentryofthe
postinglist.
•Thefifthandfinalmethodsimplycountsmatchesbetween
thequeryandtheterms.
114

SignatureFiles
•Theuseofsignaturefilesliesbetweenasequentialscan
oftheoriginaltextandtheconstructionofaninverted
index.
•Asignatureisanencodingofadocument.Theideaisto
encodealldocumentsasrelativelysmallsignatures.
•Constructionofasignatureisoftendonewithdifferent
hashingfunctions.
•Oneormorehashingfunctionsareappliedtoeachword
inthedocument.
115

116

•Thehashingfunctionisusedtosetabitinthesignature.
•Toimplementdocumentretrieval,asignatureisconstructed
forthequery.
•ABooleansignaturecannotstoreproximityinformationor
informationabouttheweightofatermasitappearsina
document.
•Signaturesareusefuliftheycanfitintomemory.
117

ScanningtoRemoveFalse
Positives
•Patternmatchingalgorithmsarerelatedtotheuseofscanning
ininformationretrievalsincetheystrivetofindapatternina
stringoftextcharacters.
•Typically,patternmatchingisdefinedasfindingallpositions
intheinputtextthatcontainthestartofagivenpattern.
•Ifthepatternisofsizepandthetextisofsizes,thenaïve
nestedlooppatternmatchrequiresO(ps)comparisons.
118

DuplicateDocumentDetection
•Amethodtoimprovebothefficiencyandeffectivenessofan
informationretrievalsystemistoremoveduplicatesornear
duplicates.
•Duplicatescanberemovedeitheratthetimedocumentsare
addedtoaninvertedindexoruponretrievingtheresultsofa
query.
•Thedifficultyisthatwedonotsimplywishtoremoveexact
duplicates,wemaywellbeinterestedinremovingnear
duplicatesaswell.
119

Finding SimilarDuplicates
•Whileitisnotpossibletodefinepreciselyatwhichpointa
documentisnolongeraduplicateofanother,researchers
haveexaminedseveralmetricsfordeterminingthesimilarity
ofadocumenttoanother.
•Thefirstisresemblance,TheresemblancerofdocumentDi
anddocumentDj,asdefinedtheintersectionoffeaturesover
theunionoffeaturesfromtwodocuments.
120

Shingles
•Thefirstnear-duplicatealgorithmwediscussistheuseof
shingles.
•Ashingleissimplyasetofcontiguoustermsinadocument
Shinglingtechniques,suchasCOPS,KOALA,andDSC
essentiallyallcomparethenumberofmatchingshingles.
•Thismakessensebecausesupershinglestendtobesomewhat
largeandwill,inalllikelihood,completelyencompassashort
document.
121

Duplicate Detection viaSimilarity
•Anotherapproachistosimplycomputethesimilarity
coefficientbetweentwodocuments.Ifthedocument
similarityexceedsathreshold
•Thedocumentscanbedeemedduplicatesofeachother.
•Theyrequireallpairsofdocumentstobecompared,i.e.each
documentiscomparedtoeveryotherdocumentanda
similarityweightiscalculated.
122

I-Match
•I-Match uses a hashing scheme that uses only some terms in a
document.
•The decision of which terms to use is key to the success of
the algorithm.
•I-match is a hash of the document that uses collection
statistics.
•The overall runtime of the I-Match approach is (O(d logd) in
the worstcase where all documents are duplicates of each
other
123

UNIT-V
124

A HistoricalProgression
•CombiningSeparateSystems
•Queriesareparsedandthe
•structuredportionsaresubmittedasaquerytothe
DBMS,whiletextsearch
•portionsofthequeryaresubmittedtoaninformation
retrievalsystem.
•Theresultsarecombinedandpresentedtotheuser.
125

•Itdoesnottakelongtobuildthissoftware,andsince
informationretrievalsystemsandDBMSarereadily
available,thisisoftenseenasanattractivesolution.
•ThekeyadvantageofthisapproachisthattheDBMSand
informationretrieval
•systemarecommercialproductsthatarecontinuously
improveduponbyvendors.
126

DataIntegrity
•DataintegrityissacrificedbecausetheDBMStransaction
logandtheinformationretrievaltransactionlogarenot
coordinated.Ifafailureoccursinthemiddleofanupdate
transaction,theDBMSwillendinastatewheretheentire
transactioniseithercompletedoritisentirelyundone.Itis
notpossibletocompletehalfofanupdate.
127

Portability
•Portabilityissacrificedbecausethequerylanguageisnot
standard.Presently,Astandardinformationretrievalquery
languagedoesnotexist.However,some
•workisbeingdonetodevelopstandardinformationretrieval
querylanguages.
•Ifoneexisted,itwouldrequiremanyyearsforwidespread
commercialacceptancetooccur.
128

Performance
•Run-timeperformancesuffersbecauseofthelackofparallel
processingandqueryoptimization.Althoughmost
commercialDBMShaveparallelimplementations,
•mostinformationretrievalsystemsdonot.
•QueryoptimizationexistsineveryrelationalDBMS.The
optimizer'sgoalistochoosetheappropriateaccesspathtothe
data.Arule-basedoptimizerusespre-definedrules,whilea
cost-basedoptimizerestimatesthecostofusingdifferent
accesspathsandchoosesthecheapestone.
129

Extensions toSQL
•Aninformationretrievalsystemtypicallyhidesthe
inverted
•indexassimplyanaccessstructurethatisusedtoobtain
data.Bystoring
•theindexasarelation,theauthorspointedoutthatusers
couldeasilyviewthe
•contentsoftheindexandmakechangesifnecessary.The
authorsmentioned
•extensions,suchasRELEVANCE(*),thatwouldcompute
therelevanceofadocumenttoaqueryusingsomepre-
definedrelevancefunction.
130

User-definedOperators
•User-definedoperatorsthatallowuserstomodifySQLby
addingtheirownfunctionstotheDBMSengine.
•Thedatatypeoftheargumentisgivenasrectangle.Hence,
thisexampleusesbothauser-definedfunctionandauser-
defineddatatype.
•Ex:1SELECTMAX(AREA(Rectangle))FROMSHAPE
•Thefollowingqueryobtainsalldocumentsthatcontainthe
termstermI,term2,andterm3:
•Ex:2SELECTDocJdFROMDOCWHERESEARCH-
TERM(Text,Terml,Term2,Term3)
131

Integrity
•Foruser-definedoperatorstobeefficient,theymustbe
linkedintothesamemoduleastheentireDBMS,giving
themaccesstotheentireaddressspaceoftheDBMS.
•Datathatresideinmemoryorondiskfilesthatare
currentlyopened,canbeaccessedbytheuser-defined
operator.
•Itispossiblethattheuser-definedoperatorcouldcorrupt
thesedata.
132

Portability
•Theoperatormayappeartoexist,butitmayperforman
entirelydifferentfunction.
•Withoutuser-definedoperators,anyonewithanRDBMS
maywriteanapplicationandexpectittorunatanysitethat
runsthatRDBMS.
•Withuser-definedoperators,thisperspectivechangesasthe
applicationislimitedtoonlythosesiteswiththeuser-
definedoperator.
133

Performance
•Queryoptimization,bydefault,doesnotknowmuchabout
thespecificuserdefinedoperators.
•Optimizationisoftenbasedonsubstantialinformation
aboutthequery.AquerywithanEQUALoperatorcanbe
expectedtoretrievefewerrowsthanaLESSTHAN
operator.
•Thisknowledgeassiststheoptimizerinchoosinganaccess
path.
134

Information Retrieval as a Relational
Application
•Thefollowingquerylistsalltheidentifiersofdocuments
thatcontainatleastonetermin
•QUERY:Ex:5SELECTDISTINCT(i.DocId)FROM
INDEXi,QUERYqWHEREi.term=q.term
•Aquerytoidentifydocumentsthatcontainanyofthe
termsinthequeryexceptthoseintheSTOP_TERM
relationisgivenbelow:
•Ex:6SELECTDISTINCT(i.DocId)FROMINDEXi,
QUERYq,STOP]ERMsWHEREi.term=termANDi.term
i=s.term
135

Preprocessing
•Apreprocessorthatreadstheinputfileandoutputsseparate
flatfilesisused.
•EachtermisreadandcheckedagainstalistofSGML
markers.Themainalgorithmforthepreprocessorsimply
parsestermsandthenappliesahashfunctiontohashthem
intoasmallhashtable.
•Ifthetermhasnotoccurredforthisdocument,anewentryis
addedtothehashtable.Collisionsarehandledbyasingle
linkedlistassociatedwiththehashtable.
136

A WorkingExample
•Thedocumentscontainbothstructuredandunstructured
dataandaregivenbelow.
•<DOC>
•<DOCNO>WSJ870323-0180<!DOCNO>
•<HL>Italy'sCommercialVehicleSales<IHL>
•<DD>03/23/87<!DD>
•<DATELINE>TURIN,Italy</DATELINE>
•<TEXT>
137

•Commercial-vehicle sales in Italy rose 11.4% in February
from a year earlier,
•to 8,848 units, according to provisional figures from the
Italian Association of Auto Makers.
•<!TEXT>
•<!DOC>
•<DOC>
•<DOCNO> WSJ870323-0161 <!DOCNO>
•<HL> Who's News: Du Pont Co. <IHL>
•<DD> 03/23/87 <!DD>
•<DATELINE> Du Pont Company, Wilmington, DE
</DATELINE>
•<TEXT>
138

BooleanRetrieval
•Forlargedocumentcollections,theyarelessusefulbecause
theresultsetisunordered,andaquerycanresultin
thousandsofmatches.
•TheuseristhenforcedtotunetheBooleanconditionsand
retrythequeryuntiltheresultisobtained.
•Relevancerankingavoidsthisproblembyranking
documentsbasedonameasureofrelevancebetweenthe
documentsandthequery.
•Theuserthenlooksatthetop-rankeddocumentsand
determineswhetherornottheyfilltheinformationneed.
139

Semi-Structured Search using
a Relational Schema
•XML-QL,aquerylanguagedevelopedatAT&T
[Deutschetal.,1999],wasdesignedtomeetthe
requirementsofafullfeaturedXMLquerylanguageset
outbytheW3C.
known todaywasreleasedin1999.
•ThespecificationdescribingXPathasitis
140

Static Relational Schema to
support XML-QL
•Thiswasfirstproposedin[FlorescuandKossman,1999]to
providesupportforXMLqueryprocessing.
•Later,intheIITInformationRetrievalLaboratory
www.ir.iit.edu).itwasshownthatafullXML-QLquery
languagecouldbebuiltusingthisbasicstructure.
•Thisisdonebytranslatingsemi-structuredXML-QLto
SQL.Theuseofastaticschemaaccommodatesdataofany
XMLschemawithouttheneedfordocument-type
definitionsorXschemas.
141

The hierarchy of XML documents is kept in tact such that
any document indexed into the database can be reconstructed
using only the information in the tables. The relations us are:
TAG_NAME ( TagId, tag) ATTRIBUTE( AttributeId,
attribute)
TAGYATH ( TagId, path) DOCUMENT ( Doc/d, fileName) INDEX
( Id, parent, path, type, tagId, attrId, pos, value
142

Distributed InformationRetrieval
ATheoreticalModelofdistributedretrieval.
WebSearch.
143

Introduction
Fig : Distributed Document Retrieval
144

Centralized InformationRetrieval
SystemModel
•Formally,aninformationretrievalsystemisdefinedasa
triple,I=(D,R,5)whereDisadocumentcollection,Ris
thesetofqueries,and5j:Rj---t2Djisamapping
assigningthehquerytoasetofrelevantdocuments.
•Manyinformationretrievalsystemsrelyuponathesaurus,
inwhichauserqueryisexpanded,toincludesynonymsof
thekeywordstomatchsynonymsinadocument.Hence,a
querythatcontainsthetermcurtainwillalsoinclude
documentscontainingthetermdrapery.
145

•Toincludethethesaurusinthemodel,itwasproposedin
[Turski,1971]that
•thetriplebeexpandedtoaquadrupleas:
•1=(T,D,R,$)
•whereTisasetofdistincttermsandtherelationpeTxT
suchthatp(tl,t2)
•impliesthattlisasynonymoft2.Usingthesynonym
relation,itispossibleto
•representdocumentsasasetofdescriptorsandasetof
ascriptors.Considera
•documentDI,thesetofdescriptorsdconsistsofallterms
inDIsuchthat:
•Eachdescriptorisunique
•Nodescriptorisasynonymofanotherdescriptor
146

•Anascriptorisdefinedasatermthatisasynonymofa
descriptor.
•Eachascriptormustbesynonymouswithonlyone
descriptor.Hence,thedescriptorsrepresentaminimal
descriptionofthedocument.
•Inadditiontothethesaurus,ageneralizationrelationover
thesetsofdescriptorsisdefinedas
•'"'(Cdxdwhere'"'((tl,t2)impliesthattlisamoregeneral
termthant2'Hence,'"'((animal,dog)isanexampleofa
validgeneralization.
147

Distributed Information Retrieval
System Model
•Thecentralizedinformationretrievalsystemcanbe
partitionedintonlocalinformationretrievalsystems81,82
,...,8n[Mazur,1984].Eachsystem8jisoftheform:8j=
(Tj,Dj,Rj,6"j),whereTjisthethesaurus;Djisthe
documentcollection;Rjthesetofqueries;and6"j:Rj--t
2Djmapsthequeriestodocuments.
•Bytakingtheunionofthelocalsites,itispossibleto
definethedistributedinformationretrievalsystem
148

•Consider a set of documents with the
following descriptors:
•D1 = (Mary, Harold, Herbert)
•D2 = (Herbert, dog)
•D3 = (people, dog)
•D4 = (Mary, cheshire)
•D5 = (Mary, dog)
•D6 = (Herbert, black-cat, doberman)
•D7 = (Herbert, doberman)
149

•Aqueryofthemostgeneralterms(people,animal)should
returnalldocuments2through7(document1containsno
animals,andthequeryiseffectivelyaBooleanAND).
However,thehierarchygivenaboveas81willonly
retrievedocumentsD3andD5,and82willonlyretrieve
documentsD6andD7.Hence,documentsD2andD4are
missingfromthefinalresultifthelocal
•Results sets are simplyconcatenated.
150

WebSearch
•EvaluationofWebSearchEngines
ToautomaticallyevaluateWebsearchengines,amethod
usingonlinetaxonomiesthatwerecreatedaspartof
OpenDirectoryProject(ODP)isdescribedin[Beitzelet
al.,2003b].Onlinedirectorieswereusedasknown
relevantitemsforaquery.Ifaquerymatcheseitherthe
titleoftheitemstoredorthedirectoryfilename
containingaknownitemthenitisconsideredamatch.
151

High PrecisionSearch
•Traditionally,precisionandrecallmeasuresarethemain
evaluationmetrics,whileresponsetimeandspace
requirementsarelikelyaddressed.However,intheWeb
environment,responsetimeiscritical.Furthermore,recall
estimationisverydifficult,andprecisionisoflimited
concernsincemostusersneveraccessanylinksthatappear
beyondthefirstanswerscreen
•TheWebenvironmentprovidesformanynewopportunities
torevisitoldissuesparticularlyintermsofperformanceand
accuracyoptimizationsandevaluationmeasuresofsearch
accuracy.Inthatlight,recently,anhourlyanalysisofavery
largetopicallycategorized
152

PageRank
•ItextendsthenotionofhubsandauthoritiesintheWebgraph
originallydescribedin[Kleinberg,1999].PageRankisatthe
heartofthepopularWebsearchengine,Google.Essentially,
thePageRankalgorithmusesincomingandoutgoinglinks
toadjustthescoreofaWebpagewithrespecttoits
popularity,independentoftheuser'squery.Hence,ifa
traditionalretrievalstrategymighthavepreviouslyranked
twodocumentsequal,thePageRankalgorithmwillboostthe
similaritymeasureforapopulardocument.
153

•Here,popularisdefinedashavinganumberofotherWeb
pageslinktothedocument.Thisalgorithmworkswellon
Webpages,buthasnobearingondocumentsthatdonot
haveanyhyperlinks.ThecalculationofPageRankforpage
AoverallpageslinkingtoitDl...Dnisdefinedas
follows:
PageRank(A)=(1-d)+dz=Dl...Dn
PageRank(Di)/C(Di)
•WhereC(Di)isthenumberoflinksoutfrompageDiandd
isadampening
154

Fig : Simple Page Rank Calculation
155

Improving Effectiveness of
Web Search Engines
•Compressionoftheinvertedindexisthesame,partial
relevancerankingisthesame,etc.
•However,therewereandaresomeeffortsspecifically
focusedonimprovingtheperformanceofWeb-based
informationretrievalsystems.
•Intermsofaccuracyimprovements,itisreasonableto
believethatbysending
•Arequesttoavarietyofdifferentsearchenginesand
mergingtheobtainedresultsonecouldimprovethe
accuracyofaWebsearchengine.
156
Tags