PonnuthuraiSelvaraj1
83 views
156 slides
Mar 24, 2024
Slide 1 of 156
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
About This Presentation
IRS
Size: 1.66 MB
Language: en
Added: Mar 24, 2024
Slides: 156 pages
Slide Content
INFORMATION RETRIEVAL SYSTEMS
IV B.TECH -I SEMESTER (JNTUH-R15)
Ms. S.J. Sowjanya , Associate Professor , CSE
Mr. N.V.Krishna Rao, Associate Professor, CSE
Mr. C.Praveen Kumar, Assistant Professor, CSE
COMPUTER SCIENCE AND ENGINEERING
INSTITUTE OF AERONAUTICAL ENGINEERING
(Autonomous)
DUNDIGAL, HYDERABAD -500 043 1
IRSystems
•IR systems contain threecomponents:
–System
–People
–Documents (informationitems)
System
User Documents
4
Data andInformation
•Data
–Stringofsymbolsassociatedwithobjects,
people,andevents
–Valuesofanattribute
•Dataneednothavemeaningtoeveryone
•Datamustbeinterpretedwithassociated
attributes.
5
Data andInformation
Information
–Themeaningofthedatainterpretedbyapersonor
asystem.
–Datathatchangesthestateofapersonorsystem
thatperceivesit.
–Datathatreducesuncertainty.
•ifdatacontainnouncertainty,thereareno
informationwiththedata.
•Examples:Itsnowsinthewinter.
Itdoesnotsnowthiswinter.
6
Information andKnowledge
•knowledge
–Structuredinformation
•throughstructuring,informationbecomes
understandable
–ProcessedInformation
•throughprocessing,informationbecomes
meaningfulanduseful
–informationsharedandagreeduponwithina
community
Data knowledge
information
7
General form ofprecision/recall
Precision
1.0
Recall
1.0
-Precision change w.r.t. Recall (not a fixedpoint)
-Systems cannot compare at one Precision/Recallpoint
-Average precision (on 11 points of recall: 0.0, 0.1, …,1.0)
14
Some techniques to improve IR
effectiveness
•Interactionwithuser(relevancefeedback)
-Keywordsonlycoverpartofthecontents
-Usercanhelpbyindicatingrelevant/irrelevantdocument
•Theuseofrelevancefeedback
–Toimprovequeryexpression:
Qnew=*Qold+*Rel_d-*Nrel_d
whereRel_d=centroidofrelevantdocumentsNRel_d=
centroidofnon-relevantdocuments
15
IR on theWeb
•Nostabledocumentcollection(spider,crawler)
•Invaliddocument,duplication,etc.
•Hugenumberofdocuments(partialcollection)
•Multimediadocuments
•Greatvariationofdocumentquality
•Multilingualproblem
16
Vector spacemodel
•Vector space = all the keywordsencountered
<t1,t2,t3, …,tn>
•Document
D =< a1, a2, a3, …,an>
ai = weight of ti inD
•Query
Q=
< b1, b2, b3, …,bn>
bi = weight of ti inQ
•R(D,Q) =Sim(D,Q)
25
Probabilistic Model
•IntroducedbyRoberstonandSparckJones,1976
–Binaryindependenceretrieval(BIR)model
•Idea:Givenauserqueryq,andtheidealanswersetRof
the
relevantdocuments,theproblemistospecifythe
propertiesforthisset
–Assumption(probabilisticprinciple):theprobability
ofrelevancedependsonthequeryanddocument
representationsonly;idealanswersetRshould
maximizetheoverallprobabilityofrelevance
–Theprobabilisticmodeltriestoestimatethe
probabilitythattheuserwillfindthedocumentdj
relevantwithratio
P(djrelevanttoq)/P(djnonrelevanttoq) 26
Probabilistic Model
•Definition
All index term weights are all binary i.e., wi,j {0,1}
Let R be the set of documents known to be relevant to
query q
Let R‟ be the complement ofR
Let(R|d)be the probability that thedocument
to the queryq
dj is nonelevantP(R|dj) to queryq
27
Probabilistic Model
•The similarity sim(dj,q) of the document dj to the query q
isdefined as theratio
Pr(R | d)
sim(d j , q)
j
Pr(R | d j)
28
Probabilistic Model
–Pr(ki|R)standsfortheprobabilitythattheindex
termkiispresentinadocumentrandomly
selectedfromthesetR
–Pr(ki|R)standsfortheprobabilitythattheindex
termkiisnotpresentinadocumentrandomly
selectedfromthesetR
29
Introduction
•The goal of clustering is to
–group data points that are close (or similar) to each other
–identify such groupings (or clusters) in an unsupervised
manner
•Unsupervised: no information is provided to the
algorithm on which data points belong to which
clusters
35
UsingN-Grams
•For N-gram models
–P(wn-1,wn) = P(wn | wn-1) P(wn-1)
–By the chain rule we can decompose a joint
probability, e.g. P(w1,w2,w3)
•P(w1,w2, ...,wn) = P(w1|w2,w3,...,wn) P(w2|w3, ...,wn) …
P(wn-1|wn) …P(wn)
•For bigrams then, the probability of a sequence is just the
product of the conditional probabilities of its bigrams
P(the,mythical,unicorn) = P(unicorn|mythical)
P(mythical|the) P(the|<start>)
43
Outlines
•SemanticNetworks
•Parsing
•Cross Language InformationRetrieval
•Introduction
•Crossing the Languagebarrier
48
What is Cross-Language
Information Retrieval?
•Definition:Selectinformationinonelanguagebasedon
queriesinanother.
•Terminologies
–Cross-LanguageInformationRetrieval(ACMSIGIR96
WorkshoponCross-LinguisticInformationRetrieval)
–TranslingualInformationRetrieval(DefenseAdvanced
ResearchProjectAgency-DARPA)
49
An Architecture of Cross-
Language InformationRetrieval
50
Building Blocks forCLIR
Information Retrieval Artificial
Intelligence
Speech
Recognition
Information
Science
Computational Linguistics
51
Major Problems ofCLIR
•Queriesanddocumentsareindifferentlanguages.
–translation
•Wordsinaquerymaybeambiguous.
–disambiguation
•Queriesareusuallyshort.
–expansion
56
Major Problems ofCLIR
•Queriesmayhavetobesegmented.
–segmentation
•Adocumentmaybeintermsofvariouslanguages.
–languageidentification
57
Enhancing Traditional
Information Retrieval Systems
•Which part(s) should be modified forCLIR?
Documents Queries
(1) (3)
Document
Representation
Query
Representation
(2) (4)
Comparison
58
Generating Mixed Ranked
Lists of Documents
•Normalizing scales ofrelevance
–using aligneddocuments
–using ranks
–interleaving according to givenratios
•Mapping documents into the samespace
–LSI
–documenttranslations
67
Character Set/FontHandling
•Input and Display Support
–Special input modules for e.g. Asian languages
–Out-of-the-box support much improved thanks to
modern web browsers
•Character Set/File Format
–Unicode/UTF-8
–XML
69
Too many factors in
CLIR system evaluation
•translation
•automatic relevancefeedback
•termexpansion
•disambiguation
•result merging
•test collection
•need to tone it down to see whathappened
75
TREC-6 Cross-LanguageTrack
•In cooperation with the Swiss Federal Institute of
Technology (ETH)
•Task Summary: retrieval of English, French, and German
documents, both in a monolingual and a cross-lingual
mode
•Documents
-SDA (1988-1990): French (250MB), German (330 MB)
-Neue Zurcher Zeitung (1994): German (200MB)
–AP (1988-1990): English (759MB)
•13 participating groups
76
TREC-7 Cross-LanguageTrack
•Task Summary: retrieval of English, French, German
and Italian documents
•Results to be returned as a single multilingual ranked
list
•Addition of Italian SDA (1989-1990), 90 MB
•Addition of a subtask of 31,000 structured German
social science documents (GIRT)
•9 participating groups
77
TREC-8 Cross-LanguageTrack
•Tasks, documents and topic creation similar to TREC-7
•12 participating groups
78
CLIR inTREC-9
•Documents
–HongKongCommercialDaily,HongKongDaily
News,Takungpao:allfrom1999andabout260
MBtotal
•25 new topics built in English; translations
made to Chinese
79
English
Query
Document
Translation
Query
Translatio
n
English
Names Name
Search
Specific
Bilingual
Dictionary
Machine
Transliteratio
n
Chinese
Names
Query
Disambiguatio
n
English
Titles Title
Search
Generic
Bilingual
Dictionary
Chinese
Titles
Chinese
Query
NPDM
ChineseIR
System
Collection
91
Title
•“Travelers among Mountains
and Streams”
•"travelers", "mountains", and
"streams" are basic components
•Userscanexpress
informationneedthrough
their
the
descriptions of a desired art
•System will measure the similarity
of art titles (descriptions) and a
query
96
I-Match
•I-Match uses a hashing scheme that uses only some terms in a
document.
•The decision of which terms to use is key to the success of
the algorithm.
•I-match is a hash of the document that uses collection
statistics.
•The overall runtime of the I-Match approach is (O(d logd) in
the worstcase where all documents are duplicates of each
other
123
UNIT-V
124
A HistoricalProgression
•CombiningSeparateSystems
•Queriesareparsedandthe
•structuredportionsaresubmittedasaquerytothe
DBMS,whiletextsearch
•portionsofthequeryaresubmittedtoaninformation
retrievalsystem.
•Theresultsarecombinedandpresentedtotheuser.
125
•Commercial-vehicle sales in Italy rose 11.4% in February
from a year earlier,
•to 8,848 units, according to provisional figures from the
Italian Association of Auto Makers.
•<!TEXT>
•<!DOC>
•<DOC>
•<DOCNO> WSJ870323-0161 <!DOCNO>
•<HL> Who's News: Du Pont Co. <IHL>
•<DD> 03/23/87 <!DD>
•<DATELINE> Du Pont Company, Wilmington, DE
</DATELINE>
•<TEXT>
138
Semi-Structured Search using
a Relational Schema
•XML-QL,aquerylanguagedevelopedatAT&T
[Deutschetal.,1999],wasdesignedtomeetthe
requirementsofafullfeaturedXMLquerylanguageset
outbytheW3C.
known todaywasreleasedin1999.
•ThespecificationdescribingXPathasitis
140
Static Relational Schema to
support XML-QL
•Thiswasfirstproposedin[FlorescuandKossman,1999]to
providesupportforXMLqueryprocessing.
•Later,intheIITInformationRetrievalLaboratory
www.ir.iit.edu).itwasshownthatafullXML-QLquery
languagecouldbebuiltusingthisbasicstructure.
•Thisisdonebytranslatingsemi-structuredXML-QLto
SQL.Theuseofastaticschemaaccommodatesdataofany
XMLschemawithouttheneedfordocument-type
definitionsorXschemas.
141
The hierarchy of XML documents is kept in tact such that
any document indexed into the database can be reconstructed
using only the information in the tables. The relations us are:
TAG_NAME ( TagId, tag) ATTRIBUTE( AttributeId,
attribute)
TAGYATH ( TagId, path) DOCUMENT ( Doc/d, fileName) INDEX
( Id, parent, path, type, tagId, attrId, pos, value
142
Distributed Information Retrieval
System Model
•Thecentralizedinformationretrievalsystemcanbe
partitionedintonlocalinformationretrievalsystems81,82
,...,8n[Mazur,1984].Eachsystem8jisoftheform:8j=
(Tj,Dj,Rj,6"j),whereTjisthethesaurus;Djisthe
documentcollection;Rjthesetofqueries;and6"j:Rj--t
2Djmapsthequeriestodocuments.
•Bytakingtheunionofthelocalsites,itispossibleto
definethedistributedinformationretrievalsystem
148
•Consider a set of documents with the
following descriptors:
•D1 = (Mary, Harold, Herbert)
•D2 = (Herbert, dog)
•D3 = (people, dog)
•D4 = (Mary, cheshire)
•D5 = (Mary, dog)
•D6 = (Herbert, black-cat, doberman)
•D7 = (Herbert, doberman)
149