IRS-total ppts.pdf which have the detail abt the

MARasheed3 3 views 190 slides Mar 05, 2025
Slide 1
Slide 1 of 263
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27
Slide 28
28
Slide 29
29
Slide 30
30
Slide 31
31
Slide 32
32
Slide 33
33
Slide 34
34
Slide 35
35
Slide 36
36
Slide 37
37
Slide 38
38
Slide 39
39
Slide 40
40
Slide 41
41
Slide 42
42
Slide 43
43
Slide 44
44
Slide 45
45
Slide 46
46
Slide 47
47
Slide 48
48
Slide 49
49
Slide 50
50
Slide 51
51
Slide 52
52
Slide 53
53
Slide 54
54
Slide 55
55
Slide 56
56
Slide 57
57
Slide 58
58
Slide 59
59
Slide 60
60
Slide 61
61
Slide 62
62
Slide 63
63
Slide 64
64
Slide 65
65
Slide 66
66
Slide 67
67
Slide 68
68
Slide 69
69
Slide 70
70
Slide 71
71
Slide 72
72
Slide 73
73
Slide 74
74
Slide 75
75
Slide 76
76
Slide 77
77
Slide 78
78
Slide 79
79
Slide 80
80
Slide 81
81
Slide 82
82
Slide 83
83
Slide 84
84
Slide 85
85
Slide 86
86
Slide 87
87
Slide 88
88
Slide 89
89
Slide 90
90
Slide 91
91
Slide 92
92
Slide 93
93
Slide 94
94
Slide 95
95
Slide 96
96
Slide 97
97
Slide 98
98
Slide 99
99
Slide 100
100
Slide 101
101
Slide 102
102
Slide 103
103
Slide 104
104
Slide 105
105
Slide 106
106
Slide 107
107
Slide 108
108
Slide 109
109
Slide 110
110
Slide 111
111
Slide 112
112
Slide 113
113
Slide 114
114
Slide 115
115
Slide 116
116
Slide 117
117
Slide 118
118
Slide 119
119
Slide 120
120
Slide 121
121
Slide 122
122
Slide 123
123
Slide 124
124
Slide 125
125
Slide 126
126
Slide 127
127
Slide 128
128
Slide 129
129
Slide 130
130
Slide 131
131
Slide 132
132
Slide 133
133
Slide 134
134
Slide 135
135
Slide 136
136
Slide 137
137
Slide 138
138
Slide 139
139
Slide 140
140
Slide 141
141
Slide 142
142
Slide 143
143
Slide 144
144
Slide 145
145
Slide 146
146
Slide 147
147
Slide 148
148
Slide 149
149
Slide 150
150
Slide 151
151
Slide 152
152
Slide 153
153
Slide 154
154
Slide 155
155
Slide 156
156
Slide 157
157
Slide 158
158
Slide 159
159
Slide 160
160
Slide 161
161
Slide 162
162
Slide 163
163
Slide 164
164
Slide 165
165
Slide 166
166
Slide 167
167
Slide 168
168
Slide 169
169
Slide 170
170
Slide 171
171
Slide 172
172
Slide 173
173
Slide 174
174
Slide 175
175
Slide 176
176
Slide 177
177
Slide 178
178
Slide 179
179
Slide 180
180
Slide 181
181
Slide 182
182
Slide 183
183
Slide 184
184
Slide 185
185
Slide 186
186
Slide 187
187
Slide 188
188
Slide 189
189
Slide 190
190
Slide 191
191
Slide 192
192
Slide 193
193
Slide 194
194
Slide 195
195
Slide 196
196
Slide 197
197
Slide 198
198
Slide 199
199
Slide 200
200
Slide 201
201
Slide 202
202
Slide 203
203
Slide 204
204
Slide 205
205
Slide 206
206
Slide 207
207
Slide 208
208
Slide 209
209
Slide 210
210
Slide 211
211
Slide 212
212
Slide 213
213
Slide 214
214
Slide 215
215
Slide 216
216
Slide 217
217
Slide 218
218
Slide 219
219
Slide 220
220
Slide 221
221
Slide 222
222
Slide 223
223
Slide 224
224
Slide 225
225
Slide 226
226
Slide 227
227
Slide 228
228
Slide 229
229
Slide 230
230
Slide 231
231
Slide 232
232
Slide 233
233
Slide 234
234
Slide 235
235
Slide 236
236
Slide 237
237
Slide 238
238
Slide 239
239
Slide 240
240
Slide 241
241
Slide 242
242
Slide 243
243
Slide 244
244
Slide 245
245
Slide 246
246
Slide 247
247
Slide 248
248
Slide 249
249
Slide 250
250
Slide 251
251
Slide 252
252
Slide 253
253
Slide 254
254
Slide 255
255
Slide 256
256
Slide 257
257
Slide 258
258
Slide 259
259
Slide 260
260
Slide 261
261
Slide 262
262
Slide 263
263

About This Presentation

IRS


Slide Content

ModernInformationRetrieval
Chapter3
M
odeling
PartI:ClassicModels
IntroductiontoIRModels
BasicConcepts
TheBooleanModel
TermWeighting
TheVectorModel
ProbabilisticModel
Chap 03: Modeling, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 1

IRModels
ModelinginIRisacomplexprocessaimedat
producingarankingfunction
Ranking function: a function that assigns scores to documents
with regard to a given query Thisprocessconsistsoftwomaintasks:
The conception of a logical framework for representing
documents and queries
The denition of a ranking function that allows quantifying the
similarities among documents and queries
Chap 03: Modeling, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 2

ModelingandRanking
IRsystemsusuallyadoptindextermstoindexand
retrievedocuments
Indexterm:
In a restricted sense: it is a keyword that has some meaning on
its own; usually plays the role of a noun
In a more general form: it is any word that appears in a document
Retrievalbasedonindextermscanbeimplemented efciently Also,indextermsaresimpletorefertoinaquery Simplicityisimportantbecauseitreducestheeffortof queryformulation
Chap 03: Modeling, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 3

Introduction
Informationretrievalprocess
documents
information 
need
index terms
match
ranking
31 2
...
docs terms
query terms
Chap 03: Modeling, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 4

Introduction
Arankingisanorderingofthedocumentsthat
(hopefully)reectstheirrelevancetoauserquery
Thus,anyIRsystemhastodealwiththeproblemof
predictingwhichdocumentstheuserswillndrelevant
Thisproblemnaturallyembodiesadegreeof uncertainty,orvagueness
Chap 03: Modeling, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 5

IRModels
AnIRmodelisaquadruple[D,Q,F,R(qi,dj)]where
1.Dis a set of logical views for the documents in the collection
2.Qis a set of logical views for the user queries
3.Fis a framework for modeling documents and queries
4.R(qi,dj)is a ranking function
Q
D
q
Γ
d
Q
D

dR
(
, Φ



,


d

,
Φ


-


, Φ




,


d

R(d
Q,q
Γ)
Chap 03: Modeling, Baeza-Yates & Ribeiro-Neto, Modern Infor
mation Retrieval, 2nd Edition – p. 6

ATaxonomyofIRModels
Chap 03: Modeling, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 7

Retrieval:AdHocxFiltering
AdHocRetrieval:
Collection
2

5
1

Q1
Q3
Q2
Q4
Chap 03: Modeling, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 8

Retrieval:AdHocxFiltering
Filtering
documents stream
user 2




user 1





!
"
#
$


!
:

%
#

&
!

"
#


$


!
:

%
#

'
Chap 03: Modeling, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 9

BasicConcepts
Eachdocumentisrepresentedbyasetof
representativekeywordsorindexterms
Anindextermisawordorgroupofconsecutivewords
inadocument
Apre-selectedsetofindextermscanbeusedto
summarizethedocumentcontents
However,itmightbeinterestingtoassumethatall
wordsareindexterms(fulltextrepresentation)
Chap 03: Modeling, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 10

BasicConcepts
Let,
tbe the number of index terms in the document collection kibe a generic index term
Then,
ThevocabularyV={k1,...,kt}is the set of all distinct index
terms in the collection
k
(
k
)
k
*
k
+
,
-.
/
0 1
2
/3
4
-
5 6
7 8
9 :
;
<
:
3
=>
V=
?
?
?
Chap 03: Modeling, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 11

BasicConcepts
Documentsandqueriescanberepresentedby
patternsoftermco-occurrences
V=
@
@
@
A
B
B
B
@
@
@
A
A
A
A
@
@
@
C D
C E
C F
C G
@ @
@
HI
J
J KL
M
JN
I
J
L
K
H
L
K
O
K
M
J
O
P Q
R S
T
K
M
J
O
U
I
M
P
V
S
K
L
W
K
O
X
Y
W
JN
JN
K
J KL
T
Z [
I
M
P
MQ
Q
JN
KL
HI
J
J KL
M
JN
I
J
L
K
H
L
K
O
K
M
J
O
P Q
R S
T
K
M
J
O
U I
M
P
V
S
K
L
W
K
O
X
Y
W
JN
I
\
\
W
M
P
K
]
J K
L
T
O
^^^
Eachofthesepatternsoftermco-occurenceiscalleda
termconjunctivecomponent
Foreachdocumentdj(orqueryq)weassociatea
uniquetermconjunctivecomponentc(dj)(orc(q))
Chap 03: Modeling, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 12

TheTerm-DocumentMatrix
Theoccurrenceofatermkiinadocumentdj
establishesarelationbetweenkianddj
Aterm-documentrelationbetweenkianddjcanbe
quantiedbythefrequencyoftheterminthedocument
Inmatrixform,thiscanwrittenas
d1d2
k1
k2
k3



f1,1f1,2
f2,1f2,2
f3,1f3,2



whereeachfi,jelementstandsforthefrequencyof
termkiindocumentdj
Chap 03: Modeling, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 13

BasicConcepts
Logicalviewofadocument: fromfulltexttoasetof
indexterms
_
` ab
c
ad
e
_
fgh
i
`
_ j
_
b
fg
k
il
j
h
`
g
g
a
i
`
d
a
m
h
e
n
a
g
f
o p
m
f
d
q
_
`
h
r
r
k
i
l
i
a
p
i
l
d
a
p
b
_
e
a
g
p
r
h
i
`
_
`
d
p
g
`
pd
h
d
h
g
a
l
i
k
`
k
a
i
s
p
m
m
`
h
t
`
`
f
t
a
i
a
r
q
u
h
q
c
a
d
e
_
_
`
d
p
g
`
p
d
h
`
h
t
` v
_
`
d
p
g
` pd
h
`
h
t
`
Chap 03: Modeling, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 14

TheBooleanModel
Chap 03: Modeling, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 15

TheBooleanModel
Simplemodelbasedonsettheoryandboolean
algebra
Queriess√eci×edasbooleanex√ressions
quite intuitive and precise semantics neat formalism example of query
q=ka∧(kb∨¬kc) Term-documentfrequenciesintheterm-document
matrixareallbinary
wij∈ {0,1}: weight associated with pair(ki,dj) wiq∈ {0,1}: weight associated with pair(ki,q)
Chap 03: Modeling, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 16

TheBooleanModel
Atermcon|unctivecom√onentthatsatis×esaqueryqis
calledaqueryconjunctivecomponentc(q)
Aqueryqrewrittenasadisjunctionofthose
componentsiscalledthedisjunctnormalformqDNF
Toillustrate,consider
queryq=ka∧(kb∨¬kc) vocabularyV={ka,kb,kc}
Then
qDNF= (1,1,1)∨(1,1,0)∨(1,0,0) c(q): a conjunctive component forq
Chap 03: Modeling, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 17

TheBooleanModel
Thethreeconjunctivecomponentsforthequery
q=ka∧(kb∨¬kc)
Ka
Kb
Kc
(1,1,1)
Chap 03: Modeling, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 18

TheBooleanModel
Thisapproachworksevenifthevocabularyofthe
collectionincludestermsnotinthequery
Considerthatthevocabularyisgivenby
V={ka,kb,kc,kd}
Then,adocumentdjthatcontainsonlytermska,kb,
andkcisrepresentedbyc(dj)=(1,1,1,0)
Thequery[q=ka∧(kb∨ ¬kc)]isrepresentedin
disjunctivenormalformas
qDNF= (1,1,1,0)∨(1,1,1,1)∨
(1,1,0,0)∨(1,1,0,1)∨
(1,0,0,0)∨(1,0,0,1)
Chap 03: Modeling, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 19

TheBooleanModel
Thesimilarityofthedocumentdjtothequeryqis
de×nedas
sim(dj,q)=
(
1 if∃c(q)|c(q)=c(dj)
0 otherwise
TheBooleanmodelpredictsthateachdocumentis
eitherrelevantornon-relevant
Chap 03: Modeling, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 20

DrawbacksoftheBooleanModel
Retrievalbasedonbinarydecisioncriteriawithno
notionofpartialmatching
Norankingofthedocumentsisprovided(absenceofa
gradingscale)
InformationneedhastobetranslatedintoaBoolean
expression,whichmostusersndawkward
TheBooleanqueriesformulatedbytheusersaremost
oftentoosimplistic
Themodelfrequentlyreturnseithertoofewortoomany
documentsinresponsetoauserquery
Chap 03: Modeling, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 21

TermWeighting
Chap 03: Modeling, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 22

TermWeighting
Thetermsofadocumentarenotequallyusefulfor
describingthedocumentcontents
Infact,thereareindextermswhicharesimplyvaguer thanothers Therearepropertiesofanindextermwhichareuseful
forevaluatingtheimportanceoftheterminadocument
For instance, a word which appears in all documents of a
collection is completely useless for retrieval tasks
Chap 03: Modeling, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 23

TermWeighting
Tocharacterizetermimportance,weassociateaweight
wi,j>0witheachtermkithatoccursinthedocumentdj
Ifkithat does not appear in the documentdj, thenwi,j= 0.
Theweightwi,jquantiestheimportanceoftheindex
termkifordescribingthecontentsofdocumentdj
Theseweightsareusefultocomputearankforeach
documentinthecollectionwithregardtoagivenquery
Chap 03: Modeling, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 24

TermWeighting
Let,
kibe an index term anddjbe a document V={k1,k2,...,kt}be the set of all index terms wi,j>0be the weight associated with(ki,dj)
Thenwedewne
~
dj=(w1,j,w2,j,...,wt,j)asaweighted
vectorthatcontainstheweightwi,jofeachtermki∈Vin
thedocumentdj
k
w
k
x
k
y
...
k
z
w
w {
|
w
x
{
|
w
y
{
|
. ..
w
z
{
|
}
~•
€
 ‚
ƒ
€„ …
~
† ‡
ˆ ‰
Š ‹
Œ

‹
„
Ž
V

‹
„
Ž

‹
ˆ ‘
’


€


~

ˆ
€

‹
Š

ˆ

’
“ ”
d
•
Chap 03: Modeling, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 25

TermWeighting
Theweightswi,jcanbecomputedusingthefrequencies
ofoccurrenceofthetermswithindocuments
Letfi,jbethefrequencyofoccurrenceofindextermkiin
thedocumentdj
ThetotalfrequencyofoccurrenceFioftermkiinthe
collectionisdenedas
Fi=
N
X
j=1
fi,j
whereNisthenumberofdocumentsinthecollection
Chap 03: Modeling, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 26

TermWeighting
Thedocumentfrequencyniofatermkiisthenumber
ofdocumentsinwhichitoccurs
Notice thatni≤Fi.
Forinstance,inthedocumentcollectionbelow,the
valuesfi,j,Fiandniassociatedwiththetermdoare
f(do,d1)=2
f(do,d2)=0
f(do,d3)=3
f(do,d4)=3
F(do)=8
n(do)=3
To do is to be.
To be is to do. 
To be or not to be.
I am what I am.
I think therefore I am.
Do be do be do.
d1
d2
d3
Do do do, da da da.
Let it be, let it be.
d4
Chap 03: Modeling, Baeza-Yates & Ribeiro-Neto, Modern Infor
mation Retrieval, 2nd Edition – p. 27

Term-termcorrelationmatrix
Forclassicinformationretrievalmodels,theindexterm
weightsareassumedtobemutuallyindependent
This means thatwi,jtells us nothing aboutwi+1,j
Thisisclearlyasimplicationbecauseoccurrencesof
indextermsinadocumentarenotuncorrelated
Forinstance,thetermscomputerandnetworktendto
appeartogetherinadocumentaboutcomputer
networks
In this document, the appearance of one of these terms attracts
the appearance of the other
Thus, they are correlated and their weights should reect this correlation.
Chap 03: Modeling, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 28

Term-termcorrelationmatrix
Totakeintoaccountterm-termcorrelations,wecan
computeacorrelationmatrix
Let
~
M=(mij)beaterm-documentmatrixt×Nwhere
mij=wi,j
Thematrix
~
C=
~
M
~
M
t
isaterm-termcorrelationmatrix Eachelementcu,v∈Cexpressesacorrelationbetween
termskuandkv,givenby
cu,v=
X
dj
wu,j
×wv,j
Higherthenumberofdocumentsinwhichthetermsku
andkvco-occur,strongeristhiscorrelation
Chap 03: Modeling, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 29

Term-termcorrelationmatrix
Term-termcorrelationmatrixforasamplecollection
d1d2k1k2k3
k1
k2
k3




w1,1w1,2
w2,1w2,2
w3,1w3,2




d1
d2
"
w1,1w2,1w3,1
w1,2w2,2w3,2
#
M
×M
T
|
{z
}

k1k2k3
k1
k2
k3




w1,1w1,1+w1,2w1,2w1,1w2,1+w1,2w2,2w1,1w3,1+w1,2w3,2
w2,1w1,1+w2,2w1,2w2,1w2,1+w2,2w2,2w2,1w3,1+w2,2w3,2
w3,1w1,1+w3,2w1,2w3,1w2,1+w3,2w2,2w3,1w3,1+w3,2w3,2




Chap 03: Modeling, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 30

TF-IDFWeights
Chap 03: Modeling, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 31

TF-IDFWeights
TF-IDFtermweightingscheme:
Term frequency (TF) Inverse document frequency (IDF) Foundations of the most popular term weighting scheme in IR
Chap 03: Modeling, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 32

Term Frequency (TF) Weights
LuhnAssumption. Thevalueofwi,jisproportionalto
thetermfrequencyfi,j
That is, the more often a term occurs in the text of the document,
the higher its weight Thisisbasedontheobservationthathighfrequency
termsareimportantfordescribingdocuments
Whichleadsdirectlytothefollowingtfweight
formulation:
tfi,j=fi,j
Chap 03: Modeling, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 33

TermFrequency(TF)Weights
Avariantoftfweightusedintheliteratureis
tfi,j=
(
1+logfi,jiffi,j>0
0otherwise
wherethelogistakeninbase2
Thelogexpressionisathepreferredformbecauseit
makesthemdirectlycomparabletoidfweights,aswe
laterdiscuss
Chap 03: Modeling, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 34

TermFrequency(TF)Weights
Logtfweightstfi,jfortheexamplecollection
Vocabulary
tfi,1
tfi,2
tfi,3
tfi,4
1
to
3
2
-
-
2
do
2
-
2.585
2.585
3
is
2
-
-
-
4
be
2
2
2
2
5
or
-
1
-
-
6
not
-
1
-
-
7
I
-
2
2
-
8
am
-
2
1
-
9
what
-
1
-
-
10
think
-
-
1
-
11
therefore
-
-
1
-
12
da
-
-
-
2.585
13
let
-
-
-
2
14
it
-
-
-
2
Chap 03: Modeling, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 35

InverseDocumentFrequency
Wecalldocumentexhaustivitythenumberofindex
termsassignedtoadocument
Themoreindextermsareassignedtoadocument,the
higheristheprobabilityofretrievalforthatdocument
If too many terms are assigned to a document, it will be retrieved
by queries for which it is not relevant Optimalexhaustivity. Wecancircumventthisproblem
byoptimizingthenumberoftermsperdocument
Anotherapproachisbyweightingthetermsdifferently,
byexploringthenotionoftermspecifcity
Chap 03: Modeling, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 36

InverseDocumentFrequency
Specifcityisapropertyofthetermsemantics
Atermismoreorlessspecicdependingonitsmeaning Toexemplify,thetermbeverageislessspecicthanthe
termsteaandbeer
Wecouldexpect thatthetermbeverageoccursinmore
documentsthanthetermsteaandbeer Termspecicityshouldbeinterpretedasastatistical
ratherthansemanticpropertyoftheterm
Statisticaltermspecifcity. Theinverseofthenumber
ofdocumentsinwhichthetermoccurs
Chap 03: Modeling, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 37

InverseDocumentFrequency
TermsaredistributedinatextaccordingtoZipf'sLaw Thus,ifwesortthevocabularytermsindecreasing
orderofdocumentfrequencieswehave
n(r)∼r
−α
wheren(r)refertotherthlargestdocumentfrequency
andαisanempiricalconstant
Thatis,thedocumentfrequencyoftermkiisan
exponentialfunctionofitsrank.
n(r)=Cr
−α
whereCisasecondempiricalconstant
Chap 03: Modeling, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 38

InverseDocumentFrequency
Settingα=1(simpleapproximationforenglish
collections)andtakinglogswehave
logn(r)=logC−logr
Forr=1,wehaveC=n(1),i.e.,thevalueofCisthe
largestdocumentfrequency
This value works as a normalization constant
Analternativeistodothenormalizationassuming
C=N,whereNisthenumberofdocsinthecollection
logr∼logN−logn(r)
Chap 03: Modeling, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 39

InverseDocumentFrequency
Letkibethetermwiththerthlargestdocument
frequency,i.e.,n(r)=ni. Then,
idfi=log
N
ni
whereidfiiscalledtheinversedocumentfrequency
oftermki
Idfprovidesafoundationformoderntermweighting
schemesandisusedforrankinginalmostallIR
systems
Chap 03: Modeling, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 40

InverseDocumentFrequency
Idfvaluesforexamplecollection
term
ni
idfi= log(N/ni)
1
to
2
1
2
do
3
0.415
3
is
1
2
4
be
4
0
5
or
1
2
6
not
1
2
7
I
2
1
8
am
2
1
9
what
1
2
10
think
1
2
11
therefore
1
2
12
da
1
2
13
let
1
2
14
it
1
2
Chap 03: Modeling, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 41

TF-IDFweightingscheme
Thebestknowntermweightingschemesuseweights
thatcombineidffactorswithtermfrequencies
Letwi,jbethetermweightassociatedwiththetermki
andthedocumentdj
Then,wede×ne
wi,j=
(
(1+logfi,j)×log
N
ni
iffi,j>0
0otherwise
whichisreferredtoasatf-idfweightingscheme
Chap 03: Modeling, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 42

TF-IDFweightingscheme
Tf-idfweightsofalltermspresentinourexample
documentcollection
d1
d2
d3
d4
1
to
3
2
-
-
2
do
0.830
-
1.073
1.073
3
is
4
-
-
-
4
be
-
-
-
-
5
or
-
2
-
-
6
not
-
2
-
-
7
I
-
2
2
-
8
am
-
2
1
-
9
what
-
2
-
-
10
think
-
-
2
-
11
therefore
-
-
2
-
12
da
-
-
-
5.170
13
let
-
-
-
4
14
it
-
-
-
4
Chap 03: Modeling, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 43

VariantsofTF-IDF
Severalvariationsoftheaboveexpressionfortf-idf
weightsaredescribedintheliterature
Fort{wei}hts,×vedistinctvariantsareillustratedbelow
tf weight
binary
{0,1}
raw frequency
fi,j
log normalization
1+logfi,j
double normalization 0.5
0.5+0.5
fi,j
maxifi,j
double normalization K
K+(1−K)
fi,j
maxifi,j
Chap 03: Modeling, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 44

VariantsofTF-IDF
Fivedistinctvariantsofidfweight
idf weight
unary
1
inverse frequency
log
N
ni
inv frequency smooth
log(1+
N
ni
)
inv frequeny max
log(1+
maxini
ni
)
probabilistic inv frequency
log
N−ni
ni
Chap 03: Modeling, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 45

VariantsofTF-IDF
Recommendedtf-idfweightingschemes
weighting scheme
document term weight
query term weight
1
fi,j∗log
N
ni
(0.5+0.5
fi,q
maxifi,q
)∗log
N
ni
2
1+logfi,j
log(1+
N
ni
)
3
(1+logfi,j)∗log
N
ni
(1+logfi,q)∗log
N
ni
Chap 03: Modeling, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 46

TF-IDFProperties
Considerthetf,idf,andtf-idfweightsfortheWallStreet
Journalreferencecollection
Tostudytheirbehavior,wewouldliketoplotthem
together
Whileidfiscomputedoverallthecollection,tfis
computedonaperdocumentbasis. Thus,weneeda
representationoftfbasedonallthecollection,whichis
providedbythetermcollectionfrequencyFi
Thisreasoningleadstothefollowingtfandidfterm
weights:
tfi=1+log
N
X
j=1
fi,jidfi=log
N
ni
Chap 03: Modeling, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 47

TF-IDFProperties
Plottingtfandidfinlogarithmicscaleyields Weobservethattfandidfweightspresentpower-law
behaviorsthatbalanceeachother
Thetermsofintermediateidfvaluesdisplaymaximum tf-idfweightsandaremostinterestingforranking
Chap 03: Modeling, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 48

DocumentLengthNormalization
Documentsizesmightvarywidely Thisisaproblembecauselongerdocumentsaremore
likelytoberetrievedbyagivenquery
Tocompensateforthisundesiredeffect,wecandivide therankofeachdocumentbyitslength Thisprocedureconsistentlyleadstobetterranking,and
itiscalleddocumentlengthnormalization
Chap 03: Modeling, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 49

DocumentLengthNormalization
Methodsofdocumentlengthnormalizationdependonthe
representationadoptedforthedocuments:
Size in bytes: consider that each document is represented
simply as a stream of bytes
Number of words: each document is represented as a single
string, and the document length is the number of words in it
Vector norms: documents are represented as vectors of
weighted terms
Chap 03: Modeling, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 50

DocumentLengthNormalization
Documentsrepresentedasvectorsofweightedterms
Each term of a collection is associated with an orthonormal unit
vector
~
kiin a t-dimensional space
For each termkiof a documentdjis associated the term vector
componentwi,j×
~
ki
Chap 03: Modeling, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 51

DocumentLengthNormalization
Thedocumentrepresentation
~
djisavectorcomposed
ofallitstermvectorcomponents
~
dj=(w1,j,w2,j,...,wt,j)
Thedocumentlengthisgivenbythenormofthisvector,
whichiscomputedasfollows
|
~
dj|=
v
u
u
t
t
X
i
w
2
i,j
Chap 03: Modeling, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 52

DocumentLengthNormalization
Threevariantsofdocumentlengthsfortheexample
collection
d1
d2
d3
d4
size in bytes
34
37
41
43
number of words
10
11
10
12
vector norm
5.068
4.899
3.762
7.738
Chap 03: Modeling, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 53

TheVectorModel
Chap 03: Modeling, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 54

TheVectorModel
Booleanmatchingandbinaryweightsistoolimiting Thevectormodelproposesaframeworkinwhich
partialmatchingispossible
Thisisaccomplishedbyassigningnon-binaryweights toindextermsinqueriesandindocuments Termweightsareusedtocomputeadegreeof
similaritybetweenaqueryandeachdocument
Thedocumentsarerankedindecreasingorderoftheir
degreeofsimilarity
Chap 03: Modeling, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 55

TheVectorModel
Forthevectormodel:
The weightwi,jassociated with a pair(ki,dj)is positive and
non-binary
The index terms are assumed to be all mutually independent They are represented as unit vectors of a t-dimensionsal space (t
is the total number of index terms)
The representations of documentdjand queryqare
t-dimensional vectors given by
~
dj=(w1j,w2j,...,wtj)
~q=(w1q,w2q,...,wtq)
Chap 03: Modeling, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 56

TheVectorModel
Similaritybetweenadocumentdjandaqueryq
j
i
d
–
q
cos(θ)

~
dj•~q
|
~
dj|×|~q|
sim(dj,q)=
P
t
i=1
wi,j×wi,q
q
P
t
i=1
w
2
i,j
×
q
P
t j=1
w
2
i,q
Sincewij>0andwiq>0,wehave06sim(dj,q)61
Chap 03: Modeling, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 57

TheVectorModel
WeightsintheVectormodelarebasicallytf-idfweights
wi,q= (1+logfi,q)×log
N
ni
wi,j= (1+logfi,j)×log
N
ni
Theseequationsshouldonlybeappliedforvaluesof
termfrequencygreaterthanzero
Ifthetermfrequencyiszero,therespectiveweightis
alsozero
Chap 03: Modeling, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 58

TheVectorModel
DocumentrankscomputedbytheVectormodelforthe
query“todo”(seetf-idfweightvaluesinSlide43)
doc
rankcomputation
rank
d1
1∗3+0.415∗0.830
5.068
0.660
d2
1∗2+0.415∗0
4.899
0.408
d3
1∗0+0.415∗1.073
3.762
0.118
d4
1∗0+0.415∗1.073
7.738
0.058
Chap 03: Modeling, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 59

TheVectorModel
Advantages:
term-weighting improves quality of the answer set partial matching allows retrieval of docs that approximate the
query conditions
cosine ranking formula sorts documents according to a degree of
similarity to the query
document length normalization is naturally built-in into the ranking
Disadvantages:
It assumes independence of index terms
Chap 03: Modeling, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 60

ProbabilisticModel
Chap 03: Modeling, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 61

ProbabilisticModel
TheprobabilisticmodelcapturestheIRproblemusinga
probabilisticframework
Givenauserquery,thereisanidealanswersetfor
thisquery
Givenadescriptionofthisidealanswerset,wecould
retrievetherelevantdocuments
Queryingisseenasaspecicationofthepropertiesof
thisidealanswerset
But,whataretheseproperties?
Chap 03: Modeling, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 62

ProbabilisticModel
Aninitialsetofdocumentsisretrievedsomehow Theuserinspectsthesedocslookingfortherelevant
ones(intruth,onlytop10-20needtobeinspected)
TheIRsystemusesthisinformationtorenethe
descriptionoftheidealanswerset
Byrepeatingthisprocess,itisexpectedthatthe
descriptionoftheidealanswersetwillimprove
Chap 03: Modeling, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 63

ProbabilisticRankingPrinciple
Theprobabilisticmodel
Tries to estimate the probability that a document will be relevant
to a user query
Assumes that this probability depends on the query and
document representations only
The ideal answer set, referred to as R, should maximize the
probability of relevance But,
How to compute these probabilities? What is the sample space?
Chap 03: Modeling, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 64

TheRanking
Let,
Rbe the set of relevant documents to queryq Rbe the set of non-relevant documents to queryq P(R|
~
dj)be the probability thatdjis relevant to the queryq P(
R|
~
dj)be the probability thatdjis non-relevant toq
Thesimilaritysim(dj,q)canbede×nedas
sim(dj,q)=
P(R|
~
dj)
P(
R|
~
dj)
Chap 03: Modeling, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 65

TheRanking
UsingBayes'rule,
sim(dj,q)=
P(
~
dj|R,q)×P(R,q)
P(
~
dj|
R,q)×P(
R,q)

P(
~
dj|R,q)
P(
~
dj|
R,q)
where
P(
~
dj|R,q): probabilityofrandomlyselectingthedocument
djfromthesetR
P(R,q): probabilitythatadocumentrandomlyselected
fromtheentirecollectionisrelevanttoqueryq
P(
~
dj|
R,q)andP(
R,q): analogousandcomplementary
Chap 03: Modeling, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 66

TheRanking
Assumingthattheweightswi,jareallbinaryand
assumingindependenceamongtheindexterms:
sim(dj,q)∼
(
Q
ki|wi,j=1
P(ki|R,q))×(
Q
ki|wi,j=0
P(
ki|R,q))
(
Q
ki|wi,j=1
P(ki|
R,q))×(
Q
ki|wi,j=0
P(
ki|
R,q))
where
P(ki|R,q): probabilitythatthetermkiispresentina
documentrandomlyselectedfromthesetR
P(
ki|R,q): probabilitythatkiisnotpresentinadocument
randomlyselectedfromthesetR
probabilitieswith
R: analogoustotheonesjustdescribed
Chap 03: Modeling, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 67

TheRanking
Tosimplifyournotation,letusadoptthefollowing
conventions
piR=P(ki|R,q) qiR=P(ki|
R,q)
Since
P(ki|R,q)+P(
ki|R,q)=1
P(ki|
R,q)+P(
ki|
R,q)=1
wecanwrite:
sim(dj,q)∼
(
Q
ki|wi,j=1
piR)×(
Q
ki|wi,j=0
(1−piR))
(
Q
ki|wi,j=1
qiR)×(
Q
ki|wi,j=0
(1−qiR))
Chap 03: Modeling, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 68

TheRanking
Takinglogarithms,wewrite
sim(dj,q)∼log
Y
ki|wi,j=1
piR+log
Y
ki|wi,j=0
(1−piR)
−log
Y
ki|wi,j=1
qiR−log
Y
ki|wi,j=0
(1−qiR)
Chap 03: Modeling, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 69

TheRanking
Summinguptermsthatcanceleachother,weobtain
sim(dj,q)∼log
Y
ki|wi,j=1
piR+log
Y
ki|wi,j=0
(1−pir)
−log
Y
ki|wi,j=1
(1−pir)+log
Y
ki|wi,j=1
(1−pir)
−log
Y
ki|wi,j=1
qiR−log
Y
ki|wi,j=0
(1−qiR)
+log
Y
ki|wi,j=1
(1−qiR)−log
Y
ki|wi,j=1
(1−qiR)
Chap 03: Modeling, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 70

TheRanking
Usinglogarithmoperations,weobtain
sim(dj,q)∼log
Y
ki|wi,j=1
piR
(1−piR)
+log
Y
ki
(1−piR)
+log
Y
ki|wi,j=1
(1−qiR)
qiR
−log
Y
ki
(1−qiR)
Noticethattwoofthefactorsintheformulaabovearea
functionofallindextermsanddonotdependon
documentdj. Theyareconstantsforagivenqueryand
canbedisregardedforthepurposeofranking
Chap 03: Modeling, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 71

TheRanking
Further,assumingthat
∀ki6∈q, piR=qiR
andconvertingthelogproductsintosumsoflogs,we
×nallyobtain
sim(dj,q)∼
P
ki∈q∧ki∈dj
log
-
piR
1−piR

+log
-
1−qiR
qiR

whichisakeyexpressionforrankingcomputationinthe
probabilisticmodel
Chap 03: Modeling, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 72

TermIncidenceContingencyTable
Let,
Nbe the number of documents in the collection nibe the number of documents that contain termki Rbe the total number of relevant documents to query q ribe the number of relevant documents that contain termki
Basedonthesevariables,wecanbuildthefollowing
contingencytable
relevant
non-relevant
all docs
docs that containki
ri
ni−ri
ni
docs that do not containki
R−ri
N−ni−(R−ri)
N−ni
all docs
R
N−R
N
Chap 03: Modeling, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 73

RankingFormula
Ifinformationonthecontingencytablewereavailable
foragivenquery,wecouldwrite
piR=
ri
R
qiR=
ni−ri
N−R
Then,theequationforrankingcomputationinthe
probabilisticmodelcouldberewrittenas
sim(dj,q)∼
X
ki[q,dj]
log
θ
ri
R−ri
×
N−ni−R+ri
ni−ri

whereki[q,dj]isashortnotationforki∈q∧ki∈dj
Chap 03: Modeling, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 74

RankingFormula
Inthepreviousformula,wearestilldependentonan
estimationoftherelevantdosforthequery
Forhandlingsmallvaluesofri,weadd0.5toeachof
thetermsintheformulaabove,whichchanges
sim(dj,q)into
X
ki[q,dj]
log
θ
ri+0.5
R−ri+0.5
×
N−ni−R+ri+0.5
ni−ri+0.5

Thisformulaisconsideredastheclassicranking
equationfortheprobabilisticmodelandisknownasthe
Robertson-SparckJonesEquation
Chap 03: Modeling, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 75

RankingFormula
Thepreviousequationcannotbecomputedwithout
estimatesofriandR
OnepossibilityistoassumeR=ri=0,asawayto
boostraptherankingequation,whichleadsto
sim(dj,q)∼
P
ki[q,dj]
log
-
N−ni+0.5
ni+0.5

Thisequationprovidesanidf-likerankingcomputation Intheabsenceofrelevanceinformation,thisisthe
equationforrankingintheprobabilisticmodel
Chap 03: Modeling, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 76

RankingExample
Documentrankscomputedbythepreviousprobabilistic
rankingequationforthequery“todo”
doc
rank computation
rank
d1
log
4−2+0.5
2+0.5
+log
4−3+0.5
3+0.5
- 1.222
d2
log
4−2+0.5
2+0.5
0
d3
log
4−3+0.5
3+0.5
- 1.222
d4
log
4−3+0.5
3+0.5
- 1.222
Chap 03: Modeling, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 77

RankingExample
Therankingcomputationledtonegativeweights
becauseoftheterm“do”
Actually,theprobabilisticrankingequationproduces
negativetermswheneverni>N/2
Onepossibleartifacttocontaintheeffectofnegative
weightsistochangethepreviousequationto:
sim(dj,q)∼
X
ki[q,dj]
log
θ
N+0.5
ni+0.5

Bydoingso,atermthatoccursinalldocuments
(ni=N)producesaweightequaltozero
Chap 03: Modeling, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 78

RankingExample
Usingthislatestformulation,weredotheranking
computationforourexamplecollectionforthequery“to
do”andobtain
doc
rank computation
rank
d1
log
4+0.5
2+0.5
+log
4+0.5
3+0.5
1.210
d2
log
4+0.5
2+0.5
0.847
d3
log
4+0.5
3+0.5
0.362
d4
log
4+0.5
3+0.5
0.362
Chap 03: Modeling, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 79

EstimagingriandR
Ourexamplesaboveconsideredthatri=R=0 AnalternativeistoestimateriandRperformingan
initialsearch:
select the top 10-20 ranked documents inspect them to gather new estimates forriandR remove the 10-20 documents used from the collection rerun the query with the estimates obtained for riandR
Unfortunately,proceduressuchastheserequirehuman
interventiontoinitiallyselecttherelevantdocuments
Chap 03: Modeling, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 80

ImprovingtheInitialRanking
Considertheequation
sim(dj,q)∼
X
ki∈q∧ki∈dj
log
θ
piR
1−piR

+log
θ
1−qiR
qiR

HowobtaintheprobabilitiespiRandqiR? Estimatesbasedonassumptions:
piR= 0.5 qiR=
ni
N
whereniis the number of docs that containki
Use this initial guess to retrieve an initial ranking Improve upon this initial ranking
Chap 03: Modeling, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 81

ImprovingtheInitialRanking
SubstitutingpiRandqiRintothepreviousEquation,we
obtain:
sim(dj,q)∼
X
ki∈q∧ki∈dj
log
θ
N−ni
ni

Thatistheequationusedwhennorelevance
informationisprovided,withoutthe0.5correctionfactor
Giventhisinitialguess,wecanprovideaninitial
probabilisticranking
Afterthat,wecanattempttoimprovethisinitialranking
asfollows
Chap 03: Modeling, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 82

ImprovingtheInitialRanking
Wecanattempttoimprovethisinitialrankingasfollows Let
D: setofdocsinitiallyretrieved Di: subsetofdocsretrievedthatcontainki
Reevaluateestimates:
piR=
Di
D
qiR=
ni−Di
N−D
Thisprocesscanthenberepeatedrecursively
Chap 03: Modeling, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 83

ImprovingtheInitialRanking
sim(dj,
)∼
X
ki∈q∧ki∈dj
log
θ
N−ni
ni

ToavoidproblemswithD=1andDi=0:
piR=
Di+0.5
D+1
;qiR=
ni−Di+0.5
N−D+1
Also,
piR=
Di+
ni
N
D+1
;qiR=
ni−Di+
ni
N
N−D+1
Chap 03: Modeling, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 84

PlusesandMinuses
Advantages:
Docsrankedindecreasingorderofprobabilityof
relevance Disadvantages:
needtoguessinitialestimatesforpiR methoddoesnottakeintoaccounttffactors thelackofdocumentlengthnormalization
Chap 03: Modeling, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 85

ComparisonofClassicModels
Booleanmodeldoesnotprovideforpartialmatches
andisconsideredtobetheweakestclassicmodel
Thereissomecontroversyastowhetherthe
probabilisticmodeloutperformsthevectormodel
Croftsuggestedthattheprobabilisticmodelprovidesa
betterretrievalperformance
However,Saltonetalshowedthatthevectormodel
outperformsitwithgeneralcollections
Thisalsoseemstobethedominantthoughtamong researchersandpractitionersofIR.
Chap 03: Modeling, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 86

ModernInformationRetrieval
Modeling
P
artII:AlternativeSetandVectorModels
Set-BasedModel
ExtendedBooleanModel
FuzzySetModel
TheGeneralizedVectorModel
LatentSemanticIndexing
NeuralNetworkforIR
Chap 03: Modeling, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 87

AlternativeSetTheoreticModels
Set-BasedModel ExtendedBooleanModel FuzzySetModel
Chap 03: Modeling, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 88

Set-BasedModel
Chap 03: Modeling, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 89

Set-BasedModel
Thisisamorerecentapproach(2005)thatcombines
settheorywithavectorialranking
Thefundamentalideaistousemutualdependencies amongindextermstoimproveresults Termdependenciesarecapturedthroughtermsets,
whicharesetsofcorrelatedterms
Theapproach,whichleadstoimprovedresultswith
variouscollections,constitutestherstIRmodelthat
effectivelytookadvantageoftermdependencewith
generalcollections
Chap 03: Modeling, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 90

Termsets
Termsetisaconceptusedinplaceoftheindexterms AtermsetSi={ka,kb,...,kn}isasubsetofthetermsin
thecollection
IfallindextermsinSioccurinadocumentdjthenwe
saythatthetermsetSioccursindj
Thereare2
t
termsetsthatmightoccurinthedocuments
ofacollection,wheretisthevocabularysize
However, most combinations of terms have no semantic meaning Thus, the actual number of termsets in a collection is far smaller
than2
t
Chap 03: Modeling, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 91

Termsets
Lettbethenumberoftermsofthecollection Then,thesetVS={S1,S2,...,S2
t}isthevocabulary-set
ofthecollection
Toillustrate,considerthedocumentcollectionbelow
To do is to be.
To be is to do. 
To be or not to be.
I am what I am.
I think therefore I am.
Do be do be do.
d1
d2
d3
Do do do, da da da.
Let it be, let it be.
d4
Chap 03: Modeling, Baeza-Yates & Ribeiro-Neto, Modern Infor
mation Retrieval, 2nd Edition – p. 92

Termsets
Tosimplifynotation,letusde ne
ka= tokd= bekg= Ikj= thinkkm= let
kb= doke= orkh= amkk= thereforekn= it
kc= iskf= notki= whatkl= da
Further,letthelettersa...nrefertotheindexterms
ka...kn,respectively
a b c a d
a d c a b
a d e f a d
g h i g h
g j k g h
b d b d b
d1
d2
d3
b b b l l l
m n d m n d
d4
Chap 03: Modeling, Baeza-Yates & Ribeiro-Neto, Modern Infor
mation Retrieval, 2nd Edition – p. 93

Termsets
Considerthequeryqas“todobeit”,i.e.q={a,b,d,n} Forthisquery,thevocabulary-setisasbelow
Termset
Set of Terms
Documents
Sa
{a}
{d1,d2}
Sb
{b}
{d1,d3,d4}
Sd
{d}
{d1,d2,d3,d4}
Sn
{n}
{d4}
Sab
{a,b}
{d1}
Sad
{a,d}
{d1,d2}
Sbd
{b,d}
{d1,d3,d4}
Sbn
{b,n}
{d4}
Sabd
{a,b,d}
{d1}
Sbdn
{b,d,n}
{d4}
Notice that there are
11 termsets that occur
in our collection, out
of the maximum of 15
termsets that can be
formed with the terms
inq
Chap 03: Modeling, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 94

Termsets
Atqueryprocessingtime,onlythetermsetsgenerated
bythequeryneedtobeconsidered
Atermsetcomposedofntermsiscalledann-termset LetNibethenumberofdocumentsinwhichSioccurs Ann-termsetSiissaidtobefrequentifNiisgreater
thanorequaltoagiventhreshold
This implies that ann-termset is frequent if and only if all of its
(n−1)-termsets are also frequent
Frequent termsetscan be used to reduce the number of
termsets to consider with long queries
Chap 03: Modeling, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 95

Termsets
Letthethresholdonthefrequencyoftermsetsbe2 Tocomputeallfrequenttermsetsforthequery
q={a,b,d,n}weproceedasfollows
1. Compute the frequent 1-termsets and their inverted lists:
Sa={d1,d2} Sb={d1,d3,d4} Sd={d1,d2,d3,d4}
2. Combine the inverted lists to
compute frequent 2-termsets:
Sad={d1,d2} Sbd={d1,d3,d4}
3. Since there are no frequent 3-
termsets, stop
a b c a d
a d c a b
a d e f a d
g h i g h
g j k g h
b d b d b
d1
d2
d3
b b b l l l
m n d m n d
d4
Chap 03: Modeling, Baeza-Yates & Ribeiro-Neto, Modern Infor
mation Retrieval, 2nd Edition – p. 96

Termsets
Noticethatthereareonly5frequenttermsetsinour
collection
Invertedlistsforfrequentn-termsetscanbecomputed
bystartingwiththeinvertedlistsoffrequent1-termsets
Thus, the only indice that is required are the standard inverted
lists used by any IR system Thisisreasonablyfastforshortqueriesupto4-5terms
Chap 03: Modeling, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 97

RankingComputation
Therankingcomputationisbasedonthevectormodel,
butadoptstermsetsinsteadofindexterms
Givenaqueryq,let
{S1,S2,...}be the set of all termsets originated from q Nibe the number of documents in which termsetSioccurs Nbe the total number of documents in the collection Fi,jbe the frequency of termsetSiin documentdj
Foreachpair[Si,dj]wecomputeaweightWi,jgivenby
Wi,j=
(
(1+logFi,j)log(1+
N
Ni
)ifFi,j>0
0Fi,j=0
WealsocomputeaWi,qvalueforeachpair[Si,q]
Chap 03: Modeling, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 98

RankingComputation
Consider
queryq={a,b,d,n} documentd1=``a b c a d a d c a b''
Termset
Weight
Sa
Wa,1
(1 + log4)∗log(1 + 4 /2) = 4 .75
Sb
Wb,1
(1 + log2)∗log(1 + 4 /3) = 2 .44
Sd
Wd,1
(1 + log2)∗log(1 + 4 /4) = 2 .00
Sn
Wn,1
0∗log(1 + 4 /1) = 0 .00
Sab
Wab,1
(1 + log2)∗log(1 + 4 /1) = 4 .64
Sad
Wad,1
(1 + log2)∗log(1 + 4 /2) = 3 .17
Sbd
Wbd,1
(1 + log2)∗log(1 + 4 /3) = 2 .44
Sbn
Wbn,1
0∗log(1 + 4 /1) = 0 .00
Sdn
Wdn,1
0∗log(1 + 4 /1) = 0 .00
Sabd
Wabd,1
(1 + log2)∗log(1 + 4 /1) = 4 .64
Sbdn
Wbdn,1
0∗log(1 + 4 /1) = 0 .00
Chap 03: Modeling, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 99

RankingComputation
Adocumentdjandaqueryqarerepresentedas
vectorsina2
t
-dimensionalspaceoftermsets
~
dj= (W1,j,W2,j,...,W2
t
,j)
~q= (W1,q,W2,q,...,W2
t
,q)
Therankofdjtothequeryqiscomputedasfollows
sim(dj,q)=
~
dj•~q
|
~
dj|×|~q|
=
P
Si
Wi,j×Wi,q
|
~
dj|×|~q|
Fortermsetsthatarenotinthequeryq,Wi,q=0
Chap 03: Modeling, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 100

RankingComputation
Thedocumentnorm|
~
dj|ishardtocomputeinthe
spaceoftermsets
Thus,itscomputationisrestrictedto1-termsets Letagainq={a,b,d,n}andd1 Thedocumentnormintermsof1-termsetsisgivenby
|
~
d1|=
q
W
2
a,1
+W
2
b,1
+W
2
c,1
+W
2
d,1
=
p
4.75
2
+2.44
2
+4.64
2
+2.00
2
= 7.35
Chap 03: Modeling, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 101

RankingComputation
Tocomputetherankofd1,weneedtoconsiderthe
seventermsetsSa,Sb,Sd,Sab,Sad,Sbd,andSabd
Therankofd1isthengivenby
sim(d1,q) = ( Wa,1∗Wa,q+Wb,1∗Wb,q+Wd,1∗Wd,q+
Wab,1∗Wab,q+Wad,1∗Wad,q+Wbd,1∗Wbd,q+
Wabd,1∗Wabd,q)/|
~
d1|
= (4.75∗1.58+2.44∗1.22+2.00∗1.00+
4.64∗2.32+3.17∗1.58+2.44∗1.22+
4.64∗2.32)/7.35
= 5.71
Chap 03: Modeling, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 102

ClosedTermsets
Theconceptoffrequenttermsetsallowssimplifyingthe
rankingcomputation
Yet,therearemanyfrequenttermsetsinalarge collection
The number of termsets to consider might be prohibitively high
with large queries Toresolvethisproblem,wecanfurtherrestrictthe rankingcomputationtoasmallernumberoftermsets Thiscanbeaccomplishedbyobservingsome
propertiesoftermsetssuchasthenotionofclosure
Chap 03: Modeling, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 103

ClosedTermsets
TheclosureofatermsetSiisthesetofallfrequent
termsetsthatco-occurwithSiinthesamesetofdocs
GiventheclosureofSi,thelargesttermsetinitiscalled
aclosedtermsetandisreferredtoasΦi
Weformalize,asfollows
LetDi⊆Cbe the subset of all documents in which termset Si
occurs and is frequent
LetS(Di)be a set composed of the frequent termsets that occur
in all documents inDiand only in those
Chap 03: Modeling, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 104

ClosedTermsets
Then,theclosedtermsetSΦisatis×esthe{ollowin}
property
6∃Sj∈S(Di)|SΦi
⊂Sj
Frequentandclosedtermsetsforourexample
collection,consideringaminimumthresholdequalto2
frequency(Si)
frequent termset
closed termset
4
d
d
3
b, bd
bd
2
a, ad
ad
2
g, h, gh, ghd
ghd
Chap 03: Modeling, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 105

ClosedTermsets
Closedtermsetsencapsulatesmallertermsets
occurringinthesamesetofdocuments
Therankingsim(d1,q)ofdocumentd1withregardto
queryqiscomputedasfollows:
d1=''a b c a d a d c a b '' q={a,b,d,n} minimum frequency threshold = 2
sim(d1,q) = ( Wd,1∗Wd,q+Wab,1∗Wab,q+Wad,1∗Wad,q+
Wbd,1∗Wbd,q+Wabd,1∗Wabd,q)/|
~
d1|
= (2.00∗1.00+4.64∗2.32+3.17∗1.58+
2.44∗1.22+4.64∗2.32)/7.35
= 4.28
Chap 03: Modeling, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 106

ClosedTermsets
Thus,ifwerestricttherankingcomputationtoclosed
termsets,wecanexpectareductioninquerytime
Smallerthenumberofclosedtermsets,sharperisthe
reductioninqueryprocessingtime
Chap 03: Modeling, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 107

ExtendedBooleanModel
Chap 03: Modeling, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 108

ExtendedBooleanModel
IntheBooleanmodel,norankingoftheanswersetis
generated
OnealternativeistoextendtheBooleanmodelwiththe
notionsofpartialmatchingandtermweighting
Thisstrategyallowsonetocombinecharacteristicsof
theVectormodelwithpropertiesofBooleanalgebra
Chap 03: Modeling, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 109

TheIdea
ConsideraconjunctiveBooleanquerygivenby
q=kx∧ky
Forthebooleanmodel,adocthatcontainsasingle
termofqisasirrelevantasadocthatcontainsnone
However,thisbinarydecisioncriteriafrequentlyisnot
inaccordancewithcommonsense
Ananalogousreasoningapplieswhenoneconsiders
purelydisjunctivequeries
Chap 03: Modeling, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 110

TheIdea
Whenonlytwotermsxandyareconsidered,wecan
plotqueriesanddocsinatwo-dimensionalspace
Adocumentdjispositionedinthisspacethroughthe
adoptionofweightswx,jandwy,j
Chap 03: Modeling, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 111

TheIdea
Theseweightscanbecomputedasnormalizedtf-idf
factorsasfollows
wx,j=
fx,j
maxxfx,j
×
idfx
maxiidfi
where
fx,jis the frequency of termkxin documentdj idfiis the inverse document frequency of termki, as before
Tosimplifynotation,let
wx,j=xandwy,j=y ~
dj= (wx,j,wy,j)as the pointdj= (x,y)
Chap 03: Modeling, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 112

TheIdea
Foradisjunctivequeryqor=kx∨ky,thepoint(0,0)is
theleastinterestingone
Thissuggeststakingthedistancefrom(0,0)asa
measureofsimilarity
sim(qor,d) =
r
x
2
+y
2
2
Chap 03: Modeling, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 113

TheIdea
Foraconjunctivequeryqand=kx∧ky,thepoint(1,1)is
themostinterestingone
Thissuggeststakingthecomplementofthedistance
fromthepoint(1,1)asameasureofsimilarity sim(qand,d) = 1−
r
(1−x)
2
+(1−y)
2
2
Chap 03: Modeling, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 114

TheIdea
sim(qor,d) =
r
x
2
+y
2
2
sim(qand,d) = 1−
r
(1−x)
2
+(1−y)
2
2
Chap 03: Modeling, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 115

GeneralizingtheIdea
Wecanextendthepreviousmodeltoconsider
Euclideandistancesinat-dimensionalspace
Thiscanbedoneusingp-normswhichextendthenotion
ofdistancetoincludep-distances,where1≤p≤∞
Ageneralizedconjunctivequeryisgivenby
qand=k1∧
p
k2∧
p
...∧
p
km
Ageneralizeddisjunctivequeryisgivenby
qor=k1∨
p
k2∨
p
...∨
p
km
Chap 03: Modeling, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 116

GeneralizingtheIdea
Thequery-documentsimilaritiesarenowgivenby
sim(qor,dj)=
-
x
p
1
+x
p
2
+...+x
p
m
m
1
p
sim(qand,dj)=1−
-
(1−x1)
p
+(1−x2)
p
+...+(1−xm)
p
m
1
p
whereeachxistandsforaweightwi,d
Ifp=1then(vector-like)
sim(qor,dj)=sim(qand,dj)=
x1+...+xm
m
Ifp=∞then(Fuzzylike)
sim(qor,dj)=max(xi) sim(qand,dj)=min(xi)
Chap 03: Modeling, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 117

Properties
Byvaryingp,wecanmakethemodelbehaveasa
vector,asafuzzy,orasanintermediarymodel
Theprocessingofmoregeneralqueriesisdoneby
}rou√in}theo√eratorsina√rede×nedorder
Forinstance,considerthequeryq=(k1∧
p
k2)∨
p
k3
k1andk2aretobeusedasinavectorialretrieval
whilethepresenceofk3isrequired Thesimilaritysim(q,dj)iscomputedas
sim(q,d)=





θ
1−
-
(1−x1)
p
+(1−x2)
p
2
1
p
p
+x
p
3
2





1
p
Chap 03: Modeling, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 118

Conclusions
Modelisquitepowerful Propertiesareinterestingandmightbeuseful Computationissomewhatcomplex However,distributivityoperationdoesnotholdfor
rankingcomputation:
q1=(k1∨k2)∧k3 q2=(k1∧k3)∨(k2∧k3) sim(q1,dj)6=sim(q2,dj)
Chap 03: Modeling, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 119

FuzzySetModel
Chap 03: Modeling, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 120

FuzzySetModel
Matchingofadocumenttoaquerytermsis
approximateorvague
Thisvaguenesscanbemodeledusingafuzzy
framework,asfollows:
eachquerytermdenesafuzzyset eachdochasadegreeofmembershipinthisset
ThisinterpretationprovidesthefoundationformanyIR
modelsbasedonfuzzytheory
Inhere,wediscussthemodelproposedby
Ogawa,Morita,andKobayashi
Chap 03: Modeling, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 121

FuzzySetTheory
Fuzzysettheorydealswiththerepresentationof
classeswhoseboundariesarenotwelldened
Keyideaistointroducethenotionofadegreeof
membershipassociatedwiththeelementsoftheclass
Thisdegreeofmembershipvariesfrom0to1and
allowsmodellingthenotionofmarginalmembership
Thus,membershipisnowagradualnotion,contraryto
thecrispynotionenforcedbyclassicBooleanlogic
Chap 03: Modeling, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 122

FuzzySetTheory
AfuzzysubsetAofauniverseofdiscourseUis
characterizedbyamembershipfunction
μA:U→[0,1]
ThisfunctionassociateswitheachelementuofUa
numberμA(u)intheinterval[0,1]
Thethreemostcommonlyusedoperationsonfuzzy
setsare:
the complement of a fuzzy set the union of two or more fuzzy sets the intersection of two or more fuzzy sets
Chap 03: Modeling, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 123

FuzzySetTheory
Let,
Ubetheuniverseofdiscourse AandBbetwofuzzysubsetsofU AbethecomplementofArelativetoU ubeanelementofU
Then,
μ
A
(u) = 1−μA(u)
μA∪B(u) =max(μA(u),μB(u))
μA∩B(u) =min(μA(u),μB(u))
Chap 03: Modeling, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 124

FuzzyInformationRetrieval
Fuzzysetsaremodeledbasedonathesaurus,which
de×nestermrelationshi√s
Athesauruscanbeconstructedbyde×nin}aterm-term
correlationmatrixC
EachelementofCde×nesanormalizedcorrelation
factorci,`betweentwotermskiandk`
ci,l=
ni,l
ni+nl−ni,l
where
ni: number of docs which containki nl: number of docs which containkl ni,l: number of docs which contain bothkiandkl
Chap 03: Modeling, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 125

FuzzyInformationRetrieval
WecanusethetermcorrelationmatrixCtoassociatea
fuzzysetwitheachindextermki
Inthisfuzzyset,adocumentdjhasadegreeof
membershipμi,jgivenby
μi,j=1−
Y
kl∈dj
(1−ci,l)
Theaboveexpressioncomputesanalgebraicsumover
alltermsindj
Adocumentdjbelongstothefuzzysetassociatedwith
ki,ifitsowntermsareassociatedwithki
Chap 03: Modeling, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 126

FuzzyInformationRetrieval
Ifdjcontainsatermklwhichiscloselyrelatedtoki,we
have
ci,l∼1 μi,j∼1 andkiisagoodfuzzyindexfordj
Chap 03: Modeling, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 127

FuzzyIR:AnExample
D
a
D
b
D
c
cc
cc cc
D =
qcc + cc + cc
1 2 3
Considerthequeryq=ka∧(kb∨ ¬kc) Thedisjunctivenormalformofqiscomposedof3
conjunctivecomponents(cc),asfollows:
~qdnf=(1,1,1)+(1,1,0)+(1,0,0)=cc1+cc2+cc3
LetDa,DbandDcbethefuzzysetsassociatedwiththe
termska,kbandkc,respectively
Chap 03: Modeling, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 128

FuzzyIR:AnExample
D
a
D
b
D
c
cc
cc cc
D =
qcc + cc + cc
1 2 3
Letμa,j,μb,j,andμc,jbethedegreesofmembershipsof
documentdjinthefuzzysetsDa,Db,andDc. Then,
cc1=μa,jμb,jμc,j
cc2=μa,jμb,j(1−μc,j)
cc3=μa,j(1−μb,j)(1−μc,j)
Chap 03: Modeling, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 129

FuzzyIR:AnExample
D
a
D
b
D
c
cc
cc cc
D =
qcc + cc + cc
1 2 3
μq,j=μcc1+cc2+cc3,j
= 1−
3
Y
i=1
(1−μcci,j)
= 1−(1−μa,jμb,jμc,j)×
(1−μa,jμb,j(1−μc,j))×(1−μa,j(1−μb,j)(1−μc,j))
Chap 03: Modeling, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 130

Conclusions
FuzzyIRmodelshavebeendiscussedmainlyinthe
literatureassociatedwithfuzzytheory
Theyprovideaninterestingframeworkwhichnaturally
embodiesthenotionoftermdependencies
Experimentswithstandardtestcollectionsarenot
available
Chap 03: Modeling, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 131

AlternativeAlgebraicModels
GeneralizedVectorModel LatentSemanticIndexing NeuralNetworkModel
Chap 03: Modeling, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 132

GeneralizedVectorModel
Chap 03: Modeling, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 133

GeneralizedVectorModel
Classicmodelsenforceindependenceofindexterms Forinstance,intheVectormodel
A set of term vectors {
~
k1,
~
k2,...,
~
kt} are linearly independent Frequently, this is interpreted as ∀i,j⇒
~
ki•
~
kj= 0
Inthegeneralizedvectorspacemodel,twoindexterm
vectorsmightbenon-orthogonal
Chap 03: Modeling, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 134

KeyIdea
Asbefore,letwi,jbetheweightassociatedwith[ki,dj]
andV={k1,k2,...,kt}bethesetofallterms
Ifthewi,jweightsarebinary,allpatternsofoccurrence
oftermswithindocscanberepresentedbyminterms:
(k1,k2,k3,...,kt)
m1= (0,0,0, ...,0)
m2= (1,0,0, ...,0)
m3= (0,1,0, ...,0)
m4= (1,1,0, ...,0)
.
.
.
m2
t= (1,1,1, ...,1)
For instance,m2indi-
catesdocumentsinwhich
solely the termk1occurs
Chap 03: Modeling, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 135

KeyIdea
Foranydocumentdj,thereisamintermmrthat
includesexactlythetermsthatoccurinthedocument
Letusde×nethe{ollowin}seto{mintermvectors~mr,
1,2,...,2
t
~m1= (1,0,...,0)
~m2= (0,1,...,0)
.
.
.
~m2
t= (0,0,...,1)
Notice that we can associate
each unit vector~mrwith a
mintermmr, and that~mi•~mj=
0for alli6=j
Chap 03: Modeling, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 136

KeyIdea
Pairwiseorthogonalityamongthe~mrvectorsdoesnot
implyindependenceamongtheindexterms
Onthecontrary,indextermsarenowcorrelatedbythe
~mrvectors
For instance, the vector~m4is associated with the minterm
m4= (1,1,...,0)
This minterm induces a dependency between termsk1andk2 Thus, if such document exists in a collection, we say that the
mintermm4is active Themodeladoptstheideathatco-occurrenceofterms
inducesdependenciesamongtheseterms
Chap 03: Modeling, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 137

FormingtheTermVectors
Leton(i,mr)returntheweight{0,1}oftheindextermki
inthemintermmr
Thevectorassociatedwiththetermkiiscomputedas:
~
ki=
P
∀r
on(i,mr)ci,r~mr
q
P
∀r
on(i,mr)c
2
i,r
ci,r=
X
dj|c(dj)=mr
wi,j
NoticethatforacollectionofsizeN,onlyNminterms
affecttheranking(andnot2
t
)
Chap 03: Modeling, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 138

DependencybetweenIndexTerms
Adegreeofcorrelationbetweenthetermskiandkjcan
nowbecomputedas:
~
ki•
~
kj=
X
∀r
on(i,mr)×ci,r×on(j,mr)×cj,r
Thisdegreeofcorrelationsumsupthedependencies
betweenkiandkjinducedbythedocsinthecollection
Chap 03: Modeling, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 139

TheGeneralizedVectorModel
AnExample
K1
K2
K3
d
2
d
4
d
6
d
5
d
1
d
7
d
3
K1K2K3
d1
2 0 1
d2
1 0 0
d3
0 1 3
d4
2 0 0
d5
1 2 4
d6
1 2 0
d7
0 5 0
q
1 2 3
Chap 03: Modeling, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 140

Computationofci,r
K1K2K3
d1
2 0 1
d2
1 0 0
d3
0 1 3
d4
2 0 0
d5
1 2 4
d6
0 2 2
d7
0 5 0
q
1 2 3
K1K2K3
d1=m6
1 0 1
d2=m2
1 0 0
d3=m7
0 1 1
d4=m2
1 0 0
d5=m8
1 1 1
d6=m7
0 1 1
d7=m3
0 1 0
q=m8
1 1 1
c1,rc2,rc3,r
m1
0 0 0
m2
3 0 0
m3
0 5 0
m4
0 0 0
m5
0 0 0
m6
2 0 1
m7
0 3 5
m8
1 2 4
Chap 03: Modeling, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 141

Computationof
−→
ki
−→
k1=
(3~m2+2~m6+~m8)

3
2
+2
2
+1
2
−→
k2=
(5~m3+3~m7+2~m8)

5+3+2
−→
k3=
(1~m6+5~m7+4~m8)

1+5+4
c1,rc2,rc3,r
m1
0 0 0
m2
3 0 0
m3
0 5 0
m4
0 0 0
m5
0 0 0
m6
2 0 1
m7
0 3 5
m8
1 2 4
Chap 03: Modeling, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 142

ComputationofDocumentVectors
−→
d1=2
−→
k1+
−→
k3
−→
d2=
−→
k1
−→
d3=
−→
k2+3
−→
k3
−→
d4=2
−→
k1
−→
d5=
−→
k1+2
−→
k2+4
−→
k3
−→
d6=2
−→
k2+2
−→
k3
−→
d7=5
−→
k2
−→
q=
−→
k1+2
−→
k2+3
−→
k3
K1K2K3
d1
2 0 1
d2
1 0 0
d3
0 1 3
d4
2 0 0
d5
1 2 4
d6
0 2 2
d7
0 5 0
q
1 2 3
Chap 03: Modeling, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 143

Conclusions
Modelconsiderscorrelationsamongindexterms Notclearinwhichsituationsitissuperiortothe
standardVectormodel
Computationcostsarehigher Modeldoesintroduceinterestingnewideas
Chap 03: Modeling, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 144

LatentSemanticIndexing
Chap 03: Modeling, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 145

LatentSemanticIndexing
ClassicIRmightleadtopoorretrievaldueto:
unrelateddocumentsmightbeincludedinthe
answerset
relevantdocumentsthatdonotcontainatleastone indextermarenotretrieved Reasoning: retrievalbasedonindextermsisvague
andnoisy Theuserinformationneedismorerelatedtoconcepts andideasthantoindexterms Adocumentthatsharesconceptswithanother
documentknowntoberelevantmightbeofinterest
Chap 03: Modeling, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 146

LatentSemanticIndexing
Theideahereistomapdocumentsandqueriesintoa
dimensionalspacecomposedofconcepts
Let
t: total number of index terms N: number of documents M= [mij]: term-document matrixt×N
ToeachelementofMisassignedaweightwi,j
associatedwiththeterm-documentpair[ki,dj]
The weightwi,jcan be based on atf-idfweighting scheme
Chap 03: Modeling, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 147

LatentSemanticIndexing
ThematrixM=[mij]canbedecomposedintothree
componentsusingsingularvaluedecomposition
M=K∙S∙D
T
were
Kis the matrix of eigenvectors derived from C=M∙M
T
D
T
is the matrix of eigenvectors derived from M
T
∙M Sis anr×rdiagonal matrix of singular values where
r= min( t,N)is the rank ofM
Chap 03: Modeling, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 148

ComputinganExample
LetM
T
=[mij]begivenby
K1K2K3
q•dj
d1
2 0 1
5
d2
1 0 0
1
d3
0 1 3
11
d4
2 0 0
2
d5
1 2 4
17
d6
1 2 0
5
d7
0 5 0
10
q
1 2 3
ComputethematricesK,S,andD
t
Chap 03: Modeling, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 149

LatentSemanticIndexing
InthematrixS,considerthatonlytheslargestsingular
valuesareselected
KeepthecorrespondingcolumnsinKandD
T
TheresultantmatrixiscalledMsandisgivenby
Ms=Ks∙Ss∙D
T
s
wheres,s<r,isthedimensionalityofareduced
conceptspace
Theparametersshouldbe
lar}e enou}h to allow ×ttin} the characteristics o{ the data small enou}h to ×lter out the non-relevant re√resentational details
Chap 03: Modeling, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 150

LatentRanking
Therelationshipbetweenanytwodocumentsinscan
beobtainedfromtheM
T
s∙Msmatrixgivenby
M
T
s∙Ms= (Ks∙Ss∙D
T
s)
T
∙Ks∙Ss∙D
T
s
=Ds∙Ss∙K
T
s∙Ks∙Ss∙D
T
s
=Ds∙Ss∙Ss∙D
T
s
= (Ds∙Ss)∙(Ds∙Ss)
T
Intheabovematrix,the(i,j)elementquanti×esthe
relationshipbetweendocumentsdianddj
Chap 03: Modeling, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 151

LatentRanking
Theuserquerycanbemodelledasa
pseudo-documentintheoriginalMmatrix
Assumethequeryismodelledasthedocument numbered0intheMmatrix ThematrixM
T
s∙Msquanti×estherelationshi√between
anytwodocumentsinthereducedconceptspace
The×rstrowo{thismatrix√rovidestheranko{allthe
documentswithregardtotheuserquery
Chap 03: Modeling, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 152

Conclusions
Latentsemanticindexingprovidesaninteresting
conceptualizationoftheIRproblem
Thus,ithasitsvalueasanewtheoreticalframework Fromapracticalpointofview,thelatentsemantic
indexingmodelhasnotyieldedencouragingresults
Chap 03: Modeling, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 153

NeuralNetworkModel
Chap 03: Modeling, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 154

NeuralNetworkModel
ClassicIR:
Termsareusedtoindexdocumentsandqueries Retrievalisbasedonindextermmatching
Motivation:
Neuralnetworksareknowntobegoodpattern
matchers
Chap 03: Modeling, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 155

NeuralNetworkModel
Thehumanbrainiscomposedofbillionsofneurons Eachneuroncanbeviewedasasmallprocessingunit Aneuronisstimulatedbyinputsignalsandemitsoutput
signalsinreaction
Achainreactionofpropagatingsignalsiscalleda
spreadactivationprocess
Asaresultofspreadactivation,thebrainmight
commandthebodytotakephysicalreactions
Chap 03: Modeling, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 156

NeuralNetworkModel
Aneuralnetworkisanoversimpliedrepresentationof
theneuroninterconnectionsinthehumanbrain:
nodesareprocessingunits edgesaresynapticconnections thestrengthofapropagatingsignalismodelledbya
weightassignedtoeachedge
thestateofanodeisdenedbyitsactivationlevel dependingonitsactivationlevel,anodemightissue
anoutputsignal
Chap 03: Modeling, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 157

NeuralNetworkforIR
Aneuralnetworkmodelforinformationretrieval
Chap 03: Modeling, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 158

NeuralNetworkforIR
Threelayersnetwork: oneforthequeryterms,onefor
thedocumentterms,andathirdoneforthedocuments
Signalspropagateacrossthenetwork Firstlevelofpropagation:
Querytermsissuetherstsignals Thesesignalspropagateacrossthenetworkto
reachthedocumentnodes Secondlevelofpropagation:
Documentnodesmightthemselvesgeneratenew
signalswhichaffectthedocumenttermnodes
Documenttermnodesmightrespondwithnew signalsoftheirown
Chap 03: Modeling, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 159

QuantifyingSignalPropagation
Normalizesignalstrength(MAX=1) Querytermsemitinitialsignalequalto1 Weightassociatedwithanedgefromaqueryterm
nodekitoadocumenttermnodeki:
wi,q=
wi,q
q
P
t
i=1
w
2
i,q
Weightassociatedwithanedgefromadocumentterm
nodekitoadocumentnodedj:
wi,j=
wi,j
q
P
t
i=1
w
2
i,j
Chap 03: Modeling, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 160

QuantifyingSignalPropagation
A{terthe×rstlevelo{si}nal√ro√a}ation,theactivation
levelofadocumentnodedjisgivenby:
t
X
i=1
wi,q
wi,j=
P
t
i=1
wi,qwi,j
q
P
t
i=1
w
2
i,q
×
q
P
t i=1
w
2
i,j
whichisexactlytherankingoftheVectormodel
Newsignalsmightbeexchangedamongdocument
termnodesanddocumentnodes
Aminimumthresholdshouldbeenforcedtoavoid spurioussignalgeneration
Chap 03: Modeling, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 161

Conclusions
ModelprovidesaninterestingformulationoftheIR
problem
Modelhasnotbeentestedextensively Itisnotcleartheimprovementsthatthemodelmight provide
Chap 03: Modeling, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 162

ModernInformationRetrieval
Chapter3
M
odeling
PartIII:AlternativeProbabilisticModels
BM25
LanguageModels
DivergencefromRandomness
BeliefNetworkModels
OtherModels
Chap 03: Modeling, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 163

BM25(BestMatch25)
Chap 03: Modeling, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 164

BM25(BestMatch25)
BM25wascreatedastheresultofaseriesof
experimentsonvariationsoftheprobabilisticmodel
Agoodtermweightingisbasedonthreeprinciples
inverse document frequency term frequency document length normalization
Theclassicprobabilisticmodelcoversonlytherstof
theseprinciples
Thisreasoningledtoaseriesofexperimentswiththe
Okapisystem,whichledtotheBM25rankingformula
Chap 03: Modeling, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 165

BM1,BM11andBM15Formulas
At×rst,theOka√isystemusedtheEquationbelowas
rankingformula
sim(dj,q)∼
X
ki∈q∧ki∈dj
log
N−ni+0.5
ni+0.5
whichistheequationusedintheprobabilisticmodel, whennorelevanceinformationisprovided
ItwasreferredtoastheBM1formula(BestMatch1)
Chap 03: Modeling, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 166

BM1,BM11andBM15Formulas
The×rstidea{orim√rovin}therankin}wastointroduce
aterm-frequencyfactorFi,jintheBM1formula
Thisfactor,aftersomechanges,evolvedtobecome
Fi,j=S1
×
fi,j
K1+fi,j
where
fi,jis the frequency of termkiwithin documentdj K1is a constant setup experimentally for each collection S1is a scaling constant, normally set to S1= (K1+1)
IfK1=0,thiswholefactorbecomesequalto1and
bearsnoeffectintheranking
Chap 03: Modeling, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 167

BM1,BM11andBM15Formulas
ThenextstepwastomodifytheFi,jfactorbyadding
documentlengthnormalizationtoit,asfollows:
F
0
i,j
=S1
×
fi,j
K1
×len(dj) avg_doclen
+fi,j
where
len(dj)is the length of documentdj(computed, for instance, as
the number of terms in the document)
avg_doclenis the average document length for the collection
Chap 03: Modeling, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 168

BM1,BM11andBM15Formulas
Next,acorrectionfactorGj,qdependentonthe
documentandquerylengthswasadded
Gj,q=K2
×len(q)
×
avg_doclen−len(dj)
avg_doclen+len(dj)
where
len(q)is the query length (number of terms in the query) K2is a constant
Chap 03: Modeling, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 169

BM1,BM11andBM15Formulas
Athirdadditionalfactor,aimedattakingintoaccount
term{requencieswithinqueries,wasde×nedas
Fi,q=S3
×
fi,q
K3+fi,q
where
fi,qis the frequency of termkiwithin queryq K3is a constant S3is an scaling constant related to K3, normally set to
S3= (K3+1)
Chap 03: Modeling, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 170

BM1,BM11andBM15Formulas
IntroductionofthesethreefactorsledtovariousBM
(BestMatching)formulas,asfollows:
simBM1(dj,q)∼
X
ki[q,dj]
log
θ
N−ni+0.5
ni+0.5

simBM15(dj,q)∼ Gj,q+
X
ki[q,dj]
Fi,j
×Fi,q
×log
θ
N−ni+0.5
ni+0.5

simBM11(dj,q)∼ Gj,q+
X
ki[q,dj]
F
0
i,j
×Fi,q
×log
θ
N−ni+0.5
ni+0.5

whereki[q,dj]isashortnotationforki∈q∧ki∈dj
Chap 03: Modeling, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 171

BM1,BM11andBM15Formulas
ExperimentsusingTRECdatahaveshownthatBM11
outperformsBM15
Further,empiricalconsiderationscanbeusedto
simplifythepreviousequations,asfollows:
EmpiricalevidencesuggeststhatabestvalueofK2is0,
whicheliminatestheGj,qfactorfromtheseequations
Further,goodestimatesforthescalingconstantsS1andS3
areK1+1andK3+1,respectively
EmpiricalevidencealsosuggeststhatmakingK3verylarge
isbetter. Asaresult,theFi,qfactorisreducedsimplytofi,q
Forshortqueries,wecanassumethatfi,qis1forallterms
Chap 03: Modeling, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 172

BM1,BM11andBM15Formulas
Theseconsiderationsleadtosimplerequationsas
follows
simBM1(dj,q)∼
X
ki[q,dj]
log

N−ni+0.5
ni+0.5

simBM15(dj,q)∼
X
ki[q,dj]
(K1+1)fi,j
(K1+fi,j)
×log

N−ni+0.5
ni+0.5

simBM11(dj,q)∼
X
ki[q,dj]
(K1+1)fi,j
K1len(dj) avg_doclen
+fi,j
×log

N−ni+0.5
ni+0.5

Chap 03: Modeling, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 173

BM25RankingFormula
BM25: combinationoftheBM11andBM15 ThemotivationwastocombinetheBM11andBM25
termfrequencyfactorsasfollows
Bi,j=
(K1+1)fi,j
K1
h
(1−b)+b
len(dj)
avg_doclen
i
+fi,j
wherebisaconstantwithvaluesintheinterval[0,1]
Ifb= 0, it reduces to the BM15 term frequency factor Ifb= 1, it reduces to the BM11 term frequency factor For values ofbbetween 0 and 1, the equation provides a
combination of BM11 with BM15
Chap 03: Modeling, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 174

BM25RankingFormula
TherankingequationfortheBM25modelcanthenbe
writtenas
simBM25(dj,q)∼
X
ki[q,dj]
Bi,j
×log
θ
N−ni+0.5
ni+0.5

whereK1andbareempiricalconstants
K1= 1works well with real collections bshould be kept closer to 1 to emphasize the document length
normalization effect present in the BM11 formula
For instance,b= 0.75is a reasonable assumption Constants values can be ×ne tunned {or √articular collections
through proper experimentation
Chap 03: Modeling, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 175

BM25RankingFormula
Unliketheprobabilisticmodel,theBM25formulacanbe
computedwithoutrelevanceinformation
ThereisconsensusthatBM25outperformstheclassic
vectormodelforgeneralcollections
Thus,ithasbeenusedasabaselineforevaluatingnew
rankingfunctions,insubstitutiontotheclassicvector
model
Chap 03: Modeling, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 176

LanguageModels
Chap 03: Modeling, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 177

LanguageModels
Languagemodelsareusedinmanynaturallanguage
processingapplications
Ex: part-of-speech tagging, speech recognition, machine
translation, and information retrieval Toillustrate,theregularitiesinspokenlanguagecanbe
modeledbyprobabilitydistributions
Thesedistributionscanbeusedtopredictthelikelihood
thatthenexttokeninthesequenceisagivenword
Theseprobabilitydistributionsarecalledlanguage
models
Chap 03: Modeling, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 178

LanguageModels
AlanguagemodelforIRiscomposedofthefollowing
components
A set of document language models, one per documentdjof the
collection
A probability distribution function that allows estimating the
likelihood that a document language modelMjgenerateseach of
the query terms
A ranking function that combines these generating probabilities
for the query terms into a rank of document djwith regard to the
query
Chap 03: Modeling, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 179

StatisticalFoundation
LetSbeasequenceofrconsecutivetermsthatoccur
inadocumentofthecollection:
S=k1,k2,...,kr
Ann-gramlanguagemodelusesaMarkovprocessto
assignaprobabilityofoccurrencetoS:
Pn(S)=
r
Y
i=1
P(ki|ki−1,ki−2,...,k
i−(n−1))
wherenistheorderoftheMarkovprocess
Theoccurrenceofatermdependsonobservingthe
n−1termsthatprecedeitinthetext
Chap 03: Modeling, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 180

StatisticalFoundation
Unigramlanguagemodel(n=1): theestimativesare
basedontheoccurrenceofindividualwords
Bigramlanguagemodel(n=2): theestimativesare
basedontheco-occurrenceofpairsofwords
HigherordermodelssuchasTrigramlanguage
models( n=3)areusuallyadoptedforspeech
recognition
Termindependenceassumption: inthecaseofIR,
theimpactofwordorderislessclear
As a result, Unigram models
have been used extensively in IR
Chap 03: Modeling, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 181

MultinomialProcess
Rankinginalanguagemodelisprovidedbyestimating
P(q|Mj)
Severalresearchshaveproposedtheadoptionofa multinomialprocesstogeneratethequery Accordingtothisprocess,ifweassumethatthequery
termsareindependentamongthemselves(unigram
model),wecanwrite:
P(q|Mj)=
Y
ki∈q
P(ki|Mj)
Chap 03: Modeling, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 182

MultinomialProcess
Bytakinglogsonbothsides
logP(q|Mj) =
X
ki∈q
logP(ki|Mj)
=
X
ki∈q∧dj
logP∈(ki|Mj)+
X
ki∈q∧¬dj
logP6∈(ki|Mj)
=
X
ki∈q∧dj
log
θ
P∈(ki|Mj)
P6∈(ki|Mj)

+
X ki∈q
logP6∈(ki|Mj)
whereP∈andP6∈aretwodistinctprobabilitydistributions:
The ×rst is a distribution {or the query terms in the document The second is a distribution for the query terms not in the
document
Chap 03: Modeling, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 183

MultinomialProcess
Fortheseconddistribution,statisticsarederivedfrom
allthedocumentcollection
Thus,wecanwrite
P6∈(ki|Mj)=αjP(ki|C)
whereαjisaparameterassociatedwithdocumentdj
andP(ki|C)isacollectionClanguagemodel
Chap 03: Modeling, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 184

MultinomialProcess
P(ki|C)canbeestimatedindifferentways Forinstance,Hiemstrasuggestsanidf-likeestimative:
P(ki|C)=
ni
P
i
ni
whereniisthenumberofdocsinwhichkioccurs
Miller,Leek,andSchwartzsuggest
P(ki|C)=
Fi
P
i
Fi
whereFi=
P
j
fi,j
Chap 03: Modeling, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 185

MultinomialProcess
Thus,weobtain
logP(q|Mj) =
X
ki∈q∧dj
log

P∈(ki|Mj)
αjP(ki|C)

+nqlogαj+
X
ki∈q
logP(ki|C)

X
ki∈q∧dj
log

P∈(ki|Mj)
αjP(ki|C)

+nqlogαj
wherenqstandsforthequerylengthandthelastsum
wasdroppedbecauseitisconstantforalldocuments
Chap 03: Modeling, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 186

MultinomialProcess
Therankingfunctionisnowcomposedoftwoseparate
parts
Thefrstpartassignsweightstoeachquerytermthat
appearsinthedocument,accordingtotheexpression
log
θ
P∈(ki|Mj)
αjP(ki|C)

Thistermweightplaysaroleanalogoustothetfplusidf
weightcomponentsinthevectormodel
Further,theparameterαjcanbeusedfordocument
lengthnormalization
Chap 03: Modeling, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 187

MultinomialProcess
Thesecondpartassignsafractionofprobabilitymass
tothequerytermsthatarenotinthedocument—a
processcalledsmoothing
Thecombinationofamultinomialprocesswith smoothingleadstoarankingformulathatnaturally includestf,idf,anddocumentlengthnormalization Thatis,smoothingplaysakeyroleinmodernlanguage modeling,aswenowdiscuss
Chap 03: Modeling, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 188

Smoothing
Inourdiscussion,weestimatedP6∈(ki|Mj)usingP(ki|C)
toavoidassigningzeroprobabilitytoquerytermsnotin
documentdj
Thisprocess,calledsmoothing,allows×netuningthe
rankingtoimprovetheresults.
Onepopularsmoothingtechniqueistomovesome
massprobabilityfromthetermsinthedocumenttothe
termsnotinthedocument,asfollows:
P(ki|Mj)=
(
P
s
∈(ki|Mj) ifki∈dj
αjP(ki|C) otherwise
whereP
s

(ki|Mj)isthesmootheddistributionfor
termsindocumentdj
Chap 03: Modeling, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 189

Smoothing
Since
P
i
P(ki|Mj)=1,wecanwrite
X
ki∈dj
P
s
∈(ki|Mj)+
X
ki6∈dj
αjP(ki|C)=1
Thatis,
αj=
1−
P
ki∈dj
P
s
∈(ki|Mj)
1−
P
ki∈dj
P(ki|C)
Chap 03: Modeling, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 190

Smoothing
Undertheaboveassumptions,thesmoothing
parameterαjisalsoafunctionofP
s
∈(ki|Mj)
Asaresult,distinctsmoothingmethodscanbe obtainedthroughdistincts√eci×cationsofP
s
∈(ki|Mj) Examplesofsmoothingmethods:
Jelinek-Mercer Method Bayesian Smoothing using Dirichlet Priors
Chap 03: Modeling, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 191

Jelinek-MercerMethod
Theideaistodoalinearinterpolationbetweenthe
documentfrequencyandthecollectionfrequency
distributions:
P
s
∈(ki|Mj,λ)=(1−λ)
fi,j
P
i
fi,j

Fi
P
i
Fi
where0≤λ≤1
Itcanbeshownthat
αj=λ
Thus,thelargerthevaluesofλ,thelargeristheeffect
ofsmoothing
Chap 03: Modeling, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 192

Dirichletsmoothing
Inthismethod,thelanguagemodelisamultinomial
distributioninwhichtheconjugatepriorprobabilitiesare
givenbytheDirichletdistribution
Thisleadsto
P
s
∈(ki|Mj,λ)=
fi,j+λ
Fi
P
i
Fi
P
i
fi,j+λ
Asbefore,closerisλto0,hi}heristhein∗uenceo{the
termdocumentfrequency. Asλmovestowards1,the
in∗uenceo{thetermcollection{requencyincreases
Chap 03: Modeling, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 193

Dirichletsmoothing
ContrarytotheJelinek-Mercermethod,thisinuenceis
alwayspartiallymixedwiththedocumentfrequency
Itcanbeshownthat
αj=
λ
P
i
fi,j+λ
Asbefore,thelargerthevaluesofλ,thelargeristhe
effectofsmoothing
Chap 03: Modeling, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 194

SmoothingComputation
Inbothsmoothingmethodsabove,computationcanbe
carriedoutefciently
Allfrequencycountscanbeobtaineddirectlyfromthe
index
Thevaluesofαjcanbeprecomputedforeach
document
Thus,thecomplexityisanalogoustothecomputationof
avectorspacerankingusingtf-idfweights
Chap 03: Modeling, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 195

ApplyingSmoothingtoRanking
TheIRrankinginamultinomiallanguagemodelis
computedasfollows:
computeP
s
∈(ki|Mj)usingasmoothingmethod computeP(ki|C)using
ni
P
i
ni
or
Fi
P
i
Fi
computeαjfromtheEquationαj=
1−
P
k
i
∈d
j
P
s

(ki|Mj)
1−
P
k
i
∈d
j
P(ki|C)
computetherankingusingtheformula
logP(q|Mj)=
X
ki∈q∧dj
log
θ
P
s
∈(ki|Mj)
αjP(ki|C)

+nqlogαj
Chap 03: Modeling, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 196

BernoulliProcess
The×rsta√√licationo{lan}ua}esmodelstoIRwasdue
toPonte&Croft. TheyproposedaBernoulliprocessfor
generatingthequery,aswenowdiscuss
Givenadocumentdj,letMjbeareferencetoa
languagemodelforthatdocument
Ifweassumeindependenceofindexterms,wecan
computeP(q|Mj)usingamultivariateBernoulliprocess:
P(q|Mj)=
Y
ki∈q
P(ki|Mj)
×
Y
ki6∈q
[1−P(ki|Mj)]
whereP(ki|Mj)aretermprobabilities
Thisisanalogoustotheexpressionforranking computationintheclassicprobabilisticmodel
Chap 03: Modeling, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 197

Bernoulliprocess
Asimpleestimateofthetermprobabilitiesis
P(ki|Mj)=
fi,j
P
`
f`,j
whichcomputestheprobabilitythattermkiwillbe
producedbyarandomdraw(takenfromdj)
However,theprobabilitywillbecomezeroifkidoesnot
occurinthedocument
Thus,weassumethatanon-occurringtermisrelatedto
djwiththeprobabilityP(ki|C)ofobservingkiinthe
wholecollectionC
Chap 03: Modeling, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 198

Bernoulliprocess
P(ki|C)canbeestimatedindifferentways Forinstance,Hiemstrasuggestsanidf-likeestimative:
P(ki|C)=
ni
P
`
n`
whereniisthenumberofdocsinwhichkioccurs
Miller,Leek,andSchwartzsuggest
P(ki|C)=
Fi
P
`
F`
whereFi=
X
j
fi,j
ThislastequationforP(ki|C)isadoptedhere
Chap 03: Modeling, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 199

Bernoulliprocess
Asaresult,werede×neP(ki|Mj)asfollows:
P(ki|Mj)=







fi,j
P
i
fi,j
iffi,j>0
Fi
P
i
Fi
iffi,j=0
Inthisexpression,P(ki|Mj)estimationisbasedonlyon
thedocumentdjwhenfi,j>0
Thisisclearlyundesirablebecauseitleadstoinstability
inthemodel
Chap 03: Modeling, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 200

Bernoulliprocess
Thisdrawbackcanbeaccomplishedthroughan
averagecomputationasfollows
P(ki)=
P
j|ki∈dj
P(ki|Mj)
ni
Thatis,P(ki)isanestimatebasedonthelanguage
modelsofalldocumentsthatcontaintermki
However,itisthesameforalldocumentsthatcontain
termki
Thatis,usingP(ki)topredictthegenerationoftermki
bytheMjinvolvesarisk
Chap 03: Modeling, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 201

Bernoulliprocess
To×xthis,letusde×netheaveragefrequency
f
i,jof
termkiindocumentdjas
f
i,j=P(ki)
×
X
i
fi,j
Chap 03: Modeling, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 202

Bernoulliprocess
TheriskRi,jassociatedwithusing
f
i,jcanbe
quanti×edbyageometricdistribution:
Ri,j=

1
1+
f
i,j
!
×

f
i,j
1+
f
i,j
!
fi,j
Fortermsthatoccurveryfrequentlyinthecollection, f
i,j:0andRi,j∼0 Fortermsthatarerarebothinthedocumentandinthe
collection,fi,j∼1,
f
i,j∼1,andRi,j∼0.25
Chap 03: Modeling, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 203

Bernoulliprocess
Letusrefertheprobabilityofobservingtermki
accordingtothelanguagemodelMjasPR(ki|Mj)
WethenusetheriskfactorRi,jtocomputePR(ki|Mj),
asfollows
PR(ki|Mj)=





P(ki|Mj)
(1−Ri,j)×P(ki)
Ri,j
iffi,j>0
Fi
P
i
Fi
otherwise
Inthisformulation,ifRi,j∼0thenPR(ki|Mj)isbasically
afunctionofP(ki|Mj)
Otherwise,itisamixofP(ki)andP(ki|Mj)
Chap 03: Modeling, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 204

Bernoulliprocess
SubstitutingintooriginalP(q|Mj)Equation,weobtain
P(q|Mj)=
Y
ki∈q
PR(ki|Mj)
×
Y
ki6∈q
[1−PR(ki|Mj)]
whichcomputestheprobabilityofgeneratingthequery
fromthelanguage(document)model
Thisisthebasicformulaforrankingcomputationina
languagemodelbasedonaBernoulliprocessfor
generatingthequery
Chap 03: Modeling, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 205

DivergencefromRandomness
Chap 03: Modeling, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 206

DivergencefromRandomness
Adistinctprobabilisticmodelhasbeenproposedby
AmatiandRijsbergen
Theideaistocomputetermweightsbymeasuringthe divergencebetweenatermdistributionproducedbya
randomprocessandtheactualtermdistribution
Thus,thenamedivergencefromrandomness Themodelisbasedontwofundamentalassumptions, asfollows
Chap 03: Modeling, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 207

DivergencefromRandomness
Firstassumption:
Not all words are equally important for describing the content of
the documents
Words that carry little information are assumed to be randomly
distributedover the whole document collectionC
Given a termki, its probability distribution over the whole
collection is referred to as P(ki|C)
The amount of information associated with this distribution is
given by
−logP(ki|C)
By modifying this probability function, we can implement distinct notions of term randomness
Chap 03: Modeling, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 208

DivergencefromRandomness
Secondassumption:
A complementary term distribution can be obtained by
considering just the subset of documents that contain term ki
This subset is referred to as the elite set The corresponding probability distribution, computed with regard to documentdj, is referred to asP(ki|dj) Smaller the probability of observing a term kiin a documentdj,
more rare and important is the term considered to be
Thus, the amount of information associated with the term in the elite set is de×ned as
1−P(ki|dj)
Chap 03: Modeling, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 209

DivergencefromRandomness
Giventheseassumptions,theweightwi,jofatermkiin
adocumentdjisde×nedas
wi,j=[−logP(ki|C)]
×[1−P(ki|dj)]
Twotermdistributionsareconsidered: inthecollection
andinthesubsetofdocsinwhichitoccurs
TherankR(dj,q)ofadocumentdjwithregardtoa
queryqisthencomputedas
R(dj,q)=
P
ki∈q
fi,q
×wi,j
wherefi,qisthefrequencyoftermkiinthequery
Chap 03: Modeling, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 210

RandomDistribution
Tocomputethedistributionoftermsinthecollection,
distinctprobabilitymodelscanbeconsidered
Forinstance,considerthatBernoullitrialsareusedto
modeltheoccurrencesofaterminthecollection
Toillustrate,consideracollectionwith1,000documents
andatermkithatoccurs10timesinthecollection
Then,theprobabilityofobserving4occurrencesof
termkiinadocumentisgivenby
P(ki|C)=
θ
10
4

1
1000


1−
1
1000

6
whichisastandardbinomialdistribution
Chap 03: Modeling, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 211

RandomDistribution
Ingeneral,letp=1/Nbetheprobabilityofobservinga
terminadocument,whereNisthenumberofdocs
Theprobabilityofobservingfi,joccurrencesoftermki
indocumentdjisdescribedbyabinomialdistribution:
P(ki|C)=
θ
Fi
fi,j

p
fi,j×(1−p)
Fi−fi,j
De×ne
λi=p
×Fi
andassumethatp→0whenN→∞,butthat
λi=p
×Firemainsconstant
Chap 03: Modeling, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 212

RandomDistribution
Undertheseconditions,wecanaproximatethe
binomialdistributionbyaPoissonprocess,whichyields
P(ki|C)=
e
−λi
λ
fi,j
i
fi,j!
Chap 03: Modeling, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 213

RandomDistribution
Theamountofinformationassociatedwithtermkiin
thecollectioncanthenbecomputedas
−logP(ki|C) =−log

e
−λi
λ
fi,j
i
fi,j!
!
≈ −fi,jlogλi+λiloge+log( fi,j!)
≈fi,jlog
θ
fi,j
λi

+
θ
λi+
1
12fi,j+1
−fi,j

loge
+
1
2
log(2πfi,j)
inwhichthelogarithmsareinbase2andthefactorial
termfi,j!wasapproximatedbytheStirling’sformula
fi,j!≈

2π f
(fi,j+0.5)
i,j
e
−fi,j
e
(12fi,j+1)
−1
Chap 03: Modeling, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 214

RandomDistribution
AnotherapproachistouseaBose-Einsteindistribution
andapproximateitbyageometricdistribution:
P(ki|C)≈p
×p
fi,j
wherep=1/(1+λi)
Theamountofinformationassociatedwithtermkiin
thecollectioncanthenbecomputedas
−logP(ki|C)≈−log
θ
1
1+λi

−fi,j
×log
θ
λi
1+λi

whichprovidesasecondformofcomputingtheterm
distributionoverthewholecollection
Chap 03: Modeling, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 215

DistributionovertheEliteSet
Theamountofinformationassociatedwithterm
distributioninelitedocscanbecomputedbyusing
Laplace'slawofsuccession
1−P(ki|dj)=
1
fi,j+1
AnotherpossibilityistoadopttheratiooftwoBernoulli
processes,whichyields
1−P(ki|dj)=
Fi+1
ni
×(fi,j+1)
whereniisthenumberofdocumentsinwhichtheterm
occurs,asbefore
Chap 03: Modeling, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 216

Normalization
Theseformulationsdonottakeintoaccountthelength
ofthedocumentdj. Thiscanbedonebynormalizing
thetermfrequencyfi,j
Distinctnormalizationscanbeused,suchas
f
0
i,j=fi,j
×
avg_doclen
len(dj)
or
f
0
i,j=fi,j
×log
θ
1+
avg_doclen
len(dj)

whereavg_doclenistheaveragedocumentlengthinthe
collectionandlen(dj)isthelengthofdocumentdj
Chap 03: Modeling, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 217

Normalization
Tocomputewi,jweightsusingnormalizedterm
frequencies,justsubstitutethefactorfi,jbyf
0
i,j
Inhereweconsiderthatasamenormalizationis
appliedforcomputingP(ki|C)andP(ki|dj)
BycombiningdifferentformsofcomputingP(ki|C)and
P(ki|dj)withdifferentnormalizations,variousranking
formulascanbeproduced
Chap 03: Modeling, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 218

BayesianNetworkModels
Chap 03: Modeling, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 219

BayesianInference
OneapproachfordevelopingprobabilisticmodelsofIR
istouseBayesianbeliefnetworks
Beliefnetworksprovideacleanformalismforcombining
distinctsourcesofevidence
Types of evidences: past queries, past feedback cycles, distinct
query formulations, etc. Inherewediscusstwomodels:
Inference network, proposed by Turtle and Croft Belief network model, proposed by Ribeiro-Neto and Muntz
Beforeproceeding,webrieyintroduceBayesian
networks
Chap 03: Modeling, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 220

BayesianNetworks
Bayesiannetworksaredirectedacyclicgraphs
(DAGs)inwhich
the nodesrepresent random variables the arcsportray causal relationships between these variables the strengthsof these causal inuences are expressed by
conditional probabilities Theparentsofanodearethosejudgedtobedirect
causesforit
Thiscausalrelationshipisrepresentedbyalink
directedfromeachparentnodetothechildnode
Therootsofthenetworkarethenodeswithoutparents
Chap 03: Modeling, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 221

BayesianNetworks
Let
xibe a node in a Bayesian networkG Γxi
be the set of parent nodes ofxi
Thein∗uenceo{Γxionxicanbes√eci×edbyanysetof
functionsFi(xi,Γxi)thatsatisfy
X
∀xi
Fi(xi,Γxi) = 1
0≤Fi(xi,Γxi)≤1
wherexialsoreferstothestatesoftherandomvariable
associatedtothenodexi
Chap 03: Modeling, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 222

BayesianNetworks
ABayesiannetworkforajointprobabilitydistribution
P(x1,x2,x3,x4,x5)
Chap 03: Modeling, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 223

BayesianNetworks
Thedependenciesdeclaredinthenetworkallowthe
naturalexpressionofthejointprobabilitydistribution
P(x1,x2,x3,x4,x5) =P(x1)P(x2|x1)P(x3|x1)P(x4|x2,x3)P(x5|x3)
TheprobabilityP(x1)iscalled
thepriorprobability for the
network
Itcanbeusedtomodelprevi-
ous knowledge about the se-
manticsoftheapplication
Chap 03: Modeling, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 224

InferenceNetworkModel
Chap 03: Modeling, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 225

InferenceNetworkModel
Anepistemologicalviewoftheinformationretrieval
problem
Randomvariablesassociatedwithdocuments,index
termsandqueries
Arandomvariableassociatedwithadocumentdj
representstheeventofobservingthatdocument
Theobservationofdjassertsabeliefupontherandom
variablesassociatedwithitsindexterms
Chap 03: Modeling, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 226

InferenceNetworkModel
Aninferencenetworkforinformationretrieval
Nodesofthenetwork
documents(dj) index terms(ki) queries(q,q1,and q2) user information need(I)
Chap 03: Modeling, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 227

InferenceNetworkModel
Theedgesfromdjtothenodeskiindicatethatthe
observationofdjincreasethebeliefinthevariableski
djhasindextermsk2,ki,andkt qhasindextermsk1,k2,andki q1andq2modelbooleanformulation q1=(k1∧k2)∨ki) I=(q∨q1)
Chap 03: Modeling, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 228

InferenceNetworkModel
Let
~
k=(k1,k2,...,kt)at-dimensionalvector ki∈{0,1},thenkhas2
t
possiblestates De×ne
on(i,
~
k)=
(
1 ifki=1accordingto
~
k
0 otherwise
Letdj∈{0,1}andq∈{0,1} Therankingofdjisameasureofhowmuchevidential
supporttheobservationofdjprovidestothequery
Chap 03: Modeling, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 229

InferenceNetworkModel
TherankingiscomputedasP(q∧dj)whereqanddjare
shortrepresentationsforq=1anddj=1,respectively
djstandsforastatewheredj=1and∀l6=j⇒dl=0,
becauseweobserveonedocumentatatime
P(q∧dj) =
X

~
k
P(q∧dj|
~
k)
×P(
~
k)
=
X

~
k
P(q∧dj∧
~
k)
=
X

~
k
P(q|dj∧
~
k)
×P(dj∧
~
k)
=
X

~
k
P(q|
~
k)
×P(
~
k|dj)
×P(dj)
P(
q∧dj) = 1−P(q∧dj)
Chap 03: Modeling, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 230

InferenceNetworkModel
Theobservationofdjseparatesitschildrenindexterm
nodesmakingthemmutuallyindependent
ThisimpliesthatP(
~
k|dj)canbecomputedinproduct
formwhichyields
P(q∧dj) =
X

~
k
P(q|
~
k)
×P(dj)
×


Y
∀i|on(i,
~
k)=1
P(ki|dj)
×
Y
∀i|on(i,
~
k)=0
P(
ki|dj)
 
whereP(
ki|dj)=1−P(ki|dj)
Chap 03: Modeling, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 231

PriorProbabilities
ThepriorprobabilityP(dj)re∗ectsthe√robabilityo{
observingagivendocumentdj
InTurtleandCroftthisprobabilityissetto1/N,where
Nisthetotalnumberofdocumentsinthesystem:
P(dj)=
1
N
P(
dj)=1−
1
N
Toincludedocumentlengthnormalizationinthemodel,
wecouldalsowriteP(dj)asfollows:
P(dj)=
1
|
~
dj|
P(
dj)=1−P(dj)
where|
~
dj|standsforthenormofthevector
~
dj
Chap 03: Modeling, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 232

NetworkforBooleanModel
Howaninferencenetworkcanbetunedtosubsumethe
Booleanmodel?
First,fortheBooleanmodel,thepriorprobabilitiesare
givenby:
P(dj)=
1
N
P(
dj)=1−
1
N
RegardingtheconditionalprobabilitiesP(ki|dj)and
P(q|
~
k),thes√eci×cationisas{ollows
P(ki|dj) =
(
1 ifki∈dj
0 otherwise
P(
ki|dj) = 1−P(ki|dj)
Chap 03: Modeling, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 233

NetworkforBooleanModel
WecanuseP(ki|dj)andP(q|
~
k)factorstocomputethe
evidentialsupporttheindextermsprovidetoq:
P(q|
~
k) =
(
1 ifc(q)=c(
~
k)
0 otherwise
P(
q|
~
k) = 1−P(q|
~
k)
wherec(q)andc(
~
k)aretheconjunctivecomponents
associatedwithqand
~
k,respectively
Byusin}thesede×nitionsinP(q∧dj)andP(
q∧dj)
equations,weobtaintheBooleanformofretrieval
Chap 03: Modeling, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 234

NetworkforTF-IDFStrategies
Foratf-idfrankingstrategy PriorprobabilityP(dj)re∗ectstheim√ortanceo{
documentnormalization
P(dj)=
1
|
~
dj|
P(
dj)=1−P(dj)
Chap 03: Modeling, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 235

NetworkforTF-IDFStrategies
Forthedocument-termbeliefs,wewrite:
P(ki|dj) =α+(1−α)
×
f
i,j
×
idf
i
P(
ki|dj) = 1−P(ki|dj)
whereαvariesfrom0to1,andempiricalevidence
suggeststhatα=0.4isagooddefaultvalue
Normalizedtermfrequencyandinversedocument
frequency:
f
i,j=
fi,j
maxifi,j
idf
i=
log
N
ni
logN
Chap 03: Modeling, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 236

NetworkforTF-IDFStrategies
Fortheterm-querybeliefs,wewrite:
P(q|
~
k) =
X
ki∈q
f
i,j
×wq
P(
q|
~
k) = 1−P(q|
~
k)
wherewqisaparameterusedtosetthemaximum
beliefachievableatthequerynode
Chap 03: Modeling, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 237

NetworkforTF-IDFStrategies
Bysubstitutin}thesede×nitionsintoP(q∧dj)and
P(
q∧dj)equations,weobtainatf-idfformofranking
Wenoticethattherankingcomputedbytheinference
networkisdistinctfromthatforthevectormodel
However,aninferencenetworkisabletoprovidegood
retrievalperformancewithgeneralcollections
Chap 03: Modeling, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 238

CombiningEvidentialSources
InFigurebelow,thenodeqisthestandard
keyword-basedqueryformulationforI
Thesecondqueryq1isaBoolean-likequeryformulation
forthesameinformationneed
Chap 03: Modeling, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 239

CombiningEvidentialSources
LetI=q∨q1 Inthiscase,therankingprovidedbytheinference
networkiscomputedas
P(I∧dj) =
X
~
k
P(I|
~
k)
×P(
~
k|dj)
×P(dj)
=
X
~
k
(1−P(
q|
~
k)P(
q
1|
~
k))
×P(
~
k|dj)
×P(dj)
whichmightyieldaretrievalperformancewhich
surpassesthatofthequerynodesinisolation
(TurtleandCroft)
Chap 03: Modeling, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 240

BeliefNetworkModel
Chap 03: Modeling, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 241

BeliefNetworkModel
Thebeliefnetworkmodelisavariantoftheinference
networkmodelwithaslightlydifferentnetworktopology
AstheInferenceNetworkModel
EpistemologicalviewoftheIRproblem Randomvariablesassociatedwithdocuments,index
termsandqueries ContrarytotheInferenceNetworkModel
Clearlydenedsamplespace Set-theoreticview
Chap 03: Modeling, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 242

BeliefNetworkModel By applying Bayes' rule, we can write
P(dj|q) =P(dj∧q)/P(q)
P(dj|q)∼
X

~
k
P(dj∧q|
~
k)
×P(
~
k)
becauseP(q)is a constant for all documents in the collection
Chap 03: Modeling, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 243

BeliefNetworkModel
Instantiationoftheindextermvariablesseparatesthe
nodesqanddmakingthemmutuallyindependent:
P(dj|q)∼
X

~
k
P(dj|
~
k)
×P(q|
~
k)
×P(
~
k)
Tocompletethebeliefnetworkweneedtospecifythe
conditionalprobabilitiesP(q|
~
k)andP(dj|
~
k)
Distincts√eci×cationso{these√robabilitiesallowthe
modelingofdifferentrankingstrategies
Chap 03: Modeling, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 244

BeliefNetworkModel
Forthevectormodel,{orinstance,wede×neavector
~
ki
givenby
~
ki=
~
k|on(i,
~
k)=1∧ ∀j6=ion(i,
~
k)=0
Themotivationisthattf-idfrankingstrategiessumup
theindividualcontributionsofindexterms
Weproceedasfollows
P(q|
~
k) =



wi,q
q
P
t
i=1
w
2
i,q
if
~
k=
~
ki∧on(i,~q)=1
0otherwise
P(
q|
~
k) = 1−P(q|
~
k)
Chap 03: Modeling, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 245

BeliefNetworkModel
Further,de×ne
P(dj|
~
k) =



wi,j
q
P
t
i=1
w
2
i,j
if
~
k=
~
ki∧on(i,
~
dj)=1
0otherwise
P(
dj|
~
k) = 1−P(dj|
~
k)
Then,therankingoftheretrieveddocumentscoincides
withtherankingorderinggeneratedbythevectormodel
Chap 03: Modeling, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 246

ComputationalCosts
Intheinferencenetworkmodelonlythestateswhich
haveasingledocumentactivenodeareconsidered
Thus,thecostofcomputingtherankingislinearonthe
numberofdocumentsinthecollection
However,therankingcomputationisrestrictedtothe
documentswhichhavetermsincommonwiththequery
Thenetworksdonotimposeadditionalcostsbecause
thenetworksdonotincludecycles
Chap 03: Modeling, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 247

OtherModels
HypertextModel Web-basedModels StructuredTextRetrieval MultimediaRetrieval EnterpriseandVerticalSearch
Chap 03: Modeling, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 248

HypertextModel
Chap 03: Modeling, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 249

TheHypertextModel
Hypertextsprovidedthebasisforthedesignofthe
hypertextmarkuplanguage(HTML)
Writtentextisusuallyconceivedtoberead
sequentially
Sometimes,however,wearelookingforinformationthat
cannotbeeasilycapturedthroughsequentialreading
For instance, while glancing at a book about the history of the wars, we might be interested in wars in Europe
Insuchasituation,adifferentorganizationofthetextis
desired
Chap 03: Modeling, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 250

TheHypertextModel
Thesolutionistodeneaneworganizationalstructure
besidestheonealreadyinexistence
Onewaytoaccomplishthisisthroughhypertexts,that
arehighlevelinteractivenavigationalstructures
Ahypertextconsistsbasicallyofnodesthatare
correlatedbydirectedlinksinagraphstructure
Chap 03: Modeling, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 251

TheHypertextModel
TwonodesAandBmightbeconnectedbyadirected
linklABwhichcorrelatesthetextsofthesetwonodes
Inthiscase,thereadermightmovetothenodeBwhile
readingthetextassociatedwithnodeA
Whenthehypertextislarge,theusermightlosetrackof
theorganizationalstructureofthehypertext
Toavoidthisproblem,itisdesirablethatthehypertext
includeahypertextmap
In its simplest form, this map is a directed graph which displays the current node being visited
Chap 03: Modeling, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 252

TheHypertextModel
Denitionofthestructureofthehypertextshouldbe
accomplishedinadomainmodelingphase
Afterthemodelingofthedomain,auserinterface
designshouldbeconcludedpriortoimplementation
Onlythen,canwesaythatwehaveaproperhypertext
structurefortheapplicationathand
Chap 03: Modeling, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 253

Web-basedModels
Chap 03: Modeling, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 254

Web-basedModels
TherstWebsearchengineswerefundamentallyIR
enginesbasedonthemodelswehavediscussedhere
Thekeydifferenceswere:
the collections were composed of Web pages (not documents) the pages had to be crawled the collections were much larger
Thisthirddifferencealsomeantthateachqueryword
retrievedtoomanydocuments
Asaresult,resultsproducedbytheseengineswere
frequentlydissatisfying
Chap 03: Modeling, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 255

Web-basedModels
Akeypieceofinnovationwasmissing—theuseoflink
informationpresentinWebpagestomodifytheranking
Therearetwofundamentalapproachestodothis
namely,PageRankandHubs-Authorities
Such approaches are covered in Chapter 11 of the book (Web
Retrieval)
Chap 03: Modeling, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 256

StructuredTextRetrieval
Chap 03: Modeling, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 257

StructuredTextRetrieval
AlltheIRmodelsdiscussedheretreatthetextasa
stringwithnoparticularstructure
However,informationonthestructuremightbe
importanttotheuserforparticularsearches
Ex: retrieve a book that contains a gure of the Eiffel tower in a
section whose title contains the term “France” Thesolutiontothisproblemistotakeadvantageofthe
textstructureofthedocumentstoimproveretrieval
StructuredtextretrievalarediscussedinChapter13of
thebook
Chap 03: Modeling, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 258

MultimediaRetrieval
Chap 03: Modeling, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 259

MultimediaRetrieval
Multimediadata,intheformofimages,audio,and
video,frequentlylacktextassociatedwiththem
Theretrievalstrategiesthathavetobeappliedarequite
distinctfromtextretrievalstrategies
However,multimediadataareanintegralpartofthe
Web
Multimediaretrievalmethodsarediscussedingreat
detailinChapter14ofthebook
Chap 03: Modeling, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 260

EnterpriseandVerticalSearch
Chap 03: Modeling, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 261

EnterpriseandVerticalSearch
Enterprisesearchisthetaskofsearchingfor
informationofinterestincorporatedocumentcollections
ManyissuesnotpresentintheWeb,suchasprivacy,
ownership,permissions,areimportantinenterprise
search
InChapter15ofthebookwediscussindetailsome enterprisesearchsolutions
Chap 03: Modeling, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 262

EnterpriseandVerticalSearch
Averticalcollectionisarepositoryofdocuments
specializedinagivendomainofknowledge
To illustrate, Lexis-Nexis offers full-text search focused on the
area of business and in the area of legal Verticalcollectionspresentspecicchallengeswith regardtosearchandretrieval
Chap 03: Modeling, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 263
Tags