A similarity measure can represent the similarity between two documents, two queries, or one document and one query
Size: 298.11 KB
Language: en
Added: Nov 10, 2007
Slides: 32 pages
Slide Content
Chapter 3
Similarity Measures
Data Mining Technology
Chapter 3
Similarity Measures
Written by Kevin E. Heinrich
Presented by Zhao Xinyou [email protected]
2007.6.7
Some materials (Examples) are taken from Website.
Searching Process
Input Text
Process
Index
Query
Sorting
Show Text Result
Input Text
Index
Query
Sorting
Show Text Result
Input Key Words
Results
Search
1.XXXXXXX
2.YYYYYYY
3.ZZZZZZZZ
.......................
Example
Similarity Measures
A similarity measure can represent the
similarity between two documents, two
queries, or one document and one query
It is possible to rank the retrieved documents
in the order of presumed importance
A similarity measure is a function which
computes the degree of similarity between a
pair of text objects
There are a large number of similarity
measures proposed in the literature, because
the best similarity measure doesn't exist
(yet!)
PP27-28
Classic Similarity Measures
All similarity measures should map to the
range [-1,1] or [0,1],
0 or -1 shows minimum similarity.
(incompatible similarity)
1 shows maximum similarity. (absolute
similarity)
PP28
Conversion
For example
1 shows incompatible similarity, 10 shows
absolute similarity.
[1, 10]
[0, 1]
s’ = (s – 1 ) / 9
Generally, we may use:
s’ = ( s – min_s ) / ( max_s – min_s )
Linear
Non-linear
PP28
Vector-Space Model-VSM
1960s Salton etc provided VSM, which has
been successfully applied on SMART (a text
searching system).
PP28-29
Example
D is a set, which contains m Web
documents;
D={d
1
, d
2
,…d
i
…d
m
} i=1,2…m
There are n words among m Web
documents. di={w
i1
,w
i2
,…
w
ij
,…w
in
} i=1,2…m ,j=1,2,….n
Q= {q
1
,q
2
,…q
i
,….q
n
} i=1,2,….n
PP28-29
If similarity(q,d
i
) > similarity(q, d
j
)
We may get the result d
i
is more relevant than d
j
Simple Measure Technology
Documents Set
PP29
Retrieved A
Relevant B
Retrieved and Relevant A∩B
Precision = Returned Relevant Documents / Total Returned Documents
Recall = Returned Relevant Documents / Total Relevant Documents
P(A,B) = |A∩B| / |A|
R(A,B) = |A∩B| / |B|
Precision-Recall Graph-Curves
There is a tradeoff between Precision and Recall
So measure Precision at different levels of Recall
PP29-30
One Query
Two Queries
Difficult to determine which of these two
hypothetical results is better
Similarity measures based on VSM
Dice coefficient
Overlap Coefficient
Jaccard
Cosine
Asymmetric
Dissimilarity
Other measures
PP30
Dice Coefficient-Cont’
Definition of Harmonic Mean:
To X
1
,X
2
, …, X
n
,their harmonic mean E equals n divided
by(1/x
1
+1/x
2
+…+1/X
n
), that is
RP
E
11
2
To Harmonic Mean (E) of Precision (P) and Recall
(R)
BA
BA
B
BA
A
BA
22
PP30
n
i
in
x
n
xxx
n
E
1
21
1111
Dice Coefficient-Cont’
Denotation of Dice Coefficient:
)1,0(
)1(
)1(
),(),(
1
2
1
2
1
n
k
kj
n
k
kq
n
k
kjkq
j
ww
ww
BA
BA
BADdqsim
PP30
E
BA
BA
BA
BA
BAthenDif
2
)1(
),(
2
1
α>0.5 : precision is more important
α<0.5 : recall is more important
Usually α=0.5
Overlap Coefficient
PP30-31
Documents Set
A queries
B Documents
n
k
n
k
kjkq
n
k
kjkq
j
ww
ww
BA
BA
BAOdqsim
1 1
22
1
),min(
),min(
),(),(
Jaccard Coefficient-Cont’
Documents Set
A queries
B Documents
n
k
n
k
kjkq
n
k
kjkq
n
k
kjkq
j
wwww
ww
BA
BA
BAJdqsim
1 11
22
1
),(),(
PP31
Example- Jaccard Coefficient
D1 = 2T1 + 3T2 + 5T3, (2,3,5)
D2 = 3T1 + 7T2 + T3 , (3,7,1)
Q = 0T1 + 0T2 + 2T3, (0,0,2)
J(D1 , Q) = 10 / (38+4-10) = 10/32 = 0.31
J(D2 , Q) = 2 / (59+4-2) = 2/61 = 0.04
PP31
n
k
n
k
kjkq
n
k
kjkq
n
k
kjkq
j
wwww
ww
dqsim
1 11
22
1
),(
Cosine Coefficient-Cont’
n
k
kj
n
k
kq
n
k
kjkq
j
j
j
ww
ww
dq
dq
BA
BA
PRBACdqsim
1
2
1
2
1
),(),(
PP31-32
(d
21
,d
22
,…d
2n
)
(d
11
,d
12
,…d
1n
)
(q
1
,q
2
,…q
n
)
Asymmetric
PP31
n
k
kq
n
k
kjkq
jj
w
ww
dqAdqsim
1
1
),min(
),(),(
dj
di
W
ki
-->w
kj
Euclidean distance
n
k
kjkqjEj
wwdqddqdis
1
2
)(),(),(
PP32
Manhattan block distance
n
k
kjkqjMj
wwdqddqdis
1
),(),(
PP32
Other Measures
We may use priori/context knowledge
For example:
Sim(q,d
j
)= [content identifier similarity]+
[objective term similarity]+
[citation similarity]
PP32
Comparison
PP34
Comparison
),min(
*2
2
1
2
1
BA
BA
O
BA
BA
C
BA
BA
J
BA
BA
D
BA
Simple matching
Dice’s Coefficient
Cosine Coefficient
Overlap Coefficient
Jaccard’s Coefficient
|A|+|B|-|A∩B| ≥(|A|+|B|)/2
|A| ≥ |A∩B|
|B| ≥ |A∩B|
(|A|+|B|)/2 ≥√(|A|*|B|)
√(|A|*|B|)≥ min (|A|, |B|)
O≥C≥D≥J
PP34
Example-Documents-Term-Query-Cont’
D1:A search Engine for 3D Models
D2:Design and Implementation of a string database
query language
D3:Ranking of documents by measures considering
conceptual dependence between terms
D4 Exploiting hierarchical domain structure to
compute similarity
D5:an approach for measuring semantic similarity
between words using multiple information sources
D6:determinging semantic similarity among entity
classes from different ontologies
D7:strong similarity measures for ordered sets of
documents in information retrieval
T1:search(ing)
T2:Engine(s)
T3:Models
T4:database
T5:query
T6:language
T7:documents
T8:measur(es,ing)
T9:conceptual
T10:dependence
T11: domain
T12:structure
T13:similarity
T14:semantic
T15: ontologies
T16:information
T17: retrieval
Query:
Semantic similarity measures used by search engines and other
information searching mechanisms
PP33
Dice coefficient
)0*0...1*11*1()0*0...1*12*2(
)0*01*0...0*01*01*11*2(*2
n
k
k
n
k
kq
n
k
kkq
n
k
k
n
k
kq
n
k
kkq
ww
ww
ww
ww
qDD
1
2
1
1
2
1
1
1
2
1
1
2
1
1
2
)
2
1
(
)1(
),1(
5.0
12
6
39
)12(*2
PP30, PP34
Final Results
PP34
O≥C≥D≥J
Current Applications
Multi-Dimensional Modeling
Hierarchical Clustering
Bioinformatics
PP35-38