similarity measure

totoyou 20,470 views 32 slides Nov 10, 2007
Slide 1
Slide 1 of 32
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27
Slide 28
28
Slide 29
29
Slide 30
30
Slide 31
31
Slide 32
32

About This Presentation

A similarity measure can represent the similarity between two documents, two queries, or one document and one query


Slide Content

Chapter 3
Similarity Measures
Data Mining Technology

Chapter 3
Similarity Measures
Written by Kevin E. Heinrich
Presented by Zhao Xinyou
[email protected]
2007.6.7
Some materials (Examples) are taken from Website.

Searching Process
Input Text
Process
Index
Query
Sorting
Show Text Result
Input Text
Index
Query
Sorting
Show Text Result
Input Key Words
Results
Search
1.XXXXXXX
2.YYYYYYY
3.ZZZZZZZZ
.......................

Example

Similarity Measures
A similarity measure can represent the
similarity between two documents, two
queries, or one document and one query
It is possible to rank the retrieved documents
in the order of presumed importance
A similarity measure is a function which
computes the degree of similarity between a
pair of text objects
There are a large number of similarity
measures proposed in the literature, because
the best similarity measure doesn't exist
(yet!)
PP27-28

Classic Similarity Measures
All similarity measures should map to the
range [-1,1] or [0,1],
0 or -1 shows minimum similarity.
(incompatible similarity)
1 shows maximum similarity. (absolute
similarity)
PP28

Conversion
For example
1 shows incompatible similarity, 10 shows
absolute similarity.
[1, 10]
[0, 1]
s’ = (s – 1 ) / 9
Generally, we may use:
s’ = ( s – min_s ) / ( max_s – min_s )
Linear
Non-linear
PP28

Vector-Space Model-VSM
1960s Salton etc provided VSM, which has
been successfully applied on SMART (a text
searching system).
PP28-29

Example
D is a set, which contains m Web
documents;
D={d
1
, d
2
,…d
i
…d
m
} i=1,2…m
There are n words among m Web
documents. di={w
i1
,w
i2
,…
w
ij
,…w
in
} i=1,2…m ,j=1,2,….n
Q= {q
1
,q
2
,…q
i
,….q
n
} i=1,2,….n
PP28-29
If similarity(q,d
i
) > similarity(q, d
j
)
We may get the result d
i
is more relevant than d
j

Simple Measure Technology
Documents Set
PP29
Retrieved A
Relevant B
Retrieved and Relevant A∩B
Precision = Returned Relevant Documents / Total Returned Documents
Recall = Returned Relevant Documents / Total Relevant Documents
P(A,B) = |A∩B| / |A|
R(A,B) = |A∩B| / |B|

Example--Simple Measure Technology
PP29
Documents Set
A,C,E,G,
H, I, J
Relevant
B,D,F
Retrieved &
Relevant
W,Y
Retrieved
|B| = {relevant} ={A,B,C,D,E,F,G,H,I,J} = 10
|A| = {retrieved} = {B, D, F,W,Y} = 5
|A∩B| = {relevant} ∩ {retrieved} ={B,D,F} = 3
P = precision = 3/5 = 60%
R = recall = 3/10 = 30%

Precision-Recall Graph-Curves
There is a tradeoff between Precision and Recall
So measure Precision at different levels of Recall
PP29-30
One Query
Two Queries
Difficult to determine which of these two
hypothetical results is better

Similarity measures based on VSM
Dice coefficient
Overlap Coefficient
Jaccard
Cosine
Asymmetric
Dissimilarity
Other measures
PP30

Dice Coefficient-Cont’
Definition of Harmonic Mean:
To X
1
,X
2
, …, X
n
,their harmonic mean E equals n divided
by(1/x
1
+1/x
2
+…+1/X
n
), that is
RP
E
11
2


To Harmonic Mean (E) of Precision (P) and Recall
(R)
BA
BA
B
BA
A
BA 






22
PP30





n
i
in
x
n
xxx
n
E
1
21
1111

Dice Coefficient-Cont’
Denotation of Dice Coefficient:
)1,0(
)1(
)1(
),(),(
1
2
1
2
1












n
k
kj
n
k
kq
n
k
kjkq
j
ww
ww
BA
BA
BADdqsim

PP30
E
BA
BA
BA
BA
BAthenDif







2
)1(
),(
2
1


α>0.5 : precision is more important
α<0.5 : recall is more important
Usually α=0.5

Overlap Coefficient
PP30-31
Documents Set
A queries
B Documents
 

 



n
k
n
k
kjkq
n
k
kjkq
j
ww
ww
BA
BA
BAOdqsim
1 1
22
1
),min(
),min(
),(),(

Jaccard Coefficient-Cont’
Documents Set
A queries
B Documents
 

 




n
k
n
k
kjkq
n
k
kjkq
n
k
kjkq
j
wwww
ww
BA
BA
BAJdqsim
1 11
22
1
),(),(


PP31

Example- Jaccard Coefficient
D1 = 2T1 + 3T2 + 5T3, (2,3,5)
D2 = 3T1 + 7T2 + T3 , (3,7,1)
Q = 0T1 + 0T2 + 2T3, (0,0,2)
J(D1 , Q) = 10 / (38+4-10) = 10/32 = 0.31
J(D2 , Q) = 2 / (59+4-2) = 2/61 = 0.04
PP31
 

 



n
k
n
k
kjkq
n
k
kjkq
n
k
kjkq
j
wwww
ww
dqsim
1 11
22
1
),(

Cosine Coefficient-Cont’











n
k
kj
n
k
kq
n
k
kjkq
j
j
j
ww
ww
dq
dq
BA
BA
PRBACdqsim
1
2
1
2
1
),(),(

PP31-32
(d
21
,d
22
,…d
2n
)
(d
11
,d
12
,…d
1n
)
(q
1
,q
2
,…q
n
)

Example-Cosine Coefficient
Q = 0T1 + 0T2 + 2T3
D1 = 2T1 + 3T2 + 5T3
D2 = 3T1 + 7T2 + T3
C (D1 , Q) =


=10/ √ (38*4) = 0.81
C (D2 , Q) = 2 / √ (59*4) = 0.13
)200()532(
523020
222222


PP31-32
(3,7,1)
(2,3,5)
(0,0,2)

Asymmetric
PP31





n
k
kq
n
k
kjkq
jj
w
ww
dqAdqsim
1
1
),min(
),(),(
dj
di
W
ki
-->w
kj

Euclidean distance



n
k
kjkqjEj
wwdqddqdis
1
2
)(),(),(
PP32

Manhattan block distance



n
k
kjkqjMj
wwdqddqdis
1
),(),(
PP32

Other Measures
We may use priori/context knowledge
For example:
Sim(q,d
j
)= [content identifier similarity]+
[objective term similarity]+
[citation similarity]



PP32

Comparison
PP34

Comparison
),min(
*2
2
1
2
1
BA
BA
O
BA
BA
C
BA
BA
J
BA
BA
D
BA












Simple matching
Dice’s Coefficient
Cosine Coefficient
Overlap Coefficient
Jaccard’s Coefficient
|A|+|B|-|A∩B| ≥(|A|+|B|)/2
|A| ≥ |A∩B|
|B| ≥ |A∩B|
(|A|+|B|)/2 ≥√(|A|*|B|)
√(|A|*|B|)≥ min (|A|, |B|)
O≥C≥D≥J
PP34

Example-Documents-Term-Query-Cont’
D1:A search Engine for 3D Models
D2:Design and Implementation of a string database
query language
D3:Ranking of documents by measures considering
conceptual dependence between terms
D4 Exploiting hierarchical domain structure to
compute similarity
D5:an approach for measuring semantic similarity
between words using multiple information sources
D6:determinging semantic similarity among entity
classes from different ontologies
D7:strong similarity measures for ordered sets of
documents in information retrieval
T1:search(ing)
T2:Engine(s)
T3:Models
T4:database
T5:query
T6:language
T7:documents
T8:measur(es,ing)
T9:conceptual
T10:dependence
T11: domain
T12:structure
T13:similarity
T14:semantic
T15: ontologies
T16:information
T17: retrieval
Query:
Semantic similarity measures used by search engines and other
information searching mechanisms
PP33

Example-Term-Document Matrix-Cont’
111112Q
11111D7
111D6
1111D5
111D4
1111D3
111D2
111D1
T17T16T15T1413T12T11T10T9T8T7T6T5T4T3T2T1
Matrix[q][A]
PP34

Dice coefficient
)0*0...1*11*1()0*0...1*12*2(
)0*01*0...0*01*01*11*2(*2















n
k
k
n
k
kq
n
k
kkq
n
k
k
n
k
kq
n
k
kkq
ww
ww
ww
ww
qDD
1
2
1
1
2
1
1
1
2
1
1
2
1
1
2
)
2
1
(
)1(
),1( 

5.0
12
6
39
)12(*2




PP30, PP34

Final Results
PP34
O≥C≥D≥J

Current Applications
Multi-Dimensional Modeling
Hierarchical Clustering
Bioinformatics
PP35-38

Discussion
Tags