Cluster Analysis: Measuring Similarity & Dissimilarity

ShivarkarSandip 264 views 38 slides May 07, 2024
Slide 1
Slide 1 of 38
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27
Slide 28
28
Slide 29
29
Slide 30
30
Slide 31
31
Slide 32
32
Slide 33
33
Slide 34
34
Slide 35
35
Slide 36
36
Slide 37
37
Slide 38
38

About This Presentation

Dissimilarity of Numeric Data: Minskowski Distance Euclidean
distance and Manhattan distance Proximity Measures for Categorical,
Ordinal Attributes, Ratio scaled variables; Dissimilarity for Attributes
of Mixed Types, Cosine Similarity, partitioning methods- k-means, kmedoids


Slide Content

Sanjivani Rural Education Society’s
Sanjivani College of Engineering, Kopargaon-423 603
(An Autonomous Institute, Affiliated to Savitribai Phule Pune University, Pune)
NACC ‘A’ Grade Accredited, ISO 9001:2015 Certified
Department of Computer Engineering
(NBA Accredited)
Prof. S. A. Shivarkar
Assistant Professor
Contact No.8275032712
[email protected]
Subject-Data Mining and Warehousing (CO314)
Unit –IV:Cluster Analysis: Measuring Similarity &
Dissimilarity

Content
Measuring Data Similarity and Dissimilarity, Proximity Measures for Nominal
Attributes and Binary Attributes, interval scaled;
Dissimilarity of Numeric Data: MinskowskiDistance Euclidean distance and
Manhattan distance
Proximity Measures for Categorical, Ordinal Attributes, Ratio scaled
variables;
Dissimilarity for Attributes of Mixed Types, Cosine Similarity,
Partitioning methods-k-means, kmedoids, outlier detection.

What is Clustering?
Clustering is the process of grouping a set of data objects
into multiple groups or clusters so that objects within a
cluster have high similarity, but are very dissimilar to objects
in other clusters.
Dissimilarities and similarities are assessed based on the
attribute values describing the objects and often involve
distance measures

What is Cluster Analysis?
Cluster: A collection of data objects
similar (or related) to one another within the same group
dissimilar (or unrelated) to the objects in other groups
Cluster analysis (or clustering, data segmentation, …)
Finding similarities between data according to the
characteristics found in the data and grouping similar
data objects into clusters
Unsupervised learning: no predefined classes (i.e., learning
by observationsvs. learning by examples: supervised)
Typical applications
As a stand-alone toolto get insight into data distribution
As a preprocessing stepfor other algorithms

Clustering for Data Understanding and Applications
Biology: taxonomy of living things: kingdom, phylum, class, order, family,
genus and species
Information retrieval: document clustering
Land use: Identification of areas of similar land use in an earth observation
database
Marketing: Help marketers discover distinct groups in their customer bases,
and then use this knowledge to develop targeted marketing programs
City-planning: Identifying groups of houses according to their house type,
value, and geographical location
Earth-quake studies: Observed earth quake epicenters should be clustered
along continent faults
Climate: understanding earth climate, find patterns of atmospheric and ocean
Economic Science: market research

Quality: What Is Good Clustering?
A good clusteringmethod will produce high quality
clusters
high intra-classsimilarity: cohesivewithin clusters
low inter-classsimilarity: distinctivebetween clusters
The qualityof a clustering method depends on
the similarity measure used by the method
its implementation, and
Its ability to discover some or all of the hiddenpatterns

Measure the Quality of Clustering
Dissimilarity/Similarity metric
Similarity is expressed in terms of a distance function, typically metric:
d(i, j)
The definitions of distance functionsare usually rather different for
interval-scaled, boolean, categorical, ordinal ratio, and vector variables
Weights should be associated with different variables based on
applications and data semantics
Quality of clustering:
There is usually a separate “quality” function that measures the
“goodness” of a cluster.
It is hard to define “similar enough” or “good enough”
The answer is typically highly subjective

Similarity and Dissimilarity
Similarity
Numerical measure of how alike two data objects are
Value is higher when objects are more alike
Often falls in the range [0,1]
Dissimilarity(e.g., distance)
Numerical measure of how different two data objects
are
Lower when objects are more alike
Minimum dissimilarity is often 0
Upper limit varies
Proximityrefers to a similarity or dissimilarity

Data Matrix and Dissimilarity Matrix
Data matrix
n data points with p
dimensions
Two modes
N by P matrix
Dissimilarity matrix
n data points, but
registers only the
distance
A triangular matrix
Single mode

















np
x...
nf
x...
n1
x
...............
ip
x...
if
x...
i1
x
...............
1p
x...
1f
x...
11
x 















0...)2,()1,(
:::
)2,3()
...ndnd
0dd(3,1
0d(2,1)
0

Proximity Measure for Nominal Attributes
Can take 2 or more states, e.g., red, yellow, blue, green
(generalization of a binary attribute)
Method 1: Simple matching
m: No. of matches,p: total No. of variables
Method 2: Use a large number of binary attributes
creating a new binary attribute for each of the Mnominal statesp
mp
jid

),(

Proximity Measure for Binary Attributes
A contingency table for binary data
Distance measure for symmetric
binary variables:
Distance measure for asymmetric
binary variables:
Jaccard coefficient (similarity
measure for asymmetric binary
variables):
Note: Jaccard coefficient is the same as “coherence”:

Dissimilarity between Binary Variables
Example
Gender is a symmetric attribute so we are not going to consider.
For the remaining asymmetric binary attributes are we are going to calculate
dissimilarity using:
Let the values Y and P be 1, and the value N 0NameGenderFeverCoughTest-1Test-2Test-3Test-4
JackM Y N P N N N
MaryF Y N P N P N
JimM Y P N N N N 75.0
211
21
),(
67.0
111
11
),(
33.0
102
10
),(












maryjimd
jimjackd
maryjackd

Standardizing Numeric Data
Numeric data is biased, it may varies in range e.g. 1,100,100000.
So needs to normalize it.
Z-score:
X: raw score to be standardized, μ: mean of the population, σ: standard
deviation
the distance between the raw score and the population mean in units of the
standard deviation
negative when the raw score is below the mean, “+” when above
An alternative way: Calculate the mean absolute deviation
where
standardized measure (z-score):
Using mean absolute deviation is more robust than using standard deviation .)...
21
1
nffff
xx(x
n
m  |)|...|||(|1
21 fnffffff
mxmxmx
n
s  f
fif
if s
mx
z

 


x
z

Example: Data Matrix and Dissimilarity Matrixpointattribute1attribute2
x1 1 2
x2 3 5
x3 2 0
x4 4 5
Dissimilarity Matrix
(with Euclidean Distance)x1 x2 x3 x4
x1 0
x2 3.61 0
x3 5.1 5.1 0
x4 4.24 1 5.39 0
Data Matrix

Distance on Numeric Data: MinkowskiDistance
Minkowskidistance: A popular distance measure
where i= (x
i1, x
i2, …, x
ip) andj= (x
j1, x
j2, …, x
jp) are two p-dimensional
data objects, and his the order (the distance so defined is also called
L-hnorm)
Properties
d(i, j) > 0 if i≠ j, and d(i, i) = 0 (Positive definiteness)
d(i, j) = d(j, i)(Symmetry)
d(i, j) d(i, k) + d(k, j)(Triangle Inequality)
A distance that satisfies these properties is a metric

Special Cases of MinkowskiDistance
h= 1: Manhattan(city block, L
1
norm)distance
E.g., the Hamming distance: the number of bits that are different between
two binary vectors
h = 2: (L
2norm) Euclideandistance
h . “supremum”(L
max
norm, L

norm) distance.
This is the maximum difference between any component (attribute) of the
vectors)||...|||(|),(
22
22
2
11 ppj
x
i
x
j
x
i
x
j
x
i
xjid  ||...||||),(
2211 ppj
x
i
x
j
x
i
x
j
x
i
xjid 

Special Cases of MinkowskiDistance
h= 1: Manhattan(city block, L
1
norm)distance
E.g., the Hamming distance: the number of bits that are different between
two binary vectors
h = 2: (L
2norm) Euclideandistance
h . “supremum”(L
max
norm, L

norm) distance.
This is the maximum difference between any component (attribute) of the
vectors)||...|||(|),(
22
22
2
11 ppj
x
i
x
j
x
i
x
j
x
i
xjid  ||...||||),(
2211 ppj
x
i
x
j
x
i
x
j
x
i
xjid 

Example of MinkowskiDistance
Dissimilarity Matricespointattribute 1attribute 2
x1 1 2
x2 3 5
x3 2 0
x4 4 5 L x1 x2 x3 x4
x1 0
x2 5 0
x3 3 6 0
x4 6 1 7 0 L2 x1 x2 x3 x4
x1 0
x2 3.61 0
x3 2.24 5.1 0
x4 4.24 1 5.39 0 L x1 x2 x3 x4
x1 0
x2 3 0
x3 2 5 0
x4 3 1 5 0
Manhattan (L
1)
Euclidean (L
2)
Supremum

Ordinal Variable
An ordinal variable can be discrete or continuous
Order is important, e.g., rank
Can be treated like interval-scaled
replace x
ifby their rank
map the range of each variable onto [0, 1] by replacing
i-thobject in the f-thvariable by
compute the dissimilarity using methods for interval-
scaled variables1
1



f
if
if
M
r
z },...,1{
fif
Mr

Attributes of Mixed Type
A database may contain all attribute types
Nominal, symmetric binary, asymmetric binary, numeric,
ordinal
One may use a weighted formula to combine their effects
fis binary or nominal:
d
ij
(f)
= 0 if x
if = x
jf, or d
ij
(f)
= 1 otherwise
fis numeric: use the normalized distance
fis ordinal
Compute ranks r
ifand
Treat z
ifas interval-scaled)(
1
)()(
1
),(
f
ij
p
f
f
ij
f
ij
p
f
d
jid






 1
1



f
if
M
r
z
if

Cosine Similarity
A documentcan be represented by thousands of attributes, each
recording the frequencyof a particular word (such as keywords) or
phrase in the document.
Other vector objects: gene features in micro-arrays, …
Applications: information retrieval, biologic taxonomy, gene feature
mapping, ...
Cosine measure: If d
1
and d
2
are two vectors (e.g., term-frequency
vectors), then
cos(d
1
,d
2
)=(d
1
d
2
)/||d
1
||||d
2
||,
whereindicatesvectordotproduct,||d||:thelengthofvectord

Cosine Similarity Example
cos(d
1
, d
2
) = (d
1
d
2
) /||d
1
|| ||d
2
|| ,
whereindicatesvectordotproduct,||d|:thelengthofvectord
Ex:Findthesimilaritybetweendocuments1and2.
d
1
=(5,0,3,0,2,0,0,2,0,0)
d
2
=(3,0,2,0,1,1,0,1,0,1)
d
1
d
2
=5*3+0*0+3*2+0*0+2*1+0*1+0*1+2*1+0*0+0*1=25
||d
1
||=(5*5+0*0+3*3+0*0+2*2+0*0+0*0+2*2+0*0+0*0)
0.5
=(42)
0.5
=6.481
||d
2
||=(3*3+0*0+2*2+0*0+1*1+1*1+0*0+1*1+0*0+1*1)
0.5
=(17)
0.5
=4.12
cos(d
1
,d
2
)=0.94

Measure the Quality of Clustering
Partitioning criteria
Single level vs. hierarchical partitioning (often, multi-level hierarchical partitioning
is desirable)
Separation of clusters
Exclusive (e.g., one customer belongs to only one region) vs. non-exclusive (e.g.,
one document may belong to more than one class)
Similarity measure
Distance-based (e.g., Euclidian, road network, vector) vs. connectivity-based
(e.g., density or contiguity)
Clustering space
Full space (often when low dimensional) vs. subspaces (often in high-dimensional
clustering)

Requirements and Challenges
Scalability
Clustering all the data instead of only on samples
Ability to deal with different types of attributes
Numerical, binary, categorical, ordinal, linked, and mixture of
these
Constraint-based clustering
User may give inputs on constraints
Use domain knowledge to determine input parameters
Interpretability and usability
Others
Discovery of clusters with arbitrary shape
Ability to deal with noisy data
Incremental clustering and insensitivity to input order
High dimensionality

Major Clustering Approaches
What are major clustering approaches?
Partitioning approach:
Construct various partitions and then evaluate them by some criterion, e.g., minimizing the
sum of square errors
Typical methods: k-means, k-medoids, CLARANS
Hierarchical approach:
Create a hierarchical decomposition of the set of data (or objects) using some criterion
Typical methods: Diana, Agnes, BIRCH, CAMELEON
Density-based approach:
Based on connectivity and density functions
Typical methods: DBSACN, OPTICS, DenClue
Grid-based approach:
based on a multiple-level granularity structure
Typical methods: STING, WaveCluster, CLIQUE

Major Clustering Approaches
Model-based:
A model is hypothesized for each of the clusters and tries to find
the best fit of that model to each other
Typical methods:EM, SOM, COBWEB
Frequent pattern-based:
Based on the analysis of frequent patterns
Typical methods: p-Cluster
User-guided or constraint-based:
Clustering by considering user-specified or application-specific
constraints
Typical methods: COD (obstacles), constrained clustering
Link-based clustering:
Objects are often linked together in various ways
Massive links can be used to cluster objects: SimRank, LinkClus

Partitioning Algorithms: Basic Concept
Partitioning method:Partitioning a database Dof nobjects into a set of kclusters, such that
the sum of squared distances is minimized (where c
iis the centroid or medoid of cluster C
i)
Given k, find a partition of k clusters that optimizes the chosen partitioning criterion
Global optimal: exhaustively enumerate all partitions
Heuristic methods: k-meansand k-medoidsalgorithms
k-means(MacQueen’67, Lloyd’57/’82): Each cluster is represented by the center of the
cluster
k-medoidsor PAM (Partition around medoids) (Kaufman & Rousseeuw’87): Each cluster
is represented by one of the objects in the cluster 2
1 )(
iCp
k
i cpE
i



The K-Means Clustering Method
Given k, the k-meansalgorithm is implemented in four steps:
Partition objects into knonempty subsets
Compute seed points as the centroids of the clusters of the current
partitioning (the centroid is the center, i.e., mean point, of the cluster)
Assign each object to the cluster with the nearest seed point
Go back to Step 2, stop when the assignment does not change

An Example of K-Means Clustering
K=2
Arbitrarily
partition
objects into
k groups
Update the
cluster
centroids
Update the
cluster
centroids
Reassign objects
Loop if
needed
The initial data set
Partition objects into knonempty
subsets
Repeat
Compute centroid (i.e., mean
point) for each partition
Assign each object to the
cluster of its nearest centroid
Until no change

K-Means Clustering:Example
Suppose that the data mining task is to cluster points (with (x, y)
representing location) into three clusters, where the points are:
A1 (2,10), A2 (2,5) A3 (8,4), B1(5,8), B2(7,5), B3(6,4), C1(1,2), C2(4,9).
The distance function is Euclidean distance. Suppose initially we assign A1, B1, and
C1 as the center of each cluster, respectively.
Use the k-means algorithm to show only:
The three cluster centers after the first round of execution.
The final three clusters.

Comments on the K-Means Method
Strength:Efficient: O(tkn), where nis # objects, kis # clusters, and t is # iterations.
Normally, k, t<< n.
Comparing: PAM: O(k(n-k)
2
), CLARA: O(ks
2
+ k(n-k))
Comment:Often terminates at a local optimal.
Weakness
Applicable only to objects in a continuous n-dimensional space
Using the k-modes method for categorical data
In comparison, k-medoidscan be applied to a wide range of data
Need to specify k, the numberof clusters, in advance (there are ways to
automatically determine the best k (see Hastie et al., 2009)
Sensitive to noisy data and outliers
Not suitable to discover clusters with non-convex shapes

Variations of the K-Means Method
Most of the variants of the k-meanswhich differ in
Selection of the initial kmeans
Dissimilarity calculations
Strategies to calculate cluster means
Handling categorical data: k-modes
Replacing means of clusters with modes
Using new dissimilarity measures to deal with categorical objects
Using a frequency-based method to update modes of clusters
A mixture of categorical and numerical data: k-prototypemethod

What is the Problem of the K-Means Method?

The K-MedoidClustering Method
K-MedoidsClustering: Find representativeobjects (medoids) in clusters
PAM(Partitioning Around Medoids, Kaufmann & Rousseeuw1987)
Starts from an initial set of medoids and iteratively replaces one of the medoids by
one of the non-medoids if it improves the total distance of the resulting clustering
PAMworks effectively for small data sets, but does not scale well for large data sets
(due to the computational complexity)
Efficiency improvement on PAM
CLARA(Kaufmann & Rousseeuw, 1990): PAM on samples
CLARANS(Ng & Han, 1994): Randomized re-sampling

The K-MedoidClustering Method

The K-MedoidClustering Steps
Select two medoids randomly.
Calculate Manhattan distance d from each medoids.
Assign the data points to the cluster with min distance.
Calculate the total cost of each cluster.
Randomly select one non medoid point and swap it with any existing medoids (in previous
iteration) and recalculate the total cost.
Calculate S, S = Current total cost -Previous total cost
If S>0 , Hence old clusters are correct.

The K-Medoid Clustering Steps
x y
X1 2 6
X2 3 4
X3 3 8
X4 4 7
X5 6 2
X6 6 4
X7 7 3
X8 7 4
X9 8 5
X10 7 6

DEPARTMENT OF COMPUTER ENGINEERING, Sanjivani COE, Kopargaon 38
Reference
Han, JiaweiKamber, MichelinePei and Jian, “Data Mining: Concepts and
Techniques”,ElsevierPublishers, ISBN:9780123814791, 9780123814807.
https://onlinecourses.nptel.ac.in/noc24_cs22
Tags