3MLChapter3ClusteringSlides23EN UC Coimbra PT

antoniodouradopc 10 views 76 slides Oct 14, 2024
Slide 1
Slide 1 of 76
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27
Slide 28
28
Slide 29
29
Slide 30
30
Slide 31
31
Slide 32
32
Slide 33
33
Slide 34
34
Slide 35
35
Slide 36
36
Slide 37
37
Slide 38
38
Slide 39
39
Slide 40
40
Slide 41
41
Slide 42
42
Slide 43
43
Slide 44
44
Slide 45
45
Slide 46
46
Slide 47
47
Slide 48
48
Slide 49
49
Slide 50
50
Slide 51
51
Slide 52
52
Slide 53
53
Slide 54
54
Slide 55
55
Slide 56
56
Slide 57
57
Slide 58
58
Slide 59
59
Slide 60
60
Slide 61
61
Slide 62
62
Slide 63
63
Slide 64
64
Slide 65
65
Slide 66
66
Slide 67
67
Slide 68
68
Slide 69
69
Slide 70
70
Slide 71
71
Slide 72
72
Slide 73
73
Slide 74
74
Slide 75
75
Slide 76
76

About This Presentation

A course on Machine Learning, Chapter 3, Department of Informatics Engineering, University of Coimbra, Portugal, 2023


Slide Content

@ADC/DEI/FCTUC/MEI/MEB/2023/MachineLearning/Chapt. 3. Clusterin g73
CHAPTER3
CLUSTERING
1. Objectives of data clustering
2. Types of clustering techniques
3. Agglomerative techniques
4. Partition techniques
5. Clustering in Matlab
6. Bibliography

@ADC/DEI/FCTUC/MEI/MEB/2023/MachineLearning/Chapt. 3. Clusterin g74
3.1. Objectives and characteristics of Clustering
Clustering objectives:
- division of a big data set into clusters (groups) of similar o bjects
- the objects of a cluster are similar among them and dissimila r from the objects
of the other clusters.
- the overall dataset can be represented by its clusters, in thi s way modelling the
data.
-these clusters correspond to patterns hidden in the data.
-it is very used in the exploratory analysis of the data and to extract information
from the data.
- it is an unsupervised learning technique of the structure embe dded in the data.

@ADC/DEI/FCTUC/MEI/MEB/2023/MachineLearning/Chapt. 3. Clusterin g75
Jain&Murty&Flynn
Original data
Clustering(hierarch., single-link)
Applications: image processing, diagnosis (classification of bi ological signals),
epidemiologic studies, marketing, decision-making, etc..

@ADC/DEI/FCTUC/MEI/MEB/2023/MachineLearning/Chapt. 3. Clusterin g76
Steps of a clustering study:
1-Features extraction and /or selection: select and extract th e features from
the data that define their patterns.
2-Define a similarity measure appropriate for the problem (usua lly a distance).
3-Cluster data with an appropriate algorithm.
4-Data abstraction: when necessary, the data are represented by their clusters.
5-Critical analysis of the results. Go to step 1 if results do not satisfy.
Jain&Murty&Flynn

@ADC/DEI/FCTUC/MEI/MEB/2023/MachineLearning/Chapt. 3. Clusterin g77
Notation:
A pattern x is composed by a single data used by the clustering algorithm. x is
composed by dmeasurements
x = {x
1
, x
2
, …, x
d
}
-each scalar component x
d
of xis a feature (characteristic) or attribute,
-dis the dimensionality of the pattern in the data space.
A pattern may measure a physical object ( an apple, a table, a car, ...), a physiological
state characterized by measurements of several biosignals, or a n abstract concept such
as writing style, painting, etc..
The characteristics or attributes can be:
quantitative : continuous (weight, height, temperature,…), disc rete (number
of patients ) or by intervals (ex. duration of an event).
qualitative: nominal or unordered (ex. color), ordered (ex. pos ition in a
professional career).
In this course we will work with quantitative attributes.

Euclidian distance:
squared Euclidian distance:
Similarity measures (distances) between two points x
i
and x
j
in a
multidimensional space
Manhattan / city-block distance:
Chebychevdistance:
Minkowskydistance:
78 @ADC/DEI/FCTUC/MEI/MEB/2023/MachineLearning/Chapt. 3. Clusterin g
1/2
2
1
()( )
d
i j ik jk
k
dx x x x



 





2
1
()( )
d
ij ikjk
k
dx x x x



 





1
()max
d
ij kikjk
dx x x x

 
1
()| |
d
ij ikjk
k
dx x x x



 





1/
1
()( )
m
d
m
ij ikjk
k
dx x x x



 





There is no best measure in general. It depends on the dataset, the application, the
objective. For an interesting comparison of measures see for e x
.
https://towardsdatascience.com/log-book-guide-to-distance-measu ring-approaches-for-k-means-clustering-
f137807e8e2111 September 2023

d–numberoffeatures(dimensionality)
n–numberofobjects(patterns)
The shape of the clusters depends:
-on the used distance: for example,
euclidian distance produce circular clusters
(all points of the cluster can be put into a
circle), Manhattan and Chebyshev produce
shapes as in the figure.
-on the used method
79
@ADC/DEI/FCTUC/MEI/MEB/2023/MachineLearning/Chapt. 3. Clusterin g
11 12 1
21 22 2
12
...
...
... ... ... ...
...
d
d
nn nd
xx x
xx x
xx x












Matrix of patterns (data
matrix)
https://towardsdatascience.com/log-book-guide-to-distance-measu ring- approaches-for-k-means-clustering-f137807e8e2111 Sept 2023
12 13 1
23 2
3
0...
0...
0...
:: :::
0
n
n
n
dd d
dd
d
















Similarity
matrix,
symmetric
d
ij
–distance between x
i
and x
j
n–numberofobjects(patterns)

80
Pisthem in theformula of
theMinkowskidistance
https:/iq.opengenus.org/minkowski-distance/ 13 Nov 2023
1/
1
()( )
m
d
m
ij ikjk
k
dx x x x



 




@ADC/DEI/FCTUC/MEI/MEB/2023/MachineLearning/Chapt. 3. Clusterin g81
3.2. Types of Clustering techniques
Agglomerative:
initially each pattern defines a cluster, and the clusters are merged
successively until some stopping criteria is reached.
ex. hierarchical clustering (upwards)
Divisive:
initially all the patterns belong to a single cluster, and a di visive
technique is successively applied until some stopping criteria is
reached.
ex. hierarchical clustering (downwards).

@ADC/DEI/FCTUC/MEI/MEB/2023/MachineLearning/Chapt. 3. Clusterin g82
Partitional:
-distance based: a unique partition of the data, in a certain number of
clusters, is obtained. Problem: how many clusters ?
They result from the optimization of a local criteria (of a su bset of
points) or global criteria (of all points).
Ex. k-means (or c-means), fuzzy c-means
-density based: the method itself finds the number of clusters by capturing
the density of points in each region.
Ex.: subtractive : based on a function of radiating potential o f a point;
DBSCAN(Density-Based Spatial Clustering of Applications with No ise).3.2. Types of Clustering techniques

@ADC/DEI/FCTUC/MEI/MEB/2023/MachineLearning/Chapt. 3. Clusterin g83
3.3. Agglomerative techniques
Hierarchical clustering
group 7 points into three
clusters
Dendrogram (from Greek dendro–tree,
gramma-drawing) representing the chained
clustering of the patterns and the similarity
levels in which the groups change.
Jain&Murty&Flynn
The number of clusters depends
on the desired level of similarity

@ADC/DEI/FCTUC/MEI/MEB/2023/MachineLearning/Chapt. 3. Clusterin g84
1. Initially make each object or pattern a cluster.
2. Compute the similarity (or proximity) matrix, containing the distances
between each pair of clusters.
3. Find the most similar pair of clusters using the proximity ma trix (those with
shortest distance outside the diagonal ). Merge these two clust ers into one.
4. Update the proximity matrix resulting from this merging : one row and one
column of the pre-merged matrix are deleted, and new distances between
the merged cluster and each of the other clusters must be recom puted; the
rest of the proximity matrix remains unchanged, its dimension d ecreases by
one).
5. Are all objects into a single cluster , i.e., is the proximit y matrix 0 ? If yes
stop. If no go to 2.
How is performed the update of the proximity matrix ? Or, equiv alently, how to
measure the distance between two clusters ? Each method gives a n algorithm. Algorithms for agglomerative hierarchical clustering

Advantages:
•Eases the interpretation and utilization of the results obtaine d.
•Allows a great flexibility in the analysis of the clusters in t he different levels of
the dendrogram. We can stop at any number of clusters (flexibil ity)
•The task of affecting a concept (label) to the clusters becomes semi-automatic.
85 @ADC/DEI/FCTUC/MEI/MEB/2023/MachineLearning/Chapt. 3. Clusterin g
Rai P. and Singh S, A Survey of Clustering Techniques
Similarity scale

different structres
and operations
•uses recursive processes
•stopping criteria:number kof wished clusters.
•The agglomerative method is more used than the divisive: it is computationally lighter
and produces better results.
Two approaches: bottom-up and top-down
86 @ADC/DEI/FCTUC/MEI/MEB/2023/MachineLearning/Chapt. 3. Clusterin g
adaptado de https://www.slideshare.net/salahecom/10-clusbasic 11 Sept 2023

HierarchicalClustering–AGNES AGlomerativeNESting 1. Atstarteachexistentpatterncomposesacluster.
2. Usinganappropriatemetric(ex.Euclidian),computethedistancesbetween
allpairsofclusters.
3. Find the pair of clusters more similar and merge these two into one single
cluster.
4. Recalculatethedistancesbetweenthenewclusterandalltheotherclusters
alreadyexistent.
5. Repeat step 3 until one single cluster is obtained, containing all of the
patterns.
87
@ADC/DEI/FCTUC/MEI/MEB/2023/MachineLearning/Chapt. 3. Clusterin g
These steps repeat the development in slide 83 and are illust rated in the next slide.

@ADC/DEI/FCTUC/MEI/MEB/2023/MachineLearning/Chapt. 3. Clusterin g88
repeat ...
(adapted from https://www.displayr.com/what-is-hierarchical-clustering/ 11 Sept 2023)
AGNES

@ADC/DEI/FCTUC/MEI/MEB/2023/MachineLearning/Chapt. 3. Clusterin g89
Naming the points of previous example as A, B , C, D, E, F, we can build the
dendrogram, a tree replicating the agglomerative operations of the previous slide.
The dendrogram has a hierarchical structure that accompanies th e agglomerations
done in the example. The level of the hierarchy is related to t he level of similarity
among clusters At each level, we can see how many clusters do we have.
(https://www.displayr.com/what-is-hierarchical-clustering/ 11 Sept 2023
The horizontal connection of the points follows the smallest di stance in the previous
figure. The similarity level of the horizontal connection is in verse to the minimum
distance in last figure (smaller distance higher similarity)
similarity level

@ADC/DEI/FCTUC/MEI/MEB/2023/MachineLearning/Chapt. 3. Clusterin g90
Rai P. and Singh S, A Survey of Clustering Techniques
Similarity scale
Another example showing the agglomerative clustering and the bu ilding up pf the
dendrogram.
Each node in the tree represents a cluster. Its height is prop ortional to the
dissimilarities of its daughters (a daughter can be a single po int or a cluster).
Dissimilarity scale
Cutting the tree at some height gives the number of cluster. Fo r example, at 0.15 we
have 4 clusters (same fig. as in the slide 84)..

Hierarchical Clustering –AGNES, agglomerative
Howtocomputethedistancebetweentwoclusters?
single-link
•Considers theminimumof the distances between all pairs of points (each
elementofthepairfromeachcluster).
•Producesclustersgeometricallyelongated:thechainingeffect.
complete-link
•Considers themaximumof the distances between all pairs of points (each
elementofthepairfromeachcluster).
•Producessmallandcompactclusterswithwelldefinedfrontiers.
average-link:compromisebetweenthesingle-linkandthecomplete-link.
•Compute themeanof all the distances between all pairs of points (each
elementofthepairfromeachcluster).
•Producesquiteacceptableresults.
centroid-link:considersthedistancebetweenthecentroidsoftheclusters.
Seealsohttps://www.stat.cmu.edu/~ryantibs/datamining/lectures/06-clus3 .pdf11/09/2023)
91
@ADC/DEI/FCTUC/MEI/MEB/2023/MachineLearning/Chapt. 3. Clusterin g

Hierarchicalclustering–DIANA DIvisiveANAlysis
1. Allofthepointsareagglomeratedintoacluster.
2. Dividethatclusterintwonewclusters.
-polytetic methods: All of the features of the data are used f or the division.
-monotetic methods: Only one feature of the data is used for th e division.
The process repeats. Find the most heterogeneous cluster to be divided into two:
higher number of samples, higher variance, higher mean squared distance, .....
This heterogeneous cluster is divided.
While the number of clusters is not equal to the number of patt erns in the sample,
step 2 is repeated.
To divide a cluster use for example a cutting algorithm based on optimizati on
(“Cut-based optimization”, with similar results as the k- means). See mor ein https://www.cs.princeton.edu/courses/archive/spr08/cos435/Class _notes/clustering4.pdf11/09/2023)
92
@ADC/DEI/FCTUC/MEI/MEB/2023/MachineLearning/Chapt. 3. Clusterin g
https://www.slideshare.net/salahecom/10-clusbasic 11/09/2023

3. 4. Partitionaltechniques
•Do not produce a hierarchical structure of the data.
•Only one level of clusters is created.
•The result is the centers of the clusters and the final members hip of each pattern
to its cluster.
advantage:
- They are very useful when one needs to work with very big data sets for which the
construction of a dendrogram is computationally complex.
disadvantage:
- It is necessary to chose, a priori, the number of desired clusters, for some of the
methods.
-They do not produce the same result in each execution (because of different
initializations).
- The final resultsdepends on that choice and on the initializat ion.
algorithms:k-means (or c-means), k-medoids , distance based
subtractive , DBSCAN, density based
and more ...
93 @ADC/DEI/FCTUC/MEI/MEB/2023/MachineLearning/Chapt. 3. Clusterin g

@ADC/DEI/FCTUC/MEI/MEB/2023/MachineLearning/Chapt. 3. Clustering
94
It organizes the data into cclusters, c being fixed a-priori.
Each cluster has at least one element.
Each element belongs to one single cluster.
Each element is affected to the center that is closer.
The centers of the clusters are iteratively moved, from
iteration to iteration, in order to decrease the distances to
the elements of its cluster.
At the end, when the algorithm converges, the sum of the
distances of the points of a cluster to its center is minimized
and the distance between the centers is maximized. 3.4.1. c (ou k) -meansclustering

@ADC/DEI/FCTUC/MEI/MEB/2023/MachineLearning/Chapt. 3. Clustering95
Distance: euclidian
Minimize the sum of the squared distances between the
elements and its centers, taken for all clusters..
p
1
p
2
dp
2
-p
1
=d
p
1
p
2
Criterion:
The algorithm also maximizes the distances among the
centers

@ADC/DEI/FCTUC/MEI/MEB/2023/MachineLearning/Chapt. 3. Clustering
96
Mathematical formulation :


12
...
iii iR
R
pp p p
the characteristics of each data point
=


Q
12 Q
P= p p ... p
set of data



12
...
iii iR
vv v i v
center of the cluster

2
1
() ()
ki
R
k i k i kj ij ik
j
i
dpvd

  

pv
pv pv
euclidian distance
between each data point and the center of the cluster
,
1,
0,
k
ik
k
i
i




p
p
belongs to the cluster
does not belong to the cluster
if
if

@ADC/DEI/FCTUC/MEI/MEB/2023/MachineLearning/Chapt. 3. Clustering97
Criterion to be minimized
2
11
() ()
QC
ik ik
ki
Jd




P,v

@ADC/DEI/FCTUC/MEI/MEB/2023/MachineLearning/Chapt. 3. Clustering98
How to compute the coordinates of a center ?
p
2
p
1
p
2
p
1
v
2
p
22
p
12
p
21
v
1
p
11
12 22 32 42 52 62
2
11 21 31 41 51 61
1
6
6
p
ppppp
v
p
ppppp
v




12 22
2
11 21
1
2
2
p
p
v
p
p
v



p
1
p
3
p
1
v
2
v
1
p
2
p
4
p
5
p
6
p
2
1
1
Q
ik kj
k
ijQ
ik
k
p
v







Average of the coordinates,
one by one, of the points
belonging to the cluster i

@ADC/DEI/FCTUC/MEI/MEB/2023/MachineLearning/Chapt. 3. Clustering 99
Let {A
i,i=1,2,..., c}be a family of sets, the c-partition of
P
.
The following properties apply:
1 1
1
()1,
,()()0,
,0(),
1,
2()
0,
i
ij
i
i
cc
ik
i i
ij k k
Q
ik
k
ki
kik
k
k
ij k
iQi
cQ




 


   
     
 
  




A
AA
A
A
AP p
AA p p
AP p
pA
p
pA

@ADC/DEI/FCTUC/MEI/MEB/2023/MachineLearning/Chapt. 3. Clustering
100
11 12 1
21 22 2
12
Q
Q
cc cQ
cQ
 
 
 








U

11
|0,1, 1,0
Q c
cij ik ik
ik
Q





MU
Sum of the
elements of one
column of
U
Sum of the
elements of one
row of
U
elements
Matrix of c-partion of
P:
any matrix
M
c
such that:

@ADC/DEI/FCTUC/MEI/MEB/2023/MachineLearning/Chapt. 3. Clustering
101
Any matrix with such structure defines a partition ( division into
clusters) of
P
. The possible number of partitions is
(Ross):
1
1
(1).
!
c
c
ci Q
i
c
i
i c




  


  




M
For Q=25, c=10
18
10 .
c


M
Iterative optimization !
NP-hard problem even for two clusters !!

@ADC/DEI/FCTUC/MEI/MEB/2023/MachineLearning/Chapt. 3. Clustering
102
The batch
c
-means algorithm
1
st
Chose initial values of the ccenters by a good sense
criterion (or randomly).
2
nd
Affect each point, from 1 to Q, to the center that is
closest to it (Euclidian distance).
3
rd
Recalculate the coordinates of each center (averages
of the coordinates of the points affected to it).
4
th
If the centers do not move, stop. End.
5
th
Otherwise go to 2
nd
.

@ADC/DEI/FCTUC/MEI/MEB/2023/MachineLearning/Chapt. 3. Clusterin g103
STOP
, in this example after 4.
1
2
3
4
0
random
random
The batch
c
-means algorithm
adapted from https://www.slideshare.net/salahecom/10-clusbasic 11/09/2023

@ADC/DEI/FCTUC/MEI/MEB/2023/MachineLearning/Chapt. 3. Clustering
104
Make initial (random) estimations of the centers
v
1
, v
2
, ...v
c
While there is some change in any center do
Use the actual centers to cluster the data (i.e., affect each point to the
closest center according to a distance)
For
i
from 1 to
c
Replace
v
i
by the average of all points in the cluster
i
End_for
End_while
The batch
c
-means algorithm pseudo code

@ADC/DEI/FCTUC/MEI/MEB/2023/MachineLearning/Chapt. 3. Clustering 105
c
-means sequential algorithm (in real time)
(https://www.cs.princeton.edu/courses/archive/fall08/cos436/Duda/C/sk_means.htm11/09/2023)
Update the centers whenever a new point appears
1
st
Chose initial values of the ccenters by a good sense
criterion (or randomly).
2
nd
Acquire the next point pand update the center v
i
that
is closer to it according to the number of points that
already appeared in that region. For that, one needs to
count them; if they are already n
i
then
v
i
(k+1)
=v
i
(k)
+(1/n
i).(p-v
i
(k)
)
Each center needs a counter associated to it.
3º Continue until there are no more points to read.

@ADC/DEI/FCTUC/MEI/MEB/2023/MachineLearning/Chapt. 3. Clustering106
Make initial guesses for the centers
v
1
, v
2
, ...v
c
Until the end do
Acquire the next point
p
If
v
i
is the closest to
p
increment
n
i
replace
v
i
by
v
i
+(1/
n
i).(p-v
i)
End_if
End_until
Initialize the counters n
1
, n
2
, ..., n
c
to zero
c
-means sequential algorithm (in real time) pseudo-code
(Duda)
The resulting centers,
v
i
, are the averages of all points
p
that, when acquired,
were closest to
v
i

@ADC/DEI/FCTUC/MEI/MEB/2023/MachineLearning/Chapt. 3. Clustering107
Until the end do
Acquire the next point
p
If
v
i
is the closest to
p
replace
v
i
by
v
i
+

.(p-v
i)
End_if
End_until
(without counters associated to the centers)
1
() (1 ) (0) (1 ) ()
(0), 0
() *, 0 1
() 1
k
kkl
ii
l
i
i
kl
if
kif
kif






  
 






vv p
v
v
p
*


,
The centers have the following evolution:
Variable
forgetting factor
effect
c
-means sequential algorithm (in real time) pseudo code
(Duda)
Make initial guesses for the centers
v
1
, v
2
, ...v
c

@ADC/DEI/FCTUC/MEI/MEB/2023/MachineLearning/Chapt. 3. Clustering108
Advantages of the c-means:
- simple implementation
- allows to fix a-priori the number of clusters
Disadvantages:
- requires some sensitivity to fix c; otherwise, is by trial and
error
- the result depends on the initialization of the centers. A
different initialization will produce a different sequence of affecting
points to a center and recalculating the new center, which may end up
with different centers and different clusters, particularly in big datasets.
- it is sensitive to outliers. An outlier will introduce a big
distance that may push too much a center to it.

@ADC/DEI/FCTUC/MEI/MEB/2023/MachineLearning/Chapt. 3. Clusterin g109
https://www.researchgate.net/figure/Results-of-k-means-clusteri ng-algorithm-on-a-linearly- separable-input-data-and-b_fig1_323650119 , Mayank Baranwal, 11/Sept/2023 Limitations of standard c-means If the points are not linearly separable, it may not work as we want:
good !
badd!

3.4.2 c (ork)-medoids Algorithm very similar to c-means
The centers (medoids) are patterns (the objects more central i n the cluster).
Minimizes the sum of the distances of all points (of the clust er) to the medoid.
advantage:
•It is more robust that the c-means in the presence of noise and outliers.
disadvantage:
•It does not work well for big datasets. As the dimension increa ses it is NP hard.
110
@ADC/DEI/FCTUC/MEI/MEB/2023/MachineLearning/Chapt. 3. Clusterin g
Kaufman, L. and Rousseeuw, P.J. (1987),
Clustering by means of Medoids, in Statistical
Data Analysis Based on the –Norm and
Related Methods, edited by Y. Dodge, North-
Holland, 405–416.

k-medoids 1. Define an initial set of medoids.
2. Replace iteratively one of the medoids by one of the non-medo ids, if that
improves the total distance of the resulting clusters. Hard to compute.
111 @ADC/DEI/FCTUC/MEI/MEB/2023/MachineLearning/Chapt. 3. Clusterin g
https://www.slideshare.net/salahecom/10-clusbasic 11/09/2023

@ADC/DEI/FCTUC/MEI/MEB/2023/MachineLearning/Chapt. 3. Clusterin g112
https://www.mathworks.com/help/stats/kmedoids.html11/09/2023

@ADC/DEI/FCTUC/MEI/MEB/2023/MachineLearning/Chapt. 3. Clustering
113
3.4.3 The fuzzy c-means clustering
X
universe of points to group (classify)


~
, 1,2,...,
i
X
Ai c
a fuzzy c-partition in

Each element in X belongs to each of the partitions ,
with some membership value.
~i
A
In the limit, one point may belong to all partitions (with
sum of membership functions equal to 1).

@ADC/DEI/FCTUC/MEI/MEB/2023/MachineLearning/Chapt. 3. Clustering
114
Let
~
1
(), [0,1]
1, 1,2,...,
i
ik A k ik
c
ik
i
x
kn
 




with the constrain


One class cannot be empty, and one class cannot have all points
with membership 1. So
1
0 , 1,2,...,
n
ik
k
ni c




It can happen that , since the k point may
belong to both classes i and j.
0
ik jk



@ADC/DEI/FCTUC/MEI/MEB/2023/MachineLearning/Chapt. 3. Clustering
115
We have
~
~
1
1
()1,
0(),
i
i
c
Ak k
i
n
Ak
k
xxk
xn i








the sum of memberships of each , for all
for all (no class has all points with membership 1)

@ADC/DEI/FCTUC/MEI/MEB/2023/MachineLearning/Chapt. 3. Clustering 116
Fuzzy c-means clustering algorithm
n
points
c
clusters, with centers in
v
1
, v
2
,..., v
c
Objective function
´
2
~
11
2
1
(,) ( ).( )
() ()
:
'[1,):
nc
m
mikik
ki
m
ik k i kj ij
j
ik
JUv d
ddxv xv
ki
m





 



value of membership of point to the class
ponderation coefficient, measures the
fuzzin



ess degree of the classification
m’ , fuzziness
degree
2
11
() ()
nc
ik ik
ki
JU d




,v
Crisp case )

@ADC/DEI/FCTUC/MEI/MEB/2023/MachineLearning/Chapt. 3. Clustering117
The m coordinates of each center
v
i
are
'
1
'
1.
,1,2,...,
n
m
ik kj
k
ijn
m
ik
k
x
vjm







The c-fuzzy optimal partition will minimize
J
m
A global minimum cannot be guaranteed, but only the
best solution under a pre-specified accuracy.
1
1
n
ik ik
k
ijn
ik
k
x
v







Crisp case (Chap. 6)

@ADC/DEI/FCTUC/MEI/MEB/2023/MachineLearning/Chapt. 3. Clustering118
~~
~
~~jj ii
i
AAAA
A
X




For two classes and

A family of fuzzy partition matrices , M
fc
, can be defined,
to classify
n
points into c-classes,
~
11
{| [0,1]; 1;0 }
1,2,..., 1,2,...,
cn
fc ik ik ik
ik
M
Un
ickn


 


Any is a fuzzy partition.
~
f
c
UM
M
fc
has infinite cardinality.

@ADC/DEI/FCTUC/MEI/MEB/2023/MachineLearning/Chapt. 3. Clustering119
Iterative procedure
(Ross, 352)
1
th
Fix c(2 c < n)
Chose a value for m’
Initialize the partition matrix (ex. randomly)
(0)
~
U
For
r
=1,2,..., do:
2
nd
Compute the centers {v
i
(r)
,i=1,2,...,c}

@ADC/DEI/FCTUC/MEI/MEB/2023/MachineLearning/Chapt. 3. Clustering120
3
th
Update the partition matrix at iteration r,
()
~
r
U

1
2
()'1
(1)
()
1
~
(1)
()
)
0,
|2 0
k
rm c
rik
ik kr
jjk
r
k
ik
r
kik
I
d
I
d
iiI
Iiicd
















 

(for the classes whose indexes
for
belong to
or
for all the classes in which
(centers whose dist


:

~
(1)
2,...,
1
k
kk
r
ik
iI
k
k
k
IcI






ance to point is null

point will be a center, so membership 1
(centers whose distance to the point
is non null)
)

@ADC/DEI/FCTUC/MEI/MEB/2023/MachineLearning/Chapt. 3. Clustering121
4
th
If
(1) () rr
L
UU



.
stop
If not,
r = r+1
go to 2
nd
J
m
: criterion of minimum of squared distances.
The squared distances are weighted by the membership
values (

ik
)
m’
.
J
m
: minimizes the squared distances of the points to their centers
maximizes the distances between the centers of the clusters.

@ADC/DEI/FCTUC/MEI/MEB/2023/MachineLearning/Chapt. 3. Clustering
122
Which is the good value for m’?
() ()
(1)
() () 1
'
,
,
22
'1,
'1 0
1, , , 0, 1
0, , , , 0
',()0 (,)0
rrik ik
ik jk
jk ik
r
ik
rrik
ik jk
jk
m
ik m
jd
jd
m
m
dd
dij
dd
d
dji
d
mJUv 













 

 
  

 
   
and then
if for all
if for some



(crisp case) (completely fuzzy)

@ADC/DEI/FCTUC/MEI/MEB/2023/MachineLearning/Chapt. 3. Clustering123
The greater
m’
, the stronger is the fuzziness of the membership
to the clusters;

controls this fuzzy character of membership to
the clusters.
m’
increases,
J
m
decreases (keeping constant all the other
parameters), slower is the convergence.
There is no theoretical optimum for
m’
;
generallyit is chosen
between 1.25 and 2 (Ross).In Matlabdefault is 2.

@ADC/DEI/FCTUC/MEI/MEB/2023/MachineLearning/Chapt. 3. Clustering 124
Measures for the fuzzy classification
-which is the uncertainty (fuzziness) degree of a classification ?
-“How fuzzy is a fuzzy c-partition?”
(Ross).
-
which is the level of superposition of the defined classes ?
Let x
k
be a classified element of the Universe

i(x
k
)-value of membership of
x
k
to the class i

j(x
k
)-value of membership of
x
k
to the class j

i

j, algebraic product, depending directly from the relative
superposition between non-empty clusters. This is a good
measure for the uncertainty of the classification.

@ADC/DEI/FCTUC/MEI/MEB/2023/MachineLearning/Chapt. 3. Clustering
125
Coefficient of the fuzzy partition
~~
~~
()
() , [ ]:
T
Cik
trUU
FU U
n


matriz of fuzzy partition

Interprets the results of the fuzzy partition.
Properties:
~
~
~
1,2,...,
()1
11
()
1
()1
C
Ci
c
ic
FU
FU
cc
FU
c




if the partitions are crisp
(total ambiguity)
in any case
if , =

@ADC/DEI/FCTUC/MEI/MEB/2023/MachineLearning/Chapt. 3. Clustering
126
The elements of the diagonal of are proportional to the
quantities of non-sharedmemberships in the fuzzy clusters.
~~
T
UU
The elements out of the diagonal of represent the
quantity of sharedpartition between pairs of fuzzy clusters. If
they are null, the partition is crisp.
~~
T
UU
As the fuzzy partition coefficient approaches 1, the fuzzy
uncertainty is minimized in the overlapping of the clusters. The
greater , the better succeeded is the partition of the
dataset into clusters.
~
()
C
F
U

@ADC/DEI/FCTUC/MEI/MEB/2023/MachineLearning/Chapt. 3. Clustering
127
Remarks:
1. If x
k
belongs with memberships

i(k)and

j(k)to the classes i
andj , then min(

i(k),

j(k)) gives the quantity of membership
claimed both by i and by j, and so not shared. It is a good
measure of self - superposition.
2. The elements of the diagonal of are the sum of the
square of the elements of each row of
~~
T
UU
~
U

@ADC/DEI/FCTUC/MEI/MEB/2023/MachineLearning/Chapt. 3. Clustering
128
Defuzzification of a fuzzy partition
Method of maximum membership
- the highest element of each columns of U passes to
1, and all the others to zero.


10,
, 1,2,..., 1,2,...
ik ik ik jk
ic
máx u j i
ij c k n




Classification by the nearest center
- each point is affected to the cluster whose center is
closest to it,


10,
ik ik ik ij
ic
jk k j
dmínd
j
i
dxv





@ADC/DEI/FCTUC/MEI/MEB/2023/MachineLearning/Chapt. 3. Clustering 129
In Fuzzy Logic Toolbox: fcm
GUI for clustering: findclusterin command line.

@ADC/DEI/FCTUC/MEI/MEB/2023/MachineLearning/Chapt. 3. Clusterin g
130
3. 4.4. Subtractiveclustering

@ADC/DEI/FCTUC/MEI/MEB/2023/MachineLearning/Chapt. 3. Clustering
131
Methods based on a function of potential
Asetofpointsv
i
are defined as possible centers (by a grid or
by some heuristic). Each center is then considered as
radiating energy and one measures the energy received by
each point in its neighborhood. The energy received in a
point (ex.p
1
) is a measure of the potential
P
of the center
v
i
in that pointp
1
; this energy decreases exponentially with
the distance:
p
1
v
i
p
2
2
(,)
ij
ij
Pe



vp
vp
The constant

fixes the speed of
decay of the potential, i.e., fixes the
area of influence of each center.
The potential ofv
i
is the sum of its
potentials in all the points.
p
3

@ADC/DEI/FCTUC/MEI/MEB/2023/MachineLearning/Chapt. 3. Clustering 132
The subtractive clustering method
Consider Q points in the R-dimensional
space


12 Q
P= p p ... p
Normalize the data to [0,1] or [-1, 1].
1
st
-Each sample p
i
defines a possible center
2
(0)
1
2
( , ) , 1,...,
4
,0,
ij
Q
ii
j
a
a
PeiQ
r
r







pp
pP
radius of the neighborhood of each point

2
nd
- The potential P
i
associated to
p
i
is the sum of the
energies (irradiated by p
i) received in each point, from 1 to Q:
S. L. Chiu, "Fuzzy model identification based
on cluster estimation". Journal of Intelligent
and Fuzzy Systems, Vol. 2, No. 3, 1994.

@ADC/DEI/FCTUC/MEI/MEB/2023/MachineLearning/Chapt. 3. Clustering
133
3
rd
– Chose the first center v
1
as the point P
1
*
with the highest
potential
4
th
– Reduction of the potential of all the points
2
(1) (0) *1
1
i
ii
PPPe



pv
2
4
b
r


r
b
: radius of the neighborhood with a significant reduction of
potential. If it is next to v
1
its potential is strongly reduced,
preventing the concentration of the centers (p
i
is “emptied”
and will never be “filled” again; the same for its neighbors)
5
th
– Chose the second center v
2
as the point P
2
*
that remains
with the highest potential.

@ADC/DEI/FCTUC/MEI/MEB/2023/MachineLearning/Chapt. 3. Clustering134
6
th
– Reduce again the potential of all points
2
(2) (1) *2
2
i
ii
PPPe



pv
7
th
– Chose the 3
rd
center v
3
as the point P
3
*
remaining with
the highest potential.
8
th
– Reduce again the potential of all the points
2
(3) (2) *3
3
i
ii
PPPe



pv
And so on , until the potential of all points is residual, under a
fixed threshold.

@ADC/DEI/FCTUC/MEI/MEB/2023/MachineLearning/Chapt. 3. Clustering 135
Advantages:
It surpasses the curse of dimensionality problem: the obtained
number of clusters reflects the variation of the density of the
points in the data space.
The number of obtained clusters depends on and , the
two degrees of freedom of the method.
It is also very used in fuzzy and neuro-fuzzy systems.

@ADC/DEI/FCTUC/MEI/MEB/2023/MachineLearning/Chapt. 3. Clusterin g136
>> findcluster, ClusteringGUI , MatlabFuzzyLogicToolbox, 2023b, pp 4-32
Fuzzyc-meansand subtractiveclusteringGUI

@ADC/DEI/FCTUC/MEI/MEB/2023/MachineLearning/Chapt. 3. Clusterin g137
3.4.5. DBSCANDensity-Based Spatial Clustering of Applications
with Noise - it is a density-oriented approach, proposed in 1996 by [Ester, Kriegel, Sander
and Xu]. It is probably one of the most used clustering methods .
- if in a region there exists a high density of points, then the re exists a cluster.
The DBSCANalgorithm uses two parameters:
minPts: The minimum number of points huddled together for a region to b e
considered dense. It is a threshold.
eps (ε): A distance measure that will be used to find the number of poin ts in the
neighborhoodεof any point and compare it with MinPts.

@ADC/DEI/FCTUC/MEI/MEB/2023/MachineLearning/Chapt. 3. Clusterin g138
It is based on two concepts: Density Reachability and Connecti vity
-Density Reachability : establishes a point p to be reachable from another point qif it lies
within a particular distance ( ε) from it andminPtsof q is greater than the threshold. It is
not (always) a symmetric relation.
Note that in this figure, from the original work, counting,minP ts(p)=2, minPts(q)=6. Let
the threshold be fixed at 5.
minPtsof pdoes not respect the threshold, so qis not directly density –reachable from
p, while, by contrary, p is directly density-reachable from qbecause minPtsof qrespects
the threshold. This means that qis in a dense region, while pis not in a dense region. The
outliers are never in a dense regions, so they will not be incl uded in any cluster with
DBSCANmethod.
(Ester, Kriegel, Sander and Xu)

rr
@ADC/DEI/FCTUC/MEI/MEB/2023/MachineLearning/Chapt. 3. Clusterin g139
-Connectivity, involves a transitivity-basedchaining-approach to determine w hether
points are located ina particular cluster. For example, pand q points are connected if
where abmeans bis in the
ε
neighborhood of a.
prstq (Ester, Kriegel, Sander and Xu)
In (a), from qto p, we pass by r. Sop is density reachable from q, because minPtsof both
qand r are greater than the threshold. But q is not density-reachable from pbecause
minPts(p) is lower than the threshold.
In (b), from pto qwe pass by successive points such that one is in the neighborho od of
the next until we reach q. 3/11/2023
See also in https://pt.slideshare.net/ssakpi/density-based-clustering a nice presentation.

@ADC/DEI/FCTUC/MEI/MEB/2023/MachineLearning/Chapt. 3. Clusterin g140
The algorithm works as follows: ( https://iq.opengenus.org/dbscan-clustering-algorithm/ with an animated demonstration, 11/Sept/2023)
1. Pick an arbitrary data point p as your first point.
2. Mark p as visited.
3. Find all points present in its neighborhood (up to
ε
distance from the point), andcall it
the set np.
4. If np>= minPts, then
a. Consider pas the first point of a new cluster and mark it as visited.
b. Find all points within
ε
distance (members of np) as other points
in this cluster and mark them as visited.
c. Repeat step b for all points in np.
5. Else if np< minPtslabel pas noise and mark it as visited.
6. Repeat steps 1-5 till the entire dataset has been visited ( labeled), i.e.the clustering is
complete.
After the algorithm is executed, we should ideally have a datas et separated into a number
ofclusters, and eventually some points labeled as noise which d o not belong to any
cluster. It is this particularity that give the name to the met hod.
The method is particularly suitable for nonlinearly separated c lusters.

@ADC/DEI/FCTUC/MEI/MEB/2023/MachineLearning/Chapt. 3. Clusterin g141
Advantages:
- it does not require a pre-set number of clusters,
- it also identifies outliers as noise, unlike others,
- it is able tofind arbitrarily sized and arbitrarily shaped clu sters quite well.
Disadvantages:
- it doesn’t perform as well as others when the clusters are of varying density.
This is because the setting of the distance threshold εand minPtsfor identifying
the neighborhood points will vary from cluster to cluster when the density
varies. This drawback also occurs with very high-dimensional da ta since again
the distance threshold εbecomes challenging to estimate in each dimension (it
may be different among the dimensions).
Advantages and disadvantages of DBSCAN
in MATLAB:
idx = dbscan(X,epsilon,minpts)
idx = dbscan(X,epsilon,minpts,Name,Value) idx= dbscan(D,epsilon,minpts,'Distance','precomputed ’)

@ADC/DEI/FCTUC/MEI/MEB/2023/MachineLearning/Chapt. 3. Clusterin g142
https://towardsdatascience.com /the-5-clustering-algorithms- data-scientists-need-to-know- a36d136ef6811/09/2023
see a visualization at
c-means DBSCAN
Comparison of standard
c-means with DBSCANfor
several types of datasets

@ADC/DEI/FCTUC/MEI/MEB/2023/MachineLearning/Chapt. 3. Clusterin g143
Comparison of the more used five methods of this chapter
Shape of
clusters
Parameters
to control
number of
clusters
Computational
complexity
(n nº of
points)
Need a-
priori
numberof
clusters
Type Method
any height of
the
dendrogram
O(n
3
) no agglomerativ
e hierarch
AGNES
hyper
spherical
trialand
error
O(n) yes partitional
distance
c-means
hyper
spherical
trial and
error
O(n
2
) yes partitional
distance
c-medoids
hyper
spherical
trialand
error
O(n) yes partitional,
distance
fuzzy c-
means
any 2
parameters
O(n) no partitional,
density
subtractiv
e
any 2
parameters
O(n
2
) no partitional,
density
DBSCAN

@ADC/DEI/FCTUC/MEI/MEB/2023/MachineLearning/Chapt. 3. Clusterin g144
Other clustering methods
There are several other methods. One interesting, that surpasse s some drawbacks of k-
means, is the Expectation–Maximization (EM) Clustering using Gaussian Mixture Models
(GMM), that can be seen at https://towardsdatascience.com/the-5-clustering-algorithms-
data-scientists-need-to-know-a36d136ef68 11/09/2023, and can also be implemented in MatlabStatistics and Machine Learning Toolbox 2023b, pp. 17-39.
https://www.mathworks.com/help/stats/clustering-using-gaussian- mixture-models.html.
Mean-Shift Clustering, is a sliding-window-based algorithm that attempts to find dense
areas of data points. In contrast to K-means clustering there i s no need to select the number
of clusters as mean-shift automatically discovers it. See also in
https://towardsdatascience.com/the-5-clustering-algorithms-data -scientists-need-to-know-
a36d136ef68, 11/09/2023. Spectral clustering is a graph-based . Very often outperforms traditional algorithm s such as
the k-means algorithm. See https://towardsdatascience.com/spectral-clustering-for-
beginners-d08b7d25b4d8, 11/09/2023. Other method that is frequently used in classification is the k-nearest neighbors, see
https://www.mathworks.com/help/stats/nearest-neighbors-1.html?s _tid=CRUX_lftnav,
11/09/2023)
However, the six methods included in this chapter give a broad idea of the clustering
problem, as an unsupervised machine learning technique, its par ticularities and difficulties.
The figure in the next slide is very illustrative of the challe nges of a good clustering.

@ADC/DEI/FCTUC/MEI/MEB/2023/MachineLearning/Chapt. 3. Clusterin g145
from https://towardsdatascience.com/the-5-clustering-algorithms-data -scientists-need-to-know-a36d136ef68 11/09/2023
Comparison of (9) clustering algorithms for different types of data

@ADC/DEI/FCTUC/MEI/MEB/2023/MachineLearning/Chapt. 3. Clusterin g146
3.4.6. Clusteringin Matlab
HierarchicalClustering: T = cluster(Z,'Cutoff',C) , p p 17-6
K-means: IDX = kmeans(X, K) pp 17-33
partitions the points in the N-by-P data matrix X into K clust ers,
bydefaultuses thesquaredeuclidiandistance
distances: pdist
K-medoidsIDX = kmedoids(X, K) partitions the points in the N-by-P data m atrix X
into K clusters
Spectral clustering: spectralclusterfunction, 17-26 Statistics and Machine Learning Toolbox, Chapter 17 DBSCAN: IDX = dbscan(X, EPSILON, MINPTS) partitions the points in the N -by-P
data matrix X into clusters based on parameters EPSILON and MIN PTS.
pp. 17-19.
Fuzzy Logic Toolbox Fuzzyc-means: [centers,U] = fcm(data,Nc,options) The membership function matr ix
U contains the grade of membership of each DATA point in each cluster pp 4-13
Subtractive: centers= subclust(data,clusterInfluenceRange)clusters input data using
subtractive clustering with the specified cluster influence ran ge, andreturns the
computed cluster centers. pp 4-5.

@ADC/DEI/FCTUC/MEI/MEB/2023/MachineLearning/Chapt. 3. Clusterin g147
7. Bibliography
Data Clustering: A Review, Jain, A.K., M.N. Murty, andP.J. Fl ynn, ACM Computing Surveys, Vol. 31,
No. 3, September 1999.
Survey of Clustering Data Mining Techniques, Pavel Berkhin ,, Ac crue Software
https://www.cc.gatech.edu/~isbell/reading/papers/berkhin02surve y.pdf11 Sept 2023
MatlabStatistics andMachine Learning Toolbox User’s Guide, 2020b , C hapter 16, Mathworks Inc.
MatlabFuzzy Logic Toolbox, 2020b, Chapter 4, Mathworks Inc.
Fuzzy Logic With Engineering Applications, Timothy Ross, 4th Ed .,Wiley, 2016.
Handbook of Cluster Analysis, Christian Hennig, Marina Meila, F ionn Murtagh, Roberto Rocci, , 2015,
Chapman and Hall/CRC.
Comparing the performance of biomedical clustering methods , Christian Wiwie, Jan Baumbach,
Richard Röttger, Nature Methods , 12, 1021-1031 (2015) DOI: 1 0.1038/nnmeth.3623.
Clustering Algorithms in Biomedic al Research: A Review, Rui Xu, Donald C. Wunsch II,, IEEE Reviews
in Biomedical Engineering ( Volume: 3 ) , p 120-154, Oct 2010,
https://ieeexplore.ieee.org/document/5594620/ , 11 Sept 2023

@ADC/DEI/FCTUC/MEI/MEB/2023/MachineLearning/Chapt. 3. Clusterin g148
A Density-Based Algorithm for Discovering Clusters in Large Spa tial Databases with Noise Martin
Ester, Hans-Peter Kriegel, Jiirg Sander, Xiaowei Xu, KDD-96 Proce edings. Copyright © 1996, AAAI
https://aaai.org/papers/037-a-density-based-algorithm-for-disco vering-clusters-in-large-spatial-
databases-with-noise/11 Sept 2023 https://www.slideshare.net/salahecom/10-clusbasic Jiawei Han, Micheline Kamber, and Jian Pei, University of Illinois Urbana-Camp aingn& Simon Fraser Universit y
slides after chapt10 of the book
Data Mining: Concepts and Techniques (3
rd
Ed.) , Jiawei Han, Micheline Kamber, and Jian Pei,
University of Illinois Urbana-Camp aingn& Simon Fraser Universit y, Morgan Kaufmann, 2013.
A Survey of Clustering Technique, Rai P. and Singh S, Internati onal Journal of Computer
Applications, Vol 7, nº 12, October 2010, doi : 10.5120/1326-180 8. s
Fuzzy model identification based on cluster estimation, S. L. C hiu, Journal of Intelligent and Fuzzy
Systems, Vol. 2, No. 3, 1994. (subtractive clustering) DOI: 10. 3233/IFS-1994-2306
7. Bibliography(cont.) https://sites.google.com/site/dataclusteringalgorithms/hierarch ical-clustering-algorithm
11 Sept 2023
A Tutorial on Clustering Algorithms, with interactive demonstration s, Matteo Matteucci ,
https://matteucci.faculty.polimi.it/Clustering/tutorial_html/in dex.html11 Sept 2023