Unsupervised Learning Clustering - Mathematcis

igfreeze7 17 views 56 slides Jul 11, 2024
Slide 1
Slide 1 of 56
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27
Slide 28
28
Slide 29
29
Slide 30
30
Slide 31
31
Slide 32
32
Slide 33
33
Slide 34
34
Slide 35
35
Slide 36
36
Slide 37
37
Slide 38
38
Slide 39
39
Slide 40
40
Slide 41
41
Slide 42
42
Slide 43
43
Slide 44
44
Slide 45
45
Slide 46
46
Slide 47
47
Slide 48
48
Slide 49
49
Slide 50
50
Slide 51
51
Slide 52
52
Slide 53
53
Slide 54
54
Slide 55
55
Slide 56
56

About This Presentation

Mathematics of Learning
Statistics
Probability
Clustering
Data Science
Unsupervised Learning


Slide Content

UnsupervisedLearning: Clustering
Lecture “Mathematics of Learning” 2023/24
Jan Rolfes
Friedrich-Alexander-Universit¨at Erlangen-N¨urnberg

RoughDifferentiationinLearningMethods
supervised learning:
•predict values of an outcome measure based on a number of input measures
(e.g., given some patient data together with label ’has illness’ / ’does not have
illness’. New patient data comes in, predict whether the patient is ill or not.)
unsupervised learning:
•no outcome measure given. goal: find structures among data.
...also something in between: semi-supervised learning.
Jan Rolfes·Friedrich-Alexander-Universit¨at Erlangen-N¨urnberg·Unsupervised Learning: Clustering2

RoughDifferentiationinLearningMethods
supervised learning:
•predict values of an outcome measure based on a number of input measures
(e.g., given some patient data together with label ’has illness’ / ’does not have
illness’. New patient data comes in, predict whether the patient is ill or not.)
unsupervised learning:
•no outcome measure given. goal: find structures among data.
...also something in between: semi-supervised learning.
Jan Rolfes·Friedrich-Alexander-Universit¨at Erlangen-N¨urnberg·Unsupervised Learning: Clustering2

UnsupervisedLearning: ClusteringofData
Is there any structure in this data?
Jan Rolfes·Friedrich-Alexander-Universit¨at Erlangen-N¨urnberg·Unsupervised Learning: Clustering3

There aretwo categories/clusterssuch that objectswithina cluster resemble
each other but objects fromdifferent clusterslook different.
Jan Rolfes·Friedrich-Alexander-Universit¨at Erlangen-N¨urnberg·Unsupervised Learning: Clustering4

What’sclustering?
Given fruit measurement data
(height
i
,width
i)
N
i=1
.
•Visually: There are multiple
“categories” of data.
•How can we sort data into
categories/clusters?
→clustering problem•More measurements help•Clustering the raw data by hand
is cumbersome
Jan Rolfes·Friedrich-Alexander-Universit¨at Erlangen-N¨urnberg·Unsupervised Learning: Clustering5

What’sclustering?
Given fruit measurement data
(height
i
,width
i)
N
i=1
.
•Visually: There are multiple
“categories” of data.
•How can we sort data into
categories/clusters?
→clustering problem•More measurements help•Clustering the raw data by hand
is cumbersome
Jan Rolfes·Friedrich-Alexander-Universit¨at Erlangen-N¨urnberg·Unsupervised Learning: Clustering5

What’sclustering?
Given fruit measurement data
(height
i
,width
i)
N
i=1
.
•Visually: There are multiple
“categories” of data.
•How can we sort data into
categories/clusters?
→clustering problem•More measurements help•Clustering the raw data by hand
is cumbersome
Jan Rolfes·Friedrich-Alexander-Universit¨at Erlangen-N¨urnberg·Unsupervised Learning: Clustering5

What’sclustering?
Given fruit measurement data
(height
i
,width
i)
N
i=1
.
•Visually: There are multiple
“categories” of data.
•How can we sort data into
categories/clusters?
→clustering problem•More measurements help•Clustering the raw data by hand
is cumbersome
Jan Rolfes·Friedrich-Alexander-Universit¨at Erlangen-N¨urnberg·Unsupervised Learning: Clustering5

What’sclustering?
Given fruit measurement data
(height
i
,width
i)
N
i=1
.
•Visually: There are multiple
“categories” of data.
•How can we sort data into
categories/clusters?
→clustering problem•More measurements help•Clustering the raw data by hand
is cumbersome
Jan Rolfes·Friedrich-Alexander-Universit¨at Erlangen-N¨urnberg·Unsupervised Learning: Clustering5

ClusteringAlgorithm
Given
•Nnumber of data points
•Mnumber of variables (i.e “mass”, “price”, “color”, ...)
•DataX={x1,...,x
N}, wherexn∈R
M
for alln=1,...,N
•K assumednumber of clusters
Want•Assignment: xn7→kn∈ {1,...,K}for alln=1,...,N•Assignment rule:x7→k(x)∈ {1,...,K}for allx∈R
M
•Reconstruction rule (’representative’): k7→m
k∈R
MOn an abstract level:
•Determination of best possible clustering (w.r.t. some objective) is a classical
combinatorial optimization problem
•K-means clustering: DetermineKpoints, i.e., centers, that minimize the sum
of the squared Euclidean distance to its closest center.
Jan Rolfes·Friedrich-Alexander-Universit¨at Erlangen-N¨urnberg·Unsupervised Learning: Clustering6

ClusteringAlgorithm
Given
•Nnumber of data points
•Mnumber of variables (i.e “mass”, “price”, “color”, ...)
•DataX={x1,...,x
N}, wherexn∈R
M
for alln=1,...,N
•K assumednumber of clusters
Want•Assignment: xn7→kn∈ {1,...,K}for alln=1,...,N•Assignment rule:x7→k(x)∈ {1,...,K}for allx∈R
M
•Reconstruction rule (’representative’): k7→m
k∈R
MOn an abstract level:
•Determination of best possible clustering (w.r.t. some objective) is a classical
combinatorial optimization problem
•K-means clustering: DetermineKpoints, i.e., centers, that minimize the sum
of the squared Euclidean distance to its closest center.
Jan Rolfes·Friedrich-Alexander-Universit¨at Erlangen-N¨urnberg·Unsupervised Learning: Clustering6

ClusteringAlgorithm
Given
•Nnumber of data points
•Mnumber of variables (i.e “mass”, “price”, “color”, ...)
•DataX={x1,...,x
N}, wherexn∈R
M
for alln=1,...,N
•K assumednumber of clusters
Want•Assignment: xn7→kn∈ {1,...,K}for alln=1,...,N•Assignment rule:x7→k(x)∈ {1,...,K}for allx∈R
M
•Reconstruction rule (’representative’): k7→m
k∈R
MOn an abstract level:
•Determination of best possible clustering (w.r.t. some objective) is a classical
combinatorial optimization problem
•K-means clustering: DetermineKpoints, i.e., centers, that minimize the sum
of the squared Euclidean distance to its closest center.
Jan Rolfes·Friedrich-Alexander-Universit¨at Erlangen-N¨urnberg·Unsupervised Learning: Clustering6

ClusteringAlgorithm
Given
•Nnumber of data points
•Mnumber of variables (i.e “mass”, “price”, “color”, ...)
•DataX={x1,...,x
N}, wherexn∈R
M
for alln=1,...,N
•K assumednumber of clusters
Want•Assignment: xn7→kn∈ {1,...,K}for alln=1,...,N•Assignment rule:x7→k(x)∈ {1,...,K}for allx∈R
M
•Reconstruction rule (’representative’): k7→m
k∈R
MOn an abstract level:
•Determination of best possible clustering (w.r.t. some objective) is a classical
combinatorial optimization problem
•K-means clustering: DetermineKpoints, i.e., centers, that minimize the sum
of the squared Euclidean distance to its closest center.
Jan Rolfes·Friedrich-Alexander-Universit¨at Erlangen-N¨urnberg·Unsupervised Learning: Clustering6

ClusteringProblemsandAlgorithms
•No general definition of clustering here, see Theoretical Computer Science
for further details
•However: Already in simplified / restricted situations, clustering is difficult,
i.e.,NP-hard.
•In particular: We cannot expect to be able to determine an algorithm that can
efficiently determine the best clustering within polynomial time in the input
size.
•Further Reading: M. Mahajan, P. Nimbhorkar, K. Varadarajan: The planar
k-means problem is NP-hard. Proceedings of WALCOM: Algorithms and
Computation, S.274-285 (2009).
•Clustering is a very basic learning task. Depending on the application,
different clustering algorithms work best.
•Here: Focus on (standard) algorithms K-Means and Expectation
Maximization.
•Plus: Hierarchical clustering and principal component analysis for clustering
and data reduction.
Jan Rolfes·Friedrich-Alexander-Universit¨at Erlangen-N¨urnberg·Unsupervised Learning: Clustering7

K-meansclusteringasoptimizationproblem
Find C={C1,...,C
K}into setsC
k⊂Xand
m={m1,...,m
K}withm
kassociated toC
k, which minimize the
energy
E(C,m) :=
1
2
K
X
k=1
X
x∈C
k
∥x−m
k∥
2
.
Observations
•The clustering energy has local
minima
(picture from:
https://upload.wikimedia.org/
wikipedia/commons/7/7c/K-means_
convergence_to_a_local_minimum.png,
modified)
Jan Rolfes·Friedrich-Alexander-Universit¨at Erlangen-N¨urnberg·Unsupervised Learning: Clustering8

K-meansclusteringasoptimizationproblem
Find C={C1,...,C
K}into setsC
k⊂Xand
m={m1,...,m
K}withm
kassociated toC
k, which minimize the
energy
E(C,m) :=
1
2
K
X
k=1
X
x∈C
k
∥x−m
k∥
2
.
Observations
•The clustering energy has local
minima
(picture from:
https://upload.wikimedia.org/
wikipedia/commons/7/7c/K-means_
convergence_to_a_local_minimum.png,
modified)
Jan Rolfes·Friedrich-Alexander-Universit¨at Erlangen-N¨urnberg·Unsupervised Learning: Clustering8

DerivationoftheK-meansalgorithm
Let us fix the clusteringCin
E(C,m) :=
1
2
K
X
k=1
X
x∈C
k
∥x−m
k∥
2
.
Jan Rolfes·Friedrich-Alexander-Universit¨at Erlangen-N¨urnberg·Unsupervised Learning: Clustering9

DerivationoftheK-meansalgorithm
Let us fix the clusteringCin
E(C,m) :=
1
2
K
X
k=1
X
x∈C
k
∥x−m
k∥
2
.
necessary first-order optimality condition: gradient with respect tom
kis zero,
i.e., a critical point.
Taking the gradient with respect tom
kwe obtain the
condition:
0=∇m
k
E(C,m) =
X
x∈C
k
(x−m
k) =
X
x∈C
k
x− |C
k|m
k
Jan Rolfes·Friedrich-Alexander-Universit¨at Erlangen-N¨urnberg·Unsupervised Learning: Clustering9

DerivationoftheK-meansalgorithm
Let us fix the clusteringCin
E(C,m) :=
1
2
K
X
k=1
X
x∈C
k
∥x−m
k∥
2
.
necessary first-order optimality condition: gradient with respect tom
kis zero,
i.e., a critical point.
Taking the gradient with respect tom
kwe obtain the
condition:
0=∇m
k
E(C,m) =
X
x∈C
k
(x−m
k) =
X
x∈C
k
x− |C
k|m
k
and hence
m
k=
1
|C
k|
X
x∈C
k
xb=mean of the cluster
Problem: Neither clusters nor means known in advance, thus heuristically start
searching for good means
Jan Rolfes·Friedrich-Alexander-Universit¨at Erlangen-N¨urnberg·Unsupervised Learning: Clustering9

DerivationoftheK-meansalgorithm
Conversely, let us fix the meansmin
E(C,m) :=
1
2
K
X
k=1
X
x∈C
k
∥x−m
k∥
2
.
Jan Rolfes·Friedrich-Alexander-Universit¨at Erlangen-N¨urnberg·Unsupervised Learning: Clustering9

DerivationoftheK-meansalgorithm
Conversely, let us fix the meansmin
E(C,m) :=
1
2
K
X
k=1
X
x∈C
k
∥x−m
k∥
2
.
perform the simple assignment step
C
k={x∈X:∥x−m
k∥ ≤

x−m
j

for allj=1,...,K}
b=Voronoi cell m
k
Jan Rolfes·Friedrich-Alexander-Universit¨at Erlangen-N¨urnberg·Unsupervised Learning: Clustering9

DerivationoftheK-meansalgorithm
Conversely, let us fix the meansmin
E(C,m) :=
1
2
K
X
k=1
X
x∈C
k
∥x−m
k∥
2
.
perform the simple assignment step
C
k={x∈X:∥x−m
k∥ ≤

x−m
j

for allj=1,...,K}
b=Voronoi cell m
k
m
kC
k
Jan Rolfes·Friedrich-Alexander-Universit¨at Erlangen-N¨urnberg·Unsupervised Learning: Clustering9

K-meansclusteringalgorithm
Data:X={x1, . . . ,xN}and number of clustersK∈N
Result:cluster meansm= (m1, . . . ,mK)
initializemrandomly;
repeat
//
forn←1toN // assign n-th point to cluster with nearest meando
kn←argmin
k∥xn−mk∥
end
//
fork←1toKdo
Ck← {n∈ {1, . . . ,N}:kn=k}//
if|Ck|>0then
mk←
1
|C
k|
P
n∈C
k
xn//
end
end
untilassignment step does not do anything;
•Assignment rule:x7→argmin
k∥x−m
k∥.
•Reconstruction rule:k7→m
k
Jan Rolfes·Friedrich-Alexander-Universit¨at Erlangen-N¨urnberg·Unsupervised Learning: Clustering10

K-meansclusteringalgorithm
Data:X={x1, . . . ,xN}and number of clustersK∈N
Result:cluster meansm= (m1, . . . ,mK)
initializemrandomly;
repeat
//
forn←1toN // assign n-th point to cluster with nearest meando
kn←argmin
k∥xn−mk∥
end
//
fork←1toKdo
Ck← {n∈ {1, . . . ,N}:kn=k}//
if|Ck|>0then
mk←
1
|C
k|
P
n∈C
k
xn//
end
end
untilassignment step does not do anything;
•Assignment rule:x7→argmin
k∥x−m
k∥.
•Reconstruction rule:k7→m
k
Jan Rolfes·Friedrich-Alexander-Universit¨at Erlangen-N¨urnberg·Unsupervised Learning: Clustering10

BacktoourfruitsdatasetJan Rolfes·Friedrich-Alexander-Universit¨at Erlangen-N¨urnberg·Unsupervised Learning: Clustering11

BacktoourfruitsdatasetJan Rolfes·Friedrich-Alexander-Universit¨at Erlangen-N¨urnberg·Unsupervised Learning: Clustering11

BacktoourfruitsdatasetJan Rolfes·Friedrich-Alexander-Universit¨at Erlangen-N¨urnberg·Unsupervised Learning: Clustering11

BacktoourfruitsdatasetJan Rolfes·Friedrich-Alexander-Universit¨at Erlangen-N¨urnberg·Unsupervised Learning: Clustering11

BacktoourfruitsdatasetJan Rolfes·Friedrich-Alexander-Universit¨at Erlangen-N¨urnberg·Unsupervised Learning: Clustering11

BacktoourfruitsdatasetJan Rolfes·Friedrich-Alexander-Universit¨at Erlangen-N¨urnberg·Unsupervised Learning: Clustering11

BacktoourfruitsdatasetJan Rolfes·Friedrich-Alexander-Universit¨at Erlangen-N¨urnberg·Unsupervised Learning: Clustering11

Differentmetricsleadtodifferent(convex)clusters
using squared Euclidean distance
(L2-norm):
E(C,m) =
1
2
K
X
k=1
X
x∈C
k
∥x−m
k∥
2
using Manhattan distance (L1-norm):
E(C,m) =
K
X
k=1
X
x∈C
k
M
X
i=1
|x
i
−m
i
k
|
Jan Rolfes·Friedrich-Alexander-Universit¨at Erlangen-N¨urnberg·Unsupervised Learning: Clustering12

Differentmetricsleadtodifferent(convex)clusters
using squared Euclidean distance
(L2-norm):
E(C,m) =
1
2
K
X
k=1
X
x∈C
k
∥x−m
k∥
2
using Manhattan distance (L1-norm):
E(C,m) =
K
X
k=1
X
x∈C
k
M
X
i=1
|x
i
−m
i
k
|
Jan Rolfes·Friedrich-Alexander-Universit¨at Erlangen-N¨urnberg·Unsupervised Learning: Clustering12

AdvantagesandDisadvantagesofK-Means
Very well known clustering algorithm that is used often.
Advantages:
•easy to implement
•can run with only a set of real data vectors and value ofK
•A feasible clustering is always available.
Disadvantages
•Choosing a good value ofKcan be difficult. Potential way out: Test several
values.
•Assumes numerical data, not categorial data such as ’car’, ’truck’, etc. We
will see a clustering approach for categorial data later.
•K-means aims at minimizing the Euclidean distances. This is not always the
right objective.
•The result strongly depends on initialization, but some improvements are
known.
•Assumes that clusters areconvex.
Jan Rolfes·Friedrich-Alexander-Universit¨at Erlangen-N¨urnberg·Unsupervised Learning: Clustering13

AdvantagesandDisadvantagesofK-Means
Very well known clustering algorithm that is used often.
Advantages:
•easy to implement
•can run with only a set of real data vectors and value ofK
•A feasible clustering is always available.
Disadvantages
•Choosing a good value ofKcan be difficult. Potential way out: Test several
values.
•Assumes numerical data, not categorial data such as ’car’, ’truck’, etc. We
will see a clustering approach for categorial data later.
•K-means aims at minimizing the Euclidean distances. This is not always the
right objective.
•The result strongly depends on initialization, but some improvements are
known.
•Assumes that clusters areconvex.
Jan Rolfes·Friedrich-Alexander-Universit¨at Erlangen-N¨urnberg·Unsupervised Learning: Clustering13

AdvantagesandDisadvantagesofK-Means
Disadvantages:
•K-means sometimes does not work well, in particular for non-spherical /
nonconvex data or for unevenly sized clusters, i.e. it has some implicit
assumptions:
from varianceexplained.org
next: some improvements.
Jan Rolfes·Friedrich-Alexander-Universit¨at Erlangen-N¨urnberg·Unsupervised Learning: Clustering14

Expectation-MaximizationClusteringAlgorithm(EM)
•further reading: The Elements of Statistical Learning, Chapter 14
•Recall K-means has implicit assumptions (clusters are convex & roughly
equally sized) that may not be satisfied.
•alternative way of thinking: Decide for each data point the probability with
which it belongs to a certain cluster, i.e., a ’soft’ clustering. allows clusters of
different size, can detect correlations
•problem: This probability distribution is unknown.
•task: Estimate probability distribution, improve estimate iteratively.
Jan Rolfes·Friedrich-Alexander-Universit¨at Erlangen-N¨urnberg·Unsupervised Learning: Clustering15

MixtureofGaussianDistributions
•make quite general assumption: This unknown distribution is a mixture, i.e.,
superposition, ofKmany (multi-dimensional) Gaussian distributions. Means
thatx∼P(·|p,µ,Σ)with a probability density of the form
P(x|p,µ,Σ) =
K
X
k=1
p
k· N(x|µ
k,Σ
k),
•wherep,µ,Σare unknown parameters with
p= (p1,...,p
K),p
k∈Rprobability vector
µ= (µ1,...,µ
K), µ
k∈R
n
vector of means
Σ= (Σ1,...,Σ
K), Σ
k∈R
n×n
covariance matrix
whereMis the dimension of a data pointx, i.e.x∈R
M
.
recall Gaussian distribution in dimensionMwith mean vectorµ∈R
M
,
varianceΣ∈R
M×M
:N(x|µ,Σ) =
1

(2π)
M
det(Σ)
exp(−
1
2
(x−µ)

Σ
−1
(x−µ))
Jan Rolfes·Friedrich-Alexander-Universit¨at Erlangen-N¨urnberg·Unsupervised Learning: Clustering16

RecallCovarianceMatrix
Letx= (x1,...,x
M)

be random vector, finite variance and mean. Let
covariance matrixΣ = (Σx
i,x
j
)∈R
M×M
be defined as
Σx
i,x
j
=E((x
i−E(x
i))(x
j−E(x
j)))whereEdenotes expected valuesµ
X=E(X).
•represents important statistical information, in particular correlation between
data
•is a real, quadratic, symmetric matrix
•is positive-semidefinite matrix
Jan Rolfes·Friedrich-Alexander-Universit¨at Erlangen-N¨urnberg·Unsupervised Learning: Clustering17

MoreDetailsonMixtureofGaussians
P(x|p,µ,Σ) =
KP
k=1
p
k· N(x|µ
k,Σ
k)
Clustering task: Given data points, estimatep,µ,Σ.
Output: Yields for each data pointnand for each clusterkan estimated
probability thatngenerated fromk.
Assign each data point the mean with highest probability.
Advantage: also clusters are possible that are (partially) contained in each
other:
Figure: K=3 Gaussian Mixture in 1d and generated data.
Jan Rolfes·Friedrich-Alexander-Universit¨at Erlangen-N¨urnberg·Unsupervised Learning: Clustering18

ResponsibilityCalculation
Defineresponsibility of cluster k for xby ob-
served relative frequencies, i.e., probability that
xis generated by Gaussiank.
γ(x;k) =
p
kN(x|µ
k,Σ
k)
P
K
j=1
p
jN(x|µ
j,Σ
j)
Roughly speaking:”How much%of data point x is attributed to cluster k?”
Let density function of (standard) normal distribution be denoted byϕ.
Jan Rolfes·Friedrich-Alexander-Universit¨at Erlangen-N¨urnberg·Unsupervised Learning: Clustering19

OutlineofExpectedMaximizationClustering
Update step takes data points into account via relative frequencies.
Compare this to K-means... Indeed, K-means can be seen as a special case of
EM.
Jan Rolfes·Friedrich-Alexander-Universit¨at Erlangen-N¨urnberg·Unsupervised Learning: Clustering20

OutlineofExpectedMaximizationClustering
Update step takes data points into account via relative frequencies.
Compare this to K-means... Indeed, K-means can be seen as a special case of
EM.
Jan Rolfes·Friedrich-Alexander-Universit¨at Erlangen-N¨urnberg·Unsupervised Learning: Clustering20

Avisualcomparison
drawback:K-means and EM only find local optima.(recall already K-means
problem is NP-hard...)
play around with skikit-learn (machine learning library in python)
Jan Rolfes·Friedrich-Alexander-Universit¨at Erlangen-N¨urnberg·Unsupervised Learning: Clustering21

HierarchicalClusteringMethods
•Sometimes, there is some hierarchical structure in the data that we want to
disclose in a clustering.
•In hierarchical methods, we do not need to specify the number of clustersK
as input
•Instead: measure of dissimilarity, e.g., ’distance’ between (disjoint) groups of
observations - Typically based on pairwise dissimilarities.
•Clusters at each levels of hierarchy are created by merging clusters at next
lower level.
•The lowest level contains one point each, highest level contains all points.
•Both agglomerative (bottom-up) or divisive (top-down) methods exist.
Jan Rolfes·Friedrich-Alexander-Universit¨at Erlangen-N¨urnberg·Unsupervised Learning: Clustering22

HierarchicalClusteringMethods
•Idea of agglomerative approach: start at bottom. At each level, recursively
merge a selected pair of clusters into a single cluster. Next higher level
contains one cluster less. Merge those two clusters with the smallest
dissimilarity.⇒This leads toN−1 levels in hierarchy.
•Clusterings are often drawn by a rooted binary tree. It contains a node as
root. Each node has not more than two children in the tree.
Jan Rolfes·Friedrich-Alexander-Universit¨at Erlangen-N¨urnberg·Unsupervised Learning: Clustering23

VisualizationasaDendrogram
from statistical learning book:
Jan Rolfes·Friedrich-Alexander-Universit¨at Erlangen-N¨urnberg·Unsupervised Learning: Clustering24

Dendrogram
•a word of caution: small changes in data can lead to quite different
dendrograms
•For a clustering application, we need to decide whether hierarchical structure
is actually intrinsic to the data or not.
Jan Rolfes·Friedrich-Alexander-Universit¨at Erlangen-N¨urnberg·Unsupervised Learning: Clustering25

ClusteringinMoreGeneralContexts
How can we compute (pairwise) distances in a clustering algorithm? ...for
example in ordinal or in categorial contexts? E.g., Suppose we have
measurements or averages over personal judgements by participants who are
asked to judge differences between objects.
Often, one usesdissimilaritiesbased on attributes for the distance calculation in
a clustering algorithm.
Dissimilarities are calculated as follows:
•Suppose we have for objectsi=1,...,Nmeasurementsx
ijfor variables
(attributes)j=1,...,p.
•We define dissimilarity between objectsiandi

as
D(x
i,x
i
′) =
p
X
j=1
d
j(x
ij,x
i

j)
•In this definition, different attributes could also be weighted differently.
Jan Rolfes·Friedrich-Alexander-Universit¨at Erlangen-N¨urnberg·Unsupervised Learning: Clustering26

AlternativesforCalculatingDistances
Depending on the variable types, distances need to be calculated differently.
•Ordinal variables: e.g., contiguous integers, ordered sets such as academic
grades (A, B, C, D, F), degree of preference (can’t stand, dislike, OK, like,
terrific). Dissimilarities for ordinal variables are typically defined by replacing
theirMoriginal values with
i−
1
2
M
,i=1,...,M
in the prescribed order of their original values.
•(unordered) categorical variables: Dissimilarity between pairs of values must
be delineated explicitly.
If variable assumesMdistinct values, these can be arranged in a symmetric
M×Mmatrix withLrr
′=Lr

r,Lrr=0,Lrr
′≥0. Often:Lrr
′=1 for allr̸=r

.
Unequal losses can be used to emphasize some distances more than others.
How can we cluster data in such contexts?
Jan Rolfes·Friedrich-Alexander-Universit¨at Erlangen-N¨urnberg·Unsupervised Learning: Clustering27

K-medoidsAlgorithm
As discussed before, theK-means clustering algorithm is designed for
numerical and quantitativevalues.
Look atK-means again: iteratively,
1. (or
dissimilarity).
2.
The optimization step 1) can easily be generalized to dissimilarities stored in
matrices, as studied on the last slides.
Jan Rolfes·Friedrich-Alexander-Universit¨at Erlangen-N¨urnberg·Unsupervised Learning: Clustering28

K-MedoidsAlgorithm
1. C, find the observation or attribute in the
cluster that minimizes the total distance to the other points in that cluster:
i

k
= argmin
{i:C(i)=k}
X
C(i

)=k
D(x
i,x
i
′).
Then
m
k=x
i

k
,k=1,2,...K
are the current estimates of the cluster centers.
2. {m1,...,m
K}, minimize the total
dissimilarity by assigning each observation to the closest (current) cluster
center:
C(i) = argmin
1≤k≤K
D(x
i,m
k).
3.
Jan Rolfes·Friedrich-Alexander-Universit¨at Erlangen-N¨urnberg·Unsupervised Learning: Clustering29

Howtochoosenumberofclusters K?
For a clustering problem, best number of clusters depends on the goal and on
the knowledge on the application.
•Sometimes, the best value ofKis given as input (e.g.,Ksalespeople are
employed, the task is to cluster a database inKmany segments)
•However, if ’natural’ clusters need to be determined, the best value ofKis
unknown and needs to be estimated from data as well.
Jan Rolfes·Friedrich-Alexander-Universit¨at Erlangen-N¨urnberg·Unsupervised Learning: Clustering30

(Heuristic)estimateofgoodKvalues
•Examine within-cluster dissimilarityD
k
′as a function of the number of
clustersk

.
•Separate solutions are obtained fork

∈ {1,2,...,Kmax}.
•Values{D1,D2,...,D
Kmax
}decrease with increasingk

.
•intuition: if there areKnatural groupings, then fork

<K, clusters will each
contain asubsetof the true underlying groups, i.e., dissimilarityD
k
′will
(noticeably) decrease for increasingk

.
•Fork

>Kinstead, at least one natural group will be assigned separate
clusters. I.e.,D
k
′will decrease only mildly.
•Estimate best number of clustersKby identifying a ’kink’ in the plot ofD
k
′as
a function ofk

.
Next, we will turn to a fundamental method in learning that can be used for
clustering as well, and also more generally for reducing data dimensions, called
Principal Component Analysis.
Jan Rolfes·Friedrich-Alexander-Universit¨at Erlangen-N¨urnberg·Unsupervised Learning: Clustering31

(Heuristic)estimateofgoodKvalues
•Examine within-cluster dissimilarityD
k
′as a function of the number of
clustersk

.
•Separate solutions are obtained fork

∈ {1,2,...,Kmax}.
•Values{D1,D2,...,D
Kmax
}decrease with increasingk

.
•intuition: if there areKnatural groupings, then fork

<K, clusters will each
contain asubsetof the true underlying groups, i.e., dissimilarityD
k
′will
(noticeably) decrease for increasingk

.
•Fork

>Kinstead, at least one natural group will be assigned separate
clusters. I.e.,D
k
′will decrease only mildly.
•Estimate best number of clustersKby identifying a ’kink’ in the plot ofD
k
′as
a function ofk

.
Next, we will turn to a fundamental method in learning that can be used for
clustering as well, and also more generally for reducing data dimensions, called
Principal Component Analysis.
Jan Rolfes·Friedrich-Alexander-Universit¨at Erlangen-N¨urnberg·Unsupervised Learning: Clustering31
Tags