Mathematics of Learning
Statistics
Probability
Clustering
Data Science
Unsupervised Learning
Size: 1.82 MB
Language: en
Added: Jul 11, 2024
Slides: 56 pages
Slide Content
UnsupervisedLearning: Clustering
Lecture “Mathematics of Learning” 2023/24
Jan Rolfes
Friedrich-Alexander-Universit¨at Erlangen-N¨urnberg
RoughDifferentiationinLearningMethods
supervised learning:
•predict values of an outcome measure based on a number of input measures
(e.g., given some patient data together with label ’has illness’ / ’does not have
illness’. New patient data comes in, predict whether the patient is ill or not.)
unsupervised learning:
•no outcome measure given. goal: find structures among data.
...also something in between: semi-supervised learning.
Jan Rolfes·Friedrich-Alexander-Universit¨at Erlangen-N¨urnberg·Unsupervised Learning: Clustering2
RoughDifferentiationinLearningMethods
supervised learning:
•predict values of an outcome measure based on a number of input measures
(e.g., given some patient data together with label ’has illness’ / ’does not have
illness’. New patient data comes in, predict whether the patient is ill or not.)
unsupervised learning:
•no outcome measure given. goal: find structures among data.
...also something in between: semi-supervised learning.
Jan Rolfes·Friedrich-Alexander-Universit¨at Erlangen-N¨urnberg·Unsupervised Learning: Clustering2
UnsupervisedLearning: ClusteringofData
Is there any structure in this data?
Jan Rolfes·Friedrich-Alexander-Universit¨at Erlangen-N¨urnberg·Unsupervised Learning: Clustering3
There aretwo categories/clusterssuch that objectswithina cluster resemble
each other but objects fromdifferent clusterslook different.
Jan Rolfes·Friedrich-Alexander-Universit¨at Erlangen-N¨urnberg·Unsupervised Learning: Clustering4
What’sclustering?
Given fruit measurement data
(height
i
,width
i)
N
i=1
.
•Visually: There are multiple
“categories” of data.
•How can we sort data into
categories/clusters?
→clustering problem•More measurements help•Clustering the raw data by hand
is cumbersome
Jan Rolfes·Friedrich-Alexander-Universit¨at Erlangen-N¨urnberg·Unsupervised Learning: Clustering5
What’sclustering?
Given fruit measurement data
(height
i
,width
i)
N
i=1
.
•Visually: There are multiple
“categories” of data.
•How can we sort data into
categories/clusters?
→clustering problem•More measurements help•Clustering the raw data by hand
is cumbersome
Jan Rolfes·Friedrich-Alexander-Universit¨at Erlangen-N¨urnberg·Unsupervised Learning: Clustering5
What’sclustering?
Given fruit measurement data
(height
i
,width
i)
N
i=1
.
•Visually: There are multiple
“categories” of data.
•How can we sort data into
categories/clusters?
→clustering problem•More measurements help•Clustering the raw data by hand
is cumbersome
Jan Rolfes·Friedrich-Alexander-Universit¨at Erlangen-N¨urnberg·Unsupervised Learning: Clustering5
What’sclustering?
Given fruit measurement data
(height
i
,width
i)
N
i=1
.
•Visually: There are multiple
“categories” of data.
•How can we sort data into
categories/clusters?
→clustering problem•More measurements help•Clustering the raw data by hand
is cumbersome
Jan Rolfes·Friedrich-Alexander-Universit¨at Erlangen-N¨urnberg·Unsupervised Learning: Clustering5
What’sclustering?
Given fruit measurement data
(height
i
,width
i)
N
i=1
.
•Visually: There are multiple
“categories” of data.
•How can we sort data into
categories/clusters?
→clustering problem•More measurements help•Clustering the raw data by hand
is cumbersome
Jan Rolfes·Friedrich-Alexander-Universit¨at Erlangen-N¨urnberg·Unsupervised Learning: Clustering5
ClusteringAlgorithm
Given
•Nnumber of data points
•Mnumber of variables (i.e “mass”, “price”, “color”, ...)
•DataX={x1,...,x
N}, wherexn∈R
M
for alln=1,...,N
•K assumednumber of clusters
Want•Assignment: xn7→kn∈ {1,...,K}for alln=1,...,N•Assignment rule:x7→k(x)∈ {1,...,K}for allx∈R
M
•Reconstruction rule (’representative’): k7→m
k∈R
MOn an abstract level:
•Determination of best possible clustering (w.r.t. some objective) is a classical
combinatorial optimization problem
•K-means clustering: DetermineKpoints, i.e., centers, that minimize the sum
of the squared Euclidean distance to its closest center.
Jan Rolfes·Friedrich-Alexander-Universit¨at Erlangen-N¨urnberg·Unsupervised Learning: Clustering6
ClusteringAlgorithm
Given
•Nnumber of data points
•Mnumber of variables (i.e “mass”, “price”, “color”, ...)
•DataX={x1,...,x
N}, wherexn∈R
M
for alln=1,...,N
•K assumednumber of clusters
Want•Assignment: xn7→kn∈ {1,...,K}for alln=1,...,N•Assignment rule:x7→k(x)∈ {1,...,K}for allx∈R
M
•Reconstruction rule (’representative’): k7→m
k∈R
MOn an abstract level:
•Determination of best possible clustering (w.r.t. some objective) is a classical
combinatorial optimization problem
•K-means clustering: DetermineKpoints, i.e., centers, that minimize the sum
of the squared Euclidean distance to its closest center.
Jan Rolfes·Friedrich-Alexander-Universit¨at Erlangen-N¨urnberg·Unsupervised Learning: Clustering6
ClusteringAlgorithm
Given
•Nnumber of data points
•Mnumber of variables (i.e “mass”, “price”, “color”, ...)
•DataX={x1,...,x
N}, wherexn∈R
M
for alln=1,...,N
•K assumednumber of clusters
Want•Assignment: xn7→kn∈ {1,...,K}for alln=1,...,N•Assignment rule:x7→k(x)∈ {1,...,K}for allx∈R
M
•Reconstruction rule (’representative’): k7→m
k∈R
MOn an abstract level:
•Determination of best possible clustering (w.r.t. some objective) is a classical
combinatorial optimization problem
•K-means clustering: DetermineKpoints, i.e., centers, that minimize the sum
of the squared Euclidean distance to its closest center.
Jan Rolfes·Friedrich-Alexander-Universit¨at Erlangen-N¨urnberg·Unsupervised Learning: Clustering6
ClusteringAlgorithm
Given
•Nnumber of data points
•Mnumber of variables (i.e “mass”, “price”, “color”, ...)
•DataX={x1,...,x
N}, wherexn∈R
M
for alln=1,...,N
•K assumednumber of clusters
Want•Assignment: xn7→kn∈ {1,...,K}for alln=1,...,N•Assignment rule:x7→k(x)∈ {1,...,K}for allx∈R
M
•Reconstruction rule (’representative’): k7→m
k∈R
MOn an abstract level:
•Determination of best possible clustering (w.r.t. some objective) is a classical
combinatorial optimization problem
•K-means clustering: DetermineKpoints, i.e., centers, that minimize the sum
of the squared Euclidean distance to its closest center.
Jan Rolfes·Friedrich-Alexander-Universit¨at Erlangen-N¨urnberg·Unsupervised Learning: Clustering6
ClusteringProblemsandAlgorithms
•No general definition of clustering here, see Theoretical Computer Science
for further details
•However: Already in simplified / restricted situations, clustering is difficult,
i.e.,NP-hard.
•In particular: We cannot expect to be able to determine an algorithm that can
efficiently determine the best clustering within polynomial time in the input
size.
•Further Reading: M. Mahajan, P. Nimbhorkar, K. Varadarajan: The planar
k-means problem is NP-hard. Proceedings of WALCOM: Algorithms and
Computation, S.274-285 (2009).
•Clustering is a very basic learning task. Depending on the application,
different clustering algorithms work best.
•Here: Focus on (standard) algorithms K-Means and Expectation
Maximization.
•Plus: Hierarchical clustering and principal component analysis for clustering
and data reduction.
Jan Rolfes·Friedrich-Alexander-Universit¨at Erlangen-N¨urnberg·Unsupervised Learning: Clustering7
K-meansclusteringasoptimizationproblem
Find C={C1,...,C
K}into setsC
k⊂Xand
m={m1,...,m
K}withm
kassociated toC
k, which minimize the
energy
E(C,m) :=
1
2
K
X
k=1
X
x∈C
k
∥x−m
k∥
2
.
Observations
•The clustering energy has local
minima
(picture from:
https://upload.wikimedia.org/
wikipedia/commons/7/7c/K-means_
convergence_to_a_local_minimum.png,
modified)
Jan Rolfes·Friedrich-Alexander-Universit¨at Erlangen-N¨urnberg·Unsupervised Learning: Clustering8
K-meansclusteringasoptimizationproblem
Find C={C1,...,C
K}into setsC
k⊂Xand
m={m1,...,m
K}withm
kassociated toC
k, which minimize the
energy
E(C,m) :=
1
2
K
X
k=1
X
x∈C
k
∥x−m
k∥
2
.
Observations
•The clustering energy has local
minima
(picture from:
https://upload.wikimedia.org/
wikipedia/commons/7/7c/K-means_
convergence_to_a_local_minimum.png,
modified)
Jan Rolfes·Friedrich-Alexander-Universit¨at Erlangen-N¨urnberg·Unsupervised Learning: Clustering8
DerivationoftheK-meansalgorithm
Let us fix the clusteringCin
E(C,m) :=
1
2
K
X
k=1
X
x∈C
k
∥x−m
k∥
2
.
Jan Rolfes·Friedrich-Alexander-Universit¨at Erlangen-N¨urnberg·Unsupervised Learning: Clustering9
DerivationoftheK-meansalgorithm
Let us fix the clusteringCin
E(C,m) :=
1
2
K
X
k=1
X
x∈C
k
∥x−m
k∥
2
.
necessary first-order optimality condition: gradient with respect tom
kis zero,
i.e., a critical point.
Taking the gradient with respect tom
kwe obtain the
condition:
0=∇m
k
E(C,m) =
X
x∈C
k
(x−m
k) =
X
x∈C
k
x− |C
k|m
k
Jan Rolfes·Friedrich-Alexander-Universit¨at Erlangen-N¨urnberg·Unsupervised Learning: Clustering9
DerivationoftheK-meansalgorithm
Let us fix the clusteringCin
E(C,m) :=
1
2
K
X
k=1
X
x∈C
k
∥x−m
k∥
2
.
necessary first-order optimality condition: gradient with respect tom
kis zero,
i.e., a critical point.
Taking the gradient with respect tom
kwe obtain the
condition:
0=∇m
k
E(C,m) =
X
x∈C
k
(x−m
k) =
X
x∈C
k
x− |C
k|m
k
and hence
m
k=
1
|C
k|
X
x∈C
k
xb=mean of the cluster
Problem: Neither clusters nor means known in advance, thus heuristically start
searching for good means
Jan Rolfes·Friedrich-Alexander-Universit¨at Erlangen-N¨urnberg·Unsupervised Learning: Clustering9
DerivationoftheK-meansalgorithm
Conversely, let us fix the meansmin
E(C,m) :=
1
2
K
X
k=1
X
x∈C
k
∥x−m
k∥
2
.
Jan Rolfes·Friedrich-Alexander-Universit¨at Erlangen-N¨urnberg·Unsupervised Learning: Clustering9
DerivationoftheK-meansalgorithm
Conversely, let us fix the meansmin
E(C,m) :=
1
2
K
X
k=1
X
x∈C
k
∥x−m
k∥
2
.
perform the simple assignment step
C
k={x∈X:∥x−m
k∥ ≤
x−m
j
for allj=1,...,K}
b=Voronoi cell m
k
Jan Rolfes·Friedrich-Alexander-Universit¨at Erlangen-N¨urnberg·Unsupervised Learning: Clustering9
DerivationoftheK-meansalgorithm
Conversely, let us fix the meansmin
E(C,m) :=
1
2
K
X
k=1
X
x∈C
k
∥x−m
k∥
2
.
perform the simple assignment step
C
k={x∈X:∥x−m
k∥ ≤
x−m
j
for allj=1,...,K}
b=Voronoi cell m
k
m
kC
k
Jan Rolfes·Friedrich-Alexander-Universit¨at Erlangen-N¨urnberg·Unsupervised Learning: Clustering9
K-meansclusteringalgorithm
Data:X={x1, . . . ,xN}and number of clustersK∈N
Result:cluster meansm= (m1, . . . ,mK)
initializemrandomly;
repeat
//
forn←1toN // assign n-th point to cluster with nearest meando
kn←argmin
k∥xn−mk∥
end
//
fork←1toKdo
Ck← {n∈ {1, . . . ,N}:kn=k}//
if|Ck|>0then
mk←
1
|C
k|
P
n∈C
k
xn//
end
end
untilassignment step does not do anything;
•Assignment rule:x7→argmin
k∥x−m
k∥.
•Reconstruction rule:k7→m
k
Jan Rolfes·Friedrich-Alexander-Universit¨at Erlangen-N¨urnberg·Unsupervised Learning: Clustering10
K-meansclusteringalgorithm
Data:X={x1, . . . ,xN}and number of clustersK∈N
Result:cluster meansm= (m1, . . . ,mK)
initializemrandomly;
repeat
//
forn←1toN // assign n-th point to cluster with nearest meando
kn←argmin
k∥xn−mk∥
end
//
fork←1toKdo
Ck← {n∈ {1, . . . ,N}:kn=k}//
if|Ck|>0then
mk←
1
|C
k|
P
n∈C
k
xn//
end
end
untilassignment step does not do anything;
•Assignment rule:x7→argmin
k∥x−m
k∥.
•Reconstruction rule:k7→m
k
Jan Rolfes·Friedrich-Alexander-Universit¨at Erlangen-N¨urnberg·Unsupervised Learning: Clustering10
Differentmetricsleadtodifferent(convex)clusters
using squared Euclidean distance
(L2-norm):
E(C,m) =
1
2
K
X
k=1
X
x∈C
k
∥x−m
k∥
2
using Manhattan distance (L1-norm):
E(C,m) =
K
X
k=1
X
x∈C
k
M
X
i=1
|x
i
−m
i
k
|
Jan Rolfes·Friedrich-Alexander-Universit¨at Erlangen-N¨urnberg·Unsupervised Learning: Clustering12
Differentmetricsleadtodifferent(convex)clusters
using squared Euclidean distance
(L2-norm):
E(C,m) =
1
2
K
X
k=1
X
x∈C
k
∥x−m
k∥
2
using Manhattan distance (L1-norm):
E(C,m) =
K
X
k=1
X
x∈C
k
M
X
i=1
|x
i
−m
i
k
|
Jan Rolfes·Friedrich-Alexander-Universit¨at Erlangen-N¨urnberg·Unsupervised Learning: Clustering12
AdvantagesandDisadvantagesofK-Means
Very well known clustering algorithm that is used often.
Advantages:
•easy to implement
•can run with only a set of real data vectors and value ofK
•A feasible clustering is always available.
Disadvantages
•Choosing a good value ofKcan be difficult. Potential way out: Test several
values.
•Assumes numerical data, not categorial data such as ’car’, ’truck’, etc. We
will see a clustering approach for categorial data later.
•K-means aims at minimizing the Euclidean distances. This is not always the
right objective.
•The result strongly depends on initialization, but some improvements are
known.
•Assumes that clusters areconvex.
Jan Rolfes·Friedrich-Alexander-Universit¨at Erlangen-N¨urnberg·Unsupervised Learning: Clustering13
AdvantagesandDisadvantagesofK-Means
Very well known clustering algorithm that is used often.
Advantages:
•easy to implement
•can run with only a set of real data vectors and value ofK
•A feasible clustering is always available.
Disadvantages
•Choosing a good value ofKcan be difficult. Potential way out: Test several
values.
•Assumes numerical data, not categorial data such as ’car’, ’truck’, etc. We
will see a clustering approach for categorial data later.
•K-means aims at minimizing the Euclidean distances. This is not always the
right objective.
•The result strongly depends on initialization, but some improvements are
known.
•Assumes that clusters areconvex.
Jan Rolfes·Friedrich-Alexander-Universit¨at Erlangen-N¨urnberg·Unsupervised Learning: Clustering13
AdvantagesandDisadvantagesofK-Means
Disadvantages:
•K-means sometimes does not work well, in particular for non-spherical /
nonconvex data or for unevenly sized clusters, i.e. it has some implicit
assumptions:
from varianceexplained.org
next: some improvements.
Jan Rolfes·Friedrich-Alexander-Universit¨at Erlangen-N¨urnberg·Unsupervised Learning: Clustering14
Expectation-MaximizationClusteringAlgorithm(EM)
•further reading: The Elements of Statistical Learning, Chapter 14
•Recall K-means has implicit assumptions (clusters are convex & roughly
equally sized) that may not be satisfied.
•alternative way of thinking: Decide for each data point the probability with
which it belongs to a certain cluster, i.e., a ’soft’ clustering. allows clusters of
different size, can detect correlations
•problem: This probability distribution is unknown.
•task: Estimate probability distribution, improve estimate iteratively.
Jan Rolfes·Friedrich-Alexander-Universit¨at Erlangen-N¨urnberg·Unsupervised Learning: Clustering15
MixtureofGaussianDistributions
•make quite general assumption: This unknown distribution is a mixture, i.e.,
superposition, ofKmany (multi-dimensional) Gaussian distributions. Means
thatx∼P(·|p,µ,Σ)with a probability density of the form
P(x|p,µ,Σ) =
K
X
k=1
p
k· N(x|µ
k,Σ
k),
•wherep,µ,Σare unknown parameters with
p= (p1,...,p
K),p
k∈Rprobability vector
µ= (µ1,...,µ
K), µ
k∈R
n
vector of means
Σ= (Σ1,...,Σ
K), Σ
k∈R
n×n
covariance matrix
whereMis the dimension of a data pointx, i.e.x∈R
M
.
recall Gaussian distribution in dimensionMwith mean vectorµ∈R
M
,
varianceΣ∈R
M×M
:N(x|µ,Σ) =
1
√
(2π)
M
det(Σ)
exp(−
1
2
(x−µ)
⊤
Σ
−1
(x−µ))
Jan Rolfes·Friedrich-Alexander-Universit¨at Erlangen-N¨urnberg·Unsupervised Learning: Clustering16
RecallCovarianceMatrix
Letx= (x1,...,x
M)
⊤
be random vector, finite variance and mean. Let
covariance matrixΣ = (Σx
i,x
j
)∈R
M×M
be defined as
Σx
i,x
j
=E((x
i−E(x
i))(x
j−E(x
j)))whereEdenotes expected valuesµ
X=E(X).
•represents important statistical information, in particular correlation between
data
•is a real, quadratic, symmetric matrix
•is positive-semidefinite matrix
Jan Rolfes·Friedrich-Alexander-Universit¨at Erlangen-N¨urnberg·Unsupervised Learning: Clustering17
MoreDetailsonMixtureofGaussians
P(x|p,µ,Σ) =
KP
k=1
p
k· N(x|µ
k,Σ
k)
Clustering task: Given data points, estimatep,µ,Σ.
Output: Yields for each data pointnand for each clusterkan estimated
probability thatngenerated fromk.
Assign each data point the mean with highest probability.
Advantage: also clusters are possible that are (partially) contained in each
other:
Figure: K=3 Gaussian Mixture in 1d and generated data.
Jan Rolfes·Friedrich-Alexander-Universit¨at Erlangen-N¨urnberg·Unsupervised Learning: Clustering18
ResponsibilityCalculation
Defineresponsibility of cluster k for xby ob-
served relative frequencies, i.e., probability that
xis generated by Gaussiank.
γ(x;k) =
p
kN(x|µ
k,Σ
k)
P
K
j=1
p
jN(x|µ
j,Σ
j)
Roughly speaking:”How much%of data point x is attributed to cluster k?”
Let density function of (standard) normal distribution be denoted byϕ.
Jan Rolfes·Friedrich-Alexander-Universit¨at Erlangen-N¨urnberg·Unsupervised Learning: Clustering19
OutlineofExpectedMaximizationClustering
Update step takes data points into account via relative frequencies.
Compare this to K-means... Indeed, K-means can be seen as a special case of
EM.
Jan Rolfes·Friedrich-Alexander-Universit¨at Erlangen-N¨urnberg·Unsupervised Learning: Clustering20
OutlineofExpectedMaximizationClustering
Update step takes data points into account via relative frequencies.
Compare this to K-means... Indeed, K-means can be seen as a special case of
EM.
Jan Rolfes·Friedrich-Alexander-Universit¨at Erlangen-N¨urnberg·Unsupervised Learning: Clustering20
Avisualcomparison
drawback:K-means and EM only find local optima.(recall already K-means
problem is NP-hard...)
play around with skikit-learn (machine learning library in python)
Jan Rolfes·Friedrich-Alexander-Universit¨at Erlangen-N¨urnberg·Unsupervised Learning: Clustering21
HierarchicalClusteringMethods
•Sometimes, there is some hierarchical structure in the data that we want to
disclose in a clustering.
•In hierarchical methods, we do not need to specify the number of clustersK
as input
•Instead: measure of dissimilarity, e.g., ’distance’ between (disjoint) groups of
observations - Typically based on pairwise dissimilarities.
•Clusters at each levels of hierarchy are created by merging clusters at next
lower level.
•The lowest level contains one point each, highest level contains all points.
•Both agglomerative (bottom-up) or divisive (top-down) methods exist.
Jan Rolfes·Friedrich-Alexander-Universit¨at Erlangen-N¨urnberg·Unsupervised Learning: Clustering22
HierarchicalClusteringMethods
•Idea of agglomerative approach: start at bottom. At each level, recursively
merge a selected pair of clusters into a single cluster. Next higher level
contains one cluster less. Merge those two clusters with the smallest
dissimilarity.⇒This leads toN−1 levels in hierarchy.
•Clusterings are often drawn by a rooted binary tree. It contains a node as
root. Each node has not more than two children in the tree.
Jan Rolfes·Friedrich-Alexander-Universit¨at Erlangen-N¨urnberg·Unsupervised Learning: Clustering23
VisualizationasaDendrogram
from statistical learning book:
Jan Rolfes·Friedrich-Alexander-Universit¨at Erlangen-N¨urnberg·Unsupervised Learning: Clustering24
Dendrogram
•a word of caution: small changes in data can lead to quite different
dendrograms
•For a clustering application, we need to decide whether hierarchical structure
is actually intrinsic to the data or not.
Jan Rolfes·Friedrich-Alexander-Universit¨at Erlangen-N¨urnberg·Unsupervised Learning: Clustering25
ClusteringinMoreGeneralContexts
How can we compute (pairwise) distances in a clustering algorithm? ...for
example in ordinal or in categorial contexts? E.g., Suppose we have
measurements or averages over personal judgements by participants who are
asked to judge differences between objects.
Often, one usesdissimilaritiesbased on attributes for the distance calculation in
a clustering algorithm.
Dissimilarities are calculated as follows:
•Suppose we have for objectsi=1,...,Nmeasurementsx
ijfor variables
(attributes)j=1,...,p.
•We define dissimilarity between objectsiandi
′
as
D(x
i,x
i
′) =
p
X
j=1
d
j(x
ij,x
i
′
j)
•In this definition, different attributes could also be weighted differently.
Jan Rolfes·Friedrich-Alexander-Universit¨at Erlangen-N¨urnberg·Unsupervised Learning: Clustering26
AlternativesforCalculatingDistances
Depending on the variable types, distances need to be calculated differently.
•Ordinal variables: e.g., contiguous integers, ordered sets such as academic
grades (A, B, C, D, F), degree of preference (can’t stand, dislike, OK, like,
terrific). Dissimilarities for ordinal variables are typically defined by replacing
theirMoriginal values with
i−
1
2
M
,i=1,...,M
in the prescribed order of their original values.
•(unordered) categorical variables: Dissimilarity between pairs of values must
be delineated explicitly.
If variable assumesMdistinct values, these can be arranged in a symmetric
M×Mmatrix withLrr
′=Lr
′
r,Lrr=0,Lrr
′≥0. Often:Lrr
′=1 for allr̸=r
′
.
Unequal losses can be used to emphasize some distances more than others.
How can we cluster data in such contexts?
Jan Rolfes·Friedrich-Alexander-Universit¨at Erlangen-N¨urnberg·Unsupervised Learning: Clustering27
K-medoidsAlgorithm
As discussed before, theK-means clustering algorithm is designed for
numerical and quantitativevalues.
Look atK-means again: iteratively,
1. (or
dissimilarity).
2.
The optimization step 1) can easily be generalized to dissimilarities stored in
matrices, as studied on the last slides.
Jan Rolfes·Friedrich-Alexander-Universit¨at Erlangen-N¨urnberg·Unsupervised Learning: Clustering28
K-MedoidsAlgorithm
1. C, find the observation or attribute in the
cluster that minimizes the total distance to the other points in that cluster:
i
⋆
k
= argmin
{i:C(i)=k}
X
C(i
′
)=k
D(x
i,x
i
′).
Then
m
k=x
i
⋆
k
,k=1,2,...K
are the current estimates of the cluster centers.
2. {m1,...,m
K}, minimize the total
dissimilarity by assigning each observation to the closest (current) cluster
center:
C(i) = argmin
1≤k≤K
D(x
i,m
k).
3.
Jan Rolfes·Friedrich-Alexander-Universit¨at Erlangen-N¨urnberg·Unsupervised Learning: Clustering29
Howtochoosenumberofclusters K?
For a clustering problem, best number of clusters depends on the goal and on
the knowledge on the application.
•Sometimes, the best value ofKis given as input (e.g.,Ksalespeople are
employed, the task is to cluster a database inKmany segments)
•However, if ’natural’ clusters need to be determined, the best value ofKis
unknown and needs to be estimated from data as well.
Jan Rolfes·Friedrich-Alexander-Universit¨at Erlangen-N¨urnberg·Unsupervised Learning: Clustering30
(Heuristic)estimateofgoodKvalues
•Examine within-cluster dissimilarityD
k
′as a function of the number of
clustersk
′
.
•Separate solutions are obtained fork
′
∈ {1,2,...,Kmax}.
•Values{D1,D2,...,D
Kmax
}decrease with increasingk
′
.
•intuition: if there areKnatural groupings, then fork
′
<K, clusters will each
contain asubsetof the true underlying groups, i.e., dissimilarityD
k
′will
(noticeably) decrease for increasingk
′
.
•Fork
′
>Kinstead, at least one natural group will be assigned separate
clusters. I.e.,D
k
′will decrease only mildly.
•Estimate best number of clustersKby identifying a ’kink’ in the plot ofD
k
′as
a function ofk
′
.
Next, we will turn to a fundamental method in learning that can be used for
clustering as well, and also more generally for reducing data dimensions, called
Principal Component Analysis.
Jan Rolfes·Friedrich-Alexander-Universit¨at Erlangen-N¨urnberg·Unsupervised Learning: Clustering31
(Heuristic)estimateofgoodKvalues
•Examine within-cluster dissimilarityD
k
′as a function of the number of
clustersk
′
.
•Separate solutions are obtained fork
′
∈ {1,2,...,Kmax}.
•Values{D1,D2,...,D
Kmax
}decrease with increasingk
′
.
•intuition: if there areKnatural groupings, then fork
′
<K, clusters will each
contain asubsetof the true underlying groups, i.e., dissimilarityD
k
′will
(noticeably) decrease for increasingk
′
.
•Fork
′
>Kinstead, at least one natural group will be assigned separate
clusters. I.e.,D
k
′will decrease only mildly.
•Estimate best number of clustersKby identifying a ’kink’ in the plot ofD
k
′as
a function ofk
′
.
Next, we will turn to a fundamental method in learning that can be used for
clustering as well, and also more generally for reducing data dimensions, called
Principal Component Analysis.
Jan Rolfes·Friedrich-Alexander-Universit¨at Erlangen-N¨urnberg·Unsupervised Learning: Clustering31