Unsupervised Learning Clustering - Mathematcis

UnsupervisedLearning: Clustering
Lecture “Mathematics of Learning” 2023/24
Jan Rolfes
Friedrich-Alexander-Universit¨at Erlangen-N¨urnberg

RoughDifferentiationinLearningMethods
supervised learning:
•predict values of an outcome measure based on a number of input measures
(e.g., given some patient data together with label ’has illness’ / ’does not have
illness’. New patient data comes in, predict whether the patient is ill or not.)
unsupervised learning:
•no outcome measure given. goal: find structures among data.
...also something in between: semi-supervised learning.
Jan Rolfes·Friedrich-Alexander-Universit¨at Erlangen-N¨urnberg·Unsupervised Learning: Clustering2

UnsupervisedLearning: ClusteringofData
Is there any structure in this data?
Jan Rolfes·Friedrich-Alexander-Universit¨at Erlangen-N¨urnberg·Unsupervised Learning: Clustering3

There aretwo categories/clusterssuch that objectswithina cluster resemble
each other but objects fromdifferent clusterslook different.
Jan Rolfes·Friedrich-Alexander-Universit¨at Erlangen-N¨urnberg·Unsupervised Learning: Clustering4

What’sclustering?
Given fruit measurement data
(height
i
,width
i)
N
i=1
.
•Visually: There are multiple
“categories” of data.
•How can we sort data into
categories/clusters?
→clustering problem•More measurements help•Clustering the raw data by hand
is cumbersome
Jan Rolfes·Friedrich-Alexander-Universit¨at Erlangen-N¨urnberg·Unsupervised Learning: Clustering5

ClusteringAlgorithm
Given
•Nnumber of data points
•Mnumber of variables (i.e “mass”, “price”, “color”, ...)
•DataX={x1,...,x
N}, wherexn∈R
M
for alln=1,...,N
•K assumednumber of clusters
Want•Assignment: xn7→kn∈ {1,...,K}for alln=1,...,N•Assignment rule:x7→k(x)∈ {1,...,K}for allx∈R
M
•Reconstruction rule (’representative’): k7→m
k∈R
MOn an abstract level:
•Determination of best possible clustering (w.r.t. some objective) is a classical
combinatorial optimization problem
•K-means clustering: DetermineKpoints, i.e., centers, that minimize the sum
of the squared Euclidean distance to its closest center.
Jan Rolfes·Friedrich-Alexander-Universit¨at Erlangen-N¨urnberg·Unsupervised Learning: Clustering6

ClusteringProblemsandAlgorithms
•No general definition of clustering here, see Theoretical Computer Science
for further details
•However: Already in simplified / restricted situations, clustering is difficult,
i.e.,NP-hard.
•In particular: We cannot expect to be able to determine an algorithm that can
efficiently determine the best clustering within polynomial time in the input
size.
•Further Reading: M. Mahajan, P. Nimbhorkar, K. Varadarajan: The planar
k-means problem is NP-hard. Proceedings of WALCOM: Algorithms and
Computation, S.274-285 (2009).
•Clustering is a very basic learning task. Depending on the application,
different clustering algorithms work best.
•Here: Focus on (standard) algorithms K-Means and Expectation
Maximization.
•Plus: Hierarchical clustering and principal component analysis for clustering
and data reduction.
Jan Rolfes·Friedrich-Alexander-Universit¨at Erlangen-N¨urnberg·Unsupervised Learning: Clustering7

K-meansclusteringasoptimizationproblem
Find C={C1,...,C
K}into setsC
k⊂Xand
m={m1,...,m
K}withm
kassociated toC
k, which minimize the
energy
E(C,m) :=
1
2
K
X
k=1
X
x∈C
k
∥x−m
k∥
2
.
Observations
•The clustering energy has local
minima
(picture from:
https://upload.wikimedia.org/
wikipedia/commons/7/7c/K-means_
convergence_to_a_local_minimum.png,
modified)
Jan Rolfes·Friedrich-Alexander-Universit¨at Erlangen-N¨urnberg·Unsupervised Learning: Clustering8

DerivationoftheK-meansalgorithm
Let us fix the clusteringCin
E(C,m) :=
1
2
K
X
k=1
X
x∈C
k
∥x−m
k∥
2
.
Jan Rolfes·Friedrich-Alexander-Universit¨at Erlangen-N¨urnberg·Unsupervised Learning: Clustering9

DerivationoftheK-meansalgorithm
Let us fix the clusteringCin
E(C,m) :=
1
2
K
X
k=1
X
x∈C
k
∥x−m
k∥
2
.
necessary first-order optimality condition: gradient with respect tom
kis zero,
i.e., a critical point.
Taking the gradient with respect tom
kwe obtain the
condition:
0=∇m
k
E(C,m) =
X
x∈C
k
(x−m
k) =
X
x∈C
k
x− |C
k|m
k
Jan Rolfes·Friedrich-Alexander-Universit¨at Erlangen-N¨urnberg·Unsupervised Learning: Clustering9

DerivationoftheK-meansalgorithm
Let us fix the clusteringCin
E(C,m) :=
1
2
K
X
k=1
X
x∈C
k
∥x−m
k∥
2
.
necessary first-order optimality condition: gradient with respect tom
kis zero,
i.e., a critical point.
Taking the gradient with respect tom
kwe obtain the
condition:
0=∇m
k
E(C,m) =
X
x∈C
k
(x−m
k) =
X
x∈C
k
x− |C
k|m
k
and hence
m
k=
1
|C
k|
X
x∈C
k
xb=mean of the cluster
Problem: Neither clusters nor means known in advance, thus heuristically start
searching for good means
Jan Rolfes·Friedrich-Alexander-Universit¨at Erlangen-N¨urnberg·Unsupervised Learning: Clustering9

DerivationoftheK-meansalgorithm
Conversely, let us fix the meansmin
E(C,m) :=
1
2
K
X
k=1
X
x∈C
k
∥x−m
k∥
2
.
Jan Rolfes·Friedrich-Alexander-Universit¨at Erlangen-N¨urnberg·Unsupervised Learning: Clustering9

DerivationoftheK-meansalgorithm
Conversely, let us fix the meansmin
E(C,m) :=
1
2
K
X
k=1
X
x∈C
k
∥x−m
k∥
2
.
perform the simple assignment step
C
k={x∈X:∥x−m
k∥ ≤

x−m
j

for allj=1,...,K}
b=Voronoi cell m
k
Jan Rolfes·Friedrich-Alexander-Universit¨at Erlangen-N¨urnberg·Unsupervised Learning: Clustering9

DerivationoftheK-meansalgorithm
Conversely, let us fix the meansmin
E(C,m) :=
1
2
K
X
k=1
X
x∈C
k
∥x−m
k∥
2
.
perform the simple assignment step
C
k={x∈X:∥x−m
k∥ ≤

x−m
j

for allj=1,...,K}
b=Voronoi cell m
k
m
kC
k
Jan Rolfes·Friedrich-Alexander-Universit¨at Erlangen-N¨urnberg·Unsupervised Learning: Clustering9

K-meansclusteringalgorithm
Data:X={x1, . . . ,xN}and number of clustersK∈N
Result:cluster meansm= (m1, . . . ,mK)
initializemrandomly;
repeat
//
forn←1toN // assign n-th point to cluster with nearest meando
kn←argmin
k∥xn−mk∥
end
//
fork←1toKdo
Ck← {n∈ {1, . . . ,N}:kn=k}//
if|Ck|>0then
mk←
1
|C
k|
P
n∈C
k
xn//
end
end
untilassignment step does not do anything;
•Assignment rule:x7→argmin
k∥x−m
k∥.
•Reconstruction rule:k7→m
k
Jan Rolfes·Friedrich-Alexander-Universit¨at Erlangen-N¨urnberg·Unsupervised Learning: Clustering10

BacktoourfruitsdatasetJan Rolfes·Friedrich-Alexander-Universit¨at Erlangen-N¨urnberg·Unsupervised Learning: Clustering11

Differentmetricsleadtodifferent(convex)clusters
using squared Euclidean distance
(L2-norm):
E(C,m) =
1
2
K
X
k=1
X
x∈C
k
∥x−m
k∥
2
using Manhattan distance (L1-norm):
E(C,m) =
K
X
k=1
X
x∈C
k
M
X
i=1
|x
i
−m
i
k
|
Jan Rolfes·Friedrich-Alexander-Universit¨at Erlangen-N¨urnberg·Unsupervised Learning: Clustering12

AdvantagesandDisadvantagesofK-Means
Very well known clustering algorithm that is used often.
Advantages:
•easy to implement
•can run with only a set of real data vectors and value ofK
•A feasible clustering is always available.
Disadvantages
•Choosing a good value ofKcan be difficult. Potential way out: Test several
values.
•Assumes numerical data, not categorial data such as ’car’, ’truck’, etc. We
will see a clustering approach for categorial data later.
•K-means aims at minimizing the Euclidean distances. This is not always the
right objective.
•The result strongly depends on initialization, but some improvements are
known.
•Assumes that clusters areconvex.
Jan Rolfes·Friedrich-Alexander-Universit¨at Erlangen-N¨urnberg·Unsupervised Learning: Clustering13

AdvantagesandDisadvantagesofK-Means
Disadvantages:
•K-means sometimes does not work well, in particular for non-spherical /
nonconvex data or for unevenly sized clusters, i.e. it has some implicit
assumptions:
from varianceexplained.org
next: some improvements.
Jan Rolfes·Friedrich-Alexander-Universit¨at Erlangen-N¨urnberg·Unsupervised Learning: Clustering14

Expectation-MaximizationClusteringAlgorithm(EM)
•further reading: The Elements of Statistical Learning, Chapter 14
•Recall K-means has implicit assumptions (clusters are convex & roughly
equally sized) that may not be satisfied.
•alternative way of thinking: Decide for each data point the probability with
which it belongs to a certain cluster, i.e., a ’soft’ clustering. allows clusters of
different size, can detect correlations
•problem: This probability distribution is unknown.
•task: Estimate probability distribution, improve estimate iteratively.
Jan Rolfes·Friedrich-Alexander-Universit¨at Erlangen-N¨urnberg·Unsupervised Learning: Clustering15

MixtureofGaussianDistributions
•make quite general assumption: This unknown distribution is a mixture, i.e.,
superposition, ofKmany (multi-dimensional) Gaussian distributions. Means
thatx∼P(·|p,µ,Σ)with a probability density of the form
P(x|p,µ,Σ) =
K
X
k=1
p
k· N(x|µ
k,Σ
k),
•wherep,µ,Σare unknown parameters with
p= (p1,...,p
K),p
k∈Rprobability vector
µ= (µ1,...,µ
K), µ
k∈R
n
vector of means
Σ= (Σ1,...,Σ
K), Σ
k∈R
n×n
covariance matrix
whereMis the dimension of a data pointx, i.e.x∈R
M
.
recall Gaussian distribution in dimensionMwith mean vectorµ∈R
M
,
varianceΣ∈R
M×M
:N(x|µ,Σ) =
1
√
(2π)
M
det(Σ)
exp(−
1
2
(x−µ)
⊤
Σ
−1
(x−µ))
Jan Rolfes·Friedrich-Alexander-Universit¨at Erlangen-N¨urnberg·Unsupervised Learning: Clustering16

RecallCovarianceMatrix
Letx= (x1,...,x
M)
⊤
be random vector, finite variance and mean. Let
covariance matrixΣ = (Σx
i,x
j
)∈R
M×M
be defined as
Σx
i,x
j
=E((x
i−E(x
i))(x
j−E(x
j)))whereEdenotes expected valuesµ
X=E(X).
•represents important statistical information, in particular correlation between
data
•is a real, quadratic, symmetric matrix
•is positive-semidefinite matrix
Jan Rolfes·Friedrich-Alexander-Universit¨at Erlangen-N¨urnberg·Unsupervised Learning: Clustering17

MoreDetailsonMixtureofGaussians
P(x|p,µ,Σ) =
KP
k=1
p
k· N(x|µ
k,Σ
k)
Clustering task: Given data points, estimatep,µ,Σ.
Output: Yields for each data pointnand for each clusterkan estimated
probability thatngenerated fromk.
Assign each data point the mean with highest probability.
Advantage: also clusters are possible that are (partially) contained in each
other:
Figure: K=3 Gaussian Mixture in 1d and generated data.
Jan Rolfes·Friedrich-Alexander-Universit¨at Erlangen-N¨urnberg·Unsupervised Learning: Clustering18

ResponsibilityCalculation
Defineresponsibility of cluster k for xby ob-
served relative frequencies, i.e., probability that
xis generated by Gaussiank.
γ(x;k) =
p
kN(x|µ
k,Σ
k)
P
K
j=1
p
jN(x|µ
j,Σ
j)
Roughly speaking:”How much%of data point x is attributed to cluster k?”
Let density function of (standard) normal distribution be denoted byϕ.
Jan Rolfes·Friedrich-Alexander-Universit¨at Erlangen-N¨urnberg·Unsupervised Learning: Clustering19

OutlineofExpectedMaximizationClustering
Update step takes data points into account via relative frequencies.
Compare this to K-means... Indeed, K-means can be seen as a special case of
EM.
Jan Rolfes·Friedrich-Alexander-Universit¨at Erlangen-N¨urnberg·Unsupervised Learning: Clustering20

Avisualcomparison
drawback:K-means and EM only find local optima.(recall already K-means
problem is NP-hard...)
play around with skikit-learn (machine learning library in python)
Jan Rolfes·Friedrich-Alexander-Universit¨at Erlangen-N¨urnberg·Unsupervised Learning: Clustering21

HierarchicalClusteringMethods
•Sometimes, there is some hierarchical structure in the data that we want to
disclose in a clustering.
•In hierarchical methods, we do not need to specify the number of clustersK
as input
•Instead: measure of dissimilarity, e.g., ’distance’ between (disjoint) groups of
observations - Typically based on pairwise dissimilarities.
•Clusters at each levels of hierarchy are created by merging clusters at next
lower level.
•The lowest level contains one point each, highest level contains all points.
•Both agglomerative (bottom-up) or divisive (top-down) methods exist.
Jan Rolfes·Friedrich-Alexander-Universit¨at Erlangen-N¨urnberg·Unsupervised Learning: Clustering22

HierarchicalClusteringMethods
•Idea of agglomerative approach: start at bottom. At each level, recursively
merge a selected pair of clusters into a single cluster. Next higher level
contains one cluster less. Merge those two clusters with the smallest
dissimilarity.⇒This leads toN−1 levels in hierarchy.
•Clusterings are often drawn by a rooted binary tree. It contains a node as
root. Each node has not more than two children in the tree.
Jan Rolfes·Friedrich-Alexander-Universit¨at Erlangen-N¨urnberg·Unsupervised Learning: Clustering23

VisualizationasaDendrogram
from statistical learning book:
Jan Rolfes·Friedrich-Alexander-Universit¨at Erlangen-N¨urnberg·Unsupervised Learning: Clustering24

Dendrogram
•a word of caution: small changes in data can lead to quite different
dendrograms
•For a clustering application, we need to decide whether hierarchical structure
is actually intrinsic to the data or not.
Jan Rolfes·Friedrich-Alexander-Universit¨at Erlangen-N¨urnberg·Unsupervised Learning: Clustering25

ClusteringinMoreGeneralContexts
How can we compute (pairwise) distances in a clustering algorithm? ...for
example in ordinal or in categorial contexts? E.g., Suppose we have
measurements or averages over personal judgements by participants who are
asked to judge differences between objects.
Often, one usesdissimilaritiesbased on attributes for the distance calculation in
a clustering algorithm.
Dissimilarities are calculated as follows:
•Suppose we have for objectsi=1,...,Nmeasurementsx
ijfor variables
(attributes)j=1,...,p.
•We define dissimilarity between objectsiandi
′
as
D(x
i,x
i
′) =
p
X
j=1
d
j(x
ij,x
i
′
j)
•In this definition, different attributes could also be weighted differently.
Jan Rolfes·Friedrich-Alexander-Universit¨at Erlangen-N¨urnberg·Unsupervised Learning: Clustering26

AlternativesforCalculatingDistances
Depending on the variable types, distances need to be calculated differently.
•Ordinal variables: e.g., contiguous integers, ordered sets such as academic
grades (A, B, C, D, F), degree of preference (can’t stand, dislike, OK, like,
terrific). Dissimilarities for ordinal variables are typically defined by replacing
theirMoriginal values with
i−
1
2
M
,i=1,...,M
in the prescribed order of their original values.
•(unordered) categorical variables: Dissimilarity between pairs of values must
be delineated explicitly.
If variable assumesMdistinct values, these can be arranged in a symmetric
M×Mmatrix withLrr
′=Lr
′
r,Lrr=0,Lrr
′≥0. Often:Lrr
′=1 for allr̸=r
′
.
Unequal losses can be used to emphasize some distances more than others.
How can we cluster data in such contexts?
Jan Rolfes·Friedrich-Alexander-Universit¨at Erlangen-N¨urnberg·Unsupervised Learning: Clustering27

K-medoidsAlgorithm
As discussed before, theK-means clustering algorithm is designed for
numerical and quantitativevalues.
Look atK-means again: iteratively,
1. (or
dissimilarity).
2.
The optimization step 1) can easily be generalized to dissimilarities stored in
matrices, as studied on the last slides.
Jan Rolfes·Friedrich-Alexander-Universit¨at Erlangen-N¨urnberg·Unsupervised Learning: Clustering28

K-MedoidsAlgorithm
1. C, find the observation or attribute in the
cluster that minimizes the total distance to the other points in that cluster:
i
⋆
k
= argmin
{i:C(i)=k}
X
C(i
′
)=k
D(x
i,x
i
′).
Then
m
k=x
i
⋆
k
,k=1,2,...K
are the current estimates of the cluster centers.
2. {m1,...,m
K}, minimize the total
dissimilarity by assigning each observation to the closest (current) cluster
center:
C(i) = argmin
1≤k≤K
D(x
i,m
k).
3.
Jan Rolfes·Friedrich-Alexander-Universit¨at Erlangen-N¨urnberg·Unsupervised Learning: Clustering29

Howtochoosenumberofclusters K?
For a clustering problem, best number of clusters depends on the goal and on
the knowledge on the application.
•Sometimes, the best value ofKis given as input (e.g.,Ksalespeople are
employed, the task is to cluster a database inKmany segments)
•However, if ’natural’ clusters need to be determined, the best value ofKis
unknown and needs to be estimated from data as well.
Jan Rolfes·Friedrich-Alexander-Universit¨at Erlangen-N¨urnberg·Unsupervised Learning: Clustering30

(Heuristic)estimateofgoodKvalues
•Examine within-cluster dissimilarityD
k
′as a function of the number of
clustersk
′
.
•Separate solutions are obtained fork
′
∈ {1,2,...,Kmax}.
•Values{D1,D2,...,D
Kmax
}decrease with increasingk
′
.
•intuition: if there areKnatural groupings, then fork
′
<K, clusters will each
contain asubsetof the true underlying groups, i.e., dissimilarityD
k
′will
(noticeably) decrease for increasingk
′
.
•Fork
′
>Kinstead, at least one natural group will be assigned separate
clusters. I.e.,D
k
′will decrease only mildly.
•Estimate best number of clustersKby identifying a ’kink’ in the plot ofD
k
′as
a function ofk
′
.
Next, we will turn to a fundamental method in learning that can be used for
clustering as well, and also more generally for reducing data dimensions, called
Principal Component Analysis.
Jan Rolfes·Friedrich-Alexander-Universit¨at Erlangen-N¨urnberg·Unsupervised Learning: Clustering31

Unsupervised Learning Clustering - Mathematcis

About This Presentation

Slide Content

Tags

Categories

Download

Quick Actions

Statistics

Related Slideshows

Unsupervised Learning Clustering - Mathematcis

About This Presentation

Slide Content

Slide 1

Slide 2

Slide 3

Slide 4

Slide 5

Slide 6

Slide 7

Slide 8

Slide 9

Slide 10

Slide 11

Slide 12

Slide 13

Slide 14

Slide 15

Slide 16

Slide 17

Slide 18

Slide 19

Slide 20

Slide 21

Slide 22

Slide 23

Slide 24

Slide 25

Slide 26

Slide 27

Slide 28

Slide 29

Slide 30

Slide 31

Slide 32

Slide 33

Slide 34

Slide 35

Slide 36

Slide 37

Slide 38

Slide 39

Slide 40

Slide 41

Slide 42

Slide 43

Slide 44

Slide 45

Slide 46

Slide 47

Slide 48

Slide 49

Slide 50

Slide 51

Slide 52

Slide 53

Slide 54

Slide 55

Slide 56

Tags

Categories

Download

Quick Actions

Statistics

Related Slideshows

TLE-9-Prepare-Salad-and-Dressing.pptxkkk

LESSON 1 ABOUT MEDIA AND INFORMATION.pptx

GRADE-8-AQUACULTURE-WEEKQ1.pdfdfawgwyrsewru

Feelings PP Game FOR CHILDREN IN ELEMENTARY SCHOOL.pptx

Jeopardy_Figures_of_Speech_Template.pptx [Autosaved].pptx

Jeopardy_Figures_of_Speech.pptxvdsvdsvsdvsd