unsupervised - clustering - k-Means CAH.pptx

AthirTarraz 11 views 30 slides Mar 03, 2025
Slide 1
Slide 1 of 30
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27
Slide 28
28
Slide 29
29
Slide 30
30

About This Presentation

INTRODUCTION TO K-MEANS


Slide Content

Lecturer: Dr. Ahmed Khanfir 1 TERM: Fall 2024 Chapter 4 part 2: Unsupervised learning – Clustering k-means and Hierarchical Clustering Analysis (HCA) SOME SLIDES ARE BORROWED FROM THOSE OF Dr. Oumayma Bounouh

Recap - Unsupervised => Unlabled data - PCA is an unsupervised technique - Clustering is among the unsupervised techniques, the most commonly used. 2

ASSOCIATION (Association Rule Mining) Association is a rule-based machine learning to discover the probability of the co-occurrence of items in a collection. 3

4 CLUSTERING - Clustering is the method of dividing the objects into clusters that are similar between them and are dissimilar to the objects belonging to another cluster.

Example E1: Clustering books of different unknown themes E2: Clustering medical scans E3: Clustering pictures - In fact, clustering is one of the most utilized data mining techniques. It has a long history, and used in almost every field, e.g., medicine, psychology, botany, sociology, biology, archeology, marketing, insurance, libraries, etc. In recent years, due to the rapid increase of online documents, text clustering becomes more and more important. 5

Main Aspects - Clustering Algorithm : group data points into clusters based on similarity. The choice of a clustering algorithm depends on the data structure, size, dimensionality, and desired outcome. - A distance (similarity, or dissimilarity) function - Clustering quality : based on how •Inter-clusters distance  maximized •Intra-clusters distance  minimized The quality of a clustering result depends on the algorithm, the distance function and the application. 6

Recap Minkowski distance: For p=1: Manhattan distance For p=2: Euclidean distance 7

2024/2025 8 Recap

2024/2025 9 Recap d(): Euclidean distance < x,y >: dot product

Recap Minkowski distance: For p=1: Manhattan distance For p=2: Euclidean distance 10

Example of algorithms - Partition-Based Clustering: 1. K-Means: Divides data into k clusters by minimizing variance within clusters. 2. K-Means++: same as K-Means except that it favors the selection of distant seeds. It is now the default initialization method in many libraries (e.g., Scikit-learn). 3. K-Medoids (PAM): Similar to K-Means but uses medoids (actual data points) instead of centroids. - Hierarchical Clustering: 1. Agglomerative (Bottom-Up): Starts with individual points and merges clusters. 2. Divisive (Top-Down): Starts with one cluster and splits it iteratively. 11

K-Means Algorithm 12

K-Means Example Example Walkthrough Input: Dataset: {(1,1),(1,4), (4,1), (4,4),(8,1)} k=2 (Two clusters) Step 1: Initialization Randomly select two points as initial centroids: Centroid 1: (1,1) Centroid 2: (4,4) 13

K-Means Example Input: Dataset: {(1,1),(1,4), (4,1), (4,4),(8,1)} and k=2 (Two clusters) Step 1: Initialization: Centroid 1: (1,1) and Centroid 2: (4,4) Step 2: Assign Points to Clusters Compute distances of each point to the centroids: Point (1,1): Closest to Centroid 1. Point (1,4): Closest to Centroid 2. Point (4,1): Closest to Centroid 1. Point (4,4): Closest to Centroid 2. Point (8,1): Closest to Centroid 1. Cluster assignments: Cluster 1: (1,1),(4,1),(8,1) Cluster 2: (1,4),(4,4) 14

Key Notes: Distance Metric: Euclidean distance is most common but can be replaced (e.g., Manhattan, cosine). Initialization Sensitivity: Results depend on initial centroid selection. Convergence Criteria: Stops when cluster assignments stabilize or changes in centroid positions are minimal. Iterative: Typical runtime is linear in terms of data points and clusters, but it may get stuck in a local minimum. 15

K-Means Exercise: 16

K-Means exercise 17

K-Means exercise 18

19 K-Means exercise

20 K-Means exercise

Partition-Based Clustering Example Algorithms : K-Means , K-Medoids (PAM), When to Use : When clusters are spherical or convex. With moderate-sized datasets. When k, the number of clusters, is known beforehand. Limitations : Struggles with non-spherical clusters. Sensitive to outliers. 21

Hierarchical Clustering 22

Hierarchical Clustering Example Algorithms: Agglomerative (Bottom-Up) , Divisive (Top-Down) Hierarchical Agglomerative Clustering (HAC) is also called Hierarchical Bottom-Up Clustering or Agglomerative Hierarchical Clustering (AHC). When to Use: When you need a hierarchy of clusters (dendrogram). When the number of clusters is not predefined. Small to medium-sized datasets. Limitations: Computationally expensive for large datasets. Sensitive to noise. 23

Hard Vs soft clustering Hard clustering: - Each observation belongs to exactly one cluster - More common and easier to do Soft clustering: - An observation can belong to more than one cluster. - It makes more sense for applications like creating browsable hierarchies - e.g. You may want to put a pair of sneakers in two clusters: ( i ) sports and (ii) shoes 24

HAC algorithm 0- Starts with each data point as its own cluster (singleton clusters). 1- Iteratively, Coputes the similiraty between clusters, merges the closest (most similar) clusters until all data points are in a single cluster or a desired number of clusters is reached. The process creates a hierarchy (dendrogram) of clusters that can be cut at different levels to obtain varying numbers of clusters. 25

Initial step 26

Intermediate step 27

Similarity metrics Similar to k-Means, the most commonly used is the Euclidean distance or the Manhattan distance. 28

Distance between clusters? Linkage criteria for merging Single Linkage ( Single-link distance): Minimum distance between points in two clusters. Complete Linkage ( Complete-link distance): Maximum distance between points in two clusters. Average Linkage ( Group average distance): Average distance between points in two clusters. Ward's Method: Minimizes the increase in total within-cluster variance. 29

30
Tags