unsupervised - clustering - k-Means CAH.pptx

AthirTarraz 11 views 30 slides Mar 03, 2025

Slide 1 of 30

About This Presentation

INTRODUCTION TO K-MEANS

Size: 1.74 MB

Language: en

Added: Mar 03, 2025

Slides: 30 pages

Slide Content

Lecturer: Dr. Ahmed Khanfir 1 TERM: Fall 2024 Chapter 4 part 2: Unsupervised learning – Clustering k-means and Hierarchical Clustering Analysis (HCA) SOME SLIDES ARE BORROWED FROM THOSE OF Dr. Oumayma Bounouh

Recap - Unsupervised => Unlabled data - PCA is an unsupervised technique - Clustering is among the unsupervised techniques, the most commonly used. 2

ASSOCIATION (Association Rule Mining) Association is a rule-based machine learning to discover the probability of the co-occurrence of items in a collection. 3

4 CLUSTERING - Clustering is the method of dividing the objects into clusters that are similar between them and are dissimilar to the objects belonging to another cluster.

Example E1: Clustering books of different unknown themes E2: Clustering medical scans E3: Clustering pictures - In fact, clustering is one of the most utilized data mining techniques. It has a long history, and used in almost every field, e.g., medicine, psychology, botany, sociology, biology, archeology, marketing, insurance, libraries, etc. In recent years, due to the rapid increase of online documents, text clustering becomes more and more important. 5

Main Aspects - Clustering Algorithm : group data points into clusters based on similarity. The choice of a clustering algorithm depends on the data structure, size, dimensionality, and desired outcome. - A distance (similarity, or dissimilarity) function - Clustering quality : based on how •Inter-clusters distance  maximized •Intra-clusters distance  minimized The quality of a clustering result depends on the algorithm, the distance function and the application. 6

Recap Minkowski distance: For p=1: Manhattan distance For p=2: Euclidean distance 7

2024/2025 8 Recap

2024/2025 9 Recap d(): Euclidean distance < x,y >: dot product

Recap Minkowski distance: For p=1: Manhattan distance For p=2: Euclidean distance 10

Example of algorithms - Partition-Based Clustering: 1. K-Means: Divides data into k clusters by minimizing variance within clusters. 2. K-Means++: same as K-Means except that it favors the selection of distant seeds. It is now the default initialization method in many libraries (e.g., Scikit-learn). 3. K-Medoids (PAM): Similar to K-Means but uses medoids (actual data points) instead of centroids. - Hierarchical Clustering: 1. Agglomerative (Bottom-Up): Starts with individual points and merges clusters. 2. Divisive (Top-Down): Starts with one cluster and splits it iteratively. 11

K-Means Algorithm 12

K-Means Example Example Walkthrough Input: Dataset: {(1,1),(1,4), (4,1), (4,4),(8,1)} k=2 (Two clusters) Step 1: Initialization Randomly select two points as initial centroids: Centroid 1: (1,1) Centroid 2: (4,4) 13

K-Means Example Input: Dataset: {(1,1),(1,4), (4,1), (4,4),(8,1)} and k=2 (Two clusters) Step 1: Initialization: Centroid 1: (1,1) and Centroid 2: (4,4) Step 2: Assign Points to Clusters Compute distances of each point to the centroids: Point (1,1): Closest to Centroid 1. Point (1,4): Closest to Centroid 2. Point (4,1): Closest to Centroid 1. Point (4,4): Closest to Centroid 2. Point (8,1): Closest to Centroid 1. Cluster assignments: Cluster 1: (1,1),(4,1),(8,1) Cluster 2: (1,4),(4,4) 14

Key Notes: Distance Metric: Euclidean distance is most common but can be replaced (e.g., Manhattan, cosine). Initialization Sensitivity: Results depend on initial centroid selection. Convergence Criteria: Stops when cluster assignments stabilize or changes in centroid positions are minimal. Iterative: Typical runtime is linear in terms of data points and clusters, but it may get stuck in a local minimum. 15

K-Means Exercise: 16

K-Means exercise 17

K-Means exercise 18

19 K-Means exercise

20 K-Means exercise

Partition-Based Clustering Example Algorithms : K-Means , K-Medoids (PAM), When to Use : When clusters are spherical or convex. With moderate-sized datasets. When k, the number of clusters, is known beforehand. Limitations : Struggles with non-spherical clusters. Sensitive to outliers. 21

Hierarchical Clustering 22

Hierarchical Clustering Example Algorithms: Agglomerative (Bottom-Up) , Divisive (Top-Down) Hierarchical Agglomerative Clustering (HAC) is also called Hierarchical Bottom-Up Clustering or Agglomerative Hierarchical Clustering (AHC). When to Use: When you need a hierarchy of clusters (dendrogram). When the number of clusters is not predefined. Small to medium-sized datasets. Limitations: Computationally expensive for large datasets. Sensitive to noise. 23

Hard Vs soft clustering Hard clustering: - Each observation belongs to exactly one cluster - More common and easier to do Soft clustering: - An observation can belong to more than one cluster. - It makes more sense for applications like creating browsable hierarchies - e.g. You may want to put a pair of sneakers in two clusters: ( i ) sports and (ii) shoes 24

HAC algorithm 0- Starts with each data point as its own cluster (singleton clusters). 1- Iteratively, Coputes the similiraty between clusters, merges the closest (most similar) clusters until all data points are in a single cluster or a desired number of clusters is reached. The process creates a hierarchy (dendrogram) of clusters that can be cut at different levels to obtain varying numbers of clusters. 25

Initial step 26

Intermediate step 27

Similarity metrics Similar to k-Means, the most commonly used is the Euclidean distance or the Manhattan distance. 28

Distance between clusters? Linkage criteria for merging Single Linkage ( Single-link distance): Minimum distance between points in two clusters. Complete Linkage ( Complete-link distance): Maximum distance between points in two clusters. Average Linkage ( Group average distance): Average distance between points in two clusters. Ward's Method: Minimizes the increase in total within-cluster variance. 29

unsupervised - clustering - k-Means CAH.pptx

About This Presentation

Slide Content

Tags

Categories

Download

Quick Actions

Statistics

Related Slideshows

unsupervised - clustering - k-Means CAH.pptx

About This Presentation

Slide Content

Slide 1

Slide 2

Slide 3

Slide 4

Slide 5

Slide 6

Slide 7

Slide 8

Slide 9

Slide 10

Slide 11

Slide 12

Slide 13

Slide 14

Slide 15

Slide 16

Slide 17

Slide 18

Slide 19

Slide 20

Slide 21

Slide 22

Slide 23

Slide 24

Slide 25

Slide 26

Slide 27

Slide 28

Slide 29

Slide 30

Tags

Categories

Download

Quick Actions

Statistics

Related Slideshows

Pray For The Peace Of Jerusalem and You Will Prosper

Don_t_Waste_Your_Life_God.....powerpoint

VILLASUR_FACTORS_TO_CONSIDER_IN_PLATING_SALAD_10-13.pdf

Fertility awareness methods for women in the society

Chapter 5 Arithmetic Functions Computer Organisation and Architecture

syakira bhasa inggris (1) (1).pptx.......