K-means machine learning clustering .pptx

asmaashalma456 21 views 21 slides Sep 21, 2024
Slide 1
Slide 1 of 21
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21

About This Presentation

Machine learning


Slide Content

Clustering Clustering: the process of portioning a set of data objects into subsets (clusters) where objects in a cluster are similar to one , yet dissimilar to objects in other clusters. Considered as unsupervised learning : no predefined classes (learning by observation vs learning by examples) Descriptive data mining

Clustering Y X OBJECT 1 1 A 1 2 B 3 4 C 4 5 D

Types of Clustering: Partitioning approach : construct various portions and then evaluate them by some criterion (i.e. minimize the sum of square errors). Hierarchical approach: create a hierarchal decomposition of the set of data using some criterion.

Partitioning approach Partitioning methods :: Partioning a dataset D of n objects into a set of k clusters. A centroid-based partitioning technique uses the centroid of a cluster, Ci , to represent that cluster. The centroid can be defined in various ways such as by the mean or medoid of the objects (or points)

What is K-Means Clustering? It is an algorithm to group your objects based on attributes/features into K number of group. K is positive integer number. Cluster representative can be: mean / centroid (average of data point) median / medoid (a point closer to the mean)

Distance Function Euclidean Distance Manhatten Distance

Ex: Find centroid and medoid of cluster containing three two dimensional points (1,1) ,(2,3) and (6,2) Centroid (mean)= To find Kmedoid find the closest point to mean - For (1,1) = |3-1|+|2-1| = 3 For (2,3)=|3-2|+|2-3| = 2 - For (6,2)=|3-6|+|2-2|=3 K medoid = (2,3)   Closest point

Partitioning approach The grouping is done by minimizing the sum of squares of distances between data and the corresponding cluster centroid. The quality of cluster Ci can be measured by the within cluster variation , which is the sum of squared error between all objects in Ci and the centroid ci , defined as

Main steps for K means

Example : Suppose we have 4 objects as your training data point and each object have 2 tributes. Each attribute represents coordinate of the object. Y X OBJECT 1 1 A 1 2 B 3 4 C 4 5 D

First step is to determine number of K. K=2 Initial centroids. c 1 = (1,1) and c2 = (2,1)

c2 = (2,1) c 1 = (1,1) = (1,1) =1 = (1,1) =0 min = (2,1) =0 min = (2,1) =1 = (4,3) =2.83 min = (4,3) =3.61 = (5,4) =4.24 min = (5,4) =5 c2 = (2,1) c 1 = (1,1) Calculate distance between objects and centroids.

New centroids , )= c 1 = (1,1) c 1 = (1,1)    

Calculate distance between objects and new centroids. c2 = c 1 = (1,1) = (1,1) =3.14 = (1,1) =0 min = (2,1) =2.36 = (2,1) =1 min = (4,3) =0.47 min = (4,3) =3.61 = (5,4) =1.89 min = (5,4) =5 c 1 = (1,1)

New centroids     , )= , )=

Calculate distance between objects and new centroids. c2 = c 1 = ( ,1) = (1,1) =4.3 = (1,1) =0.5 min = (2,1) =3.54 = (2,1) =0.5 min = (4,3) =0.71 min = (4,3) =3.20 = (5,4) =0.71 min = (5,4) =4.61

New centroids     , )= , )= Centroids not changed then Stop

EX: The following is a set of one-dimensional points: {6; 12; 18; 24; 30; 42; 48}. For each of the following set of initial centroids, create two clusters by assigning each point to the nearest centroid, and then calculate the total squared error for each set of two clusters. Show both the clusters and the total squared error for each set of centroid. { 18 ; 45 } . { 15 ; 40 } .

Sol: First round of k means - Cluster assign 42,48) Recompute mean New centroid is the same as the previous centroid The final clusters are {6,12,18,24,30} {42,48}  

= ( + + + + ) = 360 = ( + )= 18 Total square error is 360+18= 378  

True or false K means is a hierarchical clustering method. In k means clustering the number of clusters produced is not known. A partition clustering is a division of data objects into overlapping clusters. K means results in optimal data clustering. A centroid must be an actual data objects.
Tags