Clustering Clustering: the process of portioning a set of data objects into subsets (clusters) where objects in a cluster are similar to one , yet dissimilar to objects in other clusters. Considered as unsupervised learning : no predefined classes (learning by observation vs learning by examples) Descriptive data mining
Clustering Y X OBJECT 1 1 A 1 2 B 3 4 C 4 5 D
Types of Clustering: Partitioning approach : construct various portions and then evaluate them by some criterion (i.e. minimize the sum of square errors). Hierarchical approach: create a hierarchal decomposition of the set of data using some criterion.
Partitioning approach Partitioning methods :: Partioning a dataset D of n objects into a set of k clusters. A centroid-based partitioning technique uses the centroid of a cluster, Ci , to represent that cluster. The centroid can be defined in various ways such as by the mean or medoid of the objects (or points)
What is K-Means Clustering? It is an algorithm to group your objects based on attributes/features into K number of group. K is positive integer number. Cluster representative can be: mean / centroid (average of data point) median / medoid (a point closer to the mean)
Distance Function Euclidean Distance Manhatten Distance
Ex: Find centroid and medoid of cluster containing three two dimensional points (1,1) ,(2,3) and (6,2) Centroid (mean)= To find Kmedoid find the closest point to mean - For (1,1) = |3-1|+|2-1| = 3 For (2,3)=|3-2|+|2-3| = 2 - For (6,2)=|3-6|+|2-2|=3 K medoid = (2,3) Closest point
Partitioning approach The grouping is done by minimizing the sum of squares of distances between data and the corresponding cluster centroid. The quality of cluster Ci can be measured by the within cluster variation , which is the sum of squared error between all objects in Ci and the centroid ci , defined as
Main steps for K means
Example : Suppose we have 4 objects as your training data point and each object have 2 tributes. Each attribute represents coordinate of the object. Y X OBJECT 1 1 A 1 2 B 3 4 C 4 5 D
First step is to determine number of K. K=2 Initial centroids. c 1 = (1,1) and c2 = (2,1)
c2 = (2,1) c 1 = (1,1) = (1,1) =1 = (1,1) =0 min = (2,1) =0 min = (2,1) =1 = (4,3) =2.83 min = (4,3) =3.61 = (5,4) =4.24 min = (5,4) =5 c2 = (2,1) c 1 = (1,1) Calculate distance between objects and centroids.
New centroids , )= c 1 = (1,1) c 1 = (1,1)
Calculate distance between objects and new centroids. c2 = c 1 = (1,1) = (1,1) =3.14 = (1,1) =0 min = (2,1) =2.36 = (2,1) =1 min = (4,3) =0.47 min = (4,3) =3.61 = (5,4) =1.89 min = (5,4) =5 c 1 = (1,1)
New centroids , )= , )=
Calculate distance between objects and new centroids. c2 = c 1 = ( ,1) = (1,1) =4.3 = (1,1) =0.5 min = (2,1) =3.54 = (2,1) =0.5 min = (4,3) =0.71 min = (4,3) =3.20 = (5,4) =0.71 min = (5,4) =4.61
New centroids , )= , )= Centroids not changed then Stop
EX: The following is a set of one-dimensional points: {6; 12; 18; 24; 30; 42; 48}. For each of the following set of initial centroids, create two clusters by assigning each point to the nearest centroid, and then calculate the total squared error for each set of two clusters. Show both the clusters and the total squared error for each set of centroid. { 18 ; 45 } . { 15 ; 40 } .
Sol: First round of k means - Cluster assign 42,48) Recompute mean New centroid is the same as the previous centroid The final clusters are {6,12,18,24,30} {42,48}
= ( + + + + ) = 360 = ( + )= 18 Total square error is 360+18= 378
True or false K means is a hierarchical clustering method. In k means clustering the number of clusters produced is not known. A partition clustering is a division of data objects into overlapping clusters. K means results in optimal data clustering. A centroid must be an actual data objects.