Hierachical clustering

tilanigunawardena 5,333 views 84 slides Sep 04, 2016
Slide 1
Slide 1 of 84
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27
Slide 28
28
Slide 29
29
Slide 30
30
Slide 31
31
Slide 32
32
Slide 33
33
Slide 34
34
Slide 35
35
Slide 36
36
Slide 37
37
Slide 38
38
Slide 39
39
Slide 40
40
Slide 41
41
Slide 42
42
Slide 43
43
Slide 44
44
Slide 45
45
Slide 46
46
Slide 47
47
Slide 48
48
Slide 49
49
Slide 50
50
Slide 51
51
Slide 52
52
Slide 53
53
Slide 54
54
Slide 55
55
Slide 56
56
Slide 57
57
Slide 58
58
Slide 59
59
Slide 60
60
Slide 61
61
Slide 62
62
Slide 63
63
Slide 64
64
Slide 65
65
Slide 66
66
Slide 67
67
Slide 68
68
Slide 69
69
Slide 70
70
Slide 71
71
Slide 72
72
Slide 73
73
Slide 74
74
Slide 75
75
Slide 76
76
Slide 77
77
Slide 78
78
Slide 79
79
Slide 80
80
Slide 81
81
Slide 82
82
Slide 83
83
Slide 84
84

About This Presentation

Hierachical clustering


Slide Content

Tilani Gunawardena Algorithms: Clustering

Grouping of records ,observations or cases into classes of similar objects. A cluster is a collection of records, Similar to one another Dissimilar to records in other clusters What is Clustering?

Clustering

Clustering

There is no target variable for clustering Clustering does not try to classify or predict the values of a target variable. Instead, clustering algorithms seek to segment the entire data set into relatively homogeneous subgroups or clusters, Where the similarity of the records within the cluster is maximized , and Similarity to records outside this cluster is minimized . Difference between Clustering and Classification

Identification of groups f records such that similarity within a group is very high while the similarity to records in other groups is very low. group data points that are close (or similar ) to each other identify such groupings (or clusters) in an unsupervised manner Unsupervised: no information is provided to the algorithm on which data points belong to which clusters In other words, Clustering algorithm seeks to construct clusters of records such that the between-cluster variation (BCV) is large compared to the within-cluster variation (WCV) Goal of Clustering

Between-cluster variation: Within-cluster variation: Goal of Clustering between-cluster variation (BCV) is large compared to the within-cluster variation (WCV) (Intra-cluster distance) the sum of distances between objects in the same cluster are minimized ( Inter-cluster distance) while the distances between different clusters are maximized

Clustering techniques apply when there is no class to be predicted As we've seen clusters can be: disjoint vs. overlapping deterministic vs. probabilistic flat vs. hierarchical k - means Algorithm k -means clusters are disjoint, deterministic, and flat Clustering

Inter/Intra Cluster Distances Intra-cluster distance (Sum/Min/Max/Avg) the (absolute/squared) distance between All pairs of points in the cluster OR Between the centroid and all points in the cluster OR Between the “ medoid ” and all points in the cluster Inter-cluster distance Sum the (squared) distance between all pairs of clusters Where distance between two clusters is defined as: distance between their centroids/medoids (Spherical clusters) Distance between the closest pair of points belonging to the clusters (Chain shaped clusters)

Issues Related to Clustering How to measure similarity Euclidian Distance City -block Distance Minkowski Distance How to recode categorical variables ? How to standardize or normalize numerical variables? Min -Max Normalization Z -score standardization ( ) How many clusters we expect to uncover?

Type of Clustering Partitional clustering: Partitional algorithms determine all clusters at once. They include: K -Means Clustering Fuzzy c-means clustering QT clustering Hierarchical Clustering: Agglomerative ( "bottom-up" ): Agglomerative algorithms begin with each element as a separate cluster and merge them into successively larger clusters. Divisive ( "top-down" ): Divisive algorithms begin with the whole set and proceed to divide it into successively smaller clusters.

K-means Clustering

k-Means Clustering Input: n objects (or points) and a number k Algorithm Randomly assign K records to be the initial cluster center locations Assign each object to the group that has the closest centroid When all objects have been assigned, recalculate the positions of the K centroids Repeat steps 2 to 3 until convergence or termination

K -Mean Clustering

Termination Conditions The algorithm terminates when the centroids no longer change . The SSE( sum of squared errors ) value is less than some small threshold value  Where p є C i represents each data point in cluster i and m i represent the centroid of cluster i .

Example 1: Lets s suppose the following points are the delivery locations for pizza

Lets locate three cluster centers randomly

Find the distance of points as shown

Assign the points to the nearest cluster center based on the distance between each center and the point

Re-assign the cluster centres and locate nearest points

Re-assign the cluster centres and locate nearest points, calculate the distance

Form the three clusters

Example 2: Suppose that we have eight data points in two -dimensional space as follows And suppose that we are interested in uncovering k =2 clusters.

Point Distance from m1 Distance from m2 Cluster membership a b c d e f g h

Point Distance from m1 Distance from m2 Cluster membership a 2.00 2.24 b 2.83 2.24 c 3.61 2.83 d 4.47 3.61 e 1.00 1.41 f 3.16 2.24 g 0.00 1.00 h 1.00 0.00

Point Distance from m1 Distance from m2 Cluster membership a 2.00 2.24 C1 b 2.83 2.24 C2 c 3.61 2.83 C2 d 4.47 3.61 C2 e 1.00 1.41 C1 f 3.16 2.24 C2 g 0.00 1.00 C1 h 1.00 0.00 C2 SSE=2 2 +2.24 2 +2.83 2 +3.61 2 +1 2 +2.24 2 +0 2 +0 2 =36 d (m1,m2)=1

Centroid of the cluster 1 is [(1+1+1)/3,(3+2+1)/3] =( 1,2) Centroid of the cluster 2 is [ (3+4+5+4+2)/5, (3 +3+3+2+1)/5] = ( 3.6,2.4)

Point Distance from m1 Distance from m2 Cluster membership a b c d e f g h m1= ( 1,2 ) m 2=( 3.6,2.4)

Point Distance from m1 Distance from m2 Cluster membership a 1.00 2.67 b 2.24 0.85 c 3.61 0.72 d 4.12 1.52 e 0.00 2.63 f 3.00 0.57 g 1.00 2.95 h 1.41 2.13 m1= ( 1,2 ) m 2=( 3.6,2.4)

Point Distance from m1 Distance from m2 Cluster membership a 1.00 2.67 C1 b 2.24 0.85 C2 c 3.61 0.72 C2 d 4.12 1.52 C2 e 0.00 2.63 C1 f 3.00 0.57 C2 g 1.00 2.95 C1 h 1.41 2.13 C1 SSE=1 2 +0.85 2 +0.72 2 +1.52 2 +0 2 +0.57 2 +1 2 +1.41 2 =7.88 d (m1,m2)=2.63 m 1(1,2) m 2(3.6,2.4)

Centroid of the cluster 1 is [(1+1+1+2)/4,(3+2+1+1)/4] =( 1.25,1.75) Centroid of the cluster 2 is [ (3+4+5+4)/4, (3 +3+3+2)/4] = ( 4,2.75)

Point Distance from m1 Distance from m2 Cluster membership a b c d e f g h m 1(1.25,1.75) m2( 4,2.75)

Point Distance from m1 Distance from m2 Cluster membership a 1.27 3.01 b 2.15 1.03 c 3.02 0.25 d 3.95 1.03 e 0.35 3.09 f 2.76 0.75 g 0.79 3.47 h 1.06 2.66 m 1(1.25,1.75) m2( 4,2.75)

Point Distance from m1 Distance from m2 Cluster membership a 1.27 3.01 C1 b 2.15 1.03 C2 c 3.02 0.25 C2 d 3.95 1.03 C2 e 0.35 3.09 C1 f 2.76 0.75 C2 g 0.79 3.47 C1 h 1.06 2.66 C1 m 1(1.25,1.75) m2( 4,2.75) SSE=1.27 2 +1.03 2 +0.25 2 +1.03 2 +0.35 2 +0.75 2 +0.79 2 +1.06 2 =6.25 d (m1,m2)=2.93

Final Results

Example 2:

An issue on k-Means Algorithm Algorithm cannot guarantee finding the global minimum SSE, instead often settling at a local minimum. To improve the probability of achieving a global minimum, suggests; Placing the first cluster center on a random data point and Placing the subsequent cluster centers on points as far away from previous center points as possible.

How to decide k? Unless the analyst has a prior knowledge of the number of underlying clusters, therefore, Clustering solutions for each value of K is compared The value of K resulting in the smallest SSE being selected

Summary K-means algorithm is a simple yet popular method for clustering analysis Low complexity :complexity is O( nkt ), where t = # iterations Its performance is determined by initialisation and appropriate distance measure There are several variants of K -means to overcome its weaknesses K - Medoids : resistance to noise and/or outliers(data that do not comply with the general behaviour or model of the data ) K -Modes: extension to categorical data clustering analysis CLARA: extension to deal with large data sets Gaussian Mixture models ( EM algorithm ): handling uncertainty of clusters

Hierarchical Clustering

42 Hierarchical Clustering Build a tree-based hierarchical taxonomy ( dendrogram ) One approach : recursive application of a partitional clustering algorithm. animal vertebrate fish reptile amphib. mammal worm insect crustacean invertebrate

Hierarchical clustering and dendrograms A hierarchical clustering on a set of objects D is a set of nested partitions of D. It is represented by a binary tree such that : The root node is a cluster that contains all data points Each (parent) node is a cluster made of two subclusters ( childs ) Each leaf node represents one data point (singleton ie cluster with only one item) A hierarchical clustering scheme is also called a taxonomy. In data clustering the binary tree is called a dendrogram .

44 Clustering obtained by cutting the dendrogram at a desired level: each connected component forms a cluster. Dendogram : Hierarchical Clustering

Hierarchical clustering: forming clusters Forming clusters from dendograms

Hierarchical clustering There are two styles of hierarchical clustering algorithms to build a tree from the input set S: Agglomerative (bottom-up) : Beginning with singletons (sets with 1 element) Merging them until S is achieved as the root . In each steps , the two closest clusters are aggregates into a new combined cluster In this way, number of clusters in the data set is reduced at each step Eventually, all records/elements are combined into a single huge cluster It is the most common approach. Divisive (top-down) : All records are combined in to a one big cluster Then the most dissimilar records being split off r ecursively partitioning S until singleton sets are reached . Does not require the number of clusters k in advance

Two types of hierarchical clustering algorithms : Agglomerative : “bottom-up” Divisive : “top-down

48 Hierarchical Agglomerative Clustering (HAC) Algorithm Start with all instances in their own cluster. Until there is only one cluster: Among the current clusters, determine the two clusters, c i and c j , that are most similar. Replace c i and c j with a single cluster c i  c j Assumes a similarity function for determining the similarity of two instances. Starts with all instances in a separate cluster and then repeatedly joins the two clusters that are most similar until there is only one cluster.

Classification of AHC We can distinguish AHC algorithms according to the type of distance measures used. There are two approaches : Graph methods : Single link method Complete link method Group average method (UPGMA) Weighted group average method (WPGMA) Geometric : Ward’s method Centroid method Median method

Lance – W illiams Algorithm Definition( Lance-Williams formula) In AHC algorithms, the Lance-Williams formula [Lance and Williams, 1967] is a recurrence equation used to calculate the dissimilarity between a cluster C k and a cluster formed by merging two other clusters C l ∪ C l ′ where α I , α I’ , β , γ are real numbers

AHC methods and the Lance-Williams formula

Single link method Also known as the nearest neighbor method, since it employs the nearest neighbor to measure the dissimilarity between two clusters

Cluster distance measure Single link Distance between closest elements in clusters Complete link Distance between farthest elements in clusters Centroids Distance between centroids(means) of two clusters

Single-link clustering Nested Clusters Dendrogram 1 2 3 4 5 6 1 2 3 4 5

Example 1 -Single link method

Example 02-Single link method x 1 = (1, 2) x 2 = (1, 2.5 ) x 3 = (3, 1 ) x 4 = (4, 0.5) x 5 = (4, 2)

x 1 = (1, 2) x 2 = (1, 2.5 ) x 3 = (3, 1 ) x 4 = (4, 0.5) x 5 = (4, 2) Merge X1 and X2

Merge X3 and X4

Merge {X3,X4 } and X5

Merge {X1,X2} and {X3,X4,X5}

x 1 = (1, 2) x 2 = (1, 2.5 ) x 3 = (3, 1 ) x 4 = (4, 0.5) x 5 = (4, 2) Merge X1 and X2 Example 3-Complete link method

Merge X3 and X4

Merge {X3,X4} and X5

Merge {X1,X2} and {X3,X4,X5}

The dendrogram :

Summary Hierarchical Clustering For a dataset consisting of n points O(n 2 ) space; it requires storing the distance matrix O(n 3 ) time complexity in most of the cases( agglomerative clustering ) Advantages Dendograms are great for visualization Provides hierarchical relations between clusters Disadvantages Not easy to define levels for clusters Can never undo what was done previously Sensitive to cluster distance measures and noise/ outliers Experiments showed that other clustering techniques outperform hierarchical clustering There are several variants to overcome its weaknesses BIRCH : scalable to a large data set ROCK : clustering categorical data CHAMELEON : hierarchical clustering using dynamic modelling