Introduction to clustering: - Clustering is an unsupervised machine learning technique used to group similar data points into clusters based on their similarities or patterns Types of Clustering: - Clustering is of two types.
Definition of hierarchical clustering: - Hierarchical clustering is an unsupervised machine learning algorithm used to group unlabeled datasets into clusters. It builds a hierarchy of clusters in the form of a tree, known as a dendrogram. Key characteristics: - Does not require the number of clusters to be specified in advance. Creates a tree-like structure called a dendrogram to represent the hierarchy of clusters. Types of Hierarchical Clustering: -
1. Agglomerative hierarchical clustering: - This is a bottom-up approach where each data point starts as a single cluster. The algorithm then merges the closest pairs of clusters until only one cluster remains. It is also known as hierarchical agglomerative clustering (HAC). How Agglomerative Clustering Works: - 1). Start with individual points as clusters. 2). Calculate the distance (or dissimilarity) between each pair of clusters. 3). Merge the two closest clusters. 4). Repeat steps 2-3 until all points are merged into one large cluster.
Example: - Consider the following 6 one dimensional data points: (18,22,25,42,27,43)
2. Divisive hierarchical clustering: - This is a top-down approach where all data points start in a single cluster. The algorithm then recursively splits the clusters until each data point is its own cluster. This method is more complex than agglomerative clustering. Example: - Consider the dataset: { a,b,c,d,e } STEP 1: - Initially C1= { a:b:c:d:e } STEP 2: - C2=C1 and C3={} STEP 3: - Initial iteration- Average dissimilarity of ‘a’ > a= ¼ * (d( a,b )+d( a,c )+d( a,d )+d( a,e )) > a= 7.25 Similarly, Average dissimilarity of b = 7.75 Average dissimilarity of c = 5.25 Average dissimilarity of d = 7.00 Average dissimilarity of e = 7.75
The highest average dissimilarity is b & e, consider ‘b’ randomly. Remove ‘b’ from ‘C2’ and assign it to ‘C3’. ( i.e ) C2= { a:c:d:e } and C3={b} STEP 4: - Second iteration For a, a= 1/3*(d( a,c )+d( a,d )=d( a,e ) -1/1(d( a,b )) = -2.33 Similarly, for c, c= -2.33 for d, d= 0.67 for e, e= 0 The highest average dissimilarity is d. Therefore, C2= { a:c:e } and C3={ b:d } STEP 5: - Computing diameters diameter(C2)= max{d( a,c ),d( a,e ),d( c,e )}=max{3,11,2}= 11 diameter(C3)= max{d( b,d )}= 5 diameter(C2) > diameter(C3) Continue the process by splitting C2
DENDROGRAM: - A dendrogram is a tree-like diagram that represents the hierarchical relationships between clusters. It is created by iteratively merging or splitting clusters based on a measure of similarity or distance between data points. The dendrogram can be sliced at various heights to determine the number of clusters.
Advantages of Hierarchical Clustering: - Unlike k-means, hierarchical clustering automatically creates a hierarchy that can be cut at any level. Allows users to see how clusters are formed at each level of distance. Works with various distance metrics and linkage methods. Disadvantages of Hierarchical Clustering: - Sensitive to noise and outliers Cannot easily handle large datasets as effectively as other methods like k-means. Applications of Hierarchical Clustering: - Bioinformatics: gene expression data, protein clustering, phylogenetic trees. Market research: segmenting consumers based on buying behavior. Image segmentation: grouping similar pixels to partition an image into meaningful regions. Document clustering: grouping similar documents based on content.