Clustering: Grouping all Data for Insights

sasankkandru1439 21 views 15 slides Jun 02, 2024
Slide 1
Slide 1 of 15
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15

About This Presentation

Clustering: Grouping Data for Insights

Clustering is a fundamental method in data analysis and machine learning that focuses on the task of dividing a set of data points into groups or clusters. The primary goal is to ensure that data points within the same cluster are more similar to each other th...


Slide Content

CLUSTERING TECHNIQUES OVERVIEW AND APPLICATIONS

INDEX

INTRODUCTION Clustering was first employed in biology back in the 1960s to classify species. In this data-driven era, effective data organization and analysis methods play a major role in gaining insights from data. From marketing to social network analysis, clustering has been evolving and now is an essential sorting and categorizing data tool for pattern detection, data analysis, and interpretation Clustering is an unsupervised data analysis technique that groups a set of objects such that objects in the same group (cluster) are more similar to each other than to those in other groups. .

Example: Clustering Grocery Items eggs bananas milk bread

TECHNIQUES

Partitional Clustering - K Means Partitional clustering divides a dataset into non-overlapping partitions or clusters, where each data point belongs to exactly one cluster. K-means clustering groups the unlabelled dataset into a defined number of clusters where similar data points are grouped together to discove r underlying patterns. Phases:  Initialization Categorize and Update centroids Repeat

Hierarchical Based Clustering – BIRCH(Unsupervised) Hierarchical Clustering organizes elements in a hierarchical or tree like structure. Balanced Iterative Reducing and Clustering BIRCH clusters large data set with a single scan and improves the quality of data with a few additional scans. BIRCH consists of two stages, Building the CF(Clustering Feature) tree Global Clustering. Cluster refinement for accuracy.

Density Based Clustering – DBSCAN Density-based clustering methods create clusters based on the density of data or information that are to be clustered in the feature space.  Density Based Spatial Clustering of Applications with Noise defines clusters by identifying the data which has a minimum number of data points within a specific radius. Steps in the DBSCAN algorithm Classify the points and discard noise. Assign cluster to a core point. Color all the density connected points and boundary points according to the nearest core point.

Grid-Based Clustering – STING Grid-based clustering partitions the dataset into a grid structure, organizing data points into cells for efficient clustering based on spatial proximity. STING( STATISTICAL INFORMATION GRID ) approach which partitions the data into a hierarchical grid, Investigates the clusters at different levels of their detail Phases of sting are Grid Construction & Cell Assignment Density Calculation & Cluster Identification Border Point Assignment & Noise Identification

Model-Based Clustering – Gaussian Mixture Model-based clustering assigns data points to clusters based on probabilistic models representing the data distribution. "Gaussian Mixture is a statistical model that identifies subgroups within a population using a combination of Gaussian distributions." It repeatedly optimizes parameters using an expectation-maximization algorithm which estimates cluster means, covariances, and  mixture covariances  Steps of gaussian mixture are Initialization Expectation Step(E-Step) & Maximum Step(M-Step) Convergence Check Iteration

PROS AND CONS OF CLUSTERING TECHNIQUES Cons : Parameter subjectivity High dimensions challenge Evaluation difficulty Shape Assumptions Noise handling Pros : Pattern finding Exploration Feature Discovery Data compression Scalability

Applications of Clustering Techniques Customer Segmentation: Grouping customers into distinct segments based on attitudes and behavior for targeted marketing strategies. Anomaly Detection: Identifying unusual patterns or outliers in datasets that deviate significantly from normal behavior. Image Segmentation: Partitioning an image into regions with similar attributes, for object recognition and image analysis tasks. Recommendation Systems: Grouping users or items into clusters based on preferences or similarities to provide personalized recommendations in e-commerce or content platforms. Document clustering enables automatic grouping of similar documents for efficient information retrieval, text summarization, and content-based recommendation systems.

CONCLUSION Clustering techniques offer a flexible approach to unsupervised learning, applicable across diverse datasets and domains.  By grouping similar data points, clustering facilitates exploration and recognition of underlying patterns, leading to valuable insights. Clustering algorithms automate data grouping tasks, saving time and enabling efficient analysis of large datasets. Clustering finds use in marketing, healthcare, finance, and more, for tasks like customer segmentation and anomaly detection.

FUTURE RESEARCH

References T. Zhang, R. Ramakrishnan and M. Livny , “ BIRCH: an efficient data clustering method for very large databases ” in ACM Sigmod Record, ACM, vol. 25, pp. 103–114. M. Steinbach, G. Karypis , and V. Kumar, "A comparison of document clustering techniques" in Proceedings of the KDD Workshop on Text Mining, ACM, 2000. M. Ester, H.-P. Kriegel , J. Sander, and X. Xu, "A density-based algorithm for discovering clusters in large spatial databases with noise" in Proceedings of the 2nd International Conference on Knowledge Discovery and Data Mining (KDD-96), AAAI Press, 1996.