Unsupervised Learning Unsupervised Learning is a class of Machine Learning techniques to find the patterns in data . The data given to unsupervised algorithm are not labelled, which means only the input variables(X) are given with no corresponding output variables. No defined dependent and independent variables. Patterns in the data are used to identify / group similar observations
Clustering Clustering is primarily an exploratory technique to discover hidden structures of the data, possibly as a prelude to more focused analysis or decision process A way to decompose a data set into subsets with each subset representing a group with similar characteristics. Group such that objects in the same group are more similar to each other in some sense than to objects of different groups. The groups are known as clusters and each cluster gets distinct label called cluster ID , the centroid of cluster.
Types of clustering . Clustering Centroid based Clustering Connectivity based clustering ex: Hierarchical clustering ex: K Means clustering
K-Means Clustering Steps
K_means Internal Process
Step 0 – Actual Data Points
Step 1 -Select random points as centroids Randomly assign k centroids, Where k=3 here
Step 2 - Calculate distance The distance is being calculated the below three methods, Manhattan distance Euclidean distance Chebyshev distance
Step 2 - Continuous… Euclidian distance Method
Step 2 - Continuous… Manhattan distance Method
Step 2 - Continuous… Chebyshev distance Method
Step 2 - Continuous…
Step 3 : Update the centroids by taking the average of its data points
Step 4: Repeat Step 2 to 3 till to the distance of each data point is nearer to its own cluster mean than the other cluster
Elbow Method To find the optimum K value
Performance Estimation- Silhouette coefficient Silhouette coefficient is a measure of how similar a data point is to its own cluster compared to that of other clusters. It lies in the range of [-1,1] +1 = The data point is far away from the neighboring cluster and close to its own -1 = The data point is close to other neighboring cluster than its own cluster = The data point is at the boundary of the distance between the own and neighboring cluster
K-Means Summary Summary: Simplest unsupervised learning algorithm Classify a data set through a number of clusters fixed apriori . The idea is to define k centroids, one for each cluster. The centroids should be carefully placed, as far as possible from each other. Each point is associated to a centroid and then centroids are re-calculated to minimize MSE. Uses Euclidean distance Pros: With large variables, k means is faster than other clustering algorithms. Produces tighter clusters than hierarchical clustering. Less impacted by outliers Cons: Predicting the number of clusters is a tedious task. If the data has original clusters, the algorithm does not work well.
Industry Applications Customer segmentation – buying patterns, income, spending behavior, loyalty, customer lifetime value Anomaly detection Creating newsfeeds – cluster articles based on their similarity Pattern detection in medical imaging for diagnostics