2. Unsupervised Learning Uses unlabeled inputs to learn about patterns
3. Reinforcement Learning:
Key features of Unsupervised learning: Unlabeled Data Pattern Discovery No External Guidance Exploratory Analysis Clustering, Dimensionality Reduction, and Anomaly Detection
Common Applications Customer Segmentation: Grouping customers based on purchasing behavior, demographics, or other characteristics. Recommendation Systems: Identifying items that users might be interested in based on their past behavior. Anomaly Detection: Identifying unusual data points or events, such as fraudulent transactions or network intrusions. Dimensionality Reduction: Reducing the number of variables in a dataset while retaining important information. Data Compression: Reducing the size of data while preserving its essential information. Natural Language Processing (NLP): Understanding the structure of language and relationships between words.
Association Rule Learning: Discovering relationships between items in a dataset, such as items frequently purchased together. Clustering: K-means, K- mediods , hierarchical clustering
Clustering clustering is the task of partitioning the dataset into groups, called clusters. The goal is to split up the data in such a way that points within a single cluster are very similar and points in different clusters are different. Similarly to classification algorithms, clustering algorithms assign (or predict) a number to each data point, indicating which cluster a particular point belongs to.
Note: The primary driver of clustering knowledge is discovery rather than prediction, because we may not even know what we are looking for before starting the clustering analysis.
Different Clustering Methods
K-means Clustering the simple algorithm of K-means Step 1: Select K points in the data space and mark them as initial centroids loop Step 2: Assign each point in the data space to the nearest centroid to form K clusters Step 3: Measure the distance of each point in the cluster from the centroid Step 4: Calculate the Sum of Squared Error (SSE) to measure the quality of the clusters (described later in this chapter) Step 5: Identify the new centroid of each cluster on the basis of distance between points Step 6: Repeat Steps 2 to 5 to refine until centroids do not change end loop
Inertias:
The Elbow Plot The elbow method is a graphical representation technique used in k-means clustering to find the optimal ‘K’ value (the number of data clusters partitioned during clustering). This method is typically done by picking out the K value where the elbow (the point where a curve noticeably bends) is created. However, this is not always the best way to find the optimal ‘K’.
Drawbacks of the Elbow Plot:
Silhoutte Score: The Silhouette score is a very useful method to find the number of K when the elbow method doesn’t show a clear elbow point. The value of the Silhouette score ranges from -1 to 1. Following is the interpretation of the Silhouette score. 1: Points are perfectly assigned in a cluster and clusters are easily distinguishable. 0: Clusters are overlapping. -1: Points are wrongly assigned in a cluster.
Silhouette Score = (b-a) / max( a,b ) Where: a = average intra-cluster distance (i.e. the average distance between each point within a cluster). b = average inter-cluster distance (i.e. the average distance between all clusters).
Elbow Method vs. Silhouette Score Method Elbow curve and Silhouette plots both are very useful techniques for finding the optimal K for k-means clustering. In real-world data sets, you will find quite a lot of cases where the elbow curve is not sufficient to find the right ‘K’. In such cases, you should use the silhouette plot to figure out the optimal number of clusters for your data set. Note: use both the techniques together to figure out the optimal K for k-means clustering.
K- Mediods or PAM (Partitioning Around Mediods ) A medoid can be defined as a point in the cluster, whose dissimilarities with all the other points in the cluster are minimum. Dissimilarity – E = |P – C|
Algorithm Initialize k medoids − Select k random data points from the dataset as the initial medoids. Assign data points to medoids − Assign each data point to the nearest medoid. Update medoids − For each cluster, select the data point that minimizes the sum of distances to all the other data points in the cluster, and set it as the new medoid. Repeat steps 2 and 3 until convergence or a maximum number of iterations is reached.
The Time Complexity: O(k*(n-k)^2) Big O notation denotes the upper bound of an algorithm Let's assume that the first sets of medoids are the worst medoids. The cost function calculated through these medoids is the maximum of all the possible sets of medoids. Every time we choose a random medoid for comparing the cost function, we will always find one that decreases the cost function. Let's assume that unfortunately, we always choose the next worst one in all the possibilities, therefore we will exhaust all the remaining medoids (n-k) to find the set of medoids that has the minimum cost function (Adversary argument). So the outmost loop would be k, for looping through all the medoids. Then it will be n-k, to loop through all the non-medoid data points. Then n-k again for choosing the random medoid.
Difference between K means and K medoids Clustering Aspect K -Means Clustering K-Medoids Clustering Representation of Clusters K-Means Clustering uses the mean of points (centroid) to represent a cluster. It uses the most centrally located point (medoid) to represent a cluster. Sensitivity to Outliers Highly sensitive to outliers. More robust to outliers. Distance Metrics K-Means primarily uses Euclidean distance. Whereas it can use any distance metric. Computational Efficiency K-Means is generally faster and more efficient It is slower due to the need to calculate all pairwise distances within clusters. Cluster Shape Assumption It assumes spherical clusters. It does not make strong assumptions about cluster shapes.
When to use k-means When dealing with large datasets. When computational efficiency is a priority. When the data is well-behaved and not heavily influenced by outliers.
When to Use K-Medoids When robustness to outliers is important. When the dataset is smaller and the flexibility of using different distance measures is beneficial. When interpretability of cluster centers as actual data points is needed.