Clustering in Machine Learning: A Brief Overview.ppt
shilpamathur13
233 views
63 slides
Aug 11, 2024
Slide 1 of 63
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
About This Presentation
Clustering in machine learning is an unsupervised learning technique that groups data points into clusters based on their similarities. Unlike supervised learning, clustering does not require labeled data, allowing it to discover patterns and structures inherent in the dataset. Each cluster consists...
Clustering in machine learning is an unsupervised learning technique that groups data points into clusters based on their similarities. Unlike supervised learning, clustering does not require labeled data, allowing it to discover patterns and structures inherent in the dataset. Each cluster consists of data points that are more similar to one another than to points in other clusters. Common algorithms include k-means, which partitions data into a predefined number of clusters by minimizing variance within clusters, and hierarchical clustering, which creates a tree-like structure of nested clusters. Clustering is widely used in various applications, including customer segmentation, image analysis, and anomaly detection, where understanding the natural grouping of data is essential.
Size: 1.26 MB
Language: en
Added: Aug 11, 2024
Slides: 63 pages
Slide Content
Clustering in Machine
Learning
Unsupervised learning
1
By
Mrs. Shilpa Mathur
Assistant Professor-AI&ML
Department
Thakur College of Engg & Tech.
Supervised machine learning algorithms make
use of labelled data to make predictions.
For example, an email will be classified as
spam or ham, or a bank’s customer will be
predicted as ‘good’ or ‘bad’. You have a target
variable Y which needs to be predicted.
On the other hand, in unsupervised learning,
you are not interested in prediction because
you do not have a target or outcome variable.
The objective is to discover interesting
patterns in the data, e.g. are there any
subgroups or ‘clusters’ among the bank’s
customers?
2
Unsupervised Learning
Unsupervised learning is a machine learning technique that involves
analyzing data without the use of labeled or pre-defined output data.
Unsupervised learning algorithms work with only input data, and the goal is
to discover patterns or relationships within the data.
There are two main types of unsupervised learning algorithms: clustering
and dimensionality reduction.
3
4
5
Clustering
Clustering involves grouping together similar data points based on their
characteristics or features, without any prior knowledge of the class labels or
categories. Some common clustering algorithms include K-means,
hierarchical clustering, and DBSCAN.
6
PRACTICAL APPLICATIONS OF CLUSTERING
1.Customer Insight: Say, a retail chain with so many stores across locations wants to
manage stores at best and increase the sales and performance. Cluster analysis can
help the retail chain to get desired insights on customer demographics, purchase
behaviour and demand patterns across locations. This will help the retail chain for
assortment planning, planning promotional activities and store benchmarking for
better performance and higher returns.
2.Marketing: Cluster Analysis can help with In the field of marketing, Cluster Analysis can
help in market segmentation and positioning, and to identify test markets for new
product development.
3.Social Media: In the areas of social networking and social media, Cluster Analysis is
used to identify similar communities within larger groups.
4.Medical: Cluster Analysis has also been widely used in the field of biology and medical
science like human genetic clustering, sequencing into gene families, building groups
of genes, and clustering of organisms at species.
7
K Means Algorithm
8
K-Means Clustering Algorithm
●The steps in the K-Means algorithm
●How to graphically visualise the steps of K-Means algorithm
●Practical considerations while using the K-Means algorithm
9
Euclidean Distance
Clustering works on the basis of grouping the observations which are the most similar to
each other. What does this exactly mean?
In simple terms, the algorithm needs to find data points whose values are similar to each
other and therefore these points would then belong to the same cluster. The method in
which any clustering algorithm goes about doing that is through the method of finding
something called a “distance measure”. The distance measure that is used in K-means
clustering is called the Euclidean Distance measure.
10
Observation Height (in cm) Weight(in cm)
A 175(X1) 65(Y1)
B 166(X2) 67(Y2)
11
In two-dimensional space, Euclidean distance between two points (x1, y1) and (x2, y2) can be
calculated using the following formula:
Distance = sqrt((x1 - x2)^2 + (y1 - y2)^2)
12
➢The observations which are closer or more similar to each other would
have a low Euclidean distance
➢The observations which are farther or less similar to each other would
have a higher Euclidean distance
13
14
Centroids
Centroids are essentially the cluster centres of a group of observations.
15
16
Calculation of Centroid
The centroid is calculated by computing the mean of each and every column/dimension that you have and then ordering them
in the same way as above
Therefore, Height-mean = ((175+165+183+172)/)/4 = 173.75
Weight-mean = ((83+74+98+80))/4 = 83.75
Age - mean = ((22+25+24+24))/4 =23.75
Thus the centroid of the above group of observations is (173.75, 83.75 and 23.75)
17
Steps of the K-Means Algorithm
Step 1:Assignment Step
1.Choose the number of clusters K: The first step is to determine the number of clusters K that the algorithm
should create. This is typically done based on prior knowledge or by using techniques like the elbow method or
silhouette metric.
2.Initialize the cluster centroids: Next, the algorithm initializes K centroids randomly. Each centroid represents
the mean of the data points assigned to its cluster.
3.Assign data points to the closest centroid: For each data point, the algorithm calculates its distance to each
centroid and assigns the data point to the cluster represented by the closest centroid.
Step 2:Optimization Step
●Recalculate the centroids: After all data points have been assigned to clusters, the algorithm recalculates the
position of each centroid as the mean of the data points in its cluster.
●Repeat steps until convergence: The previous two steps are repeated until the centroids no longer move
significantly or a maximum number of iterations is reached.
●Output the clusters: The algorithm outputs the K clusters along with the data points assigned to each cluster.
18
Optimizing Step
The K-Means algorithm aims to minimize the sum of squared distances between each data
point and its assigned centroid. It works well on datasets with well-defined clusters and is
relatively fast and scalable, but can struggle with datasets with complex or overlapping
clusters.
K-Means algorithm in action using a visualisation tool:
https://www.naftaliharris.com/blog/visualizing-k-means-clustering/
19
Major practical considerations involved in K-Means
clustering
●The number of clusters that you want to divide your data points into, i.e. the value of
K has to be pre-determined.
●The choice of the initial cluster centres can have an impact on the final cluster
formation.
●The clustering process is very sensitive to the presence of outliers in the data.
●Since the distance metric used in the clustering process is the Euclidean distance, you
need to bring all your attributes on the same scale. This can be achieved through
standardisation.
●The K-Means algorithm does not work with categorical data.
●The process may not converge in the given number of iterations. You should always
check for convergence.
20
How to choose the value
of k? ●Elbow method
●Silhouette metric
21
Elbow Method
The elbow method is a technique used to determine the optimal number of clusters in a dataset for a clustering
algorithm. The method works by plotting the within-cluster sum of squares (WCSS) against the number of
clusters, and selecting the number of clusters at the "elbow" of the resulting curve.
The WCSS is defined as the sum of the squared distances between each data point and its assigned cluster
center. The goal of clustering is to minimize the WCSS by finding the optimal number of clusters that can effectively
group the data.
To use the elbow method, we first apply a clustering algorithm (such as k-means) to our dataset with a range of
different values for the number of clusters (e.g., 1 to 10). Then, you calculate the WCSS for each value of k, and plot
these values on a line graph. The resulting graph should resemble an arm, with a clear elbow point where the rate of
decrease in WCSS begins to slow down, indicating that adding more clusters does not significantly improve the
clustering result.
22
Silhouette metric
Silhouette Coefficient or silhouette score is a metric used to calculate the goodness of a clustering
technique. Its value ranges from -1 to 1.
1: Means clusters are well apart from each other and clearly distinguished.
0: Means clusters are indifferent, or we can say that the distance between clusters is not significant.
-1: Means clusters are assigned in the wrong way.
23
Silhouette Score = (b-a)/max(a,b) *for every data point
where
a= average intra-cluster distance i.e the average distance between each point within a
cluster.(expect this to be as small as possible)
b= average inter-cluster distance i.e the average distance between all clusters.(expect
as large as possible)
*Take average of Silhouette score for different values of k and see for which value of K
we are getting good Silhouette score.
24
Cluster Tendency
Before we apply any clustering algorithm to the given data, it's important to check whether the given
data has some meaningful clusters or not? which in general means the given data is not random. The
process to evaluate the data to check if the data is feasible for clustering or not is known as the
clustering tendency.
To check cluster tendency, we use Hopkins test.
25
Hopkins test examines whether data points differ significantly from uniformly distributed data in the
multidimensional space.To perform the Hopkins test clustering, random points are first generated
from the dataset. The distances between these points and their nearest neighbors are then calculated
and averaged to obtain the Hopkins statistic. This process is repeated a large number of times to obtain
an estimate of the statistical significance of the clustering tendency of the dataset.
●If the Hopkins statistic is significantly greater than 0.5, it suggests that the dataset has a strong
clustering tendency, indicating that clustering algorithms may be effective in partitioning the data.
●On the other hand, if the Hopkins statistic is close to 0.5, it suggests that the dataset is randomly
distributed, and clustering algorithms may not be effective.
26
K-medoids
The K-means clustering algorithm is sensitive to outliers, because a mean is easily influenced by
extreme values. K-medoids clustering is a variant of K-means that is more robust to noises and
outliers. Instead of using the mean point as the center of a cluster, K-medoids uses an actual point
in the cluster to represent it. Medoid is the most centrally located object of the cluster, with
minimum sum of distances to other points.
1.Initialize: select k random points out of the n data points as the medoids.
2.Associate each data point to the closest medoid by using any common
distance metric methods.
3.While the cost decreases: For each medoid m, for each data o point
which is not a medoid:
1.Swap m and o, associate each data point to the closest medoid, and recompute the cost.
2.If the total cost is more than that in the previous step, undo the swap.
Hierarchical Clustering
●Hierarchical clustering algorithm
●Interpreting the dendrogram
●Cutting the dendrogram
●Types of linkages
29
Hierarchical Clustering
One of the major considerations in using the K-means algorithm is deciding the value of K beforehand. The
hierarchical clustering algorithm does not have this restriction.
The output of the hierarchical clustering algorithm is quite different from the K-mean algorithm as well. It results in
an inverted tree-shaped structure, called the dendrogram. An example of a dendrogram is shown below.
30
●In the K-Means algorithm, we divide the data in the first step itself. In the subsequent steps, we refine
our clusters to get the most optimal grouping.
●In hierarchical clustering, the data is not partitioned into a particular cluster in a single step. Instead, a
series of partitions/merges take place, which may run from a single cluster containing all objects to n
clusters that each contain a single object or vice-versa.
●This is very helpful since we don’t have to specify the number of clusters beforehand.
31
Steps in Hierarchical Clustering
Given a set of N items to be clustered, the steps in hierarchical clustering are:
●Calculate the NxN distance (similarity) matrix, which calculates the distance of each data point from the
other.
●Each item is first assigned to its own cluster, i.e. N clusters are formed
●The clusters which are closest to each other are merged to form a single cluster
●The same step of computing the distance and merging the closest clusters is repeated till all the points
become part of a single cluster
32
In the dendrogram shown above, samples 4 and 5 are the most similar and join to form the first cluster, followed by samples 1 and
10. The last two clusters to fuse together to form the final single cluster are 3-6 and 4-5-2-7-1-10-9-8.
Determining the number of groups in a cluster analysis is often the primary goal. Typically, one looks for natural groupings
defined by long stems. Here, by observation, you can identify that there are 3 major groupings: 3-6, 4-5-2-7 and 1-10-9-8.
33
Cutting the dendrogram
Once we obtain the dendrogram, the clusters can be obtained by cutting the
dendrogram at an appropriate level. The number of vertical lines intersecting
the cutting line represents the number of clusters.
34
Types of Hierarchical Clustering
35
Agglomerative Clustering
Agglomerative clustering is a hierarchical clustering technique used to group similar data
points together. The basic idea of agglomerative clustering is to start with each data point as its
own cluster and then iteratively merge pairs of clusters based on their similarity, until all the
data points belong to a single cluster.
The algorithm works as follows:
1.Assign each data point to its own cluster.
2.Compute the distance or similarity between all pairs of clusters.
3.Merge the two closest clusters into a single cluster.
4.Recompute the distance or similarity between the new cluster and all the remaining clusters.
5.Repeat steps 3 and 4 until all the data points belong to a single cluster.
36
Divisive Clustering
Divisive clustering is a hierarchical clustering technique that starts with all data points belonging to a single
cluster and then recursively splits the cluster into smaller sub-clusters. The algorithm works by repeatedly
dividing the data into smaller subsets until each subset contains only a single data point.
The divisive clustering algorithm works as follows:
1.Begin with all the data points in a single cluster.
2.Compute a measure of dissimilarity or distance between all pairs of data points in the cluster.
3.Find the pair of data points that are the most dissimilar or farthest apart.
4.Divide the cluster into two sub-clusters, one containing the selected pair of data points and the other
containing the remaining data points.
5.Recursively apply steps 2 to 4 to each sub-cluster until each sub-cluster contains only a single data point.
37
Divisive clustering is less commonly used than agglomerative clustering, as it
tends to be computationally expensive and can be less accurate in some
cases. However, it can be useful in situations where the data is highly
complex and agglomerative clustering may not be effective.
38
Linkages
Single
Average
Complete
Linkages
Single Linkage
In single linkage hierarchical clustering, the distance between two clusters is defined as the shortest
distance between two points in each cluster. For example, the distance between clusters “r” and “s” to
the left is equal to the length of the arrow between their two closest points. This method tends to
create long, elongated clusters and is sensitive to outliers and noise.
40
Complete Linkage
In complete linkage hierarchical clustering, the distance between two clusters is defined as the longest
distance between two points in each cluster. For example, the distance between clusters “r” and “s” to
the left is equal to the length of the arrow between their two furthest points. This method tends to
create compact, spherical clusters and is less sensitive to outliers.
41
Average Linkage
In average linkage hierarchical clustering, the distance between two clusters is defined as the average
distance between each point in one cluster to every point in the other cluster. For example, the distance
between clusters “r” and “s” to the left is equal to the average length each arrow between connecting the
points of one cluster to the other. This method strikes a balance between single and complete linkage
and is less sensitive to outliers compared to single linkage.
42
Example
43
Steps in Hierarchical Clustering
Given a set of N items to be clustered, the steps in hierarchical clustering are:
●Calculate the NxN distance (similarity) matrix, which calculates the distance of each data point from the
other
●Each item is first assigned to its own cluster, i.e. N clusters are formed
●The clusters which are closest to each other are merged to form a single cluster
●The same step of computing the distance and merging the closest clusters is repeated till all the points
become part of a single cluster
44
Here are some considerations that can help to choose linkage method:
Data type: The type of data being clustered can influence the choice of linkage method. For
example, single linkage is sensitive to outliers, while complete linkage is less sensitive to
outliers but can be affected by non-linear relationships in the data.
Cluster size and shape: Different linkage methods can produce clusters of different sizes
and shapes. For example, single linkage tends to produce elongated clusters, while average
linkage tends to produce more compact clusters.
Objectives of the analysis: The choice of linkage method should be guided by the
objectives of the analysis. For example, if the goal is to identify compact, well-separated
clusters, then complete linkage may be more appropriate. If the goal is to identify clusters
with a specific size or shape, then a different linkage method may be more appropriate.
It is often a good practice to compare the results of clustering using different linkage
methods and choose the one that produces the most meaningful and interpretable results.
DBSCAN Clustering(Density-based spatial clustering of applications with noise)
DBSCAN is a density-based clustering algorithm that divides a data set into subgroups of
high-density regions. DBSCAN groups together point that are close to each other based on
a distance measurement (usually Euclidean distance) and a minimum number of points. It
also marks as outliers the points that are in low-density regions.
DBScan Parameters
DBSCAN algorithm requires 2 parameters:
1.Epsom or EPS
2.MinPoints or MinSamples.
53
Why DBSCAN?
Partitioning methods K-means and hierarchical clustering work for finding
spherical-shaped clusters or convex clusters. In other words, they are
suitable only for compact and well-separated clusters. Moreover, they are also
severely affected by the presence of noise and outliers in the data.
Real life data may contain irregularities, like:
●Clusters can be of arbitrary shape such as those shown in the figure below.
●Data may contain noise.
EPS
EPS is a distance parameter that defines the radius to search for nearby
neighbours. We can imagine each data point having a circle with radius EPS
drawn around it.
The value of EPS taken to cluster the data has a significant impact on the results.
If the value of EPS is considered too small, decidedly fewer data points will be
considered in one cluster, and a large part of the data will not be clustered. The
un-clustered data points will be considered as outliers because they don't satisfy
the number of points to create a dense region. If the EPS value is chosen to be
very high, no real clusters will be formed as all of them will merge in the same
cluster. The eps should be chosen based on the distance of the dataset (we can
use a k-distance graph to find it), but in general small eps values are preferable.
55
Min Samples
Min Samples or Min Points are the number of minimum points to form a
dense region or cluster. For example, if we set the min_samples as 5, we need
at least 5 points to form a dense cluster.
Minimum points can be selected from some dimensions (D) in the data set, as
a general rule min points >=D+1.
The DBSCAN algorithm is used to find associations and structures in the data
that are usually hard to find manually.
56
Algorithm
1.Find all the neighbor points within eps and identify the core points or visited
with more than MinPts neighbors.
2.For each core point if it is not already assigned to a cluster, create a new cluster.
3.Find recursively all its density connected points and assign them to the same
cluster as the core point.
4.A point a and b are said to be density connected if there exist a point c which
has a sufficient number of points in its neighbors and both the points a and b
are within the eps distance. This is a chaining process. So, if b is neighbor of c, c
is neighbor of d, d is neighbor of e, which in turn is neighbor of a implies that b
is neighbor of a.
5.Iterate through the remaining unvisited points in the dataset. Those points that
do not belong to any cluster are noise.
Gaussian Mixture Model
K-Means algorithm inner-loop iterates over two steps:
1.Assign each observation Xi to the closest cluster centroid μk
2.Update each centroid to the mean of the points assigned to it.
So, for K-Means, every data point is assigned to any of the one clusters, this is known as Hard
Clustering or Hard cluster assignment.
Hard Clustering: In hard clustering, the data points are assigned to any one cluster completely.
That means for hard clustering there is no consideration of uncertainty in the assignment of a data
points.
The limitations with hard clustering are that we tend to clusters even those data points in one of
the clusters, even if the data point doesn't follow the clustering trend completely.
60
So to overcome this issue, we have a concept of Soft Clustering.
Soft Clustering: In soft clustering, the assignment of the data point to a cluster is based
on the probability or likelihood of that data point to existing in that cluster.
For soft clustering, we have an algorithm called GMMs or Gaussian Mixture Models.
61
Algorithm
1.Initialize the model: Choose the number of clusters (K) and initialize the parameters
of each Gaussian distribution (mean and covariance matrix) and the mixture weights.
2.E-step (Expectation step): Estimate the probability of each data point belonging to
each of the K Gaussian distributions using Bayes' rule
3.M-step (Maximization step): Update the parameters of each Gaussian distribution
(mean and covariance matrix) and the mixture weights based on the probabilities
estimated in the E-step.
4.Repeat steps 2 and 3 until convergence: Repeat the E-step and M-step until the
change in the log-likelihood of the data is below a certain threshold. This ensures
that the model has converged to a stable solution.
5.Assign clusters to data points: Once the model has converged, assign each data
point to the cluster with the highest probability of belonging to that cluster.
6.Evaluate the quality of clustering: Use various metrics (e.g., silhouette score) to
evaluate the quality of the clustering and select the optimal number of clusters.
The Gaussian distribution has a bell-shaped curve with its peak at the mean,
and it is symmetric around the mean. The standard deviation determines the
width of the curve, and the larger the standard deviation, the wider the curve.