Unit 4 Clustering_K-means_K-medoid_KNN.pdf

Machine Learning

Sanjivani Rural Education Society’s
Sanjivani College of Engineering, Kopargaon-423603
(An Autonomous Institute Affiliated to Savitribai Phule Pune University, Pune)
NAAC ‘A’ Grade Accredited
Department of Information Technology
NBA Accredited-UG Programme
Ms. K. D. Patil
Assistant Professor

Contents - Clustering
•Distance measures - Euclidean, Manhattan, Hamming, Minkowski
Distance Metric, Different clustering methods (Distance, Density,
Hierarchical), K-means clustering Algorithm-with example, k-medoid
algorithm-with example, Performance Measures - Rand Index, K-
Nearest Neighbour algorithm

Machine Learning Department of Information Technology

Course Outcome
•CO4: To apply Clustering technique.
Machine Learning Department of Information Technology

Introduction
Machine Learning Department of Information Technology
•Clustering:
•Clustering is an important part of data cleaning, used in the field of
artificial intelligence, deep learning, and data science.
•Clustering is an unsupervised machine learning technique that groups
similar data points together into clusters based on their characteristics,
without using any labeled data.
•The objective is to ensure that data points within the same cluster are
more similar to each other than to those in different clusters, enabling
the discovery of natural groupings and hidden patterns in complex
datasets.

Introduction
Machine Learning Department of Information Technology
•Clustering:
•For example, if we have customer
purchase data, clustering can
group customers with similar
shopping habits. These clusters
can then be used for targeted
marketing, personalized
recommendations or customer
segmentation.

Ref: https://www.geeksforgeeks.org/machine-learning/clustering-in-machine-learning/

Distance Metrics - Introduction
Machine Learning Department of Information Technology
•Distance metrics are the backbone of clustering.
•Distance metrics basically deal with finding the proximity or distance
between data points and determining if they can be clustered together.
•Distance Metrics are used in both supervised and unsupervised learning.
•Distance Metrics are used to calculate the distance between data points
and then define the similarity between them.

Distance Metrics - Properties
Machine Learning Department of Information Technology
•Symmetry:
•If x and y are two points in a metric space, then the distance between x
and y should be equal to the distance between y and x.
d(x, y) = d(y,x)
•Triangle Inequality:
•Given three points x, y, and z, the distance metric should satisfy the triangle
inequality:

Distance Metrics - Properties
Machine Learning Department of Information Technology
•Non-negativity:
•Distances should always be non negative. Meaning it should be greater than
or equal to zero.

•The above inequality holds with equality (d(x,y) = 0) if and only if x and y
denote the same point, i.e., x = y.

Distance Metrics - Types
Machine Learning Department of Information Technology
•We can calculate the distance between points and then define the similarity
between them
•Types of Distance Metrics in Machine Learning are as follows:
•Euclidean Distance
•Manhattan Distance
•Minkowski Distance
•Hamming Distance

Distance Metrics - Euclidean Distance
Machine Learning Department of Information Technology
•Euclidean Distance represents the shortest distance between two vectors/ any
two points in a metric space.
•It is the square root of the sum of squares of differences between corresponding
elements.
•We use Euclidean distance when calculating the distance between two rows of
data that have numerical values, such a floating point or integer values.
•If columns have values with differing scales, it is common to normalize or
standardize the numerical values across all columns prior to calculating the
Euclidean distance.

Distance Metrics - Euclidean Distance
Machine Learning Department of Information Technology
•Consider two points x and y in a two-dimensional plane with coordinates (x1,
x2) and (y1, y2) respectively.
•This distance is given by the square root of the sum of the squared differences
between the corresponding coordinates of the two points.
•Mathematically, the Euclidean distance between the points x and y in two-
dimensional plane is given by:

Distance Metrics - Euclidean Distance
Machine Learning Department of Information Technology
•Example 1: Find the Euclidean distance between data points P(3, 2) and Q(4, 1).
•Solution:
•Given: P(3, 2) = (x1,y1) Q(4, 1) =(x2,y2)
•Using Euclidean distance formula,

•PQ = √[(4 – 3)2 + (1 – 2)2]
•PQ = √[(1)2 + (-1)2]
•PQ = √2 units.

Distance Metrics - Euclidean Distance
Machine Learning Department of Information Technology
•Euclidean distance in Python:
•The Euclidean distance and the other distance metrics can be computed using
convenience functions from the spatial module in SciPy.
•As a first step, let’s import distance from Scipy’s spatial module:
from scipy.spatial import distance
x = [3,6,9]
y = [1,0,1]
print(distance.euclidean(x,y))
Output >> 10.198039027185569

Distance Metrics - Manhatten Distance
Machine Learning Department of Information Technology
•The Manhattan distance, also called taxicab distance or city-block distance, is
another popular distance metric.
•Manhattan Distance is the sum of absolute differences between points across all
the dimensions.
•Manhattan Distance calculates the distance between two real-valued vectors.
•It is perhaps more useful to vectors that describe objects on a uniform grid, like a
chessboard or city blocks.
•It might make sense to calculate Manhattan distance instead of Euclidean
distance for two vectors in an integer feature space.

Distance Metrics - Manhatten Distance
Machine Learning Department of Information Technology

Distance Metrics - Manhatten Distance
Machine Learning Department of Information Technology
•The Manhattan distance between the points x and y is given by:
Manhatten Distance= |(X2 – X1)| + |(Y2-Y1)|
•In n-dimensional space, where each point has n coordinates, the Manhattan
distance is given by:

Distance Metrics - Manhatten Distance
Machine Learning Department of Information Technology
Example 1: Find the distance between data points P(6, 4) and Q(8, 2).
•Solution:
•Given: P(6, 4) = (x1,y1) Q(8, 2) =(x2,y2)
•Using Manhatten distance formula,
d = |(x2 – x1)| +|(y2 – y1)|
d = |(8 – 6)| +|(2 – 4)|
d = |2| +|-2|
d = 4

Distance Metrics - Manhatten Distance
Machine Learning Department of Information Technology
•Manhatten distance in Python:
from scipy.spatial import distance
x = [3,6,9]
y = [1,0,1]
print(distance.cityblock(x,y))
Output >> 16

Distance Metrics - Minkowski Distance
Machine Learning Department of Information Technology
•It is a generalization of the Euclidean and Manhattan distance measures and
adds a parameter, called the “order” or “p“, that allows different distance
measures to be calculated.

•It is pretty straightforward to see that,
•for p = 1, the Minkowski distance equation takes the same form as that of
Manhattan distance.
•for p = 2, the Minkowski distance is equivalent to the Euclidean distance.

Distance Metrics - Minkowski Distance
Machine Learning Department of Information Technology

Distance Metrics - Minkowski Distance
Machine Learning Department of Information Technology
Example: Find the distance between data points in 2D space. Point A: (2, 3), Point B:
(5, 7)
•Solution:

Distance Metrics - Minkowski Distance
Machine Learning Department of Information Technology
•Minkowski distance in Python:
from scipy.spatial import distance
x = [3,6,9]
y = [1,0,1]
print(distance.minkowski(x,y,p=1))
Output >> 16.0
print(distance.minkowski(x,y,p=2))
Output >> 10.198039027185569

Distance Metrics - Hamming Distance
Machine Learning Department of Information Technology
•Hamming Distance measures the similarity between two strings of the same
length.
•The Hamming Distance between two strings of the same length is the number of
positions at which the corresponding characters are different.
•This specific metric is useful when calculating the distance in between
observations for which we have only binary features.
•This means that the Hamming distance is best calculated when all of the features
of our data take either 0 or 1 as a value.
•It is basically a metric used to compare binary strings (understanding for strings
here a row of binary features).

Distance Metrics - Hamming Distance
Machine Learning Department of Information Technology
•When we want to calculate the similarity in between two different users to see if
they go in the same cluster, we can easily compute the Hamming Distance by
counting the number of features which have different values, like shown in the
following figure.

Distance Metrics - Hamming Distance
Machine Learning Department of Information Technology
•Example: Let’s say we have two strings: "euclidean" and "manhattan"
•Since the length of these strings is equal, we can calculate the Hamming Distance.
•We will go character by character and match the strings.
•The first character of both the strings (e and m, respectively) is different.
•Similarly, the second character of both the strings (u and a) is different. and so on.
•Look carefully - seven characters are different, whereas two characters (the last two
characters) are similar: euclidean and manhattan
•Hence, the Hamming Distance here will be 7.
•Note that the larger the Hamming Distance between two strings, the more dissimilar
those strings will be (and vice versa).

Distance Metrics - Hamming Distance
Machine Learning Department of Information Technology
•Hamming distance in Python:
from scipy.spatial import distance
string_1 = 'euclidean' # defining string1
string_2 = 'manhattan' # defining string2
# computing the hamming distance
hamming_distance = distance.hamming(list(string_1), list(string_2))*len(string_1)
print('Hamming Distance b/w', string_1, 'and', string_2, 'is: ', hamming_distance)

Hamming Distance b/w euclidean and manhattan is: 7.0

Types of Clustering
Machine Learning Department of Information Technology
•Clustering broadly divides into two subgroups:
•Hard Clustering:
•Each input data point either fully belongs to exactly one cluster, no overlap is allowed.
•Example: If clustering customer data into 2 segments, each customer belongs fully to
either Cluster 1 or Cluster 2 without partial memberships.
•Soft Clustering:
•Rather than assigning each data point to a distinct cluster, it assigns a probability or
likelihood of the data point being in those clusters.
•Example: A data point may have a 70% membership in Cluster 1 and 30% in Cluster 2,
reflecting uncertainty or overlap in group characteristics.

Types of Clustering Methods
Machine Learning Department of Information Technology
•Density Clustering:
•Density-based clustering method considers density ahead of distance.
•Data is clustered by regions of high concentrations of data objects bounded by
areas of low concentrations of data objects.
•The clusters formed are grouped as a maximal set of connected data points.
•The clusters formed vary in arbitrary shapes and sizes and contain a
maximum degree of homogeneity due to similar density.
•This clustering approach includes the noise and outliers in the datasets
effectively.

Types of Clustering Methods
Machine Learning Department of Information Technology
•Density Clustering:

Types of Clustering Methods
Machine Learning Department of Information Technology
•Hierarchical Clustering:
•Hierarchical clustering, as the name suggests, is an algorithm that builds a
hierarchy of clusters.
•This algorithm starts with all the data points assigned to a cluster of their own.
•Then two nearest clusters are merged into the same cluster. In the end, this
algorithm terminates when there is only a single cluster left.
•The results of hierarchical clustering can be shown using a dendrogram.
•This algorithm can be implemented using a bottom-up approach or a top-down
approach starting with all data points assigned in the same cluster and
recursively performing splits till each data point is assigned a separate cluster.

Types of Clustering Methods
Machine Learning Department of Information Technology
•Hierarchical Clustering:

K-Means Clustering Algorithm
Machine Learning Department of Information Technology
•Introduction:
•Clustering is the process of dividing the entire data into groups (also known as
clusters) based on the patterns in the data.
•In clustering, we do not have a target to predict. We look at the data, try to club
similar observations, and form different groups. Hence it is an unsupervised
learning problem.
•K-means clustering is unsupervised machine learning algorithm widely used for
cluster analysis where the aim is to partition a set of objects into K clusters in
such a way that the sum of the squared distances between the objects and their
assigned cluster mean is minimized.
•k-means clustering divides data into a predefined number of clusters,

K-Means Clustering Algorithm
Machine Learning Department of Information Technology
•Example:
•Let’s try understanding this with a simple example. A bank wants to give credit card offers to
its customers. Currently, they look at the details of each customer and, based on this
information, decide which offer should be given to which customer.
•Now, the bank can potentially have millions of customers. Does it make sense to look at the
details of each customer separately and then make a decision? Certainly not! It is a manual
process and will take a huge amount of time.
•So what can the bank do? One option is to segment its customers into different groups. For
instance, the bank can group the customers based on their income:

K-Means Clustering Algorithm
Machine Learning Department of Information Technology
•Properties of Clusters:
•Case I: All the data points in a cluster should be similar to each other.
•Case II: The data points from different clusters should be as different as
possible. This will intuitively make sense if you’ve grasped the above property.

K-Means Clustering Algorithm
Machine Learning Department of Information Technology
•Algorithm:
•K-means clustering is a method for grouping n observations into K clusters.
•It uses vector quantization and aims to assign each observation to the cluster
with the nearest mean or centroid, which serves as a prototype for the cluster.
•Originally developed for signal processing, K-means clustering is now widely
used in machine learning to partition data points into K clusters based on their
similarity.
•The goal is to minimize the sum of squared distances between the data points
and their corresponding cluster centroids, resulting in clusters that are internally
homogeneous and distinct from each other.

K-Means Clustering Algorithm
Machine Learning Department of Information Technology
•Algorithm:
•K-means is a centroid-based algorithm or a distance-based algorithm, where we
calculate the distances to assign a point to a cluster.
•In K-Means, each cluster is associated with a centroid.
•Optimization plays a crucial role in the k-means clustering algorithm.
•The goal of the optimization process is to find the best set of centroids that
minimizes the sum of squared distances between each data point and its closest
centroid.
•This process is repeated multiple times until convergence, resulting in the
optimal clustering solution.

K-Means Clustering Algorithm
Machine Learning Department of Information Technology
•Steps to apply K-means clustering algorithm:
•We have these 8 points, and we want to apply k-means to create clusters for these
points. Here's how we can do it.
•Step 1: Choose the number of clusters k
•The first step in k-means is to pick the number of clusters, k.
•Step 2: Select k random points from the data as centroids
•Next, we randomly select the centroid for each cluster. Let's say we want to
have 2 clusters, so k is equal to 2 here. We then randomly select the centroid:
•Here, the red and green circles represent the centroid for these clusters.

K-Means Clustering Algorithm
Machine Learning Department of Information Technology
•Steps to apply K-means clustering algorithm:
•Step 3: Assign all the points to the closest cluster centroid
•Once we have initialized the centroids, we assign each point to the closest cluster
centroid:

•Here you can see that the points closer to the red point are assigned to the red
cluster, whereas the points closer to the green point are assigned to the green cluster.

K-Means Clustering Algorithm
Machine Learning Department of Information Technology
•Steps to apply K-means clustering algorithm:
•Step 4: Recompute the centroids of newly formed clusters
•Now, once we have assigned all of the points to either cluster, the next step is to
compute the centroids of newly formed clusters:

•Here, the red and green crosses are the new centroids.

K-Means Clustering Algorithm
Machine Learning Department of Information Technology
•Steps to apply K-means clustering algorithm:
•Step 5: Repeat steps 3 and 4
•We then repeat steps 3 and 4: Recompute the centroids of newly formed clusters

•The step of computing the centroid and assigning all the points to the cluster based
on their distance from the centroid is a single iteration.

K-Means Clustering Algorithm
Machine Learning Department of Information Technology
•Stopping Criteria for K-Means Clustering:
•There are essentially three stopping criteria that can be adopted to stop the K-
means algorithm:
•Criteria 1: Centroids of newly formed clusters do not change
•We can stop the algorithm if the centroids of newly formed clusters are not
changing.
•Even after multiple iterations, if we are getting the same centroids for all
the clusters, we can say that the algorithm is not learning any new pattern,
and it is a sign to stop the training.

K-Means Clustering Algorithm
Machine Learning Department of Information Technology
•Stopping Criteria for K-Means Clustering:
•Criteria 2: Points remain in the same cluster
•Another clear sign that we should stop the training process is if the points
remain in the same cluster even after training the algorithm for multiple
iterations.
•Criteria 3: Maximum number of iterations is reached
•Finally, we can stop the training if the maximum number of iterations is
reached.
•Suppose we have set the number of iterations as 100.
•The process will repeat for 100 iterations before stopping.

K-Means Clustering Algorithm
Machine Learning Department of Information Technology
•Advantages:
•Simple and easy to implement: The k-means algorithm is easy to understand
and implement, making it a popular choice for clustering tasks.
•Fast and efficient: K-means is computationally efficient and can handle large
datasets with high dimensionality.
•Scalability: K-means can handle large datasets with a large number of data
points and can be easily scaled to handle even larger datasets.
•Flexibility: K-means can be easily adapted to different applications and can be
used with different distance metrics and initialization methods.

K-Means Clustering Algorithm
Machine Learning Department of Information Technology
•Disadvantages:
•Sensitivity to initial centroids: K-means is sensitive to the initial selection of
centroids and can converge to a suboptimal solution.
•Requires specifying the number of clusters: The number of clusters k needs to
be specified before running the algorithm, which can be challenging in some
applications.
•Sensitive to outliers: K-means is sensitive to outliers, which can have a
significant impact on the resulting clusters.

K-Means Clustering Algorithm
Machine Learning Department of Information Technology
•Applications:
•K-Means clustering is used in a variety of examples or business cases in real life, like:
•Academic performance: Based on the scores, students are categorized into grades like
A, B, or C.
•Diagnostic systems: The medical profession uses k-means in creating smarter medical
decision support systems, especially in the treatment of liver ailments.
•Search engines: Clustering forms a backbone of search engines. When a search is
performed, the search results need to be grouped, and the search engines very often use
clustering to do this.
•Wireless sensor networks: The clustering algorithm plays the role of finding the cluster
heads, which collect all the data in its respective cluster.

K-Means Clustering Algorithm
Machine Learning Department of Information Technology
•How to Choose the Right Number of Clusters in K-Means Clustering?
•The maximum possible number of clusters will be equal to the number of
observations in the dataset.
•But then, how can we decide the optimum number of clusters? One thing we
can do is plot a graph, also known as an elbow curve, where the x-axis will
represent the number of clusters and the y-axis will be an evaluation metric.
•we will increase the number of clusters, train the model again, and plot the
inertia value.
•The cluster value where this decrease in inertia value becomes constant can be
chosen as the right cluster value for our data.

K-Means Clustering Algorithm
Machine Learning Department of Information Technology
•How to Choose the Right Number of Clusters in K-Means Clustering?

K-Means Clustering Algorithm
Machine Learning Department of Information Technology
•Example: Apply K(=2)-Means algorithm over the data (185, 72), (170, 56), (168, 60),
(179,68), (182,72), (188,77) up to two iterations and show the clusters. Initially choose first
two objects as initial centroids. (Ref: https://medium.com/@karna.sujan52/k-means-algorithm-solved-numerical-3c94d25076e8)
•Solution: Given, number of clusters to be created (K) = 2 say c1 and c2,
number of iterations = 2 and
The given data points can be represented in tabular form as:
First two objects as initial centroids:
Centroid for first cluster c1 = (185, 72)
Centroid for second cluster c2 = (170, 56)

K-Means Clustering Algorithm
Machine Learning Department of Information Technology
•Iteration 1: Now calculating similarity by using Euclidean distance measure as:

K-Means Clustering Algorithm
Machine Learning Department of Information Technology
•Iteration 1: Representing above information in tabular form:
The resulting cluster after first iteration is:

K-Means Clustering Algorithm
Machine Learning Department of Information Technology
•Iteration 2: Now calculating centroid for each cluster:

•Now, again calculating similarity:

K-Means Clustering Algorithm
Machine Learning Department of Information Technology
•Iteration 2:

K-Means Clustering Algorithm
Machine Learning Department of Information Technology
•Iteration 2: Representing above information in tabular form.

The resulting cluster after second iteration is:
As we have already completed two iteration
as asked by our question, the numerical ends
here.

Since, the clustering doesn’t change after
second iteration, so terminate the iteration
even if question doesn’t say so.

K-Medoid Clustering Algorithm
Machine Learning Department of Information Technology
•Introduction:
•K-Medoids is an unsupervised clustering algorithm using the partition method of
clustering.
•It is an improvised version of the K-Means clustering algorithm especially used
to handle outlier data.
•It requires unlabelled data to work with.
•In the K-Medoid algorithm, each data point is called a medoid. The medoids
serve as cluster centres.
•The medoid is a point such that its sum of the distance from all other points in
the same cluster is minimum. For distance, any suitable metric like Euclidian
distance or Manhattan distance can be used.

K-Medoid Clustering Algorithm
Machine Learning Department of Information Technology
•Steps to apply K-means clustering algorithm:
•Step 1: First, we select K random data points from the dataset and use them as
medoids.
•Step 2: Now, we will calculate the distance of each data point from the medoids.
You can use any of the Euclidean, Manhattan distance,
•Step 3: Once we find the distance of each data point from the medoids, we will
assign the data points to the clusters associated with each medoid. The data
points are assigned to the medoids at the closest distance.
•Step 4: After determining the clusters, we will calculate the sum of the distance
of all the non-medoid data points to the medoid of each cluster. Let the cost be
Ci.

K-Medoid Clustering Algorithm
Machine Learning Department of Information Technology
•Steps to apply K-means clustering algorithm:
•Step 5: Now, we will select a random data point Dj from the dataset and swap it
with a medoid Mi. Here, Dj becomes a temporary medoid. After swapping, we
will calculate the distance of all the non-medoid data points to the current
medoid of each cluster. Let this cost be Cj.
•Step 6: If Ci>Cj, the current medoids with Dj as one of the medoids are made
permanent medoids. Otherwise, we undo the swap, and Mi is reinstated as the
medoid.
•Step 7: Repeat 4 to 6 until no change occurs in the clusters.

K-Medoid Clustering Algorithm
Machine Learning Department of Information Technology
•Steps to apply K-means clustering algorithm:
•The K-Medoid is applied in such a way that:
•A single point can belong to only one cluster
•Each cluster has a minimum one point

K-Medoid Clustering Algorithm
Machine Learning Department of Information Technology
•Advantages:
•K-Medoids clustering is guaranteed to converge. Hence, we are guaranteed to
get results when we perform k-medoids clustering on any dataset.
•K-Medoids clustering doesn't apply to a specific domain. Owing to the
generalization, we can use k-medoids clustering in different machine learning
applications ranging from text data to Geo-spatial data and financial data to e-
commerce data.
•The medoids in k-medoids clustering are selected randomly. We can choose
the initial medoids in a way such that the number of iterations can be
minimized. For improving the performance of the algorithm, you can warm
start the choice of medoids by selecting specific data points as medoids after
data analysis.

K-Medoid Clustering Algorithm
Machine Learning Department of Information Technology
•Advantages:
•Compared to other partitioning clustering algorithms such as K-median and k-
modes clustering, the k-medoids clustering algorithm is faster in execution.
•K-Medoids clustering algorithm is very robust and it effectively deals with
outliers and noise in the dataset. Compared to k-means clustering, the k-
medoids clustering algorithm is a better choice for analyzing data with
significant noise and outliers.

K-Medoid Clustering Algorithm
Machine Learning Department of Information Technology
•Disadvantages:
•It is not suitable for clustering non-spherical (arbitrarily shaped) groups of
objects. This is because it relies on minimizing the distances between the non-
medoid objects and the medoid (the cluster center) - briefly, it uses
compactness as clustering criteria instead of connectivity.
•It may obtain different results for different runs on the same dataset because
the first k medoids are chosen randomly.

K-Medoid Clustering Algorithm
Machine Learning Department of Information Technology
•Applications:
•E-commerce websites, supermarkets, and malls use transaction data for
customer segmentation based on buying behavior. It helps them to increase sales
by recommending products according to user behavior using recommendation
systems.
•K-Medoids clustering can also be used in cyber profiling. It involves grouping
people based on their connection to each other. By collecting usage data from
individual users, you can cluster them into groups for profiling.
•By grouping similar pixels into clusters, you can also perform image
segmentation using k-medoids clustering.

K-Medoid Clustering Algorithm
Machine Learning Department of Information Technology
•Applications:
•You can also perform fraud detection on banking and insurance data using k-
medoids clustering. By using historical data on frauds, you can find probable
fraudulent transactions from a given dataset.
•You can also use k-medoids clustering in various other tasks such as social
media profiling, ride-share data analysis, identification of localities with a high
rate of crime, etc.

K-Medoid Clustering Algorithm
Machine Learning Department of Information Technology
•Example 1:
•For the given Dataset, apply K-Medoid clustering
algorithm to form two clusters. Use Manhattan
distance to find the between data point and medoid.

K-Medoid Clustering Algorithm
Machine Learning Department of Information Technology
•Step 1:
•Select two medoids C1=(3, 4) C2=(7, 4)
•Manhattan Distance = |x1−x2| + |y1−y2|
•Mdist[(2,6),(3,4)] = |2−3| + |6−4| = 3
•??????&#3627408465;??????&#3627408480;&#3627408481;[(2,6),(3,4)] = |2−3| + |6−4| = 3

•We will get distance of each data point from
C1 and C2 as shown in table.

K-Medoid Clustering Algorithm
Machine Learning Department of Information Technology
•Step 2:
•So the clusters are,
C1: {(2,6), (3,4), (3,8), (4,7)}
C2: {(6, 2), (6, 4), (7, 3), (7, 4), (8, 5), (7,6)}

K-Medoid Clustering Algorithm
Machine Learning Department of Information Technology
•Step 3:
•Randomly select one non-medoid point and
recalculate the cost.
•C1=(3, 4) and C2=(7, 4)
•O=(7, 3)

•Swap C2 with O
•New Medoids
•C1=(3, 4) and O=(7, 3)

K-Medoid Clustering Algorithm
Machine Learning Department of Information Technology
•Step 3:
•New Medoids C1=(3, 4) and O=(7, 3)
•Manhattan Dist = |x1−x2| + |y1−y2|
•Mdist[(2, 6), (7, 3)] = |2-7| + |6-3| = 8

•Accordingly we will calculate the distance
of remaining data points with new medoids as
shown in table.

K-Medoid Clustering Algorithm
Machine Learning Department of Information Technology
•Step 3:
•So, new clusters are:

C1: {(2,6), (3,4), (3,8), (4,7)}
O: {(6, 2), (6, 4), (7, 3), (7, 4), (8, 5), (7,6)}

K-Medoid Clustering Algorithm
Machine Learning Department of Information Technology
•Step 4:
•For clusters,
•C1: {(2,6), (3,4), (3,8), (4,7)}
•O: {(6, 2), (6, 4), (7,3), (7, 4), (8, 5), (7,6)}
•Calculate the Total Cost as Cost(c,x) = ∑??????|&#3627408464;??????−????????????|
Current Total Cost = {Cost((3,4), (2,6)) + Cost((3,4), (3,8)) +
Cost((3,4), (4,7)) + Cost((7,3), (6,2)) + Cost((7,3), (6,4)) + Cost((7,3), (7,4)) +
Cost((7,3), (8,5)) + Cost((7,3), (7,6))}
Current Total Cost = 3 + 4 + 4 + 2 + 2 + 1 + 3 + 3 = 22

K-Medoid Clustering Algorithm
Machine Learning Department of Information Technology
•Step 4:
•Cost of Swapping of medoid C2 with O
•S = Current Total Cost - Previous Total Cost
•S = 22 - 20 = 2 > 0
•Hence Swapping C2 with O is not a good Idea.
•Final Medoids are C1=(3, 4) and C2=(7, 4)
•Clusters are
C1: {(2,6), (3,4), (3,8), (4,7)}
C2: {(6, 2), (6, 4), (7, 3), (7, 4), (8, 5), (7,6)}

K-Medoid Clustering Algorithm
Machine Learning Department of Information Technology
•Example 2: The dataset for clustering is as follows.

Point Coordinates
A1 (2,6)
A2 (3,8)
A3 (4,7)
A4 (6,2)
A5 (6,4)
A6 (7,3)
A7 (7,4)
A8 (8,5)
A9 (7,6)
A10 (3,4)
Ref: https://codinginfinite.com/k-medoids-clustering-algorithm-with-numerical-example/

K-Medoid Clustering Algorithm
Machine Learning Department of Information Technology
•Iteration 1:
•Suppose that we want to group the above dataset into two clusters. So, we will
randomly choose two medoids.
•Here, the choice of medoids is important for efficient execution. Hence, we have
selected two points from the dataset that can be potential medoid for the final clusters.
Following are two points from the dataset that we have selected as medoids.
•M1 = (3, 4)
•M2 = (7, 3)
•Now, we will calculate the distance between each data point and the medoids using the
Manhattan distance measure. The results have been tabulated as follows.

K-Medoid Clustering Algorithm
Machine Learning Department of Information Technology
•Iteration 1:

K-Medoid Clustering Algorithm
Machine Learning Department of Information Technology
•Iteration 1:
•The clusters made with medoids (3, 4) and (7, 3) are as follows.
•Points in cluster1= {(2, 6), (3, 8), (4, 7), (3, 4)}
•Points in cluster 2= {(7,4), (6,2), (6, 4), (7,3), (8,5), (7,6)}
•After assigning clusters, we will calculate the cost for each cluster and find their sum.
The cost is nothing but the sum of distances of all the data points from the medoid of
the cluster they belong to.
•Hence, the cost for the current cluster will be 3+4+4+2+2+0+1+3+3+0=22.

K-Medoid Clustering Algorithm
Machine Learning Department of Information Technology
•Iteration 2:
•Now, we will select another non-
medoid point (7, 4) and make it a
temporary medoid for the second
cluster. Hence,
•M1 = (3, 4)
•M2 = (7, 4)
•Now, let us calculate the distance
between all the data points and the
current medoids.

K-Medoid Clustering Algorithm
Machine Learning Department of Information Technology
•Iteration 2:
•The data points haven’t changed in the clusters after changing the medoids. Hence,
clusters are:
•Points in cluster1:{(2, 6), (3, 8), (4, 7), (3, 4)}
•Points in cluster 2:{(7,4), (6,2), (6, 4), (7,3), (8,5), (7,6)}
•Now, let us again calculate the cost for each cluster and find their sum. The total cost this
time will be 3+4+4+3+1+1+0+2+2+0=20.
•Here, the current cost is less than the cost calculated in the previous iteration. Hence, we
will make the swap permanent and make (7,4) the medoid for cluster 2. If the cost this
time was greater than the previous cost i.e. 22, we would have to revert the change. New
medoids after this iteration are (3, 4) and (7, 4) with no change in the clusters.

K-Medoid Clustering Algorithm
Machine Learning Department of Information Technology
•Iteration 3:
•Now, let us again change the
medoid of cluster 2 to (6, 4).
Hence, the new medoids for the
clusters are M1=(3, 4) and M2=
(6, 4 ).
•Let us calculate the distance
between the data points and the
above medoids to find the new
cluster. The results have been
tabulated as follows.

K-Medoid Clustering Algorithm
Machine Learning Department of Information Technology
•Iteration 3:
•Again, the clusters haven’t changed. Hence, clusters are:
•Points in cluster1:{(2, 6), (3, 8), (4, 7), (3, 4)}
•Points in cluster 2:{(7,4), (6,2), (6, 4), (7,3), (8,5), (7,6)}
•Now, let us again calculate the cost for each cluster and find their sum. The total cost
this time will be 3+4+4+2+0+2+1+3+3+0=22.
•The current cost is 22 which is greater than the cost in the previous iteration i.e. 20.
Hence, we will revert the change and the point (7, 4) will again be made the medoid
for cluster 2.

K-Medoid Clustering Algorithm
Machine Learning Department of Information Technology
•Iteration 3:
•So, the clusters after this iteration will be cluster1 = {(2, 6), (3, 8), (4, 7), (3, 4)}
and cluster 2= {(7,4), (6,2), (6, 4), (7,3), (8,5), (7,6)}. The medoids are (3,4) and (7,4).
•We keep replacing the medoids with a non-medoid data point. The set of medoids for
which the cost is the least, the medoids, and the associated clusters are made
permanent. So, after all the iterations, you will get the final clusters and their medoids.
•The K-Medoids clustering algorithm is a computation-intensive algorithm that requires
many iterations. In each iteration, we need to calculate the distance between the
medoids and the data points, assign clusters, and compute the cost. Hence, K-Medoids
clustering is not well suited for large data sets.

K-NN Algorithm
Machine Learning Department of Information Technology
•Introduction:
•The K-Nearest Neighbors (KNN) is a supervised algorithm is a popular
machine learning technique used for classification and regression tasks.
•It relies on the idea that similar data points tend to have similar labels or values.
•During the training phase, the KNN algorithm stores the entire training dataset
as a reference.
•When making predictions, it calculates the distance between the input data
point and all the training examples, using a chosen distance metric such as
Euclidean distance.
•K represents the number of nearest neighbors that needs to be considered while
making prediction.

K-NN Algorithm
Machine Learning Department of Information Technology
•Introduction:
•In the case of classification, the algorithm assigns the most common class label
among the K neighbours as the predicted label for the input data point.
•For regression, it calculates the average or weighted average of the target
values of the K neighbours to predict the value for the input data point.
•KNN Algorithm can be used for both classification and regression predictive
problems.
•However, it is more widely used in classification problems in the industry.

K-NN Algorithm
Machine Learning Department of Information Technology
•Introduction:
•To evaluate any technique, we generally look at 3 important aspects:
•Ease of interpreting output
•Calculation time
•Predictive Power

K-NN Algorithm
Machine Learning Department of Information Technology
•Working of KNN:
•Let’s take a simple case to understand this algorithm. Following is a spread of
red circles (RC) and green squares (GS):
Ref: https://www.analyticsvidhya.com/blog/2018/03/introduction-k-neighbours-algorithm-clustering/

K-NN Algorithm
Machine Learning Department of Information Technology
•Working of KNN:
•You intend to find out the class of the blue star (BS). BS can either be RC or GS and
nothing else. The “K” in KNN algorithm is the nearest neighbor we wish to take the
vote from.
•Let’s say K = 3.
•Hence, we will now make a circle with BS as the center just as big as to enclose only
three data points on the plane.

K-NN Algorithm
Machine Learning Department of Information Technology
•Working of KNN:
•The three closest points to BS are all RC.
•Hence, with a good confidence level, we can say that the BS should belong to the
class RC.
•Here, the choice became obvious as all three votes from the closest neighbor went to
RC.
•The choice of the parameter K is very crucial in this algorithm.

K-NN Algorithm
Machine Learning Department of Information Technology
•We can implement a KNN model by following the below steps:
•Step 1: Load the data
•Step 2: Initialize the value of k
•Step 3: For getting the predicted class, iterate from 1 to total number of training data points
•Calculate the distance between test data and each row of training dataset. Here we will use
Euclidean distance as our distance metric since it’s the most popular method. The other distance
functions or metrics that you can use include Manhattan distance, Minkowski distance,
Chebyshev, cosine, etc. If you have categorical variables, you can use Hamming distance.
•Sort the calculated distances in ascending order based on distance values
•Get top k rows from the sorted array
•Get the most frequent class of these rows
•Return the predicted class

K-NN Algorithm
Machine Learning Department of Information Technology
•Advantages:
•Easy to implement as the complexity of the algorithm is not that high.
•Adapts Easily: As per the working of the KNN algorithm it stores all the data
in memory storage and hence whenever a new example or data point is added
then the algorithm adjusts itself as per that new example and has its
contribution to the future predictions as well.
•Few Hyperparameters: The only parameters which are required in the training
of a KNN algorithm are the value of k and the choice of the distance metric
which we would like to choose from our evaluation metric.

K-NN Algorithm
Machine Learning Department of Information Technology
•Disadvantages:
•It fails when variables have different scales.
•It is difficult to choose K-value.
•It leads to ambiguous interpretations.
•It is sensitive to outliers and missing values.
•Does not work well with large datasets.
•It does not work well with high dimensions.

K-NN Algorithm
Machine Learning Department of Information Technology
•Applications:
•Credit rating: The KNN algorithm helps determine an individual's credit rating by comparing
them with the ones with similar characteristics.
•Loan approval: identifying individuals who are more likely to default on loans by comparing
their traits with similar individuals.
•Data preprocessing: The KNN algorithm is used for a process called missing data imputation that
estimates the missing values.
•Pattern recognition: Pattern detection is also useful in identifying patterns in customer purchase
behavior.
•Stock price prediction: It's useful in predicting the future value of stocks based on historical data.
•Recommendation systems: It can be used in an online video streaming platform to suggest
content a user is more likely to watch by analyzing what similar users watch.

K-NN Algorithm
Machine Learning Department of Information Technology
•Example:
•From the given dataset, predict the class of x=(Maths=6, CS=8) (X2, Y2)

•Step 1: Select K neighbors. K=3
Sr. No. Maths CS Result
1 4 3 Fail
2 6 7 Pass
3 7 8 Pass
4 5 5 Fail
5 8 8 Pass

K-NN Algorithm
Machine Learning Department of Information Technology
•Example:
•Step 2: Calculate the Euclidean distance of K number of neighbors.
•Euclidean Distance Formula:

•X2,Y2 = Observed values
•&#3627408459;1,&#3627408460;1 = Actual Values

K-NN Algorithm
Machine Learning Department of Information Technology
•Example:
•Step 2: Calculate the Euclidean distance of K number of neighbors.

•Step 3: As per Result, 2. Pass 3. Pass 5. Pass
•We have, 3 Pass & 0 Fail So, 3>0
Probability of Pass is more.
•Hence, In Given Query x = (Maths=6, CS=8) is Pass.

K-NN Algorithm
Machine Learning Department of Information Technology
•Example 2:
•Apply K nearest neighbor classifier to predict the diabetic patient with the given features BMI,
Age. Assume K=3, Test Example BMI=43.6, Age=40, Sugar=? If the training examples are,

K-NN Algorithm
Machine Learning Department of Information Technology
•Example 2:
•Step 1: Calculate the Euclidean
distance of K number of neighbors.
•Euclidean Distance Formula:

K-NN Algorithm
Machine Learning Department of Information Technology
•Example 2:
•Step 2: Identify class of top 3
neighbors.
•Majority class is 1 so for Test
Example Sugar is 1

References
Machine Learning Department of Information Technology
•Ethem Alpaydin, “Introduction to Machine Learning”, PHI 4th Edition-2020, The MIT Press,
ISBN:9780262043793.
•Ian Goodfelllow, Yoshua Benjio, Aaron Courville, “Deep Learning”, The MIT Press
ISBN:97802620356133.
•Tom M. Mitchell, “Machine Learning”, McGraw Hill, 1997 ISBN: 0071154671,
9780071154673
•Peter Flach, “Machine Learning The Art and Science of Algorithms that Make Sense of Data”,
Cambridge University Press India. ISBN 13: 9781107422223
•Christopher Bishop, “Pattern Recognition and Machine Learning”, Springer. 2006, ISBN-13:
978-1493938438
•Shai Shalev, Shwartz and Shai Ben-David, “Understanding Machine Learning”, Cambridge
University, Press 2017. ISBN:978-1-107-05713-5.

Machine Learning Department of Information Technology

Thank You!!!

Happy Learning!!!

Unit 4 Clustering_K-means_K-medoid_KNN.pdf

About This Presentation

Slide Content

Tags

Categories

Download

Quick Actions

Statistics

Related Slideshows

Unit 4 Clustering_K-means_K-medoid_KNN.pdf

About This Presentation

Slide Content

Slide 1

Slide 2

Slide 3

Slide 4

Slide 5

Slide 6

Slide 7

Slide 8

Slide 9

Slide 10

Slide 11

Slide 12

Slide 13

Slide 14

Slide 15

Slide 16

Slide 17

Slide 18

Slide 19

Slide 20

Slide 21

Slide 22

Slide 23

Slide 24

Slide 25

Slide 26

Slide 27

Slide 28

Slide 29

Slide 30

Slide 31

Slide 32

Slide 33

Slide 34

Slide 35

Slide 36

Slide 37

Slide 38

Slide 39

Slide 40

Slide 41

Slide 42

Slide 43

Slide 44

Slide 45

Slide 46

Slide 47

Slide 48

Slide 49

Slide 50

Slide 51

Slide 52

Slide 53

Slide 54

Slide 55

Slide 56

Slide 57

Slide 58

Slide 59

Slide 60

Slide 61

Slide 62

Slide 63

Slide 64

Slide 65

Slide 66

Slide 67

Slide 68

Slide 69

Slide 70

Slide 71

Slide 72

Slide 73

Slide 74

Slide 75

Slide 76

Slide 77