Cluster Analysis K-Means Clustering Typically measured by Euclidean distance
MSridhar18
15 views
10 slides
Feb 26, 2025
Slide 1 of 10
1
2
3
4
5
6
7
8
9
10
About This Presentation
Cluster Analysis K-Means Clustering Typically measured by Euclidean distance
Size: 54.35 KB
Language: en
Added: Feb 26, 2025
Slides: 10 pages
Slide Content
Cluster Analysis
K-Means Clustering
•
Overview:
K-Means is a partition-based clustering method that groups data points into k clusters
based on their similarity (typically measured by Euclidean distance).
How It Works:
Initialization: Choose k random centroids.
Assignment: Assign each data point to the nearest centroid.
Update: Recalculate centroids as the mean of the assigned points.
Repeat: Iterate the assignment and update steps until convergence (when centroids
no longer change significantly).
•
Density-Based Clustering
Overview:
Density-based methods group together data points that are close to each other in
high-density regions. DBSCAN (Density-Based Spatial Clustering of Applications with
Noise) is one of the most popular density-based algorithms.
How It Works:
Core Points: Identify points that have a minimum number of neighboring points
within a specified radius (epsilon).
Cluster Expansion: Core points and their neighbors are grouped into a cluster.
Noise Points: Points that do not meet the density criteria are labeled as noise and do
not belong to any cluster.
Result: Clusters are formed based on regions of high point density, and the noise
points are excluded.
•
Grid-Based Clustering
Overview:
Grid-based clustering methods divide the data space into a finite number of cells or
grids and then perform clustering based on these grids. The CLIQUE (Clustering in
Quest) algorithm is a popular example.
How It Works:
Grid Division: Divide the data space into a predefined grid (i.e., uniform partitions or
cells).
Density Calculation: Calculate the density of points within each grid cell.
Cluster Formation: Group adjacent cells with high density into clusters.
Result: The algorithm identifies dense regions in the grid and forms clusters by
merging these regions.
•
Introduction to Outliers
Types of Outliers
Global Outliers (Point Outliers):
•These are individual data points that are far removed from the other observations in the
dataset. For example, a person with an income of $1 million when most people earn
between $50,000 and $100,000.
Contextual Outliers (Conditional Outliers):
•A data point may be an outlier in one context but not in another. For example, a
temperature of 40°C could be considered an outlier in winter but normal during
summer.
Collective Outliers:
•A group of data points that together deviate from the expected pattern. Even though
each point in the group may not be an outlier by itself, the combination is unusual. For
example, multiple consecutive days of very high or low sales may indicate a collective
anomaly.
•
Types of Outliers
Outliers can be classified into different types based
on their nature
Global Outliers (Point Outliers)
Definition:
•Global outliers are individual data points that are
significantly different from the majority of the data in the
entire dataset. They stand out as extreme values when
compared to the rest of the dataset.
Example:
•In a dataset of people's ages ranging from 20 to 60, a
person aged 120 would be a global outlier.
•
Contextual Outliers (Conditional Outliers)
Definition:
•Contextual outliers are data points that are considered
outliers only within a specific context or condition. A data
point that might appear normal in one context may be an
outlier in another context.
Example:
•A temperature of 40°C is a typical value during summer but
would be an outlier in winter. Similarly, a person's income
might be normal in one country but an outlier in another
country.
•
•
Collective Outliers
Definition:
•Collective outliers occur when a group of data points
together is unusual, even though each individual point in
the group may not be an outlier. The combination of these
points forms an outlier when compared to the rest of the
dataset.
Example:
•A series of daily stock prices showing a drastic drop over a
period of time, even though each day's drop might not
seem unusual in isolation, but collectively, they indicate a
significant anomaly.
•
Outlier Detection Methods
•Statistical Methods
•Z-Score: Identifies points far from the mean (Z > 3 or Z < -3).
•IQR: Flags points outside the range of Q1 1.5×IQRQ1 - 1.5 \
−
times IQRQ1 1.5×IQR and Q3+1.5×IQRQ3 + 1.5 \times
−
IQRQ3+1.5×IQR.
•Distance-Based Methods
•KNN: Points far from their neighbors are outliers.
•LOF (Local Outlier Factor): Identifies points with lower local
density than neighbors.
•DBSCAN: Groups points into clusters; points outside clusters
are outliers.
•
•
Model-Based Methods
•Isolation Forest: Isolates points with fewer splits, marking
them as outliers.
•One-Class SVM: Identifies points outside the learned boundary
of normal data.
•Autoencoders: High reconstruction error points are outliers.
Visualization Methods
•Box Plot: Points outside 1.5 * IQR are outliers.
•Scatter Plot & Histogram: Isolated points in the plot are
considered outliers.
Clustering-Based Methods
•K-Means: Points far from cluster centroids are outliers.