Presented by – Rohit Paul Density-Based Clustering Algorithm
Disadvantages of Partitioning Method Choosing k manually Clustering data of varying sizes and density Sensitive to outliers Figure 2 Figure 1
Density-based Method Cluster in a data space is a contiguous region of high point density Separated by lower density of points Density within the areas of noise is assumed to be lower
DBSCAN Density Based Spatial Clustering of Applications with Noise How is the density estimated? Estimates the density by counting the number of points in a fixed-radius neighborhood KEY IDEA: For each point of a cluster the neighborhood of a given radius has to contain at least a minimum number of points Two input parameters required: Epsilon(ε) MinPts
ε-Neighborhood: Objects within a radius ε from an object Three types of points based on ε-Neighborhood : Core Point Border Point Noise Point
Terminology used: Directly density-reachable A point q is directly density-reachable from a core point p if q is within the Eps-neighborhood of p. Density-reachable A point q is density reachable from p if there are a set of core points leading from p to q.
Density-connected Two points p and q are density connected if there are a core point o, such that both p and q are density reachable from o
A cluster satisfy the following conditions: If p be a core point and the set of all point which are density-reachable from p be O. Then, this set O is a cluster with respect to Eps and Minpts . For all p and q in cluster C, p is density-connected to q with respect to Eps and Minpts . Noise, a set of points which do not belongs to any clusters
Pseudo code of DBSCAN Algorithm
Scikit -learn implementation
Reference https://developers.google.com/machine-learning/clustering/algorithm/advantages-disadvantages https://towardsdatascience.com/dbscan-clustering-explained-97556a2ad556 https://scikit-learn.org/stable/modules/generated/sklearn.cluster.DBSCAN.html Martin Ester, Hans-Peter Kriegel , J¨org Sander, and Xiaowei Xu . A density-based algorithm for discovering clusters in large spatial databases with noise.