Density based clustering

626 views 29 slides Sep 22, 2019
Slide 1
Slide 1 of 29
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27
Slide 28
28
Slide 29
29

About This Presentation

Machine learning


Slide Content

Density based Clustering

Review: Supervised Learning From example pairs, learn a function that maps from input to output Fruit example:

Unsupervised Learning Given a pile of unlabelled data, what can you learn from it?

Examples how many types of fruits are there group types of customers detect anomalies or “odd behavior”

Clustering Most of these applications are related to the task of Clustering. In Classification, we draw separators to differentiate known classes of data. In Clustering, we make up different types of data and draw separators . Types: Connectivity models Centroid models Density Models Distribution models

Non-globular clusters What if the clusters are weird shapes? How would K-means fare for this data? Is there another way we could find clusters? Source:  https://hdbscan.readthedocs.io/en/latest/comparing_clustering_algorithms.html

Non-globular clusters What if the clusters are weird shapes? How would K-means fare for this data? Is there another way we could find clusters? Source:  https://hdbscan.readthedocs.io/en/latest/comparing_clustering_algorithms.html

Heirarchical clustering Much better clusters on this (made-up) data Still not perfect, what should we do with outliers? Does every point need to be assigned to a cluster? Source:  https://hdbscan.readthedocs.io/en/latest/comparing_clustering_algorithms.html

Density-based clustering Uses the density of surrounding points to assign clusters Not every point assigned to a cluster Source:  https://hdbscan.readthedocs.io/en/latest/comparing_clustering_algorithms.html

DBSACN DBSCAN is a density-based algorithm DBScan stands for Density-Based Spatial Clustering of Applications with Noise Density-based Clustering locates regions of high density that are separated from one another by regions of low density Density = number of points within a specified radius (Eps)

Concepts: Preliminary A point is a core point if it has more than a specified number of points (MinPts) within Eps These are points that are at the interior of a cluster A border point has fewer than MinPts within Eps, but is in the neighborhood of a core point A noise point is any point that is not a core point or a border point

Parameter Estimation

Determining the parameters Eps and MinPts The parameters Eps and MinPts can be determined by a heuristic . Observation For points in a cluster , their k-th nearest neighbors are at roughly the same distance . Noise points have the k-th nearest neighbor at farther distance .  Plot sorted distance of every point to its k-th nearest neighbor.

Determining the parameters Eps and MinPts database Procedure Define a function k-dist from the database to the real numbers, mapping each point to the distance from its k-th nearest neighbor . S ort the points of the database in descending order of their k-dist values . k-dist

Determining the parameters Eps and MinPts Procedure Choose an arbitrary point p set Eps = k-dist(p) set MinPts = k . All points with an equal or smaller k-dist value will be cluster points k-dist p cluster points noise

DBSCAN : Advantages

DBSCAN : Disadvantages DBSCAN is not entirely deterministic: Border points that are reachable from more than one cluster can be part of either cluster, depending on the order the data is processed. The quality of DBSCAN depends on the distance measure used in the function regionQuery. (such as Euclidean distance) Has problems of identifying clusters of varying densities (SSN algorithm) If the data and scale are not well understood, choosing a meaningful distance threshold ε can be difficult.

OPTICS It computes an ordering of all objects in a given database. And It stores the core-distance and a suitable reachability-distance for each object in the database. OPTICS maintains a list called OrderSeeds to generate the output ordering. Objects in OrderSeeds are sorted by the reachability-distance from their respective closest core objects, that is, by the smallest reachability-distance of each object.

Optics Concepts: Core Distance-  the minimum epsilon to make a distinct point a core point, given a finite MinPts parameter. Reachability Distance-  the reachability-distance of an object p with respect to another object o is the smallest distance from o if o is a core object. It also cannot be smaller than the core distance of o.

OPTICS ALGORITHM EXAMPLE A I J L R P B K M N D C E F G H 44  re a ch seedlist: Example Database (2-dimensional, 16 points) ε = 44, MinPts = 3 April 30,2012 <number>

OPTICS ALGORITHM EXAMPLE se e d l is t : A I J L R P K M N D C E F G H A 44 re a ch   B c o re- distance (B,40) (I, 40) Example Database (2-dimensional, 16 points) ε = 44, MinPts = 3 April 30,2012 <number>

OPTICS ALGORITHM EXAMPLE 44 re a ch  A B A I J L R P B K M N D C E F G H Example Database (2-dimensional, 16 points) ε = 44, MinPts = 3 seedlist: (I, 40) (C, 40) April 30,2012 <number>

OPTICS ALGORITHM EXAMPLE 44 re a ch  A I J L R P B K M N D C E F G H A B I Example Database (2-dimensional, 16 points) ε = 44, MinPts = 3 seedlist: (J, 20) (K, 20) (L, 31) (C, 40) (M, 40) (R, 43) April 30,2012 <number>

OPTICS ALGORITHM EXAMPLE 44 re a ch  A I J L R P B K M N D C E F G H A B I J Example Database (2-dimensional, 16 points) ε = 44, MinPts = 3 seedlist: (L, 19) (K, 20) (R, 21) (M, 30) (P, 31) (C, 40) April 30,2012

OPTICS ALGORITHM EXAMPLE 44 re a ch  A I B J L R P K M N D C E F G H A B I J L … Example Database (2-dimensional, 16 points) ε = 44, MinPts = 3 seedlist: (M, 18) (K, 18) (R, 20) (P, 21) (N, 35) (C, 40) April 30,2012

OPTICS ALGORITHM EXAMPLE A I J L R P B K M N D C E F G H A B I J L M K N R P C D F G E H 44 re a ch  Example Database (2-dimensional, 16 points) ε = 44, MinPts = 3 seedlist: - April 30,2012

OPTICS ALGORITHM EXAMPLE A I J L R P B K M N D C E F G H A B I J L M K N R P C D F G E H 44 re a ch  Example Database (2-dimensional, 16 points) ε = 44, MinPts = 3 seedlist: - April 30,2012

GRAPHICAL REPRESENTATION A data set’s cluster ordering can be represented graphically. It helps to visualize and understand the clustering structure in a data set. 32
Tags