Review: Supervised Learning From example pairs, learn a function that maps from input to output Fruit example:
Unsupervised Learning Given a pile of unlabelled data, what can you learn from it?
Examples how many types of fruits are there group types of customers detect anomalies or “odd behavior”
Clustering Most of these applications are related to the task of Clustering. In Classification, we draw separators to differentiate known classes of data. In Clustering, we make up different types of data and draw separators . Types: Connectivity models Centroid models Density Models Distribution models
Non-globular clusters What if the clusters are weird shapes? How would K-means fare for this data? Is there another way we could find clusters? Source: https://hdbscan.readthedocs.io/en/latest/comparing_clustering_algorithms.html
Non-globular clusters What if the clusters are weird shapes? How would K-means fare for this data? Is there another way we could find clusters? Source: https://hdbscan.readthedocs.io/en/latest/comparing_clustering_algorithms.html
Heirarchical clustering Much better clusters on this (made-up) data Still not perfect, what should we do with outliers? Does every point need to be assigned to a cluster? Source: https://hdbscan.readthedocs.io/en/latest/comparing_clustering_algorithms.html
Density-based clustering Uses the density of surrounding points to assign clusters Not every point assigned to a cluster Source: https://hdbscan.readthedocs.io/en/latest/comparing_clustering_algorithms.html
DBSACN DBSCAN is a density-based algorithm DBScan stands for Density-Based Spatial Clustering of Applications with Noise Density-based Clustering locates regions of high density that are separated from one another by regions of low density Density = number of points within a specified radius (Eps)
Concepts: Preliminary A point is a core point if it has more than a specified number of points (MinPts) within Eps These are points that are at the interior of a cluster A border point has fewer than MinPts within Eps, but is in the neighborhood of a core point A noise point is any point that is not a core point or a border point
Parameter Estimation
Determining the parameters Eps and MinPts The parameters Eps and MinPts can be determined by a heuristic . Observation For points in a cluster , their k-th nearest neighbors are at roughly the same distance . Noise points have the k-th nearest neighbor at farther distance . Plot sorted distance of every point to its k-th nearest neighbor.
Determining the parameters Eps and MinPts database Procedure Define a function k-dist from the database to the real numbers, mapping each point to the distance from its k-th nearest neighbor . S ort the points of the database in descending order of their k-dist values . k-dist
Determining the parameters Eps and MinPts Procedure Choose an arbitrary point p set Eps = k-dist(p) set MinPts = k . All points with an equal or smaller k-dist value will be cluster points k-dist p cluster points noise
DBSCAN : Advantages
DBSCAN : Disadvantages DBSCAN is not entirely deterministic: Border points that are reachable from more than one cluster can be part of either cluster, depending on the order the data is processed. The quality of DBSCAN depends on the distance measure used in the function regionQuery. (such as Euclidean distance) Has problems of identifying clusters of varying densities (SSN algorithm) If the data and scale are not well understood, choosing a meaningful distance threshold ε can be difficult.
OPTICS It computes an ordering of all objects in a given database. And It stores the core-distance and a suitable reachability-distance for each object in the database. OPTICS maintains a list called OrderSeeds to generate the output ordering. Objects in OrderSeeds are sorted by the reachability-distance from their respective closest core objects, that is, by the smallest reachability-distance of each object.
Optics Concepts: Core Distance- the minimum epsilon to make a distinct point a core point, given a finite MinPts parameter. Reachability Distance- the reachability-distance of an object p with respect to another object o is the smallest distance from o if o is a core object. It also cannot be smaller than the core distance of o.
OPTICS ALGORITHM EXAMPLE A I J L R P B K M N D C E F G H 44 re a ch seedlist: Example Database (2-dimensional, 16 points) ε = 44, MinPts = 3 April 30,2012 <number>
OPTICS ALGORITHM EXAMPLE se e d l is t : A I J L R P K M N D C E F G H A 44 re a ch B c o re- distance (B,40) (I, 40) Example Database (2-dimensional, 16 points) ε = 44, MinPts = 3 April 30,2012 <number>
OPTICS ALGORITHM EXAMPLE 44 re a ch A B A I J L R P B K M N D C E F G H Example Database (2-dimensional, 16 points) ε = 44, MinPts = 3 seedlist: (I, 40) (C, 40) April 30,2012 <number>
OPTICS ALGORITHM EXAMPLE 44 re a ch A I J L R P B K M N D C E F G H A B I Example Database (2-dimensional, 16 points) ε = 44, MinPts = 3 seedlist: (J, 20) (K, 20) (L, 31) (C, 40) (M, 40) (R, 43) April 30,2012 <number>
OPTICS ALGORITHM EXAMPLE 44 re a ch A I J L R P B K M N D C E F G H A B I J Example Database (2-dimensional, 16 points) ε = 44, MinPts = 3 seedlist: (L, 19) (K, 20) (R, 21) (M, 30) (P, 31) (C, 40) April 30,2012
OPTICS ALGORITHM EXAMPLE 44 re a ch A I B J L R P K M N D C E F G H A B I J L … Example Database (2-dimensional, 16 points) ε = 44, MinPts = 3 seedlist: (M, 18) (K, 18) (R, 20) (P, 21) (N, 35) (C, 40) April 30,2012
OPTICS ALGORITHM EXAMPLE A I J L R P B K M N D C E F G H A B I J L M K N R P C D F G E H 44 re a ch Example Database (2-dimensional, 16 points) ε = 44, MinPts = 3 seedlist: - April 30,2012
OPTICS ALGORITHM EXAMPLE A I J L R P B K M N D C E F G H A B I J L M K N R P C D F G E H 44 re a ch Example Database (2-dimensional, 16 points) ε = 44, MinPts = 3 seedlist: - April 30,2012
GRAPHICAL REPRESENTATION A data set’s cluster ordering can be represented graphically. It helps to visualize and understand the clustering structure in a data set. 32