This document discusses density-based clustering algorithms. It begins by outlining the limitations of k-means clustering, such as its inability to find non-convex clusters or determine the intrinsic number of clusters. It then introduces DBSCAN, a density-based algorithm that can identify clusters of arbitrary shapes and handle noise. The key definitions and algorithm of DBSCAN are described. While effective, DBSCAN relies on parameter selection and cannot handle varying densities well. OPTICS is then presented as an augmentation of DBSCAN that produces a reachability plot to provide insight into the underlying cluster structure and avoid specifying the cluster count.
3. K-means Clustering
– An Unsupervised approach for partitioning a data set into K distinct, non-
overlapping clusters. [Lloyd, 1982]
– We must first specify the desired number of clusters ‘K’.
– Then the K-means algorithm will assign each observation to exactly one of the
K clusters.
– The optimization problem that defines K-means clustering,
– The problem is computationally NP –hard.
4. K-means : Algorithm
• Lloyd’s Algorithm
– Mathematically, this is partitioning the observations according to
the Voronoi diagram generated by the means.
6. Problems with K-means
– K-means partition the space in
Voronoi cells and they are convex
in nature.
– Thus k-means does not perform
good when we have non-convex
clusters
– We have to provide the number of
clusters beforehand.
– Sometimes, we want to find out
the intrinsic number of clusters
within the dataset.
– No way of handling noise
separately.
7. Problems with K-means
• Non-convex Clusters
• When we do not know the number of clusters.
• To solve these issues, density based clustering was introduced.
8. DBSCAN
• Density-Based Spatial Clustering of Applications with Noise (DBSCAN)
• Inventors:
Martin Ester, Hans-Peter Kriegel, Jörg Sander, and Xiaowei Xu.
• Paper : “A Density-Based Algorithm for Discovering Clusters in Large Spatial
Databases with Noise”
• Presented at the International Conference of Knowledge Discovery and Data
Mining (KDD) in 1996. KDD is a SIG of ACM.
• Citations: 13,293 (till 11/04/2018)
• The ‘2014 Test of Time’ award recognized DBSCAN as an influential
contributions to SIGKDD that have withstood the test of time.
• This is an unsupervised algorithm.
9. Definitions
– The shape of a neighborhood is determined by the choice of a distance
measure between two points p and q, denoted by d(p,q).
– For instance, when using the Manhattan distance in 2D space, the shape of
the neighborhood is rectangular.
– For the purpose of proper visualization, all examples will be in 2D space
using the Euclidean distance.
18. Heuristics for Choosing DBSCAN Parameters
– Let d be the distance of a point p to its k-th nearest neighbor, then the d-
neighborhood of p contains exactly k+1 points for almost all points p.
– For a given k we define a function k-dist (= d) from the database D to the
real numbers, mapping each point to the distance from its k-th nearest
neighbor.
– When sorting the points of the database in descending order of their k-dist
values, the graph of this function gives some hints concerning the density
distribution in the database.
– If we choose an arbitrary point p, set the parameter Eps to k-dist(p) and set
the parameter MinPts to k, all points with an equal or smaller k-dist value
will be core points.
– All points with a higher k-dist value ( left of the threshold) are
considered to be noise, all other points (right of the threshold) are assigned
to some cluster
19. DBSCAN : Parameter Selection
– The easier-to-set parameter of DBSCAN is the minPts parameter.
– Sander et al. suggest setting it to twice the dataset dimensionality, i.e.,
minPts = 2 · dim.
– Ester et al. provide a heuristic for choosing the ε parameter based on the
distance to the fourth nearest neighbor (for two/dimensional data).
– In Generalized DBSCAN, Sander et al. suggested using the (2 · dim - 1)
nearest neighbors and minPts = 2 · dim
20. OPTICS
• Ordering Points To Identify the Clustering Structure (OPTICS)
– Inventors: (1999)
Mihael Ankerst, Markus M. Breunig, Hans-Peter Kriegel, Jörg Sander
– Paper: “OPTICS: Ordering Points To Identify the Clustering Structure”
– OPTICS requires the same ε and minPts parameters as DBSCAN,
however, the ε parameter is theoretically unnecessary and is only used for
the practical purpose of reducing the runtime complexity of the algorithm.
– While DBSCAN may be thought of as a clustering algorithm, searching
for natural groups in data, OPTICS is an augmented ordering algorithm.
– In OPTICS, we have to introduce two more definitions.
– Here, we just fix the minPts parameter and we can get the insight of the
underlying clusters using a plot called ‘Reachability Plot’.
31. R Package & Examples
• dbscan: Density Based Clustering of Applications with Noise
(DBSCAN) and Related Algorithms
– Published: May 19, 2018
– From the order discovered by OPTICS, two ways to group points into
clusters was discussed
36. Conclusion
• Reachability plots are helpful to determine the number of clusters.
• Can be applied to find clusters in high dim-data (like image).
• DBSCAN and OPTICS, both are unsupervised techniques.