Fa18_P2.pptx

Department of Electrical Engineering
University of Arkansas
Density-Based Spatial Clustering
Md Abul Hayat
mahayat@uark.edu

OUTLINE
• Introduction
• k-means clustering
• DBSCAN
– Definitions
– Advantages
– Limitations
• OPTICS
– Definitions
– Advantages
– Limitations

K-means Clustering
– An Unsupervised approach for partitioning a data set into K distinct, non-
overlapping clusters. [Lloyd, 1982]
– We must first specify the desired number of clusters ‘K’.
– Then the K-means algorithm will assign each observation to exactly one of the
K clusters.
– The optimization problem that defines K-means clustering,
– The problem is computationally NP –hard.

K-means : Algorithm
• Lloyd’s Algorithm
– Mathematically, this is partitioning the observations according to
the Voronoi diagram generated by the means.

Problems with K-means
– K-means partition the space in
Voronoi cells and they are convex
in nature.
– Thus k-means does not perform
good when we have non-convex
clusters
– We have to provide the number of
clusters beforehand.
– Sometimes, we want to find out
the intrinsic number of clusters
within the dataset.
– No way of handling noise
separately.

Problems with K-means
• Non-convex Clusters
• When we do not know the number of clusters.
• To solve these issues, density based clustering was introduced.

DBSCAN
• Density-Based Spatial Clustering of Applications with Noise (DBSCAN)
• Inventors:
Martin Ester, Hans-Peter Kriegel, Jörg Sander, and Xiaowei Xu.
• Paper : “A Density-Based Algorithm for Discovering Clusters in Large Spatial
Databases with Noise”
• Presented at the International Conference of Knowledge Discovery and Data
Mining (KDD) in 1996. KDD is a SIG of ACM.
• Citations: 13,293 (till 11/04/2018)
• The ‘2014 Test of Time’ award recognized DBSCAN as an influential
contributions to SIGKDD that have withstood the test of time.
• This is an unsupervised algorithm.

Definitions
– The shape of a neighborhood is determined by the choice of a distance
measure between two points p and q, denoted by d(p,q).
– For instance, when using the Manhattan distance in 2D space, the shape of
the neighborhood is rectangular.
– For the purpose of proper visualization, all examples will be in 2D space
using the Euclidean distance.

Distance Measures
– If we have two points
– Minkowski Distance:

Definitions
• Introduction
• [ Link: Funny Visualization ]

DBSCAN: Algorithm
• Introduction

DCSCAN : Examples
• Resistant to Noise (unlike k-means)
• Can handle clusters of different shapes and sizes
Original Data After DBSCAN

DBSCAN Limitations
• Introduction
(MinPts=4, Eps=9.92).
(MinPts=4, Eps=9.75)
Original Data
• Cannot handle varying densities.
• Sensitive to parameter selection.

Heuristics for Choosing DBSCAN Parameters
– Let d be the distance of a point p to its k-th nearest neighbor, then the d-
neighborhood of p contains exactly k+1 points for almost all points p.
– For a given k we define a function k-dist (= d) from the database D to the
real numbers, mapping each point to the distance from its k-th nearest
neighbor.
– When sorting the points of the database in descending order of their k-dist
values, the graph of this function gives some hints concerning the density
distribution in the database.
– If we choose an arbitrary point p, set the parameter Eps to k-dist(p) and set
the parameter MinPts to k, all points with an equal or smaller k-dist value
will be core points.
– All points with a higher k-dist value ( left of the threshold) are
considered to be noise, all other points (right of the threshold) are assigned
to some cluster

DBSCAN : Parameter Selection
– The easier-to-set parameter of DBSCAN is the minPts parameter.
– Sander et al. suggest setting it to twice the dataset dimensionality, i.e.,
minPts = 2 · dim.
– Ester et al. provide a heuristic for choosing the ε parameter based on the
distance to the fourth nearest neighbor (for two/dimensional data).
– In Generalized DBSCAN, Sander et al. suggested using the (2 · dim - 1)
nearest neighbors and minPts = 2 · dim

OPTICS
• Ordering Points To Identify the Clustering Structure (OPTICS)
– Inventors: (1999)
Mihael Ankerst, Markus M. Breunig, Hans-Peter Kriegel, Jörg Sander
– Paper: “OPTICS: Ordering Points To Identify the Clustering Structure”
– OPTICS requires the same ε and minPts parameters as DBSCAN,
however, the ε parameter is theoretically unnecessary and is only used for
the practical purpose of reducing the runtime complexity of the algorithm.
– While DBSCAN may be thought of as a clustering algorithm, searching
for natural groups in data, OPTICS is an augmented ordering algorithm.
– In OPTICS, we have to introduce two more definitions.
– Here, we just fix the minPts parameter and we can get the insight of the
underlying clusters using a plot called ‘Reachability Plot’.

OPTICS : Definitions
• Introduction

OPTICS : Definitions
• Introduction
ε = Generating Distance
ε’ = Core Distance

Reachability Graph : Toy Example

Reachability Graph : Toy Example
• Introduction

R Package & Examples
• dbscan: Density Based Clustering of Applications with Noise
(DBSCAN) and Related Algorithms
– Published: May 19, 2018
– From the order discovered by OPTICS, two ways to group points into
clusters was discussed

Conclusion
• Reachability plots are helpful to determine the number of clusters.
• Can be applied to find clusters in high dim-data (like image).
• DBSCAN and OPTICS, both are unsupervised techniques.

Fa18_P2.pptx

Recommended

Recommended

More Related Content

Similar to Fa18_P2.pptx

Similar to Fa18_P2.pptx (20)

More from Md Abul Hayat

More from Md Abul Hayat (13)

Recently uploaded

Recently uploaded (20)

Fa18_P2.pptx