Density based Clustering finds clusters of arbitrary shape by looking for dense regions of points separated by low density regions. It includes DBSCAN, which defines clusters based on core points that have many nearby neighbors and border points near core points. DBSCAN has parameters for neighborhood size and minimum points. OPTICS is a density based algorithm that computes an ordering of all objects and their reachability distances without fixing parameters.
4. Examples
• how many types of fruits are there
• group types of customers
• detect anomalies or “odd behavior”
5. Clustering
Most of these applications are related to the task of Clustering.
In Classification, we draw separators to differentiate known classes of data.
In Clustering, we make up different types of data and draw separators.
Types:
Connectivity models
Centroid models
Density Models
Distribution models
6. Non-globular
clusters
What if the clusters are
weird shapes?
How would K-means
fare for this data?
Is there another way we
could find clusters?
Source: https://hdbscan.readthedocs.io/en/latest/comparing_clustering_algorithms.html
7. Non-globular
clusters
What if the clusters are
weird shapes?
How would K-means
fare for this data?
Is there another way we
could find clusters?
Source: https://hdbscan.readthedocs.io/en/latest/comparing_clustering_algorithms.html
8. Heirarchical
clustering
Much better clusters on
this (made-up) data
Still not perfect, what
should we do with
outliers?
Does every point need to
be assigned to a cluster?
Source: https://hdbscan.readthedocs.io/en/latest/comparing_clustering_algorithms.html
9. Density-based
clustering
Uses the density of
surrounding points to
assign clusters
Not every point assigned
to a cluster
Source: https://hdbscan.readthedocs.io/en/latest/comparing_clustering_algorithms.html
10. DBSACN
DBSCAN is a density-based algorithm
DBScan stands for Density-Based Spatial Clustering of
Applications with Noise
Density-based Clustering locates regions of high density that
are separated from one another by regions of low density
Density = number of points within a specified radius (Eps)
11. Concepts: Preliminary
A point is a core point if it has more than aspecified
number of points (MinPts) within Eps
These are points that are at the interior of a cluster
A border point has fewer than MinPts within Eps, but is
in the neighborhood of a core point
A noise point is any point that is not a core point ora
border point
13. Determining the parameters Eps and MinPts
The parameters Eps and MinPts can be determined by a heuristic.
Observation
For points in a cluster, their k-th nearest neighbors are at roughly the same distance.
Noise points have the k-th nearest neighbor at farther distance.
Plot sorted distance of every point to its k-th nearest
neighbor.
14. Determining the parameters Eps and MinPts
database
Procedure
Define a function k-dist from the database to the real numbers, mapping
each point to the distance from its k-th nearest neighbor.
• Sort the points of the database in descending order of their k-dist
values.
k-dist
15. Determining the parameters Eps and MinPts
Procedure
Choose an arbitrary point p
set Eps = k-dist(p) set MinPts = k.
All points with an equal or smaller k-dist value will be cluster points
k-dist
p
cluster pointsnoise
17. DBSCAN : Disadvantages
• DBSCAN is not entirely deterministic: Border points that are reachable from
more than one cluster can be part of either cluster, depending on the order the
data is processed.
• The quality of DBSCAN depends on the distance measure used in the function
regionQuery. (such as Euclidean distance)
• Has problems of identifying clusters of varying densities ( SSN algorithm)
• If the data and scale are not well understood, choosing a meaningful distance
threshold ε can be difficult.
18.
19. OPTICS
It computes an ordering of all objects in a given database. And
It stores the core-distance and a suitable reachability-distance for each object
in the database.
OPTICS maintains a list called OrderSeeds to generate the output ordering.
Objects in OrderSeeds
are sorted by the reachability-distance from their respective closest core
objects,
that is, by the smallest reachability-distance of each object.
20. Optics Concepts:
Core Distance- the minimum epsilon to make a distinct point a core point, given a finite MinPts
parameter.
Reachability Distance- the reachability-distance of an object p with respect to another object o is the
smallest distance from o if o is a core object. It also cannot be smaller than the core distance of o.
23. OPTICS ALGORITHM EXAMPLE
44
reach
A B
A I
J
L
R
P
B
K
M
N
D
C
E
F
G
H
Example Database (2-dimensional, 16 points)
• ε= 44,MinPts = 3
seedlist: (I, 40) (C, 40)
April 30,2012
23
24. OPTICS ALGORITHM EXAMPLE
44
reach
A I
J
L
R
P
B
K
M
N
D
C
E
F
G
H
A B I
Example Database (2-dimensional, 16 points)
• ε= 44,MinPts = 3
seedlist: (J, 20) (K, 20) (L, 31) (C, 40) (M, 40) (R, 43)
April 30,2012
24
26. OPTICS ALGORITHM EXAMPLE
44
reach
A I
B
J
L
R
P
K
M
N
D
C
E
F
G
H
A B I
J
L
…
Example Database (2-dimensional, 16 points)
• ε= 44,MinPts = 3
seedlist: (M, 18) (K, 18) (R, 20) (P, 21) (N, 35) (C, 40)
April 30,2012
29. GRAPHICAL REPRESENTATION
A data set’s cluster ordering can be represented graphically.
It helps to visualize and understand the clustering structure in a data set.
32