This document discusses density-based clustering and the DBSCAN algorithm. It defines density-based clustering as clustering based on density, where clusters are defined as density-connected points. DBSCAN discovers clusters of arbitrary shape by finding core points that have many neighboring points within a given radius (Eps) and connecting nearby border and core points. The algorithm iterates through points, forming clusters from core points and labeling other points as border or noise. It works well for clusters of varying shapes but can fail on varying densities or high dimensions.
4. Prof. Pier Luca Lanzi
What is density-based clustering?
• Clustering based on density (local cluster criterion),
such as density-connected points
• Major features:
§ Discover clusters of arbitrary shape
§ Handle noise
§ One scan
§ Need density parameters as termination condition
• Several interesting studies:
§ DBSCAN: Ester, et al. (KDD’96)
§ OPTICS: Ankerst, et al (SIGMOD’99).
§ DENCLUE: Hinneburg D. Keim (KDD’98)
§ CLIQUE: Agrawal, et al. (SIGMOD’98) (more grid-based)
4
5. Prof. Pier Luca Lanzi
DBSCAN: Basic Concepts
• The neighborhood within a radius ε of a given object is called
the ε-neighborhood of the object
• If the ε-neighborhood of an object contains at least MinPts
objects, then the object is a core object
• An object p is directly density-reachable from object q if p is
within the ε-neighborhood of q and q is a core object
• An object p is density-reachable from object q if there is a chain
of object p1, …, pn where p_1=p and p_n=q such that pi+1 is
directly density reachable from pi
• An object p is density-connected to q with respect to ε and
MinPts if there is an object o such that both p and q are density
reachable from o
5
6. Prof. Pier Luca Lanzi
DBSCAN: Basic Concepts
• Density = number of points within a specified radius (Eps)
• A border point has fewer than MinPts within Eps,
but is in the neighborhood of a core point
• A noise point is any point that is not a core point
or a border point
• A density-based cluster is a set of density-connected objects that
is maximal with respect to density-reachability
6
7. Prof. Pier Luca Lanzi
Density-Reachable Density-Connected
• Directly density-reachable • Density-reachable
• Density-connected
p
q
p1
p q
o
p
q
MinPts = 5
Eps = 1 cm
7
8. Prof. Pier Luca Lanzi
DBSCAN: Core, Border, and Noise Points 8
9. Prof. Pier Luca Lanzi
DBSCAN
Density Based Spatial Clustering
• Relies on a density-based notion of cluster: A cluster is defined
as a maximal set of density-connected points
• Discovers clusters of arbitrary shape in spatial databases with
noise
• The Algorithm
§ Arbitrary select a point p
§ Retrieve all points density-reachable
from p given Eps and MinPts.
§ If p is a core point, a cluster is formed.
§ If p is a border point, no points are density-reachable from p
and DBSCAN visits the next point of the database
§ Continue the process until all of the points have been
processed
9
10. Prof. Pier Luca Lanzi
DBSCAN: Core, Border and Noise Points
Original Points Point types: core, border
and noise
Eps = 10, MinPts = 4
10
11. Prof. Pier Luca Lanzi
When DBSCAN Works Well
• Resistant to Noise
• Can handle clusters of different shapes and sizes
Original Points Clusters
11
12. Prof. Pier Luca Lanzi
When DBSCAN May Fail?
• Varying densities
• High-dimensional data
Original Points
(MinPts=4, Eps=9.75).
(MinPts=4, Eps=9.92)
12
13. Prof. Pier Luca Lanzi
Clusters found in Random Data
0 0.2 0.4 0.6 0.8 1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
x
y
Random
Points
0 0.2 0.4 0.6 0.8 1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
x
y
K-means
0 0.2 0.4 0.6 0.8 1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
x
y
DBSCAN
0 0.2 0.4 0.6 0.8 1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
x
y
Complete
Link
13
14. Prof. Pier Luca Lanzi
Density-Based Clustering in R
library(fpc)
set.seed(665544)
n - 600
x - cbind(runif(10, 0, 10)+rnorm(n, sd=0.2), runif(10, 0,
10)+rnorm(n,sd=0.2))
par(bg=grey40)
ds - dbscan(x, 0.2, showplot=1)
14
15. Prof. Pier Luca Lanzi
Density-Based Clustering in R
library(fpc)
set.seed(665544)
x - seq(0,6.28,0.1)
y - sin(x)
xd - x+rnorm(630,sd=0.2)
yd - y+rnorm(630,sd=0.2)
plot(xd,yd)
par(bg=grey40)
d - cbind(xd,yd)
# this works nicely since the epsilon is
# the same size of the standard deviation (0.2)
# used to generate the data
ds - dbscan(d, 0.2, showplot=1)
# this does not work so nicely
ds - dbscan(d, 0.1, showplot=1)
15
16. Prof. Pier Luca Lanzi
Clustering Comparisons on Sin Data 16
hierarchical clustering kmeans clustering
17. Prof. Pier Luca Lanzi
Clustering Comparisons on Sin Data
(k-means with 10 clusters)
17
18. Prof. Pier Luca Lanzi
http://en.wikibooks.org/wiki/Data_Mining_Algorithms_In_R/Clustering/Density-Based_Clustering
Software Packages