3.4 density and grid methods

Clustering
Density and Grid Based
1

2
Density based methods
 Clusters – dense regions of objects
 Low density regions – Noise
 DBSCAN
 Density Based Spatial Clustering of Applications with
Noise
 OPTICS
 Ordering Points To Identify the Clustering Structure
 DENCLUE
 DENsity Based CLUstEring

3
DBSCAN
 Cluster – maximal set of density connected points
 Grows regions with sufficiently high density into clusters
 ε-neighborhood
 MinPts and Core object
 Directly Density Reachable
 An object p is directly density reachable from object q if
p is within the ε-neighborhood of q and q is a core
object
p
q
MinPts = 5

4
DBSCAN
 Density Reachable
 An object p is density reachable from q, if there is
a chain of objects p1, …pn, p1=q and pn=p such that
pi+1 is directly density reachable from pi
p
q
p1

5
DBSCAN
 Density Connected
An object p is density connected to object
q if there is an object o such that both p and q are
density reachable from o.
p q
o

6
DBSCAN
 Arbitrarily select a point p
 Retrieve all points density-reachable from p
 If p is a core point, a cluster is formed.
 If p is a border point, no points are density-reachable
from p, then DBSCAN visits the next point of the
database.
 Continue the process until all of the points have been
processed.
 Complexity : O(n log n) / O(n2
)

7
OPTICS: A Cluster-Ordering Method
 OPTICS: Ordering Points To Identify the Clustering
Structure
 Produces a special order of the database with respect
to its density-based clustering structure
 Good for both automatic and interactive cluster
analysis, including finding intrinsic clustering structure
 Can be represented graphically or using visualization
techniques

OPTICS
 In DBSCAN, for a constant MinPts value, density based
clusters with respect to a higher density (lower value of
ε) are completely contained in lower density sets.
 DBSCAN is extended so that Objects are processed in a
specific order.
 Selects an object that is density-reachable with respect to lowest
ε value
 Core distance of an object p : smallest ε’ value that makes {p} a
core object
 Reachability distance of an object q with respect to p = max
(core-distance of p, d(p,q))
8

OPTICS
 Complexity : O(n log n)
9

10
ε
ε
Reachability-
distance
Cluster-order
of the objects
undefined
ε ‘
OPTICS

11
DENCLUE: using density functions
 DENsity-based CLUstEring
 Major features
 Solid mathematical foundation
 Good for data sets with large amounts of noise
 Allows a compact mathematical description of arbitrarily
shaped clusters in high-dimensional data sets
 Significantly faster than existing algorithm (faster than
DBSCAN by a factor of up to 45)
 But needs a large number of parameters

12
 Influence function: describes the impact of a data point within its
neighborhood.
 x, y – objects in Fd
– d-dimensional input space
 Influence of object y on x is:
 Can be determined by distance:
 Overall density of the data space can be calculated as the sum of
the influence function of all data points.
 Clusters can be determined mathematically by identifying density
attractors.
 Density attractors are local maximal of the overall density function.
DENCLUE
),()( yxfxf B
y
B =
otherwise1or),(0),( σ>= yxdifyxfsquare
f x y eGaussian
d x y
( , )
( , )
=
−
2
2
2σ

13
 Density attractor – Local maxima of overall density
function
 A point x is said to be density attracted to a density
attractor x* if there exists a set of points x0, x1,..xk
such that x0 = x and xk =x* and the gradient of xi-1 is
in the direction of xi
DENCLUE

DENCLUE
 Center defined clusters
 For a density attractor x* - a subset of points that are
density attracted by x* and where density function x* is
no less than threshold ξ
 Others are outliers
 Arbitrary shape cluster
 Set of density attractors and set of Cs
 There should be a path from each density attractor to
another where density function value for each point is
no less that ξ
14

16
Grid Based Methods
 Uses a Multi-resolution grid data structure
 Quantizes space into a finite number of cells
that form a grid structure
 Fast processing time
 STING
 WaveCluster
 CLIQUE – CLustering In QUEst

17
STING
 STatistical Information Grid
 Spatial area is divided into rectangular cells
 Several levels of cells – at different levels of
resolution
 High level cell is partitioned into several lower
level cells
 Statistical attributes are stored in cell
 Mean, Maximum, Minimum

19
STING
 Parameters of higher level cells are computed
from those at lower levels
 To answer queries
 Identify level
 Estimate cell’s relevance to query
 Process relevant cells at lower levels
 Continue to lowest level

20
STING
 Computation is query independent
 Parallel processing – supported
 Data is processed in a single pass
 Quality depends on granularity

21
WaveCluster
 A multi-resolution clustering approach which applies
wavelet transform to the feature space
 A wavelet transform is a signal processing technique
that decomposes a signal into different frequency sub-
band.
 Both grid-based and density-based
 Input parameters:
 # of grid cells for each dimension
 the wavelet, and the # of applications of wavelet
transform.

22
WaveCluster
 Using wavelet transform to find clusters
 Summarises the data by imposing a multidimensional
grid structure onto data space
 These multidimensional spatial data objects are
represented in a n-dimensional feature space
 Apply wavelet transform on feature space to find the
dense regions in the feature space
 Apply wavelet transform multiple times which result in
clusters at different scales from fine to coarse

25
WaveCluster
 Reasons for using Wavelet transformation in clustering
 Unsupervised clustering
It uses filters to emphasize region where points cluster, but
simultaneously to suppress weaker information in their boundary
 Effective removal of outliers
 Multi-resolution
 Cost efficiency
 Major features:
 Complexity O(N)
 Detect arbitrary shaped clusters at different scales
 Not sensitive to noise, not sensitive to input order
 Only applicable to low dimensional data

26
CLIQUE (Clustering In QUEst)
 Automatically identifying subspaces of a high dimensional data space
that allow better clustering than original space
 CLIQUE can be considered as both density-based and grid-based
 It partitions each dimension into the same number of equal length
interval
 It partitions an m-dimensional data space into non-overlapping
rectangular units
 A unit is dense if the fraction of total data points contained in the unit
exceeds the input model parameter
 A cluster is a maximal set of connected dense units within a
subspace

3.4 density and grid methods

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (12)

Similar to 3.4 density and grid methods

Similar to 3.4 density and grid methods (20)

More from Krish_ver2

More from Krish_ver2 (20)

Recently uploaded

Recently uploaded (20)

3.4 density and grid methods