dm_clustering2.ppt

1
Clustering Part2
1. BIRCH
2. Density-based Clustering --- DBSCAN and
DENCLUE
3. GRID-based Approaches --- STING and ClIQUE
4. SOM
5. Outlier Detection
6. Summary
Remark: Only DENCLUE and briefly grid-based clusterin will be
covered in 2007.

2
BIRCH (1996)
 Birch: Balanced Iterative Reducing and Clustering using
Hierarchies, by Zhang, Ramakrishnan, Livny (SIGMOD’96)
 Incrementally construct a CF (Clustering Feature) tree, a
hierarchical data structure for multiphase clustering
 Phase 1: scan DB to build an initial in-memory CF tree (a
multi-level compression of the data that tries to preserve
the inherent clustering structure of the data)
 Phase 2: use an arbitrary clustering algorithm to cluster
the leaf nodes of the CF-tree
 Scales linearly: finds a good clustering with a single scan
and improves the quality with a few additional scans
 Weakness: handles only numeric data, and sensitive to the
order of the data record.

3
Clustering Feature Vector
Clustering Feature: CF = (N, LS, SS)
N: Number of data points
LS: N
i=1=Xi
SS: N
i=1=Xi
2
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
CF = (5, (16,30),(54,190))
(3,4)
(2,6)
(4,5)
(4,7)
(3,8)

4
CF Tree
CF1
child1
CF3
child3
CF2
child2
CF6
child6
CF1
child1
CF3
child3
CF2
child2
CF5
child5
CF1 CF2 CF6
prev next CF1 CF2 CF4
prev next
B = 7
L = 6
Root
Non-leaf node
Leaf node Leaf node

5
Chapter 8. Cluster Analysis
 What is Cluster Analysis?
 Types of Data in Cluster Analysis
 A Categorization of Major Clustering Methods
 Partitioning Methods
 Hierarchical Methods
 Density-Based Methods
 Grid-Based Methods
 Model-Based Clustering Methods
 Outlier Analysis
 Summary

6
Density-Based Clustering Methods
 Clustering based on density (local cluster criterion),
such as density-connected points or based on an
explicitly constructed density function
 Major features:
 Discover clusters of arbitrary shape
 Handle noise
 One scan
 Need density parameters
 Several interesting studies:
 DBSCAN: Ester, et al. (KDD’96)
 OPTICS: Ankerst, et al (SIGMOD’99).
 DENCLUE: Hinneburg & D. Keim (KDD’98)
 CLIQUE: Agrawal, et al. (SIGMOD’98)

7
Density-Based Clustering: Background
 Two parameters:
 Eps: Maximum radius of the neighbourhood
 MinPts: Minimum number of points in an Eps-
neighbourhood of that point
 NEps(p): {q belongs to D | dist(p,q) <= Eps}
 Directly density-reachable: A point p is directly density-
reachable from a point q wrt. Eps, MinPts if
 1) p belongs to NEps(q)
 2) core point condition:
|NEps (q)| >= MinPts
p
q
MinPts = 5
Eps = 1 cm

8
Density-Based Clustering: Background (II)
 Density-reachable:
 A point p is density-reachable from
a point q wrt. Eps, MinPts if there
is a chain of points p1, …, pn, p1 =
q, pn = p such that pi+1 is directly
density-reachable from pi
 Density-connected
 A point p is density-connected to a
point q wrt. Eps, MinPts if there is
a point o such that both, p and q
are density-reachable from o wrt.
Eps and MinPts.
p
q
p1
p q
o

9
DBSCAN: Density Based Spatial
Clustering of Applications with Noise
 Relies on a density-based notion of cluster: A cluster is
defined as a maximal set of density-connected points
 Discovers clusters of arbitrary shape in spatial databases
with noise
Core
Border
Outlier
Eps = 1cm
MinPts = 5
Density reachable
from core point
Not density reachable
from core point

10
DBSCAN: The Algorithm
 Arbitrary select a point p
 Retrieve all points density-reachable from p wrt Eps
and MinPts.
 If p is a core point, a cluster is formed.
 If p ia not a core point, no points are density-
reachable from p and DBSCAN visits the next point of
the database.
 Continue the process until all of the points have been
processed.

11
DENCLUE: using density functions
 DENsity-based CLUstEring by Hinneburg & Keim (KDD’98)
 Major features
 Solid mathematical foundation
 Good for data sets with large amounts of noise
 Allows a compact mathematical description of arbitrarily
shaped clusters in high-dimensional data sets
 Significant faster than existing algorithm (faster than
DBSCAN by a factor of up to 45)
 But needs a large number of parameters

12
 Uses grid cells but only keeps information about grid
cells that do actually contain data points and manages
these cells in a tree-based access structure.
 Influence function: describes the impact of a data point
within its neighborhood.
 Overall density of the data space can be calculated as
the sum of the influence function of all data points.
 Clusters can be determined mathematically by
identifying density attractors.
 Density attractors are local maximal of the overall
density function.
Denclue: Technical Essence

13
Gradient: The steepness of a slope
 Example
 


N
i
x
x
d
D
Gaussian
i
e
x
f 1
2
)
,
(
2
2
)
( 






N
i
x
x
d
i
i
D
Gaussian
i
e
x
x
x
x
f 1
2
)
,
(
2
2
)
(
)
,
( 
f x y e
Gaussian
d x y
( , )
( , )


2
2
2

14
Example: Density Computation
D={x1,x2,x3,x4}
fD
Gaussian(x)= influence(x1) + influence(x2) + influence(x3) +
influence(x4)=0.04+0.06+0.08+0.6=0.78
x1
x2
x3
x4
x 0.6
0.08
0.06
0.04
y
Remark: the density value of y would be larger than the one for x

16
Examples of DENCLUE Clusters

17
Basic Steps DENCLUE Algorithms
1. Determine density attractors
2. Associate data objects with density
attractors ( initial clustering)
3. Merge the initial clusters further relying
on a hierarchical clustering approach
(optional)

18
 Summary

19
Steps of Grid-based Clustering Algorithms
Basic Grid-based Algorithm
1. Define a set of grid-cells
2. Assign objects to the appropriate grid cell and
compute the density of each cell.
3. Eliminate cells, whose density is below a
certain threshold t.
4. Form clusters from contiguous (adjacent)
groups of dense cells (usually minimizing a
given objective function)

20
Advantages of Grid-based Clustering
Algorithms
 fast:
 No distance computations
 Clustering is performed on summaries and not
individual objects; complexity is usually O(#-
populated-grid-cells) and not O(#objects)
 Easy to determine which clusters are
neighboring
 Shapes are limited to union of grid-cells

21
Grid-Based Clustering Methods
 Using multi-resolution grid data structure
 Clustering complexity depends on the number of
populated grid cells and not on the number of objects in
the dataset
 Several interesting methods (in addition to the basic grid-
based algorithm)
 STING (a STatistical INformation Grid approach) by
Wang, Yang and Muntz (1997)
 CLIQUE: Agrawal, et al. (SIGMOD’98)

22
STING: A Statistical Information
Grid Approach
 Wang, Yang and Muntz (VLDB’97)
 The spatial area area is divided into rectangular cells
 There are several levels of cells corresponding to different
levels of resolution

23
Grid Approach (2)
 Each cell at a high level is partitioned into a number of smaller
cells in the next lower level
 Statistical info of each cell is calculated and stored beforehand
and is used to answer queries
 Parameters of higher level cells can be easily calculated from
parameters of lower level cell
 count, mean, s, min, max
 type of distribution—normal, uniform, etc.
 Use a top-down approach to answer spatial data queries

24
STING: Query Processing(3)
Used a top-down approach to answer spatial data queries
1. Start from a pre-selected layer—typically with a small number of
cells
2. From the pre-selected layer until you reach the bottom layer do
the following:
 For each cell in the current level compute the confidence interval
indicating a cell’s relevance to a given query;
 If it is relevant, include the cell in a cluster
 If it irrelevant, remove cell from further consideration
 otherwise, look for relevant cells at the next lower layer
3. Combine relevant cells into relevant regions (based on grid-
neighborhood) and return the so obtained clusters as your
answers.

25
Grid Approach (3)
 Advantages:
 Query-independent, easy to parallelize, incremental
update
 O(K), where K is the number of grid cells at the
lowest level
 Disadvantages:
 All the cluster boundaries are either horizontal or
vertical, and no diagonal boundary is detected

26
CLIQUE (Clustering In QUEst)
 Agrawal, Gehrke, Gunopulos, Raghavan (SIGMOD’98).
 Automatically identifying subspaces of a high dimensional
data space that allow better clustering than original space
 CLIQUE can be considered as both density-based and grid-
based
 It partitions each dimension into the same number of
equal length interval
 It partitions an m-dimensional data space into non-
overlapping rectangular units
 A unit is dense if the fraction of total data points
contained in the unit exceeds the input model parameter
 A cluster is a maximal set of connected dense units
within a subspace

27
CLIQUE: The Major Steps
 Partition the data space and find the number of points that
lie inside each cell of the partition.
 Identify the subspaces that contain clusters using the
Apriori principle
 Identify clusters:
 Determine dense units in all subspaces of interests
 Determine connected dense units in all subspaces of
interests.
 Generate minimal description for the clusters
 Determine maximal regions that cover a cluster of
connected dense units for each cluster
 Determination of minimal cover for each cluster

28
Salary
(10,000)
20 30 40 50 60
age
5
4
3
1
2
6
7
0
20 30 40 50 60
age
5
4
3
1
2
6
7
0
Vacation
(week)
age
Vacation
30 50
t = 3

29
Strength and Weakness of CLIQUE
 Strength
 It automatically finds subspaces of the highest
dimensionality such that high density clusters exist in
those subspaces
 It is insensitive to the order of records in input and
does not presume some canonical data distribution
 It scales linearly with the size of input and has good
scalability as the number of dimensions in the data
increases
 Weakness
 The accuracy of the clustering result may be
degraded at the expense of simplicity of the method

30
 Summary

31
Self-organizing feature maps (SOMs)
 Clustering is also performed by having several
units competing for the current object
 The unit whose weight vector is closest to the
current object wins
 The winner and its neighbors learn by having
their weights adjusted
 SOMs are believed to resemble processing that
can occur in the brain
 Useful for visualizing high-dimensional data in
2- or 3-D space

32
 Summary

33
What Is Outlier Discovery?
 What are outliers?
 The set of objects are considerably dissimilar from
the remainder of the data
 Example: Sports: Michael Jordon, Wayne Gretzky, ...
 Problem
 Find top n outlier points
 Applications:
 Credit card fraud detection
 Telecom fraud detection
 Customer segmentation
 Medical analysis

34
Outlier Discovery:
Statistical Approaches
 Assume a model underlying distribution that generates
data set (e.g. normal distribution)
 Use discordancy tests depending on
 data distribution
 distribution parameter (e.g., mean, variance)
 number of expected outliers
 Drawbacks
 most tests are for single attribute
 In many cases, data distribution may not be known

35
Outlier Discovery: Distance-
Based Approach
 Introduced to counter the main limitations imposed by
statistical methods
 We need multi-dimensional analysis without knowing
data distribution.
 Distance-based outlier: A DB(p, D)-outlier is an object O
in a dataset T such that at least a fraction p of the
objects in T lies at a distance greater than D from O
 Algorithms for mining distance-based outliers (see
textbook)
 Index-based algorithm
 Nested-loop algorithm
 Cell-based algorithm

36
 Summary

37
Problems and Challenges
 Considerable progress has been made in scalable clustering
methods
 Partitioning/Representative-based: k-means, k-medoids,
CLARANS, EM
 Hierarchical: BIRCH, CURE
 Density-based: DBSCAN, DENCLUE, CLIQUE, OPTICS
 Grid-based: STING, CLIQUE
 Model-based: Autoclass, Cobweb, SOM
 Current clustering techniques do not address all the requirements
adequately

38
References (1)
 R. Agrawal, J. Gehrke, D. Gunopulos, and P. Raghavan. Automatic subspace clustering of
high dimensional data for data mining applications. SIGMOD'98
 M. R. Anderberg. Cluster Analysis for Applications. Academic Press, 1973.
 M. Ankerst, M. Breunig, H.-P. Kriegel, and J. Sander. Optics: Ordering points to identify
the clustering structure, SIGMOD’99.
 P. Arabie, L. J. Hubert, and G. De Soete. Clustering and Classification. World Scietific, 1996
 M. Ester, H.-P. Kriegel, J. Sander, and X. Xu. A density-based algorithm for discovering
clusters in large spatial databases. KDD'96.
 M. Ester, H.-P. Kriegel, and X. Xu. Knowledge discovery in large spatial databases: Focusing
techniques for efficient class identification. SSD'95.
 D. Fisher. Knowledge acquisition via incremental conceptual clustering. Machine Learning,
2:139-172, 1987.
 D. Gibson, J. Kleinberg, and P. Raghavan. Clustering categorical data: An approach based
on dynamic systems. In Proc. VLDB’98.
 S. Guha, R. Rastogi, and K. Shim. Cure: An efficient clustering algorithm for large
databases. SIGMOD'98.
 A. K. Jain and R. C. Dubes. Algorithms for Clustering Data. Printice Hall, 1988.

39
References (2)
 L. Kaufman and P. J. Rousseeuw. Finding Groups in Data: an Introduction to Cluster
Analysis. John Wiley & Sons, 1990.
 E. Knorr and R. Ng. Algorithms for mining distance-based outliers in large datasets.
VLDB’98.
 G. J. McLachlan and K.E. Bkasford. Mixture Models: Inference and Applications to
Clustering. John Wiley and Sons, 1988.
 P. Michaud. Clustering techniques. Future Generation Computer systems, 13, 1997.
 R. Ng and J. Han. Efficient and effective clustering method for spatial data mining.
VLDB'94.
 E. Schikuta. Grid clustering: An efficient hierarchical clustering method for very large
data sets. Proc. 1996 Int. Conf. on Pattern Recognition, 101-105.
 G. Sheikholeslami, S. Chatterjee, and A. Zhang. WaveCluster: A multi-resolution
clustering approach for very large spatial databases. VLDB’98.
 W. Wang, Yang, R. Muntz, STING: A Statistical Information grid Approach to Spatial
Data Mining, VLDB’97.
 T. Zhang, R. Ramakrishnan, and M. Livny. BIRCH : an efficient data clustering method
for very large databases. SIGMOD'96.

dm_clustering2.ppt

Recommended

Recommended

More Related Content

Similar to dm_clustering2.ppt

Similar to dm_clustering2.ppt (20)

Recently uploaded

Recently uploaded (20)

dm_clustering2.ppt