Clique and sting

CLIQUE and STING
Dr S.Natarajan
Professor and Key Resource Person
Department of Information Science and
Engineering
PES Institute of Technology
Bengaluru
natarajan@pes.edu
995280225

High-dimensional integration
• High-dimensional integrals in statistics, ML, physics
• Expectations / model averaging
• Marginalization
• Partition function / rank models / parameter learning
• Curse of dimensionality:
• Quadrature involves weighted sum over exponential
number of items (e.g., units of volume)
L L2 L3 Ln
n dimensional
hypercube
L4
2

High Dimensional Indexing Techniques
• Index trees (e.g., X-tree, TV-tree, SS-tree, SR-tree, M-
tree, Hybrid Tree)
– Sequential scan better at high dim. (Dimensionality Curse)
• Dimensionality reduction (e.g., Principal Component
Analysis (PCA)), then build index on reduced space

Datasets
• Synthetic dataset:
– 64-d data, 100,000 points, generates clusters in different
subspaces (cluster sizes and subspace dimensionalities follow
Zipf distribution), contains noise
• Real dataset:
– 64-d data (8X8 color histograms extracted from 70,000
images in Corel collection), available at
http://kdd.ics.uci.edu/databases/CorelFeatures

5
Preliminaries – Nearest
Neighbor Search
• Given a collection of data points and a query
point in m-dimensional metric space, find the
data point that is closest to the query point
• Variation: k-nearest neighbor
• Relevant to clustering and similarity search
• Applications: Geographical Information
Systems, similarity search in multimedia
databases

6
NN Search Con’t
Source: [2]

7
Problems with
High Dimensional Data
• A point’s nearest neighbor (NN) loses
meaning
Source: [2]

8
Problems Con’t
• NN query cost degrades – more strong
candidates to compare with
• In as few as 10 dimensions, linear scan
outperforms some multidimensional indexing
structures (e.g. SS tree, R* tree, SR tree)
• Biology and genomic data can have
dimensions in the 1000’s.

9
Problems Con’t
• The presence of irrelevant attributes
decreases the tendency for clusters to form
• Points in high dimensional space have high
degree of freedom; they could be so
scattered that they appear uniformly
distributed

10
Problems Con’t
• In which cluster does the query point fall?

11
The Curse
• Refers to the decrease in performance of query
processing when the dimensionality increases
• The focus of this talk will be on quality issues of
NN search and on not performance issues
• In particular, under certain conditions, the distance
between the nearest point and the query point
equals the distance between the farthest and
query point as dimensionality approaches infinity

12
Curse Con’t
Source: N. Katayama, S. Satoh. Distinctiveness Sensitive Nearest Neighbor Search for Efficient Similarity
Retrieval of Multimedia Information. ICDE Conference, 2001.

13
Unstable NN-Query
A nearest neighbor query is unstable for a given
 > 0 if the distance from the query point to
most data points is less than (1+) times the
distance from the query point to its nearest
neighbor
Source: [2]

14
Theorem Con’t
Source: [2]

15
Theorem Con’t
Source: [1]

16
Rate of Convergence
• At what dimensionality does NN-queries
become unstable. Not easy to answer, so
experiments were performed on real and
synthetic data.
• If conditions of theorem are met,
DMAXm/DMINm should decrease with
increasing dimensionality

17
Conclusions
• Make sure enough contrast between query and
data points. If distance to NN is not much
different from average distance, the NN may not
be meaningful
• When evaluating high-dimensional indexing
techniques, should use data that do not satisfy
Theorem 1 and should compare with linear scan
• Meaningfulness also depends on how you
describe the object that is represented by the
data point (i.e., the feature vector)

18
Other Issues
• After selecting relevant attributes, the
dimensionality could still be high
• Reporting cases when data does not yield any
meaningful nearest neighbor, i.e. indistinctive
nearest neighbors

Sudoku
• How many ways to fill a valid sudoku square?
• Sum over 981 ~ 1077 possible squares (items)
• w(x)=1 if it is a valid square, w(x)=0 otherwise
• Accurate solution within seconds:
• 1.634×1021 vs 6.671×1021
1 2
….
?
19

Minimum Description Length Principle
 Occam’s razor: prefer the simplest hypothesis
 Simplest hypothesis  hypothesis with shortest
description length
 Minimum description length
 Prefer shortest hypothesis
 LC (x) is the description length for message x
under coding scheme c
1 2
argmin ( ) ( | )MDL C C
h H
h L h L D h

 
# of bits to encode
hypothesis h
# of bits to encode
data D given h
Complexity of
Model
# of Mistakes

MDL: Interpretation of –logP(D|H)+K(H)
 Interpreting –logP(D|H)+K(H)
K(H) is mimimum description length of H
-logP(D|H) is the mimimum description length of D
(experimental data) given H. That is, if H perfectly
explains D, then P(D|H)=1, then this term is 0. If not
perfect, then this is interpreted as the number of bits
needed to encode errors.
 MDL: Minimum Description Length principle (J.
Rissanen): given data D, the best theory for D is
the theory H which minimizes the sum of
Length of encoding H
Length of encoding D, based on H (encoding errors)

CLIQUE: A Dimension-Growth
Subspace Clustering Method
 Firstdimensiongrowth subspaceclustering algorithm
 Clusteringstarts atsingle-dimensionsubspaceand
moveupwardstowards higherdimension subspace
 Thisalgorithmcan beviewedas the integration
Densitybasedand Grid based algorithm

CLIQUE (CLstering In QUEst)
• Agrawal, Gehrke, Gunopulos, Raghavan (SIGMOD’98).
• Automatically identifying subspaces of a high dimensional data
space that allow better clustering than original space
• CLIQUE can be considered as both density-based and grid-based
– It partitions each dimension into the same number of equal length intervals
– It partitions an m-dimensional data space into non-overlapping rectangular
units
– A unit is dense if the fraction of total data points contained in the unit
exceeds the input model parameter
– A cluster is a maximal set of connected dense units within a subspace

Definitions That Need to Be Known
 Unit : After forming a grid structure on
the space, each rectangular cell is
called a Unit.
 Dense: A unit is dense, if the fraction of
total data points contained in the
unit exceeds the input model
parameter.
 Cluster: A cluster is defined as a maximal
set of connected dense units.

Informal problem statement
 Given a large set of multidimensional data points, the
data space is usually not uniformly occupied by the
datapoints.
 CLIQUE’s clustering identifies
“crowded” areas in space
the sparse and the
(or units), thereby
discovering the overall distribution patterns of the
dataset.
 A unit is dense if the fraction of total data points
contained in itexceedsan input model parameter.
 In CLIQUE, a clusteris defined as a maximal setof
connected denseunits.

Formal Problem Statement
 LetA= {A1, A2, . . . , Ad } bea setof bounded, totally
ordered domains and S = A1× A2× · · · × Ad a d-
dimensional numericalspace.
 Wewill referto A1, . . . , Ad as the dimensions
(attributes) of S.
 The inputconsistsof a setof d-dimensional pointsV =
{v1, v2, . . . ,vm}
 Wherevi = vi1, vi2, . . . , vid . The j th componentof vi is
drawn from domainAj .

28
 The CLIQUE Algorithm (cont.)
3. Minimal description of clusters
The minimal description of a cluster C, produced by the above
procedure, is the minimum possible union of hyperrectangular regions.
For example
• A  B is the minimum cluster description of the shaded region.
• C  D  E is a non-minimal cluster description of the same region.

Clique Working
 2 StepProcess
 1st step – Partitioning the d- dimensional dataspace
 2nd step- Generates the minimal descriptionof each
cluster.

1st step- Partitioning
 Partitioning is done foreach dimension.

continue….
 The subspaces representing these dense units are
intersected to forma candidatesearch space in which
denseunitsof higherdimensionality mayexist.
 Thisapproachof selecting candidates is quite similar
toApiori Gen processof generating candidates.
 Here it is expected that if some thing is dense in
higherdimensional space itcant besparse in lower
dimensionstate.

More formally
 If a k-dimensional unit is dense, then soare its projections
in (k-1)-dimensionalspace.
 Given a k-dimensional candidate dense unit, if any of it’s
(k-1)th projection unit is not dense then kth dimensional
unitcannot be dense
 So,we can generate candidate dense units in k-dimensional
space from the dense units found in (k-1)-dimensional
space
 Theresulting space searched is much smaller than the
originalspace.
 Thedense unitsare thenexamined inorder todetermine
theclusters.

Intersection
Denseunits found with respect to age for the dimensionssalary
and vacation are intersected in order to provide a candidate
search space fordense units of higherdimensionality.

2nd stage- Minimal Description
 Foreach cluster, Clique determines the maximal
region that covers the cluster of connected dense units.
 It then determines a minimal cover (logicdescription)
foreach cluster.

Effectiveness of Clique-
 CLIQUE automatically finds subspaces of the highest
dimensionalitysuch that high-densityclustersexist in
thosesubspaces.
 It is insensitive to the orderof inputobjects
 Itscales linearlywith thesizeof input
 Easily scalablewith the numberof dimensions in the
data

GRID-BASED CLUSTERING
METHODS
 This is the approach in which we
quantize space into a finite number of
cells that form a grid structure on which
all of the operations for clustering is
performed.
 So, for example assume that we have a
set of records and we want to cluster with
respect to two attributes, then, we divide
the related space (plane), into a grid
structure and then we find the clusters.

Age
Salary (10,000)
Our “space” is this
plane
20 30 40 50 60
8
7
6
5
4
3
2
1
0

Techniques for Grid-Based Clustering
The following are some techniques
that are used to perform Grid-Based
Clustering:
 CLIQUE (CLustering In QUest.)
 STING (STatistical Information Grid.)
 WaveCluster

Looking at CLIQUE as an Example
 CLIQUE is used for the clustering of high-
dimensional data present in large tables.
By high-dimensional data we mean
records that have many attributes.
 CLIQUE identifies the dense units in the
subspaces of high dimensional data
space, and uses these subspaces to
provide more efficient clustering.

How Does CLIQUE Work?
 Let us say that we have a set of records
that we would like to cluster in terms of
n-attributes.
 So, we are dealing with an n-
dimensional space.
 MAJOR STEPS :
 CLIQUE partitions each subspace that has
dimension 1 into the same number of equal
length intervals.
 Using this as basis, it partitions the n-
dimensional data space into non-overlapping
rectangular units.

CLIQUE: Major Steps (Cont.)
 Now CLIQUE’S goal is to identify the dense n-
dimensional units.
 It does this in the following way:
 CLIQUE finds dense units of higher
dimensionality by finding the dense units in the
subspaces.
 So, for example if we are dealing with a 3-
dimensional space, CLIQUE finds the dense
units in the 3 related PLANES (2-dimensional
subspaces.)
 It then intersects the extension of the
subspaces representing the dense units to
form a candidate search space in which dense
units of higher dimensionality would exist.

CLIQUE: Major Steps. (Cont.)
 Each maximal set of connected dense units is
considered a cluster.
 Using this definition, the dense units in the
subspaces are examined in order to find
clusters in the subspaces.
 The information of the subspaces is then used
to find clusters in the n-dimensional space.
 It must be noted that all cluster boundaries are
either horizontal or vertical. This is due to the
nature of the rectangular grid cells.

Example for CLIQUE
 Let us say that we want to cluster a set
of records that have three attributes,
namely, salary, vacation and age.
 The data space for the this data would
be 3-dimensional.
age
salary
vacation

Example (Cont.)
 After plotting the data objects,
each dimension, (i.e., salary,
vacation and age) is split into
intervals of equal length.
 Then we form a 3-dimensional grid
on the space, each unit of which
would be a 3-D rectangle.
 Now, our goal is to find the dense
3-D rectangular units.

Example (Cont.)
 To do this, we find the dense units
of the subspaces of this 3-d space.
 So, we find the dense units with
respect to age for salary. This
means that we look at the salary-
age plane and find all the 2-D
rectangular units that are dense.
 We also find the dense 2-D
rectangular units for the vacation-
age plane.

Example 1
Salary
(10,000)
20 30 40 50 60
age
54312670
20 30 40 50 60
age
54312670
Vacation
(week) 20 30 40 50 60
age
54312670
Vacation
(week)

Example (Cont.)
 Now let us try to visualize the
dense units of the two planes on the
following 3-d figure :
age
Vacation
Salary 30 50
age
Vacation
Salary 30 50
 = 3

Example (Cont.)
 We can extend the dense areas in the
vacation-age plane inwards.
 We can extend the dense areas in the
salary-age plane upwards.
 The intersection of these two spaces
would give us a candidate search space in
which 3-dimensional dense units exist.
 We then find the dense units in the
salary-vacation plane and we form an
extension of the subspace that represents
these dense units.

Example (Cont.)
 Now, we perform an intersection of
the candidate search space with the
extension of the dense units of the
salary-vacation plane, in order to
get all the 3-d dense units.
 So, What was the main idea?
 We used the dense units in
subspaces in order to find the dense
units in the 3-dimensional space.
 After finding the dense units, it is
very easy to find clusters.

Reflecting upon CLIQUE
 Why does CLIQUE confine its search for
dense units in high dimensions to the
intersection of dense units in subspaces?
 Because the Apriori property employs
prior knowledge of the items in the search
space so that portions of the space can be
pruned.
 The property for CLIQUE says that if a k-
dimensional unit is dense then so are its
projections in the (k-1) dimensional
space.

Strength and Weakness of CLIQUE
 Strength
 It automatically finds subspaces of the highest
dimensionality such that high density clusters exist in
those subspaces.
 It is quite efficient.
 It is insensitive to the order of records in input and
does not presume some canonical data distribution.
 It scales linearly with the size of input and has good
scalability as the number of dimensions in the data
increases.
 Weakness
 The accuracy of the clustering result may be
degraded at the expense of simplicity of the simplicity
of this method.

CLIQUE: The Major Steps
• Partition the data space and find the number of points
that lie inside each cell of the partition.
• Identify the subspaces that contain clusters using the
Apriori principle
• Identify clusters:
– Determine dense units in all subspaces of interests
– Determine connected dense units in all subspaces of interests.
• Generate minimal description for the clusters
– Determine maximal regions that cover a cluster of connected
dense units for each cluster
– Determination of minimal cover for each cluster

Salary
(10,000)
20 30 40 50 60
age
54312670
20 30 40 50 60
age
54312670
Vacation
(week)
age
Vacation
30 50
 = 3

Strength and Weakness of
CLIQUE
• Strength
– It automatically finds subspaces of the highest
dimensionality such that high density clusters exist in those
subspaces
– It is insensitive to the order of records in input and does not
presume some canonical data distribution
– It scales linearly with the size of input and has good
scalability as the number of dimensions in the data
increases
• Weakness
– The accuracy of the clustering result may be degraded at
the expense of simplicity of the method

Global Dimensionality Reduction (GDR)
First Principal
Component (PC) First PC
•works well only when data is globally correlated
•otherwise too many false positives result in high query cost
•solution: find local correlations instead of global correlation

Local Dimensionality Reduction (LDR)
First PC
GDR LDR
First PC of
Cluster1
Cluster1
Cluster2
First PC of
Cluster2

Correlated Cluster
Second PC
(eliminated
dim.)
Centroid of
cluster
(projection of
mean on
eliminated dim)
First PC
(retained dim.)
Mean of all
points in cluster
A set of locally correlated points = <PCs, subspace dim,
centroid, points>

Reconstruction Distance
Centroid of
cluster
First PC
(retained dim)
Second PC
(eliminated dim)
Point Q
Projection of Q
on eliminated
dim
Reconstruction
Distance(Q,S)

Reconstruction Distance Bound
Centroid
First PC
(retained dim)
Second PC
(eliminated dim)
MaxReconDist
MaxReconDist
ReconDist(P, S) MaxReconDist, " P in S

Other constraints
• Dimensionality bound: A cluster must not retain
any more dimensions necessary and subspace
dimensionality MaxDim
• Size bound: number of points in the cluster 
MinSize

Clustering Algorithm
Step 1: Construct Spatial Clusters
• Choose a set of well-
scattered points as
centroids (piercing set)
from random sample
• Group each point P in the
dataset with its closest
centroid C if the
Dist(P,C) 

Step 2: Choose PCs for each cluster
• Compute PCs

Step 3: Compute Subspace Dimensionality
0
0.2
0.4
0.6
0.8
1
0 2 4 6 8 10 12 14 16
#dims retained
Fracpointsobeying
recons.bound
• Assign each point to
cluster that needs
min dim. to
accommodate it
• Subspace dim. for
each cluster is the
min # dims to retain
to keep most points

Step 4: Recluster points
• Assign each point P to
the cluster S such that
ReconDist(P,S) 
MaxReconDist
• If multiple such
clusters, assign to first
cluster (overcomes
“splitting” problem)
Empty
clusters

Clustering algorithm
Step 5: Map points
• Eliminate small
clusters
• Map each point to
subspace (also store
reconstruction dist.)
Map

Clustering algorithm
Step 6: Iterate
• Iterate for more clusters as long as new
clusters are being found among outliers
• Overall Complexity: 3 passes, O(ND2K)

Experiments (Part 1)
• Precision Experiments:
– Compare information loss in GDR and LDR for same reduced
dimensionality
– Precision = |Orig. Space Result|/|Reduced Space Result| (for
range queries)
– Note: precision measures efficiency, not answer quality

Precision Experiments (1)
0
0.5
1
Prec.
0 0.5 1 2
Skew in cluster size
Sensitivity of prec. to skew
LDR GDR
0
0.5
1
Prec.
1 2 5 10
Number of clusters
Sensitivity of prec. to num clus
LDR GDR

Precision Experiments (2)
0
0.5
1
Prec.
0 0.02 0.05 0.1 0.2
Degree of Correlation
Sensitivity of prec. to correlation
LDR GDR
0
0.5
1
Prec.
7 10 12 14 23 42
Reduced dim
Sensitivity of prec. to reduced dim
LDR GDR

Index structure
Root containing pointers to root of
each cluster index (also stores PCs and
subspace dim.)
Index
on
Cluster 1
Index
on
Cluster K
Set of outliers
(no index:
sequential scan)
Properties: (1) disk based
(2) height  1 + height(original space index)
(3) almost balanced

Cluster Indices
• For each cluster S, multidimensional index on (d+1)-dimensional space instead
of d-dimensional space:
– NewImage(P,S)[j] = projection of P along jth PC for 1 j d
= ReconDist(P,S) for j= d+1
• Better estimate:
 D(NewImage(P,S), NewImage(Q,S)) 
D(Image(P,S), Image(Q,S))
• Correctness: Lower Bounding Lemma
 D(NewImage(P,S), NewImage(Q,S)) D(P,Q)

Effect of Extra dimension
I/O cost
0
200
400
600
800
1000
12 14 15 17 19 30 34
Reduced dimensionality
#randdisk
accesses
d-dim
(d+1)-dim

Outlier Index
• Retain all dimensions
• May build an index, else use sequential scan
(we use sequential scan for our
experiments)

Query Support
• Correctness:
– Query result same as original space index
• Point query, Range Query, k-NN query
– similar to algorithms in multidimensional index structures
– see paper for details
• Dynamic insertions and deletions
– see paper for details

Experiments (Part 2)• Cost Experiments:
– Compare linear scan, Original Space Index(OSI), GDR and LDR in
terms of I/O and CPU costs. We used hybrid tree index structure for
OSI, GDR and LDR.
• Cost Formulae:
– Linear Scan: I/O cost (#rand accesses)=file_size/10, CPU cost
– OSI: I/O cost=num index nodes visited, CPU cost
– GDR: I/O cost=index cost+post processing cost (to eliminate false
positives), CPU cost
– LDR: I/O cost=index cost+post processing cost+outlier_file_size/10,
CPU cost

I/O Cost (#random disk accesses)
I/O cost comparison
0
500
1000
1500
2000
2500
3000
7 10 12 14 23 42 50 60
Reduced dim
#rand disk
acc
LDR
GDR
OSI
Lin Scan

CPU Cost (only computation time)
CPU cost comparison
0
20
40
60
80
7 10 12 14 23 42
Reduced dim
CPU cost
(sec)
LDR
GDR
OSI
Lin Scan

Conclusion
• LDR is a powerful dimensionality reduction technique
for high dimensional data
– reduces dimensionality with lower loss in distance
information compared to GDR
– achieves significantly lower query cost compared to linear
scan, original space index and GDR
• LDR has applications beyond high dimensional indexing

10/8/2016 CLIQUE clustering algorithm 81
Motivation
 An object typically has dozens of attributes, the
domain for each attribute can be large
 Require the user to specify the subspace for cluster
analysis
 User-identification of subspaces is quite error-
prone.

The Contribution of CLIQUE
 Automatically find subspaces with
high-density clusters in high
dimensional attribute space

Background
 A1, A2, ﹍, Ad is the dimensions of S：
 A = {A1, A2, ﹍, Ad}
 S = A1 × A2 × ﹍ × Ad
 units：
 Partition every dimension into ξ intervals of equal
length
 unit u: {u1, u2, ﹍, ud} where ui = [ li, hi )

Background(Cont.)
 Selectivity: the fraction of total data points
contained in the unit
 Dense unit: selectivity(u) > 
 Cluster: a maximal set of connected dense units

1
Example

Background(Cont.)
 region: axis-parallel rectangular set
 RC=R: R is contained in C
 maximal region: no proper superset of R is contained in C
 minimal description: a non-redundant covering of the
cluster with maximal regions

Example
((30age50)(4 salary8))  ((40age60)(2 salary6))
2

CLIQUE Algorithm
1. Identification of dense units
2. Identification of clusters.
3. Generation of minimal description

Identification of dense units
 bottom-up algorithm：
 like Apriori algorithm
 Monotonicity：
 If a collection of points S is a cluster in a k-dimensional
space, then S is also part of a cluster in any (k–1)-
dimensional projections of this space.

Algorithm
1. determine 1-dimensional dense units
2. k = 2
3. generate candidate k-dimensional units from
(k-1)-dimensional dense units
4. if candidates are not empty
find dense units
k = k + 1
go to step 3

Algorithm - Candidate Generation
 Self-joining
insert into Ck
select u1.[l1, h1), u1.[l2, h2), ﹍, u1.[lk-1, hk-1), u2.[lk-1, hk-1)
from Dk-1 u1, Dk-1 u2
where u1.a1 = u2.a1, u1.l1 = u2.l1, u2.h1 = u2.h1,
u1.a2 = u2.a2, u1.l2 = u2.l2, u2.h2 = u2.h2, ﹍,
u1.ak-2 = u2.ak-2, u1.lk-2 = u2.lk-2, u2.hk-2 = u2.hk-2,
u1.ak-1 < u2.ak-1
 Pruning

Prune subspaces
 Objective: use only the dense units that lie in
“interesting” subspaces
 MDL principle:
 encode the input data under a given model and
select the encoding that minimizes the code
length.

Prune subspaces (Cont.)
 Group together dense units in the same subspace
 Compute the number of points for each subspace
 Sort subspaces in the descending order of their coverage
 Minimize the total length of the encoding
|))((|log))((log
|))((|log))((log)(
1
2
1
2
ixi
ixiiCL
P
nji
SP
I
ij
SI
j
j








2
2
 

jij Su iS ucountx )(

Prune subspaces (Cont.)
Partitioning of the subspaces into selected and pruned sets

Finding Clusters

Generating minimal cluster
descriptions
 R is a cover of C
 optimal cover: NP-hard
 solution to the problem：
 greedily cover the cluster by a number of maximal
regions
 discard the redundant regions

Greedy growth
1) begin with an arbitrary dense
unit u  C
2) Greedily grow a maximal
region covering u, add to R
3) repeat 2) with all uk  C are
covered by some maximal
regions in R

Minimal Cover
 Remove from the cover the smallest
maximal region which is redundant.
 Repeat the procedure until no maximal
region can be removed.

Comparison with Birch, DBScan
Concludes that CLIQUE performs better than Birch, DBScan

Real data experimental result
 datasets：
 insurance industry (Insur1, Insur2)
 department store (Store)
 bank (Bank)
 In all cases, we discovered
meaningful clusters
embedded in lower
dimensional subspaces.

Strength
 automatically finds clusters in subspaces
 insensitive to the order of records
 not presume some canonical data
distribution
 scales linearly with the size of input
 tolerant of missing values

Weakness
 depends on some parameters that hard to
pre-select
 ξ (partition threshold)
  (density threshold)
 some potential clusters will be lost in the
density-units prune procedures.
 the correctness of the algorithm degrades

What or who is STING?
 A singer who was the
lead singer of the
band Police and then
took up solo career
and won many
grammy’s.
 The bite of a scorpion.
 A Statistical
Information Grid
Approach to Spatial
Data Mining.
 All of the above.

What is Spatial Data?
Many definitions according to specific areas
According to GIS
 Spatial data may be thought of as features
located on or referenced to the Earth's
surface, such as roads, streams, political
boundaries, schools, land use classifications,
property ownership parcels, drinking water
intakes, pollution discharge sites - in short,
anything that can be mapped.
 Geographic features are stored as a series of
coordinate values. Each point along a road or
other feature is defined by positional
coordinate value, such as longitude and
latitude.
 The GIS stores and manages the data not as
a map but as a series of layers or, as they
are sometimes called, themes
When viewed in a GIS, these layers
visually appear as one graphic, but are
actually still independent of each other.
This allows changes to specific themes,
without affecting the others.
Discussion Question 1: So can you define spatial Data Generically????

•Spatial database systems aim at storing, retrieving, manipulating, querying, and
analyzing geometric data.
•Special data types are necessary to model geometry and to suitably represent
geometric data in database systems. These data types are usually called spatial
data types, such as point, line, and region but also include more complex types like
partitions and graphs (networks).
•Data Type understanding is a prerequisite for an effective construction of important
components of a spatial database system (like spatial index structures, optimizers
for spatial data, spatial query languages, storage management, and graphical user
interfaces) and for a cooperation with extensible DBMS providing spatial type
extension packages (like spatial data blades and cartridges).
•Excellent tutorial on spatial data and data types available at:
http://www.informatik.fernuni-hagen.de/import/pi4/schneider/abstracts/TutorialSDT.html
What are Spatial Databases?

Different Grid Levels during Query
Processing.

Pennsylvania Spatial Data Access
http://www.pasda.psu.edu/
The Missouri Spatial Data Information Service
http://msdis.missouri.edu/
National Spatial Data Infrastructure
http://www.fgdc.gov/nsdi/nsdi.html
Michigan Department of Natural Resources Online
www.dnr.state.mi.us/spatialdatalibrary/
Georgia Spatial Data Infrastructure Home Page
www.gis.state.ga.us/
Free GIS Data - GIS Data Depot
www.gisdatadepot.com
Spatial Data Resources

Spatial Data Mining
 Discovery of interesting characteristics and
patterns that may implicitly exist in spatial
databases.
 Huge amount of data specialized in nature.
 Clustering and region oriented queries are
common problems in this domain.
 We deal with high dimensional data
generally.
 Applications: GIS, Medical Imaging etc.

•Huge Amount of Data Specialized in Nature
Problems????????
•Complexity
•Defining of geometric patterns and region
oriented queries
•Conceptual nature of problem!
•Spatial Data Accessing

STING-An Introduction
•STING is a grid based method to efficiently process many
common region oriented queries on a set of points
•What defines region? You tell me! Essentially it is a set of points
satisfying some criterion
•It is a hierarchical Method. The idea is to capture statistical
information associated with spatial cells in such a manner that
the whole classes of queries can be answered without referring
to the individual objects.
•Complexity is hence even less than O(n) infact what do you
think it will be???
•Link to Paper: http://citeseer.nj.nec.com/wang97sting.html

Related Work
Spatial Data Mining
Generalization Based Knowledge Discovery Clustering Based Methods
CLARANS BIRCH DBSCANSpatial Data Dominant Non-Spatial Data Dominant
Great comparison of Clustering algorithms
http://www.cs.ualberta.ca/~joerg/papers/KI-Journal.pdf

Generalization Based Approaches
 Two types: Spatial Data Dominant and Non-
Spatial Data Dominant
 Both of these require that a generalization
hierarchy is given explicitly by experts or is
somehow generated automatically.
 Quality of mined data depends on the structure of
the hierarchy.
 Computational Complexity O(nlogn)
 So the onus shifted to developing algorithms
which discover characteristics directly from data.
This was the motivation to move to clustering
algorithms

Clustering Based Approaches
 BIRCH: Already covered Remember it??
Complexity??
 The problem with BIRCH is that it does not work
well with clusters which are not spherical.
 DBSCAN: Already covered Remember it??
Complexity??
 The Global Parameter Eps determination in
DBSCAN requires human participation
 When the point set to be clustered is the response
set of objects with some qualifications, then
determination of Eps must be done each time and
cost is hence higher.

Clustering Based Approaches
 CLARANS: Clustering Large Applications based upon
RANdomized Search.
 Although claims have been made on it being linear it is
essentially quadratic.
 The computational Complexity is at least Ώ(KN2) where N
is the number of data point and K is the number of
clusters.
 Quality of results can not be guaranteed when N is large as
we use Randomized Search
Optimization with Randomized Search Heuristics The
(A)NFL Theorem, Realistic Scenarios, and Dicult Functions

Related Work
 All the approaches described in previous slides
are all query dependent approaches
 The structure of queries influence the structure of
the algorithm and cannot be generalized to all
queries.
 As they scan all the data points the complexity
will at least be O(N)

STING THE OVERVIEW
 Spatial Area is divided into rectangular cells
 Different levels of cells corresponding to different
resolution and these cells have a hierarchical structure.
 Each cell at a higher level is partitioned into number of
cells of the next lower level
 Statistical information of each cell is calculated and stored
beforehand and is used to answer queries

GRID CELL HIERARCHY
Each Cell at (i-1)th level has 4
children at ith level (can be
changed)
The size of leaf cell is dependent
on the density of objects.
Generally it should be from several
dozens to thousands

 For each cell we have attribute-dependent and
attribute-independent parameters
 The attribute independent parameter is number of
objects in a cell-n
 For attribute dependent parameters it is assumed
that for each object its attributes have numerical
values.
 For each Numerical attribute we have the
following five parameters
GRID CELL HIERARCHY

GRID CELL HIERARCHY
 m- mean of all values in this cell
 s- standard deviation of all values in
this cell
 min-the minimum value of the
attribute in this cell
 max-the minimum value of the
attribute in this cell
 distribution-the type of distribution
this cell follows. (This is of
enumeration type)

Parameter Generation
•The determination of dist parameter is as follows
•First the dist is set to distribution followed by most point
•An estimate is made on number of conflicting points confl according to following
Rules
1) if disti is not equal to dist, m=mi and s=si then confl is increased
by amount ni.
2) if disti is not equal to dist, m=mi or s=si but not both then confl is
set to n.
3) if disti=dist and m=mi and s=si then confl is not changed
4) if disti = dist, m=mi or s=si but not both then confl is set to n.
Finally if confl/n is greater than a threshold (say 0.05) then dist is set to none or
Original dist is retained

i 1 2 3 4
ni 100 50 60 10
mi 20.1 19.7 21 20.5
si 2.3 2.2 2.4 2.1
mini 4.5 5.5 3.8 7
max
i 36 34 37 40
disti
Norm
al
Norm
al
Norm
al
Non
e
The parameters of the current cell are
N=220
m=20.27
s=2.37
min=3.8
max=40
dist=NORMAL
This is so because there are 220 data points out of which 10 are not NORMAL
So confl/n=10/220=0.045<0.05 hence it is still NORMAL.
The parameters are calculated only once so overall compilation time is O(N)
But querying requires much less time as we only scan the number of grid cells
K i.e. O(K)

Query Types
 If hierarchical structure cannot
answer a query then can go to
underlying database
 SQL like Language used to describe
queries
 Two types of common queries
found: one is to find region
specifying certain constraints and
other take in a region and return
some attribute of the region

Algorithm
 Top down querying. Examine cells at a higher level determine
if the cell is relevant to query at some confidence level. This
likelihood can be defined as the proportion of objects in this
cell that satisfy the query conditions. After obtaining the
confidence interval, we label this cell to be relevant or not
relevant at the specified confidence level.
 After doing so for the present layer process is repeated for the
children cells of the RELEVANT cells in the present layer only!!!
 Procedure continues till the bottom most layer
 Find region formed by relevant cells and return them
 If not satisfactory retrieve those data that fall into the
relevant cells from database and do some further processing.

 After all cells are labeled as relevant or not
relevant, we can easily find all regions that
satisfy the density specified by Breadth First
Search.
 For a relevant cell, we examine cells within a
certain distance d from the center of the
current cell to see if the average density
within this small area is greater than density
specified.
 If yes the cells are put into a queue
 Step 2 and 3 are repeated for all the cells in
the queue except cells previously examined
are omitted.
 When the queue is empty we get one region.
Algorithm

 The distance d =max (l, √(f/c∏)
 l, c, f are the side length of bottom
layer cell, the specified density and
small constant number set by STING
(does not vary from query to
another)
 L is usually the dominant term so we
generally have to examine the
neighborhood term. If only
granularity is very small do we need
examine very cell at that distance
rather than just the neighborhood.
Algorithm

Example
Given Data: Houses one of the attribute is price
Query:“Find those regions with area at least A where the number of houses per unit
area is at least c and at least b% of the houses have price between a and b with
(1 - a) confidence” where a < b. Here, a could be -æ and b could be +æ.
This query can be written as
We begin from the top level working our way down. Assume the dist type is NORMAL
First we calculate the proportion of houses whose price lies between [a,b]
The probability that price lies between a and b is
m and s are mean and standard deviation of all prices.

Example
 Now as we assume prices to be independent of m and s
the number of houses with price range [a, b] has a
binomial distribution with parameters n and p where n is
number of houses. Now we consider the following cases
according to n, np and n(1-p)
a) n<=30: binomial distribution used to determine confidence interval of the
number of houses whose prices fall into [a, b], and divide it by n to get the
confidence interval for the proportion.
b) When n > 30, n p ³ 5, and n(1 - p ) ³ 5, the proportion that the price falls
in [a, b] has a normal distribution Then 100(1 - alpha)%
confidence interval of the proportion is
c) When n>30 but np<5 , the Poissons distribution with parameters
is used for approximation.
d) When n>30 but n(1-p)<5, we can calculate the proportion of houses (X)
whose price is not in [a,b] using Poissons distribution with n(1-p) and
1-X is the proportion of houses whose prices is in [a,b].

Example
 Once we have the
confidence interval
or the estimated
range [p1, p2], we
can label this cell as
relevant or not
relevant.
 Let S be area of
cells at bottom
layer. If p1xn<Sxcx
%, we can label as
not relevant
otherwise as
relevant


Analysis of STING
 Step one takes constant time
 Step 2 and 3 total time is proportional
to the total number of cells in the
hierarchy.
 Total number of cells is 1.33K, where K
is number of cells at bottom layer.
 In all cases it is found or claimed to be
O(K)
 Discussion Question: what is the
complexity if we need to go to step 7 in
the algorithm??

Quality
 STING under the following sufficient condition
guarantee that if a region satisfies the specification of
the query then it is returned.
 Let F be a region. The width of F is defined as the
side length of the maximum square that can fit in F.

Limiting Behavior of STING
 The regions returned by Sting are
an approximation of the result by
DBSCAN. As the granularity
approaches zero the regions
returned by STING approaches
result of DBSCAN.
 SO worst case complexity is
O(nlogn)!!!!!

Performance measure
Case A: Normal Distribution
Query in e.g. answered in 0.2 sec
Structure generation: 9.8 second
Case A: None
Query in e.g. answered in 0.22 sec
Structure generation: 9.7 second

Performance measure
 Used a benchmark called SEQUOLA 2000 to
compare STING, DBSCAN, CLARANS
 All the previous algorithms have three phases in
query answering
1. Find Query Response
2. Build auxiliary structure
3. Do clustering
 STING does all of this in one step so is inherently
better.

Discussion Question
 “STING is trivially parallelizable.”
Comment why and what is the
importance of this statement?

References
 STING : Statistical Information Grid approach to spatial data
mining. Wei Wang et al.
 Optimization with Randomized Search Heuristics The (A)NFL
Theorem, Realistic Scenarios, and Dicult Functions. Stefan
Droste et al.
 Efficient and Effective clustering Method for spatial data
mining. R. Ng et al.
 BIRCH: An efficient data clustering method for very large
databases. T Zhang et al.
 Tutorial on Spatial data types: http://www.informatik.fernuni-
hagen.de/import/pi4/schneider/abstracts/TutorialSDT.html
 An efficient Approach to Clustering in Large Multimedia
Databases with Noise. A Hinneburg et al.
 Comparison of clustering algorithms :
http://www.cs.ualberta.ca/~joerg/papers/KI-Journal.pdf

Motivation
 All previousclustering algorithmare querydependent
 Theyare builtforone queryand generally no use for
otherquery.
 Needa separate scan foreachquery.
 Socomputation morecomplexat least O(n).
 Sowe need a structure outof Databaseso thatvarious
queriescan beanswered without rescanning.

Basics
 Grid based method-quantizes the object space into a
finite number of cells that form a grid structure on
which all of the operations for clustering are
performed
 Develop hierarchical Structure out of given data and
answer various queries efficiently.
 Everylevel of hierarchy consistof cells
 Answeringa query is not O(n) where n is the number
of elements in thedatabase

A hierarchical structure for STING clustering

continue …..
 Theroot of the hierarchy beat level 1
 Cell in level i corresponds to the union of the areasof
itschildrenat level i + 1
 Cellata higher level is partitioned to form a numberof
cellsof the next lowerlevel
 Statistical informationof each cell iscalculated and
stored beforehandand is used toanswerqueries

Cell parameter
 Attribute Independentparameter-
n- numberof objects (points) in this cell
 Attribute dependentparameters-
m - mean of all values in thiscell
s - standard deviation of all values of the attribute in this
cell
min - the minimumvalueof the attribute in this cell
max - the maximum value of the attribute in this cell
distribution - the type of distribution that the attribute
valuein thiscell follows

 n, m, s, min, and max of bottom levelcellsare
calculateddirectly from data
 Distributioncan beeitherassigned byuserorcan be
obtained byhypothetical tests - χ2 test
 Parametersof higher levelcells is calculated from
parameterof lowerlevelcells.

continue…..
 n, m, s, min, max, dist beparametersof current cell
 ni, mi, si, mini, maxi and disti beparametersof
corresponding lower levelcells

dist for Parent Cell
 Set dist as the distribution type followed by most pointsin
thiscell
 Nowcheck forconflicting points in thechild cells call it
confl.
1. If disti ≠ dist, mi ≈ m and si ≈ s, then confl is increased by an
amount of ni;
2. If disti ≠ dist, but either mi ≈ m or si ≈ s is not satisfied, then
set confl ton
3. If disti = dist, mi ≈ m and si ≈ s, then confl is increased by0;
4.If disti = dist, but either mi ≈ m or si ≈ s is not satisfied, then
confl is set ton.

continue…..
 If is greater thana threshold t set dist as NONE.
 Otherwise keeptheoriginal type.
Example:

continue…..
 Parameterforparentcell would be
n = 220
min = 3.8
m = 20.27
max =40
s = 2.37
dist = NORMAL
 210 points whose distributiontype is NORMAL
 Setdistof parentas Normal
 confl = 10
 = 0.045 < 0.05 so keeptheoriginal.

Query types
 STING structure is capableof answering variousqueries
 Butif itdoesn’tthenwealwayshave the underlying
Database
 Even if statistical information is notsufficientto answer
querieswe can still generate possiblesetof answers.

Common queries
 Select regions that satisfy certain conditions
Select the maximal regions that have at least 100 houses per unit
area and at least 70% of the house prices are above $400K and
withtotalareaat least100 unitswith90% confidence
SELECT REGION
FROM house-map
WHERE DENSITY IN (100, ∞)
AND price RANGE (400000, ∞) WITH PERCENT (0.7, 1)
AND AREA (100, ∞)
AND WITH CONFIDENCE 0.9

continue….
 Selects regions and returns some function of the region
Select the range of age of houses in those maximal regionswhere there
areat least 100 houses perunit areaand at least 70% of the houses have
pricebetween$150Kand $300Kwithareaat least100units in California.
SELECT RANGE(age)
FROM house-map
WHERE DENSITY IN (100, ∞)
AND price RANGE (150000, 300000) WITH PERCENT (0.7, 1)
AND AREA (100, ∞)
AND LOCATION California

Algorithm
 With the hierarchical structure of grid cells on hand,
we can use a top-down approach to answer spatial data
miningqueries
 For any query, we begin by examining cells on a high
levellayer
 calculate the likelihood that this cell is relevant to the
queryatsomeconfidence level using the parameters of
thiscell
 If the distribution type is NONE, we estimate the
likelihoodusing somedistribution free techniques
instead

continue….
 Afterweobtain the confidence interval, welabel this
cell to be relevant or not relevant at the specified
confidencelevel
 Proceed to the next layerbutonlyconsiderthe Childs
of relevantcellsof upperlayer
 Werepeat thisuntil wereach to the final layer
 Relevantcellsof final layerhave enoughstatistical
informationtogivesatisfactoryresultto query.
 However for accurate mining we may refer to data
corresponding torelevantcellsand furtherprocess it.

Finding regions
 Afterwe havegotall the relevantcellsat the final level
weneed to outputregionsthat satisfies thequery
 Wecan do itusing Breadth FirstSearch

Breadth First Search
 we examine cells within a certain distance from
the center of currentcell
 If the average density within this small area is
greater than the density specified mark this area
 Put the relevant cells just examined in thequeue.
 Take element from queue repeat the same
procedure except that only those relevant cells that
are not examined before are enqueued. When
queue is empty we have identified oneregion.

Statistical Information Grid-based
Algorithm
1. Determine a layer to beginwith.
2. For each cell of this layer, we calculate the confidence interval (or
estimated range) of probability that thiscell is relevantto thequery.
3. From the interval calculated above, we label the cell as relevant or not
relevant.
4. If this layer is the bottom layer, go to Step 6; otherwise, go to Step 5.
5. Wego down the hierarchy structure by one level. Go to Step 2 for those
cells that form the relevantcells of the higher level layer.
6. If the specification of the query is met, go to Step 8; otherwise, go to Step
7.
7. Retrieve those data fall into the relevant cells and do further processing.
Returntheresult that meet therequirementof thequery. Goto Step 9.
8. Find the regions of relevant cells. Return those regions that meet the
requirementof thequery. Go to Step 9.
9. Stop.

Time Analysis:
 Step1 takesconstanttime. Steps 2 and 3 require
constanttime.
 The total time is less than orequal to the total number
of cells in our hierarchical structure.
 Notice that the total numberof cells is 1.33K, where K
is the number of cells at bottomlayer.
 Sothe overall computationcomplexityon the grid
hierarchy structure isO(K)

Time Analysis:
 STING goesthrough the databaseonce to computethe
statistical parameters of thecells
 timecomplexityof generating clusters is O(n), where
n is the total numberof objects.
 After generating the hierarchical structure, the query
processing time is O(g), whereg is the total numberof
grid cells at the lowest level, which is usually much
smaller thann.

Definitions That Need to Be Known
 Spatial Data:
 Data that have a spatial or location
component.
 These are objects that themselves are located
in physical space.
 Examples: My house, lake Geneva, New York
City, etc.
 Spatial Area:
 The area that encompasses the locations of all
the spatial data is called spatial area.

STING (Introduction)
 STING is used for performing clustering
on spatial data.
 STING uses a hierarchical multi resolution
grid data structure to partition the spatial
area.
 STINGS big benefit is that it processes
many common “region oriented” queries
on a set of points, efficiently.
 We want to cluster the records that are in
a spatial table in terms of location.
 Placement of a record in a grid cell is
completely determined by its physical
location.

Hierarchical Structure of Each Grid Cell
 The spatial area is divided into
rectangular cells. (Using latitude and
longitude.)
 Each cell forms a hierarchical structure.
 This means that each cell at a higher
level is further partitioned into 4 smaller
cells in the lower level.
 In other words each cell at the ith level
(except the leaves) has 4 children in the
i+1 level.
 The union of the 4 children cells would
give back the parent cell in the level
above them.

Hierarchical Structure of Cells (Cont.)
 The size of the leaf level cells and the
number of layers depends upon how
much granularity the user wants.
 So, Why do we have a hierarchical
structure for cells?
 We have them in order to provide a
better granularity, or higher resolution.

A Hierarchical Structure for Sting Clustering

Statistical Parameters Stored in each
Cell
 For each cell in each layer we have
attribute dependent and attribute
independent parameters.
 Attribute Independent Parameter:
 Count : number of records in this cell.
 Attribute Dependent Parameter:
 (We are assuming that our attribute
values are real numbers.)

Statistical Parameters (Cont.)
 For each attribute of each cell we store
the following parameters:
 M  mean of all values of each attribute in
this cell.
 S  Standard Deviation of all values of
each attribute in this cell.
 Min  The minimum value for each attribute
in this cell.
 Max  The maximum value for each
attribute in this cell.
 Distribution  The type of distribution that
the attribute value in this cell follows. (e.g.
normal, exponential, etc.) None is assigned
to “Distribution” if the distribution is
unknown.

Storing of Statistical Parameters
 Statistical information regarding the
attributes in each grid cell, for each layer
are pre-computed and stored before
hand.
 The statistical parameters for the cells in
the lowest layer is computed directly from
the values that are present in the table.
 The Statistical parameters for the cells in
all the other levels are computed from
their respective children cells that are in
the lower level.

How are Queries Processed ?
 STING can answer many queries, (especially
region queries) efficiently, because we don’t have
to access full database.
 How are spatial data queries processed?
 We use a top-down approach to answer spatial
data queries.
 Start from a pre-selected layer-typically with a
small number of cells.
 The pre-selected layer does not have to be the
top most layer.
 For each cell in the current layer compute the
confidence interval (or estimated range of
probability) reflecting the cells relevance to the
given query.

Query Processing (Cont.)
 The confidence interval is calculated by
using the statistical parameters of each
cell.
 Remove irrelevant cells from further
consideration.
 When finished with the current layer,
proceed to the next lower level.
 Processing of the next lower level
examines only the remaining relevant
cells.
 Repeat this process until the bottom layer
is reached.

Sample Query Examples
 Assume that the spatial area is the map of the
regions of Long Island, Brooklyn and Queens.
 Our records represent apartments that are
present throughout the above region.
 Query : “ Find all the apartments that are for
rent near Stony Brook University that have a
rent range of: $800 to $1000”
 The above query depend upon the parameter
“near.” For our example near means within 15
miles of Stony Brook University.

Clique and sting

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Clique and sting

Similar to Clique and sting (20)

Recently uploaded

Recently uploaded (20)

Clique and sting