CLIQUE is an algorithm for subspace clustering of high-dimensional data. It works in two steps: (1) It partitions each dimension of the data space into intervals of equal length to form a grid, (2) It identifies dense units within this grid and finds clusters as maximal sets of connected dense units. CLIQUE efficiently discovers clusters by identifying dense units in subspaces and intersecting them to obtain candidate dense units in higher dimensions. It automatically determines relevant subspaces for clustering and scales well with large, high-dimensional datasets.
Classification of common clustering algorithm and techniques, e.g., hierarchical clustering, distance measures, K-means, Squared error, SOFM, Clustering large databases.
Density-Based Clustering refers to one of the most popular unsupervised learning methodologies used in model building and machine learning algorithms .
K Means Clustering Algorithm | K Means Clustering Example | Machine Learning ...Simplilearn
This K-Means clustering algorithm presentation will take you through the machine learning introduction, types of clustering algorithms, k-means clustering, how does K-Means clustering work and at least explains K-Means clustering by taking a real life use case. This Machine Learning algorithm tutorial video is ideal for beginners to learn how K-Means clustering work.
Below topics are covered in this K-Means Clustering Algorithm presentation:
1. Types of Machine Learning?
2. What is K-Means Clustering?
3. Applications of K-Means Clustering
4. Common distance measure
5. How does K-Means Clustering work?
6. K-Means Clustering Algorithm
7. Demo: k-Means Clustering
8. Use case: Color compression
- - - - - - - -
About Simplilearn Machine Learning course:
A form of artificial intelligence, Machine Learning is revolutionizing the world of computing as well as all people’s digital interactions. Machine Learning powers such innovative automated technologies as recommendation engines, facial recognition, fraud protection and even self-driving cars.This Machine Learning course prepares engineers, data scientists and other professionals with knowledge and hands-on skills required for certification and job competency in Machine Learning.
- - - - - - -
Why learn Machine Learning?
Machine Learning is taking over the world- and with that, there is a growing need among companies for professionals to know the ins and outs of Machine Learning
The Machine Learning market size is expected to grow from USD 1.03 Billion in 2016 to USD 8.81 Billion by 2022, at a Compound Annual Growth Rate (CAGR) of 44.1% during the forecast period.
- - - - - -
What skills will you learn from this Machine Learning course?
By the end of this Machine Learning course, you will be able to:
1. Master the concepts of supervised, unsupervised and reinforcement learning concepts and modeling.
2. Gain practical mastery over principles, algorithms, and applications of Machine Learning through a hands-on approach which includes working on 28 projects and one capstone project.
3. Acquire thorough knowledge of the mathematical and heuristic aspects of Machine Learning.
4. Understand the concepts and operation of support vector machines, kernel SVM, naive bayes, decision tree classifier, random forest classifier, logistic regression, K-nearest neighbors, K-means clustering and more.
5. Be able to model a wide variety of robust Machine Learning algorithms including deep learning, clustering, and recommendation systems
- - - - - - -
It is a data mining technique used to place the data elements into their related groups. Clustering is the process of partitioning the data (or objects) into the same class, The data in one class is more similar to each other than to those in other cluster.
This is very simple introduction to Clustering with some real world example. At the end of lecture I use stackOverflow API to test some clustering. I also wants to try facebook but it has some problem with it's API
Classification of common clustering algorithm and techniques, e.g., hierarchical clustering, distance measures, K-means, Squared error, SOFM, Clustering large databases.
Density-Based Clustering refers to one of the most popular unsupervised learning methodologies used in model building and machine learning algorithms .
K Means Clustering Algorithm | K Means Clustering Example | Machine Learning ...Simplilearn
This K-Means clustering algorithm presentation will take you through the machine learning introduction, types of clustering algorithms, k-means clustering, how does K-Means clustering work and at least explains K-Means clustering by taking a real life use case. This Machine Learning algorithm tutorial video is ideal for beginners to learn how K-Means clustering work.
Below topics are covered in this K-Means Clustering Algorithm presentation:
1. Types of Machine Learning?
2. What is K-Means Clustering?
3. Applications of K-Means Clustering
4. Common distance measure
5. How does K-Means Clustering work?
6. K-Means Clustering Algorithm
7. Demo: k-Means Clustering
8. Use case: Color compression
- - - - - - - -
About Simplilearn Machine Learning course:
A form of artificial intelligence, Machine Learning is revolutionizing the world of computing as well as all people’s digital interactions. Machine Learning powers such innovative automated technologies as recommendation engines, facial recognition, fraud protection and even self-driving cars.This Machine Learning course prepares engineers, data scientists and other professionals with knowledge and hands-on skills required for certification and job competency in Machine Learning.
- - - - - - -
Why learn Machine Learning?
Machine Learning is taking over the world- and with that, there is a growing need among companies for professionals to know the ins and outs of Machine Learning
The Machine Learning market size is expected to grow from USD 1.03 Billion in 2016 to USD 8.81 Billion by 2022, at a Compound Annual Growth Rate (CAGR) of 44.1% during the forecast period.
- - - - - -
What skills will you learn from this Machine Learning course?
By the end of this Machine Learning course, you will be able to:
1. Master the concepts of supervised, unsupervised and reinforcement learning concepts and modeling.
2. Gain practical mastery over principles, algorithms, and applications of Machine Learning through a hands-on approach which includes working on 28 projects and one capstone project.
3. Acquire thorough knowledge of the mathematical and heuristic aspects of Machine Learning.
4. Understand the concepts and operation of support vector machines, kernel SVM, naive bayes, decision tree classifier, random forest classifier, logistic regression, K-nearest neighbors, K-means clustering and more.
5. Be able to model a wide variety of robust Machine Learning algorithms including deep learning, clustering, and recommendation systems
- - - - - - -
It is a data mining technique used to place the data elements into their related groups. Clustering is the process of partitioning the data (or objects) into the same class, The data in one class is more similar to each other than to those in other cluster.
This is very simple introduction to Clustering with some real world example. At the end of lecture I use stackOverflow API to test some clustering. I also wants to try facebook but it has some problem with it's API
A Hierarchical and Grid Based Clustering Method for Distributed Systems (Hgd ...iosrjce
In distributed peer-to-peer systems, huge amount of data are dispersed. Grouping of those data from
multiple sources is a tedious task. By applying effective data mining techniques the clustering of distributed data
is become ease and this decreases the hurdles of clustering due to processing, storage, and transmission costs.
To perform a dynamic distributed clustering, a fully decentralized clustering method has been proposed. HGD
Cluster can cluster a data set which is dispersed among a large number of nodes in a distributed environment
using hierarchal and grid based clustering techniques. When nodes are fully asynchronous and decentralized
and also adaptable to stir, then HGD cluster will apply. The general design principles employed in the proposed
algorithm also allow customization for other classes of clustering. It is fully capable of clustering dynamic and
distributed data sets. Using the algorithm, every node can maintain summarized views of the dataset.
Customizing HGD Cluster for execution of the hierarchal-based and grid-based clustering methods on the
summarized views is the main aim of the proposed system. Coping with dynamic data is made possible by
gradually adapting the clustering model.
Graph coloring has many applications, including in VLSI CAD. Since graph coloring is NP-complete, heuristics are used to approximate the optimum solution. But heuristic solutions can be arbitrary larger than the minimum coloring. We demonstrate how a greedy coloring, together with a heuristics max-clique algorithm, can be combined to generate a new pruning technique, the q-color pruning algorithm. We show that since real-life graphs appear to be 1-perfect, one can solve graph coloring exactly for a small overhead.
The method of identifying similar groups of data in a data set is called clustering. Entities in each group are comparatively more similar to entities of that group than those of the other groups.
A Visual Exploration of Distance, Documents, and DistributionsRebecca Bilbro
Machine learning often requires us to think spatially and make choices about what it means for two instances to be close or far apart. So which is best - Euclidean? Manhattan? Cosine? It all depends! In this talk, we'll explore open source tools and visual diagnostic strategies for picking good distance metrics when doing machine learning on text.
Machine learning often requires us to think spatially and make choices about what it means for two instances to be close or far apart. So which is best - Euclidean? Manhattan? Cosine? It all depends! In this talk, we'll explore open source tools and visual diagnostic strategies for picking good distance metrics when doing machine learning on text.
An Efficient Method of Partitioning High Volumes of Multidimensional Data for...IJERA Editor
An optimal data partitioning in parallel/distributed implementation of clustering algorithms is a necessary
computation as it ensures independent task completion, fair distribution, less number of affected points and
better & faster merging. Though partitioning using Kd-Tree is being conventionally used in academia, it suffers
from performance drenches and bias (non equal distribution) as dimensionality of data increases and hence is
not suitable for practical use in industry where dimensionality can be of order of 100’s to 1000’s. To address
these issues we propose two new partitioning techniques using existing mathematical models & study their
feasibility, performance (bias and partitioning speed) & possible variants in choosing initial seeds. First method
uses an n-dimensional hashed grid based approach which is based on mapping the points in space to a set of
cubes which hashes the points. Second method uses a tree of voronoi planes where each plane corresponds to a
partition. We found that grid based approach was computationally impractical, while using a tree of voronoi
planes (using scalable K-Means++ initial seeds) drastically outperformed the Kd-tree tree method as
dimensionality increased.
Adjusting primitives for graph : SHORT REPORT / NOTESSubhajit Sahu
Graph algorithms, like PageRank Compressed Sparse Row (CSR) is an adjacency-list based graph representation that is
Multiply with different modes (map)
1. Performance of sequential execution based vs OpenMP based vector multiply.
2. Comparing various launch configs for CUDA based vector multiply.
Sum with different storage types (reduce)
1. Performance of vector element sum using float vs bfloat16 as the storage type.
Sum with different modes (reduce)
1. Performance of sequential execution based vs OpenMP based vector element sum.
2. Performance of memcpy vs in-place based CUDA based vector element sum.
3. Comparing various launch configs for CUDA based vector element sum (memcpy).
4. Comparing various launch configs for CUDA based vector element sum (in-place).
Sum with in-place strategies of CUDA mode (reduce)
1. Comparing various launch configs for CUDA based vector element sum (in-place).
As Europe's leading economic powerhouse and the fourth-largest hashtag#economy globally, Germany stands at the forefront of innovation and industrial might. Renowned for its precision engineering and high-tech sectors, Germany's economic structure is heavily supported by a robust service industry, accounting for approximately 68% of its GDP. This economic clout and strategic geopolitical stance position Germany as a focal point in the global cyber threat landscape.
In the face of escalating global tensions, particularly those emanating from geopolitical disputes with nations like hashtag#Russia and hashtag#China, hashtag#Germany has witnessed a significant uptick in targeted cyber operations. Our analysis indicates a marked increase in hashtag#cyberattack sophistication aimed at critical infrastructure and key industrial sectors. These attacks range from ransomware campaigns to hashtag#AdvancedPersistentThreats (hashtag#APTs), threatening national security and business integrity.
🔑 Key findings include:
🔍 Increased frequency and complexity of cyber threats.
🔍 Escalation of state-sponsored and criminally motivated cyber operations.
🔍 Active dark web exchanges of malicious tools and tactics.
Our comprehensive report delves into these challenges, using a blend of open-source and proprietary data collection techniques. By monitoring activity on critical networks and analyzing attack patterns, our team provides a detailed overview of the threats facing German entities.
This report aims to equip stakeholders across public and private sectors with the knowledge to enhance their defensive strategies, reduce exposure to cyber risks, and reinforce Germany's resilience against cyber threats.
Explore our comprehensive data analysis project presentation on predicting product ad campaign performance. Learn how data-driven insights can optimize your marketing strategies and enhance campaign effectiveness. Perfect for professionals and students looking to understand the power of data analysis in advertising. for more details visit: https://bostoninstituteofanalytics.org/data-science-and-artificial-intelligence/
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Subhajit Sahu
Abstract — Levelwise PageRank is an alternative method of PageRank computation which decomposes the input graph into a directed acyclic block-graph of strongly connected components, and processes them in topological order, one level at a time. This enables calculation for ranks in a distributed fashion without per-iteration communication, unlike the standard method where all vertices are processed in each iteration. It however comes with a precondition of the absence of dead ends in the input graph. Here, the native non-distributed performance of Levelwise PageRank was compared against Monolithic PageRank on a CPU as well as a GPU. To ensure a fair comparison, Monolithic PageRank was also performed on a graph where vertices were split by components. Results indicate that Levelwise PageRank is about as fast as Monolithic PageRank on the CPU, but quite a bit slower on the GPU. Slowdown on the GPU is likely caused by a large submission of small workloads, and expected to be non-issue when the computation is performed on massive graphs.
1. CLIQUE and STING
Dr S.Natarajan
Professor and Key Resource Person
Department of Information Science and
Engineering
PES Institute of Technology
Bengaluru
natarajan@pes.edu
995280225
2. High-dimensional integration
• High-dimensional integrals in statistics, ML, physics
• Expectations / model averaging
• Marginalization
• Partition function / rank models / parameter learning
• Curse of dimensionality:
• Quadrature involves weighted sum over exponential
number of items (e.g., units of volume)
L L2 L3 Ln
n dimensional
hypercube
L4
2
3. High Dimensional Indexing Techniques
• Index trees (e.g., X-tree, TV-tree, SS-tree, SR-tree, M-
tree, Hybrid Tree)
– Sequential scan better at high dim. (Dimensionality Curse)
• Dimensionality reduction (e.g., Principal Component
Analysis (PCA)), then build index on reduced space
4. Datasets
• Synthetic dataset:
– 64-d data, 100,000 points, generates clusters in different
subspaces (cluster sizes and subspace dimensionalities follow
Zipf distribution), contains noise
• Real dataset:
– 64-d data (8X8 color histograms extracted from 70,000
images in Corel collection), available at
http://kdd.ics.uci.edu/databases/CorelFeatures
5. 5
Preliminaries – Nearest
Neighbor Search
• Given a collection of data points and a query
point in m-dimensional metric space, find the
data point that is closest to the query point
• Variation: k-nearest neighbor
• Relevant to clustering and similarity search
• Applications: Geographical Information
Systems, similarity search in multimedia
databases
8. 8
Problems Con’t
• NN query cost degrades – more strong
candidates to compare with
• In as few as 10 dimensions, linear scan
outperforms some multidimensional indexing
structures (e.g. SS tree, R* tree, SR tree)
• Biology and genomic data can have
dimensions in the 1000’s.
9. 9
Problems Con’t
• The presence of irrelevant attributes
decreases the tendency for clusters to form
• Points in high dimensional space have high
degree of freedom; they could be so
scattered that they appear uniformly
distributed
11. 11
The Curse
• Refers to the decrease in performance of query
processing when the dimensionality increases
• The focus of this talk will be on quality issues of
NN search and on not performance issues
• In particular, under certain conditions, the distance
between the nearest point and the query point
equals the distance between the farthest and
query point as dimensionality approaches infinity
12. 12
Curse Con’t
Source: N. Katayama, S. Satoh. Distinctiveness Sensitive Nearest Neighbor Search for Efficient Similarity
Retrieval of Multimedia Information. ICDE Conference, 2001.
13. 13
Unstable NN-Query
A nearest neighbor query is unstable for a given
> 0 if the distance from the query point to
most data points is less than (1+) times the
distance from the query point to its nearest
neighbor
Source: [2]
16. 16
Rate of Convergence
• At what dimensionality does NN-queries
become unstable. Not easy to answer, so
experiments were performed on real and
synthetic data.
• If conditions of theorem are met,
DMAXm/DMINm should decrease with
increasing dimensionality
17. 17
Conclusions
• Make sure enough contrast between query and
data points. If distance to NN is not much
different from average distance, the NN may not
be meaningful
• When evaluating high-dimensional indexing
techniques, should use data that do not satisfy
Theorem 1 and should compare with linear scan
• Meaningfulness also depends on how you
describe the object that is represented by the
data point (i.e., the feature vector)
18. 18
Other Issues
• After selecting relevant attributes, the
dimensionality could still be high
• Reporting cases when data does not yield any
meaningful nearest neighbor, i.e. indistinctive
nearest neighbors
19. Sudoku
• How many ways to fill a valid sudoku square?
• Sum over 981 ~ 1077 possible squares (items)
• w(x)=1 if it is a valid square, w(x)=0 otherwise
• Accurate solution within seconds:
• 1.634×1021 vs 6.671×1021
1 2
….
?
19
21. Minimum Description Length Principle
Occam’s razor: prefer the simplest hypothesis
Simplest hypothesis hypothesis with shortest
description length
Minimum description length
Prefer shortest hypothesis
LC (x) is the description length for message x
under coding scheme c
1 2
argmin ( ) ( | )MDL C C
h H
h L h L D h
# of bits to encode
hypothesis h
# of bits to encode
data D given h
Complexity of
Model
# of Mistakes
22. MDL: Interpretation of –logP(D|H)+K(H)
Interpreting –logP(D|H)+K(H)
K(H) is mimimum description length of H
-logP(D|H) is the mimimum description length of D
(experimental data) given H. That is, if H perfectly
explains D, then P(D|H)=1, then this term is 0. If not
perfect, then this is interpreted as the number of bits
needed to encode errors.
MDL: Minimum Description Length principle (J.
Rissanen): given data D, the best theory for D is
the theory H which minimizes the sum of
Length of encoding H
Length of encoding D, based on H (encoding errors)
23. CLIQUE: A Dimension-Growth
Subspace Clustering Method
Firstdimensiongrowth subspaceclustering algorithm
Clusteringstarts atsingle-dimensionsubspaceand
moveupwardstowards higherdimension subspace
Thisalgorithmcan beviewedas the integration
Densitybasedand Grid based algorithm
24. CLIQUE (CLstering In QUEst)
• Agrawal, Gehrke, Gunopulos, Raghavan (SIGMOD’98).
• Automatically identifying subspaces of a high dimensional data
space that allow better clustering than original space
• CLIQUE can be considered as both density-based and grid-based
– It partitions each dimension into the same number of equal length intervals
– It partitions an m-dimensional data space into non-overlapping rectangular
units
– A unit is dense if the fraction of total data points contained in the unit
exceeds the input model parameter
– A cluster is a maximal set of connected dense units within a subspace
25. Definitions That Need to Be Known
Unit : After forming a grid structure on
the space, each rectangular cell is
called a Unit.
Dense: A unit is dense, if the fraction of
total data points contained in the
unit exceeds the input model
parameter.
Cluster: A cluster is defined as a maximal
set of connected dense units.
26. Informal problem statement
Given a large set of multidimensional data points, the
data space is usually not uniformly occupied by the
datapoints.
CLIQUE’s clustering identifies
“crowded” areas in space
the sparse and the
(or units), thereby
discovering the overall distribution patterns of the
dataset.
A unit is dense if the fraction of total data points
contained in itexceedsan input model parameter.
In CLIQUE, a clusteris defined as a maximal setof
connected denseunits.
27. Formal Problem Statement
LetA= {A1, A2, . . . , Ad } bea setof bounded, totally
ordered domains and S = A1× A2× · · · × Ad a d-
dimensional numericalspace.
Wewill referto A1, . . . , Ad as the dimensions
(attributes) of S.
The inputconsistsof a setof d-dimensional pointsV =
{v1, v2, . . . ,vm}
Wherevi = vi1, vi2, . . . , vid . The j th componentof vi is
drawn from domainAj .
28. 28
The CLIQUE Algorithm (cont.)
3. Minimal description of clusters
The minimal description of a cluster C, produced by the above
procedure, is the minimum possible union of hyperrectangular regions.
For example
• A B is the minimum cluster description of the shaded region.
• C D E is a non-minimal cluster description of the same region.
29. Clique Working
2 StepProcess
1st step – Partitioning the d- dimensional dataspace
2nd step- Generates the minimal descriptionof each
cluster.
32. continue….
The subspaces representing these dense units are
intersected to forma candidatesearch space in which
denseunitsof higherdimensionality mayexist.
Thisapproachof selecting candidates is quite similar
toApiori Gen processof generating candidates.
Here it is expected that if some thing is dense in
higherdimensional space itcant besparse in lower
dimensionstate.
33. More formally
If a k-dimensional unit is dense, then soare its projections
in (k-1)-dimensionalspace.
Given a k-dimensional candidate dense unit, if any of it’s
(k-1)th projection unit is not dense then kth dimensional
unitcannot be dense
So,we can generate candidate dense units in k-dimensional
space from the dense units found in (k-1)-dimensional
space
Theresulting space searched is much smaller than the
originalspace.
Thedense unitsare thenexamined inorder todetermine
theclusters.
34. Intersection
Denseunits found with respect to age for the dimensionssalary
and vacation are intersected in order to provide a candidate
search space fordense units of higherdimensionality.
35. 2nd stage- Minimal Description
Foreach cluster, Clique determines the maximal
region that covers the cluster of connected dense units.
It then determines a minimal cover (logicdescription)
foreach cluster.
36. Effectiveness of Clique-
CLIQUE automatically finds subspaces of the highest
dimensionalitysuch that high-densityclustersexist in
thosesubspaces.
It is insensitive to the orderof inputobjects
Itscales linearlywith thesizeof input
Easily scalablewith the numberof dimensions in the
data
37. GRID-BASED CLUSTERING
METHODS
This is the approach in which we
quantize space into a finite number of
cells that form a grid structure on which
all of the operations for clustering is
performed.
So, for example assume that we have a
set of records and we want to cluster with
respect to two attributes, then, we divide
the related space (plane), into a grid
structure and then we find the clusters.
39. Techniques for Grid-Based Clustering
The following are some techniques
that are used to perform Grid-Based
Clustering:
CLIQUE (CLustering In QUest.)
STING (STatistical Information Grid.)
WaveCluster
40. Looking at CLIQUE as an Example
CLIQUE is used for the clustering of high-
dimensional data present in large tables.
By high-dimensional data we mean
records that have many attributes.
CLIQUE identifies the dense units in the
subspaces of high dimensional data
space, and uses these subspaces to
provide more efficient clustering.
41. How Does CLIQUE Work?
Let us say that we have a set of records
that we would like to cluster in terms of
n-attributes.
So, we are dealing with an n-
dimensional space.
MAJOR STEPS :
CLIQUE partitions each subspace that has
dimension 1 into the same number of equal
length intervals.
Using this as basis, it partitions the n-
dimensional data space into non-overlapping
rectangular units.
42. CLIQUE: Major Steps (Cont.)
Now CLIQUE’S goal is to identify the dense n-
dimensional units.
It does this in the following way:
CLIQUE finds dense units of higher
dimensionality by finding the dense units in the
subspaces.
So, for example if we are dealing with a 3-
dimensional space, CLIQUE finds the dense
units in the 3 related PLANES (2-dimensional
subspaces.)
It then intersects the extension of the
subspaces representing the dense units to
form a candidate search space in which dense
units of higher dimensionality would exist.
43. CLIQUE: Major Steps. (Cont.)
Each maximal set of connected dense units is
considered a cluster.
Using this definition, the dense units in the
subspaces are examined in order to find
clusters in the subspaces.
The information of the subspaces is then used
to find clusters in the n-dimensional space.
It must be noted that all cluster boundaries are
either horizontal or vertical. This is due to the
nature of the rectangular grid cells.
44. Example for CLIQUE
Let us say that we want to cluster a set
of records that have three attributes,
namely, salary, vacation and age.
The data space for the this data would
be 3-dimensional.
age
salary
vacation
45. Example (Cont.)
After plotting the data objects,
each dimension, (i.e., salary,
vacation and age) is split into
intervals of equal length.
Then we form a 3-dimensional grid
on the space, each unit of which
would be a 3-D rectangle.
Now, our goal is to find the dense
3-D rectangular units.
46. Example (Cont.)
To do this, we find the dense units
of the subspaces of this 3-d space.
So, we find the dense units with
respect to age for salary. This
means that we look at the salary-
age plane and find all the 2-D
rectangular units that are dense.
We also find the dense 2-D
rectangular units for the vacation-
age plane.
47. Example 1
Salary
(10,000)
20 30 40 50 60
age
54312670
20 30 40 50 60
age
54312670
Vacation
(week) 20 30 40 50 60
age
54312670
Vacation
(week)
48. Example (Cont.)
Now let us try to visualize the
dense units of the two planes on the
following 3-d figure :
age
Vacation
Salary 30 50
age
Vacation
Salary 30 50
= 3
49. Example (Cont.)
We can extend the dense areas in the
vacation-age plane inwards.
We can extend the dense areas in the
salary-age plane upwards.
The intersection of these two spaces
would give us a candidate search space in
which 3-dimensional dense units exist.
We then find the dense units in the
salary-vacation plane and we form an
extension of the subspace that represents
these dense units.
50. Example (Cont.)
Now, we perform an intersection of
the candidate search space with the
extension of the dense units of the
salary-vacation plane, in order to
get all the 3-d dense units.
So, What was the main idea?
We used the dense units in
subspaces in order to find the dense
units in the 3-dimensional space.
After finding the dense units, it is
very easy to find clusters.
51. Reflecting upon CLIQUE
Why does CLIQUE confine its search for
dense units in high dimensions to the
intersection of dense units in subspaces?
Because the Apriori property employs
prior knowledge of the items in the search
space so that portions of the space can be
pruned.
The property for CLIQUE says that if a k-
dimensional unit is dense then so are its
projections in the (k-1) dimensional
space.
52. Strength and Weakness of CLIQUE
Strength
It automatically finds subspaces of the highest
dimensionality such that high density clusters exist in
those subspaces.
It is quite efficient.
It is insensitive to the order of records in input and
does not presume some canonical data distribution.
It scales linearly with the size of input and has good
scalability as the number of dimensions in the data
increases.
Weakness
The accuracy of the clustering result may be
degraded at the expense of simplicity of the simplicity
of this method.
53. CLIQUE: The Major Steps
• Partition the data space and find the number of points
that lie inside each cell of the partition.
• Identify the subspaces that contain clusters using the
Apriori principle
• Identify clusters:
– Determine dense units in all subspaces of interests
– Determine connected dense units in all subspaces of interests.
• Generate minimal description for the clusters
– Determine maximal regions that cover a cluster of connected
dense units for each cluster
– Determination of minimal cover for each cluster
54. Salary
(10,000)
20 30 40 50 60
age
54312670
20 30 40 50 60
age
54312670
Vacation
(week)
age
Vacation
30 50
= 3
55. Strength and Weakness of
CLIQUE
• Strength
– It automatically finds subspaces of the highest
dimensionality such that high density clusters exist in those
subspaces
– It is insensitive to the order of records in input and does not
presume some canonical data distribution
– It scales linearly with the size of input and has good
scalability as the number of dimensions in the data
increases
• Weakness
– The accuracy of the clustering result may be degraded at
the expense of simplicity of the method
56. Global Dimensionality Reduction (GDR)
First Principal
Component (PC) First PC
•works well only when data is globally correlated
•otherwise too many false positives result in high query cost
•solution: find local correlations instead of global correlation
58. Correlated Cluster
Second PC
(eliminated
dim.)
Centroid of
cluster
(projection of
mean on
eliminated dim)
First PC
(retained dim.)
Mean of all
points in cluster
A set of locally correlated points = <PCs, subspace dim,
centroid, points>
61. Other constraints
• Dimensionality bound: A cluster must not retain
any more dimensions necessary and subspace
dimensionality MaxDim
• Size bound: number of points in the cluster
MinSize
62. Clustering Algorithm
Step 1: Construct Spatial Clusters
• Choose a set of well-
scattered points as
centroids (piercing set)
from random sample
• Group each point P in the
dataset with its closest
centroid C if the
Dist(P,C)
64. Clustering Algorithm
Step 3: Compute Subspace Dimensionality
0
0.2
0.4
0.6
0.8
1
0 2 4 6 8 10 12 14 16
#dims retained
Fracpointsobeying
recons.bound
• Assign each point to
cluster that needs
min dim. to
accommodate it
• Subspace dim. for
each cluster is the
min # dims to retain
to keep most points
65. Clustering Algorithm
Step 4: Recluster points
• Assign each point P to
the cluster S such that
ReconDist(P,S)
MaxReconDist
• If multiple such
clusters, assign to first
cluster (overcomes
“splitting” problem)
Empty
clusters
66. Clustering algorithm
Step 5: Map points
• Eliminate small
clusters
• Map each point to
subspace (also store
reconstruction dist.)
Map
67. Clustering algorithm
Step 6: Iterate
• Iterate for more clusters as long as new
clusters are being found among outliers
• Overall Complexity: 3 passes, O(ND2K)
68. Experiments (Part 1)
• Precision Experiments:
– Compare information loss in GDR and LDR for same reduced
dimensionality
– Precision = |Orig. Space Result|/|Reduced Space Result| (for
range queries)
– Note: precision measures efficiency, not answer quality
69. Datasets
• Synthetic dataset:
– 64-d data, 100,000 points, generates clusters in different
subspaces (cluster sizes and subspace dimensionalities follow
Zipf distribution), contains noise
• Real dataset:
– 64-d data (8X8 color histograms extracted from 70,000
images in Corel collection), available at
http://kdd.ics.uci.edu/databases/CorelFeatures
70. Precision Experiments (1)
0
0.5
1
Prec.
0 0.5 1 2
Skew in cluster size
Sensitivity of prec. to skew
LDR GDR
0
0.5
1
Prec.
1 2 5 10
Number of clusters
Sensitivity of prec. to num clus
LDR GDR
71. Precision Experiments (2)
0
0.5
1
Prec.
0 0.02 0.05 0.1 0.2
Degree of Correlation
Sensitivity of prec. to correlation
LDR GDR
0
0.5
1
Prec.
7 10 12 14 23 42
Reduced dim
Sensitivity of prec. to reduced dim
LDR GDR
72. Index structure
Root containing pointers to root of
each cluster index (also stores PCs and
subspace dim.)
Index
on
Cluster 1
Index
on
Cluster K
Set of outliers
(no index:
sequential scan)
Properties: (1) disk based
(2) height 1 + height(original space index)
(3) almost balanced
73. Cluster Indices
• For each cluster S, multidimensional index on (d+1)-dimensional space instead
of d-dimensional space:
– NewImage(P,S)[j] = projection of P along jth PC for 1 j d
= ReconDist(P,S) for j= d+1
• Better estimate:
D(NewImage(P,S), NewImage(Q,S))
D(Image(P,S), Image(Q,S))
• Correctness: Lower Bounding Lemma
D(NewImage(P,S), NewImage(Q,S)) D(P,Q)
75. Outlier Index
• Retain all dimensions
• May build an index, else use sequential scan
(we use sequential scan for our
experiments)
76. Query Support
• Correctness:
– Query result same as original space index
• Point query, Range Query, k-NN query
– similar to algorithms in multidimensional index structures
– see paper for details
• Dynamic insertions and deletions
– see paper for details
77. Experiments (Part 2)• Cost Experiments:
– Compare linear scan, Original Space Index(OSI), GDR and LDR in
terms of I/O and CPU costs. We used hybrid tree index structure for
OSI, GDR and LDR.
• Cost Formulae:
– Linear Scan: I/O cost (#rand accesses)=file_size/10, CPU cost
– OSI: I/O cost=num index nodes visited, CPU cost
– GDR: I/O cost=index cost+post processing cost (to eliminate false
positives), CPU cost
– LDR: I/O cost=index cost+post processing cost+outlier_file_size/10,
CPU cost
78. I/O Cost (#random disk accesses)
I/O cost comparison
0
500
1000
1500
2000
2500
3000
7 10 12 14 23 42 50 60
Reduced dim
#rand disk
acc
LDR
GDR
OSI
Lin Scan
79. CPU Cost (only computation time)
CPU cost comparison
0
20
40
60
80
7 10 12 14 23 42
Reduced dim
CPU cost
(sec)
LDR
GDR
OSI
Lin Scan
80. Conclusion
• LDR is a powerful dimensionality reduction technique
for high dimensional data
– reduces dimensionality with lower loss in distance
information compared to GDR
– achieves significantly lower query cost compared to linear
scan, original space index and GDR
• LDR has applications beyond high dimensional indexing
81. 10/8/2016 CLIQUE clustering algorithm 81
Motivation
An object typically has dozens of attributes, the
domain for each attribute can be large
Require the user to specify the subspace for cluster
analysis
User-identification of subspaces is quite error-
prone.
82. 10/8/2016 CLIQUE clustering algorithm 82
The Contribution of CLIQUE
Automatically find subspaces with
high-density clusters in high
dimensional attribute space
83. 10/8/2016 CLIQUE clustering algorithm 83
Background
A1, A2, ﹍, Ad is the dimensions of S:
A = {A1, A2, ﹍, Ad}
S = A1 × A2 × ﹍ × Ad
units:
Partition every dimension into ξ intervals of equal
length
unit u: {u1, u2, ﹍, ud} where ui = [ li, hi )
84. 10/8/2016 CLIQUE clustering algorithm 84
Background(Cont.)
Selectivity: the fraction of total data points
contained in the unit
Dense unit: selectivity(u) >
Cluster: a maximal set of connected dense units
86. 10/8/2016 CLIQUE clustering algorithm 86
Background(Cont.)
region: axis-parallel rectangular set
RC=R: R is contained in C
maximal region: no proper superset of R is contained in C
minimal description: a non-redundant covering of the
cluster with maximal regions
88. 10/8/2016 CLIQUE clustering algorithm 88
CLIQUE Algorithm
1. Identification of dense units
2. Identification of clusters.
3. Generation of minimal description
89. 10/8/2016 CLIQUE clustering algorithm 89
Identification of dense units
bottom-up algorithm:
like Apriori algorithm
Monotonicity:
If a collection of points S is a cluster in a k-dimensional
space, then S is also part of a cluster in any (k–1)-
dimensional projections of this space.
90. 10/8/2016 CLIQUE clustering algorithm 90
Algorithm
1. determine 1-dimensional dense units
2. k = 2
3. generate candidate k-dimensional units from
(k-1)-dimensional dense units
4. if candidates are not empty
find dense units
k = k + 1
go to step 3
93. 10/8/2016 CLIQUE clustering algorithm 93
Prune subspaces
Objective: use only the dense units that lie in
“interesting” subspaces
MDL principle:
encode the input data under a given model and
select the encoding that minimizes the code
length.
94. 10/8/2016 CLIQUE clustering algorithm 94
Prune subspaces (Cont.)
Group together dense units in the same subspace
Compute the number of points for each subspace
Sort subspaces in the descending order of their coverage
Minimize the total length of the encoding
|))((|log))((log
|))((|log))((log)(
1
2
1
2
ixi
ixiiCL
P
nji
SP
I
ij
SI
j
j
2
2
jij Su iS ucountx )(
95. 10/8/2016 CLIQUE clustering algorithm 95
Prune subspaces (Cont.)
Partitioning of the subspaces into selected and pruned sets
97. 10/8/2016 CLIQUE clustering algorithm 97
Generating minimal cluster
descriptions
R is a cover of C
optimal cover: NP-hard
solution to the problem:
greedily cover the cluster by a number of maximal
regions
discard the redundant regions
98. 10/8/2016 CLIQUE clustering algorithm 98
Greedy growth
1) begin with an arbitrary dense
unit u C
2) Greedily grow a maximal
region covering u, add to R
3) repeat 2) with all uk C are
covered by some maximal
regions in R
99. 10/8/2016 CLIQUE clustering algorithm 99
Minimal Cover
Remove from the cover the smallest
maximal region which is redundant.
Repeat the procedure until no maximal
region can be removed.
101. 10/8/2016 CLIQUE clustering algorithm 101
Comparison with Birch, DBScan
Concludes that CLIQUE performs better than Birch, DBScan
102. 10/8/2016 CLIQUE clustering algorithm 102
Real data experimental result
datasets:
insurance industry (Insur1, Insur2)
department store (Store)
bank (Bank)
In all cases, we discovered
meaningful clusters
embedded in lower
dimensional subspaces.
103. 10/8/2016 CLIQUE clustering algorithm 103
Strength
automatically finds clusters in subspaces
insensitive to the order of records
not presume some canonical data
distribution
scales linearly with the size of input
tolerant of missing values
104. 10/8/2016 CLIQUE clustering algorithm 104
Weakness
depends on some parameters that hard to
pre-select
ξ (partition threshold)
(density threshold)
some potential clusters will be lost in the
density-units prune procedures.
the correctness of the algorithm degrades
105. What or who is STING?
A singer who was the
lead singer of the
band Police and then
took up solo career
and won many
grammy’s.
The bite of a scorpion.
A Statistical
Information Grid
Approach to Spatial
Data Mining.
All of the above.
106. What is Spatial Data?
Many definitions according to specific areas
According to GIS
Spatial data may be thought of as features
located on or referenced to the Earth's
surface, such as roads, streams, political
boundaries, schools, land use classifications,
property ownership parcels, drinking water
intakes, pollution discharge sites - in short,
anything that can be mapped.
Geographic features are stored as a series of
coordinate values. Each point along a road or
other feature is defined by positional
coordinate value, such as longitude and
latitude.
The GIS stores and manages the data not as
a map but as a series of layers or, as they
are sometimes called, themes
When viewed in a GIS, these layers
visually appear as one graphic, but are
actually still independent of each other.
This allows changes to specific themes,
without affecting the others.
Discussion Question 1: So can you define spatial Data Generically????
107. •Spatial database systems aim at storing, retrieving, manipulating, querying, and
analyzing geometric data.
•Special data types are necessary to model geometry and to suitably represent
geometric data in database systems. These data types are usually called spatial
data types, such as point, line, and region but also include more complex types like
partitions and graphs (networks).
•Data Type understanding is a prerequisite for an effective construction of important
components of a spatial database system (like spatial index structures, optimizers
for spatial data, spatial query languages, storage management, and graphical user
interfaces) and for a cooperation with extensible DBMS providing spatial type
extension packages (like spatial data blades and cartridges).
•Excellent tutorial on spatial data and data types available at:
http://www.informatik.fernuni-hagen.de/import/pi4/schneider/abstracts/TutorialSDT.html
What are Spatial Databases?
110. Pennsylvania Spatial Data Access
http://www.pasda.psu.edu/
The Missouri Spatial Data Information Service
http://msdis.missouri.edu/
National Spatial Data Infrastructure
http://www.fgdc.gov/nsdi/nsdi.html
Michigan Department of Natural Resources Online
www.dnr.state.mi.us/spatialdatalibrary/
Georgia Spatial Data Infrastructure Home Page
www.gis.state.ga.us/
Free GIS Data - GIS Data Depot
www.gisdatadepot.com
Spatial Data Resources
111. Spatial Data Mining
Discovery of interesting characteristics and
patterns that may implicitly exist in spatial
databases.
Huge amount of data specialized in nature.
Clustering and region oriented queries are
common problems in this domain.
We deal with high dimensional data
generally.
Applications: GIS, Medical Imaging etc.
112. •Huge Amount of Data Specialized in Nature
Problems????????
•Complexity
•Defining of geometric patterns and region
oriented queries
•Conceptual nature of problem!
•Spatial Data Accessing
113. STING-An Introduction
•STING is a grid based method to efficiently process many
common region oriented queries on a set of points
•What defines region? You tell me! Essentially it is a set of points
satisfying some criterion
•It is a hierarchical Method. The idea is to capture statistical
information associated with spatial cells in such a manner that
the whole classes of queries can be answered without referring
to the individual objects.
•Complexity is hence even less than O(n) infact what do you
think it will be???
•Link to Paper: http://citeseer.nj.nec.com/wang97sting.html
114. Related Work
Spatial Data Mining
Generalization Based Knowledge Discovery Clustering Based Methods
CLARANS BIRCH DBSCANSpatial Data Dominant Non-Spatial Data Dominant
Great comparison of Clustering algorithms
http://www.cs.ualberta.ca/~joerg/papers/KI-Journal.pdf
115. Generalization Based Approaches
Two types: Spatial Data Dominant and Non-
Spatial Data Dominant
Both of these require that a generalization
hierarchy is given explicitly by experts or is
somehow generated automatically.
Quality of mined data depends on the structure of
the hierarchy.
Computational Complexity O(nlogn)
So the onus shifted to developing algorithms
which discover characteristics directly from data.
This was the motivation to move to clustering
algorithms
116. Clustering Based Approaches
BIRCH: Already covered Remember it??
Complexity??
The problem with BIRCH is that it does not work
well with clusters which are not spherical.
DBSCAN: Already covered Remember it??
Complexity??
The Global Parameter Eps determination in
DBSCAN requires human participation
When the point set to be clustered is the response
set of objects with some qualifications, then
determination of Eps must be done each time and
cost is hence higher.
117. Clustering Based Approaches
CLARANS: Clustering Large Applications based upon
RANdomized Search.
Although claims have been made on it being linear it is
essentially quadratic.
The computational Complexity is at least Ώ(KN2) where N
is the number of data point and K is the number of
clusters.
Quality of results can not be guaranteed when N is large as
we use Randomized Search
Optimization with Randomized Search Heuristics The
(A)NFL Theorem, Realistic Scenarios, and Dicult Functions
118. Related Work
All the approaches described in previous slides
are all query dependent approaches
The structure of queries influence the structure of
the algorithm and cannot be generalized to all
queries.
As they scan all the data points the complexity
will at least be O(N)
119. STING THE OVERVIEW
Spatial Area is divided into rectangular cells
Different levels of cells corresponding to different
resolution and these cells have a hierarchical structure.
Each cell at a higher level is partitioned into number of
cells of the next lower level
Statistical information of each cell is calculated and stored
beforehand and is used to answer queries
120. GRID CELL HIERARCHY
Each Cell at (i-1)th level has 4
children at ith level (can be
changed)
The size of leaf cell is dependent
on the density of objects.
Generally it should be from several
dozens to thousands
121. For each cell we have attribute-dependent and
attribute-independent parameters
The attribute independent parameter is number of
objects in a cell-n
For attribute dependent parameters it is assumed
that for each object its attributes have numerical
values.
For each Numerical attribute we have the
following five parameters
GRID CELL HIERARCHY
122. GRID CELL HIERARCHY
m- mean of all values in this cell
s- standard deviation of all values in
this cell
min-the minimum value of the
attribute in this cell
max-the minimum value of the
attribute in this cell
distribution-the type of distribution
this cell follows. (This is of
enumeration type)
123. Parameter Generation
•The determination of dist parameter is as follows
•First the dist is set to distribution followed by most point
•An estimate is made on number of conflicting points confl according to following
Rules
1) if disti is not equal to dist, m=mi and s=si then confl is increased
by amount ni.
2) if disti is not equal to dist, m=mi or s=si but not both then confl is
set to n.
3) if disti=dist and m=mi and s=si then confl is not changed
4) if disti = dist, m=mi or s=si but not both then confl is set to n.
Finally if confl/n is greater than a threshold (say 0.05) then dist is set to none or
Original dist is retained
124. Parameter Generation
i 1 2 3 4
ni 100 50 60 10
mi 20.1 19.7 21 20.5
si 2.3 2.2 2.4 2.1
mini 4.5 5.5 3.8 7
max
i 36 34 37 40
disti
Norm
al
Norm
al
Norm
al
Non
e
The parameters of the current cell are
N=220
m=20.27
s=2.37
min=3.8
max=40
dist=NORMAL
This is so because there are 220 data points out of which 10 are not NORMAL
So confl/n=10/220=0.045<0.05 hence it is still NORMAL.
The parameters are calculated only once so overall compilation time is O(N)
But querying requires much less time as we only scan the number of grid cells
K i.e. O(K)
125. Query Types
If hierarchical structure cannot
answer a query then can go to
underlying database
SQL like Language used to describe
queries
Two types of common queries
found: one is to find region
specifying certain constraints and
other take in a region and return
some attribute of the region
127. Algorithm
Top down querying. Examine cells at a higher level determine
if the cell is relevant to query at some confidence level. This
likelihood can be defined as the proportion of objects in this
cell that satisfy the query conditions. After obtaining the
confidence interval, we label this cell to be relevant or not
relevant at the specified confidence level.
After doing so for the present layer process is repeated for the
children cells of the RELEVANT cells in the present layer only!!!
Procedure continues till the bottom most layer
Find region formed by relevant cells and return them
If not satisfactory retrieve those data that fall into the
relevant cells from database and do some further processing.
128. After all cells are labeled as relevant or not
relevant, we can easily find all regions that
satisfy the density specified by Breadth First
Search.
For a relevant cell, we examine cells within a
certain distance d from the center of the
current cell to see if the average density
within this small area is greater than density
specified.
If yes the cells are put into a queue
Step 2 and 3 are repeated for all the cells in
the queue except cells previously examined
are omitted.
When the queue is empty we get one region.
Algorithm
129. The distance d =max (l, √(f/c∏)
l, c, f are the side length of bottom
layer cell, the specified density and
small constant number set by STING
(does not vary from query to
another)
L is usually the dominant term so we
generally have to examine the
neighborhood term. If only
granularity is very small do we need
examine very cell at that distance
rather than just the neighborhood.
Algorithm
130. Example
Given Data: Houses one of the attribute is price
Query:“Find those regions with area at least A where the number of houses per unit
area is at least c and at least b% of the houses have price between a and b with
(1 - a) confidence” where a < b. Here, a could be -æ and b could be +æ.
This query can be written as
We begin from the top level working our way down. Assume the dist type is NORMAL
First we calculate the proportion of houses whose price lies between [a,b]
The probability that price lies between a and b is
m and s are mean and standard deviation of all prices.
131. Example
Now as we assume prices to be independent of m and s
the number of houses with price range [a, b] has a
binomial distribution with parameters n and p where n is
number of houses. Now we consider the following cases
according to n, np and n(1-p)
a) n<=30: binomial distribution used to determine confidence interval of the
number of houses whose prices fall into [a, b], and divide it by n to get the
confidence interval for the proportion.
b) When n > 30, n p ³ 5, and n(1 - p ) ³ 5, the proportion that the price falls
in [a, b] has a normal distribution Then 100(1 - alpha)%
confidence interval of the proportion is
c) When n>30 but np<5 , the Poissons distribution with parameters
is used for approximation.
d) When n>30 but n(1-p)<5, we can calculate the proportion of houses (X)
whose price is not in [a,b] using Poissons distribution with n(1-p) and
1-X is the proportion of houses whose prices is in [a,b].
132. Example
Once we have the
confidence interval
or the estimated
range [p1, p2], we
can label this cell as
relevant or not
relevant.
Let S be area of
cells at bottom
layer. If p1xn<Sxcx
%, we can label as
not relevant
otherwise as
relevant
133. Analysis of STING
Step one takes constant time
Step 2 and 3 total time is proportional
to the total number of cells in the
hierarchy.
Total number of cells is 1.33K, where K
is number of cells at bottom layer.
In all cases it is found or claimed to be
O(K)
Discussion Question: what is the
complexity if we need to go to step 7 in
the algorithm??
134. Quality
STING under the following sufficient condition
guarantee that if a region satisfies the specification of
the query then it is returned.
Let F be a region. The width of F is defined as the
side length of the maximum square that can fit in F.
135. Limiting Behavior of STING
The regions returned by Sting are
an approximation of the result by
DBSCAN. As the granularity
approaches zero the regions
returned by STING approaches
result of DBSCAN.
SO worst case complexity is
O(nlogn)!!!!!
136. Performance measure
Case A: Normal Distribution
Query in e.g. answered in 0.2 sec
Structure generation: 9.8 second
Case A: None
Query in e.g. answered in 0.22 sec
Structure generation: 9.7 second
137. Performance measure
Used a benchmark called SEQUOLA 2000 to
compare STING, DBSCAN, CLARANS
All the previous algorithms have three phases in
query answering
1. Find Query Response
2. Build auxiliary structure
3. Do clustering
STING does all of this in one step so is inherently
better.
138. Discussion Question
“STING is trivially parallelizable.”
Comment why and what is the
importance of this statement?
139. References
STING : Statistical Information Grid approach to spatial data
mining. Wei Wang et al.
Optimization with Randomized Search Heuristics The (A)NFL
Theorem, Realistic Scenarios, and Dicult Functions. Stefan
Droste et al.
Efficient and Effective clustering Method for spatial data
mining. R. Ng et al.
BIRCH: An efficient data clustering method for very large
databases. T Zhang et al.
Tutorial on Spatial data types: http://www.informatik.fernuni-
hagen.de/import/pi4/schneider/abstracts/TutorialSDT.html
An efficient Approach to Clustering in Large Multimedia
Databases with Noise. A Hinneburg et al.
Comparison of clustering algorithms :
http://www.cs.ualberta.ca/~joerg/papers/KI-Journal.pdf
140. Motivation
All previousclustering algorithmare querydependent
Theyare builtforone queryand generally no use for
otherquery.
Needa separate scan foreachquery.
Socomputation morecomplexat least O(n).
Sowe need a structure outof Databaseso thatvarious
queriescan beanswered without rescanning.
141. Basics
Grid based method-quantizes the object space into a
finite number of cells that form a grid structure on
which all of the operations for clustering are
performed
Develop hierarchical Structure out of given data and
answer various queries efficiently.
Everylevel of hierarchy consistof cells
Answeringa query is not O(n) where n is the number
of elements in thedatabase
143. continue …..
Theroot of the hierarchy beat level 1
Cell in level i corresponds to the union of the areasof
itschildrenat level i + 1
Cellata higher level is partitioned to form a numberof
cellsof the next lowerlevel
Statistical informationof each cell iscalculated and
stored beforehandand is used toanswerqueries
144. Cell parameter
Attribute Independentparameter-
n- numberof objects (points) in this cell
Attribute dependentparameters-
m - mean of all values in thiscell
s - standard deviation of all values of the attribute in this
cell
min - the minimumvalueof the attribute in this cell
max - the maximum value of the attribute in this cell
distribution - the type of distribution that the attribute
valuein thiscell follows
145. Parameter Generation
n, m, s, min, and max of bottom levelcellsare
calculateddirectly from data
Distributioncan beeitherassigned byuserorcan be
obtained byhypothetical tests - χ2 test
Parametersof higher levelcells is calculated from
parameterof lowerlevelcells.
146. continue…..
n, m, s, min, max, dist beparametersof current cell
ni, mi, si, mini, maxi and disti beparametersof
corresponding lower levelcells
147. dist for Parent Cell
Set dist as the distribution type followed by most pointsin
thiscell
Nowcheck forconflicting points in thechild cells call it
confl.
1. If disti ≠ dist, mi ≈ m and si ≈ s, then confl is increased by an
amount of ni;
2. If disti ≠ dist, but either mi ≈ m or si ≈ s is not satisfied, then
set confl ton
3. If disti = dist, mi ≈ m and si ≈ s, then confl is increased by0;
4.If disti = dist, but either mi ≈ m or si ≈ s is not satisfied, then
confl is set ton.
148. continue…..
If is greater thana threshold t set dist as NONE.
Otherwise keeptheoriginal type.
Example:
149. continue…..
Parameterforparentcell would be
n = 220
min = 3.8
m = 20.27
max =40
s = 2.37
dist = NORMAL
210 points whose distributiontype is NORMAL
Setdistof parentas Normal
confl = 10
= 0.045 < 0.05 so keeptheoriginal.
150. Query types
STING structure is capableof answering variousqueries
Butif itdoesn’tthenwealwayshave the underlying
Database
Even if statistical information is notsufficientto answer
querieswe can still generate possiblesetof answers.
151. Common queries
Select regions that satisfy certain conditions
Select the maximal regions that have at least 100 houses per unit
area and at least 70% of the house prices are above $400K and
withtotalareaat least100 unitswith90% confidence
SELECT REGION
FROM house-map
WHERE DENSITY IN (100, ∞)
AND price RANGE (400000, ∞) WITH PERCENT (0.7, 1)
AND AREA (100, ∞)
AND WITH CONFIDENCE 0.9
152. continue….
Selects regions and returns some function of the region
Select the range of age of houses in those maximal regionswhere there
areat least 100 houses perunit areaand at least 70% of the houses have
pricebetween$150Kand $300Kwithareaat least100units in California.
SELECT RANGE(age)
FROM house-map
WHERE DENSITY IN (100, ∞)
AND price RANGE (150000, 300000) WITH PERCENT (0.7, 1)
AND AREA (100, ∞)
AND LOCATION California
153. Algorithm
With the hierarchical structure of grid cells on hand,
we can use a top-down approach to answer spatial data
miningqueries
For any query, we begin by examining cells on a high
levellayer
calculate the likelihood that this cell is relevant to the
queryatsomeconfidence level using the parameters of
thiscell
If the distribution type is NONE, we estimate the
likelihoodusing somedistribution free techniques
instead
154. continue….
Afterweobtain the confidence interval, welabel this
cell to be relevant or not relevant at the specified
confidencelevel
Proceed to the next layerbutonlyconsiderthe Childs
of relevantcellsof upperlayer
Werepeat thisuntil wereach to the final layer
Relevantcellsof final layerhave enoughstatistical
informationtogivesatisfactoryresultto query.
However for accurate mining we may refer to data
corresponding torelevantcellsand furtherprocess it.
155. Finding regions
Afterwe havegotall the relevantcellsat the final level
weneed to outputregionsthat satisfies thequery
Wecan do itusing Breadth FirstSearch
156. Breadth First Search
we examine cells within a certain distance from
the center of currentcell
If the average density within this small area is
greater than the density specified mark this area
Put the relevant cells just examined in thequeue.
Take element from queue repeat the same
procedure except that only those relevant cells that
are not examined before are enqueued. When
queue is empty we have identified oneregion.
157. Statistical Information Grid-based
Algorithm
1. Determine a layer to beginwith.
2. For each cell of this layer, we calculate the confidence interval (or
estimated range) of probability that thiscell is relevantto thequery.
3. From the interval calculated above, we label the cell as relevant or not
relevant.
4. If this layer is the bottom layer, go to Step 6; otherwise, go to Step 5.
5. Wego down the hierarchy structure by one level. Go to Step 2 for those
cells that form the relevantcells of the higher level layer.
6. If the specification of the query is met, go to Step 8; otherwise, go to Step
7.
7. Retrieve those data fall into the relevant cells and do further processing.
Returntheresult that meet therequirementof thequery. Goto Step 9.
8. Find the regions of relevant cells. Return those regions that meet the
requirementof thequery. Go to Step 9.
9. Stop.
158. Time Analysis:
Step1 takesconstanttime. Steps 2 and 3 require
constanttime.
The total time is less than orequal to the total number
of cells in our hierarchical structure.
Notice that the total numberof cells is 1.33K, where K
is the number of cells at bottomlayer.
Sothe overall computationcomplexityon the grid
hierarchy structure isO(K)
159. Time Analysis:
STING goesthrough the databaseonce to computethe
statistical parameters of thecells
timecomplexityof generating clusters is O(n), where
n is the total numberof objects.
After generating the hierarchical structure, the query
processing time is O(g), whereg is the total numberof
grid cells at the lowest level, which is usually much
smaller thann.
161. Definitions That Need to Be Known
Spatial Data:
Data that have a spatial or location
component.
These are objects that themselves are located
in physical space.
Examples: My house, lake Geneva, New York
City, etc.
Spatial Area:
The area that encompasses the locations of all
the spatial data is called spatial area.
162. STING (Introduction)
STING is used for performing clustering
on spatial data.
STING uses a hierarchical multi resolution
grid data structure to partition the spatial
area.
STINGS big benefit is that it processes
many common “region oriented” queries
on a set of points, efficiently.
We want to cluster the records that are in
a spatial table in terms of location.
Placement of a record in a grid cell is
completely determined by its physical
location.
163. Hierarchical Structure of Each Grid Cell
The spatial area is divided into
rectangular cells. (Using latitude and
longitude.)
Each cell forms a hierarchical structure.
This means that each cell at a higher
level is further partitioned into 4 smaller
cells in the lower level.
In other words each cell at the ith level
(except the leaves) has 4 children in the
i+1 level.
The union of the 4 children cells would
give back the parent cell in the level
above them.
164. Hierarchical Structure of Cells (Cont.)
The size of the leaf level cells and the
number of layers depends upon how
much granularity the user wants.
So, Why do we have a hierarchical
structure for cells?
We have them in order to provide a
better granularity, or higher resolution.
166. Statistical Parameters Stored in each
Cell
For each cell in each layer we have
attribute dependent and attribute
independent parameters.
Attribute Independent Parameter:
Count : number of records in this cell.
Attribute Dependent Parameter:
(We are assuming that our attribute
values are real numbers.)
167. Statistical Parameters (Cont.)
For each attribute of each cell we store
the following parameters:
M mean of all values of each attribute in
this cell.
S Standard Deviation of all values of
each attribute in this cell.
Min The minimum value for each attribute
in this cell.
Max The maximum value for each
attribute in this cell.
Distribution The type of distribution that
the attribute value in this cell follows. (e.g.
normal, exponential, etc.) None is assigned
to “Distribution” if the distribution is
unknown.
168. Storing of Statistical Parameters
Statistical information regarding the
attributes in each grid cell, for each layer
are pre-computed and stored before
hand.
The statistical parameters for the cells in
the lowest layer is computed directly from
the values that are present in the table.
The Statistical parameters for the cells in
all the other levels are computed from
their respective children cells that are in
the lower level.
169. How are Queries Processed ?
STING can answer many queries, (especially
region queries) efficiently, because we don’t have
to access full database.
How are spatial data queries processed?
We use a top-down approach to answer spatial
data queries.
Start from a pre-selected layer-typically with a
small number of cells.
The pre-selected layer does not have to be the
top most layer.
For each cell in the current layer compute the
confidence interval (or estimated range of
probability) reflecting the cells relevance to the
given query.
170. Query Processing (Cont.)
The confidence interval is calculated by
using the statistical parameters of each
cell.
Remove irrelevant cells from further
consideration.
When finished with the current layer,
proceed to the next lower level.
Processing of the next lower level
examines only the remaining relevant
cells.
Repeat this process until the bottom layer
is reached.
171. Sample Query Examples
Assume that the spatial area is the map of the
regions of Long Island, Brooklyn and Queens.
Our records represent apartments that are
present throughout the above region.
Query : “ Find all the apartments that are for
rent near Stony Brook University that have a
rent range of: $800 to $1000”
The above query depend upon the parameter
“near.” For our example near means within 15
miles of Stony Brook University.