SlideShare a Scribd company logo
1 of 36
Author
Rakesh Agrawal , Johannes Gehrke, Dimitrios Gunopulos,
Prabhakar Raghavan
Prepared by : Raed T Aldahdooh
 Introduction
 Motivation
 Contributions Of The Paper
 Subspace Clustering
 CLIQUE(Clustering in Quest)
 Performance Experiments
 Conclusions
 Agrawal, Gehrke, Gunopulos, Raghavan (SIGMOD’98)
 CLIQUE can be considered as both density-based and grid-
based
 Clustering high-dimensional data.
 Automatically identifying subspaces of a high dimensional data space that
allow better clustering than original space
 Many irrelevant dimensions may mask clusters.
 Distance measure becomes meaningless—due to
equi-distance.
 Clusters may exist only in some subspaces.
 Only data in one dimension is relatively packed.
 Adding a dimension “stretch” the points across that dimension, making
them further apart.
 Density decrease dramatically.
 Distance measure becomes meaningless—due to equi-distance.
 Methods
◦ Feature transformation: only effective if most dimensions are relevant
 PCA “Principal component analysis” & SVD “Singular
value decomposition” useful only when features are highly
correlated/redundant
◦ Feature selection: wrapper or filter approaches
 useful to find a subspace where the data have nice clusters
◦ Subspace-clustering: find clusters in all the possible subspaces
 CLIQUE, ProClus, and frequent pattern-based clustering
The need for developing new algorithms
 Effective treatment of high dimensionality:
◦ To effectively extract information from a huge amount of data in databases. In
other words. The running time of algorithms must be predictable and usable in
large database.
 Interpretability of results:
◦ User expect clustering results in the high dimensional data to be interpretable,
comprehensible.
 Scalability and usability:
◦ Many clustering algorithms don’t well in a large database may contain millions
of objects, Clustering on a sample of a given data set may lead to biased results.
In other words, The clustering technique should be fast and scale with the
number of dimensions and the size of input and insensitive to the order of input
data.
 CLIQUE satisfies the above desiderata
( Effective , interpretability, Scalability and Usability).
 CLIQUE can automatically finds subspaces with high density
clusters.
 CLIQUE generates a minimal description for each cluster in
DNF expressions.
 Empirical evaluation shows that CLIQUE scales linearly with
the number of input records and has good scalability as the
number of dimension in the dimensionality of the hidden
cluster.
 a disjunctive normal form (DNF) is a
standardization (or normalization) of a logical
formula which is a disjunction of conjunctive clauses.
 A disjunction of conjunctions where every variable or
its negation is represented once in each conjunction
(a minterm)
◦ each minterm appears only once
Example: DNF of pq is
(pq)(pq).
 Clusters may exist only in some
subspaces.
 Subspace-clustering: find clusters in
all the subspaces.
 What’s (a)unit (b)dense unit (c)a cluster (d)a minimal description of a cluster.
 In Figure 1,the two dim space(age , salary) has been partitioned by a 10x10 grid.ξ=10
 The unit u=(30≤age<35)Λ(1≤salary<2)
 A and B are both region
 A=(30≤age<50)Λ(4≤salary<8)
 B =(40≤age<60)Λ(2≤salary<6)
 Assuming the dense units have been shaded,
 AUB is a cluster( A,B are connected regions)
 A∩B is not a maximal region.
 The minimal description for this cluster AUB is the
 DNF expression: ( (30≤age<50)Λ(4≤salary<8))v
 ( (40≤age<60)Λ(2≤salary<6))
 In Figure2. Assuming T=20%
(density threshold _ 3 point) If
selectivity(u)>T then u is a dense
unit.
 Where selectivity in the fraction of
total data points contained in the
unit.
 No 2-dimen unit is dense and
there are no clusters In the
original data space.
The points are projected on the salary dimension , there are three 1-dim dense
units, and there are two clusters in the 1-dim salary subspace,
C=(5≤salary<7 )and D=(2≤salary<3)
But there is no dense unit and cluster in 1-dim age subspace
3.
Generation of
minimal
description for
the clusters.
 CLIQUE consists of the following three steps:
1) Identification of subspace that contain clusters.
2) Identification of clusters .
3) Generation of minimal description for the clusters.
Title in here
2.
Identification of
clusters.
1.
Identification of
subspace that
contain clusters.
CLIQUE consists
of the following
three steps:
 Downward closure (DC) property: If a cluster
is satisfied in a k-dimensional space, it is
also satisfied in all of its (k-1)-dimensional
subspaces.
 Due to the DC property, identification of
subspaces is carried out in an iterative
bottom-up fashion (from lower to higher
dimensional subspaces).
 The difficulty in identifying subspaces that contain clusters
lies in finding dense units in different subspaces.
 A. using a bottom-up algorithm to find dense units that
exploits the monotonicity of the clustering criterion with
respect to dimensionality to prune the search space.
 Lemma1 (monotonicity):If k-dim unit is dense ,then so are
it’s projections in (k-1)-dim space.
 The bottom-up algorithm process
 Determines 1-dim dense unit and interaction(self-join) to get 2-dim dense unit.
Until having (k-1)dim dense units, We can self-join DK-1 to get the candidate k-dim units.
 we discard those dense units from Ck which have a projection (k-1)-dim that
isn't included in Ck-1 .
 B. Making the bottom-up algorithm faster with MDL-base
pruning.
 A. Determination of dense units
◦ Determine the set D1 of all one-dimensional dense units.
◦ k=1
◦ While Dk ≠  do
 k=k+1
 Determine the set Dk as the set of all the k-dimensional dense units
all of whose (k-1)-dimensional projections, belong to Dk-1.
◦ End while
 B. Determination of high coverage subspaces.
◦ Determine all the subspaces that contain at least one dense
unit.
◦ Sort these subspaces in descending order according to their
coverage (fraction of the num. of points of the original data set
they contain).
◦ Optimize a suitably defined Minimum Description Length
criterion function and determine a threshold under which a
coverage is considered “low”.
◦ Select the subspaces with “high” coverage.
 The input to the step of Finding Clusters is a set of dense units
D all in the same k-dim space.
 Depth-first search algorithm
◦ Using a Depth –first search algorithm to find the connected components
of the graph, By starting with some U in D, Assign it the first cluster
number and find all the units it is connected to, then if there still are
units in D that have not yet been visited , we find one and repeat the
procedure.
 For each high coverage subspace S do
◦ Consider the set E of all the dense units in S.
◦ While E ≠ 
◦ m´ =1
◦ Select a randomly chosen unit u from E.
◦ Assign to Cm´, u and all units of E that are connected to u.
◦ E=E-Cm´
◦ End while
 End for
 The input to this step consists of disjoint clusters in k-
dim subspace.
 The goal is to generate a minimal description of each
cluster with two steps:
◦ Covering with maximal region.
◦ Minimal cover.
 The CLIQUE Algorithm (cont.)
3. Minimal description of clusters
The minimal description of a cluster C, produced by the Last
procedure, is the minimum possible union of hyper rectangular
regions.
For example
 A  B is the minimum cluster description of the shaded region.
 C  D  E is a non-minimal cluster description of the same
region.
 The CLIQUE Algorithm (cont.)
3. Minimal description of clusters (algorithm)
For each cluster C do
1st stage
• c=0
• While C ≠ 
 c=c+1
 Choose a dense unit in C
 For i=1 to l
o Grow the unit in both directions along the i-th dimension, trying to cover as
many units in C as possible (boxes that are not belong to C should not be
covered).
 End for
 Define the set I containing all the units covered by the above procedure
 C=C-I
• End while
2nd stage
• Remove all covers whose units are covered by at least another cover.
 A two dimensional grid of lines of edge size ξ applied in the
two-dimensional feature space.
 Two-dimensional and one-dimensional units are defined:
◦ ui
q denotes the i-th one dimensional unit along xq
◦ uij denotes the two dimensional unit resulting from the Cartesian
product of the i-th and j-th intervals along x1 and x2, respectively.
 ξ=10 and τ=8% (thus, each unit containing more than
5 points is considered to be dense).
 The points in u48 and u58, u75 and u76, u83 and u93 are
collinear.
One-dimensional dense units:
D1={u2
1, u3
1, u4
1, u5
1, u8
1, u9
1, u1
2, u2
2, u3
2, u5
2, u6
2}
Two-dimensional dense units:
D2={u21, u22, u32, u33, u83, u93}
Notes:
•Although each one of the u48, u75, u76
contains more that 5 points, they are not
included in D2.
•Although it seems unnatural for u83 and
u93 to be included in D2, they are
included since u3
2 is dense.
• All subspaces of the two-dimensional
space contain clusters.
One-dimensional clusters:
C1={u2
1, u3
1, u4
1, u5
1}
C2={u8
1, u9
1}
C3={u1
2, u2
2, u3
2}
C4={u5
2, u6
2}
Two-dimensional clusters:
C5={u21, u22, u32, u33}
C6={u83, u93}
One-dimensional dense units:
D1={u2
1, u3
1, u4
1, u5
1, u8
1, u9
1, u1
2,
u2
2, u3
2, u5
2, u6
2}
Two-dimensional dense units:
D2={u21, u22, u32, u33, u83, u93}
C1={(x1): 1 x1<5}
C2={(x1): 7 x1<9}
C3={(x2): 0 x2<3}
C4={(x2): 4 x2<6}
C5={(x1, x2): 1 x1<2, 0 x2<2}{(x1, x2): 2 x1<3, 1 x2<3}
C6={(x1, x2): 7 x1<9, 2 x2<3}
Note that C2 and C6 are
essentially the same cluster,
which is reported twice by
the algorithm.
 We now empirically evaluate CLIQUE using synthetic data (Generator
from M.Zait and H.Messatfa. a comparative study of clustering methods)
 The goals of the experiments are to assess the efficiency of
CLIQUE:
 Efficiency :Determine how the running time scales with
◦ Dimensionality of the data space.
◦ Dimensionality of clusters.
◦ Size of data.
 Accuracy:Test if CLIQUE recovers known clusters in some
subspaces of a high dimensional data space.
Using clusters embedded in 5-dim subspaces while varying
the dimensional of the space from 5 to50.
CLIQUE was able to recover all clusters in every case.
 Strength
◦ automatically finds subspaces of the highest dimensionality such that
high density clusters exist in those subspaces
◦ insensitive to the order of records in input and does not presume some
canonical data distribution
◦ scales linearly with the size of input and has good scalability as the
number of dimensions in the data increases
 Weakness
◦ The accuracy of the clustering result may be degraded at the expense of
simplicity of the method
 The problem of high dimensionality is often tackled by requiring the
user to specify the subspace for cluster analysis. But user-identification
of quite error-prone. CLIQUE can find clusters embedded in subspaces of
high dimensional data without requiring the user to guess subspaces that
might have interesting clusters.
 CLIQUE generates cluster descriptions in the form of DNF expressions
that are minimized for ease of comprehension.
 CLIQUE is insensitive to the order of input records, Some clustering
algorithms are sensitive to the order of input data.
 Empirical evolution shows that CLIQUE scales linearly with the size of
input and has good scalability as the number of dimension in the data.
 CLIQUE can accurately discover clusters embedded in lower dimensional
subspaces.
CLIQUE Automatic subspace clustering of high dimensional data for data mining application

More Related Content

What's hot

K-Means clustring @jax
K-Means clustring @jaxK-Means clustring @jax
K-Means clustring @jaxAjay Iet
 
CART: Not only Classification and Regression Trees
CART: Not only Classification and Regression TreesCART: Not only Classification and Regression Trees
CART: Not only Classification and Regression TreesMarc Garcia
 
05 Clustering in Data Mining
05 Clustering in Data Mining05 Clustering in Data Mining
05 Clustering in Data MiningValerii Klymchuk
 
Chapter 6. Mining Frequent Patterns, Associations and Correlations Basic Conc...
Chapter 6. Mining Frequent Patterns, Associations and Correlations Basic Conc...Chapter 6. Mining Frequent Patterns, Associations and Correlations Basic Conc...
Chapter 6. Mining Frequent Patterns, Associations and Correlations Basic Conc...Subrata Kumer Paul
 
Document clustering and classification
Document clustering and classification Document clustering and classification
Document clustering and classification Mahmoud Alfarra
 
Hierarchical clustering.pptx
Hierarchical clustering.pptxHierarchical clustering.pptx
Hierarchical clustering.pptxNTUConcepts1
 
2.2 decision tree
2.2 decision tree2.2 decision tree
2.2 decision treeKrish_ver2
 
Hierarchical Clustering | Hierarchical Clustering in R |Hierarchical Clusteri...
Hierarchical Clustering | Hierarchical Clustering in R |Hierarchical Clusteri...Hierarchical Clustering | Hierarchical Clustering in R |Hierarchical Clusteri...
Hierarchical Clustering | Hierarchical Clustering in R |Hierarchical Clusteri...Simplilearn
 
3.5 model based clustering
3.5 model based clustering3.5 model based clustering
3.5 model based clusteringKrish_ver2
 
Introduction to Clustering algorithm
Introduction to Clustering algorithmIntroduction to Clustering algorithm
Introduction to Clustering algorithmhadifar
 
Unsupervised learning: Clustering
Unsupervised learning: ClusteringUnsupervised learning: Clustering
Unsupervised learning: ClusteringDeepak George
 
Hierarchical clustering
Hierarchical clustering Hierarchical clustering
Hierarchical clustering Ashek Farabi
 
3.7 outlier analysis
3.7 outlier analysis3.7 outlier analysis
3.7 outlier analysisKrish_ver2
 
Random forest algorithm
Random forest algorithmRandom forest algorithm
Random forest algorithmRashid Ansari
 
Cluster Analysis : Assignment & Update
Cluster Analysis : Assignment & UpdateCluster Analysis : Assignment & Update
Cluster Analysis : Assignment & UpdateBilly Yang
 

What's hot (20)

K-Means clustring @jax
K-Means clustring @jaxK-Means clustring @jax
K-Means clustring @jax
 
Density based clustering
Density based clusteringDensity based clustering
Density based clustering
 
CART: Not only Classification and Regression Trees
CART: Not only Classification and Regression TreesCART: Not only Classification and Regression Trees
CART: Not only Classification and Regression Trees
 
05 Clustering in Data Mining
05 Clustering in Data Mining05 Clustering in Data Mining
05 Clustering in Data Mining
 
Chapter 6. Mining Frequent Patterns, Associations and Correlations Basic Conc...
Chapter 6. Mining Frequent Patterns, Associations and Correlations Basic Conc...Chapter 6. Mining Frequent Patterns, Associations and Correlations Basic Conc...
Chapter 6. Mining Frequent Patterns, Associations and Correlations Basic Conc...
 
Document clustering and classification
Document clustering and classification Document clustering and classification
Document clustering and classification
 
Hierarchical clustering.pptx
Hierarchical clustering.pptxHierarchical clustering.pptx
Hierarchical clustering.pptx
 
My8clst
My8clstMy8clst
My8clst
 
2.2 decision tree
2.2 decision tree2.2 decision tree
2.2 decision tree
 
Hierarchical Clustering | Hierarchical Clustering in R |Hierarchical Clusteri...
Hierarchical Clustering | Hierarchical Clustering in R |Hierarchical Clusteri...Hierarchical Clustering | Hierarchical Clustering in R |Hierarchical Clusteri...
Hierarchical Clustering | Hierarchical Clustering in R |Hierarchical Clusteri...
 
Data clustering
Data clustering Data clustering
Data clustering
 
3.5 model based clustering
3.5 model based clustering3.5 model based clustering
3.5 model based clustering
 
Introduction to Clustering algorithm
Introduction to Clustering algorithmIntroduction to Clustering algorithm
Introduction to Clustering algorithm
 
Unsupervised learning: Clustering
Unsupervised learning: ClusteringUnsupervised learning: Clustering
Unsupervised learning: Clustering
 
Hierarchical clustering
Hierarchical clustering Hierarchical clustering
Hierarchical clustering
 
3.7 outlier analysis
3.7 outlier analysis3.7 outlier analysis
3.7 outlier analysis
 
Apriori algorithm
Apriori algorithmApriori algorithm
Apriori algorithm
 
Random forest algorithm
Random forest algorithmRandom forest algorithm
Random forest algorithm
 
Lect12 graph mining
Lect12 graph miningLect12 graph mining
Lect12 graph mining
 
Cluster Analysis : Assignment & Update
Cluster Analysis : Assignment & UpdateCluster Analysis : Assignment & Update
Cluster Analysis : Assignment & Update
 

Similar to CLIQUE Automatic subspace clustering of high dimensional data for data mining application

CLUSTER ANALYSIS ALGORITHMS.pptx
CLUSTER ANALYSIS ALGORITHMS.pptxCLUSTER ANALYSIS ALGORITHMS.pptx
CLUSTER ANALYSIS ALGORITHMS.pptxShwetapadmaBabu1
 
Machine Learning Algorithms (Part 1)
Machine Learning Algorithms (Part 1)Machine Learning Algorithms (Part 1)
Machine Learning Algorithms (Part 1)Zihui Li
 
3.4 density and grid methods
3.4 density and grid methods3.4 density and grid methods
3.4 density and grid methodsKrish_ver2
 
Data Mining: Concepts and techniques: Chapter 11,Review: Basic Cluster Analys...
Data Mining: Concepts and techniques: Chapter 11,Review: Basic Cluster Analys...Data Mining: Concepts and techniques: Chapter 11,Review: Basic Cluster Analys...
Data Mining: Concepts and techniques: Chapter 11,Review: Basic Cluster Analys...Salah Amean
 
Clustering of graphs and search of assemblages
Clustering of graphs and search of assemblagesClustering of graphs and search of assemblages
Clustering of graphs and search of assemblagesData-Centric_Alliance
 
Chapter 11. Cluster Analysis Advanced Methods.ppt
Chapter 11. Cluster Analysis Advanced Methods.pptChapter 11. Cluster Analysis Advanced Methods.ppt
Chapter 11. Cluster Analysis Advanced Methods.pptSubrata Kumer Paul
 
Clustering Algorithms.pdf
Clustering Algorithms.pdfClustering Algorithms.pdf
Clustering Algorithms.pdfLibya Thomas
 
CC282 Unsupervised Learning (Clustering) Lecture 7 slides for ...
CC282 Unsupervised Learning (Clustering) Lecture 7 slides for ...CC282 Unsupervised Learning (Clustering) Lecture 7 slides for ...
CC282 Unsupervised Learning (Clustering) Lecture 7 slides for ...butest
 
Enhance The K Means Algorithm On Spatial Dataset
Enhance The K Means Algorithm On Spatial DatasetEnhance The K Means Algorithm On Spatial Dataset
Enhance The K Means Algorithm On Spatial DatasetAlaaZ
 
8.clustering algorithm.k means.em algorithm
8.clustering algorithm.k means.em algorithm8.clustering algorithm.k means.em algorithm
8.clustering algorithm.k means.em algorithmLaura Petrosanu
 
An Efficient Method of Partitioning High Volumes of Multidimensional Data for...
An Efficient Method of Partitioning High Volumes of Multidimensional Data for...An Efficient Method of Partitioning High Volumes of Multidimensional Data for...
An Efficient Method of Partitioning High Volumes of Multidimensional Data for...IJERA Editor
 
11ClusAdvanced.ppt
11ClusAdvanced.ppt11ClusAdvanced.ppt
11ClusAdvanced.pptSueMiu
 

Similar to CLIQUE Automatic subspace clustering of high dimensional data for data mining application (20)

CLUSTER ANALYSIS ALGORITHMS.pptx
CLUSTER ANALYSIS ALGORITHMS.pptxCLUSTER ANALYSIS ALGORITHMS.pptx
CLUSTER ANALYSIS ALGORITHMS.pptx
 
dm_clustering2.ppt
dm_clustering2.pptdm_clustering2.ppt
dm_clustering2.ppt
 
Machine Learning Algorithms (Part 1)
Machine Learning Algorithms (Part 1)Machine Learning Algorithms (Part 1)
Machine Learning Algorithms (Part 1)
 
3.4 density and grid methods
3.4 density and grid methods3.4 density and grid methods
3.4 density and grid methods
 
Project PPT
Project PPTProject PPT
Project PPT
 
Data Mining: Concepts and techniques: Chapter 11,Review: Basic Cluster Analys...
Data Mining: Concepts and techniques: Chapter 11,Review: Basic Cluster Analys...Data Mining: Concepts and techniques: Chapter 11,Review: Basic Cluster Analys...
Data Mining: Concepts and techniques: Chapter 11,Review: Basic Cluster Analys...
 
Clustering of graphs and search of assemblages
Clustering of graphs and search of assemblagesClustering of graphs and search of assemblages
Clustering of graphs and search of assemblages
 
Chapter 11. Cluster Analysis Advanced Methods.ppt
Chapter 11. Cluster Analysis Advanced Methods.pptChapter 11. Cluster Analysis Advanced Methods.ppt
Chapter 11. Cluster Analysis Advanced Methods.ppt
 
Lect4
Lect4Lect4
Lect4
 
11 clusadvanced
11 clusadvanced11 clusadvanced
11 clusadvanced
 
Neural nw k means
Neural nw k meansNeural nw k means
Neural nw k means
 
Clustering Algorithms.pdf
Clustering Algorithms.pdfClustering Algorithms.pdf
Clustering Algorithms.pdf
 
CC282 Unsupervised Learning (Clustering) Lecture 7 slides for ...
CC282 Unsupervised Learning (Clustering) Lecture 7 slides for ...CC282 Unsupervised Learning (Clustering) Lecture 7 slides for ...
CC282 Unsupervised Learning (Clustering) Lecture 7 slides for ...
 
Enhance The K Means Algorithm On Spatial Dataset
Enhance The K Means Algorithm On Spatial DatasetEnhance The K Means Algorithm On Spatial Dataset
Enhance The K Means Algorithm On Spatial Dataset
 
8.clustering algorithm.k means.em algorithm
8.clustering algorithm.k means.em algorithm8.clustering algorithm.k means.em algorithm
8.clustering algorithm.k means.em algorithm
 
kmean clustering
kmean clusteringkmean clustering
kmean clustering
 
Hierachical clustering
Hierachical clusteringHierachical clustering
Hierachical clustering
 
An Efficient Method of Partitioning High Volumes of Multidimensional Data for...
An Efficient Method of Partitioning High Volumes of Multidimensional Data for...An Efficient Method of Partitioning High Volumes of Multidimensional Data for...
An Efficient Method of Partitioning High Volumes of Multidimensional Data for...
 
11ClusAdvanced.ppt
11ClusAdvanced.ppt11ClusAdvanced.ppt
11ClusAdvanced.ppt
 
Clustering
ClusteringClustering
Clustering
 

Recently uploaded

scipt v1.pptxcxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx...
scipt v1.pptxcxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx...scipt v1.pptxcxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx...
scipt v1.pptxcxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx...HenryBriggs2
 
Learn the concepts of Thermodynamics on Magic Marks
Learn the concepts of Thermodynamics on Magic MarksLearn the concepts of Thermodynamics on Magic Marks
Learn the concepts of Thermodynamics on Magic MarksMagic Marks
 
Unleashing the Power of the SORA AI lastest leap
Unleashing the Power of the SORA AI lastest leapUnleashing the Power of the SORA AI lastest leap
Unleashing the Power of the SORA AI lastest leapRishantSharmaFr
 
Generative AI or GenAI technology based PPT
Generative AI or GenAI technology based PPTGenerative AI or GenAI technology based PPT
Generative AI or GenAI technology based PPTbhaskargani46
 
Hospital management system project report.pdf
Hospital management system project report.pdfHospital management system project report.pdf
Hospital management system project report.pdfKamal Acharya
 
Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...
Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...
Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...Arindam Chakraborty, Ph.D., P.E. (CA, TX)
 
Introduction to Data Visualization,Matplotlib.pdf
Introduction to Data Visualization,Matplotlib.pdfIntroduction to Data Visualization,Matplotlib.pdf
Introduction to Data Visualization,Matplotlib.pdfsumitt6_25730773
 
Digital Communication Essentials: DPCM, DM, and ADM .pptx
Digital Communication Essentials: DPCM, DM, and ADM .pptxDigital Communication Essentials: DPCM, DM, and ADM .pptx
Digital Communication Essentials: DPCM, DM, and ADM .pptxpritamlangde
 
Online food ordering system project report.pdf
Online food ordering system project report.pdfOnline food ordering system project report.pdf
Online food ordering system project report.pdfKamal Acharya
 
Theory of Time 2024 (Universal Theory for Everything)
Theory of Time 2024 (Universal Theory for Everything)Theory of Time 2024 (Universal Theory for Everything)
Theory of Time 2024 (Universal Theory for Everything)Ramkumar k
 
A CASE STUDY ON CERAMIC INDUSTRY OF BANGLADESH.pptx
A CASE STUDY ON CERAMIC INDUSTRY OF BANGLADESH.pptxA CASE STUDY ON CERAMIC INDUSTRY OF BANGLADESH.pptx
A CASE STUDY ON CERAMIC INDUSTRY OF BANGLADESH.pptxmaisarahman1
 
HOA1&2 - Module 3 - PREHISTORCI ARCHITECTURE OF KERALA.pptx
HOA1&2 - Module 3 - PREHISTORCI ARCHITECTURE OF KERALA.pptxHOA1&2 - Module 3 - PREHISTORCI ARCHITECTURE OF KERALA.pptx
HOA1&2 - Module 3 - PREHISTORCI ARCHITECTURE OF KERALA.pptxSCMS School of Architecture
 
Hostel management system project report..pdf
Hostel management system project report..pdfHostel management system project report..pdf
Hostel management system project report..pdfKamal Acharya
 
Design For Accessibility: Getting it right from the start
Design For Accessibility: Getting it right from the startDesign For Accessibility: Getting it right from the start
Design For Accessibility: Getting it right from the startQuintin Balsdon
 
DeepFakes presentation : brief idea of DeepFakes
DeepFakes presentation : brief idea of DeepFakesDeepFakes presentation : brief idea of DeepFakes
DeepFakes presentation : brief idea of DeepFakesMayuraD1
 
1_Introduction + EAM Vocabulary + how to navigate in EAM.pdf
1_Introduction + EAM Vocabulary + how to navigate in EAM.pdf1_Introduction + EAM Vocabulary + how to navigate in EAM.pdf
1_Introduction + EAM Vocabulary + how to navigate in EAM.pdfAldoGarca30
 
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXssuser89054b
 
HAND TOOLS USED AT ELECTRONICS WORK PRESENTED BY KOUSTAV SARKAR
HAND TOOLS USED AT ELECTRONICS WORK PRESENTED BY KOUSTAV SARKARHAND TOOLS USED AT ELECTRONICS WORK PRESENTED BY KOUSTAV SARKAR
HAND TOOLS USED AT ELECTRONICS WORK PRESENTED BY KOUSTAV SARKARKOUSTAV SARKAR
 
Thermal Engineering-R & A / C - unit - V
Thermal Engineering-R & A / C - unit - VThermal Engineering-R & A / C - unit - V
Thermal Engineering-R & A / C - unit - VDineshKumar4165
 
457503602-5-Gas-Well-Testing-and-Analysis-pptx.pptx
457503602-5-Gas-Well-Testing-and-Analysis-pptx.pptx457503602-5-Gas-Well-Testing-and-Analysis-pptx.pptx
457503602-5-Gas-Well-Testing-and-Analysis-pptx.pptxrouholahahmadi9876
 

Recently uploaded (20)

scipt v1.pptxcxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx...
scipt v1.pptxcxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx...scipt v1.pptxcxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx...
scipt v1.pptxcxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx...
 
Learn the concepts of Thermodynamics on Magic Marks
Learn the concepts of Thermodynamics on Magic MarksLearn the concepts of Thermodynamics on Magic Marks
Learn the concepts of Thermodynamics on Magic Marks
 
Unleashing the Power of the SORA AI lastest leap
Unleashing the Power of the SORA AI lastest leapUnleashing the Power of the SORA AI lastest leap
Unleashing the Power of the SORA AI lastest leap
 
Generative AI or GenAI technology based PPT
Generative AI or GenAI technology based PPTGenerative AI or GenAI technology based PPT
Generative AI or GenAI technology based PPT
 
Hospital management system project report.pdf
Hospital management system project report.pdfHospital management system project report.pdf
Hospital management system project report.pdf
 
Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...
Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...
Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...
 
Introduction to Data Visualization,Matplotlib.pdf
Introduction to Data Visualization,Matplotlib.pdfIntroduction to Data Visualization,Matplotlib.pdf
Introduction to Data Visualization,Matplotlib.pdf
 
Digital Communication Essentials: DPCM, DM, and ADM .pptx
Digital Communication Essentials: DPCM, DM, and ADM .pptxDigital Communication Essentials: DPCM, DM, and ADM .pptx
Digital Communication Essentials: DPCM, DM, and ADM .pptx
 
Online food ordering system project report.pdf
Online food ordering system project report.pdfOnline food ordering system project report.pdf
Online food ordering system project report.pdf
 
Theory of Time 2024 (Universal Theory for Everything)
Theory of Time 2024 (Universal Theory for Everything)Theory of Time 2024 (Universal Theory for Everything)
Theory of Time 2024 (Universal Theory for Everything)
 
A CASE STUDY ON CERAMIC INDUSTRY OF BANGLADESH.pptx
A CASE STUDY ON CERAMIC INDUSTRY OF BANGLADESH.pptxA CASE STUDY ON CERAMIC INDUSTRY OF BANGLADESH.pptx
A CASE STUDY ON CERAMIC INDUSTRY OF BANGLADESH.pptx
 
HOA1&2 - Module 3 - PREHISTORCI ARCHITECTURE OF KERALA.pptx
HOA1&2 - Module 3 - PREHISTORCI ARCHITECTURE OF KERALA.pptxHOA1&2 - Module 3 - PREHISTORCI ARCHITECTURE OF KERALA.pptx
HOA1&2 - Module 3 - PREHISTORCI ARCHITECTURE OF KERALA.pptx
 
Hostel management system project report..pdf
Hostel management system project report..pdfHostel management system project report..pdf
Hostel management system project report..pdf
 
Design For Accessibility: Getting it right from the start
Design For Accessibility: Getting it right from the startDesign For Accessibility: Getting it right from the start
Design For Accessibility: Getting it right from the start
 
DeepFakes presentation : brief idea of DeepFakes
DeepFakes presentation : brief idea of DeepFakesDeepFakes presentation : brief idea of DeepFakes
DeepFakes presentation : brief idea of DeepFakes
 
1_Introduction + EAM Vocabulary + how to navigate in EAM.pdf
1_Introduction + EAM Vocabulary + how to navigate in EAM.pdf1_Introduction + EAM Vocabulary + how to navigate in EAM.pdf
1_Introduction + EAM Vocabulary + how to navigate in EAM.pdf
 
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
 
HAND TOOLS USED AT ELECTRONICS WORK PRESENTED BY KOUSTAV SARKAR
HAND TOOLS USED AT ELECTRONICS WORK PRESENTED BY KOUSTAV SARKARHAND TOOLS USED AT ELECTRONICS WORK PRESENTED BY KOUSTAV SARKAR
HAND TOOLS USED AT ELECTRONICS WORK PRESENTED BY KOUSTAV SARKAR
 
Thermal Engineering-R & A / C - unit - V
Thermal Engineering-R & A / C - unit - VThermal Engineering-R & A / C - unit - V
Thermal Engineering-R & A / C - unit - V
 
457503602-5-Gas-Well-Testing-and-Analysis-pptx.pptx
457503602-5-Gas-Well-Testing-and-Analysis-pptx.pptx457503602-5-Gas-Well-Testing-and-Analysis-pptx.pptx
457503602-5-Gas-Well-Testing-and-Analysis-pptx.pptx
 

CLIQUE Automatic subspace clustering of high dimensional data for data mining application

  • 1. Author Rakesh Agrawal , Johannes Gehrke, Dimitrios Gunopulos, Prabhakar Raghavan Prepared by : Raed T Aldahdooh
  • 2.  Introduction  Motivation  Contributions Of The Paper  Subspace Clustering  CLIQUE(Clustering in Quest)  Performance Experiments  Conclusions
  • 3.  Agrawal, Gehrke, Gunopulos, Raghavan (SIGMOD’98)  CLIQUE can be considered as both density-based and grid- based  Clustering high-dimensional data.  Automatically identifying subspaces of a high dimensional data space that allow better clustering than original space
  • 4.  Many irrelevant dimensions may mask clusters.  Distance measure becomes meaningless—due to equi-distance.  Clusters may exist only in some subspaces.
  • 5.  Only data in one dimension is relatively packed.  Adding a dimension “stretch” the points across that dimension, making them further apart.  Density decrease dramatically.  Distance measure becomes meaningless—due to equi-distance.
  • 6.  Methods ◦ Feature transformation: only effective if most dimensions are relevant  PCA “Principal component analysis” & SVD “Singular value decomposition” useful only when features are highly correlated/redundant ◦ Feature selection: wrapper or filter approaches  useful to find a subspace where the data have nice clusters ◦ Subspace-clustering: find clusters in all the possible subspaces  CLIQUE, ProClus, and frequent pattern-based clustering
  • 7. The need for developing new algorithms  Effective treatment of high dimensionality: ◦ To effectively extract information from a huge amount of data in databases. In other words. The running time of algorithms must be predictable and usable in large database.  Interpretability of results: ◦ User expect clustering results in the high dimensional data to be interpretable, comprehensible.  Scalability and usability: ◦ Many clustering algorithms don’t well in a large database may contain millions of objects, Clustering on a sample of a given data set may lead to biased results. In other words, The clustering technique should be fast and scale with the number of dimensions and the size of input and insensitive to the order of input data.
  • 8.  CLIQUE satisfies the above desiderata ( Effective , interpretability, Scalability and Usability).  CLIQUE can automatically finds subspaces with high density clusters.  CLIQUE generates a minimal description for each cluster in DNF expressions.  Empirical evaluation shows that CLIQUE scales linearly with the number of input records and has good scalability as the number of dimension in the dimensionality of the hidden cluster.
  • 9.  a disjunctive normal form (DNF) is a standardization (or normalization) of a logical formula which is a disjunction of conjunctive clauses.  A disjunction of conjunctions where every variable or its negation is represented once in each conjunction (a minterm) ◦ each minterm appears only once Example: DNF of pq is (pq)(pq).
  • 10.  Clusters may exist only in some subspaces.  Subspace-clustering: find clusters in all the subspaces.
  • 11.  What’s (a)unit (b)dense unit (c)a cluster (d)a minimal description of a cluster.  In Figure 1,the two dim space(age , salary) has been partitioned by a 10x10 grid.ξ=10  The unit u=(30≤age<35)Λ(1≤salary<2)  A and B are both region  A=(30≤age<50)Λ(4≤salary<8)  B =(40≤age<60)Λ(2≤salary<6)  Assuming the dense units have been shaded,  AUB is a cluster( A,B are connected regions)  A∩B is not a maximal region.  The minimal description for this cluster AUB is the  DNF expression: ( (30≤age<50)Λ(4≤salary<8))v  ( (40≤age<60)Λ(2≤salary<6))
  • 12.  In Figure2. Assuming T=20% (density threshold _ 3 point) If selectivity(u)>T then u is a dense unit.  Where selectivity in the fraction of total data points contained in the unit.  No 2-dimen unit is dense and there are no clusters In the original data space. The points are projected on the salary dimension , there are three 1-dim dense units, and there are two clusters in the 1-dim salary subspace, C=(5≤salary<7 )and D=(2≤salary<3) But there is no dense unit and cluster in 1-dim age subspace
  • 13. 3. Generation of minimal description for the clusters.  CLIQUE consists of the following three steps: 1) Identification of subspace that contain clusters. 2) Identification of clusters . 3) Generation of minimal description for the clusters. Title in here 2. Identification of clusters. 1. Identification of subspace that contain clusters. CLIQUE consists of the following three steps:
  • 14.  Downward closure (DC) property: If a cluster is satisfied in a k-dimensional space, it is also satisfied in all of its (k-1)-dimensional subspaces.  Due to the DC property, identification of subspaces is carried out in an iterative bottom-up fashion (from lower to higher dimensional subspaces).
  • 15.  The difficulty in identifying subspaces that contain clusters lies in finding dense units in different subspaces.  A. using a bottom-up algorithm to find dense units that exploits the monotonicity of the clustering criterion with respect to dimensionality to prune the search space.  Lemma1 (monotonicity):If k-dim unit is dense ,then so are it’s projections in (k-1)-dim space.  The bottom-up algorithm process  Determines 1-dim dense unit and interaction(self-join) to get 2-dim dense unit. Until having (k-1)dim dense units, We can self-join DK-1 to get the candidate k-dim units.  we discard those dense units from Ck which have a projection (k-1)-dim that isn't included in Ck-1 .  B. Making the bottom-up algorithm faster with MDL-base pruning.
  • 16.  A. Determination of dense units ◦ Determine the set D1 of all one-dimensional dense units. ◦ k=1 ◦ While Dk ≠  do  k=k+1  Determine the set Dk as the set of all the k-dimensional dense units all of whose (k-1)-dimensional projections, belong to Dk-1. ◦ End while
  • 17.  B. Determination of high coverage subspaces. ◦ Determine all the subspaces that contain at least one dense unit. ◦ Sort these subspaces in descending order according to their coverage (fraction of the num. of points of the original data set they contain). ◦ Optimize a suitably defined Minimum Description Length criterion function and determine a threshold under which a coverage is considered “low”. ◦ Select the subspaces with “high” coverage.
  • 18.  The input to the step of Finding Clusters is a set of dense units D all in the same k-dim space.  Depth-first search algorithm ◦ Using a Depth –first search algorithm to find the connected components of the graph, By starting with some U in D, Assign it the first cluster number and find all the units it is connected to, then if there still are units in D that have not yet been visited , we find one and repeat the procedure.
  • 19.  For each high coverage subspace S do ◦ Consider the set E of all the dense units in S. ◦ While E ≠  ◦ m´ =1 ◦ Select a randomly chosen unit u from E. ◦ Assign to Cm´, u and all units of E that are connected to u. ◦ E=E-Cm´ ◦ End while  End for
  • 20.  The input to this step consists of disjoint clusters in k- dim subspace.  The goal is to generate a minimal description of each cluster with two steps: ◦ Covering with maximal region. ◦ Minimal cover.
  • 21.  The CLIQUE Algorithm (cont.) 3. Minimal description of clusters The minimal description of a cluster C, produced by the Last procedure, is the minimum possible union of hyper rectangular regions. For example  A  B is the minimum cluster description of the shaded region.  C  D  E is a non-minimal cluster description of the same region.
  • 22.  The CLIQUE Algorithm (cont.) 3. Minimal description of clusters (algorithm) For each cluster C do 1st stage • c=0 • While C ≠   c=c+1  Choose a dense unit in C  For i=1 to l o Grow the unit in both directions along the i-th dimension, trying to cover as many units in C as possible (boxes that are not belong to C should not be covered).  End for  Define the set I containing all the units covered by the above procedure  C=C-I • End while 2nd stage • Remove all covers whose units are covered by at least another cover.
  • 23.
  • 24.  A two dimensional grid of lines of edge size ξ applied in the two-dimensional feature space.  Two-dimensional and one-dimensional units are defined: ◦ ui q denotes the i-th one dimensional unit along xq ◦ uij denotes the two dimensional unit resulting from the Cartesian product of the i-th and j-th intervals along x1 and x2, respectively.  ξ=10 and τ=8% (thus, each unit containing more than 5 points is considered to be dense).
  • 25.  The points in u48 and u58, u75 and u76, u83 and u93 are collinear.
  • 26. One-dimensional dense units: D1={u2 1, u3 1, u4 1, u5 1, u8 1, u9 1, u1 2, u2 2, u3 2, u5 2, u6 2} Two-dimensional dense units: D2={u21, u22, u32, u33, u83, u93} Notes: •Although each one of the u48, u75, u76 contains more that 5 points, they are not included in D2. •Although it seems unnatural for u83 and u93 to be included in D2, they are included since u3 2 is dense. • All subspaces of the two-dimensional space contain clusters.
  • 27. One-dimensional clusters: C1={u2 1, u3 1, u4 1, u5 1} C2={u8 1, u9 1} C3={u1 2, u2 2, u3 2} C4={u5 2, u6 2} Two-dimensional clusters: C5={u21, u22, u32, u33} C6={u83, u93} One-dimensional dense units: D1={u2 1, u3 1, u4 1, u5 1, u8 1, u9 1, u1 2, u2 2, u3 2, u5 2, u6 2} Two-dimensional dense units: D2={u21, u22, u32, u33, u83, u93}
  • 28. C1={(x1): 1 x1<5} C2={(x1): 7 x1<9} C3={(x2): 0 x2<3} C4={(x2): 4 x2<6} C5={(x1, x2): 1 x1<2, 0 x2<2}{(x1, x2): 2 x1<3, 1 x2<3} C6={(x1, x2): 7 x1<9, 2 x2<3} Note that C2 and C6 are essentially the same cluster, which is reported twice by the algorithm.
  • 29.  We now empirically evaluate CLIQUE using synthetic data (Generator from M.Zait and H.Messatfa. a comparative study of clustering methods)  The goals of the experiments are to assess the efficiency of CLIQUE:  Efficiency :Determine how the running time scales with ◦ Dimensionality of the data space. ◦ Dimensionality of clusters. ◦ Size of data.  Accuracy:Test if CLIQUE recovers known clusters in some subspaces of a high dimensional data space.
  • 30.
  • 31.
  • 32. Using clusters embedded in 5-dim subspaces while varying the dimensional of the space from 5 to50. CLIQUE was able to recover all clusters in every case.
  • 33.
  • 34.  Strength ◦ automatically finds subspaces of the highest dimensionality such that high density clusters exist in those subspaces ◦ insensitive to the order of records in input and does not presume some canonical data distribution ◦ scales linearly with the size of input and has good scalability as the number of dimensions in the data increases  Weakness ◦ The accuracy of the clustering result may be degraded at the expense of simplicity of the method
  • 35.  The problem of high dimensionality is often tackled by requiring the user to specify the subspace for cluster analysis. But user-identification of quite error-prone. CLIQUE can find clusters embedded in subspaces of high dimensional data without requiring the user to guess subspaces that might have interesting clusters.  CLIQUE generates cluster descriptions in the form of DNF expressions that are minimized for ease of comprehension.  CLIQUE is insensitive to the order of input records, Some clustering algorithms are sensitive to the order of input data.  Empirical evolution shows that CLIQUE scales linearly with the size of input and has good scalability as the number of dimension in the data.  CLIQUE can accurately discover clusters embedded in lower dimensional subspaces.