On October 23rd, 2014, we updated our
By continuing to use LinkedIn’s SlideShare service, you agree to the revised terms, so please take a few minutes to review them.
CLIQUE 09mx Crew Members ~ K. Kanagaraj 14 S. Karthikeyan 17 S. Kathiresan 19 N. PadmaShree 28 M. RamKumar 33 S. Sowmya 45
GRID-BASED CLUSTERING METHOD Using multi-resolution grid data structure Clustering complexity depends on the number of populated grid cells and not on the number of objects in the dataset Space into a finite number of cells that form a grid structure on which all of the operations for clustering is performed. (eg) assume that we have a set of records and we want to cluster with respect to two attributes, then, we divide the related space (plane), into a grid structure and then we find the clusters.
Salary (10,000) “Space” is this plane 8 7 6 5 4 3 2 1 0 20 30 40 50 60 Age
4 Advantages of Grid-based Clustering fast No distance computations Complexity is usually on #-of populated-grid-cells and not on #-of objects Easy to determine which clusters are neighboring Shapes are limited to union of grid-cells
Techniques for Grid-Based Clustering The following are some techniques that are used to perform Grid-Based Clustering: CLIQUE (CLustering In QUEst.) STING (STatistical Information Grid.) WaveCluster
CLIQUE CLustering In QUEst – By Agarwal, Gehrke, Gunopulos, Raghavan published in (SIGMOD ‘98) - [Special Interest Group on Management of Data] Clustering - grouping of a number of similar things acc,. to Characteristic or Behavior. Quest - make a search (for) Automatic sub-space clustering of high dimension data
Looking at CLIQUE as an Example CLIQUE is used for the clustering of high-dimensional data present in large tables. By high-dimensional data we mean records that have many attributes. CLIQUE identifies the dense units in the subspaces of high dimensional data space, and uses these subspaces to provide more efficient clustering.
Definitions That Need to Be Known Unit : After forming a grid structure on the space, each rectangular cell is called a Unit. Dense: A unit is dense, if the fraction of total data points contained in the unit exceeds the input model parameter. Cluster: A cluster is defined as a maximal set of connected dense units.
How Does CLIQUE Work? Let us say that we have a set of records that we would like to cluster in terms of n-attributes. So, we are dealing with an n-dimensional space. MAJOR STEPS : CLIQUE partitions each subspace that has dimension 1 into the same number of equal length intervals. Using this as basis, it partitions the n-dimensional data space into non-overlapping rectangular units.
CLIQUE: Major Steps (Cont.) Now CLIQUE’S goal is to identify the dense n-dimensional units. It does this in the following way: CLIQUE finds dense units of higher dimensionality by finding the dense units in the subspaces. So, for example if we are dealing with a 3-dimensional space, CLIQUE finds the dense units in the 3 related PLANES (2-dimensional subspaces.) It then intersects the extension of the subspaces representing the dense units to form a candidate search space in which dense units of higher dimensionality would exist.
CLIQUE: Major Steps. (Cont.) Eachmaximal set of connected dense units is considered a cluster. Using this definition, the dense units in the subspaces are examined in order to find clusters in the subspaces. The information of the subspaces is then used to find clusters in the n-dimensional space. It must be noted that all cluster boundaries are either horizontal or vertical. This is due to the nature of the rectangular grid cells.
Example for CLIQUE Let us say that we want to cluster a set of records that have three attributes namely salary, vacation and age. The data space for the this data would be 3-dimensional. vacation age salary
Example (Cont.) After plotting the data objects, each dimension, (i.e., salary, vacation and age) is split into intervals of equal length. Then we form a 3-dimensional grid on the space, each unit of which would be a 3-D rectangle. Now, our goal is to find the dense 3-D rectangular units.
Example (Cont.) To do this, we find the dense units of the subspaces of this 3-d space. So, we find the dense units with respect to age for salary. This means that we look at the salary-age plane and find all the 2-D rectangular units that are dense. We also find the dense 2-D rectangular units for the vacation-age plane.
Example (Cont.) Now let us try to visualize the dense units of the two planes on the following 3-d figure :
Example (Cont.) We can extend the dense areas in the vacation-age plane inwards. We can extend the dense areas in the salary-age plane upwards. The intersection of these two spaces would give us a candidate search space in which 3-dimensional dense units exist. We then find the dense units in the salary-vacation plane and we form an extension of the subspace that represents these dense units.
Example (Cont.) Now, we perform an intersection of the candidate search space with the extension of the dense units of the salary-vacation plane, in order to get all the 3-d dense units. So, What was the main idea? We used the dense units in subspaces in order to find the dense units in the 3-dimensional space. After finding the dense units, it is very easy to find clusters.
Reflecting upon CLIQUE Why does CLIQUE confine its search for dense units in high dimensions to the intersection of dense units in subspaces? Because the Apriori property employs prior knowledge of the items in the search space so that portions of the space can be pruned. The property for CLIQUE says that if a k-dimensional unit is dense then so are its projections in the (k-1) dimensional space.
Strength and Weakness of CLIQUE Strength It automatically finds subspaces of thehighest dimensionality such that high density clusters exist in those subspaces. It is quite efficient. It is insensitive to the order of records in input and does not presume some canonical data distribution. It scales linearly with the size of input and has good scalability as the number of dimensions in the data increases. Weakness The accuracy of the clustering result may be degraded at the expense of simplicity of the simplicity of this method.
Although the study of complete subgraphs goes back at least to the graph-theoretic reformulation of Ramsey theory by Erdős & Szekeres (1935), the term "clique" comes from Luce & Perry (1949), who used complete subgraphs in social networks to model cliques of people; that is, groups of people all of whom know each other. Cliques have many other applications in the sciences and particularly in bioinformatics.
A maximal clique is a clique that cannot be extended by including one more adjacent vertex, that is, a clique which does not exist exclusively within the vertex set of a larger clique. A maximum clique is a clique of the largest possible size in a given graph. The clique number ω(G) of a graph G is the number of vertices in a maximum clique in G.