Published on

  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide


  1. 1. V.Kavitha, Dr.M.Punithavalli / International Journal of Engineering Research and Applications (IJERA) ISSN: 2248-9622 Vol. 3, Issue 1, January -February 2013, pp.1627-1633 Evaluating Cluster Quality Using Modified Density Subspace Clustering Approach V.Kavitha1, Dr.M.Punithavalli2, 1 Research Scholar, Dept of Computer Science, Karpagam University Coimbatore, India. 2 Director,Dept of MCA, Sri Ramakrishna College of Engineering Coimbatore, India.Abstract Clustering of the time series data faced Identifying the groups of similar objects and helps towith curse of dimensionality, where real world discover distribution of patterns, finding thedata consist of many dimensions. Finding the interesting correlations in large data sets. It has beenclusters in feature space, subspace clustering is a subject of wide research since it arises in manygrowing task. Density based approach to identify application domains in engineering, business andclusters in dimensional point sets. Density social sciences. Especially, in the last years thesubspace clustering is a method to detect the availability of huge transactional and experimentaldensity-connected clusters in all subspaces of high data sets and the arising requirements for datadimensional data for clustering time series data mining created needs for clustering algorithms thatstreams Multidimensional data clustering scale and can be applied in diverse domains.evaluation can be done through a density-basedapproach. In this approach proposed, Density The clustering of the times series datasubspace clustering algorithm is used to find best streams becomes one of the important problem incluster result from the dataset. Density subspace data mining domain. Most of the traditionalclustering algorithm selects the P set of attributes algorithms will not support the fast arrival of largefrom the dataset. Then apply the density amount of data stream. In several traditionalclustering for selected attributes from the algorithms, a few one-pass clustering algorithmsdataset .From the resultant cluster calculate the have been proposed for the data stream problem.intra and inter cluster distance. Measuring the These methods address the scalability issues of thesimilarity between data objects in sparse and clustering problem and evolving the data in thehigh-dimensional data in the dataset, Plays a very result and do not address the following issues:important role in the success or failure of aclustering method. Evaluate the similarity (1)The quality of the clusters is poor when the databetween data points and consequently formulate evolves considerably over criterion functions for clustering .Improvethe accuracy and evaluate the similarity between (2) A data stream clustering algorithm requires muchthe data points in the clustering,. The proposed greater functionality to discovering accurate resultalgorithm also concentrates the Density and exploring clusters over different portions of theDivergence Problem (Outlier). Proposed system stream.clustering results compared them with existing Charu C. Aggarwal [5] proposed micro-clustering algorithms in terms of the Execution clustering phase in the online statistical datatime, Cluster Quality analysis. Experimental collection portion of the algorithm. This method isresults show that proposed system improves not dependent on any user input such as the timeclustering quality result, and less time than the horizon or the required granularity of the clusteringexisting clustering methods. process. The aim of the method is to maintain statistics based on sufficiently high level granularityKeywords— Density Subspace Clustering, Intra used by the online components such as horizon-Cluster, Inter Cluster, Outlier, Hierarchical specific macro-clustering as well as evolutionClustering. analysis.I. INTRODUCTION The clustering of the time series data Clustering is one of the most important stream and incremental models are requiring atasks in data mining process for discovering similar decisions before all the data are available in thegroups and identifying interesting distributions and dataset. The models are not identical to find the bestpatterns in the underlying data. Clustering of the clustering result.time series DataStream problem is about partitioning Finding clusters in the feature space,a given data set into groups such that the data points subspace clustering is an emergent task. Clusteringin a cluster are more similar to each other than points with dissimilarity measure is robust method toin different clusters .Cluster analysis [3] aims at handle large amount of data and able to estimate the 1627 | P a g e
  2. 2. V.Kavitha, Dr.M.Punithavalli / International Journal of Engineering Research and Applications (IJERA) ISSN: 2248-9622 Vol. 3, Issue 1, January -February 2013, pp.1627-1633number of clusters automatically by avoid overlap. hierarchical algorithms are classified asDensity subspace clustering is a method to detect the agglomerative (merging) or divisive (splitting). Thedensity-connected clusters in all subspaces of high agglomerative approach starts with each data pointdimensional data. In our proposed approach Density in a separate cluster or with a certain large numbersubspace clustering algorithm is used to find best of clusters. Each step of this approach merges thecluster result from the dataset. Density subspace two clusters that are the most similar. Thus afterclustering algorithm selects the P set of attributes each step, the total number of clusters decreases.from the dataset. Then apply the density clustering This is repeated until the desired number of clustersfor selected attributes from the dataset .From the is obtained or only one cluster remains. By contrast,resultant cluster calculate the intra and inter cluster the divisive approach starts with all data objects indistance. In this method finds the best cluster the same cluster. In each step, one cluster is split intodistance, repeat the steps until all the attributes in the smaller clusters, until a termination condition holds.dataset are clustered among the clustered result in Agglomerative algorithms are more widely used inthe dataset and finally finds the best cluster result. practice. Thus the similarities between clusters are more researched. Clustering methods are used to supportestimates a data distribution for newly attracted data Hierarchical divisive methods generate aand their ability to generate cluster boundaries of classification in a top-down manner, byarbitrary shape, size and efficiently. Density based progressively sub-dividing the single cluster whichclustering for measuring dynamic dissimilarity represents an entire dataset. Monothetic (divisionsmeasure based on the dynamical system was based on just a single descriptor) hierarchicalassociated with Density estimating functions. divisive methods are generally much faster inHypothetical basics of the system proposed measure operation than the corresponding polytheticare developed and applied to construct a different (divisions based on all descriptors) hierarchicalclustering method that can efficiently partition of the divisive and hierarchical agglomerative methods, butwhole data space in the dataset. Clustering based on tend to give poor results. Given a set of N items tothe Density based clustering dissimilarity measure is be clustered, and an N*N distance (or similarity)robust to handle large amount of data in the dataset matrix, the basic process of hierarchical clusteringand able to estimate the number of clusters 1. Start by assigning each item to a cluster, soautomatically by avoid overlap. The dissimilarity that if you have N items, you now have Nvalues are evaluated and clustering process is carried clusters, each containing just one item. Letout with the density values. the distances (similarities) between the clusters the same as the distances Similarity measures that take into (similarities) between the items theyconsideration on the context of the features have also contain.been employed but refer to continuous data, e.g., 2. Find the closest (most similar) pair ofMahalanobis distance. Dino Ienco proposed context- clusters and merge them into a singlebased distance for categorical attributes. The cluster, so that now you have one clustermotivation of this work is to measure the distance less.between two values of a categorical attribute Ai can 3. Compute distances (similarities) betweenbe determined by which the values of the other the new cluster and each of the old clusters.attributes Aj are distributed in the dataset objects: if 4. Repeat steps 2 and 3 until all items arethey are similarly of the attributes distributed in the clustered into a single cluster of size N. (*)groups of data objects in correspondence of thedistinct values of Ai a low value of distance was An easy way to comply with the conferenceobtained. Author also Propose also a solution to the paper formatting requirements is to use thiscritical point choice of the attributes Aj .The result document as a template and simply type your textwas validated with various real world and synthetic into it.datasets, by embedding our distance learning methodin both partitional and a hierarchical clustering B. Non Hierarchical Clusteringalgorithm. A non-hierarchical method generates a classification by partitioning a dataset, giving a setII. RELATED WORK of (generally) non-overlapping groups having no This work is based on hierarchical approach. hierarchical relationships between them. A So, the process is incremental clustering process. systematic evaluation of all possible partitions is quite infeasible, and many different heuristics haveA. Hierarchical Clustering thus been described to allow the identification of A hierarchical clustering algorithm creates good, but possibly sub-optimal, partitions. Non-a hierarchical decomposition of the given set of data hierarchical methods are generally much lessobjects. Depending on the decomposition approach, demanding of computational resources than the 1628 | P a g e
  3. 3. V.Kavitha, Dr.M.Punithavalli / International Journal of Engineering Research and Applications (IJERA) ISSN: 2248-9622 Vol. 3, Issue 1, January -February 2013, pp.1627-1633hierarchic methods, only a single partition of the system splits the cluster and assigns each of thedataset has to be formed. chosen variables to one of the new clusters,Three of the main categories of non-hierarchical becoming this pivot variable for that cluster.method are single-pass, relocation and nearest Afterwards, all remaining variables on the oldneighbour: cluster are assigned to the new cluster which has the  Single-pass methods (e.g. Leader) produce closest pivot. New leaves start new statistics, clusters that are dependent upon the order assuming that only forthcoming information will be in which the compounds are processed, and useful to decide whether or not this cluster should be so will not be considered further; split. This feature increases the systems ability to  Relocation methods, such as k-means, cope with changing concepts as, later on, a test is assigns compounds to a user-defined performed such that if the diameters of the children number of seed clusters and then iteratively leaves approach the parents diameter, then the reassign compounds to see if better clusters previously taken decision may no longer recent the result. Such methods are prone to reaching structure of data, so the system re-aggregates the local optima rather than a global optimum, leaves on the parent node, restarting statistics. We and it is generally not possible to determine propose our work to solve the density based when or whether the global optimum subspace system comparing to the different densities solution has been reached; at each of the subspace attributes system, each time  Nearest neighbor methods, such as the series data streams. Jarvis-Patrick method, assign compounds to the same cluster as some number of their B .Disadvantages of the existing system nearest neighbors. User-defined parameters  The split decision used in the algorithm determine how many nearest neighbors focus only focus on measuring the distance need to be considered, and the necessary between the two groups, which implies level of similarity between nearest neighbor high risk to solve the density problems at lists. different densities.  The different density at sub attribute valuesIII. METHODOLOGY is changes to both intra and inter cluster. In Methodology, we discussed about theexisting system ODAC and Proposed System of C.IHCA (Improved Hierarchical ClusteringIHCA and MDSC algorithms. Algorithm) The Improved Hierarchical ClusteringA. Online Divisive Agglomerative Clustering algorithm [IHCA] is an algorithm for an incremental The Online Divisive-Agglomerative clustering of streaming time sequence. It constructs aClustering (ODAC) is an incremental approach for hierarchical tree-shaped structure of clusters byclustering streaming time series using a hierarchical using a top-down strategy. The leaves are theprocedure over time. It constructs a tree-like resulting clusters, with each leaf grouping a set ofhierarchy of clusters of streams, using a top-down variables. The system includes an incrementalstrategy based on the correlation between streams. distance measure and executes procedures forThe system also possesses an agglomerative phase to expansion and aggregation of the tree basedenhance a dynamic behaviour capable of structural structure. The system will be monitoring the flow ofchange detection. The ODAC (Online Divisive- continuous time series data. Then time interval willAgglomerative Clustering) system is a variable be fixed. Within the specific time interval the dataclustering algorithm that constructs a hierarchical points will be partitioned. In a partition the diametertree-shaped structure of clusters using a top-down is calculated. Diameter is nothing but the maximumstrategy. The leaves are the resulting clusters, with a distance between the two points. Each and everyset of variables at each leaf. The union of the leaves data point of the partition will be compare with theis the complete set of variables. The intersection of diameter value. If the data point is greater than theleaves is the empty set. The system encloses an diameter value then the split process will be executeincremental distance measure and executes otherwise the Aggregate (Merge) process will beprocedures for expansion and aggregation of the performed. Based on the above criteria thetree-based structure, based on the diameters of the hierarchical tree will be growing. Here we have toclusters. The main setting of our system is the observe the splitting process, because the splittingmonitoring of existing clusters diameters. In a will decide the growth of clusters. In the proposeddivisive hierarchical structure of clusters, technique the Hoeffding Bound is used for toconsidering stationary data streams, the overall intra- observe the splitting process. In IHCA the techniquecluster dissimilarity should decrease with each split. unequality vapnik Chervonenkis is used for splittingFor each existing cluster, the system finds the two process. Using this technique the observation ofvariables defining the diameter of that cluster. If a splitting process is improved. So, the cluster isgiven heuristic condition is met on this diameter, the grouping properly. 1629 | P a g e
  4. 4. V.Kavitha, Dr.M.Punithavalli / International Journal of Engineering Research and Applications (IJERA) ISSN: 2248-9622 Vol. 3, Issue 1, January -February 2013, pp.1627-1633 Set minPts In the Hoeffding Bound, If (q { C=0 (1) best distance =0 i=1…nWhere, the observations starting that after nindependent observations of the real valued random For each unvisited point P in dataset D mark P as visitedvariable r with range R, with confidence N = getNeighbours (P, eps) In the proposed algorithm, the range value if sizeof(N) < MinPts mark P as NOISEwill be increase from R2 [1]to RN .[2] So the elseobservation process is not a fixed one. Depends on C = next clusterthe number of nodes the system will generating the expandCluster (P, N, C, eps, MinPts)observation process. Calculate the average distance clusterD. Modified Density Subspace Clustering Instead of finding clusters in the full featurespace, subspace clustering is an emergent task which denotes the jth datapoint which belongs to clusteraims at detecting clusters embedded in subspaces. A i.cluster is based on a high-density region in a Nc stands for number of clusters.subspace. To identify the dense region is a major is the distance between datapoint andproblem. And some of the data points are formingout of the cluster range is called ―the density the cluster centroid .divergence problem‖. We propose a novel Modified xi stands for datapoint which belongs to clustersubspace clustering algorithm is to discover the centroidclusters based on the relative region densities in the best distance =0subspaces attribute, where the clusters are regarded for each value asas regions whose densities are relatively high as if (fi >bestdistance)compared to the region densities in a subspace. {Based on this idea, different density thresholds are bestdistance=fiadaptively determined to discover the clusters in selected attribute set = from best distance fidifferent subspace attribute. We also devise an Hierarchical _clustering ( )innovative algorithm, referred to as MDSC }(Modified Density Subspace clustering), to adopt a d=d+1divide-and-conquer scheme to efficiently discover }clusters satisfying different density thresholds in End whiledifferent subspace cardinalities. } expandCluster (P, N, C, eps, MinPts)Advantages of the proposed system {  The proposed system efficiently discovers add P to cluster C clusters satisfying different density thresholds for each point P in N in different subspace attributes. if P is not visited  To reduce the Density Divergence Problem. mark P as visited  To reduce the outlier. (The data points which N = getNeighbours(P, eps) are out of the range). if sizeof(N) >= MinPts  To improve the Intra cluster and Inter cluster N = N joined with N performance. if P is not yet member of any cluster add P to cluster CE. Algorithm for Modified Density Subspace Return the cluster C from the datapoint P.Clustering }1. Initialize the selected attribute set and An –total Hierarchical _clustering ( )attribute set X = {x1, x2, x3... xn} be the set of data points.2. Select a set of attribute subset from at dataset I. Begin with the disjoint clustering having level L(0)d=1…n, = 0 and sequence number m = 0.While ( ==! null II. Find the least distance pair of clusters in the current clustering, say pair (Ci), (Cj), according to{ d[(Ci), (Cj),] = min d[(i),(j)] =best distance whereSet (eps) 1630 | P a g e
  5. 5. V.Kavitha, Dr.M.Punithavalli / International Journal of Engineering Research and Applications (IJERA) ISSN: 2248-9622 Vol. 3, Issue 1, January -February 2013, pp.1627-1633the minimum is over all pairs of clusters in the D. Outliercurrent clustering. Outlier is nothing but, the data points whichIII. Increment the sequence number: are out of the range of the cluster. The outlier ism = m +1.Merge clusters Ci and Cj into a single calculated for the existing method of ODAC(Onlinecluster to form the next clustering m. Set the level Divisive Agglomerative Clustering) and theof this clustering to L(m) = d[(Ci), (Cj),] proposed method MDSC (Modified DensityIV. Update the distance matrix, D, by Subspace Clustering).deleting the rows and columns corresponding toclusters (r) and (s) and adding a row and Outlier Calculationcolumn corresponding to the newly formed cluster. Step 1: Intra Cluster value is calculated for allThe distance between the new cluster, denoted (Ci), Clusters.(Cj) and old cluster (Ck) is defined in this way: d Step 2: Mean of the Intra cluster is found out.[(Ck), (Ci), (Cj)] = min [ d(k,i) ,d(k,j) ] Step 3: All the data points of the clusters will beV. If all the data points are in one cluster then stop, comparing with the mean value.else repeat from step 2. Step 4: After comparison, each data point will be decided whether the point will position within a3. End cluster or out of the cluster.IV. EXPERIMENTAL RESULTS V.SYSTEM EVALUATION ON TIME The main objective of this chapter to SERIES DATA SETmeasure the proposed system result with the existing This proposed method is evaluated withsystem. Measuring the performance of cluster results different kinds of time series data sets. Three typesand cluster analysis was measured in terms of the of data sets are used to evaluate the proposedCluster Quality (Intra Cluster and Inter Cluster) and algorithm. The data sets are namely ECG Data, EEGComputation Time. The proposed system is very Data and Network Sensor Data. ECG Data set ismuch adapted with the dynamic performance of time used to find out the anomaly Identification. This dataseries data stream. We must evaluate our proposed set have three attributes namely time seconds, leftsystem with real data produced by applications that peek and right peek. EEG Data set is used to find outgenerate time series data streams. abnormal personality. The name of the attributes is Trial number, Sensor value, Sensor position, SampleA. Evaluation Criteria for Clustering Quality number. The third type of data set is Network sensor. Generally, the criteria used to evaluate The name of he attributes is Total bytes, in bytes, outclustering methods concentrate on the quality of the bytes, Total Package, in package, out package,resulting clusters. Given the hierarchical Events.characteristics of the system, the quality of thehierarchy is constructed by our algorithm. And A. Record Set Specificationanother evaluation criterion is computation time of TABLE I DATA SET SPECIFICATIONthe system. Data Set Number of Number of Instance AttributesB. Cluster Quality ECG 1800 3 A good clustering algorithm will produce EEG 1644 4high quality based on intra cluster similarity and Sensor 2500 7inter cluster similarity measures. The quality of the Networkclustering result depends on the similarity measure Using the above three kinds of data sets we haveused by the method and its implementation. The to calculate Execution time of the system, Intraquality of a clustering method is also measured by cluster , Inter cluster and outlier of the cluster.its ability to discover some or all of the hiddenpatterns. The criteria for measuring the cluster B. Result of Outlierquality of intra clusters similarity will be high. And TABLE2 OUTLIER SPECIFICATIONthe inter cluster similarity will be low. For analysingcluster quality will be in two forms, First one isfinding groups of objects will be related to one Technique Outlier Pointsanother. And second one is finding the group of Existing 152objects that differ from the objects in other groups. System(ODAC) Existing 123C. Computation Time System(IHCA) Another evaluation of this work is Proposed 63calculating the computation time of the process. The System(MDSC)complexity of execution time will be decreasedwhen using the proposed work. 1631 | P a g e
  6. 6. V.Kavitha, Dr.M.Punithavalli / International Journal of Engineering Research and Applications (IJERA) ISSN: 2248-9622 Vol. 3, Issue 1, January -February 2013, pp.1627-1633 C.Result of Execution Time FIGURE 2 The following table shows that the difference INTRA CLUSTER BETWEEN EXISIING AND between the two techniques of the system execution PROPOSED SYSTEM time. TABLE 3 EXECUTION TIME BETWEEN EXISIING AND PROPOSED Existing SystemNo of Time in seconds Proposed System Time in secondsClusters ODAC IHCA MDSC2 2.0343 1.9066 1.37824 2.0493 1.9216 1.36646 2.1043 1.9766 1.3641 TABLE 58 2.1115 1.9838 1.3901 INTER CLUSTER BETWEEN EXISIING AND10 2.0536 1.9259 1.3251 PROPOSED SYSTEM FIGURE 1 EXECUTION TIME BETWEEN THE EXISTING Existing AND PROPOSED SYSTEMS System Inter Cluster Proposed System Inter Cluster No of Clusters ODAC IHCA MDSC 2 295.64 375.84 393.26 4 262.72 279.07 296.42 6 233.27 204.84 222.26 8 156.65 148.74 166.16 10 136.54 119.89 137.31 FIGURE 3 INTER CLUSTER BETWEEN EXISIING AND PROPOSED SYSTEM TABLE 4 INTRA CLUSTER BETWEEN EXISIING AND PROPOSED SYSTEM Existing System No of Intra Cluster Proposed System Intra Cluster Clusters ODAC IHCA MDSC 2 898.94 865.15 831.72 4 448.87 415.63 382.21 6 346.56 313.41 279.98 VI.CONCLUSION AND FUTURE WORK 8 271.54 238.54 205.11 Clustering of the time series data faced with curse of dimensionality, where real world data 10 199.04 166.04 132.61 consist of many dimensions. Finding the clusters in feature space, subspace clustering is an growing task. 1632 | P a g e
  7. 7. V.Kavitha, Dr.M.Punithavalli / International Journal of Engineering Research and Applications (IJERA) ISSN: 2248-9622 Vol. 3, Issue 1, January -February 2013, pp.1627-1633Density based approach to identify clusters in Simoudis, J. Han, and U. Fayyad, Eds.dimensional point sets. Density subspace clustering AAAI Press, 1996, pp. 226– a method to detect the density connected clusters [3] Y. Kim, W. Street, and F. Menczer,in all subspaces of high dimensional data for ―Feature Selection in Unsupervisedclustering time series data streams Multidimensional Learning via Evolutionary Search,‖ clustering evaluation can be done through a Sixth ACM SIGKDD Int’l Conf.density-based approach. The Modified Density Knowledge Discovery and Data Mining, pp.subspace clustering algorithm is used to find best 365-369, 2000.cluster result from the dataset. Improve the Cluster [4] M. Halkidi, Y. Batistakis, and M.Quality and evaluate the similarity between the data Varzirgiannis, ―On clustering validationpoints in the clustering, The MDSC algorithm also techniques,‖ Journal of Intelligentconcentrates the Density Divergence Problem Information Systems, vol. 17, no. 2-3, pp.(Outlier). Proposed system clustering results 107–145, 2001.compared them with existing clustering algorithms [5]. Pedro Pereira Rodriguess and Joao Pedroin terms of the Execution time, Cluster Quality Pedroso, ―Hierarchical Clustering of Timeanalysis. Experimental results show that proposed Series Data Streams,‖ Sudipto Guha, Adamsystem improves clustering quality result, and less Meyerson, Nine Mishra and Rajeevtime than the existing clustering methods. The Motiwani, ―Clustering Data Streams:problem find out in the proposed work is, to Theory and Practice‖, IEEE Transactionsoptimize the centroid point using Multi view point on Knowledge and Data Engineering. Vol.approach. And to apply this technique in non time 15, no. 3, pp. 515-528, May/June 2003.series data set also. [6] Ashish Singhal, and Dale E Seborg, ―Clustering Multivarriate Time SeriesReferences Data,‖ Journal of Chemometrics, vol. 19, [1] Yi-Hong Chu, Jen-Wei Huang, Kun-Ya pp. 427-438, Jan 2006. Chuang, DeNian Yang, ―Density Conscious [7] Sudipto Guha, Adam Meyerson, Nine Subspace Clustering for High Dimensional Mishra and Rajeev Motiwani, ―Clustering Data‖ IEEE Transactions on Knowledge Data Streams: Theory and Practice‖, IEEE and Data Engineering. Vol 22, No 1, Transactions on Knowledge and Data January 2010. Engineering. Vol. 15, no. 3, pp. 515-528, [2] M. Ester, H.-P. Kriegel, J. Sander, and X. May/June 2003. Xu, ―A density-based algorithm for [8] Ashish Singhal, and Dale E Seborg, discovering clusters in large spatial ―Clustering Multivarriate Time Series databases with noise,‖ in Proceedings of the Data,‖ Journal of Chemometrics, vol. 19, Second International Conference on pp. 427-438, Jan 2006. Knowledge Discovery and Data Mining, E. 1633 | P a g e