• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
A frame work for clustering time evolving data
 

A frame work for clustering time evolving data

on

  • 813 views

 

Statistics

Views

Total Views
813
Views on SlideShare
813
Embed Views
0

Actions

Likes
0
Downloads
11
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    A frame work for clustering time evolving data A frame work for clustering time evolving data Document Transcript

    • INTERNATIONALComputer Volume OF COMPUTER ENGINEERING – International Journal of JOURNAL 3, Issueand Technology (IJCET), ISSN 0976 6367(Print), ISSN 0976 – 6375(Online) Engineering 3, October-December (2012), © IAEME & TECHNOLOGY (IJCET)ISSN 0976 – 6367(Print)ISSN 0976 – 6375(Online)Volume 3, Issue 3, October - December (2012), pp. 377-383 IJCET© IAEME: www.iaeme.com/ijcet.aspJournal Impact Factor (2012): 3.9580 (Calculated by GISI) ©IAEMEwww.jifactor.com A FRAME WORK FOR CLUSTERING TIME EVOLVING DATA USING SLIDING WINDOW TECHNIQUE Y. Swapna1, S. Ravi Sankar2 1 (Faculty, CSE Department, National Institute of Technology, Goa, India, spr@nitgoa.ac.in) 2 (Faculty, CSE Department, National Institute of Technology, Goa, India, srs@nitgoa.ac.in) ABSTRACT Clustering is defined as the process of dividing a dataset into mutually exclusive groups such that the members of each group are as "close" as possible to one another and different groups are as "far" as possible from one another. Sampling is defined as representing large data sets into smaller random samples of data. It is used to improve the efficiency of clustering. Though sampling is applied, the points that are not sampled will not have their labels after the normal process. The problem has been solved for numerical domain, where as clustering of time- evolving data in the categorical domain still remains a challenging issue. In this paper, Sliding Window is used to form subset of data from dataset using specified size (i.e.) collection of data from the database and transfer to the module. The drifting concept detection has been proposed which introduces new algorithm that finds the number of outliers that cannot be assigned to any of the cluster. The objective of this algorithm is to compare the distribution of clusters and outliers between the last clustering result and the current temporal clustering result. The experimental evaluation shows that performing DCD is faster than doing clustering once on the entire data set and DCD can provide high-quality clustering results with correctly detected drifting concepts. Keywords: clustering, sampling, categorical domain, labels, sliding window, drifting concept detection. I. INTRODUCTION Our present information age society thrives and evolves on knowledge. Knowledge is derived from information gleaned from a wide variety of reservoirs of data (databases). Clustering is an important technique for exploratory data analysis and has been the focus of substantial research in several domains for decades. Clusters are connected regions of a multi- dimensional space containing of a relatively high density of points, separated from other such 377
    • International Journal of Computer Engineering and Technology (IJCET), ISSN 0976 –6367(Print), ISSN 0976 – 6375(Online) Volume 3, Issue 3, October-December (2012), © IAEMEregions by a region containing a low density of points. It is useful for classification and canreveal the structure in high-dimensional data spaces, outliers may be interesting, statisticalpattern recognition, machine learning, and information retrieval because of its use in a widerange of applications. Cluster analysis is the assignment of a set of observations into subsets(called clusters) so that observations in the same cluster are similar in some sense. It helps us togain insight into the data distribution. In real world domain, the concept of interest may dependon some hidden context, not given plainly in the form of predictive features, which has become aproblem as these concepts drift with time. A suitable example would be buying preferences ofcustomers which may change with time, depending on their needs, climatic conditions, discountsetc. Since the concepts behind the data evolve with time, the underlying clusters may also changesignificantly with time. The concept not only decreases the quality of clusters but also disregardsthe expectations of users, which usually require recent clustering results. Many works have beenexplored based on the problem of clustering time-evolving data in the numerical domain. Categorical attributes also prevalently exist in real data with drifting concepts, forexample Web logs that record the browsing history of users, stock market details, buying recordsof customers often evolve with time. Previous works on clustering categorical data focus ondoing clustering on the entire data set and drifting concepts were not taken consideration.Consequently, the problem of clustering time evolving data in the categorical domain remains achallenging issue. The objective is to propose a framework for performing clustering on thecategorical time-evolving data. The goal is to use a generalized clustering framework that utilizesexisting clustering algorithms that detects if there is a drifting concept or not in the incomingdata, instead of designing a specific clustering algorithm. Sliding window technique is adopted todetect the drifting concepts.II. RELATED WORK Many different numerical clustering algorithms have been proposed that consider the time-evolving data and traditional categorical clustering algorithms [1]. An effective and efficient method,called, clustream for clustering large evolving data streams was proposed by [5]. This method tries tocluster the whole stream at one time rather than viewing the stream as a changing process over time. Adensity-based method called DenStream was proposed in [2] for discovering clusters in an evolving datastream. Evolutionary clustering algorithms were proposed in [5] and [3]. They adopted the same methodthat performs data clustering over time and tries to optimize two potentially conflicting criteria: first, theprevious and the present cluster must be similar without drifting concept, and second, clustering shouldreflect the data arrived at that time step with the drifting concept. In [6], a generic frame work for thisproblem used k-means and agglomerative hierarchical clustering algorithms that were extended accordingto the problem domain. In [5], a measure of temporal smoothness is integrated in the overall measure ofclustering quality. Due to this, the proposed method uses stable and consistent clustering results that areless sensitive to short-term noises while at the same time are adaptive to long-term cluster drifts. Thepreviously proposed methods have concentrated on the problem of clustering time evolving data in thenumerical domain. In [4], problem of clustering categorical data is discussed, which performs clusteringon customer transaction data in a market database. 378
    • International Journal of Computer Engineering and Technology (IJCET), ISSN 0976 –6367(Print), ISSN 0976 – 6375(Online) Volume 3, Issue 3, October-December (2012), © IAEME In [6], [4], a framework to perform clustering on the categorical time-evolving data has beenproposed. Especially the rough membership function in rough set theory represents a concept that inducesa fuzzy set. Several extension works based on k-modes are presented for different objectives, fuzzy k-modes [6], initial points refinement [2], etc. These categorical algorithms focus on performing clusteringon the entire data set and do not consider time-evolving trends.III. THE PROPOSED APPROACH We propose a generalized clustering framework that utilizes existing clusteringalgorithms and detects if there is a drifting concept or not in the incoming data. In order to detectthe drifting concepts at different sliding windows, we propose the algorithm DCD to compare thecluster distributions between the last clustering result and the temporal current clustering result. It is a collection of data which is extracted from the database that we are going to clusterand the data from the database which is time evolving categorical data (It is not sequential basismanner). We used a synthetic data generator [5] to generate data sets with different number ofdata points and attributes. The number of data points varies from 10,000 to 100,000, and thedimensionality is in the range of 10-50. In all synthetic data sets, each dimension possesses 20attribute values. Sliding Window is used to form subset of data from dataset using specified size (i.e.)collection of data from the database and transfer to the module. In this paper, a practicalcategorical clustering representative, named “Node Importance Representative” (abbreviated asNIR), is utilized. It represents clusters by measuring the importance of each attribute value in theclusters. Drifting Concept Detection (DCD) algorithm (fig.2) is used to detect the difference ofcluster distribution between the current data subset and the last clustering result. In order toperform proper evaluation, we label the clusters and those that do not belong to any cluster arecalled an outlier. The result is set to perform reclustering if the difference between the clusters islarge enough. Two clusters are said to be similar (resemblance), if they satisfy the conditionbetween point pj and cluster ck i.e. 1< k<l obtains maximum of the cluster point. Theresemblance for a given data point p j and an NIR table of clusters ck, is defined by the followingequation: R ( ‫݌‬௝ , ܿ௞ ) = ∑௤ ‫ݓ‬ሺܿ௞ , ‫ܫ‬௞௥ ሻ ௥ୀଵ (1)Where ‫ܫ‬௞௥ is one entry in the NIR table of clusters ܿ௞ . As shown in the equation (1), resemblancecan be directly obtained by summing up the nodes’ importance in the NIR table of clustersܿ௞ .Resemblance will be larger if data point contains nodes that are more important in one clusterthan in another cluster and is considered to obtain maximal resemblance. If resemblance valuesbetween each cluster are small, then it will be treated as an outlier. Therefore, a threshold ߣ௜ ineach cluster is set to identify outliers. The decision function is defined as follows: Label = { ܿ௜‫, כ‬ if max R (‫݌‬௝ , ܿ௜ ሻ ≥ ߣ௜ where 1 ≤ i ≤ l; outliers; otherwise. As shown in fig.1, the data points in the second sliding window are going to perform datalabeling and thresholds are λ1 = λ2 = 0.5. The first data point p6 = (B, E, F) in S2 is decomposed 379
    • International Journal of Computer Engineering and Technology (IJCET), ISSN 0976 –6367(Print), ISSN 0976 – 6375(Online) Volume 3, Issue 3, October-December (2012), © IAEMEinto 3 nodes, i.e., {[A1=B]}, {[A2=E]}, {[A3=F]}. The resemblance of ‫ ଺݌‬in ܿଵ is zero, and in ܿଶ ଵ ଵit is also zero, since the maximal resemblance is not larger than the threshold, hence the datapoint ‫ ଺݌‬is considered as an outlier. The resemblance of ‫ ଻݌‬in ܿଵ is 0.037 and in ܿଶ it is ଵ ଵ1.537(0.5 +0.037 +1). Then the maximal resemblance value is R (‫ܿ , ଻݌‬ଶ ) and the resemblance ଵvalue is larger than the threshold λ2 = 0.5, therefore ‫ ଻݌‬is labeled clusterܿଶ . ଵ p1 p2 p3 p4 p5 p6 p7 p8 p9 p10 A1 C I C S C B I B S B A2 W W W W W E T E I O A3 D M N M D F H G H G S1 S2 p11 p12 p13 p14 p15 S I Z I S W W P W W P P T P P S3 ܿଵ ଵ C C C I ܿଶ Sଵ W W W W W D N D M M ܿଶ ′ଵ ܿଶ ′ଶ outliers I S B B B T T E E O H H F G G Fig. 1: The temporal clustering result ࡯′૛ that is obtained by data labeling.Algorithm Used: Let temp=‫ ܥ‬ሾ௧೐ ,௧ିଵሿ DriftingConceptDetecting (temp, ܵ ௧ ) outliers out = 0 while there is next tuple in ܵ ௧ do read in data point ‫݌‬௝ from St divide ‫݌‬௝ into nodes ‫ܫ‬ଵ to ‫ܫ‬௤ for all clusters tempi in ‫ ݌݉݁ݐ‬do calculate Resemblance R(pj, tempi) end for find Maximal Resemblance tempm if R( ‫݌‬௝ , tempm ) ≥ ߣ௠ then ‫݌‬௝ is assign to ܿ௠ else ′௧ out = out + 1 380
    • International Journal of Computer Engineering and Technology (IJCET), ISSN 0976 –6367(Print), ISSN 0976 – 6375(Online) Volume 3, Issue 3, October-December (2012), © IAEME end if end while Outlier = out; {Do data labeling on current sliding window } Numdiffclusters = 0 For all clusters tempi in temp do ሾ೟ ,೟షభሿ ௠೔ ೐ ௠೔ ೟ If ቤ ೖ ሾ೟೐ ,೟షభሿ ሾ೟೐ ,೟షభሿ െ ሾ೟೐ ,೟షభሿ ೟ ቤ then ∑ೣసభ ௠ೣ ∑ೖ ೣసభ ௠ೣ Numdiffclusters = numdiffclusters + 1 end if end for ௢௨௧௟௜௘௥ ௡௨௠ௗ௜௙௙௖௟௨௦௧௘௥௦ if ே > θ or ൐ ߟ then ௞ ሾ೟೐ ,೟షభሿ {Concept Drifts} dump out temp call initial clustering on St else {Concept not drifts} add ‫′ ܥ‬௧ into temp update NIR as ‫ ܥ‬ሾ௧೐ ,௧ሿ end if Since we measure the similarity between the data point ‫݌‬௝ and the cluster ܿ௜ as R (‫݌‬௝ , ܿ௜ ሻ,the cluster with the maximal resemblance is the most appropriate cluster for that data point. If themaximal resemblance (the most appropriate cluster) is smaller than the threshold ߣ௜ in thatcluster, the data point is seen as an outlier. In order to observe the relationship between differentclustering results, cluster relationship analysis is used to analyze and show the changes betweendifferent clustering results. It measures the similarity of clusters between the clustering results atdifferent time stamps and links the similar clusters. Cluster Cluster ܿଶ ଶ Cluster ܿଵ ଵ ܿଵ 0.012 ଶ 0.182 Cluster ܿଶ ଵ 0.567 0 Cluster Cluster ܿଶ ଷ Cluster ܿଵ ଶ ܿଵ 1 ଷ 0 Cluster ܿଶ ଶ 0 0 Fig. 2: The similarity table between clustering results ഥ തതതThe cosine measure CM ( ܿଶ , ܿଵ ). = (1.537/1.225)* 1.578 = 0.567, which is larger than CM ଵ ଶ തതതത തതത(ܿ ଵ , ܿଵ ). Therefore cluster ܿଶ is said to be more similar to ܿଵ than to clusterܿଵ . ଶ ଵ ଶ ଵ ଵ 381
    • International Journal of Computer Engineering and Technology (IJCET), ISSN 0976 –6367(Print), ISSN 0976 – 6375(Online) Volume 3, Issue 3, October-December (2012), © IAEME Table 1: Symbols used in Algorithm Aa The a-th attribute in the data set. C[t1 , t2] The clustering result from t1 to t2. Ct The clustering result on sliding window t. C1t The temporal clustering result on sliding window t. Cj The j-th cluster in C. ܿప ഥ The node importance vector of ܿ௜ . ‫ܫ‬௜௥ The r-th node in ܿ௜ . |‫ܫ‬௜௥ | The number of occurrence of ‫ܫ‬௜௥ . K The number of clusters in C. ݉௜ The number of data points in ܿ௜ . N The size of sliding window. ܵ௧ The sliding window t. T The timestamp index of sliding window. ‫ݓ‬ሺܿ௜ , ‫ܫ‬௜௥ ሻ The importance of ‫ܫ‬௜௥ in ܿ௜ . Θ The outlier threshold. Ε The cluster variation threshold. Η The cluster difference threshold. CM(ܿ௜ , ܿ௝ ) The cosine measure between cluster vectors ܿప and ܿఫ ഥ ഥ.IV RESULTS: The following table shows the results in terms of precision and recall of DCD areefficient on detecting drifting concepts. N=1000 Settings drifting precision Recall D1 35.6 0.557 0.873 D2 39.2 0.825 0.992 D3 46 0.816 0.98 D4 44.5 0.443 0.97 Fig. 3: The precision and recall of the DCD We change clustering pairs to obtain the data sets with drifting concepts and then test thedetecting accuracy of algorithm DCD by those data sets. The outlier threshold θ is set to 0.1, andthe cluster variation threshold ε is set to 0.1, and also, the cluster difference threshold η is set to0.5. The number of clusters k, which is the required parameter on the initial clustering step andreclustering step, is set to the maximum number of clusters in each setting, e.g., k = 10 in D1 andk = 20 in D3. In addition, each synthetic data set is generated by randomly combining 50clustering results on that data set setting, and the precision and recall shown in fig.3 are theaverages of 20 experiments. The precision and recall are more than 80 percent when the size ofthe sliding window is larger than 2,000. It is a little low when the size of the sliding window isset to 1,000 because the drifting concepts often cross two windows, we only count one as a 382
    • International Journal of Computer Engineering and Technology (IJCET), ISSN 0976 –6367(Print), ISSN 0976 – 6375(Online) Volume 3, Issue 3, October-December (2012), © IAEMEcorrect hit, and the other window is considered as a miss. However, the detecting recall is thehighest one when the size of sliding window is set to 1,000. The drifting concepts will probablynot be omitted in the sliding window when the data set is separated in detail. If we choose twoexamples of bank datasets that are synthesized by settings D1 and D2 and evaluate clusteringresults on each sliding window, it generates a new clustering results when the drifting concept isdetected, it also response quickly to the trend of evolving dataset.IV. CONCLUSION In this paper we have proposed a framework to perform clustering on categorical time-evolving data. In order to detect the drifting concepts at different sliding windows, we proposedthe algorithm DCD to compare the cluster distributions between the last clustering result and thetemporal current clustering result. If the results are quite different, the last clustering result willbe dumped out, and the current data in this sliding window will perform reclustering. In order toobserve the relationship between different clustering results, cluster relationship analysis is usedto analyze and show the changes between different clustering results. The experimentalevaluation shows that performing DCD is faster than doing clustering once on the entire data setand DCD can provide high-quality clustering results with correctly detected drifting concepts.Therefore, the result demonstrates that our framework is practical for detecting drifting conceptsin time-evolving categorical data.V.REFERENCES[1] D. Barbara, Y. Li, and J. Couto, Coolcat: An Entropy-Based Algorithm for CategoricalClustering, Proc. ACM Int’l Conf. Information and Knowledge Management (CIKM), 2002.[2] F. Cao, M. Ester, W. Qian, and A. Zhou, Density-Based Clustering over an Evolving DataStream with Noise, Proc. Sixth SIAM Int’l Conf. Data Mining (SDM), 2006.[3] H.-L. Chen, K.-T. Chuang, and M.-S. Chen, Labeling Unclustered Categorical Data intoClusters Based on the Important Attribute Values, Proc. Fifth IEEE Int’l Conf. Data Mining(ICDM), 2005.[4] O. Nasraoui, M. Soliman, E. Saka, A. Badia, and R. Germain, A Web Usage MiningFramework for Mining Evolving User Profiles in Dynamic Web Sites, IEEE Trans. Knowledgeand Data Eng., vol. 20, no. 2, pp. 202-215, Feb. 2008.[5] Hung-Leng Chen, Ming-Syan Chen, and Su-Chen Lin, Catching the Trend: A Frameworkfor Clustering Concept-Drifting Categorical Data, IEEE Trans. Knowledge and Data Eng., vol.21, no. 5, May 2009.[6] Z. Huang and M.K. Ng, A Fuzzy k-Modes Algorithm for Clustering Categorical Data, IEEETrans. Fuzzy Systems, 1999. 383