Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
988                                                    IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING,               ...
NGUYEN ET AL.: CLUSTERING WITH MULTIVIEWPOINT-BASED SIMILARITY MEASURE                                                    ...
990                                               IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING,           VOL. 24, ...
NGUYEN ET AL.: CLUSTERING WITH MULTIVIEWPOINT-BASED SIMILARITY MEASURE                                                    ...
992                                                 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING,                  ...
NGUYEN ET AL.: CLUSTERING WITH MULTIVIEWPOINT-BASED SIMILARITY MEASURE                                                    ...
994                                          IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING,      VOL. 24,   NO. 6,  ...
NGUYEN ET AL.: CLUSTERING WITH MULTIVIEWPOINT-BASED SIMILARITY MEASURE                                                    ...
996                                                    IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING,        VOL. 24...
NGUYEN ET AL.: CLUSTERING WITH MULTIVIEWPOINT-BASED SIMILARITY MEASURE                                                    ...
998                                                 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING,   VOL. 24,   NO. ...
NGUYEN ET AL.: CLUSTERING WITH MULTIVIEWPOINT-BASED SIMILARITY MEASURE                                                    ...
1000                                                IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING,               VOL...
NGUYEN ET AL.: CLUSTERING WITH MULTIVIEWPOINT-BASED SIMILARITY MEASURE                                                    ...
Upcoming SlideShare
Loading in …5
×

Clustering with multiviewpoint based similarity measure.bak

2,667 views

Published on

  • Be the first to comment

  • Be the first to like this

Clustering with multiviewpoint based similarity measure.bak

  1. 1. 988 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 24, NO. 6, JUNE 2012 Clustering with Multiviewpoint-Based Similarity Measure Duc Thang Nguyen, Lihui Chen, Senior Member, IEEE, and Chee Keong Chan Abstract—All clustering methods have to assume some cluster relationship among the data objects that they are applied on. Similarity between a pair of objects can be defined either explicitly or implicitly. In this paper, we introduce a novel multiviewpoint-based similarity measure and two related clustering methods. The major difference between a traditional dissimilarity/similarity measure and ours is that the former uses only a single viewpoint, which is the origin, while the latter utilizes many different viewpoints, which are objects assumed to not be in the same cluster with the two objects being measured. Using multiple viewpoints, more informative assessment of similarity could be achieved. Theoretical analysis and empirical study are conducted to support this claim. Two criterion functions for document clustering are proposed based on this new measure. We compare them with several well-known clustering algorithms that use other popular similarity measures on various document collections to verify the advantages of our proposal. Index Terms—Document clustering, text mining, similarity measure. Ç1 INTRODUCTIONC LUSTERING is one of the most interesting and important assumption that the true intrinsic structure of data could topics in data mining. The aim of clustering is to find be correctly described by the similarity formula defined andintrinsic structures in data, and organize them into mean- embedded in the clustering criterion function. Hence,ingful subgroups for further study and analysis. There have effectiveness of clustering algorithms under this approachbeen many clustering algorithms published every year. depends on the appropriateness of the similarity measure toThey can be proposed for very distinct research fields, and the data at hand. For instance, the original k-means hasdeveloped using totally different techniques and ap- sum-of-squared-error objective function that uses euclideanproaches. Nevertheless, according to a recent study [1], distance. In a very sparse and high-dimensional domain http://ieeexploreprojects.blogspot.commore than half a century after it was introduced, the simple like text documents, spherical k-means, which uses cosinealgorithm k-means still remains as one of the top 10 data similarity (CS) instead of euclidean distance as the measure,mining algorithms nowadays. It is the most frequently used is deemed to be more suitable [3], [4].partitional clustering algorithm in practice. Another recent In [5], Banerjee et al. showed that euclidean distance wasscientific discussion [2] states that k-means is the favorite indeed one particular form of a class of distance measuresalgorithm that practitioners in the related fields choose to called Bregman divergences. They proposed Bregman hard-use. Needless to mention, k-means has more than a few basic clustering algorithm, in which any kind of the Bregmandrawbacks, such as sensitiveness to initialization and to divergences could be applied. Kullback-Leibler divergencecluster size, and its performance can be worse than other was a special case of Bregman divergences that was said tostate-of-the-art algorithms in many domains. In spite of that, give good clustering results on document data sets. Kullback-its simplicity, understandability, and scalability are the Leibler divergence is a good example of nonsymmetricreasons for its tremendous popularity. An algorithm with measure. Also on the topic of capturing dissimilarity in data,adequate performance and usability in most of application Pakalska et al. [6] found that the discriminative power ofscenarios could be preferable to one with better performance some distance measures could increase when their none-in some cases but limited usage due to high complexity. uclidean and nonmetric attributes were increased. TheyWhile offering reasonable results, k-means is fast and easy to concluded that noneuclidean and nonmetric measures couldcombine with other methods in larger systems. be informative for statistical learning of data. In [7], Pelillo A common approach to the clustering problem is to treat even argued that the symmetry and nonnegativity assump-it as an optimization process. An optimal partition is found tion of similarity measures was actually a limitation of currentby optimizing a particular function of similarity (or state-of-the-art clustering approaches. Simultaneously, clus-distance) among data. Basically, there is an implicit tering still requires more robust dissimilarity or similarity measures; recent works such as [8] illustrate this need. The work in this paper is motivated by investigations. The authors are with the Division of Information Engineering, School of Electrical and Electronic Engineering, Nanyang Technological University, from the above and similar research findings. It appears to Block S1, Nanyang Avenue, Republic of Singapore 639798. us that the nature of similarity measure plays a very E-mail: victorthang.ng@pmail.ntu.edu.sg, {elhchen, eckchan}@ntu.edu.sg. important role in the success or failure of a clusteringManuscript received 16 July 2010; revised 25 Feb. 2011; accepted 10 Mar. method. Our first objective is to derive a novel method for2011; published online 22 Mar. 2011. measuring similarity between data objects in sparse andRecommended for acceptance by M. Ester. high-dimensional domain, particularly text documents.For information on obtaining reprints of this article, please send e-mail to:tkde@computer.org, and reference IEEECS Log Number TKDE-2010-07-0398. From the proposed similarity measure, we then formulateDigital Object Identifier no. 10.1109/TKDE.2011.86. new clustering criterion functions and introduce their 1041-4347/12/$31.00 ß 2012 IEEE Published by the IEEE Computer Society
  2. 2. NGUYEN ET AL.: CLUSTERING WITH MULTIVIEWPOINT-BASED SIMILARITY MEASURE 989 TABLE 1 However, for data in a sparse and high-dimensional space, Notations such as that in document clustering, cosine similarity is more widely used. It is also a popular similarity score in text mining and information retrieval [12]. Particularly, similarity of two document vectors di and dj , Simðdi ; dj Þ, is defined as the cosine of the angle between them. For unit vectors, this equals to their inner product Simðdi ; dj Þ ¼ cosðdi ; dj Þ ¼ dt dj : i ð3Þ Cosine measure is used in a variant of k-means called spherical k-means [3]. While k-means aims to minimize euclidean distance, spherical k-means intends to maximize the cosine similarity between documents in a cluster and that cluster’s centroid X X dt Cr k i max : ð4Þrespective clustering algorithms, which are fast and scalable r¼1 d 2S kCr k i rlike k-means, but are also capable of providing high-qualityand consistent performance. The major difference between euclidean distance and cosine The remaining of this paper is organized as follows: In similarity, and therefore between k-means and spherical k-Section 2, we review related literature on similarity and means, is that the former focuses on vector magnitudes,clustering of documents. We then present our proposal for while the latter emphasizes on vector directions. Besidesdocument similarity measure in Section 3.1. It is followed direct application in spherical k-means, cosine of documentby two criterion functions for document clustering and their vectors is also widely used in many other documentoptimization algorithms in Section 4. Extensive experiments clustering methods as a core similarity measurement. Theon real-world benchmark data sets are presented and min-max cut graph-based spectral method is an example [13]. In graph partitioning approach, document corpus isdiscussed in Sections 5 and 6. Finally, conclusions and consider as a graph G ¼ ðV ; EÞ, where each document is apotential future work are given in Section 7. vertex in V and each edge in E has a weight equal to the similarity between a pair of vertices. Min-max cut algorithm2 RELATED WORK http://ieeexploreprojects.blogspot.com tries to minimize the criterion functionFirst of all, Table 1 summarizes the basic notations that will X SimðSr ; S n Sr Þ kbe used extensively throughout this paper to represent mindocuments and related concepts. Each document in a SimðSr ; Sr Þ r¼1 Xcorpus corresponds to an m-dimensional vector d, where where SimðSq ; Sr Þ ¼ Simðdi ; dj Þ;m is the total number of terms that the document corpus 1 q;r k di 2Sq ;dj 2Srhas. Document vectors are often subjected to some weight-ing schemes, such as the standard Term Frequency-Inverse and when the cosine as in (3) is used, minimizing theDocument Frequency (TF-IDF), and normalized to have criterion in (5) is equivalent tounit length. The principle definition of clustering is to arrange data X Dt D k r min : ð6Þobjects into separate clusters such that the intracluster r¼1 kDr k2similarity as well as the intercluster dissimilarity ismaximized. The problem formulation itself implies that There are many other graph partitioning methods withsome forms of measurement are needed to determine such different cutting strategies and criterion functions, such assimilarity or dissimilarity. There are many state-of-the-art Average Weight [14] and Normalized Cut [15], all of whichclustering approaches that do not employ any specific form have been successfully applied for document clusteringof measurement, for instance, probabilistic model-based using cosine as the pairwise similarity score [16], [17]. Inmethod [9], nonnegative matrix factorization [10], informa- [18], an empirical study was conducted to compare ation theoretic coclustering [11] and so on. In this paper, variety of criterion functions for document clustering.though, we primarily focus on methods that indeed do Another popular graph-based clustering technique isutilize a specific measure. In the literature, euclidean implemented in a software package called CLUTO [19]. Thisdistance is one of the most popular measures method first models the documents with a nearest neighbor graph, and then splits the graph into clusters using a min-cut Distðdi ; dj Þ ¼ kdi À dj k: ð1Þ algorithm. Besides cosine measure, the extended Jaccard coefficient can also be used in this method to representIt is used in the traditional k-means algorithm. The objective similarity between nearest documents. Given nonunit docu-of k-means is to minimize the euclidean distance between ment vectors ui , uj ðdi ¼ ui =kui k; dj ¼ uj =kuj kÞ, their ex-objects of a cluster and that cluster’s centroid tended Jaccard coefficient is XX k ut uj min kdi À Cr k2 : ð2Þ SimeJacc ðui ; uj Þ ¼ i : ð7Þ r¼1 di 2Sr kui k2 þ kuj k2 À ut uj i
  3. 3. 990 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 24, NO. 6, JUNE 2012Compared with euclidean distance and cosine similarity, the To construct a new concept of similarity, it is possible toextended Jaccard coefficient takes into account both the use more than just one point of reference. We may have amagnitude and the direction of the document vectors. If the more accurate assessment of how close or distant a pair ofdocuments are instead represented by their corresponding points are, if we look at them from many differentunit vectors, this measure has the same effect as cosine viewpoints. From a third point dh , the directions andsimilarity. In [20], Strehl et al. compared four measures: distances to di and dj are indicated, respectively, by theeuclidean, cosine, Pearson correlation, and extended Jaccard, difference vectors ðdi À dh Þ and ðdj À dh Þ. By standing atand concluded that cosine and extended Jaccard are the best various reference points dh to view di , dj and working on their difference vectors, we define similarity between theones on web documents. two documents as In nearest neighbor graph clustering methods, such as theCLUTO’s graph method above, the concept of similarity is 1 Xsomewhat different from the previously discussed methods. Simðdi ; dj Þ ¼ Simðdi Àdh ; dj Àdh Þ: ð9Þ di ;dj 2Sr nÀnr d 2SnS h rTwo documents may have a certain value of cosine similarity,but if neither of them is in the other one’s neighborhood, they As described by the above equation, similarity of twohave no connection between them. In such a case, some documents di and dj —given that they are in the samecontext-based knowledge or relativeness property is already cluster—is defined as the average of similarities measuredtaken into account when considering similarity. Recently, relatively from the views of all other documents outside thatAhmad and Dey [21] proposed a method to compute distance cluster. What is interesting is that the similarity here isbetween two categorical values of an attribute based on their defined in a close relation to the clustering problem. Arelationship with all other attributes. Subsequently, Ienco presumption of cluster memberships has been made prior toet al. [22] introduced a similar context-based distance the measure. The two objects to be measured must be in thelearning method for categorical data. However, for a given same cluster, while the points from where to establish thisattribute, they only selected a relevant subset of attributes measurement must be outside of the cluster. We call thisfrom the whole attribute set to use as the context for proposal the Multiviewpoint-based Similarity, or MVS. Fromcalculating distance between its two values. this point onwards, we will denote the proposed similarity More related to text data, there are phrase-based and measure between two document vectors di and dj byconcept-based document similarities. Lakkaraju et al. [23] MVSðdi ; dj jdi ; dj 2Sr Þ, or occasionally MVSðdi ; dj Þ for short.employed a conceptual tree-similarity measure to identify The final form of MVS in (9) depends on particularsimilar documents. This method requires representing formulation of the individual similarities within the sum. If http://ieeexploreprojects.blogspot.comdocuments as concept trees with the help of a classifier. the relative similarity is defined by dot-product of theFor clustering, Chim and Deng [24] proposed a phrase- difference vectors, we havebased document similarity by combining suffix tree modeland vector space model. They then used Hierarchical MVSðdi ; dj jdi ; dj 2 Sr ÞAgglomerative Clustering algorithm to perform the cluster- 1 X ¼ ðdi Àdh Þt ðdj Àdh Þing task. However, a drawback of this approach is the high nÀnr d 2SnS ð10Þ h rcomputational complexity due to the needs of building the X 1suffix tree and calculating pairwise similarities explicitly ¼ cosðdi Àdh ; dj Àdh Þkdi Àdh kkdj Àdh k: nÀnr dbefore clustering. There are also measures designed speci- hfically for capturing structural similarity among XML The similarity between two points di and dj inside clusterdocuments [25]. They are essentially different from the S , viewed from a point d outside this cluster, is equal to r hdocument-content measures that are discussed in this paper. the product of the cosine of the angle between d and d i j In general, cosine similarity still remains as the most looking from d and the euclidean distances from d to h hpopular measure because of its simple interpretation and these two points. This definition is based on the assumptioneasy computation, though its effectiveness is yet fairly that d is not in the same cluster with d and d . The smaller h i jlimited. In the following sections, we propose a novel way to the distances kd À d k and kd À d k are, the higher the i h j hevaluate similarity between documents, and consequently chance that d is in fact in the same cluster with d and d , h i jformulate new criterion functions for document clustering. and the similarity based on d should also be small to reflect h this potential. Therefore, through these distances, (10) also3 MULTIVIEWPOINT-BASED SIMILARITY provides a measure of intercluster dissimilarity, given that points di and dj belong to cluster Sr , whereas dh belongs to3.1 Our Novel Similarity Measure another cluster. The overall similarity between di and dj isThe cosine similarity in (3) can be expressed in the determined by taking average over all the viewpoints notfollowing form without changing its meaning: belonging to cluster Sr . It is possible to argue that while Simðdi ; dj Þ ¼ cosðdi À0; dj À0Þ ¼ ðdi À0Þt ðdj À0Þ; ð8Þ most of these viewpoints are useful, there may be some of them giving misleading information just like it may happenwhere 0 is vector 0 that represents the origin point. with the origin point. However, given a large enoughAccording to this formula, the measure takes 0 as one and number of viewpoints and their variety, it is reasonable toonly reference point. The similarity between two documents assume that the majority of them will be useful. Hence, thedi and dj is determined w.r.t. the angle between the two effect of misleading viewpoints is constrained and reducedpoints when looking from the origin. by the averaging step. It can be seen that this method offers
  4. 4. NGUYEN ET AL.: CLUSTERING WITH MULTIVIEWPOINT-BASED SIMILARITY MEASURE 991 Fig. 2. Procedure: Get validity score.Fig. 1. Procedure: Build MVS similarity matrix. “closer” dl is to CSnSr , the greater the chance it actuallymore informative assessment of similarity than the single belongs to another cluster rather than Sr and is, therefore,origin point-based similarity measure. less similar to di . For this reason, MVS brings to the table an additional useful measure compared with CS.3.2 Analysis and Practical Examples of MVS To further justify the above proposal and analysis, weIn this section, we present analytical study to show that the carried out a validity test for MVS and CS. The purpose ofproposed MVS could be a very effective similarity measure this test is to check how much a similarity measurefor data clustering. In order to demonstrate its advantages, coincides with the true class labels. It is based on one http://ieeexploreprojects.blogspot.comMVS is compared with cosine similarity on how well they principle: if a similarity measure is appropriate for thereflect the true group structure in document collections. clustering problem, for any of a document in the corpus, theFirst, exploring (10), we have documents that are closest to it based on this measure should be in the same cluster with it. MVS ðdi ; dj jdi ; dj 2 Sr Þ The validity test is designed as following: For each type 1 X À Á of similarity measure, a similarity matrix A ¼ faij gnÂn is t t t t ¼ d dj À di dh À dj dh þ dh dh n À nr d 2SnS i created. For CS, this is simple, as aij ¼ dt dj . The procedure i h r X X for building MVS matrix is described in Fig. 1. 1 1 ¼ dt dj À i dt dh À dt dh þ1; kdh k¼1 ð11Þ First, the outer composite w.r.t. each class is determined. nÀnr i d nÀnr j d Then, for each row ai of A, i ¼ 1; . . . ; n, if the pair of h h 1 1 documents di and dj ; j ¼ 1; . . . ; n are in the same class, aij is ¼ dt dj À i dt DSnSr À dt DSnSr þ 1 n À nr i n À nr j calculated as in line 10, Fig. 1. Otherwise, dj is assumed to ¼ dt dj À dt CSnSr À dt CSnSr þ 1; be in di ’s class, and aij is calculated as in line 12, Fig. 1. After i i j P matrix A is formed, the procedure in Fig. 2 is used to get itswhere DSnSr ¼ dh 2SnSr dh is the composite vector of all the validity score.documents outside cluster r, called the outer composite For each document di corresponding to row ai of A, wew.r.t. cluster r, and CSnSr ¼ DSnSr =ðnÀnr Þ the outer centroid select qr documents closest to di . The value of qr is chosenw.r.t. cluster r, 8r ¼ 1; . . . ; k. From (11), when comparing relatively as percentage of the size of the class r that containstwo pairwise similarities MVSðdi ; dj Þ and MVSðdi ; dl Þ, di , where percentage 2 ð0; 1Š. Then, validity w.r.t. di isdocument dj is more similar to document di than the other calculated by the fraction of these qr documents having thedocument dl is, if and only if same class label with di , as in line 12, Fig. 2. The final validity is determined by averaging over all the rows of A, as in line 14, dt dj À dt CSnSr > dt dl À dt CSnSr i j i l Fig. 2. It is clear that validity score is bounded within 0 and 1. , cosðdi ; dj Þ À cosðdj ; CSnSr ÞkCSnSr k > ð12Þ The higher validity score a similarity measure has, the more suitable it should be for the clustering task. cosðdi ; dl Þ À cosðdl ; CSnSr ÞkCSnSr k: Two real-world document data sets are used as examplesFrom this condition, it is seen that even when dl is considered in this validity test. The first is reuters7, a subset of the“closer” to di in terms of CS, i.e., cosðdi ; dj Þ cosðdi ; dl Þ, dl can famous collection, 1Reuters-21578 Distribution 1.0, of Reuter’sstill possibly be regarded as less similar to d based on MVS newswire articles. Reuters-21578 is one of the most widely iif, on the contrary, it is “closer” enough to the outer centroid 1. http://www.daviddlewis.com/resources/testcollections/CSnSr than dj is. This is intuitively reasonable, since the reuters21578/.
  5. 5. 992 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 24, NO. 6, JUNE 2012Fig. 3. Characteristics of reuters7 and k1b data sets.used test collection for text categorization. In our validitytest, we selected 2,500 documents from the largest seven Fig. 4. CS and MVS validity test.categories: “acq,” “crude,” “interest,” “earn,” “money-fx,”“ship,” and “trade” to form reuters7. Some of the documents optimization procedure to be performed in a simple, fastmay appear in more than one category. The second data set and effective way. According to (10)is k1b, a collection of 2,340 webpages from the Yahoo! subjecthierarchy, including six topics: “health,” “entertainment,” X“sport,” “politics,” “tech,” and “business.” It was created Simðdi ; dj Þfrom a past study in information retrieval called WebAce di ;dj 2Sr[26], and is now available with the CLUTO toolkit [19]. X 1 X À Á ¼ ðdi À dh Þt dj À dh The two data sets were preprocessed by stop-word di ;dj 2Sr n À nr d 2SnS h rremoval and stemming. Moreover, we removed words that 1 XXÀ t Áappear in less than two documents or more than 99.5 percent ¼ di dj À dt dh À dt dh þ dt dh : i j h n À nr d ;d dof the total number of documents. Finally, the documents i j hwere weighted by TF-IDF and normalized to unit vectors.The full characteristics of reuters7 and k1b are presented in Since http://ieeexploreprojects.blogspot.comFig. 3. X X Fig. 4 shows the validity scores of CS and MVS on the di ¼ dj ¼ Dr ;two data sets relative to the parameter percentage. The di 2Sr dj 2Srvalue of percentage is set at 0.001, 0.01, 0.05, 0.1, 0:2; . . . ; 1:0. X dh ¼ D À Dr and kdh k ¼ 1;According to Fig. 4, MVS is clearly better than CS for both dh 2SnSrdata sets in this validity test. For example, with k1b data setat percentage ¼ 1:0, MVS’ validity score is 0.80, while that of we haveCS is only 0.67. This indicates that, on average, when we Xpick up any document and consider its neighborhood of Simðdi ; dj Þsize equal to its true class size, only 67 percent of that di ;dj 2Srdocument’s neighbors based on CS actually belong to its X 2nr X t X ¼ dt dj À i d dh þ n2class. If based on MVS, the number of valid neighbors di ;dj 2Sr n À nr d 2S i d 2SnS r i r h rincreases to 80 percent. The validity test has illustrated the 2nrpotential advantage of the new multiviewpoint-based ¼ Dt Dr r À Dt ðD À Dr Þ þ n2 n À nr r rsimilarity measure compared to the cosine measure. n þ nr 2nr ¼ kDr k2 À Dt D þ n2 : n À nr n À nr r r4 MULTIVIEWPOINT-BASED CLUSTERING4.1 Two Clustering Criterion Functions IR and IV Substituting into (13) to getHaving defined our similarity measure, we now formulate X 1 n þ nr k ! 2 n þ nr tour clustering criterion functions. The first function, called F ¼ kDr k À À 1 Dr D þ n:IR , is the cluster size-weighted sum of average pairwise n n À nr r¼1 r n À nrsimilarities of documents in the same cluster. First, let us Because n is constant, maximizing F is equivalent toexpress this sum in a general form by function F maximizing F 2 3 X k 1 X X 1 nþnr k ! F ¼ nr 4 2 Simðdi ; dj Þ5: ð13Þ 2 nþnr t nr d ;d 2S F ¼ kDr k À À 1 Dr D : ð14Þ r¼1 i j r n nÀnr r¼1 r nÀnrWe would like to transform this objective function into If comparing F with the min-max cut in (5), both functionssome suitable form such that it could facilitate the contain the two terms kDr k2 (an intracluster similarity
  6. 6. NGUYEN ET AL.: CLUSTERING WITH MULTIVIEWPOINT-BASED SIMILARITY MEASURE 993 X X measure) and Dt D (an intercluster similarity measure). Cr r ðdi À dh Þt À dhNonetheless, while the objective of min-max cut is to di 2Sr dh 2SnSr kCr kminimize the inverse ratio between these two terms, our X X Cr Cr aim here is to maximize their weighted difference. In F , this ¼ dt i À dt dh À dt i h þ1 di dh kCr k kCr kdifference term is determined for each cluster. They are Dr Drweighted by the inverse of the cluster’s size, before summed ¼ ðnÀnr ÞDt r À Dt ðDÀDr Þ À nr ðDÀDr Þt r kDr k kDr kup over all the clusters. One problem is that this formula- Cr Drtion is expected to be quite sensitive to cluster size. From the þ nr ðn À nr Þ; since ¼ kCr k kDr kformulation of COSA [27]—a widely known subspaceclustering algorithm—we have learned that it is desirable Dt D ¼ ðn þ kDr kÞkDr k À ðnr þ kDr kÞ r þ nr ðn À nr Þ:to have a set of weight factors ¼ fr gk to regulate the kDr k 1distribution of these cluster sizes in clustering solutions. Substituting the above into (17) to have Hence, we integrate into the expression of F to have it X nþkDr k k t !become nþkDr k Dr D G¼ kDr k À À1 þ n: r¼1 nÀnr nÀnr kDr k X r nþnr k ! Again, we could eliminate n because it is a constant. nþnr F ¼ kDr k2 À À1 Dt D : r ð15Þ Maximizing G is equivalent to maximizing IV below n nÀnr r¼1 r nÀnr X nþkDr k k t ! nþkDr k Dr D IV ¼ kDr kÀ À1 : ð18ÞIn common practice, fr gk are often taken to be simple 1 r¼1 nÀnr nÀnr kDr kfunctions of the respective cluster sizes fnr gk [28]. Let us 1use a parameter called the regulating factor, which has IV calculates the weighted difference between the two terms: kDr k and Dt D=kDr k, which again represent ansome constant value ( 2 ½0; 1Š), and let r ¼ n in (15), the r r intracluster similarity measure and an intercluster similar-final form of our criterion function IR is ity measure, respectively. The first term is actually equivalent to an element of the sum in spherical k-means X 1 nþnr k ! objective function in (4); the second one is similar to an nþnr IR ¼ 1À nÀn kDr k2 À À1 Dt D : r ð16Þ element of the sum in min-max cut criterion in (6), but with r¼1 nr r http://ieeexploreprojects.blogspot.cominstead of kD k2 . We have presented nÀnr kDr k as scaling factor r our clustering criterion functions IR and IV in the simpleIn the empirical study of Section 5.4, it appears that IR ’s forms. Next, we show how to perform clustering by using aperformance dependency on the value of is not very greedy algorithm to optimize these functions.critical. The criterion function yields relatively good 4.2 Optimization Algorithm and Complexityclustering results for 2 ð0; 1Þ. We denote our clustering framework by MVSC, meaning In the formulation of IR , a cluster quality is measured by Clustering with Multiviewpoint-based Similarity. Subse-the average pairwise similarity between documents within quently, we have MVSC-IR and MVSC-IV , which are MVSCthat cluster. However, such an approach can lead to with criterion function IR and IV , respectively. The mainsensitiveness to the size and tightness of the clusters. With goal is to perform document clustering by optimizing IR inCS, for example, pairwise similarity of documents in a (16) and IV in (18). For this purpose, the incremental k-waysparse cluster is usually smaller than those in a dense algorithm [18], [29]—a sequential version of k-means—iscluster. Though not as clear as with CS, it is still possible employed. Considering that the expression of IV in (18)that the same effect may hinder MVS-based clustering if depends only on nr and Dr , r ¼ 1; . . . ; k, IV can be written inusing pairwise similarity. To prevent this, an alternative a general formapproach is to consider similarity between each documentvector and its cluster’s centroid instead. This is expressed in X kobjective function G IV ¼ Ir ðnr ; Dr Þ; ð19Þ r¼1 where Ir ðnr ; Dr Þ corresponds to the objective value of cluster XX k 1 X Cr r. The same is applied to IR . With this general form, the G¼ Sim di Àdh ; Àdh r¼1 di 2Sr nÀnr d 2SnS kCr k incremental optimization algorithm, which has two major h r ð17Þ steps Initialization and Refinement, is described in Fig. 5. X 1 X X k Cr At Initialization, k arbitrary documents are selected to be G¼ ðdi Àdh Þt Àdh : r¼1 nÀnr d 2S d 2SnS kCr k the seeds from which initial partitions are formed. Refine- i r h r ment is a procedure that consists of a number of iterations. During each iteration, the n documents are visited one bySimilar to the formulation of IR , we would like to express one in a totally random order. Each document is checked ifthis objective in a simple form that we could optimize more its move to another cluster results in improvement of theeasily. Exploring the vector dot product, we get objective function. If yes, the document is moved to the
  7. 7. 994 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 24, NO. 6, JUNE 2012 TABLE 2 Document Data SetsFig. 5. Algorithm: Incremental clustering. 5.1 Document Collections The data corpora that we used for experiments consist ofcluster that leads to the highest improvement. If no clusters 20 benchmark document data sets. Besides reuters7 and k1b,are better than the current cluster, the document is notmoved. The clustering process terminates when an iteration which have been described in details earlier, we includedcompletes without any documents being moved to new another 18 text collections so that the examination of theclusters. Unlike the traditional k-means, this algorithm is a clustering methods is more thorough and exhaustive. http://ieeexploreprojects.blogspot.comstepwise optimal procedure. While k-means only updates Similar to k1b, these data sets are provided together withafter all n documents have been reassigned, the incremental CLUTO by the toolkit’s authors [19]. They had been usedclustering algorithm updates immediately whenever each for experimental testing in previous papers, and theirdocument is moved to new cluster. Since every move when source and origin had also been described in details [30],happens increases the objective function value, convergence [31]. Table 2 summarizes their characteristics. The corporato a local optimum is guaranteed. present a diversity of size, number of classes and class During the optimization procedure, in each iteration, the balance. They were all preprocessed by standard proce-main sources of computational cost are dures, including stop-word removal, stemming, removal of . Searching for optimum clusters to move individual too rare as well as too frequent words, TF-IDF weighting documents to: Oðnz Á kÞ. and normalization. . Updating composite vectors as a result of such 5.2 Experimental Setup and Evaluation moves: Oðm Á kÞ. To demonstrate how well MVSCs can perform, we comparewhere nz is the total number of nonzero entries in all them with five other clustering methods on the 20 data setsdocument vectors. Our clustering approach is partitional and in Table 2. In summary, the seven clustering algorithms areincremental; therefore, computing similarity matrix isabsolutely not needed. If denotes the number of iterations . MVSC-IR : MVSC using criterion function IRthe algorithm takes, since nz is often several tens times larger . MVSC-IV : MVSC using criterion function IVthan m for document domain, the computational complexity . k-means: standard k-means with euclidean distancerequired for clustering with IR and IV is Oðnz Á k Á Þ. . Spkmeans: spherical k-means with CS . graphCS: CLUTO’s graph method with CS . graphEJ: CLUTO’s graph with extended Jaccard5 PERFORMANCE EVALUATION OF MVSC . MMC: Spectral Min-Max Cut algorithm [13].To verify the advantages of our proposed methods, we Our MVSC-I and MVSC-I programs are implemented in R Vevaluate their performance in experiments on document Java. The regulating factor in IR is always set at 0.3 duringdata. The objective of this section is to compare MVSC-IRand MVSC-IV with the existing algorithms that also use the experiments. We observed that this is one of the mostspecific similarity measures and criterion functions for appropriate values. A study on MVSC-IR ’s performancedocument clustering. The similarity measures to be com- relative to different values is presented in a later section.pared includes euclidean distance, cosine similarity, and The other algorithms are provided by the C library interfaceextended Jaccard coefficient. which is available freely with the CLUTO toolkit [19]. For
  8. 8. NGUYEN ET AL.: CLUSTERING WITH MULTIVIEWPOINT-BASED SIMILARITY MEASURE 995each data set, cluster number is predefined equal to the It can be observed that MVSC-IR and MVSC-IV performnumber of true class, i.e., k ¼ c. consistently well. In Fig. 6, 19 out of 20 data sets, except None of the above algorithms are guaranteed to find reviews, either both or one of MVSC approaches are in theglobal optimum, and all of them are initialization- top two algorithms. The next consistent performer isdependent. Hence, for each method, we performed Spkmeans. The other algorithms might work well on certainclustering a few times with randomly initialized values, data set. For example, graphEJ yields outstanding result onand chose the best trial in terms of the corresponding classic; graphCS and MMC are good on reviews. But they doobjective function value. In all the experiments, each test not fare very well on the rest of the collections.run consisted of 10 trials. Moreover, the result reported To have a statistical justification of the clusteringhere on each data set by a particular clustering method is performance comparisons, we also carried out statisticalthe average of 10 test runs. significance tests. Each of MVSC-IR and MVSC-IV was After a test run, clustering solution is evaluated by paired up with one of the remaining algorithms for a pairedcomparing the documents’ assigned labels with their true t-test [32]. Given two paired sets X and Y of N measuredlabels provided by the corpus. Three types of external values, the null hypothesis of the test is that the differencesevaluation metric are used to assess clustering performance. between X and Y come from a population with mean 0. TheThey are the FScore, Normalized Mutual Information (NMI), alternative hypothesis is that the paired sets differ fromand Accuracy. FScore is an equally weighted combination of each other in a significant way. In our experiment, thesethe “precision” (P ) and “recall” (R) values used in tests were done based on the evaluation values obtained oninformation retrieval. Given a clustering solution, FScore is the 20 data sets. The typical 5 percent significance level wasdetermined as used. For example, considering the pair (MVSC-IR , k- means), from Table 3, it is seen that MVSC-IR dominates k- X ni k F Score ¼ max ðFi;j Þ means w.r.t. FScore. If the paired t-test returns a p-value i¼1 n j smaller than 0.05, we reject the null hypothesis and say that 2  Pi;j  Ri;j ni;j ni;j the dominance is significant. Otherwise, the null hypothesis where Fi;j ¼ ; Pi;j ¼ ; Ri;j ¼ ; is true and the comparison is considered insignificant. Pi;j þ Ri;j nj ni The outcomes of the paired t-tests are presented inwhere ni denotes the number of documents in class i, nj the Table 5. As the paired t-tests show, the advantage of MVSC-number of documents assigned to cluster j, and ni;j the IR and MVSC-IV over the other methods is statisticallynumber of documents shared by class i and cluster j. From significant. A special case is the graphEJ algorithm. On the http://ieeexploreprojects.blogspot.comanother aspect, NMI measures the information the true class one hand, MVSC-IR is not significantly better than graphEJpartition and the cluster assignment share. It measures how if based on FScore or NMI. On the other hand, when MVSC-much knowing about the clusters helps us know about the I and MVSC-I are tested obviously better than graphEJ, R Vclasses the p-values can still be considered relatively large, Pk Pk although they are smaller than 0.05. The reason is that, as nÁn i¼1 j¼1 ni;j log ni ni;j j observed before, graphEJ’s results on classic data set are NMI ¼ r ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi : P Pk ni k nj very different from those of the other algorithms. While i¼1 ni log n j¼1 nj log n interesting, these values can be considered as outliers, and including them in the statistical tests would affect theFinally, Accuracy measures the fraction of documents that outcomes greatly. Hence, we also report in Table 5 the testsare correctly labels, assuming a one-to-one correspondence where classic was excluded and only results on the otherbetween true classes and assigned clusters. Let q denote any 19 data sets were used. Under this circumstance, bothpossible permutation of index set f1; . . . ; kg, Accuracy is MVSC-I and MVSC-I outperform graphEJ significantly R Vcalculated by with good p-values. 1 X k 5.4 Effect of on MVSC-IR ’s Performance I Accuracy ¼ max ni;qðiÞ : n q i¼1 It has been known that criterion function-based partitional clustering methods can be sensitive to cluster size andThe best mapping q to determine Accuracy could be found balance. In the formulation of IR in (16), there existsby the Hungarian algorithm.2 For all three metrics, their parameter which is called the regulating factor, 2 ½0; 1Š.range is from 0 to 1, and a greater value indicates a better To examine how the determination of could affect MVSC-clustering solution. IR ’s performance, we evaluated MVSC-IR with different5.3 Results values of from 0 to 1, with 0.1 incremental interval. TheFig. 6 shows the Accuracy of the seven clustering algorithms assessment was done based on the clustering results inon the 20 text collections. Presented in a different way, NMI, FScore, and Accuracy, each averaged over all theclustering results based on FScore and NMI are reported in 20 given data sets. Since the evaluation metrics for differentTables 3 and 4, respectively. For each data set in a row, the data sets could be very different from each other, simplyvalue in bold and underlined is the best result, while the taking the average over all the data sets would not be veryvalue in bold only is the second to best. meaningful. Hence, we employed the method used in [18] to transform the metrics into relative metrics before aver- 2. http://en.wikipedia.org/wiki/Hungarian_algorithm. aging. On a particular document collection S, the relative
  9. 9. 996 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 24, NO. 6, JUNE 2012 http://ieeexploreprojects.blogspot.comFig. 6. Clustering results in Accuracy. Left-to-right in legend corresponds to left-to-right in the plot.FScore measure of MVSC-IR with ¼ i is determined as and 1), and tends to get better when is set at some softfollowing: values in between 0 and 1. Based on our experimental È É study, MVSC-IR always produces results within 5 percent maxj F ScoreðIR ; S; j Þ of the best case, regarding any types of evaluation metric, relative F ScoreðIR ; S; i Þ ¼ ; F ScoreðIR ; S; i Þ with from 0.2 to 0.8.where i ; j 2 f0:0; 0:1; . . . ; 1:0g, F ScoreðIR ; S; i Þ is theFScore result on data set S obtained by MVSC-IR with 6 MVSC AS REFINEMENT FOR k -MEANS ¼ i . The same transformation was applied to NMI and From the analysis of (12) in Section 3.2, MVS provides anAccuracy to yield relative_NMI and relative_Accuracy, respec- additional criterion for measuring the similarity amongtively. MVSC-IR performs the best with an i if its relative documents compared with CS. Alternatively, MVS can bemeasure has a value of 1. Otherwise its relative measure is considered as a refinement for CS, and consequently MVSCgreater than 1; the larger this value is, the worse MVSC-IR algorithms as refinements for spherical k-means, whichwith i performs in comparison with other settings of . uses CS. To further investigate the appropriateness andFinally, the average relative measures were calculated over effectiveness of MVS and its clustering algorithms, weall the data sets to present the overall performance. carried out another set of experiments in which solutions Fig. 7 shows the plot of average relative FScore, NMI, and obtained by Spkmeans were further optimized by MVSC-IRAccuracy w.r.t. different values of . In a broad view, and MVSC-IV . The rationale for doing so is that if the finalMVSC-IR performs the worst at the extreme values of (0 solutions by MVSC-IR and MVSC-IV are better than the
  10. 10. NGUYEN ET AL.: CLUSTERING WITH MULTIVIEWPOINT-BASED SIMILARITY MEASURE 997 TABLE 3 Clustering Results in FScore TABLE 4 Clustering Results in NMI http://ieeexploreprojects.blogspot.comintermediate ones obtained by Spkmeans, MVS is indeed 6.1 TDT2 and Reuters-21578 Collectionsgood for the clustering problem. These experiments would For variety and thoroughness, in this empirical study, wereveal more clearly if MVS actually improves the clustering used two new document copora described in Table 6: TDT2performance compared with CS. and Reuters-21578. The original TDT2 corpus,3 which In the previous section, MVSC algorithms have been consists of 11,201 documents in 96 topics (i.e., classes), hascompared against the existing algorithms that are closely been one of the most standard sets for document clusteringrelated to them, i.e., ones that also employ similarity purpose. We used a subcollection of this corpus whichmeasures and criterion functions. In this section, we make contains 10,021 documents in the largest 56 topics. Theuse of the extended experiments to further compare the Reuters-21578 Distribution 1.0 has been mentioned earlier inMVSC with a different type of clustering approach, the this paper. The original corpus consists of 21,578 documentsNMF methods [10], which do not use any form of explicitlydefined similarity measure for documents. 3. http://www.nist.gov/speech/tests/tdt/tdt98/index.html.
  11. 11. 998 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 24, NO. 6, JUNE 2012 TABLE 5 Statistical Significance of Comparisons Based on Paired t-Tests with 5 Percent Significance Levelin 135 topics. We used a subcollection having 8,213 docu- MVSC-IR and MVSC-IV further on the new data sets.ments from the largest 41 topics. The same two document Besides, it would be interesting to see how they and theircollections had been used in the paper of the NMF methods Spkmeans-initialized versions fare against each other. What[10]. Documents that appear in two or more topics were is more, two well-known document clustering approachesremoved, and the remaining documents were preprocessed based on nonnegative matrix factorization, NMF and NMF-in the same way as in Section 5.1. NCW [10], are also included for a comparison with our algorithms, which use explicit MVS measure.6.2 Experiments and Results During the experiments, each of the two corpora in Table 6The following clustering methods: were used to create six different test cases, each of which http://ieeexploreprojects.blogspot.com distinct number of topics used corresponded to a . Spkmeans: spherical k-means (c ¼ 5; . . . ; 10). For each test case, c topics were randomly . rMVSC-IR : refinement of Spkmeans by MVSC-IR selected from the corpus and their documents were mixed . rMVSC-IV : refinement of Spkmeans by MVSC-IV together to form a test set. This selection was repeated . MVSC-IR : normal MVSC using criterion IR 50 times so that each test case had 50 different test sets. The . MVSC-IV : normal MVSC using criterion IV , average performance of the clustering algorithms with k ¼ cand two new document clustering approaches that do not were calculated over these 50 test sets. This experimentaluse any particular form of similarity measure: setup is inspired by the similar experiments conducted in the NMF paper [10]. Furthermore, similar to previous experi- . NMF: Nonnegative Matrix Factorization method mental setup in Section 5.2, each algorithm (including NMF . NMF-NCW: Normalized Cut Weighted NMF and NMF-NCW) actually considered 10 trials on any test setwere involved in the performance comparison. When used before using the solution of the best obtainable objectiveas a refinement for Spkmeans, the algorithms rMVSC-IR function value as its final output.and rMVSC-IV worked directly on the output solution of The clustering results on TDT2 and Reuters-21578 areSpkmeans. The cluster assignment produced by Spkmeans shown in Table 7 and 8, respectively. For each test case in awas used as initialization for both rMVSC-IR and rMVSC- column, the value in bold and underlined is the bestIV . We also investigated the performance of the original among the results returned by the algorithms, while the value in bold only is the second to best. From the tables, several observations can be made. First, MVSC-IR and MVSC-IV continue to show they are good clustering algorithms by outperforming other methods frequently. They are always the best in every test case of TDT2. TABLE 6 TDT2 and Reuters-21578 Document CorporaFig. 7. MVSC-IR ’s performance with respect to .
  12. 12. NGUYEN ET AL.: CLUSTERING WITH MULTIVIEWPOINT-BASED SIMILARITY MEASURE 999 TABLE 7 Clustering Results on TDT2 TABLE 8 Clustering Results on Reuters-21578 http://ieeexploreprojects.blogspot.comFig. 8. Accuracies on the 50 test sets (in sorted order of Spkmeans) in the test case k ¼ 5.Compared with NMF-NCW, they are better in almost all cases. Interestingly, there are many circumstances wherethe cases, except only the case of Reuters-21578, k ¼ 5, Spkmeans’ result is worse than that of NMF clusteringwhere NMF-NCW is the best based on Accuracy. methods, but after refined by MVSCs, it becomes better. To The second observation, which is also the main objective have a more descriptive picture of the improvements, weof this empirical study, is that by applying MVSC to refine could refer to the radar charts in Fig. 8. The figure showsthe output of spherical k-means, clustering solutions are details of a particular test case where k ¼ 5. Remember thatimproved significantly. Both rMVSC-IR and rMVSC-IV lead a test case consists of 50 different test sets. The chartsto higher NMIs and Accuracies than Spkmeans in all the display result on each test set, including the accuracy result
  13. 13. 1000 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 24, NO. 6, JUNE 2012obtained by Spkmeans, and the results after refinement by I. Guyon, U.V. Luxburg, and R.C. Williamson, “Clustering: [2] Science or Art?,” Proc. NIPS Workshop Clustering Theory, 2009.MVSC, namely rMVSC-IR and rMVSC-IV . For effective [3] I. Dhillon and D. Modha, “Concept Decompositions for Largevisualization, they are sorted in ascending order of the Sparse Text Data Using Clustering,” Machine Learning, vol. 42,accuracies by Spkmeans (clockwise). As the patterns in both nos. 1/2, pp. 143-175, Jan. 2001.Figs. 8a and 8b reveal, improvement in accuracy is most [4] S. Zhong, “Efficient Online Spherical K-means Clustering,” Proc. IEEE Int’l Joint Conf. Neural Networks (IJCNN), pp. 3180-3185, 2005.likely attainable by rMVSC-IR and rMVSC-IV . Many of the [5] A. Banerjee, S. Merugu, I. Dhillon, and J. Ghosh, “Clustering withimprovements are with a considerably large margin, Bregman Divergences,” J. Machine Learning Research, vol. 6,especially when the original accuracy obtained by pp. 1705-1749, Oct. 2005.Spkmeans is low. There are only few exceptions where [6] E. Pekalska, A. Harol, R.P.W. Duin, B. Spillmann, and H. Bunke, “Non-Euclidean or Non-Metric Measures Can Be Informative,”after refinement, accuracy becomes worst. Nevertheless, the Structural, Syntactic, and Statistical Pattern Recognition, vol. 4109,decreases in such cases are small. pp. 871-880, 2006. Finally, it is also interesting to notice from Tables 7 and 8 [7] M. Pelillo, “What Is a Cluster? Perspectives from Game Theory,”that MVSC preceded by spherical k-means does not Proc. NIPS Workshop Clustering Theory, 2009.necessarily yields better clustering results than MVSC with [8] D. Lee and J. Lee, “Dynamic Dissimilarity Measure for Support Based Clustering,” IEEE Trans. Knowledge and Data Eng., vol. 22,random initialization. There are only a small number of no. 6, pp. 900-905, June 2010.cases in the two tables that rMVSC can be found better than [9] A. Banerjee, I. Dhillon, J. Ghosh, and S. Sra, “Clustering on theMVSC. This phenomenon, however, is understandable. Unit Hypersphere Using Von Mises-Fisher Distributions,”Given a local optimal solution returned by spherical k- J. Machine Learning Research, vol. 6, pp. 1345-1382, Sept. 2005.means, rMVSC algorithms as a refinement method would [10] W. Xu, X. Liu, and Y. Gong, “Document Clustering Based on Non- Negative Matrix Factorization,” Proc. 26th Ann. Int’l ACM SIGIRbe constrained by this local optimum itself and, hence, their Conf. Research and Development in Informaion Retrieval, pp. 267-273,search space might be restricted. The original MVSC 2003.algorithms, on the other hand, are not subjected to this [11] I.S. Dhillon, S. Mallela, and D.S. Modha, “Information-Theoreticconstraint, and are able to follow the search trajectory of Co-Clustering,” Proc. Ninth ACM SIGKDD Int’l Conf. Knowledge Discovery and Data Mining (KDD), pp. 89-98, 2003.their objective function from the beginning. Hence, while ¨ [12] C.D. Manning, P. Raghavan, and H. Schutze, An Introduction toperformance improvement after refining spherical k-means’ Information Retrieval. Cambridge Univ. Press, 2009.result by MVSC proves the appropriateness of MVS and its [13] C. Ding, X. He, H. Zha, M. Gu, and H. Simon, “A Min-Max Cutcriterion functions for document clustering, this observation Algorithm for Graph Partitioning and Data Clustering,” Proc.in fact only reaffirms its potential. IEEE Int’l Conf. Data Mining (ICDM), pp. 107-114, 2001. [14] H. Zha, X. He, C. Ding, H. Simon, and M. Gu, “Spectral Relaxation for K-Means Clustering,” Proc. Neural Info. Processing Systems7 CONCLUSIONS AND FUTURE WORK (NIPS), pp. 1057-1064, 2001. http://ieeexploreprojects.blogspot.com [15] J. Shi and J. Malik, “Normalized Cuts and Image Segmentation,”In this paper, we propose a Multiviewpoint-based Similar- IEEE Trans. Pattern Analysis Machine Intelligence, vol. 22, no. 8,ity measuring method, named MVS. Theoretical analysis pp. 888-905, Aug. 2000.and empirical examples show that MVS is potentially more [16] I.S. Dhillon, “Co-Clustering Documents Proc. Words Using Bipartite Spectral Graph Partitioning,” and Seventh ACMsuitable for text documents than the popular cosine SIGKDD Int’l Conf. Knowledge Discovery and Data Mining (KDD),similarity. Based on MVS, two criterion functions, IR and pp. 269-274, 2001.IV , and their respective clustering algorithms, MVSC-IR [17] Y. Gong and W. Xu, Machine Learning for Multimedia Content Analysis. Springer-Verlag, 2007.and MVSC-IV , have been introduced. Compared with other [18] Y. Zhao and G. Karypis, “Empirical and Theoretical Comparisonsstate-of-the-art clustering methods that use different types of Selected Criterion Functions for Document Clustering,”of similarity measure, on a large number of document data Machine Learning, vol. 55, no. 3, pp. 311-331, June 2004.sets and under different evaluation metrics, the proposed [19] G. Karypis, “CLUTO a Clustering Toolkit,” technical report, Dept. of Computer Science, Univ. of Minnesota, http://glaros.dtc.umn.algorithms show that they could provide significantly edu/~gkhome/views/cluto, 2003.improved clustering performance. [20] A. Strehl, J. Ghosh, and R. Mooney, “Impact of Similarity The key contribution of this paper is the fundamental Measures on Web-Page Clustering,” Proc. 17th Nat’l Conf. Artificialconcept of similarity measure from multiple viewpoints. Intelligence: Workshop of Artificial Intelligence for Web Search (AAAI), pp. 58-64, July 2000.Future methods could make use of the same principle, but [21] A. Ahmad and L. Dey, “A Method to Compute Distance Betweendefine alternative forms for the relative similarity in (10), or Two Categorical Values of Same Attribute in Unsuperviseddo not use average but have other methods to combine the Learning for Categorical Data Set,” Pattern Recognition Letters,relative similarities according to the different viewpoints. vol. 28, no. 1, pp. 110-118, 2007.Besides, this paper focuses on partitional clustering of [22] D. Ienco, R.G. Pensa, and R. Meo, “Context-Based Distance Learning for Categorical Data Clustering,” Proc. Eighth Int’l Symp.documents. In the future, it would also be possible to apply Intelligent Data Analysis (IDA), pp. 83-94, 2009.the proposed criterion functions for hierarchical clustering [23] P. Lakkaraju, S. Gauch, and M. Speretta, “Document Similarityalgorithms. Finally, we have shown the application of MVS Based on Concept Tree Distance,” Proc. 19th ACM Conf. Hypertextand its clustering algorithms for text data. It would be and Hypermedia, pp. 127-132, 2008.interesting to explore how they work on other types of [24] H. Chim and X. Deng, “Efficient Phrase-Based Document Similarity for Clustering,” IEEE Trans. Knowledge and Data Eng.,sparse and high-dimensional data. vol. 20, no. 9, pp. 1217-1229, Sept. 2008. [25] S. Flesca, G. Manco, E. Masciari, L. Pontieri, and A. Pugliese, “Fast Detection of xml Structural Similarity,” IEEE Trans. Knowledge andREFERENCES Data Eng., vol. 17, no. 2, pp. 160-175, Feb. 2005.[1] X. Wu, V. Kumar, J.R. Quinlan, J. Ghosh, Q. Yang, H. Motoda, G.J. [26] E.-H. Han, D. Boley, M. Gini, R. Gross, K. Hastings, G. Karypis, V. McLachlan, A. Ng, B. Liu, P.S. Yu, Z.-H. Zhou, M. Steinbach, D.J. Kumar, B. Mobasher, and J. Moore, “Webace: A Web Agent for Hand, and D. Steinberg, “Top 10 Algorithms in Data Mining,” Document Categorization and Exploration,” Proc. Second Int’l Knowledge Information Systems, vol. 14, no. 1, pp. 1-37, 2007. Conf. Autonomous Agents (AGENTS ’98), pp. 408-415, 1998.
  14. 14. NGUYEN ET AL.: CLUSTERING WITH MULTIVIEWPOINT-BASED SIMILARITY MEASURE 1001[27] J. Friedman and J. Meulman, “Clustering Objects on Subsets of Chee Keong Chan received the BEng degree Attributes,” J. Royal Statistical Soc. Series B Statistical Methodology, from National University of Singapore in elec- vol. 66, no. 4, pp. 815-839, 2004. trical and electronic engineering, the MSC[28] L. Hubert, P. Arabie, and J. Meulman, Combinatorial Data Analysis: degree and DIC in computing from Imperial Optimization by Dynamic Programming. SIAM, 2001. College, University of London, and the PhD[29] R.O. Duda, P.E. Hart, and D.G. Stork, Pattern Classification, second degree from Nanyang Technological University. ed. John Wiley Sons, 2001. Upon graduation from NUS, he worked as an[30] S. Zhong and J. Ghosh, “A Comparative Study of Generative RD engineer in Philips Singapore for several Models for Document Clustering,” Proc. SIAM Int’l Conf. Data years. Currently, he is an associate professor in Mining Workshop Clustering High Dimensional Data and Its Applica- the Information Engineering Division, lecturing in tions, 2003. subjects related to computer systems, artificial intelligence, software[31] Y. Zhao and G. Karypis, “Criterion Functions for Document engineering, and cyber security. Through his years in NTU, he has Clustering: Experiments and Analysis,” technical report, Dept. of published about numerous research papers in conferences, journals, Computer Science, Univ. of Minnesota, 2002. books, and book chapter. He has also provided numerous consultations[32] T.M. Mitchell, Machine Learning. McGraw-Hill, 1997. to the industries. His current research interest areas include data mining (text and solar radiation data mining), evolutionary algorithms (schedul- Duc Thang Nguyen received the BEng degree ing and games), and renewable energy. in electrical and electronic engineering from Nanyang Technological University, Singapore, where he is also working toward the PhD . For more information on this or any other computing topic, degree in the Division of Information Engineer- please visit our Digital Library at www.computer.org/publications/dlib. ing. Currently, he is an operations planning analyst at PSA Corporation Ltd., Singapore. His research interests include algorithms, informa- tion retrieval, data mining, optimizations, and operations research. Lihui Chen received the BEng degree in computer science and engineering at Zhejiang University, China and the PhD degree in computational science at the University of St. Andrews, United Kingdom. Currently, she is an associate professor in the Division of Information Engineering at Nanyang Technological Univer- sity in Singapore. Her research interests include machine learning algorithms and applications, http://ieeexploreprojects.blogspot.com data mining and web intelligence. She haspublished more than 70 referred papers in international journals andconferences in these areas. She is a senior member of the IEEE, and amember of the IEEE Computational Intelligence Society.

×