Published on

IJERA (International journal of Engineering Research and Applications) is International online, ... peer reviewed journal. For more detail or submit your article, please visit www.ijera.com

Published in: Technology, Education
  • Be the first to comment

  • Be the first to like this

No Downloads
Total Views
On Slideshare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide


  1. 1. Preeti Kashyap, Babita Ujjainiya / International Journal of Engineering Research andApplications (IJERA) ISSN: 2248-9622 www.ijera.comVol. 3, Issue 3, May-Jun 2013, pp.274-282274 | P a g eA Survey On Seeds Affinity PropagationPreeti Kashyap, Babita Ujjainiya(Department of Information Technology, SATI College, RGPV University,Vidisha (M.P), India)(Ass. Prof In Department of Information Technology, SATI College, RGPV University, Vidisha (M.P), India)ABSTRACTAffinity propagation (AP) is a clusteringmethod that can find data centers or clusters bysending messages between pairs of data points.Seed Affinity Propagation is a novel semi-supervised text clustering algorithm which isbased on AP. AP algorithm couldn’t cope up withpart known data direct. Therefore, focusing onthis issue a semi-supervised scheme calledincremental affinity propagation clustering ispresent in the paper where pre-knowninformation is represented by adjustingsimilarity matrix The standard affinitypropagation clustering algorithm also suffersfrom a limitation that it is hard to know the valueof the parameter “preference” which can yield anoptimal clustering solution. This limitation canbe overcome by a method named, adaptiveaffinity propagation. The method first finds outthe range of “preference”, then searches thespace of “preference” to find a good value whichcan optimize the clustering result.Keywords- Affinity propagation, Clustering,Incremental, Partition adaptive affinity propagationand Text clustering.I. INTRODUCTIONClustering is a process of organizingobjects into groups whose members are similar insome way. Cluster analysis seeks to partition a givendata set into groups based on specified features sothat the data points within a group are more similarto each other than the points in different groups [1],Therefore, a cluster is a collection of objects that aresimilar among themselves and dissimilar to theobjects belonging to other clusters. Text Clusteringis to divide a set of text into cluster , so that textwithin each cluster are similar in content.Clustering is an area of research, finding itsapplications in many fields. One of the most popularclustering method is k-means clustering algorithm.Arbitrarily k- points are generated as initialcentroids where k is a user specified parameter.Each point is then assigned to the cluster with theclosest centroid then the centroid of each cluster isupdated by taking the mean of the data points ofeach cluster. Some data points may move from onecluster to other cluster. Again calculate newcentroids and assign the data points to the suitableclusters. Repeat the assignment and update thecentroids, until convergence criteria are met. In thisalgorithm mostly Euclidean distance is used to finddistance between data points and centroids. StandardK-Means method has two limitations: (1) thenumber of cluster needs to be specified first. (2) theclustering result is sensitive to the initial clustercenters .Traditional approaches for clustering dataare based on metric similarities, i.e. symmetric,nonnegative, and satisfying the triangle inequalitymeasures. More recent approaches, like AffinityPropagation (AP) algorithm [2], can take as inputalso general non metric similarities. In the domainof image clustering, AP can use as input metricselected segments of images pairs [3]. AP has beenused to solve a wide range of clustering problems[4] and individual preferences predictions [5].Theclustering performance depends on the messageupdating frequency and similarity measure AP hasbeen used in text clustering for its simplicity, goodperformance and general applicability. By using APto preprocess text for text clustering. It wascombined with a parallel strategy for e-learningresources clustering. But AP was used only as anunsupervised algorithm and did not consider anystructural information derived from the specificdocuments. For text mining tasks, the vector spacemodel (VSM), which treats a document as a bag ofwords and uses plain language words as features [6].This model can represent the text mining problemsdirectly and easily. With the increase of data setsize, the vector space becomes sparse, highdimensional, and the computational complexitygrows exponentially. In many practical applications,unsupervised learning is lacking relevantinformation whereas supervised learning needs aninitial large number of class label information,which requires time and expensive human labor. [7],[8]. In recent years, semi-supervised learning hascaptured a great deal of attentions [9], [10]. Semi-supervised learning is a machine learning in whichthe model is constructed using both labeled andunlabeled data for training—typically a smallamount of labeled data and a large amount ofunlabeled data..II . RELATED WORKThis section includes so far study on Affinitypropagation.
  2. 2. Preeti Kashyap, Babita Ujjainiya / International Journal of Engineering Research andApplications (IJERA) ISSN: 2248-9622 www.ijera.comVol. 3, Issue 3, May-Jun 2013, pp.274-282275 | P a g e2.1 Affinity PropagationAP, a new and powerful technique forexemplar learning. It is a fast clustering algorithmespecially in the case of large number of clustersand has some advantages: speed, generalapplicability and good performance. In brief , APworks based on similarities between pairs of datapoints and simultaneously considers all the datapoints as potential cluster centers called exemplar. Itcomputes two kinds of messages exchangedbetween data points. The first one is called―responsibility‖ and the second one is ―availability‖.Affinity propagation takes as input a collection ofreal-valued similarities between data points, wherethe similarity s(i,k) indicates how well the data pointwith index k is suited to be the exemplar for datapoint i. When the goal is to minimize squared error,each similarity is set to a negative squared error i.eEuclidean distance: For point’s xi and xk,s(i,k)=-||xi-xk||2(1)The two kinds of messages are exchangedbetween data points, and each takes into account adifferent kind of competition. Messages can becombined at any stage to decide which points areexemplars and, for every other point, whichexemplar it belongs to. The ―availability‖ a(i, k)message is sent from candidate exemplar point j topoint i and it reflects the accumulated evidence forhow appropriate it would be for point i to choosepoint j as its exemplar. The responsibility‖ r(i, k)message is sent from data point i to candidateexemplar point j and it reflects the accumulatedevidence for how well-suited point j is to serve asthe exemplar for point i. At the beginning, theavailabilities are initialized to zero: a(i ,j) = 0.Then,the responsibilities are computed using the ruler(i,k) ← s(i,k) ─ max{ a(i,k) + s(i,k)} (2)The availabilities are zero in the firstiteration, r(i,k) is set to the input similarity betweenpoint i and k as its exemplar, minus the largest ofthe similarities between point i and other candidateexemplars. This update is data-driven and does nottake into account how many other points favor eachcandidate exemplar. In later iterations, when somepoints are effectively assigned to other exemplars,their availabilities will drop below zero. Thesenegative availabilities will decrease the effectivevalues of some of the input similarities s(i,k′) ,removing the corresponding candidate exemplarsfrom competition. For k = i, the responsibility r(k,k)is set to the input preference that point k be chosenas an exemplar, s(k,k), minus the largest of thesimilarities between point i and all other candidateexemplars. This self-responsibility reflects that pointk is an exemplar.a(i,k) ←min{0, r(k,k) + ∑ max{0, r(i ,k)}} (3)The availability a(i ,k) is set to the self responsibilityr(k,k) plus the sum of the positive responsibilitiescandidate exemplar k receives from other points.For a good exemplar only the positive portions ofincoming responsibilities are added to explain somedata points well regardless of how poorly it explainsother data points Negative self responsibility r(k,k)indicates that point k is currently better suited asbelonging to another exemplar rather than being anexemplar itself, the availability of point k as anexemplar can be increased if some other points havepositive responsibilities for point k being theirexemplar. The total sum is thresholded to limit theinfluence of strong incoming positiveresponsibilities so that it cannot go above zero. The―self-availability‖ a(k,k) is updated differently:a(i,k) ← ∑ max { 0, r(i, k)} (4)This message reflects that point k is an exemplarsent to candidate exemplar k from other points. Theabove update rules require only local and simplecomputations that are easily implemented in eq. (3)and messages need be exchanged between pairs ofpoints with known similarities Availabilities andResponsibilities can be combined to identifyexemplars at any point during affinity propagation.For point i, the value of k that maximizes a(i,k) +r(i,k) either identifies point i as an exemplar if k = i,or identifies the data point that is the exemplar forpoint i. After changes in the messages fall below athreshold , the message-passing procedure may beterminated after a fixed number of iterations. Toavoid numerical oscillations that arise in somecircumstances, it is important that they be dampedwhen updating the messages. Each message is set tol times its value from the previous iteration plus 1 –λ times its prescribed updated value, where thedamping factor l is between 0 and 1. Each iterationof affinity propagation consists of :1. Updating all responsibilities given theavailabilities.2. Updating all availabilities given theresponsibilities and3. Combining availabilities and responsibilities tomonitor the exemplar decisions and terminatethe algorithm.2.1.1 Disadvantages of Affinity Propagation1.It is hard to know the value of the parameterpreferences which can yield an optimal clusteringsolution.2.When oscillations occur, AP cannot automaticallyeliminate them.2.2 Seeds Affinity PropagationSeeds Affinity Propagation is based on APmethod. The main new features of the new
  3. 3. Preeti Kashyap, Babita Ujjainiya / International Journal of Engineering Research andApplications (IJERA) ISSN: 2248-9622 www.ijera.comVol. 3, Issue 3, May-Jun 2013, pp.274-282276 | P a g ealgorithm are: Tri-Set computation, similaritycomputation, seeds construction, and messagestransmission [11] Similarity measurement are asfollows: Co-feature Set (CFS), Unilateral FeatureSet (UFS), and Significant Co-feature Set (SCS).The structural information of the text documents isincluded in the new similarity measurement.2.2.1 Similarity MeasurementSimilarity measurement plays a major rolein Affinity Propagation clustering. To give specificand effective similarity measurement for ourparticular domain, i.e., text document these threefeature sets are used . Co-feature Set, the UnilateralFeature Set, and the Significant Co-feature Set.. Inthis approach, each term in text is still deemed as afeature and each document is still deemed as avector. All the features and vectors are notcomputed simultaneously, but one at a time. Let D= {d1 ,d2… dN} be a set of texts. Suppose, di and djare two objects in D and can be represented usingthe following two subsets:di ={<fi1,ni1>,<fi2,ni2>,….<fiL, niL> },dj = {<fj1,nj1>,<fj2,nj2>,…<fjM,njM>},where fixand fjy(1≤ x ≤ L, 1 ≤ y ≤ M ) in the twotuples <fix,njx>, <fjy, njy> represent the xth and ythfeature of di and dj, respectively. nixand njyare thevalues of fixand fjy. L and M are the counts of theobjects features.Let Fi and Fj be the feature sets of the twoobjects, respectively: Fi = { fi1, fi2,…fiL}, Fj = { fj1,fj2…..fjM}. Let the set DFj composed of the ―mostsignificant‖ features of dj. Most significant meansfeatures that are capable of representing crucialaspects of the document. These most significantfeatures could be key phrases and/or tags associatedwith each document when available or they could beall the words except stop words in the title of eachdocument. The venn diagram is shown in Fig1.Definition 1. Co-feature Set.Let di and dj be two objects in a data set. Supposethat some features of di, also belong to dj. A newtwo-tuples subset consisting of these features andtheir values in dj can be constructed.Definition 2. Unilateral Feature Set.Suppose that some features of di, do not belong todj. Two-tuples subset consisting of these featuresand their values in di can be constructed.Definition 3. Significant Co-feature Set.Suppose that some features of di, also belong to themost significant features of dj. A new two-tuplessubset consisting of these features and their valuesas the most significant features in dj can beconstructed. Thus extending the generic definitionof the similarity measures based on the Cosinecoefficient by introducing the three new sets CFSUFS and SCS namely, Tri-Set similarity.This extended similarity measure canreveal both the difference and the asymmetric natureof similarities between documents. It is quiteeffective in the application of Affinity Propagationclustering for text documents, image processing andso on, since it is capable of dealing asymmetricproblem. The combination of this new similaritywith conventional Affinity Propagation is names asTri-Set Affinity Propagation (AP(Tri-Set))clustering algorithm.Fig 1. Three kinds of relations between the twofeature subsets of di and dj . Fi and Fj are theirfeature subsets. DFj is the most significant featuresof dj. D is the whole data set.2.2.2 Seed ConstructionIn semi-supervised clustering, the maingoal is to efficiently cluster a large number ofunlabeled objects starting from a relatively smallnumber of initial labeled objects. Given a few initiallabeled objects, it can be use to construct efficientinitial seeds for our Affinity Propagation clustering
  4. 4. Preeti Kashyap, Babita Ujjainiya / International Journal of Engineering Research andApplications (IJERA) ISSN: 2248-9622 www.ijera.comVol. 3, Issue 3, May-Jun 2013, pp.274-282277 | P a g ealgorithm. To avoid a blind search guaranteeprecision for seeds and imbalance errors, a specificseeds’ construction method is presented, namelyMean Features Selection. Let NO, ND, NFand FCrepresent, respectively, the object number, the mostsignificant feature number, the feature number, andfeature set of cluster C in the labeled set. Suppose Fis the feature set and DF is the most significantfeature set of seed c .For example, DF of thismanuscript could be all the words except stop wordsin the title, i.e.{Survey, Seeds , Affinity, andPropagation}. Let fk € FC; fk € FC. Their values incluster c are nk and nk, the values of being the mostsignificant feature are nDK (0 ≤ nDK ≤ nk) and nDK (0≤ nDK≤ nK) . The seeds construction method isprescribed as:1. iff nk ≥ ∑ nk , fk € F;No2. iff nDK ≥ ∑ nDK , fk € DF.NoThis method can find out therepresentative features in labeled objects quickly.Seeds are made up of these features and their valuesin different clusters. They should be morerepresentative and discriminative than the normalobjects. For seeds, their self-similarities are set to+∞ to ensure that the seeds will be chosen asexemplars and help the algorithm to get the exactcluster number. The combination of this semi-supervised strategy with classical similaritymeasurement and conventional Affinity Propagationis named as Seeds Affinity Propagation withCosine coefficient (SAP(CC)) clustering algorithm.By introducing both seed construction method andthe new similarity measurement into conventionalAP, the definition of the complete ―Seeds AffinityPropagation algorithm can be generated.2.2.3 Seeds Affinity Propagation AlgorithmBased on the definitions of SCS, UFS and thedescribed seeds’ construction method, the SAPalgorithm is developed, with this sequence of steps:1. Initialization: Let the data set D be an N (N >0)terms superset where each term consists of asequence of two-tuples:D = {<f11,n11>, <f12, n12>, ….<f1M1, n1M1>}…{<fN1, nN1>, <fN2,nN2>…..<fNMN,nNMN>}where Mx represents the count of the xth object’sfeature.2. Seeds construction: Constructing seeds from afew labeled objects according to Mean FeaturesSelection. Adding these new objects into thedata set D, and getting a new data set D’ whichcontains N’ terms (N≤N );3. Tri-Set computation: Computing the Co-featureset, Unilateral Feature Set and Significant Co-feature Set between objects i and j.4. Similarity computation: Computing thesimilarities among objects in D5. Self-Similarity computation: Computing theself-similarities for each object in D’.6. Initialize messages: Initializing the matrixes ofmessagesr (i ,j )= s(i,j) - max {s(i,j)}, a(i,j)=07. Message matrix computation: Computing thematrixes of messages.8. Exemplar selection: Adding the two messagematrixes and searching the exemplar for eachobject i which is the maximum of r (i, j) + a(i, j).9. Updating the messages10. Iterating steps 6, 7, and 8 until the exemplarselection outcome stays constant for a number ofiterations or after a fixed number of iterationsend the algorithm.To summarize, start with the definition of three newrelations between objects. Then, assign the threefeature sets with different weights and present a newsimilarity measurement. Finally, a fast initial seedsconstruction method is defined and detail the stepsof the new Seeds Affinity Propagation algorithm inthe general case.2.3.4 Advantages of Seeds Affinity Propagation1. Reduces the time complexity and improves theaccuracy.2. Avoids Being random initialization and trappedin local minimum.3.1 Incremental Affinity PropagationA semi-supervised scheme calledincremental affinity propagation clustering ispresent in the paper.[12] . Incremental AffinityPropagation integrates AP algorithm and semi-supervised learning. The labeled data information iscoded into similarity matrix in some way. And theincremental study is performed for amplifying theprior knowledge. In the scheme, the pre-knowninformation is represented by adjusting similaritymatrix. An incremental study is applied to amplifythe prior knowledge. To examine the effectivenessof this method, concentrate it to text clusteringproblem. In this method, the known classinformation is coded into the similarity matrixinitially. And then after running AP for a certainnumber of iterations , the most convinced data areput into the "labeled data set" and reset thesimilarity matrix. This process is repeated until allthe data are labeled. Compared with the method in[13], the dealing with constrained condition in thisscheme is soft and objective. Furthermore, theintroduction of incremental study amplifies pre-knowledge about the target data set and thereforeleads to a more accurate result. Also, the parameter
  5. 5. Preeti Kashyap, Babita Ujjainiya / International Journal of Engineering Research andApplications (IJERA) ISSN: 2248-9622 www.ijera.comVol. 3, Issue 3, May-Jun 2013, pp.274-282278 | P a g eof "preference" in the method is self-adaptivemainly according to the target number of clusters.Focused on text clustering problem, Cosinecoefficient method is used [14] to compute thesimilarity between two different points (texts) in thespecific I-APC algorithm.3.1.1 Incremental Affinity PropagationAlgorithmIn most cases of real clustering problems, a part ofdata set, generally a minority of the whole, mayhave been labeled in prior. To take advantage of thevaluable and may be expensive resource ,a semi-supervised scheme based on AP which is calledIncremental Affinity Propagation Clustering (I-APC) algorithm is presented.The similarity between data point i and j .s( i,j ), indicates how well the data point i is suitedto be the exemplars, for data point j. Then if both thepoints are labeled into the same class, they are morelikely to support each other to be an exemplar.Therefore, s(i,j) should be set much larger thanusual. On contrary, if the two points are labeleddifferently, s(i. j) should be set much smaller.In the method, incremental learningtechnique is applied to amplify our pre-knowledge.After each run of AP, the most convinced data pointis added to the labeled data set. For non-labeled datapoint k, a score function score (k,i ) represents howlikely the point belongs to cluster i:Score (k,i) = α.∑(a(k,j)+r(k,j)) +β .∑(a(k,j)+r(k,j)) (5)where L is the labeled data set, I the data setcontained the ith clustering according to the last timeresults running by AP, and a and β the coefficientsof the so-called convinced and non-convinced items,respectively. Here a > β > O. Then a convincedfunction of point k can be defined according to thescore function:conv(k) = min max score(k,m)score (k,n) (6)Therefore the most convinced data point is selectedby maximizing whose convinced function valuewithin the non-labeled data.After the labeled data set is updated and thesimilarity matrix is reset according to the newlabeled data. The effect of the labeled data decreaseswhen the labeling time increases. So if points i and jboth are in the labeled data set and at least one is anew labeled, the s (i,j) is set to be a function ofprogram time t:S(i,j)=─│Xi─Xj│(A+e-Bt)1-2│sign(Ci-Cj)│(7)wheresign(x) = 1 if x > 00 if x = 0-1 if x < 0and B are two constants, t is the program time,which could be considered as the iteration numberof AP running, and Ci and Cj are the labelednumbers for Xi and Xj, respectively.During the process, preference s(i,j ) is alsoadjustable. After a run of AP, when the resultedclustering number is larger than expected, the valuesof preferences should be reduced as a whole, andvice versa.The rule is as follows:s(i,i) = s(i,i). 1 (8)1+e-K/Kwhere K is the expected clustering number and Kthe resulted clustering number, respectively.About the responsibilities and availabilities, theirvalues are kept at the end of last run as the initialvalues of the next time. It will speed up theconvergence of the algorithm.In summary, the I-APC scheme can be describedas follows:1. Initialize: including the initializations of labeleddata set L, the responsibility matrix andavailability matrix, similarities between differentpoints (according to Eqs, (1) and (7) and self-similarities say, set as a common value as well.2. If the size of L reaches a pre-given number P,goto step 5, else run AP.3. Select the most convinced data point to Eqs (5-6)and then update L. Reset similarities betweendifferent points ,according to Eq 7 and self-similarities ,according to Eq. 84. Go to step 2.5. Output results, end.The flow chart of the I-APC scheme is shown in Fig2.
  6. 6. Preeti Kashyap, Babita Ujjainiya / International Journal of Engineering Research andApplications (IJERA) ISSN: 2248-9622 www.ijera.comVol. 3, Issue 3, May-Jun 2013, pp.274-282279 | P a g eNYesFig 2. Flowchart of I-APC scheme3.1.2 Incremental Affinity Propagation For TextClusteringText clustering is the process to divide awhole set of texts into several groups according tothe similarities among all the texts. It is an importantresearch field of text mining and is often used forbenchmark for the new developed clusteringmethods. For text clustering, each text is consideredas a point in the problem space. So the similaritybetween two different points couldnt be set asnegative squared Euclidean distance any longer.Several similarity measurements arecommonly used in information retrieval. Thesimplest one is Simple matching coefficient method,where the idea is to count the number of sharedterms. A most powerful method is CosineCoefficient. Therefore Cosine coefficient is used tomeasure the point similarity in this method. Duringthe process, the similarity matrix should be resetaccording to the new labeled data and s(i , j) is atime-dependent function based on the originaldistance.3.1.3 Disadvantage of Incremental AffinityPropagation1. I-APC method costs more CPU time than AP.2. The performance will be tempered with whenthe incremental process is run too much. So theselection of the threshold is important.4.1 Adaptive Affinity PropagationThe affinity propagation clusteringalgorithm suffers from one limitation that it is hardto know the value of the parameter ―preference‖which can yield an optimal clustering solution. Thislimitation can be overcome by a method named,adaptive affinity propagation.[15] The method firstfinds out the range of ―preference‖, then searchesthe space of ―preference‖ to find a good value whichcan optimize the clustering result.4.1.1 Computing the range of preferencesAffinity propagation tries to maximize thenet similarity [16]. Net similarity is a score forexplaining the data, and it represents howappropriate the exemplars are. The score sums up allsimilarities between data points and their exemplarThe similarity between exemplar to itself is thepreference of the exemplar. Affinity propagationaims at maximizing Net Similarity and tests eachdata point whether it is an exemplar. Therefore, themethod which is using for computing the range ofpreferences can be developed just as shown in Fig1.Fig3: The procedure of computing the range ofpreferencesThe maximum preference ( pmax) in therange is the value which clusters the N data pointsInitialization of S , A andL|L| < P ?Run APSelect the mostconvinced data into LUpdate S( sij and skk )Output result & End.Step3: Compute the minimal value ofPreferencesStep 3.1 Compute the net similarity whenthe number of clusters is 1:dpsim1 = max {∑ s(i,j)}Step 3.2 Compute the net similarity whenthe number of clusters is 2:dpsim2 = max{∑ max{s (i,k), s (j,k)}Step3.3 Compute the minimal value ofPreferences:Pmin = dpsim1 ─ dpsim2Algorithm 1 Preference Range ComputingInput: s(i,k) : the similarity between point iand point k ( i ≠ k )Output: the maximal value and minimal valueof preferences; pmax , pminStep1. Initialize s(k, k) to zero:s (k, k) = 0Step2: Compute the maximal value ofPreferences:Pmax = max {s (i , k )}
  7. 7. Preeti Kashyap, Babita Ujjainiya / International Journal of Engineering Research andApplications (IJERA) ISSN: 2248-9622 www.ijera.comVol. 3, Issue 3, May-Jun 2013, pp.274-282280 | P a g einto N clusters, and this is equal to the maximumsimilarity, since a preference lower than that wouldmake the object better to have the data pointassociated with that maximum similarity assigned tobe a cluster member rather than an exemplar.The derivation for pmin is similar to pmax.Suppose that there is a particular preference p suchthat the optimal net similarity for one clusters (k=1)and the optimal net similarity for two clusters (k=2)are the same. Optimal net similarity of two clusterscan be obtained by searching through all possiblepairs of possible exemplars, and the value isdpsim2+2* p. If there is one cluster, the value ofoptimal net similarity is dpsim1 + p. The minimumpreference pmin leads to clustering the N data pointsinto one cluster. Since affinity propagation aims atmaximizing the net similarity, that is dpsim1 + p≥dpsim2+2*p, Then p≤ dpsim1─dpsim2 . Pmin nomore than p , therefore, the minimum value forpreferences is pmin = dpsim1─dpsim2.4.1.2 Adaptive Affinity Propagation ClusteringAfter computing the range of preferences,preferences space can be scan to find the optimalclustering result. Different preferences will resultdifferent cluster. Cluster validation techniques areused to evaluate which clustering result is optimalfor the datasets. Preference step is very important toscan the space adaptively. It is denoted as:pstep = pmax ─ pminN * 0.1 * K+ 50In order to sample the whole space, set The base ofscanning step as pmax ─ pminNThis fixed increasing step cannot meet the differentrequirement of different cases such as more clustersand less clusters. Because more-clusters case ismore sensitive than that of less-cluster case.Therefore the adaptive step method similar toWang’s[17], an adaptive coefficient is more useful ,q = 10.1 K+ 50In this way, the value of step p with the count ofclusters (K) is set . When K is large, pstep will besmall and vice versa.In this paper, global silhouettes index are validityindices. Silhouettes is introduced by Rousseeuw[18] as a general graphical aid for interpretation andvalidation of cluster analysis, which provides ameasure data point classification when it is assignedto a cluster in according to both the tightness of theclusters and the separation between them. Globalsilhouette index is defined as follows:GS = 1 ∑ SjncWhere local silhouette index is:Sj = 1 ∑ b(i)─ a(i)rj max {b(i),a(i)}Where rj the count of the objects in class j, a(i) is theaverage distance between object i and the objects inthe same class j, is the minimum average distancebetween object i and objects in class closet to class j.Fig 2 shows the procedure of the adaptiveaffinity propagation clustering method. The largestglobal silhouette index indicates the best clusteringquality and the optimal number of clusters[19]. Aseries of Sil values corresponding to clustering resultwith different number of cluster are calculated. Theoptimal clustering result is found when Sil islargest.Fig4; The procedure of adaptive propagationclustering4.1.3 Adaptive Affinity Propagation DocumentClusteringThis section discusses the adaptive affinitydocument clustering, which implements the adaptiveaffinity propagation algorithm in clusteringdocuments, combined with Vector Space Model(VSM). Vector Space Model is the most commonmodel for representing document among manymodels of document representation. In VSM, everydocument is represented as a vector:V(d) = (t1 , w1(d); t2, w2(d);…tm, wm (d)) , where t1is the word item, w1(d) is the weight of t1 in in theAlgorithm 3 Adaptive affinity propagationClustering:Input: s(i,k): the similarity between point i andpoint k (i≠ k)Output: the clustering resultStep1: Apply Preferences Range algorithm tocomputing the range of preferences:[ pmin , pmax ]Step2: Initialize the preferences:preference = pmin ─ pstepStep3: Update the preferences:step preference = preference + pStep4: Apply Affinity Propagation algorithm togenerating K clustersStep5: Terminate until Sil is largest.
  8. 8. Preeti Kashyap, Babita Ujjainiya / International Journal of Engineering Research andApplications (IJERA) ISSN: 2248-9622 www.ijera.comVol. 3, Issue 3, May-Jun 2013, pp.274-282281 | P a g edocument d. The most widely used weightingscheme is Term Frequency with Inverse DocumentFrequency (TF-IDF).5.1 Partition Adaptive Affinity PropagationAffinity propagation exhibits fast executionspeed and finds clusters with low error rate whenclustering sparsely related data but its values ofparameters are fixed. Partition adaptive affinitypropagation can automatically eliminate oscillationsand adjust the values of parameters when rerunningaffinity propagation procedure to yield optimalclustering results, with high execution speed andprecision [20]The premise is that both AP and AAP are amessage communication process between datapoints in a dense matrix. The time spent is in directratio to the number of iterations. During eachiteration of AP, each element r( i, k) of theresponsibility matrix must be calculated once andeach calculation must be applied to N-1 elements,where N is the size of the input similarity matrix,according to Eq.( 2).And each element of theavailability matrix can be calculated in the sameway. During an iteration of AAP, the convergent ofK is detected but the execution speed is still muchlower than AP.Partition adaptive affinity propagation (PAAP).This modified algorithm can eliminate oscillationsby using the method of AAP. Our adaptivetechnique consists of two parts. One is called fineadjustment, another is coarse adjustment. Fineadjustment is used to decrease the values ofparameter "preference "slowly, and coarseadjustment is used to rapidly decrease the values ofpreference correspondingly. The original similaritymatrix is decomposed into sub-matrices to gainhigher execution speed [21] [22], when executingour method. PAAP can yield optimal clusteringsolutions on both dense and sparse datasets.Assuming that Cmax is the expected maximalnumber of clusters, Cmin is the expected minimalnumber of clusters, and K (i) is the number ofclusters in the iteration, and maxits is the maximalnumber of iterations. λstep and Pstep are theadaptive factors as in AAP. The PAAP algorithmgoes as follows:Algorithm PAAP algorithm:1. Execute AP procedure, get the number ofclusters:K(i).2. If K(i) ≤ K(i + 1), then go to step 4. Else, count== 0, then go to step 3.3. λ←λ + λstep , then go to step 1. If A > 0.85, thenp← p+ pstep , s(i,i) ←p, Else go to step 1.4. If │Cmax - K(i) │ > CK, thenAstep == ─20 * │K(i) - Cmin │Go to step 6.Else, delay 10 iterations and then go to step 5.5. If K(i) ≤ K(i + 1), then count == cout + 1, Astep== count *Pstep. Go to step 6 or Else, go to step1.6. p == p + Astep, then s(i, i) ← p.7. If i == maxits or K(i) ~ Cmin, the algorithmterminates. Else, go to step 1.PAAP can find the true or better number of clusterswith high execution speed on dense or sparsedatasets , meanwhile, it can automatically detect thenumber oscillations and eliminate them. Thisverified that both acceleration technique andpartition technique are effective. If Kpart s and Astep (acceleration factor) is well chosen, the averagenumber of iteration can be reduced effectively.5.1.2 Advantages of partition Adaptive affinitypropagation.1. PAAP improved approach based on affinitypropagation. It can automatically escape fromthe oscillation and adjust values of parameters λand p.III. CONCLUSIONIn this survey, various clusteringapproaches and algorithms in document clusteringare described. . A new clustering algorithm whichcombines Affinity Propagation with semi-supervised learning ,i.e Seeds Affinity Propagationalgorithm is present. In comparison with theclassical clustering algorithm k-means, SAP notonly improves the accuracy and reduces thecomputing complexity of text clustering and but alsoeffectively avoids being trapped in local minimumand random initialization. Whereas IncrementalAffinity Propagation integrates AP algorithm andsemi-supervised learning. The labeled datainformation is coded into similarity matrix. TheAdaptive Affinity Propagation algorithm firstcomputes the range of preferences, and thensearches the space to find the value of preferencewhich can generate the optimal clustering resultscompare to AP approach which cannot yield optimalclustering results because it sets preferences as themedian of the similarities.The area of document clustering has many issues,which need to be solved. We hope , the paper givesinterested readers a broad overview of the existingtechniques. As a future work, improvement over theexisting systems with better results which offer newinformation representation capabilities withdifferent techniques can be attempted.REFERENCES[1] S. Deelers and S.Auwatanamongkol,“Enhancing K-MeansAlgorithm with Initial Cluster CentersDerived from Data Partitioning along theData Axis with the Highest Variance,‖
  9. 9. Preeti Kashyap, Babita Ujjainiya / International Journal of Engineering Research andApplications (IJERA) ISSN: 2248-9622 www.ijera.comVol. 3, Issue 3, May-Jun 2013, pp.274-282282 | P a g eInternational Journal of Electrical andComputer Engineering 2:4 2007[2] B.J. Frey and D. Dueck, “Clustering byPassing Messages between Data Points,”Science, vol. 315, no. 5814, pp. 972-976,Feb. 2007.[3] B.J. Frey and D. Dueck, “Non-MetricAffinity Propagation for Un-SupervisedImage Categorization,” Proc. 11th IEEEInt’l Conf. Computer Vision (ICCV ’07),pp. 1-8, Oct. 2007.[4] L. Michele, Sumedha, and W. Martin,“Clustering by Soft- Constraint AffinityPropagation Applications to Gene-Expression Data,” Bioinformatics, vol. 23,no. 20, pp. 2708-2715, Sept. 2007.[5] T.Y. Jiang and A. Tuzhilin, “DynamicMicro Targeting: Fitness- Based Approachto Predicting Individual Preferences,”Proc. Seventh IEEE Int’l Conf. DataMining (ICDM ’07), pp. 173-182, Oct.2007.[6] F. Sebastiani, ―Machine Learning inAutomated Text Categorization,” ACMComputing Surveys, vol. 34, pp. 1-47,2002.[7] F. Wang and C.S. Zhang, “LabelPropagation through LinearNeighbourhoods,” IEEE Trans. Knowledgeand Data Eng., vol. 20, no. 1, pp. 55-67,Jan. 2008.[8] Z.H. Zhou and M. Li, “Semi-SupervisedRegression with Co- Training StyleAlgorithms,” IEEE Trans. Knowledge andData Eng., vol. 19, no. 11, pp. 1479-1493,Aug. 2007.[9] A. Blum and T. Mitchell, “CombiningLabeled and Unlabeled Data with Co-Training,‖ Proc. 11th Ann. Conf.Computational Learning Theory, pp. 92-100, 1998.[10] Z.H. Zhou, D.C. Zhan, and Q. Yang,“Semi-Supervised Learning with Very FewLabeled Training Examples,” Proc. 22ndAAAI Conf. Artificial Intelligence, pp.675-680, 2007[11] Renchu Guan, Xiaohu Shi, MaurizioMarchese, Chen Yang, and Yanchun Liang―Text Clustering with Seeds AffinityPropagtaion” IEEE Transactions onKnowledge and data Engineering , VOL.23, NO. 4, APRIL 2011[12] H.F. Ma, X.H. Fan, and J. Chen, “AnIncremental Chinese Text ClassificationAlgorithm Based on Quick Clustering,”Proc. 2008 Int’l Symp. InformationProcessing (ISIP ’08), pp. 308- 312, May2008.[13] Y. Xiao, and J. Yu, "Semi-SupervisedClustering Based on Affinity Propagation," Journal of Software, Vol. 19, No. 11,November 2008, pp. 2803-2813.[14] C. J. van Rijsbergen, Information Retrieval,2nd edition, Butterworth, London, pp. 23-28, 1979.[15] K.J. Wang, J.Y. Zhang, D. Li, X.N. Zhang,and T. Guo, “Adaptive AffinityPropagation Clustering,” Acta AutomaticaSinica, vol. 33, no. 12, pp. 1242-1246, Dec.2007[16] FAQ of Affinity Propagation Clustering:http://www.psi.toronto.edu/affinitypropagation/faq.html[17] K.J. Wang, J.Y.. Zhang, and D. Li."Adaptive Affinity PropagationClustering." Acta Automatic Sinica,33(12): 1242-1246, 2007[18] P.J. Rousseeuw, Silhouettes: "a graphicalaid to the interpretation and validation ofcluster analysis", Computational andApplied Mathematics. (20),53-65, 1987[19] S. Dudoit, J. Fridlyand. "A prediction-based resampling method for estimatingthe number of clusters in a dataset".Genome Biology,3(7): 0036.1-0036.21,2002[20] Changyin Sun, Chenghong Wang, SuSong,Yifan Wang “A Local Approach ofAdaptive Affinity Propagation Clusteringfor Large Scale Data” Proceedings ofInternational Joint Conference on NeuralNetworks, Atlanta, Georgia, USA, June 14-19, 2009[21] Guha, S., Rastogi, R., Shim, K., "CURE:an efficient clustering algorithm for largedatabases," Inf.Syst., 26(1): 35-58, 2001.[22] Ding-yin Xia, Fei Wu, Xu-qing Zhang,Yue-ting Zhuang, " Local and globalapproaches of affinity propagationclustering for large scale data," J ZhejiangUniv SCi A, , pp.1373 1381,2008..