• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
Privacy Preserving MFI Based Similarity Measure For Hierarchical Document Clustering
 

Privacy Preserving MFI Based Similarity Measure For Hierarchical Document Clustering

on

  • 337 views

The increasing nature of World Wide Web has imposed great challenges for researchers in improving the search efficiency over the internet. Now days web document clustering has become an important ...

The increasing nature of World Wide Web has imposed great challenges for researchers in improving the search efficiency over the internet. Now days web document clustering has become an important research topic to provide most relevant documents in huge volumes of results returned in response to a simple query. In this paper, first we proposed a novel approach, to precisely define clusters based on maximal frequent item set (MFI) by Apriori algorithm. Afterwards utilizing the same maximal frequent item set (MFI) based similarity measure for Hierarchical document clustering. By considering maximal frequent item sets, the dimensionality of document set is decreased. Secondly, providing privacy preserving of open web documents is to avoiding duplicate documents. There by we can protect the privacy of individual copy rights of documents. This can be achieved using equivalence relation.

Statistics

Views

Total Views
337
Views on SlideShare
337
Embed Views
0

Actions

Likes
0
Downloads
3
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    Privacy Preserving MFI Based Similarity Measure For Hierarchical Document Clustering Privacy Preserving MFI Based Similarity Measure For Hierarchical Document Clustering Document Transcript

    • International Journal of Research in Computer ScienceeISSN 2249-8265 Volume 2 Issue 4 (2012) pp. 7-12© White Globe Publicationswww.ijorcs.org PRIVACY PRESERVING MFI BASED SIMILARITY MEASURE FOR HIERARCHICAL DOCUMENT CLUSTERING P. Rajesh1, G. Narasimha2, N.Saisumanth3 1,3 Department of CSE, VVIT, Nambur, Andhra Pradesh, India Email: rajesh.pleti@gmail.com Email: saisumanth.nanduri@gmail.com 2 Department of CSE, JNTUH, Hyderabad, Andhra Pradesh, India Email: narasimha06@gmail.comAbstract: The increasing nature of World Wide Web navigation steps to find relevant documents. So wehas imposed great challenges for researchers in need a hierarchical clustering that is relatively flat thatimproving the search efficiency over the internet. Now reduces the number of navigation steps. Thereforedays web document clustering has become an there is a great need for new document clusteringimportant research topic to provide most relevant algorithms, which are more efficient than conventionaldocuments in huge volumes of results returned in clustering algorithms [1, 2].response to a simple query. In this paper, first we The increasing nature of World Wide Web hasproposed a novel approach, to precisely define imposed great challenges for researchers to cluster theclusters based on maximal frequent item set (MFI) by similar documents over the internet and their byApriori algorithm. Afterwards utilizing the same improving the efficiency of search. Search engine usesmaximal frequent item set (MFI) based similarity the getting more confused in selecting the relevantmeasure for Hierarchical document clustering. By documents among huge volumes of search resultsconsidering maximal frequent item sets, the returned to a simple query. A potential solution to thisdimensionality of document set is decreased. Secondly, problem is to cluster the similar web documents, whichproviding privacy preserving of open web documents helps the user in identifying the relevant data easilyis to avoiding duplicate documents. There by we can and effectively [3].protect the privacy of individual copy rights ofdocuments. This can be achieved using equivalence The outline of this paper is divided into sixrelation. sections. section II, briefly discusses related work. We explained our proposed algorithm descriptionKeywords: Maximal Frequent Item set, Apriori including common preprocessing steps and pseudoalgorithm, Hierarchical document clustering, code of algorithm in section III. It also includes toequivalence relation. precisely defining clusters based on maximal frequent item set (MFI) by Apriori algorithm. Section IV, I. INTRODUCTION describes exploiting the same maximal frequent item Document clustering has been studied intensively set (MFI) based similarity measure for Hierarchicalbecause of its wide applicability in areas such as web document clustering with running example. In sectionmining, search engines, text mining and information V, provides privacy preserving of open webretrieval. The rapid progress of databases in every documents by using equivalence relation to protect theaspect of human actions has resulted in enormous individual copy rights of a document.. Section VI,demand for efficient algorithms for spinning data into consists of conclusion and future scope.valuable knowledge. II. RELATED WORK Document clustering has undergone throughvarious methods, still document clustering is in its The related work of using maximal frequent iteminefficiency state for providing the required set in web document clustering is explained as follows.information needed by the user exactly and Ling Zhuang Honghua Dai [4] introduced a newapproximately. Suppose the user makes an incorrect criterion to specifically locate the initial points usingselection while browsing the documents in hierarchy. maximal frequent item set. These initial points are thenIf user may not notice his mistakes until he browses used as centers for k-means algorithm. However k-into the deep portion of the hierarchy, then it decreases means clustering is completely unstructured approach,the efficiency of search and increases the number of sensitive to noise and produces an unorganized www.ijorcs.org
    • 8 P. Rajesh, G. Narasimha, N.Saisumanthcollection of clusters that is not favorable to based similarity measure . The clusters in the resultinginterpretation [5, 6]. To minimize the overlapping of hierarchy are non-overlapping. The parent clusterdocuments, Beil, Ester [7] were proposed a method contains only the general documents.HFTC (Hierarchical Frequent Text Clustering) isanother frequent item set based approach to choose the III. ALGORITHM DESCRIPTIONnext frequent item sets. But the clustering result In this section, we explained our proposeddepends on the order of choosing next frequent item algorithm description including commonsets. The resulting hierarchy in HFTC usually contains preprocessing steps and pseudo code of algorithm. Itmany clusters at first level. As a result the documents also includes to precisely defining clusters based onin the same class are to be distributed into different maximal frequent item set (MFI) by Apriori algorithm.branches of hierarchy, which decreases the overall First, we will speak about some commonclustering accuracy. preprocessing steps for representing each document by C.M.Fung [8] has introduced FIHC (Frequent Item item sets (terms). Second we will bring in vector spaceset based Hierarchical Clustering) method for model by assigning weights to terms in all documentdocument clustering. Which employed, a cluster topic sets. Finally, we will explain the process oftree is constructed based on the similarity among initialization of clusters seeds using MFI to performclusters. FIHC used the efficient child pruning when hierarchical clustering. Let Ds represents set of allnumber of clusters is large and to apply the elaborated documents in collection of database.sibling merging only when number of clusters is small. Ds= {d1, d2, d3………dM}: 1 ≤ i ≤ MThe experiment results FIHC actually outperforms allother algorithms (bisecting-k means, UPGMA) in A. Pre-Processingaccuracy for most number of clusters. The document set Ds is converted from The Apriori algorithm [9] is a well-known method unstructured format into some common representationfor computing frequent item sets in a transaction using the text preprocessing techniques, in whichdatabase. The document under the same topic, shares words or terms are extracted (tokenization). The inputmore common frequent item sets (terms) than the data set of documents in Ds are preprocessed using thedocuments of different topics. The main advantage of techniques namely, removing HTML tags first, afterusing frequent item sets is that it can identify the that apply stop words list and stemming algorithm.relation among the more than two documents at a time a) HTML Tags: parsing of HTML Tagin a document collection unlike similarity measure b) Stop words: Remove the stop words list likebetween two documents [10, 11].By the means of “conjunctions, connectives, prepositions etc”maximal frequent item sets, the dimensionality of the c) Stemming algorithm: We utilize porter 2document set is reduced. More over maximal frequent stemmer algorithm in our approach.item sets captures most related document sets. On theother hand, hierarchical clustering most relevant for B. Vector representation of document:browsing and maps most specific documents togeneralized documents in the whole collection. Vector space model is the most commonly used document representation model in text mining, web A conventional hierarchical clustering method mining and information retrieval areas. In this modelconstructs the hierarchy by subdividing parent cluster each document is represented as n-dimensional termor merging similar children clusters. It usually suffers vector. The value of each term in the n-dimensionalfrom its inability to perform tuning once a merge or vector reflects the importance of correspondingsplit decision has been performed. This rigidity may document. Let N be the total number of terms and Mlower the clustering accuracy. Furthermore, due to the be the number of documents and each the document 𝐷 𝑖 = (𝑡𝑒𝑟𝑚 𝑖1 , 𝑡𝑒𝑟𝑚 𝑖2 , … … … … . . 𝑡𝑒𝑟𝑚 𝑖𝑛 ) 1≤ i≤ M. Wherefact that a parent cluster in the hierarchy always can be denoted as 𝑑𝑓(𝑡𝑒𝑟𝑚 𝑖𝑗 ) < 𝑡ℎ𝑟𝑒𝑠ℎ𝑜𝑙𝑑contains all objects of its Childs, this kind of hierarchy frequency 𝑡𝑒𝑟𝑚 𝑖𝑗 is less than the threshold value isis not suitable for browsing. The user may have value. The documentdifficulty to locate his intention object in such a largecluster. considered to avoid the problem of more times a term Our hierarchical clustering method is completely appears throughout all documents in the wholedifferent. The aim of this paper is, first we form all collection, the more poorly it discriminates betweenthe clusters by assigning documents to the most similar documents [12].Calculate term frequency tf is numbercluster using maximal frequent item sets by Apriori of times a term appears in a document. Document frequency of a term df as no of documents that documents vectors. 𝐷 𝑖 = (𝑤 𝑖1 , 𝑤12 , 𝑤13 , … … . . , 𝑤1 𝑖𝑛 )algorithm and then construct the hierarchicaldocument clustering based on their inter-cluster contains term. Also construct the weights for Where 𝑤 𝑖𝑗 = 𝑡𝑓𝑖𝑗 ∗ 𝐼𝐷𝑓(𝑗) andsimilarities via same maximal frequent item set (MFI) www.ijorcs.org
    • IDf (j) =𝑙𝑜𝑔 � �1≤j≤n.where IDf is the inverse 𝑚Privacy Preserving MFI Based Similarity Measure For Hierarchical Document Clustering 9 𝑑𝑓 𝑗 A frequent item set is a set of words which occurs frequently together and are good candidates for such that X ⊂ X1 and t(X) = t(X1), where t(X) defined document frequency. clusters and are denoted by FI. An item set X is closedTable 1: Table Representation of Transactional Database of if there does not exist an item set X1 such that X1, Documents as the set of transactions that contain item set X and itTerms Doc 1 Doc 2 Doc 3 ..... Doc 4 is denoted by FCI(frequently closed items).If X isJava 1 1 0 ..... 1 frequent and no superset of X is frequent among theBeans 0 1 0 ..... 0 MFI. Then MFI⊂ FCI ⊂ FI Whenever there are very set of items I in transactional databases. Then we say..... ..... ….. ….. ..... ….. that X is maximal frequent item set and denoted byServlets 1 0 1 ..... 1 By the representation of document as vector form, long patterns are present in the data it is oftenwe can easily identify which documents Contains the impractical to generate the entire set if frequent itemsame features .The more features documents have in sets or closed item sets [16]. In that case, maximalcommon, the more related they are. Thus, it is realistic frequent item sets are adequate for such applications.to find well related documents. Assume that each We employed maximal frequent item set algorithmdocument is an item in the transactional database; each from [17] using apriori. These maximal frequent itemterm corresponds to a transaction. Our aim is to search sets are initial seeds for hierarchical documentfor highly related documents “appearing” together clustering.with same features (the documents whose MFI features D. Pseudo code Algorithmare closed). Similarly, the maximal frequent item setdiscovery in the transaction database serves the For MFI Based Similarity Measure for Hierarchicalpurpose of finding items of documents appearing Document Clusteringtogether in many transactions. i.e., document sets Input: Document set Ds.which have large amount of feature in common. Definition: MFI: Maximal Frequent Item set.C. Apriori for maximal frequent item sets (tf) Term frequency and (df) document frequency Mining frequent item sets is a primary content of Step 1. For each document in Ds, Remove the HTMLdata mining that emphasizes particularly in finding therelation of different items in the large database. Mining tags and perform stop word list and stemming. Step 2. Calculate the term frequency (tf) and document 𝐷 𝑖 = (𝑡𝑒𝑟𝑚 𝑖1 , 𝑡𝑒𝑟𝑚 𝑖2 , … … … … . . 𝑡𝑒𝑟𝑚 𝑖𝑛 ) 1≤i≤Mfrequent patterns is crucial problem in many datamining applications such as the discovery of frequency (df). Where df�𝑡𝑒𝑟𝑚 𝑖𝑗 � < Threshold valueassociation rules, correlations, multidimensionalpatterns, and other numerous important inferringpatterns from consumer market basket analysis andweb access etc. The association mining problem is Step 3. Also construct the weighted document vectors 𝐷 𝑖 = (𝑤 𝑖1 , 𝑤12 , 𝑤13 , … … . . , 𝑤1 𝑖𝑛 ) 𝑤 𝑖𝑗 = 𝑡𝑓𝑖𝑗 ∗formulated as follows: Given a large data base of set of for all the documents 𝐼𝐷𝑓(𝑗).Idf (j) =𝑙𝑜𝑔 � � 1≤j≤n.items transactions, find all frequent item sets, where a 𝑚 Wherefrequent item set is one that occurs in at least a user- 𝑑𝑓 𝑗specified threshold value of the data base. Many of theproposed item set mining algorithms are a variant of Step 4. Now represent each documents by keywordsApriori, which employs a bottom-up, breadth first whose tf>supportsearch that enumerates every single frequent item set. 𝑀𝐹𝐼 = { 𝐹1 , 𝐹2 , 𝐹3 , … … … … . . 𝐹 𝑛 }Apriori is a conventional algorithm that was first Calculate the Maximal Frequent Item set(MFI) ofintroduced] for mining association rules. Association terms using Apriori algorithm Where each 𝐹𝑖 = { 𝑑1 , 𝑑2 , 𝑑3 , … … … 𝑑 𝑘 }can be viewed as two-step process as a document 𝑑 𝑖 is in more than one maximal frequent item set then choose 𝐼 𝑑 as a set (1) Identifying all frequent item sets Step 5. If (2) Generating strong association rules from the containing document 𝑑 𝑖 . Then Assign𝐼 𝑥 =𝐼 𝑑0 .For frequent item sets consisting of such maximal frequent item sets At first, candidate item sets are generated and the document 𝑑 𝑖afterwards frequent item sets are mined with the help each the maximal frequent item sets containing 𝐼𝑓 [ 𝑗𝑎𝑐𝑐𝑎𝑟𝑑𝑠(𝑐𝑒𝑛𝑡𝑒𝑟 ( 𝐼 𝑥 , 𝑑 𝑖 ))of these candidate item sets. In the proposed approach, > 𝑗𝑎𝑐𝑐𝑎𝑟𝑑𝑠(𝑐𝑒𝑛𝑡𝑒𝑟 ( 𝐼 𝑑𝑖 , 𝑑 𝑖 ))]we have used only the frequent item sets for furtherprocessing so that, we undergone only the first step(generation of maximal frequent item sets) of theApriori algorithm. www.ijorcs.org
    • Then assign 𝐼 𝑥 = 𝐼 𝑑𝑖 .Assign the document 𝑑 𝑖 to 𝐼 𝑥 𝐹𝑖 𝑙𝑖𝑘𝑒 𝐹3 = { 𝑑1 , 𝑑5 , 𝑑7 } as one cluster in hierarchy10 P. Rajesh, G. Narasimha, N.Saisumanthand discard 𝑑 𝑖 for other maximal frequent item sets. Case 3: If 𝐹𝑖 , 𝐹𝑗 contains some same documents and represent it by center (as in step6).Repeat this process for all documents that occurs in consider the case of document 𝑑2 is repeatedin moremore than one maximal frequent item set these maximal frequent item sets 𝐹𝑖 as clusters than one maximal frequent item sets{𝐹1 𝐹4 }.Similarly among the documents list obtained from MFI. Let us and combine the documents in 𝐹𝑖 into a singleStep 6. Apply hierarchical document clustering to make 𝑑4 is repeated in{ 𝐹1 , 𝐹2 , 𝐹4 }. Then choose𝐼 𝑑 = { 𝐹1 , 𝐹2 , 𝐹4 } = { 𝐼 𝑑0 , 𝐼 𝑑1 , 𝐼 𝑑2 }for document𝑑4 .Assign 𝐼 𝑥 =𝐼 𝑑0 = 𝐹1 . For each the maximal frequent item sets new document and represent it by centers of the 𝐼 𝑑 containing 𝑑4 maximal frequent item sets. These are obtained 𝐼 𝑑0 𝑡𝑜 𝐼 𝑑2 calculate the measure by combining the features of maximal frequent in the document from 𝐼𝑓 [ 𝑗𝑎𝑐𝑐𝑎𝑟𝑑𝑠(𝑐𝑒𝑛𝑡𝑒𝑟 ( 𝐼 𝑥 , 𝑑4 )) item set of terms that grouping the documents > 𝑗𝑎𝑐𝑐𝑎𝑟𝑑𝑠(𝑐𝑒𝑛𝑡𝑒𝑟 ( 𝐼 𝑑𝑖 , 𝑑4 ))]Step 7. Repeat the same process of hierarchical document clustering based on maximal frequent document 𝑑4 closest to which maximal frequent item item sets for all levels in hierarchy and stop if total number of documents equals to one else go By using this jaccards measure, we can identify the document 𝑑4 .Then assign 𝐼 𝑥 = 𝐼 𝑑𝑖 . to step 4. set among maximal frequent item sets containing the Let’s suppose that 𝑑4 is closed to the maximal IV. HIERARCHICAL CLUSTERS BASED ON frequent item set 𝐹4 . Assign the document𝑑4 to𝐼 𝑥 = MAXIMAL FREQUENT ITEM SETS 𝐼 𝑑𝑖 = 𝐹4 and discard 𝑑4 for other maximal frequent After finding maximal frequent item sets (MFI) byusing Apriori algorithm. We turn to describing the exactly one cluster. Similarly 𝑑2 belongs to𝐹1 .Repeatcreation of hierarchical document clustering using item sets. After this step, each document belongs tosame similarity measure by MFI. A simple instanceamong the whole collection of documents 𝐷 𝑆 bycase of example is also provided to demonstrate the 𝑑2 , 𝑑4 are repeated in𝐹1 , 𝐹4 . The clusters that will form this process for all documents that occurs in more thanapriorialgorithm are 𝑀𝐹𝐼 = { 𝐹1 , 𝐹2 , 𝐹3 … . . 𝐹 𝑛 }.Whereentire process. The set of maximal frequent item sets one maximal frequent item set. Since the documentsby𝐹𝑖 = { 𝑑1 , 𝑑2 , 𝑑3 … . . 𝑑 𝑘 }.Then consider total number at the first level of hierarchy by applying step5 and 𝐹1 = {𝑑2 , 𝑑6 }each MFI consist of set of documents represented step 6 are as follows. 𝐹2 = {𝑑3 , , 𝑑8 }of documents which occurs in maximal frequent item 𝑑1 , 𝑑2 , 𝑑3, 𝑑4 , 𝑑5 , 𝑑6 , 𝑑7 , 𝑑8 , 𝐹3 = { 𝑑1 , 𝑑5 , 𝑑7 }sets in MFI as follows. 𝑀𝐹𝐼 = � � 𝑑9 , 𝑑10 , 𝑑11 , 𝑑12 , 𝑑13 , 𝑑14 , 𝑑15 𝐹4 = {𝑑4 , , 𝑑14 } 𝐹1 = {𝑑2 , 𝑑4 , 𝑑6 } 𝐹5 = {𝑑10 , 𝑑12 , 𝑑15 } 𝐹2 = {𝑑3 , 𝑑4 , 𝑑8 } 𝐹6 = {𝑑9 , 𝑑11 , 𝑑13 } 𝐹3 = { 𝑑1 , 𝑑5 , 𝑑7 } 𝐹4 = {𝑑4 , 𝑑2 , 𝑑14 } The hierarchical diagram for the above form of 𝐹5 = {𝑑10 , 𝑑12 , 𝑑15 } maximal frequent item set clusters can be representing 𝐹6 = {𝑑9 , 𝑑11 , 𝑑13 } as follows. Repeat the same process of hierarchical document clustering based on maximal frequent item sets for all levels in hierarchy and stop if total number The clusters in the resulting hierarchy are non- of documents equals to one else go to step 4.overlapping. This can be achieved through the Case1: If 𝐹𝑖 , 𝐹𝑗 are same then choose one in randomfollowing cases. Case2: If 𝐹𝑖 , 𝐹𝑗 are different then form clusters ofto form cluster.documents contained in𝐹𝑖 , 𝐹𝑗 independently. In ourin 𝐹3 , 𝐹5 and 𝐹6 𝑎𝑟𝑒 different. So we form a clustersexample, the maximal frequent item set of documentsaccording to the documents contained in Figure 1: Hierarchical document clustering using MFI www.ijorcs.org
    • Represent each new document �𝐿 𝑖𝑗 � in hierarchy byPrivacy Preserving MFI Based Similarity Measure For Hierarchical Document Clustering 11 itself. When we are classifying the documents intomaximal frequent item set of terms as centers (as in equivalence classes, we are not considering these onesstep 6).These maximal frequent item sets are obtained and put zeros. Jaccard similarity coefficient matrix forby combining the features of maximal frequent item four documents can be represented as follows.set of terms that grouping the documents. Each new d1 d2 d3 d4�𝐿 𝑖𝑗 � represents that jth document in the level ofdocument also consisting of corresponding updatedweights of maximal frequent item set of terms. Where d 1  1 0.4 0.8 0.5hierarchy𝐿 𝑖 . In the figure { 𝐿12 = 𝐿21 }means that the d 2 0.4 1 0.8 0.4 Rα =  level 𝐿1 are not matched with other documents MFI set d 3 0.8 0.8 1 0.9maximal frequent item set of terms in 2nd document of   d 4 0.5 0.4 0.9 1 in same level𝐿1 .So it is repeated same for the nextlevel and it is also same for the document { 𝐿13 = Ds = {d1 , d2 , d3 , d4 }as the collectionof document pairs 𝐿22 }. The documents{ 𝐿11 , 𝐿15 } and{ 𝐿14 , 𝐿16 } in first Where alpha is threshold. Let define a relation R on value. i.e 𝑅 = {(𝑑 𝑖 , 𝑑 𝑗 )/ 𝐽 (𝑑 𝑖 , 𝑑 𝑗 ) ≥ 𝑡ℎ𝑟𝑒𝑠ℎ𝑜𝑙𝑑 } whose similarity measure is above some thresholdlevel as 𝐿23 , 𝐿24 .level are combined using MFI based hierarchical 1. R is reflexive on Ds iff 𝑅 (𝑑 𝑖 , 𝑑 𝑖 ) = 1. i.e Everyclustering and represent these documents in the second 2. R is symmetric on Ds iff𝑅 �𝑑 𝑖 , 𝑑 𝑗 � = 𝑅 �𝑑 𝑗 , 𝑑 𝑖 �i.e document is mostly related to itself. if the document 𝑑 𝑖 is similar to 𝑑 𝑗 then the V. PRIVACY PRESERVING OF WEB document 𝑑 𝑗 is also similar to𝑑 𝑖 . DOCUMENTS USING EQUIVALENCE RELATION Most internet web documents are publicly available 𝑅 (𝑑 𝑖 , 𝑑 𝑘 ) ≥ 𝑚𝑎𝑥 𝑗 { min{𝑅 �𝑑 𝑖 , 𝑑 𝑗 �, 𝑅 �𝑑 𝑗 , 𝑑 𝑖 �}}.for providing services required by the user. In such 3. R is transitive on Ds iffdocuments there is no confidential or sensitive data(open to all). Then how can we provide privacy ofsuch documents. Now a days, same information will Then R is transitive by the definition.be exists in more than one document in duplicate Then R is an equivalence relation on Ds, whichforms. The way of providing privacy preserving of partitions the input document set Ds into set ofdocuments is by avoiding duplicate documents. There equivalence classes. Equivalence relation seems aby we can protect the privacy of individual copy rights natural technique for duplicate documentof documents. Many duplicate document detection categorization. Any two documents in sametechniques are available such as syntactic, URL based, equivalence class are related and are different if theysemantic approaches. In each technique, a processing are coming from two equivalence classes. The set ofoverhead of maintaining shingling’s, signatures, all equivalence classes induces the document set Ds.fingerprints [13, 14, 15, 18]. In this paper, we High syntactic similarity pairs of documents typicallyproposed a new technique for avoiding duplicate referred to as duplicates or near duplicates exceptdocuments using equivalence relation. Let Ds be the diagonal elements. By using equivalence relation,input duplicate document set is subset to web easily we can identify the duplicate documents or wedocument collection. First find the jaccard similarity can perform the clustering on duplicate documents.measure for every pair of documents in Ds using Apart from the representation of feature documentweighted feature representation of maximal frequent vector by MFI, we also need to consider that who isitem sets discussed in step 2 and step 3 in algorithm. If the author of document, when the document wasthe similarity measure of two documents is equal to 1, created, where it is available, helps in effectivelythen the two documents are most similar. If the finding the duplicate documents. Each document inmeasure is 0, then they are not duplicates. The Jaccard input Ds must belong to unique equivalence class. If Rindex or the Jaccard similarity coefficient is a is equivalence relation on Ds = {d1, d2, d3, d4 …..dn}.statistical measure of similarity between sample sets. Then number of equivalence relations on Ds is alwaysFor two sets, it is denoted as the cardinality of their lies between n ≤ | R|≤ n2. i.e the time complexity ofintersection divided by the cardinality of their union. |𝑑1 ∩ 𝑑2 | calculating equivalence relation on Ds is O(n2). .i.e𝐽 �𝑑 𝑖 , 𝑑 𝑗 � ≥ 0.8. Since the matrix is symmetric, theMathematically 𝐽(𝑑1 , 𝑑2 ) = Choose the threshold α in equivalence relation as 0.8 |𝑑1 ∩ 𝑑2 | documents sets {(𝑑3 , 𝑑1 ), (𝑑3 , 𝑑2 ), (𝑑4 , 𝑑3 )} are mostly related. Hence the documents are near For every pair of two documents calculate jaccard duplicates and grouping the documents into clustersmeasure of d1, d2.All the diagonal elements in matrix thereby providing privacy of individual copy rights ofare ones, because every document mostly related to documents. www.ijorcs.org
    • 12 P. Rajesh, G. Narasimha, N.Saisumanth 0 0 1 0 Data mining 2002 (KDD-2002), Edmonton, Alberta, 0 0 1 0 Canada. R 0.8 =   [8] BenjaminFung, C.M., Wang, Ke., Ester, Martin. (2003). 1 1 0 1 “Hierarchical Document Clustering using Frequent Item   Sets”. In Proceedings SIAM International Conference 0 0 1 0 on Data Mining 2003 (SIAM DM-2003), pp:59-70. [9] Agrawal, R., Srikant, R. (1994). “Fast Algorithms for VI. CONCLUSION AND FUTURE SCOPE Mining Association Rules”. In the Proceedings of 20th International Conference on Very Large Data Bases, Cluster analysis can be used as powerful ,stranded 1994, Santiago, Chile, PP: 487-499.alone data mining concept that gains insight [10] Liu, W.L., and Zeng, X.S. (2005). “Documentinformation of knowledge from huge unstructured Clustering Based on Frequent Term Sets”. Proceedingsdatabases. Most conventional clustering methods do of Intelligent Systems and Control, 2005.not satisfy the document clustering requirements such [11] Zamir, O., Etzioni, O. (1998). “Web Documentas high dimensionality, huge volumes and easy of Clustering: A Feasibility Demonstration”. In theaccessing meaningful clusters labels. In this paper, we Proceedings of ACM,1998 (SIGIR-98), PP: 46-54.presented novel approach; Maximal frequent item set [12] Kjersti, (1997). “A Survey on Personalized Information(MFI) Based Similarity Measure for Hierarchical Filtering Systems for the World Wide Web”. TechnicalDocument Clustering to address these issues. Report 922, Norwegian Computing Center, 1997.Dimensionality reduction can be achieved through [13] Prasannakumar, J., Govindarajulu, P., “Duplicate andMFI. By using the same MFI similarity measure in Near Duplicate Documents Detection: A Review”.hierarchal document clustering, the number of levels European Journal of Scientific Research ISSN 1450-will be decreased. It is easy for browsing. Clustering 216X Vol.32 No.4 ,2009, pp:514-527has its paths in many areas, by applying MFI based [14] Syed Mudhasir,Y., Deepika,J., “Near Duplicatetechniques to clusters, including data mining, statistics, Detection and Elimination Based on Web Provenancebiology, and machine learning we can get the high for Efficient Web Search”. In the Proceedings ofquality of clusters. Moreover, by means of maximal International Journal on Internet and Distributedfrequent item sets, we can predict the most influenced Computing Systems, Vol.1, No.1, 2011.objects of clusters in the entire dataset of applications [15] Alsulami, B.S., Abulkhair, F., Essa, E., “Near Duplicatelike business, marketing, world wide web, social Document Detection Survey”. In the Proceedings ofnetworking analysis. International Journal of Computer Science and Communications Networks, Vol.2, N0.2, pp:147-151. VII. REFEERENCES [16] Doug Burdick, Manuel Calimlim, Johannes Gehrke. (2001). “A Maximal Frequent Itemset Algorithm for[1] Ruxixu, Donald Wunsch., “A Survey of Clustering Transactional Databases”. In the Proceedings of ICDE, Algorithms”. In the Proceedings of IEEE Transactions 17th International Conference on Data Engineering on Neural Networks, Vol. 16, No. 3, May 2005. 2001 (ICDE-2001).[2] Jain, A.K., Murty, M.N., Flynn, P.J., “Data Clustering: [17] Murali Krishna, S., Durga Bhavani, S., “An Efficient A Review”. In the Proceedings of ACM Computing Approach for Text Clustering Based On Frequent Item Surveys, Vol.31, No.3, 1999, pp: 264-323. Sets”. European Journal of Scientific Research ISSN[3] Kleinberg, J.M., “Authoritative Sources in a 1450-216X, Vol.42, No.3, 2010, pp:399-410. Hyperlinked Environment”. In the Journal of the ACM, [18] Lopresti, D.P. (1999). "Models and Algorithms for Vol. 46, No.5, 1999, pp: 604-632. Duplicate Document Detection". In the Proceedings of[4] Ling Zhuang, Honghua Dai. (2004). “A Maximal Fifth International Conference on Document Analysis Frequent Item Set Approach for Web Document and Recognition 1999 (ICDAR-1999), 20th-22th Sep, Clustering”. In Proceedings of the IEEE Fourth pp:297-300. International Conference on Computer and Information Technology 2004 (CIT-2004).[5] Michael, W., Trosset. (2008). “Representing Clusters: k-Means Clustering, Self-Organizing Maps and Multidimensional Scaling”. Technical Report, Department of Statistics, Indian University, Bloomington, 2008.[6] Michael Steinbach, George karypis, and Vipinkumar. (2000). “A Comparison of Document Clustering Techniques”. In Proceedings of the Workshop on Text Mining, 2000 (KDD-2000), Boston, pp: 109-111.[7] Beil, F., Ester, M., Xu, X. (2002). “Frequent Term- Based Text Clustering”. In Proceedings of 8th International Conference on Knowledge Discovery and www.ijorcs.org