A novel approach based on prototypes and rough sets for document and feature reductions  in text categorization Shing-Hua ...
Outline <ul><li>Introduction </li></ul><ul><li>Document reduction based on prototype concept </li></ul><ul><li>Feature red...
Introduction <ul><li>Text categorization is the task of automatically assigning predefined category labels to new texts. <...
Introduction <ul><li>Feature selection offers a means of choosing a smaller subset of original features to represent the o...
Introduction <ul><li>In this paper a new approach based on prototype concept for docment selection is provided which guara...
Document reduction based on prototype concept <ul><li>The idea of document reduction is to form a number of groups,each of...
Document reduction based on prototype concept <ul><li>We performed the algorithm based on the following four situations : ...
Document reduction based on prototype concept <ul><li>3) If the closest prototype is a prototype of a different group but ...
Document reduction based on prototype concept <ul><li>Input:  V  document-label pairs { dv  ,  s ( dv )},  v =1,..., V  an...
Document reduction based on prototype concept Step 05: Determine the index of the closest prototype to each document  dv  ...
Document reduction based on prototype concept <ul><li>Step 08: If  s ( PIv )= s ( Pz ),  PIv ≠ Pz  for some  dv ∈ Gz  Then...
Feature reduction based on rough sets <ul><li>The proposed algorithm is based on the following three properties : </li></u...
Feature reduction based on rough sets <ul><li>Using the prototype document space model,every original feature term X n  ca...
Feature reduction based on rough sets
Feature reduction based on rough sets
Feature reduction based on rough sets
Feature reduction based on rough sets
Feature reduction based on rough sets The rough-based feature selection algorithm achieves exclusive clusters and required...
Feature reduction based on rough sets
Performance evaluation and comparison
Performance evaluation and comparison <ul><li>Four Feature Selection Methods : </li></ul><ul><li>Document frequency(DF) </...
Performance evaluation and comparison <ul><li>Four Classifiers : </li></ul><ul><li>K-Nearest-Neighbor(KNN) </li></ul><ul><...
Experimental design and result analysis <ul><li>The reuters-21578 dataset which is a collection of newswire stories from 1...
Experimental design and result analysis
Experimental design and result analysis
Experimental design and result analysis <ul><li>20_newsgroups which was assembled by Ken Lang in Carnegie Mellon Universit...
Experimental design and result analysis
Experimental design and result analysis
Conclusion <ul><li>The best classification accuracy is achieved by using a subset of feature chosen by information gain me...
Upcoming SlideShare
Loading in …5
×

A+Novel+Approach+Based+On+Prototypes+And+Rough+Sets+For+Document+And+Feature+Reductions+In+Text+Categorization

728 views

Published on

這是一篇關於文件分類paper的投影片

Published in: Economy & Finance, Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
728
On SlideShare
0
From Embeds
0
Number of Embeds
34
Actions
Shares
0
Downloads
0
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide
  • A+Novel+Approach+Based+On+Prototypes+And+Rough+Sets+For+Document+And+Feature+Reductions+In+Text+Categorization

    1. 1. A novel approach based on prototypes and rough sets for document and feature reductions in text categorization Shing-Hua Ho and Jung-Hsien Chiang Reporter :CHE-MIN LIAO 2007/8/27
    2. 2. Outline <ul><li>Introduction </li></ul><ul><li>Document reduction based on prototype concept </li></ul><ul><li>Feature reduction based on rough sets </li></ul><ul><li>Performance evaluation and comparison </li></ul><ul><li>Experimental design and result analysis </li></ul><ul><li>Conclusion </li></ul>
    3. 3. Introduction <ul><li>Text categorization is the task of automatically assigning predefined category labels to new texts. </li></ul><ul><li>Recently,many statistical methods have been applied in text categorization to reduce the number of terms describing the content of documents due to high dimensionality of document representation. </li></ul>
    4. 4. Introduction <ul><li>Feature selection offers a means of choosing a smaller subset of original features to represent the original dataset. </li></ul><ul><li>Rough set theory can discover hidden patterns and dependency relationships among large number of feature terms in text datasets. </li></ul><ul><li>Rough set theory has been applied to many domains,including the task of text categorization and shown to be effective. </li></ul>
    5. 5. Introduction <ul><li>In this paper a new approach based on prototype concept for docment selection is provided which guarantees to reduce the size greatly as well as preserve the classification accuracy. </li></ul><ul><li>The number of feature terms is reduced via our proposed rough-based algorithm. </li></ul><ul><li>The properties of rough set theory are used to identify which feature terms should be deleted and which feature terms should be selected. </li></ul>
    6. 6. Document reduction based on prototype concept <ul><li>The idea of document reduction is to form a number of groups,each of which contains several documents of the same label,and each group mean is used as a prototype for the group. </li></ul><ul><li>At the beginning,documents of each label form a group and their mean is calculated as the initial prototype. </li></ul>
    7. 7. Document reduction based on prototype concept <ul><li>We performed the algorithm based on the following four situations : For all documents within a group </li></ul><ul><li>1) If the closest prototype is the same group prototype,then there is no modification has been perfoemed for this group. </li></ul><ul><li>2) If the closest prototype is one of an incorrect label,then the group is split into several subgroups according to their label types. </li></ul>
    8. 8. Document reduction based on prototype concept <ul><li>3) If the closest prototype is a prototype of a different group but have the same label,these documents are shifted to the group of that closest prototype. </li></ul><ul><li>4) When the closest prototype is a prototype of a different group and of an incorrect label,these documents are removed to form a new group and its mean is computed as a new prototype. </li></ul>
    9. 9. Document reduction based on prototype concept <ul><li>Input: V document-label pairs { dv , s ( dv )}, v =1,..., V and </li></ul><ul><li> s ( dv )∈{1,..., U ) is the label for document dv . </li></ul><ul><li>Output: Prototype set { Pz }, z =1,…, Z and their </li></ul><ul><li> corresponding labels. </li></ul><ul><li>Procedure: </li></ul><ul><li>Step 01: Set Gz ={ dv | s ( dv )= z }, z =1,…, U . </li></ul><ul><li>Step 02: For z =1 to U </li></ul><ul><li>Calculate the initial prototypes Pz =mean( Gz ) </li></ul><ul><li>And their labels are s ( Pz )= z , z =1,…, U </li></ul><ul><li> End For </li></ul><ul><li>Step 03: Set z =1, Z = U </li></ul><ul><li>Step 04: For t =1 to Z </li></ul><ul><li> Calculate Dvt =|| dv - Pt ||2, ∀ dv ∈ Gz </li></ul><ul><li> End For </li></ul>
    10. 10. Document reduction based on prototype concept Step 05: Determine the index of the closest prototype to each document dv as Iv =arg min( Dvt ) Step 06: If Iv = z , ∀ dv ∈ Gz Then go to Step 11 End If Step 07: If s ( PIv )≠ s ( Pz ) ∀ cv ∈ Gz Then Set Z = Z +1 and split Gz into two subgroups Ga and Gb Update their means: Pa =mean( Ga ) and Pb =mean( Gb ) If s ( Pa )= s ( Pb ) Then go to Step 04 End If End If
    11. 11. Document reduction based on prototype concept <ul><li>Step 08: If s ( PIv )= s ( Pz ), PIv ≠ Pz for some dv ∈ Gz Then </li></ul><ul><li>Remove these documents from Gz and </li></ul><ul><li>include them in group GIv </li></ul><ul><li>Update their means: PIv =mean( GIv ) and </li></ul><ul><li>Pz =mean( Gz ) </li></ul><ul><li> End If </li></ul><ul><li>Step 09: If s ( PIv )≠ s ( Pz ) for some dv ∈ Gz Then </li></ul><ul><li>Set Z = Z +1 </li></ul><ul><li>Remove these documents from Gz and </li></ul><ul><li>create a new group Gn containing these documents </li></ul><ul><li>Update the means: Pz =mean( Gz ) and </li></ul><ul><li>Pn =mean( Gn ) </li></ul><ul><li> End If </li></ul><ul><li>Step 10: If z ≠ Z Then </li></ul><ul><li>Set z = z +1 and go to Step 04 </li></ul><ul><li> End If </li></ul><ul><li>Step 11: If z = Z and no change in groups or </li></ul><ul><li>prototypes Then STOP </li></ul><ul><li> End If </li></ul>
    12. 12. Feature reduction based on rough sets <ul><li>The proposed algorithm is based on the following three properties : </li></ul><ul><li>(1) An object can be a member of one lower </li></ul><ul><li> bound at most. </li></ul><ul><li>(2) An object that is a member of the lower </li></ul><ul><li> bound of a cluster is also member of the </li></ul><ul><li> upper bound of the same cluster. </li></ul><ul><li>(3) An object that does not belong to any lower </li></ul><ul><li> bound is the member of at least two upper </li></ul><ul><li> bounds. </li></ul>
    13. 13. Feature reduction based on rough sets <ul><li>Using the prototype document space model,every original feature term X n can be represented by X n =(X 1 ,…,X z ) T with respect to Z prototype documents. The distance between the object X n and the mean m k is defined as </li></ul>
    14. 14. Feature reduction based on rough sets
    15. 15. Feature reduction based on rough sets
    16. 16. Feature reduction based on rough sets
    17. 17. Feature reduction based on rough sets
    18. 18. Feature reduction based on rough sets The rough-based feature selection algorithm achieves exclusive clusters and required to determine the desired number of clusters.Theoretically,the suitable maximum number of clusters is estimated as ,where N is the size of the features
    19. 19. Feature reduction based on rough sets
    20. 20. Performance evaluation and comparison
    21. 21. Performance evaluation and comparison <ul><li>Four Feature Selection Methods : </li></ul><ul><li>Document frequency(DF) </li></ul><ul><li>Information gain(IG) </li></ul><ul><li>Mutual information(MI) </li></ul><ul><li>χ 2 statistic method </li></ul>
    22. 22. Performance evaluation and comparison <ul><li>Four Classifiers : </li></ul><ul><li>K-Nearest-Neighbor(KNN) </li></ul><ul><li>Naïve Bayes(NB) </li></ul><ul><li>Rocchio method </li></ul><ul><li>Support Vector Machine(SVM) </li></ul>
    23. 23. Experimental design and result analysis <ul><li>The reuters-21578 dataset which is a collection of newswire stories from 1987 by David Lewis. </li></ul><ul><li>the experimental results were obtained with 10-fold-cross-validation for all classifiers. </li></ul>
    24. 24. Experimental design and result analysis
    25. 25. Experimental design and result analysis
    26. 26. Experimental design and result analysis <ul><li>20_newsgroups which was assembled by Ken Lang in Carnegie Mellon University. </li></ul><ul><li>There are 20,000 documents in the dataset,collected from 20 dfferent newsgroups,and each contains 1,000 documents.The dataset contains 111,446 features in all. </li></ul>
    27. 27. Experimental design and result analysis
    28. 28. Experimental design and result analysis
    29. 29. Conclusion <ul><li>The best classification accuracy is achieved by using a subset of feature chosen by information gain method for LSVM classifier and our proposed method. </li></ul><ul><li>Another point worth noting is not only classification accuracy but also computational efficiency is improved through feature reduction and document reduction. </li></ul>

    ×