• Save
A+Novel+Approach+Based+On+Prototypes+And+Rough+Sets+For+Document+And+Feature+Reductions+In+Text+Categorization
Upcoming SlideShare
Loading in...5
×
 

Like this? Share it with your network

Share

A+Novel+Approach+Based+On+Prototypes+And+Rough+Sets+For+Document+And+Feature+Reductions+In+Text+Categorization

on

  • 1,008 views

這是一篇關於文件分類paper的投影片

這是一篇關於文件分類paper的投影片

Statistics

Views

Total Views
1,008
Views on SlideShare
990
Embed Views
18

Actions

Likes
0
Downloads
0
Comments
0

3 Embeds 18

http://liouville.blogspot.com 16
http://72.14.235.104 1
http://www.slideshare.net 1

Accessibility

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

A+Novel+Approach+Based+On+Prototypes+And+Rough+Sets+For+Document+And+Feature+Reductions+In+Text+Categorization Presentation Transcript

  • 1. A novel approach based on prototypes and rough sets for document and feature reductions in text categorization Shing-Hua Ho and Jung-Hsien Chiang Reporter :CHE-MIN LIAO 2007/8/27
  • 2. Outline
    • Introduction
    • Document reduction based on prototype concept
    • Feature reduction based on rough sets
    • Performance evaluation and comparison
    • Experimental design and result analysis
    • Conclusion
  • 3. Introduction
    • Text categorization is the task of automatically assigning predefined category labels to new texts.
    • Recently,many statistical methods have been applied in text categorization to reduce the number of terms describing the content of documents due to high dimensionality of document representation.
  • 4. Introduction
    • Feature selection offers a means of choosing a smaller subset of original features to represent the original dataset.
    • Rough set theory can discover hidden patterns and dependency relationships among large number of feature terms in text datasets.
    • Rough set theory has been applied to many domains,including the task of text categorization and shown to be effective.
  • 5. Introduction
    • In this paper a new approach based on prototype concept for docment selection is provided which guarantees to reduce the size greatly as well as preserve the classification accuracy.
    • The number of feature terms is reduced via our proposed rough-based algorithm.
    • The properties of rough set theory are used to identify which feature terms should be deleted and which feature terms should be selected.
  • 6. Document reduction based on prototype concept
    • The idea of document reduction is to form a number of groups,each of which contains several documents of the same label,and each group mean is used as a prototype for the group.
    • At the beginning,documents of each label form a group and their mean is calculated as the initial prototype.
  • 7. Document reduction based on prototype concept
    • We performed the algorithm based on the following four situations : For all documents within a group
    • 1) If the closest prototype is the same group prototype,then there is no modification has been perfoemed for this group.
    • 2) If the closest prototype is one of an incorrect label,then the group is split into several subgroups according to their label types.
  • 8. Document reduction based on prototype concept
    • 3) If the closest prototype is a prototype of a different group but have the same label,these documents are shifted to the group of that closest prototype.
    • 4) When the closest prototype is a prototype of a different group and of an incorrect label,these documents are removed to form a new group and its mean is computed as a new prototype.
  • 9. Document reduction based on prototype concept
    • Input: V document-label pairs { dv , s ( dv )}, v =1,..., V and
    • s ( dv )∈{1,..., U ) is the label for document dv .
    • Output: Prototype set { Pz }, z =1,…, Z and their
    • corresponding labels.
    • Procedure:
    • Step 01: Set Gz ={ dv | s ( dv )= z }, z =1,…, U .
    • Step 02: For z =1 to U
    • Calculate the initial prototypes Pz =mean( Gz )
    • And their labels are s ( Pz )= z , z =1,…, U
    • End For
    • Step 03: Set z =1, Z = U
    • Step 04: For t =1 to Z
    • Calculate Dvt =|| dv - Pt ||2, ∀ dv ∈ Gz
    • End For
  • 10. Document reduction based on prototype concept Step 05: Determine the index of the closest prototype to each document dv as Iv =arg min( Dvt ) Step 06: If Iv = z , ∀ dv ∈ Gz Then go to Step 11 End If Step 07: If s ( PIv )≠ s ( Pz ) ∀ cv ∈ Gz Then Set Z = Z +1 and split Gz into two subgroups Ga and Gb Update their means: Pa =mean( Ga ) and Pb =mean( Gb ) If s ( Pa )= s ( Pb ) Then go to Step 04 End If End If
  • 11. Document reduction based on prototype concept
    • Step 08: If s ( PIv )= s ( Pz ), PIv ≠ Pz for some dv ∈ Gz Then
    • Remove these documents from Gz and
    • include them in group GIv
    • Update their means: PIv =mean( GIv ) and
    • Pz =mean( Gz )
    • End If
    • Step 09: If s ( PIv )≠ s ( Pz ) for some dv ∈ Gz Then
    • Set Z = Z +1
    • Remove these documents from Gz and
    • create a new group Gn containing these documents
    • Update the means: Pz =mean( Gz ) and
    • Pn =mean( Gn )
    • End If
    • Step 10: If z ≠ Z Then
    • Set z = z +1 and go to Step 04
    • End If
    • Step 11: If z = Z and no change in groups or
    • prototypes Then STOP
    • End If
  • 12. Feature reduction based on rough sets
    • The proposed algorithm is based on the following three properties :
    • (1) An object can be a member of one lower
    • bound at most.
    • (2) An object that is a member of the lower
    • bound of a cluster is also member of the
    • upper bound of the same cluster.
    • (3) An object that does not belong to any lower
    • bound is the member of at least two upper
    • bounds.
  • 13. Feature reduction based on rough sets
    • Using the prototype document space model,every original feature term X n can be represented by X n =(X 1 ,…,X z ) T with respect to Z prototype documents. The distance between the object X n and the mean m k is defined as
  • 14. Feature reduction based on rough sets
  • 15. Feature reduction based on rough sets
  • 16. Feature reduction based on rough sets
  • 17. Feature reduction based on rough sets
  • 18. Feature reduction based on rough sets The rough-based feature selection algorithm achieves exclusive clusters and required to determine the desired number of clusters.Theoretically,the suitable maximum number of clusters is estimated as ,where N is the size of the features
  • 19. Feature reduction based on rough sets
  • 20. Performance evaluation and comparison
  • 21. Performance evaluation and comparison
    • Four Feature Selection Methods :
    • Document frequency(DF)
    • Information gain(IG)
    • Mutual information(MI)
    • χ 2 statistic method
  • 22. Performance evaluation and comparison
    • Four Classifiers :
    • K-Nearest-Neighbor(KNN)
    • Naïve Bayes(NB)
    • Rocchio method
    • Support Vector Machine(SVM)
  • 23. Experimental design and result analysis
    • The reuters-21578 dataset which is a collection of newswire stories from 1987 by David Lewis.
    • the experimental results were obtained with 10-fold-cross-validation for all classifiers.
  • 24. Experimental design and result analysis
  • 25. Experimental design and result analysis
  • 26. Experimental design and result analysis
    • 20_newsgroups which was assembled by Ken Lang in Carnegie Mellon University.
    • There are 20,000 documents in the dataset,collected from 20 dfferent newsgroups,and each contains 1,000 documents.The dataset contains 111,446 features in all.
  • 27. Experimental design and result analysis
  • 28. Experimental design and result analysis
  • 29. Conclusion
    • The best classification accuracy is achieved by using a subset of feature chosen by information gain method for LSVM classifier and our proposed method.
    • Another point worth noting is not only classification accuracy but also computational efficiency is improved through feature reduction and document reduction.