Automatic Document Categorization using Support Vector Machines  Prashanth Kumar Muthoju [email_address] Advisor: Dr. Zubair
Overview <ul><li>Introduction </li></ul><ul><li>Problem </li></ul><ul><li>Proposed Solution </li></ul><ul><li>Improvements...
Introduction <ul><li>What is Categorization </li></ul><ul><ul><li>Sorting a set of documents into categories from a  </li>...
Introduction .. Cont.d <ul><li>Types of Categorization : </li></ul><ul><ul><li>Manual  </li></ul></ul><ul><ul><li>Automati...
Introduction .. Cont.d <ul><li>Why ‘Automation’ ? </li></ul><ul><ul><li>Manual categorization </li></ul></ul><ul><ul><ul><...
Introduction .. Cont.d <ul><li>Applications of Automatic Categorization: </li></ul><ul><ul><li>Indexing of scientific arti...
Problem <ul><li>The DTIC document base has to be categorized into 25 fields (broad) and 251 groups (narrow) </li></ul><ul>...
Towards the solution .. <ul><li>Strategy: </li></ul><ul><ul><li>Exploit an existing collection with categorized documents ...
Towards the solution .. <ul><li>What is Support Vector Machine ? </li></ul><ul><li>Binary Classifier </li></ul><ul><ul><li...
Towards the solution .. <ul><li>Why is SVM chosen for Automatic Categorization? </li></ul><ul><ul><li>Prior studies have s...
Towards the solution .. <ul><li>SVM Library (LibSVM 2.85) </li></ul><ul><li>Java   </li></ul>
Solution <ul><li>Before we can train the SVM using LibSVM for a Field/Group, we have to prepare dataset for that Field/Gro...
Solution <ul><li>For each of the Field/Group, </li></ul><ul><li>the following procedure is </li></ul><ul><li>Repeated (Tra...
Solution <ul><li>(Testing Phase) </li></ul>Field/Group 1 Field/Group K Field/Group N Trained SVM For Trained SVM For Train...
Improving the results <ul><li>Scaling the vectors in datasets </li></ul><ul><ul><li>To make the <value>s in <feature>:<val...
Experiment <ul><li>Randomly selected 5 Field/Groups. </li></ul><ul><ul><li>140200, 120200, 201300, 220200, 250400. </li></...
Experiment <ul><li>Metric: </li></ul><ul><ul><li>Recall = #Correct Answers /  </li></ul></ul><ul><ul><li>#Total Possible A...
Results 140200 120200 201300 220200 250400 140200 13 2 1 2 2 120200 1 16 0 3 0 201300 0 5 13 2 0 220200 1 0 2 17 0 250400 ...
Results ..Cont.d Category Precession Recall 140200 0.87 0.65 120200 0.70 0.80 201300 0.76 0.65 220200 0.71 0.85 250400 0.9...
Future Work <ul><li>Hierarchical Model </li></ul>In flat model, we consider each field/group independent. In Hierarchical ...
Future Work <ul><li>Multi-Label classification </li></ul><ul><ul><li>Practically each document may belong to multiple fiel...
Conclusion <ul><li>The classification results of DTIC documents based on Field/Groups were impressive.  </li></ul><ul><li>...
References <ul><li>Sebastiani, F (2002). .Machine learning in automated text categorization.. ACM Computing Surveys. Vol. ...
Upcoming SlideShare
Loading in...5
×

pmuthoju_presentation.ppt

302

Published on

0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
302
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
7
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

pmuthoju_presentation.ppt

  1. 1. Automatic Document Categorization using Support Vector Machines Prashanth Kumar Muthoju [email_address] Advisor: Dr. Zubair
  2. 2. Overview <ul><li>Introduction </li></ul><ul><li>Problem </li></ul><ul><li>Proposed Solution </li></ul><ul><li>Improvements </li></ul><ul><li>Results </li></ul><ul><li>Future Work </li></ul><ul><li>Conclusion </li></ul><ul><li>References </li></ul>
  3. 3. Introduction <ul><li>What is Categorization </li></ul><ul><ul><li>Sorting a set of documents into categories from a </li></ul></ul><ul><ul><li>predefined set. [ link ] </li></ul></ul><ul><ul><li>Assigning a document to a category based on it’s contents. </li></ul></ul>
  4. 4. Introduction .. Cont.d <ul><li>Types of Categorization : </li></ul><ul><ul><li>Manual </li></ul></ul><ul><ul><li>Automatic (Machine Learning) </li></ul></ul><ul><ul><ul><li>Probabilistic (e.g., Naïve Bayesian) </li></ul></ul></ul><ul><ul><ul><li>Decision Structures (e.g., Decision Trees) </li></ul></ul></ul><ul><ul><ul><li>Support Machines (e.g., SVM) </li></ul></ul></ul>
  5. 5. Introduction .. Cont.d <ul><li>Why ‘Automation’ ? </li></ul><ul><ul><li>Manual categorization </li></ul></ul><ul><ul><ul><li>needs large number of human resources </li></ul></ul></ul><ul><ul><ul><li>is expensive </li></ul></ul></ul><ul><ul><ul><li>is time consuming </li></ul></ul></ul>
  6. 6. Introduction .. Cont.d <ul><li>Applications of Automatic Categorization: </li></ul><ul><ul><li>Indexing of scientific articles </li></ul></ul><ul><ul><li>Spam filtering of e-mails </li></ul></ul><ul><ul><li>Authorship attribution </li></ul></ul>
  7. 7. Problem <ul><li>The DTIC document base has to be categorized into 25 fields (broad) and 251 groups (narrow) </li></ul><ul><ul><li>Fields/Groups listed here http://www.dtic.mil/trail/fieldgrp.html </li></ul></ul>
  8. 8. Towards the solution .. <ul><li>Strategy: </li></ul><ul><ul><li>Exploit an existing collection with categorized documents </li></ul></ul><ul><ul><ul><li>A portion is used as training set </li></ul></ul></ul><ul><ul><ul><li>Other potion is used as testing set </li></ul></ul></ul><ul><ul><ul><li>Allow tuning of classifier to yield maximum effectiveness </li></ul></ul></ul>
  9. 9. Towards the solution .. <ul><li>What is Support Vector Machine ? </li></ul><ul><li>Binary Classifier </li></ul><ul><ul><li>Finds the ith largest margin </li></ul></ul><ul><ul><ul><li>to separate two classes </li></ul></ul></ul><ul><ul><li>Subsequently classifies items </li></ul></ul><ul><ul><ul><li>Based on which side of the line </li></ul></ul></ul><ul><ul><ul><li>They fall. </li></ul></ul></ul>
  10. 10. Towards the solution .. <ul><li>Why is SVM chosen for Automatic Categorization? </li></ul><ul><ul><li>Prior studies have suggested good results with SVM </li></ul></ul><ul><ul><li>Relatively immune to ‘over fitting’ (fitting to coincidental relations encountered during training). </li></ul></ul>
  11. 11. Towards the solution .. <ul><li>SVM Library (LibSVM 2.85) </li></ul><ul><li>Java </li></ul>
  12. 12. Solution <ul><li>Before we can train the SVM using LibSVM for a Field/Group, we have to prepare dataset for that Field/Group. </li></ul><ul><li>Each file is represented by </li></ul><ul><li><label> <feature1>:<value1> < feature 2>:<value2> ... </li></ul><ul><li>(Sparse vector representation) </li></ul><ul><li><label> is 1 if positive file; 0 if negative file </li></ul><ul><li>< feature>:<value> are represented by <word>:<tfidf> </li></ul><ul><li>(Common words are eliminated before preparing data set). </li></ul>
  13. 13. Solution <ul><li>For each of the Field/Group, </li></ul><ul><li>the following procedure is </li></ul><ul><li>Repeated (Training phase): </li></ul>Collection Model by Dr. Zeil Field/Group K Field/Group K Download Documents ( PDF ) Convert PDF to Text Model Documents Using TF and IDF Positive Training Set for Negative Training Set for Field/Group K SVM For
  14. 14. Solution <ul><li>(Testing Phase) </li></ul>Field/Group 1 Field/Group K Field/Group N Trained SVM For Trained SVM For Trained SVM For Input Test Document ( PDF ) Convert PDF to Text Model Documents Using TF and IDF Estimate in the range 0 to 1 indicating how likely the Field/Group K maps to the test document .
  15. 15. Improving the results <ul><li>Scaling the vectors in datasets </li></ul><ul><ul><li>To make the <value>s in <feature>:<value> pairs between 0 and 1 </li></ul></ul>
  16. 16. Experiment <ul><li>Randomly selected 5 Field/Groups. </li></ul><ul><ul><li>140200, 120200, 201300, 220200, 250400. </li></ul></ul><ul><li>For each field/group, </li></ul><ul><ul><li>70 pdf files were downloaded. </li></ul></ul><ul><ul><ul><li>50 files were used as positive files for training </li></ul></ul></ul><ul><ul><ul><li>20 files were used for testing </li></ul></ul></ul><ul><ul><li>An additional 50 files were taken randomly from all other field/groups as negative files for training. </li></ul></ul>
  17. 17. Experiment <ul><li>Metric: </li></ul><ul><ul><li>Recall = #Correct Answers / </li></ul></ul><ul><ul><li>#Total Possible Answers </li></ul></ul><ul><ul><li>Precision = #Correct Answers / </li></ul></ul><ul><ul><li>#Answers Produced </li></ul></ul>
  18. 18. Results 140200 120200 201300 220200 250400 140200 13 2 1 2 2 120200 1 16 0 3 0 201300 0 5 13 2 0 220200 1 0 2 17 0 250400 0 0 1 0 19
  19. 19. Results ..Cont.d Category Precession Recall 140200 0.87 0.65 120200 0.70 0.80 201300 0.76 0.65 220200 0.71 0.85 250400 0.90 0.95
  20. 20. Future Work <ul><li>Hierarchical Model </li></ul>In flat model, we consider each field/group independent. In Hierarchical model, we consider all files under the branch as positive files for training 150000 150300 150600 150301 150302 150601 150602
  21. 21. Future Work <ul><li>Multi-Label classification </li></ul><ul><ul><li>Practically each document may belong to multiple field/groups. </li></ul></ul>
  22. 22. Conclusion <ul><li>The classification results of DTIC documents based on Field/Groups were impressive. </li></ul><ul><li>Ways to improve the results have been identified. </li></ul><ul><li>A couple of suggestions were given for future work in this particular area. </li></ul>
  23. 23. References <ul><li>Sebastiani, F (2002). .Machine learning in automated text categorization.. ACM Computing Surveys. Vol. 34(1). pp. 1-47. </li></ul><ul><li>Joachims, T. (1998). Text categorization with support vector machines: learning with many relevant features. ( http://www.cs.cornell.edu/people/tj/publications/joachims_98a.pdf ) </li></ul><ul><li>J.T. Kwok. Automated text categorization using support vector machine. In Proceedings of the International Conference on Neural Information Processing, Kitakyushu, Japan, Oct. 1998, pp. 347- 351. </li></ul>
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×