• Like
  • Save
pmuthoju_presentation.ppt
Upcoming SlideShare
Loading in...5
×
 

pmuthoju_presentation.ppt

on

  • 444 views

 

Statistics

Views

Total Views
444
Views on SlideShare
444
Embed Views
0

Actions

Likes
0
Downloads
5
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    pmuthoju_presentation.ppt pmuthoju_presentation.ppt Presentation Transcript

    • Automatic Document Categorization using Support Vector Machines Prashanth Kumar Muthoju [email_address] Advisor: Dr. Zubair
    • Overview
      • Introduction
      • Problem
      • Proposed Solution
      • Improvements
      • Results
      • Future Work
      • Conclusion
      • References
    • Introduction
      • What is Categorization
        • Sorting a set of documents into categories from a
        • predefined set. [ link ]
        • Assigning a document to a category based on it’s contents.
    • Introduction .. Cont.d
      • Types of Categorization :
        • Manual
        • Automatic (Machine Learning)
          • Probabilistic (e.g., Naïve Bayesian)
          • Decision Structures (e.g., Decision Trees)
          • Support Machines (e.g., SVM)
    • Introduction .. Cont.d
      • Why ‘Automation’ ?
        • Manual categorization
          • needs large number of human resources
          • is expensive
          • is time consuming
    • Introduction .. Cont.d
      • Applications of Automatic Categorization:
        • Indexing of scientific articles
        • Spam filtering of e-mails
        • Authorship attribution
    • Problem
      • The DTIC document base has to be categorized into 25 fields (broad) and 251 groups (narrow)
        • Fields/Groups listed here http://www.dtic.mil/trail/fieldgrp.html
    • Towards the solution ..
      • Strategy:
        • Exploit an existing collection with categorized documents
          • A portion is used as training set
          • Other potion is used as testing set
          • Allow tuning of classifier to yield maximum effectiveness
    • Towards the solution ..
      • What is Support Vector Machine ?
      • Binary Classifier
        • Finds the ith largest margin
          • to separate two classes
        • Subsequently classifies items
          • Based on which side of the line
          • They fall.
    • Towards the solution ..
      • Why is SVM chosen for Automatic Categorization?
        • Prior studies have suggested good results with SVM
        • Relatively immune to ‘over fitting’ (fitting to coincidental relations encountered during training).
    • Towards the solution ..
      • SVM Library (LibSVM 2.85)
      • Java
    • Solution
      • Before we can train the SVM using LibSVM for a Field/Group, we have to prepare dataset for that Field/Group.
      • Each file is represented by
      • <label> <feature1>:<value1> < feature 2>:<value2> ...
      • (Sparse vector representation)
      • <label> is 1 if positive file; 0 if negative file
      • < feature>:<value> are represented by <word>:<tfidf>
      • (Common words are eliminated before preparing data set).
    • Solution
      • For each of the Field/Group,
      • the following procedure is
      • Repeated (Training phase):
      Collection Model by Dr. Zeil Field/Group K Field/Group K Download Documents ( PDF ) Convert PDF to Text Model Documents Using TF and IDF Positive Training Set for Negative Training Set for Field/Group K SVM For
    • Solution
      • (Testing Phase)
      Field/Group 1 Field/Group K Field/Group N Trained SVM For Trained SVM For Trained SVM For Input Test Document ( PDF ) Convert PDF to Text Model Documents Using TF and IDF Estimate in the range 0 to 1 indicating how likely the Field/Group K maps to the test document .
    • Improving the results
      • Scaling the vectors in datasets
        • To make the <value>s in <feature>:<value> pairs between 0 and 1
    • Experiment
      • Randomly selected 5 Field/Groups.
        • 140200, 120200, 201300, 220200, 250400.
      • For each field/group,
        • 70 pdf files were downloaded.
          • 50 files were used as positive files for training
          • 20 files were used for testing
        • An additional 50 files were taken randomly from all other field/groups as negative files for training.
    • Experiment
      • Metric:
        • Recall = #Correct Answers /
        • #Total Possible Answers
        • Precision = #Correct Answers /
        • #Answers Produced
    • Results 140200 120200 201300 220200 250400 140200 13 2 1 2 2 120200 1 16 0 3 0 201300 0 5 13 2 0 220200 1 0 2 17 0 250400 0 0 1 0 19
    • Results ..Cont.d Category Precession Recall 140200 0.87 0.65 120200 0.70 0.80 201300 0.76 0.65 220200 0.71 0.85 250400 0.90 0.95
    • Future Work
      • Hierarchical Model
      In flat model, we consider each field/group independent. In Hierarchical model, we consider all files under the branch as positive files for training 150000 150300 150600 150301 150302 150601 150602
    • Future Work
      • Multi-Label classification
        • Practically each document may belong to multiple field/groups.
    • Conclusion
      • The classification results of DTIC documents based on Field/Groups were impressive.
      • Ways to improve the results have been identified.
      • A couple of suggestions were given for future work in this particular area.
    • References
      • Sebastiani, F (2002). .Machine learning in automated text categorization.. ACM Computing Surveys. Vol. 34(1). pp. 1-47.
      • Joachims, T. (1998). Text categorization with support vector machines: learning with many relevant features. ( http://www.cs.cornell.edu/people/tj/publications/joachims_98a.pdf )
      • J.T. Kwok. Automated text categorization using support vector machine. In Proceedings of the International Conference on Neural Information Processing, Kitakyushu, Japan, Oct. 1998, pp. 347- 351.