Agressive feature selection for text categorization
Upcoming SlideShare
Loading in...5
×
 

Agressive feature selection for text categorization

on

  • 34 views

Paper review at Machine Learning :: Reading Group, 2/2014

Paper review at Machine Learning :: Reading Group, 2/2014

Statistics

Views

Total Views
34
Slideshare-icon Views on SlideShare
31
Embed Views
3

Actions

Likes
0
Downloads
1
Comments
0

1 Embed 3

https://www.linkedin.com 3

Accessibility

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    Agressive feature selection for text categorization Agressive feature selection for text categorization Presentation Transcript

    • Aggressive feature selection for text categorization E Gabrilovich and S Markovitch, “Text categorization with many redundant features: Using aggressive feature selection to make SVMs competitive with C4.5,” 21st International Conference on Machine Learning, ACM, 2004. Presented by Hershel Safer in Machine Learning :: Reading Group Meetup on 12/2/14 Aggressive feature selection for text categorization – Hershel Safer Page 112 February 2014
    • Results Key result: They introduce a measure of the redundancy of words in a collection of documents that predicts if feature selection will improve categorization of the documents. Also: A method to generate labeled datasets for testing text- categorization algorithms (previous work)categorization algorithms (previous work) A platform testing text-categorization algorithms Aggressive feature selection for text categorization – Hershel Safer Page 212 February 2014
    • Background: Text categorization Text categorization: Given a set of natural-language documents and a set of labels, assign one or more labels to each document. Most algorithms treat a document as a collection of words, with each word as a feature; so even modest collections have thousands or tens of thousands of features. For such high-dimensional problems, feature selection is oftenFor such high-dimensional problems, feature selection is often used to reduce noise and avoid overfitting. Aggressive feature selection for text categorization – Hershel Safer Page 312 February 2014
    • Background: Feature selection Use various methods to measure how well specific words discriminate between categories: information gain (IG), chi- squared, bi-normal separation, document frequency, etc. Feature selection: Choose the most informative features using a score cutoff or a fixed percentage of the top-scoring features. Previous work on standard document collections found thatPrevious work on standard document collections found that even words with low discriminative power improved classification. Question asked by this work: When does aggressive feature selection (using ~1% of the words in the collection) improve text categorization? Aggressive feature selection for text categorization – Hershel Safer Page 412 February 2014
    • The data The data consist of 100 datasets created from Web directories, each containing documents from 2 categories. The categorization difficulty ranges from very easy to very hard. Baseline accuracy of categorization using SVM is fairly uniformly distributed between 0.6 and 0.92. Aggressive feature selection for text categorization – Hershel Safer Page 512 February 2014
    • Distribution of IG and effect of feature selection Key is not the level of IG values but rather the rate of decrease. For dataset D and features F, the Outlier Count (OC) is # features with IG at least 3 standard deviations above the mean: Aggressive feature selection for text categorization – Hershel Safer Page 612 February 2014
    • Effect of Outlier Count on SVM accuracy OC has a strong negative correlation with the improvement in SVM accuracy that results from aggressive feature selection. Studies that found no benefit from aggressive feature selection used datasets with very large OC. Aggressive feature selection for text categorization – Hershel Safer Page 712 February 2014
    • Choosing classifier and feature-selection methods Using feature selection may affect choice of classifier method. Different methods for feature selection give different results. They report information gain, Chi-squared, and bi-normal separation as being best. Aggressive feature selection for text categorization – Hershel Safer Page 812 February 2014