Successfully reported this slideshow.



Published on

  • Be the first to comment

  • Be the first to like this


  1. 1. A Survey on Text Categorization with Machine Learning Chikayama lab. Dai Saito
  2. 2. Introduction: Text Categorization <ul><li>Many digital Texts are available </li></ul><ul><ul><li>E-mail, Online news, Blog … </li></ul></ul><ul><li>Need of Automatic Text Categorization is increasing </li></ul><ul><ul><li>without human resource </li></ul></ul><ul><ul><li>Merits of time and cost </li></ul></ul>
  3. 3. Introduction: Text Categorization <ul><li>Application </li></ul><ul><ul><li>Spam filter </li></ul></ul><ul><ul><li>Topic Categorization </li></ul></ul>
  4. 4. Introduction: Machine Learning <ul><li>Making Categorization rule automatically by Feature of Text </li></ul><ul><li>Types of Machine Learning (ML) </li></ul><ul><ul><li>Supervised Learning </li></ul></ul><ul><ul><ul><li>Labeling </li></ul></ul></ul><ul><ul><li>Unsupervised Learning </li></ul></ul><ul><ul><ul><li>Clustering </li></ul></ul></ul>
  5. 5. Introduction: flow of ML <ul><li>Prepare training Text data with label </li></ul><ul><ul><li>Feature of Text </li></ul></ul><ul><li>Learn </li></ul><ul><li>Categorize new Text </li></ul>Label1 Label2 ?
  6. 6. Outline <ul><li>Introduction </li></ul><ul><li>Text Categorization </li></ul><ul><li>Feature of Text </li></ul><ul><li>Learning Algorithm </li></ul><ul><li>Conclusion </li></ul>
  7. 7. Number of labels <ul><li>Binary-label </li></ul><ul><ul><li>True or False (Ex. spam or not) </li></ul></ul><ul><ul><li>Applied for other types </li></ul></ul><ul><li>Multi-label </li></ul><ul><ul><li>Many labels, but One Text has one label </li></ul></ul><ul><li>Overlapping-label </li></ul><ul><ul><li>One Text has some labels </li></ul></ul>Yes No L1 L2 L3 L4 L1 L2 L3 L4
  8. 8. Types of labels <ul><li>Topic Categorization </li></ul><ul><ul><li>Basic Task </li></ul></ul><ul><ul><li>Compare individual words </li></ul></ul><ul><li>Author Categorization </li></ul><ul><li>Sentiment Categorization </li></ul><ul><ul><li>Ex) Review of products </li></ul></ul><ul><ul><li>Need more linguistic information </li></ul></ul>
  9. 9. Outline <ul><li>Introduction </li></ul><ul><li>Text Categorization </li></ul><ul><li>Feature of Text </li></ul><ul><li>Learning Algorithm </li></ul><ul><li>Conclusion </li></ul>
  10. 10. Feature of Text <ul><li>How to express a feature of Text? </li></ul><ul><ul><li>“ Bag of Words” </li></ul></ul><ul><ul><ul><li>Ignore an order of words </li></ul></ul></ul><ul><ul><li>Structure </li></ul></ul><ul><ul><ul><li>Ex) I like this car. | I don’t like this car. </li></ul></ul></ul><ul><ul><ul><ul><li>“ Bag of Words” will not work well </li></ul></ul></ul></ul><ul><ul><li>(d:document = text) </li></ul></ul><ul><ul><li>(t:term = word) </li></ul></ul>
  11. 11. Preprocessing <ul><li>Remove stop words </li></ul><ul><ul><li>“ the” “a” “for” … </li></ul></ul><ul><li>Stemming </li></ul><ul><ul><li>relational -> relate, truly -> true </li></ul></ul>
  12. 12. Term Weighting <ul><li>Term Frequency </li></ul><ul><ul><li>Number of a term in a document </li></ul></ul><ul><ul><li>Frequent terms in a document seems to be important for categorization </li></ul></ul><ul><li>tf ・ idf </li></ul><ul><ul><li>Terms appearing in many documents are not useful for categorization </li></ul></ul>
  13. 13. Sentiment Weighting <ul><li>For sentiment classification, weight a word as Positive or Negative </li></ul><ul><li>Constructing sentiment dictionary </li></ul><ul><li>WordNet [04 Kamps et al.] </li></ul><ul><ul><li>Synonym Database </li></ul></ul><ul><ul><li>Using a distance from ‘good’ and ‘bad’ </li></ul></ul>d (good, happy) = 2 d (bad, happy) = 4 good bad happy
  14. 14. Dimension Reduction <ul><li>Size of feature vector is (#terms)*(#documents) </li></ul><ul><ul><li>#terms ≒ size of dictionary </li></ul></ul><ul><ul><li>High calculation cost </li></ul></ul><ul><ul><li>Risk of overfitting </li></ul></ul><ul><ul><ul><li>Best for training data ≠ Best for real data </li></ul></ul></ul><ul><li>Choosing effective feature </li></ul><ul><ul><li>to improve accuracy and calculation cost </li></ul></ul>
  15. 15. Dimension Reduction <ul><li>df-threshold </li></ul><ul><ul><li>Terms appearing in very few documents (ex.only one) are not important </li></ul></ul><ul><li>   Score </li></ul><ul><ul><li>  </li></ul></ul><ul><ul><li>If t and cj are independent, Score is equal to Zero </li></ul></ul>
  16. 16. Outline <ul><li>Introduction </li></ul><ul><li>Text Categorization </li></ul><ul><li>Feature of Text </li></ul><ul><li>Learning Algorithm </li></ul><ul><li>Conclusion </li></ul>
  17. 17. Learning Algorithm <ul><li>Many (Almost all?) algorithms are used in Text Categorization </li></ul><ul><ul><li>Simple approach </li></ul></ul><ul><ul><ul><li>Naïve Bayes </li></ul></ul></ul><ul><ul><ul><li>K-Nearest Neighbor </li></ul></ul></ul><ul><ul><li>High performance approach </li></ul></ul><ul><ul><ul><li>Boosting </li></ul></ul></ul><ul><ul><ul><li>Support Vector Machine </li></ul></ul></ul><ul><ul><li>Hierarchical Learning </li></ul></ul>
  18. 18. Naïve Bayes <ul><li>Bayes Rule </li></ul><ul><li>This value is hard to calculate </li></ul><ul><ul><li>? </li></ul></ul><ul><ul><li>Assumption : each terms occurs independently </li></ul></ul>
  19. 19. k-Nearest Neighbor <ul><li>Define a “distance” of two Texts </li></ul><ul><ul><li>Ex)Sim(d1, d2) = d1 ・ d2 / |d1||d2| = cosθ </li></ul></ul><ul><li>check k of high similarity Texts and categorize by majority vote </li></ul><ul><li>If size of test data is larger, memory and search cost is higher </li></ul>d1 d2 θ k=3
  20. 20. Boosting <ul><li>BoosTexter [00 Schapire et al.] </li></ul><ul><li>Ada boost </li></ul><ul><ul><li>making many “weak learner”s with different parameters </li></ul></ul><ul><ul><li>Kth “weak learner” checks performance of 1..K-1th, and tries to classify right to the worst score training data </li></ul></ul><ul><ul><li>BoosTexter uses Decision Stump as “weak learner” </li></ul></ul>
  21. 21. Simple example of Boosting + + + + + - - - - - + + + + + - - - - - 1. - - + + + + + - - - 2. + + + + + - - - - - 3.
  22. 22. Support Vector Machine <ul><li>Text Categorization with SVM [98 Joachims] </li></ul><ul><li>Maximize margin </li></ul>
  23. 23. Text Categorization with SVM <ul><li>SVM works well for Text Categorization </li></ul><ul><ul><li>Robustness for high dimension </li></ul></ul><ul><ul><ul><li>Robustness for overfitting </li></ul></ul></ul><ul><ul><li>Most Text Categorization problems are linearly separable </li></ul></ul><ul><ul><ul><li>All of OHSUMED (MEDLINE collection) </li></ul></ul></ul><ul><ul><ul><li>Most of Reuters-21578 (NEWS collection) </li></ul></ul></ul>
  24. 24. Comparison of these methods <ul><li>[02 Sebastiani] </li></ul><ul><li>Reuters-21578 (2 versions) </li></ul><ul><ul><li>difference: number of Categories </li></ul></ul>.920 .870 SVM Boosting Naïve Bayes k-NN Method .878 .795 .860 Ver.1(90) - .815 .823 Ver.2(10)
  25. 25. Hierarchical Learning <ul><li>TreeBoost[06 Esuli et al.] </li></ul><ul><ul><li>Boosting algorithm for Hierarchical labels </li></ul></ul><ul><ul><li>Hierarchical labels and Texts with label as Training data </li></ul></ul><ul><ul><li>Applying AdaBoost recursively </li></ul></ul><ul><ul><li>Better classifier than ‘flat’ AdaBoost </li></ul></ul><ul><ul><ul><li>Accuracy : 2-3% up </li></ul></ul></ul><ul><ul><ul><li>Time: training and categorization time down </li></ul></ul></ul><ul><li>Hierarchical SVM[04 Cai et al.] </li></ul>
  26. 26. TreeBoost root L1 L2 L3 L4 L11 L12 L41 L42 L43 L421 L422
  27. 27. Outline <ul><li>Introduction </li></ul><ul><li>Text Categorization </li></ul><ul><li>Feature of Text </li></ul><ul><li>Learning Algorithm </li></ul><ul><li>Conclusion </li></ul>
  28. 28. Conclusion <ul><li>Overview of Text Categorization with Machine Learning </li></ul><ul><ul><li>Feature of Text </li></ul></ul><ul><ul><li>Learning Algorithm </li></ul></ul><ul><li>Future Work </li></ul><ul><ul><li>Natural Language Processing with Machine Learning, especially in Japanese </li></ul></ul><ul><ul><li>Calculation Cost </li></ul></ul>