A Survey on Text Categorization with Machine Learning Chikayama lab. Dai Saito
Introduction: Text Categorization Many digital Texts are available E-mail, Online news, Blog … Need of Automatic Text Categorization is increasing without human resource Merits of time and cost
Introduction: Text Categorization Application Spam filter Topic Categorization
Introduction: Machine Learning Making Categorization rule automatically by Feature of Text Types of Machine Learning (ML) Supervised Learning Labeling Unsupervised Learning Clustering
Introduction: flow of ML Prepare training Text data with label Feature of Text Learn Categorize new Text Label1 Label2 ?
Outline Introduction Text Categorization Feature of Text Learning Algorithm Conclusion
Number of labels Binary-label True or False (Ex. spam or not) Applied for other types Multi-label Many labels, but One Text has one label Overlapping-label One Text has some labels Yes No L1 L2 L3 L4 L1 L2 L3 L4
Types of labels Topic Categorization Basic Task Compare individual words Author Categorization Sentiment Categorization Ex) Review of products Need more linguistic information
Outline Introduction Text Categorization Feature of Text Learning Algorithm Conclusion
Feature of Text How to express a feature of Text? “ Bag of Words” Ignore an order of words Structure Ex) I like this car. | I don’t like this car. “ Bag of Words” will not work well (d:document = text) (t:term = word)
Preprocessing Remove stop words “ the” “a” “for” … Stemming relational -> relate, truly -> true
Term Weighting Term Frequency Number of a term in a document Frequent terms in a document seems to be important for categorization tf ・ idf Terms appearing in many documents are not useful for categorization
Sentiment Weighting For sentiment classification, weight a word as Positive or Negative Constructing sentiment dictionary WordNet [04 Kamps et al.] Synonym Database Using a distance from ‘good’ and ‘bad’ d (good, happy) = 2 d (bad, happy) = 4 good bad happy
Dimension Reduction  Size of feature vector is (#terms)*(#documents) #terms ≒ size of dictionary High calculation cost Risk of overfitting Best for training data ≠ Best for real data Choosing effective feature to improve accuracy and calculation cost
Dimension Reduction df-threshold Terms appearing in very few documents (ex.only one) are not important    Score   If t and cj are independent,    Score is equal to Zero
Outline Introduction Text Categorization Feature of Text Learning Algorithm Conclusion
Learning Algorithm Many (Almost all?) algorithms are used in Text Categorization Simple approach Naïve Bayes K-Nearest Neighbor High performance approach Boosting Support Vector Machine Hierarchical Learning
Naïve Bayes Bayes Rule This value is hard to calculate ? Assumption :  each terms occurs independently
k-Nearest Neighbor Define a “distance” of two Texts Ex)Sim(d1, d2) = d1 ・ d2 / |d1||d2|   = cosθ check k of high similarity Texts and categorize by majority vote If size of test data is larger,  memory and search cost is higher d1 d2 θ k=3
Boosting BoosTexter [00 Schapire et al.] Ada boost making many “weak learner”s with different parameters Kth “weak learner” checks performance of 1..K-1th, and tries to classify right to the worst score training data BoosTexter uses Decision Stump as “weak learner”
Simple example of Boosting + + + + + - - - - - + + + + + - - - - - 1. - - + + + + + - - - 2. + + + + + - - - - - 3.
Support Vector Machine Text Categorization with SVM [98 Joachims] Maximize margin
Text Categorization with SVM SVM works well for Text Categorization Robustness for high dimension Robustness for overfitting Most Text Categorization problems are linearly separable All of OHSUMED (MEDLINE collection) Most of Reuters-21578 (NEWS collection)
Comparison of these methods [02 Sebastiani] Reuters-21578 (2 versions)  difference: number of Categories .920 .870 SVM Boosting Naïve Bayes k-NN Method .878 .795 .860 Ver.1(90) -  .815 .823 Ver.2(10)
Hierarchical Learning TreeBoost[06 Esuli et al.] Boosting algorithm for Hierarchical labels Hierarchical labels and Texts with label as Training data Applying AdaBoost recursively Better classifier than ‘flat’ AdaBoost Accuracy :  2-3% up Time: training and categorization time down Hierarchical SVM[04 Cai et al.]
TreeBoost root L1 L2 L3 L4 L11 L12 L41 L42 L43 L421 L422
Outline Introduction Text Categorization Feature of Text Learning Algorithm Conclusion
Conclusion Overview of Text Categorization with Machine Learning Feature of Text Learning Algorithm Future Work Natural Language Processing with Machine Learning, especially in Japanese Calculation Cost

[ppt]

  • 1.
    A Survey onText Categorization with Machine Learning Chikayama lab. Dai Saito
  • 2.
    Introduction: Text CategorizationMany digital Texts are available E-mail, Online news, Blog … Need of Automatic Text Categorization is increasing without human resource Merits of time and cost
  • 3.
    Introduction: Text CategorizationApplication Spam filter Topic Categorization
  • 4.
    Introduction: Machine LearningMaking Categorization rule automatically by Feature of Text Types of Machine Learning (ML) Supervised Learning Labeling Unsupervised Learning Clustering
  • 5.
    Introduction: flow ofML Prepare training Text data with label Feature of Text Learn Categorize new Text Label1 Label2 ?
  • 6.
    Outline Introduction TextCategorization Feature of Text Learning Algorithm Conclusion
  • 7.
    Number of labelsBinary-label True or False (Ex. spam or not) Applied for other types Multi-label Many labels, but One Text has one label Overlapping-label One Text has some labels Yes No L1 L2 L3 L4 L1 L2 L3 L4
  • 8.
    Types of labelsTopic Categorization Basic Task Compare individual words Author Categorization Sentiment Categorization Ex) Review of products Need more linguistic information
  • 9.
    Outline Introduction TextCategorization Feature of Text Learning Algorithm Conclusion
  • 10.
    Feature of TextHow to express a feature of Text? “ Bag of Words” Ignore an order of words Structure Ex) I like this car. | I don’t like this car. “ Bag of Words” will not work well (d:document = text) (t:term = word)
  • 11.
    Preprocessing Remove stopwords “ the” “a” “for” … Stemming relational -> relate, truly -> true
  • 12.
    Term Weighting TermFrequency Number of a term in a document Frequent terms in a document seems to be important for categorization tf ・ idf Terms appearing in many documents are not useful for categorization
  • 13.
    Sentiment Weighting Forsentiment classification, weight a word as Positive or Negative Constructing sentiment dictionary WordNet [04 Kamps et al.] Synonym Database Using a distance from ‘good’ and ‘bad’ d (good, happy) = 2 d (bad, happy) = 4 good bad happy
  • 14.
    Dimension Reduction Size of feature vector is (#terms)*(#documents) #terms ≒ size of dictionary High calculation cost Risk of overfitting Best for training data ≠ Best for real data Choosing effective feature to improve accuracy and calculation cost
  • 15.
    Dimension Reduction df-thresholdTerms appearing in very few documents (ex.only one) are not important    Score   If t and cj are independent, Score is equal to Zero
  • 16.
    Outline Introduction TextCategorization Feature of Text Learning Algorithm Conclusion
  • 17.
    Learning Algorithm Many(Almost all?) algorithms are used in Text Categorization Simple approach Naïve Bayes K-Nearest Neighbor High performance approach Boosting Support Vector Machine Hierarchical Learning
  • 18.
    Naïve Bayes BayesRule This value is hard to calculate ? Assumption : each terms occurs independently
  • 19.
    k-Nearest Neighbor Definea “distance” of two Texts Ex)Sim(d1, d2) = d1 ・ d2 / |d1||d2| = cosθ check k of high similarity Texts and categorize by majority vote If size of test data is larger, memory and search cost is higher d1 d2 θ k=3
  • 20.
    Boosting BoosTexter [00Schapire et al.] Ada boost making many “weak learner”s with different parameters Kth “weak learner” checks performance of 1..K-1th, and tries to classify right to the worst score training data BoosTexter uses Decision Stump as “weak learner”
  • 21.
    Simple example ofBoosting + + + + + - - - - - + + + + + - - - - - 1. - - + + + + + - - - 2. + + + + + - - - - - 3.
  • 22.
    Support Vector MachineText Categorization with SVM [98 Joachims] Maximize margin
  • 23.
    Text Categorization withSVM SVM works well for Text Categorization Robustness for high dimension Robustness for overfitting Most Text Categorization problems are linearly separable All of OHSUMED (MEDLINE collection) Most of Reuters-21578 (NEWS collection)
  • 24.
    Comparison of thesemethods [02 Sebastiani] Reuters-21578 (2 versions) difference: number of Categories .920 .870 SVM Boosting Naïve Bayes k-NN Method .878 .795 .860 Ver.1(90) - .815 .823 Ver.2(10)
  • 25.
    Hierarchical Learning TreeBoost[06Esuli et al.] Boosting algorithm for Hierarchical labels Hierarchical labels and Texts with label as Training data Applying AdaBoost recursively Better classifier than ‘flat’ AdaBoost Accuracy : 2-3% up Time: training and categorization time down Hierarchical SVM[04 Cai et al.]
  • 26.
    TreeBoost root L1L2 L3 L4 L11 L12 L41 L42 L43 L421 L422
  • 27.
    Outline Introduction TextCategorization Feature of Text Learning Algorithm Conclusion
  • 28.
    Conclusion Overview ofText Categorization with Machine Learning Feature of Text Learning Algorithm Future Work Natural Language Processing with Machine Learning, especially in Japanese Calculation Cost

Editor's Notes

  • #3 インターネットの普及やコンピュータを用いた文書の電子化が進むにつれて、 メールやニュース、ブログ等、大量の電子化されたデータが入手可能となってきた。 それに従い、時間や人的コストの観点から、 人手を介さずに大量の文書を効率良く分類する必要が高まってきている。
  • #4 例えばテキストを自動的にどのトピックに属するかを調べたり、 Webからの評判を抽出、といった応用が挙げられる。
  • #5 そこで、テキストを自動で分類するための手法として最も広く用いられているのが、 単語などのテキスト情報を元にした機械学習の手法である。 機械学習は広く分けて教師あり、教師無し、があるが、 本輪講では教師あり学習について述べる
  • #6 ここでテキスト分類における機械学習の主な流れを示す。 まず、自然言語で書かれたテキストを機械が扱えるような形に変換する。 (特徴抽出) そしてその特徴を用いて学習器で学習する。 (学習) 未知のデータが来た場合、訓練した学習器を元にデータを分類する。 (分類) このようにテキスト分類は一般で用いられる機械学習の流れとほぼ同じなため、 機械学習の分野で広く研究されている。 ここでは、このそれぞれの段階について用いられている手法の調査を行う。
  • #8 ここではテキストデータからの特徴抽出について説明する。 まず、自然言語で書かれたデータを形態素解析等を用いて何らかの数値データに変換する必要がある。
  • #11 この場合、例えば英語で言えばthe, for, 等の非常に頻繁に出てくる単語は「ストップワード」として 取り除かれる必要がある。
  • #13 まず最初に思いつく最も単純な方法として、各単語の出現回数を数える方法が考えられる。 文書数×単語数のベクトルを考え、どの文書にどの単語が何回出現するのか、を表す。 この場合、非常に単純にデータを扱うことが出来るが、出現回数のみを見ているのであまり精度が出ない
  • #14 ここで考えられるのが tf-idf 法である。 これは、(単語がある文書に出てくる頻度) × (単語が出てくる文書数の逆数)をとったもので、 文書に頻繁に出てきて、また全体ではあまり出てこない単語に高い重みがつくようになっており、 テキスト分類における特徴抽出の方法として広く用いられている。 基本的に文書の特徴は tf-idf か、あるいはこの値を正規化したものを用いることが 事実上標準となっており、新たな研究はあまり行われていない。
  • #15 上のままだと文書を表すベクトルが文書数×辞書の単語数、とかなり大きくなってしまう。 そこで、この次元数を削減するために特徴選択が用いられる。
  • #16 ここで用いられているものは、まず一つは出現頻度に特定のスレッショルドを設けることである。 単語が出てくる文書数一定回以上出てない単語は学習に用いない。 これは、非常に少ない文書にしか出てこない単語は分類の役に立たないであろう、という推測に基づいている。