Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

20070702 Text Categorization


Published on

the 16th chapter of the book: foundation of statistical natural language processing

Published in: Education, Technology
  • Be the first to comment

20070702 Text Categorization

  1. 1. Text Categorization Chapter 16 Foundations of Statistical Natural Language Processing
  2. 2. Outline <ul><li>Preparation </li></ul><ul><li>Decision Tree </li></ul><ul><li>Maximum Entropy Modeling </li></ul><ul><li>Perceptrons </li></ul><ul><li>K nearest Neighbor Classification </li></ul>
  3. 3. Part I <ul><li>Preparation </li></ul>
  4. 4. Classification <ul><li>Classification / Categorization </li></ul><ul><ul><li>The task of assigning objects from a universe to two or mare classes (categorizes) </li></ul></ul>Parse trees Sentence PP attachment The word’s seneses Context of a word Disambiguation topics Document Text categorization Languages Document Language identification Document authors Document Author identification The word’s (POS) tags Context of a word Tagging Categories Object Problem
  5. 5. Task Description <ul><li>Goal: Given the classification scheme, the system can decide which class(es) a document is related to. </li></ul><ul><li>A mapping from document space to classification scheme. </li></ul><ul><ul><li>1 to 1 / 1 to many </li></ul></ul><ul><li>To build the mapping: </li></ul><ul><ul><li>observe the known samples classified in the scheme, </li></ul></ul><ul><ul><li>Summarize the features and create rules/formula </li></ul></ul><ul><ul><li>Decide the classes for the new documents according to the rules. </li></ul></ul>
  6. 6. Task Formulation <ul><li>Training set: </li></ul><ul><ul><li>(text doc, category) -> -> for TC, doc is presented as a vector of (possibly weighted ) word counts </li></ul></ul><ul><li>Model class </li></ul><ul><ul><li>a parameterized family of classifiers </li></ul></ul><ul><li>Training procedure </li></ul><ul><ul><li>selects one classifier from this family. </li></ul></ul><ul><li>E.g. </li></ul>A data representation model g(x) = 0 x1 x2 w w = (1,1) b = -1 w x2 + b < 0 w x1 + b > 0 (0,1) (1,0)
  7. 7. Evaluation(1) <ul><li>Test set </li></ul><ul><li>For binary classification </li></ul><ul><ul><li>(proportion of correctly classified objects) </li></ul></ul>Contingency table d c No was assigned b a Yes was assigned No is correct Yes is correct
  8. 8. Evaluation(2) <ul><li>More than two categories </li></ul><ul><ul><li>Macro-averaging </li></ul></ul><ul><ul><ul><li>For each category create a contingency table, then compute the precision/recall seperately </li></ul></ul></ul><ul><ul><ul><li>Average the evaluation measure over categories. </li></ul></ul></ul><ul><ul><ul><li>E.g </li></ul></ul></ul><ul><ul><li>Micro-averaging: make a single contingency table for all the data by summing the scores in each cell for all categories. </li></ul></ul><ul><ul><ul><li>Macro-avg: give equal weight to each class </li></ul></ul></ul><ul><ul><ul><li>Micro-avg: give equal weight to each object </li></ul></ul></ul>
  9. 9. Part II <ul><li>Decision Tree </li></ul>
  10. 10. E.g. A trained decision tree for category “earnings” Doc = {cts=1, net =3} Node1 7681 articles P(c|n1) = 0.3000 split: cts value: 2 Node2 5977 articles P(c|n2) = 0.116 split: net value: 1 Node5 1704 articles P(c|n5) = 0.943 split: vs value: 2 Node3 5436 articles P(c|n3) = 0.050 Node4 541 articles P(c|n4) = 0.649 Node6 301 articles P(c|n6) = 0.694 Node7 1403 articles P(c|n7) = 0.996 cts < 2 cts >= 2 net<1 Net>= 1 vs <2 vs >= 2
  11. 11. A Closer Look on the E.g. <ul><li>Doc = {cts=1, net =3} </li></ul><ul><li> Data presentation model </li></ul><ul><li>Model class </li></ul><ul><ul><li>Structure decided </li></ul></ul><ul><ul><li>Parameters ? </li></ul></ul><ul><li>Training Procedure </li></ul>
  12. 12. Data Presentation Model (1) <ul><li>An art in itself. </li></ul><ul><ul><li>Usually depends on the particular categorization method used. </li></ul></ul><ul><li>In this book, given as an e.g., we present each document as an weighted word vector. </li></ul><ul><li>The words are chosen by X 2 (chi-square) method from the training corpus </li></ul><ul><ul><li>20 words are chosen. E.g. vs, mln, 1000, loss, profit… </li></ul></ul>Ref to: Chap 5
  13. 13. Data Presentation Model (2) <ul><li>Each document is then represented as a vector of K = 20 integers, , </li></ul><ul><li>tf(ij) : the number of occurrences of term i in document j </li></ul><ul><li>l(j) the length of document j </li></ul><ul><li>E.g. profit </li></ul>
  14. 14. Training Procedure: Growing (1) <ul><li>Growing a tree </li></ul><ul><ul><li>Splitting criterion: </li></ul></ul><ul><ul><ul><li>finding the feature and its value that we will split on </li></ul></ul></ul><ul><ul><ul><li>Information gain </li></ul></ul></ul><ul><ul><li>Stopping criterion </li></ul></ul><ul><ul><ul><li>Determines when to stop splitting </li></ul></ul></ul><ul><ul><ul><li>e.g. all elements at a node have an identical representation or the same category </li></ul></ul></ul>Entropy of parent Node Proportion of elements that passed on to the left nodes Ref. Machine Learning
  15. 15. Training Procedure: Growing (2) <ul><li>E.g. the value of G(‘cts’, 2) </li></ul><ul><li>H(t) = -0.3*log(0.3) - 0.7*log(0.7) = 0.611 </li></ul><ul><li>pL = 5977/7681 </li></ul><ul><li>G(‘cts’,2) = 0.611 – (*) = 0.283 </li></ul>cts < 2 cts >= 2 Node1 7681 articles P(c|n1) = 0.3000 split: cts value: 2 Node2 5977 articles2 p(c|n) = 0.116 Node5 1704 articles P(c|n5) = 0.943
  16. 16. Training Procedure: pruning (1) <ul><li>Overfitting: </li></ul><ul><ul><li>E.g. </li></ul></ul><ul><ul><li>Introduced by the errors or coarse in the training set </li></ul></ul><ul><ul><li>Or the insufficiency of training set </li></ul></ul><ul><li>Solution: Pruning </li></ul><ul><ul><li>Create a detailed decision tree, then pruning the tree to a appropriate size </li></ul></ul><ul><ul><li>Approach: </li></ul></ul><ul><ul><ul><li>Quinlan 1987 </li></ul></ul></ul><ul><ul><ul><li>Quinlan 1993 </li></ul></ul></ul><ul><ul><ul><li>Magerman 1994 </li></ul></ul></ul>Ref to :chap3 (3.7.1) machine learning
  17. 17. Training Procedure: pruning (2) <ul><li>Validation </li></ul><ul><ul><li>Validation set (Cross validation) </li></ul></ul>
  18. 18. Discussion <ul><li>Learning Curve </li></ul><ul><ul><li>Large training set v.s. optimal performance </li></ul></ul><ul><li>Can be interpreted easily  greatest advantage </li></ul><ul><ul><li>The model is more complicated than classifiers like Naïve Bayes, linear regression, etc. </li></ul></ul><ul><li>Split the training set into smaller and smaller subsets. This makes correct generalization harder. (not enough data for reliable prediction) </li></ul><ul><ul><li>Pruning addresses the problem to some extent. </li></ul></ul>
  19. 19. Part III <ul><li>Maximum Entropy Modeling </li></ul><ul><ul><li>Data presentation model </li></ul></ul><ul><ul><li>Model class </li></ul></ul><ul><ul><li>Training procedure </li></ul></ul>
  20. 20. Basic Idea <ul><li>Given a set of raining documents and their categories </li></ul><ul><li>Select features that represent empirical data. </li></ul><ul><li>Select a probability density function to generate the empirical data </li></ul><ul><li>Found out the probability distribution (=decide the parameters of the probability function) that </li></ul><ul><ul><li>has the maximum entropy H(p) of all the possible p </li></ul></ul><ul><ul><li>Satisfies the constrains given by features </li></ul></ul><ul><ul><li>Maximizes the likelihood of the data </li></ul></ul><ul><li>New document is classified under the probability distribution </li></ul>
  21. 21. Data Presentation Model <ul><li>Remind the data presentation model in used by decision tree: </li></ul><ul><ul><li>Each document is then represented as a vector of K = 20 integers, ,where s(ij) is an integer and presents the weight of the feature word; </li></ul></ul><ul><li>The features f(i) are defined to characterize any property of a pair (x,c). </li></ul>
  22. 22. Model Class <ul><li>Loglinear models </li></ul><ul><li>K number of features, ai is the weight of feature fi and Z is a normalizing constant, used to ensure a probability distribution results. </li></ul><ul><li>Classify new document: </li></ul><ul><ul><li>Compute p(x, 0), and p(x, 1) and, choose the class label with the greater probability </li></ul></ul>
  23. 23. Training Process: Generalized Iterative Scaling <ul><li>Given the equation </li></ul><ul><li>Under a set of constrains: </li></ul><ul><ul><li>The expected value of fi for p* is the same as the expected value for the empirical distribution </li></ul></ul><ul><li>There is a unique maximum entropy distribution </li></ul><ul><li>There is a computable procedure that converges to the distribution p* (16.2.1) </li></ul>
  24. 24. The Principle of Maximum Entropy <ul><li>Given by E.T.Jaynes in 1957 </li></ul><ul><li>The distribution with maximum entropy is more possible to appear than other distributions. </li></ul><ul><ul><li>(the entropy of a close system is continuously increasing?) </li></ul></ul><ul><li>Information entropy as a measurement of ‘uninformativeness’ </li></ul><ul><ul><li>If we chose a model with less entropy, we would add ‘information’ constraints to the model that are not justified by the empirical evidence available to us </li></ul></ul>
  25. 25. Application to Text Categorization <ul><li>Feature selection </li></ul><ul><ul><li>In maximum entropy modeling, feature selection and training are usually integrated </li></ul></ul><ul><li>Test for convergence </li></ul><ul><ul><li>Compare the log difference between empirical and estimated feature expectations </li></ul></ul><ul><li>Generalized iterative scaling </li></ul><ul><ul><li>Computationally expensive due to slow convergence </li></ul></ul><ul><li>VS. Naïve Bayes </li></ul><ul><ul><li>Both use the prior probability </li></ul></ul><ul><ul><li>NB suppose no dependency between variables, while MEM doesn’t </li></ul></ul><ul><li>Strength </li></ul><ul><ul><li>Arbitrarily complex features can be defined if the experimenter believes that these features may contribute useful information for the classification decision. </li></ul></ul><ul><ul><li>Unified framework for feature selection and classification </li></ul></ul>
  26. 26. Part VI <ul><li>Perceptrons </li></ul>
  27. 27. Models <ul><li>Data presentation Model: </li></ul><ul><ul><li>text document is represented as term vectors. </li></ul></ul><ul><li>Model class </li></ul><ul><li>Binary classification </li></ul><ul><ul><li>For any input text document x, </li></ul></ul><ul><ul><li>Class(x) = c iff f(x) > 0; else class(x) <> c; </li></ul></ul><ul><li>Algorithm: </li></ul><ul><ul><li>Perceptron learning algorithm is a simple example of gradient descent algorithm </li></ul></ul><ul><ul><li>The goal is to learn the weighted vector w and a threshold theta. </li></ul></ul>
  28. 28. Perceptron learning Procedure: gradient descent <ul><li>Gradient descent </li></ul><ul><ul><li>an optimization algorithm. </li></ul></ul><ul><ul><li>To find a local minimum of a function using gradient descent, </li></ul></ul>
  29. 29. Perceptron learning Procedure: Basic Idea <ul><li>To find a linear division of the training set </li></ul><ul><li>Procedure: </li></ul><ul><ul><li>Estimate w and theta, if they make a mistake we move them in the direction of greatest change for the optimality criterion. </li></ul></ul><ul><ul><ul><li>For each ( x , y ) pair </li></ul></ul></ul><ul><ul><ul><li>Pass ( xi , yi , wi ) to the update rule w ( j )' = w ( j ) + α(δ − y ) x ( j ) </li></ul></ul></ul><ul><li>Perceptron convergence theorem </li></ul><ul><ul><li>Novikoff (1962) proved that the perceptron algorithm converges after a finite number of iterations if the data is linearly separable </li></ul></ul>j-th item in the weight vector j-th item in the input vector expected output & output
  30. 30. Why <ul><li>w ( j )' = w ( j ) + α(δ − y ) x ( j ) </li></ul><ul><ul><li>Ref </li></ul></ul><ul><ul><ul><li>Ref: </li></ul></ul></ul><ul><ul><ul><li>The 8.1 and 8.2 of reinforce learning </li></ul></ul></ul><ul><ul><ul><li>Refer to formula 8.2 </li></ul></ul></ul><ul><ul><li>The greatest gradient </li></ul></ul><ul><ul><li>Where w’ = {w1,w2,… wk, theta}, x’ = {x1, x2, …, xk, -1} </li></ul></ul>
  31. 31. E.g. w x x w+x s’ s Yes No
  32. 32. Discussion <ul><li>The data set should be linear separable </li></ul><ul><ul><li>(-1969), researchers relaized the limitations, and the interest in perceptrons reminds low </li></ul></ul><ul><li>As a gradient descent algorithm, it doesn’t suffer from the local optimum problem </li></ul><ul><li>Back propagation algorithm, etc. </li></ul><ul><ul><li>80’s </li></ul></ul><ul><ul><li>Multi-layer perceptrons, neural networks, connectionist models. </li></ul></ul><ul><ul><li>Overcome the shortcoming of conceptrons, ideally can learn any classification function (e.g. XOR). </li></ul></ul><ul><ul><li>Converges more slowly </li></ul></ul><ul><ul><li>Can get caught in local optima </li></ul></ul>
  33. 33. Part V <ul><li>K Nearest Neighbor Classification </li></ul>
  34. 34. Nearest Neighbor <ul><li>Category = purple </li></ul>
  35. 35. K Nearest Neighbor <ul><li>N = 4 </li></ul><ul><li>Category = Blue </li></ul>
  36. 36. Discussion <ul><li>Similarity metric </li></ul><ul><ul><li>The complexity of KNN is in finding a good measure of similarity </li></ul></ul><ul><li>It’s performance is very dependent on the right similarity metric </li></ul><ul><li>Efficiency </li></ul><ul><ul><li>However there are ways of implementing KNN search efficiently, and often there is an obvious choice for a similarity metric </li></ul></ul>
  37. 37. Thanks!