20070702 Text Categorization


Published on

the 16th chapter of the book: foundation of statistical natural language processing

Published in: Education, Technology
1 Like
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • 20070702 Text Categorization

    1. 1. Text Categorization Chapter 16 Foundations of Statistical Natural Language Processing
    2. 2. Outline <ul><li>Preparation </li></ul><ul><li>Decision Tree </li></ul><ul><li>Maximum Entropy Modeling </li></ul><ul><li>Perceptrons </li></ul><ul><li>K nearest Neighbor Classification </li></ul>
    3. 3. Part I <ul><li>Preparation </li></ul>
    4. 4. Classification <ul><li>Classification / Categorization </li></ul><ul><ul><li>The task of assigning objects from a universe to two or mare classes (categorizes) </li></ul></ul>Parse trees Sentence PP attachment The word’s seneses Context of a word Disambiguation topics Document Text categorization Languages Document Language identification Document authors Document Author identification The word’s (POS) tags Context of a word Tagging Categories Object Problem
    5. 5. Task Description <ul><li>Goal: Given the classification scheme, the system can decide which class(es) a document is related to. </li></ul><ul><li>A mapping from document space to classification scheme. </li></ul><ul><ul><li>1 to 1 / 1 to many </li></ul></ul><ul><li>To build the mapping: </li></ul><ul><ul><li>observe the known samples classified in the scheme, </li></ul></ul><ul><ul><li>Summarize the features and create rules/formula </li></ul></ul><ul><ul><li>Decide the classes for the new documents according to the rules. </li></ul></ul>
    6. 6. Task Formulation <ul><li>Training set: </li></ul><ul><ul><li>(text doc, category) -> -> for TC, doc is presented as a vector of (possibly weighted ) word counts </li></ul></ul><ul><li>Model class </li></ul><ul><ul><li>a parameterized family of classifiers </li></ul></ul><ul><li>Training procedure </li></ul><ul><ul><li>selects one classifier from this family. </li></ul></ul><ul><li>E.g. </li></ul>A data representation model g(x) = 0 x1 x2 w w = (1,1) b = -1 w x2 + b < 0 w x1 + b > 0 (0,1) (1,0)
    7. 7. Evaluation(1) <ul><li>Test set </li></ul><ul><li>For binary classification </li></ul><ul><ul><li>(proportion of correctly classified objects) </li></ul></ul>Contingency table d c No was assigned b a Yes was assigned No is correct Yes is correct
    8. 8. Evaluation(2) <ul><li>More than two categories </li></ul><ul><ul><li>Macro-averaging </li></ul></ul><ul><ul><ul><li>For each category create a contingency table, then compute the precision/recall seperately </li></ul></ul></ul><ul><ul><ul><li>Average the evaluation measure over categories. </li></ul></ul></ul><ul><ul><ul><li>E.g </li></ul></ul></ul><ul><ul><li>Micro-averaging: make a single contingency table for all the data by summing the scores in each cell for all categories. </li></ul></ul><ul><ul><ul><li>Macro-avg: give equal weight to each class </li></ul></ul></ul><ul><ul><ul><li>Micro-avg: give equal weight to each object </li></ul></ul></ul>
    9. 9. Part II <ul><li>Decision Tree </li></ul>
    10. 10. E.g. A trained decision tree for category “earnings” Doc = {cts=1, net =3} Node1 7681 articles P(c|n1) = 0.3000 split: cts value: 2 Node2 5977 articles P(c|n2) = 0.116 split: net value: 1 Node5 1704 articles P(c|n5) = 0.943 split: vs value: 2 Node3 5436 articles P(c|n3) = 0.050 Node4 541 articles P(c|n4) = 0.649 Node6 301 articles P(c|n6) = 0.694 Node7 1403 articles P(c|n7) = 0.996 cts < 2 cts >= 2 net<1 Net>= 1 vs <2 vs >= 2
    11. 11. A Closer Look on the E.g. <ul><li>Doc = {cts=1, net =3} </li></ul><ul><li> Data presentation model </li></ul><ul><li>Model class </li></ul><ul><ul><li>Structure decided </li></ul></ul><ul><ul><li>Parameters ? </li></ul></ul><ul><li>Training Procedure </li></ul>
    12. 12. Data Presentation Model (1) <ul><li>An art in itself. </li></ul><ul><ul><li>Usually depends on the particular categorization method used. </li></ul></ul><ul><li>In this book, given as an e.g., we present each document as an weighted word vector. </li></ul><ul><li>The words are chosen by X 2 (chi-square) method from the training corpus </li></ul><ul><ul><li>20 words are chosen. E.g. vs, mln, 1000, loss, profit… </li></ul></ul>Ref to: Chap 5
    13. 13. Data Presentation Model (2) <ul><li>Each document is then represented as a vector of K = 20 integers, , </li></ul><ul><li>tf(ij) : the number of occurrences of term i in document j </li></ul><ul><li>l(j) the length of document j </li></ul><ul><li>E.g. profit </li></ul>
    14. 14. Training Procedure: Growing (1) <ul><li>Growing a tree </li></ul><ul><ul><li>Splitting criterion: </li></ul></ul><ul><ul><ul><li>finding the feature and its value that we will split on </li></ul></ul></ul><ul><ul><ul><li>Information gain </li></ul></ul></ul><ul><ul><li>Stopping criterion </li></ul></ul><ul><ul><ul><li>Determines when to stop splitting </li></ul></ul></ul><ul><ul><ul><li>e.g. all elements at a node have an identical representation or the same category </li></ul></ul></ul>Entropy of parent Node Proportion of elements that passed on to the left nodes Ref. Machine Learning
    15. 15. Training Procedure: Growing (2) <ul><li>E.g. the value of G(‘cts’, 2) </li></ul><ul><li>H(t) = -0.3*log(0.3) - 0.7*log(0.7) = 0.611 </li></ul><ul><li>pL = 5977/7681 </li></ul><ul><li>G(‘cts’,2) = 0.611 – (*) = 0.283 </li></ul>cts < 2 cts >= 2 Node1 7681 articles P(c|n1) = 0.3000 split: cts value: 2 Node2 5977 articles2 p(c|n) = 0.116 Node5 1704 articles P(c|n5) = 0.943
    16. 16. Training Procedure: pruning (1) <ul><li>Overfitting: </li></ul><ul><ul><li>E.g. </li></ul></ul><ul><ul><li>Introduced by the errors or coarse in the training set </li></ul></ul><ul><ul><li>Or the insufficiency of training set </li></ul></ul><ul><li>Solution: Pruning </li></ul><ul><ul><li>Create a detailed decision tree, then pruning the tree to a appropriate size </li></ul></ul><ul><ul><li>Approach: </li></ul></ul><ul><ul><ul><li>Quinlan 1987 </li></ul></ul></ul><ul><ul><ul><li>Quinlan 1993 </li></ul></ul></ul><ul><ul><ul><li>Magerman 1994 </li></ul></ul></ul>Ref to :chap3 (3.7.1) machine learning
    17. 17. Training Procedure: pruning (2) <ul><li>Validation </li></ul><ul><ul><li>Validation set (Cross validation) </li></ul></ul>
    18. 18. Discussion <ul><li>Learning Curve </li></ul><ul><ul><li>Large training set v.s. optimal performance </li></ul></ul><ul><li>Can be interpreted easily  greatest advantage </li></ul><ul><ul><li>The model is more complicated than classifiers like Naïve Bayes, linear regression, etc. </li></ul></ul><ul><li>Split the training set into smaller and smaller subsets. This makes correct generalization harder. (not enough data for reliable prediction) </li></ul><ul><ul><li>Pruning addresses the problem to some extent. </li></ul></ul>
    19. 19. Part III <ul><li>Maximum Entropy Modeling </li></ul><ul><ul><li>Data presentation model </li></ul></ul><ul><ul><li>Model class </li></ul></ul><ul><ul><li>Training procedure </li></ul></ul>
    20. 20. Basic Idea <ul><li>Given a set of raining documents and their categories </li></ul><ul><li>Select features that represent empirical data. </li></ul><ul><li>Select a probability density function to generate the empirical data </li></ul><ul><li>Found out the probability distribution (=decide the parameters of the probability function) that </li></ul><ul><ul><li>has the maximum entropy H(p) of all the possible p </li></ul></ul><ul><ul><li>Satisfies the constrains given by features </li></ul></ul><ul><ul><li>Maximizes the likelihood of the data </li></ul></ul><ul><li>New document is classified under the probability distribution </li></ul>
    21. 21. Data Presentation Model <ul><li>Remind the data presentation model in used by decision tree: </li></ul><ul><ul><li>Each document is then represented as a vector of K = 20 integers, ,where s(ij) is an integer and presents the weight of the feature word; </li></ul></ul><ul><li>The features f(i) are defined to characterize any property of a pair (x,c). </li></ul>
    22. 22. Model Class <ul><li>Loglinear models </li></ul><ul><li>K number of features, ai is the weight of feature fi and Z is a normalizing constant, used to ensure a probability distribution results. </li></ul><ul><li>Classify new document: </li></ul><ul><ul><li>Compute p(x, 0), and p(x, 1) and, choose the class label with the greater probability </li></ul></ul>
    23. 23. Training Process: Generalized Iterative Scaling <ul><li>Given the equation </li></ul><ul><li>Under a set of constrains: </li></ul><ul><ul><li>The expected value of fi for p* is the same as the expected value for the empirical distribution </li></ul></ul><ul><li>There is a unique maximum entropy distribution </li></ul><ul><li>There is a computable procedure that converges to the distribution p* (16.2.1) </li></ul>
    24. 24. The Principle of Maximum Entropy <ul><li>Given by E.T.Jaynes in 1957 </li></ul><ul><li>The distribution with maximum entropy is more possible to appear than other distributions. </li></ul><ul><ul><li>(the entropy of a close system is continuously increasing?) </li></ul></ul><ul><li>Information entropy as a measurement of ‘uninformativeness’ </li></ul><ul><ul><li>If we chose a model with less entropy, we would add ‘information’ constraints to the model that are not justified by the empirical evidence available to us </li></ul></ul>
    25. 25. Application to Text Categorization <ul><li>Feature selection </li></ul><ul><ul><li>In maximum entropy modeling, feature selection and training are usually integrated </li></ul></ul><ul><li>Test for convergence </li></ul><ul><ul><li>Compare the log difference between empirical and estimated feature expectations </li></ul></ul><ul><li>Generalized iterative scaling </li></ul><ul><ul><li>Computationally expensive due to slow convergence </li></ul></ul><ul><li>VS. Naïve Bayes </li></ul><ul><ul><li>Both use the prior probability </li></ul></ul><ul><ul><li>NB suppose no dependency between variables, while MEM doesn’t </li></ul></ul><ul><li>Strength </li></ul><ul><ul><li>Arbitrarily complex features can be defined if the experimenter believes that these features may contribute useful information for the classification decision. </li></ul></ul><ul><ul><li>Unified framework for feature selection and classification </li></ul></ul>
    26. 26. Part VI <ul><li>Perceptrons </li></ul>
    27. 27. Models <ul><li>Data presentation Model: </li></ul><ul><ul><li>text document is represented as term vectors. </li></ul></ul><ul><li>Model class </li></ul><ul><li>Binary classification </li></ul><ul><ul><li>For any input text document x, </li></ul></ul><ul><ul><li>Class(x) = c iff f(x) > 0; else class(x) <> c; </li></ul></ul><ul><li>Algorithm: </li></ul><ul><ul><li>Perceptron learning algorithm is a simple example of gradient descent algorithm </li></ul></ul><ul><ul><li>The goal is to learn the weighted vector w and a threshold theta. </li></ul></ul>
    28. 28. Perceptron learning Procedure: gradient descent <ul><li>Gradient descent </li></ul><ul><ul><li>an optimization algorithm. </li></ul></ul><ul><ul><li>To find a local minimum of a function using gradient descent, </li></ul></ul>
    29. 29. Perceptron learning Procedure: Basic Idea <ul><li>To find a linear division of the training set </li></ul><ul><li>Procedure: </li></ul><ul><ul><li>Estimate w and theta, if they make a mistake we move them in the direction of greatest change for the optimality criterion. </li></ul></ul><ul><ul><ul><li>For each ( x , y ) pair </li></ul></ul></ul><ul><ul><ul><li>Pass ( xi , yi , wi ) to the update rule w ( j )' = w ( j ) + α(δ − y ) x ( j ) </li></ul></ul></ul><ul><li>Perceptron convergence theorem </li></ul><ul><ul><li>Novikoff (1962) proved that the perceptron algorithm converges after a finite number of iterations if the data is linearly separable </li></ul></ul>j-th item in the weight vector j-th item in the input vector expected output & output
    30. 30. Why <ul><li>w ( j )' = w ( j ) + α(δ − y ) x ( j ) </li></ul><ul><ul><li>Ref </li></ul></ul><ul><ul><ul><li>Ref: www.cs.ualberta.ca/~sutton/book/8/node3.thml </li></ul></ul></ul><ul><ul><ul><li>The 8.1 and 8.2 of reinforce learning </li></ul></ul></ul><ul><ul><ul><li>Refer to formula 8.2 </li></ul></ul></ul><ul><ul><li>The greatest gradient </li></ul></ul><ul><ul><li>Where w’ = {w1,w2,… wk, theta}, x’ = {x1, x2, …, xk, -1} </li></ul></ul>
    31. 31. E.g. w x x w+x s’ s Yes No
    32. 32. Discussion <ul><li>The data set should be linear separable </li></ul><ul><ul><li>(-1969), researchers relaized the limitations, and the interest in perceptrons reminds low </li></ul></ul><ul><li>As a gradient descent algorithm, it doesn’t suffer from the local optimum problem </li></ul><ul><li>Back propagation algorithm, etc. </li></ul><ul><ul><li>80’s </li></ul></ul><ul><ul><li>Multi-layer perceptrons, neural networks, connectionist models. </li></ul></ul><ul><ul><li>Overcome the shortcoming of conceptrons, ideally can learn any classification function (e.g. XOR). </li></ul></ul><ul><ul><li>Converges more slowly </li></ul></ul><ul><ul><li>Can get caught in local optima </li></ul></ul>
    33. 33. Part V <ul><li>K Nearest Neighbor Classification </li></ul>
    34. 34. Nearest Neighbor <ul><li>Category = purple </li></ul>
    35. 35. K Nearest Neighbor <ul><li>N = 4 </li></ul><ul><li>Category = Blue </li></ul>
    36. 36. Discussion <ul><li>Similarity metric </li></ul><ul><ul><li>The complexity of KNN is in finding a good measure of similarity </li></ul></ul><ul><li>It’s performance is very dependent on the right similarity metric </li></ul><ul><li>Efficiency </li></ul><ul><ul><li>However there are ways of implementing KNN search efficiently, and often there is an obvious choice for a similarity metric </li></ul></ul>
    37. 37. Thanks!