Statistische Methoden in der Computerlinguistik Statistical Methods in Computational Linguistics 7. Machine Learning  Jona...
Addition: Good-Turing Smoothing <ul><li>Use discounted estimates ( r *) only up to a threshold  k </li></ul><ul><li>For ty...
Machine Learning Overview <ul><li>Tom Mitchell (1997): Machine Learning, McGraw Hill. </li></ul><ul><li>Slides mainly base...
Machine Learning <ul><li>Idea: Synthesize computer programs by learning from representative examples of input (and output)...
Well-Posed Learning Problems <ul><li>A computer program is said to learn from experience E with respect to some class of t...
Designing a Learning System <ul><li>In designing a learning system, we have to deal with (at least) the following issues: ...
Training Experience <ul><li>Issues concerning the training experience: </li></ul><ul><ul><li>Direct or indirect evidence (...
Target Function and Learned Function <ul><li>The problem of improving performance can often be reduced to the problem of l...
Learning Algorithm <ul><li>In order to learn the (approximated) target function we require: </li></ul><ul><ul><li>A set of...
Supervised Learning <ul><li>Let X and Y be the set of possible inputs and outputs, respectively. </li></ul><ul><ul><li>Tar...
Variations of Machine Learning <ul><li>Unsupervised learning: Learning without output values (data exploration, e.g. clust...
Learning and Generalization <ul><li>Any hypothesis that correctly classifies all the training examples is said to be consi...
Approaches to Machine Learning <ul><li>Decision trees </li></ul><ul><li>Artificial neural networks </li></ul><ul><li>Bayes...
Decision Tree Example: Name Recognition Capitalized? Sentence-Inital? Yes No No 0 0 1 1
Decision Tree Example: Name Recognition
Bayesian Learning <ul><li>Two reasons for studying Bayesian learning methods: </li></ul><ul><ul><li>Efficient learning alg...
Towards Naive Bayes Classification <ul><li>Let H be a hypothesis space defined over the instance space X, where the task i...
Towards Naive Bayes Classification <ul><li>Reformulate numerator (using chain rule) </li></ul><ul><li>(Naïve) conditional ...
Naive Bayes Classifier <ul><li>The naive Bayes classification of a new instance is: </li></ul>
Learning a Naive Bayes Classifier <ul><li>Estimate probabilities from training data using ML estimation: </li></ul><ul><li...
Naive Bayes <ul><li>Naive Bayes classifiers work surprisingly well for text classification tasks </li></ul>
Decision Tree Learning <ul><li>Decision trees classify instances by sorting them down the tree from the root to some leaf ...
Example: Name Recognition Capitalized? Sentence-Inital? Yes No No 0 0 1 1
PlayTennis  example
Appropriate Problems for Decision Tree Learning <ul><li>Instances are represented by attribute-value pairs. </li></ul><ul>...
The ID3 Learning Algorithm <ul><li>ID3(X = instances, Y = classes, A = attributes): </li></ul><ul><ul><li>Create a root no...
Selecting the Best Attribute <ul><li>ID3 uses the measure Information Gain (IG) to decide which attribute  a  best classif...
Information Gain
Training examples
Selecting the next attribute
 
Hypothesis Space Search and Inductive Bias <ul><li>Characteristics of ID3: </li></ul><ul><ul><li>Searches a complete hypot...
Overfitting <ul><li>The problem of overfitting: </li></ul><ul><ul><li>A hypothesis h is overfitted to the training data if...
Upcoming SlideShare
Loading in …5
×

ppt

821 views

Published on

  • Be the first to comment

  • Be the first to like this

ppt

  1. 1. Statistische Methoden in der Computerlinguistik Statistical Methods in Computational Linguistics 7. Machine Learning Jonas Kuhn Universität Potsdam, 2007
  2. 2. Addition: Good-Turing Smoothing <ul><li>Use discounted estimates ( r *) only up to a threshold k </li></ul><ul><li>For types that occurred more often than k , relative frequency estimates are used (Jurafsky/Martin, p. 216) </li></ul><ul><li>With threshold k : </li></ul>
  3. 3. Machine Learning Overview <ul><li>Tom Mitchell (1997): Machine Learning, McGraw Hill. </li></ul><ul><li>Slides mainly based on slides by Joakim Nivre and Tom Mitchell </li></ul>
  4. 4. Machine Learning <ul><li>Idea: Synthesize computer programs by learning from representative examples of input (and output) data. </li></ul><ul><li>Rationale: </li></ul><ul><ul><li>For many problems, there is no known method for computing the desired output from a set of inputs. </li></ul></ul><ul><ul><li>For other problems, computation according to the known correct method may be too expensive. </li></ul></ul>
  5. 5. Well-Posed Learning Problems <ul><li>A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E. </li></ul><ul><li>Examples: </li></ul><ul><ul><li>Learning to classify chemical compounds </li></ul></ul><ul><ul><li>Learning to drive an autonomous vehicle </li></ul></ul><ul><ul><li>Learning to play bridge </li></ul></ul><ul><ul><li>Learning to parse natural language sentences </li></ul></ul>
  6. 6. Designing a Learning System <ul><li>In designing a learning system, we have to deal with (at least) the following issues: </li></ul><ul><ul><li>Training experience </li></ul></ul><ul><ul><li>Target function </li></ul></ul><ul><ul><li>Learned function </li></ul></ul><ul><ul><li>Learning algorithm </li></ul></ul><ul><li>Example: Consider the task T of parsing English sentences, using the performance measure P of labeled precision and recall in a given test corpus (gold standard). </li></ul>
  7. 7. Training Experience <ul><li>Issues concerning the training experience: </li></ul><ul><ul><li>Direct or indirect evidence (supervised or unsupervised). </li></ul></ul><ul><ul><li>Controlled or uncontrolled sequence of training examples. </li></ul></ul><ul><ul><li>Representativity of training data in relation to test data. </li></ul></ul><ul><li>Training data for a syntactic parser: </li></ul><ul><ul><li>Treebank versus raw text corpus. </li></ul></ul><ul><ul><li>Constructed test suite versus random sample. </li></ul></ul><ul><ul><li>Training and test data from the same/similar/different sources with the same/similar/different annotations. </li></ul></ul>
  8. 8. Target Function and Learned Function <ul><li>The problem of improving performance can often be reduced to the problem of learning some particular target function. </li></ul><ul><ul><li>A shift-reduce parser can be trained by learning a transition function f : C  C, where C is the set of possible parser configurations. </li></ul></ul><ul><li>In many cases we can only hope to acquire some approximation to the ideal target function. </li></ul><ul><ul><li>The transition function f can be approximated by a function :   Action from stack (top) symbols to parse actions. </li></ul></ul>
  9. 9. Learning Algorithm <ul><li>In order to learn the (approximated) target function we require: </li></ul><ul><ul><li>A set of training examples (input arguments) </li></ul></ul><ul><ul><li>A rule for estimating the value corresponding to each training example (if this is not directly available) </li></ul></ul><ul><ul><li>An algorithm for choosing the function that best fits the training data </li></ul></ul><ul><li>Given a treebank on which we can simulate the shift-reduce parser, we may decide to choose the function that maps each stack symbol to the action that occurs most frequently when is on top of the stack. </li></ul>
  10. 10. Supervised Learning <ul><li>Let X and Y be the set of possible inputs and outputs, respectively. </li></ul><ul><ul><li>Target function: Function f from X to Y. </li></ul></ul><ul><ul><li>Training data: Finite sequence D of pairs <x, f(x)> (x  X). </li></ul></ul><ul><ul><li>Hypothesis space: Subset H of functions from X to Y. </li></ul></ul><ul><ul><li>Learning algorithm: Function A mapping a training set D to a hypothesis h  H. </li></ul></ul><ul><li>If Y is a subset of the real numbers, we have a regression problem; otherwise we have a classification problem. </li></ul>
  11. 11. Variations of Machine Learning <ul><li>Unsupervised learning: Learning without output values (data exploration, e.g. clustering). </li></ul><ul><li>Query learning: Learning where the learner can query the environment about the output associated with a particular input. </li></ul><ul><li>Reinforcement learning: Learning where the learner has a range of actions which it can take to attempt to move towards states where it can expect high rewards. </li></ul><ul><li>Batch vs. online learning: All training examples at once or one at a time (with estimate and update after each example). </li></ul>
  12. 12. Learning and Generalization <ul><li>Any hypothesis that correctly classifies all the training examples is said to be consistent. However: </li></ul><ul><ul><li>The training data may be noisy so that there is no consistent hypothesis at all. </li></ul></ul><ul><ul><li>The real target function may be outside the hypothesis space and has to be approximated. </li></ul></ul><ul><ul><li>A rote learner, which simply outputs y for every x such that <x, y>  D is consistent but fails to classify any x not in D. </li></ul></ul><ul><li>A better criterion of success is generalization, the ability to correctly classify instances not represented in the training data. </li></ul>
  13. 13. Approaches to Machine Learning <ul><li>Decision trees </li></ul><ul><li>Artificial neural networks </li></ul><ul><li>Bayesian learning </li></ul><ul><li>Instance-based learning (cf. Memory-based learning/MBL) </li></ul><ul><li>Genetic algorithms </li></ul><ul><li>Relational learning (cf. Inductive logical programming/ILP) </li></ul>First focus: Naïve Bayes classifier
  14. 14. Decision Tree Example: Name Recognition Capitalized? Sentence-Inital? Yes No No 0 0 1 1
  15. 15. Decision Tree Example: Name Recognition
  16. 16. Bayesian Learning <ul><li>Two reasons for studying Bayesian learning methods: </li></ul><ul><ul><li>Efficient learning algorithms for certain kinds of problems </li></ul></ul><ul><ul><li>Analysis framework for other kinds of learning algorithms </li></ul></ul><ul><li>Features of Bayesian learning methods: </li></ul><ul><ul><li>Assign probabilities to hypotheses (not accept or reject) </li></ul></ul><ul><ul><li>Combine prior knowledge with observed data </li></ul></ul><ul><ul><li>Permit hypotheses that make probabilistic predictions </li></ul></ul><ul><ul><li>Permit predictions based on multiple hypotheses, weighted by their probabilities </li></ul></ul>
  17. 17. Towards Naive Bayes Classification <ul><li>Let H be a hypothesis space defined over the instance space X, where the task is to learn a target function f : X  Y, where Y is a finite set of classes used to classify instances in X, and where a 1 , …, a n are the attributes used to represent an instance x  X. </li></ul><ul><li>Bayes’ Theorem </li></ul><ul><ul><li>Maximize numerator for finding most probable y with a given x (and its attribute representation a 1 , …, a n </li></ul></ul><ul><ul><li>Numerator is equivalent to joint probability </li></ul></ul>
  18. 18. Towards Naive Bayes Classification <ul><li>Reformulate numerator (using chain rule) </li></ul><ul><li>(Naïve) conditional independence assumption: </li></ul><ul><ul><li>For any attributes a i , a j ( i  j ): </li></ul></ul><ul><ul><li>Hence: </li></ul></ul>
  19. 19. Naive Bayes Classifier <ul><li>The naive Bayes classification of a new instance is: </li></ul>
  20. 20. Learning a Naive Bayes Classifier <ul><li>Estimate probabilities from training data using ML estimation: </li></ul><ul><li>Smoothe probability estimates to compensate for sparse data, e.g. using an m-estimate: </li></ul><ul><li>where m is a constant called the equivalent sample size and p is a prior probability (usually assumed to be uniform). </li></ul>
  21. 21. Naive Bayes <ul><li>Naive Bayes classifiers work surprisingly well for text classification tasks </li></ul>
  22. 22. Decision Tree Learning <ul><li>Decision trees classify instances by sorting them down the tree from the root to some leaf node, where: </li></ul><ul><ul><li>Each internal node specifies a test of some attribute. </li></ul></ul><ul><ul><li>Each branch corresponds to a value for the tested attribute. </li></ul></ul><ul><ul><li>Each leaf node provides a classification for the instance. </li></ul></ul><ul><li>Decision trees represent a disjunction of conjunctions of constraints on the attribute values of instances. </li></ul><ul><ul><li>Each path from root to leaf specifies a conjunction of tests. </li></ul></ul><ul><ul><li>The tree itself represents the disjunction of all paths. </li></ul></ul>
  23. 23. Example: Name Recognition Capitalized? Sentence-Inital? Yes No No 0 0 1 1
  24. 24. PlayTennis example
  25. 25. Appropriate Problems for Decision Tree Learning <ul><li>Instances are represented by attribute-value pairs. </li></ul><ul><li>The target function has discrete output values. </li></ul><ul><li>Disjunctive descriptions may be required. </li></ul><ul><li>The training data may contain errors. </li></ul><ul><li>The training data may contain missing attribute values. </li></ul>
  26. 26. The ID3 Learning Algorithm <ul><li>ID3(X = instances, Y = classes, A = attributes): </li></ul><ul><ul><li>Create a root node R for the tree. </li></ul></ul><ul><ul><li>If all instances in X are in class y, return R with label y. </li></ul></ul><ul><ul><li>Else let the decision attribute for R be the attribute a  A that best classifies X and for each value v i of a: </li></ul></ul><ul><ul><ul><ul><li>(a) Add a branch below R for the test a = v i . </li></ul></ul></ul></ul><ul><ul><ul><ul><li>(b) Let X i be the subset of X that have a = v i . If X i is empty then add a leaf labeled with the most common class in X; else add the subtree ID3(X i , Y, A − a). </li></ul></ul></ul></ul><ul><ul><li>Return R. </li></ul></ul>
  27. 27. Selecting the Best Attribute <ul><li>ID3 uses the measure Information Gain (IG) to decide which attribute a best classifies a set of examples X : </li></ul><ul><li>where V a is the set of possible values for a , X v is the subset of X for which a = v , and Entropy ( X ) is defined as follows: </li></ul><ul><li>An alternative measure is Gain Ratio (GR): </li></ul>
  28. 28. Information Gain
  29. 29. Training examples
  30. 30. Selecting the next attribute
  31. 32. Hypothesis Space Search and Inductive Bias <ul><li>Characteristics of ID3: </li></ul><ul><ul><li>Searches a complete hypothesis space of target functions </li></ul></ul><ul><ul><li>Maintains a single current hypothesis throughout the search </li></ul></ul><ul><ul><li>Performs a hill-climbing search (susceptible to local minima) </li></ul></ul><ul><ul><li>Uses all training examples at each step in the search </li></ul></ul><ul><li>Inductive bias: </li></ul><ul><ul><li>Prefers shorter trees over longer ones </li></ul></ul><ul><ul><li>Prefers trees with informative attributes close to the root </li></ul></ul><ul><ul><li>Preference bias (incomplete search of complete space) </li></ul></ul>
  32. 33. Overfitting <ul><li>The problem of overfitting: </li></ul><ul><ul><li>A hypothesis h is overfitted to the training data if there exists an alternative hypothesis h’ with higher training error but lower test error. </li></ul></ul><ul><li>Two approaches for avoiding overfitting in decision tree learning: </li></ul><ul><ul><li>Stop growing the tree before it overfits the training data. </li></ul></ul><ul><ul><li>Allow overfitting and then post-prune the tree. </li></ul></ul><ul><li>Both methods require a stopping criterion and can be validated using held-out data. </li></ul>

×