Seminar: Statistical NLP Girona, June 2003 Machine Learning for  Natural Language Processing  Lluís Màrquez TALP Research ...
Outline <ul><li>Machine Learning for NLP </li></ul><ul><li>The Classification Problem </li></ul><ul><li>Three ML Algorithm...
Outline <ul><li>Machine Learning for NLP </li></ul><ul><li>The Classification Problem </li></ul><ul><li>Three ML Algorithm...
<ul><li>There are many general-purpose definitions of Machine Learning (or artificial learning): </li></ul>Machine Learnin...
Machine Learning <ul><li>We will concentrate on: </li></ul><ul><ul><li>Supervised   inductive  learning for  classificatio...
Machine Learning <ul><li>What to read? </li></ul><ul><ul><li>Machine Learning  (Mitchell, 1997) </li></ul></ul>A more prec...
<ul><li>Lexical and structural  ambiguity  problems </li></ul><ul><ul><li>Word selection (SR, MT) </li></ul></ul><ul><ul><...
<ul><li>Ambiguity  is a crucial problem for natural language understanding/processing.  Ambiguity Resolution = Classificat...
<ul><li>Morpho-syntactic ambiguity </li></ul>NLP “classification” problems  <ul><ul><li>He was  shot  in the  hand  as he ...
<ul><li>Morpho-syntactic ambiguity :  Part of Speech Tagging </li></ul>NLP “classification” problems  <ul><ul><li>He was  ...
<ul><li>Semantic (lexical) ambiguity </li></ul>NLP “classification” problems  <ul><ul><li>He was shot in the  hand  as he ...
<ul><li>Semantic (lexical) ambiguity :  Word Sense Disambiguation </li></ul>NLP “classification” problems  <ul><ul><li>He ...
<ul><li>Structural (syntactic) ambiguity </li></ul>NLP “classification” problems  <ul><ul><li>He was shot in the hand as h...
<ul><li>Structural (syntactic) ambiguity </li></ul>NLP “classification” problems  <ul><ul><li>He  was shot  in the hand as...
<ul><li>Structural (syntactic) ambiguity:   PP-attachment disambiguation </li></ul>NLP “classification” problems  <ul><ul>...
Outline <ul><li>The Classification Problem </li></ul><ul><li>Three ML Algorithms in detail </li></ul><ul><li>Applications ...
Feature Vector Classification <ul><li>An  instance  is a vector:  x   =   < x 1 ,…,  x n >   whose components, called  fea...
Feature Vector Classification <ul><li>The  goal  is to find a function  h  belonging to  H  such that for all pair  < x , ...
An Example otherwise   negative Classification (COLOR= red )     (SHAPE= circle )   positive Rules red blue SHAPE neg...
An Example Classification Rules (SIZE= small )     (SHAPE= circle )   positive otherwise   negative (SIZE= big )   ...
Some important concepts <ul><li>Inductive  Bias </li></ul><ul><li>“ Any means that a classification learning system uses t...
<ul><li>Inductive  Bias </li></ul><ul><li>Training error and  generalization  error </li></ul>Some important concepts <ul>...
Propositional vs.  Relational Learning  Classification color(red)    shape(circle)     classA <ul><li>Propositional le...
The Classification Setting Class, Point, Example, Data Set, ... <ul><li>Input Space:  X     R n </li></ul><ul><li>(binary...
The Classification Setting Learning, Error, ... <ul><li>The hypotheses space,  H , is the set of functions  h: X  Y   tha...
The Classification Setting Learning, Error, ... <ul><li>Expected error (risk) </li></ul><ul><li>Problem :  P  itself is un...
The Classification Setting Error, Over(under)fitting,... <ul><li>Low training error    low true error? </li></ul><ul><li>...
Outline <ul><li>The Classification Problem </li></ul><ul><li>Three ML Algorithms </li></ul><ul><li>Applications to NLP </l...
Outline <ul><li>The Classification Problem </li></ul><ul><li>Three ML Algorithms </li></ul><ul><ul><li>Decision Trees </li...
Learning Paradigms <ul><li>Statistical  learning: </li></ul><ul><ul><li>HMM, Bayesian Networks, ME, CRF, etc. </li></ul></...
Learning Paradigms <ul><li>Classifier  combination : </li></ul><ul><ul><li>Bagging, Boosting, Randomization, ECOC, Stackin...
Decision Trees <ul><li>Decision trees are a way to represent rules underlying training data, with hierarchical structures ...
Decision Trees <ul><li>Acquisition:  Top-Down Induction of Decision Trees (TDIDT)  </li></ul><ul><li>Systems:  </li></ul><...
An Example Algorithms A1 A2 A3 C1 A5 A2 A2 A5 C3 C2 C1 ... ... ... ... v 1 v 2 v 3 v 5 v 4 v 6 v 7 small big SHAPE pos cir...
Learning Decision Trees Algorithms Training Training  Set TDIDT + DT = Test = DT Example + Class
General Induction Algorithm Algorithms function   TDIDT  (X:set-of-examples; A:set-of-features) var : tree 1 ,tree 2 : dec...
General Induction Algorithm Algorithms function   TDIDT  (X:set-of-examples; A:set-of-features) var : tree 1 ,tree 2 : dec...
Feature Selection Criteria <ul><li>Functions derived from  Information Theory : </li></ul><ul><ul><li>Information Gain, Ga...
Extensions of DTs <ul><li>Pruning  (pre/post)  </li></ul><ul><li>Minimize the effect of the greedy approach:  lookahead </...
Decision Trees and NLP <ul><li>Speech processing  (Bahl et al. 89; Bakiri & Dietterich 99) </li></ul><ul><li>POS Tagging  ...
Decision Trees and NLP <ul><li>Noun phrase coreference  (Aone & Benett 95; Mc Carthy & Lehnert 95) </li></ul><ul><li>Disco...
Decision Trees: pros&cons <ul><li>Advantages </li></ul><ul><ul><li>Acquires symbolic knowledge in a understandable way </l...
Decision Trees: pros&cons <ul><li>Drawbacks </li></ul><ul><ul><li>Computationally expensive when scaling to large natural ...
Boosting algorithms <ul><li>Idea </li></ul><ul><li>“ to combine many simple and moderately accurate hypotheses ( weak clas...
AdaBoost: general scheme TRAINING Algorithms TS 2 D 2 TS 1 D 1 Weak  Learner h 1 Weak  Learner h 2 TS T . . . Probability ...
AdaBoost: algorithm Algorithms (Freund & Schapire 97)
AdaBoost: example Weak hypotheses  = vertical/horizontal hyperplanes Algorithms
AdaBoost: round  1 Algorithms
AdaBoost: round  2 Algorithms
AdaBoost: round  3 Algorithms
Combined Hypothesis Algorithms www.research.att.com/ ~ yoav/adaboost
AdaBoost and NLP <ul><li>POS Tagging   (Abney et al. 99; Màrquez 99) </li></ul><ul><li>Text and Speech Categorization   (S...
AdaBoost: pros&cons Algorithms <ul><li>Easy to implement and few parameters to set </li></ul><ul><li>Time and space grow l...
AdaBoost: pros&cons <ul><li>Seems to be rather robust to overfitting  (number of rounds) but sensitive to noise </li></ul>...
<ul><li>“ Support Vector Machines (SVM) are learning systems that use a hypothesis space of linear functions in a high dim...
SVM: A General Definition <ul><li>“ Support Vector Machines (SVM) are learning systems that use a hypothesis space of  lin...
Linear Classifiers <ul><li>Hyperplanes   in R N . </li></ul><ul><li>Defined by a  weight vector  ( w ) and a  threshold  (...
Optimal Hyperplane:  Geometric Intuition Algorithms
Optimal Hyperplane:  Geometric Intuition    Maximal  Margin  Hyperplane Algorithms These are the Support  Vectors
Linearly separable data Quadratic  Programming Algorithms Seminari SVM s  22/05/2001
Non-separable case (soft margin) Algorithms Seminari SVM s  22/05/2001
Non-linear SVMs <ul><li>Implicit mapping into feature space via kernel functions </li></ul>Algorithms Non-linear mapping S...
Non-linear SVMs <ul><li>Kernel functions </li></ul><ul><ul><li>Must be efficiently computable </li></ul></ul><ul><ul><li>C...
Non linear SVMs Degree 3 polynomial kernel lin. separable lin. non-separable Algorithms Seminari SVM s  22/05/2001
Toy Examples <ul><li>All examples have been run with the 2D graphic interface of SVMLIB  ( Chang and Lin, National Univers...
Toy Examples  (I) Linearly separable data set Linear SVM Maximal margin Hyperplane Algorithms . What happens if we add a b...
Toy Examples  (I) (still) Linearly separable data set Linear SVM High value of  C  parameter Maximal margin Hyperplane The...
Toy Examples  (I) (still) Linearly separable data set Linear SVM Low value of  C  parameter Trade-off between: margin and ...
Toy Examples  (II) Algorithms
Toy Examples  (II) Algorithms
Toy Examples  (II) Algorithms
Toy Examples  (III) Algorithms
SVM: Summary <ul><li>SVMs introduced in COLT’92  (Boser, Guyon, & Vapnik, 1992) . Great developement since then </li></ul>...
SVM: Summary <ul><li>Due to Mercer’s conditions on the kernels the optimi-sation problems are convex. No local minima ( + ...
Outline <ul><li>The Classification Problem </li></ul><ul><li>Three ML Algorithms </li></ul><ul><li>Applications to NLP </l...
NLP problems Applications <ul><li>Warning!  We will not focus on final NLP applications, but on intermediate tasks... </li...
NLP problems: structural complexity Applications <ul><li>Decisional  problems </li></ul><ul><ul><li>Text Categorization, D...
<ul><li>Morpho-syntactic ambiguity :  Part of Speech Tagging </li></ul>POS tagging <ul><ul><li>He was  shot  in the  hand ...
POS tagging Applications “ preposition-adverb” tree root P(IN)=0.81 P(RB)=0.19 Word Form leaf P(IN)=0.83 P(RB)=0.17 tag(+1...
POS tagging “ as _ RB  much_ RB  as_ IN ” Collocations: “ as _ RB  well_ RB  as_ IN ” “ as _ RB  soon_ RB  as_ IN ” Applic...
POS tagging Raw text Morphological analysis Tagged text Classify Update Filter Language   Model Disambiguation stop? RTT  ...
POS tagging STT  (Màrquez & Rodríguez 97) Applications Tagged text Raw text Morphological analysis Viterbi algorithm Langu...
Detection of sequential and hierarchical structures  <ul><li>Named Entity recognition </li></ul><ul><li>Clause detection <...
Summary/conclusions <ul><li>We have briefly outlined:  </li></ul><ul><ul><li>The ML setting: “supervised learning for clas...
<ul><li>Any ML algorithm for NLP should be: </li></ul><ul><ul><li>Robust to noise and outliers </li></ul></ul><ul><ul><li>...
Summary/conclusions <ul><li>Statistical and ML-based Natural Language Processing is a very active and multidisciplinary ar...
Some current research lines  <ul><li>Appropriate learning paradigm for all kind of NLP problems:  TiMBL   (DBZ99) ,  TBEDL...
Bibliografia <ul><li>You may found additional information at: </li></ul><ul><ul><li>http://www.lsi.upc.es/ ~ lluism/ </li>...
Seminar: Statistical NLP Girona, June 2003 Machine Learning for  Natural Language Processing  Lluís Màrquez TALP Research ...
Upcoming SlideShare
Loading in …5
×

Machine Learning for NLP

980 views

Published on

0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
980
On SlideShare
0
From Embeds
0
Number of Embeds
4
Actions
Shares
0
Downloads
39
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide
  • ReliefF-IG... Variant de la funció ReliefF de Kononenko que determina la utilitat dels diferents atributs considerant les interrelacions entre ells.
  • Last point : many functions for attribute selection, stopping criteria, pruning method, etc.
  • Last point : many functions for attribute selection, stopping criteria, pruning method, etc.
  • Maximitzar el marge funcional és equivalent a normalitzar-lo igualant-lo a 1(canonical hyperplanes) i minimitzar la norma del vector de pesos
  • Maximitzar el marge funcional és equivalent a normalitzar-lo igualant-lo a 1(canonical hyperplanes) i minimitzar la norma del vector de pesos
  • Machine Learning for NLP

    1. 1. Seminar: Statistical NLP Girona, June 2003 Machine Learning for Natural Language Processing Lluís Màrquez TALP Research Center Llenguatges i Sistemes Informàtics Universitat Politècnica de Catalunya
    2. 2. Outline <ul><li>Machine Learning for NLP </li></ul><ul><li>The Classification Problem </li></ul><ul><li>Three ML Algorithms </li></ul><ul><li>Applications to NLP </li></ul>
    3. 3. Outline <ul><li>Machine Learning for NLP </li></ul><ul><li>The Classification Problem </li></ul><ul><li>Three ML Algorithms </li></ul><ul><li>Applications to NLP </li></ul>
    4. 4. <ul><li>There are many general-purpose definitions of Machine Learning (or artificial learning): </li></ul>Machine Learning ML4NLP <ul><li>Learners are computers : we study learning algorithms </li></ul><ul><li>Resources are scarce: time, memory, data, etc. </li></ul><ul><li>It has (almost) nothing to do with: Cognitive science, neuroscience, theory of scientific discovery and research, etc. </li></ul><ul><li>Biological plausibility is welcome but not the main goal </li></ul>Making a computer automatically acquire some kind of knowledge from a concrete data domain
    5. 5. Machine Learning <ul><li>We will concentrate on: </li></ul><ul><ul><li>Supervised inductive learning for classification </li></ul></ul><ul><ul><li>= discriminative learning </li></ul></ul><ul><li>Learning... but what for? </li></ul><ul><ul><li>To perform some particular task </li></ul></ul><ul><ul><li>To react to environmental inputs </li></ul></ul><ul><ul><li>Concept learning from data: </li></ul></ul><ul><ul><ul><li>modelling concepts underlying data </li></ul></ul></ul><ul><ul><ul><li>predicting unseen observations </li></ul></ul></ul><ul><ul><ul><li>compacting the knowledge representation </li></ul></ul></ul><ul><ul><ul><li>knowledge discovery for expert systems </li></ul></ul></ul>ML4NLP
    6. 6. Machine Learning <ul><li>What to read? </li></ul><ul><ul><li>Machine Learning (Mitchell, 1997) </li></ul></ul>A more precise definition: ML4NLP Obtaining a description of the concept in some representation language that explains observations and helps predicting new instances of the same distribution
    7. 7. <ul><li>Lexical and structural ambiguity problems </li></ul><ul><ul><li>Word selection (SR, MT) </li></ul></ul><ul><ul><li>Part-of-speech tagging </li></ul></ul><ul><ul><li>Semantic ambiguity (polysemy) </li></ul></ul><ul><ul><li>Prepositional phrase attachment </li></ul></ul><ul><ul><li>Reference ambiguity (anaphora) </li></ul></ul><ul><ul><li>etc. </li></ul></ul>Empirical NLP 90 ’s : Application of Machine Learning techniques (ML) to NLP problems ML4NLP <ul><li>What to read? Foundations of Statistical Language Processing (Manning & Schütze, 1999) </li></ul>Clasification problems
    8. 8. <ul><li>Ambiguity is a crucial problem for natural language understanding/processing. Ambiguity Resolution = Classification </li></ul>NLP “classification” problems <ul><ul><li>He was shot in the hand as he chased the robbers in the back street </li></ul></ul>( The Wall Street Journal Corpus ) ML4NLP
    9. 9. <ul><li>Morpho-syntactic ambiguity </li></ul>NLP “classification” problems <ul><ul><li>He was shot in the hand as he chased the robbers in the back street </li></ul></ul>NN VB JJ VB NN VB ( The Wall Street Journal Corpus ) ML4NLP
    10. 10. <ul><li>Morpho-syntactic ambiguity : Part of Speech Tagging </li></ul>NLP “classification” problems <ul><ul><li>He was shot in the hand as he chased the robbers in the back street </li></ul></ul>NN VB JJ VB NN VB ( The Wall Street Journal Corpus ) ML4NLP
    11. 11. <ul><li>Semantic (lexical) ambiguity </li></ul>NLP “classification” problems <ul><ul><li>He was shot in the hand as he chased the robbers in the back street </li></ul></ul>body -part clock -part ( The Wall Street Journal Corpus ) ML4NLP
    12. 12. <ul><li>Semantic (lexical) ambiguity : Word Sense Disambiguation </li></ul>NLP “classification” problems <ul><ul><li>He was shot in the hand as he chased the robbers in the back street </li></ul></ul>body -part clock -part ( The Wall Street Journal Corpus ) ML4NLP
    13. 13. <ul><li>Structural (syntactic) ambiguity </li></ul>NLP “classification” problems <ul><ul><li>He was shot in the hand as he chased the robbers in the back street </li></ul></ul>( The Wall Street Journal Corpus ) ML4NLP
    14. 14. <ul><li>Structural (syntactic) ambiguity </li></ul>NLP “classification” problems <ul><ul><li>He was shot in the hand as he chased the robbers in the back street </li></ul></ul>( The Wall Street Journal Corpus ) ML4NLP
    15. 15. <ul><li>Structural (syntactic) ambiguity: PP-attachment disambiguation </li></ul>NLP “classification” problems <ul><ul><li>He was shot in the hand as he (chased (the robbers) NP (in the back street) PP ) </li></ul></ul>( The Wall Street Journal Corpus ) ML4NLP
    16. 16. Outline <ul><li>The Classification Problem </li></ul><ul><li>Three ML Algorithms in detail </li></ul><ul><li>Applications to NLP </li></ul><ul><li>Machine Learning for NLP </li></ul>
    17. 17. Feature Vector Classification <ul><li>An instance is a vector: x = < x 1 ,…, x n > whose components, called features (or attributes), are discrete or real-valued. </li></ul><ul><li>Let X be the space of all possible instances. </li></ul><ul><li>Let Y= { y 1 ,…, y m } be the set of categories (or classes). </li></ul><ul><li>The goal is to learn an unknown target function, f : X Y </li></ul><ul><li>A training example is an instance x belonging to X , labelled with the correct value for f( x ) , i.e., a pair < x , f( x ) > </li></ul><ul><li>Let D be the set of all training examples . </li></ul>IA perspective Classification
    18. 18. Feature Vector Classification <ul><li>The goal is to find a function h belonging to H such that for all pair < x , f ( x ) > belonging to D , h (x) = f (x) </li></ul><ul><li>The hypotheses space , H , is the set of functions h: X Y that the learner can consider as possible definitions </li></ul>Classification
    19. 19. An Example otherwise  negative Classification (COLOR= red )  (SHAPE= circle )  positive Rules red blue SHAPE negative positive circle triangle negative COLOR Decision Tree
    20. 20. An Example Classification Rules (SIZE= small )   (SHAPE= circle )  positive otherwise  negative (SIZE= big )   (COLOR= red )  positive small big SHAPE pos circle red SIZE Decision Tree COLOR triang blue neg pos neg
    21. 21. Some important concepts <ul><li>Inductive Bias </li></ul><ul><li>“ Any means that a classification learning system uses to choose between to functions that are both consistent with the training data is called inductive bias” (Mooney & Cardie, 99) </li></ul><ul><ul><li>Language / Search bias </li></ul></ul>Classification red blue SHAPE negative positive circle triangle negative COLOR Decision Tree
    22. 22. <ul><li>Inductive Bias </li></ul><ul><li>Training error and generalization error </li></ul>Some important concepts <ul><li>Generalization ability and overfitting </li></ul><ul><li>Batch Learning vs. on-line Leaning </li></ul><ul><li>Symbolic vs. statistical Learning </li></ul><ul><li>Propositional vs. first-order learning </li></ul>Classification
    23. 23. Propositional vs. Relational Learning Classification color(red)  shape(circle)    classA <ul><li>Propositional learning </li></ul>course(X)  person(Y)  link_to(Y,X)    instructor_of(X,Y) research_project(X)  person(Z)  link_to(L 1 ,X,Y)  link_to(L 2 ,Y,Z)  neighbour_word_ people (L 1 )   member_proj(X,Z) <ul><li>Relational learning = ILP (induction of logic programs) </li></ul>
    24. 24. The Classification Setting Class, Point, Example, Data Set, ... <ul><li>Input Space: X  R n </li></ul><ul><li>(binary) Output Space: Y = { + 1,-1} </li></ul><ul><li>A point, pattern or instance: x  X , x = (x 1 , x 2 , …, x n ) </li></ul><ul><li>Example: ( x , y ) with x  X, y  Y </li></ul><ul><li>Training Set: a set of m examples generated i.i.d. according to an unknown distribution P ( x , y ) S = {( x 1 , y 1 ), …, ( x m , y m )}  ( X  Y ) m </li></ul>Classification CoLT/SLT perspective
    25. 25. The Classification Setting Learning, Error, ... <ul><li>The hypotheses space, H , is the set of functions h: X  Y that the learner can consider as possible definitions. In SVM are of the form: </li></ul><ul><li>The goal is to find a function h belonging to H such that the expected misclassification error on new examples, also drawn from P ( x , y ) , is minimal ( Risk Minimization , RM) </li></ul>Classification
    26. 26. The Classification Setting Learning, Error, ... <ul><li>Expected error (risk) </li></ul><ul><li>Problem : P itself is unknown. Known are training examples  an induction principle is needed </li></ul><ul><li>Empirical Risk Minimization (ERM): Find the function h belonging to H for which the training error (empirical risk) is minimal </li></ul>Classification
    27. 27. The Classification Setting Error, Over(under)fitting,... <ul><li>Low training error  low true error? </li></ul><ul><li>The overfitting dilemma: </li></ul><ul><li>Trade-off between training error and complexity </li></ul><ul><li>Different learning biases can be used </li></ul>(Müller et al., 2001) Classification Under fitting Over fitting
    28. 28. Outline <ul><li>The Classification Problem </li></ul><ul><li>Three ML Algorithms </li></ul><ul><li>Applications to NLP </li></ul><ul><li>Machine Learning for NLP </li></ul>
    29. 29. Outline <ul><li>The Classification Problem </li></ul><ul><li>Three ML Algorithms </li></ul><ul><ul><li>Decision Trees </li></ul></ul><ul><ul><li>AdaBoost </li></ul></ul><ul><ul><li>Support Vector Machines </li></ul></ul><ul><li>Applications to NLP </li></ul><ul><li>Machine Learning for NLP </li></ul>
    30. 30. Learning Paradigms <ul><li>Statistical learning: </li></ul><ul><ul><li>HMM, Bayesian Networks, ME, CRF, etc. </li></ul></ul><ul><li>Traditional methods from Artificial Intelligence ( ML, AI ) </li></ul><ul><ul><li>Decision trees/lists, exemplar-based learning, rule induction, neural networks, etc. </li></ul></ul><ul><li>Methods from Computational Learning Theory ( CoLT/SLT ) </li></ul><ul><ul><li>Winnow, AdaBoost, SVM’s, etc. </li></ul></ul>Algorithms
    31. 31. Learning Paradigms <ul><li>Classifier combination : </li></ul><ul><ul><li>Bagging, Boosting, Randomization, ECOC, Stacking, etc. </li></ul></ul><ul><li>Semi-supervised learning : learning from labelled and unlabelled examples </li></ul><ul><ul><li>Bootstrapping, EM, Transductive learning (SVM’s, AdaBoost), Co-Training, etc. </li></ul></ul><ul><li>etc. </li></ul>Algorithms
    32. 32. Decision Trees <ul><li>Decision trees are a way to represent rules underlying training data, with hierarchical structures that recursively partition the data. </li></ul><ul><li>They have been used by many research communities (Pattern Recognition, Statistics, ML, etc.) for data exploration with some of the following purposes: Description, Classification , and Generalization. </li></ul><ul><li>From a machine-learning perspective: Decision Trees are n -ary branching trees that represent classification rules for classifying the objects of a certain domain into a set of mutually exclusive classes </li></ul>Algorithms
    33. 33. Decision Trees <ul><li>Acquisition: Top-Down Induction of Decision Trees (TDIDT) </li></ul><ul><li>Systems: </li></ul><ul><li>CART (Breiman et al. 84) , </li></ul><ul><li>ID3, C4.5, C5.0 (Quinlan 86,93,98), </li></ul><ul><li>A SSISTANT , A SSISTANT -R (Cestnik et al. 87) (Kononenko et al. 95) </li></ul><ul><li>etc. </li></ul>Algorithms
    34. 34. An Example Algorithms A1 A2 A3 C1 A5 A2 A2 A5 C3 C2 C1 ... ... ... ... v 1 v 2 v 3 v 5 v 4 v 6 v 7 small big SHAPE pos circle red SIZE Decision Tree COLOR triang blue neg pos neg
    35. 35. Learning Decision Trees Algorithms Training Training Set TDIDT + DT = Test = DT Example + Class
    36. 36. General Induction Algorithm Algorithms function TDIDT (X:set-of-examples; A:set-of-features) var : tree 1 ,tree 2 : decision-tree; X’: set-of-examples; A’: set-of-features end-var if ( stopping_criterion (X)) then tree 1 := create_leaf_tree (X) else a max := feature_selection (X,A); tree 1 := create_tree (X, a max ); for-all val in values (a max ) do X’ := select_examples (X,a max ,val); A’ := A - {a max }; tree 2 := TDIDT (X’,A’); tree 1 := add_branch (tree 1 ,tree 2 ,val) end-for end-if return (tree 1 ) end-function
    37. 37. General Induction Algorithm Algorithms function TDIDT (X:set-of-examples; A:set-of-features) var : tree 1 ,tree 2 : decision-tree; X’: set-of-examples; A’: set-of-features end-var if ( stopping_criterion (X)) then tree 1 := create_leaf_tree (X) else a max := feature_selection (X,A); tree 1 := create_tree (X, a max ); for-all val in values (a max ) do X’ := select_examples (X,a max ,val); A’ := A - {a max }; tree 2 := TDIDT (X’,A’); tree 1 := add_branch (tree 1 ,tree 2 ,val) end-for end-if return (tree 1 ) end-function
    38. 38. Feature Selection Criteria <ul><li>Functions derived from Information Theory : </li></ul><ul><ul><li>Information Gain, Gain Ratio (Quinlan 86) </li></ul></ul><ul><li>Functions derived from Distance Measures </li></ul><ul><ul><li>Gini Diversity Index (Breiman et al. 84) </li></ul></ul><ul><ul><li>RLM (López de Mántaras 91) </li></ul></ul><ul><li>Statistically-based </li></ul><ul><ul><li>Chi-square test (Sestito & Dillon 94) </li></ul></ul><ul><ul><li>Symmetrical Tau (Zhou & Dillon 91) </li></ul></ul><ul><li>R ELIEF F-IG: variant of R ELIEF F (Kononenko 94) </li></ul>Algorithms
    39. 39. Extensions of DTs <ul><li>Pruning (pre/post) </li></ul><ul><li>Minimize the effect of the greedy approach: lookahead </li></ul><ul><li>Non-lineal splits </li></ul><ul><li>Combination of multiple models </li></ul><ul><li>Incremental learning (on-line) </li></ul><ul><li>etc. </li></ul>(Murthy 95) Algorithms
    40. 40. Decision Trees and NLP <ul><li>Speech processing (Bahl et al. 89; Bakiri & Dietterich 99) </li></ul><ul><li>POS Tagging (Cardie 93, Schmid 94b; Magerman 95; Màrquez & Rodríguez 95,97; Màrquez et al. 00) </li></ul><ul><li>Word sense disambiguation (Brown et al. 91; Cardie 93; Mooney 96) </li></ul><ul><li>Parsing (Magerman 95,96; Haruno et al. 98,99) </li></ul><ul><li>Text categorization (Lewis & Ringuette 94; Weiss et al. 99) </li></ul><ul><li>Text summarization (Mani & Bloedorn 98) </li></ul><ul><li>Dialogue act tagging (Samuel et al. 98) </li></ul>Algorithms
    41. 41. Decision Trees and NLP <ul><li>Noun phrase coreference (Aone & Benett 95; Mc Carthy & Lehnert 95) </li></ul><ul><li>Discourse analysis in information extraction (Soderland & Lehnert 94) </li></ul><ul><li>Cue phrase identification in text and speech (Litman 94; Siegel & McKeown 94) </li></ul><ul><li>Verb classification in Machine Translation (Tanaka 96; Siegel 97) </li></ul>Algorithms
    42. 42. Decision Trees: pros&cons <ul><li>Advantages </li></ul><ul><ul><li>Acquires symbolic knowledge in a understandable way </li></ul></ul><ul><ul><li>Very well studied ML algorithms and variants </li></ul></ul><ul><ul><li>Can be easily translated into rules </li></ul></ul><ul><ul><li>Existence of available software: C4.5, C5.0, etc. </li></ul></ul><ul><ul><li>Can be easily integrated into an ensemble </li></ul></ul>Algorithms
    43. 43. Decision Trees: pros&cons <ul><li>Drawbacks </li></ul><ul><ul><li>Computationally expensive when scaling to large natural language domains: training examples, features, etc. </li></ul></ul><ul><ul><li>Data sparseness and data fragmentation: the problem of the small disjuncts => Probability estimation </li></ul></ul><ul><ul><li>DTs is a model with high variance (unstable) </li></ul></ul><ul><ul><li>Tendency to overfit training data: pruning is necessary </li></ul></ul><ul><ul><li>Requires quite a big effort in tuning the model </li></ul></ul>Algorithms
    44. 44. Boosting algorithms <ul><li>Idea </li></ul><ul><li>“ to combine many simple and moderately accurate hypotheses ( weak classifiers ) into a single and highly accurate classifier” </li></ul><ul><li>AdaBoost (Freund & Schapire 95) has been theoretically and empirically studied extensively </li></ul><ul><li>Many other variants extensions (1997-2003) </li></ul><ul><li>http://www.lsi.upc.es/ ~ lluism/seminari/ml&nlp.html </li></ul>Algorithms
    45. 45. AdaBoost: general scheme TRAINING Algorithms TS 2 D 2 TS 1 D 1 Weak Learner h 1 Weak Learner h 2 TS T . . . Probability distribution updating D T Weak Learner h T . . . Linear combination F( h 1 ,h 2 ,...,h T ) TEST  2    
    46. 46. AdaBoost: algorithm Algorithms (Freund & Schapire 97)
    47. 47. AdaBoost: example Weak hypotheses = vertical/horizontal hyperplanes Algorithms
    48. 48. AdaBoost: round 1 Algorithms
    49. 49. AdaBoost: round 2 Algorithms
    50. 50. AdaBoost: round 3 Algorithms
    51. 51. Combined Hypothesis Algorithms www.research.att.com/ ~ yoav/adaboost
    52. 52. AdaBoost and NLP <ul><li>POS Tagging (Abney et al. 99; Màrquez 99) </li></ul><ul><li>Text and Speech Categorization (Schapire & Singer 98; Schapire et al. 98; Weiss et al. 99) </li></ul><ul><li>PP-attachment Disambiguation (Abney et al. 99) </li></ul><ul><li>Parsing (Haruno et al. 99) </li></ul><ul><li>Word Sense Disambiguation (Escudero et al. 00, 01) </li></ul><ul><li>Shallow parsing (Carreras & Màrquez, 01a; 02) </li></ul><ul><li>Email spam filtering (Carreras & Màrquez, 01b) </li></ul><ul><li>Term Extraction (Vivaldi, et al. 01) </li></ul>Algorithms
    53. 53. AdaBoost: pros&cons Algorithms <ul><li>Easy to implement and few parameters to set </li></ul><ul><li>Time and space grow linearly with number of examples. Ability to manage very large learning problems </li></ul><ul><li>Does not constrain explicitly the complexity of the learner </li></ul><ul><li>Naturally combines feature selection with learning </li></ul><ul><li>Has been succesfully applied to many practical problems </li></ul>
    54. 54. AdaBoost: pros&cons <ul><li>Seems to be rather robust to overfitting (number of rounds) but sensitive to noise </li></ul><ul><li>Performance is very good when there are relatively few relevant terms (features) </li></ul><ul><li>Can perform poorly when there is insufficient training data relative to the complexity of the base classifiers, the training errors of the base classifiers become too large too quickly </li></ul>Algorithms
    55. 55. <ul><li>“ Support Vector Machines (SVM) are learning systems that use a hypothesis space of linear functions in a high dimensional feature space, trained with a learning algorithm from optimisation theory that implements a learning bias derived from statistical learning theory”. (Cristianini & Shawe-Taylor, 2000) </li></ul>Algorithms SVM: A General Definition
    56. 56. SVM: A General Definition <ul><li>“ Support Vector Machines (SVM) are learning systems that use a hypothesis space of linear functions in a high dimensional feature space, trained with a learning algorithm from optimisation theory that implements a learning bias derived from statistical learning theory”. (Cristianini & Shawe-Taylor, 2000) </li></ul>Key Concepts Algorithms
    57. 57. Linear Classifiers <ul><li>Hyperplanes in R N . </li></ul><ul><li>Defined by a weight vector ( w ) and a threshold ( b ). </li></ul><ul><li>They induce a classification rule: </li></ul>Algorithms w + + + + + + _ _ _ _ _ _ _ _ _
    58. 58. Optimal Hyperplane: Geometric Intuition Algorithms
    59. 59. Optimal Hyperplane: Geometric Intuition    Maximal Margin Hyperplane Algorithms These are the Support Vectors
    60. 60. Linearly separable data Quadratic Programming Algorithms Seminari SVM s 22/05/2001
    61. 61. Non-separable case (soft margin) Algorithms Seminari SVM s 22/05/2001
    62. 62. Non-linear SVMs <ul><li>Implicit mapping into feature space via kernel functions </li></ul>Algorithms Non-linear mapping Set of hypotheses Seminari SVM s 22/05/2001 Dual formulation Kernel function Evaluation
    63. 63. Non-linear SVMs <ul><li>Kernel functions </li></ul><ul><ul><li>Must be efficiently computable </li></ul></ul><ul><ul><li>Characterization via Mercer’s theorem </li></ul></ul><ul><ul><li>One of the curious facts about using a kernel is that we do not need to know the underlying feature map in order to be able to learn in the feature space! (Cristianini & Shawe-Taylor, 2000) </li></ul></ul><ul><ul><li>Examples: polynomials, Gaussian radial basis functions, two-layer sigmoidal neural networks, etc. </li></ul></ul>Algorithms Seminari SVM s 22/05/2001
    64. 64. Non linear SVMs Degree 3 polynomial kernel lin. separable lin. non-separable Algorithms Seminari SVM s 22/05/2001
    65. 65. Toy Examples <ul><li>All examples have been run with the 2D graphic interface of SVMLIB ( Chang and Lin, National University of Taiwan) </li></ul><ul><li>“ LIBSVM is an integrated software for support vector classification, (C-SVC, nu-SVC), regression (epsilon-SVR, un-SVR) and distribution estimation (one-class SVM). It supports multi-class classification. The basic algorithm is a simplification of both SMO by Platt and SVMLight by Joachims. It is also a simplification of the modification 2 of SMO by Keerthy et al. Our goal is to help users from other fields to easily use SVM as a tool. LIBSVM provides a simple interface where users can easily link it with their own programs…” </li></ul><ul><li>Available from: www.csie.ntu.edu.tw/~cjlin/libsvm (it icludes a Web integrated demo tool) </li></ul>Algorithms
    66. 66. Toy Examples (I) Linearly separable data set Linear SVM Maximal margin Hyperplane Algorithms . What happens if we add a blue training example here?
    67. 67. Toy Examples (I) (still) Linearly separable data set Linear SVM High value of C parameter Maximal margin Hyperplane The example is correctly classified Algorithms
    68. 68. Toy Examples (I) (still) Linearly separable data set Linear SVM Low value of C parameter Trade-off between: margin and training error The example is now a bounded SV Algorithms
    69. 69. Toy Examples (II) Algorithms
    70. 70. Toy Examples (II) Algorithms
    71. 71. Toy Examples (II) Algorithms
    72. 72. Toy Examples (III) Algorithms
    73. 73. SVM: Summary <ul><li>SVMs introduced in COLT’92 (Boser, Guyon, & Vapnik, 1992) . Great developement since then </li></ul><ul><li>Kernel-induced feature spaces: SVMs work efficiently in very high dimensional feature spaces ( + ) </li></ul><ul><li>Learning bias: maximal margin optimisation. Reduces the danger of overfitting. Generalization bounds for SVMs ( + ) </li></ul><ul><li>Compact representation of the induced hypothesis. The solution is sparse in terms of SVs ( + ) </li></ul>Algorithms
    74. 74. SVM: Summary <ul><li>Due to Mercer’s conditions on the kernels the optimi-sation problems are convex. No local minima ( + ) </li></ul><ul><li>Optimisation theory guides the implementation. Efficient learning ( + ) </li></ul><ul><li>Mainly for classification but also for regression, density estimation, clustering, etc. </li></ul><ul><li>Success in many real-world applications: OCR, vision, bioinformatics, speech recognition, NLP: TextCat, POS tagging, chunking, parsing, etc. ( + ) </li></ul><ul><li>Parameter tuning ( – ). Implications in convergence times, sparsity of the solution, etc. </li></ul>Algorithms
    75. 75. Outline <ul><li>The Classification Problem </li></ul><ul><li>Three ML Algorithms </li></ul><ul><li>Applications to NLP </li></ul><ul><li>Machine Learning for NLP </li></ul>
    76. 76. NLP problems Applications <ul><li>Warning! We will not focus on final NLP applications, but on intermediate tasks... </li></ul><ul><li>We will classify the NLP tasks according to their (structural) complexity </li></ul>
    77. 77. NLP problems: structural complexity Applications <ul><li>Decisional problems </li></ul><ul><ul><li>Text Categorization, Document filtering, Word Sense Disambiguation, etc. </li></ul></ul><ul><li>Sequence tagging and detection of sequential structures </li></ul><ul><ul><li>POS tagging, Named Entity extraction, syntactic chunking, etc. </li></ul></ul><ul><li>Hierarchical structures </li></ul><ul><ul><li>Clause detection, full parsing, IE of complex concepts, composite Named Entities, etc. </li></ul></ul>
    78. 78. <ul><li>Morpho-syntactic ambiguity : Part of Speech Tagging </li></ul>POS tagging <ul><ul><li>He was shot in the hand as he chased the robbers in the back street </li></ul></ul>NN VB JJ VB NN VB ( The Wall Street Journal Corpus ) Applications
    79. 79. POS tagging Applications “ preposition-adverb” tree root P(IN)=0.81 P(RB)=0.19 Word Form leaf P(IN)=0.83 P(RB)=0.17 tag(+1) P(IN)=0.13 P(RB)=0.87 tag(+2) P(IN)=0.013 P(RB)=0.987 “ As”,“as” RB IN others others ... ... ^ Probabilistic interpretation: P( RB | word=“A/as”  tag(+1)=RB  tag(+2)=IN) = 0.987 P( IN | word=“A/as”  tag(+1)=RB  tag(+2)=IN) = 0.013 ^
    80. 80. POS tagging “ as _ RB much_ RB as_ IN ” Collocations: “ as _ RB well_ RB as_ IN ” “ as _ RB soon_ RB as_ IN ” Applications “ preposition-adverb” tree root P(IN)=0.81 P(RB)=0.19 Word Form leaf P(IN)=0.83 P(RB)=0.17 tag(+1) P(IN)=0.13 P(RB)=0.87 tag(+2) P(IN)=0.013 P(RB)=0.987 “ As”,“as” RB IN others others ... ...
    81. 81. POS tagging Raw text Morphological analysis Tagged text Classify Update Filter Language Model Disambiguation stop? RTT (Màrquez & Rodríguez 97) yes no Applications A Sequential Model for Multi-class Classification: NLP/POS Tagging (Even-Zohar & Roth, 01)
    82. 82. POS tagging STT (Màrquez & Rodríguez 97) Applications Tagged text Raw text Morphological analysis Viterbi algorithm Language Model Disambiguation Lexical probs. + Contextual probs. The Use of Classifiers in sequential inference: Chunking (Punyakanok & Roth, 00)
    83. 83. Detection of sequential and hierarchical structures <ul><li>Named Entity recognition </li></ul><ul><li>Clause detection </li></ul>Applications
    84. 84. Summary/conclusions <ul><li>We have briefly outlined: </li></ul><ul><ul><li>The ML setting: “supervised learning for classification” </li></ul></ul><ul><ul><li>Three concrete machine learning algorithms </li></ul></ul><ul><ul><li>How to apply them to solve itermediate NLP tasks </li></ul></ul>Conclusions
    85. 85. <ul><li>Any ML algorithm for NLP should be: </li></ul><ul><ul><li>Robust to noise and outliers </li></ul></ul><ul><ul><li>Efficient in large feature/example spaces </li></ul></ul><ul><ul><li>Adaptive to new/changing domains: portability, tuning, etc. </li></ul></ul><ul><ul><li>Able to take advantage of unlabelled examples: semi-supervised learning </li></ul></ul>Conclusions Summary/conclusions
    86. 86. Summary/conclusions <ul><li>Statistical and ML-based Natural Language Processing is a very active and multidisciplinary area of research </li></ul>Conclusions
    87. 87. Some current research lines <ul><li>Appropriate learning paradigm for all kind of NLP problems: TiMBL (DBZ99) , TBEDL (Brill95), ME (Ratnaparkhi98), SNoW (Roth98), CRF (Pereira & Singer02). </li></ul><ul><li>Definition of an adequate (and task-specific) feature space: mapping from the input space to a high dimensional feature space, kernels, etc. </li></ul><ul><li>Resolution of complex NLP problems: inference with classifiers + constraint satisfaction </li></ul><ul><li>etc. </li></ul>Conclusions
    88. 88. Bibliografia <ul><li>You may found additional information at: </li></ul><ul><ul><li>http://www.lsi.upc.es/ ~ lluism/ </li></ul></ul><ul><ul><li>tesi.html </li></ul></ul><ul><ul><li>publicacions/pubs.html </li></ul></ul><ul><ul><li>cursos/talks.html </li></ul></ul><ul><ul><li>cursos/MLandNL.html </li></ul></ul><ul><ul><li>cursos/emnlp1.html </li></ul></ul><ul><li>This talk at: </li></ul><ul><li>http://www.lsi.upc.es/ ~ lluism/udg03.ppt.gz </li></ul>Conclusions
    89. 89. Seminar: Statistical NLP Girona, June 2003 Machine Learning for Natural Language Processing Lluís Màrquez TALP Research Center Llenguatges i Sistemes Informàtics Universitat Politècnica de Catalunya

    ×