Machine Learning for NLP

5,058 views
4,822 views

Published on

0 Comments
4 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
5,058
On SlideShare
0
From Embeds
0
Number of Embeds
7
Actions
Shares
0
Downloads
166
Comments
0
Likes
4
Embeds 0
No embeds

No notes for slide

Machine Learning for NLP

  1. 1. Seminar: Statistical NLP Machine Learning for Natural Language Processing Lluís Màrquez TALP Research Center Llenguatges i Sistemes Informàtics Universitat Politècnica de Catalunya Girona, June 2003 Machine Learning for NLP 30/06/2003
  2. 2. Outline • Machine Learning for NLP • The Classification Problem • Three ML Algorithms • Applications to NLP Machine Learning for NLP 30/06/2003
  3. 3. Outline • Machine Learning for NLP • The Classification Problem • Three ML Algorithms • Applications to NLP Machine Learning for NLP 30/06/2003
  4. 4. ML4NLP Machine Learning • There are many general-purpose definitions of Machine Learning (or artificial learning): Making a computer automatically acquire some kind of knowledge from a concrete data domain • Learners are computers: we study learning algorithms • Resources are scarce: time, memory, data, etc. • It has (almost) nothing to do with: Cognitive science, neuroscience, theory of scientific discovery and research, etc. • Biological plausibility is welcome but not the main goal Machine Learning for NLP 30/06/2003
  5. 5. ML4NLP Machine Learning • Learning... but what for? – To perform some particular task – To react to environmental inputs – Concept learning from data: • modelling concepts underlying data • predicting unseen observations • compacting the knowledge representation • knowledge discovery for expert systems • We will concentrate on: – Supervised inductive learning for classification = discriminative learning Machine Learning for NLP 30/06/2003
  6. 6. ML4NLP Machine Learning A more precise definition: Obtaining a description of the concept in some representation language that explains observations and helps predicting new instances of the same distribution • What to read? – Machine Learning (Mitchell, 1997) Machine Learning for NLP 30/06/2003
  7. 7. ML4NLP Empirical NLP 90’s: Application of Machine Learning techniques (ML) to NLP problems • Lexical and structural ambiguity problems – Word selection (SR, MT) – Part-of-speech tagging Clasification – Semantic ambiguity (polysemy) problems – Prepositional phrase attachment – Reference ambiguity (anaphora) – etc. • What to read? Foundations of Statistical Language Processing (Manning & Schütze, 1999) Machine Learning for NLP 30/06/2003
  8. 8. ML4NLP NLP “classification” problems • Ambiguity is a crucial problem for natural language understanding/processing. Ambiguity Resolution = Classification He was shot in the hand as he chased the robbers in the back street (The Wall Street Journal Corpus) Machine Learning for NLP 30/06/2003
  9. 9. ML4NLP NLP “classification” problems • Morpho-syntactic ambiguity He was shot in the hand as he chased the robbers in the back street JJ NN NN VB VB VB (The Wall Street Journal Corpus) Machine Learning for NLP 30/06/2003
  10. 10. ML4NLP NLP “classification” problems • Morpho-syntactic ambiguity: Part of Speech Tagging He was shot in the hand as he chased the robbers in the back street JJ NN NN VB VB VB (The Wall Street Journal Corpus) Machine Learning for NLP 30/06/2003
  11. 11. ML4NLP NLP “classification” problems • Semantic (lexical) ambiguity He was shot in the hand as he chased the robbers in body-part street the back clock-part (The Wall Street Journal Corpus) Machine Learning for NLP 30/06/2003
  12. 12. ML4NLP NLP “classification” problems • Semantic (lexical) ambiguity: Word Sense Disambiguation He was shot in the hand as he chased the robbers in body-part street the back clock-part (The Wall Street Journal Corpus) Machine Learning for NLP 30/06/2003
  13. 13. ML4NLP NLP “classification” problems • Structural (syntactic) ambiguity He was shot in the hand as he chased the robbers in the back street (The Wall Street Journal Corpus) Machine Learning for NLP 30/06/2003
  14. 14. ML4NLP NLP “classification” problems • Structural (syntactic) ambiguity He was shot in the hand as he chased the robbers in the back street (The Wall Street Journal Corpus) Machine Learning for NLP 30/06/2003
  15. 15. ML4NLP NLP “classification” problems • Structural (syntactic) ambiguity: PP-attachment disambiguation He was shot in the hand as he (chased (the robbers)NP (in the back street)PP) (The Wall Street Journal Corpus) Machine Learning for NLP 30/06/2003
  16. 16. Outline • Machine Learning for NLP • The Classification Problem • Three ML Algorithms in detail • Applications to NLP Machine Learning for NLP 30/06/2003
  17. 17. Classification Feature Vector Classification IA perspective • An instance is a vector: x=<x1,…, xn> whose components, called features (or attributes), are discrete or real-valued. • Let X be the space of all possible instances. • Let Y={y1,…, ym} be the set of categories (or classes). • The goal is to learn an unknown target function, f : X Y • A training example is an instance x belonging to X, labelled with the correct value for f(x), i.e., a pair <x, f(x)> • Let D be the set of all training examples. Machine Learning for NLP 30/06/2003
  18. 18. Classification Feature Vector Classification • The hypotheses space, H, is the set of functions h: X Y that the learner can consider as possible definitions The goal is to find a function h belonging to H such that for all pair <x, f (x)> belonging to D, h(x) = f (x) Machine Learning for NLP 30/06/2003
  19. 19. Classification An Example Example SIZE COLOR SHAPE CLASS 1 small red circle positive 2 big red circle positive 3 small red triangle negative 4 big blue circle negative Rules Decision Tree COLOR (COLOR=red) red blue (SHAPE=circle) positive SHAPE negative otherwise negative circle triangle positive negative Machine Learning for NLP 30/06/2003
  20. 20. Classification An Example Example SIZE COLOR SHAPE CLASS 1 small red circle positive 2 big red circle positive 3 small red triangle negative 4 big blue circle negative Rules Decision Tree SIZE (SIZE=small) (SHAPE=circle) positive small big (SIZE=big) (COLOR=red) positive otherwise negative SHAPE COLOR circle triang red blue pos neg pos neg Machine Learning for NLP 30/06/2003
  21. 21. Classification Some important concepts • Inductive Bias “Any means that a classification learning system uses to choose between to functions that are both consistent with the training data is called inductive bias” (Mooney & Cardie, 99) – Language / Search bias Decision Tree COLOR red blue SHAPE negative circle triangle positive negative Machine Learning for NLP 30/06/2003
  22. 22. Classification Some important concepts • Inductive Bias • Training error and generalization error • Generalization ability and overfitting • Batch Learning vs. on-line Leaning • Symbolic vs. statistical Learning • Propositional vs. first-order learning Machine Learning for NLP 30/06/2003
  23. 23. Classification Propositional vs. Relational Learning • Propositional learning color(red) shape(circle) classA • Relational learning = ILP (induction of logic programs) course(X) person(Y) link_to(Y,X) instructor_of(X,Y) research_project(X) person(Z) link_to(L1,X,Y) link_to(L2,Y,Z) neighbour_word_people(L1) member_proj(X,Z) Machine Learning for NLP 30/06/2003
  24. 24. Classification The Classification Setting Class, Point, Example, Data Set, ... CoLT/SLT • Input Space: X Rn perspective • (binary) Output Space: Y = {+1,-1} • A point, pattern or instance: x X, x = (x1, x2, …, xn) • Example: (x, y) with x X, y Y • Training Set: a set of m examples generated i.i.d. according to an unknown distribution P(x,y) S = {(x1, y1), …, (xm, ym)} (X Y)m Machine Learning for NLP 30/06/2003
  25. 25. Classification The Classification Setting Learning, Error, ... • The hypotheses space, H, is the set of functions h: X Y that the learner can consider as possible definitions. In SVM are of the form: n h( x) wi i (x) b i 1 • The goal is to find a function h belonging to H such that the expected misclassification error on new examples, also drawn from P(x,y), is minimal (Risk Minimization, RM) Machine Learning for NLP 30/06/2003
  26. 26. Classification The Classification Setting Learning, Error, ... • Expected error (risk) Rh loss h(x), y dP (x, y ) • Problem: P itself is unknown. Known are training examples an induction principle is needed • Empirical Risk Minimization (ERM): Find the function h belonging to H for which the training error (empirical risk) is minimal m Remp h 1m i 1 loss h(x i ), yi Machine Learning for NLP 30/06/2003
  27. 27. Classification The Classification Setting Error, Over(under)fitting,... • Low training error low true error? • The overfitting dilemma: Underfitting Overfitting • Trade-off between training error and complexity • Different learning biases can be used Machine Learning for NLP 30/06/2003
  28. 28. Outline • Machine Learning for NLP • The Classification Problem • Three ML Algorithms • Applications to NLP Machine Learning for NLP 30/06/2003
  29. 29. Outline • Machine Learning for NLP • The Classification Problem • Three ML Algorithms −Decision Trees −AdaBoost −Support Vector Machines • Applications to NLP Machine Learning for NLP 30/06/2003
  30. 30. Algorithms Learning Paradigms • Statistical learning: – HMM, Bayesian Networks, ME, CRF, etc. • Traditional methods from Artificial Intelligence (ML, AI) – Decision trees/lists, exemplar-based learning, rule induction, neural networks, etc. • Methods from Computational Learning Theory (CoLT/SLT) – Winnow, AdaBoost, SVM’s, etc. Machine Learning for NLP 30/06/2003
  31. 31. Algorithms Learning Paradigms • Classifier combination: – Bagging, Boosting, Randomization, ECOC, Stacking, etc. • Semi-supervised learning: learning from labelled and unlabelled examples – Bootstrapping, EM, Transductive learning (SVM’s, AdaBoost), Co-Training, etc. • etc. Machine Learning for NLP 30/06/2003
  32. 32. Algorithms Decision Trees • Decision trees are a way to represent rules underlying training data, with hierarchical structures that recursively partition the data. • They have been used by many research communities (Pattern Recognition, Statistics, ML, etc.) for data exploration with some of the following purposes: Description, Classification, and Generalization. • From a machine-learning perspective: Decision Trees are n -ary branching trees that represent classification rules for classifying the objects of a certain domain into a set of mutually exclusive classes Machine Learning for NLP 30/06/2003
  33. 33. Algorithms Decision Trees • Acquisition: Top-Down Induction of Decision Trees (TDIDT) • Systems: CART (Breiman et al. 84), ID3, C4.5, C5.0 (Quinlan 86,93,98), ASSISTANT, ASSISTANT-R (Cestnik et al. 87) (Kononenko et al. 95) etc. Machine Learning for NLP 30/06/2003
  34. 34. Algorithms An Example A1 v1 v3 v2 A2 A3 ... A2 ... v4 v5 ... Decision Tree A5 A2 ... v6 SIZE small big A5 C3 SHAPE COLOR v7 circle triang red blue C1 C2 C1 pos neg pos neg Machine Learning for NLP 30/06/2003
  35. 35. Algorithms Learning Decision Trees Training Training + TDIDT = Set DT Test Example + = Class DT Machine Learning for NLP 30/06/2003
  36. 36. Algorithms General Induction Algorithm function TDIDT (X:set-of-examples; A:set-of-features) var: tree1,tree2: decision-tree; X’: set-of-examples; A’: set-of-features end-var if (stopping_criterion (X)) then tree1 := create_leaf_tree (X) else amax := feature_selection (X,A); tree1 := create_tree (X, amax); for-all val in values (amax) do X’ := select_examples (X,amax,val); A’ := A - {amax}; tree2 := TDIDT (X’,A’); tree1 := add_branch (tree1,tree2,val) end-for end-if return (tree1) end-function Machine Learning for NLP 30/06/2003
  37. 37. Algorithms General Induction Algorithm function TDIDT (X:set-of-examples; A:set-of-features) var: tree1,tree2: decision-tree; X’: set-of-examples; A’: set-of-features end-var if (stopping_criterion (X)) then tree1 := create_leaf_tree (X) else amax := feature_selection (X,A); tree1 := create_tree (X, amax); for-all val in values (amax) do X’ := select_examples (X,amax,val); A’ := A - {amax}; tree2 := TDIDT (X’,A’); tree1 := add_branch (tree1,tree2,val) end-for end-if return (tree1) end-function Machine Learning for NLP 30/06/2003
  38. 38. Algorithms Feature Selection Criteria • Functions derived from Information Theory: – Information Gain, Gain Ratio (Quinlan 86) • Functions derived from Distance Measures – Gini Diversity Index (Breiman et al. 84) – RLM (López de Mántaras 91) • Statistically-based – Chi-square test (Sestito & Dillon 94) – Symmetrical Tau (Zhou & Dillon 91) • RELIEFF-IG: variant of RELIEFF (Kononenko 94) Machine Learning for NLP 30/06/2003
  39. 39. Algorithms Extensions of DTs (Murthy 95) • Pruning (pre/post) • Minimize the effect of the greedy approach: lookahead • Non-lineal splits • Combination of multiple models • Incremental learning (on-line) • etc. Machine Learning for NLP 30/06/2003
  40. 40. Algorithms Decision Trees and NLP • Speech processing (Bahl et al. 89; Bakiri & Dietterich 99) • POS Tagging (Cardie 93, Schmid 94b; Magerman 95; Màrquez & Rodríguez 95,97; Màrquez et al. 00) • Word sense disambiguation (Brown et al. 91; Cardie 93; Mooney 96) • Parsing (Magerman 95,96; Haruno et al. 98,99) • Text categorization (Lewis & Ringuette 94; Weiss et al. 99) • Text summarization (Mani & Bloedorn 98) • Dialogue act tagging (Samuel et al. 98) Machine Learning for NLP 30/06/2003
  41. 41. Algorithms Decision Trees and NLP • Noun phrase coreference (Aone & Benett 95; Mc Carthy & Lehnert 95) • Discourse analysis in information extraction (Soderland & Lehnert 94) • Cue phrase identification in text and speech (Litman 94; Siegel & McKeown 94) • Verb classification in Machine Translation (Tanaka 96; Siegel 97) Machine Learning for NLP 30/06/2003
  42. 42. Algorithms Decision Trees: pros&cons • Advantages – Acquires symbolic knowledge in a understandable way – Very well studied ML algorithms and variants – Can be easily translated into rules – Existence of available software: C4.5, C5.0, etc. – Can be easily integrated into an ensemble Machine Learning for NLP 30/06/2003
  43. 43. Algorithms Decision Trees: pros&cons • Drawbacks – Computationally expensive when scaling to large natural language domains: training examples, features, etc. – Data sparseness and data fragmentation: the problem of the small disjuncts => Probability estimation – DTs is a model with high variance (unstable) – Tendency to overfit training data: pruning is necessary – Requires quite a big effort in tuning the model Machine Learning for NLP 30/06/2003
  44. 44. Algorithms Boosting algorithms • Idea “to combine many simple and moderately accurate hypotheses (weak classifiers) into a single and highly accurate classifier” • AdaBoost (Freund & Schapire 95) has been theoretically and empirically studied extensively • Many other variants extensions (1997-2003) http://www.lsi.upc.es/~lluism/seminari/ml&nlp.html Machine Learning for NLP 30/06/2003
  45. 45. Algorithms AdaBoost: general scheme Linear F(h1,h2,...,hT) combination 2 ... h1 h2 hT Weak Weak Weak Learner Probability Learner Learner distribution updating TS1 TS2 ... TST D1 D2 DT Machine Learning for NLP 30/06/2003
  46. 46. Algorithms AdaBoost: algorithm (Freund & Schapire 97) Machine Learning for NLP 30/06/2003
  47. 47. Algorithms AdaBoost: example Weak hypotheses = vertical/horizontal hyperplanes Machine Learning for NLP 30/06/2003
  48. 48. Algorithms AdaBoost: round 1 Machine Learning for NLP 30/06/2003
  49. 49. Algorithms AdaBoost: round 2 Machine Learning for NLP 30/06/2003
  50. 50. Algorithms AdaBoost: round 3 Machine Learning for NLP 30/06/2003
  51. 51. Algorithms Combined Hypothesis www.research.att.com/~yoav/adaboost Machine Learning for NLP 30/06/2003
  52. 52. Algorithms AdaBoost and NLP • POS Tagging (Abney et al. 99; Màrquez 99) • Text and Speech Categorization (Schapire & Singer 98; Schapire et al. 98; Weiss et al. 99) • PP-attachment Disambiguation (Abney et al. 99) • Parsing (Haruno et al. 99) • Word Sense Disambiguation (Escudero et al. 00, 01) • Shallow parsing (Carreras & Màrquez, 01a; 02) • Email spam filtering (Carreras & Màrquez, 01b) • Term Extraction (Vivaldi, et al. 01) Machine Learning for NLP 30/06/2003
  53. 53. Algorithms AdaBoost: pros&cons + Easy to implement and few parameters to set + Time and space grow linearly with number of examples. Ability to manage very large learning problems + Does not constrain explicitly the complexity of the learner + Naturally combines feature selection with learning + Has been succesfully applied to many practical problems Machine Learning for NLP 30/06/2003
  54. 54. Algorithms AdaBoost: pros&cons ± Seems to be rather robust to overfitting (number of rounds) but sensitive to noise ± Performance is very good when there are relatively few relevant terms (features) – Can perform poorly when there is insufficient training data relative to the complexity of the base classifiers, the training errors of the base classifiers become too large too quickly Machine Learning for NLP 30/06/2003
  55. 55. Algorithms SVM: A General Definition • “Support Vector Machines (SVM) are learning systems that use a hypothesis space of linear functions in a high dimensional feature space, trained with a learning algorithm from optimisation theory that implements a learning bias derived from statistical learning theory”. (Cristianini & Shawe-Taylor, 2000) Machine Learning for NLP 30/06/2003
  56. 56. Algorithms SVM: A General Definition • “Support Vector Machines (SVM) are learning systems that use a hypothesis space of linear functions in a high dimensional feature space, trained with a learning algorithm from optimisation theory that implements a learning bias derived from statistical learning theory”. (Cristianini & Shawe-Taylor, 2000) Key Concepts Machine Learning for NLP 30/06/2003
  57. 57. Algorithms Linear Classifiers • Hyperplanes in RN. • Defined by a weight vector (w) and a threshold (b). • They induce a classification rule: N N 1 if wi xi b 0 h(x) sign wi xi b i 1 i 1 1 otherwise + + + + + _ _ w _ _ + b _ _ _ w _ _ Machine Learning for NLP 30/06/2003
  58. 58. Algorithms Optimal Hyperplane: Geometric Intuition Machine Learning for NLP 30/06/2003
  59. 59. Algorithms Optimal Hyperplane: Geometric Intuition These are the Support Vectors Maximal Margin Hyperplane Machine Learning for NLP 30/06/2003
  60. 60. Algorithms Linearly separable data Quadratic geometricmargin 2 / w 2 Programming 2 maximizing the margin is equivalent to minimize w subject to constraint s : yi ( w xi b) 1 for all i 1,, l Seminari SVMs Machine Learning for NLP 22/05/2001 30/06/2003
  61. 61. Algorithms Non-separable case (soft margin) 1 ,, l positiveslack vari ables for introducin costs g n 2 Minimize w C i subject toconstraint : s i 1 yi ( w xi b) 1 i for all i 1,, l i 0 for all i 1,, l Seminari SVMs Machine Learning for NLP 22/05/2001 30/06/2003
  62. 62. Algorithms Non-linear SVMs • Implicit mapping into feature space via kernel functions :X F Non-linear mapping n f ( x) wi i (x) b Set of hypotheses i 1 l f (x) i yi (xi ) (x) b Dual formulation i 1 K (x, z) (x) (z) Kernel function l f ( x) i yi K (xi , x) b Evaluation i 1 Seminari SVMs Machine Learning for NLP 22/05/2001 30/06/2003
  63. 63. Algorithms Non-linear SVMs • Kernel functions – Must be efficiently computable – Characterization via Mercer’s theorem – One of the curious facts about using a kernel is that we do not need to know the underlying feature map in order to be able to learn in the feature space! (Cristianini & Shawe-Taylor, 2000) – Examples: polynomials, Gaussian radial basis functions, two-layer sigmoidal neural networks, etc. Seminari SVMs Machine Learning for NLP 22/05/2001 30/06/2003
  64. 64. Algorithms Non linear SVMs Degree 3 polynomial kernel lin. separable lin. non-separable Seminari SVMs Machine Learning for NLP 22/05/2001 30/06/2003
  65. 65. Algorithms Toy Examples • All examples have been run with the 2D graphic interface of SVMLIB (Chang and Lin, National University of Taiwan) “LIBSVM is an integrated software for support vector classification, (C-SVC, nu-SVC), regression (epsilon-SVR, un-SVR) and distribution estimation (one-class SVM). It supports multi-class classification. The basic algorithm is a simplification of both SMO by Platt and SVMLight by Joachims. It is also a simplification of the modification 2 of SMO by Keerthy et al. Our goal is to help users from other fields to easily use SVM as a tool. LIBSVM provides a simple interface where users can easily link it with their own programs…” • Available from: www.csie.ntu.edu.tw/~cjlin/libsvm (it icludes a Web integrated demo tool) Machine Learning for NLP 30/06/2003
  66. 66. Algorithms Toy Examples (I) Linearly separable data set Linear SVM Maximal margin Hyperplane . What happens if we add a blue training example here? Machine Learning for NLP 30/06/2003
  67. 67. Algorithms Toy Examples (I) (still) Linearly separable data set Linear SVM High value of C parameter Maximal margin Hyperplane The example is correctly classified Machine Learning for NLP 30/06/2003
  68. 68. Algorithms Toy Examples (I) (still) Linearly separable data set Linear SVM Low value of C parameter Trade-off between: margin and training error The example is now a bounded SV Machine Learning for NLP 30/06/2003
  69. 69. Algorithms Toy Examples (II) Machine Learning for NLP 30/06/2003
  70. 70. Algorithms Toy Examples (II) Machine Learning for NLP 30/06/2003
  71. 71. Algorithms Toy Examples (II) Machine Learning for NLP 30/06/2003
  72. 72. Algorithms Toy Examples (III) Machine Learning for NLP 30/06/2003
  73. 73. Algorithms SVM: Summary • SVMs introduced in COLT’92 (Boser, Guyon, & Vapnik, 1992). Great developement since then • Kernel-induced feature spaces: SVMs work efficiently in very high dimensional feature spaces (+) • Learning bias: maximal margin optimisation. Reduces the danger of overfitting. Generalization bounds for SVMs (+) • Compact representation of the induced hypothesis. The solution is sparse in terms of SVs (+) Machine Learning for NLP 30/06/2003
  74. 74. Algorithms SVM: Summary • Due to Mercer’s conditions on the kernels the optimi- sation problems are convex. No local minima (+) • Optimisation theory guides the implementation. Efficient learning (+) • Mainly for classification but also for regression, density estimation, clustering, etc. • Success in many real-world applications: OCR, vision, bioinformatics, speech recognition, NLP: TextCat, POS tagging, chunking, parsing, etc. (+) • Parameter tuning (–). Implications in convergence times, sparsity of the solution, etc. Machine Learning for NLP 30/06/2003
  75. 75. Outline • Machine Learning for NLP • The Classification Problem • Three ML Algorithms • Applications to NLP Machine Learning for NLP 30/06/2003
  76. 76. Applications NLP problems • Warning! We will not focus on final NLP applications, but on intermediate tasks... • We will classify the NLP tasks according to their (structural) complexity Machine Learning for NLP 30/06/2003
  77. 77. Applications NLP problems: structural complexity • Decisional problems − Text Categorization, Document filtering, Word Sense Disambiguation, etc. • Sequence tagging and detection of sequential structures − POS tagging, Named Entity extraction, syntactic chunking, etc. • Hierarchical structures − Clause detection, full parsing, IE of complex concepts, composite Named Entities, etc. Machine Learning for NLP 30/06/2003
  78. 78. Applications POS tagging • Morpho-syntactic ambiguity: Part of Speech Tagging He was shot in the hand as he chased the robbers in the back street JJ NN NN VB VB VB (The Wall Street Journal Corpus) Machine Learning for NLP 30/06/2003
  79. 79. Applications POS tagging “preposition-adverb” tree root P(IN)=0.81 P(RB)=0.19 Word Form “As”,“as” others ... P(IN)=0.83 P(RB)=0.17 tag(+1) RB others ... P(IN)=0.13 Probabilistic interpretation: P(RB)=0.87 tag(+2) ^ P( RB | word=“A/as” tag(+1)=RB tag(+2)=IN) = 0.987 IN ^ P( IN | word=“A/as” tag(+1)=RB tag(+2)=IN) = 0.013 P(IN)=0.013 leaf P(RB)=0.987 Machine Learning for NLP 30/06/2003
  80. 80. Applications POS tagging “preposition-adverb” tree root P(IN)=0.81 P(RB)=0.19 Word Form “As”,“as” others ... P(IN)=0.83 P(RB)=0.17 tag(+1) RB Collocations: others “as_RB much_RB as_IN” ... P(IN)=0.13 P(RB)=0.87 tag(+2) “as_RB soon_RB as_IN” IN “as_RB well_RB as_IN” P(IN)=0.013 leaf P(RB)=0.987 Machine Learning for NLP 30/06/2003
  81. 81. Applications POS tagging RTT (Màrquez & Rodríguez 97) Language Model A Sequential Model for Multi-class Classification: NLP/POS Tagging (Even-Zohar & Roth, 01) stop? Morphological Tagged Raw analysis Classify Update Filter text yes text no Disambiguation Machine Learning for NLP 30/06/2003
  82. 82. Applications POS tagging STT (Màrquez & Rodríguez 97) Language Model Lexical probs. + The Use of Classifiers in sequential inference: Contextual probs. Chunking (Punyakanok & Roth, 00) Raw Morphological Viterbi Tagged analysis algorithm text text Disambiguation Machine Learning for NLP 30/06/2003
  83. 83. Applications Detection of sequential and hierarchical structures • Named Entity recognition • Clause detection Machine Learning for NLP 30/06/2003
  84. 84. Conclusions Summary/conclusions • We have briefly outlined: −The ML setting: “supervised learning for classification” −Three concrete machine learning algorithms −How to apply them to solve itermediate NLP tasks Machine Learning for NLP 30/06/2003
  85. 85. Conclusions Summary/conclusions • Any ML algorithm for NLP should be: – Robust to noise and outliers – Efficient in large feature/example spaces – Adaptive to new/changing domains: portability, tuning, etc. – Able to take advantage of unlabelled examples: semi-supervised learning Machine Learning for NLP 30/06/2003
  86. 86. Conclusions Summary/conclusions • Statistical and ML-based Natural Language Processing is a very active and multidisciplinary area of research Machine Learning for NLP 30/06/2003
  87. 87. Conclusions Some current research lines • Appropriate learning paradigm for all kind of NLP problems: TiMBL (DBZ99), TBEDL (Brill95), ME (Ratnaparkhi98), SNoW (Roth98), CRF (Pereira & Singer02). • Definition of an adequate (and task-specific) feature space: mapping from the input space to a high dimensional feature space, kernels, etc. • Resolution of complex NLP problems: inference with classifiers + constraint satisfaction • etc. Machine Learning for NLP 30/06/2003
  88. 88. Conclusions Bibliografia • You may found additional information at: http://www.lsi.upc.es/~lluism/ tesi.html publicacions/pubs.html cursos/talks.html cursos/MLandNL.html cursos/emnlp1.html • This talk at: http://www.lsi.upc.es/~lluism/udg03.ppt.gz Machine Learning for NLP 30/06/2003
  89. 89. Seminar: Statistical NLP Machine Learning for Natural Language Processing Lluís Màrquez TALP Research Center Llenguatges i Sistemes Informàtics Universitat Politècnica de Catalunya Girona, June 2003 Machine Learning for NLP 30/06/2003

×