Published on

  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • Pos, spelling
  • Task is relevant because words can take multiple POS tags, correct one determined by context.
  • again, issue of: multiple representations, ambiguity (JFK) ‘ harder’ task, because phrase level
  • (in a principled way, given ML theory, that is...)
  • Just to r^d -> r, instance space, output space, single (multi) label
  • pictures of a 4 class classifier...with the OvA definition
  • Learning a Linear Discriminant Function in Online
  • (in a principled way, given ML theory, that is...)
  • Well-posed learning problems
  • Framing the classification task
  • POS tagger requires one sentence per line.
  • A machine learning system
  • Preprocessing text
  • A machine learning system
  • Feature Generation (Kernels)
  • today, we’re focusing on the second form. 3 rd form is deprecated
  • Emphasize that it provides you a means to define the types of features you like Feature engineering is in fact an important part of practice We’ll generate two versions of examples using two different script files
  • Emphasize that it provides you a means to define the types of features you like Feature engineering is in fact an important part of practice We’ll generate two versions of examples using two different script files
  • CS spelling: do we want features that include the target word itself?
  • generates active features -> if the pattern isn’t there, no feature generated.
  • disjunction: a shorthand unless combined with e.g. existential; w|t is a bit silly
  • Note: don’t have ‘inc’ – don’t want target word to be part of the example (because if other word is subsituted in a mistake in test example, your features won’t fire!)
  • ppt

    1. 1. Overview of Machine Learning for NLP Tasks: part I (based partly on slides by Kevin Small and Scott Yih)
    2. 2. Goals of Introduction <ul><li>Frame specific natural language processing (NLP) tasks as machine learning problems </li></ul><ul><li>Provide an overview of a general machine learning system architecture </li></ul><ul><ul><li>Introduce a common terminology </li></ul></ul><ul><ul><li>Identify typical needs of ML system </li></ul></ul><ul><li>Describe some specific aspects of our tool suite in regards to the general architecture </li></ul><ul><ul><li>Build some intuition for using the tools </li></ul></ul><ul><li>Focus here is on Supervised learning </li></ul>
    3. 3. Overview <ul><li>Some Sample NLP Problems </li></ul><ul><li>Solving Problems with Supervised Learning </li></ul><ul><li>Framing NLP Problems as Supervised Learning Tasks </li></ul><ul><li>Preprocessing: cleaning up and enriching text </li></ul><ul><li>Machine Learning System Architecture </li></ul><ul><li>Feature Extraction using FEX </li></ul>
    4. 4. Context Sensitive Spelling [2] <ul><li>A word level tagging task: </li></ul><ul><li>I would like a peace of cake for desert. </li></ul><ul><li>I would like a piece of cake for dessert . </li></ul><ul><li>In principal, we can use the solution to the </li></ul><ul><li>duel problem . </li></ul><ul><li>In principle , we can use the solution to the </li></ul><ul><li>dual problem. </li></ul>
    5. 5. Part of Speech (POS) Tagging <ul><li>Another word-level task: </li></ul><ul><li>Allen Iverson is an inconsistent player. While he can shoot very well, some nights he will score only a few points. </li></ul><ul><li>(NNP Allen) (NNP Iverson) (VBZ is) (DT an) (JJ inconsistent) (NN player) (. .) (IN While) (PRP he) (MD can) (VB shoot) (RB very) (RB well) (, ,) (DT some) (NNS nights) (PRP he) (MD will) (VB score) (RB only) (DT a) (JJ few) (NNS points) (. .) </li></ul>
    6. 6. Phrase Tagging <ul><li>Named Entity Recognition – a phrase-level task: </li></ul><ul><ul><li>After receiving his M.B.A. from Harvard Business School, Richard F. America accepted a faculty position at the McDonough School of Business (Georgetown University) in Washington. </li></ul></ul><ul><ul><li>After receiving his [MISC M.B.A.] from [ORG Harvard Business School] , [PER Richard F. America] accepted a faculty position at the [ORG McDonough School of Business] ( [ORG Georgetown University] ) in [LOC Washington] . </li></ul></ul>
    7. 7. Some Other Tasks <ul><li>Text Categorization </li></ul><ul><li>Word Sense Disambiguation </li></ul><ul><li>Shallow Parsing </li></ul><ul><li>Semantic Role Labeling </li></ul><ul><li>Preposition Identification </li></ul><ul><li>Question Classification </li></ul><ul><li>Spam Filtering </li></ul><ul><ul><li>: </li></ul></ul><ul><ul><li>: </li></ul></ul>
    8. 8. <ul><li>Supervised Learning/SNoW </li></ul>
    9. 9. Learning Mapping Functions <ul><li>Binary Classification </li></ul><ul><li>Multi-class Classification </li></ul><ul><li>Ranking </li></ul><ul><li>Regression </li></ul><ul><li>{Feature, Instance, Input} Space – space used to describe each instance; often </li></ul><ul><li>Output Space – space of possible output labels; very dependent on problem </li></ul><ul><li>Hypothesis Space – space of functions that can be selected by the machine learning algorithm; algorithm dependent (obviously) </li></ul>
    10. 10. Multi-class Classification [3,4] One Versus All (OvA) Constraint Classification
    11. 11. Online Learning [5] <ul><li>SNoW algorithms include Winnow, Perceptron </li></ul><ul><li>Learning algorithms are mistake driven </li></ul><ul><li>Search for linear discriminant along function gradient (unconstrained optimization) </li></ul><ul><li>Provides best hypothesis using data presented up to to the present example </li></ul><ul><li>Learning rate determines convergence </li></ul><ul><ul><li>Too small and it will take forever </li></ul></ul><ul><ul><li>Too large and it will not converge </li></ul></ul>
    12. 12. <ul><li>Framing NLP Problems as Supervised Learning Tasks </li></ul>
    13. 13. Defining Learning Problems [6] <ul><li>ML algorithms are mathematical formalisms and problems must be modeled accordingly </li></ul><ul><li>Feature Space – space used to describe each instance; </li></ul><ul><li>often R d , {0,1} d , N d </li></ul><ul><li>Output Space – space of possible output labels, e.g. </li></ul><ul><ul><li>Set of Part-of-Speech tags </li></ul></ul><ul><ul><li>Correctly spelled word (possibly from confusion set) </li></ul></ul><ul><li>Hypothesis Space – space of functions that can be selected by the machine learning algorithm, e.g. </li></ul><ul><ul><li>Boolean functions (e.g. decision trees) </li></ul></ul><ul><ul><li>Linear separators in R d </li></ul></ul>
    14. 14. Context Sensitive Spelling <ul><li>Did anybody (else) want too sleep for to more </li></ul><ul><li>hours this morning? </li></ul><ul><li>Output Space </li></ul><ul><ul><li>Could use the entire vocabulary; Y ={a,aback,...,zucchini} </li></ul></ul><ul><ul><li>Could also use a confusion set ; Y= {to, too, two} </li></ul></ul><ul><li>Model as (single label) multi-class classification </li></ul><ul><li>Hypothesis space is provided by SNoW </li></ul><ul><li>Need to define the feature space </li></ul>
    15. 15. What are ‘feature’, ‘feature type’, anyway? <ul><li>A feature type is any characteristic (relation) you can define over the input representation. </li></ul><ul><ul><li>Example: feature TYPE = word bigrams </li></ul></ul><ul><ul><li>Sentence: </li></ul></ul><ul><ul><li>The man in the moon eats green cheese. </li></ul></ul><ul><ul><li>Features: </li></ul></ul><ul><ul><li>[The_man], [man_in], [in_the], [the_moon]…. </li></ul></ul><ul><li>In Natural Language Text, sparseness is often a problem </li></ul><ul><ul><li>How many times are we likely to see “the_moon”? </li></ul></ul><ul><ul><li>How often will it provide useful information? </li></ul></ul><ul><ul><li>How can we avoid this problem? </li></ul></ul>
    16. 16. Preprocessing: cleaning up and enriching text <ul><li>Assuming we start with plain text: </li></ul><ul><li>The quick brown fox jumped over the lazy dog. It landed on Mr. Tibbles, the slow blue cat. </li></ul><ul><li>Problems: </li></ul><ul><ul><li>Often, want to work at the level of sentences, words </li></ul></ul><ul><ul><li>Where are sentence boundaries – ‘Mr.’ vs. ‘Cat.’? </li></ul></ul><ul><ul><li>Where are word boundaries -- ‘dog.’ Vs. ‘dog’? </li></ul></ul><ul><li>Enriching the text: e.g. POS-tagging: </li></ul><ul><li>(DT The) (JJ quick) (NN brown) (NN fox) (VBD jumped) </li></ul><ul><li>(IN over) (DT the) (JJ lazy) (NN dog) (. .) </li></ul>
    17. 17. Download Some Tools <ul><li>http::/l2r.cs.uiuc.edu/~cogcomp/ </li></ul><ul><ul><li>Software::tools, Software::packages </li></ul></ul><ul><li>Sentence segmenter </li></ul><ul><li>Word segmenter </li></ul><ul><li>POS-tagger </li></ul><ul><li>FEX </li></ul><ul><li>NB: RIGHT-CLICK on “download” link </li></ul><ul><ul><li>select “save link as...” </li></ul></ul>
    18. 18. Preprocessing scripts <ul><li>http://l2r.cs.uiuc.edu/~cogcomp/ </li></ul><ul><li>sentence-boundary.pl </li></ul><ul><li>./sentence-splitter.pl –d HONORIFICS –i nyttext.txt -o nytsentence.txt </li></ul><ul><li>word-splitter.pl </li></ul><ul><li>./word-splitter.pl nytsentence.txt > nytword.txt </li></ul><ul><li>Invoking the tagger: </li></ul><ul><li>./tagger –i nytword.txt –o nytpos.txt </li></ul><ul><li>Check output </li></ul>
    19. 19. Problems running .pl scripts? <ul><li>Check the first line: </li></ul><ul><ul><li>#!/usr/bin/perl </li></ul></ul><ul><li>Find perl library on own machine </li></ul><ul><ul><li>E.g. might need... </li></ul></ul><ul><ul><li>#!/local/bin/perl </li></ul></ul><ul><li>Check file permissions... </li></ul><ul><ul><li>> ls –l sentence-boundary.pl </li></ul></ul><ul><ul><li>> chmod 744 sentence-boundary.pl </li></ul></ul>
    20. 20. Minor Problems with install <ul><li>Possible (system-dependent) compilation errors: </li></ul><ul><ul><li>doesn’t recognize ‘optarg’ </li></ul></ul><ul><ul><li>POS-tagger: change Makefile in subdirectory snow/ where indicated </li></ul></ul><ul><ul><li>sentence-boundary.pl: try ‘perl sentence-boundary.pl’ </li></ul></ul><ul><li>Link error (POS tagger): linker can’t find –lxnet </li></ul><ul><ul><li>remove ‘-lxnet’ entry from Makefile </li></ul></ul><ul><ul><li>generally, check README, makefile for hints </li></ul></ul>
    21. 21. <ul><li>The System View </li></ul>
    22. 22. A Machine Learning System Testing Examples Feature Vectors Training Examples Preprocessing Feature Extraction Machine Learner Classifier(s) Inference Raw Text Formatted Text Function Parameters Labels Labels
    23. 23. Preprocessing Text <ul><li>Sentence splitting, Word Splitting, etc. </li></ul><ul><li>Put data in a form usable for feature extraction </li></ul>They recently recovered a small piece of a live Elvis concert recording. He was singing gospel songs, including “Peace in the Valley.” 0 0 0 They 0 0 1 recently 0 0 2 recovered 0 0 3 a 0 0 4 small piece 0 5 piece 0 0 6 of : 0 1 6 including 0 1 7 QUOTE peace 1 8 Peace 0 1 9 in 0 1 10 the 0 1 11 Valley 0 1 12 . 0 1 13 QUOTE
    24. 24. A Machine Learning System Feature Vectors Preprocessing Feature Extraction Raw Text Formatted Text
    25. 25. <ul><li>Feature Extraction with FEX </li></ul>
    26. 26. Feature Extraction with FEX <ul><li>FEX (Feature Extraction tool) generates abstract representations of text input </li></ul><ul><ul><li>Has a number of specialized modes suited to different types of problem </li></ul></ul><ul><ul><li>Can generate very expressive features </li></ul></ul><ul><ul><li>Works best when text enriched with other knowledge sources – i.e., need to preprocess text </li></ul></ul><ul><ul><li>S = I would like a piece of cake too! </li></ul></ul><ul><li>FEX converts input text into a list of active features… </li></ul><ul><ul><li>1: 1003, 1005, 1101, 1330… </li></ul></ul><ul><ul><li>Where each numerical feature corresponds to a specific textual feature : </li></ul></ul><ul><ul><li>1: label[piece] </li></ul></ul><ul><ul><li>1003: word[like] BEFORE word[a] </li></ul></ul>
    27. 27. Feature Extraction <ul><li>Converts formatted text into feature vectors </li></ul><ul><li>Lexicon file contains feature descriptions </li></ul>0 0 0 They 0 0 1 recently 0 0 2 recovered 0 0 3 a 0 0 4 small piece 0 5 piece 0 0 6 of : 0 1 6 including 0 1 7 QUOTE peace 1 8 Peace 0 1 9 in 0 1 10 the 0 1 11 Valley 0 1 12 . 0 1 13 QUOTE 0, 1001, 1013, 1134, 1175, 1206 1, 1021, 1055, 1085, 1182, 1252 Lexicon File
    28. 28. Role of FEX <ul><li>Why won't you accept the facts? </li></ul><ul><li>No one saw her except the postman. </li></ul>1, 1001, 1003, 1004, 1006: 2, 1002, 1003, 1005, 1006: Feature Extraction FEX lab[accept], w[you], w[the], w[you*], w[*the] lab[except], w[her], w[the], w[her*], w[*the]
    29. 29. Four Important Files FEX A new representation of the raw text data <ul><li>Control FEX’s behavior </li></ul><ul><li>Define the “types” of features </li></ul>Feature vectors for SNoW Mapping of feature and feature id Script Corpus Example Lexicon
    30. 30. Corpus – General Linear Format <ul><li>The corpus file contains the preprocessed input with a single sentence per line. </li></ul><ul><li>When generating examples, Fex never crosses line boundaries. </li></ul><ul><li>The input can be any combination of: </li></ul><ul><ul><li>1 st form: words separated by white spaces </li></ul></ul><ul><ul><li>2 nd form: tag/word pairs in parentheses </li></ul></ul><ul><ul><li>There is a more complicated 3 rd form, but deprecated in view of alternative, more general format (later) </li></ul></ul>
    31. 31. Corpus – Context Sensitive Spelling <ul><li>Why won't you accept the facts? </li></ul><ul><ul><li>( WRB Why) ( VBD wo) ( NN n't) ( PRP you) ( VBP accept) ( DT the) ( NNS facts) ( . ?) </li></ul></ul><ul><li>No one saw her except the postman. </li></ul><ul><ul><li>( DT No) ( CD one) ( VBD saw) ( PRP her) ( IN except) ( DT the) ( NN postman) ( . .) </li></ul></ul>
    32. 32. Script – Means of Feature Engineering <ul><li>Fex does not decide or find good features. </li></ul><ul><li>Instead, Fex provides you an easy method to define the feature types and extracts the corresponding features from data. </li></ul><ul><li>Feature Engineering is in fact very important in practical learning tasks. </li></ul>
    33. 33. Script – Description of Feature Types <ul><li>What can be good features? </li></ul><ul><ul><li>Let’s try some combinations of words and tags. </li></ul></ul><ul><li>Feature types in mind </li></ul><ul><ul><li>Words around the target word ( accept , except ) </li></ul></ul><ul><ul><li>POS tags around the target </li></ul></ul><ul><ul><li>Conjunctions of words and POS tags? </li></ul></ul><ul><ul><li>Bigrams or trigrams? </li></ul></ul><ul><ul><li>Include relative locations? </li></ul></ul>
    34. 34. Graphical Representation 0 1 2 3 4 5 6 7 Target -2 -1 1 2 0 -3 -4 3 Window [-2,2] Why WRB won VBD 't NN you PRP accept VBP the DT facts NNS ? .
    35. 35. Script – Syntax <ul><li>Syntax: </li></ul><ul><ul><li>targ [inc] [loc]: RGF [[left-offset, right-offset]] </li></ul></ul><ul><ul><li>targ – target index </li></ul></ul><ul><ul><ul><li>If targ is ‘-1’… </li></ul></ul></ul><ul><ul><ul><ul><li>target file entries are used to identify the targets </li></ul></ul></ul></ul><ul><ul><ul><ul><li>If no target file is specified, then EVERY word is treated as a target </li></ul></ul></ul></ul><ul><ul><li>inc – use the actual target instead of the generic place-holder (‘*’) </li></ul></ul><ul><ul><li>loc – include the location of feature relative to the target </li></ul></ul><ul><ul><li>RGF – define “types” of features like words, tags, conjunctions, bigrams, trigrams, …, etc </li></ul></ul><ul><ul><li>left-offset and right-offset: specify the window range </li></ul></ul>
    36. 36. Basic RGF’s – Sensors (1/2) <ul><li>Sensor is the fundamental method of defining “feature types.” </li></ul><ul><li>It is applied on the element, and generates active features. </li></ul>len[5] length of the word len Length v[eager] active if the word starts with a vowel v Vowel t[NNP] part-of-speech tag t Tag w[you] the word (spelling) w Word Example Interpretation Mnemonic Type
    37. 37. Basic RGF’s – Sensors (2/2) <ul><li>lab: a special RGF that generates labels </li></ul><ul><ul><li>lab(w), lab(t), … </li></ul></ul>More sensors can be found by looking at FEX source (Sensors.h) <ul><li>Sensors are also an elegant way to incorporate our background knowledge. </li></ul>isCity[Chicago] active is the phrase is the name of a city isCity City List vCls[51.2] return Levin’s verb class vCls Verb Class Example Interpretation Mnemonic Type
    38. 38. Complex RGF’s <ul><li>Existential Usage </li></ul><ul><ul><li>len(x=3), v(X) </li></ul></ul><ul><li>Conjunction and Disjunction </li></ul><ul><ul><li>w&t; w|t </li></ul></ul><ul><li>Collocation and Sparse Collocation </li></ul><ul><ul><li>coloc(w,w); coloc(w,t,w); coloc(w|t,w|t) </li></ul></ul><ul><ul><li>scoloc(t,t); scoloc(t,w,t); scoloc(w|t,w|t) </li></ul></ul>
    39. 39. (Sparse) Collocation 0 1 2 3 4 5 6 7 Target -2 -1 1 2 0 -3 -4 3 -1 inc: coloc(w,t)[-2,2] w[‘t]-t[PRP], w[you]-t[VBP] w[accept]-t[DT], w[the]-t[NNS] -1 inc: scoloc(w,t)[-2,2] w[‘t]-t[PRP], w[‘t]-t[VBP], w[‘t]-t[DT], w[‘t]-t[NNS], w[you]-t[VBP], w[you]-t[DT], w[you]-t[NNS], w[accept]-t[DT], w[accept]-t[NNS], w[the]-t[NNS] Why WRB won VBD 't NN you PRP accept VBP the DT facts NNS ? .
    40. 40. Examples – 2 Scripts <ul><li>Download examples from tutorial page: </li></ul><ul><ul><li>‘ context sensitive spelling materials’ link </li></ul></ul><ul><li>accept-except-simple.scr </li></ul><ul><ul><li>-1: lab(w) </li></ul></ul><ul><ul><li>-1: w[-1,1] </li></ul></ul><ul><li>accept-except.scr </li></ul><ul><ul><li>-1: lab(w) </li></ul></ul><ul><ul><li>-1: w|t [-2,2] </li></ul></ul><ul><ul><li>-1 loc: coloc(w|t,w|t) [-3,-3] </li></ul></ul>
    41. 41. Lexicon & Example (1/3) <ul><li>Corpus: </li></ul><ul><ul><li>… (NNS prices) (CC or) (VB accept) (JJR slimmer) (NNS profits) … </li></ul></ul><ul><li>Script: ae-simple.scr </li></ul><ul><ul><li>-1 lab(w); -1: w[-1,1] </li></ul></ul><ul><li>Lexicon: </li></ul><ul><ul><li>1 label[w[except]] </li></ul></ul><ul><ul><li>2 label[w[accept]] </li></ul></ul><ul><ul><li>1001 w[or] </li></ul></ul><ul><ul><li>1002 w[slimmer] </li></ul></ul><ul><li>Example: </li></ul><ul><ul><li>2, 1001, 1002; </li></ul></ul>Generated by w[-1,1] Feature indices of lab start from 1. Feature indices of regular features start from 1001. Generated by lab(w)
    42. 42. Lexicon & Example (2/3) <ul><li>Target file: fex -t ae.targ … </li></ul><ul><ul><li>accept </li></ul></ul><ul><ul><li>except </li></ul></ul><ul><li>Lexicon file </li></ul><ul><ul><li>If the file does not exist, fex will create it. </li></ul></ul><ul><ul><li>If the file already exists, fex will first read it, and then append the new entries to this file. </li></ul></ul><ul><ul><li>This is important because we don’t want two different feature indices representing the same feature. </li></ul></ul>We treat only these two words as targets .
    43. 43. Lexicon & Example (3/3) <ul><li>Example file </li></ul><ul><ul><li>If the file does not exist, fex will create it. </li></ul></ul><ul><ul><li>If the file already exists, fex will append new examples to it. </li></ul></ul><ul><ul><li>Only active features and their corresponding lexicon items are generated. </li></ul></ul><ul><ul><li>If the read-only lexicon option is set, only those features from the lexicon that are present ( active ) in the current instance are listed. </li></ul></ul>
    44. 44. <ul><li>Now practice – change script, run FEX, look at the resulting lexicon/examples </li></ul><ul><ul><li>> ./fex –t ae.targ ae-simple.scr ae-simple.lex short-ae.pos short-ae.ex </li></ul></ul>
    45. 45. Citations <ul><li>F. Sebastiani. Machine Learning in Automated Text Categorization. ACM Computing Surveys, 34(1):1-47, 2002. </li></ul><ul><li>A. R. Golding and D. Roth. A Winnow-Based Approach to Spelling Correction. Machine Learning , 34:107-130, 1999. </li></ul><ul><li>E. Allewin, R. Schapire, and Y. Singer. Reducing Multiclass to Binary: A Unifying Approach for Margin Classifiers. Journal of Machine Learning Research, 1:113-141, 2000. </li></ul><ul><li>S. Har-Peled, D. Roth, and D. Zimak. Constraint Classification: A New Approach to Multiclass Classification. In Proc. 13 th Annual Intl. Conf. of Algorithmic Learning Theory, pp. 365-379, 2002. </li></ul><ul><li>A. Blum. On-Line Algorithms in Machine Learning. 1996. </li></ul>
    46. 46. Citations <ul><li>T. Mitchell. Machine Learning, McGraw Hill, 1997. </li></ul><ul><li>A. Blum. Learning Boolean Functions in an Infinite Attribute Space. Machine Learning, 9(4):373-386, 1992. </li></ul><ul><li>J. Kivinen and M. Warmuth. The Perceptron Algorithm vs. Winnow: Linear vs. Logarithmic Mistake Bounds when few Input Variables are Relevant. UCSC-CRL-95-44, 1995. </li></ul><ul><li>T. Dietterich. Approximate Statistical Tests for Comparing Supervised Classfication Learning Algorithms. Neural Computation , 10(7):1895-1923, 1998 </li></ul>