Published on

1 Like
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide


  1. 1. Two Paradigms for Natural-Language Processing Robert C. Moore Senior Researcher Microsoft Research
  2. 2. Why is Microsoft interested in natural-language processing? <ul><li>Make computers/software easier to use. </li></ul><ul><li>Long term goal: just talk to your computer (Startrek scenario). </li></ul>
  3. 3. Some of Microsoft’s near(er) term goals in NLP <ul><li>Better search </li></ul><ul><ul><li>Help find things on your computer. </li></ul></ul><ul><ul><li>Help find information on the Internet. </li></ul></ul><ul><li>Document summarization </li></ul><ul><ul><li>Help deal with information overload. </li></ul></ul><ul><li>Machine translation </li></ul>
  4. 4. Why is Microsoft interested in machine translation? <ul><li>Internal: Microsoft is the world’s largest user of translation services. MT can help Microsoft </li></ul><ul><ul><li>Translate documents that would otherwise not be translated – e.g., PSS knowledge base ( http:// = fh;ES-ES;faqtraduccion ). </li></ul></ul><ul><ul><li>Save money on human translation by providing machine translations as a starting point. </li></ul></ul><ul><li>External: Sell similar software/services to other large companies. </li></ul>
  5. 5. Knowledge engineering vs. machine learning in NLP <ul><li>Biggest debate over the last 15 years in NLP has been knowledge engineering vs. machine learning. </li></ul><ul><li>KE approach to NLP usually involves hand-coding of grammars and lexicons by linguistic experts. </li></ul><ul><li>ML approach to NLP usually involves training statistical models on large amounts of annotated or un-annotated text. </li></ul>
  6. 6. Central problems in KE-based NLP <ul><li>Parsing – determining the syntactic structure of a sentence. </li></ul><ul><li>Interpretation – deriving formal representation of the meaning of a sentence. </li></ul><ul><li>Generation – deriving a sentence that expresses a given meaning representation. </li></ul>
  7. 7. Simple examples of KE-based NLP notations <ul><li>Phrase-structure grammar: </li></ul><ul><li>S  Np Vp, Np  Sue, Np  Mary </li></ul><ul><li>Vp  V Np, V  sees </li></ul><ul><li>Syntactic structure: </li></ul><ul><li>[[Sue] Np [[sees] V [Mary] Np ] Vp ] S </li></ul><ul><li>Meaning representation: </li></ul><ul><li>[see(E), agt(E,sue), pat(E,mary)] </li></ul>
  8. 8. Unification Grammar: the pinnacle of the NLP KE paradigm <ul><li>Provides a uniform declarative formalism. </li></ul><ul><li>Can be used to specify both syntactic and semantic analyses. </li></ul><ul><li>A single grammar can be used for both parsing and generation. </li></ul><ul><li>Supports a variety of efficient parsing and generation algorithms. </li></ul>
  9. 9. Background: Question formation in English <ul><li>To construct a yes/no question: </li></ul><ul><li>Place the tensed auxiliary verb from the corresponding statement at the front of the clause. </li></ul><ul><ul><li>John can see Mary. </li></ul></ul><ul><ul><li>Can John see Mary? </li></ul></ul><ul><li>If there is no tensed auxiliary, add the appropriate form of the semantically empty auxiliary do. </li></ul><ul><ul><li>John sees Mary. </li></ul></ul><ul><ul><li>John does see Mary. </li></ul></ul><ul><ul><li>Does John see Mary? </li></ul></ul>
  10. 10. Question formation in English (continued) <ul><li>To construct a who/what question: </li></ul><ul><li>For a non-subject who/what question, form a corresponding yes/no question. </li></ul><ul><ul><li>Does John see Mary? </li></ul></ul><ul><li>Replace the noun phrase in the position being questioned with a question noun phrase and move to the front of the clause. </li></ul><ul><ul><li>Who does John see ? </li></ul></ul><ul><li>For a subject who/what question, simply replace the subject with a question noun phrase. </li></ul><ul><ul><li>Who sees Mary? </li></ul></ul>
  11. 11. Example of a UG grammar rule involved in who/what questions <ul><li>S1/S_sem ---> [NP/NP_sem, S2/S_sem] :- </li></ul><ul><li>S1::(cat=s, stype=whq, whgap_in=SL, </li></ul><ul><li>whgap_out=SL, vgap=[]), </li></ul><ul><li>NP::(cat=np, wh=y, whgap_in=[], </li></ul><ul><li>whgap_out=[]), </li></ul><ul><li>S2::(cat=s, stype=ynq, </li></ul><ul><li>whgap_in=NP/NP_sem, </li></ul><ul><li>whgap_out=[], vgap=[]). </li></ul>
  12. 12. Context-free backbone of rule <ul><li>S1 /S_sem ---> [ NP /NP_sem, S2 /S_sem] :- </li></ul><ul><li>S1 ::( cat=s , stype=whq, whgap_in=SL, </li></ul><ul><li>whgap_out=SL, vgap=[]), </li></ul><ul><li>NP ::( cat=np , wh=y, whgap_in=[], </li></ul><ul><li>whgap_out=[]), </li></ul><ul><li>S2 ::( cat=s , stype=ynq, </li></ul><ul><li>whgap_in=NP/NP_sem, </li></ul><ul><li>whgap_out=[], vgap=[]). </li></ul>
  13. 13. Category subtype features <ul><li>S1/S_sem ---> [NP/NP_sem, S2/S_sem] :- </li></ul><ul><li>S1::(cat=s, stype=whq , whgap_in=SL, </li></ul><ul><li>whgap_out=SL, vgap=[]), </li></ul><ul><li>NP::(cat=np, wh=y , whgap_in=[], </li></ul><ul><li>whgap_out=[]), </li></ul><ul><li>S2::(cat=s, stype=ynq , </li></ul><ul><li>whgap_in=NP/NP_sem, </li></ul><ul><li>whgap_out=[], vgap=[]). </li></ul>
  14. 14. Features for tracking long distance dependencies <ul><li>S1/S_sem ---> [NP/NP_sem, S2/S_sem] :- </li></ul><ul><li>S1::(cat=s, stype=whq, whgap_in=SL , </li></ul><ul><li>whgap_out=SL , vgap=[] ), </li></ul><ul><li>NP::(cat=np, wh=y, whgap_in=[] , </li></ul><ul><li>whgap_out=[] ), </li></ul><ul><li>S2::(cat=s, stype=ynq, </li></ul><ul><li>whgap_in=NP /NP_sem, </li></ul><ul><li>whgap_out=[] , vgap=[] ). </li></ul>
  15. 15. Semantic features <ul><li>S1/ S_sem ---> [NP/ NP_sem , S2/ S_sem ] :- </li></ul><ul><li>S1::(cat=s, stype=whq, whgap_in=SL, </li></ul><ul><li>whgap_out=SL, vgap=[]), </li></ul><ul><li>NP::(cat=np, wh=y, whgap_in=[], </li></ul><ul><li>whgap_out=[]), </li></ul><ul><li>S2::(cat=s, stype=ynq, </li></ul><ul><li>whgap_in=NP/ NP_sem , </li></ul><ul><li>whgap_out=[], vgap=[]). </li></ul>
  16. 16. Parsing algorithms for UG <ul><li>Virtually any CFG parsing algorithm can be applied to UG by replacing identity tests on nonterminals with unification of nonterminals. </li></ul><ul><li>UG grammars are Turing complete, so grammars have to be written appropriately for parsing to terminate. </li></ul><ul><li>“ Reasonable” grammars generally can be parsed in polynomial time, often n 3 . </li></ul>
  17. 17. Generation algorithms for UG <ul><li>Since grammar is purely declarative, generation can be done by “running the parser backwards.” </li></ul><ul><li>Efficient generation algorithms are more complicated than that, but still polynomial for “reasonable” grammars and “exact generation.” </li></ul><ul><li>Generation taking into account semantic equivalence is worst-case NP-hard, but still can be efficient in practice. </li></ul>
  18. 18. A Prolog-based UG system to play with <ul><li>Go to </li></ul><ul><li>Download “Unification Grammar Sentence Realization Algorithms,” which includes </li></ul><ul><ul><li>A simple bottom-up parser, </li></ul></ul><ul><ul><li>Two sophisticated generation algorithms, </li></ul></ul><ul><ul><li>A small sample grammar and lexicon, </li></ul></ul><ul><ul><li>A paraphrase demo that </li></ul></ul><ul><ul><ul><li>Parses sentences covered by the grammar into a semantic representation. </li></ul></ul></ul><ul><ul><ul><li>Generates all sentences that have that semantic representation according to the grammar. </li></ul></ul></ul>
  19. 19. A paraphrase example <ul><li>?- paraphrase(s(_,'CAT'([]),'CAT'([]),'CAT'([])), </li></ul><ul><li>[what,direction,was,the,cat,chased,by,the,dog,in]). </li></ul><ul><li>in what direction did the dog __ chase the cat __ </li></ul><ul><li>in what direction was the cat __ chased __ by the dog </li></ul><ul><li>in what direction was the cat __ chased by the dog __ </li></ul><ul><li>what direction did the dog __ chase the cat in __ </li></ul><ul><li>what direction was the cat __ chased in __ by the dog </li></ul><ul><li>what direction was the cat __ chased by the dog in __ </li></ul><ul><li>generation_elapsed_seconds(0.0625) </li></ul>
  20. 20. Whatever happened to UG-based NLP? <ul><li>UG-based NLP is elegant, but lacks robustness for broad-coverage tasks. </li></ul><ul><li>Hard for human experts to incorporate enough details for broad coverage, unless grammar/lexicon are very permissive. </li></ul><ul><li>Too many possible ambiguities arise as coverage increases. </li></ul>
  21. 21. How machine-learning-based NLP addresses these problems <ul><li>Details are learned by processing very large corpora. </li></ul><ul><li>Ambiguities are resolved by choosing most likely answer according to a statistical model. </li></ul>
  22. 22. Increase in stat/ML papers at ACL conferences over 15 years
  23. 23. Characteristics of ML approach to NLP compared to KE approach <ul><li>Model-driven rather than theory-driven. </li></ul><ul><li>Uses shallower analyses and representations. </li></ul><ul><li>More opportunistic and more diverse in range of problems addressed. </li></ul><ul><li>Often driven by availability of training data. </li></ul>
  24. 24. Differences in approaches to stat/ML NLP <ul><li>Type of training data </li></ul><ul><ul><li>Annotated – supervised training </li></ul></ul><ul><ul><li>Un-annotated – unsupervised training </li></ul></ul><ul><li>Type of model </li></ul><ul><ul><li>Joint model – e.g., generative probabilistic </li></ul></ul><ul><ul><li>Conditional model – e.g., conditional maximum entropy </li></ul></ul><ul><li>Type of training </li></ul><ul><ul><li>Joint – maximum likelihood training </li></ul></ul><ul><ul><li>Conditional – discriminative training </li></ul></ul>
  25. 25. Statistical parsing models <ul><li>Most are: </li></ul><ul><ul><li>Generative probabilistic models, </li></ul></ul><ul><ul><li>Trained on annotated data (e.g., Penn Treebank), </li></ul></ul><ul><ul><li>Using maximum likelihood training. </li></ul></ul><ul><li>The simplest such model would be a probabilistic context-free grammar. </li></ul>
  26. 26. Probabilistic context-free grammars (PCFGs) <ul><li>A PCFG is a CFG that assigns to each production a conditional probability of the right-hand side given the left-hand side. </li></ul><ul><li>The probability of a derivation is simply the product of the conditional probabilities of all the productions used in the derivation. </li></ul><ul><li>PCFG-based parsing chooses, as the parse of a sentence, the derivation of the sentence having the highest probability. </li></ul>
  27. 27. Problems with simple generative probabilistic models <ul><li>Incorporating more features into the model splits data, resulting in sparse data problems. </li></ul><ul><li>Joint maximum likelihood training “wastes” probability mass predicting the given part of the input data. </li></ul>
  28. 28. A currently popular technique: conditional maximum entropy models <ul><li>Basic models are of the form: </li></ul><ul><li>Advantages: </li></ul><ul><ul><li>Using more features does not require splitting data. </li></ul></ul><ul><ul><li>Training maximizes conditional probability rather than joint probability. </li></ul></ul>
  29. 29. Unsupervised learning in NLP <ul><li>Tries to infer unknown parameters and alignments of data to “hidden” states that best explain (i.e., assign highest probability to) un-annotated NL data. </li></ul><ul><li>Most common training method is Expectation Maximization (EM): </li></ul><ul><ul><li>Assume initial distributions for joint probability of alignments of hidden states to observable data. </li></ul></ul><ul><ul><li>Compute joint probabilities for observed training data and all possible alignments. </li></ul></ul><ul><ul><li>Re-estimate probability distributions based on probabilistically weighted counts from previous step. </li></ul></ul><ul><ul><li>Iterate last two steps until desired convergence is reached. </li></ul></ul>
  30. 30. Statistical machine translation <ul><li>A leading example of unsupervised learning in NLP. </li></ul><ul><li>Models are trained from parallel bilingual, but otherwise un-annotated corpora. </li></ul><ul><li>Models usually assume a sequence of words in one language is produced by a generative probabilistic process from a sequence of words in another language. </li></ul>
  31. 31. Structure of stat MT models <ul><li>Often a noisy-channel framework is assumed: </li></ul><ul><li>In basic models, each target word is assumed to be generated by one source word. </li></ul>
  32. 32. A simple model: IBM model 1 <ul><li>A sentence e produces a sentence f assuming </li></ul><ul><ul><li>The length m of f is independent of the length l of e . </li></ul></ul><ul><ul><li>Each word of f is generated by one word of e (including an empty word e 0 ). </li></ul></ul><ul><ul><li>Each word in e is equally likely to generate the word at any position in f , independently of how any other words are generated. </li></ul></ul><ul><li>Mathematically: </li></ul>
  33. 33. More advanced models <ul><li>Most approaches </li></ul><ul><ul><li>Model how words are ordered (but crudely). </li></ul></ul><ul><ul><li>Model how many words a given word is likely to translates into. </li></ul></ul><ul><li>Best performing approaches model word-sequence-to-word-sequence translations. </li></ul><ul><li>Some initial work has been done on incorporating syntactic structure into models. </li></ul>
  34. 34. Examples of machine learned English/Italian word translations <ul><li>PROCESSOR PROCESSORE </li></ul><ul><li>APPLICATIONS APPLICAZIONI </li></ul><ul><li>SPECIFY SPECIFICARE </li></ul><ul><li>NODE NODO </li></ul><ul><li>DATA DATI </li></ul><ul><li>SERVICE SERVIZIO </li></ul><ul><li>THREE TRE </li></ul><ul><li>IF SE </li></ul><ul><li>SITES SITI </li></ul><ul><li>TARGET DESTINAZIONE </li></ul><ul><li>RESTORATION RIPRISTINO </li></ul><ul><li>ATTENDANT SUPERVISORE </li></ul><ul><li>GROUPS GRUPPI </li></ul><ul><li>MESSAGING MESSAGGISTICA </li></ul><ul><li>MONITORING MONITORAGGIO </li></ul><ul><li>THAT CHE </li></ul><ul><li>FUNCTIONALITY FUNZIONALITÀ </li></ul><ul><li>PHASE FASE </li></ul><ul><li>SEGMENT SEGMENTO </li></ul><ul><li>CUBES CUBI </li></ul><ul><li>VERIFICATION VERIFICA </li></ul><ul><li>ALLOWS CONSENTE </li></ul><ul><li>TABLE TABELLA </li></ul><ul><li>BETWEEN TRA </li></ul><ul><li>DOMAINS DOMINI </li></ul><ul><li>MULTIPLE PIÙ </li></ul><ul><li>NETWORKS RETI </li></ul><ul><li>A UN </li></ul><ul><li>PHYSICALLY FISICAMENTE </li></ul><ul><li>FUNCTIONS FUNZIONI </li></ul>
  35. 35. How do KE and ML approaches to NLP compare today? <ul><li>ML has become the dominant paradigm in NLP. (“Today’s students know everything about maxent modeling, but not what a noun phrase is.”) </li></ul><ul><li>ML results are easier to transfer than KE results. </li></ul><ul><li>We probably now have enough computer power and data to learn more by ML than a linguistic expert could encode in a lifetime. </li></ul><ul><li>In almost every independent evaluation, ML methods outperform KE methods in practice. </li></ul>
  36. 36. Do we still need linguistics in computational linguistics? <ul><li>There are still many things we are not good at modeling statistically. </li></ul><ul><li>For example, stat MT models based on single-words or strings are good at getting the right words, but poor at getting them in the right order. </li></ul><ul><li>Consider: </li></ul><ul><ul><li>La profesora le gusta a tu hermano. </li></ul></ul><ul><ul><li>Your brother likes the teacher. </li></ul></ul><ul><ul><li>The teacher likes your brother. </li></ul></ul>
  37. 37. Concluding thoughts <ul><li>If forced to choose between a pure ML approach and a pure KE approach, ML almost always wins. </li></ul><ul><li>Statistical models still seem to need a lot more linguistic features for really high performance. </li></ul><ul><li>A lot of KE is actually hidden in ML approaches, in the form of annotated data, which is usually expensive to obtain. </li></ul><ul><li>The way forward may be to find methods for experts to give advice to otherwise unsupervised ML methods, which may be cheaper than annotating enough data to learn the content of the advice. </li></ul>