Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

The NLP Muppets revolution!

44 views

Published on

The NLP muppets revolution! @ Data Science London 2019

video: https://skillsmatter.com/skillscasts/13940-a-deep-dive-into-contextual-word-embeddings-and-understanding-what-nlp-models-learn

event: https://www.meetup.com/Data-Science-London/events/261483332/

Published in: Science
  • Be the first to comment

  • Be the first to like this

The NLP Muppets revolution!

  1. 1. The NLP Muppets revolution! Fabio Petroni Data Science London 28 May 2019
  2. 2. Disclaimer 2 This is not a Data Science talk
  3. 3. Outline 1. NLP pre-2018 (15 mins) 2. The Revolution (30 mins) 3. Related research activities in FAIR London (15 mins) 3
  4. 4. Natural Language Processing Goal: for computers to process or “understand” natural language in order to perform tasks that are useful. Christopher Manning 4 • Fully understanding and representing the meaning of language (or even defining it) is an AI-complete problem.
  5. 5. Discrete representation of word meaning 5 • WordNet: lexical database for the English language • “one-hot” encoding of words • Problem: no notion of relationships (e.g., similarity) between words
  6. 6. Word meaning as a neural word vector • Representing a word by means of its neighbors 6 large corpora Source: https://becominghuman.ai/how-does-word2vecs-skip-gram-work-f92e0525def4 1 2 3 4 Unsupervised !
  7. 7. Word meaning as a neural word vector Place words in high dimensional vector spaces Source: https://www.tensorflow.org/tutorials/representation/word2vec • Representing a word by means of its neighbors 7
  8. 8. NLU Models, pre 2018 Architectures Galore! • Design a custom model for each task • Train on as much labeled data as you have • Often use pretrained word embeddings, but not always E.g. BiDAF (Seo et al, 2017) for Reading Comprehension only first layer is pre-trained 8
  9. 9. Limitations of word embeddings • Word2vec, Glove and related methods represent shallow representations of language (such as edges in an image) and fail to capture the high level structure of language • A model that uses such word embeddings need to learn by itself complex language phenomena: • A huge number of examples is needed to achieve good performance • word-sense disambiguation • compositionality • long-term dependencies • negation • etc 9
  10. 10. Paradigm Shift initializing the first layer of our models pretraining the entire model 10
  11. 11. Paradigm Shift initializing the first layer of our models pretraining the entire model 11 Which task to use?
  12. 12. Aim: to predict the next word given the previous words. 𝑝 𝑤#, … , 𝑤& = ( )*#..& 𝑝(𝑤)|𝑤#, … , 𝑤).#) e.g.: p(The cat is on the table) = p(The) x p(cat | The) x p(is | cat, The) x p(on | is, cat, The) ... 𝑝 𝑤) 𝑤#, … , 𝑤).# is often a (recurrent) neural network Language Modelling (LM) 12
  13. 13. What are LMs used for? Until ~2018 • Used in text generation, e.g. machine translation or speech recognition • Long standing proof of concepts for language understanding, largely ignored As of this year • Use for everything! • Pretrain on lots of data • Finetune for specific tasks E.g. some of the embeddings in NLP from scratch papers (Collobert et. al. 2009-2011) were extracted from language models: “Following our NLP from scratch philosophy, we now describe how to dramatically improve these embeddings using large unlabeled data sets.” “Very long training times make such strategies necessary for the foreseeable future: if we had been given computers ten times faster, we probably would have found uses for data sets ten times bigger.” 13
  14. 14. Are LMs NLP-complete? AI-complete? • Good LMs definitely won’t solve vision or robots or grounded language understanding… J But, for NLP… • How much signal is there in raw text alone? • The signal is weak, and unclear how to best learn from it… World knowledge: The Mona Lisa is a portrait painting by [ Leonardo | Obama ] Coreference resolution: John robbed the bank. He was [ arrested | terrified | bored ] Machine translation: Belka is the Russian word for [ cat | squirrel ] 14
  15. 15. Language Model Pretraining * • ELMo: contextual word embeddings [Best paper, NAACL 2018] • GPT: no embeddings, finetune LM • BERT: bidirectional LM finetuning [Best paper, NAACL 2019] • GPT2: even larger scale, zero shot *See also Dai & Lee (2015), Peters et al (2017), Howard & Ruder (2018), and many others 15
  16. 16. Contextual word embeddings: f(wk | w1, …, wn ) ∈ ℝ N f(play | Elmo and Cookie Monster play a game .) ≠ f(play | The Broadway play premiered yesterday .) 16 The Allen Institute for Artificial Intelligence
  17. 17. Contextual word embeddings AI2 ELMo • Train two LMs: left-to- right and right-to-left [Peters et al, 2018] The Broadway play premiered yesterday LSTM LSTM LSTM LSTM LSTM LSTM ?? 17
  18. 18. Contextual word embeddings AI2 ELMo • Train two LMs: left-to- right and right-to-left [Peters et al, 2018] The Broadway play premiered yesterday LSTM LSTM LSTM LSTM LSTM LSTM LSTM LSTM LSTM LSTM LSTM LSTM ?? 18
  19. 19. Embeddings from Language Models AI2 ELMo • Train two LMs: left-to- right and right-to-left • Extract contextualized vectors from networks • Use instead of word embeddings (e.g. FastText) • Still use custom task archtectures… [Peters et al, 2018] The Broadway play premiered yesterday LSTM LSTM LSTM LSTM LSTM LSTM LSTM LSTM LSTM LSTM LSTM LSTM =ELMo: f( ) 19
  20. 20. AI2 ELMo • Train two LMs: left-to- right and right-to-left • Extract contextualized vectors from networks • Use instead of word embeddings (e.g. FastText) • Still use custom task archtectures… [Peters et al, 2018] 20
  21. 21. AI2 ELMo • Train two LMs: left-to- right and right-to-left • Extract contextualized vectors from networks • Use instead of word embeddings (e.g. FastText) • Still use custom task archtectures… [Peters et al, 2018] 21
  22. 22. AI2 ELMo • Train two LMs: left-to- right and right-to-left • Extract contextualized vectors from networks • Use instead of word embeddings (e.g. FastText) • Still use custom task archtectures… [Peters et al, 2018] • Simple and general, building on recent ideas [text cat. (Dai and Lee 2015); tagging (Peters et al, 2017)] • Showed self-pre-training working for lots of problems! • Still assumed custom task-specific architectures… 22
  23. 23. Question: Why Embeddings At All? 23
  24. 24. The Transformer (A Vaswani et al, 2017) 24 Google AI
  25. 25. The Transformer (A Vaswani et al, 2017) 25
  26. 26. 1. pretrain Huge corpora (B of words) TRANSFORMER 2. fine-tune LM TRANSFORMER This takes days/weeks on several GPUs / TPUs Simple component Fine-tune for a specific task on as much labeled data as you have. Fast ! SupervisedUnsupervised Pre-trained models publicly available on the web!
  27. 27. Why embeddings? OpenAI GPT • Train transformer language model E.g., can we pre-train a common architecture? [Radford et al, 2018] ?? … The submarine … … Transformer Transformer Transformer Transformer Transformer Transformer is 27 OpenAI
  28. 28. Why embeddings? OpenAI GPT • Train transformer language model • Encode multiple sentences at once E.g., can we pre-train a common architecture? [Radford et al, 2018] Textual Classification (e.g. sentiment) <S> Kids love gelato! <E> Textual Entailment <S> Kids love gelato. <SEP> No one hates gelato. <E> 28
  29. 29. Why embeddings? OpenAI GPT • Train transformer language model • Encode multiple sentences at once • Add small layer on top • Fine tune for each end task E.g., can we pre-train a common architecture? [Radford et al, 2018] 29
  30. 30. Without ensemble for SNLI: ESIM+ELMo (88.7) vs GPT(89.9)
  31. 31. Without ensemble for SNLI: ESIM+ELMo (88.7) vs GPT(89.9) • Train on longer texts, with self attention • Idea of a unified architecture is currently winning! • Will scale up really nicely (e.g. GPT-2) • Isn’t bidirectional… 31
  32. 32. Question: Why not bidirectional? 32
  33. 33. Google’s BERT • Train masked language model • Otherwise, largely adopt OpenAI approach New task: predict the missing/masked word(s) [Devlin et al, 2018] … The [MASK] … … is yellow … ?? Idea: jointly model left and right context Transformer Transformer Transformer Transformer Transformer Transformer Transformer Transformer 33 Google AI … …
  34. 34. Google’s BERT • Train masked language model • Otherwise, largely adopt OpenAI approach • Generalize beyond text classification New task: predict the missing/masked word(s) [Devlin et al, 2018] Idea: jointly model left and right context 34
  35. 35. 35
  36. 36. Google’s BERT • Train masked language model • Otherwise, largely adopt OpenAI approach • Generalize beyond text classification [Devlin et al, 2018] Idea: jointly model left and right context vs. E.g. for reading comprehension 36
  37. 37. Google’s BERT • Train masked language model • Otherwise, largely adopt OpenAI approach • Generalize beyond text classification New task: predict the missing/masked word(s) [Devlin et al, 2018] Idea: jointly model left and right context • Bidirectional reasoning is important! • Much better comparison, clearly best approach • Arguably killing off “architecture hacking” research • Currently focus on intense study in NLP community 37
  38. 38. Question: Why fine-tune? 38
  39. 39. What happens with more data? OpenAI GPT-2 • Same model as GPT • Still left-to-right • Add more parameters and train on cleaner data! • Adopt a new evaluation scheme… [Radford et al, 2019] Also, don’t fine tune for different end tasks… ?? … premiered yesterday … … …Transformer Transformer Transformer Transformer …Transformer Transformer Transformer Transformer … It was 39
  40. 40. Very good language models! Caveat: training sets differ… 40
  41. 41. Very good at generating text! Too good to release to the public… 41
  42. 42. Evaluation: zero shot OpenAI GPT-2 • Same model as GPT • Still left-to-right • Add more parameters and train on cleaner data! • Adopt a new evaluation scheme… [Radford et al, 2019] Assume no labeled data… Question Answering: Summarization: Machine Translation: Q: Who wrote the origin of species? A: ??? [input document, e.g. new article]. TL;DR: ??? [English sentence]1=[French sentence]1, … [English sentence]n=??? 42
  43. 43. Zero shot results… Performs near baseline, with lots of data… 43
  44. 44. Zero shot QA results… Model is learning lots of facts! 44
  45. 45. Zero shot QA results… Model is learning lots of facts! • Left-to-right again... but at least easy to sample from • Zero shot results are still at baseline levels… • LMs clearly improve with more parameters, more data • Very nice job selling the work to the press
  46. 46. FAIR research activities 46
  47. 47. • What amount of knowledge do LM store? • How does their performance compare to automatically constructed KBs? Research Question
  48. 48. 48 off-the-shelf Sorokin and Gurevych, 2017 What does BERT know? oracle
  49. 49. Single-token objects and answers. 50 Language models rank every word in the vocabulary by its probability. PREDICTIONS “The theory of relativity was developed by ___ .” Einstein -1.143 him -2.994 Newton -3.758 Aristotle -4.477 Maxwell -4.486 unified vocabulary Methodology 50k facts
  50. 50. 4.4 1.2 7.6 2.9 10 0 2 4 6 8 10 12 Frequency KBP KBP Oracle ELMO BERT Google RE 16.7 6.1 33.8 7.1 32.3 0 5 10 15 20 25 30 35 40 Frequency KBP KBP Oracle ELMO BERT T-REx Factual Knowledge p@1 p@1
  51. 51. 0 10 20 30 40 50 60 70 80 10 0 10 1 10 2 meanP@k k Fs Txl Eb E5B Bb Bl Mean P@k curve for T-REx FS: Fairseq-fconv Txl: Transformer xl Eb: Elmo base E5B: Elmo 5.5B Bb: Bert-base Bl: Bert-large
  52. 52. 0 5.5 11 16.5 22 Frequency ELMO BERT Precisionat1 ConceptNet 54 Common Sense Knowledge
  53. 53. BERT-large. The last column reports the top5 tokens generated together with the associated log probability (in square brackets) Examples of Generations
  54. 54. 56 Question Answering
  55. 55. 0 17.5 35 52.5 70 Question Answering DrQa DrQaBERT BERT p@1 p@10
  56. 56. https://github.com/facebookresearch/LAMA 58
  57. 57. download pretrained language models rather than pretrained word embeddings! 59
  58. 58. THANK YOU

×