More Related Content

Slideshows for you(20)

Ekaterina vylomova-what-do-neural models-know-about-language-p1

  1. What Do Neural Models "Know" About Natural Language? Ekaterina Vylomova Vylomova, Ekaterina Neural models and Natural Language 1 / 53
  2. 1943: Artificial Neuron (McCulloch-Pitts) ... or, in other words, ˆy = f ( n i=1 wi xi + b), Vylomova, Ekaterina Neural models and Natural Language 2 / 53
  3. 1943: Artificial Neuron (McCulloch-Pitts) ... or, in other words, ˆy = f ( n i=1 wi xi + b), and activation function might be sigmoid: sig(x) = 1 1+e−x Vylomova, Ekaterina Neural models and Natural Language 3 / 53
  4. 1957: Simple Perceptron The Perceptron: A Probabilistic Model for Information Storage and Organization in the Brain Trained with trial-and-error method It can: – generalize over characters – discover character-specific features But: – failed to recognized badly written/different size/partially closed characters Vylomova, Ekaterina Neural models and Natural Language 4 / 53
  5. 1960s: Single Layer Perceptron The Perceptron: A Probabilistic Model for Information Storage and Organization in the Brain Perceptrons: an introduction to computational geometry XOR Problem Vylomova, Ekaterina Neural models and Natural Language 5 / 53
  6. 1980s: Multi-Layer Perceptrons with Back-Propagation Learning Internal Representations by Error Propagation Solving problems with non-linearly separable cases Vylomova, Ekaterina Neural models and Natural Language 6 / 53
  7. 1980s: The Past Tense Debate Rumelhart & McClelland (1985): On learning the past tenses of English verbs Vylomova, Ekaterina Neural models and Natural Language 7 / 53
  8. 1980s: The Past Tense Debate Rumelhart & McClelland (1985): On learning the past tenses of English verbs Pinker &Prince, 1988: Extremely poor empirical performance! Vylomova, Ekaterina Neural models and Natural Language 8 / 53
  9. 1990s: RNNs Finding structure in time Exploring – context-dependent learning – structure in letter sequences – learning lexical classes from word order Vylomova, Ekaterina Neural models and Natural Language 9 / 53
  10. 1990s: CNNs Backpropagation Applied to Handwritten Zip Code Recognition Training Data: 9,298 segmented numerals from U.S. mail Mislassified: Training – 0.14%; Test – 5.0% Vylomova, Ekaterina Neural models and Natural Language 10 / 53
  11. Meanwhile in NLP: Language Modelling (mostly Ngrams with Kneser-Ney smoothing) OK, Marvin, which word comes next: Two cats are ___ Hmmm, let me guess ... sitting 3.01 ∗ 10−4 play 2.87 ∗ 10−4 running 2.53 ∗ 10−4 nice 2.32 ∗ 10−4 lost 1.97 ∗ 10−4 playing 1.66 ∗ 10−4 sat 1.54 ∗ 10−4 plays 1.32 ∗ 10−4 . .Vylomova, Ekaterina Neural models and Natural Language 11 / 53
  12. 2013: Word2Vec Skip-Gram Distributed Representations of Words and Phrases and their Compositionality Training Objective 1 T T t=1 −c≤j≤c logp(wt+j |wt) p(wo|wi ) = exp(v T wo vwi ) W w=1 exp(v T w vwi ) For efficiency, softmax was replaced with Negative Sampling. Levy et al., 2015 experimented with positive pointwise mutual information (PMI) matrix and showed that Word2vec Skip-Gram with NS is implicit matrix factorization. Vylomova, Ekaterina Neural models and Natural Language 12 / 53
  13. 2013: Word2Vec CBOW Efficient Estimation of Word Representations in Vector Space Training Objective 1 T T t=1 logp(wt|w[t−c,t+c]) p(wo|wi ) = exp(v T wo −c≤j≤c vwi+j ) W w=1 exp(v T w −c≤j≤c vwi+j ) Vylomova, Ekaterina Neural models and Natural Language 13 / 53
  14. 2013: Word2Vec Linear Relations and Compositionality Vylomova, Ekaterina Neural models and Natural Language 14 / 53
  15. 2013: Word2Vec: Word Analogies Linear Relations and Compositionality: Russia + river = Volga_river Vylomova, Ekaterina Neural models and Natural Language 15 / 53
  16. 2013: Word2Vec: Word Analogies Linear Relations and Compositionality: king-man+woman = queen? Vylomova, Ekaterina Neural models and Natural Language 16 / 53
  17. Word Analogies on other embeddings Word Embeddings, Analogies, and Machine Learning: Beyond King - Man+ Woman= Queen Vylomova, Ekaterina Neural models and Natural Language 17 / 53
  18. Word Analogies on other embeddings Word Embeddings, Analogies, and Machine Learning: Beyond King - Man+ Woman= Queen Vylomova, Ekaterina Neural models and Natural Language 18 / 53
  19. Word Analogies on other embeddings Word Embeddings, Analogies, and Machine Learning: Beyond King - Man+ Woman= Queen Vylomova, Ekaterina Neural models and Natural Language 19 / 53
  20. Word Analogies on other embeddings Word Embeddings, Analogies, and Machine Learning: Beyond King - Man+ Woman= Queen Vylomova, Ekaterina Neural models and Natural Language 20 / 53
  21. Word Analogies on other embeddings Word Embeddings, Analogies, and Machine Learning: Beyond King - Man+ Woman= Queen Vylomova, Ekaterina Neural models and Natural Language 21 / 53
  22. Pre-trained Word2Vec (Google News): Bias and Stereotypes Man is to Computer Programmer as Woman is to Homemaker? Vylomova, Ekaterina Neural models and Natural Language 22 / 53
  23. Word2vec trained of Reddit data: Bias and Stereotypes Black is to Criminal as Caucasian is to Police Vylomova, Ekaterina Neural models and Natural Language 23 / 53
  24. Data Bias and Stereotypes Gendered Language Positive adjectives describing women are often related to their bodies, while positive adjectives describing men are often related to their behavior. Vylomova, Ekaterina Neural models and Natural Language 24 / 53
  25. Word2Vec and similar models What do the models learn? Morphology – Are capable of learning inflections but not much derivations (less regular and compositional) Lexical Semantics – Challenging, especially meronyms, antonyms, synonyms Major Difficulties – Polysemy (all word senses in a single vector) – Negation Vylomova, Ekaterina Neural models and Natural Language 25 / 53
  26. Broader context – back to RNNs! Vylomova, Ekaterina Neural models and Natural Language 26 / 53
  27. Neural Machine Translation: Seq2Seq Models (Sutskever et al., 2014) Vylomova, Ekaterina Neural models and Natural Language 27 / 53 The resulting LSTM has 384M params 64M are pure recurrent connections
  28. BUT: Longer contexts – lower quality (vanishing gradient) Long Short-Term Memory will solve it! Vylomova, Ekaterina Neural models and Natural Language 28 / 53
  29. Neural Machine Translation: Seq2Seq Models (Sutskever et al., 2014 Vylomova, Ekaterina Neural models and Natural Language 29 / 53 PCA projection of LSTM hidden state of the corresponding sequences
  30. We can also use both directions (to encode source language) Vylomova, Ekaterina Neural models and Natural Language 30 / 53
  31. Neural Machine Translation: Seq2Seq Models w/Attention (Bahdanau et al, 2014) A whole sentence shouldn’t be compressed into a single vector! Use Attention! Vylomova, Ekaterina Neural models and Natural Language 31 / 53
  32. Neural Machine Translation: Seq2Seq Models w/Attention (Bahdanau et al, 2014) A whole sentence shouldn’t be compressed into a single vector! Use Attention! Vylomova, Ekaterina Neural models and Natural Language 32 / 53
  33. Neural Machine Translation: Seq2Seq Models w/Attention (Bahdanau et al, 2014) It learns alignment and it can be visualized! Vylomova, Ekaterina Neural models and Natural Language 33 / 53
  34. Neural Machine Translation: Seq2Seq Models w/Attention (Bahdanau et al, 2014) What do the models learn? Belinkov et al., 2018a, 2018b – Higher layers are better at learning semantics while lower layers tend to be better for part-of-speech tagging – Lower layers of the neural network are better at capturing morphology Linzen et al., 2018, 2020 English Subject-Verb agreement: –LSTMs were able to learn to perform the verb-number agreement task in most cases, although their error rate increased on particularly difficult sentences. – the LM objective is not by itself sufficient for learning structure-sensitive dependencies, and suggest a joint training objective Vylomova, Ekaterina Neural models and Natural Language 34 / 53
  35. Neural Machine Translation: Seq2Seq Models w/Attention (Bahdanau et al, 2014) What do the models learn? Vylomova et al., 2019 – Contextual inflection in 10 languages: Three little kitten were _sit_ on the mat. Predict: sitting – Agreement: Adjective-Noun ok, Subject-Verb more challenging – Morphological complexity matters (Uralic languages are more challenging than Germanic) – Inherent vs. contextual categories. Inherent (tense, noun number, w/o agreement or extra signal) cannot be predicted Vylomova, Ekaterina Neural models and Natural Language 35 / 53
  36. Back to Past Tense Debate: Seq2Seq Models w/Attention Kirov & Cotterell,2018: The model obviates most of Pinker and Prince’s criticisms SIGMORPHON 2016 Shared Task Task 1: run + V;PRES;3SG → (runs) On Arabic, Finnish, Georgian, German, Hungarian, Maltese, Navajo, Russian, Spanish Vylomova, Ekaterina Neural models and Natural Language 36 / 53
  37. Lake et al., 2018: Compositionality of RNNs Vylomova, Ekaterina Neural models and Natural Language 37 / 53 Simplified version of the CommAI Navigation tasks
  38. Lake et al., 2018: Compositionality of RNNs Vylomova, Ekaterina Neural models and Natural Language 38 / 53 Simplified version of the CommAI Navigation tasks Successful zero-shot generalizations when the differences between training and test command Trained on "run", "jump" and "run twice" fails on "jump twice"
  39. Contextualized Embeddings: Addressing the problem with polysemy! Context matters! ELMo: Let’s make context-specific embeddings! Features – Two Independent(!) LSTMs – Pre-trained embeddings – Weighted-task specific sum of embeddings (two hidden state + word vector) Vylomova, Ekaterina Neural models and Natural Language 39 / 53
  40. Self-Attention (Cheng et al., 2016) Relate parts of a single sequence to compute its representation Vylomova, Ekaterina Neural models and Natural Language 40 / 53 Shows similarity to other parts! Helpful for coreference resolution!
  41. Contextualized Embeddings Transformer: Attention is All you Need Features – No recursion but wide window (somewhat similar to CNN) – positional embeddings (to access token positions) – Self-attention with several heads (matrices) and separate key, query and value (masks) Vylomova, Ekaterina Neural models and Natural Language 41 / 53
  42. Contextualized Embeddings BERT: Deep Bidirectional Transformers Features – Trained on: Masked tokens prediction + Next sentence prediction (binary) – BPE tokenization – Window: 512, CLS – classification Vylomova, Ekaterina Neural models and Natural Language 42 / 53
  43. Contextualized Embeddings BERT: Deep Bidirectional Transformers Vylomova, Ekaterina Neural models and Natural Language 43 / 53
  44. Contextualized Embeddings BERTs BERT BASE(L=12, H=768, A=12, Total Parameters=110M) BERT LARGE(L=24, H=1024,A=16, Total Parameters=340M). Vylomova, Ekaterina Neural models and Natural Language 44 / 53
  45. Contextualized Embeddings: BERT Vylomova, Ekaterina Neural models and Natural Language 45 / 53
  46. Contextualized Embeddings: Word Sense Disambiguation Word Sense Disambiguation "A mouse consists of an object held in one’s hand, with one or more buttons." "Mouse" – an electronic device Vylomova, Ekaterina Neural models and Natural Language 46 / 53
  47. Contextualized Embeddings: Word Sense Disambiguation Word Sense Disambiguation "A mouse consists of an object held in one’s hand, with one or more buttons." "Mouse" – an electronic device Vylomova, Ekaterina Neural models and Natural Language 47 / 53
  48. Contextualized Embeddings: Coreference Resolution Coreference resolution task The secretary called the physician and told _him_ about a new patient. him → physician Vylomova, Ekaterina Neural models and Natural Language 48 / 53
  49. Contextualized Embeddings: Coreference Resolution Gender Bias in Coreference Resolution WinoBias: a Winograd-schema style sentences with entities corresponding to people referred by their occupation Vylomova, Ekaterina Neural models and Natural Language 49 / 53
  50. Contextualized Embeddings: Bias, bias, bias Zhao et al., 2019 – Coref SOTA system that depends on ELMo inherits its bias and demonstrates significant bias on the WinoBias – training data for ELMo contains significantly more male than female entities – the trained ELMo embeddings systematically encode gender information – ELMo unequally encodes gender information about male and female entities Vylomova, Ekaterina Neural models and Natural Language 50 / 53
  51. Contextualized Embeddings: What does BERT know (Rogers et al., 2020)? Syntax – Representations are hierarchical rather than linear and encode POS and syntactic roles(Liu et al., 2019a,b) – Does not “understand” negation and is insensitive to malformed input (Ettinger, 2019) Semantics – Has some knowledge for semantic roles(Ettinger, 2019) – Struggles with representations of numbers (floating point; Wallace et al., 2019b) World Knowledge – Cannot reason based on its world knowledge ("A dog entered the room" doesn’t yield that "room is larger than the dog") Vylomova, Ekaterina Neural models and Natural Language 51 / 53
  52. Extra resources NLP Progress Hugging Face – Models "Embeddings in Natural Language Processing" book "Dive into Deep Learning" interactive book Vylomova, Ekaterina Neural models and Natural Language 52 / 53
  53. Thank you! Questions? Vylomova, Ekaterina Neural models and Natural Language 53 / 53