Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Grammarly AI-NLP Club #6 - Sequence Tagging using Neural Networks - Artem Chernodub

566 views

Published on

Speaker: Artem Chernodub, Chief Scientist at Clikque Technology and Associate Professor at Ukrainian Catholic University

Summary: Sequence Tagging is an important NLP problem that has several applications, including Named Entity Recognition, Part-of-Speech Tagging, and Argument Component Detection. In our talk, we will focus on a BiLSTM+CNN+CRF model — one of the most popular and efficient neural network-based models for tagging. We will discuss task decomposition for this model, explore the internal design of its components, and provide the ablation study for them on the well-known NER 2003 shared task dataset.

  • Be the first to comment

  • Be the first to like this

Grammarly AI-NLP Club #6 - Sequence Tagging using Neural Networks - Artem Chernodub

  1. 1. Short BIO Kyiv Natural Sciences Lyceum # 145 4 Moscow Institute of Physics and Technology (B.Sc., M.Sc.), 2007 Institute of Mathematical Machines and Systems NASU, Kyiv (PhD), 2016 Ukrainian Catholic University, Lviv (teaching now)
  2. 2. Annotation 5 “Sequence Tagging is an important NLP problem that has several applications, including Named Entity Recognition, Part-of-Speech Tagging, and Argument Component Detection. In our talk, we will focus on a BiLSTM+CNN+CRF model — one of the most popular and efficient neural network-based models for tagging. We will discuss task decomposition for this model, explore the internal design of its components, and provide the ablation study for them on the well- known CoNNL-2003 NER shared task dataset.”
  3. 3. Problem
  4. 4. Sequence tagging (sequence labeling): problem formulation • Input: sequence of 𝑛𝑛 tokens (words) 𝒙𝒙1, 𝒙𝒙2, … , 𝒙𝒙𝑛𝑛 . • Output: sequence of 𝑛𝑛 tags (labels) 𝑦𝑦1, 𝑦𝑦2, … , 𝑦𝑦𝑛𝑛 , 𝑦𝑦𝑖𝑖 ∈ 1, … , 𝐾𝐾 , 𝑖𝑖 = 1, 𝑛𝑛. 7 𝒙𝒙1 𝒙𝒙2 𝒙𝒙3 𝒙𝒙4 𝒙𝒙5 … 𝒙𝒙𝑛𝑛
  5. 5. • Input: sequence of 𝑛𝑛 tokens (words) 𝒙𝒙1, 𝒙𝒙2, … , 𝒙𝒙𝑛𝑛 . • Output: sequence of 𝑛𝑛 tags (labels) 𝑦𝑦1, 𝑦𝑦2, … , 𝑦𝑦𝑛𝑛 , 𝑦𝑦𝑖𝑖 ∈ 1, … , 𝐾𝐾 , 𝑖𝑖 = 1, 𝑛𝑛. 8 𝒙𝒙1 𝒙𝒙2 𝒙𝒙3 𝒙𝒙4 𝒙𝒙5 … 𝒙𝒙𝑛𝑛 𝑦𝑦1 𝑦𝑦2 𝑦𝑦3 𝑦𝑦4 𝑦𝑦5 … 𝑦𝑦𝑛𝑛 Sequence tagging (sequence labeling): problem formulation
  6. 6. Sequence tagging: examples (NER, POS) • Named Entity Recognition (NER) • Part-Of-Speech (POS) 9 There is no Starbucks in Kyiv . ORG LOC Donald Thrump is a president of USA NOUN NOUN VERB ARTICLE NOUN PREPOS NOUN
  7. 7. 10 • detection and evaluation of the arguments in natural texts; • develop automatic systems for making judgments, support decision making and finding contradictions in the natural text, document summarization, analysis of scientific papers, writing assistance, essay scoring. • domains: law, decision making, philosophy. Motivation: Argument Mining (scientific project with LT lab, Hamburg University, 2017 - )
  8. 8. Argument Component Detection 11 List of Argument Components: “THE” -- this token is a part of the thesis of the argument (claim); “PRO” -- this token is a part of a statement that supports the thesis (premise); “CON” -- this token is a part of a statement that supports the opposite statement to the thesis. Input: Let us discuss which technology to use for our new project. In my opinion, Python is a good choice for scientific programming, because it is open source and has a rich collection of libraries, such as NumPy. Output: LetO usO discussO whichO technologyO toO useO forO ourO newO projectO. InO myO opinionO, PythonB-THE isI-THE aI-THE goodI-THE choiceI-THE forI-THE scientificI-THE programmingI-THE, becauseI itB-PRO isI-PRO openI-PRO sourceI-PRO andI-PRO hasI-PRO aI-PRO richI-PRO collectionI-PRO ofI-PRO librariesI-PRO, suchI-PRO asI-PRO NumPyI-PRO.
  9. 9. PyTorch tutorial, toy example -> make it accurate and fast 12 Source: https://pytorch.org/tutorials/beginner/nlp/sequence_models_tutorial.html
  10. 10. PyTorch tutorial, toy example -> make it accurate and fast 13 Source: https://pytorch.org/tutorials/beginner/nlp/sequence_models_tutorial.html
  11. 11. CoNNL NER-2003 shared task (English) • Four different types of named entities: PERSON, LOCATION, ORGANIZATION, MISC. • TRAIN / DEV / TEST, sentences #: 38 219 / 5 527 / 5 462. • TRAIN / DEV / TEST, tokens #: 912 344 / 131 768 / 129 654. 14 Tjong Kim Sang, E. F., & De Meulder, F. (2003, May). Introduction to the CoNLL-2003 shared task: Language-independent named entity recognition. In Proceedings of the seventh conference on Natural language learning at HLT-NAACL 2003-Volume 4 (pp. 142-147). Association for Computational Linguistics.
  12. 12. Accuracy
  13. 13. • B-PER: beginning of new named entity (person); • I-PER: inside the named entity (person); • O: outside of any named entity. 16 Italy recalled Marcello Cuttitta . B-LOC O B-PER I-PER O Friendly against Scotland at Murray O O B-LOC O B-LOC CoNNL formatExample: IOB2 tagging scheme Technical notes
  14. 14. Various IOB-like tagging schemes 17 Source: http://cs229.stanford.edu/proj2005/KrishnanGanapathy-NamedEntityRecognition.pdf IOB1 I is a token inside a chunk, O is a token outside a chunk and B is the beginning of chunk immediately following another chunk. IOB2 The same as IOB1, except that a B tag is given for every token, which exists at the beginning of the chunk. IOE1 E tag used to mark the last token of a chunk immediately preceding another chunk of the same named entity. IOE2 The same as IOE1, except that an E tag is given for every token, which exists at the end of the chunk. IOBES This consists of the tags B, E, I, S or O where S is used to represent a chunk containing a single token. Chunks of length greater than or equal to two always start with the B tag and end with the E tag. IO Only the I and O labels are used. This therefore cannot distinguish between adjacent chunks of the same named entity.
  15. 15. Various IOB-like tagging schemes 18 Source: http://cs229.stanford.edu/proj2005/KrishnanGanapathy-NamedEntityRecognition.pdf IOB1 I is a token inside a chunk, O is a token outside a chunk and B is the beginning of chunk immediately following another chunk. IOB2 The same as IOB1, except that a B tag is given for every token, which exists at the beginning of the chunk. IOE1 E tag used to mark the last token of a chunk immediately preceding another chunk of the same named entity. IOE2 The same as IOE1, except that an E tag is given for every token, which exists at the end of the chunk. IOBES This consists of the tags B, E, I, S or O where S is used to represent a chunk containing a single token. Chunks of length greater than or equal to two always start with the B tag and end with the E tag. IO Only the I and O labels are used. This therefore cannot distinguish between adjacent chunks of the same named entity.
  16. 16. What tagging scheme is the best for NER- 2003 shared task (English)? 19 [Ma & Hovy, 2016][Lample et. al., 2016] [Reimers et. al., 2017]
  17. 17. What tagging scheme is the best for NER- 2003 shared task (English)? 20 [Ma & Hovy, 2016] IOBES better [Lample et. al., 2016] the same [Reimers & Gurevych, 2017] IOB2 better
  18. 18. F1-score definition 21 Source: http://bazhenov.me/blog/2012/07/21/classification-performance-evaluation.html F𝛽𝛽 -score is a harmonic mean between Precision and Recall: 𝐹𝐹𝛽𝛽 = 1 + 𝛽𝛽2 𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃 � 𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅 𝛽𝛽2 � 𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃 + 𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅 . • 0 < 𝛽𝛽 < 1 − more priority to Precision; • 𝛽𝛽 > 1 − more priority to Recall; • 𝛽𝛽 = 1 − balanced F-score also known as "𝐹𝐹1“; • averaging options: micro-, macro- and binary.
  19. 19. Precision & Recall definition 22 Source: https://en.wikipedia.org/wiki/F1_score 𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃 = 𝑇𝑇𝑇𝑇 𝑇𝑇𝑇𝑇 + 𝐹𝐹𝐹𝐹 𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅 = 𝑇𝑇𝑇𝑇 𝑇𝑇𝑇𝑇 + 𝐹𝐹𝐹𝐹
  20. 20. F1-score for sequence tagging 23 Input Italy recalled Marcello Cuttitta . Output B-LOC O B-PER I-PER O LOC PER LOC PER 100% intersection • sklearn’s “from the box” solution from sklearn.metrics import f1_score is not an option because it works on token-level; • “fuzzy” intersections (i.e. 50%) can also be used. 100% intersection
  21. 21. F1-score for sequence tagging 24 Input Italy recalled Marcello Cuttitta . Output B-LOC O B-PER I-PER O LOC PER LOC PER PERLOC 100% intersection • sklearn’s “from the box” solution from sklearn.metrics import f1_score is not an option; • “fuzzy” intersections (i.e. 50%) can also be used. 100% intersection 50% intersection
  22. 22. F1-score for CoNNL NER-2003 shared task, an ultimate solution 25 • original script on Perl for NER-2003 shared task by Erik Tjong Kim Sang can be used; • IOB2 tagging scheme.
  23. 23. Model
  24. 24. Tagger model End-to-end trainable taggers 27 recalled Marcello Italy <O> <B-PER> <B-LOC> Tagger model Tagger model
  25. 25. Tagger models: BiLSTM + CNN + CRF [Lample et. al., 2016], [Ma & Hovy, 2016] 28 Word-level embeddings BiLSTM CRF Char-level features
  26. 26. Tagger models: BiLSTM + CNN + CRF [Lample et. al., 2016], [Ma & Hovy, 2016] 29 recalled Marcello Italy <O> <B-LOC> Word-level embeddings BiLSTM CRF Char-level features Word-level embeddings BiLSTM CRF Char-level features Word-level embeddings BiLSTM CRF Char-level features <B-PER>
  27. 27. BiLSTM CRF Word-level embeddings Char-level features Word Embeddings
  28. 28. word2vec vector properties 31 Vectors for King, Man, Queen, & Woman: The result of the vector composition King - Man + Woman = ? Source: https://blog.acolyer.org/2016/04/21/the-amazing-power-of-word-vectors
  29. 29. Semantic similarity 32
  30. 30. 33 0 1 0 0 0 0 0 0 … 0 0 0 0 1 0 0 0 0 … 0 xcat xon 0 0 0 0 0 0 0 1 … 0 Input layer Hidden layer sat Output layer V-dim V-dim N-dim V-dim 𝑊𝑊𝑉𝑉×𝑁𝑁 𝑊𝑊𝑉𝑉×𝑁𝑁 0.1 2.4 1.6 1.8 0.5 0.9 … … … 3.2 0.5 2.6 1.4 2.9 1.5 3.6 … … … 6.1 … … … … … … … … … … … … … … … … … … … … 0.6 1.8 2.7 1.9 2.4 2.0 … … … 1.2 𝑊𝑊𝑉𝑉×𝑁𝑁 𝑇𝑇 Contain word’s vectors 𝑊𝑊𝑉𝑉×𝑁𝑁 ′ We can consider either W or W’ as the word’s representation. Or even take the average. word2vec – MLP
  31. 31. Features used to train the taggers for AM 34 Stab C., Gurevych I. Parsing Argumentation Structures in Persuasive Essays // Computational Linguistics. – 2017. – Т. 43. – №. 3. – С. 619-659.
  32. 32. Embeddings vs hand-crafted features in AM 35 In-domain scenario Cross-domain scenario Stab C., Gurevych I. Parsing Argumentation Structures in Persuasive Essays // Computational Linguistics. – 2017. – Т. 43. – №. 3. – С. 619-659.
  33. 33. Word embeddings under consideration 36 Name # of tokens Dim Size (Gb) word2vec (GoogleNews) 3,000,000 300 3.8 GLoVE 400,000 100 0.34 fasttext 2,519,370 300 6.6
  34. 34. Experiment: extending the vocabulary for GloVe embeddings 37 Strategy Found embedding vectors Strategy # 1: use only embeddings with original capitalizations 14618 Strategy # 2: lowercase words both in dataset and embeddings 22947 Strategy # 3: use embedding from the word with original capitalization; if not found - use embeddings from the lowercased word. 26340
  35. 35. Experiment: extending the vocabulary for GloVe embeddings, BiLSTM+CNN+CRF 38 Strategy # 1, f1 = 72.21 Strategy # 2, f1 = 75.58 Strategy # 3, f1 = 90.75
  36. 36. BiLSTM CRF Word-level embeddings Char-level features BiLSTM
  37. 37. Recurrent Neural Network (unrolled) 40 RNN(W) Xt ht 𝑋𝑋𝑡𝑡 - network’s input for time 𝑡𝑡; ℎ - network’s state for time 𝑡𝑡.
  38. 38. Recurrent Neural Network (unrolled) 41 RNN(W) Xt ht RNN(W) X1 h1 RNN(W) X2 h2 RNN(W) X3 h3 = Unrolled RNN is a feed forward deep neural network with shared weights (the principle of training using Backpropagation Through Time). http://karpathy.github.io/2015/05/21/rnn-effectiveness/ http://colah.github.io/posts/2015-08-Understanding-LSTMs
  39. 39. Modes of RNN’s work Image copyright © Andrej Karpathy blog, 2015. “Vanilla” mode e.g. Sentiment Classification: “sequence of words” -> “sentiment” e.g. Machine Translation “sequence of words” -> “sequence of words” e.g. Sequence Tagging “sequence of words -> sequence of words” 42 e.g. Image Captioning: “image” -> “sequence of words”
  40. 40. Bi-Recurrent Neural Network 43 RNN(W) Xt ht X1 X2 X3 RNN(W) RNN(W) RNN(W) RNN(W) h3 h2h1 RNN(W)RNN(W) h3h2h1 = 𝒉𝒉𝒕𝒕 = 𝒉𝒉𝒕𝒕; 𝒉𝒉𝒕𝒕
  41. 41. Vanilla RNN 44 ℎ𝑡𝑡 = 𝑡𝑡 𝑡𝑡𝑡𝑡𝑡 𝑊𝑊𝑥𝑥𝑥 𝑥𝑥𝑡𝑡 + 𝑊𝑊ℎℎℎ𝑡𝑡−1 + 𝑏𝑏 Source: https://colah.github.io/posts/2015-08-Understanding-LSTMs/ State transition equation:
  42. 42. Long-Short Term Memory (LSTM) RNN 45 𝑓𝑓𝑡𝑡 = 𝜎𝜎 𝑊𝑊𝑓𝑓 ℎ𝑡𝑡−1, 𝑥𝑥𝑡𝑡 + 𝑏𝑏𝑓𝑓 𝑖𝑖𝑡𝑡 = 𝜎𝜎 𝑊𝑊𝑖𝑖 ℎ𝑡𝑡−1, 𝑥𝑥𝑡𝑡 + 𝑏𝑏𝑖𝑖 �𝐶𝐶𝑡𝑡 = 𝑡𝑡 𝑡𝑡𝑡𝑡𝑡 𝑊𝑊𝐶𝐶 ℎ𝑡𝑡−1, 𝑥𝑥𝑡𝑡 + 𝑏𝑏𝐶𝐶 𝐶𝐶𝑡𝑡 = 𝑓𝑓𝑡𝑡 ∗ 𝐶𝐶𝑡𝑡−1 + 𝑖𝑖𝑡𝑡 ∗ �𝐶𝐶𝑡𝑡 𝑜𝑜𝑡𝑡 = 𝜎𝜎 𝑊𝑊𝑜𝑜 ℎ𝑡𝑡−1, 𝑥𝑥𝑡𝑡 + 𝑏𝑏𝑜𝑜 ℎ𝑡𝑡 = 𝑜𝑜𝑡𝑡 ∗ tanh(𝐶𝐶𝑡𝑡) State transition equation: Source: https://colah.github.io/posts/2015-08-Understanding-LSTMs/
  43. 43. Comparison: RNN types 46 BiLSTM, f1 = 90.88 BiGRU, f1 = 89.89 BiVanillaRNN, f1 = 88.91
  44. 44. Char CNN BiLSTM CRF Word-level embeddings Char-level features
  45. 45. Convolution function 1D Convolution 1D: 𝑠𝑠 𝑡𝑡 = 𝑥𝑥 ∗ 𝑤𝑤 𝑡𝑡 = � 𝑎𝑎=−∞ +∞ 𝑥𝑥 𝑎𝑎 𝑤𝑤 𝑡𝑡 − 𝑎𝑎 , where x is an input, w is a filter. 1 2 3 4 5 6 7 Input x: 3 4 5 4 3 3 8 15 16 15 Filter w: 57 Multiplication: Output (sum): 48 Key properties: • linear operation; • adding invariance to shift; • good parallelizable.
  46. 46. Convolution function 1D Convolution 1D: 𝑠𝑠 𝑡𝑡 = 𝑥𝑥 ∗ 𝑤𝑤 𝑡𝑡 = � 𝑎𝑎=−∞ +∞ 𝑥𝑥 𝑎𝑎 𝑤𝑤 𝑡𝑡 − 𝑎𝑎 , where x is an input, w is a filter. 1 2 3 4 5 6 7 Input x: 3 4 5 4 3 6 12 20 20 18 Filter w: 76 Multiplication: Output (sum): Key properties: • linear operation; • adding invariance to shift; • good parallelizable. 49
  47. 47. Convolution function 1D Convolution 1D: 𝑠𝑠 𝑡𝑡 = 𝑥𝑥 ∗ 𝑤𝑤 𝑡𝑡 = � 𝑎𝑎=−∞ +∞ 𝑥𝑥 𝑎𝑎 𝑤𝑤 𝑡𝑡 − 𝑎𝑎 , where x is an input, w is a filter. 1 2 3 4 5 6 7 Input x: 3 4 5 4 3 9 16 25 24 21 Filter w: 95 Multiplication: Output (sum): 50 Key properties: • linear operation; • adding invariance to shift; • good parallelizable.
  48. 48. CNN Character-level representations 51 Padding M a r c e l l o Padding Char Embedding, here dim = 4
  49. 49. CNN Character-level representations 52 Padding M a r c e l l o Padding Char Embedding, here dim = 4 Convolution 1-D • 30 filters; • window size=3; • red dash line = dropout;
  50. 50. CNN Character-level representations 53 Padding M a r c e l l o Padding Char Embedding, here dim = 4 Convolution 1-D • 30 filters; • window size=3; • red dash line = dropout;
  51. 51. CNN Character-level representations 54 Padding M a r c e l l o Padding Char Embedding, here dim = 4 Convolution 1-D • 30 filters; • window size=3; • red dash line = dropout;
  52. 52. CNN Character-level representations 55 Padding M a r c e l l o Padding Char Embedding, here dim = 4 Convolution 1-D • 30 filters; • window size=3; • red dash line = dropout;
  53. 53. CNN Character-level representations 56 Padding M a r c e l l o Padding Char Embedding, here dim = 4 Convolution 1-D Char-level representation • 30 filters; • window size=3; • red dash line = dropout;Max-pooling
  54. 54. Experiment: BiLSTM+CRF (no CNN) 57 GloVe, f1 = 87.23 fasttext, f1 = 86.85 word2vec, f1 = 82.14
  55. 55. BiLSTM CRF Word-level embeddings Char-level features Conditional Random Fields
  56. 56. Empirical tag transitions table from CoNNL NER-2003 dataset (IOB2 scheme) 59
  57. 57. Conditional Random Fields (CRF) • Consider we use 𝑿𝑿 = 𝒙𝒙1, … , 𝒙𝒙𝑛𝑛 for input sequence and 𝒚𝒚 = 𝑦𝑦1, … , 𝑦𝑦𝑛𝑛 for sequence of predictions (tags), 𝑛𝑛 is length of input sequence. • A matrix of transitions scores 𝑻𝑻, 𝑇𝑇𝑘𝑘,𝑗𝑗 represents the score of transition from tag 𝑘𝑘 to tag 𝑗𝑗. • A matrix of Bi-LSTM’s outputs 𝑼𝑼, 𝑈𝑈𝑘𝑘,𝑗𝑗 represents the score of tag 𝑗𝑗 for 𝑘𝑘-th token. • We define score function for one sequence as: 𝑠𝑠 𝑿𝑿, 𝒚𝒚 = � 𝑖𝑖=0 𝑛𝑛 𝑇𝑇𝑦𝑦𝑖𝑖,𝑦𝑦𝑖𝑖+1 + � 𝑖𝑖=1 𝑛𝑛 𝑈𝑈𝑖𝑖,𝑦𝑦𝑖𝑖 . 60
  58. 58. Conditional Random Fields (CRF) • Consider we use 𝑿𝑿 = 𝒙𝒙1, … , 𝒙𝒙𝑛𝑛 for input sequence and 𝒚𝒚 = 𝑦𝑦1, … , 𝑦𝑦𝑛𝑛 for sequence of predictions (tags), 𝑛𝑛 is length of input sequence. • A matrix of transitions scores 𝑻𝑻, 𝑇𝑇𝑘𝑘,𝑗𝑗 represents the score of transition from tag 𝑘𝑘 to tag 𝑗𝑗. • A matrix of Bi-LSTM’s outputs 𝑼𝑼, 𝑈𝑈𝑘𝑘,𝑗𝑗 represents the score of tag 𝑗𝑗 for 𝑘𝑘-th token. • We define score function for one sequence as: 𝑠𝑠 𝑿𝑿, 𝒚𝒚 = � 𝑖𝑖=0 𝑛𝑛 𝑇𝑇𝑦𝑦𝑖𝑖,𝑦𝑦𝑖𝑖+1 + � 𝑖𝑖=1 𝑛𝑛 𝑈𝑈𝑖𝑖,𝑦𝑦𝑖𝑖 . 61 tunable param, “weights”
  59. 59. Conditional Random Fields (CRF) The probabilistic model for sequence CRF defines a family of conditional probability, 𝒀𝒀 𝑿𝑿 - all possible tag sequences for input 𝑿𝑿 : 𝑝𝑝 𝒚𝒚 𝑿𝑿 = 𝑒𝑒𝑒𝑒𝑒𝑒(𝑠𝑠(𝑿𝑿, 𝒚𝒚)) ∑𝒚𝒚′∈𝒀𝒀(𝑿𝑿) 𝑒𝑒𝑒𝑒𝑒𝑒(𝑠𝑠(𝑿𝑿, 𝒚𝒚′)) . Taking the logarithm, 𝑝𝑝 𝒚𝒚 𝑿𝑿 = 𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠 𝒚𝒚 𝑿𝑿 − log � 𝒚𝒚′∈𝒀𝒀 𝒙𝒙 exp 𝑠𝑠 𝑿𝑿, 𝒚𝒚′ → 𝑚𝑚𝑚𝑚𝑚𝑚. For inference we search the tags sequence with the highest conditional probability (Viterbi): 𝑦𝑦∗ = 𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑦𝑦∈𝒀𝒀(𝑿𝑿) 𝑝𝑝 𝒚𝒚 𝑿𝑿 . 62
  60. 60. Conditional Random Fields (CRF) The probabilistic model for sequence CRF defines a family of conditional probability, 𝒀𝒀 𝑿𝑿 - all possible tag sequences for input 𝑿𝑿 : 𝑝𝑝 𝒚𝒚 𝑿𝑿 = 𝑒𝑒𝑒𝑒𝑒𝑒(𝑠𝑠(𝑿𝑿, 𝒚𝒚)) ∑𝒚𝒚′∈𝒀𝒀(𝑿𝑿) 𝑒𝑒𝑒𝑒𝑒𝑒(𝑠𝑠(𝑿𝑿, 𝒚𝒚′)) . Taking the logarithm, 𝑝𝑝 𝒚𝒚 𝑿𝑿 = 𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠 𝒚𝒚 𝑿𝑿 − log � 𝒚𝒚′∈𝒀𝒀 𝒙𝒙 exp 𝑠𝑠 𝑿𝑿, 𝒚𝒚′ → 𝑚𝑚𝑚𝑚𝑚𝑚. For inference we search the tags sequence with the highest conditional probability (Viterbi): 𝑦𝑦∗ = 𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑦𝑦∈𝒀𝒀(𝑿𝑿) 𝑝𝑝 𝒚𝒚 𝑿𝑿 . 63 all possible tag sequences always positive
  61. 61. Experiment: BiLSTM+CNN+CRF 64[Ma & Hovy, 2016] GloVe, f1 = 90.88 fasttext, f1 = 88.32 word2vec, f1 = 86.35
  62. 62. Speed
  63. 63. How to make fast? •use standard functions as more as possible; •code vectorization = train/predict on batches 66
  64. 64. How to make fast? •use standard functions as more as possible •code vectorization = train/predict on batches. 67
  65. 65. 68 RNN(W) Xt ht Friendly against Scotland at Murray . Nadim Ladki AL-AIN United Arab Emirates ROME 1996-12 Two goals in the last minutes Problem with training on batches: different sequences length: add pads
  66. 66. Problem with training on batches: different sequences length: add zero <PAD>s 69 RNN(W) Xt ht Friendly against Scotland at Murray . Nadim Ladki <PAD> <PAD> <PAD> <PAD> AL-AIN United Arab Emirates <PAD> <PAD> ROME 1996-12 <PAD> <PAD> <PAD> <PAD> Two goals in the last minutes
  67. 67. Straightforward approach: adding binary mask 70 Friendly against Scotland at Murray . Nadim Ladki <PAD> <PAD> <PAD> <PAD> AL-AIN United Arab Emirates <PAD> <PAD> ROME 1996-12 <PAD> <PAD> <PAD> <PAD> Two goals in the last minutes RNN(W) Xt ht Loss Input: Mask:
  68. 68. No masks, Pytorch, approach # 1: adding ignore_index=pad_idx to loss 71
  69. 69. No masks, Pytorch, approach # 2: pad_sequence for RNNs 72 1. Make a PackedSequence of your RNN input (sorting sequences by length is required). 2. Run RNN(PackedSequence). 3. Convert output from PackedSequence to torch.Tensor.
  70. 70. No masks, Pytorch, approach # 3: pad_sequence for the whole pipeline 73 Source: https://medium.com/@florijan.stamenkovic_99541/rnn-language-modelling-with-pytorch-packed-batching-and- tied-weights-9d8952db35a9 1. Make a PackedSequence of your sentences (word tokens). 2. Convert PackedSequence.data member into embedded vectors. 3. Construct a new PackedSequence from the result and the old one’s sequence lengths. This is not officially supported by the API. 4. Pass the resulting PackedSequence into a recurrent layer of the net.
  71. 71. Experiment: training time for one epoch for different batch sizes (seconds) 74 0 200 400 600 800 1000 1200 CPU GPU (GTX960M) batch_size = 1 batch_size = 2 batch_size = 5 batch_size = 10
  72. 72. Final accuracy evaluation on CoNNL NER 2003 Shared task (English) 75 Tagger model micro-f1 score on test BiLSTM + CNN + CRF, GloVe emb. (our) 90.88 BiLSTM + CNN + CRF, GloVe emb. (Lample et. al., 2016) 90.94 BiLSTM + CNN + CRF, GloVe emb. (Ma & Hovy, 2016) 91.21 BiLSTM - CRF+ ELMo emb. (Peters et al., 2018) 92.22 Flair emb. (Akbik et al., 2018) 93.09 Source: https://nlpprogress.com/named_entity_recognition.html
  73. 73. Future plans: ELMO embeddings & UlmFit • variational dropout; • CRF decoding: beam search, other heuristics; • + ELMo embeddings [Peters et.al., 2018], Flair embeddings [Akbik et al., 2018]; • + UlmFit, FastAI. http://nlp.fast.ai/classification/2018/05/15/introducting- ulmfit.html 76
  74. 74. Sources (PyTorch) https://github.com/achernodub/bilstm-cnn-crf-tagger 77
  75. 75. Alternative neural taggers, links • NeuroNER (Tensorflow) https://github.com/Franck-Dernoncourt/NeuroNER • LM-LSTM-CRF (Pytorch) https://github.com/LiyuanLucasLiu/LM-LSTM-CRF • LD-Net (Pytorch) https://github.com/LiyuanLucasLiu/LD-Net • Reimers & Gurevych, 2017 (Tensorflow & Keras) https://github.com/UKPLab/emnlp2017-bilstm-cnn-crf • Reimers & Gurevych + ELMO, 2017 (Tensorflow & Keras) https://github.com/UKPLab/elmo-bilstm-cnn-crf • Eger at. al., 2017 (diff) https://github.com/UKPLab/acl2017-neural_end2end_am 78
  76. 76. References 79 1. Tjong Kim Sang, E. F., & De Meulder, F. (2003, May). Introduction to the CoNLL-2003 shared task: Language- independent named entity recognition. In Proceedings of the seventh conference on Natural language learning at HLT-NAACL 2003-Volume 4 (pp. 142-147). Association for Computational Linguistics. 2. [Collobert et. al., 2011] Collobert R. et al. Natural Language Processing (almost) from Scratch // Journal of Machine Learning Research. – 2011. – Т. 12. – №. Aug. – С. 2493-2537. 3. [Ma & Hovy, 2016] Ma X., Hovy E. End-to-end Sequence Labeling via Bi-directional LSTM-CNNs-CRF // arXiv preprint arXiv:1603.01354. – 2016. 4. [Lample et. al., 2016] Lample G. et al. Neural Architectures for Named Entity Recognition // arXiv preprint arXiv:1603.01360. – 2016. 5. [Eger et.al., 2017] Eger S., Daxenberger J., Gurevych I. Neural End-to-end Learning for Computational Argumentation Mining // arXiv preprint arXiv:1704.06104. – 2017. 6. [Reimers & Gurevych, 2017] Reimers, N., & Gurevych, I. (2017). Reporting score distributions makes a difference: Performance study of lstm-networks for sequence tagging. arXiv preprint arXiv:1707.09861. 7. [Peters et. al., 2018] Peters, M. E., Neumann, M., Iyyer, M., Gardner, M., Clark, C., Lee, K., & Zettlemoyer, L. (2018). Deep contextualized word representations. arXiv preprint arXiv:1802.05365. 8. [Akbik et.al., 2018] Akbik, A., Blythe, D., & Vollgraf, R. (2018). Contextual String Embeddings for Sequence Labeling. In Proceedings of the 27th International Conference on Computational Linguistics (pp. 1638-1649).
  77. 77. Thank you! chernodub@ucu.edu.ua

×