Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

NLP State of the Art | BERT

BERT: Bidirectional Encoder Representation from Transformer.
BERT is a Pretrained Model by Google for State of the art NLP tasks.
BERT has the ability to take into account Syntaxtic and Semantic meaning of Text.

  • Be the first to comment

NLP State of the Art | BERT

  1. 1. BERT: Bidirectional Encoder Representation from Transformer By: Shaurya Uppal
  2. 2. Defining Language Language:- Divided into 3 Parts ● Syntax ● Semantics ● Pragmatics Syntax- Word Ordering, Sentence form Semantics- Meaning of word Pragmatics- refers to the social language skills that we use in our daily interactions with others.
  3. 3. Example of Syntax, Semantics, Pragmatics + This discussion is about BERT. + The green frogs sleep soundly. + BERT play football good SSP SS None
  4. 4. Why study about BERT? Bert has ability to perform state of the art performance in many Natural Language Processing Tasks. It can perform tasks such as Text Classification, Text Similarity finding, Next Sentence Sequence Prediction, Question Answering, Auto- Summarization, Named Entity Recognition,etc. What is BERT Exactly? BERT is Pretrained model by Google, which is a bidirectional representation from unlabeled text by jointly conditioning on both left and right context in all layers.
  5. 5. Dataset used to Pre-train BERT + BooksCorpus (800M words) + English Wikipedia (2,500+M words) A pretrained model can be applied by feature-based approach or fine tuning. + In Fine Tuning all weights change. + In Feature based approach only the final layer weights change. (Approach by ELMo) This pretrained model is then fine tuned on different NLP tasks. Pretraining and Fine Tuning: You train a model m on Data A, then this model m is trained on Data B from the checkpoint. SLIDE 17
  6. 6. Language Training Approach To train a language model: Two approaches Context Free + Traditionally we use to convert word2vec or use Glove. Contextual + RNN + BERT
  7. 7. How does BERT Work? BERT weights are learned in advance through two unsupervised tasks: masked language modeling (predicting a missing word given the left and right context) and next sentence prediction (predicting whether one sentence follows another). BERT makes use of Transformer, an attention mechanism that learns contextual relations between words (or sub-words) in a text. (Paper 2 Attention is all you need) Multi-headed attention is used in BERT. It uses multiple layers of attention and also incorporates multiple attention “heads” in every layer (12 or 16). Since model weights are not shared between layers, a single BERT model effectively has up to 12 x 12 = 144 different attention mechanisms.
  8. 8. What does BERT learn, how it tokenize and handle OOV? Consider the input example:- I went to the store. At the store, I bought fresh strawberries. BERT uses a WORD PIECE Tokenizer which breaks a OOV(out of vocab) word into segments. For example, if play, ##ing, and ##ed are present in the vocabulary but playing and played are OOV words then they will be broken down into play + ##ing and play + ##ed respectively. (## is used to represent sub- words). BERT also requires a [CLS] special classifier token at beginning and [SEP] at end of a sequence. [CLS] I went to the store. [SEP] At the store I bought fresh straw ##berries.[SEP]
  9. 9. Attention An attention probe is a task for a pair of tokens (tokeni, tokenj) where the input is a model-wide attention vector formed by concatenating the entries aij, in every attention matrix from every attention head in every layer.
  10. 10. Some visual Attention Patterns and Why we use Attention Mechanism? Reason for Attention: Attention helps in two main tasks of BERT, MLM (Masked Language Model) and NSP (Next Sentence Prediction).
  11. 11. Visual Pattern from Attention mechanism ● Attention to next word. [ Layer 2, Head 0 ] | Backward RNN ● Attention to Previous word. [Layer 0, Head 2 ] | Forward RNN ● Attention to identical/related word. ● Attention to identical words in other sentence. | Helps in nextsentence prediction task ● Attention to other words predictive of word. ● Attention to delimiter tokens [CLS], [SEP] ● Attention to Bag of Words.
  12. 12. How input in BERT is Feeded?
  13. 13. MLM: Masked Language Model Input: My dog is hairy. Masking is done randomly, and 15% of all WordPiece tokens in each input sequence in masked. We only predict the masked tokens rather than predicting the entire input sequence. Procedure: + 80% of the time: Replace the word with [MASK]. My dog is [MASK]. + 10% of time: Replace word randomly. My dog is apple. + 10% of time: Keep same My dog is hairy.
  14. 14. Why MLM is best?
  15. 15. NSP: Next Sentence Prediction Training Method: In unlabelled data, we take a input sequence A and 50% of time making next occurring input sequence as B. Rest 50% of time we randomly pick any sequence as B.
  16. 16. BERT Architecture BERT is a multi-layer bidirectional Transformer encoder. There are two models introduced in the paper. ● BERT base – 12 layers (transformer blocks), 12 attention heads, and 110 million parameters. ● BERT Large – 24 layers, 16 attention heads and, 340 million parameters.
  17. 17. BERT Pretraining and Fine Tuning Architecture
  18. 18. Illustration how the BERT Pretrain architecture remain the same and just the fine tuning layer architecture change for different NLP tasks.
  19. 19. Related Work EMLo:- A pretrained model based which is feature based (only final layer weights change) for NLP tasks. Difference: ELMo uses LSTMS; BERT uses Transformer - an attention based model with positional encodings to represent word positions). ELMo also failed because is was word based and could not handle OOV. OpenAI GPT: uses a left to right architecture where every token can only attend to previous tokens in the self-attention layer of Transformer. Failed because it could not get proper contextual knowledge.
  20. 20. How BERT Outperforms others? In the paper Visualizing and Measuring the Geometry of BERT, we prove how BERT holds semantic and syntax features of a text. In this paper aims to show how attention matrix contains grammatical representations. Turning to semantics, using visualizations of the activations created by different pieces of text, we show suggestive evidence that BERT distinguishes word senses at a very fine level. BERT’s internals consist of two parts. First, an initial embedding for each token is created by combining a pre-trained word piece embedding with position and segment information. Next, this initial sequence of embeddings is run through multiple transformer layers, producing a new sequence of context embeddings at each step. Implicit in each transformer layer is a set of attention matrices, one for each attention head, each of which contains a scalar value for each ordered pair (tokeni , tokenj ). [SLIDE 11]
  21. 21. Experiment for Syntax Representation Experiment on corpus of Penn TreeBank (3.1M dependency relations). With PyStanford Dependency Library we found the grammatical dependency on which we ran BERT-base through each sentence and obtained model-wide attention matrix. [ SLIDE 9]. On this dataset we train test split of 30% and achieve an accuracy of 85.8% on binary probe and 71.9% on multiclass probe. Proved: Attention mechanism contains syntactic features.
  22. 22. Geometry of Word Sense (Experiment) On wikipedia articles with a query word we applied nearest-neighbor classifier where each neighbour is the centroid of a given word sense’s BERT-base embeddings in training data.
  23. 23. Conclusion BERT is undoubtedly a breakthrough in the use of Machine Learning for Natural Language Processing. The fact that it’s approachable and allows fast fine-tuning will likely allow a wide range of practical applications in the future. Tested on our data of SupportLen for Text Classification. We have a priority column in supportlen where we manually label whether a customer email is urgent or not. On this dataset we used BERT-base-uncased. Model config { "attention_probs_dropout_prob": 0.1, "hidden_act": "gelu", "hidden_dropout_prob": 0.1, "hidden_size": 768, "initializer_range": 0.02, "intermediate_size": 3072, "max_position_embeddings": 512, "num_attention_heads": 12, "num_hidden_layers": 12, "type_vocab_size": 2, "vocab_size": 28996 }
  24. 24. Some FAQs on BERT 1. WHAT IS THE MAXIMUM SEQUENCE LENGTH OF THE INPUT? 512 tokens 2. OPTIMAL VALUES OF THE HYPERPARAMETERS USED IN FINE-TUNING ● Dropout – 0.1 ● Batch Size – 16, 32 ● Learning Rate (Adam) – 5e-5, 3e-5, 2e-5 ● Number of epochs – 3, 4 3. HOW MANY LAYERS ARE FROZEN IN THE FINE-TUNING STEP? No layers are frozen during fine-tuning. All the pre-trained layers along with the task-specific parameters are trained simultaneously. 4. IN HOW MUCH TIME BERT WAS PRETRAINED BY GOOGLE? Google took 4days to Pretrain BERT with 16TPUs.
  25. 25. ULMFiT: Universal Language Model Fine-tuning for Text Classification ULMFiT paper added a intermediate step in which an intermediate step in which model is fine-tuned on text from the same domain as the target task. Now, along with BERT Pretrained model classification task is done, resulting in better accuracy than simply using BERT model alone. We too fine-tune bert on our custom data. It took around 50mins per epoch on Telsa K80 12GB GPU on P2.
  26. 26. Future work and use cases that BERT can solve for us + Email Prioritization + Sentiment Analysis of Reviews + Review Tagging + Question-Answering for ChatBot & Community + Similar Products problem, we currently use cosine similarity on description text. Testing of ULMFiT Experiment to be done, by fine tuning BERT on our domain dataset.