BERT
Bidirectional Encoder
Representations from Transformer
PHAM QUANG KHANG
Concept
1. A pre-trained language representation utilizing the architecture of Transformer on 2 tasks
a. Randomly mask some words within a sequence and let the model try to predict that masked words
b. Predict if a pair of sequences is actually one next to other in a larger context: “next sentence
prediction”
2. Can be used as transfer learning (similar to pre-trained on ImageNet in Computer Vision)
a. Pre-train on a large corpus as un-supervised learning to learn the language representation
b. Fine-tune the model for specific tasks: text classification, Name entity recognition, SQuAD
2019/12/16 PHAM QUANG KHANG 2
Devlin et al,. 2018
Architecture
1. Encoder: encoder from transformer
a. Base model: N = 12, Hidden dim=768, Heads=12
b. Large model: N=24, Hidden dime=1024, Heads=16
2. Embedding:
2019/12/16 PHAM QUANG KHANG 3
Token Embedding
Multi-head
Attention
Add & Norm
Feed Forward
Add & Norm
N×
Positional Embedding
Segment Embedding
Linear + Softmax
Output
Pre-training tasks
Masked LM Next sentence prediction
2019/12/16 PHAM QUANG KHANG 4
Fine-tuning on SQuAD
 Use output hidden states to predict start and end span
 Apply 1 Linear(output=2) onto output hidden
state vectors T’i
 Output is predictions of starting and ending
positions of answer within input paragraph
 Objective function is log-likelihood of correct
start and end positions
2019/12/16 PHAM QUANG KHANG 5
Result on SQuAD
SQuAD 1.1: new SOTA SQuAD 2.0: being used as pre-trained model
2019/12/16 PHAM QUANG KHANG 6
https://rajpurkar.github.io/SQuAD-explorer/
Improving from BERT
ROBERTA
1. Train longer, bigger batches, more data
2. Remove next-sentence-prediction task
3. Longer sequences
4. Dynamic changing masks
ALBERT
1. Factorized embedding params
2. Cross-layer param sharing
3. Inter-sentence coherence loss
2019/12/16 PHAM QUANG KHANG 7
References
1. Devlin et al,. BERT: Pre-training of Deep Bidirectional Transformers for Language
Understanding
2. https://colab.research.google.com/github/tensorflow/tpu/blob/master/tools/colab/bert_fi
netuning_with_cloud_tpus.ipynb
3. https://github.com/google-research/bert
4. Pytorch version: https://github.com/huggingface/pytorch-pretrained-BERT
5. Liu et al,. RoBERTa: A Robustly Optimized BERT Pretraining Approach
6. Lan et al,. ALBERT: A Lite BERT for Self-supervised Learning of Language Representations
2019/12/16 PHAM QUANG KHANG 8

BERT

  • 1.
  • 2.
    Concept 1. A pre-trainedlanguage representation utilizing the architecture of Transformer on 2 tasks a. Randomly mask some words within a sequence and let the model try to predict that masked words b. Predict if a pair of sequences is actually one next to other in a larger context: “next sentence prediction” 2. Can be used as transfer learning (similar to pre-trained on ImageNet in Computer Vision) a. Pre-train on a large corpus as un-supervised learning to learn the language representation b. Fine-tune the model for specific tasks: text classification, Name entity recognition, SQuAD 2019/12/16 PHAM QUANG KHANG 2 Devlin et al,. 2018
  • 3.
    Architecture 1. Encoder: encoderfrom transformer a. Base model: N = 12, Hidden dim=768, Heads=12 b. Large model: N=24, Hidden dime=1024, Heads=16 2. Embedding: 2019/12/16 PHAM QUANG KHANG 3 Token Embedding Multi-head Attention Add & Norm Feed Forward Add & Norm N× Positional Embedding Segment Embedding Linear + Softmax Output
  • 4.
    Pre-training tasks Masked LMNext sentence prediction 2019/12/16 PHAM QUANG KHANG 4
  • 5.
    Fine-tuning on SQuAD Use output hidden states to predict start and end span  Apply 1 Linear(output=2) onto output hidden state vectors T’i  Output is predictions of starting and ending positions of answer within input paragraph  Objective function is log-likelihood of correct start and end positions 2019/12/16 PHAM QUANG KHANG 5
  • 6.
    Result on SQuAD SQuAD1.1: new SOTA SQuAD 2.0: being used as pre-trained model 2019/12/16 PHAM QUANG KHANG 6 https://rajpurkar.github.io/SQuAD-explorer/
  • 7.
    Improving from BERT ROBERTA 1.Train longer, bigger batches, more data 2. Remove next-sentence-prediction task 3. Longer sequences 4. Dynamic changing masks ALBERT 1. Factorized embedding params 2. Cross-layer param sharing 3. Inter-sentence coherence loss 2019/12/16 PHAM QUANG KHANG 7
  • 8.
    References 1. Devlin etal,. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding 2. https://colab.research.google.com/github/tensorflow/tpu/blob/master/tools/colab/bert_fi netuning_with_cloud_tpus.ipynb 3. https://github.com/google-research/bert 4. Pytorch version: https://github.com/huggingface/pytorch-pretrained-BERT 5. Liu et al,. RoBERTa: A Robustly Optimized BERT Pretraining Approach 6. Lan et al,. ALBERT: A Lite BERT for Self-supervised Learning of Language Representations 2019/12/16 PHAM QUANG KHANG 8