BERT

BERT
Bidirectional Encoder
Representations from Transformer
PHAM QUANG KHANG

Concept
1. A pre-trained language representation utilizing the architecture of Transformer on 2 tasks
a. Randomly mask some words within a sequence and let the model try to predict that masked words
b. Predict if a pair of sequences is actually one next to other in a larger context: “next sentence
prediction”
2. Can be used as transfer learning (similar to pre-trained on ImageNet in Computer Vision)
a. Pre-train on a large corpus as un-supervised learning to learn the language representation
b. Fine-tune the model for specific tasks: text classification, Name entity recognition, SQuAD
2019/12/16 PHAM QUANG KHANG 2
Devlin et al,. 2018

Architecture
1. Encoder: encoder from transformer
a. Base model: N = 12, Hidden dim=768, Heads=12
b. Large model: N=24, Hidden dime=1024, Heads=16
2. Embedding:
Token Embedding
Multi-head
Attention
Add & Norm
Feed Forward
Add & Norm
N×
Positional Embedding
Segment Embedding
Linear + Softmax
Output

Pre-training tasks
Masked LM Next sentence prediction

Fine-tuning on SQuAD
 Use output hidden states to predict start and end span
 Apply 1 Linear(output=2) onto output hidden
state vectors T’i
 Output is predictions of starting and ending
positions of answer within input paragraph
 Objective function is log-likelihood of correct
start and end positions

Result on SQuAD
SQuAD 1.1: new SOTA SQuAD 2.0: being used as pre-trained model
https://rajpurkar.github.io/SQuAD-explorer/

Improving from BERT
ROBERTA
1. Train longer, bigger batches, more data
2. Remove next-sentence-prediction task
3. Longer sequences
4. Dynamic changing masks
ALBERT
1. Factorized embedding params
2. Cross-layer param sharing
3. Inter-sentence coherence loss

References
1. Devlin et al,. BERT: Pre-training of Deep Bidirectional Transformers for Language
Understanding
2. https://colab.research.google.com/github/tensorflow/tpu/blob/master/tools/colab/bert_fi
netuning_with_cloud_tpus.ipynb
3. https://github.com/google-research/bert
4. Pytorch version: https://github.com/huggingface/pytorch-pretrained-BERT
5. Liu et al,. RoBERTa: A Robustly Optimized BERT Pretraining Approach
6. Lan et al,. ALBERT: A Lite BERT for Self-supervised Learning of Language Representations

BERT

More Related Content

What's hot

Similar to BERT

Recently uploaded

BERT