7. ELMo: deep contextualised word representation
Instead of using a fixed embedding for each word, ELMo looks at the
entire sentence before assigning each word in it an embedding.
7
11. OpenAI GPT (Generative Pre-trained Transformer)
• Unsupervised pre-training, maximising the log-likelihood
• Where 𝒰 = 𝑢!, … , 𝑢" is an unsupervised corpus of tokens, 𝑘 is the size of context
window, 𝑃 is modelled as a neural network with parameters Θ.
• where 𝑈 is one-hot representation of tokens in the window, 𝑛 is the total
number of transformer layers, transformer_block() denotes the decoder of
the Transformer model.
(1) pre-training
11
12. GPT: (2) Fine-tuning
Given labelled data 𝐶 , including each
input as a sequence of tokens
𝑥!, 𝑥#, … , 𝑥$, each label as 𝑦.
Then maximise the final objective function:
𝜆 is set as 0.5 in the experiment.
12
13. 13
ELMo and GPT are all unidirectional
• OpenAI GPT used left-to-right architecture
• ELMo concatenates forward and backward language models
• Why not just use bidirectional LSTMs or Transformer?
• bidirectional would allow each word to indirectly see itself in a
multi-layered context.
15. 15
BERT: Bidirectional Encoder
Representations from Transformers
• Main ideas
• Propose a new pre-training objective so that a deep
bidirectional Transformer can be trained
• The “masked language model” (MLM): the objective is to
predict the original word of a masked word based only on its
context
• ”Next sentence prediction”
• Merits of BERT
• Just fine-tune BERT model for specific tasks to achieve
state-of-the-art performance
• BERT advances the state-of-the-art for eleven NLP tasks
16. 16
Model architecture
• BERT’s model architecture is a multi-layer
bidirectional Transformer encoder
• (Vaswani et al., 2017) “Attention is all you need”
• Two models with different sizes were investigated
• BERTBASE: L=12, H=768, A=12, Total Parameters=110M
• (L: number of layers (Transformer blocks), H is the hidden size,
A: the number of self-attention heads)
• BERTLARGE: L=24, H=1024, A=16, Total Parameters=340M
18. Transformer Encoders
• Transformer is an attention-based architecture for NLP
• Transformer composed of two parts: Encoding
component and Decoding component
• BERT is a multi-layer bidirectional Transformer encoder
Encoder Block
Encoder Block
Encoder Block
Input sequence
18Attention is all you need. NIPS2017
19. Input Representation
• Use [CLS] for the classification tasks
• Separate sentences by using a special token [SEP]
• Token Embeddings
• Shape=[vocab_size, token_dim]
• Use pretrained WordPiece embeddings: Byte-Pair Encoding
• just character for Chinese
• Segment Embeddings
• Shape=[token_type, token_dim]
19
20. Position Encoding
• Position Encoding is used to make use of the order of the sequence
• Since the model contains no recurrence and no convolution
• Sine and cosine functions of different frequencies
20
• pos is the position and 𝑖 is the dimension
• Learned positional embeddings
• produce nearly identical results
Attention is all you need. NIPS2017
Convolutional sequence to sequence learning.
https://kazemnejad.com/blog/transformer_architecture_positional_encoding/
21. Input Representation
• Position Embeddings: Use Learned Positional Embeddings
• Shape=[seq_len, token_dim]
• Why PE changed to LPE in BERT?
• Simply feed in the Input layer? Reordering embedding
• Relative Position Embedding: mask operation+ non-linear functions
21
Neural Machine Translation with Reordering Embeddings. ACL2019
22. Input Representation
Just sum all the three embeddings directly?
• Reasonable? Sum of 3 embeddings⟺ concat of 3 one hot + MLP
• Optimal? faster convergence speed+ better performance
22
Rethinking Positional Encoding in Language Pre-training. arxiv2020.06 MSRA
=>
token-to-token token-to-position position-to-token position-to-position
23. 23
Task#1: Masked LM
• 15% of the words are masked at random
• The task is to predict the masked words based on its
left and right context
• Static mask vs Dynamic mask in RoBERTa
• Not all tokens were masked in the same
way (example sentence “My dog is hairy”)
• 80% were replaced by the <MASK> token: “My dog is
<MASK>”
• 10% were replaced by a random token: “My dog
is apple”
• 10% were left intact: “My dog is hairy”
24. Pre-Training with Whole Word Masking for Chinese BERT(but proposed by Google2019/05/31)
https://github.com/ymcui/Chinese-BERT-wwm/issues/4
Whole Word Masking—BERT WWM
• Mask the whole word: superman=>super ##man
• Where here mask means global mask: [mask], [random], [intact]
• Simple but effective
there [MASK] an ap [MASK] ##le tr [RANDOM] nearby .
[MASK] [MASK] an ap ##p [MASK] tr ##ee nearby .
there is [MASK] ap ##p ##le [MASK] ##ee [MASK] .
there is [MASK] ap [MASK] ##le tr ##ee nearby [MASK] .
there is an! ap ##p ##le tr [MASK] nearby [MASK] .
there is an [MASK] ##p [MASK] tr ##ee nearby [MASK] .
there [MASK] [MASK] ap ##p ##le tr ##ee nearby [MASK] .
there is an ap ##p ##le [RANDOM] [MASK] [MASK] .
there is an [MASK] ##p ##le tr ##ee [MASK] [MASK] .
there [MASK] an ap ##p ##le tr [MASK] nearby [MASK] .
there is an [MASK] [MASK] [RANDOM] tr ##ee nearby .
there is! [MASK] ap ##p ##le tr ##ee nearby [MASK] .
there is [MASK] ap ##p ##le [MASK] [MASK] nearby .
there [MASK] [MASK] ap ##p ##le tr ##ee [RANDOM] .
there is an ap ##p ##le [MASK] [MASK] nearby [MASK] .
[MASK] is an ap ##p ##le [MASK] [MASK] nearby .
there is an ap ##p ##le [MASK] [MASK] nearby [MASK] .
[MASK] is an ap ##p ##le [MASK] ##ee! nearby .
there is an ap! [MASK] [MASK] tr ##ee nearby .
there is [MASK] ap ##p ##le [RANDOM] [MASK] nearby .
Raw Mask Whole Word Masking
24
26. 26
Task#2: Next Sentence Prediction
• Motivation
• Many downstream tasks are based on understanding the
relationship between two text sentences
• Question Answering (QA) and Natural Language Inference (NLI)
• Language modeling does not directly capture that
relationship
• The task is pre-training binarized next sentence
prediction task
Input = [CLS] the man went to [MASK] store [SEP] he bought a gallon
[MASK] milk [SEP]
Label = isNext | NotNext
27. 27
Task#2: Next Sentence Prediction
• Modify in ALBERT
• NSP task is too easy
• Replace NSP with SOP(Sentence-order prediction)
• Modify in RoBERTa
• FULL – SENTENCES: input multiple sentences until length reach 512
• Modify in SpanBERT
• Similar to RoBERTa
• Sentence from another document means noise for MLM task.
• A longer sentence means more context information.
28. 28
Pre-training procedure
• Training data: BooksCorpus (800M words) + English
Wikipedia (2,500M words)
• Togenerate each training input sequences: sample
two spans of text (A and B) from the corpus
• The combined length is ≤ 500 tokens
• 50% B is the actual next sentence that follows A and 50%
of the time it is a random sentence from the corpus
• The training loss is the sum of the mean masked
LM likelihood and the mean next sentence
prediction likelihood
29. Fine-tuning with BERT
• Context vector 𝐶: Take
the final hidden state
corresponding to the first
token in the input: [CLS].
• Transform to a
probability distribution of
the class labels:
29
32. Unified Language Model Pre-training(UNILM)
• use mask to control how much context the token should attend to
• Pre-training Objectives:
• Unidirectional LM: both left-to-right and right-to-left
• Bidirectional LM
• Sequence-to-sequence LM
• Jointly pre-trained and Share parameters
35. How A Lite BERT (ALBERT) reduce parameters?
Ø Factorized embedding parameterization
• the WordPiece embedding size E is tied with the hidden layer size H
• E: context-independent H: context-dependent
• 𝑂 𝑉 ∗ 𝐻 ⇒ 𝑂(𝑉 ∗ 𝐸 + 𝐸 ∗ 𝐻)
ALBert_xxlarge V=30000, H=4096, E=128
V * H= 30000 * 4096 = 117M V * E + E * H=30000*128+128*4096=4M
ALBERT: A Lite BERT For Self-Supervised Learning Of Language Representations. ICLR2020
35
36. How A Lite BERT (ALBERT) reduce parameters?
Ø Cross-layer parameter sharing
• only sharing feed-forward network
• only sharing attention parameters
• share all parameters across layers(default for albert)
• most of the performance drop appears to come from sharing
the FFN-layer parameters
36
42. Three fine-tuning methods
Ø Fine-Tuning Strategies
• Preprocessing of long text
• truncation methods
• hierarchical methods
• Features from Different layers
• Catastrophic Forgetting
• pre-trained knowledge is erased during learning of new knowledge
• a lower learning rate is necessary to make BERT overcome the
catastrophic forgetting problem(usually {2,3,4,5}𝑒!")
• Layer-wise Decreasing Layer Rate
How to Fine-Tune BERT for Text Classification? 2019
42
43. Three fine-tuning methods
• Further Pre-training
• Within-Task Further Pre-Training
• In-Domain Further Pre-Training
• Cross-Domain Further Pre-Training
• Multi-Task Fine-Tuning
How to Fine-Tune BERT for Text Classification? 2019
43
44. Inside an Encoder Block
44
In BERT experiments, the number of blocks
N was chosen to be 12 and 24.
Blocks do not share weights with each
other
Inside an Encoder Block
40
In BERT experiments, the number of blocks
N was chosen to be 12 and 24.
Blocks do not share weights with each
other
47. Self-Attention in Detail
• Attention maps a query and a set of key-value pairs
to an output
• query, keys, and output are all vectors
Input
Queries
Keys
Values
X1 X2
q1 q2
k1 k2
v1 v2
Use matrices WQ , WK and
WV to project input into
query, key and value vectors
47
d is the dimension ofk
key vectors