An Introduction to Pre-training General Language Representations

An Introduction to Pre-training
General Language Representations
zhangpengfei36
meituan-ai-dm 2020-0717

2
Outline
• Research Context
• ELMo&GPT
• BERT
• BERT Extend

• Static Embeddings
• Word2Vec, Glove, …
• Fixed, can not solve problems such as polysemy
• Dynamic Embeddings
• Autoregressive LM
• Left-to-right or right-to-left
• ELMO/GPT1.0/GPT2.0
• Autoencoder LM
• Denoising Autoencoder
• BERT/ERNIE1.0/MTDNN/SpanBERT/RoBERTa
• XLNet
• Bidirectional+ Autoregressive
Pre-training general language representations
3

Feature extraction
• RNNs: ELMO/ULMFiT/SiATL
• Transformer: GPT1.0/GPT2.0/BERT series
• Transformer-XL: XLNet
Feature usage
• Feature-based: based on task-specific model(ELMo)
• Fine-tune: add task-specific parameters(GPT, BERT…)

6
Outline
• Research Context
• ELMo&GPT
• BERT
• BERT Extend

ELMo: deep contextualised word representation
Instead of using a fixed embedding for each word, ELMo looks at the
entire sentence before assigning each word in it an embedding.
7

8

9

10

OpenAI GPT (Generative Pre-trained Transformer)
• Unsupervised pre-training, maximising the log-likelihood
• Where 𝒰 = 𝑢!, … , 𝑢" is an unsupervised corpus of tokens, 𝑘 is the size of context
window, 𝑃 is modelled as a neural network with parameters Θ.
• where 𝑈 is one-hot representation of tokens in the window, 𝑛 is the total
number of transformer layers, transformer_block() denotes the decoder of
the Transformer model.
(1) pre-training
11

GPT: (2) Fine-tuning
Given labelled data 𝐶 , including each
input as a sequence of tokens
𝑥!, 𝑥#, … , 𝑥$, each label as 𝑦.
Then maximise the final objective function:
𝜆 is set as 0.5 in the experiment.
12

13
ELMo and GPT are all unidirectional
• OpenAI GPT used left-to-right architecture
• ELMo concatenates forward and backward language models
• Why not just use bidirectional LSTMs or Transformer?
• bidirectional would allow each word to indirectly see itself in a
multi-layered context.

14
Outline
• Research context
• ELMo&GPT
• BERT
• BERT Extend

15
BERT: Bidirectional Encoder
Representations from Transformers
• Main ideas
• Propose a new pre-training objective so that a deep
bidirectional Transformer can be trained
• The “masked language model” (MLM): the objective is to
predict the original word of a masked word based only on its
context
• ”Next sentence prediction”
• Merits of BERT
• Just fine-tune BERT model for specific tasks to achieve
state-of-the-art performance
• BERT advances the state-of-the-art for eleven NLP tasks

16
Model architecture
• BERT’s model architecture is a multi-layer
bidirectional Transformer encoder
• (Vaswani et al., 2017) “Attention is all you need”
• Two models with different sizes were investigated
• BERTBASE: L=12, H=768, A=12, Total Parameters=110M
• (L: number of layers (Transformer blocks), H is the hidden size,
A: the number of self-attention heads)
• BERTLARGE: L=24, H=1024, A=16, Total Parameters=340M

Differences in pre-training model architectures:
BERT, OpenAI GPT, and ELMo
17

Transformer Encoders
• Transformer is an attention-based architecture for NLP
• Transformer composed of two parts: Encoding
component and Decoding component
• BERT is a multi-layer bidirectional Transformer encoder
Encoder Block
Encoder Block
Encoder Block
Input sequence
18Attention is all you need. NIPS2017

Input Representation
• Use [CLS] for the classification tasks
• Separate sentences by using a special token [SEP]
• Token Embeddings
• Shape=[vocab_size, token_dim]
• Use pretrained WordPiece embeddings: Byte-Pair Encoding
• just character for Chinese
• Segment Embeddings
• Shape=[token_type, token_dim]
19

Position Encoding
• Position Encoding is used to make use of the order of the sequence
• Since the model contains no recurrence and no convolution
• Sine and cosine functions of different frequencies
20
• pos is the position and 𝑖 is the dimension
• Learned positional embeddings
• produce nearly identical results
Attention is all you need. NIPS2017
Convolutional sequence to sequence learning.
https://kazemnejad.com/blog/transformer_architecture_positional_encoding/

• Position Embeddings: Use Learned Positional Embeddings
• Shape=[seq_len, token_dim]
• Why PE changed to LPE in BERT?
• Simply feed in the Input layer? Reordering embedding
• Relative Position Embedding: mask operation+ non-linear functions
21
Neural Machine Translation with Reordering Embeddings. ACL2019

Just sum all the three embeddings directly?
• Reasonable? Sum of 3 embeddings⟺ concat of 3 one hot + MLP
• Optimal? faster convergence speed+ better performance
22
Rethinking Positional Encoding in Language Pre-training. arxiv2020.06 MSRA
=>
token-to-token token-to-position position-to-token position-to-position

23
Task#1: Masked LM
• 15% of the words are masked at random
• The task is to predict the masked words based on its
left and right context
• Static mask vs Dynamic mask in RoBERTa
• Not all tokens were masked in the same
way (example sentence “My dog is hairy”)
• 80% were replaced by the <MASK> token: “My dog is
<MASK>”
• 10% were replaced by a random token: “My dog
is apple”
• 10% were left intact: “My dog is hairy”

Pre-Training with Whole Word Masking for Chinese BERT(but proposed by Google2019/05/31)
https://github.com/ymcui/Chinese-BERT-wwm/issues/4
Whole Word Masking—BERT WWM
• Mask the whole word: superman=>super ##man
• Where here mask means global mask: [mask], [random], [intact]
• Simple but effective
there [MASK] an ap [MASK] ##le tr [RANDOM] nearby .
[MASK] [MASK] an ap ##p [MASK] tr ##ee nearby .
there is [MASK] ap ##p ##le [MASK] ##ee [MASK] .
there is [MASK] ap [MASK] ##le tr ##ee nearby [MASK] .
there is an! ap ##p ##le tr [MASK] nearby [MASK] .
there is an [MASK] ##p [MASK] tr ##ee nearby [MASK] .
there [MASK] [MASK] ap ##p ##le tr ##ee nearby [MASK] .
there is an ap ##p ##le [RANDOM] [MASK] [MASK] .
there is an [MASK] ##p ##le tr ##ee [MASK] [MASK] .
there [MASK] an ap ##p ##le tr [MASK] nearby [MASK] .
there is an [MASK] [MASK] [RANDOM] tr ##ee nearby .
there is! [MASK] ap ##p ##le tr ##ee nearby [MASK] .
there is [MASK] ap ##p ##le [MASK] [MASK] nearby .
there [MASK] [MASK] ap ##p ##le tr ##ee [RANDOM] .
there is an ap ##p ##le [MASK] [MASK] nearby [MASK] .
[MASK] is an ap ##p ##le [MASK] [MASK] nearby .
there is an ap ##p ##le [MASK] [MASK] nearby [MASK] .
[MASK] is an ap ##p ##le [MASK] ##ee! nearby .
there is an ap! [MASK] [MASK] tr ##ee nearby .
there is [MASK] ap ##p ##le [RANDOM] [MASK] nearby .
Raw Mask Whole Word Masking
24

ERNIE1.0 Basic-Level Masking, Phrase-Level Masking, Entity-Level Masking
MT-BERT Knowledge-aware Masking
25

26
Task#2: Next Sentence Prediction
• Motivation
• Many downstream tasks are based on understanding the
relationship between two text sentences
• Question Answering (QA) and Natural Language Inference (NLI)
• Language modeling does not directly capture that
relationship
• The task is pre-training binarized next sentence
prediction task
Input = [CLS] the man went to [MASK] store [SEP] he bought a gallon
[MASK] milk [SEP]
Label = isNext | NotNext

27
Task#2: Next Sentence Prediction
• Modify in ALBERT
• NSP task is too easy
• Replace NSP with SOP(Sentence-order prediction)
• Modify in RoBERTa
• FULL – SENTENCES: input multiple sentences until length reach 512
• Modify in SpanBERT
• Similar to RoBERTa
• Sentence from another document means noise for MLM task.
• A longer sentence means more context information.

28
Pre-training procedure
• Training data: BooksCorpus (800M words) + English
Wikipedia (2,500M words)
• Togenerate each training input sequences: sample
two spans of text (A and B) from the corpus
• The combined length is ≤ 500 tokens
• 50% B is the actual next sentence that follows A and 50%
of the time it is a random sentence from the corpus
• The training loss is the sum of the mean masked
LM likelihood and the mean next sentence
prediction likelihood

Fine-tuning with BERT
• Context vector 𝐶: Take
the final hidden state
corresponding to the first
token in the input: [CLS].
• Transform to a
probability distribution of
the class labels:
29

30
Outline
• Research context
• ELMo&GPT
• BERT
• BERT Extend

Uniﬁed Language Model Pre-training(UNILM)
• use mask to control how much context the token should attend to
• Pre-training Objectives:
• Unidirectional LM: both left-to-right and right-to-left
• Bidirectional LM
• Sequence-to-sequence LM
• Jointly pre-trained and Share parameters

How A Lite BERT (ALBERT) reduce parameters?
Ø Factorized embedding parameterization
• the WordPiece embedding size E is tied with the hidden layer size H
• E: context-independent H: context-dependent
• 𝑂 𝑉 ∗ 𝐻 ⇒ 𝑂(𝑉 ∗ 𝐸 + 𝐸 ∗ 𝐻)
ALBert_xxlarge V=30000, H=4096, E=128
V * H= 30000 * 4096 = 117M V * E + E * H=30000*128+128*4096=4M
ALBERT: A Lite BERT For Self-Supervised Learning Of Language Representations. ICLR2020
35

How A Lite BERT (ALBERT) reduce parameters?
Ø Cross-layer parameter sharing
• only sharing feed-forward network
• only sharing attention parameters
• share all parameters across layers(default for albert)
• most of the performance drop appears to come from sharing
the FFN-layer parameters
36

Speedup comparison
https://github.com/brightmart/albert_zh 37
An important next step is thus to speed up the training and inference speed!

MT-BERT
美团BERT的探索和实践.2019.11.14 38

MT-BERT模型轻量化
• 低精度量化：在模型训练和推理中使用低精度表示
• FP32=>FP16甚至INT8、二值网络
• 模型裁剪和剪枝：减少模型层数和参数规模
• 减少Transformer层数对短文本影响较小
• MT-BERT=> MT-BERT-MINI (4层Transformer结构)
• 在线服务TP999 50ms+ -->12-14ms
裁剪前后MT-BERT模型在Query意图分类数据集上F1对比 39

MT-BERT模型轻量化
• 模型蒸馏
• 长句子直接裁剪模型会带来更多的性能损失
• 在一定精度要求下，将大模型学到的知识迁移到另一个轻量级小模型上
• 在训练集上训练好一个大模型A（通常叫做teacher model）
• 在transfer set上利用大模型A给每一个样本生成一个soft target
• 在transfer set上对student model B进行训练 cross entropy loss(soft + hard)
• 保留student model进行线上预测，去掉soft target，只保留普通分类的softmax
Distilling the knowledge in a neural network. G Hinton. NIPS2014.
裁剪和知识蒸馏方式在Query-Doc相关性任务上的效果对比
40

Three fine-tuning methods
Ø Fine-Tuning Strategies
• Preprocessing of long text
• truncation methods
• hierarchical methods
• Features from Different layers
• Catastrophic Forgetting
• pre-trained knowledge is erased during learning of new knowledge
• a lower learning rate is necessary to make BERT overcome the
catastrophic forgetting problem(usually {2,3,4,5}𝑒!")
• Layer-wise Decreasing Layer Rate
How to Fine-Tune BERT for Text Classiﬁcation? 2019
42

Three fine-tuning methods
• Further Pre-training
• Within-Task Further Pre-Training
• In-Domain Further Pre-Training
• Cross-Domain Further Pre-Training
• Multi-Task Fine-Tuning
How to Fine-Tune BERT for Text Classiﬁcation? 2019
43

Inside an Encoder Block
44
In BERT experiments, the number of blocks
N was chosen to be 12 and 24.
Blocks do not share weights with each
other
Inside an Encoder Block
40
In BERT experiments, the number of blocks
N was chosen to be 12 and 24.
Blocks do not share weights with each
other

Transformer Encoders: Key Concepts
Multi-head
self-attention
45
Self-
attention
Transformer
Encoders
Position
Encoding
Layer
NormalizationResidual
Connections
Position-wise
Feed Forward
Network

Self-Attention
https://jalammar.github.io/illustrated-transformer/
46

Self-Attention in Detail
• Attention maps a query and a set of key-value pairs
to an output
• query, keys, and output are all vectors
Input
Queries
Keys
Values
X1 X2
q1 q2
k1 k2
v1 v2
Use matrices WQ , WK and
WV to project input into
query, key and value vectors
47
d is the dimension ofk
key vectors

Multi-Head Attention
X1
X2
...
Head #0 Head #1 Head #7
Concat
Linear Projection
48
Use a weight
matrix Wo

1.混合精度实现训练加速
• 深度学习模型训练：单精度（Float 32）与双精度（Double）
• 受限于显存大小，当网络规模很大时Batch Size过小
• 网络学习过程不稳定,影响模型最终效果
• 降低了数据吞吐效率，影响训练速度
• 混合精度训练方式:FP32和FP16混合
49

3.知识融入
• 常识（Common Sense）缺失+缺乏推理能力
• 在MT-BERT预训练过程中融入知识图谱信息：Knowledge-aware Masking
• 美团大脑——大规模的餐饮娱乐知识图谱
• 在预训练之前，对语料做分词，并将分词结果和图谱实体对齐
50

BERT和ALBERT在句子完整性判断上的性能对比
51

An Introduction to Pre-training General Language Representations

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to An Introduction to Pre-training General Language Representations

Similar to An Introduction to Pre-training General Language Representations (20)

Recently uploaded

Recently uploaded (20)

An Introduction to Pre-training General Language Representations