SlideShare a Scribd company logo
1 of 51
Download to read offline
An Introduction to Pre-training
General Language Representations
zhangpengfei36
meituan-ai-dm 2020-0717
2
Outline
• Research Context
• ELMo&GPT
• BERT
• BERT Extend
• Static Embeddings
• Word2Vec, Glove, …
• Fixed, can not solve problems such as polysemy
• Dynamic Embeddings
• Autoregressive LM
• Left-to-right or right-to-left
• ELMO/GPT1.0/GPT2.0
• Autoencoder LM
• Denoising Autoencoder
• BERT/ERNIE1.0/MTDNN/SpanBERT/RoBERTa
• XLNet
• Bidirectional+ Autoregressive
Pre-training general language representations
3
Feature extraction
• RNNs: ELMO/ULMFiT/SiATL
• Transformer: GPT1.0/GPT2.0/BERT series
• Transformer-XL: XLNet
Feature usage
• Feature-based: based on task-specific model(ELMo)
• Fine-tune: add task-specific parameters(GPT, BERT…)
Fine-tuning approaches
6
Outline
• Research Context
• ELMo&GPT
• BERT
• BERT Extend
ELMo: deep contextualised word representation
Instead of using a fixed embedding for each word, ELMo looks at the
entire sentence before assigning each word in it an embedding.
7
ELMo: deep contextualised word representation
8
ELMo: deep contextualised word representation
9
ELMo: deep contextualised word representation
10
OpenAI GPT (Generative Pre-trained Transformer)
• Unsupervised pre-training, maximising the log-likelihood
• Where 𝒰 = 𝑢!, … , 𝑢" is an unsupervised corpus of tokens, 𝑘 is the size of context
window, 𝑃 is modelled as a neural network with parameters Θ.
• where 𝑈 is one-hot representation of tokens in the window, 𝑛 is the total
number of transformer layers, transformer_block() denotes the decoder of
the Transformer model.
(1) pre-training
11
GPT: (2) Fine-tuning
Given labelled data 𝐶 , including each
input as a sequence of tokens
𝑥!, 𝑥#, … , 𝑥$, each label as 𝑦.
Then maximise the final objective function:
𝜆 is set as 0.5 in the experiment.
12
13
ELMo and GPT are all unidirectional
• OpenAI GPT used left-to-right architecture
• ELMo concatenates forward and backward language models
• Why not just use bidirectional LSTMs or Transformer?
• bidirectional would allow each word to indirectly see itself in a
multi-layered context.
14
Outline
• Research context
• ELMo&GPT
• BERT
• BERT Extend
15
BERT: Bidirectional Encoder
Representations from Transformers
• Main ideas
• Propose a new pre-training objective so that a deep
bidirectional Transformer can be trained
• The “masked language model” (MLM): the objective is to
predict the original word of a masked word based only on its
context
• ”Next sentence prediction”
• Merits of BERT
• Just fine-tune BERT model for specific tasks to achieve
state-of-the-art performance
• BERT advances the state-of-the-art for eleven NLP tasks
16
Model architecture
• BERT’s model architecture is a multi-layer
bidirectional Transformer encoder
• (Vaswani et al., 2017) “Attention is all you need”
• Two models with different sizes were investigated
• BERTBASE: L=12, H=768, A=12, Total Parameters=110M
• (L: number of layers (Transformer blocks), H is the hidden size,
A: the number of self-attention heads)
• BERTLARGE: L=24, H=1024, A=16, Total Parameters=340M
Differences in pre-training model architectures:
BERT, OpenAI GPT, and ELMo
17
Transformer Encoders
• Transformer is an attention-based architecture for NLP
• Transformer composed of two parts: Encoding
component and Decoding component
• BERT is a multi-layer bidirectional Transformer encoder
Encoder Block
Encoder Block
Encoder Block
Input sequence
18Attention is all you need. NIPS2017
Input Representation
• Use [CLS] for the classification tasks
• Separate sentences by using a special token [SEP]
• Token Embeddings
• Shape=[vocab_size, token_dim]
• Use pretrained WordPiece embeddings: Byte-Pair Encoding
• just character for Chinese
• Segment Embeddings
• Shape=[token_type, token_dim]
19
Position Encoding
• Position Encoding is used to make use of the order of the sequence
• Since the model contains no recurrence and no convolution
• Sine and cosine functions of different frequencies
20
• pos is the position and 𝑖 is the dimension
• Learned positional embeddings
• produce nearly identical results
Attention is all you need. NIPS2017
Convolutional sequence to sequence learning.
https://kazemnejad.com/blog/transformer_architecture_positional_encoding/
Input Representation
• Position Embeddings: Use Learned Positional Embeddings
• Shape=[seq_len, token_dim]
• Why PE changed to LPE in BERT?
• Simply feed in the Input layer? Reordering embedding
• Relative Position Embedding: mask operation+ non-linear functions
21
Neural Machine Translation with Reordering Embeddings. ACL2019
Input Representation
Just sum all the three embeddings directly?
• Reasonable? Sum of 3 embeddings⟺ concat of 3 one hot + MLP
• Optimal? faster convergence speed+ better performance
22
Rethinking Positional Encoding in Language Pre-training. arxiv2020.06 MSRA
=>
token-to-token token-to-position position-to-token position-to-position
23
Task#1: Masked LM
• 15% of the words are masked at random
• The task is to predict the masked words based on its
left and right context
• Static mask vs Dynamic mask in RoBERTa
• Not all tokens were masked in the same
way (example sentence “My dog is hairy”)
• 80% were replaced by the <MASK> token: “My dog is
<MASK>”
• 10% were replaced by a random token: “My dog
is apple”
• 10% were left intact: “My dog is hairy”
Pre-Training with Whole Word Masking for Chinese BERT(but proposed by Google2019/05/31)
https://github.com/ymcui/Chinese-BERT-wwm/issues/4
Whole Word Masking—BERT WWM
• Mask the whole word: superman=>super ##man
• Where here mask means global mask: [mask], [random], [intact]
• Simple but effective
there [MASK] an ap [MASK] ##le tr [RANDOM] nearby .
[MASK] [MASK] an ap ##p [MASK] tr ##ee nearby .
there is [MASK] ap ##p ##le [MASK] ##ee [MASK] .
there is [MASK] ap [MASK] ##le tr ##ee nearby [MASK] .
there is an! ap ##p ##le tr [MASK] nearby [MASK] .
there is an [MASK] ##p [MASK] tr ##ee nearby [MASK] .
there [MASK] [MASK] ap ##p ##le tr ##ee nearby [MASK] .
there is an ap ##p ##le [RANDOM] [MASK] [MASK] .
there is an [MASK] ##p ##le tr ##ee [MASK] [MASK] .
there [MASK] an ap ##p ##le tr [MASK] nearby [MASK] .
there is an [MASK] [MASK] [RANDOM] tr ##ee nearby .
there is! [MASK] ap ##p ##le tr ##ee nearby [MASK] .
there is [MASK] ap ##p ##le [MASK] [MASK] nearby .
there [MASK] [MASK] ap ##p ##le tr ##ee [RANDOM] .
there is an ap ##p ##le [MASK] [MASK] nearby [MASK] .
[MASK] is an ap ##p ##le [MASK] [MASK] nearby .
there is an ap ##p ##le [MASK] [MASK] nearby [MASK] .
[MASK] is an ap ##p ##le [MASK] ##ee! nearby .
there is an ap! [MASK] [MASK] tr ##ee nearby .
there is [MASK] ap ##p ##le [RANDOM] [MASK] nearby .
Raw Mask Whole Word Masking
24
ERNIE1.0 Basic-Level Masking, Phrase-Level Masking, Entity-Level Masking
MT-BERT Knowledge-aware Masking
25
26
Task#2: Next Sentence Prediction
• Motivation
• Many downstream tasks are based on understanding the
relationship between two text sentences
• Question Answering (QA) and Natural Language Inference (NLI)
• Language modeling does not directly capture that
relationship
• The task is pre-training binarized next sentence
prediction task
Input = [CLS] the man went to [MASK] store [SEP] he bought a gallon
[MASK] milk [SEP]
Label = isNext | NotNext
27
Task#2: Next Sentence Prediction
• Modify in ALBERT
• NSP task is too easy
• Replace NSP with SOP(Sentence-order prediction)
• Modify in RoBERTa
• FULL – SENTENCES: input multiple sentences until length reach 512
• Modify in SpanBERT
• Similar to RoBERTa
• Sentence from another document means noise for MLM task.
• A longer sentence means more context information.
28
Pre-training procedure
• Training data: BooksCorpus (800M words) + English
Wikipedia (2,500M words)
• Togenerate each training input sequences: sample
two spans of text (A and B) from the corpus
• The combined length is ≤	500 tokens
• 50% B is the actual next sentence that follows A and 50%
of the time it is a random sentence from the corpus
• The training loss is the sum of the mean masked
LM likelihood and the mean next sentence
prediction likelihood
Fine-tuning with BERT
• Context vector 𝐶: Take
the final hidden state
corresponding to the first
token in the input: [CLS].
• Transform to a
probability distribution of
the class labels:
29
30
Outline
• Research context
• ELMo&GPT
• BERT
• BERT Extend
31
基于BERT的改进
Unified Language Model Pre-training(UNILM)
• use mask to control how much context the token should attend to
• Pre-training Objectives:
• Unidirectional LM: both left-to-right and right-to-left
• Bidirectional LM
• Sequence-to-sequence LM
• Jointly pre-trained and Share parameters
模型压缩与轻量化
How A Lite BERT (ALBERT) reduce parameters?
Ø Factorized embedding parameterization
• the WordPiece embedding size E is tied with the hidden layer size H
• E: context-independent H: context-dependent
• 𝑂 𝑉 ∗ 𝐻 ⇒ 𝑂(𝑉 ∗ 𝐸 + 𝐸 ∗ 𝐻)
ALBert_xxlarge V=30000, H=4096, E=128
V * H= 30000 * 4096 = 117M V * E + E * H=30000*128+128*4096=4M
ALBERT: A Lite BERT For Self-Supervised Learning Of Language Representations. ICLR2020
35
How A Lite BERT (ALBERT) reduce parameters?
Ø Cross-layer parameter sharing
• only sharing feed-forward network
• only sharing attention parameters
• share all parameters across layers(default for albert)
• most of the performance drop appears to come from sharing
the FFN-layer parameters
36
Speedup comparison
https://github.com/brightmart/albert_zh 37
An important next step is thus to speed up the training and inference speed!
MT-BERT
美团BERT的探索和实践.2019.11.14 38
MT-BERT模型轻量化
• 低精度量化:在模型训练和推理中使用低精度表示
• FP32=>FP16甚至INT8、二值网络
• 模型裁剪和剪枝:减少模型层数和参数规模
• 减少Transformer层数对短文本影响较小
• MT-BERT=> MT-BERT-MINI (4层Transformer结构)
• 在线服务TP999 50ms+ -->12-14ms
裁剪前后MT-BERT模型在Query意图分类数据集上F1对比 39
MT-BERT模型轻量化
• 模型蒸馏
• 长句子直接裁剪模型会带来更多的性能损失
• 在一定精度要求下,将大模型学到的知识迁移到另一个轻量级小模型上
• 在训练集上训练好一个大模型A(通常叫做teacher model)
• 在transfer set上利用大模型A给每一个样本生成一个soft target
• 在transfer set上对student model B进行训练 cross entropy loss(soft + hard)
• 保留student model进行线上预测,去掉soft target,只保留普通分类的softmax
Distilling the knowledge in a neural network. G Hinton. NIPS2014.
裁剪和知识蒸馏方式在Query-Doc相关性任务上的效果对比
40
41
Thanks & QA
Three fine-tuning methods
Ø Fine-Tuning Strategies
• Preprocessing of long text
• truncation methods
• hierarchical methods
• Features from Different layers
• Catastrophic Forgetting
• pre-trained knowledge is erased during learning of new knowledge
• a lower learning rate is necessary to make BERT overcome the
catastrophic forgetting problem(usually {2,3,4,5}𝑒!")
• Layer-wise Decreasing Layer Rate
How to Fine-Tune BERT for Text Classification? 2019
42
Three fine-tuning methods
• Further Pre-training
• Within-Task Further Pre-Training
• In-Domain Further Pre-Training
• Cross-Domain Further Pre-Training
• Multi-Task Fine-Tuning
How to Fine-Tune BERT for Text Classification? 2019
43
Inside an Encoder Block
44
In BERT experiments, the number of blocks
N was chosen to be 12 and 24.
Blocks do not share weights with each
other
Inside an Encoder Block
40
In BERT experiments, the number of blocks
N was chosen to be 12 and 24.
Blocks do not share weights with each
other
Transformer Encoders: Key Concepts
Multi-head
self-attention
45
Self-
attention
Transformer
Encoders
Position
Encoding
Layer
NormalizationResidual
Connections
Position-wise
Feed Forward
Network
Self-Attention
https://jalammar.github.io/illustrated-transformer/
46
Self-Attention in Detail
• Attention maps a query and a set of key-value pairs
to an output
• query, keys, and output are all vectors
Input
Queries
Keys
Values
X1 X2
q1 q2
k1 k2
v1 v2
Use matrices WQ , WK and
WV to project input into
query, key and value vectors
47
d is the dimension ofk
key vectors
Multi-Head Attention
X1
X2
...
Head #0 Head #1 Head #7
Concat
Linear Projection
48
Use a weight
matrix Wo
1.混合精度实现训练加速
• 深度学习模型训练:单精度(Float 32)与双精度(Double)
• 受限于显存大小,当网络规模很大时Batch Size过小
• 网络学习过程不稳定,影响模型最终效果
• 降低了数据吞吐效率,影响训练速度
• 混合精度训练方式:FP32和FP16混合
49
3.知识融入
• 常识(Common Sense)缺失+缺乏推理能力
• 在MT-BERT预训练过程中融入知识图谱信息:Knowledge-aware Masking
• 美团大脑——大规模的餐饮娱乐知识图谱
• 在预训练之前,对语料做分词,并将分词结果和图谱实体对齐
50
BERT和ALBERT在句子完整性判断上的性能对比
51

More Related Content

What's hot

PR-297: Training data-efficient image transformers & distillation through att...
PR-297: Training data-efficient image transformers & distillation through att...PR-297: Training data-efficient image transformers & distillation through att...
PR-297: Training data-efficient image transformers & distillation through att...Jinwon Lee
 
Classifying Text using CNN
Classifying Text using CNNClassifying Text using CNN
Classifying Text using CNNSomnath Banerjee
 
関数データ解析の概要とその方法
関数データ解析の概要とその方法関数データ解析の概要とその方法
関数データ解析の概要とその方法Hidetoshi Matsui
 
A note on word embedding
A note on word embeddingA note on word embedding
A note on word embeddingKhang Pham
 
Survey of Attention mechanism & Use in Computer Vision
Survey of Attention mechanism & Use in Computer VisionSurvey of Attention mechanism & Use in Computer Vision
Survey of Attention mechanism & Use in Computer VisionSwatiNarkhede1
 
Prediction and planning for self driving at waymo
Prediction and planning for self driving at waymoPrediction and planning for self driving at waymo
Prediction and planning for self driving at waymoYu Huang
 
2 5 2.一般化線形モデル色々_ロジスティック回帰
2 5 2.一般化線形モデル色々_ロジスティック回帰2 5 2.一般化線形モデル色々_ロジスティック回帰
2 5 2.一般化線形モデル色々_ロジスティック回帰logics-of-blue
 
A Review of Deep Contextualized Word Representations (Peters+, 2018)
A Review of Deep Contextualized Word Representations (Peters+, 2018)A Review of Deep Contextualized Word Representations (Peters+, 2018)
A Review of Deep Contextualized Word Representations (Peters+, 2018)Shuntaro Yada
 
Attention Mechanism in Language Understanding and its Applications
Attention Mechanism in Language Understanding and its ApplicationsAttention Mechanism in Language Understanding and its Applications
Attention Mechanism in Language Understanding and its ApplicationsArtifacia
 
Enabling Power-Efficient AI Through Quantization
Enabling Power-Efficient AI Through QuantizationEnabling Power-Efficient AI Through Quantization
Enabling Power-Efficient AI Through QuantizationQualcomm Research
 
Overcoming catastrophic forgetting in neural network
Overcoming catastrophic forgetting in neural networkOvercoming catastrophic forgetting in neural network
Overcoming catastrophic forgetting in neural networkKaty Lee
 
Nlp research presentation
Nlp research presentationNlp research presentation
Nlp research presentationSurya Sg
 
NLP using transformers
NLP using transformers NLP using transformers
NLP using transformers Arvind Devaraj
 
Natural language processing and transformer models
Natural language processing and transformer modelsNatural language processing and transformer models
Natural language processing and transformer modelsDing Li
 
ViT (Vision Transformer) Review [CDM]
ViT (Vision Transformer) Review [CDM]ViT (Vision Transformer) Review [CDM]
ViT (Vision Transformer) Review [CDM]Dongmin Choi
 
Outrageously Large Neural Networks:The Sparsely-Gated Mixture-of-Experts Laye...
Outrageously Large Neural Networks:The Sparsely-Gated Mixture-of-Experts Laye...Outrageously Large Neural Networks:The Sparsely-Gated Mixture-of-Experts Laye...
Outrageously Large Neural Networks:The Sparsely-Gated Mixture-of-Experts Laye...KCS Keio Computer Society
 
Transformer in Vision
Transformer in VisionTransformer in Vision
Transformer in VisionSangmin Woo
 

What's hot (20)

PR-297: Training data-efficient image transformers & distillation through att...
PR-297: Training data-efficient image transformers & distillation through att...PR-297: Training data-efficient image transformers & distillation through att...
PR-297: Training data-efficient image transformers & distillation through att...
 
Word2Vec
Word2VecWord2Vec
Word2Vec
 
Classifying Text using CNN
Classifying Text using CNNClassifying Text using CNN
Classifying Text using CNN
 
関数データ解析の概要とその方法
関数データ解析の概要とその方法関数データ解析の概要とその方法
関数データ解析の概要とその方法
 
A note on word embedding
A note on word embeddingA note on word embedding
A note on word embedding
 
Survey of Attention mechanism & Use in Computer Vision
Survey of Attention mechanism & Use in Computer VisionSurvey of Attention mechanism & Use in Computer Vision
Survey of Attention mechanism & Use in Computer Vision
 
Prediction and planning for self driving at waymo
Prediction and planning for self driving at waymoPrediction and planning for self driving at waymo
Prediction and planning for self driving at waymo
 
2 5 2.一般化線形モデル色々_ロジスティック回帰
2 5 2.一般化線形モデル色々_ロジスティック回帰2 5 2.一般化線形モデル色々_ロジスティック回帰
2 5 2.一般化線形モデル色々_ロジスティック回帰
 
A Review of Deep Contextualized Word Representations (Peters+, 2018)
A Review of Deep Contextualized Word Representations (Peters+, 2018)A Review of Deep Contextualized Word Representations (Peters+, 2018)
A Review of Deep Contextualized Word Representations (Peters+, 2018)
 
Bert
BertBert
Bert
 
Attention Mechanism in Language Understanding and its Applications
Attention Mechanism in Language Understanding and its ApplicationsAttention Mechanism in Language Understanding and its Applications
Attention Mechanism in Language Understanding and its Applications
 
Enabling Power-Efficient AI Through Quantization
Enabling Power-Efficient AI Through QuantizationEnabling Power-Efficient AI Through Quantization
Enabling Power-Efficient AI Through Quantization
 
Overcoming catastrophic forgetting in neural network
Overcoming catastrophic forgetting in neural networkOvercoming catastrophic forgetting in neural network
Overcoming catastrophic forgetting in neural network
 
Nlp research presentation
Nlp research presentationNlp research presentation
Nlp research presentation
 
NLP using transformers
NLP using transformers NLP using transformers
NLP using transformers
 
Natural language processing and transformer models
Natural language processing and transformer modelsNatural language processing and transformer models
Natural language processing and transformer models
 
ViT (Vision Transformer) Review [CDM]
ViT (Vision Transformer) Review [CDM]ViT (Vision Transformer) Review [CDM]
ViT (Vision Transformer) Review [CDM]
 
Outrageously Large Neural Networks:The Sparsely-Gated Mixture-of-Experts Laye...
Outrageously Large Neural Networks:The Sparsely-Gated Mixture-of-Experts Laye...Outrageously Large Neural Networks:The Sparsely-Gated Mixture-of-Experts Laye...
Outrageously Large Neural Networks:The Sparsely-Gated Mixture-of-Experts Laye...
 
Introduction to Transformer Model
Introduction to Transformer ModelIntroduction to Transformer Model
Introduction to Transformer Model
 
Transformer in Vision
Transformer in VisionTransformer in Vision
Transformer in Vision
 

Similar to An Introduction to Pre-training General Language Representations

BERT- Pre-training of Deep Bidirectional Transformers for Language Understand...
BERT- Pre-training of Deep Bidirectional Transformers for Language Understand...BERT- Pre-training of Deep Bidirectional Transformers for Language Understand...
BERT- Pre-training of Deep Bidirectional Transformers for Language Understand...Kyuri Kim
 
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
BERT: Pre-training of Deep Bidirectional Transformers for Language UnderstandingBERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
BERT: Pre-training of Deep Bidirectional Transformers for Language UnderstandingMinh Pham
 
1909 BERT: why-and-how (CODE SEMINAR)
1909 BERT: why-and-how (CODE SEMINAR)1909 BERT: why-and-how (CODE SEMINAR)
1909 BERT: why-and-how (CODE SEMINAR)WarNik Chow
 
Nlp and transformer (v3s)
Nlp and transformer (v3s)Nlp and transformer (v3s)
Nlp and transformer (v3s)H K Yoon
 
Learning New Semi-Supervised Deep Auto-encoder Features for Statistical Machi...
Learning New Semi-Supervised Deep Auto-encoder Features for Statistical Machi...Learning New Semi-Supervised Deep Auto-encoder Features for Statistical Machi...
Learning New Semi-Supervised Deep Auto-encoder Features for Statistical Machi...Vimukthi Wickramasinghe
 
PR-355: Masked Autoencoders Are Scalable Vision Learners
PR-355: Masked Autoencoders Are Scalable Vision LearnersPR-355: Masked Autoencoders Are Scalable Vision Learners
PR-355: Masked Autoencoders Are Scalable Vision LearnersJinwon Lee
 
Transformer Models_ BERT vs. GPT.pdf
Transformer Models_ BERT vs. GPT.pdfTransformer Models_ BERT vs. GPT.pdf
Transformer Models_ BERT vs. GPT.pdfhelloworld28847
 
Site visit presentation 2012 12 14
Site visit presentation 2012 12 14Site visit presentation 2012 12 14
Site visit presentation 2012 12 14Mitchell Wand
 
[PACLING2019] Improving Context-aware Neural Machine Translation with Target-...
[PACLING2019] Improving Context-aware Neural Machine Translation with Target-...[PACLING2019] Improving Context-aware Neural Machine Translation with Target-...
[PACLING2019] Improving Context-aware Neural Machine Translation with Target-...Hayahide Yamagishi
 
NLP State of the Art | BERT
NLP State of the Art | BERTNLP State of the Art | BERT
NLP State of the Art | BERTshaurya uppal
 
고급컴파일러구성론_개레_230303.pptx
고급컴파일러구성론_개레_230303.pptx고급컴파일러구성론_개레_230303.pptx
고급컴파일러구성론_개레_230303.pptxssuser1e7611
 
BERT - Part 1 Learning Notes of Senthil Kumar
BERT - Part 1 Learning Notes of Senthil KumarBERT - Part 1 Learning Notes of Senthil Kumar
BERT - Part 1 Learning Notes of Senthil KumarSenthil Kumar M
 
GPT-2: Language Models are Unsupervised Multitask Learners
GPT-2: Language Models are Unsupervised Multitask LearnersGPT-2: Language Models are Unsupervised Multitask Learners
GPT-2: Language Models are Unsupervised Multitask LearnersYoung Seok Kim
 
Word_Embedding.pptx
Word_Embedding.pptxWord_Embedding.pptx
Word_Embedding.pptxNameetDaga1
 
GPT and other Text Transformers: Black Swans and Stochastic Parrots
GPT and other Text Transformers:  Black Swans and Stochastic ParrotsGPT and other Text Transformers:  Black Swans and Stochastic Parrots
GPT and other Text Transformers: Black Swans and Stochastic ParrotsKonstantin Savenkov
 

Similar to An Introduction to Pre-training General Language Representations (20)

BERT- Pre-training of Deep Bidirectional Transformers for Language Understand...
BERT- Pre-training of Deep Bidirectional Transformers for Language Understand...BERT- Pre-training of Deep Bidirectional Transformers for Language Understand...
BERT- Pre-training of Deep Bidirectional Transformers for Language Understand...
 
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
BERT: Pre-training of Deep Bidirectional Transformers for Language UnderstandingBERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
 
1909 BERT: why-and-how (CODE SEMINAR)
1909 BERT: why-and-how (CODE SEMINAR)1909 BERT: why-and-how (CODE SEMINAR)
1909 BERT: why-and-how (CODE SEMINAR)
 
Nlp and transformer (v3s)
Nlp and transformer (v3s)Nlp and transformer (v3s)
Nlp and transformer (v3s)
 
Learning New Semi-Supervised Deep Auto-encoder Features for Statistical Machi...
Learning New Semi-Supervised Deep Auto-encoder Features for Statistical Machi...Learning New Semi-Supervised Deep Auto-encoder Features for Statistical Machi...
Learning New Semi-Supervised Deep Auto-encoder Features for Statistical Machi...
 
Transformers AI PPT.pptx
Transformers AI PPT.pptxTransformers AI PPT.pptx
Transformers AI PPT.pptx
 
PR-355: Masked Autoencoders Are Scalable Vision Learners
PR-355: Masked Autoencoders Are Scalable Vision LearnersPR-355: Masked Autoencoders Are Scalable Vision Learners
PR-355: Masked Autoencoders Are Scalable Vision Learners
 
Bert.pptx
Bert.pptxBert.pptx
Bert.pptx
 
Transformer Models_ BERT vs. GPT.pdf
Transformer Models_ BERT vs. GPT.pdfTransformer Models_ BERT vs. GPT.pdf
Transformer Models_ BERT vs. GPT.pdf
 
Transformer Zoo
Transformer ZooTransformer Zoo
Transformer Zoo
 
Deep Learning for Machine Translation
Deep Learning for Machine TranslationDeep Learning for Machine Translation
Deep Learning for Machine Translation
 
The NLP Muppets revolution!
The NLP Muppets revolution!The NLP Muppets revolution!
The NLP Muppets revolution!
 
Site visit presentation 2012 12 14
Site visit presentation 2012 12 14Site visit presentation 2012 12 14
Site visit presentation 2012 12 14
 
[PACLING2019] Improving Context-aware Neural Machine Translation with Target-...
[PACLING2019] Improving Context-aware Neural Machine Translation with Target-...[PACLING2019] Improving Context-aware Neural Machine Translation with Target-...
[PACLING2019] Improving Context-aware Neural Machine Translation with Target-...
 
NLP State of the Art | BERT
NLP State of the Art | BERTNLP State of the Art | BERT
NLP State of the Art | BERT
 
고급컴파일러구성론_개레_230303.pptx
고급컴파일러구성론_개레_230303.pptx고급컴파일러구성론_개레_230303.pptx
고급컴파일러구성론_개레_230303.pptx
 
BERT - Part 1 Learning Notes of Senthil Kumar
BERT - Part 1 Learning Notes of Senthil KumarBERT - Part 1 Learning Notes of Senthil Kumar
BERT - Part 1 Learning Notes of Senthil Kumar
 
GPT-2: Language Models are Unsupervised Multitask Learners
GPT-2: Language Models are Unsupervised Multitask LearnersGPT-2: Language Models are Unsupervised Multitask Learners
GPT-2: Language Models are Unsupervised Multitask Learners
 
Word_Embedding.pptx
Word_Embedding.pptxWord_Embedding.pptx
Word_Embedding.pptx
 
GPT and other Text Transformers: Black Swans and Stochastic Parrots
GPT and other Text Transformers:  Black Swans and Stochastic ParrotsGPT and other Text Transformers:  Black Swans and Stochastic Parrots
GPT and other Text Transformers: Black Swans and Stochastic Parrots
 

Recently uploaded

Call Us ≽ 9953322196 ≼ Call Girls In Lajpat Nagar (Delhi) |
Call Us ≽ 9953322196 ≼ Call Girls In Lajpat Nagar (Delhi) |Call Us ≽ 9953322196 ≼ Call Girls In Lajpat Nagar (Delhi) |
Call Us ≽ 9953322196 ≼ Call Girls In Lajpat Nagar (Delhi) |aasikanpl
 
Call Girls In Nihal Vihar Delhi ❤️8860477959 Looking Escorts In 24/7 Delhi NCR
Call Girls In Nihal Vihar Delhi ❤️8860477959 Looking Escorts In 24/7 Delhi NCRCall Girls In Nihal Vihar Delhi ❤️8860477959 Looking Escorts In 24/7 Delhi NCR
Call Girls In Nihal Vihar Delhi ❤️8860477959 Looking Escorts In 24/7 Delhi NCRlizamodels9
 
Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝soniya singh
 
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptxSOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptxkessiyaTpeter
 
OECD bibliometric indicators: Selected highlights, April 2024
OECD bibliometric indicators: Selected highlights, April 2024OECD bibliometric indicators: Selected highlights, April 2024
OECD bibliometric indicators: Selected highlights, April 2024innovationoecd
 
Recombinant DNA technology( Transgenic plant and animal)
Recombinant DNA technology( Transgenic plant and animal)Recombinant DNA technology( Transgenic plant and animal)
Recombinant DNA technology( Transgenic plant and animal)DHURKADEVIBASKAR
 
The dark energy paradox leads to a new structure of spacetime.pptx
The dark energy paradox leads to a new structure of spacetime.pptxThe dark energy paradox leads to a new structure of spacetime.pptx
The dark energy paradox leads to a new structure of spacetime.pptxEran Akiva Sinbar
 
(9818099198) Call Girls In Noida Sector 14 (NOIDA ESCORTS)
(9818099198) Call Girls In Noida Sector 14 (NOIDA ESCORTS)(9818099198) Call Girls In Noida Sector 14 (NOIDA ESCORTS)
(9818099198) Call Girls In Noida Sector 14 (NOIDA ESCORTS)riyaescorts54
 
Harmful and Useful Microorganisms Presentation
Harmful and Useful Microorganisms PresentationHarmful and Useful Microorganisms Presentation
Harmful and Useful Microorganisms Presentationtahreemzahra82
 
Analytical Profile of Coleus Forskohlii | Forskolin .pdf
Analytical Profile of Coleus Forskohlii | Forskolin .pdfAnalytical Profile of Coleus Forskohlii | Forskolin .pdf
Analytical Profile of Coleus Forskohlii | Forskolin .pdfSwapnil Therkar
 
Scheme-of-Work-Science-Stage-4 cambridge science.docx
Scheme-of-Work-Science-Stage-4 cambridge science.docxScheme-of-Work-Science-Stage-4 cambridge science.docx
Scheme-of-Work-Science-Stage-4 cambridge science.docxyaramohamed343013
 
Behavioral Disorder: Schizophrenia & it's Case Study.pdf
Behavioral Disorder: Schizophrenia & it's Case Study.pdfBehavioral Disorder: Schizophrenia & it's Case Study.pdf
Behavioral Disorder: Schizophrenia & it's Case Study.pdfSELF-EXPLANATORY
 
Twin's paradox experiment is a meassurement of the extra dimensions.pptx
Twin's paradox experiment is a meassurement of the extra dimensions.pptxTwin's paradox experiment is a meassurement of the extra dimensions.pptx
Twin's paradox experiment is a meassurement of the extra dimensions.pptxEran Akiva Sinbar
 
Forest laws, Indian forest laws, why they are important
Forest laws, Indian forest laws, why they are importantForest laws, Indian forest laws, why they are important
Forest laws, Indian forest laws, why they are importantadityabhardwaj282
 
THE ROLE OF PHARMACOGNOSY IN TRADITIONAL AND MODERN SYSTEM OF MEDICINE.pptx
THE ROLE OF PHARMACOGNOSY IN TRADITIONAL AND MODERN SYSTEM OF MEDICINE.pptxTHE ROLE OF PHARMACOGNOSY IN TRADITIONAL AND MODERN SYSTEM OF MEDICINE.pptx
THE ROLE OF PHARMACOGNOSY IN TRADITIONAL AND MODERN SYSTEM OF MEDICINE.pptxNandakishor Bhaurao Deshmukh
 
Neurodevelopmental disorders according to the dsm 5 tr
Neurodevelopmental disorders according to the dsm 5 trNeurodevelopmental disorders according to the dsm 5 tr
Neurodevelopmental disorders according to the dsm 5 trssuser06f238
 
Manassas R - Parkside Middle School 🌎🏫
Manassas R - Parkside Middle School 🌎🏫Manassas R - Parkside Middle School 🌎🏫
Manassas R - Parkside Middle School 🌎🏫qfactory1
 
Environmental Biotechnology Topic:- Microbial Biosensor
Environmental Biotechnology Topic:- Microbial BiosensorEnvironmental Biotechnology Topic:- Microbial Biosensor
Environmental Biotechnology Topic:- Microbial Biosensorsonawaneprad
 

Recently uploaded (20)

Call Us ≽ 9953322196 ≼ Call Girls In Lajpat Nagar (Delhi) |
Call Us ≽ 9953322196 ≼ Call Girls In Lajpat Nagar (Delhi) |Call Us ≽ 9953322196 ≼ Call Girls In Lajpat Nagar (Delhi) |
Call Us ≽ 9953322196 ≼ Call Girls In Lajpat Nagar (Delhi) |
 
Call Girls In Nihal Vihar Delhi ❤️8860477959 Looking Escorts In 24/7 Delhi NCR
Call Girls In Nihal Vihar Delhi ❤️8860477959 Looking Escorts In 24/7 Delhi NCRCall Girls In Nihal Vihar Delhi ❤️8860477959 Looking Escorts In 24/7 Delhi NCR
Call Girls In Nihal Vihar Delhi ❤️8860477959 Looking Escorts In 24/7 Delhi NCR
 
Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝
 
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptxSOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
 
OECD bibliometric indicators: Selected highlights, April 2024
OECD bibliometric indicators: Selected highlights, April 2024OECD bibliometric indicators: Selected highlights, April 2024
OECD bibliometric indicators: Selected highlights, April 2024
 
Volatile Oils Pharmacognosy And Phytochemistry -I
Volatile Oils Pharmacognosy And Phytochemistry -IVolatile Oils Pharmacognosy And Phytochemistry -I
Volatile Oils Pharmacognosy And Phytochemistry -I
 
Recombinant DNA technology( Transgenic plant and animal)
Recombinant DNA technology( Transgenic plant and animal)Recombinant DNA technology( Transgenic plant and animal)
Recombinant DNA technology( Transgenic plant and animal)
 
The dark energy paradox leads to a new structure of spacetime.pptx
The dark energy paradox leads to a new structure of spacetime.pptxThe dark energy paradox leads to a new structure of spacetime.pptx
The dark energy paradox leads to a new structure of spacetime.pptx
 
(9818099198) Call Girls In Noida Sector 14 (NOIDA ESCORTS)
(9818099198) Call Girls In Noida Sector 14 (NOIDA ESCORTS)(9818099198) Call Girls In Noida Sector 14 (NOIDA ESCORTS)
(9818099198) Call Girls In Noida Sector 14 (NOIDA ESCORTS)
 
Harmful and Useful Microorganisms Presentation
Harmful and Useful Microorganisms PresentationHarmful and Useful Microorganisms Presentation
Harmful and Useful Microorganisms Presentation
 
Analytical Profile of Coleus Forskohlii | Forskolin .pdf
Analytical Profile of Coleus Forskohlii | Forskolin .pdfAnalytical Profile of Coleus Forskohlii | Forskolin .pdf
Analytical Profile of Coleus Forskohlii | Forskolin .pdf
 
Scheme-of-Work-Science-Stage-4 cambridge science.docx
Scheme-of-Work-Science-Stage-4 cambridge science.docxScheme-of-Work-Science-Stage-4 cambridge science.docx
Scheme-of-Work-Science-Stage-4 cambridge science.docx
 
Behavioral Disorder: Schizophrenia & it's Case Study.pdf
Behavioral Disorder: Schizophrenia & it's Case Study.pdfBehavioral Disorder: Schizophrenia & it's Case Study.pdf
Behavioral Disorder: Schizophrenia & it's Case Study.pdf
 
Twin's paradox experiment is a meassurement of the extra dimensions.pptx
Twin's paradox experiment is a meassurement of the extra dimensions.pptxTwin's paradox experiment is a meassurement of the extra dimensions.pptx
Twin's paradox experiment is a meassurement of the extra dimensions.pptx
 
Forest laws, Indian forest laws, why they are important
Forest laws, Indian forest laws, why they are importantForest laws, Indian forest laws, why they are important
Forest laws, Indian forest laws, why they are important
 
THE ROLE OF PHARMACOGNOSY IN TRADITIONAL AND MODERN SYSTEM OF MEDICINE.pptx
THE ROLE OF PHARMACOGNOSY IN TRADITIONAL AND MODERN SYSTEM OF MEDICINE.pptxTHE ROLE OF PHARMACOGNOSY IN TRADITIONAL AND MODERN SYSTEM OF MEDICINE.pptx
THE ROLE OF PHARMACOGNOSY IN TRADITIONAL AND MODERN SYSTEM OF MEDICINE.pptx
 
Neurodevelopmental disorders according to the dsm 5 tr
Neurodevelopmental disorders according to the dsm 5 trNeurodevelopmental disorders according to the dsm 5 tr
Neurodevelopmental disorders according to the dsm 5 tr
 
Manassas R - Parkside Middle School 🌎🏫
Manassas R - Parkside Middle School 🌎🏫Manassas R - Parkside Middle School 🌎🏫
Manassas R - Parkside Middle School 🌎🏫
 
Engler and Prantl system of classification in plant taxonomy
Engler and Prantl system of classification in plant taxonomyEngler and Prantl system of classification in plant taxonomy
Engler and Prantl system of classification in plant taxonomy
 
Environmental Biotechnology Topic:- Microbial Biosensor
Environmental Biotechnology Topic:- Microbial BiosensorEnvironmental Biotechnology Topic:- Microbial Biosensor
Environmental Biotechnology Topic:- Microbial Biosensor
 

An Introduction to Pre-training General Language Representations

  • 1. An Introduction to Pre-training General Language Representations zhangpengfei36 meituan-ai-dm 2020-0717
  • 2. 2 Outline • Research Context • ELMo&GPT • BERT • BERT Extend
  • 3. • Static Embeddings • Word2Vec, Glove, … • Fixed, can not solve problems such as polysemy • Dynamic Embeddings • Autoregressive LM • Left-to-right or right-to-left • ELMO/GPT1.0/GPT2.0 • Autoencoder LM • Denoising Autoencoder • BERT/ERNIE1.0/MTDNN/SpanBERT/RoBERTa • XLNet • Bidirectional+ Autoregressive Pre-training general language representations 3
  • 4. Feature extraction • RNNs: ELMO/ULMFiT/SiATL • Transformer: GPT1.0/GPT2.0/BERT series • Transformer-XL: XLNet Feature usage • Feature-based: based on task-specific model(ELMo) • Fine-tune: add task-specific parameters(GPT, BERT…)
  • 6. 6 Outline • Research Context • ELMo&GPT • BERT • BERT Extend
  • 7. ELMo: deep contextualised word representation Instead of using a fixed embedding for each word, ELMo looks at the entire sentence before assigning each word in it an embedding. 7
  • 8. ELMo: deep contextualised word representation 8
  • 9. ELMo: deep contextualised word representation 9
  • 10. ELMo: deep contextualised word representation 10
  • 11. OpenAI GPT (Generative Pre-trained Transformer) • Unsupervised pre-training, maximising the log-likelihood • Where 𝒰 = 𝑢!, … , 𝑢" is an unsupervised corpus of tokens, 𝑘 is the size of context window, 𝑃 is modelled as a neural network with parameters Θ. • where 𝑈 is one-hot representation of tokens in the window, 𝑛 is the total number of transformer layers, transformer_block() denotes the decoder of the Transformer model. (1) pre-training 11
  • 12. GPT: (2) Fine-tuning Given labelled data 𝐶 , including each input as a sequence of tokens 𝑥!, 𝑥#, … , 𝑥$, each label as 𝑦. Then maximise the final objective function: 𝜆 is set as 0.5 in the experiment. 12
  • 13. 13 ELMo and GPT are all unidirectional • OpenAI GPT used left-to-right architecture • ELMo concatenates forward and backward language models • Why not just use bidirectional LSTMs or Transformer? • bidirectional would allow each word to indirectly see itself in a multi-layered context.
  • 14. 14 Outline • Research context • ELMo&GPT • BERT • BERT Extend
  • 15. 15 BERT: Bidirectional Encoder Representations from Transformers • Main ideas • Propose a new pre-training objective so that a deep bidirectional Transformer can be trained • The “masked language model” (MLM): the objective is to predict the original word of a masked word based only on its context • ”Next sentence prediction” • Merits of BERT • Just fine-tune BERT model for specific tasks to achieve state-of-the-art performance • BERT advances the state-of-the-art for eleven NLP tasks
  • 16. 16 Model architecture • BERT’s model architecture is a multi-layer bidirectional Transformer encoder • (Vaswani et al., 2017) “Attention is all you need” • Two models with different sizes were investigated • BERTBASE: L=12, H=768, A=12, Total Parameters=110M • (L: number of layers (Transformer blocks), H is the hidden size, A: the number of self-attention heads) • BERTLARGE: L=24, H=1024, A=16, Total Parameters=340M
  • 17. Differences in pre-training model architectures: BERT, OpenAI GPT, and ELMo 17
  • 18. Transformer Encoders • Transformer is an attention-based architecture for NLP • Transformer composed of two parts: Encoding component and Decoding component • BERT is a multi-layer bidirectional Transformer encoder Encoder Block Encoder Block Encoder Block Input sequence 18Attention is all you need. NIPS2017
  • 19. Input Representation • Use [CLS] for the classification tasks • Separate sentences by using a special token [SEP] • Token Embeddings • Shape=[vocab_size, token_dim] • Use pretrained WordPiece embeddings: Byte-Pair Encoding • just character for Chinese • Segment Embeddings • Shape=[token_type, token_dim] 19
  • 20. Position Encoding • Position Encoding is used to make use of the order of the sequence • Since the model contains no recurrence and no convolution • Sine and cosine functions of different frequencies 20 • pos is the position and 𝑖 is the dimension • Learned positional embeddings • produce nearly identical results Attention is all you need. NIPS2017 Convolutional sequence to sequence learning. https://kazemnejad.com/blog/transformer_architecture_positional_encoding/
  • 21. Input Representation • Position Embeddings: Use Learned Positional Embeddings • Shape=[seq_len, token_dim] • Why PE changed to LPE in BERT? • Simply feed in the Input layer? Reordering embedding • Relative Position Embedding: mask operation+ non-linear functions 21 Neural Machine Translation with Reordering Embeddings. ACL2019
  • 22. Input Representation Just sum all the three embeddings directly? • Reasonable? Sum of 3 embeddings⟺ concat of 3 one hot + MLP • Optimal? faster convergence speed+ better performance 22 Rethinking Positional Encoding in Language Pre-training. arxiv2020.06 MSRA => token-to-token token-to-position position-to-token position-to-position
  • 23. 23 Task#1: Masked LM • 15% of the words are masked at random • The task is to predict the masked words based on its left and right context • Static mask vs Dynamic mask in RoBERTa • Not all tokens were masked in the same way (example sentence “My dog is hairy”) • 80% were replaced by the <MASK> token: “My dog is <MASK>” • 10% were replaced by a random token: “My dog is apple” • 10% were left intact: “My dog is hairy”
  • 24. Pre-Training with Whole Word Masking for Chinese BERT(but proposed by Google2019/05/31) https://github.com/ymcui/Chinese-BERT-wwm/issues/4 Whole Word Masking—BERT WWM • Mask the whole word: superman=>super ##man • Where here mask means global mask: [mask], [random], [intact] • Simple but effective there [MASK] an ap [MASK] ##le tr [RANDOM] nearby . [MASK] [MASK] an ap ##p [MASK] tr ##ee nearby . there is [MASK] ap ##p ##le [MASK] ##ee [MASK] . there is [MASK] ap [MASK] ##le tr ##ee nearby [MASK] . there is an! ap ##p ##le tr [MASK] nearby [MASK] . there is an [MASK] ##p [MASK] tr ##ee nearby [MASK] . there [MASK] [MASK] ap ##p ##le tr ##ee nearby [MASK] . there is an ap ##p ##le [RANDOM] [MASK] [MASK] . there is an [MASK] ##p ##le tr ##ee [MASK] [MASK] . there [MASK] an ap ##p ##le tr [MASK] nearby [MASK] . there is an [MASK] [MASK] [RANDOM] tr ##ee nearby . there is! [MASK] ap ##p ##le tr ##ee nearby [MASK] . there is [MASK] ap ##p ##le [MASK] [MASK] nearby . there [MASK] [MASK] ap ##p ##le tr ##ee [RANDOM] . there is an ap ##p ##le [MASK] [MASK] nearby [MASK] . [MASK] is an ap ##p ##le [MASK] [MASK] nearby . there is an ap ##p ##le [MASK] [MASK] nearby [MASK] . [MASK] is an ap ##p ##le [MASK] ##ee! nearby . there is an ap! [MASK] [MASK] tr ##ee nearby . there is [MASK] ap ##p ##le [RANDOM] [MASK] nearby . Raw Mask Whole Word Masking 24
  • 25. ERNIE1.0 Basic-Level Masking, Phrase-Level Masking, Entity-Level Masking MT-BERT Knowledge-aware Masking 25
  • 26. 26 Task#2: Next Sentence Prediction • Motivation • Many downstream tasks are based on understanding the relationship between two text sentences • Question Answering (QA) and Natural Language Inference (NLI) • Language modeling does not directly capture that relationship • The task is pre-training binarized next sentence prediction task Input = [CLS] the man went to [MASK] store [SEP] he bought a gallon [MASK] milk [SEP] Label = isNext | NotNext
  • 27. 27 Task#2: Next Sentence Prediction • Modify in ALBERT • NSP task is too easy • Replace NSP with SOP(Sentence-order prediction) • Modify in RoBERTa • FULL – SENTENCES: input multiple sentences until length reach 512 • Modify in SpanBERT • Similar to RoBERTa • Sentence from another document means noise for MLM task. • A longer sentence means more context information.
  • 28. 28 Pre-training procedure • Training data: BooksCorpus (800M words) + English Wikipedia (2,500M words) • Togenerate each training input sequences: sample two spans of text (A and B) from the corpus • The combined length is ≤ 500 tokens • 50% B is the actual next sentence that follows A and 50% of the time it is a random sentence from the corpus • The training loss is the sum of the mean masked LM likelihood and the mean next sentence prediction likelihood
  • 29. Fine-tuning with BERT • Context vector 𝐶: Take the final hidden state corresponding to the first token in the input: [CLS]. • Transform to a probability distribution of the class labels: 29
  • 30. 30 Outline • Research context • ELMo&GPT • BERT • BERT Extend
  • 32. Unified Language Model Pre-training(UNILM) • use mask to control how much context the token should attend to • Pre-training Objectives: • Unidirectional LM: both left-to-right and right-to-left • Bidirectional LM • Sequence-to-sequence LM • Jointly pre-trained and Share parameters
  • 33.
  • 35. How A Lite BERT (ALBERT) reduce parameters? Ø Factorized embedding parameterization • the WordPiece embedding size E is tied with the hidden layer size H • E: context-independent H: context-dependent • 𝑂 𝑉 ∗ 𝐻 ⇒ 𝑂(𝑉 ∗ 𝐸 + 𝐸 ∗ 𝐻) ALBert_xxlarge V=30000, H=4096, E=128 V * H= 30000 * 4096 = 117M V * E + E * H=30000*128+128*4096=4M ALBERT: A Lite BERT For Self-Supervised Learning Of Language Representations. ICLR2020 35
  • 36. How A Lite BERT (ALBERT) reduce parameters? Ø Cross-layer parameter sharing • only sharing feed-forward network • only sharing attention parameters • share all parameters across layers(default for albert) • most of the performance drop appears to come from sharing the FFN-layer parameters 36
  • 37. Speedup comparison https://github.com/brightmart/albert_zh 37 An important next step is thus to speed up the training and inference speed!
  • 39. MT-BERT模型轻量化 • 低精度量化:在模型训练和推理中使用低精度表示 • FP32=>FP16甚至INT8、二值网络 • 模型裁剪和剪枝:减少模型层数和参数规模 • 减少Transformer层数对短文本影响较小 • MT-BERT=> MT-BERT-MINI (4层Transformer结构) • 在线服务TP999 50ms+ -->12-14ms 裁剪前后MT-BERT模型在Query意图分类数据集上F1对比 39
  • 40. MT-BERT模型轻量化 • 模型蒸馏 • 长句子直接裁剪模型会带来更多的性能损失 • 在一定精度要求下,将大模型学到的知识迁移到另一个轻量级小模型上 • 在训练集上训练好一个大模型A(通常叫做teacher model) • 在transfer set上利用大模型A给每一个样本生成一个soft target • 在transfer set上对student model B进行训练 cross entropy loss(soft + hard) • 保留student model进行线上预测,去掉soft target,只保留普通分类的softmax Distilling the knowledge in a neural network. G Hinton. NIPS2014. 裁剪和知识蒸馏方式在Query-Doc相关性任务上的效果对比 40
  • 42. Three fine-tuning methods Ø Fine-Tuning Strategies • Preprocessing of long text • truncation methods • hierarchical methods • Features from Different layers • Catastrophic Forgetting • pre-trained knowledge is erased during learning of new knowledge • a lower learning rate is necessary to make BERT overcome the catastrophic forgetting problem(usually {2,3,4,5}𝑒!") • Layer-wise Decreasing Layer Rate How to Fine-Tune BERT for Text Classification? 2019 42
  • 43. Three fine-tuning methods • Further Pre-training • Within-Task Further Pre-Training • In-Domain Further Pre-Training • Cross-Domain Further Pre-Training • Multi-Task Fine-Tuning How to Fine-Tune BERT for Text Classification? 2019 43
  • 44. Inside an Encoder Block 44 In BERT experiments, the number of blocks N was chosen to be 12 and 24. Blocks do not share weights with each other Inside an Encoder Block 40 In BERT experiments, the number of blocks N was chosen to be 12 and 24. Blocks do not share weights with each other
  • 45. Transformer Encoders: Key Concepts Multi-head self-attention 45 Self- attention Transformer Encoders Position Encoding Layer NormalizationResidual Connections Position-wise Feed Forward Network
  • 47. Self-Attention in Detail • Attention maps a query and a set of key-value pairs to an output • query, keys, and output are all vectors Input Queries Keys Values X1 X2 q1 q2 k1 k2 v1 v2 Use matrices WQ , WK and WV to project input into query, key and value vectors 47 d is the dimension ofk key vectors
  • 48. Multi-Head Attention X1 X2 ... Head #0 Head #1 Head #7 Concat Linear Projection 48 Use a weight matrix Wo
  • 49. 1.混合精度实现训练加速 • 深度学习模型训练:单精度(Float 32)与双精度(Double) • 受限于显存大小,当网络规模很大时Batch Size过小 • 网络学习过程不稳定,影响模型最终效果 • 降低了数据吞吐效率,影响训练速度 • 混合精度训练方式:FP32和FP16混合 49
  • 50. 3.知识融入 • 常识(Common Sense)缺失+缺乏推理能力 • 在MT-BERT预训练过程中融入知识图谱信息:Knowledge-aware Masking • 美团大脑——大规模的餐饮娱乐知识图谱 • 在预训练之前,对语料做分词,并将分词结果和图谱实体对齐 50