SlideShare a Scribd company logo
1 of 35
XLNET, RoBERTa, Reformer
San Kim
2020.02.07
XLNET
Independence Assumption
• BERT – Independence Assumption
• XLNET – Capture the dependency between the target pair
Context dependency
• AR – only conditioned on the tokens up to position t
• XLNET – Access to the contextual information on both sides
Noise
• BERT – contains artificial symbols like [mask] that never occur in
downstream tasks
• XLNET – Does not rely on any input corruption
Notation
Ζ 𝑇: The set of all possible permutations of the length-T index sequence [1, 2, … , 𝑇].
𝑧𝑡: t-th element, 𝒛<𝑡 = 𝒛1:𝑡−1: the first t-1 elements, 𝒛 ∈ Ζ 𝑇
e.g. In case of Ζ5, if 𝒛 = 𝟒, 𝟓, 𝟏, 𝟑, 𝟐 ∈ Ζ5 and t = 4, 𝑧𝑡 = 3 and 𝒛<𝑡 = [4,5,1]
𝑥𝑡: t-th element of 𝒙, 𝒙<𝑡: the first t-1 elements of 𝒙
𝑒 𝑥 : the embedding of 𝑥, 𝑚 𝑡 = 1 indicates 𝑥𝑡 is masked.
ℎ 𝜃 𝒙1:𝑡−1 : a context representation produced by neural models
𝒙 = [𝑥1, … , 𝑥 𝑇]: a text sequence
𝒙 : corrupted version of 𝒙 (randomly setting a portion of tokens to [mask] symbol)
𝒙: the masked tokens. (e.g. if 𝒙 = 𝐾𝐸𝑇𝐼, 𝑖𝑠, 𝑚𝑎𝑠𝑘 , 𝑚𝑎𝑠𝑘 , 𝑐𝑜𝑚𝑝𝑎𝑛𝑦 , 𝒙 = [a, good])
𝐻 𝜃(𝒙) is a Transformer that maps a length-T text sequence 𝒙 into a sequence of
hidden vectors 𝐻 𝜃 𝒙 = [𝐻 𝜃 𝒙 1, 𝐻 𝜃 𝒙 2, ⋯ , 𝐻 𝜃 𝒙 𝑇].
Objective
BERT
max
𝜃
log 𝑝 𝜃 𝑥 𝑥 ≈ Σ 𝑡=1
𝑇
𝑚 𝑡 log 𝑝 𝜃 𝑥𝑡 𝑥 = Σ 𝑡=1
𝑇
𝑚 𝑡 log
exp 𝐻 𝜃 𝑥 𝑡
⊤
𝑒 𝑥𝑡
Σ 𝑥′ exp 𝐻 𝜃 𝑥 𝑡
⊤
𝑒 𝑥′
XLNET
max
𝜃
𝔼 𝑧 ~𝑍 𝑇
Σ 𝑡=1
𝑇
log 𝑝 𝜃 𝑥 𝑧 𝑡
𝒙 𝒛<𝑡
Objective
Hello1 my2 name3 is4 San5
𝒙 =
𝒛 = 3 5 4 2 1 𝒛 is a sample of 𝑍5
name3
name3San5
name3San5is4
name3San5is4my2
name3San5is4my2Hello1
Target Condition
max
𝜃
log 𝑝 𝜃 𝑇𝑎𝑟𝑔𝑒𝑡 𝐶𝑜𝑛𝑑𝑖𝑡𝑖𝑜𝑛
Capture bidirectional context
Adopted from [1]
In BERT
Same dist. Regardless of the target position
Target position aware
XLNET
XLNET
XLNET
Partial Prediction
Hello1 my2 name3 is4 San5
𝒙 =
𝒛 = 3 5 4 2 1
name3
name3San5
name3San5is4
name3San5is4my2
name3San5is4my2Hello1
Target Condition
C: cutting point
Partial Prediction
• Predicting tokens with
Sufficient context
• Fast convergence
• Save speed and memory!
Incorporating Ideas from Transformer-XL
• Relative positional encoding
• Segment recurrence mechanism
• + Relative Segment Encodings
Adopted from [1]
Comparing with BERT
• Similarity
• Perform partial prediction
• Reducing optimization difficulty (Sufficient context)
• (e.g. 𝑝(? |𝑡ℎ𝑒)  It’s difficult to predict
• Difference
• Dependency between targets (if given the same target; cf.
RoBERTa)
Experiments
• 32.89B subword pieces
• 512 TPU v3 chips for 500K steps, batch size:
8192, about 5.5 days.
• 540K $ (on-demend), 162K $(preemptible)
알 수 없는 작성자 님의 이 사진에는 CC BY 라이선스가 적용됩니
다.
Adopted from [1]
Experiments
SQuAD, Adopted from [1]
GLUE, Adopted from [1]
RoBERTa
1. Training the model longer
2. Bigger batches over more data
3. Removing next sentence prediction objective (w/ DOC-
SENTENCES)
4. Training on longer sequences
5. Dynamically changing the masking pattern
Pretrain the model using 1024 V100 GPUs(32GB) for approx.
one day
RoBERTa
SEGMENT-PAIR: Each input has a pair of segments.
SENTENCE-PAIR: Each input contains a pair of natural sentences.
FULL-SENTENCES: Inputs may cross document boundaries. Add
an extra separator token between documents.
DOC-SENTENCES: may not cross document boundaries.
RoBERTa
NSP, Adopted from [2]
Adopted from [2]
RoBERTa
Additional data, pretrain longer, Adopted from [2]
Adopted from [2]
Reformer – The Efficient Transformer
• Large-scale long-sequence models yield greate results but
strain resources to the point where some argue that this trend
is breaking NLP research.
• Many large Transformer models can only realistically be trained
in large industrial research laboratories and such models
trained with model parallelism cannot even be fine-tuned on a
single GPU as their memory requirements demand a multi-
accelerator hardware setup even for a single training step.
Efficiency!! [5-6]
Reformer – The Efficient Transformer
1. Memory in a model with 𝑁 layers is 𝑁-times larger than in a
single-layer model due to the fact that activations need to be
stored for back-propagation
2. Since the depth 𝑑 𝑓𝑓 of intermediate feed-forward layers is
often much larger than the depth 𝑑 𝑚𝑜𝑑𝑒𝑙 of attention
activations, it accounts for a large fraction of memory use.
3. Attention on sequences of length 𝐿 is 𝑂(𝐿2
) in both
computational and memory complexity, so even for a single
sequence of 64K tokens can exhaust accelerator memory.
Reformer – The Efficient Transformer
1. Reversible layers enable storing only a single copy of
activations in the whole model, so the 𝑁 factor disappears.
2. Splitting activations inside feed-forward layers and processing
them in chunks removes the 𝑑 𝑓𝑓 factor and saves memory
inside feed-forward layers.
3. Approximate attention computation based on locality-sensitive
hashing replaces the 𝑂 𝐿2 factor in attention layers with
𝑂 𝐿 log 𝐿 and so allows operating on long sequences.
Reformer – The Efficient Transformer
Adopted from [3]
Reformer – The Efficient Transformer
Adopted from [3]
3. LSH(Locality-sensitive hashing)
Reformer – The Efficient Transformer
Adopted from [3]
Reformer – The Efficient Transformer
Adopted from [4]
1. Reversible residual networks
Reformer – The Efficient Transformer
1. Reversible Transformer
𝑌1 = 𝑋1 + 𝐴𝑡𝑡𝑒𝑛𝑡𝑖𝑜𝑛(𝑋2) 𝑌2 = 𝑋 𝑥 + 𝐹𝑒𝑒𝑑𝐹𝑜𝑟𝑤𝑎𝑟𝑑(𝑌1)
• Sharing QK(Query and Key)
𝑥 → 𝑦 𝑦 = 𝑥 + 𝐹(𝑥)
𝑥1, 𝑥2 → (𝑦1, 𝑦2)
Residual networks
Reversible residual networks
𝑦1 = 𝑥1 + 𝐹(𝑥2) 𝑦2 = 𝑥2 + 𝐺(𝑦2)
𝑥1 = 𝑦1 − 𝐹(𝑥2)𝑥2 = 𝑦2 − 𝐺(𝑦2)
Reformer – The Efficient Transformer
Adopted from [3]
Reformer – The Efficient Transformer
Adopted from [3]
Reformer – The Efficient Transformer
Adopted from [3]
Reference
[1] Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Carbonell, Ruslan Salakhutdinov, and Quoc V Le.
XLNet: Generalized autoregressive pretraining for language understanding. arXiv preprint
arXiv:1906.08237, 2019
[2] Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis,
Luke Zettlemoyer, and Veseliin Stoyanov. RoBERTa: A robustly optimized BERT pre-training approach.
arXiv preprint arXiv:1907.11692, 2019.
[3] Nikita Kitaev, Lukasz Kaiser, Anselm Levskaya. Reformer: The Efficient Transformer. arXiv preprint
arXiv:2001.04451, 2020
[4] Nikita Kitaev, Lukasz Kaiser. Reformer: The Efficient Transformer. Google AI Blog, 2020, [URL]
https://ai.googleblog.com/2020/01/reformer-efficient-transformer.html
[5] Yang You, Jing Li, Jonathan Hseu, Xiaodan Song, James Demmel, and Cho-Jui Hsieh. Large Batch
Optimization for Deep Learning: Training BERT in 76 minutes. arXiv preprint arXiv: 1904.00962, 2019.
[6] Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, Radu Soricut.
ALBERT: A Lite BERT for Self-supervised Learning of Language Representations. arXiv preprint
arXiv:1909.11942, 2019
Reinforcement Learning(2018~)
Compositionality, Modularity (VQA, GNNs, Causal
modeling, NMNs)
Graph Neural Networks (from GNN Model to Hyperbolic
GNNs, HCGNNs)
Next topic

More Related Content

What's hot

Introduction to Transformers for NLP - Olga Petrova
Introduction to Transformers for NLP - Olga PetrovaIntroduction to Transformers for NLP - Olga Petrova
Introduction to Transformers for NLP - Olga PetrovaAlexey Grigorev
 
Transformer in Computer Vision
Transformer in Computer VisionTransformer in Computer Vision
Transformer in Computer VisionDongmin Choi
 
Variational Auto Encoder and the Math Behind
Variational Auto Encoder and the Math BehindVariational Auto Encoder and the Math Behind
Variational Auto Encoder and the Math BehindVarun Reddy
 
Attention is all you need (UPC Reading Group 2018, by Santi Pascual)
Attention is all you need (UPC Reading Group 2018, by Santi Pascual)Attention is all you need (UPC Reading Group 2018, by Santi Pascual)
Attention is all you need (UPC Reading Group 2018, by Santi Pascual)Universitat Politècnica de Catalunya
 
Introduction to XGBoost
Introduction to XGBoostIntroduction to XGBoost
Introduction to XGBoostJoonyoung Yi
 
Continual/Lifelong Learning with Deep Architectures
Continual/Lifelong Learning with Deep ArchitecturesContinual/Lifelong Learning with Deep Architectures
Continual/Lifelong Learning with Deep ArchitecturesVincenzo Lomonaco
 
Seq2Seq (encoder decoder) model
Seq2Seq (encoder decoder) modelSeq2Seq (encoder decoder) model
Seq2Seq (encoder decoder) model佳蓉 倪
 
Restricted Boltzmann Machine | Neural Network Tutorial | Deep Learning Tutori...
Restricted Boltzmann Machine | Neural Network Tutorial | Deep Learning Tutori...Restricted Boltzmann Machine | Neural Network Tutorial | Deep Learning Tutori...
Restricted Boltzmann Machine | Neural Network Tutorial | Deep Learning Tutori...Edureka!
 
Introduction to Recurrent Neural Network
Introduction to Recurrent Neural NetworkIntroduction to Recurrent Neural Network
Introduction to Recurrent Neural NetworkKnoldus Inc.
 
Nlp and transformer (v3s)
Nlp and transformer (v3s)Nlp and transformer (v3s)
Nlp and transformer (v3s)H K Yoon
 
Efficient estimation of word representations in vector space (2013)
Efficient estimation of word representations in vector space (2013)Efficient estimation of word representations in vector space (2013)
Efficient estimation of word representations in vector space (2013)Minhazul Arefin
 
Onnx and onnx runtime
Onnx and onnx runtimeOnnx and onnx runtime
Onnx and onnx runtimeVishwas N
 

What's hot (20)

Introduction to Transformers for NLP - Olga Petrova
Introduction to Transformers for NLP - Olga PetrovaIntroduction to Transformers for NLP - Olga Petrova
Introduction to Transformers for NLP - Olga Petrova
 
BERT introduction
BERT introductionBERT introduction
BERT introduction
 
Introduction to Transformer Model
Introduction to Transformer ModelIntroduction to Transformer Model
Introduction to Transformer Model
 
LSTM
LSTMLSTM
LSTM
 
Transformer in Computer Vision
Transformer in Computer VisionTransformer in Computer Vision
Transformer in Computer Vision
 
Variational Auto Encoder and the Math Behind
Variational Auto Encoder and the Math BehindVariational Auto Encoder and the Math Behind
Variational Auto Encoder and the Math Behind
 
Attention is all you need (UPC Reading Group 2018, by Santi Pascual)
Attention is all you need (UPC Reading Group 2018, by Santi Pascual)Attention is all you need (UPC Reading Group 2018, by Santi Pascual)
Attention is all you need (UPC Reading Group 2018, by Santi Pascual)
 
Xgboost
XgboostXgboost
Xgboost
 
Introduction to XGBoost
Introduction to XGBoostIntroduction to XGBoost
Introduction to XGBoost
 
Perceptron
PerceptronPerceptron
Perceptron
 
Bert
BertBert
Bert
 
Continual/Lifelong Learning with Deep Architectures
Continual/Lifelong Learning with Deep ArchitecturesContinual/Lifelong Learning with Deep Architectures
Continual/Lifelong Learning with Deep Architectures
 
Seq2Seq (encoder decoder) model
Seq2Seq (encoder decoder) modelSeq2Seq (encoder decoder) model
Seq2Seq (encoder decoder) model
 
Restricted Boltzmann Machine | Neural Network Tutorial | Deep Learning Tutori...
Restricted Boltzmann Machine | Neural Network Tutorial | Deep Learning Tutori...Restricted Boltzmann Machine | Neural Network Tutorial | Deep Learning Tutori...
Restricted Boltzmann Machine | Neural Network Tutorial | Deep Learning Tutori...
 
Introduction to Recurrent Neural Network
Introduction to Recurrent Neural NetworkIntroduction to Recurrent Neural Network
Introduction to Recurrent Neural Network
 
Nlp and transformer (v3s)
Nlp and transformer (v3s)Nlp and transformer (v3s)
Nlp and transformer (v3s)
 
Efficient estimation of word representations in vector space (2013)
Efficient estimation of word representations in vector space (2013)Efficient estimation of word representations in vector space (2013)
Efficient estimation of word representations in vector space (2013)
 
Onnx and onnx runtime
Onnx and onnx runtimeOnnx and onnx runtime
Onnx and onnx runtime
 
Lstm
LstmLstm
Lstm
 
Perceptron in ANN
Perceptron in ANNPerceptron in ANN
Perceptron in ANN
 

Similar to XLnet RoBERTa Reformer

Recurrent Neural Networks
Recurrent Neural NetworksRecurrent Neural Networks
Recurrent Neural NetworksSharath TS
 
Introduction to deep learning
Introduction to deep learningIntroduction to deep learning
Introduction to deep learningJunaid Bhat
 
Foundation of Generative AI: Study Materials Connecting the Dots by Delving i...
Foundation of Generative AI: Study Materials Connecting the Dots by Delving i...Foundation of Generative AI: Study Materials Connecting the Dots by Delving i...
Foundation of Generative AI: Study Materials Connecting the Dots by Delving i...Fordham University
 
MSc Thesis Presentation
MSc Thesis PresentationMSc Thesis Presentation
MSc Thesis PresentationReem Sherif
 
Kernal based speaker specific feature extraction and its applications in iTau...
Kernal based speaker specific feature extraction and its applications in iTau...Kernal based speaker specific feature extraction and its applications in iTau...
Kernal based speaker specific feature extraction and its applications in iTau...TELKOMNIKA JOURNAL
 
Petri Nets: Properties, Analysis and Applications
Petri Nets: Properties, Analysis and ApplicationsPetri Nets: Properties, Analysis and Applications
Petri Nets: Properties, Analysis and ApplicationsDr. Mohamed Torky
 
Hardware Acceleration for Machine Learning
Hardware Acceleration for Machine LearningHardware Acceleration for Machine Learning
Hardware Acceleration for Machine LearningCastLabKAIST
 
Neural machine translation by jointly learning to align and translate.pptx
Neural machine translation by jointly learning to align and translate.pptxNeural machine translation by jointly learning to align and translate.pptx
Neural machine translation by jointly learning to align and translate.pptxssuser2624f71
 
Trajectory Transformer.pptx
Trajectory Transformer.pptxTrajectory Transformer.pptx
Trajectory Transformer.pptxSeungeon Baek
 
RNN and sequence-to-sequence processing
RNN and sequence-to-sequence processingRNN and sequence-to-sequence processing
RNN and sequence-to-sequence processingDongang (Sean) Wang
 
Introduction to Tree-LSTMs
Introduction to Tree-LSTMsIntroduction to Tree-LSTMs
Introduction to Tree-LSTMsDaniel Perez
 
Isolated words recognition using mfcc, lpc and neural network
Isolated words recognition using mfcc, lpc and neural networkIsolated words recognition using mfcc, lpc and neural network
Isolated words recognition using mfcc, lpc and neural networkeSAT Journals
 
Deep Learning: Application & Opportunity
Deep Learning: Application & OpportunityDeep Learning: Application & Opportunity
Deep Learning: Application & OpportunityiTrain
 
Sepformer&DPTNet.pdf
Sepformer&DPTNet.pdfSepformer&DPTNet.pdf
Sepformer&DPTNet.pdfssuser849b73
 
Design of ternary sequence using msaa
Design of ternary sequence using msaaDesign of ternary sequence using msaa
Design of ternary sequence using msaaEditor Jacotech
 
RNN and LSTM model description and working advantages and disadvantages
RNN and LSTM model description and working advantages and disadvantagesRNN and LSTM model description and working advantages and disadvantages
RNN and LSTM model description and working advantages and disadvantagesAbhijitVenkatesh1
 
CMSI計算科学技術特論A (2015) 第13回 Parallelization of Molecular Dynamics
CMSI計算科学技術特論A (2015) 第13回 Parallelization of Molecular Dynamics CMSI計算科学技術特論A (2015) 第13回 Parallelization of Molecular Dynamics
CMSI計算科学技術特論A (2015) 第13回 Parallelization of Molecular Dynamics Computational Materials Science Initiative
 

Similar to XLnet RoBERTa Reformer (20)

Recurrent Neural Networks
Recurrent Neural NetworksRecurrent Neural Networks
Recurrent Neural Networks
 
Introduction to deep learning
Introduction to deep learningIntroduction to deep learning
Introduction to deep learning
 
Foundation of Generative AI: Study Materials Connecting the Dots by Delving i...
Foundation of Generative AI: Study Materials Connecting the Dots by Delving i...Foundation of Generative AI: Study Materials Connecting the Dots by Delving i...
Foundation of Generative AI: Study Materials Connecting the Dots by Delving i...
 
MSc Thesis Presentation
MSc Thesis PresentationMSc Thesis Presentation
MSc Thesis Presentation
 
Kernal based speaker specific feature extraction and its applications in iTau...
Kernal based speaker specific feature extraction and its applications in iTau...Kernal based speaker specific feature extraction and its applications in iTau...
Kernal based speaker specific feature extraction and its applications in iTau...
 
Petri Nets: Properties, Analysis and Applications
Petri Nets: Properties, Analysis and ApplicationsPetri Nets: Properties, Analysis and Applications
Petri Nets: Properties, Analysis and Applications
 
Hardware Acceleration for Machine Learning
Hardware Acceleration for Machine LearningHardware Acceleration for Machine Learning
Hardware Acceleration for Machine Learning
 
XLNet Presentation.pdf
XLNet Presentation.pdfXLNet Presentation.pdf
XLNet Presentation.pdf
 
Neural machine translation by jointly learning to align and translate.pptx
Neural machine translation by jointly learning to align and translate.pptxNeural machine translation by jointly learning to align and translate.pptx
Neural machine translation by jointly learning to align and translate.pptx
 
Trajectory Transformer.pptx
Trajectory Transformer.pptxTrajectory Transformer.pptx
Trajectory Transformer.pptx
 
RNN and sequence-to-sequence processing
RNN and sequence-to-sequence processingRNN and sequence-to-sequence processing
RNN and sequence-to-sequence processing
 
Rnn presentation 2
Rnn presentation 2Rnn presentation 2
Rnn presentation 2
 
Introduction to Tree-LSTMs
Introduction to Tree-LSTMsIntroduction to Tree-LSTMs
Introduction to Tree-LSTMs
 
Ruifeng.pptx
Ruifeng.pptxRuifeng.pptx
Ruifeng.pptx
 
Isolated words recognition using mfcc, lpc and neural network
Isolated words recognition using mfcc, lpc and neural networkIsolated words recognition using mfcc, lpc and neural network
Isolated words recognition using mfcc, lpc and neural network
 
Deep Learning: Application & Opportunity
Deep Learning: Application & OpportunityDeep Learning: Application & Opportunity
Deep Learning: Application & Opportunity
 
Sepformer&DPTNet.pdf
Sepformer&DPTNet.pdfSepformer&DPTNet.pdf
Sepformer&DPTNet.pdf
 
Design of ternary sequence using msaa
Design of ternary sequence using msaaDesign of ternary sequence using msaa
Design of ternary sequence using msaa
 
RNN and LSTM model description and working advantages and disadvantages
RNN and LSTM model description and working advantages and disadvantagesRNN and LSTM model description and working advantages and disadvantages
RNN and LSTM model description and working advantages and disadvantages
 
CMSI計算科学技術特論A (2015) 第13回 Parallelization of Molecular Dynamics
CMSI計算科学技術特論A (2015) 第13回 Parallelization of Molecular Dynamics CMSI計算科学技術特論A (2015) 第13回 Parallelization of Molecular Dynamics
CMSI計算科学技術特論A (2015) 第13回 Parallelization of Molecular Dynamics
 

More from San Kim

20230419-LLaMA-Adapter_ Efficient Fine-tuning of Language Models with Zero-in...
20230419-LLaMA-Adapter_ Efficient Fine-tuning of Language Models with Zero-in...20230419-LLaMA-Adapter_ Efficient Fine-tuning of Language Models with Zero-in...
20230419-LLaMA-Adapter_ Efficient Fine-tuning of Language Models with Zero-in...San Kim
 
2023 EMNLP day_san.pptx
2023 EMNLP day_san.pptx2023 EMNLP day_san.pptx
2023 EMNLP day_san.pptxSan Kim
 
LongT5_Efficient Text-toText Transformer for Long Sequences_san.pptx
LongT5_Efficient Text-toText Transformer for Long Sequences_san.pptxLongT5_Efficient Text-toText Transformer for Long Sequences_san.pptx
LongT5_Efficient Text-toText Transformer for Long Sequences_san.pptxSan Kim
 
slide-acl2022-combined_san.pptx
slide-acl2022-combined_san.pptxslide-acl2022-combined_san.pptx
slide-acl2022-combined_san.pptxSan Kim
 
Compeition-Level Code Generation with AlphaCode.pptx
Compeition-Level Code Generation with AlphaCode.pptxCompeition-Level Code Generation with AlphaCode.pptx
Compeition-Level Code Generation with AlphaCode.pptxSan Kim
 
Intrinsic Dimensionality Explains the Effectiveness of Language Model Fine-Tu...
Intrinsic Dimensionality Explains the Effectiveness of Language Model Fine-Tu...Intrinsic Dimensionality Explains the Effectiveness of Language Model Fine-Tu...
Intrinsic Dimensionality Explains the Effectiveness of Language Model Fine-Tu...San Kim
 
AI2 day.pptx
AI2 day.pptxAI2 day.pptx
AI2 day.pptxSan Kim
 
Temporal reasoning task
Temporal reasoning taskTemporal reasoning task
Temporal reasoning taskSan Kim
 
Answering complex open domain questions with multi-hop dense retrieval
Answering complex open domain questions with multi-hop dense retrievalAnswering complex open domain questions with multi-hop dense retrieval
Answering complex open domain questions with multi-hop dense retrievalSan Kim
 
Measuring massive multitask language understanding
Measuring massive multitask language understandingMeasuring massive multitask language understanding
Measuring massive multitask language understandingSan Kim
 
Abductive commonsense reasoning
Abductive commonsense reasoningAbductive commonsense reasoning
Abductive commonsense reasoningSan Kim
 
Transformer xl
Transformer xlTransformer xl
Transformer xlSan Kim
 
Face recognition v1
Face recognition v1Face recognition v1
Face recognition v1San Kim
 
Gan seminar
Gan seminarGan seminar
Gan seminarSan Kim
 
Deep learning study 3
Deep learning study 3Deep learning study 3
Deep learning study 3San Kim
 
Deep learning study 2
Deep learning study 2Deep learning study 2
Deep learning study 2San Kim
 
Deep learning study 1
Deep learning study 1Deep learning study 1
Deep learning study 1San Kim
 
Back propagation
Back propagationBack propagation
Back propagationSan Kim
 

More from San Kim (19)

20230419-LLaMA-Adapter_ Efficient Fine-tuning of Language Models with Zero-in...
20230419-LLaMA-Adapter_ Efficient Fine-tuning of Language Models with Zero-in...20230419-LLaMA-Adapter_ Efficient Fine-tuning of Language Models with Zero-in...
20230419-LLaMA-Adapter_ Efficient Fine-tuning of Language Models with Zero-in...
 
2023 EMNLP day_san.pptx
2023 EMNLP day_san.pptx2023 EMNLP day_san.pptx
2023 EMNLP day_san.pptx
 
LongT5_Efficient Text-toText Transformer for Long Sequences_san.pptx
LongT5_Efficient Text-toText Transformer for Long Sequences_san.pptxLongT5_Efficient Text-toText Transformer for Long Sequences_san.pptx
LongT5_Efficient Text-toText Transformer for Long Sequences_san.pptx
 
slide-acl2022-combined_san.pptx
slide-acl2022-combined_san.pptxslide-acl2022-combined_san.pptx
slide-acl2022-combined_san.pptx
 
Compeition-Level Code Generation with AlphaCode.pptx
Compeition-Level Code Generation with AlphaCode.pptxCompeition-Level Code Generation with AlphaCode.pptx
Compeition-Level Code Generation with AlphaCode.pptx
 
Intrinsic Dimensionality Explains the Effectiveness of Language Model Fine-Tu...
Intrinsic Dimensionality Explains the Effectiveness of Language Model Fine-Tu...Intrinsic Dimensionality Explains the Effectiveness of Language Model Fine-Tu...
Intrinsic Dimensionality Explains the Effectiveness of Language Model Fine-Tu...
 
AI2 day.pptx
AI2 day.pptxAI2 day.pptx
AI2 day.pptx
 
Temporal reasoning task
Temporal reasoning taskTemporal reasoning task
Temporal reasoning task
 
Answering complex open domain questions with multi-hop dense retrieval
Answering complex open domain questions with multi-hop dense retrievalAnswering complex open domain questions with multi-hop dense retrieval
Answering complex open domain questions with multi-hop dense retrieval
 
Measuring massive multitask language understanding
Measuring massive multitask language understandingMeasuring massive multitask language understanding
Measuring massive multitask language understanding
 
Abductive commonsense reasoning
Abductive commonsense reasoningAbductive commonsense reasoning
Abductive commonsense reasoning
 
Electra
ElectraElectra
Electra
 
Transformer xl
Transformer xlTransformer xl
Transformer xl
 
Face recognition v1
Face recognition v1Face recognition v1
Face recognition v1
 
Gan seminar
Gan seminarGan seminar
Gan seminar
 
Deep learning study 3
Deep learning study 3Deep learning study 3
Deep learning study 3
 
Deep learning study 2
Deep learning study 2Deep learning study 2
Deep learning study 2
 
Deep learning study 1
Deep learning study 1Deep learning study 1
Deep learning study 1
 
Back propagation
Back propagationBack propagation
Back propagation
 

Recently uploaded

CCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdf
CCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdfCCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdf
CCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdfAsst.prof M.Gokilavani
 
Sachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
Sachpazis Costas: Geotechnical Engineering: A student's Perspective IntroductionSachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
Sachpazis Costas: Geotechnical Engineering: A student's Perspective IntroductionDr.Costas Sachpazis
 
Introduction-To-Agricultural-Surveillance-Rover.pptx
Introduction-To-Agricultural-Surveillance-Rover.pptxIntroduction-To-Agricultural-Surveillance-Rover.pptx
Introduction-To-Agricultural-Surveillance-Rover.pptxk795866
 
Call Girls Narol 7397865700 Independent Call Girls
Call Girls Narol 7397865700 Independent Call GirlsCall Girls Narol 7397865700 Independent Call Girls
Call Girls Narol 7397865700 Independent Call Girlsssuser7cb4ff
 
Call Girls Delhi {Jodhpur} 9711199012 high profile service
Call Girls Delhi {Jodhpur} 9711199012 high profile serviceCall Girls Delhi {Jodhpur} 9711199012 high profile service
Call Girls Delhi {Jodhpur} 9711199012 high profile servicerehmti665
 
Architect Hassan Khalil Portfolio for 2024
Architect Hassan Khalil Portfolio for 2024Architect Hassan Khalil Portfolio for 2024
Architect Hassan Khalil Portfolio for 2024hassan khalil
 
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort service
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort serviceGurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort service
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort servicejennyeacort
 
What are the advantages and disadvantages of membrane structures.pptx
What are the advantages and disadvantages of membrane structures.pptxWhat are the advantages and disadvantages of membrane structures.pptx
What are the advantages and disadvantages of membrane structures.pptxwendy cai
 
Oxy acetylene welding presentation note.
Oxy acetylene welding presentation note.Oxy acetylene welding presentation note.
Oxy acetylene welding presentation note.eptoze12
 
IVE Industry Focused Event - Defence Sector 2024
IVE Industry Focused Event - Defence Sector 2024IVE Industry Focused Event - Defence Sector 2024
IVE Industry Focused Event - Defence Sector 2024Mark Billinghurst
 
Biology for Computer Engineers Course Handout.pptx
Biology for Computer Engineers Course Handout.pptxBiology for Computer Engineers Course Handout.pptx
Biology for Computer Engineers Course Handout.pptxDeepakSakkari2
 
Heart Disease Prediction using machine learning.pptx
Heart Disease Prediction using machine learning.pptxHeart Disease Prediction using machine learning.pptx
Heart Disease Prediction using machine learning.pptxPoojaBan
 
UNIT III ANALOG ELECTRONICS (BASIC ELECTRONICS)
UNIT III ANALOG ELECTRONICS (BASIC ELECTRONICS)UNIT III ANALOG ELECTRONICS (BASIC ELECTRONICS)
UNIT III ANALOG ELECTRONICS (BASIC ELECTRONICS)Dr SOUNDIRARAJ N
 
computer application and construction management
computer application and construction managementcomputer application and construction management
computer application and construction managementMariconPadriquez1
 
main PPT.pptx of girls hostel security using rfid
main PPT.pptx of girls hostel security using rfidmain PPT.pptx of girls hostel security using rfid
main PPT.pptx of girls hostel security using rfidNikhilNagaraju
 
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptxDecoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptxJoão Esperancinha
 
Electronically Controlled suspensions system .pdf
Electronically Controlled suspensions system .pdfElectronically Controlled suspensions system .pdf
Electronically Controlled suspensions system .pdfme23b1001
 
Artificial-Intelligence-in-Electronics (K).pptx
Artificial-Intelligence-in-Electronics (K).pptxArtificial-Intelligence-in-Electronics (K).pptx
Artificial-Intelligence-in-Electronics (K).pptxbritheesh05
 

Recently uploaded (20)

CCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdf
CCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdfCCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdf
CCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdf
 
Sachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
Sachpazis Costas: Geotechnical Engineering: A student's Perspective IntroductionSachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
Sachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
 
Introduction-To-Agricultural-Surveillance-Rover.pptx
Introduction-To-Agricultural-Surveillance-Rover.pptxIntroduction-To-Agricultural-Surveillance-Rover.pptx
Introduction-To-Agricultural-Surveillance-Rover.pptx
 
Call Girls Narol 7397865700 Independent Call Girls
Call Girls Narol 7397865700 Independent Call GirlsCall Girls Narol 7397865700 Independent Call Girls
Call Girls Narol 7397865700 Independent Call Girls
 
Call Girls Delhi {Jodhpur} 9711199012 high profile service
Call Girls Delhi {Jodhpur} 9711199012 high profile serviceCall Girls Delhi {Jodhpur} 9711199012 high profile service
Call Girls Delhi {Jodhpur} 9711199012 high profile service
 
Architect Hassan Khalil Portfolio for 2024
Architect Hassan Khalil Portfolio for 2024Architect Hassan Khalil Portfolio for 2024
Architect Hassan Khalil Portfolio for 2024
 
Exploring_Network_Security_with_JA3_by_Rakesh Seal.pptx
Exploring_Network_Security_with_JA3_by_Rakesh Seal.pptxExploring_Network_Security_with_JA3_by_Rakesh Seal.pptx
Exploring_Network_Security_with_JA3_by_Rakesh Seal.pptx
 
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort service
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort serviceGurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort service
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort service
 
What are the advantages and disadvantages of membrane structures.pptx
What are the advantages and disadvantages of membrane structures.pptxWhat are the advantages and disadvantages of membrane structures.pptx
What are the advantages and disadvantages of membrane structures.pptx
 
Oxy acetylene welding presentation note.
Oxy acetylene welding presentation note.Oxy acetylene welding presentation note.
Oxy acetylene welding presentation note.
 
IVE Industry Focused Event - Defence Sector 2024
IVE Industry Focused Event - Defence Sector 2024IVE Industry Focused Event - Defence Sector 2024
IVE Industry Focused Event - Defence Sector 2024
 
Biology for Computer Engineers Course Handout.pptx
Biology for Computer Engineers Course Handout.pptxBiology for Computer Engineers Course Handout.pptx
Biology for Computer Engineers Course Handout.pptx
 
Heart Disease Prediction using machine learning.pptx
Heart Disease Prediction using machine learning.pptxHeart Disease Prediction using machine learning.pptx
Heart Disease Prediction using machine learning.pptx
 
UNIT III ANALOG ELECTRONICS (BASIC ELECTRONICS)
UNIT III ANALOG ELECTRONICS (BASIC ELECTRONICS)UNIT III ANALOG ELECTRONICS (BASIC ELECTRONICS)
UNIT III ANALOG ELECTRONICS (BASIC ELECTRONICS)
 
computer application and construction management
computer application and construction managementcomputer application and construction management
computer application and construction management
 
main PPT.pptx of girls hostel security using rfid
main PPT.pptx of girls hostel security using rfidmain PPT.pptx of girls hostel security using rfid
main PPT.pptx of girls hostel security using rfid
 
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptxDecoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
 
Electronically Controlled suspensions system .pdf
Electronically Controlled suspensions system .pdfElectronically Controlled suspensions system .pdf
Electronically Controlled suspensions system .pdf
 
Artificial-Intelligence-in-Electronics (K).pptx
Artificial-Intelligence-in-Electronics (K).pptxArtificial-Intelligence-in-Electronics (K).pptx
Artificial-Intelligence-in-Electronics (K).pptx
 
🔝9953056974🔝!!-YOUNG call girls in Rajendra Nagar Escort rvice Shot 2000 nigh...
🔝9953056974🔝!!-YOUNG call girls in Rajendra Nagar Escort rvice Shot 2000 nigh...🔝9953056974🔝!!-YOUNG call girls in Rajendra Nagar Escort rvice Shot 2000 nigh...
🔝9953056974🔝!!-YOUNG call girls in Rajendra Nagar Escort rvice Shot 2000 nigh...
 

XLnet RoBERTa Reformer

  • 2. XLNET Independence Assumption • BERT – Independence Assumption • XLNET – Capture the dependency between the target pair Context dependency • AR – only conditioned on the tokens up to position t • XLNET – Access to the contextual information on both sides Noise • BERT – contains artificial symbols like [mask] that never occur in downstream tasks • XLNET – Does not rely on any input corruption
  • 3. Notation Ζ 𝑇: The set of all possible permutations of the length-T index sequence [1, 2, … , 𝑇]. 𝑧𝑡: t-th element, 𝒛<𝑡 = 𝒛1:𝑡−1: the first t-1 elements, 𝒛 ∈ Ζ 𝑇 e.g. In case of Ζ5, if 𝒛 = 𝟒, 𝟓, 𝟏, 𝟑, 𝟐 ∈ Ζ5 and t = 4, 𝑧𝑡 = 3 and 𝒛<𝑡 = [4,5,1] 𝑥𝑡: t-th element of 𝒙, 𝒙<𝑡: the first t-1 elements of 𝒙 𝑒 𝑥 : the embedding of 𝑥, 𝑚 𝑡 = 1 indicates 𝑥𝑡 is masked. ℎ 𝜃 𝒙1:𝑡−1 : a context representation produced by neural models 𝒙 = [𝑥1, … , 𝑥 𝑇]: a text sequence 𝒙 : corrupted version of 𝒙 (randomly setting a portion of tokens to [mask] symbol) 𝒙: the masked tokens. (e.g. if 𝒙 = 𝐾𝐸𝑇𝐼, 𝑖𝑠, 𝑚𝑎𝑠𝑘 , 𝑚𝑎𝑠𝑘 , 𝑐𝑜𝑚𝑝𝑎𝑛𝑦 , 𝒙 = [a, good]) 𝐻 𝜃(𝒙) is a Transformer that maps a length-T text sequence 𝒙 into a sequence of hidden vectors 𝐻 𝜃 𝒙 = [𝐻 𝜃 𝒙 1, 𝐻 𝜃 𝒙 2, ⋯ , 𝐻 𝜃 𝒙 𝑇].
  • 4. Objective BERT max 𝜃 log 𝑝 𝜃 𝑥 𝑥 ≈ Σ 𝑡=1 𝑇 𝑚 𝑡 log 𝑝 𝜃 𝑥𝑡 𝑥 = Σ 𝑡=1 𝑇 𝑚 𝑡 log exp 𝐻 𝜃 𝑥 𝑡 ⊤ 𝑒 𝑥𝑡 Σ 𝑥′ exp 𝐻 𝜃 𝑥 𝑡 ⊤ 𝑒 𝑥′ XLNET max 𝜃 𝔼 𝑧 ~𝑍 𝑇 Σ 𝑡=1 𝑇 log 𝑝 𝜃 𝑥 𝑧 𝑡 𝒙 𝒛<𝑡
  • 5. Objective Hello1 my2 name3 is4 San5 𝒙 = 𝒛 = 3 5 4 2 1 𝒛 is a sample of 𝑍5 name3 name3San5 name3San5is4 name3San5is4my2 name3San5is4my2Hello1 Target Condition max 𝜃 log 𝑝 𝜃 𝑇𝑎𝑟𝑔𝑒𝑡 𝐶𝑜𝑛𝑑𝑖𝑡𝑖𝑜𝑛
  • 8. Same dist. Regardless of the target position
  • 10. XLNET
  • 11. XLNET
  • 12. XLNET
  • 13. Partial Prediction Hello1 my2 name3 is4 San5 𝒙 = 𝒛 = 3 5 4 2 1 name3 name3San5 name3San5is4 name3San5is4my2 name3San5is4my2Hello1 Target Condition C: cutting point
  • 14. Partial Prediction • Predicting tokens with Sufficient context • Fast convergence • Save speed and memory!
  • 15. Incorporating Ideas from Transformer-XL • Relative positional encoding • Segment recurrence mechanism • + Relative Segment Encodings Adopted from [1]
  • 16. Comparing with BERT • Similarity • Perform partial prediction • Reducing optimization difficulty (Sufficient context) • (e.g. 𝑝(? |𝑡ℎ𝑒)  It’s difficult to predict • Difference • Dependency between targets (if given the same target; cf. RoBERTa)
  • 17. Experiments • 32.89B subword pieces • 512 TPU v3 chips for 500K steps, batch size: 8192, about 5.5 days. • 540K $ (on-demend), 162K $(preemptible) 알 수 없는 작성자 님의 이 사진에는 CC BY 라이선스가 적용됩니 다. Adopted from [1]
  • 18. Experiments SQuAD, Adopted from [1] GLUE, Adopted from [1]
  • 19. RoBERTa 1. Training the model longer 2. Bigger batches over more data 3. Removing next sentence prediction objective (w/ DOC- SENTENCES) 4. Training on longer sequences 5. Dynamically changing the masking pattern Pretrain the model using 1024 V100 GPUs(32GB) for approx. one day
  • 20. RoBERTa SEGMENT-PAIR: Each input has a pair of segments. SENTENCE-PAIR: Each input contains a pair of natural sentences. FULL-SENTENCES: Inputs may cross document boundaries. Add an extra separator token between documents. DOC-SENTENCES: may not cross document boundaries.
  • 21. RoBERTa NSP, Adopted from [2] Adopted from [2]
  • 22. RoBERTa Additional data, pretrain longer, Adopted from [2] Adopted from [2]
  • 23. Reformer – The Efficient Transformer • Large-scale long-sequence models yield greate results but strain resources to the point where some argue that this trend is breaking NLP research. • Many large Transformer models can only realistically be trained in large industrial research laboratories and such models trained with model parallelism cannot even be fine-tuned on a single GPU as their memory requirements demand a multi- accelerator hardware setup even for a single training step. Efficiency!! [5-6]
  • 24. Reformer – The Efficient Transformer 1. Memory in a model with 𝑁 layers is 𝑁-times larger than in a single-layer model due to the fact that activations need to be stored for back-propagation 2. Since the depth 𝑑 𝑓𝑓 of intermediate feed-forward layers is often much larger than the depth 𝑑 𝑚𝑜𝑑𝑒𝑙 of attention activations, it accounts for a large fraction of memory use. 3. Attention on sequences of length 𝐿 is 𝑂(𝐿2 ) in both computational and memory complexity, so even for a single sequence of 64K tokens can exhaust accelerator memory.
  • 25. Reformer – The Efficient Transformer 1. Reversible layers enable storing only a single copy of activations in the whole model, so the 𝑁 factor disappears. 2. Splitting activations inside feed-forward layers and processing them in chunks removes the 𝑑 𝑓𝑓 factor and saves memory inside feed-forward layers. 3. Approximate attention computation based on locality-sensitive hashing replaces the 𝑂 𝐿2 factor in attention layers with 𝑂 𝐿 log 𝐿 and so allows operating on long sequences.
  • 26. Reformer – The Efficient Transformer Adopted from [3]
  • 27. Reformer – The Efficient Transformer Adopted from [3] 3. LSH(Locality-sensitive hashing)
  • 28. Reformer – The Efficient Transformer Adopted from [3]
  • 29. Reformer – The Efficient Transformer Adopted from [4] 1. Reversible residual networks
  • 30. Reformer – The Efficient Transformer 1. Reversible Transformer 𝑌1 = 𝑋1 + 𝐴𝑡𝑡𝑒𝑛𝑡𝑖𝑜𝑛(𝑋2) 𝑌2 = 𝑋 𝑥 + 𝐹𝑒𝑒𝑑𝐹𝑜𝑟𝑤𝑎𝑟𝑑(𝑌1) • Sharing QK(Query and Key) 𝑥 → 𝑦 𝑦 = 𝑥 + 𝐹(𝑥) 𝑥1, 𝑥2 → (𝑦1, 𝑦2) Residual networks Reversible residual networks 𝑦1 = 𝑥1 + 𝐹(𝑥2) 𝑦2 = 𝑥2 + 𝐺(𝑦2) 𝑥1 = 𝑦1 − 𝐹(𝑥2)𝑥2 = 𝑦2 − 𝐺(𝑦2)
  • 31. Reformer – The Efficient Transformer Adopted from [3]
  • 32. Reformer – The Efficient Transformer Adopted from [3]
  • 33. Reformer – The Efficient Transformer Adopted from [3]
  • 34. Reference [1] Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Carbonell, Ruslan Salakhutdinov, and Quoc V Le. XLNet: Generalized autoregressive pretraining for language understanding. arXiv preprint arXiv:1906.08237, 2019 [2] Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veseliin Stoyanov. RoBERTa: A robustly optimized BERT pre-training approach. arXiv preprint arXiv:1907.11692, 2019. [3] Nikita Kitaev, Lukasz Kaiser, Anselm Levskaya. Reformer: The Efficient Transformer. arXiv preprint arXiv:2001.04451, 2020 [4] Nikita Kitaev, Lukasz Kaiser. Reformer: The Efficient Transformer. Google AI Blog, 2020, [URL] https://ai.googleblog.com/2020/01/reformer-efficient-transformer.html [5] Yang You, Jing Li, Jonathan Hseu, Xiaodan Song, James Demmel, and Cho-Jui Hsieh. Large Batch Optimization for Deep Learning: Training BERT in 76 minutes. arXiv preprint arXiv: 1904.00962, 2019. [6] Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, Radu Soricut. ALBERT: A Lite BERT for Self-supervised Learning of Language Representations. arXiv preprint arXiv:1909.11942, 2019
  • 35. Reinforcement Learning(2018~) Compositionality, Modularity (VQA, GNNs, Causal modeling, NMNs) Graph Neural Networks (from GNN Model to Hyperbolic GNNs, HCGNNs) Next topic