XLnet RoBERTa Reformer

XLNET, RoBERTa, Reformer
San Kim
2020.02.07

XLNET
Independence Assumption
• BERT – Independence Assumption
• XLNET – Capture the dependency between the target pair
Context dependency
• AR – only conditioned on the tokens up to position t
• XLNET – Access to the contextual information on both sides
Noise
• BERT – contains artificial symbols like [mask] that never occur in
downstream tasks
• XLNET – Does not rely on any input corruption

Notation
Ζ 𝑇: The set of all possible permutations of the length-T index sequence [1, 2, … , 𝑇].
𝑧𝑡: t-th element, 𝒛<𝑡 = 𝒛1:𝑡−1: the first t-1 elements, 𝒛 ∈ Ζ 𝑇
e.g. In case of Ζ5, if 𝒛 = 𝟒, 𝟓, 𝟏, 𝟑, 𝟐 ∈ Ζ5 and t = 4, 𝑧𝑡 = 3 and 𝒛<𝑡 = [4,5,1]
𝑥𝑡: t-th element of 𝒙, 𝒙<𝑡: the first t-1 elements of 𝒙
𝑒 𝑥 : the embedding of 𝑥, 𝑚 𝑡 = 1 indicates 𝑥𝑡 is masked.
ℎ 𝜃 𝒙1:𝑡−1 : a context representation produced by neural models
𝒙 = [𝑥1, … , 𝑥 𝑇]: a text sequence
𝒙 : corrupted version of 𝒙 (randomly setting a portion of tokens to [mask] symbol)
𝒙: the masked tokens. (e.g. if 𝒙 = 𝐾𝐸𝑇𝐼, 𝑖𝑠, 𝑚𝑎𝑠𝑘 , 𝑚𝑎𝑠𝑘 , 𝑐𝑜𝑚𝑝𝑎𝑛𝑦 , 𝒙 = [a, good])
𝐻 𝜃(𝒙) is a Transformer that maps a length-T text sequence 𝒙 into a sequence of
hidden vectors 𝐻 𝜃 𝒙 = [𝐻 𝜃 𝒙 1, 𝐻 𝜃 𝒙 2, ⋯ , 𝐻 𝜃 𝒙 𝑇].

Objective
BERT
max
𝜃
log 𝑝 𝜃 𝑥 𝑥 ≈ Σ 𝑡=1
𝑇
𝑚 𝑡 log 𝑝 𝜃 𝑥𝑡 𝑥 = Σ 𝑡=1
𝑇
𝑚 𝑡 log
exp 𝐻 𝜃 𝑥 𝑡
⊤
𝑒 𝑥𝑡
Σ 𝑥′ exp 𝐻 𝜃 𝑥 𝑡
⊤
𝑒 𝑥′
XLNET
max
𝜃
𝔼 𝑧 ~𝑍 𝑇
Σ 𝑡=1
𝑇
log 𝑝 𝜃 𝑥 𝑧 𝑡
𝒙 𝒛<𝑡

Objective
Hello1 my2 name3 is4 San5
𝒙 =
𝒛 = 3 5 4 2 1 𝒛 is a sample of 𝑍5
name3
name3San5
name3San5is4
name3San5is4my2
name3San5is4my2Hello1
Target Condition
max
𝜃
log 𝑝 𝜃 𝑇𝑎𝑟𝑔𝑒𝑡 𝐶𝑜𝑛𝑑𝑖𝑡𝑖𝑜𝑛

Capture bidirectional context
Adopted from [1]

Same dist. Regardless of the target position

Partial Prediction
Hello1 my2 name3 is4 San5
𝒙 =
𝒛 = 3 5 4 2 1
name3
name3San5
name3San5is4
name3San5is4my2
name3San5is4my2Hello1
Target Condition
C: cutting point

Partial Prediction
• Predicting tokens with
Sufficient context
• Fast convergence
• Save speed and memory!

Incorporating Ideas from Transformer-XL
• Relative positional encoding
• Segment recurrence mechanism
• + Relative Segment Encodings
Adopted from [1]

Comparing with BERT
• Similarity
• Perform partial prediction
• Reducing optimization difficulty (Sufficient context)
• (e.g. 𝑝(? |𝑡ℎ𝑒)  It’s difficult to predict
• Difference
• Dependency between targets (if given the same target; cf.
RoBERTa)

Experiments
• 32.89B subword pieces
• 512 TPU v3 chips for 500K steps, batch size:
8192, about 5.5 days.
• 540K $ (on-demend), 162K $(preemptible)
알 수 없는 작성자 님의 이 사진에는 CC BY 라이선스가 적용됩니
다.
Adopted from [1]

Experiments
SQuAD, Adopted from [1]
GLUE, Adopted from [1]

RoBERTa
1. Training the model longer
2. Bigger batches over more data
3. Removing next sentence prediction objective (w/ DOC-
SENTENCES)
4. Training on longer sequences
5. Dynamically changing the masking pattern
Pretrain the model using 1024 V100 GPUs(32GB) for approx.
one day

RoBERTa
SEGMENT-PAIR: Each input has a pair of segments.
SENTENCE-PAIR: Each input contains a pair of natural sentences.
FULL-SENTENCES: Inputs may cross document boundaries. Add
an extra separator token between documents.
DOC-SENTENCES: may not cross document boundaries.

RoBERTa
NSP, Adopted from [2]
Adopted from [2]

RoBERTa
Additional data, pretrain longer, Adopted from [2]
Adopted from [2]

Reformer – The Efficient Transformer
• Large-scale long-sequence models yield greate results but
strain resources to the point where some argue that this trend
is breaking NLP research.
• Many large Transformer models can only realistically be trained
in large industrial research laboratories and such models
trained with model parallelism cannot even be fine-tuned on a
single GPU as their memory requirements demand a multi-
accelerator hardware setup even for a single training step.
Efficiency!! [5-6]

1. Memory in a model with 𝑁 layers is 𝑁-times larger than in a
single-layer model due to the fact that activations need to be
stored for back-propagation
2. Since the depth 𝑑 𝑓𝑓 of intermediate feed-forward layers is
often much larger than the depth 𝑑 𝑚𝑜𝑑𝑒𝑙 of attention
activations, it accounts for a large fraction of memory use.
3. Attention on sequences of length 𝐿 is 𝑂(𝐿2
) in both
computational and memory complexity, so even for a single
sequence of 64K tokens can exhaust accelerator memory.

1. Reversible layers enable storing only a single copy of
activations in the whole model, so the 𝑁 factor disappears.
2. Splitting activations inside feed-forward layers and processing
them in chunks removes the 𝑑 𝑓𝑓 factor and saves memory
inside feed-forward layers.
3. Approximate attention computation based on locality-sensitive
hashing replaces the 𝑂 𝐿2 factor in attention layers with
𝑂 𝐿 log 𝐿 and so allows operating on long sequences.

Adopted from [3]

Adopted from [3]
3. LSH(Locality-sensitive hashing)

Adopted from [4]
1. Reversible residual networks

1. Reversible Transformer
𝑌1 = 𝑋1 + 𝐴𝑡𝑡𝑒𝑛𝑡𝑖𝑜𝑛(𝑋2) 𝑌2 = 𝑋 𝑥 + 𝐹𝑒𝑒𝑑𝐹𝑜𝑟𝑤𝑎𝑟𝑑(𝑌1)
• Sharing QK(Query and Key)
𝑥 → 𝑦 𝑦 = 𝑥 + 𝐹(𝑥)
𝑥1, 𝑥2 → (𝑦1, 𝑦2)
Residual networks
Reversible residual networks
𝑦1 = 𝑥1 + 𝐹(𝑥2) 𝑦2 = 𝑥2 + 𝐺(𝑦2)
𝑥1 = 𝑦1 − 𝐹(𝑥2)𝑥2 = 𝑦2 − 𝐺(𝑦2)

Reference
[1] Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Carbonell, Ruslan Salakhutdinov, and Quoc V Le.
XLNet: Generalized autoregressive pretraining for language understanding. arXiv preprint
arXiv:1906.08237, 2019
[2] Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis,
Luke Zettlemoyer, and Veseliin Stoyanov. RoBERTa: A robustly optimized BERT pre-training approach.
arXiv preprint arXiv:1907.11692, 2019.
[3] Nikita Kitaev, Lukasz Kaiser, Anselm Levskaya. Reformer: The Efficient Transformer. arXiv preprint
arXiv:2001.04451, 2020
[4] Nikita Kitaev, Lukasz Kaiser. Reformer: The Efficient Transformer. Google AI Blog, 2020, [URL]
https://ai.googleblog.com/2020/01/reformer-efficient-transformer.html
[5] Yang You, Jing Li, Jonathan Hseu, Xiaodan Song, James Demmel, and Cho-Jui Hsieh. Large Batch
Optimization for Deep Learning: Training BERT in 76 minutes. arXiv preprint arXiv: 1904.00962, 2019.
[6] Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, Radu Soricut.
ALBERT: A Lite BERT for Self-supervised Learning of Language Representations. arXiv preprint
arXiv:1909.11942, 2019

Reinforcement Learning(2018~)
Compositionality, Modularity (VQA, GNNs, Causal
modeling, NMNs)
Graph Neural Networks (from GNN Model to Hyperbolic
GNNs, HCGNNs)
Next topic

XLnet RoBERTa Reformer

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to XLnet RoBERTa Reformer

Similar to XLnet RoBERTa Reformer (20)

More from San Kim

More from San Kim (19)

Recently uploaded

Recently uploaded (20)

XLnet RoBERTa Reformer