XLNET, RoBERTa, and Reformer are state-of-the-art language models. XLNET improves on BERT by capturing dependency between target pairs. RoBERTa further improves pre-training by removing the next sentence prediction objective, training longer sequences with bigger batches. Reformer introduces efficient attention and feedforward mechanisms like reversible layers and locality-sensitive hashing to process long sequences with less memory.
2. XLNET
Independence Assumption
• BERT – Independence Assumption
• XLNET – Capture the dependency between the target pair
Context dependency
• AR – only conditioned on the tokens up to position t
• XLNET – Access to the contextual information on both sides
Noise
• BERT – contains artificial symbols like [mask] that never occur in
downstream tasks
• XLNET – Does not rely on any input corruption
3. Notation
Ζ 𝑇: The set of all possible permutations of the length-T index sequence [1, 2, … , 𝑇].
𝑧𝑡: t-th element, 𝒛<𝑡 = 𝒛1:𝑡−1: the first t-1 elements, 𝒛 ∈ Ζ 𝑇
e.g. In case of Ζ5, if 𝒛 = 𝟒, 𝟓, 𝟏, 𝟑, 𝟐 ∈ Ζ5 and t = 4, 𝑧𝑡 = 3 and 𝒛<𝑡 = [4,5,1]
𝑥𝑡: t-th element of 𝒙, 𝒙<𝑡: the first t-1 elements of 𝒙
𝑒 𝑥 : the embedding of 𝑥, 𝑚 𝑡 = 1 indicates 𝑥𝑡 is masked.
ℎ 𝜃 𝒙1:𝑡−1 : a context representation produced by neural models
𝒙 = [𝑥1, … , 𝑥 𝑇]: a text sequence
𝒙 : corrupted version of 𝒙 (randomly setting a portion of tokens to [mask] symbol)
𝒙: the masked tokens. (e.g. if 𝒙 = 𝐾𝐸𝑇𝐼, 𝑖𝑠, 𝑚𝑎𝑠𝑘 , 𝑚𝑎𝑠𝑘 , 𝑐𝑜𝑚𝑝𝑎𝑛𝑦 , 𝒙 = [a, good])
𝐻 𝜃(𝒙) is a Transformer that maps a length-T text sequence 𝒙 into a sequence of
hidden vectors 𝐻 𝜃 𝒙 = [𝐻 𝜃 𝒙 1, 𝐻 𝜃 𝒙 2, ⋯ , 𝐻 𝜃 𝒙 𝑇].
16. Comparing with BERT
• Similarity
• Perform partial prediction
• Reducing optimization difficulty (Sufficient context)
• (e.g. 𝑝(? |𝑡ℎ𝑒) It’s difficult to predict
• Difference
• Dependency between targets (if given the same target; cf.
RoBERTa)
17. Experiments
• 32.89B subword pieces
• 512 TPU v3 chips for 500K steps, batch size:
8192, about 5.5 days.
• 540K $ (on-demend), 162K $(preemptible)
알 수 없는 작성자 님의 이 사진에는 CC BY 라이선스가 적용됩니
다.
Adopted from [1]
19. RoBERTa
1. Training the model longer
2. Bigger batches over more data
3. Removing next sentence prediction objective (w/ DOC-
SENTENCES)
4. Training on longer sequences
5. Dynamically changing the masking pattern
Pretrain the model using 1024 V100 GPUs(32GB) for approx.
one day
20. RoBERTa
SEGMENT-PAIR: Each input has a pair of segments.
SENTENCE-PAIR: Each input contains a pair of natural sentences.
FULL-SENTENCES: Inputs may cross document boundaries. Add
an extra separator token between documents.
DOC-SENTENCES: may not cross document boundaries.
23. Reformer – The Efficient Transformer
• Large-scale long-sequence models yield greate results but
strain resources to the point where some argue that this trend
is breaking NLP research.
• Many large Transformer models can only realistically be trained
in large industrial research laboratories and such models
trained with model parallelism cannot even be fine-tuned on a
single GPU as their memory requirements demand a multi-
accelerator hardware setup even for a single training step.
Efficiency!! [5-6]
24. Reformer – The Efficient Transformer
1. Memory in a model with 𝑁 layers is 𝑁-times larger than in a
single-layer model due to the fact that activations need to be
stored for back-propagation
2. Since the depth 𝑑 𝑓𝑓 of intermediate feed-forward layers is
often much larger than the depth 𝑑 𝑚𝑜𝑑𝑒𝑙 of attention
activations, it accounts for a large fraction of memory use.
3. Attention on sequences of length 𝐿 is 𝑂(𝐿2
) in both
computational and memory complexity, so even for a single
sequence of 64K tokens can exhaust accelerator memory.
25. Reformer – The Efficient Transformer
1. Reversible layers enable storing only a single copy of
activations in the whole model, so the 𝑁 factor disappears.
2. Splitting activations inside feed-forward layers and processing
them in chunks removes the 𝑑 𝑓𝑓 factor and saves memory
inside feed-forward layers.
3. Approximate attention computation based on locality-sensitive
hashing replaces the 𝑂 𝐿2 factor in attention layers with
𝑂 𝐿 log 𝐿 and so allows operating on long sequences.
34. Reference
[1] Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Carbonell, Ruslan Salakhutdinov, and Quoc V Le.
XLNet: Generalized autoregressive pretraining for language understanding. arXiv preprint
arXiv:1906.08237, 2019
[2] Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis,
Luke Zettlemoyer, and Veseliin Stoyanov. RoBERTa: A robustly optimized BERT pre-training approach.
arXiv preprint arXiv:1907.11692, 2019.
[3] Nikita Kitaev, Lukasz Kaiser, Anselm Levskaya. Reformer: The Efficient Transformer. arXiv preprint
arXiv:2001.04451, 2020
[4] Nikita Kitaev, Lukasz Kaiser. Reformer: The Efficient Transformer. Google AI Blog, 2020, [URL]
https://ai.googleblog.com/2020/01/reformer-efficient-transformer.html
[5] Yang You, Jing Li, Jonathan Hseu, Xiaodan Song, James Demmel, and Cho-Jui Hsieh. Large Batch
Optimization for Deep Learning: Training BERT in 76 minutes. arXiv preprint arXiv: 1904.00962, 2019.
[6] Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, Radu Soricut.
ALBERT: A Lite BERT for Self-supervised Learning of Language Representations. arXiv preprint
arXiv:1909.11942, 2019