BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
XLNet Presentation.pdf
1. XLNet
Generalized Autoregressive Pretraining for Language Understanding
by Zhilin Yang and Zihang Dai et al.
Presented by:
V S Siva Kumar Lakkoju
CS2139
Information Retrieval
MTech CS
Indian Statistical Institute, Kolkata
June 23, 2022
7. XLNet
Introduction
Proposed
Method
Design of
XLNet
HyperParameter
Results
Transformer XL [2]
Increase context through segment level recurrence and a realtive positional
encoding scheme
• Caching and reusing previous segments’ hidden state
• Allows variable length context, great for long term dependencies
• Resolves the problem of context fragmentation
– XLNet 7/35
14. XLNet
Introduction
Proposed
Method
Design of
XLNet
HyperParameter
Results
The Idea
• Permutation: only on factorization order, not the original sequence order
• Attention masks provide the context for each prediction
• Two stream self-attention allows prediction to be aware of target position
• Partial prediction : only predict 1/K tokens in each permutation
– XLNet 14/35
21. XLNet
Introduction
Proposed
Method
Design of
XLNet
HyperParameter
Results
How to reparameterize?
Standard softmax does not work
P(xzt |xzt ) =
eemb(x)T h(xzt )
P
x′ eemb(x′)T h(xzt )
(10)
Solution: Incorporate z into the hidden state
P(xzt |xzt ) =
eemb(x′)T g(zt,xzt )
P
x′ eemb(x′)T g(zt,xzt )
(11)
**This is implemented using a two stream self attention
– XLNet 21/35
28. XLNet
Introduction
Proposed
Method
Design of
XLNet
HyperParameter
Results
Information about specific datasets
• RACE dataset – 512 seq. length
• SQuAD – During finetuning on SQuAD2.0 dataset, jointly applied logistic
regression to check if an question can be answered
• Layer Wise Decay If the learning rate of the 24th layer is, say l and decay is
q, then the learning rate of the layer m is
lrm = l ∗ q(24−m)
(15)
– XLNet 28/35
30. XLNet
Introduction
Proposed
Method
Design of
XLNet
HyperParameter
Results
XLNet vs BERT
Input Sequence: New York is a city
XLNet factorization order: [is, a, city, New, York]
For BERT,
log p(New York|is a city) = log p(New|is a city) + log p(Y ork|is a city) (16)
For XLNet,
log p(New York|is a city) = log p(New|is a city) + log p(Y ork|New is a city)
(17)
– XLNet 30/35
35. XLNet
Introduction
Proposed
Method
Design of
XLNet
HyperParameter
Results
References I
Z. Yang, Z. Dai, and et al., “XLNet: Generalized autoregressive pretraining for
language understanding,” in 33rd Conference on Neural Information Processing
Systems (NeurIPS), 2019.
Z. Dai, Z. Yang, Y. Yang, J. Carbonell, Q. V. Le, and R. Salakhutdinov,
“Transformer-xl: Attentive language models beyond a fixed-length context,”
2019.
“https://towardsdatascience.com/transformer-xl-explained-combining-
transformers-and-rnns-into-a-state-of-the-art-language-model-c0cfe9e5a924.”
– XLNet 35/35