XLNet Presentation.pdf

XLNet
Generalized Autoregressive Pretraining for Language Understanding
by Zhilin Yang and Zihang Dai et al.
Presented by:
V S Siva Kumar Lakkoju
CS2139
Information Retrieval
MTech CS
Indian Statistical Institute, Kolkata
June 23, 2022

XLNet
Introduction
Proposed
Method
Design of
XLNet
HyperParameter
Results
Table of Contents
1 Introduction
2 Proposed Method
3 Design of XLNet
4 HyperParameters
5 Results
– XLNet 2/35

XLNet
Introduction
Proposed
Method
Design of
XLNet
HyperParameter
Results
Introduction
– XLNet 3/35

XLNet
Introduction
Proposed
Method
Design of
XLNet
HyperParameter
Results
Background and Motivation
Earlier, Autoregressive (ex. ELMo, GPT) and Autoencoding (ex. BERT) models
are the most successful pre-training objectives, but both have their own limitations.
– XLNet 4/35

XLNet
Introduction
Proposed
Method
Design of
XLNet
HyperParameter
Results
AR models vs AE models
– XLNet 5/35

XLNet
Introduction
Proposed
Method
Design of
XLNet
HyperParameter
Results
XLNet Key ideas [1]
• Autoregressive: Predict next word based on the context
• Bidirectional Context from permutation language ordering
• Attention mechanism, similar to that of Transformer XL
– XLNet 6/35

XLNet
Introduction
Proposed
Method
Design of
XLNet
HyperParameter
Results
Transformer XL [2]
Increase context through segment level recurrence and a realtive positional
encoding scheme
• Caching and reusing previous segments’ hidden state
• Allows variable length context, great for long term dependencies
• Resolves the problem of context fragmentation
– XLNet 7/35

XLNet
Introduction
Proposed
Method
Design of
XLNet
HyperParameter
Results
Vanilla Transformer
– XLNet 8/35

XLNet
Introduction
Proposed
Method
Design of
XLNet
HyperParameter
Results
Transformer XL [3]
– XLNet 9/35

XLNet
Introduction
Proposed
Method
Design of
XLNet
HyperParameter
Results
Implmentation details of Transformer XL
∀t = 1, . . . , T : ĥ(m)
zt
= LayerNorm

h(m−1)
zt
+ RelAttn

h(m−1)
zt
,
h
h̃(m−1)
, h(m−1)
z≤t
i
h(m)
zt
= LayerNorm

ĥ(m)
zt
+ PosFF

ĥ(m)
zt

ĝ(m)
zt
= LayerNorm

g(m−1)
zt
+ RelAttn

g(m−1)
zt
,
h
h̃(m−1)
, h(m−1)
zt
i
g(m)
zt
= LayerNorm

ĝ(m)
zt
+ PosFF

ĝ(m)
zt

Target-aware prediction distribution:
pθ (Xzt = x | xzt ) =
exp

emb(x)⊤g
(M)
zt

P
x′ exp

emb (x′)⊤
g
(M)
zt

– XLNet 10/35

XLNet
Introduction
Proposed
Method
Design of
XLNet
HyperParameter
Results
Proposed Method
– XLNet 11/35

XLNet
Introduction
Proposed
Method
Design of
XLNet
HyperParameter
Results
Formulations
AR formulation:
max
θ
log pθ(x) =
T
X
t=1
log pθ(xt|xt) =
T
X
t=1
log
ehθ(x1:t−1)T emb(x)
P
x′ ehθ(x1:t−1)T emb(x′)
. (1)
AE formulation:
max
θ
log pθ(x̃|x̂) ≈
T
X
t=1
mt log pθ(xt|x̂) =
T
X
t=1
mt log
eHθ(x̂)T emb(xt)
P
x′ eHθ(x̂)T emb(x′)
. (2)
XLNet formulation:
max
θ
Ez∼ZT
T
X
t=1
log pθ(xzt |xzt ) (3)
– XLNet 12/35

XLNet
Introduction
Proposed
Method
Design of
XLNet
HyperParameter
Results
Design of XLNet
– XLNet 13/35

XLNet
Introduction
Proposed
Method
Design of
XLNet
HyperParameter
Results
The Idea
• Permutation: only on factorization order, not the original sequence order
• Attention masks provide the context for each prediction
• Two stream self-attention allows prediction to be aware of target position
• Partial prediction : only predict 1/K tokens in each permutation
– XLNet 14/35

XLNet
Introduction
Proposed
Method
Design of
XLNet
HyperParameter
Results
Factorization order
Standard LM factorization: 1 →2 →3 →4
P(x) = P(x1)P(x2|x1)P(x3|x1,2)P(x4|x1,2,3) (4)
– XLNet 15/35

XLNet
Introduction
Proposed
Method
Design of
XLNet
HyperParameter
Results
XLNet factorization: 4 →1 →3 →2
P(x) = P(x4)P(x1|x4)P(x3|x1,4)P(x2|x1,3,4) (5)
– XLNet 16/35

XLNet
Introduction
Proposed
Method
Design of
XLNet
HyperParameter
Results
How is Permutation on factorization bidirectional?
– XLNet 17/35

XLNet
Introduction
Proposed
Method
Design of
XLNet
HyperParameter
Results
Attention Masks
Attention masks provide context
Attentiono/p = w1 ∗ v1 + w2 ∗ v2 + w3 ∗ v3 + w4 ∗ v4 (6)
Attentiono/p = w1 ∗ v1 + w4 ∗ v4 (7)
– XLNet 18/35

XLNet
Introduction
Proposed
Method
Design of
XLNet
HyperParameter
Results
Reparameterization
Suppose we have two factorization orders z(1)
and z(2)
such that,
•
z
(1)
t = z
(2)
t = zt (8)
•
z
(1)
t = i; z
(2)
t = j and i ̸= j (9)
P(Xi=x|xzt )= e
emb(x)T h(xzt )
P
x′ e
emb(x′)T h(xzt )
P(Xj=x|xzt )= e
emb(x)T h(xzt )
P
x′ e
emb(x′)T h(xzt )
– XLNet 19/35

XLNet
Introduction
Proposed
Method
Design of
XLNet
HyperParameter
Results
Reparameterization
The apple was eaten
z1 = apple was the eaten
P(xi) = P(xi|apple, was)
P(Xi=the|xzt )= e
emb(x)T h(xzt )
P
x′ e
emb(x′)T h(xzt )
z2 = apple was eaten the
P(xj) = P(xj|apple, was)
P(Xj=eaten|xzt )= e
emb(x)T h(xzt )
P
x′ e
emb(x′)T h(xzt )
– XLNet 20/35

XLNet
Introduction
Proposed
Method
Design of
XLNet
HyperParameter
Results
How to reparameterize?
Standard softmax does not work
P(xzt |xzt ) =
eemb(x)T h(xzt )
P
x′ eemb(x′)T h(xzt )
(10)
Solution: Incorporate z into the hidden state
P(xzt |xzt ) =
eemb(x′)T g(zt,xzt )
P
x′ eemb(x′)T g(zt,xzt )
(11)
**This is implemented using a two stream self attention
– XLNet 21/35

XLNet
Introduction
Proposed
Method
Design of
XLNet
HyperParameter
Results
Two stream self-attention
g(zt, xzt ) = Attnθ(Q = Enc(zt), KV = h(xzt )) (12)
– XLNet 22/35

XLNet
Introduction
Proposed
Method
Design of
XLNet
HyperParameter
Results
gz
(m)
t = Attnθ(Q = gz
(m−1)
t , KV = hzt
(m−1); θ)
hz
(m)
t = Attnθ(Q = hz
(m−1)
t , KV = hz≤t
(m−1); θ)
– XLNet 23/35

XLNet
Introduction
Proposed
Method
Design of
XLNet
HyperParameter
Results
Two stream Attention: an example
Factorization order: 4 →1 →3 →2
– XLNet 24/35

XLNet
Introduction
Proposed
Method
Design of
XLNet
HyperParameter
Results
Partial Prediction
MOTIVATION: To reduce optimization difficulty from too little context.
Split sequence into context words and target words, cutoff at some c.
Only predict target words (1/K of the original sequence)
max
θ
(Ez∼ZT
T
X
t=1
log pθ(xzc |xzc ) = Ez∼ZT
|z|
X
t=c+1
log pθ(xzt |xzt )) (13)
|z|
(|z| − c)
≈ K = 6 (14)
– XLNet 25/35

XLNet
Introduction
Proposed
Method
Design of
XLNet
HyperParameter
Results
HyperParameters
– XLNet 26/35

XLNet
Introduction
Proposed
Method
Design of
XLNet
HyperParameter
Results
Hyper Parameters
– XLNet 27/35

XLNet
Introduction
Proposed
Method
Design of
XLNet
HyperParameter
Results
Information about specific datasets
• RACE dataset – 512 seq. length
• SQuAD – During finetuning on SQuAD2.0 dataset, jointly applied logistic
regression to check if an question can be answered
• Layer Wise Decay If the learning rate of the 24th layer is, say l and decay is
q, then the learning rate of the layer m is
lrm = l ∗ q(24−m)
(15)
– XLNet 28/35

XLNet
Introduction
Proposed
Method
Design of
XLNet
HyperParameter
Results
Results
– XLNet 29/35

XLNet
Introduction
Proposed
Method
Design of
XLNet
HyperParameter
Results
XLNet vs BERT
Input Sequence: New York is a city
XLNet factorization order: [is, a, city, New, York]
For BERT,
log p(New York|is a city) = log p(New|is a city) + log p(Y ork|is a city) (16)
For XLNet,
log p(New York|is a city) = log p(New|is a city) + log p(Y ork|New is a city)
(17)
– XLNet 30/35

XLNet
Introduction
Proposed
Method
Design of
XLNet
HyperParameter
Results
Error rates on different classification tasks
– XLNet 31/35

XLNet
Introduction
Proposed
Method
Design of
XLNet
HyperParameter
Results
Performance on SQuAD2.0
– XLNet 32/35

XLNet
Introduction
Proposed
Method
Design of
XLNet
HyperParameter
Results
Performance on GLUE tasks
– XLNet 33/35

XLNet
Introduction
Proposed
Method
Design of
XLNet
HyperParameter
Results
Ablation Study
– XLNet 34/35

XLNet
Introduction
Proposed
Method
Design of
XLNet
HyperParameter
Results
References I
Z. Yang, Z. Dai, and et al., “XLNet: Generalized autoregressive pretraining for
language understanding,” in 33rd Conference on Neural Information Processing
Systems (NeurIPS), 2019.
Z. Dai, Z. Yang, Y. Yang, J. Carbonell, Q. V. Le, and R. Salakhutdinov,
“Transformer-xl: Attentive language models beyond a fixed-length context,”
2019.
“https://towardsdatascience.com/transformer-xl-explained-combining-
transformers-and-rnns-into-a-state-of-the-art-language-model-c0cfe9e5a924.”
– XLNet 35/35

XLNet Presentation.pdf

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to XLNet Presentation.pdf

Similar to XLNet Presentation.pdf (20)

Recently uploaded

Recently uploaded (20)

XLNet Presentation.pdf