Transformer xl

Transformer-XL: Attentive Language Models
Beyond a Fixed-Length Context
San Kim
2019. 08. 23
Zihang Dai, Zhilin Yang, Yiming Yang, Jaime Carbonell, Quoc V. Le, Ruslan Salakhutdinov

Transformer architecture [1]
[CLS] family isn ' t about whose blood you have . it ' s about who you care about . [SEP] [pad] [pad] [pad]
101 11214 65148 112 162 10935 17060 15465 10855 10574 119 10197 112 161 10935 10488 10855 11258 10935 119 102 0 0 0
0.01 -0.05 0.05 -0.05 -0.03 0 0.05 -0.03 -0.03 -0.01 -0.04 -0.04 -0.05 -0.04 0 -0.03 -0.03 -0.04 0 -0.04 -0.01 0.08 0.08 0.08
-0.01 0.03 0.04 0.04 -0.09 0.07 -0.01 -0.06 -0.01 -0.03 -0.01 -0.01 0.04 -0.01 0.07 -0.04 -0.01 -0.06 0.07 -0.01 -0.02 0.04 0.04 0.04
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
0.01 0.03 0.02 -0.02 0.01 0.06 0.08 -0.05 -0.01 -0.05 0.03 0.04 -0.02 0.05 0.06 -0.03 -0.01 0.08 0.06 0.03 0 0.01 0.01 0.01
0 0 0.1 -0.04 0.05 -0.01 -0.1 -0.06 0.04 -0.04 0.02 0.04 -0.04 0.04 -0.01 -0.03 0.04 -0.09 -0.01 0.02 -0.01 0 0 0
Input tokens
Input ids
Embedding
𝑑𝑒𝑚𝑏
𝑠𝑒𝑞_𝑙𝑒𝑛

-0.01 -0.01 0 -0.04 0.03 ... -0.03 0 -0.03
0 0 -0.01 0 0.01 ... 0.01 0 0
... ... ... ... ... ... ... ... ...
-0.01 0 -0.01 0 0 ... 0.01 0.01 0
0 -0.03 -0.02 -0.01 -0.01 ... -0.02 -0.02 -0.03
Positional Encoding (sinusoidal or learned positional embedding)
[CLS] family isn ' t ... care about .
0.01 -0.05 0.05 -0.05 -0.03 ... -0.04 0 -0.04
-0.01 0.03 0.04 0.04 -0.09 ... -0.06 0.07 -0.01
... ... ... ... ... ... ... ... ...
0.01 0.03 0.02 -0.02 0.01 ... 0.08 0.06 0.03
0 0 0.1 -0.04 0.05 ... -0.09 -0.01 0.02
Word embedding
[CLS] family isn ' t ... care about .
0 -0.06 0.05 -0.09 0 ... -0.06 0 -0.07
-0.01 0.04 0.04 0.05 -0.08 ... -0.04 0.08 -0.01
... ... ... ... ... ... ... ... ...
0.01 0.02 0.01 -0.02 0 ... 0.09 0.06 0.03
0 -0.03 0.07 -0.05 0.04 ... -0.11 -0.03 -0.01
Positional Encoding + Word embedding
𝑑 𝑒𝑚𝑏 × 𝑠𝑒𝑞𝑙𝑒𝑛

Sinusoidal positional encoding
• For any fixed offset 𝑘, 𝑃𝐸 𝑝𝑜𝑠+𝑘 can be
represented as a linear function of 𝑃𝐸 𝑝𝑜𝑠.

Sinusoidal positional embeddings
A model trained on a memory of some certain
length can automatically generalize to a memory
several times long during evaluation.
learned positional embeddings

[CLS] family ... [pad] [pad]
0.09 -0.63 ... 2.26 0.96
-0.14 0.55 ... 0.95 1.07
... ... ... ... ...
0.01 0.35 ... 0.11 0.29
-0.02 -0.4 ... -0.39 -0.21
Positional Encoding + Word embedding
𝑑 𝑒𝑚𝑏 × 𝑠𝑒𝑞_𝑙𝑒𝑛
𝐸𝑚𝑏 𝑞
0.02 0.04 0 ... 0.01 0.05 -0.1
-0.02 0.07 0.03 ... 0.03 -0.02 0
... ... ... ... ... ... ...
0.1 -0.06 -0.02 ... -0.09 -0.04 -0.02
0.05 0.04 0.06 ... 0 -0.05 0.01
𝑑 𝑞 × 𝑑 𝑒𝑚𝑏
𝑊𝑞
-0.63 -0.17 ... 0.06 -0.61
1 × 𝑑 𝑞
𝐵𝑞
𝑄 𝑝𝑟𝑜𝑗 = 𝐸𝑚𝑏 𝑞
𝑇
𝑊𝑞
𝑇
+ 𝐵𝑞
𝑇
𝐾𝑝𝑟𝑜𝑗 = 𝐸𝑚𝑏 𝑘
𝑇
𝑊𝑘
𝑇 + 𝐵 𝑘
𝑇
𝑉𝑝𝑟𝑜𝑗 = 𝐸𝑚𝑏 𝑣
𝑇
𝑊𝑣
𝑇 + 𝐵𝑣
𝑇

-0.76 -1.25 ... 0.38 -0.01
0 0.29 ... -1.9 -1.33
... ... ... ... ...
-1.54 1.33 ... 1.56 -0.12
-1.48 -0.11 ... 0.24 -0.24
𝑄 𝑝𝑟𝑜𝑗
𝑑 𝑞 × 𝑠𝑒𝑞_𝑙𝑒𝑛
-3.36 -1.59 ... 1.27 1.99
-2.38 -1.82 ... -0.53 -0.29
... ... ... ... ...
1.13 1.66 ... -0.1 1.36
-2.61 -0.89 ... 1.77 1.88
𝐾 𝑝𝑟𝑜𝑗
𝑑 𝑘 × 𝑠𝑒𝑞_𝑙𝑒𝑛
𝑎𝑡𝑡𝑒𝑛𝑡𝑖𝑜𝑛_𝑠𝑐𝑜𝑟𝑒𝑠 =
𝑄 𝑝𝑟𝑜𝑗
𝑇
𝐾𝑝𝑟𝑜𝑗
𝑑 𝑞
𝑑 𝑞 = 𝑑 𝑘

𝑎𝑡𝑡_𝑝𝑟𝑜𝑏𝑠 = 𝑠𝑚(𝑎𝑡𝑡_𝑠𝑐𝑜𝑟𝑒𝑠)

0.86 0 ... 0 0
0.06 0.04 ... 0 0
... ... ... ... ...
0.15 0 ... 0 0
0.19 0 ... 0 0
𝑎𝑡𝑡_𝑝𝑟𝑜𝑏𝑠
-0.2 -1.08 ... -0.24 -0.26
0.09 -1.09 ... 0.81 0.79
... ... ... ... ...
0.28 -0.15 ... 0.74 0.76
0.07 0.72 ... 0.55 0.56
𝑉𝑝𝑟𝑜𝑗
𝑑 𝑣 × 𝑠𝑒𝑞_𝑙𝑒𝑛 𝑣
𝑠𝑒𝑞_𝑙𝑒𝑛 𝑞 × 𝑠𝑒𝑞_𝑙𝑒𝑛 𝑘
𝑐𝑜𝑛𝑡𝑒𝑥𝑡_𝑣𝑒𝑐𝑡𝑜𝑟𝑠 = 𝑎𝑡𝑡 𝑝𝑟𝑜𝑏𝑠 × 𝑉𝑝𝑟𝑜𝑗
𝑇
𝑠𝑒𝑞_𝑙𝑒𝑛 𝑘 = 𝑠𝑒𝑞_𝑙𝑒𝑛 𝑣

Family isn't about whose blood you have. It's about who you care about.
𝑦𝑜𝑢 𝑞 = 0.63 × ℎ𝑎𝑣𝑒 𝑣 + 0.08 × 𝑏𝑙𝑜𝑜𝑑 𝑣 + 0.07 × 𝑤ℎ𝑜𝑠𝑒 𝑣
𝑤ℎ𝑜 𝑞 = 0.32 × 𝑦𝑜𝑢 𝑣 + 0.24 × 𝑎𝑏𝑜𝑢𝑡 𝑣 + 0.11 × ℎ𝑎𝑣𝑒 𝑣
𝑦𝑜𝑢 𝑞 = 0.34 × 𝑤ℎ𝑜 𝑣 + 0.27 × 𝑐𝑎𝑟𝑒 𝑣

𝑐𝑜𝑛𝑡𝑒𝑥𝑡_𝑣𝑒𝑐𝑡𝑜𝑟𝑠
-0.19 -0.25 ... 0.2 0.28
0.01 -0.33 ... -0.12 0.11
... ... ... ... ...
0.23 -0.09 ... 0.24 0.33
0.05 0.15 ... -0.26 -0.29
𝑑 𝑣 × 𝑠𝑒𝑞_𝑙𝑒𝑛 𝑞
-0.03 -0.21 ... 0.81 -0.04
0.04 1.17 ... 0.36 0.44
... ... ... ... ...
-0.02 0.6 ... -0.18 -0.03
0.01 -0.28 ... -0.28 -0.15
𝑎𝑡𝑡_𝑙𝑎𝑦𝑒𝑟𝑜𝑢𝑡
𝑒𝑚𝑏 𝑑 × 𝑠𝑒𝑞_𝑙𝑒𝑛 𝑞
𝑎𝑡𝑡_𝑙𝑎𝑦𝑒𝑟𝑜𝑢𝑡 = 𝑐𝑜𝑛𝑐𝑎𝑡( 𝑐𝑜𝑛𝑡𝑒𝑥𝑡 𝑣𝑒𝑐𝑡𝑜𝑟𝑠 𝑖 i ∈ 0, 𝑁 , 𝑖 < 𝑛𝑢𝑚ℎ𝑒𝑎𝑑

𝐹𝐹𝑁 = max 0, 𝑥𝑊1 + 𝑏1 𝑊2 + 𝑏2
𝑥
seq_len 𝑞 × 𝑑 𝑒𝑚𝑏
𝑊1
𝑑 𝑒𝑚𝑏 × 𝑑𝑖𝑛𝑡𝑒𝑟
𝑏1
1 × 𝑑𝑖𝑛𝑡𝑒𝑟
𝑊2
𝑑𝑖𝑛𝑡𝑒𝑟 × 𝑑 𝑒𝑚𝑏
𝑏2
1 × 𝑑 𝑒𝑚𝑏
𝐿𝑎𝑦𝑒𝑟𝑁𝑜𝑟𝑚 𝑥 + 𝑑𝑟𝑜𝑝𝑜𝑢𝑡 𝑠𝑢𝑏𝑙𝑎𝑦𝑒𝑟 𝑥

The memory cost for scaled
dot-product attention is
quadratic w.r.t. the sequence
length.

Pros
• It’s enables the learning of long-term dependency.
• It is less affected by the gradient vanishing problem compared to RNN.
Cons
• The model cannot capture any longer-dependency beyond the predefined context
length.
• The model lacks necessary contextual information needed to well prediction the first
few symbols. (Context fragmentation)
• Longer sequences are disproportionately expensive because attention is quadratic to
the sequence length.
recurrence mechanism and a novel positional encoding scheme. Our method not only

Vanilla Transformer
Vanilla Transformer with a fixed-
length context at a training time.
Vanilla Transformer with a fixed-
length context at evaluation time.
• Context fragmentation (information
never flows across segments)
• Upper bounded by the segment
length
• The segment has to be processed all
from scratch for prediction a token.
• This evaluation procedure is extremely
expensive.

Transformer XL (extra long) [2]
Transformer-XL with segment-level
recurrence at training time.
Transformer-XL with segment-level
recurrence at evaluation time.
• It can capture longer dependency
than segment length.
• It’s fast than the vanilla transformer.

Major contributions
• Segment-Level Recurrence with State Reuse
• Relative Positional Encodings
Embedding and loss
• Adaptive Input representation
• Adaptive softmax

Segment-Level Recurrence with State Reuse

ℎ 𝜏+1
𝑛−1
= 𝑆𝐺 ℎ 𝜏
𝑛−1
∘ ℎ 𝜏+1
𝑛−1
,
𝑞 𝜏+1
𝑛
, 𝑘 𝜏+1
𝑛
, 𝑣 𝜏+1
𝑛
= ℎ 𝜏+1
𝑛−1
𝑊𝑞
𝑇, ℎ 𝜏+1
𝑛−1
𝑊𝑘
𝑇
, ℎ 𝜏+1
𝑛−1
𝑊𝑣
𝑇,
ℎ 𝜏+1
𝑛
= 𝑇𝑟𝑎𝑛𝑠𝑓𝑜𝑟𝑚𝑒𝑟 − 𝐿𝑎𝑦𝑒𝑟 𝑞 𝜏+1
𝑛
, 𝑘 𝜏+1
𝑛
, 𝑣 𝜏+1
𝑛
.
Let the two consecutive segments of length L be 𝒔 𝝉 = 𝒙 𝝉,𝟏, ⋯ , 𝒙 𝝉,𝟏 and 𝒔 𝝉+𝟏 = 𝒙 𝝉+𝟏,𝟏, ⋯ , 𝒙 𝝉+𝟏,𝟏
respectively.
The 𝑛-ths layer hidden state sequence produced for the 𝜏-ths segment 𝑠𝜏 by 𝒉 𝝉
𝒏 ∈ ℝ 𝑳×𝒅, where d is
the hidden dimension.
𝑺𝑮(⋅) : stop-gradient
𝒉 𝒖 ∘ 𝒉 𝒗 : the concatenation of two hidden sequences along the length dimension.

ℎ 𝜏+1
𝑛−1
ℎ 𝜏+1
𝑛−1
ℎ 𝜏
𝑛−1
ℎ 𝜏+1
𝑛

Abs Positional Encodings
(sinusoidal)
𝑨𝑖,𝑗
𝑎𝑏𝑠
= 𝑬 𝑥 𝑖
𝑇 𝑾 𝑞
𝑇 𝑾 𝑘 𝑬 𝑥 𝑗
+ 𝑬 𝑥 𝑖
𝑇
𝑾 𝑞
𝑇
𝑾 𝑘 𝑼𝑗
+ 𝑼𝑖
𝑇
𝑾 𝑞
𝑇 𝑾 𝑘 𝑬 𝑥 𝑗
+ 𝑼𝑖
𝑇
𝑾 𝑞
𝑇
𝑾 𝑘 𝑼𝑗.
Relative Positional Encodings
𝑨𝑖,𝑗
𝑟𝑒𝑙
= 𝑬 𝑥 𝑖
𝑇 𝑾 𝑞
𝑇 𝑾 𝑘,𝐸 𝑬 𝑥 𝑗
+ 𝑬 𝑥 𝑖
𝑇
𝑾 𝑞
𝑇
𝑾 𝑘,𝑅 𝑹𝑖−𝑗
+ 𝑢 𝑇
𝑾 𝑘,𝐸 𝑬 𝑥 𝑗
+ 𝑣 𝑇 𝑾 𝑘,𝑅 𝑹𝑖−𝑗
• 𝑈𝑗 → 𝑅𝑖−𝑗
• 𝑈𝑖
𝑇
𝑊𝑞
𝑇 → 𝑎 𝑡𝑟𝑎𝑖𝑛𝑎𝑏𝑙𝑒 𝑝𝑎𝑟𝑎𝑚𝑒𝑡𝑒𝑟 𝑢 ∈ ℝ 𝑑, 𝑣 ∈ ℝ 𝑑
• 𝑊𝑘 → 𝑊𝑘,𝐸, 𝑊𝑘,𝑅 𝑓𝑜𝑟 𝐸 𝑥 𝑗
, 𝑈𝑗

h 𝜏
𝑛−1
= 𝑆𝐺 𝑚 𝜏
𝑛−1
∘ ℎ 𝜏
𝑛−1
𝑞 𝜏
𝑛
, 𝑘 𝜏
𝑛
, 𝑣𝜏
𝑛
= ℎ 𝜏
𝑛−1
𝑊𝑞
𝑛−1 𝑇
, ℎ 𝜏
𝑛−1
𝑊𝑘,𝐸
𝑛 𝑇
, ℎ 𝜏
𝑛−1
𝑊𝑣
𝑛 𝑇
𝐴 𝜏,𝑖,𝑗
𝑛
= 𝑞 𝜏,𝑖
𝑛 𝑇
𝑘 𝜏,𝑗
𝑛
+ 𝑞 𝜏,𝑖
𝑛 𝑇
𝑊𝑘,𝑅
𝑛
𝑅𝑖−𝑗 + 𝑢 𝑇
𝑘 𝜏,𝑗 + 𝑣 𝑇
𝑊𝑘,𝑅
𝑛
𝑅𝑖−𝑗
𝑎 𝜏
𝑛
= 𝑀𝑎𝑠𝑘𝑒𝑑_𝑆𝑜𝑓𝑡𝑚𝑎𝑥 𝐴 𝜏
𝑛
𝑣𝜏
𝑛
𝑜𝜏
𝑛
= 𝐿𝑎𝑦𝑒𝑟𝑁𝑜𝑟𝑚 𝐿𝑖𝑛𝑒𝑎𝑟 𝑎 𝜏
𝑛
+ ℎ 𝜏
𝑛−1
ℎ 𝜏
𝑛
= 𝑃𝑜𝑠𝑖𝑡𝑖𝑜𝑛𝑤𝑖𝑠𝑒_𝐹𝑒𝑒𝑑_𝐹𝑜𝑟𝑤𝑎𝑟𝑑 𝑜𝜏
𝑛
ℎ 𝜏
0
≔ 𝐸𝑠 𝜏

Efficient Computation of the Attention with Relative Positional Embedding
𝑬 𝑥 𝑖
𝑇 𝑾 𝑞
𝑇 𝑾 𝑘,𝑅 𝑹𝑖−𝑗
+ 𝑣 𝑇 𝑾 𝑘,𝑅 𝑹𝑖−𝑗 𝑄 in a reversed order
𝑄 𝑘 = 𝑊𝑘,𝑅 𝑅 𝑀+𝐿−1−𝑘

𝑬 𝑥 𝑖
𝑇 𝑾 𝑞
𝑇 𝑾 𝑘,𝑅 𝑹𝑖−𝑗
+ 𝑣 𝑇
𝑾 𝑘,𝑅 𝑹𝑖−𝑗

Transformer XL (extra long) [2] Ablation study

Efficient softmax approximation for GPUs [3]
Most of the probability mass is covered by a small fraction of the
dictionary, e.g., 87% of the document is covered by only 20% of the
vocabulary in the Penn TreeBank.
 Hierarchical softmax. [adopted from
(Hugo Lachorelle’s Youtube lectures)]
• Hierarchical softmax
• Differentiated softmax
• Importance sampling
• Negative sampling
• Noise sampling

notation
• 𝐵: batch size
• 𝑘 = 𝒱 : cardinality of total vocabulary
• 𝑔 𝑘 = max(𝑐 + 𝜆𝑘0, 𝑐 + 𝜆𝑘) = 𝑐 𝑚 +
max 0, 𝜆 𝑘 − 𝑘0 : computation time
1. The computation time 𝑔(𝑘) is constant
for low values of k, until a certain
inflection point 𝑘0 ≈ 50, and then
becomes affine for values 𝑘 > 𝑘0.
2. Empirically, 𝑐 𝑚 = 0.40 𝑚𝑠 on a K40 and
0.22 ms on a M40.

𝒱ℎ
𝒱𝑡
Notation
• 𝒱ℎ: the word set of head
• 𝒱𝑡: the word set of tail
• 𝑘ℎ = 𝒱ℎ , 𝑘 𝑡 = 𝒱𝑡
• 𝑝𝑖: the probability of a word to occur in the set 𝒱𝑖.
𝐶 = 𝑔 𝑘ℎ + 1, 𝐵 + 𝑔(𝑘 𝑡, 𝑝𝑡 𝐵)

𝐶ℎ = 𝑔(𝐽 + 𝑘ℎ, 𝐵)
∀𝑖, 𝐶𝑖 = 𝑔(𝑘𝑖, 𝑝𝑖 𝐵)
𝐶 = 𝑔 𝐽 + 𝑘ℎ, 𝐵 + 𝑖 𝑔 𝑘𝑖, 𝑝𝑖 𝐵 .
Constraint. 𝑘𝐵 ≥ 𝑘0 𝐵0
𝐶 = 𝑐 + 𝜆𝐵 𝐽 + 𝑘ℎ 𝐵 + 𝑖(𝑐 + 𝜆𝑘𝑖 𝑝𝑖 𝐵)
= 𝐽 + 1 𝑐 + 𝜆𝐵 𝐽 + 𝑘ℎ + 𝑖 𝑝𝑖 𝑘𝑖 .

𝑝𝑖 𝑘𝑖 + 𝑝𝑗 𝑘𝑗 = 𝑝𝑖 𝑘𝑖 − 𝑘𝑗 + 𝑝𝑖+𝑗 𝑘𝑗 where 𝑝𝑖+𝑗 = 𝑝𝑖 + 𝑝𝑗
𝐶 = 𝐽 + 1 𝑐 + 𝜆𝐵 𝐽 + 𝑘ℎ +
𝑖
𝑝𝑖 𝑘𝑖
Assume that 𝑘𝑖 > 𝑘𝑗, and fix the quantities 𝑝𝑖+𝑗, 𝑘𝑖 and 𝑘𝑗.
The best strategy is trivially to minimize the probability of the largest cluster
𝒱𝑖.
For a fixed number of clusters of given sizes, the best strategy is to assign
the words by decreasing probabilities to cluster of increasing size.

Adaptive Input Representations [4]

Adaptive Input Representations [4]
𝐿𝑖𝑛𝑒𝑎𝑟𝒱1
𝑂𝑢𝑡𝑝𝑢𝑡 𝑙𝑎𝑦𝑒𝑟𝒱1
𝐿𝑖𝑛𝑒𝑎𝑟𝒱2
𝐿𝑖𝑛𝑒𝑎𝑟𝒱 𝑛
⋯
𝑂𝑢𝑡𝑝𝑢𝑡 𝑙𝑎𝑦𝑒𝑟𝒱2
𝑂𝑢𝑡𝑝𝑢𝑡 𝑙𝑎𝑦𝑒𝑟𝒱 𝑛⋯
𝑑
𝑑 → 𝑑 + 𝑛 − 1 𝑑 →
𝑑
𝑘1
𝑑 →
𝑑
𝑘 𝑛−1
𝒱1 + 𝑛 − 1 𝒱2
𝒱𝑛

References
[1] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez,
Lukasz Kaiser, and Illia Polosukhin.
Attention is all you need. CoRR, abs/1706.03762, 2017
[2] Zihang Dai, Zhilin Yang, Yiming Yang, Jaime G. Carbonell, Quoc V. Le, and Ruslan Salakhutdinov.
Transformer-xl: Attentive language
models beyond a fixed-length context. CoRR, abs/1901.02860, 2019
[3] Edouard Grave, Armand Joulin, Moustapha Cisse, David Grangier, and Herve Jegou. Efficient
softmax approximation for gpus.
CoRR, abs/1609.04309, 2016
[4] Alexei Baevski and Michael Auli. Adaptive input representations for neural language modeling.
CoRR, abs/1809.10853, 2018.

Transformer xl

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Transformer xl

Similar to Transformer xl (20)

More from San Kim

More from San Kim (18)

Recently uploaded

Recently uploaded (20)

Transformer xl