SlideShare a Scribd company logo
Transformer-XL: Attentive Language Models
Beyond a Fixed-Length Context
San Kim
2019. 08. 23
Zihang Dai, Zhilin Yang, Yiming Yang, Jaime Carbonell, Quoc V. Le, Ruslan Salakhutdinov
Transformer architecture [1]
Transformer architecture [1]
Transformer architecture [1]
[CLS] family isn ' t about whose blood you have . it ' s about who you care about . [SEP] [pad] [pad] [pad]
101 11214 65148 112 162 10935 17060 15465 10855 10574 119 10197 112 161 10935 10488 10855 11258 10935 119 102 0 0 0
0.01 -0.05 0.05 -0.05 -0.03 0 0.05 -0.03 -0.03 -0.01 -0.04 -0.04 -0.05 -0.04 0 -0.03 -0.03 -0.04 0 -0.04 -0.01 0.08 0.08 0.08
-0.01 0.03 0.04 0.04 -0.09 0.07 -0.01 -0.06 -0.01 -0.03 -0.01 -0.01 0.04 -0.01 0.07 -0.04 -0.01 -0.06 0.07 -0.01 -0.02 0.04 0.04 0.04
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
0.01 0.03 0.02 -0.02 0.01 0.06 0.08 -0.05 -0.01 -0.05 0.03 0.04 -0.02 0.05 0.06 -0.03 -0.01 0.08 0.06 0.03 0 0.01 0.01 0.01
0 0 0.1 -0.04 0.05 -0.01 -0.1 -0.06 0.04 -0.04 0.02 0.04 -0.04 0.04 -0.01 -0.03 0.04 -0.09 -0.01 0.02 -0.01 0 0 0
Input tokens
Input ids
Embedding
𝑑𝑒𝑚𝑏
𝑠𝑒𝑞_𝑙𝑒𝑛
Transformer architecture [1]
-0.01 -0.01 0 -0.04 0.03 ... -0.03 0 -0.03
0 0 -0.01 0 0.01 ... 0.01 0 0
... ... ... ... ... ... ... ... ...
-0.01 0 -0.01 0 0 ... 0.01 0.01 0
0 -0.03 -0.02 -0.01 -0.01 ... -0.02 -0.02 -0.03
Positional Encoding (sinusoidal or learned positional embedding)
[CLS] family isn ' t ... care about .
0.01 -0.05 0.05 -0.05 -0.03 ... -0.04 0 -0.04
-0.01 0.03 0.04 0.04 -0.09 ... -0.06 0.07 -0.01
... ... ... ... ... ... ... ... ...
0.01 0.03 0.02 -0.02 0.01 ... 0.08 0.06 0.03
0 0 0.1 -0.04 0.05 ... -0.09 -0.01 0.02
Word embedding
[CLS] family isn ' t ... care about .
0 -0.06 0.05 -0.09 0 ... -0.06 0 -0.07
-0.01 0.04 0.04 0.05 -0.08 ... -0.04 0.08 -0.01
... ... ... ... ... ... ... ... ...
0.01 0.02 0.01 -0.02 0 ... 0.09 0.06 0.03
0 -0.03 0.07 -0.05 0.04 ... -0.11 -0.03 -0.01
Positional Encoding + Word embedding
𝑑 𝑒𝑚𝑏 × 𝑠𝑒𝑞𝑙𝑒𝑛
Transformer architecture [1]
Sinusoidal positional encoding
• For any fixed offset 𝑘, 𝑃𝐸 𝑝𝑜𝑠+𝑘 can be
represented as a linear function of 𝑃𝐸 𝑝𝑜𝑠.
Transformer architecture [1]
Sinusoidal positional embeddings
A model trained on a memory of some certain
length can automatically generalize to a memory
several times long during evaluation.
learned positional embeddings
Transformer architecture [1]
[CLS] family ... [pad] [pad]
0.09 -0.63 ... 2.26 0.96
-0.14 0.55 ... 0.95 1.07
... ... ... ... ...
0.01 0.35 ... 0.11 0.29
-0.02 -0.4 ... -0.39 -0.21
Positional Encoding + Word embedding
𝑑 𝑒𝑚𝑏 × 𝑠𝑒𝑞_𝑙𝑒𝑛
𝐸𝑚𝑏 𝑞
0.02 0.04 0 ... 0.01 0.05 -0.1
-0.02 0.07 0.03 ... 0.03 -0.02 0
... ... ... ... ... ... ...
0.1 -0.06 -0.02 ... -0.09 -0.04 -0.02
0.05 0.04 0.06 ... 0 -0.05 0.01
𝑑 𝑞 × 𝑑 𝑒𝑚𝑏
𝑊𝑞
-0.63 -0.17 ... 0.06 -0.61
1 × 𝑑 𝑞
𝐵𝑞
𝑄 𝑝𝑟𝑜𝑗 = 𝐸𝑚𝑏 𝑞
𝑇
𝑊𝑞
𝑇
+ 𝐵𝑞
𝑇
𝐾𝑝𝑟𝑜𝑗 = 𝐸𝑚𝑏 𝑘
𝑇
𝑊𝑘
𝑇 + 𝐵 𝑘
𝑇
𝑉𝑝𝑟𝑜𝑗 = 𝐸𝑚𝑏 𝑣
𝑇
𝑊𝑣
𝑇 + 𝐵𝑣
𝑇
Transformer architecture [1]
[CLS] family ... [pad] [pad]
-0.76 -1.25 ... 0.38 -0.01
0 0.29 ... -1.9 -1.33
... ... ... ... ...
-1.54 1.33 ... 1.56 -0.12
-1.48 -0.11 ... 0.24 -0.24
𝑄 𝑝𝑟𝑜𝑗
𝑑 𝑞 × 𝑠𝑒𝑞_𝑙𝑒𝑛
[CLS] family ... [pad] [pad]
-3.36 -1.59 ... 1.27 1.99
-2.38 -1.82 ... -0.53 -0.29
... ... ... ... ...
1.13 1.66 ... -0.1 1.36
-2.61 -0.89 ... 1.77 1.88
𝐾 𝑝𝑟𝑜𝑗
𝑑 𝑘 × 𝑠𝑒𝑞_𝑙𝑒𝑛
𝑎𝑡𝑡𝑒𝑛𝑡𝑖𝑜𝑛_𝑠𝑐𝑜𝑟𝑒𝑠 =
𝑄 𝑝𝑟𝑜𝑗
𝑇
𝐾𝑝𝑟𝑜𝑗
𝑑 𝑞
𝑑 𝑞 = 𝑑 𝑘
Transformer architecture [1]
Transformer architecture [1]
𝑎𝑡𝑡_𝑝𝑟𝑜𝑏𝑠 = 𝑠𝑚(𝑎𝑡𝑡_𝑠𝑐𝑜𝑟𝑒𝑠)
Transformer architecture [1]
[CLS] family ... [pad] [pad]
0.86 0 ... 0 0
0.06 0.04 ... 0 0
... ... ... ... ...
0.15 0 ... 0 0
0.19 0 ... 0 0
𝑎𝑡𝑡_𝑝𝑟𝑜𝑏𝑠
[CLS] family ... [pad] [pad]
-0.2 -1.08 ... -0.24 -0.26
0.09 -1.09 ... 0.81 0.79
... ... ... ... ...
0.28 -0.15 ... 0.74 0.76
0.07 0.72 ... 0.55 0.56
𝑉𝑝𝑟𝑜𝑗
𝑑 𝑣 × 𝑠𝑒𝑞_𝑙𝑒𝑛 𝑣
𝑠𝑒𝑞_𝑙𝑒𝑛 𝑞 × 𝑠𝑒𝑞_𝑙𝑒𝑛 𝑘
𝑐𝑜𝑛𝑡𝑒𝑥𝑡_𝑣𝑒𝑐𝑡𝑜𝑟𝑠 = 𝑎𝑡𝑡 𝑝𝑟𝑜𝑏𝑠 × 𝑉𝑝𝑟𝑜𝑗
𝑇
𝑠𝑒𝑞_𝑙𝑒𝑛 𝑘 = 𝑠𝑒𝑞_𝑙𝑒𝑛 𝑣
Transformer architecture [1]
Family isn't about whose blood you have. It's about who you care about.
𝑦𝑜𝑢 𝑞 = 0.63 × ℎ𝑎𝑣𝑒 𝑣 + 0.08 × 𝑏𝑙𝑜𝑜𝑑 𝑣 + 0.07 × 𝑤ℎ𝑜𝑠𝑒 𝑣
𝑤ℎ𝑜 𝑞 = 0.32 × 𝑦𝑜𝑢 𝑣 + 0.24 × 𝑎𝑏𝑜𝑢𝑡 𝑣 + 0.11 × ℎ𝑎𝑣𝑒 𝑣
𝑦𝑜𝑢 𝑞 = 0.34 × 𝑤ℎ𝑜 𝑣 + 0.27 × 𝑐𝑎𝑟𝑒 𝑣
Transformer architecture [1]
𝑐𝑜𝑛𝑡𝑒𝑥𝑡_𝑣𝑒𝑐𝑡𝑜𝑟𝑠
[CLS] family ... [pad] [pad]
-0.19 -0.25 ... 0.2 0.28
0.01 -0.33 ... -0.12 0.11
... ... ... ... ...
0.23 -0.09 ... 0.24 0.33
0.05 0.15 ... -0.26 -0.29
𝑑 𝑣 × 𝑠𝑒𝑞_𝑙𝑒𝑛 𝑞
[CLS] family ... [pad] [pad]
-0.03 -0.21 ... 0.81 -0.04
0.04 1.17 ... 0.36 0.44
... ... ... ... ...
-0.02 0.6 ... -0.18 -0.03
0.01 -0.28 ... -0.28 -0.15
𝑎𝑡𝑡_𝑙𝑎𝑦𝑒𝑟𝑜𝑢𝑡
𝑒𝑚𝑏 𝑑 × 𝑠𝑒𝑞_𝑙𝑒𝑛 𝑞
𝑎𝑡𝑡_𝑙𝑎𝑦𝑒𝑟𝑜𝑢𝑡 = 𝑐𝑜𝑛𝑐𝑎𝑡( 𝑐𝑜𝑛𝑡𝑒𝑥𝑡 𝑣𝑒𝑐𝑡𝑜𝑟𝑠 𝑖 i ∈ 0, 𝑁 , 𝑖 < 𝑛𝑢𝑚ℎ𝑒𝑎𝑑
Transformer architecture [1]
𝐹𝐹𝑁 = max 0, 𝑥𝑊1 + 𝑏1 𝑊2 + 𝑏2
𝑥
seq_len 𝑞 × 𝑑 𝑒𝑚𝑏
𝑊1
𝑑 𝑒𝑚𝑏 × 𝑑𝑖𝑛𝑡𝑒𝑟
𝑏1
1 × 𝑑𝑖𝑛𝑡𝑒𝑟
𝑊2
𝑑𝑖𝑛𝑡𝑒𝑟 × 𝑑 𝑒𝑚𝑏
𝑏2
1 × 𝑑 𝑒𝑚𝑏
𝐿𝑎𝑦𝑒𝑟𝑁𝑜𝑟𝑚 𝑥 + 𝑑𝑟𝑜𝑝𝑜𝑢𝑡 𝑠𝑢𝑏𝑙𝑎𝑦𝑒𝑟 𝑥
Transformer architecture [1]
The memory cost for scaled
dot-product attention is
quadratic w.r.t. the sequence
length.
Transformer architecture [1]
Pros
• It’s enables the learning of long-term dependency.
• It is less affected by the gradient vanishing problem compared to RNN.
Cons
• The model cannot capture any longer-dependency beyond the predefined context
length.
• The model lacks necessary contextual information needed to well prediction the first
few symbols. (Context fragmentation)
• Longer sequences are disproportionately expensive because attention is quadratic to
the sequence length.
recurrence mechanism and a novel positional encoding scheme. Our method not only
Vanilla Transformer
Vanilla Transformer with a fixed-
length context at a training time.
Vanilla Transformer with a fixed-
length context at evaluation time.
• Context fragmentation (information
never flows across segments)
• Upper bounded by the segment
length
• The segment has to be processed all
from scratch for prediction a token.
• This evaluation procedure is extremely
expensive.
Transformer XL (extra long) [2]
Transformer-XL with segment-level
recurrence at training time.
Transformer-XL with segment-level
recurrence at evaluation time.
• It can capture longer dependency
than segment length.
• It’s fast than the vanilla transformer.
Transformer XL (extra long) [2]
Major contributions
• Segment-Level Recurrence with State Reuse
• Relative Positional Encodings
Embedding and loss
• Adaptive Input representation
• Adaptive softmax
Transformer XL (extra long) [2]
Segment-Level Recurrence with State Reuse
Transformer XL (extra long) [2]
Segment-Level Recurrence with State Reuse
ℎ 𝜏+1
𝑛−1
= 𝑆𝐺 ℎ 𝜏
𝑛−1
∘ ℎ 𝜏+1
𝑛−1
,
𝑞 𝜏+1
𝑛
, 𝑘 𝜏+1
𝑛
, 𝑣 𝜏+1
𝑛
= ℎ 𝜏+1
𝑛−1
𝑊𝑞
𝑇, ℎ 𝜏+1
𝑛−1
𝑊𝑘
𝑇
, ℎ 𝜏+1
𝑛−1
𝑊𝑣
𝑇,
ℎ 𝜏+1
𝑛
= 𝑇𝑟𝑎𝑛𝑠𝑓𝑜𝑟𝑚𝑒𝑟 − 𝐿𝑎𝑦𝑒𝑟 𝑞 𝜏+1
𝑛
, 𝑘 𝜏+1
𝑛
, 𝑣 𝜏+1
𝑛
.
Let the two consecutive segments of length L be 𝒔 𝝉 = 𝒙 𝝉,𝟏, ⋯ , 𝒙 𝝉,𝟏 and 𝒔 𝝉+𝟏 = 𝒙 𝝉+𝟏,𝟏, ⋯ , 𝒙 𝝉+𝟏,𝟏
respectively.
The 𝑛-ths layer hidden state sequence produced for the 𝜏-ths segment 𝑠𝜏 by 𝒉 𝝉
𝒏 ∈ ℝ 𝑳×𝒅, where d is
the hidden dimension.
𝑺𝑮(⋅) : stop-gradient
𝒉 𝒖 ∘ 𝒉 𝒗 : the concatenation of two hidden sequences along the length dimension.
Transformer XL (extra long) [2]
Segment-Level Recurrence with State Reuse
ℎ 𝜏+1
𝑛−1
ℎ 𝜏+1
𝑛−1
ℎ 𝜏
𝑛−1
ℎ 𝜏+1
𝑛
Transformer XL (extra long) [2]
Abs Positional Encodings
(sinusoidal)
𝑨𝑖,𝑗
𝑎𝑏𝑠
= 𝑬 𝑥 𝑖
𝑇 𝑾 𝑞
𝑇 𝑾 𝑘 𝑬 𝑥 𝑗
+ 𝑬 𝑥 𝑖
𝑇
𝑾 𝑞
𝑇
𝑾 𝑘 𝑼𝑗
+ 𝑼𝑖
𝑇
𝑾 𝑞
𝑇 𝑾 𝑘 𝑬 𝑥 𝑗
+ 𝑼𝑖
𝑇
𝑾 𝑞
𝑇
𝑾 𝑘 𝑼𝑗.
Relative Positional Encodings
𝑨𝑖,𝑗
𝑟𝑒𝑙
= 𝑬 𝑥 𝑖
𝑇 𝑾 𝑞
𝑇 𝑾 𝑘,𝐸 𝑬 𝑥 𝑗
+ 𝑬 𝑥 𝑖
𝑇
𝑾 𝑞
𝑇
𝑾 𝑘,𝑅 𝑹𝑖−𝑗
+ 𝑢 𝑇
𝑾 𝑘,𝐸 𝑬 𝑥 𝑗
+ 𝑣 𝑇 𝑾 𝑘,𝑅 𝑹𝑖−𝑗
• 𝑈𝑗 → 𝑅𝑖−𝑗
• 𝑈𝑖
𝑇
𝑊𝑞
𝑇 → 𝑎 𝑡𝑟𝑎𝑖𝑛𝑎𝑏𝑙𝑒 𝑝𝑎𝑟𝑎𝑚𝑒𝑡𝑒𝑟 𝑢 ∈ ℝ 𝑑, 𝑣 ∈ ℝ 𝑑
• 𝑊𝑘 → 𝑊𝑘,𝐸, 𝑊𝑘,𝑅 𝑓𝑜𝑟 𝐸 𝑥 𝑗
, 𝑈𝑗
Transformer XL (extra long) [2]
h 𝜏
𝑛−1
= 𝑆𝐺 𝑚 𝜏
𝑛−1
∘ ℎ 𝜏
𝑛−1
𝑞 𝜏
𝑛
, 𝑘 𝜏
𝑛
, 𝑣𝜏
𝑛
= ℎ 𝜏
𝑛−1
𝑊𝑞
𝑛−1 𝑇
, ℎ 𝜏
𝑛−1
𝑊𝑘,𝐸
𝑛 𝑇
, ℎ 𝜏
𝑛−1
𝑊𝑣
𝑛 𝑇
𝐴 𝜏,𝑖,𝑗
𝑛
= 𝑞 𝜏,𝑖
𝑛 𝑇
𝑘 𝜏,𝑗
𝑛
+ 𝑞 𝜏,𝑖
𝑛 𝑇
𝑊𝑘,𝑅
𝑛
𝑅𝑖−𝑗 + 𝑢 𝑇
𝑘 𝜏,𝑗 + 𝑣 𝑇
𝑊𝑘,𝑅
𝑛
𝑅𝑖−𝑗
𝑎 𝜏
𝑛
= 𝑀𝑎𝑠𝑘𝑒𝑑_𝑆𝑜𝑓𝑡𝑚𝑎𝑥 𝐴 𝜏
𝑛
𝑣𝜏
𝑛
𝑜𝜏
𝑛
= 𝐿𝑎𝑦𝑒𝑟𝑁𝑜𝑟𝑚 𝐿𝑖𝑛𝑒𝑎𝑟 𝑎 𝜏
𝑛
+ ℎ 𝜏
𝑛−1
ℎ 𝜏
𝑛
= 𝑃𝑜𝑠𝑖𝑡𝑖𝑜𝑛𝑤𝑖𝑠𝑒_𝐹𝑒𝑒𝑑_𝐹𝑜𝑟𝑤𝑎𝑟𝑑 𝑜𝜏
𝑛
ℎ 𝜏
0
≔ 𝐸𝑠 𝜏
Transformer XL (extra long) [2]
Efficient Computation of the Attention with Relative Positional Embedding
𝑬 𝑥 𝑖
𝑇 𝑾 𝑞
𝑇 𝑾 𝑘,𝑅 𝑹𝑖−𝑗
+ 𝑣 𝑇 𝑾 𝑘,𝑅 𝑹𝑖−𝑗 𝑄 in a reversed order
𝑄 𝑘 = 𝑊𝑘,𝑅 𝑅 𝑀+𝐿−1−𝑘
Transformer XL (extra long) [2]
Efficient Computation of the Attention with Relative Positional Embedding
𝑬 𝑥 𝑖
𝑇 𝑾 𝑞
𝑇 𝑾 𝑘,𝑅 𝑹𝑖−𝑗
+ 𝑣 𝑇
𝑾 𝑘,𝑅 𝑹𝑖−𝑗
Transformer XL (extra long) [2]
Efficient Computation of the Attention with Relative Positional Embedding
Transformer XL (extra long) [2]
Efficient Computation of the Attention with Relative Positional Embedding
Transformer XL (extra long) [2]
Efficient Computation of the Attention with Relative Positional Embedding
Transformer XL (extra long) [2]
Efficient Computation of the Attention with Relative Positional Embedding
Transformer XL (extra long) [2]
Transformer XL (extra long) [2]
Transformer XL (extra long) [2]
Transformer XL (extra long) [2] Ablation study
Transformer XL (extra long) [2]
Efficient softmax approximation for GPUs [3]
Most of the probability mass is covered by a small fraction of the
dictionary, e.g., 87% of the document is covered by only 20% of the
vocabulary in the Penn TreeBank.
 Hierarchical softmax. [adopted from
(Hugo Lachorelle’s Youtube lectures)]
• Hierarchical softmax
• Differentiated softmax
• Importance sampling
• Negative sampling
• Noise sampling
Efficient softmax approximation for GPUs [3]
notation
• 𝐵: batch size
• 𝑘 = 𝒱 : cardinality of total vocabulary
• 𝑔 𝑘 = max(𝑐 + 𝜆𝑘0, 𝑐 + 𝜆𝑘) = 𝑐 𝑚 +
max 0, 𝜆 𝑘 − 𝑘0 : computation time
1. The computation time 𝑔(𝑘) is constant
for low values of k, until a certain
inflection point 𝑘0 ≈ 50, and then
becomes affine for values 𝑘 > 𝑘0.
2. Empirically, 𝑐 𝑚 = 0.40 𝑚𝑠 on a K40 and
0.22 ms on a M40.
Efficient softmax approximation for GPUs [3]
𝒱ℎ
𝒱𝑡
Notation
• 𝒱ℎ: the word set of head
• 𝒱𝑡: the word set of tail
• 𝑘ℎ = 𝒱ℎ , 𝑘 𝑡 = 𝒱𝑡
• 𝑝𝑖: the probability of a word to occur in the set 𝒱𝑖.
𝐶 = 𝑔 𝑘ℎ + 1, 𝐵 + 𝑔(𝑘 𝑡, 𝑝𝑡 𝐵)
Efficient softmax approximation for GPUs [3]
𝐶ℎ = 𝑔(𝐽 + 𝑘ℎ, 𝐵)
∀𝑖, 𝐶𝑖 = 𝑔(𝑘𝑖, 𝑝𝑖 𝐵)
𝐶 = 𝑔 𝐽 + 𝑘ℎ, 𝐵 + 𝑖 𝑔 𝑘𝑖, 𝑝𝑖 𝐵 .
Constraint. 𝑘𝐵 ≥ 𝑘0 𝐵0
𝐶 = 𝑐 + 𝜆𝐵 𝐽 + 𝑘ℎ 𝐵 + 𝑖(𝑐 + 𝜆𝑘𝑖 𝑝𝑖 𝐵)
= 𝐽 + 1 𝑐 + 𝜆𝐵 𝐽 + 𝑘ℎ + 𝑖 𝑝𝑖 𝑘𝑖 .
Efficient softmax approximation for GPUs [3]
𝑝𝑖 𝑘𝑖 + 𝑝𝑗 𝑘𝑗 = 𝑝𝑖 𝑘𝑖 − 𝑘𝑗 + 𝑝𝑖+𝑗 𝑘𝑗 where 𝑝𝑖+𝑗 = 𝑝𝑖 + 𝑝𝑗
𝐶 = 𝐽 + 1 𝑐 + 𝜆𝐵 𝐽 + 𝑘ℎ +
𝑖
𝑝𝑖 𝑘𝑖
Assume that 𝑘𝑖 > 𝑘𝑗, and fix the quantities 𝑝𝑖+𝑗, 𝑘𝑖 and 𝑘𝑗.
The best strategy is trivially to minimize the probability of the largest cluster
𝒱𝑖.
For a fixed number of clusters of given sizes, the best strategy is to assign
the words by decreasing probabilities to cluster of increasing size.
Efficient softmax approximation for GPUs [3]
Efficient softmax approximation for GPUs [3]
Efficient softmax approximation for GPUs [3]
Adaptive Input Representations [4]
Adaptive Input Representations [4]
𝐿𝑖𝑛𝑒𝑎𝑟𝒱1
𝑂𝑢𝑡𝑝𝑢𝑡 𝑙𝑎𝑦𝑒𝑟𝒱1
𝐿𝑖𝑛𝑒𝑎𝑟𝒱2
𝐿𝑖𝑛𝑒𝑎𝑟𝒱 𝑛
⋯
𝑂𝑢𝑡𝑝𝑢𝑡 𝑙𝑎𝑦𝑒𝑟𝒱2
𝑂𝑢𝑡𝑝𝑢𝑡 𝑙𝑎𝑦𝑒𝑟𝒱 𝑛⋯
𝑑
𝑑 → 𝑑 + 𝑛 − 1 𝑑 →
𝑑
𝑘1
𝑑 →
𝑑
𝑘 𝑛−1
𝒱1 + 𝑛 − 1 𝒱2
𝒱𝑛
References
[1] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez,
Lukasz Kaiser, and Illia Polosukhin.
Attention is all you need. CoRR, abs/1706.03762, 2017
[2] Zihang Dai, Zhilin Yang, Yiming Yang, Jaime G. Carbonell, Quoc V. Le, and Ruslan Salakhutdinov.
Transformer-xl: Attentive language
models beyond a fixed-length context. CoRR, abs/1901.02860, 2019
[3] Edouard Grave, Armand Joulin, Moustapha Cisse, David Grangier, and Herve Jegou. Efficient
softmax approximation for gpus.
CoRR, abs/1609.04309, 2016
[4] Alexei Baevski and Michael Auli. Adaptive input representations for neural language modeling.
CoRR, abs/1809.10853, 2018.

More Related Content

What's hot

Word embedding
Word embedding Word embedding
Word embedding
ShivaniChoudhary74
 
Deep learning for NLP and Transformer
 Deep learning for NLP  and Transformer Deep learning for NLP  and Transformer
Deep learning for NLP and Transformer
Arvind Devaraj
 
Transformers in 2021
Transformers in 2021Transformers in 2021
Transformers in 2021
Grigory Sapunov
 
Recurrent Neural Networks (RNN) | RNN LSTM | Deep Learning Tutorial | Tensorf...
Recurrent Neural Networks (RNN) | RNN LSTM | Deep Learning Tutorial | Tensorf...Recurrent Neural Networks (RNN) | RNN LSTM | Deep Learning Tutorial | Tensorf...
Recurrent Neural Networks (RNN) | RNN LSTM | Deep Learning Tutorial | Tensorf...
Edureka!
 
Word embeddings, RNN, GRU and LSTM
Word embeddings, RNN, GRU and LSTMWord embeddings, RNN, GRU and LSTM
Word embeddings, RNN, GRU and LSTM
Divya Gera
 
Word Embeddings - Introduction
Word Embeddings - IntroductionWord Embeddings - Introduction
Word Embeddings - Introduction
Christian Perone
 
Attention Is All You Need
Attention Is All You NeedAttention Is All You Need
Attention Is All You Need
Illia Polosukhin
 
Introduction to Named Entity Recognition
Introduction to Named Entity RecognitionIntroduction to Named Entity Recognition
Introduction to Named Entity Recognition
Tomer Lieber
 
Latent diffusions vs DALL-E v2
Latent diffusions vs DALL-E v2Latent diffusions vs DALL-E v2
Latent diffusions vs DALL-E v2
Vitaly Bondar
 
Introduction to Transformers for NLP - Olga Petrova
Introduction to Transformers for NLP - Olga PetrovaIntroduction to Transformers for NLP - Olga Petrova
Introduction to Transformers for NLP - Olga Petrova
Alexey Grigorev
 
BERT - Part 1 Learning Notes of Senthil Kumar
BERT - Part 1 Learning Notes of Senthil KumarBERT - Part 1 Learning Notes of Senthil Kumar
BERT - Part 1 Learning Notes of Senthil Kumar
Senthil Kumar M
 
Introduction For seq2seq(sequence to sequence) and RNN
Introduction For seq2seq(sequence to sequence) and RNNIntroduction For seq2seq(sequence to sequence) and RNN
Introduction For seq2seq(sequence to sequence) and RNN
Hye-min Ahn
 
Variational Autoencoder
Variational AutoencoderVariational Autoencoder
Variational Autoencoder
Mark Chang
 
Long Short Term Memory
Long Short Term MemoryLong Short Term Memory
Long Short Term Memory
Yan Xu
 
Word2Vec
Word2VecWord2Vec
Word2Vec
hyunyoung Lee
 
Seq2Seq (encoder decoder) model
Seq2Seq (encoder decoder) modelSeq2Seq (encoder decoder) model
Seq2Seq (encoder decoder) model
佳蓉 倪
 
Tutorial on Theory and Application of Generative Adversarial Networks
Tutorial on Theory and Application of Generative Adversarial NetworksTutorial on Theory and Application of Generative Adversarial Networks
Tutorial on Theory and Application of Generative Adversarial Networks
MLReview
 
Transformers
TransformersTransformers
Transformers
Anup Joseph
 
NLP State of the Art | BERT
NLP State of the Art | BERTNLP State of the Art | BERT
NLP State of the Art | BERT
shaurya uppal
 
Introduction to Transformer Model
Introduction to Transformer ModelIntroduction to Transformer Model
Introduction to Transformer Model
Nuwan Sriyantha Bandara
 

What's hot (20)

Word embedding
Word embedding Word embedding
Word embedding
 
Deep learning for NLP and Transformer
 Deep learning for NLP  and Transformer Deep learning for NLP  and Transformer
Deep learning for NLP and Transformer
 
Transformers in 2021
Transformers in 2021Transformers in 2021
Transformers in 2021
 
Recurrent Neural Networks (RNN) | RNN LSTM | Deep Learning Tutorial | Tensorf...
Recurrent Neural Networks (RNN) | RNN LSTM | Deep Learning Tutorial | Tensorf...Recurrent Neural Networks (RNN) | RNN LSTM | Deep Learning Tutorial | Tensorf...
Recurrent Neural Networks (RNN) | RNN LSTM | Deep Learning Tutorial | Tensorf...
 
Word embeddings, RNN, GRU and LSTM
Word embeddings, RNN, GRU and LSTMWord embeddings, RNN, GRU and LSTM
Word embeddings, RNN, GRU and LSTM
 
Word Embeddings - Introduction
Word Embeddings - IntroductionWord Embeddings - Introduction
Word Embeddings - Introduction
 
Attention Is All You Need
Attention Is All You NeedAttention Is All You Need
Attention Is All You Need
 
Introduction to Named Entity Recognition
Introduction to Named Entity RecognitionIntroduction to Named Entity Recognition
Introduction to Named Entity Recognition
 
Latent diffusions vs DALL-E v2
Latent diffusions vs DALL-E v2Latent diffusions vs DALL-E v2
Latent diffusions vs DALL-E v2
 
Introduction to Transformers for NLP - Olga Petrova
Introduction to Transformers for NLP - Olga PetrovaIntroduction to Transformers for NLP - Olga Petrova
Introduction to Transformers for NLP - Olga Petrova
 
BERT - Part 1 Learning Notes of Senthil Kumar
BERT - Part 1 Learning Notes of Senthil KumarBERT - Part 1 Learning Notes of Senthil Kumar
BERT - Part 1 Learning Notes of Senthil Kumar
 
Introduction For seq2seq(sequence to sequence) and RNN
Introduction For seq2seq(sequence to sequence) and RNNIntroduction For seq2seq(sequence to sequence) and RNN
Introduction For seq2seq(sequence to sequence) and RNN
 
Variational Autoencoder
Variational AutoencoderVariational Autoencoder
Variational Autoencoder
 
Long Short Term Memory
Long Short Term MemoryLong Short Term Memory
Long Short Term Memory
 
Word2Vec
Word2VecWord2Vec
Word2Vec
 
Seq2Seq (encoder decoder) model
Seq2Seq (encoder decoder) modelSeq2Seq (encoder decoder) model
Seq2Seq (encoder decoder) model
 
Tutorial on Theory and Application of Generative Adversarial Networks
Tutorial on Theory and Application of Generative Adversarial NetworksTutorial on Theory and Application of Generative Adversarial Networks
Tutorial on Theory and Application of Generative Adversarial Networks
 
Transformers
TransformersTransformers
Transformers
 
NLP State of the Art | BERT
NLP State of the Art | BERTNLP State of the Art | BERT
NLP State of the Art | BERT
 
Introduction to Transformer Model
Introduction to Transformer ModelIntroduction to Transformer Model
Introduction to Transformer Model
 

Similar to Transformer xl

Umbra Ignite 2015: Rulon Raymond – The State of Skinning – a dive into modern...
Umbra Ignite 2015: Rulon Raymond – The State of Skinning – a dive into modern...Umbra Ignite 2015: Rulon Raymond – The State of Skinning – a dive into modern...
Umbra Ignite 2015: Rulon Raymond – The State of Skinning – a dive into modern...
Umbra Software
 
Detailed Description on Cross Entropy Loss Function
Detailed Description on Cross Entropy Loss FunctionDetailed Description on Cross Entropy Loss Function
Detailed Description on Cross Entropy Loss Function
범준 김
 
Deep learning study 2
Deep learning study 2Deep learning study 2
Deep learning study 2
San Kim
 
Optimal Control in Agent-based Economics: A Survey
Optimal Control in Agent-based Economics: A SurveyOptimal Control in Agent-based Economics: A Survey
Optimal Control in Agent-based Economics: A Survey
James Matthew Miraflor
 
Tavas and pashmaks
Tavas and pashmaksTavas and pashmaks
Tavas and pashmaks
minsu kim
 
07-Convolution.pptx signal spectra and signal processing
07-Convolution.pptx signal spectra and signal processing07-Convolution.pptx signal spectra and signal processing
07-Convolution.pptx signal spectra and signal processing
JordanJohmMallillin
 
05.scd_cuantificacion_y_senales_de_prueba
05.scd_cuantificacion_y_senales_de_prueba05.scd_cuantificacion_y_senales_de_prueba
05.scd_cuantificacion_y_senales_de_prueba
Hipólito Aguilar
 
B010310813
B010310813B010310813
B010310813
IOSR Journals
 
Error Correction 14_03_2022.pptx
Error Correction 14_03_2022.pptxError Correction 14_03_2022.pptx
Error Correction 14_03_2022.pptx
RonCohen53
 
Paper Study: OptNet: Differentiable Optimization as a Layer in Neural Networks
Paper Study: OptNet: Differentiable Optimization as a Layer in Neural NetworksPaper Study: OptNet: Differentiable Optimization as a Layer in Neural Networks
Paper Study: OptNet: Differentiable Optimization as a Layer in Neural Networks
ChenYiHuang5
 
Introduction to Neural Networks and Deep Learning from Scratch
Introduction to Neural Networks and Deep Learning from ScratchIntroduction to Neural Networks and Deep Learning from Scratch
Introduction to Neural Networks and Deep Learning from Scratch
Ahmed BESBES
 
Mk slides.ppt
Mk slides.pptMk slides.ppt
Mk slides.ppt
Tabassum Saher
 
Large Scale Online Experimentation with Quantile Metrics
Large Scale Online Experimentation with Quantile MetricsLarge Scale Online Experimentation with Quantile Metrics
Large Scale Online Experimentation with Quantile Metrics
Weitao Duan
 
Av 738- Adaptive Filtering - Wiener Filters[wk 3]
Av 738- Adaptive Filtering - Wiener Filters[wk 3]Av 738- Adaptive Filtering - Wiener Filters[wk 3]
Av 738- Adaptive Filtering - Wiener Filters[wk 3]
Dr. Bilal Siddiqui, C.Eng., MIMechE, FRAeS
 
Applied Math at Microsoft Azure - Rohit Pandey
Applied Math at Microsoft Azure - Rohit PandeyApplied Math at Microsoft Azure - Rohit Pandey
Applied Math at Microsoft Azure - Rohit Pandey
WithTheBest
 
SVD.ppt
SVD.pptSVD.ppt
SVD.ppt
cmpt cmpt
 
Quantum Computing 101, Part 2 - Hello Entangled World
Quantum Computing 101, Part 2 - Hello Entangled WorldQuantum Computing 101, Part 2 - Hello Entangled World
Quantum Computing 101, Part 2 - Hello Entangled World
AaronTurner9
 
Av 738- Adaptive Filtering - Background Material
Av 738- Adaptive Filtering - Background MaterialAv 738- Adaptive Filtering - Background Material
Av 738- Adaptive Filtering - Background Material
Dr. Bilal Siddiqui, C.Eng., MIMechE, FRAeS
 
A Stabilized Finite Element Dynamic Overset Method for the Navier-Stokes Equa...
A Stabilized Finite Element Dynamic Overset Method for the Navier-Stokes Equa...A Stabilized Finite Element Dynamic Overset Method for the Navier-Stokes Equa...
A Stabilized Finite Element Dynamic Overset Method for the Navier-Stokes Equa...
Chao Liu
 
2. filtering basics
2. filtering basics2. filtering basics
2. filtering basics
Atul Kumar Jha
 

Similar to Transformer xl (20)

Umbra Ignite 2015: Rulon Raymond – The State of Skinning – a dive into modern...
Umbra Ignite 2015: Rulon Raymond – The State of Skinning – a dive into modern...Umbra Ignite 2015: Rulon Raymond – The State of Skinning – a dive into modern...
Umbra Ignite 2015: Rulon Raymond – The State of Skinning – a dive into modern...
 
Detailed Description on Cross Entropy Loss Function
Detailed Description on Cross Entropy Loss FunctionDetailed Description on Cross Entropy Loss Function
Detailed Description on Cross Entropy Loss Function
 
Deep learning study 2
Deep learning study 2Deep learning study 2
Deep learning study 2
 
Optimal Control in Agent-based Economics: A Survey
Optimal Control in Agent-based Economics: A SurveyOptimal Control in Agent-based Economics: A Survey
Optimal Control in Agent-based Economics: A Survey
 
Tavas and pashmaks
Tavas and pashmaksTavas and pashmaks
Tavas and pashmaks
 
07-Convolution.pptx signal spectra and signal processing
07-Convolution.pptx signal spectra and signal processing07-Convolution.pptx signal spectra and signal processing
07-Convolution.pptx signal spectra and signal processing
 
05.scd_cuantificacion_y_senales_de_prueba
05.scd_cuantificacion_y_senales_de_prueba05.scd_cuantificacion_y_senales_de_prueba
05.scd_cuantificacion_y_senales_de_prueba
 
B010310813
B010310813B010310813
B010310813
 
Error Correction 14_03_2022.pptx
Error Correction 14_03_2022.pptxError Correction 14_03_2022.pptx
Error Correction 14_03_2022.pptx
 
Paper Study: OptNet: Differentiable Optimization as a Layer in Neural Networks
Paper Study: OptNet: Differentiable Optimization as a Layer in Neural NetworksPaper Study: OptNet: Differentiable Optimization as a Layer in Neural Networks
Paper Study: OptNet: Differentiable Optimization as a Layer in Neural Networks
 
Introduction to Neural Networks and Deep Learning from Scratch
Introduction to Neural Networks and Deep Learning from ScratchIntroduction to Neural Networks and Deep Learning from Scratch
Introduction to Neural Networks and Deep Learning from Scratch
 
Mk slides.ppt
Mk slides.pptMk slides.ppt
Mk slides.ppt
 
Large Scale Online Experimentation with Quantile Metrics
Large Scale Online Experimentation with Quantile MetricsLarge Scale Online Experimentation with Quantile Metrics
Large Scale Online Experimentation with Quantile Metrics
 
Av 738- Adaptive Filtering - Wiener Filters[wk 3]
Av 738- Adaptive Filtering - Wiener Filters[wk 3]Av 738- Adaptive Filtering - Wiener Filters[wk 3]
Av 738- Adaptive Filtering - Wiener Filters[wk 3]
 
Applied Math at Microsoft Azure - Rohit Pandey
Applied Math at Microsoft Azure - Rohit PandeyApplied Math at Microsoft Azure - Rohit Pandey
Applied Math at Microsoft Azure - Rohit Pandey
 
SVD.ppt
SVD.pptSVD.ppt
SVD.ppt
 
Quantum Computing 101, Part 2 - Hello Entangled World
Quantum Computing 101, Part 2 - Hello Entangled WorldQuantum Computing 101, Part 2 - Hello Entangled World
Quantum Computing 101, Part 2 - Hello Entangled World
 
Av 738- Adaptive Filtering - Background Material
Av 738- Adaptive Filtering - Background MaterialAv 738- Adaptive Filtering - Background Material
Av 738- Adaptive Filtering - Background Material
 
A Stabilized Finite Element Dynamic Overset Method for the Navier-Stokes Equa...
A Stabilized Finite Element Dynamic Overset Method for the Navier-Stokes Equa...A Stabilized Finite Element Dynamic Overset Method for the Navier-Stokes Equa...
A Stabilized Finite Element Dynamic Overset Method for the Navier-Stokes Equa...
 
2. filtering basics
2. filtering basics2. filtering basics
2. filtering basics
 

More from San Kim

20230419-LLaMA-Adapter_ Efficient Fine-tuning of Language Models with Zero-in...
20230419-LLaMA-Adapter_ Efficient Fine-tuning of Language Models with Zero-in...20230419-LLaMA-Adapter_ Efficient Fine-tuning of Language Models with Zero-in...
20230419-LLaMA-Adapter_ Efficient Fine-tuning of Language Models with Zero-in...
San Kim
 
2023 EMNLP day_san.pptx
2023 EMNLP day_san.pptx2023 EMNLP day_san.pptx
2023 EMNLP day_san.pptx
San Kim
 
LongT5_Efficient Text-toText Transformer for Long Sequences_san.pptx
LongT5_Efficient Text-toText Transformer for Long Sequences_san.pptxLongT5_Efficient Text-toText Transformer for Long Sequences_san.pptx
LongT5_Efficient Text-toText Transformer for Long Sequences_san.pptx
San Kim
 
slide-acl2022-combined_san.pptx
slide-acl2022-combined_san.pptxslide-acl2022-combined_san.pptx
slide-acl2022-combined_san.pptx
San Kim
 
Compeition-Level Code Generation with AlphaCode.pptx
Compeition-Level Code Generation with AlphaCode.pptxCompeition-Level Code Generation with AlphaCode.pptx
Compeition-Level Code Generation with AlphaCode.pptx
San Kim
 
Intrinsic Dimensionality Explains the Effectiveness of Language Model Fine-Tu...
Intrinsic Dimensionality Explains the Effectiveness of Language Model Fine-Tu...Intrinsic Dimensionality Explains the Effectiveness of Language Model Fine-Tu...
Intrinsic Dimensionality Explains the Effectiveness of Language Model Fine-Tu...
San Kim
 
AI2 day.pptx
AI2 day.pptxAI2 day.pptx
AI2 day.pptx
San Kim
 
Temporal reasoning task
Temporal reasoning taskTemporal reasoning task
Temporal reasoning task
San Kim
 
Answering complex open domain questions with multi-hop dense retrieval
Answering complex open domain questions with multi-hop dense retrievalAnswering complex open domain questions with multi-hop dense retrieval
Answering complex open domain questions with multi-hop dense retrieval
San Kim
 
Measuring massive multitask language understanding
Measuring massive multitask language understandingMeasuring massive multitask language understanding
Measuring massive multitask language understanding
San Kim
 
Abductive commonsense reasoning
Abductive commonsense reasoningAbductive commonsense reasoning
Abductive commonsense reasoning
San Kim
 
Electra
ElectraElectra
Electra
San Kim
 
XLnet RoBERTa Reformer
XLnet RoBERTa ReformerXLnet RoBERTa Reformer
XLnet RoBERTa Reformer
San Kim
 
Face recognition v1
Face recognition v1Face recognition v1
Face recognition v1
San Kim
 
Gan seminar
Gan seminarGan seminar
Gan seminar
San Kim
 
Deep learning study 3
Deep learning study 3Deep learning study 3
Deep learning study 3
San Kim
 
Deep learning study 1
Deep learning study 1Deep learning study 1
Deep learning study 1
San Kim
 
Back propagation
Back propagationBack propagation
Back propagation
San Kim
 

More from San Kim (18)

20230419-LLaMA-Adapter_ Efficient Fine-tuning of Language Models with Zero-in...
20230419-LLaMA-Adapter_ Efficient Fine-tuning of Language Models with Zero-in...20230419-LLaMA-Adapter_ Efficient Fine-tuning of Language Models with Zero-in...
20230419-LLaMA-Adapter_ Efficient Fine-tuning of Language Models with Zero-in...
 
2023 EMNLP day_san.pptx
2023 EMNLP day_san.pptx2023 EMNLP day_san.pptx
2023 EMNLP day_san.pptx
 
LongT5_Efficient Text-toText Transformer for Long Sequences_san.pptx
LongT5_Efficient Text-toText Transformer for Long Sequences_san.pptxLongT5_Efficient Text-toText Transformer for Long Sequences_san.pptx
LongT5_Efficient Text-toText Transformer for Long Sequences_san.pptx
 
slide-acl2022-combined_san.pptx
slide-acl2022-combined_san.pptxslide-acl2022-combined_san.pptx
slide-acl2022-combined_san.pptx
 
Compeition-Level Code Generation with AlphaCode.pptx
Compeition-Level Code Generation with AlphaCode.pptxCompeition-Level Code Generation with AlphaCode.pptx
Compeition-Level Code Generation with AlphaCode.pptx
 
Intrinsic Dimensionality Explains the Effectiveness of Language Model Fine-Tu...
Intrinsic Dimensionality Explains the Effectiveness of Language Model Fine-Tu...Intrinsic Dimensionality Explains the Effectiveness of Language Model Fine-Tu...
Intrinsic Dimensionality Explains the Effectiveness of Language Model Fine-Tu...
 
AI2 day.pptx
AI2 day.pptxAI2 day.pptx
AI2 day.pptx
 
Temporal reasoning task
Temporal reasoning taskTemporal reasoning task
Temporal reasoning task
 
Answering complex open domain questions with multi-hop dense retrieval
Answering complex open domain questions with multi-hop dense retrievalAnswering complex open domain questions with multi-hop dense retrieval
Answering complex open domain questions with multi-hop dense retrieval
 
Measuring massive multitask language understanding
Measuring massive multitask language understandingMeasuring massive multitask language understanding
Measuring massive multitask language understanding
 
Abductive commonsense reasoning
Abductive commonsense reasoningAbductive commonsense reasoning
Abductive commonsense reasoning
 
Electra
ElectraElectra
Electra
 
XLnet RoBERTa Reformer
XLnet RoBERTa ReformerXLnet RoBERTa Reformer
XLnet RoBERTa Reformer
 
Face recognition v1
Face recognition v1Face recognition v1
Face recognition v1
 
Gan seminar
Gan seminarGan seminar
Gan seminar
 
Deep learning study 3
Deep learning study 3Deep learning study 3
Deep learning study 3
 
Deep learning study 1
Deep learning study 1Deep learning study 1
Deep learning study 1
 
Back propagation
Back propagationBack propagation
Back propagation
 

Recently uploaded

GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
Neo4j
 
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AI
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AIEnchancing adoption of Open Source Libraries. A case study on Albumentations.AI
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AI
Vladimir Iglovikov, Ph.D.
 
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy SurveyTrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc
 
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Speck&Tech
 
UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6
DianaGray10
 
Climate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing DaysClimate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing Days
Kari Kakkonen
 
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
Neo4j
 
20240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 202420240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 2024
Matthew Sinclair
 
“I’m still / I’m still / Chaining from the Block”
“I’m still / I’m still / Chaining from the Block”“I’m still / I’m still / Chaining from the Block”
“I’m still / I’m still / Chaining from the Block”
Claudio Di Ciccio
 
20240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 202420240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 2024
Matthew Sinclair
 
Introducing Milvus Lite: Easy-to-Install, Easy-to-Use vector database for you...
Introducing Milvus Lite: Easy-to-Install, Easy-to-Use vector database for you...Introducing Milvus Lite: Easy-to-Install, Easy-to-Use vector database for you...
Introducing Milvus Lite: Easy-to-Install, Easy-to-Use vector database for you...
Zilliz
 
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with SlackLet's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
shyamraj55
 
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfObservability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Paige Cruz
 
Full-RAG: A modern architecture for hyper-personalization
Full-RAG: A modern architecture for hyper-personalizationFull-RAG: A modern architecture for hyper-personalization
Full-RAG: A modern architecture for hyper-personalization
Zilliz
 
Artificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopmentArtificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopment
Octavian Nadolu
 
Large Language Model (LLM) and it’s Geospatial Applications
Large Language Model (LLM) and it’s Geospatial ApplicationsLarge Language Model (LLM) and it’s Geospatial Applications
Large Language Model (LLM) and it’s Geospatial Applications
Rohit Gautam
 
Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1
DianaGray10
 
Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
Ana-Maria Mihalceanu
 
RESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for studentsRESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for students
KAMESHS29
 
Essentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FMEEssentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FME
Safe Software
 

Recently uploaded (20)

GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
 
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AI
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AIEnchancing adoption of Open Source Libraries. A case study on Albumentations.AI
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AI
 
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy SurveyTrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy Survey
 
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
 
UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6
 
Climate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing DaysClimate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing Days
 
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
 
20240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 202420240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 2024
 
“I’m still / I’m still / Chaining from the Block”
“I’m still / I’m still / Chaining from the Block”“I’m still / I’m still / Chaining from the Block”
“I’m still / I’m still / Chaining from the Block”
 
20240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 202420240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 2024
 
Introducing Milvus Lite: Easy-to-Install, Easy-to-Use vector database for you...
Introducing Milvus Lite: Easy-to-Install, Easy-to-Use vector database for you...Introducing Milvus Lite: Easy-to-Install, Easy-to-Use vector database for you...
Introducing Milvus Lite: Easy-to-Install, Easy-to-Use vector database for you...
 
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with SlackLet's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
 
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfObservability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
 
Full-RAG: A modern architecture for hyper-personalization
Full-RAG: A modern architecture for hyper-personalizationFull-RAG: A modern architecture for hyper-personalization
Full-RAG: A modern architecture for hyper-personalization
 
Artificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopmentArtificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopment
 
Large Language Model (LLM) and it’s Geospatial Applications
Large Language Model (LLM) and it’s Geospatial ApplicationsLarge Language Model (LLM) and it’s Geospatial Applications
Large Language Model (LLM) and it’s Geospatial Applications
 
Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1
 
Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
 
RESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for studentsRESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for students
 
Essentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FMEEssentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FME
 

Transformer xl

  • 1. Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context San Kim 2019. 08. 23 Zihang Dai, Zhilin Yang, Yiming Yang, Jaime Carbonell, Quoc V. Le, Ruslan Salakhutdinov
  • 4. Transformer architecture [1] [CLS] family isn ' t about whose blood you have . it ' s about who you care about . [SEP] [pad] [pad] [pad] 101 11214 65148 112 162 10935 17060 15465 10855 10574 119 10197 112 161 10935 10488 10855 11258 10935 119 102 0 0 0 0.01 -0.05 0.05 -0.05 -0.03 0 0.05 -0.03 -0.03 -0.01 -0.04 -0.04 -0.05 -0.04 0 -0.03 -0.03 -0.04 0 -0.04 -0.01 0.08 0.08 0.08 -0.01 0.03 0.04 0.04 -0.09 0.07 -0.01 -0.06 -0.01 -0.03 -0.01 -0.01 0.04 -0.01 0.07 -0.04 -0.01 -0.06 0.07 -0.01 -0.02 0.04 0.04 0.04 ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... 0.01 0.03 0.02 -0.02 0.01 0.06 0.08 -0.05 -0.01 -0.05 0.03 0.04 -0.02 0.05 0.06 -0.03 -0.01 0.08 0.06 0.03 0 0.01 0.01 0.01 0 0 0.1 -0.04 0.05 -0.01 -0.1 -0.06 0.04 -0.04 0.02 0.04 -0.04 0.04 -0.01 -0.03 0.04 -0.09 -0.01 0.02 -0.01 0 0 0 Input tokens Input ids Embedding 𝑑𝑒𝑚𝑏 𝑠𝑒𝑞_𝑙𝑒𝑛
  • 5. Transformer architecture [1] -0.01 -0.01 0 -0.04 0.03 ... -0.03 0 -0.03 0 0 -0.01 0 0.01 ... 0.01 0 0 ... ... ... ... ... ... ... ... ... -0.01 0 -0.01 0 0 ... 0.01 0.01 0 0 -0.03 -0.02 -0.01 -0.01 ... -0.02 -0.02 -0.03 Positional Encoding (sinusoidal or learned positional embedding) [CLS] family isn ' t ... care about . 0.01 -0.05 0.05 -0.05 -0.03 ... -0.04 0 -0.04 -0.01 0.03 0.04 0.04 -0.09 ... -0.06 0.07 -0.01 ... ... ... ... ... ... ... ... ... 0.01 0.03 0.02 -0.02 0.01 ... 0.08 0.06 0.03 0 0 0.1 -0.04 0.05 ... -0.09 -0.01 0.02 Word embedding [CLS] family isn ' t ... care about . 0 -0.06 0.05 -0.09 0 ... -0.06 0 -0.07 -0.01 0.04 0.04 0.05 -0.08 ... -0.04 0.08 -0.01 ... ... ... ... ... ... ... ... ... 0.01 0.02 0.01 -0.02 0 ... 0.09 0.06 0.03 0 -0.03 0.07 -0.05 0.04 ... -0.11 -0.03 -0.01 Positional Encoding + Word embedding 𝑑 𝑒𝑚𝑏 × 𝑠𝑒𝑞𝑙𝑒𝑛
  • 6. Transformer architecture [1] Sinusoidal positional encoding • For any fixed offset 𝑘, 𝑃𝐸 𝑝𝑜𝑠+𝑘 can be represented as a linear function of 𝑃𝐸 𝑝𝑜𝑠.
  • 7. Transformer architecture [1] Sinusoidal positional embeddings A model trained on a memory of some certain length can automatically generalize to a memory several times long during evaluation. learned positional embeddings
  • 8. Transformer architecture [1] [CLS] family ... [pad] [pad] 0.09 -0.63 ... 2.26 0.96 -0.14 0.55 ... 0.95 1.07 ... ... ... ... ... 0.01 0.35 ... 0.11 0.29 -0.02 -0.4 ... -0.39 -0.21 Positional Encoding + Word embedding 𝑑 𝑒𝑚𝑏 × 𝑠𝑒𝑞_𝑙𝑒𝑛 𝐸𝑚𝑏 𝑞 0.02 0.04 0 ... 0.01 0.05 -0.1 -0.02 0.07 0.03 ... 0.03 -0.02 0 ... ... ... ... ... ... ... 0.1 -0.06 -0.02 ... -0.09 -0.04 -0.02 0.05 0.04 0.06 ... 0 -0.05 0.01 𝑑 𝑞 × 𝑑 𝑒𝑚𝑏 𝑊𝑞 -0.63 -0.17 ... 0.06 -0.61 1 × 𝑑 𝑞 𝐵𝑞 𝑄 𝑝𝑟𝑜𝑗 = 𝐸𝑚𝑏 𝑞 𝑇 𝑊𝑞 𝑇 + 𝐵𝑞 𝑇 𝐾𝑝𝑟𝑜𝑗 = 𝐸𝑚𝑏 𝑘 𝑇 𝑊𝑘 𝑇 + 𝐵 𝑘 𝑇 𝑉𝑝𝑟𝑜𝑗 = 𝐸𝑚𝑏 𝑣 𝑇 𝑊𝑣 𝑇 + 𝐵𝑣 𝑇
  • 9. Transformer architecture [1] [CLS] family ... [pad] [pad] -0.76 -1.25 ... 0.38 -0.01 0 0.29 ... -1.9 -1.33 ... ... ... ... ... -1.54 1.33 ... 1.56 -0.12 -1.48 -0.11 ... 0.24 -0.24 𝑄 𝑝𝑟𝑜𝑗 𝑑 𝑞 × 𝑠𝑒𝑞_𝑙𝑒𝑛 [CLS] family ... [pad] [pad] -3.36 -1.59 ... 1.27 1.99 -2.38 -1.82 ... -0.53 -0.29 ... ... ... ... ... 1.13 1.66 ... -0.1 1.36 -2.61 -0.89 ... 1.77 1.88 𝐾 𝑝𝑟𝑜𝑗 𝑑 𝑘 × 𝑠𝑒𝑞_𝑙𝑒𝑛 𝑎𝑡𝑡𝑒𝑛𝑡𝑖𝑜𝑛_𝑠𝑐𝑜𝑟𝑒𝑠 = 𝑄 𝑝𝑟𝑜𝑗 𝑇 𝐾𝑝𝑟𝑜𝑗 𝑑 𝑞 𝑑 𝑞 = 𝑑 𝑘
  • 11. Transformer architecture [1] 𝑎𝑡𝑡_𝑝𝑟𝑜𝑏𝑠 = 𝑠𝑚(𝑎𝑡𝑡_𝑠𝑐𝑜𝑟𝑒𝑠)
  • 12. Transformer architecture [1] [CLS] family ... [pad] [pad] 0.86 0 ... 0 0 0.06 0.04 ... 0 0 ... ... ... ... ... 0.15 0 ... 0 0 0.19 0 ... 0 0 𝑎𝑡𝑡_𝑝𝑟𝑜𝑏𝑠 [CLS] family ... [pad] [pad] -0.2 -1.08 ... -0.24 -0.26 0.09 -1.09 ... 0.81 0.79 ... ... ... ... ... 0.28 -0.15 ... 0.74 0.76 0.07 0.72 ... 0.55 0.56 𝑉𝑝𝑟𝑜𝑗 𝑑 𝑣 × 𝑠𝑒𝑞_𝑙𝑒𝑛 𝑣 𝑠𝑒𝑞_𝑙𝑒𝑛 𝑞 × 𝑠𝑒𝑞_𝑙𝑒𝑛 𝑘 𝑐𝑜𝑛𝑡𝑒𝑥𝑡_𝑣𝑒𝑐𝑡𝑜𝑟𝑠 = 𝑎𝑡𝑡 𝑝𝑟𝑜𝑏𝑠 × 𝑉𝑝𝑟𝑜𝑗 𝑇 𝑠𝑒𝑞_𝑙𝑒𝑛 𝑘 = 𝑠𝑒𝑞_𝑙𝑒𝑛 𝑣
  • 13. Transformer architecture [1] Family isn't about whose blood you have. It's about who you care about. 𝑦𝑜𝑢 𝑞 = 0.63 × ℎ𝑎𝑣𝑒 𝑣 + 0.08 × 𝑏𝑙𝑜𝑜𝑑 𝑣 + 0.07 × 𝑤ℎ𝑜𝑠𝑒 𝑣 𝑤ℎ𝑜 𝑞 = 0.32 × 𝑦𝑜𝑢 𝑣 + 0.24 × 𝑎𝑏𝑜𝑢𝑡 𝑣 + 0.11 × ℎ𝑎𝑣𝑒 𝑣 𝑦𝑜𝑢 𝑞 = 0.34 × 𝑤ℎ𝑜 𝑣 + 0.27 × 𝑐𝑎𝑟𝑒 𝑣
  • 14. Transformer architecture [1] 𝑐𝑜𝑛𝑡𝑒𝑥𝑡_𝑣𝑒𝑐𝑡𝑜𝑟𝑠 [CLS] family ... [pad] [pad] -0.19 -0.25 ... 0.2 0.28 0.01 -0.33 ... -0.12 0.11 ... ... ... ... ... 0.23 -0.09 ... 0.24 0.33 0.05 0.15 ... -0.26 -0.29 𝑑 𝑣 × 𝑠𝑒𝑞_𝑙𝑒𝑛 𝑞 [CLS] family ... [pad] [pad] -0.03 -0.21 ... 0.81 -0.04 0.04 1.17 ... 0.36 0.44 ... ... ... ... ... -0.02 0.6 ... -0.18 -0.03 0.01 -0.28 ... -0.28 -0.15 𝑎𝑡𝑡_𝑙𝑎𝑦𝑒𝑟𝑜𝑢𝑡 𝑒𝑚𝑏 𝑑 × 𝑠𝑒𝑞_𝑙𝑒𝑛 𝑞 𝑎𝑡𝑡_𝑙𝑎𝑦𝑒𝑟𝑜𝑢𝑡 = 𝑐𝑜𝑛𝑐𝑎𝑡( 𝑐𝑜𝑛𝑡𝑒𝑥𝑡 𝑣𝑒𝑐𝑡𝑜𝑟𝑠 𝑖 i ∈ 0, 𝑁 , 𝑖 < 𝑛𝑢𝑚ℎ𝑒𝑎𝑑
  • 15. Transformer architecture [1] 𝐹𝐹𝑁 = max 0, 𝑥𝑊1 + 𝑏1 𝑊2 + 𝑏2 𝑥 seq_len 𝑞 × 𝑑 𝑒𝑚𝑏 𝑊1 𝑑 𝑒𝑚𝑏 × 𝑑𝑖𝑛𝑡𝑒𝑟 𝑏1 1 × 𝑑𝑖𝑛𝑡𝑒𝑟 𝑊2 𝑑𝑖𝑛𝑡𝑒𝑟 × 𝑑 𝑒𝑚𝑏 𝑏2 1 × 𝑑 𝑒𝑚𝑏 𝐿𝑎𝑦𝑒𝑟𝑁𝑜𝑟𝑚 𝑥 + 𝑑𝑟𝑜𝑝𝑜𝑢𝑡 𝑠𝑢𝑏𝑙𝑎𝑦𝑒𝑟 𝑥
  • 16. Transformer architecture [1] The memory cost for scaled dot-product attention is quadratic w.r.t. the sequence length.
  • 17. Transformer architecture [1] Pros • It’s enables the learning of long-term dependency. • It is less affected by the gradient vanishing problem compared to RNN. Cons • The model cannot capture any longer-dependency beyond the predefined context length. • The model lacks necessary contextual information needed to well prediction the first few symbols. (Context fragmentation) • Longer sequences are disproportionately expensive because attention is quadratic to the sequence length. recurrence mechanism and a novel positional encoding scheme. Our method not only
  • 18. Vanilla Transformer Vanilla Transformer with a fixed- length context at a training time. Vanilla Transformer with a fixed- length context at evaluation time. • Context fragmentation (information never flows across segments) • Upper bounded by the segment length • The segment has to be processed all from scratch for prediction a token. • This evaluation procedure is extremely expensive.
  • 19. Transformer XL (extra long) [2] Transformer-XL with segment-level recurrence at training time. Transformer-XL with segment-level recurrence at evaluation time. • It can capture longer dependency than segment length. • It’s fast than the vanilla transformer.
  • 20. Transformer XL (extra long) [2] Major contributions • Segment-Level Recurrence with State Reuse • Relative Positional Encodings Embedding and loss • Adaptive Input representation • Adaptive softmax
  • 21. Transformer XL (extra long) [2] Segment-Level Recurrence with State Reuse
  • 22. Transformer XL (extra long) [2] Segment-Level Recurrence with State Reuse ℎ 𝜏+1 𝑛−1 = 𝑆𝐺 ℎ 𝜏 𝑛−1 ∘ ℎ 𝜏+1 𝑛−1 , 𝑞 𝜏+1 𝑛 , 𝑘 𝜏+1 𝑛 , 𝑣 𝜏+1 𝑛 = ℎ 𝜏+1 𝑛−1 𝑊𝑞 𝑇, ℎ 𝜏+1 𝑛−1 𝑊𝑘 𝑇 , ℎ 𝜏+1 𝑛−1 𝑊𝑣 𝑇, ℎ 𝜏+1 𝑛 = 𝑇𝑟𝑎𝑛𝑠𝑓𝑜𝑟𝑚𝑒𝑟 − 𝐿𝑎𝑦𝑒𝑟 𝑞 𝜏+1 𝑛 , 𝑘 𝜏+1 𝑛 , 𝑣 𝜏+1 𝑛 . Let the two consecutive segments of length L be 𝒔 𝝉 = 𝒙 𝝉,𝟏, ⋯ , 𝒙 𝝉,𝟏 and 𝒔 𝝉+𝟏 = 𝒙 𝝉+𝟏,𝟏, ⋯ , 𝒙 𝝉+𝟏,𝟏 respectively. The 𝑛-ths layer hidden state sequence produced for the 𝜏-ths segment 𝑠𝜏 by 𝒉 𝝉 𝒏 ∈ ℝ 𝑳×𝒅, where d is the hidden dimension. 𝑺𝑮(⋅) : stop-gradient 𝒉 𝒖 ∘ 𝒉 𝒗 : the concatenation of two hidden sequences along the length dimension.
  • 23. Transformer XL (extra long) [2] Segment-Level Recurrence with State Reuse ℎ 𝜏+1 𝑛−1 ℎ 𝜏+1 𝑛−1 ℎ 𝜏 𝑛−1 ℎ 𝜏+1 𝑛
  • 24. Transformer XL (extra long) [2] Abs Positional Encodings (sinusoidal) 𝑨𝑖,𝑗 𝑎𝑏𝑠 = 𝑬 𝑥 𝑖 𝑇 𝑾 𝑞 𝑇 𝑾 𝑘 𝑬 𝑥 𝑗 + 𝑬 𝑥 𝑖 𝑇 𝑾 𝑞 𝑇 𝑾 𝑘 𝑼𝑗 + 𝑼𝑖 𝑇 𝑾 𝑞 𝑇 𝑾 𝑘 𝑬 𝑥 𝑗 + 𝑼𝑖 𝑇 𝑾 𝑞 𝑇 𝑾 𝑘 𝑼𝑗. Relative Positional Encodings 𝑨𝑖,𝑗 𝑟𝑒𝑙 = 𝑬 𝑥 𝑖 𝑇 𝑾 𝑞 𝑇 𝑾 𝑘,𝐸 𝑬 𝑥 𝑗 + 𝑬 𝑥 𝑖 𝑇 𝑾 𝑞 𝑇 𝑾 𝑘,𝑅 𝑹𝑖−𝑗 + 𝑢 𝑇 𝑾 𝑘,𝐸 𝑬 𝑥 𝑗 + 𝑣 𝑇 𝑾 𝑘,𝑅 𝑹𝑖−𝑗 • 𝑈𝑗 → 𝑅𝑖−𝑗 • 𝑈𝑖 𝑇 𝑊𝑞 𝑇 → 𝑎 𝑡𝑟𝑎𝑖𝑛𝑎𝑏𝑙𝑒 𝑝𝑎𝑟𝑎𝑚𝑒𝑡𝑒𝑟 𝑢 ∈ ℝ 𝑑, 𝑣 ∈ ℝ 𝑑 • 𝑊𝑘 → 𝑊𝑘,𝐸, 𝑊𝑘,𝑅 𝑓𝑜𝑟 𝐸 𝑥 𝑗 , 𝑈𝑗
  • 25. Transformer XL (extra long) [2] h 𝜏 𝑛−1 = 𝑆𝐺 𝑚 𝜏 𝑛−1 ∘ ℎ 𝜏 𝑛−1 𝑞 𝜏 𝑛 , 𝑘 𝜏 𝑛 , 𝑣𝜏 𝑛 = ℎ 𝜏 𝑛−1 𝑊𝑞 𝑛−1 𝑇 , ℎ 𝜏 𝑛−1 𝑊𝑘,𝐸 𝑛 𝑇 , ℎ 𝜏 𝑛−1 𝑊𝑣 𝑛 𝑇 𝐴 𝜏,𝑖,𝑗 𝑛 = 𝑞 𝜏,𝑖 𝑛 𝑇 𝑘 𝜏,𝑗 𝑛 + 𝑞 𝜏,𝑖 𝑛 𝑇 𝑊𝑘,𝑅 𝑛 𝑅𝑖−𝑗 + 𝑢 𝑇 𝑘 𝜏,𝑗 + 𝑣 𝑇 𝑊𝑘,𝑅 𝑛 𝑅𝑖−𝑗 𝑎 𝜏 𝑛 = 𝑀𝑎𝑠𝑘𝑒𝑑_𝑆𝑜𝑓𝑡𝑚𝑎𝑥 𝐴 𝜏 𝑛 𝑣𝜏 𝑛 𝑜𝜏 𝑛 = 𝐿𝑎𝑦𝑒𝑟𝑁𝑜𝑟𝑚 𝐿𝑖𝑛𝑒𝑎𝑟 𝑎 𝜏 𝑛 + ℎ 𝜏 𝑛−1 ℎ 𝜏 𝑛 = 𝑃𝑜𝑠𝑖𝑡𝑖𝑜𝑛𝑤𝑖𝑠𝑒_𝐹𝑒𝑒𝑑_𝐹𝑜𝑟𝑤𝑎𝑟𝑑 𝑜𝜏 𝑛 ℎ 𝜏 0 ≔ 𝐸𝑠 𝜏
  • 26. Transformer XL (extra long) [2] Efficient Computation of the Attention with Relative Positional Embedding 𝑬 𝑥 𝑖 𝑇 𝑾 𝑞 𝑇 𝑾 𝑘,𝑅 𝑹𝑖−𝑗 + 𝑣 𝑇 𝑾 𝑘,𝑅 𝑹𝑖−𝑗 𝑄 in a reversed order 𝑄 𝑘 = 𝑊𝑘,𝑅 𝑅 𝑀+𝐿−1−𝑘
  • 27. Transformer XL (extra long) [2] Efficient Computation of the Attention with Relative Positional Embedding 𝑬 𝑥 𝑖 𝑇 𝑾 𝑞 𝑇 𝑾 𝑘,𝑅 𝑹𝑖−𝑗 + 𝑣 𝑇 𝑾 𝑘,𝑅 𝑹𝑖−𝑗
  • 28. Transformer XL (extra long) [2] Efficient Computation of the Attention with Relative Positional Embedding
  • 29. Transformer XL (extra long) [2] Efficient Computation of the Attention with Relative Positional Embedding
  • 30. Transformer XL (extra long) [2] Efficient Computation of the Attention with Relative Positional Embedding
  • 31. Transformer XL (extra long) [2] Efficient Computation of the Attention with Relative Positional Embedding
  • 35. Transformer XL (extra long) [2] Ablation study
  • 37. Efficient softmax approximation for GPUs [3] Most of the probability mass is covered by a small fraction of the dictionary, e.g., 87% of the document is covered by only 20% of the vocabulary in the Penn TreeBank.  Hierarchical softmax. [adopted from (Hugo Lachorelle’s Youtube lectures)] • Hierarchical softmax • Differentiated softmax • Importance sampling • Negative sampling • Noise sampling
  • 38. Efficient softmax approximation for GPUs [3] notation • 𝐵: batch size • 𝑘 = 𝒱 : cardinality of total vocabulary • 𝑔 𝑘 = max(𝑐 + 𝜆𝑘0, 𝑐 + 𝜆𝑘) = 𝑐 𝑚 + max 0, 𝜆 𝑘 − 𝑘0 : computation time 1. The computation time 𝑔(𝑘) is constant for low values of k, until a certain inflection point 𝑘0 ≈ 50, and then becomes affine for values 𝑘 > 𝑘0. 2. Empirically, 𝑐 𝑚 = 0.40 𝑚𝑠 on a K40 and 0.22 ms on a M40.
  • 39. Efficient softmax approximation for GPUs [3] 𝒱ℎ 𝒱𝑡 Notation • 𝒱ℎ: the word set of head • 𝒱𝑡: the word set of tail • 𝑘ℎ = 𝒱ℎ , 𝑘 𝑡 = 𝒱𝑡 • 𝑝𝑖: the probability of a word to occur in the set 𝒱𝑖. 𝐶 = 𝑔 𝑘ℎ + 1, 𝐵 + 𝑔(𝑘 𝑡, 𝑝𝑡 𝐵)
  • 40. Efficient softmax approximation for GPUs [3] 𝐶ℎ = 𝑔(𝐽 + 𝑘ℎ, 𝐵) ∀𝑖, 𝐶𝑖 = 𝑔(𝑘𝑖, 𝑝𝑖 𝐵) 𝐶 = 𝑔 𝐽 + 𝑘ℎ, 𝐵 + 𝑖 𝑔 𝑘𝑖, 𝑝𝑖 𝐵 . Constraint. 𝑘𝐵 ≥ 𝑘0 𝐵0 𝐶 = 𝑐 + 𝜆𝐵 𝐽 + 𝑘ℎ 𝐵 + 𝑖(𝑐 + 𝜆𝑘𝑖 𝑝𝑖 𝐵) = 𝐽 + 1 𝑐 + 𝜆𝐵 𝐽 + 𝑘ℎ + 𝑖 𝑝𝑖 𝑘𝑖 .
  • 41. Efficient softmax approximation for GPUs [3] 𝑝𝑖 𝑘𝑖 + 𝑝𝑗 𝑘𝑗 = 𝑝𝑖 𝑘𝑖 − 𝑘𝑗 + 𝑝𝑖+𝑗 𝑘𝑗 where 𝑝𝑖+𝑗 = 𝑝𝑖 + 𝑝𝑗 𝐶 = 𝐽 + 1 𝑐 + 𝜆𝐵 𝐽 + 𝑘ℎ + 𝑖 𝑝𝑖 𝑘𝑖 Assume that 𝑘𝑖 > 𝑘𝑗, and fix the quantities 𝑝𝑖+𝑗, 𝑘𝑖 and 𝑘𝑗. The best strategy is trivially to minimize the probability of the largest cluster 𝒱𝑖. For a fixed number of clusters of given sizes, the best strategy is to assign the words by decreasing probabilities to cluster of increasing size.
  • 46. Adaptive Input Representations [4] 𝐿𝑖𝑛𝑒𝑎𝑟𝒱1 𝑂𝑢𝑡𝑝𝑢𝑡 𝑙𝑎𝑦𝑒𝑟𝒱1 𝐿𝑖𝑛𝑒𝑎𝑟𝒱2 𝐿𝑖𝑛𝑒𝑎𝑟𝒱 𝑛 ⋯ 𝑂𝑢𝑡𝑝𝑢𝑡 𝑙𝑎𝑦𝑒𝑟𝒱2 𝑂𝑢𝑡𝑝𝑢𝑡 𝑙𝑎𝑦𝑒𝑟𝒱 𝑛⋯ 𝑑 𝑑 → 𝑑 + 𝑛 − 1 𝑑 → 𝑑 𝑘1 𝑑 → 𝑑 𝑘 𝑛−1 𝒱1 + 𝑛 − 1 𝒱2 𝒱𝑛
  • 47. References [1] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. CoRR, abs/1706.03762, 2017 [2] Zihang Dai, Zhilin Yang, Yiming Yang, Jaime G. Carbonell, Quoc V. Le, and Ruslan Salakhutdinov. Transformer-xl: Attentive language models beyond a fixed-length context. CoRR, abs/1901.02860, 2019 [3] Edouard Grave, Armand Joulin, Moustapha Cisse, David Grangier, and Herve Jegou. Efficient softmax approximation for gpus. CoRR, abs/1609.04309, 2016 [4] Alexei Baevski and Michael Auli. Adaptive input representations for neural language modeling. CoRR, abs/1809.10853, 2018.