- The document presents a new formulation of the attention mechanism in Transformers using kernels. This allows defining attention over a larger space and integrating positional embeddings.
- Experiments on neural machine translation and sequence prediction tasks study different kernel forms like RBF and their combination with positional encodings.
- Results show the best kernel is the RBF kernel and competitive performance to state-of-the-art models with less computation.
Building Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Paper Study: Transformer dissection
1. Transformer Dissection: A Unified
Understanding of Transformer’s
Attention via the Lens of Kernel
Yao-Hung Hubert Tsai, Shaojie Bai, Makoto Yamada, Louis-Philippe
Morency and Ruslan Salakhutdinov
CMU, Kyoto University and RIKEN AIP
EMNLP 2019
2. Abstract
• Transformer is a powerful architecture that achieves superior
performance in NLP domain.
• Present a new formulation of attention via the lens of the kernel.
• Achieve competitive performance to the current state of the art
model with less computation in the experiments.
3. Introduction
• Transformer is a relative new architecture outperforming
traditional deep learning models such as RNN and Temporal
Convolutional Networks (TCNs) in NLP and CV domain.
• Instead of performing recurrence or convolution, Transformer
concurrently processes the entire sequence in the feed-forward
manner.
4. Introduction (cont’d)
• At the core of the Transformer is its attention mechanism, which
can be seen as a weighted combination of the input sequence,
where the weights are determined by the similarities between
elements of the input sequence.
• Be inspired to connect Transformer’s attention to kernel learning
due to the fact that they both calculate the similarity of given
sequences.
5. Introduction (cont’d)
• Develop a new variant of attention which considers a product of
symmetric kernels.
• Conduct the experiment on neural machine translation and
sequence prediction.
• Empirically study multiple kernel forms and find that the best
kernel is the RBF kernel.
7. Linear algebra (in real number)
• Symmetric matrix
• 𝐴 = 𝐴 𝑇
• 𝐴 = 𝑄Λ𝑄−1 = 𝑄Λ𝑄 𝑇 where 𝑄 is an orthogonal matrix
• Real eigenvalues
• For 𝑚 × 𝑛 matrix 𝐴 and its transpose 𝐴⊤
, 𝐴𝐴⊤
is symmetric matrix.
• Proof: 𝐴𝐴⊤ ⊤
= 𝐴⊤ ⊤
𝐴⊤
= 𝐴𝐴⊤
• Positive-definite matrix
• Also a symmetric matrix
• All eigenvalues are positive
• All sub-determinants are positive
Source from: MIT Linear Algebra - Symmetric matrices and positive definiteness
5 2
2 3
−1 0
0 −3
8. Kernels
• A function 𝐾: 𝒳 × 𝒳 → ℝ is called a kernel over 𝒳.
• For any two points 𝑥, 𝑥′ ∈ 𝒳, 𝐾(𝑥, 𝑥′) be equal to an inner product of
vectors Φ(𝑥) and Φ(𝑥′):
∀𝑥, 𝑥′ ∈ 𝒳, 𝐾 𝑥, 𝑥′ = Φ 𝑥 , Φ 𝑥′ ,
for some mapping Φ: 𝒳 → ℍ to a *Hilbert space ℍ called a feature
space.
Source from: Foundations of Machine Learning (2 edition)
Hilbert space: vector space equipped with an inner product
9. Kernel
• A kernel 𝐾: 𝒳 × 𝒳 → ℝ is said to be positive definite symmetric
(PDS) if for any 𝑥1, … , 𝑥 𝑚 ⊆ 𝒳, the matrix 𝑲 = 𝐾 𝑥𝑖, 𝑥𝑗 𝑖𝑗
∈
ℝ 𝑚×𝑚 is symmetric positive semidefinite (SPSD).
• For a sample 𝑆 = (𝑥1, … , 𝑥 𝑚), 𝑲 = 𝐾 𝑥𝑖, 𝑥𝑗 𝑖𝑗
∈ ℝ 𝑚×𝑚 is called
the kernel matrix or the Gram matrix associated to 𝐾 and the sample S.
Source from: Foundations of Machine Learning (2 edition)
12. Transformer
• Encoder-decoder model
• Different layer:
• Embedding Layer
• Positional Encoding
• Encoder/Decoder
• Output Probability Layer
Source from Attention Is All You Need and Transformer Dissection
13. Attention
• Core inside Encoder/Decoder:
• Scaled Dot-Product Attention
Attention 𝑄, 𝐾, 𝑉 = softmax
QK⊤
𝑑 𝑘
V
• Encoder-encoder attention
• Decoder-decoder attention
• Encoder-decoder attention
Source from Attention Is All You Need and Transformer Dissection
14. Multi-head attention
• Consider attention in different space
MultiHead Q, K, V = Concat head1, ⋯ , headh WO
where headi = Attention(QWi
Q
, KWi
K
, VWi
V
)
Source from Attention Is All You Need and Transformer Dissection
16. • Transformer’s attention is an order-agnostic operation given the
order in the inputs.
• Transformer introduced positional embedding to indicate the
positional relation for the inputs.
• 𝒙 = [𝑥1, 𝑥2, ⋯ , 𝑥 𝑇]
• 𝑥𝑖 = (𝑓𝑖, 𝑡𝑖) with
• 𝑓𝑖 ∈ ℱ non-temporal feature (E.g., word representation, frame in a video etc.)
• 𝑡𝑖 ∈ 𝒯 temporal feature (E.g., sine and cosine functions)
17. • Definition. Given a non-negative kernel function 𝑘 ⋅,⋅ : 𝒳 × 𝒳 → ℝ+
,
a set filtering function 𝑀 ⋅,⋅ : 𝒳 × 𝒮 → 𝒮, and a value function
𝑣 ⋅ : 𝒳 → 𝒴, the Attention function taking the input of a query feature
𝑥 𝑞 ∈ 𝒳 is defined as
Attention 𝑥 𝑞; 𝑀 𝑥 𝑞, 𝑆 𝑥 𝑘
=
𝑥 𝑘∈𝑀 𝑥 𝑞,𝑆 𝑥 𝑘
𝑘 𝑥 𝑞, 𝑥 𝑘
σ
𝑥 𝑘
′ ∈𝑀 𝑥 𝑞,𝑆 𝑥 𝑘
𝑘 𝑥 𝑞, 𝑥 𝑘
′
𝑣(𝑥 𝑘)
• Set filtering function 𝑀 𝑥 𝑞, 𝑆 𝑥 𝑘
: 𝒳 × 𝒮 → 𝒮 returns a set with its elements
that operate with 𝑥 𝑞(E.g., mask in decoder self-attention).
18. Attention in Transformer
• Recall attention mechanism in original Transformer:
Attention 𝑥 𝑞; 𝑆 𝑥 𝑘
= softmax
𝑥 𝑞 𝑊𝑞 𝑥 𝑘 𝑊𝑘
⊤
𝑑 𝑘
𝑥 𝑘 𝑊𝑣
with 𝑥 𝑞 = 𝑓𝑞 + 𝑡 𝑞, 𝑥 𝑘 = 𝑓𝑘 + 𝑡 𝑘
• Note that the input sequences are
• same (𝑥 𝑞 = 𝑥 𝑘) for self-attention
• different (𝑥 𝑞 from decoder and 𝑥 𝑘 from encoder) for encoder-decoder attention
26. Conclusions
• Present a kernel formulation for the attention mechanism in
Transformer allowing us to define a larger space for designing
attention.
• Study different kernel forms and the ways to integrate positional
embedding on NMT and SP.