Attention Is All You Need

Introduction
From Ashish Vaswani’s Talk
… the purpose is … not going to be just to talk about a
particular model, but, as … starting point is always asking a question
what are the … what’s the structure in my dataset or what are the
symmetries in my dataset and is there a model that exists that has the
Inductive biases to model these properties that exist in my dataset.
… deep learning, … is all about representation learning … building the
right tools for learning representations is important factor in
achieving empirical success.

Abstract
The dominant sequence transduction models are based on complex recurrent
or convolutional neural networks that include an encoder and a decoder. The
best performing models also connect the encoder and decoder through an
attention mechanism. We propose a new simple network architecture, the
Transformer, based solely on attention mechanisms, dispensing with
recurrence and convolutions entirely. Experiments on two machine translation
tasks show these models to be superior in quality while being more
parallelizable and requiring significantly less time to train.
Our model achieves 28.4 BLEU on the WMT 2014 English to-German translation
task, improving over the existing best results, including ensembles, by over 2
BLEU. We show that the Transformer generalizes well to other tasks by
applying it successfully to English constituency parsing both with large and
limited training data.

Background
Sequence-to-sequence with pure RNN

Background
Sequence-to-sequence with Attention

Background
What Attention do in VQA

Transformer
Paradigm Shift in Seq2Seq !
Transformer

Transformer
Why not RNN?
This inherently sequential nature precludes parallelization within training
examples, which becomes critical at longer sequence lengths, as memory
constraints limit batching across examples. Recent work has achieved
significant improvements in computational efficiency through factorization
tricks [21] and conditional computation [32], while also improving model
performance in case of the latter. The fundamental constraint of
sequential computation, however, remains.

Transformer
Convolutional Neural Networks
for Sentence Classification
In these models, the number of operations
required to relate signals from two arbitrary
input or output positions grows in the
distance between positions, linearly for
ConvS2S and logarithmically for ByteNet. This
makes it more difficult to learn dependencies
between distant positions

Transformer
Transformer Architecture

Transformer
Positional encoding

Transformer
Positional Encoding
𝑃𝐸(𝑝𝑜𝑠, 2𝑖) = sin(𝑝𝑜𝑠/100002𝑖/𝑑 𝑚𝑜𝑑𝑒𝑙)
𝑃𝐸(𝑝𝑜𝑠,2𝑖 + 1) = cos(𝑝𝑜𝑠/100002𝑖/𝑑 𝑚𝑜𝑑𝑒𝑙)

Transformer
Self Attention
𝐴𝑡𝑡𝑒𝑛𝑡𝑖𝑜𝑛 𝑄, 𝐾, 𝑉 = 𝑠𝑜𝑓𝑡𝑚𝑎𝑥
𝑄𝐾 𝑇
𝑑 𝑘
𝑉

Transformer
Multi-Head Attention

Transformer
Decoder : Masked Multi-Head Attention
Due to teacher forcing

Transformer
Final encoding outputs are used for Key
and Value vector of multi-head attention
in Decoder

Transformer
Regularization
1. 𝐿𝑎𝑏𝑒𝑙 𝑆𝑚𝑜𝑜𝑡ℎ𝑖𝑛𝑔
2. 𝑅𝑒𝑠𝑖𝑑𝑢𝑎𝑙 𝐷𝑟𝑜𝑝𝑜𝑢𝑡

References
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, Illia
Polosukhin. “Attention Is All You Need.” (NIPS) 2017.
Paper link: https://arxiv.org/pdf/1706.03762.pdf
Stanford CS224N: NLP with Deep Learning | Winter 2019 | Lecture 14 – Transformers and Self-Attention
Youtube: https://youtu.be/5vcj8kSwBCY
Transformer (Attention Is All You Need) by Minsuk Heo
Youtube: https://youtu.be/mxGCEWOxfe8

Attention Is All You Need

More Related Content

What's hot

Similar to Attention Is All You Need

More from SEMINARGROOT

Recently uploaded

Attention Is All You Need