Visual transformers
Leo Pauly
PhD student | Visual AI
Advisors: Prof. David Hogg, Prof. Raul Fuentes
University of Leeds, UK
Visual transformers
Leo Pauly
PhD student | Visual AI
Advisors: Prof. David Hogg, Prof. Raul Fuentes
University of Leeds, UK
Visual transformers
Leo Pauly
PhD student | Visual AI
Advisors: Prof. David Hogg, Prof. Raul Fuentes
University of Leeds, UK
Visual transformers
Leo Pauly
PhD student | Visual AI
Advisors: Prof. David Hogg, Prof. Raul Fuentes
University of Leeds, UK
Dosovitskiy et.al, ICLR 2021
Vaswani et.al, NeurlPS 2017
Dosovitskiy et.al, ICLR 2021
Vaswani et.al, NeurlPS 2017
Dosovitskiy et.al, ICLR 2021
Bahdanau et.al, ICLR 2015
Dosovitskiy et.al, ICLR 2021
Vaswani et.al, NeurlPS 2017 Bahdanau et.al, ICLR 2015
Sutskever et.al, NeurlPS 2014
Vaswani et.al, NeurlPS 2017
Sutskever et.al, NeurlPS 2014
Dosovitskiy et.al, ICLR 2021
Bahdanau et.al, ICLR 2015
Attention Mechanism
yi=RNN(yi-1,c,si-1)
s1 s2
y3
yo y1
y2
c
Bahdanau et.al, ICLR 2015
Attention Mechanism
yi=RNN(yi-1,c,si-1)
s1 s2
y3
yo y1
y2
c
Bahdanau et.al, ICLR 2015
• Bottleneck at the context vector (c)
• Information loss
• Back propagation issues
Attention Mechanism
yi=RNN(yi-1,c,si-1)
s1 s2
y3
yo y1
y2
c
Attention Mechanism
yi=RNN(yi-1,c,si-1)
s1 s2
y3
yo y1
y2
c
yi=RNN(yi-1,ci,si-1)
ci=f(hj) j=1…Tx
Attention Mechanism
s1 s2
y3
yo y1
y2
c
yi=RNN(yi-1,ci,si-1)
Figure from: https://medium.datadriveninvestor.com/attention-in-rnns-321fbcd64f05
Attention Mechanism
s1 s2
y3
yo y1
y2
c
yi=RNN(yi-1,ci,si-1)
Figure from: https://medium.datadriveninvestor.com/attention-in-rnns-321fbcd64f05
Attention Mechanism
s1 s2
y3
yo y1
y2
c
yi=RNN(yi-1,ci,si-1)
More reading: https://medium.datadriveninvestor.com/attention-in-rnns-321fbcd64f05
Attention Mechanism
Figure from: https://trungtran.io/2019/03/29/neural-machine-translation-with-attention-mechanism/
x=
y=
Attention is all you Need
Vaswani et.al, NeurlPS 2017
Attention is all you Need
Attention is all you Need
• Scaled dot product attention
• Multi-headed attention
• Self attention
Attention is all you Need
Attention is all you Need
Basics explained
Y (Query)
X (Value)
x1
x2
x3
x1 x2 x3 X
Attention Map
X
Output
x1 x2 x3
y1
y2
y3
XT (KeyT)
y1
y2
y3
Q
KT
V
=(Q.KT). V
Attention is all you Need
Basics explained
Y (Query)
X (Value)
x1
x2
x3
x1 x2 x3
XT (KeyT)
y1
y2
y3
Attention is all you Need
Basics explained
Y (Query)
X (Value)
x1
x2
x3
x1 x2 x3 X
XT (KeyT)
y1
y2
y3
Q
KT
Attention is all you Need
Basics explained
Y (Query)
X (Value)
x1
x2
x3
x1 x2 x3 X
XT (KeyT)
y1
y2
y3
Q
KT
Attention Map
x1 x2 x3
y1
y2
y3
Attention is all you Need
Basics explained
Y (Query)
X (Value)
X
XT (KeyT)
Q
KT
Attention Map
‘I’ ‘am’ ‘Leo’
‘Je’
‘suis’
‘leo’
‘I’
‘am’
‘Leo’
‘Je’
‘suis’
‘leo’
‘I’ ‘am’ ‘Leo’
Attention is all you Need
Basics explained
Y (Query)
X (Value)
x1
x2
x3
x1 x2 x3 X
XT (KeyT)
y1
y2
y3
Q
KT
Attention Map
x1 x2 x3
y1
y2
y3
X
Attention is all you Need
Basics explained
Y (Query)
X (Value)
x1
x2
x3
x1 x2 x3 X
Attention Map
X
Output
x1 x2 x3
y1
y2
y3
XT (KeyT)
y1
y2
y3
Q
KT
V
=(Q.KT). V
Attention is all you Need
Basics explained
Y (Query)
X (Value)
X
Attention Map
X
Output
XT (KeyT)
Q
KT
V
=(Q.KT). V
‘I’ ‘am’ ‘Leo’
‘Je’
‘suis’
‘leo’
‘I’
‘am’
‘Leo’
‘Je’
‘suis’
‘leo’
‘I’ ‘am’ ‘Leo’
Attention is all you Need
Attention is all you Need
Attention is all you Need
Self attention !!!
X
Attention is all you Need
Transformer Architecture
Attention is all you Need
Vision Transformers
Dosovitskiy et.al, ICLR 2021
Vision Transformers
Vision Transformers
x
xp=x1….xN
Vision Transformers
x
xp=x1….xN
Vision Transformers
x
xp=x1….xN
Vision Transformers
z0
zl
z'
l
L times
Vision Transformers
y
Vision Transformers
Results
• Transformers vs CNNs : Is it worth the hype ?
Vision Transformers
Insights
Ref: https://youtu.be/TvVc1e_4648
?
MaaS ?
• Transformers vs CNNs : Is it worth the hype ?
Vision Transformers
Insights
?
?
• Transformers vs CNNs : Is it worth the hype ?
Vision Transformers
Insights
Higher
resolutions ?
Vision Transformers
• Can we do (un)self-supervised pre-training ?
Insights
Goyal et.al, Arxiv 2021
• Architecture-level unification across domains
Multi-modal
AI systems
Vision Transformers
Insights
Q !

Introduction to Visual transformers