SlideShare a Scribd company logo
Visual Transformers
Kwanghee Choi (Jonas)
Table of Contents
● Preliminary
○ Key, Value, Query, Attention
○ Pooling
○ Multi-head Attention
○ Unsupervised Representation Learning
○ Syntactic Knowledge
● State-of-the-art Papers
○ Generative Pretraining from Pixels (ICML 2020)
○ An Image is Worth 16x16 Words (ICLR 2021)
○ End-to-End Object Detection with Transformers (ECCV 2020)
○ Additional Works
Key, Value, Query, Attention
● Problem: Given a set of data points (xi
, yi
), find unknown y for x.
● Simplest approach:
● A bit more complicated approach: Watson-Nadaraya Estimator (1964)
● Key, value pairs (xi
, yi
)
● Query x
● Attention ⍺
Reference. Attention in Deep Learning (Alex Smola, ICML 2019 Tutorials, http://alex.smola.org/talks/ICML19-attention.pdf )
Pooling
● Nonlinearity ⍴, ɸ, learnable weight w
● Deep sets (Zaheer et al. 2017)
○ Permutation Invariant
● Word2Vec (Mikolov et al. 2013)
○ Embed each word in a sentence
● Attention Weighting (Wang et al. 2016)
○ Query x depends on the context ⍺
● Iterative Attention Pooling (Yang et al. 2016)
○ Repeatedly update internal state qt
Reference. Attention in Deep Learning (Alex Smola, ICML 2019 Tutorials, http://alex.smola.org/talks/ICML19-attention.pdf )
Multi-head Attention
● Attention module
○ Softmax acts as an attention function.
○ Dot product of Q and K acts as a similarity.
○ sqrt(dk
): Standard deviation of the dot product when Q, V ~ N(0, 1)
● Multi-head Attention
○ Single-head limits the ability of focusing on a specific position.
○ Multi-head gives attention layers different representation subspace.
Attention Is All You Need (Vaswani et al. NeurIPS 2017)
Unsupervised Representation Learning
● Input sequence x=(x1
, x2
, … )
● Autoregressive (AR)
○ ex) ELMo, GPT
○ No bidirectional context.
○ ELMO: Need to separately train forward/backward context.
● Auto Encoding (AE)
○ Corrupted input x’=(x1
, x2
, …, [MASK], … )
○ ex) BERT
○ Bi-directional self-attention
○ Different input distribution due to corruption
Understanding XLNet https://www.borealisai.com/en/blog/understanding-xlnet
Syntactic Knowledge
● BERT representations are hierarchical rather
than linear.
○ Open Sesame: Getting Inside BERT’s Linguistic Knowledge
(Lin et al. ACLW 2019)
● BERT “naturally” learns some syntactic
information, although it is not very similar to
linguistic annotated resources.
○ Perturbed Masking: Parameter-free Probing for Analyzing
and Interpreting BERT (Wu et al. ACL 2020)
A Primer in BERTology: What we know about how BERT works (Rogers et al. TACL 2020)
Generative Pretraining from Pixels
ICML 2020, OpenAI
Towards a general “image” model
● Just as a general LM can generate coherent text, Image GPT can
generate coherent images.
● “Analysis by Synthesis” suggests that model will also know about
object categories after it learns to do so.
● Generative sequence modeling is a universal unsupervised algorithm.
Image GPT (https://openai.com/blog/image-gpt/)
Approach
Generative Pretraining from Pixels (Chen et al. ICML 2020)
What representation works best?
● In supervised pre-training, representation quality tends to increase
monotonically with depth, but with generative pre-training, it is not
obvious whether a task like pixel prediction is relevant to image
classification.
● Representations first improve as a function of depth, and then,
starting around the middle layer, begin to deteriorate.
○ In the first phase, each position gathers information from its surrounding context in
order to build a more global image representation.
○ In the second phase, this contextualized input is used to solve the conditional next
pixel prediction task.
○ This could resemble the behavior of encoder-decoder architectures, but learned
within a monolithic architecture via a pre-training objective.
Generative Pretraining from Pixels (Chen et al. ICML 2020)
Performance on CIFAR dataset
● We find that both increasing the
scale of our models and training for
more iterations result in better
generative performance, which
directly translates into better
feature quality.
● Generative models produce much
better features than BERT models
after pre-training, but BERT
models catch up after fine-tuning.
Generative Pretraining from Pixels (Chen et al. ICML 2020)
An Image is Worth 16x16 Words:
Transformers for Image Recognition at Scale
ICLR 2021, Google
When does Transformers work?
● When trained on mid-sized datasets (i.e. ImageNet), Transformers
yield modest accuracies, few % below ResNets of comparable size.
● However, large scale training (14M-300M images) trumps inductive
bias of CNNs such as translation invariance & locality.
● Naive application of self-attention to images would require that each
pixel attends to every other pixel. With quadratic cost in the number
of pixels, this does not scale to realistic input sizes.
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale (Dosovitskiy et al. ICLR 2021)
Model overview
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale (Dosovitskiy et al. ICLR 2021)
Performance
With self-supervised pre-training (masked patch prediction), our smaller ViT-B/16 model achieves 79.9% accuracy
on ImageNet, a significant improvement of 2% to training from scratch, but still 4% behind supervised pre-training.
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale (Dosovitskiy et al. ICLR 2021)
Interpreting the Results
● Positional embeddings
○ We speculate that learning to represent the spatial relations in
this resolution (14 x 14) is equally easy for different strategies.
○ Closer patches tend to have more similar position embeddings.
○ Row-column structure & sinusoidal structure appears.
● Self-attention
○ “Attention distance” analogous to “receptive field size”.
○ Highly localized attention may serve a similar function as early
convolutional layers in CNNs.
○ Model attends to image regions that are semantically relevant
for classification.
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale (Dosovitskiy et al. ICLR 2021)
End-to-End Object Detection
with Transformers
ECCV 2020, Facebook
End-to-end object detection
Object detection as a direct set prediction problem.
End-to-End Object Detection with Transformers (Carion et al. ECCV 2020)
Removing NMS
● Conventional CNN to learn a 2D representation + Positional encoding
● 100 learned positional embeddings as object queries
● Global reasoning using pairwise relations
End-to-End Object Detection with Transformers (Carion et al. ECCV 2020)
Encoder’s attention mechanism in action
End-to-End Object Detection with Transformers (Carion et al. ECCV 2020)
Decoder’s attention mechanism in action
End-to-End Object Detection with Transformers (Carion et al. ECCV 2020)
Performance in Object Detection
End-to-End Object Detection with Transformers (Carion et al. ECCV 2020)
Panoptic Segmentation
End-to-End Object Detection with Transformers (Carion et al. ECCV 2020)
Performance in Panoptic Segmentation
End-to-End Object Detection with Transformers (Carion et al. ECCV 2020)
Additional Works
Notable Extensions
● Training data-efficient image transformers & distillation through
attention (Touvron et al. Arxiv 2021)
○ Add another token: distillation token to ViT. Using only the classification token
doesn’t help much.
○ Soft distillation (teacher model’s softmax output) and hard-distillation (teacher
model’s argmax with label smoothing).
○ Surpasses SOTA yet again.
● DALL·E: Creating Images from Text (Ramesh et al. 2021)
○ Decoder-only transformer that receives both the text and the image as a single
stream of tokens (Text: 256, Image: 1024) and models all of them autoregressively.
○ Creates images from text captions for a wide range of concepts expressible in natural
language.
Task-specific: Object Detection
● End-to-End Object Detection with Adaptive Clustering Transformer
(Zheng et al. Arxiv 2020)
○ ACT cluster the query features adaptively using Locality Sensitive Hashing (LSH) and
approximate the query-key interaction using the prototype-key interaction.
○ ACT can replace the original self-attention module in DETR without degrading the
performance of pre-trained DETR model.
● Deformable DETR: Deformable Transformers for End-to-End Object
Detection (Zhu et al. ICLR 2021)
○ Deformable DETR can achieve better performance than DETR (especially on small
objects) with 10× less training epochs.
○ Deformable attention module: Choose only prominent feature map pixels, aggregate
multi-scale features.
A Survey on Visual Transformer (Han et al. Arxiv 2021)
Task-specific: Object Detection
● UP-DETR: Unsupervised Pre-training for Object Detection with
Transformers (Dai et al. Arxiv 2020)
○ Propose a pretext task named random query patch detection to unsupervisedly
pretrain DETR (UP-DETR) for object detection.
● Rethinking Transformer-based Set Prediction for Object Detection
(Sun et al. Arxiv 2020)
○ Encoder-only DETR significantly accelerate the training of small object detection, as
it removes cross-attention.
○ Feature generation for transformer encoders with FCOS (Fully Convolutional
One-Stage object detector) or RCNN
A Survey on Visual Transformer (Han et al. Arxiv 2021)
Task-specific: Segmentation
● MaX-DeepLab: End-to-End Panoptic Segmentation with Mask
Transformers (Wang et al. Arxiv 2020)
○ Infers masks and classes directly without hand-coded priors like object boxes.
○ Dual-path transformer enables CNNs to read and write a global memory at any layer.
● End-to-End Video Instance Segmentation with Transformers (Wang
et al. Arxiv 2020)
○ Three dimensional (temporal, horizontal and vertical) positional encoding
○ Instance sequence matching strategy - applying loss across different time
signatures
A Survey on Visual Transformer (Han et al. Arxiv 2021)
Additional Tasks
● Learning Joint Spatial-Temporal Transformations for Video
Inpainting (Zeng et al. ECCV 2020)
● End-to-End Dense Video Captioning with Masked Transformer (Zhou
et al. CVPR 2018)
● Hand-Transformer: Non-Autoregressive Structured Modeling for 3D
Hand Pose Estimation (Huang et al. ECCV 2020)
● Taming Transformers for High-Resolution Image Synthesis (Esser et
al. Arxiv 2020)
● Pre-Trained Image Processing Transformer (Chen et al. Arxiv 2020)
○ ImageNet pre-training for image denoising/superresolution
A Survey on Visual Transformer (Han et al. Arxiv 2021)

More Related Content

Similar to Visual Transformers

Similar to Visual Transformers (20)

IISc Internship Report
IISc Internship ReportIISc Internship Report
IISc Internship Report
 
Deep Learning Hardware: Past, Present, & Future
Deep Learning Hardware: Past, Present, & FutureDeep Learning Hardware: Past, Present, & Future
Deep Learning Hardware: Past, Present, & Future
 
Introduction to 3D Computer Vision and Differentiable Rendering
Introduction to 3D Computer Vision and Differentiable RenderingIntroduction to 3D Computer Vision and Differentiable Rendering
Introduction to 3D Computer Vision and Differentiable Rendering
 
Reading_0413_var_Transformers.pptx
Reading_0413_var_Transformers.pptxReading_0413_var_Transformers.pptx
Reading_0413_var_Transformers.pptx
 
Deep Neural Networks Presentation
Deep Neural Networks PresentationDeep Neural Networks Presentation
Deep Neural Networks Presentation
 
IPT.pdf
IPT.pdfIPT.pdf
IPT.pdf
 
Deep learning 1.0 and Beyond, Part 1
Deep learning 1.0 and Beyond, Part 1Deep learning 1.0 and Beyond, Part 1
Deep learning 1.0 and Beyond, Part 1
 
Conv xg
Conv xgConv xg
Conv xg
 
Deep Learning & NLP: Graphs to the Rescue!
Deep Learning & NLP: Graphs to the Rescue!Deep Learning & NLP: Graphs to the Rescue!
Deep Learning & NLP: Graphs to the Rescue!
 
Point Cloud Processing: Estimating Normal Vectors and Curvature Indicators us...
Point Cloud Processing: Estimating Normal Vectors and Curvature Indicators us...Point Cloud Processing: Estimating Normal Vectors and Curvature Indicators us...
Point Cloud Processing: Estimating Normal Vectors and Curvature Indicators us...
 
Generative Models and Adversarial Training (D2L3 Insight@DCU Machine Learning...
Generative Models and Adversarial Training (D2L3 Insight@DCU Machine Learning...Generative Models and Adversarial Training (D2L3 Insight@DCU Machine Learning...
Generative Models and Adversarial Training (D2L3 Insight@DCU Machine Learning...
 
210610 SSIIi2021 Computer Vision x Trasnformer
210610 SSIIi2021 Computer Vision x Trasnformer210610 SSIIi2021 Computer Vision x Trasnformer
210610 SSIIi2021 Computer Vision x Trasnformer
 
Attentive Relational Networks for Mapping Images to Scene Graphs
Attentive Relational Networks for Mapping Images to Scene GraphsAttentive Relational Networks for Mapping Images to Scene Graphs
Attentive Relational Networks for Mapping Images to Scene Graphs
 
Brodmann17 CVPR 2017 review - meetup slides
Brodmann17 CVPR 2017 review - meetup slides Brodmann17 CVPR 2017 review - meetup slides
Brodmann17 CVPR 2017 review - meetup slides
 
Cvpr 2017 Summary Meetup
Cvpr 2017 Summary MeetupCvpr 2017 Summary Meetup
Cvpr 2017 Summary Meetup
 
IRJET- Image Captioning using Multimodal Embedding
IRJET-  	  Image Captioning using Multimodal EmbeddingIRJET-  	  Image Captioning using Multimodal Embedding
IRJET- Image Captioning using Multimodal Embedding
 
Computer Vision: Visual Extent of an Object
Computer Vision: Visual Extent of an ObjectComputer Vision: Visual Extent of an Object
Computer Vision: Visual Extent of an Object
 
Deep Generative Models - Kevin McGuinness - UPC Barcelona 2018
Deep Generative Models - Kevin McGuinness - UPC Barcelona 2018Deep Generative Models - Kevin McGuinness - UPC Barcelona 2018
Deep Generative Models - Kevin McGuinness - UPC Barcelona 2018
 
Deep learning for molecules, introduction to chainer chemistry
Deep learning for molecules, introduction to chainer chemistryDeep learning for molecules, introduction to chainer chemistry
Deep learning for molecules, introduction to chainer chemistry
 
VIBE: Video Inference for Human Body Pose and Shape Estimation
VIBE: Video Inference for Human Body Pose and Shape EstimationVIBE: Video Inference for Human Body Pose and Shape Estimation
VIBE: Video Inference for Human Body Pose and Shape Estimation
 

More from Kwanghee Choi

More from Kwanghee Choi (19)

Trends of ICASSP 2022
Trends of ICASSP 2022Trends of ICASSP 2022
Trends of ICASSP 2022
 
추천 시스템 한 발짝 떨어져 살펴보기 (3)
추천 시스템 한 발짝 떨어져 살펴보기 (3)추천 시스템 한 발짝 떨어져 살펴보기 (3)
추천 시스템 한 발짝 떨어져 살펴보기 (3)
 
Recommendation systems: Vertical and Horizontal Scrolls
Recommendation systems: Vertical and Horizontal ScrollsRecommendation systems: Vertical and Horizontal Scrolls
Recommendation systems: Vertical and Horizontal Scrolls
 
추천 시스템 한 발짝 떨어져 살펴보기 (1)
추천 시스템 한 발짝 떨어져 살펴보기 (1)추천 시스템 한 발짝 떨어져 살펴보기 (1)
추천 시스템 한 발짝 떨어져 살펴보기 (1)
 
추천 시스템 한 발짝 떨어져 살펴보기 (2)
추천 시스템 한 발짝 떨어져 살펴보기 (2)추천 시스템 한 발짝 떨어져 살펴보기 (2)
추천 시스템 한 발짝 떨어져 살펴보기 (2)
 
Before and After the AI Winter - Recap
Before and After the AI Winter - RecapBefore and After the AI Winter - Recap
Before and After the AI Winter - Recap
 
Mastering Gomoku - Recap
Mastering Gomoku - RecapMastering Gomoku - Recap
Mastering Gomoku - Recap
 
Teachings of Ada Lovelace
Teachings of Ada LovelaceTeachings of Ada Lovelace
Teachings of Ada Lovelace
 
div, grad, curl, and all that - a review
div, grad, curl, and all that - a reviewdiv, grad, curl, and all that - a review
div, grad, curl, and all that - a review
 
Gaussian processes
Gaussian processesGaussian processes
Gaussian processes
 
Neural Architecture Search: Learning How to Learn
Neural Architecture Search: Learning How to LearnNeural Architecture Search: Learning How to Learn
Neural Architecture Search: Learning How to Learn
 
Duality between OOP and RL
Duality between OOP and RLDuality between OOP and RL
Duality between OOP and RL
 
JFEF encoding
JFEF encodingJFEF encoding
JFEF encoding
 
Bandit algorithms for website optimization - A summary
Bandit algorithms for website optimization - A summaryBandit algorithms for website optimization - A summary
Bandit algorithms for website optimization - A summary
 
Dummy log generation using poisson sampling
 Dummy log generation using poisson sampling Dummy log generation using poisson sampling
Dummy log generation using poisson sampling
 
Azure functions: Quickstart
Azure functions: QuickstartAzure functions: Quickstart
Azure functions: Quickstart
 
Modern convolutional object detectors
Modern convolutional object detectorsModern convolutional object detectors
Modern convolutional object detectors
 
Usage of Moving Average
Usage of Moving AverageUsage of Moving Average
Usage of Moving Average
 
Jpl coding standard for the c programming language
Jpl coding standard for the c programming languageJpl coding standard for the c programming language
Jpl coding standard for the c programming language
 

Recently uploaded

Future Visions: Predictions to Guide and Time Tech Innovation, Peter Udo Diehl
Future Visions: Predictions to Guide and Time Tech Innovation, Peter Udo DiehlFuture Visions: Predictions to Guide and Time Tech Innovation, Peter Udo Diehl
Future Visions: Predictions to Guide and Time Tech Innovation, Peter Udo Diehl
Peter Udo Diehl
 

Recently uploaded (20)

How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...
 
When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...
 
JMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and GrafanaJMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and Grafana
 
Connector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a buttonConnector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a button
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
 
Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*
 
Speed Wins: From Kafka to APIs in Minutes
Speed Wins: From Kafka to APIs in MinutesSpeed Wins: From Kafka to APIs in Minutes
Speed Wins: From Kafka to APIs in Minutes
 
Knowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and backKnowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and back
 
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
 
Future Visions: Predictions to Guide and Time Tech Innovation, Peter Udo Diehl
Future Visions: Predictions to Guide and Time Tech Innovation, Peter Udo DiehlFuture Visions: Predictions to Guide and Time Tech Innovation, Peter Udo Diehl
Future Visions: Predictions to Guide and Time Tech Innovation, Peter Udo Diehl
 
Quantum Computing: Current Landscape and the Future Role of APIs
Quantum Computing: Current Landscape and the Future Role of APIsQuantum Computing: Current Landscape and the Future Role of APIs
Quantum Computing: Current Landscape and the Future Role of APIs
 
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
 
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
 
In-Depth Performance Testing Guide for IT Professionals
In-Depth Performance Testing Guide for IT ProfessionalsIn-Depth Performance Testing Guide for IT Professionals
In-Depth Performance Testing Guide for IT Professionals
 
Assuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyesAssuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyes
 
НАДІЯ ФЕДЮШКО БАЦ «Професійне зростання QA спеціаліста»
НАДІЯ ФЕДЮШКО БАЦ  «Професійне зростання QA спеціаліста»НАДІЯ ФЕДЮШКО БАЦ  «Професійне зростання QA спеціаліста»
НАДІЯ ФЕДЮШКО БАЦ «Професійне зростання QA спеціаліста»
 
Demystifying gRPC in .Net by John Staveley
Demystifying gRPC in .Net by John StaveleyDemystifying gRPC in .Net by John Staveley
Demystifying gRPC in .Net by John Staveley
 
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
 
Exploring UiPath Orchestrator API: updates and limits in 2024 🚀
Exploring UiPath Orchestrator API: updates and limits in 2024 🚀Exploring UiPath Orchestrator API: updates and limits in 2024 🚀
Exploring UiPath Orchestrator API: updates and limits in 2024 🚀
 

Visual Transformers

  • 2. Table of Contents ● Preliminary ○ Key, Value, Query, Attention ○ Pooling ○ Multi-head Attention ○ Unsupervised Representation Learning ○ Syntactic Knowledge ● State-of-the-art Papers ○ Generative Pretraining from Pixels (ICML 2020) ○ An Image is Worth 16x16 Words (ICLR 2021) ○ End-to-End Object Detection with Transformers (ECCV 2020) ○ Additional Works
  • 3. Key, Value, Query, Attention ● Problem: Given a set of data points (xi , yi ), find unknown y for x. ● Simplest approach: ● A bit more complicated approach: Watson-Nadaraya Estimator (1964) ● Key, value pairs (xi , yi ) ● Query x ● Attention ⍺ Reference. Attention in Deep Learning (Alex Smola, ICML 2019 Tutorials, http://alex.smola.org/talks/ICML19-attention.pdf )
  • 4. Pooling ● Nonlinearity ⍴, ɸ, learnable weight w ● Deep sets (Zaheer et al. 2017) ○ Permutation Invariant ● Word2Vec (Mikolov et al. 2013) ○ Embed each word in a sentence ● Attention Weighting (Wang et al. 2016) ○ Query x depends on the context ⍺ ● Iterative Attention Pooling (Yang et al. 2016) ○ Repeatedly update internal state qt Reference. Attention in Deep Learning (Alex Smola, ICML 2019 Tutorials, http://alex.smola.org/talks/ICML19-attention.pdf )
  • 5. Multi-head Attention ● Attention module ○ Softmax acts as an attention function. ○ Dot product of Q and K acts as a similarity. ○ sqrt(dk ): Standard deviation of the dot product when Q, V ~ N(0, 1) ● Multi-head Attention ○ Single-head limits the ability of focusing on a specific position. ○ Multi-head gives attention layers different representation subspace. Attention Is All You Need (Vaswani et al. NeurIPS 2017)
  • 6. Unsupervised Representation Learning ● Input sequence x=(x1 , x2 , … ) ● Autoregressive (AR) ○ ex) ELMo, GPT ○ No bidirectional context. ○ ELMO: Need to separately train forward/backward context. ● Auto Encoding (AE) ○ Corrupted input x’=(x1 , x2 , …, [MASK], … ) ○ ex) BERT ○ Bi-directional self-attention ○ Different input distribution due to corruption Understanding XLNet https://www.borealisai.com/en/blog/understanding-xlnet
  • 7. Syntactic Knowledge ● BERT representations are hierarchical rather than linear. ○ Open Sesame: Getting Inside BERT’s Linguistic Knowledge (Lin et al. ACLW 2019) ● BERT “naturally” learns some syntactic information, although it is not very similar to linguistic annotated resources. ○ Perturbed Masking: Parameter-free Probing for Analyzing and Interpreting BERT (Wu et al. ACL 2020) A Primer in BERTology: What we know about how BERT works (Rogers et al. TACL 2020)
  • 8. Generative Pretraining from Pixels ICML 2020, OpenAI
  • 9. Towards a general “image” model ● Just as a general LM can generate coherent text, Image GPT can generate coherent images. ● “Analysis by Synthesis” suggests that model will also know about object categories after it learns to do so. ● Generative sequence modeling is a universal unsupervised algorithm. Image GPT (https://openai.com/blog/image-gpt/)
  • 10. Approach Generative Pretraining from Pixels (Chen et al. ICML 2020)
  • 11. What representation works best? ● In supervised pre-training, representation quality tends to increase monotonically with depth, but with generative pre-training, it is not obvious whether a task like pixel prediction is relevant to image classification. ● Representations first improve as a function of depth, and then, starting around the middle layer, begin to deteriorate. ○ In the first phase, each position gathers information from its surrounding context in order to build a more global image representation. ○ In the second phase, this contextualized input is used to solve the conditional next pixel prediction task. ○ This could resemble the behavior of encoder-decoder architectures, but learned within a monolithic architecture via a pre-training objective. Generative Pretraining from Pixels (Chen et al. ICML 2020)
  • 12. Performance on CIFAR dataset ● We find that both increasing the scale of our models and training for more iterations result in better generative performance, which directly translates into better feature quality. ● Generative models produce much better features than BERT models after pre-training, but BERT models catch up after fine-tuning. Generative Pretraining from Pixels (Chen et al. ICML 2020)
  • 13. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale ICLR 2021, Google
  • 14. When does Transformers work? ● When trained on mid-sized datasets (i.e. ImageNet), Transformers yield modest accuracies, few % below ResNets of comparable size. ● However, large scale training (14M-300M images) trumps inductive bias of CNNs such as translation invariance & locality. ● Naive application of self-attention to images would require that each pixel attends to every other pixel. With quadratic cost in the number of pixels, this does not scale to realistic input sizes. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale (Dosovitskiy et al. ICLR 2021)
  • 15. Model overview An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale (Dosovitskiy et al. ICLR 2021)
  • 16. Performance With self-supervised pre-training (masked patch prediction), our smaller ViT-B/16 model achieves 79.9% accuracy on ImageNet, a significant improvement of 2% to training from scratch, but still 4% behind supervised pre-training. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale (Dosovitskiy et al. ICLR 2021)
  • 17. Interpreting the Results ● Positional embeddings ○ We speculate that learning to represent the spatial relations in this resolution (14 x 14) is equally easy for different strategies. ○ Closer patches tend to have more similar position embeddings. ○ Row-column structure & sinusoidal structure appears. ● Self-attention ○ “Attention distance” analogous to “receptive field size”. ○ Highly localized attention may serve a similar function as early convolutional layers in CNNs. ○ Model attends to image regions that are semantically relevant for classification. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale (Dosovitskiy et al. ICLR 2021)
  • 18. End-to-End Object Detection with Transformers ECCV 2020, Facebook
  • 19. End-to-end object detection Object detection as a direct set prediction problem. End-to-End Object Detection with Transformers (Carion et al. ECCV 2020)
  • 20. Removing NMS ● Conventional CNN to learn a 2D representation + Positional encoding ● 100 learned positional embeddings as object queries ● Global reasoning using pairwise relations End-to-End Object Detection with Transformers (Carion et al. ECCV 2020)
  • 21. Encoder’s attention mechanism in action End-to-End Object Detection with Transformers (Carion et al. ECCV 2020)
  • 22. Decoder’s attention mechanism in action End-to-End Object Detection with Transformers (Carion et al. ECCV 2020)
  • 23. Performance in Object Detection End-to-End Object Detection with Transformers (Carion et al. ECCV 2020)
  • 24. Panoptic Segmentation End-to-End Object Detection with Transformers (Carion et al. ECCV 2020)
  • 25. Performance in Panoptic Segmentation End-to-End Object Detection with Transformers (Carion et al. ECCV 2020)
  • 27. Notable Extensions ● Training data-efficient image transformers & distillation through attention (Touvron et al. Arxiv 2021) ○ Add another token: distillation token to ViT. Using only the classification token doesn’t help much. ○ Soft distillation (teacher model’s softmax output) and hard-distillation (teacher model’s argmax with label smoothing). ○ Surpasses SOTA yet again. ● DALL·E: Creating Images from Text (Ramesh et al. 2021) ○ Decoder-only transformer that receives both the text and the image as a single stream of tokens (Text: 256, Image: 1024) and models all of them autoregressively. ○ Creates images from text captions for a wide range of concepts expressible in natural language.
  • 28. Task-specific: Object Detection ● End-to-End Object Detection with Adaptive Clustering Transformer (Zheng et al. Arxiv 2020) ○ ACT cluster the query features adaptively using Locality Sensitive Hashing (LSH) and approximate the query-key interaction using the prototype-key interaction. ○ ACT can replace the original self-attention module in DETR without degrading the performance of pre-trained DETR model. ● Deformable DETR: Deformable Transformers for End-to-End Object Detection (Zhu et al. ICLR 2021) ○ Deformable DETR can achieve better performance than DETR (especially on small objects) with 10× less training epochs. ○ Deformable attention module: Choose only prominent feature map pixels, aggregate multi-scale features. A Survey on Visual Transformer (Han et al. Arxiv 2021)
  • 29. Task-specific: Object Detection ● UP-DETR: Unsupervised Pre-training for Object Detection with Transformers (Dai et al. Arxiv 2020) ○ Propose a pretext task named random query patch detection to unsupervisedly pretrain DETR (UP-DETR) for object detection. ● Rethinking Transformer-based Set Prediction for Object Detection (Sun et al. Arxiv 2020) ○ Encoder-only DETR significantly accelerate the training of small object detection, as it removes cross-attention. ○ Feature generation for transformer encoders with FCOS (Fully Convolutional One-Stage object detector) or RCNN A Survey on Visual Transformer (Han et al. Arxiv 2021)
  • 30. Task-specific: Segmentation ● MaX-DeepLab: End-to-End Panoptic Segmentation with Mask Transformers (Wang et al. Arxiv 2020) ○ Infers masks and classes directly without hand-coded priors like object boxes. ○ Dual-path transformer enables CNNs to read and write a global memory at any layer. ● End-to-End Video Instance Segmentation with Transformers (Wang et al. Arxiv 2020) ○ Three dimensional (temporal, horizontal and vertical) positional encoding ○ Instance sequence matching strategy - applying loss across different time signatures A Survey on Visual Transformer (Han et al. Arxiv 2021)
  • 31. Additional Tasks ● Learning Joint Spatial-Temporal Transformations for Video Inpainting (Zeng et al. ECCV 2020) ● End-to-End Dense Video Captioning with Masked Transformer (Zhou et al. CVPR 2018) ● Hand-Transformer: Non-Autoregressive Structured Modeling for 3D Hand Pose Estimation (Huang et al. ECCV 2020) ● Taming Transformers for High-Resolution Image Synthesis (Esser et al. Arxiv 2020) ● Pre-Trained Image Processing Transformer (Chen et al. Arxiv 2020) ○ ImageNet pre-training for image denoising/superresolution A Survey on Visual Transformer (Han et al. Arxiv 2021)