SlideShare a Scribd company logo
1 of 44
Download to read offline
Masked Self-supervised
Pre-training for Visual
Recognition
by: Jefferson Hernandez
There has been a divergence between how we do
pre-training in Vision vs NLP
NLP models are usually are pre-trained using masked or autoregressive methods:
Masked language model Autoregressive language model
Images from: Jay Alammar'blog
Instead the most successful pre-training in Vision is done using
contrastive methods
SimCLR (from Ziyan's talk)
How can we make Vision pre-training more
similar to NLP pre-training?
Masked and autoregressive methods in NLP are at heart
Denoising autoencoders
● They are a class of autoencoder that corrupt the input and ask the model to
predict the un-corrupted version
● For images this would mean applying geometric transformations, color
transformations, masking pixels, shuffluling pixels, etc
Masked image modelling (MIM) has been done using
convolutions
The paper Context Encoders: Feature Learning by Inpainting (2016), is the
pioneer of masked image modelling, using convolutional neural networks to fill out
masked part of an image.
CNN Encoder CNN
Decoder
But the results are very poor…...
So the authors need to add an adversarial loss (GAN) to get better visual results
but even then fine-tuning accuracies were low for today’s standard
Can we do better than this?
How to tokenize images the same way as text?
The paper AN IMAGE IS WORTH 16X16 WORDS introduces the main way to
tokenize images for transformers, just split then into patches of 16 by 16 pixels
and pass then through a linear layer
(MAE) Masked Autoencoders Are Scalable Vision
Learners
● With the introduction of ViT, we can do masked image modelling the same
way we do mask language modelling in BERT.
● Unlike BERT, MAE uses an asymmetric design. The encoder only operates
on the masked input (No [MASKED] token) and a lightweight decoder that
reconstructs the full signal from the latent representation and [MASKED]
tokens.
MAE Architecture
1) Mask
original
image
MAE Architecture
1) Mask
original
image
2) Encode
visible
tokens
MAE Architecture
1) Mask
original
image
2) Encode
visible
tokens
3) Add [M]
tokens
MAE Architecture
1) Mask
original
image
2) Encode
visible
tokens
3) Add [M]
tokens
4) Predict
image
MAE Architecture
1) Mask
original
image
2) Encode
visible
tokens
3) Add [M]
tokens
4) Predict
image
5) L2 pixel
Loss
MAE Architecture
1) Mask
original
image
2) Encode
visible
tokens
3) Add [M]
tokens
4) Predict
image
5) L2 pixel
Loss
Qualitative Results
Qualitative Results
Qualitative Results
Results
The authors do self-supervised pre-training on the ImageNet-1K (IN1K) training
set. Then they do supervised training to evaluate the representations with (i)
end-to-end fine-tuning or (ii) linear probing.
Baseline model: ViT-Large:
● ViT-Large (ViT-L/16) is the backbone in their ablation study.
● ViT-L is very big and tends to overfit.
● It is very hard to train supervised ViT-L from scratch and a good recipe with
strong regularization is needed .
We need high masking ratios
● The optimal ratios are surprisingly
high. The ratio of 75% is good for both
linear probing and fine-tuning.
● This is in contrast with BERT(15%)
and similar works in CV(20% - 50%)
● For linear probing, the accuracy
increases steadily with the masking
ratio until 75% masking: the accuracy
gap is up to ∼20% (54.6% vs.73.5%).
For fine-tuning, the results are less
sensitive to the ratios, and a wide
range of masking ratios (40–80%)
work well.
Mask Token
● If the encoder uses mask tokens, it
performs worse: its accuracy drops
by 14% in linear probing.
● By removing the mask token from
the encoder, They constrain the
encoder to always see real patches
and thus improve accuracy.
Reconstruction target
● Using pixels with normalization improves accuracy.
● In another variant, the authors perform PCA in the patch space and use the
largest PCA coefficients (96 here) as the target. Doing so degrades accuracy.
● The authors also compare an MAE variant that predicts tokens, the target
used in BEiT. Specifically for this variant, they use the DALLE pre-trained
dVAE as the tokenizer, following BEiT.
● The dVAE tokenizer requires one more pre-training stage, which may depend
on extra data (250M images). The dVAE encoder is a large convolutional
network (40% FLOPs of ViT-L) and adds nontrivial overhead.
Comparison with other
self-supervised Methods
Comparison with
supervised pre-training
Transfer learning experiments
● Object detection and instance segmentation
○ Mask R-CNN is finetuned on COCO. The ViT backbone is adapted to work with FPN.
● Semantic segmentation:
○ Experiments on ADE20K use UperNet and ViT as backbone.
Extending MAE to other modalities (Video)
Masked Autoencoders As Spatiotemporal Learners
● Basic idea: extend MAE to spatiotemporal learning
How to mask spatiotemporal data?
(a): Random sampling that is spacetime-agnostic. (b): Space-only random
sampling, broadcasted to all time steps (“tube” masking). (c): Time-only random
sampling, broadcasted to all spatial locations (“frame” masking). (d): Block-wise
sampling in spacetime, removing large regions (“cube” masking).
What is the optimal masking ratio for spatiotemporal data?
Optimal is ~90% much higher than in images.
Qualitative Results
Qualitative Results
Influence of pre-training data
Results on Kinetics Dataset (400)
Multi-Model MAE (Img+Text)
Masked Vision and Language Modeling for Multi-modal
Representation Learning
Masked Vision and Language Modeling for Multi-modal
Representation Learning
Masked Vision and Language Modeling for Multi-modal
Representation Learning
Basic idea: Model p(img | text) and p(text | img)
Where
Qualitative Results
They don't show image reconstructions
Image-Text Retrieval (Finetuned)
Image-Text Retrieval (Zero-Shot)
Retrieval is done using img_features @ text_features^T
Visual Question Answering (VQA) and Natural Language
for Visual Reasoning (NLVR)
VQA NLVR
Results

More Related Content

Similar to jefferson-mae Masked Autoencoders based Pretraining

Color Detection & Segmentation based Invisible Cloak
Color Detection & Segmentation based Invisible CloakColor Detection & Segmentation based Invisible Cloak
Color Detection & Segmentation based Invisible CloakAviral Chaurasia
 
What multimodal foundation models cannot perceive
What multimodal foundation models cannot perceiveWhat multimodal foundation models cannot perceive
What multimodal foundation models cannot perceiveUniversity of Amsterdam
 
IMAGE CAPTION GENERATOR USING DEEP LEARNING
IMAGE CAPTION GENERATOR USING DEEP LEARNINGIMAGE CAPTION GENERATOR USING DEEP LEARNING
IMAGE CAPTION GENERATOR USING DEEP LEARNINGIRJET Journal
 
Automated Speech Recognition
Automated Speech Recognition Automated Speech Recognition
Automated Speech Recognition Pruthvij Thakar
 
Image Classification using Deep Learning
Image Classification using Deep LearningImage Classification using Deep Learning
Image Classification using Deep LearningIRJET Journal
 
Image attendance system
Image attendance systemImage attendance system
Image attendance systemMayank Garg
 
COVID-19-Preventions-Control-System and Unconstrained Face-mask and Face-hand...
COVID-19-Preventions-Control-System and Unconstrained Face-mask and Face-hand...COVID-19-Preventions-Control-System and Unconstrained Face-mask and Face-hand...
COVID-19-Preventions-Control-System and Unconstrained Face-mask and Face-hand...SaiPrakash106
 
Automated_attendance_system_project.pptx
Automated_attendance_system_project.pptxAutomated_attendance_system_project.pptx
Automated_attendance_system_project.pptxNaveensai51
 
Learn to Build an App to Find Similar Images using Deep Learning- Piotr Teterwak
Learn to Build an App to Find Similar Images using Deep Learning- Piotr TeterwakLearn to Build an App to Find Similar Images using Deep Learning- Piotr Teterwak
Learn to Build an App to Find Similar Images using Deep Learning- Piotr TeterwakPyData
 
IRJET - Visual Question Answering – Implementation using Keras
IRJET -  	  Visual Question Answering – Implementation using KerasIRJET -  	  Visual Question Answering – Implementation using Keras
IRJET - Visual Question Answering – Implementation using KerasIRJET Journal
 
IRJET- Efficient Face Detection from Video Sequences using KNN and PCA
IRJET-  	  Efficient Face Detection from Video Sequences using KNN and PCAIRJET-  	  Efficient Face Detection from Video Sequences using KNN and PCA
IRJET- Efficient Face Detection from Video Sequences using KNN and PCAIRJET Journal
 
MINR: Implicit Neural Representations with Masked Image Modelling (ICCV '23 O...
MINR: Implicit Neural Representations with Masked Image Modelling (ICCV '23 O...MINR: Implicit Neural Representations with Masked Image Modelling (ICCV '23 O...
MINR: Implicit Neural Representations with Masked Image Modelling (ICCV '23 O...Joonhun Lee
 
Face Detection.pptx
Face Detection.pptxFace Detection.pptx
Face Detection.pptxTorshaSett
 
Report face recognition : ArganRecogn
Report face recognition :  ArganRecognReport face recognition :  ArganRecogn
Report face recognition : ArganRecognIlyas CHAOUA
 
Image Captioning Generator using Deep Machine Learning
Image Captioning Generator using Deep Machine LearningImage Captioning Generator using Deep Machine Learning
Image Captioning Generator using Deep Machine Learningijtsrd
 
BULK IEEE PROJECTS IN MATLAB ,BULK IEEE PROJECTS, IEEE 2015-16 MATLAB PROJEC...
 BULK IEEE PROJECTS IN MATLAB ,BULK IEEE PROJECTS, IEEE 2015-16 MATLAB PROJEC... BULK IEEE PROJECTS IN MATLAB ,BULK IEEE PROJECTS, IEEE 2015-16 MATLAB PROJEC...
BULK IEEE PROJECTS IN MATLAB ,BULK IEEE PROJECTS, IEEE 2015-16 MATLAB PROJEC...Nexgen Technology
 
final year ieee pojects in pondicherry,bulk ieee projects ,bulk 2015-16 i...
  final  year ieee pojects in pondicherry,bulk ieee projects ,bulk  2015-16 i...  final  year ieee pojects in pondicherry,bulk ieee projects ,bulk  2015-16 i...
final year ieee pojects in pondicherry,bulk ieee projects ,bulk 2015-16 i...nexgentech
 
IMAGE SEGMENTATION AND ITS TECHNIQUES
IMAGE SEGMENTATION AND ITS TECHNIQUESIMAGE SEGMENTATION AND ITS TECHNIQUES
IMAGE SEGMENTATION AND ITS TECHNIQUESIRJET Journal
 

Similar to jefferson-mae Masked Autoencoders based Pretraining (20)

PIES_Profile_INDIA
PIES_Profile_INDIAPIES_Profile_INDIA
PIES_Profile_INDIA
 
Color Detection & Segmentation based Invisible Cloak
Color Detection & Segmentation based Invisible CloakColor Detection & Segmentation based Invisible Cloak
Color Detection & Segmentation based Invisible Cloak
 
What multimodal foundation models cannot perceive
What multimodal foundation models cannot perceiveWhat multimodal foundation models cannot perceive
What multimodal foundation models cannot perceive
 
IMAGE CAPTION GENERATOR USING DEEP LEARNING
IMAGE CAPTION GENERATOR USING DEEP LEARNINGIMAGE CAPTION GENERATOR USING DEEP LEARNING
IMAGE CAPTION GENERATOR USING DEEP LEARNING
 
Automated Speech Recognition
Automated Speech Recognition Automated Speech Recognition
Automated Speech Recognition
 
Image Classification using Deep Learning
Image Classification using Deep LearningImage Classification using Deep Learning
Image Classification using Deep Learning
 
Image attendance system
Image attendance systemImage attendance system
Image attendance system
 
COVID-19-Preventions-Control-System and Unconstrained Face-mask and Face-hand...
COVID-19-Preventions-Control-System and Unconstrained Face-mask and Face-hand...COVID-19-Preventions-Control-System and Unconstrained Face-mask and Face-hand...
COVID-19-Preventions-Control-System and Unconstrained Face-mask and Face-hand...
 
Automated_attendance_system_project.pptx
Automated_attendance_system_project.pptxAutomated_attendance_system_project.pptx
Automated_attendance_system_project.pptx
 
Learn to Build an App to Find Similar Images using Deep Learning- Piotr Teterwak
Learn to Build an App to Find Similar Images using Deep Learning- Piotr TeterwakLearn to Build an App to Find Similar Images using Deep Learning- Piotr Teterwak
Learn to Build an App to Find Similar Images using Deep Learning- Piotr Teterwak
 
IRJET - Visual Question Answering – Implementation using Keras
IRJET -  	  Visual Question Answering – Implementation using KerasIRJET -  	  Visual Question Answering – Implementation using Keras
IRJET - Visual Question Answering – Implementation using Keras
 
IRJET- Efficient Face Detection from Video Sequences using KNN and PCA
IRJET-  	  Efficient Face Detection from Video Sequences using KNN and PCAIRJET-  	  Efficient Face Detection from Video Sequences using KNN and PCA
IRJET- Efficient Face Detection from Video Sequences using KNN and PCA
 
MINR: Implicit Neural Representations with Masked Image Modelling (ICCV '23 O...
MINR: Implicit Neural Representations with Masked Image Modelling (ICCV '23 O...MINR: Implicit Neural Representations with Masked Image Modelling (ICCV '23 O...
MINR: Implicit Neural Representations with Masked Image Modelling (ICCV '23 O...
 
Face Detection.pptx
Face Detection.pptxFace Detection.pptx
Face Detection.pptx
 
Report face recognition : ArganRecogn
Report face recognition :  ArganRecognReport face recognition :  ArganRecogn
Report face recognition : ArganRecogn
 
Image Captioning Generator using Deep Machine Learning
Image Captioning Generator using Deep Machine LearningImage Captioning Generator using Deep Machine Learning
Image Captioning Generator using Deep Machine Learning
 
BULK IEEE PROJECTS IN MATLAB ,BULK IEEE PROJECTS, IEEE 2015-16 MATLAB PROJEC...
 BULK IEEE PROJECTS IN MATLAB ,BULK IEEE PROJECTS, IEEE 2015-16 MATLAB PROJEC... BULK IEEE PROJECTS IN MATLAB ,BULK IEEE PROJECTS, IEEE 2015-16 MATLAB PROJEC...
BULK IEEE PROJECTS IN MATLAB ,BULK IEEE PROJECTS, IEEE 2015-16 MATLAB PROJEC...
 
final year ieee pojects in pondicherry,bulk ieee projects ,bulk 2015-16 i...
  final  year ieee pojects in pondicherry,bulk ieee projects ,bulk  2015-16 i...  final  year ieee pojects in pondicherry,bulk ieee projects ,bulk  2015-16 i...
final year ieee pojects in pondicherry,bulk ieee projects ,bulk 2015-16 i...
 
IMAGE SEGMENTATION AND ITS TECHNIQUES
IMAGE SEGMENTATION AND ITS TECHNIQUESIMAGE SEGMENTATION AND ITS TECHNIQUES
IMAGE SEGMENTATION AND ITS TECHNIQUES
 
One shot learning
One shot learningOne shot learning
One shot learning
 

Recently uploaded

SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )Tsuyoshi Horigome
 
Microscopic Analysis of Ceramic Materials.pptx
Microscopic Analysis of Ceramic Materials.pptxMicroscopic Analysis of Ceramic Materials.pptx
Microscopic Analysis of Ceramic Materials.pptxpurnimasatapathy1234
 
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur High Profile
 
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...Call Girls in Nagpur High Profile
 
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...Dr.Costas Sachpazis
 
247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt
247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt
247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).pptssuser5c9d4b1
 
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...ranjana rawat
 
(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service
(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service
(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Serviceranjana rawat
 
HARMONY IN THE NATURE AND EXISTENCE - Unit-IV
HARMONY IN THE NATURE AND EXISTENCE - Unit-IVHARMONY IN THE NATURE AND EXISTENCE - Unit-IV
HARMONY IN THE NATURE AND EXISTENCE - Unit-IVRajaP95
 
Porous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writingPorous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writingrakeshbaidya232001
 
Software Development Life Cycle By Team Orange (Dept. of Pharmacy)
Software Development Life Cycle By  Team Orange (Dept. of Pharmacy)Software Development Life Cycle By  Team Orange (Dept. of Pharmacy)
Software Development Life Cycle By Team Orange (Dept. of Pharmacy)Suman Mia
 
UNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its PerformanceUNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its Performancesivaprakash250
 
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur EscortsHigh Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escortsranjana rawat
 
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete RecordCCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete RecordAsst.prof M.Gokilavani
 
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escortsranjana rawat
 
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLSMANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLSSIVASHANKAR N
 
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130Suhani Kapoor
 
result management system report for college project
result management system report for college projectresult management system report for college project
result management system report for college projectTonystark477637
 

Recently uploaded (20)

SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )
 
Microscopic Analysis of Ceramic Materials.pptx
Microscopic Analysis of Ceramic Materials.pptxMicroscopic Analysis of Ceramic Materials.pptx
Microscopic Analysis of Ceramic Materials.pptx
 
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
 
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...
 
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
 
247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt
247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt
247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt
 
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
 
(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service
(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service
(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service
 
HARMONY IN THE NATURE AND EXISTENCE - Unit-IV
HARMONY IN THE NATURE AND EXISTENCE - Unit-IVHARMONY IN THE NATURE AND EXISTENCE - Unit-IV
HARMONY IN THE NATURE AND EXISTENCE - Unit-IV
 
Porous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writingPorous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writing
 
Software Development Life Cycle By Team Orange (Dept. of Pharmacy)
Software Development Life Cycle By  Team Orange (Dept. of Pharmacy)Software Development Life Cycle By  Team Orange (Dept. of Pharmacy)
Software Development Life Cycle By Team Orange (Dept. of Pharmacy)
 
UNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its PerformanceUNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its Performance
 
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur EscortsHigh Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
 
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete RecordCCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
 
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
 
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLSMANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
 
Roadmap to Membership of RICS - Pathways and Routes
Roadmap to Membership of RICS - Pathways and RoutesRoadmap to Membership of RICS - Pathways and Routes
Roadmap to Membership of RICS - Pathways and Routes
 
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
 
★ CALL US 9953330565 ( HOT Young Call Girls In Badarpur delhi NCR
★ CALL US 9953330565 ( HOT Young Call Girls In Badarpur delhi NCR★ CALL US 9953330565 ( HOT Young Call Girls In Badarpur delhi NCR
★ CALL US 9953330565 ( HOT Young Call Girls In Badarpur delhi NCR
 
result management system report for college project
result management system report for college projectresult management system report for college project
result management system report for college project
 

jefferson-mae Masked Autoencoders based Pretraining

  • 1. Masked Self-supervised Pre-training for Visual Recognition by: Jefferson Hernandez
  • 2. There has been a divergence between how we do pre-training in Vision vs NLP NLP models are usually are pre-trained using masked or autoregressive methods: Masked language model Autoregressive language model Images from: Jay Alammar'blog
  • 3. Instead the most successful pre-training in Vision is done using contrastive methods
  • 5. How can we make Vision pre-training more similar to NLP pre-training?
  • 6. Masked and autoregressive methods in NLP are at heart Denoising autoencoders ● They are a class of autoencoder that corrupt the input and ask the model to predict the un-corrupted version ● For images this would mean applying geometric transformations, color transformations, masking pixels, shuffluling pixels, etc
  • 7. Masked image modelling (MIM) has been done using convolutions The paper Context Encoders: Feature Learning by Inpainting (2016), is the pioneer of masked image modelling, using convolutional neural networks to fill out masked part of an image. CNN Encoder CNN Decoder
  • 8. But the results are very poor…... So the authors need to add an adversarial loss (GAN) to get better visual results but even then fine-tuning accuracies were low for today’s standard
  • 9. Can we do better than this?
  • 10. How to tokenize images the same way as text? The paper AN IMAGE IS WORTH 16X16 WORDS introduces the main way to tokenize images for transformers, just split then into patches of 16 by 16 pixels and pass then through a linear layer
  • 11. (MAE) Masked Autoencoders Are Scalable Vision Learners ● With the introduction of ViT, we can do masked image modelling the same way we do mask language modelling in BERT. ● Unlike BERT, MAE uses an asymmetric design. The encoder only operates on the masked input (No [MASKED] token) and a lightweight decoder that reconstructs the full signal from the latent representation and [MASKED] tokens.
  • 14. MAE Architecture 1) Mask original image 2) Encode visible tokens 3) Add [M] tokens
  • 15. MAE Architecture 1) Mask original image 2) Encode visible tokens 3) Add [M] tokens 4) Predict image
  • 16. MAE Architecture 1) Mask original image 2) Encode visible tokens 3) Add [M] tokens 4) Predict image 5) L2 pixel Loss
  • 17. MAE Architecture 1) Mask original image 2) Encode visible tokens 3) Add [M] tokens 4) Predict image 5) L2 pixel Loss
  • 21. Results The authors do self-supervised pre-training on the ImageNet-1K (IN1K) training set. Then they do supervised training to evaluate the representations with (i) end-to-end fine-tuning or (ii) linear probing. Baseline model: ViT-Large: ● ViT-Large (ViT-L/16) is the backbone in their ablation study. ● ViT-L is very big and tends to overfit. ● It is very hard to train supervised ViT-L from scratch and a good recipe with strong regularization is needed .
  • 22. We need high masking ratios ● The optimal ratios are surprisingly high. The ratio of 75% is good for both linear probing and fine-tuning. ● This is in contrast with BERT(15%) and similar works in CV(20% - 50%) ● For linear probing, the accuracy increases steadily with the masking ratio until 75% masking: the accuracy gap is up to ∼20% (54.6% vs.73.5%). For fine-tuning, the results are less sensitive to the ratios, and a wide range of masking ratios (40–80%) work well.
  • 23. Mask Token ● If the encoder uses mask tokens, it performs worse: its accuracy drops by 14% in linear probing. ● By removing the mask token from the encoder, They constrain the encoder to always see real patches and thus improve accuracy.
  • 24. Reconstruction target ● Using pixels with normalization improves accuracy. ● In another variant, the authors perform PCA in the patch space and use the largest PCA coefficients (96 here) as the target. Doing so degrades accuracy. ● The authors also compare an MAE variant that predicts tokens, the target used in BEiT. Specifically for this variant, they use the DALLE pre-trained dVAE as the tokenizer, following BEiT. ● The dVAE tokenizer requires one more pre-training stage, which may depend on extra data (250M images). The dVAE encoder is a large convolutional network (40% FLOPs of ViT-L) and adds nontrivial overhead.
  • 25. Comparison with other self-supervised Methods Comparison with supervised pre-training
  • 26. Transfer learning experiments ● Object detection and instance segmentation ○ Mask R-CNN is finetuned on COCO. The ViT backbone is adapted to work with FPN. ● Semantic segmentation: ○ Experiments on ADE20K use UperNet and ViT as backbone.
  • 27. Extending MAE to other modalities (Video)
  • 28. Masked Autoencoders As Spatiotemporal Learners ● Basic idea: extend MAE to spatiotemporal learning
  • 29. How to mask spatiotemporal data? (a): Random sampling that is spacetime-agnostic. (b): Space-only random sampling, broadcasted to all time steps (“tube” masking). (c): Time-only random sampling, broadcasted to all spatial locations (“frame” masking). (d): Block-wise sampling in spacetime, removing large regions (“cube” masking).
  • 30. What is the optimal masking ratio for spatiotemporal data? Optimal is ~90% much higher than in images.
  • 34. Results on Kinetics Dataset (400)
  • 36. Masked Vision and Language Modeling for Multi-modal Representation Learning
  • 37. Masked Vision and Language Modeling for Multi-modal Representation Learning
  • 38. Masked Vision and Language Modeling for Multi-modal Representation Learning Basic idea: Model p(img | text) and p(text | img) Where
  • 39. Qualitative Results They don't show image reconstructions
  • 41. Image-Text Retrieval (Zero-Shot) Retrieval is done using img_features @ text_features^T
  • 42. Visual Question Answering (VQA) and Natural Language for Visual Reasoning (NLVR)