SlideShare a Scribd company logo
1 of 27
Download to read offline
MobileViT: Light-weight, General-purpose,
and Mobile-friendly Vision Transformer
Published in ICLR’22 (Apple) Cited by 61
(arxiv Oct. 21)
Speaker: Jemin Lee
https://leejaymin.github.io/index.html
Backbone
• MobileViT: "MobileViT: Light-weight, General-purpose, and
Mobile-friendly Vision Transformer", ICLR, 2022 (Apple).
- https://iclr.cc/virtual/2022/poster/6230
• MobileViTv2: "Separable Self-attention for Mobile Vision Transformers",
arXiv22.6
• EfficientFormer: "EfficientFormer: Vision Transformers at MobileNet
Speed", arXiv22.6
• TinyViT, ECCV22
• EfficientViT, Arxiv22.6
• EdgeViTs, Arxiv22.7
• EdgeNeXt Arxiv22.7
• LightViT Arxiv22.7
Up-to-date List
Efficient Vision Transformer
| 2 |
Apple
Machine Learning Research
| 3 |
Learning Visual Representations in CNN
Background
| 4 |
Pros
• Inductive bias
• Variably-sized datasets
• Fast
• Accurate
• General
Cons
• Local representations
• Weak data scalability
Pros
• Non-local representations
• Competitive to heavy-weight CNNs (Resnet50)
Cons
• No inductive bias
• Large datasets
• Slow
• Poor performance
Learning Visual Representations in ViT
Background
| 5 |
Vision Transformer (ViT) ICLR 2021 (ArXiv on Oct. 2020)
Cited by 6k within 2 years
Recap ViT
Background
| 6 |
https://youtu.be/L7_nNfVeIw8
Transformer
• Non-local representations
• Competitive to CNNs (light (mobilenet) + heavy (resnet))
CNN
• Inductive bias
• Variably-sized datasets
Fast
Accurate
General
What MobileViT focused on designing for
MobileViT
| 7 |
Model the local & global information
MobileVit Architecture
| 8 |
O(N^2 d)
O(N^2dP)
Local: nxn conv (3x3)
Point-wise conv -> produce H,W,d tensor (d>C)
Dilated conv could learn long range dependences in input
tensors
However, it requires careful selection of dilation rates
Alternative: self-attention (ViT) -> heavy weights
MobileViT -> learn global representation with spatial inductive
bais
Model Long range dependencies with Dilated Conv.
MobileVit Architecture
| 9 |
MobileViT block
MobileVit Architecture
| 10 |
𝑋!
𝑋"
| 11 |
Appendix
Details of MobileViT Block
MobileVit Architecture
| 12 |
[512, 64, 32, 32] -> [512x4 (BP), 16x16(N), 64 (C)]
16
16
64
4
256
64
64
3x3
X -> [2048, 4 (head), 256, 16 (c//head)]
Relationship to convolution
• MobileViT block replaces the local processing (GEMM) in convolutions
with deeper global processing
• Transformer as a convolution
Light-weight
• ViT, Deit -> L=12, d=192
• MobileViT -> L = {2,4,3}, d={96,120,144} at special level 32x32, 16x16,
and 8x8
MobileViT block
MobileVit Architecture
| 13 |
Three types of MobileViT
MobileViT Arch.
| 14 |
Input = (100,3,256,256)
X = (400, 256, 64)
(2048, 256, 64)
Linear bottlenecks
Inverted residuals
Two blocks
• Stride1, stride 2
MobileNetv2 Block
Recap
| 15 |
MobileNetV2: Inverted Residuals and Linear Bottlenecks, cited by 10K
https://gaussian37.github.io/dl-concept-mobilenet_v2/
T=6
However, most of these works sample a new spatial resolution
after a fixed number of iterations.
The multi-scale sampler
• Reduces the training time as it requires fewer optimizer updates with
variably-sized batches
• Improves performance by about 0.5%
• Forces the network to learn better multi-scale representations
Multi-Scale Sampler for Training Efficiency
How to efficiently train MobileViT
| 16 |
CVNets: High Performance Library for Computer Vision, Arxiv22
Good point!
Not leaving other models for comparison
Imagenet-1k
• The dataset provides 1.28 million and 50 thousand images for training
and validation, respectively.
The MobileViT networks are trained using PyTorch for 300 epochs
on
• 8 NVIDIA GPUs with an effective
• batch size of 1,024 images using AdamW optimizer (Loshchilov & Hutter,
2019),
• label smoothing cross-entropy loss (smoothing=0.1), and
• multi-scale sampler (S = {(160, 160), (192, 192), (256, 256), (288, 288), (320,
320)}).
The learning rate is
• increased from 0.0002 to 0.002 for the first 3k iterations and then
annealed to 0.0002 using a cosine schedule (Loshchilov & Hutter, 2017).
• We use L2 weight decay of 0.01.
We use basic data augmentation (i.e., random resized cropping and
horizontal flipping) and evaluate the performance using a single
crop top-1 accuracy.
For inference, an exponential moving average of model weights is
used.
Implementation details
How to efficiently train MobileViT
| 17 |
Comparison with CNNs
Experimental Results
| 18 |
Comparison with ViTs
Experimental Results
| 19 |
MobileViT is well performed on object detection and semantic
segmentation.
Evaluation for the general-purpose nature
Experimental Results
| 20 |
CoreML on iPhone 12
Performance on Mobile Devices
Experimental Results
| 21 |
Dedicated CUDA kernels exist for transformers on GPU
CNNs benefit from several device-level optimizations, including
batch normalization fusion with convolutional layers
• Not available for mobile devices
• The resultant inference graph of MobileViT and ViT-based networks
for mobile devices is sub-optimal.
Discussion
Experimental Results
| 22 |
A systematic measurement
Experimental Results
| 23 |
Towards Efficient Vision Transformer Inference: A First Study of Transformers on Mobile Devices, HotMobile22
Source code
Closing Remarks
| 24 |
CUDA_VISIBLE_DEVICES=0 cvnets-eval --common.config-file
$CFG_FILE/mobilevit_xxs.yaml --common.results-loc
classification_results --model.classification.pretrained
$MODEL_WEIGHTS/mobilevit_xxs.pt
CVNet v0.1
Source Code
| 25 |
Timm
• XXS: "top1": 68.906, 0.107s
• XS: "top1": 74.634, 0.178s
• S: "top1": 78.316, 0.219s
Reproduce accuracy with Timm
Alternative code
| 26 |
Github CVNet: Readme.md
Paper
Comparison on latency with timm libary
Source Code
| 27 |
/ssd/dataset/imagenet –model vit_base_patch16_224 --pretrained
/ssd/dataset/imagenet –model mobilevit_xxs --pretrained
/ssd/dataset/imagenet –model mobilevit_s --pretrained

More Related Content

What's hot

The Transformer in Vision | Xavier Giro | Master in Computer Vision Barcelona...
The Transformer in Vision | Xavier Giro | Master in Computer Vision Barcelona...The Transformer in Vision | Xavier Giro | Master in Computer Vision Barcelona...
The Transformer in Vision | Xavier Giro | Master in Computer Vision Barcelona...Universitat Politècnica de Catalunya
 
MLP-Mixer: An all-MLP Architecture for Vision
MLP-Mixer: An all-MLP Architecture for VisionMLP-Mixer: An all-MLP Architecture for Vision
MLP-Mixer: An all-MLP Architecture for VisionKazuyuki Miyazawa
 
【メタサーベイ】Neural Fields
【メタサーベイ】Neural Fields【メタサーベイ】Neural Fields
【メタサーベイ】Neural Fieldscvpaper. challenge
 
Introduction to Visual transformers
Introduction to Visual transformers Introduction to Visual transformers
Introduction to Visual transformers leopauly
 
【メタサーベイ】Transformerから基盤モデルまでの流れ / From Transformer to Foundation Models
【メタサーベイ】Transformerから基盤モデルまでの流れ / From Transformer to Foundation Models【メタサーベイ】Transformerから基盤モデルまでの流れ / From Transformer to Foundation Models
【メタサーベイ】Transformerから基盤モデルまでの流れ / From Transformer to Foundation Modelscvpaper. challenge
 
【DL輪読会】ViT + Self Supervised Learningまとめ
【DL輪読会】ViT + Self Supervised Learningまとめ【DL輪読会】ViT + Self Supervised Learningまとめ
【DL輪読会】ViT + Self Supervised LearningまとめDeep Learning JP
 
Transformer 動向調査 in 画像認識
Transformer 動向調査 in 画像認識Transformer 動向調査 in 画像認識
Transformer 動向調査 in 画像認識Kazuki Maeno
 
Transformer メタサーベイ
Transformer メタサーベイTransformer メタサーベイ
Transformer メタサーベイcvpaper. challenge
 
Deep learning for NLP and Transformer
 Deep learning for NLP  and Transformer Deep learning for NLP  and Transformer
Deep learning for NLP and TransformerArvind Devaraj
 
Masked Autoencoders Are Scalable Vision Learners
Masked Autoencoders Are Scalable Vision LearnersMasked Autoencoders Are Scalable Vision Learners
Masked Autoencoders Are Scalable Vision LearnersGuoqingLiu9
 
Deep Learning による視覚×言語融合の最前線
Deep Learning による視覚×言語融合の最前線Deep Learning による視覚×言語融合の最前線
Deep Learning による視覚×言語融合の最前線Yoshitaka Ushiku
 
Learning with a Wasserstein Loss (NIPS2015)
Learning with a Wasserstein Loss (NIPS2015)Learning with a Wasserstein Loss (NIPS2015)
Learning with a Wasserstein Loss (NIPS2015)Hayato Watanabe
 
第7回WBAシンポジウム:予測符号化モデルとしての 深層予測学習とロボット知能化
第7回WBAシンポジウム:予測符号化モデルとしての 深層予測学習とロボット知能化第7回WBAシンポジウム:予測符号化モデルとしての 深層予測学習とロボット知能化
第7回WBAシンポジウム:予測符号化モデルとしての 深層予測学習とロボット知能化The Whole Brain Architecture Initiative
 
【論文紹介】Understanding Back-Translation at Scale
【論文紹介】Understanding Back-Translation at Scale【論文紹介】Understanding Back-Translation at Scale
【論文紹介】Understanding Back-Translation at ScaleTomoyuki Hioki
 
Enabling Power-Efficient AI Through Quantization
Enabling Power-Efficient AI Through QuantizationEnabling Power-Efficient AI Through Quantization
Enabling Power-Efficient AI Through QuantizationQualcomm Research
 
[DL輪読会]GLIDE: Guided Language to Image Diffusion for Generation and Editing
[DL輪読会]GLIDE: Guided Language to Image Diffusion  for Generation and Editing[DL輪読会]GLIDE: Guided Language to Image Diffusion  for Generation and Editing
[DL輪読会]GLIDE: Guided Language to Image Diffusion for Generation and EditingDeep Learning JP
 

What's hot (20)

The Transformer in Vision | Xavier Giro | Master in Computer Vision Barcelona...
The Transformer in Vision | Xavier Giro | Master in Computer Vision Barcelona...The Transformer in Vision | Xavier Giro | Master in Computer Vision Barcelona...
The Transformer in Vision | Xavier Giro | Master in Computer Vision Barcelona...
 
MLP-Mixer: An all-MLP Architecture for Vision
MLP-Mixer: An all-MLP Architecture for VisionMLP-Mixer: An all-MLP Architecture for Vision
MLP-Mixer: An all-MLP Architecture for Vision
 
【メタサーベイ】Neural Fields
【メタサーベイ】Neural Fields【メタサーベイ】Neural Fields
【メタサーベイ】Neural Fields
 
Introduction to Visual transformers
Introduction to Visual transformers Introduction to Visual transformers
Introduction to Visual transformers
 
【メタサーベイ】Transformerから基盤モデルまでの流れ / From Transformer to Foundation Models
【メタサーベイ】Transformerから基盤モデルまでの流れ / From Transformer to Foundation Models【メタサーベイ】Transformerから基盤モデルまでの流れ / From Transformer to Foundation Models
【メタサーベイ】Transformerから基盤モデルまでの流れ / From Transformer to Foundation Models
 
【DL輪読会】ViT + Self Supervised Learningまとめ
【DL輪読会】ViT + Self Supervised Learningまとめ【DL輪読会】ViT + Self Supervised Learningまとめ
【DL輪読会】ViT + Self Supervised Learningまとめ
 
Transformer 動向調査 in 画像認識
Transformer 動向調査 in 画像認識Transformer 動向調査 in 画像認識
Transformer 動向調査 in 画像認識
 
Transformer メタサーベイ
Transformer メタサーベイTransformer メタサーベイ
Transformer メタサーベイ
 
Transformers AI PPT.pptx
Transformers AI PPT.pptxTransformers AI PPT.pptx
Transformers AI PPT.pptx
 
The Transformer - Xavier Giró - UPC Barcelona 2021
The Transformer - Xavier Giró - UPC Barcelona 2021The Transformer - Xavier Giró - UPC Barcelona 2021
The Transformer - Xavier Giró - UPC Barcelona 2021
 
Deep learning for NLP and Transformer
 Deep learning for NLP  and Transformer Deep learning for NLP  and Transformer
Deep learning for NLP and Transformer
 
Masked Autoencoders Are Scalable Vision Learners
Masked Autoencoders Are Scalable Vision LearnersMasked Autoencoders Are Scalable Vision Learners
Masked Autoencoders Are Scalable Vision Learners
 
Attention Is All You Need
Attention Is All You NeedAttention Is All You Need
Attention Is All You Need
 
Deep Learning による視覚×言語融合の最前線
Deep Learning による視覚×言語融合の最前線Deep Learning による視覚×言語融合の最前線
Deep Learning による視覚×言語融合の最前線
 
Learning with a Wasserstein Loss (NIPS2015)
Learning with a Wasserstein Loss (NIPS2015)Learning with a Wasserstein Loss (NIPS2015)
Learning with a Wasserstein Loss (NIPS2015)
 
第7回WBAシンポジウム:予測符号化モデルとしての 深層予測学習とロボット知能化
第7回WBAシンポジウム:予測符号化モデルとしての 深層予測学習とロボット知能化第7回WBAシンポジウム:予測符号化モデルとしての 深層予測学習とロボット知能化
第7回WBAシンポジウム:予測符号化モデルとしての 深層予測学習とロボット知能化
 
【論文紹介】Understanding Back-Translation at Scale
【論文紹介】Understanding Back-Translation at Scale【論文紹介】Understanding Back-Translation at Scale
【論文紹介】Understanding Back-Translation at Scale
 
Transformers in 2021
Transformers in 2021Transformers in 2021
Transformers in 2021
 
Enabling Power-Efficient AI Through Quantization
Enabling Power-Efficient AI Through QuantizationEnabling Power-Efficient AI Through Quantization
Enabling Power-Efficient AI Through Quantization
 
[DL輪読会]GLIDE: Guided Language to Image Diffusion for Generation and Editing
[DL輪読会]GLIDE: Guided Language to Image Diffusion  for Generation and Editing[DL輪読会]GLIDE: Guided Language to Image Diffusion  for Generation and Editing
[DL輪読会]GLIDE: Guided Language to Image Diffusion for Generation and Editing
 

Similar to MobileViTv1

New microsoft office power point presentation
New microsoft office power point presentationNew microsoft office power point presentation
New microsoft office power point presentationJEEVA ARAVINTH
 
“Once-for-All DNNs: Simplifying Design of Efficient Models for Diverse Hardwa...
“Once-for-All DNNs: Simplifying Design of Efficient Models for Diverse Hardwa...“Once-for-All DNNs: Simplifying Design of Efficient Models for Diverse Hardwa...
“Once-for-All DNNs: Simplifying Design of Efficient Models for Diverse Hardwa...Edge AI and Vision Alliance
 
(R)evolution of the computing continuum - A few challenges
(R)evolution of the computing continuum  - A few challenges(R)evolution of the computing continuum  - A few challenges
(R)evolution of the computing continuum - A few challengesFrederic Desprez
 
“How Transformers Are Changing the Nature of Deep Learning Models,” a Present...
“How Transformers Are Changing the Nature of Deep Learning Models,” a Present...“How Transformers Are Changing the Nature of Deep Learning Models,” a Present...
“How Transformers Are Changing the Nature of Deep Learning Models,” a Present...Edge AI and Vision Alliance
 
Investigate the Effect of Different Mobility Trajectory on VOD over WIMAX
Investigate the Effect of Different Mobility Trajectory on VOD over WIMAXInvestigate the Effect of Different Mobility Trajectory on VOD over WIMAX
Investigate the Effect of Different Mobility Trajectory on VOD over WIMAXiosrjce
 
Vision Transformer(ViT) / An Image is Worth 16*16 Words: Transformers for Ima...
Vision Transformer(ViT) / An Image is Worth 16*16 Words: Transformers for Ima...Vision Transformer(ViT) / An Image is Worth 16*16 Words: Transformers for Ima...
Vision Transformer(ViT) / An Image is Worth 16*16 Words: Transformers for Ima...changedaeoh
 
Migration process from monolithic to micro frontend architecture in mobile ap...
Migration process from monolithic to micro frontend architecture in mobile ap...Migration process from monolithic to micro frontend architecture in mobile ap...
Migration process from monolithic to micro frontend architecture in mobile ap...ESUG
 
MobileNet Review | Mobile Net Research Paper Review | MobileNet v1 Paper Expl...
MobileNet Review | Mobile Net Research Paper Review | MobileNet v1 Paper Expl...MobileNet Review | Mobile Net Research Paper Review | MobileNet v1 Paper Expl...
MobileNet Review | Mobile Net Research Paper Review | MobileNet v1 Paper Expl...Laxmi Kant Tiwari
 
InternImage: Exploring Large-Scale Vision Foundation Models with Deformable C...
InternImage: Exploring Large-Scale Vision Foundation Models with Deformable C...InternImage: Exploring Large-Scale Vision Foundation Models with Deformable C...
InternImage: Exploring Large-Scale Vision Foundation Models with Deformable C...taeseon ryu
 
leewayhertz.com-HOW IS A VISION TRANSFORMER MODEL ViT BUILT AND IMPLEMENTED.pdf
leewayhertz.com-HOW IS A VISION TRANSFORMER MODEL ViT BUILT AND IMPLEMENTED.pdfleewayhertz.com-HOW IS A VISION TRANSFORMER MODEL ViT BUILT AND IMPLEMENTED.pdf
leewayhertz.com-HOW IS A VISION TRANSFORMER MODEL ViT BUILT AND IMPLEMENTED.pdfrobertsamuel23
 
Designing of cordic processor in verilog using xilinx ise simulator
Designing of cordic processor in verilog using xilinx ise simulatorDesigning of cordic processor in verilog using xilinx ise simulator
Designing of cordic processor in verilog using xilinx ise simulatoreSAT Publishing House
 
IRJET- Sobel Edge Detection on ZYNQ based Architecture with Vivado
IRJET- Sobel Edge Detection on ZYNQ based Architecture with VivadoIRJET- Sobel Edge Detection on ZYNQ based Architecture with Vivado
IRJET- Sobel Edge Detection on ZYNQ based Architecture with VivadoIRJET Journal
 
PERFORMANCE EVALUATION OF MOBILE WIMAX IEEE 802.16E FOR HARD HANDOVER
PERFORMANCE EVALUATION OF MOBILE WIMAX IEEE 802.16E FOR HARD HANDOVERPERFORMANCE EVALUATION OF MOBILE WIMAX IEEE 802.16E FOR HARD HANDOVER
PERFORMANCE EVALUATION OF MOBILE WIMAX IEEE 802.16E FOR HARD HANDOVERIJCNCJournal
 
High Speed and Time Efficient 1-D DWT on Xilinx Virtex4 DWT Using 9/7 Filter ...
High Speed and Time Efficient 1-D DWT on Xilinx Virtex4 DWT Using 9/7 Filter ...High Speed and Time Efficient 1-D DWT on Xilinx Virtex4 DWT Using 9/7 Filter ...
High Speed and Time Efficient 1-D DWT on Xilinx Virtex4 DWT Using 9/7 Filter ...IOSR Journals
 

Similar to MobileViTv1 (20)

New microsoft office power point presentation
New microsoft office power point presentationNew microsoft office power point presentation
New microsoft office power point presentation
 
“Once-for-All DNNs: Simplifying Design of Efficient Models for Diverse Hardwa...
“Once-for-All DNNs: Simplifying Design of Efficient Models for Diverse Hardwa...“Once-for-All DNNs: Simplifying Design of Efficient Models for Diverse Hardwa...
“Once-for-All DNNs: Simplifying Design of Efficient Models for Diverse Hardwa...
 
(R)evolution of the computing continuum - A few challenges
(R)evolution of the computing continuum  - A few challenges(R)evolution of the computing continuum  - A few challenges
(R)evolution of the computing continuum - A few challenges
 
“How Transformers Are Changing the Nature of Deep Learning Models,” a Present...
“How Transformers Are Changing the Nature of Deep Learning Models,” a Present...“How Transformers Are Changing the Nature of Deep Learning Models,” a Present...
“How Transformers Are Changing the Nature of Deep Learning Models,” a Present...
 
Investigate the Effect of Different Mobility Trajectory on VOD over WIMAX
Investigate the Effect of Different Mobility Trajectory on VOD over WIMAXInvestigate the Effect of Different Mobility Trajectory on VOD over WIMAX
Investigate the Effect of Different Mobility Trajectory on VOD over WIMAX
 
G017354046
G017354046G017354046
G017354046
 
Hb3412981311
Hb3412981311Hb3412981311
Hb3412981311
 
Vision Transformer(ViT) / An Image is Worth 16*16 Words: Transformers for Ima...
Vision Transformer(ViT) / An Image is Worth 16*16 Words: Transformers for Ima...Vision Transformer(ViT) / An Image is Worth 16*16 Words: Transformers for Ima...
Vision Transformer(ViT) / An Image is Worth 16*16 Words: Transformers for Ima...
 
Migration process from monolithic to micro frontend architecture in mobile ap...
Migration process from monolithic to micro frontend architecture in mobile ap...Migration process from monolithic to micro frontend architecture in mobile ap...
Migration process from monolithic to micro frontend architecture in mobile ap...
 
MobileNet Review | Mobile Net Research Paper Review | MobileNet v1 Paper Expl...
MobileNet Review | Mobile Net Research Paper Review | MobileNet v1 Paper Expl...MobileNet Review | Mobile Net Research Paper Review | MobileNet v1 Paper Expl...
MobileNet Review | Mobile Net Research Paper Review | MobileNet v1 Paper Expl...
 
MobileNet V3
MobileNet V3MobileNet V3
MobileNet V3
 
InternImage: Exploring Large-Scale Vision Foundation Models with Deformable C...
InternImage: Exploring Large-Scale Vision Foundation Models with Deformable C...InternImage: Exploring Large-Scale Vision Foundation Models with Deformable C...
InternImage: Exploring Large-Scale Vision Foundation Models with Deformable C...
 
leewayhertz.com-HOW IS A VISION TRANSFORMER MODEL ViT BUILT AND IMPLEMENTED.pdf
leewayhertz.com-HOW IS A VISION TRANSFORMER MODEL ViT BUILT AND IMPLEMENTED.pdfleewayhertz.com-HOW IS A VISION TRANSFORMER MODEL ViT BUILT AND IMPLEMENTED.pdf
leewayhertz.com-HOW IS A VISION TRANSFORMER MODEL ViT BUILT AND IMPLEMENTED.pdf
 
Designing of cordic processor in verilog using xilinx ise simulator
Designing of cordic processor in verilog using xilinx ise simulatorDesigning of cordic processor in verilog using xilinx ise simulator
Designing of cordic processor in verilog using xilinx ise simulator
 
nikhil.pptx
nikhil.pptxnikhil.pptx
nikhil.pptx
 
imagefiltervhdl.pptx
imagefiltervhdl.pptximagefiltervhdl.pptx
imagefiltervhdl.pptx
 
IRJET- Sobel Edge Detection on ZYNQ based Architecture with Vivado
IRJET- Sobel Edge Detection on ZYNQ based Architecture with VivadoIRJET- Sobel Edge Detection on ZYNQ based Architecture with Vivado
IRJET- Sobel Edge Detection on ZYNQ based Architecture with Vivado
 
Luigy Bertaglia Bortolo - Poster Final
Luigy Bertaglia Bortolo - Poster FinalLuigy Bertaglia Bortolo - Poster Final
Luigy Bertaglia Bortolo - Poster Final
 
PERFORMANCE EVALUATION OF MOBILE WIMAX IEEE 802.16E FOR HARD HANDOVER
PERFORMANCE EVALUATION OF MOBILE WIMAX IEEE 802.16E FOR HARD HANDOVERPERFORMANCE EVALUATION OF MOBILE WIMAX IEEE 802.16E FOR HARD HANDOVER
PERFORMANCE EVALUATION OF MOBILE WIMAX IEEE 802.16E FOR HARD HANDOVER
 
High Speed and Time Efficient 1-D DWT on Xilinx Virtex4 DWT Using 9/7 Filter ...
High Speed and Time Efficient 1-D DWT on Xilinx Virtex4 DWT Using 9/7 Filter ...High Speed and Time Efficient 1-D DWT on Xilinx Virtex4 DWT Using 9/7 Filter ...
High Speed and Time Efficient 1-D DWT on Xilinx Virtex4 DWT Using 9/7 Filter ...
 

More from jemin lee

HAWQ-V3: Dyadic Neural Network Quantization
HAWQ-V3: Dyadic Neural Network QuantizationHAWQ-V3: Dyadic Neural Network Quantization
HAWQ-V3: Dyadic Neural Network Quantizationjemin lee
 
Efficient execution of quantized deep learning models a compiler approach
Efficient execution of quantized deep learning models a compiler approachEfficient execution of quantized deep learning models a compiler approach
Efficient execution of quantized deep learning models a compiler approachjemin lee
 
Integer quantization for deep learning inference: principles and empirical ev...
Integer quantization for deep learning inference: principles and empirical ev...Integer quantization for deep learning inference: principles and empirical ev...
Integer quantization for deep learning inference: principles and empirical ev...jemin lee
 
MLPerf an industry standard benchmark suite for machine learning performance
MLPerf an industry standard benchmark suite for machine learning performanceMLPerf an industry standard benchmark suite for machine learning performance
MLPerf an industry standard benchmark suite for machine learning performancejemin lee
 
PACT19, MOSAIC : Heterogeneity-, Communication-, and Constraint-Aware Model ...
PACT19, MOSAIC : Heterogeneity-, Communication-, and Constraint-Aware Model ...PACT19, MOSAIC : Heterogeneity-, Communication-, and Constraint-Aware Model ...
PACT19, MOSAIC : Heterogeneity-, Communication-, and Constraint-Aware Model ...jemin lee
 
Versatile tensor accelerator (vta) introduction and usage
Versatile tensor accelerator (vta) introduction and usage Versatile tensor accelerator (vta) introduction and usage
Versatile tensor accelerator (vta) introduction and usage jemin lee
 
Jetson agx xavier and nvdla introduction and usage
Jetson agx xavier and nvdla introduction and usageJetson agx xavier and nvdla introduction and usage
Jetson agx xavier and nvdla introduction and usagejemin lee
 

More from jemin lee (7)

HAWQ-V3: Dyadic Neural Network Quantization
HAWQ-V3: Dyadic Neural Network QuantizationHAWQ-V3: Dyadic Neural Network Quantization
HAWQ-V3: Dyadic Neural Network Quantization
 
Efficient execution of quantized deep learning models a compiler approach
Efficient execution of quantized deep learning models a compiler approachEfficient execution of quantized deep learning models a compiler approach
Efficient execution of quantized deep learning models a compiler approach
 
Integer quantization for deep learning inference: principles and empirical ev...
Integer quantization for deep learning inference: principles and empirical ev...Integer quantization for deep learning inference: principles and empirical ev...
Integer quantization for deep learning inference: principles and empirical ev...
 
MLPerf an industry standard benchmark suite for machine learning performance
MLPerf an industry standard benchmark suite for machine learning performanceMLPerf an industry standard benchmark suite for machine learning performance
MLPerf an industry standard benchmark suite for machine learning performance
 
PACT19, MOSAIC : Heterogeneity-, Communication-, and Constraint-Aware Model ...
PACT19, MOSAIC : Heterogeneity-, Communication-, and Constraint-Aware Model ...PACT19, MOSAIC : Heterogeneity-, Communication-, and Constraint-Aware Model ...
PACT19, MOSAIC : Heterogeneity-, Communication-, and Constraint-Aware Model ...
 
Versatile tensor accelerator (vta) introduction and usage
Versatile tensor accelerator (vta) introduction and usage Versatile tensor accelerator (vta) introduction and usage
Versatile tensor accelerator (vta) introduction and usage
 
Jetson agx xavier and nvdla introduction and usage
Jetson agx xavier and nvdla introduction and usageJetson agx xavier and nvdla introduction and usage
Jetson agx xavier and nvdla introduction and usage
 

Recently uploaded

VTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learnVTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learnAmarnathKambale
 
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfLearn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfkalichargn70th171
 
Software Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsSoftware Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsArshad QA
 
Azure_Native_Qumulo_High_Performance_Compute_Benchmarks.pdf
Azure_Native_Qumulo_High_Performance_Compute_Benchmarks.pdfAzure_Native_Qumulo_High_Performance_Compute_Benchmarks.pdf
Azure_Native_Qumulo_High_Performance_Compute_Benchmarks.pdfryanfarris8
 
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providermohitmore19
 
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...ICS
 
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsAlberto González Trastoy
 
Introducing Microsoft’s new Enterprise Work Management (EWM) Solution
Introducing Microsoft’s new Enterprise Work Management (EWM) SolutionIntroducing Microsoft’s new Enterprise Work Management (EWM) Solution
Introducing Microsoft’s new Enterprise Work Management (EWM) SolutionOnePlan Solutions
 
HR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comHR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comFatema Valibhai
 
Right Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsRight Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsJhone kinadey
 
Define the academic and professional writing..pdf
Define the academic and professional writing..pdfDefine the academic and professional writing..pdf
Define the academic and professional writing..pdfPearlKirahMaeRagusta1
 
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...harshavardhanraghave
 
Unlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language ModelsUnlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language Modelsaagamshah0812
 
Exploring the Best Video Editing App.pdf
Exploring the Best Video Editing App.pdfExploring the Best Video Editing App.pdf
Exploring the Best Video Editing App.pdfproinshot.com
 
8257 interfacing 2 in microprocessor for btech students
8257 interfacing 2 in microprocessor for btech students8257 interfacing 2 in microprocessor for btech students
8257 interfacing 2 in microprocessor for btech studentsHimanshiGarg82
 
AI & Machine Learning Presentation Template
AI & Machine Learning Presentation TemplateAI & Machine Learning Presentation Template
AI & Machine Learning Presentation TemplatePresentation.STUDIO
 
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️Delhi Call girls
 
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...panagenda
 

Recently uploaded (20)

VTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learnVTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learn
 
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfLearn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
 
Microsoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdfMicrosoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdf
 
Software Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsSoftware Quality Assurance Interview Questions
Software Quality Assurance Interview Questions
 
Azure_Native_Qumulo_High_Performance_Compute_Benchmarks.pdf
Azure_Native_Qumulo_High_Performance_Compute_Benchmarks.pdfAzure_Native_Qumulo_High_Performance_Compute_Benchmarks.pdf
Azure_Native_Qumulo_High_Performance_Compute_Benchmarks.pdf
 
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service provider
 
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS LiveVip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
 
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
 
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
 
Introducing Microsoft’s new Enterprise Work Management (EWM) Solution
Introducing Microsoft’s new Enterprise Work Management (EWM) SolutionIntroducing Microsoft’s new Enterprise Work Management (EWM) Solution
Introducing Microsoft’s new Enterprise Work Management (EWM) Solution
 
HR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comHR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.com
 
Right Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsRight Money Management App For Your Financial Goals
Right Money Management App For Your Financial Goals
 
Define the academic and professional writing..pdf
Define the academic and professional writing..pdfDefine the academic and professional writing..pdf
Define the academic and professional writing..pdf
 
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
 
Unlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language ModelsUnlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language Models
 
Exploring the Best Video Editing App.pdf
Exploring the Best Video Editing App.pdfExploring the Best Video Editing App.pdf
Exploring the Best Video Editing App.pdf
 
8257 interfacing 2 in microprocessor for btech students
8257 interfacing 2 in microprocessor for btech students8257 interfacing 2 in microprocessor for btech students
8257 interfacing 2 in microprocessor for btech students
 
AI & Machine Learning Presentation Template
AI & Machine Learning Presentation TemplateAI & Machine Learning Presentation Template
AI & Machine Learning Presentation Template
 
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
 
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
 

MobileViTv1

  • 1. MobileViT: Light-weight, General-purpose, and Mobile-friendly Vision Transformer Published in ICLR’22 (Apple) Cited by 61 (arxiv Oct. 21) Speaker: Jemin Lee https://leejaymin.github.io/index.html
  • 2. Backbone • MobileViT: "MobileViT: Light-weight, General-purpose, and Mobile-friendly Vision Transformer", ICLR, 2022 (Apple). - https://iclr.cc/virtual/2022/poster/6230 • MobileViTv2: "Separable Self-attention for Mobile Vision Transformers", arXiv22.6 • EfficientFormer: "EfficientFormer: Vision Transformers at MobileNet Speed", arXiv22.6 • TinyViT, ECCV22 • EfficientViT, Arxiv22.6 • EdgeViTs, Arxiv22.7 • EdgeNeXt Arxiv22.7 • LightViT Arxiv22.7 Up-to-date List Efficient Vision Transformer | 2 |
  • 4. Learning Visual Representations in CNN Background | 4 | Pros • Inductive bias • Variably-sized datasets • Fast • Accurate • General Cons • Local representations • Weak data scalability
  • 5. Pros • Non-local representations • Competitive to heavy-weight CNNs (Resnet50) Cons • No inductive bias • Large datasets • Slow • Poor performance Learning Visual Representations in ViT Background | 5 | Vision Transformer (ViT) ICLR 2021 (ArXiv on Oct. 2020) Cited by 6k within 2 years
  • 6. Recap ViT Background | 6 | https://youtu.be/L7_nNfVeIw8
  • 7. Transformer • Non-local representations • Competitive to CNNs (light (mobilenet) + heavy (resnet)) CNN • Inductive bias • Variably-sized datasets Fast Accurate General What MobileViT focused on designing for MobileViT | 7 |
  • 8. Model the local & global information MobileVit Architecture | 8 | O(N^2 d) O(N^2dP) Local: nxn conv (3x3) Point-wise conv -> produce H,W,d tensor (d>C)
  • 9. Dilated conv could learn long range dependences in input tensors However, it requires careful selection of dilation rates Alternative: self-attention (ViT) -> heavy weights MobileViT -> learn global representation with spatial inductive bais Model Long range dependencies with Dilated Conv. MobileVit Architecture | 9 |
  • 12. Details of MobileViT Block MobileVit Architecture | 12 | [512, 64, 32, 32] -> [512x4 (BP), 16x16(N), 64 (C)] 16 16 64 4 256 64 64 3x3 X -> [2048, 4 (head), 256, 16 (c//head)]
  • 13. Relationship to convolution • MobileViT block replaces the local processing (GEMM) in convolutions with deeper global processing • Transformer as a convolution Light-weight • ViT, Deit -> L=12, d=192 • MobileViT -> L = {2,4,3}, d={96,120,144} at special level 32x32, 16x16, and 8x8 MobileViT block MobileVit Architecture | 13 |
  • 14. Three types of MobileViT MobileViT Arch. | 14 | Input = (100,3,256,256) X = (400, 256, 64) (2048, 256, 64)
  • 15. Linear bottlenecks Inverted residuals Two blocks • Stride1, stride 2 MobileNetv2 Block Recap | 15 | MobileNetV2: Inverted Residuals and Linear Bottlenecks, cited by 10K https://gaussian37.github.io/dl-concept-mobilenet_v2/ T=6
  • 16. However, most of these works sample a new spatial resolution after a fixed number of iterations. The multi-scale sampler • Reduces the training time as it requires fewer optimizer updates with variably-sized batches • Improves performance by about 0.5% • Forces the network to learn better multi-scale representations Multi-Scale Sampler for Training Efficiency How to efficiently train MobileViT | 16 | CVNets: High Performance Library for Computer Vision, Arxiv22 Good point! Not leaving other models for comparison
  • 17. Imagenet-1k • The dataset provides 1.28 million and 50 thousand images for training and validation, respectively. The MobileViT networks are trained using PyTorch for 300 epochs on • 8 NVIDIA GPUs with an effective • batch size of 1,024 images using AdamW optimizer (Loshchilov & Hutter, 2019), • label smoothing cross-entropy loss (smoothing=0.1), and • multi-scale sampler (S = {(160, 160), (192, 192), (256, 256), (288, 288), (320, 320)}). The learning rate is • increased from 0.0002 to 0.002 for the first 3k iterations and then annealed to 0.0002 using a cosine schedule (Loshchilov & Hutter, 2017). • We use L2 weight decay of 0.01. We use basic data augmentation (i.e., random resized cropping and horizontal flipping) and evaluate the performance using a single crop top-1 accuracy. For inference, an exponential moving average of model weights is used. Implementation details How to efficiently train MobileViT | 17 |
  • 20. MobileViT is well performed on object detection and semantic segmentation. Evaluation for the general-purpose nature Experimental Results | 20 |
  • 21. CoreML on iPhone 12 Performance on Mobile Devices Experimental Results | 21 |
  • 22. Dedicated CUDA kernels exist for transformers on GPU CNNs benefit from several device-level optimizations, including batch normalization fusion with convolutional layers • Not available for mobile devices • The resultant inference graph of MobileViT and ViT-based networks for mobile devices is sub-optimal. Discussion Experimental Results | 22 |
  • 23. A systematic measurement Experimental Results | 23 | Towards Efficient Vision Transformer Inference: A First Study of Transformers on Mobile Devices, HotMobile22
  • 25. CUDA_VISIBLE_DEVICES=0 cvnets-eval --common.config-file $CFG_FILE/mobilevit_xxs.yaml --common.results-loc classification_results --model.classification.pretrained $MODEL_WEIGHTS/mobilevit_xxs.pt CVNet v0.1 Source Code | 25 |
  • 26. Timm • XXS: "top1": 68.906, 0.107s • XS: "top1": 74.634, 0.178s • S: "top1": 78.316, 0.219s Reproduce accuracy with Timm Alternative code | 26 | Github CVNet: Readme.md Paper
  • 27. Comparison on latency with timm libary Source Code | 27 | /ssd/dataset/imagenet –model vit_base_patch16_224 --pretrained /ssd/dataset/imagenet –model mobilevit_xxs --pretrained /ssd/dataset/imagenet –model mobilevit_s --pretrained