SlideShare a Scribd company logo
MobileViT: Light-weight, General-purpose,
and Mobile-friendly Vision Transformer
Published in ICLR’22 (Apple) Cited by 61
(arxiv Oct. 21)
Speaker: Jemin Lee
https://leejaymin.github.io/index.html
Backbone
• MobileViT: "MobileViT: Light-weight, General-purpose, and
Mobile-friendly Vision Transformer", ICLR, 2022 (Apple).
- https://iclr.cc/virtual/2022/poster/6230
• MobileViTv2: "Separable Self-attention for Mobile Vision Transformers",
arXiv22.6
• EfficientFormer: "EfficientFormer: Vision Transformers at MobileNet
Speed", arXiv22.6
• TinyViT, ECCV22
• EfficientViT, Arxiv22.6
• EdgeViTs, Arxiv22.7
• EdgeNeXt Arxiv22.7
• LightViT Arxiv22.7
Up-to-date List
Efficient Vision Transformer
| 2 |
Apple
Machine Learning Research
| 3 |
Learning Visual Representations in CNN
Background
| 4 |
Pros
• Inductive bias
• Variably-sized datasets
• Fast
• Accurate
• General
Cons
• Local representations
• Weak data scalability
Pros
• Non-local representations
• Competitive to heavy-weight CNNs (Resnet50)
Cons
• No inductive bias
• Large datasets
• Slow
• Poor performance
Learning Visual Representations in ViT
Background
| 5 |
Vision Transformer (ViT) ICLR 2021 (ArXiv on Oct. 2020)
Cited by 6k within 2 years
Recap ViT
Background
| 6 |
https://youtu.be/L7_nNfVeIw8
Transformer
• Non-local representations
• Competitive to CNNs (light (mobilenet) + heavy (resnet))
CNN
• Inductive bias
• Variably-sized datasets
Fast
Accurate
General
What MobileViT focused on designing for
MobileViT
| 7 |
Model the local & global information
MobileVit Architecture
| 8 |
O(N^2 d)
O(N^2dP)
Local: nxn conv (3x3)
Point-wise conv -> produce H,W,d tensor (d>C)
Dilated conv could learn long range dependences in input
tensors
However, it requires careful selection of dilation rates
Alternative: self-attention (ViT) -> heavy weights
MobileViT -> learn global representation with spatial inductive
bais
Model Long range dependencies with Dilated Conv.
MobileVit Architecture
| 9 |
MobileViT block
MobileVit Architecture
| 10 |
𝑋!
𝑋"
| 11 |
Appendix
Details of MobileViT Block
MobileVit Architecture
| 12 |
[512, 64, 32, 32] -> [512x4 (BP), 16x16(N), 64 (C)]
16
16
64
4
256
64
64
3x3
X -> [2048, 4 (head), 256, 16 (c//head)]
Relationship to convolution
• MobileViT block replaces the local processing (GEMM) in convolutions
with deeper global processing
• Transformer as a convolution
Light-weight
• ViT, Deit -> L=12, d=192
• MobileViT -> L = {2,4,3}, d={96,120,144} at special level 32x32, 16x16,
and 8x8
MobileViT block
MobileVit Architecture
| 13 |
Three types of MobileViT
MobileViT Arch.
| 14 |
Input = (100,3,256,256)
X = (400, 256, 64)
(2048, 256, 64)
Linear bottlenecks
Inverted residuals
Two blocks
• Stride1, stride 2
MobileNetv2 Block
Recap
| 15 |
MobileNetV2: Inverted Residuals and Linear Bottlenecks, cited by 10K
https://gaussian37.github.io/dl-concept-mobilenet_v2/
T=6
However, most of these works sample a new spatial resolution
after a fixed number of iterations.
The multi-scale sampler
• Reduces the training time as it requires fewer optimizer updates with
variably-sized batches
• Improves performance by about 0.5%
• Forces the network to learn better multi-scale representations
Multi-Scale Sampler for Training Efficiency
How to efficiently train MobileViT
| 16 |
CVNets: High Performance Library for Computer Vision, Arxiv22
Good point!
Not leaving other models for comparison
Imagenet-1k
• The dataset provides 1.28 million and 50 thousand images for training
and validation, respectively.
The MobileViT networks are trained using PyTorch for 300 epochs
on
• 8 NVIDIA GPUs with an effective
• batch size of 1,024 images using AdamW optimizer (Loshchilov & Hutter,
2019),
• label smoothing cross-entropy loss (smoothing=0.1), and
• multi-scale sampler (S = {(160, 160), (192, 192), (256, 256), (288, 288), (320,
320)}).
The learning rate is
• increased from 0.0002 to 0.002 for the first 3k iterations and then
annealed to 0.0002 using a cosine schedule (Loshchilov & Hutter, 2017).
• We use L2 weight decay of 0.01.
We use basic data augmentation (i.e., random resized cropping and
horizontal flipping) and evaluate the performance using a single
crop top-1 accuracy.
For inference, an exponential moving average of model weights is
used.
Implementation details
How to efficiently train MobileViT
| 17 |
Comparison with CNNs
Experimental Results
| 18 |
Comparison with ViTs
Experimental Results
| 19 |
MobileViT is well performed on object detection and semantic
segmentation.
Evaluation for the general-purpose nature
Experimental Results
| 20 |
CoreML on iPhone 12
Performance on Mobile Devices
Experimental Results
| 21 |
Dedicated CUDA kernels exist for transformers on GPU
CNNs benefit from several device-level optimizations, including
batch normalization fusion with convolutional layers
• Not available for mobile devices
• The resultant inference graph of MobileViT and ViT-based networks
for mobile devices is sub-optimal.
Discussion
Experimental Results
| 22 |
A systematic measurement
Experimental Results
| 23 |
Towards Efficient Vision Transformer Inference: A First Study of Transformers on Mobile Devices, HotMobile22
Source code
Closing Remarks
| 24 |
CUDA_VISIBLE_DEVICES=0 cvnets-eval --common.config-file
$CFG_FILE/mobilevit_xxs.yaml --common.results-loc
classification_results --model.classification.pretrained
$MODEL_WEIGHTS/mobilevit_xxs.pt
CVNet v0.1
Source Code
| 25 |
Timm
• XXS: "top1": 68.906, 0.107s
• XS: "top1": 74.634, 0.178s
• S: "top1": 78.316, 0.219s
Reproduce accuracy with Timm
Alternative code
| 26 |
Github CVNet: Readme.md
Paper
Comparison on latency with timm libary
Source Code
| 27 |
/ssd/dataset/imagenet –model vit_base_patch16_224 --pretrained
/ssd/dataset/imagenet –model mobilevit_xxs --pretrained
/ssd/dataset/imagenet –model mobilevit_s --pretrained

More Related Content

What's hot

Transformer in Vision
Transformer in VisionTransformer in Vision
Transformer in Vision
Sangmin Woo
 
Efficient Neural Architecture Search via Parameter Sharing
Efficient Neural Architecture Search via Parameter SharingEfficient Neural Architecture Search via Parameter Sharing
Efficient Neural Architecture Search via Parameter Sharing
Jinwon Lee
 
TensorRT survey
TensorRT surveyTensorRT survey
TensorRT survey
Yi-Hsiu Hsu
 
論文紹介:Multimodal Learning with Transformers: A Survey
論文紹介:Multimodal Learning with Transformers: A Survey論文紹介:Multimodal Learning with Transformers: A Survey
論文紹介:Multimodal Learning with Transformers: A Survey
Toru Tamaki
 
ViT (Vision Transformer) Review [CDM]
ViT (Vision Transformer) Review [CDM]ViT (Vision Transformer) Review [CDM]
ViT (Vision Transformer) Review [CDM]
Dongmin Choi
 
Deep Learning for Computer Vision: Object Detection (UPC 2016)
Deep Learning for Computer Vision: Object Detection (UPC 2016)Deep Learning for Computer Vision: Object Detection (UPC 2016)
Deep Learning for Computer Vision: Object Detection (UPC 2016)
Universitat Politècnica de Catalunya
 
文献紹介:Swin Transformer: Hierarchical Vision Transformer Using Shifted Windows
文献紹介:Swin Transformer: Hierarchical Vision Transformer Using Shifted Windows文献紹介:Swin Transformer: Hierarchical Vision Transformer Using Shifted Windows
文献紹介:Swin Transformer: Hierarchical Vision Transformer Using Shifted Windows
Toru Tamaki
 
PR-108: MobileNetV2: Inverted Residuals and Linear Bottlenecks
PR-108: MobileNetV2: Inverted Residuals and Linear BottlenecksPR-108: MobileNetV2: Inverted Residuals and Linear Bottlenecks
PR-108: MobileNetV2: Inverted Residuals and Linear Bottlenecks
Jinwon Lee
 
210523 swin transformer v1.5
210523 swin transformer v1.5210523 swin transformer v1.5
210523 swin transformer v1.5
taeseon ryu
 
Rethinking Attention with Performers
Rethinking Attention with PerformersRethinking Attention with Performers
Rethinking Attention with Performers
Joonhyung Lee
 
A Brief History of Object Detection / Tommi Kerola
A Brief History of Object Detection / Tommi KerolaA Brief History of Object Detection / Tommi Kerola
A Brief History of Object Detection / Tommi Kerola
Preferred Networks
 
Pr083 Non-local Neural Networks
Pr083 Non-local Neural NetworksPr083 Non-local Neural Networks
Pr083 Non-local Neural Networks
Taeoh Kim
 
Emerging Properties in Self-Supervised Vision Transformers
Emerging Properties in Self-Supervised Vision TransformersEmerging Properties in Self-Supervised Vision Transformers
Emerging Properties in Self-Supervised Vision Transformers
Sungchul Kim
 
【DL輪読会】DINOv2: Learning Robust Visual Features without Supervision
【DL輪読会】DINOv2: Learning Robust Visual Features without Supervision【DL輪読会】DINOv2: Learning Robust Visual Features without Supervision
【DL輪読会】DINOv2: Learning Robust Visual Features without Supervision
Deep Learning JP
 
“Practical DNN Quantization Techniques and Tools,” a Presentation from Facebook
“Practical DNN Quantization Techniques and Tools,” a Presentation from Facebook“Practical DNN Quantization Techniques and Tools,” a Presentation from Facebook
“Practical DNN Quantization Techniques and Tools,” a Presentation from Facebook
Edge AI and Vision Alliance
 
Introduction to Graph Neural Networks: Basics and Applications - Katsuhiko Is...
Introduction to Graph Neural Networks: Basics and Applications - Katsuhiko Is...Introduction to Graph Neural Networks: Basics and Applications - Katsuhiko Is...
Introduction to Graph Neural Networks: Basics and Applications - Katsuhiko Is...
Preferred Networks
 
(文献紹介)Deep Unrolling: Learned ISTA (LISTA)
(文献紹介)Deep Unrolling: Learned ISTA (LISTA)(文献紹介)Deep Unrolling: Learned ISTA (LISTA)
(文献紹介)Deep Unrolling: Learned ISTA (LISTA)
Morpho, Inc.
 
DETR ECCV20
DETR ECCV20DETR ECCV20
DETR ECCV20
Mengmeng Xu
 
Introduction to multiple object tracking
Introduction to multiple object trackingIntroduction to multiple object tracking
Introduction to multiple object tracking
Fan Yang
 
A brief introduction to recent segmentation methods
A brief introduction to recent segmentation methodsA brief introduction to recent segmentation methods
A brief introduction to recent segmentation methods
Shunta Saito
 

What's hot (20)

Transformer in Vision
Transformer in VisionTransformer in Vision
Transformer in Vision
 
Efficient Neural Architecture Search via Parameter Sharing
Efficient Neural Architecture Search via Parameter SharingEfficient Neural Architecture Search via Parameter Sharing
Efficient Neural Architecture Search via Parameter Sharing
 
TensorRT survey
TensorRT surveyTensorRT survey
TensorRT survey
 
論文紹介:Multimodal Learning with Transformers: A Survey
論文紹介:Multimodal Learning with Transformers: A Survey論文紹介:Multimodal Learning with Transformers: A Survey
論文紹介:Multimodal Learning with Transformers: A Survey
 
ViT (Vision Transformer) Review [CDM]
ViT (Vision Transformer) Review [CDM]ViT (Vision Transformer) Review [CDM]
ViT (Vision Transformer) Review [CDM]
 
Deep Learning for Computer Vision: Object Detection (UPC 2016)
Deep Learning for Computer Vision: Object Detection (UPC 2016)Deep Learning for Computer Vision: Object Detection (UPC 2016)
Deep Learning for Computer Vision: Object Detection (UPC 2016)
 
文献紹介:Swin Transformer: Hierarchical Vision Transformer Using Shifted Windows
文献紹介:Swin Transformer: Hierarchical Vision Transformer Using Shifted Windows文献紹介:Swin Transformer: Hierarchical Vision Transformer Using Shifted Windows
文献紹介:Swin Transformer: Hierarchical Vision Transformer Using Shifted Windows
 
PR-108: MobileNetV2: Inverted Residuals and Linear Bottlenecks
PR-108: MobileNetV2: Inverted Residuals and Linear BottlenecksPR-108: MobileNetV2: Inverted Residuals and Linear Bottlenecks
PR-108: MobileNetV2: Inverted Residuals and Linear Bottlenecks
 
210523 swin transformer v1.5
210523 swin transformer v1.5210523 swin transformer v1.5
210523 swin transformer v1.5
 
Rethinking Attention with Performers
Rethinking Attention with PerformersRethinking Attention with Performers
Rethinking Attention with Performers
 
A Brief History of Object Detection / Tommi Kerola
A Brief History of Object Detection / Tommi KerolaA Brief History of Object Detection / Tommi Kerola
A Brief History of Object Detection / Tommi Kerola
 
Pr083 Non-local Neural Networks
Pr083 Non-local Neural NetworksPr083 Non-local Neural Networks
Pr083 Non-local Neural Networks
 
Emerging Properties in Self-Supervised Vision Transformers
Emerging Properties in Self-Supervised Vision TransformersEmerging Properties in Self-Supervised Vision Transformers
Emerging Properties in Self-Supervised Vision Transformers
 
【DL輪読会】DINOv2: Learning Robust Visual Features without Supervision
【DL輪読会】DINOv2: Learning Robust Visual Features without Supervision【DL輪読会】DINOv2: Learning Robust Visual Features without Supervision
【DL輪読会】DINOv2: Learning Robust Visual Features without Supervision
 
“Practical DNN Quantization Techniques and Tools,” a Presentation from Facebook
“Practical DNN Quantization Techniques and Tools,” a Presentation from Facebook“Practical DNN Quantization Techniques and Tools,” a Presentation from Facebook
“Practical DNN Quantization Techniques and Tools,” a Presentation from Facebook
 
Introduction to Graph Neural Networks: Basics and Applications - Katsuhiko Is...
Introduction to Graph Neural Networks: Basics and Applications - Katsuhiko Is...Introduction to Graph Neural Networks: Basics and Applications - Katsuhiko Is...
Introduction to Graph Neural Networks: Basics and Applications - Katsuhiko Is...
 
(文献紹介)Deep Unrolling: Learned ISTA (LISTA)
(文献紹介)Deep Unrolling: Learned ISTA (LISTA)(文献紹介)Deep Unrolling: Learned ISTA (LISTA)
(文献紹介)Deep Unrolling: Learned ISTA (LISTA)
 
DETR ECCV20
DETR ECCV20DETR ECCV20
DETR ECCV20
 
Introduction to multiple object tracking
Introduction to multiple object trackingIntroduction to multiple object tracking
Introduction to multiple object tracking
 
A brief introduction to recent segmentation methods
A brief introduction to recent segmentation methodsA brief introduction to recent segmentation methods
A brief introduction to recent segmentation methods
 

Similar to MobileViTv1

“How Transformers are Changing the Direction of Deep Learning Architectures,”...
“How Transformers are Changing the Direction of Deep Learning Architectures,”...“How Transformers are Changing the Direction of Deep Learning Architectures,”...
“How Transformers are Changing the Direction of Deep Learning Architectures,”...
Edge AI and Vision Alliance
 
New microsoft office power point presentation
New microsoft office power point presentationNew microsoft office power point presentation
New microsoft office power point presentation
JEEVA ARAVINTH
 
“Once-for-All DNNs: Simplifying Design of Efficient Models for Diverse Hardwa...
“Once-for-All DNNs: Simplifying Design of Efficient Models for Diverse Hardwa...“Once-for-All DNNs: Simplifying Design of Efficient Models for Diverse Hardwa...
“Once-for-All DNNs: Simplifying Design of Efficient Models for Diverse Hardwa...
Edge AI and Vision Alliance
 
(R)evolution of the computing continuum - A few challenges
(R)evolution of the computing continuum  - A few challenges(R)evolution of the computing continuum  - A few challenges
(R)evolution of the computing continuum - A few challenges
Frederic Desprez
 
“How Transformers Are Changing the Nature of Deep Learning Models,” a Present...
“How Transformers Are Changing the Nature of Deep Learning Models,” a Present...“How Transformers Are Changing the Nature of Deep Learning Models,” a Present...
“How Transformers Are Changing the Nature of Deep Learning Models,” a Present...
Edge AI and Vision Alliance
 
Investigate the Effect of Different Mobility Trajectory on VOD over WIMAX
Investigate the Effect of Different Mobility Trajectory on VOD over WIMAXInvestigate the Effect of Different Mobility Trajectory on VOD over WIMAX
Investigate the Effect of Different Mobility Trajectory on VOD over WIMAX
iosrjce
 
G017354046
G017354046G017354046
G017354046
IOSR Journals
 
Hb3412981311
Hb3412981311Hb3412981311
Hb3412981311
IJERA Editor
 
Vision Transformer(ViT) / An Image is Worth 16*16 Words: Transformers for Ima...
Vision Transformer(ViT) / An Image is Worth 16*16 Words: Transformers for Ima...Vision Transformer(ViT) / An Image is Worth 16*16 Words: Transformers for Ima...
Vision Transformer(ViT) / An Image is Worth 16*16 Words: Transformers for Ima...
changedaeoh
 
Migration process from monolithic to micro frontend architecture in mobile ap...
Migration process from monolithic to micro frontend architecture in mobile ap...Migration process from monolithic to micro frontend architecture in mobile ap...
Migration process from monolithic to micro frontend architecture in mobile ap...
ESUG
 
MobileNet Review | Mobile Net Research Paper Review | MobileNet v1 Paper Expl...
MobileNet Review | Mobile Net Research Paper Review | MobileNet v1 Paper Expl...MobileNet Review | Mobile Net Research Paper Review | MobileNet v1 Paper Expl...
MobileNet Review | Mobile Net Research Paper Review | MobileNet v1 Paper Expl...
Laxmi Kant Tiwari
 
MobileNet V3
MobileNet V3MobileNet V3
MobileNet V3
Wonbeom Jang
 
InternImage: Exploring Large-Scale Vision Foundation Models with Deformable C...
InternImage: Exploring Large-Scale Vision Foundation Models with Deformable C...InternImage: Exploring Large-Scale Vision Foundation Models with Deformable C...
InternImage: Exploring Large-Scale Vision Foundation Models with Deformable C...
taeseon ryu
 
leewayhertz.com-HOW IS A VISION TRANSFORMER MODEL ViT BUILT AND IMPLEMENTED.pdf
leewayhertz.com-HOW IS A VISION TRANSFORMER MODEL ViT BUILT AND IMPLEMENTED.pdfleewayhertz.com-HOW IS A VISION TRANSFORMER MODEL ViT BUILT AND IMPLEMENTED.pdf
leewayhertz.com-HOW IS A VISION TRANSFORMER MODEL ViT BUILT AND IMPLEMENTED.pdf
robertsamuel23
 
Designing of cordic processor in verilog using xilinx ise simulator
Designing of cordic processor in verilog using xilinx ise simulatorDesigning of cordic processor in verilog using xilinx ise simulator
Designing of cordic processor in verilog using xilinx ise simulator
eSAT Publishing House
 
nikhil.pptx
nikhil.pptxnikhil.pptx
nikhil.pptx
PreetiPrajapati45
 
imagefiltervhdl.pptx
imagefiltervhdl.pptximagefiltervhdl.pptx
imagefiltervhdl.pptx
Akbarali206563
 
IRJET- Sobel Edge Detection on ZYNQ based Architecture with Vivado
IRJET- Sobel Edge Detection on ZYNQ based Architecture with VivadoIRJET- Sobel Edge Detection on ZYNQ based Architecture with Vivado
IRJET- Sobel Edge Detection on ZYNQ based Architecture with Vivado
IRJET Journal
 
Luigy Bertaglia Bortolo - Poster Final
Luigy Bertaglia Bortolo - Poster FinalLuigy Bertaglia Bortolo - Poster Final
Luigy Bertaglia Bortolo - Poster Final
Luigy Bertaglia Bortolo
 
PERFORMANCE EVALUATION OF MOBILE WIMAX IEEE 802.16E FOR HARD HANDOVER
PERFORMANCE EVALUATION OF MOBILE WIMAX IEEE 802.16E FOR HARD HANDOVERPERFORMANCE EVALUATION OF MOBILE WIMAX IEEE 802.16E FOR HARD HANDOVER
PERFORMANCE EVALUATION OF MOBILE WIMAX IEEE 802.16E FOR HARD HANDOVER
IJCNCJournal
 

Similar to MobileViTv1 (20)

“How Transformers are Changing the Direction of Deep Learning Architectures,”...
“How Transformers are Changing the Direction of Deep Learning Architectures,”...“How Transformers are Changing the Direction of Deep Learning Architectures,”...
“How Transformers are Changing the Direction of Deep Learning Architectures,”...
 
New microsoft office power point presentation
New microsoft office power point presentationNew microsoft office power point presentation
New microsoft office power point presentation
 
“Once-for-All DNNs: Simplifying Design of Efficient Models for Diverse Hardwa...
“Once-for-All DNNs: Simplifying Design of Efficient Models for Diverse Hardwa...“Once-for-All DNNs: Simplifying Design of Efficient Models for Diverse Hardwa...
“Once-for-All DNNs: Simplifying Design of Efficient Models for Diverse Hardwa...
 
(R)evolution of the computing continuum - A few challenges
(R)evolution of the computing continuum  - A few challenges(R)evolution of the computing continuum  - A few challenges
(R)evolution of the computing continuum - A few challenges
 
“How Transformers Are Changing the Nature of Deep Learning Models,” a Present...
“How Transformers Are Changing the Nature of Deep Learning Models,” a Present...“How Transformers Are Changing the Nature of Deep Learning Models,” a Present...
“How Transformers Are Changing the Nature of Deep Learning Models,” a Present...
 
Investigate the Effect of Different Mobility Trajectory on VOD over WIMAX
Investigate the Effect of Different Mobility Trajectory on VOD over WIMAXInvestigate the Effect of Different Mobility Trajectory on VOD over WIMAX
Investigate the Effect of Different Mobility Trajectory on VOD over WIMAX
 
G017354046
G017354046G017354046
G017354046
 
Hb3412981311
Hb3412981311Hb3412981311
Hb3412981311
 
Vision Transformer(ViT) / An Image is Worth 16*16 Words: Transformers for Ima...
Vision Transformer(ViT) / An Image is Worth 16*16 Words: Transformers for Ima...Vision Transformer(ViT) / An Image is Worth 16*16 Words: Transformers for Ima...
Vision Transformer(ViT) / An Image is Worth 16*16 Words: Transformers for Ima...
 
Migration process from monolithic to micro frontend architecture in mobile ap...
Migration process from monolithic to micro frontend architecture in mobile ap...Migration process from monolithic to micro frontend architecture in mobile ap...
Migration process from monolithic to micro frontend architecture in mobile ap...
 
MobileNet Review | Mobile Net Research Paper Review | MobileNet v1 Paper Expl...
MobileNet Review | Mobile Net Research Paper Review | MobileNet v1 Paper Expl...MobileNet Review | Mobile Net Research Paper Review | MobileNet v1 Paper Expl...
MobileNet Review | Mobile Net Research Paper Review | MobileNet v1 Paper Expl...
 
MobileNet V3
MobileNet V3MobileNet V3
MobileNet V3
 
InternImage: Exploring Large-Scale Vision Foundation Models with Deformable C...
InternImage: Exploring Large-Scale Vision Foundation Models with Deformable C...InternImage: Exploring Large-Scale Vision Foundation Models with Deformable C...
InternImage: Exploring Large-Scale Vision Foundation Models with Deformable C...
 
leewayhertz.com-HOW IS A VISION TRANSFORMER MODEL ViT BUILT AND IMPLEMENTED.pdf
leewayhertz.com-HOW IS A VISION TRANSFORMER MODEL ViT BUILT AND IMPLEMENTED.pdfleewayhertz.com-HOW IS A VISION TRANSFORMER MODEL ViT BUILT AND IMPLEMENTED.pdf
leewayhertz.com-HOW IS A VISION TRANSFORMER MODEL ViT BUILT AND IMPLEMENTED.pdf
 
Designing of cordic processor in verilog using xilinx ise simulator
Designing of cordic processor in verilog using xilinx ise simulatorDesigning of cordic processor in verilog using xilinx ise simulator
Designing of cordic processor in verilog using xilinx ise simulator
 
nikhil.pptx
nikhil.pptxnikhil.pptx
nikhil.pptx
 
imagefiltervhdl.pptx
imagefiltervhdl.pptximagefiltervhdl.pptx
imagefiltervhdl.pptx
 
IRJET- Sobel Edge Detection on ZYNQ based Architecture with Vivado
IRJET- Sobel Edge Detection on ZYNQ based Architecture with VivadoIRJET- Sobel Edge Detection on ZYNQ based Architecture with Vivado
IRJET- Sobel Edge Detection on ZYNQ based Architecture with Vivado
 
Luigy Bertaglia Bortolo - Poster Final
Luigy Bertaglia Bortolo - Poster FinalLuigy Bertaglia Bortolo - Poster Final
Luigy Bertaglia Bortolo - Poster Final
 
PERFORMANCE EVALUATION OF MOBILE WIMAX IEEE 802.16E FOR HARD HANDOVER
PERFORMANCE EVALUATION OF MOBILE WIMAX IEEE 802.16E FOR HARD HANDOVERPERFORMANCE EVALUATION OF MOBILE WIMAX IEEE 802.16E FOR HARD HANDOVER
PERFORMANCE EVALUATION OF MOBILE WIMAX IEEE 802.16E FOR HARD HANDOVER
 

More from jemin lee

HAWQ-V3: Dyadic Neural Network Quantization
HAWQ-V3: Dyadic Neural Network QuantizationHAWQ-V3: Dyadic Neural Network Quantization
HAWQ-V3: Dyadic Neural Network Quantization
jemin lee
 
Efficient execution of quantized deep learning models a compiler approach
Efficient execution of quantized deep learning models a compiler approachEfficient execution of quantized deep learning models a compiler approach
Efficient execution of quantized deep learning models a compiler approach
jemin lee
 
Integer quantization for deep learning inference: principles and empirical ev...
Integer quantization for deep learning inference: principles and empirical ev...Integer quantization for deep learning inference: principles and empirical ev...
Integer quantization for deep learning inference: principles and empirical ev...
jemin lee
 
MLPerf an industry standard benchmark suite for machine learning performance
MLPerf an industry standard benchmark suite for machine learning performanceMLPerf an industry standard benchmark suite for machine learning performance
MLPerf an industry standard benchmark suite for machine learning performance
jemin lee
 
PACT19, MOSAIC : Heterogeneity-, Communication-, and Constraint-Aware Model ...
PACT19, MOSAIC : Heterogeneity-, Communication-, and Constraint-Aware Model ...PACT19, MOSAIC : Heterogeneity-, Communication-, and Constraint-Aware Model ...
PACT19, MOSAIC : Heterogeneity-, Communication-, and Constraint-Aware Model ...
jemin lee
 
Versatile tensor accelerator (vta) introduction and usage
Versatile tensor accelerator (vta) introduction and usage Versatile tensor accelerator (vta) introduction and usage
Versatile tensor accelerator (vta) introduction and usage
jemin lee
 
Jetson agx xavier and nvdla introduction and usage
Jetson agx xavier and nvdla introduction and usageJetson agx xavier and nvdla introduction and usage
Jetson agx xavier and nvdla introduction and usage
jemin lee
 

More from jemin lee (7)

HAWQ-V3: Dyadic Neural Network Quantization
HAWQ-V3: Dyadic Neural Network QuantizationHAWQ-V3: Dyadic Neural Network Quantization
HAWQ-V3: Dyadic Neural Network Quantization
 
Efficient execution of quantized deep learning models a compiler approach
Efficient execution of quantized deep learning models a compiler approachEfficient execution of quantized deep learning models a compiler approach
Efficient execution of quantized deep learning models a compiler approach
 
Integer quantization for deep learning inference: principles and empirical ev...
Integer quantization for deep learning inference: principles and empirical ev...Integer quantization for deep learning inference: principles and empirical ev...
Integer quantization for deep learning inference: principles and empirical ev...
 
MLPerf an industry standard benchmark suite for machine learning performance
MLPerf an industry standard benchmark suite for machine learning performanceMLPerf an industry standard benchmark suite for machine learning performance
MLPerf an industry standard benchmark suite for machine learning performance
 
PACT19, MOSAIC : Heterogeneity-, Communication-, and Constraint-Aware Model ...
PACT19, MOSAIC : Heterogeneity-, Communication-, and Constraint-Aware Model ...PACT19, MOSAIC : Heterogeneity-, Communication-, and Constraint-Aware Model ...
PACT19, MOSAIC : Heterogeneity-, Communication-, and Constraint-Aware Model ...
 
Versatile tensor accelerator (vta) introduction and usage
Versatile tensor accelerator (vta) introduction and usage Versatile tensor accelerator (vta) introduction and usage
Versatile tensor accelerator (vta) introduction and usage
 
Jetson agx xavier and nvdla introduction and usage
Jetson agx xavier and nvdla introduction and usageJetson agx xavier and nvdla introduction and usage
Jetson agx xavier and nvdla introduction and usage
 

Recently uploaded

Fundamentals of Programming and Language Processors
Fundamentals of Programming and Language ProcessorsFundamentals of Programming and Language Processors
Fundamentals of Programming and Language Processors
Rakesh Kumar R
 
E-Invoicing Implementation: A Step-by-Step Guide for Saudi Arabian Companies
E-Invoicing Implementation: A Step-by-Step Guide for Saudi Arabian CompaniesE-Invoicing Implementation: A Step-by-Step Guide for Saudi Arabian Companies
E-Invoicing Implementation: A Step-by-Step Guide for Saudi Arabian Companies
Quickdice ERP
 
DDS-Security 1.2 - What's New? Stronger security for long-running systems
DDS-Security 1.2 - What's New? Stronger security for long-running systemsDDS-Security 1.2 - What's New? Stronger security for long-running systems
DDS-Security 1.2 - What's New? Stronger security for long-running systems
Gerardo Pardo-Castellote
 
Hand Rolled Applicative User Validation Code Kata
Hand Rolled Applicative User ValidationCode KataHand Rolled Applicative User ValidationCode Kata
Hand Rolled Applicative User Validation Code Kata
Philip Schwarz
 
Using Query Store in Azure PostgreSQL to Understand Query Performance
Using Query Store in Azure PostgreSQL to Understand Query PerformanceUsing Query Store in Azure PostgreSQL to Understand Query Performance
Using Query Store in Azure PostgreSQL to Understand Query Performance
Grant Fritchey
 
LORRAINE ANDREI_LEQUIGAN_HOW TO USE ZOOM
LORRAINE ANDREI_LEQUIGAN_HOW TO USE ZOOMLORRAINE ANDREI_LEQUIGAN_HOW TO USE ZOOM
LORRAINE ANDREI_LEQUIGAN_HOW TO USE ZOOM
lorraineandreiamcidl
 
Oracle 23c New Features For DBAs and Developers.pptx
Oracle 23c New Features For DBAs and Developers.pptxOracle 23c New Features For DBAs and Developers.pptx
Oracle 23c New Features For DBAs and Developers.pptx
Remote DBA Services
 
Vitthal Shirke Java Microservices Resume.pdf
Vitthal Shirke Java Microservices Resume.pdfVitthal Shirke Java Microservices Resume.pdf
Vitthal Shirke Java Microservices Resume.pdf
Vitthal Shirke
 
Need for Speed: Removing speed bumps from your Symfony projects ⚡️
Need for Speed: Removing speed bumps from your Symfony projects ⚡️Need for Speed: Removing speed bumps from your Symfony projects ⚡️
Need for Speed: Removing speed bumps from your Symfony projects ⚡️
Łukasz Chruściel
 
2024 eCommerceDays Toulouse - Sylius 2.0.pdf
2024 eCommerceDays Toulouse - Sylius 2.0.pdf2024 eCommerceDays Toulouse - Sylius 2.0.pdf
2024 eCommerceDays Toulouse - Sylius 2.0.pdf
Łukasz Chruściel
 
AI Fusion Buddy Review: Brand New, Groundbreaking Gemini-Powered AI App
AI Fusion Buddy Review: Brand New, Groundbreaking Gemini-Powered AI AppAI Fusion Buddy Review: Brand New, Groundbreaking Gemini-Powered AI App
AI Fusion Buddy Review: Brand New, Groundbreaking Gemini-Powered AI App
Google
 
What is Master Data Management by PiLog Group
What is Master Data Management by PiLog GroupWhat is Master Data Management by PiLog Group
What is Master Data Management by PiLog Group
aymanquadri279
 
ALGIT - Assembly Line for Green IT - Numbers, Data, Facts
ALGIT - Assembly Line for Green IT - Numbers, Data, FactsALGIT - Assembly Line for Green IT - Numbers, Data, Facts
ALGIT - Assembly Line for Green IT - Numbers, Data, Facts
Green Software Development
 
原版定制美国纽约州立大学奥尔巴尼分校毕业证学位证书原版一模一样
原版定制美国纽约州立大学奥尔巴尼分校毕业证学位证书原版一模一样原版定制美国纽约州立大学奥尔巴尼分校毕业证学位证书原版一模一样
原版定制美国纽约州立大学奥尔巴尼分校毕业证学位证书原版一模一样
mz5nrf0n
 
Using Xen Hypervisor for Functional Safety
Using Xen Hypervisor for Functional SafetyUsing Xen Hypervisor for Functional Safety
Using Xen Hypervisor for Functional Safety
Ayan Halder
 
A Study of Variable-Role-based Feature Enrichment in Neural Models of Code
A Study of Variable-Role-based Feature Enrichment in Neural Models of CodeA Study of Variable-Role-based Feature Enrichment in Neural Models of Code
A Study of Variable-Role-based Feature Enrichment in Neural Models of Code
Aftab Hussain
 
Microservice Teams - How the cloud changes the way we work
Microservice Teams - How the cloud changes the way we workMicroservice Teams - How the cloud changes the way we work
Microservice Teams - How the cloud changes the way we work
Sven Peters
 
E-commerce Development Services- Hornet Dynamics
E-commerce Development Services- Hornet DynamicsE-commerce Development Services- Hornet Dynamics
E-commerce Development Services- Hornet Dynamics
Hornet Dynamics
 
How to write a program in any programming language
How to write a program in any programming languageHow to write a program in any programming language
How to write a program in any programming language
Rakesh Kumar R
 
Neo4j - Product Vision and Knowledge Graphs - GraphSummit Paris
Neo4j - Product Vision and Knowledge Graphs - GraphSummit ParisNeo4j - Product Vision and Knowledge Graphs - GraphSummit Paris
Neo4j - Product Vision and Knowledge Graphs - GraphSummit Paris
Neo4j
 

Recently uploaded (20)

Fundamentals of Programming and Language Processors
Fundamentals of Programming and Language ProcessorsFundamentals of Programming and Language Processors
Fundamentals of Programming and Language Processors
 
E-Invoicing Implementation: A Step-by-Step Guide for Saudi Arabian Companies
E-Invoicing Implementation: A Step-by-Step Guide for Saudi Arabian CompaniesE-Invoicing Implementation: A Step-by-Step Guide for Saudi Arabian Companies
E-Invoicing Implementation: A Step-by-Step Guide for Saudi Arabian Companies
 
DDS-Security 1.2 - What's New? Stronger security for long-running systems
DDS-Security 1.2 - What's New? Stronger security for long-running systemsDDS-Security 1.2 - What's New? Stronger security for long-running systems
DDS-Security 1.2 - What's New? Stronger security for long-running systems
 
Hand Rolled Applicative User Validation Code Kata
Hand Rolled Applicative User ValidationCode KataHand Rolled Applicative User ValidationCode Kata
Hand Rolled Applicative User Validation Code Kata
 
Using Query Store in Azure PostgreSQL to Understand Query Performance
Using Query Store in Azure PostgreSQL to Understand Query PerformanceUsing Query Store in Azure PostgreSQL to Understand Query Performance
Using Query Store in Azure PostgreSQL to Understand Query Performance
 
LORRAINE ANDREI_LEQUIGAN_HOW TO USE ZOOM
LORRAINE ANDREI_LEQUIGAN_HOW TO USE ZOOMLORRAINE ANDREI_LEQUIGAN_HOW TO USE ZOOM
LORRAINE ANDREI_LEQUIGAN_HOW TO USE ZOOM
 
Oracle 23c New Features For DBAs and Developers.pptx
Oracle 23c New Features For DBAs and Developers.pptxOracle 23c New Features For DBAs and Developers.pptx
Oracle 23c New Features For DBAs and Developers.pptx
 
Vitthal Shirke Java Microservices Resume.pdf
Vitthal Shirke Java Microservices Resume.pdfVitthal Shirke Java Microservices Resume.pdf
Vitthal Shirke Java Microservices Resume.pdf
 
Need for Speed: Removing speed bumps from your Symfony projects ⚡️
Need for Speed: Removing speed bumps from your Symfony projects ⚡️Need for Speed: Removing speed bumps from your Symfony projects ⚡️
Need for Speed: Removing speed bumps from your Symfony projects ⚡️
 
2024 eCommerceDays Toulouse - Sylius 2.0.pdf
2024 eCommerceDays Toulouse - Sylius 2.0.pdf2024 eCommerceDays Toulouse - Sylius 2.0.pdf
2024 eCommerceDays Toulouse - Sylius 2.0.pdf
 
AI Fusion Buddy Review: Brand New, Groundbreaking Gemini-Powered AI App
AI Fusion Buddy Review: Brand New, Groundbreaking Gemini-Powered AI AppAI Fusion Buddy Review: Brand New, Groundbreaking Gemini-Powered AI App
AI Fusion Buddy Review: Brand New, Groundbreaking Gemini-Powered AI App
 
What is Master Data Management by PiLog Group
What is Master Data Management by PiLog GroupWhat is Master Data Management by PiLog Group
What is Master Data Management by PiLog Group
 
ALGIT - Assembly Line for Green IT - Numbers, Data, Facts
ALGIT - Assembly Line for Green IT - Numbers, Data, FactsALGIT - Assembly Line for Green IT - Numbers, Data, Facts
ALGIT - Assembly Line for Green IT - Numbers, Data, Facts
 
原版定制美国纽约州立大学奥尔巴尼分校毕业证学位证书原版一模一样
原版定制美国纽约州立大学奥尔巴尼分校毕业证学位证书原版一模一样原版定制美国纽约州立大学奥尔巴尼分校毕业证学位证书原版一模一样
原版定制美国纽约州立大学奥尔巴尼分校毕业证学位证书原版一模一样
 
Using Xen Hypervisor for Functional Safety
Using Xen Hypervisor for Functional SafetyUsing Xen Hypervisor for Functional Safety
Using Xen Hypervisor for Functional Safety
 
A Study of Variable-Role-based Feature Enrichment in Neural Models of Code
A Study of Variable-Role-based Feature Enrichment in Neural Models of CodeA Study of Variable-Role-based Feature Enrichment in Neural Models of Code
A Study of Variable-Role-based Feature Enrichment in Neural Models of Code
 
Microservice Teams - How the cloud changes the way we work
Microservice Teams - How the cloud changes the way we workMicroservice Teams - How the cloud changes the way we work
Microservice Teams - How the cloud changes the way we work
 
E-commerce Development Services- Hornet Dynamics
E-commerce Development Services- Hornet DynamicsE-commerce Development Services- Hornet Dynamics
E-commerce Development Services- Hornet Dynamics
 
How to write a program in any programming language
How to write a program in any programming languageHow to write a program in any programming language
How to write a program in any programming language
 
Neo4j - Product Vision and Knowledge Graphs - GraphSummit Paris
Neo4j - Product Vision and Knowledge Graphs - GraphSummit ParisNeo4j - Product Vision and Knowledge Graphs - GraphSummit Paris
Neo4j - Product Vision and Knowledge Graphs - GraphSummit Paris
 

MobileViTv1

  • 1. MobileViT: Light-weight, General-purpose, and Mobile-friendly Vision Transformer Published in ICLR’22 (Apple) Cited by 61 (arxiv Oct. 21) Speaker: Jemin Lee https://leejaymin.github.io/index.html
  • 2. Backbone • MobileViT: "MobileViT: Light-weight, General-purpose, and Mobile-friendly Vision Transformer", ICLR, 2022 (Apple). - https://iclr.cc/virtual/2022/poster/6230 • MobileViTv2: "Separable Self-attention for Mobile Vision Transformers", arXiv22.6 • EfficientFormer: "EfficientFormer: Vision Transformers at MobileNet Speed", arXiv22.6 • TinyViT, ECCV22 • EfficientViT, Arxiv22.6 • EdgeViTs, Arxiv22.7 • EdgeNeXt Arxiv22.7 • LightViT Arxiv22.7 Up-to-date List Efficient Vision Transformer | 2 |
  • 4. Learning Visual Representations in CNN Background | 4 | Pros • Inductive bias • Variably-sized datasets • Fast • Accurate • General Cons • Local representations • Weak data scalability
  • 5. Pros • Non-local representations • Competitive to heavy-weight CNNs (Resnet50) Cons • No inductive bias • Large datasets • Slow • Poor performance Learning Visual Representations in ViT Background | 5 | Vision Transformer (ViT) ICLR 2021 (ArXiv on Oct. 2020) Cited by 6k within 2 years
  • 6. Recap ViT Background | 6 | https://youtu.be/L7_nNfVeIw8
  • 7. Transformer • Non-local representations • Competitive to CNNs (light (mobilenet) + heavy (resnet)) CNN • Inductive bias • Variably-sized datasets Fast Accurate General What MobileViT focused on designing for MobileViT | 7 |
  • 8. Model the local & global information MobileVit Architecture | 8 | O(N^2 d) O(N^2dP) Local: nxn conv (3x3) Point-wise conv -> produce H,W,d tensor (d>C)
  • 9. Dilated conv could learn long range dependences in input tensors However, it requires careful selection of dilation rates Alternative: self-attention (ViT) -> heavy weights MobileViT -> learn global representation with spatial inductive bais Model Long range dependencies with Dilated Conv. MobileVit Architecture | 9 |
  • 12. Details of MobileViT Block MobileVit Architecture | 12 | [512, 64, 32, 32] -> [512x4 (BP), 16x16(N), 64 (C)] 16 16 64 4 256 64 64 3x3 X -> [2048, 4 (head), 256, 16 (c//head)]
  • 13. Relationship to convolution • MobileViT block replaces the local processing (GEMM) in convolutions with deeper global processing • Transformer as a convolution Light-weight • ViT, Deit -> L=12, d=192 • MobileViT -> L = {2,4,3}, d={96,120,144} at special level 32x32, 16x16, and 8x8 MobileViT block MobileVit Architecture | 13 |
  • 14. Three types of MobileViT MobileViT Arch. | 14 | Input = (100,3,256,256) X = (400, 256, 64) (2048, 256, 64)
  • 15. Linear bottlenecks Inverted residuals Two blocks • Stride1, stride 2 MobileNetv2 Block Recap | 15 | MobileNetV2: Inverted Residuals and Linear Bottlenecks, cited by 10K https://gaussian37.github.io/dl-concept-mobilenet_v2/ T=6
  • 16. However, most of these works sample a new spatial resolution after a fixed number of iterations. The multi-scale sampler • Reduces the training time as it requires fewer optimizer updates with variably-sized batches • Improves performance by about 0.5% • Forces the network to learn better multi-scale representations Multi-Scale Sampler for Training Efficiency How to efficiently train MobileViT | 16 | CVNets: High Performance Library for Computer Vision, Arxiv22 Good point! Not leaving other models for comparison
  • 17. Imagenet-1k • The dataset provides 1.28 million and 50 thousand images for training and validation, respectively. The MobileViT networks are trained using PyTorch for 300 epochs on • 8 NVIDIA GPUs with an effective • batch size of 1,024 images using AdamW optimizer (Loshchilov & Hutter, 2019), • label smoothing cross-entropy loss (smoothing=0.1), and • multi-scale sampler (S = {(160, 160), (192, 192), (256, 256), (288, 288), (320, 320)}). The learning rate is • increased from 0.0002 to 0.002 for the first 3k iterations and then annealed to 0.0002 using a cosine schedule (Loshchilov & Hutter, 2017). • We use L2 weight decay of 0.01. We use basic data augmentation (i.e., random resized cropping and horizontal flipping) and evaluate the performance using a single crop top-1 accuracy. For inference, an exponential moving average of model weights is used. Implementation details How to efficiently train MobileViT | 17 |
  • 20. MobileViT is well performed on object detection and semantic segmentation. Evaluation for the general-purpose nature Experimental Results | 20 |
  • 21. CoreML on iPhone 12 Performance on Mobile Devices Experimental Results | 21 |
  • 22. Dedicated CUDA kernels exist for transformers on GPU CNNs benefit from several device-level optimizations, including batch normalization fusion with convolutional layers • Not available for mobile devices • The resultant inference graph of MobileViT and ViT-based networks for mobile devices is sub-optimal. Discussion Experimental Results | 22 |
  • 23. A systematic measurement Experimental Results | 23 | Towards Efficient Vision Transformer Inference: A First Study of Transformers on Mobile Devices, HotMobile22
  • 25. CUDA_VISIBLE_DEVICES=0 cvnets-eval --common.config-file $CFG_FILE/mobilevit_xxs.yaml --common.results-loc classification_results --model.classification.pretrained $MODEL_WEIGHTS/mobilevit_xxs.pt CVNet v0.1 Source Code | 25 |
  • 26. Timm • XXS: "top1": 68.906, 0.107s • XS: "top1": 74.634, 0.178s • S: "top1": 78.316, 0.219s Reproduce accuracy with Timm Alternative code | 26 | Github CVNet: Readme.md Paper
  • 27. Comparison on latency with timm libary Source Code | 27 | /ssd/dataset/imagenet –model vit_base_patch16_224 --pretrained /ssd/dataset/imagenet –model mobilevit_xxs --pretrained /ssd/dataset/imagenet –model mobilevit_s --pretrained