SlideShare a Scribd company logo
1 of 42
You Only Look at One Sequence (YOLOS):
Rethinking Transformer in Vision through
Object Detection
김병현
이미지처리팀
김선옥, 안종식, 이찬혁, 홍은기
Here comes YOLOS!!
 YOLOS
Transformer based 2D object detection model
Only used Transformer Encoder & MLP Heads
2
YOLOS
YOLOS Performance
comparison with SOTA object detector
YOLOS Detection Example
Here comes YOLOS!!
 YOLOS
Transformer based 2D object detection model
Only used Transformer Encoder & MLP Heads
3
YOLOS
YOLOS Performance
comparison with SOTA object detector
YOLOS Detection Example
Transformer Encoder
Transformer is Born to Transfer
4
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., ... & Polosukhin, I. (2017). Attention is all you need. In Advances in neural
information processing systems (pp. 5998-6008).
Transformer is for
sequential data
such as natural
language!!
Transformer
Vision Transformer
 AN IMAGE IS WORTH 16X16 WORDS
5
Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X.,
Unterthiner, T., ... & Houlsby, N. (2020). An image is worth 16x16 words:
Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929.
Can an image be a sequential data….?
6
 In Object Detection ….
Can an image be a sequential data….?
7
Dog : 0.89 Dog : 0.69 Person : 0.51
 In Object Detection ….
Can an image be a sequential data….?
8
 In Object Detection ….
Can an image be a sequential data….?
9
……
……
……
……
 In Object Detection ….
Can an image be a sequential data….?
10
……
……
……
……
Hard Spatial Information Loss
during Position Embedding
 In Object Detection ….
How to Apply Transformer to Object Detection
 ViT-FRCNN
11
Strategy 1 : Concatenate patches to 2D Feature map again
Beal, J., Kim, E., Tzeng, E., Park, D. H., Zhai, A., & Kislyuk, D. (2020). Toward
transformer-based object detection. arXiv preprint arXiv:2012.09958.
How to Apply Transformer to Object Detection
 ViT-FRCNN
12
Beal, J., Kim, E., Tzeng, E., Park, D. H., Zhai, A., & Kislyuk, D. (2020). Toward
transformer-based object detection. arXiv preprint arXiv:2012.09958.
Strategy 1 : Concatenate patches to 2D Feature map again
How to Apply Transformer to Object Detection
 DETR
13
Strategy 2 :
CNN Feature Extractor + Positional Encoding + Bipartite Matching Loss
Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., & Zagoruyko, S.
(2020, August). End-to-end object detection with transformers. In European
Conference on Computer Vision (pp. 213-229). Springer, Cham.
How to Apply Transformer to Object Detection
 Swin Transformer
14
Strategy 3 : Patch embedding with different patch size
Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., ... & Guo, B. (2021). Swin
transformer: Hierarchical vision transformer using shifted windows. arXiv preprint
arXiv:2103.14030.
How to Apply Transformer to Object Detection
15
Can Transformer perform
2D object detection as a pure
sequence-to-sequence
method?
Q & A
Q & A
16
YOLOS = VIT + Bipartite Loss
17
VIT
Bipartite
Loss
YOLOS
From
DETR
Architecture of YOLOS
18
Architecture of YOLOS
19
VIT
Bipartite
Loss
From DETR
Architecture of YOLOS
20
1. Patch Token &
Patch Embedding
Architecture of YOLOS
21
2. Transformer
Encoder
Architecture of YOLOS
22
3. Bipartite Loss
& Detection Token
Q & A
Q & A
23
Component 1 – Patch Token & Patch Embedding
24
Conv2d
Embedding Dimension= 768
16
16
Stride = 16
……
……
Original Image Feature map
1280
960
80
60
768
Component 1 – Patch Token & Patch Embedding
25
Conv2d
768
16
16
Stride = 16
……
……
Original Image
Flattened
Feature map
768
4800
Component 2 – Vision Transformer (Backbone)
26
Patch
token
Flattened
Feature map
Detection
token
Position Embedding
Component 2 – Vision Transformer (Backbone)
27
Multi-Layer
Perceptron
Multi-Layer
Perceptron
Detection
token
No. of Class
x, y, w, h
Sigmoid
Normalized to
[0, 1]
Component 3 – Bipartite Matching Loss
28
Component 3 – Bipartite Matching Loss
29
Prediction Ground Truth
No. of Class x, y, w, h
1.
No. of Class x, y, w, h
2.
No. of Class x, y, w, h
3.
No. of Class x, y, w, h
100.
……
No. of Class x, y, w, h
1.
No. of Class x, y, w, h
2.
No. of Class x, y, w, h
3.
No. of Class x, y, w, h
n.
……
Component 3 – Bipartite Matching Loss
30
Q & A
Q & A
31
Experiments - Model Variants
32
Experiments - The Effects of Pre-training
33
Experiments - The Effects of Pre-training
34
Rethinking ImageNet Pre-training (He et al., 2018)
Self Supervised Learning
Experiments Comparisons with CNN
35
Experiments Comparison with DETR
36
Experiments Comparisons with Other Models
37
YOLOS
Meanings of the Results
 Each Token specialized on certain region and size
38
Det-Tok 1 Det-Tok 2 Det-Tok 3 Det-Tok 4 Det-Tok 5
Det-Tok 6 Det-Tok 7 Det-Tok 8 Det-Tok 9 Det-Tok 10
Center coordinates of bounding box predictions
Small, Medium, Large
Meanings of the Results
 Each Token specialized on certain region and size
39
Meanings of the Results
 Category Insensitive
40
Object Categories
No.
of
Objects
Ground Truth
Prediction
Discussion
 이미지 처리팀에서 Discussion 했던 내용들
굳이 트랜스포머를 왜 고집할 이유가 있는가?
• Long distance dependency를 잘 학습한다.1)
• CNN과 달리 Transformer에는 Inductive bias가 없어서
학습이 어렵지만 제대로 학습만 되면 CNN 보다 좋을 수 있다.2)
• CNN과 Transformer 쓰면 상호 보완적이 되지 않을까??
참고 : CNN의 Inductive Bias
→ “Computer Vision Task는 Spatial Information이 학습에 도움이 된다."
본 모델은 NLP 모델에 대한 이해도가 있으면 쉽게 구현 가능
Bipartite Matching Loss 의 Contribution을 다시 한 번 확인
• 비교적 간단한 모델 구조로도 Object Detector 학습 가능
41
1) Intriguing Properties of Vision Transformers https://arxiv.org/pdf/2105.10497.pdf
2) Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., ... & Houlsby, N. (2020).
An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929.
Q & A
Q & A
42

More Related Content

What's hot

"Computer-vision-based 360-degree Video Systems: Architectures, Algorithms an...
"Computer-vision-based 360-degree Video Systems: Architectures, Algorithms an..."Computer-vision-based 360-degree Video Systems: Architectures, Algorithms an...
"Computer-vision-based 360-degree Video Systems: Architectures, Algorithms an...
Edge AI and Vision Alliance
 
automatic vehicle location
automatic vehicle locationautomatic vehicle location
automatic vehicle location
Akhil Kumar
 
UNDER WATER ACOUSTIC COMMUNICATION
UNDER WATER ACOUSTIC COMMUNICATIONUNDER WATER ACOUSTIC COMMUNICATION
UNDER WATER ACOUSTIC COMMUNICATION
jaisica
 

What's hot (20)

Deep Learning Hardware: Past, Present, & Future
Deep Learning Hardware: Past, Present, & FutureDeep Learning Hardware: Past, Present, & Future
Deep Learning Hardware: Past, Present, & Future
 
Presentation on intelligent traffic prediction system
Presentation on intelligent traffic prediction systemPresentation on intelligent traffic prediction system
Presentation on intelligent traffic prediction system
 
Solutions for ADAS and AI data engineering using OpenPOWER/POWER systems
Solutions for ADAS and AI data engineering using OpenPOWER/POWER systemsSolutions for ADAS and AI data engineering using OpenPOWER/POWER systems
Solutions for ADAS and AI data engineering using OpenPOWER/POWER systems
 
Software defined vehicles,automotive standards (safety, security), agile cont...
Software defined vehicles,automotive standards (safety, security), agile cont...Software defined vehicles,automotive standards (safety, security), agile cont...
Software defined vehicles,automotive standards (safety, security), agile cont...
 
Om net++
Om net++Om net++
Om net++
 
Mission planning and control for UAV's
Mission planning and control for UAV'sMission planning and control for UAV's
Mission planning and control for UAV's
 
Introduction to CNN
Introduction to CNNIntroduction to CNN
Introduction to CNN
 
Vehicle to Vehicle Communication using Bluetooth and GPS.
Vehicle to Vehicle Communication using Bluetooth and GPS.Vehicle to Vehicle Communication using Bluetooth and GPS.
Vehicle to Vehicle Communication using Bluetooth and GPS.
 
"Computer-vision-based 360-degree Video Systems: Architectures, Algorithms an...
"Computer-vision-based 360-degree Video Systems: Architectures, Algorithms an..."Computer-vision-based 360-degree Video Systems: Architectures, Algorithms an...
"Computer-vision-based 360-degree Video Systems: Architectures, Algorithms an...
 
Lane Detection and Object Detection
Lane Detection and Object DetectionLane Detection and Object Detection
Lane Detection and Object Detection
 
automatic vehicle location
automatic vehicle locationautomatic vehicle location
automatic vehicle location
 
UNDER WATER ACOUSTIC COMMUNICATION
UNDER WATER ACOUSTIC COMMUNICATIONUNDER WATER ACOUSTIC COMMUNICATION
UNDER WATER ACOUSTIC COMMUNICATION
 
[PR12] You Only Look Once (YOLO): Unified Real-Time Object Detection
[PR12] You Only Look Once (YOLO): Unified Real-Time Object Detection[PR12] You Only Look Once (YOLO): Unified Real-Time Object Detection
[PR12] You Only Look Once (YOLO): Unified Real-Time Object Detection
 
Overview of automotive network protocol
Overview of automotive network protocolOverview of automotive network protocol
Overview of automotive network protocol
 
Generative Models for General Audiences
Generative Models for General AudiencesGenerative Models for General Audiences
Generative Models for General Audiences
 
Hud
Hud Hud
Hud
 
Classifying and understanding financial data using graph neural network
Classifying and understanding financial data using graph neural networkClassifying and understanding financial data using graph neural network
Classifying and understanding financial data using graph neural network
 
인공지능, 기계학습 그리고 딥러닝
인공지능, 기계학습 그리고 딥러닝인공지능, 기계학습 그리고 딥러닝
인공지능, 기계학습 그리고 딥러닝
 
Tesla UBQ01B0 FSD Chip
Tesla UBQ01B0 FSD ChipTesla UBQ01B0 FSD Chip
Tesla UBQ01B0 FSD Chip
 
Intermediate: Vehicle to Everything (V2X) Introduction
Intermediate: Vehicle to Everything (V2X) IntroductionIntermediate: Vehicle to Everything (V2X) Introduction
Intermediate: Vehicle to Everything (V2X) Introduction
 

Similar to Yolos you only look one sequence

Cvpr2007 object category recognition p3 - discriminative models
Cvpr2007 object category recognition   p3 - discriminative modelsCvpr2007 object category recognition   p3 - discriminative models
Cvpr2007 object category recognition p3 - discriminative models
zukun
 
Xin Yao: "What can evolutionary computation do for you?"
Xin Yao: "What can evolutionary computation do for you?"Xin Yao: "What can evolutionary computation do for you?"
Xin Yao: "What can evolutionary computation do for you?"
ieee_cis_cyprus
 
3680-NoCA.pptx
3680-NoCA.pptx3680-NoCA.pptx
3680-NoCA.pptx
grssieee
 

Similar to Yolos you only look one sequence (20)

Wang midterm-defence
Wang midterm-defenceWang midterm-defence
Wang midterm-defence
 
4_22865_IS465_2019_1__2_1_02Data-2.ppt
4_22865_IS465_2019_1__2_1_02Data-2.ppt4_22865_IS465_2019_1__2_1_02Data-2.ppt
4_22865_IS465_2019_1__2_1_02Data-2.ppt
 
Cvpr2007 object category recognition p3 - discriminative models
Cvpr2007 object category recognition   p3 - discriminative modelsCvpr2007 object category recognition   p3 - discriminative models
Cvpr2007 object category recognition p3 - discriminative models
 
Paper Summary of Disentangling by Factorising (Factor-VAE)
Paper Summary of Disentangling by Factorising (Factor-VAE)Paper Summary of Disentangling by Factorising (Factor-VAE)
Paper Summary of Disentangling by Factorising (Factor-VAE)
 
MINIMIZING DISTORTION IN STEGANOG-RAPHY BASED ON IMAGE FEATURE
MINIMIZING DISTORTION IN STEGANOG-RAPHY BASED ON IMAGE FEATUREMINIMIZING DISTORTION IN STEGANOG-RAPHY BASED ON IMAGE FEATURE
MINIMIZING DISTORTION IN STEGANOG-RAPHY BASED ON IMAGE FEATURE
 
MINIMIZING DISTORTION IN STEGANOG-RAPHY BASED ON IMAGE FEATURE
MINIMIZING DISTORTION IN STEGANOG-RAPHY BASED ON IMAGE FEATUREMINIMIZING DISTORTION IN STEGANOG-RAPHY BASED ON IMAGE FEATURE
MINIMIZING DISTORTION IN STEGANOG-RAPHY BASED ON IMAGE FEATURE
 
Scrdet++ analysis
Scrdet++ analysisScrdet++ analysis
Scrdet++ analysis
 
Lecture7 xing fei-fei
Lecture7 xing fei-feiLecture7 xing fei-fei
Lecture7 xing fei-fei
 
Xin Yao: "What can evolutionary computation do for you?"
Xin Yao: "What can evolutionary computation do for you?"Xin Yao: "What can evolutionary computation do for you?"
Xin Yao: "What can evolutionary computation do for you?"
 
November 30, Projects
November 30, ProjectsNovember 30, Projects
November 30, Projects
 
Darwin’s Magic: Evolutionary Computation in Nanoscience, Bioinformatics and S...
Darwin’s Magic: Evolutionary Computation in Nanoscience, Bioinformatics and S...Darwin’s Magic: Evolutionary Computation in Nanoscience, Bioinformatics and S...
Darwin’s Magic: Evolutionary Computation in Nanoscience, Bioinformatics and S...
 
Rigorous Pack Edge Detection Fuzzy System
Rigorous Pack Edge Detection Fuzzy SystemRigorous Pack Edge Detection Fuzzy System
Rigorous Pack Edge Detection Fuzzy System
 
3680-NoCA.pptx
3680-NoCA.pptx3680-NoCA.pptx
3680-NoCA.pptx
 
Fractal Image Compression By Range Block Classification
Fractal Image Compression By Range Block ClassificationFractal Image Compression By Range Block Classification
Fractal Image Compression By Range Block Classification
 
파이콘 한국 2019 튜토리얼 - 설명가능인공지능이란? (Part 1)
파이콘 한국 2019 튜토리얼 - 설명가능인공지능이란? (Part 1)파이콘 한국 2019 튜토리얼 - 설명가능인공지능이란? (Part 1)
파이콘 한국 2019 튜토리얼 - 설명가능인공지능이란? (Part 1)
 
PhD Defense
PhD DefensePhD Defense
PhD Defense
 
Visual Transformers
Visual TransformersVisual Transformers
Visual Transformers
 
FAN search for image copy-move forgery-amalta 2014
 FAN search for image copy-move forgery-amalta 2014 FAN search for image copy-move forgery-amalta 2014
FAN search for image copy-move forgery-amalta 2014
 
3-D isotope position tracking system using portable gamma cameras; Feasibilit...
3-D isotope position tracking system using portable gamma cameras; Feasibilit...3-D isotope position tracking system using portable gamma cameras; Feasibilit...
3-D isotope position tracking system using portable gamma cameras; Feasibilit...
 
Knowledge Graph Embeddings for Recommender Systems
Knowledge Graph Embeddings for Recommender SystemsKnowledge Graph Embeddings for Recommender Systems
Knowledge Graph Embeddings for Recommender Systems
 

More from taeseon ryu

VoxelNet
VoxelNetVoxelNet
VoxelNet
taeseon ryu
 
OpineSum Entailment-based self-training for abstractive opinion summarization...
OpineSum Entailment-based self-training for abstractive opinion summarization...OpineSum Entailment-based self-training for abstractive opinion summarization...
OpineSum Entailment-based self-training for abstractive opinion summarization...
taeseon ryu
 
RL_UpsideDown
RL_UpsideDownRL_UpsideDown
RL_UpsideDown
taeseon ryu
 
MOReL: Model-Based Offline Reinforcement Learning
MOReL: Model-Based Offline Reinforcement LearningMOReL: Model-Based Offline Reinforcement Learning
MOReL: Model-Based Offline Reinforcement Learning
taeseon ryu
 

More from taeseon ryu (20)

VoxelNet
VoxelNetVoxelNet
VoxelNet
 
OpineSum Entailment-based self-training for abstractive opinion summarization...
OpineSum Entailment-based self-training for abstractive opinion summarization...OpineSum Entailment-based self-training for abstractive opinion summarization...
OpineSum Entailment-based self-training for abstractive opinion summarization...
 
3D Gaussian Splatting
3D Gaussian Splatting3D Gaussian Splatting
3D Gaussian Splatting
 
JetsonTX2 Python
 JetsonTX2 Python  JetsonTX2 Python
JetsonTX2 Python
 
Hyperbolic Image Embedding.pptx
Hyperbolic  Image Embedding.pptxHyperbolic  Image Embedding.pptx
Hyperbolic Image Embedding.pptx
 
MCSE_Multimodal Contrastive Learning of Sentence Embeddings_변현정
MCSE_Multimodal Contrastive Learning of Sentence Embeddings_변현정MCSE_Multimodal Contrastive Learning of Sentence Embeddings_변현정
MCSE_Multimodal Contrastive Learning of Sentence Embeddings_변현정
 
LLaMA Open and Efficient Foundation Language Models - 230528.pdf
LLaMA Open and Efficient Foundation Language Models - 230528.pdfLLaMA Open and Efficient Foundation Language Models - 230528.pdf
LLaMA Open and Efficient Foundation Language Models - 230528.pdf
 
YOLO V6
YOLO V6YOLO V6
YOLO V6
 
Dataset Distillation by Matching Training Trajectories
Dataset Distillation by Matching Training Trajectories Dataset Distillation by Matching Training Trajectories
Dataset Distillation by Matching Training Trajectories
 
RL_UpsideDown
RL_UpsideDownRL_UpsideDown
RL_UpsideDown
 
Packed Levitated Marker for Entity and Relation Extraction
Packed Levitated Marker for Entity and Relation ExtractionPacked Levitated Marker for Entity and Relation Extraction
Packed Levitated Marker for Entity and Relation Extraction
 
MOReL: Model-Based Offline Reinforcement Learning
MOReL: Model-Based Offline Reinforcement LearningMOReL: Model-Based Offline Reinforcement Learning
MOReL: Model-Based Offline Reinforcement Learning
 
Scaling Instruction-Finetuned Language Models
Scaling Instruction-Finetuned Language ModelsScaling Instruction-Finetuned Language Models
Scaling Instruction-Finetuned Language Models
 
Visual prompt tuning
Visual prompt tuningVisual prompt tuning
Visual prompt tuning
 
mPLUG
mPLUGmPLUG
mPLUG
 
variBAD, A Very Good Method for Bayes-Adaptive Deep RL via Meta-Learning.pdf
variBAD, A Very Good Method for Bayes-Adaptive Deep RL via Meta-Learning.pdfvariBAD, A Very Good Method for Bayes-Adaptive Deep RL via Meta-Learning.pdf
variBAD, A Very Good Method for Bayes-Adaptive Deep RL via Meta-Learning.pdf
 
Reinforced Genetic Algorithm Learning For Optimizing Computation Graphs.pdf
Reinforced Genetic Algorithm Learning For Optimizing Computation Graphs.pdfReinforced Genetic Algorithm Learning For Optimizing Computation Graphs.pdf
Reinforced Genetic Algorithm Learning For Optimizing Computation Graphs.pdf
 
The Forward-Forward Algorithm
The Forward-Forward AlgorithmThe Forward-Forward Algorithm
The Forward-Forward Algorithm
 
Towards Robust and Reproducible Active Learning using Neural Networks
Towards Robust and Reproducible Active Learning using Neural NetworksTowards Robust and Reproducible Active Learning using Neural Networks
Towards Robust and Reproducible Active Learning using Neural Networks
 
BRIO: Bringing Order to Abstractive Summarization
BRIO: Bringing Order to Abstractive SummarizationBRIO: Bringing Order to Abstractive Summarization
BRIO: Bringing Order to Abstractive Summarization
 

Recently uploaded

Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptx
JohnnyPlasten
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
amitlee9823
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
amitlee9823
 
Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...
shambhavirathore45
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
shivangimorya083
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
MarinCaroMartnezBerg
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
amitlee9823
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
amitlee9823
 

Recently uploaded (20)

Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptx
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptx
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptx
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
 
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
 
Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFx
 
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxBPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interaction
 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptx
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptx
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
 
Sampling (random) method and Non random.ppt
Sampling (random) method and Non random.pptSampling (random) method and Non random.ppt
Sampling (random) method and Non random.ppt
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
 

Yolos you only look one sequence

  • 1. You Only Look at One Sequence (YOLOS): Rethinking Transformer in Vision through Object Detection 김병현 이미지처리팀 김선옥, 안종식, 이찬혁, 홍은기
  • 2. Here comes YOLOS!!  YOLOS Transformer based 2D object detection model Only used Transformer Encoder & MLP Heads 2 YOLOS YOLOS Performance comparison with SOTA object detector YOLOS Detection Example
  • 3. Here comes YOLOS!!  YOLOS Transformer based 2D object detection model Only used Transformer Encoder & MLP Heads 3 YOLOS YOLOS Performance comparison with SOTA object detector YOLOS Detection Example Transformer Encoder
  • 4. Transformer is Born to Transfer 4 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., ... & Polosukhin, I. (2017). Attention is all you need. In Advances in neural information processing systems (pp. 5998-6008). Transformer is for sequential data such as natural language!! Transformer
  • 5. Vision Transformer  AN IMAGE IS WORTH 16X16 WORDS 5 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., ... & Houlsby, N. (2020). An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929.
  • 6. Can an image be a sequential data….? 6  In Object Detection ….
  • 7. Can an image be a sequential data….? 7 Dog : 0.89 Dog : 0.69 Person : 0.51  In Object Detection ….
  • 8. Can an image be a sequential data….? 8  In Object Detection ….
  • 9. Can an image be a sequential data….? 9 …… …… …… ……  In Object Detection ….
  • 10. Can an image be a sequential data….? 10 …… …… …… …… Hard Spatial Information Loss during Position Embedding  In Object Detection ….
  • 11. How to Apply Transformer to Object Detection  ViT-FRCNN 11 Strategy 1 : Concatenate patches to 2D Feature map again Beal, J., Kim, E., Tzeng, E., Park, D. H., Zhai, A., & Kislyuk, D. (2020). Toward transformer-based object detection. arXiv preprint arXiv:2012.09958.
  • 12. How to Apply Transformer to Object Detection  ViT-FRCNN 12 Beal, J., Kim, E., Tzeng, E., Park, D. H., Zhai, A., & Kislyuk, D. (2020). Toward transformer-based object detection. arXiv preprint arXiv:2012.09958. Strategy 1 : Concatenate patches to 2D Feature map again
  • 13. How to Apply Transformer to Object Detection  DETR 13 Strategy 2 : CNN Feature Extractor + Positional Encoding + Bipartite Matching Loss Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., & Zagoruyko, S. (2020, August). End-to-end object detection with transformers. In European Conference on Computer Vision (pp. 213-229). Springer, Cham.
  • 14. How to Apply Transformer to Object Detection  Swin Transformer 14 Strategy 3 : Patch embedding with different patch size Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., ... & Guo, B. (2021). Swin transformer: Hierarchical vision transformer using shifted windows. arXiv preprint arXiv:2103.14030.
  • 15. How to Apply Transformer to Object Detection 15 Can Transformer perform 2D object detection as a pure sequence-to-sequence method?
  • 16. Q & A Q & A 16
  • 17. YOLOS = VIT + Bipartite Loss 17 VIT Bipartite Loss YOLOS From DETR
  • 20. Architecture of YOLOS 20 1. Patch Token & Patch Embedding
  • 21. Architecture of YOLOS 21 2. Transformer Encoder
  • 22. Architecture of YOLOS 22 3. Bipartite Loss & Detection Token
  • 23. Q & A Q & A 23
  • 24. Component 1 – Patch Token & Patch Embedding 24 Conv2d Embedding Dimension= 768 16 16 Stride = 16 …… …… Original Image Feature map 1280 960 80 60 768
  • 25. Component 1 – Patch Token & Patch Embedding 25 Conv2d 768 16 16 Stride = 16 …… …… Original Image Flattened Feature map 768 4800
  • 26. Component 2 – Vision Transformer (Backbone) 26 Patch token Flattened Feature map Detection token Position Embedding
  • 27. Component 2 – Vision Transformer (Backbone) 27 Multi-Layer Perceptron Multi-Layer Perceptron Detection token No. of Class x, y, w, h Sigmoid Normalized to [0, 1]
  • 28. Component 3 – Bipartite Matching Loss 28
  • 29. Component 3 – Bipartite Matching Loss 29 Prediction Ground Truth No. of Class x, y, w, h 1. No. of Class x, y, w, h 2. No. of Class x, y, w, h 3. No. of Class x, y, w, h 100. …… No. of Class x, y, w, h 1. No. of Class x, y, w, h 2. No. of Class x, y, w, h 3. No. of Class x, y, w, h n. ……
  • 30. Component 3 – Bipartite Matching Loss 30
  • 31. Q & A Q & A 31
  • 32. Experiments - Model Variants 32
  • 33. Experiments - The Effects of Pre-training 33
  • 34. Experiments - The Effects of Pre-training 34 Rethinking ImageNet Pre-training (He et al., 2018) Self Supervised Learning
  • 37. Experiments Comparisons with Other Models 37 YOLOS
  • 38. Meanings of the Results  Each Token specialized on certain region and size 38 Det-Tok 1 Det-Tok 2 Det-Tok 3 Det-Tok 4 Det-Tok 5 Det-Tok 6 Det-Tok 7 Det-Tok 8 Det-Tok 9 Det-Tok 10 Center coordinates of bounding box predictions Small, Medium, Large
  • 39. Meanings of the Results  Each Token specialized on certain region and size 39
  • 40. Meanings of the Results  Category Insensitive 40 Object Categories No. of Objects Ground Truth Prediction
  • 41. Discussion  이미지 처리팀에서 Discussion 했던 내용들 굳이 트랜스포머를 왜 고집할 이유가 있는가? • Long distance dependency를 잘 학습한다.1) • CNN과 달리 Transformer에는 Inductive bias가 없어서 학습이 어렵지만 제대로 학습만 되면 CNN 보다 좋을 수 있다.2) • CNN과 Transformer 쓰면 상호 보완적이 되지 않을까?? 참고 : CNN의 Inductive Bias → “Computer Vision Task는 Spatial Information이 학습에 도움이 된다." 본 모델은 NLP 모델에 대한 이해도가 있으면 쉽게 구현 가능 Bipartite Matching Loss 의 Contribution을 다시 한 번 확인 • 비교적 간단한 모델 구조로도 Object Detector 학습 가능 41 1) Intriguing Properties of Vision Transformers https://arxiv.org/pdf/2105.10497.pdf 2) Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., ... & Houlsby, N. (2020). An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929.
  • 42. Q & A Q & A 42