안녕하세요 딥논읽-DNR입니다!
오늘 소개드릴 논문은 'YOLOS' 라는 논문입니다. YOLOS에 대해 간략하게 먼저 설명을 드리면 오직 Transformer만을 이용하여 2D object Detection을 수행한 모델이라고 이해해 주시면 됩니다. 구조는 오직 Transformer의 Encoder만을 사용하여 Object detection을 수행하였는대요, 데이터셋을 균등하게 학습시켜도 오브젝트마다의 AP가 차이가 심했던 다른 CNN기반의 Object detector와 다르게, 해당 모델은 모든 카테고리에 대해서 AP가 꽤나 균등하게 나오는것도 중요한 특징중 하나 입니다.
오늘 논문 리뷰를 이미지 처리팀 김병현님이 자세한 리뷰를 도와주셨습니다! 오늘도 많은 관심 미리 감사드립니다!
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Yolos you only look one sequence
1. You Only Look at One Sequence (YOLOS):
Rethinking Transformer in Vision through
Object Detection
김병현
이미지처리팀
김선옥, 안종식, 이찬혁, 홍은기
2. Here comes YOLOS!!
YOLOS
Transformer based 2D object detection model
Only used Transformer Encoder & MLP Heads
2
YOLOS
YOLOS Performance
comparison with SOTA object detector
YOLOS Detection Example
3. Here comes YOLOS!!
YOLOS
Transformer based 2D object detection model
Only used Transformer Encoder & MLP Heads
3
YOLOS
YOLOS Performance
comparison with SOTA object detector
YOLOS Detection Example
Transformer Encoder
4. Transformer is Born to Transfer
4
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., ... & Polosukhin, I. (2017). Attention is all you need. In Advances in neural
information processing systems (pp. 5998-6008).
Transformer is for
sequential data
such as natural
language!!
Transformer
5. Vision Transformer
AN IMAGE IS WORTH 16X16 WORDS
5
Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X.,
Unterthiner, T., ... & Houlsby, N. (2020). An image is worth 16x16 words:
Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929.
6. Can an image be a sequential data….?
6
In Object Detection ….
7. Can an image be a sequential data….?
7
Dog : 0.89 Dog : 0.69 Person : 0.51
In Object Detection ….
8. Can an image be a sequential data….?
8
In Object Detection ….
9. Can an image be a sequential data….?
9
……
……
……
……
In Object Detection ….
10. Can an image be a sequential data….?
10
……
……
……
……
Hard Spatial Information Loss
during Position Embedding
In Object Detection ….
11. How to Apply Transformer to Object Detection
ViT-FRCNN
11
Strategy 1 : Concatenate patches to 2D Feature map again
Beal, J., Kim, E., Tzeng, E., Park, D. H., Zhai, A., & Kislyuk, D. (2020). Toward
transformer-based object detection. arXiv preprint arXiv:2012.09958.
12. How to Apply Transformer to Object Detection
ViT-FRCNN
12
Beal, J., Kim, E., Tzeng, E., Park, D. H., Zhai, A., & Kislyuk, D. (2020). Toward
transformer-based object detection. arXiv preprint arXiv:2012.09958.
Strategy 1 : Concatenate patches to 2D Feature map again
13. How to Apply Transformer to Object Detection
DETR
13
Strategy 2 :
CNN Feature Extractor + Positional Encoding + Bipartite Matching Loss
Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., & Zagoruyko, S.
(2020, August). End-to-end object detection with transformers. In European
Conference on Computer Vision (pp. 213-229). Springer, Cham.
14. How to Apply Transformer to Object Detection
Swin Transformer
14
Strategy 3 : Patch embedding with different patch size
Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., ... & Guo, B. (2021). Swin
transformer: Hierarchical vision transformer using shifted windows. arXiv preprint
arXiv:2103.14030.
15. How to Apply Transformer to Object Detection
15
Can Transformer perform
2D object detection as a pure
sequence-to-sequence
method?
29. Component 3 – Bipartite Matching Loss
29
Prediction Ground Truth
No. of Class x, y, w, h
1.
No. of Class x, y, w, h
2.
No. of Class x, y, w, h
3.
No. of Class x, y, w, h
100.
……
No. of Class x, y, w, h
1.
No. of Class x, y, w, h
2.
No. of Class x, y, w, h
3.
No. of Class x, y, w, h
n.
……
38. Meanings of the Results
Each Token specialized on certain region and size
38
Det-Tok 1 Det-Tok 2 Det-Tok 3 Det-Tok 4 Det-Tok 5
Det-Tok 6 Det-Tok 7 Det-Tok 8 Det-Tok 9 Det-Tok 10
Center coordinates of bounding box predictions
Small, Medium, Large
39. Meanings of the Results
Each Token specialized on certain region and size
39
40. Meanings of the Results
Category Insensitive
40
Object Categories
No.
of
Objects
Ground Truth
Prediction
41. Discussion
이미지 처리팀에서 Discussion 했던 내용들
굳이 트랜스포머를 왜 고집할 이유가 있는가?
• Long distance dependency를 잘 학습한다.1)
• CNN과 달리 Transformer에는 Inductive bias가 없어서
학습이 어렵지만 제대로 학습만 되면 CNN 보다 좋을 수 있다.2)
• CNN과 Transformer 쓰면 상호 보완적이 되지 않을까??
참고 : CNN의 Inductive Bias
→ “Computer Vision Task는 Spatial Information이 학습에 도움이 된다."
본 모델은 NLP 모델에 대한 이해도가 있으면 쉽게 구현 가능
Bipartite Matching Loss 의 Contribution을 다시 한 번 확인
• 비교적 간단한 모델 구조로도 Object Detector 학습 가능
41
1) Intriguing Properties of Vision Transformers https://arxiv.org/pdf/2105.10497.pdf
2) Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., ... & Houlsby, N. (2020).
An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929.