SlideShare a Scribd company logo
End-to-End Object Detection with
Transformers
(DETR)
Nicolas Carion & Francisco Massa et al., “End-to-End Object Detection with Transformers”, ECCV2020
8th November, 2020
PR12 Paper Review
JinWon Lee
Samsung Electronics
End-to-End Object Detection withTransformers
Reference
• Facebook AI Blog
 https://ai.facebook.com/blog/end-to-end-object-detection-with-
transformers/
• Author’sYoutube
 https://youtu.be/utxbUlo9CyY
• Official Code
 https://github.com/facebookresearch/detr
Object Detection
• Predict a set of bounding boxes with labels, given an image
Classical Approach to Detection
• Popular approach: detection := classification of boxes
• Requires selecting a subset of candidate boxes
• Regression step to refine the predictions
• Typically non-differentiable
DETR in a nutshell
• Set prediction formulation
• No geometric priors(NMS, anchors, …)
• Fully differentiable
• Competitive performance
• Extensible to other tasks
Reframing theTask of Object Detection
• DETR casts the object detection task as an image-to-set problem.
• Given an image, the model must predict an unordered set of all the
objects present, each represented by its class, along with a tight
bounding box surrounding each one.
• This formulation is particularly suitable forTransformers.
Transformers
• Transformers are a deep learning architecture that has gained
popularity in recent years.
• They rely on a simple yet powerful mechanism called attention,
which enables AI models to selectively focus on certain parts of their
input and thus reason more effectively.
• Transformers have been widely applied on problems with sequential
data, and have also been extended to tasks as diverse as speech
recognition, symbolic mathematics, and reinforcement learning.
• But, perhaps surprisingly, computer vision has not yet been swept up
by theTransformer revolution.
Streamlined Detection Pipeline
DETR
Object Detection Set Prediction Loss
• DETR infers a fixed-size set of N predictions, where N is set to be
significantly large than the typical number of objects in an image.
• Let us denote by 𝑦 the ground truth set of objects, and ො𝑦 = {ෝ𝑦𝑖}𝑖=1
𝑁
the set of N predictions.
• 𝑦 also as a set of size N padded with ∅(no object).
Object Detection Set Prediction Loss
• To find a bipartite matching between these two sets we search for a
permutation of N elements 𝜎 ∈ 𝔖 𝑁 with the lowest cost:
ො𝜎 = arg min
𝜎∈𝔖 𝑁
෍
𝑖
𝑁
ℒ 𝑚𝑎𝑡𝑐ℎ(𝑦𝑖, ො𝑦 𝜎 𝑖 )
𝑦𝑖 = 𝑐𝑖, 𝑏𝑖 , where 𝑐𝑖 is the target class label(which may be ∅) and
𝑏𝑖 ∈ [0, 1] 4 is a vector of box center coordinates and its height and width.
ℒ 𝑚𝑎𝑡𝑐ℎ 𝑦𝑖, ො𝑦 𝜎 𝑖 = −𝕝 𝑐 𝑖≠∅ Ƹ𝑝 𝜎 𝑖 + 𝕝 𝑐 𝑖≠∅ ℒ 𝑏𝑜𝑥(𝑏𝑖, ෠𝑏 𝜎 𝑖 )
Ƹ𝑝 𝜎 𝑖 (𝑐𝑖) is a prob. of class 𝑐𝑖 and ෠𝑏 𝜎 𝑖 is a predicted box
• ℒ 𝑚𝑎𝑡𝑐ℎ(𝑦𝑖, ො𝑦 𝜎 𝑖 ) is a pair-wise matching cost and this optimal assignment
is computed efficiently with Hungarian algorithm.
Loss Function – Hungarian Loss
• A linear combination of a negative log-likelihood for class prediction
and a box loss
ℒ 𝐻𝑢𝑛𝑔𝑎𝑟𝑖𝑎𝑛 𝑦, ො𝑦 = ෍
𝑖=1
𝑁
[− log Ƹ𝑝ෝ𝜎 𝑖 𝑐𝑖 + 𝕝 𝑐 𝑖≠∅ ℒ 𝑏𝑜𝑥(𝑏𝑖, ෠𝑏 𝜎 𝑖 )]
In practice, the log prob. term is down-weighted when 𝑐𝑖=∅ by a factor
of 10 to account for class imbalance.
Bounding Box Loss
• Unlike many detectors that do box predictions as a Δ w.r.t. some
initial guesses, DETR make box prediction directly.
• The most commonly-used ℓ1 loss will have different scales for small
and large boxes even if there relative errors are similar.
• To mitigate this issue, DETR use a linear combination of the ℓ1 loss
and generalized IoU loss that is scale-invariant.
ℒ 𝑏𝑜𝑥 𝑏𝑖, ෠𝑏 𝜎 𝑖 = 𝜆𝑖𝑜𝑢ℒ 𝑖𝑜𝑢 𝑏𝑖, ෠𝑏 𝜎 𝑖 + 𝜆 𝐿1 𝑏𝑖 − ෠𝑏 𝜎 𝑖 1
Overall Architecture of DETR
Backbone
• Starting from the initial image 𝑥𝑖𝑚𝑔 ∈ ℝ3×𝐻0×𝑊0 (with 3 color channels),
a conventional CNN backbone generates a lower-resolution activation
map 𝑓 ∈ ℝ 𝐶×𝐻×𝑊
• Typical value DETR use are 𝐶 = 2048 and H, 𝑊 =
𝐻0
32
,
𝑊0
32
Transformer Encoder
• Creating a new feature map 𝑧0 ∈ ℝ 𝑑×𝐻×𝑊, 𝑑 is smaller than 𝐶
• The encoder expects a sequence as input, hence the spatial
dimensions of 𝑧0 is collapsed into one dimension, resulting in a 𝑑 ×
𝐻𝑊 feature map.
• Since the transformer architecture is permutation-invariant, DETR
supplement it with fixed positional encodings that are added to the
input of each attention layer.
Transformer Decoder
• The decoder follows the standard architecture of the transformer,
transforming N embeddings of size d using multi-headed self- and
encoder-decoder attention mechanisms.
• The difference with the original trans former is that our model
decodes the N objects in parallel at each decoder layer.
• Since the decoder is also permutation-invariant, the N input
embeddings must be different to produce different results.
• These input embeddings are learnt positional encodings that we
refer to as object queries, and similarly to the encoder, they are
added to the input of each attention layer
Architecture of DETR’sTransformer
Prediction FFN and Auxiliary Decoding Losses
• The final prediction is computed by a 3-layer perceptron with ReLU
and hidden dimension 𝑑.
• The FFN predicts the normalized center coordinates, height and
width of the box .
• Authors add prediction FFNs and Hungarian loss after each decoder
layer. All predictions FFNs share their parameters.
DETR
https://youtu.be/utxbUlo9CyY
Experiments
• Experiments on COCO 2017 detection and panoptic segmentation
dataset.There are 7 instances per image on average, up to 63
instances in a single image in training set.
• Two different backbones are used: a ResNet-50(DETR) and a ResNet-
101(DETR-R101).
• Dilation to the last stage of the backbone also used for increasing the
feature resolution: DETR-DC5 and DETR-DC5-R101 (dilated C5 stage)
• This modification increases the resolution by a factor of 2, thus
improving performance for small object, at the cost of a 16x higher
cost in the self-attentions of the encoder, leading to an overall 2x
increase in computational cost.
Results
Deformable DETR
(https://arxiv.org/abs/2010.04159)
Number of Encoder Layers
• Without encoder layers, overall AP drops by 3.9 points, with a more
significant drop of 6.0 AP on large objects
• By global scene reasoning, the encoder is important for
disentangling objects.
Encoder Self-Attention
Number of Decoder Layers
• A single decoding layer of the transformer is not able to compute any
cross-correlation between the output elements, and thus it is prone
to making multiple predictions for same object.  NMS improves
performance
DecoderAttention
• Decoder attention is fairly local, meaning that is mostly attends to
object extremities such as heads or legs.
Importance of Positional Encodings
Loss Ablations
Analysis
• Decoder output slot analysis
 DETR learns different specialization for each
query slot.
• Generalization to unseen numbers of
instances
 There is no image with more than 13 giraffes in
the training set.
 This experiment confirms that there is no strong
class-specialization in each object query.
Increasing the Number of Instances
• While the model detects all instances when up to 50 are visible, it
then starts saturating and misses more and more instances. Notably,
when the image contains all 100 instances, the model only detects 30
on average, which is less than if the image contains only 50 instances
that are all detected.
DETR for Panoptic Segmentation
• Similarly to the extension of Faster R-CNN to Mask R-CNN, DETR
can be naturally extended by adding a mask head on top of the
decoder outputs.
DETR for Panoptic Segmentation
DETR for Panoptic Segmentation
https://youtu.be/utxbUlo9CyY
Panoptic Segmentation Results
Panoptic Segmentation Results
Thank you

More Related Content

What's hot

Transformer in Vision
Transformer in VisionTransformer in Vision
Transformer in Vision
Sangmin Woo
 
[PR12] You Only Look Once (YOLO): Unified Real-Time Object Detection
[PR12] You Only Look Once (YOLO): Unified Real-Time Object Detection[PR12] You Only Look Once (YOLO): Unified Real-Time Object Detection
[PR12] You Only Look Once (YOLO): Unified Real-Time Object Detection
Taegyun Jeon
 
DETR ECCV20
DETR ECCV20DETR ECCV20
DETR ECCV20
Mengmeng Xu
 
PR-132: SSD: Single Shot MultiBox Detector
PR-132: SSD: Single Shot MultiBox DetectorPR-132: SSD: Single Shot MultiBox Detector
PR-132: SSD: Single Shot MultiBox Detector
Jinwon Lee
 
You only look once
You only look onceYou only look once
You only look once
Gin Kyeng Lee
 
The Transformer in Vision | Xavier Giro | Master in Computer Vision Barcelona...
The Transformer in Vision | Xavier Giro | Master in Computer Vision Barcelona...The Transformer in Vision | Xavier Giro | Master in Computer Vision Barcelona...
The Transformer in Vision | Xavier Giro | Master in Computer Vision Barcelona...
Universitat Politècnica de Catalunya
 
Semantic Segmentation AIML Project
Semantic Segmentation AIML ProjectSemantic Segmentation AIML Project
Semantic Segmentation AIML Project
Hitesh
 
Deep learning for object detection
Deep learning for object detectionDeep learning for object detection
Deep learning for object detection
Wenjing Chen
 
Mask-RCNN for Instance Segmentation
Mask-RCNN for Instance SegmentationMask-RCNN for Instance Segmentation
Mask-RCNN for Instance Segmentation
Dat Nguyen
 
PR-207: YOLOv3: An Incremental Improvement
PR-207: YOLOv3: An Incremental ImprovementPR-207: YOLOv3: An Incremental Improvement
PR-207: YOLOv3: An Incremental Improvement
Jinwon Lee
 
You Only Look Once: Unified, Real-Time Object Detection
You Only Look Once: Unified, Real-Time Object DetectionYou Only Look Once: Unified, Real-Time Object Detection
You Only Look Once: Unified, Real-Time Object Detection
DADAJONJURAKUZIEV
 
Image Classification Done Simply using Keras and TensorFlow
Image Classification Done Simply using Keras and TensorFlow Image Classification Done Simply using Keras and TensorFlow
Image Classification Done Simply using Keras and TensorFlow
Rajiv Shah
 
You only look once (YOLO) : unified real time object detection
You only look once (YOLO) : unified real time object detectionYou only look once (YOLO) : unified real time object detection
You only look once (YOLO) : unified real time object detection
Entrepreneur / Startup
 
Introduction to Visual transformers
Introduction to Visual transformers Introduction to Visual transformers
Introduction to Visual transformers
leopauly
 
Object detection and Instance Segmentation
Object detection and Instance SegmentationObject detection and Instance Segmentation
Object detection and Instance Segmentation
Hichem Felouat
 
Convolutional Neural Network Models - Deep Learning
Convolutional Neural Network Models - Deep LearningConvolutional Neural Network Models - Deep Learning
Convolutional Neural Network Models - Deep Learning
Mohamed Loey
 
Real-time object detection coz YOLO!
Real-time object detection coz YOLO!Real-time object detection coz YOLO!
Real-time object detection coz YOLO!
J On The Beach
 
Scaling Instruction-Finetuned Language Models
Scaling Instruction-Finetuned Language ModelsScaling Instruction-Finetuned Language Models
Scaling Instruction-Finetuned Language Models
taeseon ryu
 
[Paper Reading] Attention is All You Need
[Paper Reading] Attention is All You Need[Paper Reading] Attention is All You Need
[Paper Reading] Attention is All You Need
Daiki Tanaka
 
Focal loss의 응용(Detection & Classification)
Focal loss의 응용(Detection & Classification)Focal loss의 응용(Detection & Classification)
Focal loss의 응용(Detection & Classification)
홍배 김
 

What's hot (20)

Transformer in Vision
Transformer in VisionTransformer in Vision
Transformer in Vision
 
[PR12] You Only Look Once (YOLO): Unified Real-Time Object Detection
[PR12] You Only Look Once (YOLO): Unified Real-Time Object Detection[PR12] You Only Look Once (YOLO): Unified Real-Time Object Detection
[PR12] You Only Look Once (YOLO): Unified Real-Time Object Detection
 
DETR ECCV20
DETR ECCV20DETR ECCV20
DETR ECCV20
 
PR-132: SSD: Single Shot MultiBox Detector
PR-132: SSD: Single Shot MultiBox DetectorPR-132: SSD: Single Shot MultiBox Detector
PR-132: SSD: Single Shot MultiBox Detector
 
You only look once
You only look onceYou only look once
You only look once
 
The Transformer in Vision | Xavier Giro | Master in Computer Vision Barcelona...
The Transformer in Vision | Xavier Giro | Master in Computer Vision Barcelona...The Transformer in Vision | Xavier Giro | Master in Computer Vision Barcelona...
The Transformer in Vision | Xavier Giro | Master in Computer Vision Barcelona...
 
Semantic Segmentation AIML Project
Semantic Segmentation AIML ProjectSemantic Segmentation AIML Project
Semantic Segmentation AIML Project
 
Deep learning for object detection
Deep learning for object detectionDeep learning for object detection
Deep learning for object detection
 
Mask-RCNN for Instance Segmentation
Mask-RCNN for Instance SegmentationMask-RCNN for Instance Segmentation
Mask-RCNN for Instance Segmentation
 
PR-207: YOLOv3: An Incremental Improvement
PR-207: YOLOv3: An Incremental ImprovementPR-207: YOLOv3: An Incremental Improvement
PR-207: YOLOv3: An Incremental Improvement
 
You Only Look Once: Unified, Real-Time Object Detection
You Only Look Once: Unified, Real-Time Object DetectionYou Only Look Once: Unified, Real-Time Object Detection
You Only Look Once: Unified, Real-Time Object Detection
 
Image Classification Done Simply using Keras and TensorFlow
Image Classification Done Simply using Keras and TensorFlow Image Classification Done Simply using Keras and TensorFlow
Image Classification Done Simply using Keras and TensorFlow
 
You only look once (YOLO) : unified real time object detection
You only look once (YOLO) : unified real time object detectionYou only look once (YOLO) : unified real time object detection
You only look once (YOLO) : unified real time object detection
 
Introduction to Visual transformers
Introduction to Visual transformers Introduction to Visual transformers
Introduction to Visual transformers
 
Object detection and Instance Segmentation
Object detection and Instance SegmentationObject detection and Instance Segmentation
Object detection and Instance Segmentation
 
Convolutional Neural Network Models - Deep Learning
Convolutional Neural Network Models - Deep LearningConvolutional Neural Network Models - Deep Learning
Convolutional Neural Network Models - Deep Learning
 
Real-time object detection coz YOLO!
Real-time object detection coz YOLO!Real-time object detection coz YOLO!
Real-time object detection coz YOLO!
 
Scaling Instruction-Finetuned Language Models
Scaling Instruction-Finetuned Language ModelsScaling Instruction-Finetuned Language Models
Scaling Instruction-Finetuned Language Models
 
[Paper Reading] Attention is All You Need
[Paper Reading] Attention is All You Need[Paper Reading] Attention is All You Need
[Paper Reading] Attention is All You Need
 
Focal loss의 응용(Detection & Classification)
Focal loss의 응용(Detection & Classification)Focal loss의 응용(Detection & Classification)
Focal loss의 응용(Detection & Classification)
 

Similar to PR-284: End-to-End Object Detection with Transformers(DETR)

Wits presentation 6_28072015
Wits presentation 6_28072015Wits presentation 6_28072015
Wits presentation 6_28072015
Beatrice van Eden
 
Overview of Convolutional Neural Networks
Overview of Convolutional Neural NetworksOverview of Convolutional Neural Networks
Overview of Convolutional Neural Networks
ananth
 
Deep learning from a novice perspective
Deep learning from a novice perspectiveDeep learning from a novice perspective
Deep learning from a novice perspective
Anirban Santara
 
Neural motifs scene graph parsing with global context
Neural motifs scene graph parsing with global contextNeural motifs scene graph parsing with global context
Neural motifs scene graph parsing with global context
Sangmin Woo
 
PR095: Modularity Matters: Learning Invariant Relational Reasoning Tasks
PR095: Modularity Matters: Learning Invariant Relational Reasoning TasksPR095: Modularity Matters: Learning Invariant Relational Reasoning Tasks
PR095: Modularity Matters: Learning Invariant Relational Reasoning Tasks
Jinwon Lee
 
Object detection - RCNNs vs Retinanet
Object detection - RCNNs vs RetinanetObject detection - RCNNs vs Retinanet
Object detection - RCNNs vs Retinanet
Rishabh Indoria
 
CSA 3702 machine learning module 3
CSA 3702 machine learning module 3CSA 3702 machine learning module 3
CSA 3702 machine learning module 3
Nandhini S
 
ADAPTIVE FILTER FOR DENOISING 3D DATA CAPTURED BY DEPTH SENSORS
ADAPTIVE FILTER FOR DENOISING 3D DATA CAPTURED BY DEPTH SENSORSADAPTIVE FILTER FOR DENOISING 3D DATA CAPTURED BY DEPTH SENSORS
ADAPTIVE FILTER FOR DENOISING 3D DATA CAPTURED BY DEPTH SENSORS
Soma Boubou
 
Unit3
Unit3Unit3
Lecture 2.B: Computer Vision Applications - Full Stack Deep Learning - Spring...
Lecture 2.B: Computer Vision Applications - Full Stack Deep Learning - Spring...Lecture 2.B: Computer Vision Applications - Full Stack Deep Learning - Spring...
Lecture 2.B: Computer Vision Applications - Full Stack Deep Learning - Spring...
Sergey Karayev
 
150424 Scalable Object Detection using Deep Neural Networks
150424 Scalable Object Detection using Deep Neural Networks150424 Scalable Object Detection using Deep Neural Networks
150424 Scalable Object Detection using Deep Neural Networks
Junho Cho
 
09_dm1_knn_2022_23.pdf
09_dm1_knn_2022_23.pdf09_dm1_knn_2022_23.pdf
09_dm1_knn_2022_23.pdf
ArafathJazeeb1
 
Knn 160904075605-converted
Knn 160904075605-convertedKnn 160904075605-converted
Knn 160904075605-converted
rameswara reddy venkat
 
Efficient anomaly detection via matrix sketching
Efficient anomaly detection via matrix sketchingEfficient anomaly detection via matrix sketching
Efficient anomaly detection via matrix sketching
Hsing-chuan Hsieh
 
Weisfeiler and Leman Go Neural: Higher-order Graph Neural Networks, arXiv e-...
Weisfeiler and Leman Go Neural: Higher-order Graph Neural Networks, arXiv e-...Weisfeiler and Leman Go Neural: Higher-order Graph Neural Networks, arXiv e-...
Weisfeiler and Leman Go Neural: Higher-order Graph Neural Networks, arXiv e-...
ssuser2624f71
 
Machine Learning Foundations for Professional Managers
Machine Learning Foundations for Professional ManagersMachine Learning Foundations for Professional Managers
Machine Learning Foundations for Professional Managers
Albert Y. C. Chen
 
Automatic Attendace using convolutional neural network Face Recognition
Automatic Attendace using convolutional neural network Face RecognitionAutomatic Attendace using convolutional neural network Face Recognition
Automatic Attendace using convolutional neural network Face Recognition
vatsal199567
 
Machine learning and linear regression programming
Machine learning and linear regression programmingMachine learning and linear regression programming
Machine learning and linear regression programming
Soumya Mukherjee
 
Unsupervised Learning in Machine Learning
Unsupervised Learning in Machine LearningUnsupervised Learning in Machine Learning
Unsupervised Learning in Machine Learning
Pyingkodi Maran
 
Clique and sting
Clique and stingClique and sting
Clique and sting
Subramanyam Natarajan
 

Similar to PR-284: End-to-End Object Detection with Transformers(DETR) (20)

Wits presentation 6_28072015
Wits presentation 6_28072015Wits presentation 6_28072015
Wits presentation 6_28072015
 
Overview of Convolutional Neural Networks
Overview of Convolutional Neural NetworksOverview of Convolutional Neural Networks
Overview of Convolutional Neural Networks
 
Deep learning from a novice perspective
Deep learning from a novice perspectiveDeep learning from a novice perspective
Deep learning from a novice perspective
 
Neural motifs scene graph parsing with global context
Neural motifs scene graph parsing with global contextNeural motifs scene graph parsing with global context
Neural motifs scene graph parsing with global context
 
PR095: Modularity Matters: Learning Invariant Relational Reasoning Tasks
PR095: Modularity Matters: Learning Invariant Relational Reasoning TasksPR095: Modularity Matters: Learning Invariant Relational Reasoning Tasks
PR095: Modularity Matters: Learning Invariant Relational Reasoning Tasks
 
Object detection - RCNNs vs Retinanet
Object detection - RCNNs vs RetinanetObject detection - RCNNs vs Retinanet
Object detection - RCNNs vs Retinanet
 
CSA 3702 machine learning module 3
CSA 3702 machine learning module 3CSA 3702 machine learning module 3
CSA 3702 machine learning module 3
 
ADAPTIVE FILTER FOR DENOISING 3D DATA CAPTURED BY DEPTH SENSORS
ADAPTIVE FILTER FOR DENOISING 3D DATA CAPTURED BY DEPTH SENSORSADAPTIVE FILTER FOR DENOISING 3D DATA CAPTURED BY DEPTH SENSORS
ADAPTIVE FILTER FOR DENOISING 3D DATA CAPTURED BY DEPTH SENSORS
 
Unit3
Unit3Unit3
Unit3
 
Lecture 2.B: Computer Vision Applications - Full Stack Deep Learning - Spring...
Lecture 2.B: Computer Vision Applications - Full Stack Deep Learning - Spring...Lecture 2.B: Computer Vision Applications - Full Stack Deep Learning - Spring...
Lecture 2.B: Computer Vision Applications - Full Stack Deep Learning - Spring...
 
150424 Scalable Object Detection using Deep Neural Networks
150424 Scalable Object Detection using Deep Neural Networks150424 Scalable Object Detection using Deep Neural Networks
150424 Scalable Object Detection using Deep Neural Networks
 
09_dm1_knn_2022_23.pdf
09_dm1_knn_2022_23.pdf09_dm1_knn_2022_23.pdf
09_dm1_knn_2022_23.pdf
 
Knn 160904075605-converted
Knn 160904075605-convertedKnn 160904075605-converted
Knn 160904075605-converted
 
Efficient anomaly detection via matrix sketching
Efficient anomaly detection via matrix sketchingEfficient anomaly detection via matrix sketching
Efficient anomaly detection via matrix sketching
 
Weisfeiler and Leman Go Neural: Higher-order Graph Neural Networks, arXiv e-...
Weisfeiler and Leman Go Neural: Higher-order Graph Neural Networks, arXiv e-...Weisfeiler and Leman Go Neural: Higher-order Graph Neural Networks, arXiv e-...
Weisfeiler and Leman Go Neural: Higher-order Graph Neural Networks, arXiv e-...
 
Machine Learning Foundations for Professional Managers
Machine Learning Foundations for Professional ManagersMachine Learning Foundations for Professional Managers
Machine Learning Foundations for Professional Managers
 
Automatic Attendace using convolutional neural network Face Recognition
Automatic Attendace using convolutional neural network Face RecognitionAutomatic Attendace using convolutional neural network Face Recognition
Automatic Attendace using convolutional neural network Face Recognition
 
Machine learning and linear regression programming
Machine learning and linear regression programmingMachine learning and linear regression programming
Machine learning and linear regression programming
 
Unsupervised Learning in Machine Learning
Unsupervised Learning in Machine LearningUnsupervised Learning in Machine Learning
Unsupervised Learning in Machine Learning
 
Clique and sting
Clique and stingClique and sting
Clique and sting
 

More from Jinwon Lee

PR-366: A ConvNet for 2020s
PR-366: A ConvNet for 2020sPR-366: A ConvNet for 2020s
PR-366: A ConvNet for 2020s
Jinwon Lee
 
PR-355: Masked Autoencoders Are Scalable Vision Learners
PR-355: Masked Autoencoders Are Scalable Vision LearnersPR-355: Masked Autoencoders Are Scalable Vision Learners
PR-355: Masked Autoencoders Are Scalable Vision Learners
Jinwon Lee
 
PR-344: A Battle of Network Structures: An Empirical Study of CNN, Transforme...
PR-344: A Battle of Network Structures: An Empirical Study of CNN, Transforme...PR-344: A Battle of Network Structures: An Empirical Study of CNN, Transforme...
PR-344: A Battle of Network Structures: An Empirical Study of CNN, Transforme...
Jinwon Lee
 
PR-330: How To Train Your ViT? Data, Augmentation, and Regularization in Visi...
PR-330: How To Train Your ViT? Data, Augmentation, and Regularization in Visi...PR-330: How To Train Your ViT? Data, Augmentation, and Regularization in Visi...
PR-330: How To Train Your ViT? Data, Augmentation, and Regularization in Visi...
Jinwon Lee
 
PR-317: MLP-Mixer: An all-MLP Architecture for Vision
PR-317: MLP-Mixer: An all-MLP Architecture for VisionPR-317: MLP-Mixer: An all-MLP Architecture for Vision
PR-317: MLP-Mixer: An all-MLP Architecture for Vision
Jinwon Lee
 
PR-297: Training data-efficient image transformers & distillation through att...
PR-297: Training data-efficient image transformers & distillation through att...PR-297: Training data-efficient image transformers & distillation through att...
PR-297: Training data-efficient image transformers & distillation through att...
Jinwon Lee
 
PR-270: PP-YOLO: An Effective and Efficient Implementation of Object Detector
PR-270: PP-YOLO: An Effective and Efficient Implementation of Object DetectorPR-270: PP-YOLO: An Effective and Efficient Implementation of Object Detector
PR-270: PP-YOLO: An Effective and Efficient Implementation of Object Detector
Jinwon Lee
 
PR-258: From ImageNet to Image Classification: Contextualizing Progress on Be...
PR-258: From ImageNet to Image Classification: Contextualizing Progress on Be...PR-258: From ImageNet to Image Classification: Contextualizing Progress on Be...
PR-258: From ImageNet to Image Classification: Contextualizing Progress on Be...
Jinwon Lee
 
PR243: Designing Network Design Spaces
PR243: Designing Network Design SpacesPR243: Designing Network Design Spaces
PR243: Designing Network Design Spaces
Jinwon Lee
 
PR-231: A Simple Framework for Contrastive Learning of Visual Representations
PR-231: A Simple Framework for Contrastive Learning of Visual RepresentationsPR-231: A Simple Framework for Contrastive Learning of Visual Representations
PR-231: A Simple Framework for Contrastive Learning of Visual Representations
Jinwon Lee
 
PR-217: EfficientDet: Scalable and Efficient Object Detection
PR-217: EfficientDet: Scalable and Efficient Object DetectionPR-217: EfficientDet: Scalable and Efficient Object Detection
PR-217: EfficientDet: Scalable and Efficient Object Detection
Jinwon Lee
 
PR-197: One ticket to win them all: generalizing lottery ticket initializatio...
PR-197: One ticket to win them all: generalizing lottery ticket initializatio...PR-197: One ticket to win them all: generalizing lottery ticket initializatio...
PR-197: One ticket to win them all: generalizing lottery ticket initializatio...
Jinwon Lee
 
PR-183: MixNet: Mixed Depthwise Convolutional Kernels
PR-183: MixNet: Mixed Depthwise Convolutional KernelsPR-183: MixNet: Mixed Depthwise Convolutional Kernels
PR-183: MixNet: Mixed Depthwise Convolutional Kernels
Jinwon Lee
 
PR-169: EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks
PR-169: EfficientNet: Rethinking Model Scaling for Convolutional Neural NetworksPR-169: EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks
PR-169: EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks
Jinwon Lee
 
PR-155: Exploring Randomly Wired Neural Networks for Image Recognition
PR-155: Exploring Randomly Wired Neural Networks for Image RecognitionPR-155: Exploring Randomly Wired Neural Networks for Image Recognition
PR-155: Exploring Randomly Wired Neural Networks for Image Recognition
Jinwon Lee
 
PR-144: SqueezeNext: Hardware-Aware Neural Network Design
PR-144: SqueezeNext: Hardware-Aware Neural Network DesignPR-144: SqueezeNext: Hardware-Aware Neural Network Design
PR-144: SqueezeNext: Hardware-Aware Neural Network Design
Jinwon Lee
 
PR-120: ShuffleNet V2: Practical Guidelines for Efficient CNN Architecture De...
PR-120: ShuffleNet V2: Practical Guidelines for Efficient CNN Architecture De...PR-120: ShuffleNet V2: Practical Guidelines for Efficient CNN Architecture De...
PR-120: ShuffleNet V2: Practical Guidelines for Efficient CNN Architecture De...
Jinwon Lee
 
PR-108: MobileNetV2: Inverted Residuals and Linear Bottlenecks
PR-108: MobileNetV2: Inverted Residuals and Linear BottlenecksPR-108: MobileNetV2: Inverted Residuals and Linear Bottlenecks
PR-108: MobileNetV2: Inverted Residuals and Linear Bottlenecks
Jinwon Lee
 
In datacenter performance analysis of a tensor processing unit
In datacenter performance analysis of a tensor processing unitIn datacenter performance analysis of a tensor processing unit
In datacenter performance analysis of a tensor processing unit
Jinwon Lee
 
Efficient Neural Architecture Search via Parameter Sharing
Efficient Neural Architecture Search via Parameter SharingEfficient Neural Architecture Search via Parameter Sharing
Efficient Neural Architecture Search via Parameter Sharing
Jinwon Lee
 

More from Jinwon Lee (20)

PR-366: A ConvNet for 2020s
PR-366: A ConvNet for 2020sPR-366: A ConvNet for 2020s
PR-366: A ConvNet for 2020s
 
PR-355: Masked Autoencoders Are Scalable Vision Learners
PR-355: Masked Autoencoders Are Scalable Vision LearnersPR-355: Masked Autoencoders Are Scalable Vision Learners
PR-355: Masked Autoencoders Are Scalable Vision Learners
 
PR-344: A Battle of Network Structures: An Empirical Study of CNN, Transforme...
PR-344: A Battle of Network Structures: An Empirical Study of CNN, Transforme...PR-344: A Battle of Network Structures: An Empirical Study of CNN, Transforme...
PR-344: A Battle of Network Structures: An Empirical Study of CNN, Transforme...
 
PR-330: How To Train Your ViT? Data, Augmentation, and Regularization in Visi...
PR-330: How To Train Your ViT? Data, Augmentation, and Regularization in Visi...PR-330: How To Train Your ViT? Data, Augmentation, and Regularization in Visi...
PR-330: How To Train Your ViT? Data, Augmentation, and Regularization in Visi...
 
PR-317: MLP-Mixer: An all-MLP Architecture for Vision
PR-317: MLP-Mixer: An all-MLP Architecture for VisionPR-317: MLP-Mixer: An all-MLP Architecture for Vision
PR-317: MLP-Mixer: An all-MLP Architecture for Vision
 
PR-297: Training data-efficient image transformers & distillation through att...
PR-297: Training data-efficient image transformers & distillation through att...PR-297: Training data-efficient image transformers & distillation through att...
PR-297: Training data-efficient image transformers & distillation through att...
 
PR-270: PP-YOLO: An Effective and Efficient Implementation of Object Detector
PR-270: PP-YOLO: An Effective and Efficient Implementation of Object DetectorPR-270: PP-YOLO: An Effective and Efficient Implementation of Object Detector
PR-270: PP-YOLO: An Effective and Efficient Implementation of Object Detector
 
PR-258: From ImageNet to Image Classification: Contextualizing Progress on Be...
PR-258: From ImageNet to Image Classification: Contextualizing Progress on Be...PR-258: From ImageNet to Image Classification: Contextualizing Progress on Be...
PR-258: From ImageNet to Image Classification: Contextualizing Progress on Be...
 
PR243: Designing Network Design Spaces
PR243: Designing Network Design SpacesPR243: Designing Network Design Spaces
PR243: Designing Network Design Spaces
 
PR-231: A Simple Framework for Contrastive Learning of Visual Representations
PR-231: A Simple Framework for Contrastive Learning of Visual RepresentationsPR-231: A Simple Framework for Contrastive Learning of Visual Representations
PR-231: A Simple Framework for Contrastive Learning of Visual Representations
 
PR-217: EfficientDet: Scalable and Efficient Object Detection
PR-217: EfficientDet: Scalable and Efficient Object DetectionPR-217: EfficientDet: Scalable and Efficient Object Detection
PR-217: EfficientDet: Scalable and Efficient Object Detection
 
PR-197: One ticket to win them all: generalizing lottery ticket initializatio...
PR-197: One ticket to win them all: generalizing lottery ticket initializatio...PR-197: One ticket to win them all: generalizing lottery ticket initializatio...
PR-197: One ticket to win them all: generalizing lottery ticket initializatio...
 
PR-183: MixNet: Mixed Depthwise Convolutional Kernels
PR-183: MixNet: Mixed Depthwise Convolutional KernelsPR-183: MixNet: Mixed Depthwise Convolutional Kernels
PR-183: MixNet: Mixed Depthwise Convolutional Kernels
 
PR-169: EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks
PR-169: EfficientNet: Rethinking Model Scaling for Convolutional Neural NetworksPR-169: EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks
PR-169: EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks
 
PR-155: Exploring Randomly Wired Neural Networks for Image Recognition
PR-155: Exploring Randomly Wired Neural Networks for Image RecognitionPR-155: Exploring Randomly Wired Neural Networks for Image Recognition
PR-155: Exploring Randomly Wired Neural Networks for Image Recognition
 
PR-144: SqueezeNext: Hardware-Aware Neural Network Design
PR-144: SqueezeNext: Hardware-Aware Neural Network DesignPR-144: SqueezeNext: Hardware-Aware Neural Network Design
PR-144: SqueezeNext: Hardware-Aware Neural Network Design
 
PR-120: ShuffleNet V2: Practical Guidelines for Efficient CNN Architecture De...
PR-120: ShuffleNet V2: Practical Guidelines for Efficient CNN Architecture De...PR-120: ShuffleNet V2: Practical Guidelines for Efficient CNN Architecture De...
PR-120: ShuffleNet V2: Practical Guidelines for Efficient CNN Architecture De...
 
PR-108: MobileNetV2: Inverted Residuals and Linear Bottlenecks
PR-108: MobileNetV2: Inverted Residuals and Linear BottlenecksPR-108: MobileNetV2: Inverted Residuals and Linear Bottlenecks
PR-108: MobileNetV2: Inverted Residuals and Linear Bottlenecks
 
In datacenter performance analysis of a tensor processing unit
In datacenter performance analysis of a tensor processing unitIn datacenter performance analysis of a tensor processing unit
In datacenter performance analysis of a tensor processing unit
 
Efficient Neural Architecture Search via Parameter Sharing
Efficient Neural Architecture Search via Parameter SharingEfficient Neural Architecture Search via Parameter Sharing
Efficient Neural Architecture Search via Parameter Sharing
 

Recently uploaded

De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
Product School
 
How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...
Product School
 
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Ramesh Iyer
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
Jemma Hussein Allen
 
Assuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyesAssuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyes
ThousandEyes
 
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMsTo Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
Paul Groth
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
Kari Kakkonen
 
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Thierry Lestable
 
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
Sri Ambati
 
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Jeffrey Haguewood
 
PCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase TeamPCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase Team
ControlCase
 
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Albert Hoitingh
 
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
91mobiles
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
Guy Korland
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
BookNet Canada
 
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Product School
 
Generating a custom Ruby SDK for your web service or Rails API using Smithy
Generating a custom Ruby SDK for your web service or Rails API using SmithyGenerating a custom Ruby SDK for your web service or Rails API using Smithy
Generating a custom Ruby SDK for your web service or Rails API using Smithy
g2nightmarescribd
 
When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...
Elena Simperl
 
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Product School
 

Recently uploaded (20)

De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
 
How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...
 
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
 
Assuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyesAssuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyes
 
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMsTo Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
 
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
 
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
 
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
 
PCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase TeamPCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase Team
 
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
 
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
 
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...
 
Generating a custom Ruby SDK for your web service or Rails API using Smithy
Generating a custom Ruby SDK for your web service or Rails API using SmithyGenerating a custom Ruby SDK for your web service or Rails API using Smithy
Generating a custom Ruby SDK for your web service or Rails API using Smithy
 
When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...
 
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...
 

PR-284: End-to-End Object Detection with Transformers(DETR)

  • 1. End-to-End Object Detection with Transformers (DETR) Nicolas Carion & Francisco Massa et al., “End-to-End Object Detection with Transformers”, ECCV2020 8th November, 2020 PR12 Paper Review JinWon Lee Samsung Electronics
  • 2. End-to-End Object Detection withTransformers
  • 3. Reference • Facebook AI Blog  https://ai.facebook.com/blog/end-to-end-object-detection-with- transformers/ • Author’sYoutube  https://youtu.be/utxbUlo9CyY • Official Code  https://github.com/facebookresearch/detr
  • 4. Object Detection • Predict a set of bounding boxes with labels, given an image
  • 5. Classical Approach to Detection • Popular approach: detection := classification of boxes • Requires selecting a subset of candidate boxes • Regression step to refine the predictions • Typically non-differentiable
  • 6. DETR in a nutshell • Set prediction formulation • No geometric priors(NMS, anchors, …) • Fully differentiable • Competitive performance • Extensible to other tasks
  • 7. Reframing theTask of Object Detection • DETR casts the object detection task as an image-to-set problem. • Given an image, the model must predict an unordered set of all the objects present, each represented by its class, along with a tight bounding box surrounding each one. • This formulation is particularly suitable forTransformers.
  • 8. Transformers • Transformers are a deep learning architecture that has gained popularity in recent years. • They rely on a simple yet powerful mechanism called attention, which enables AI models to selectively focus on certain parts of their input and thus reason more effectively. • Transformers have been widely applied on problems with sequential data, and have also been extended to tasks as diverse as speech recognition, symbolic mathematics, and reinforcement learning. • But, perhaps surprisingly, computer vision has not yet been swept up by theTransformer revolution.
  • 10. DETR
  • 11. Object Detection Set Prediction Loss • DETR infers a fixed-size set of N predictions, where N is set to be significantly large than the typical number of objects in an image. • Let us denote by 𝑦 the ground truth set of objects, and ො𝑦 = {ෝ𝑦𝑖}𝑖=1 𝑁 the set of N predictions. • 𝑦 also as a set of size N padded with ∅(no object).
  • 12. Object Detection Set Prediction Loss • To find a bipartite matching between these two sets we search for a permutation of N elements 𝜎 ∈ 𝔖 𝑁 with the lowest cost: ො𝜎 = arg min 𝜎∈𝔖 𝑁 ෍ 𝑖 𝑁 ℒ 𝑚𝑎𝑡𝑐ℎ(𝑦𝑖, ො𝑦 𝜎 𝑖 ) 𝑦𝑖 = 𝑐𝑖, 𝑏𝑖 , where 𝑐𝑖 is the target class label(which may be ∅) and 𝑏𝑖 ∈ [0, 1] 4 is a vector of box center coordinates and its height and width. ℒ 𝑚𝑎𝑡𝑐ℎ 𝑦𝑖, ො𝑦 𝜎 𝑖 = −𝕝 𝑐 𝑖≠∅ Ƹ𝑝 𝜎 𝑖 + 𝕝 𝑐 𝑖≠∅ ℒ 𝑏𝑜𝑥(𝑏𝑖, ෠𝑏 𝜎 𝑖 ) Ƹ𝑝 𝜎 𝑖 (𝑐𝑖) is a prob. of class 𝑐𝑖 and ෠𝑏 𝜎 𝑖 is a predicted box • ℒ 𝑚𝑎𝑡𝑐ℎ(𝑦𝑖, ො𝑦 𝜎 𝑖 ) is a pair-wise matching cost and this optimal assignment is computed efficiently with Hungarian algorithm.
  • 13. Loss Function – Hungarian Loss • A linear combination of a negative log-likelihood for class prediction and a box loss ℒ 𝐻𝑢𝑛𝑔𝑎𝑟𝑖𝑎𝑛 𝑦, ො𝑦 = ෍ 𝑖=1 𝑁 [− log Ƹ𝑝ෝ𝜎 𝑖 𝑐𝑖 + 𝕝 𝑐 𝑖≠∅ ℒ 𝑏𝑜𝑥(𝑏𝑖, ෠𝑏 𝜎 𝑖 )] In practice, the log prob. term is down-weighted when 𝑐𝑖=∅ by a factor of 10 to account for class imbalance.
  • 14. Bounding Box Loss • Unlike many detectors that do box predictions as a Δ w.r.t. some initial guesses, DETR make box prediction directly. • The most commonly-used ℓ1 loss will have different scales for small and large boxes even if there relative errors are similar. • To mitigate this issue, DETR use a linear combination of the ℓ1 loss and generalized IoU loss that is scale-invariant. ℒ 𝑏𝑜𝑥 𝑏𝑖, ෠𝑏 𝜎 𝑖 = 𝜆𝑖𝑜𝑢ℒ 𝑖𝑜𝑢 𝑏𝑖, ෠𝑏 𝜎 𝑖 + 𝜆 𝐿1 𝑏𝑖 − ෠𝑏 𝜎 𝑖 1
  • 16. Backbone • Starting from the initial image 𝑥𝑖𝑚𝑔 ∈ ℝ3×𝐻0×𝑊0 (with 3 color channels), a conventional CNN backbone generates a lower-resolution activation map 𝑓 ∈ ℝ 𝐶×𝐻×𝑊 • Typical value DETR use are 𝐶 = 2048 and H, 𝑊 = 𝐻0 32 , 𝑊0 32
  • 17. Transformer Encoder • Creating a new feature map 𝑧0 ∈ ℝ 𝑑×𝐻×𝑊, 𝑑 is smaller than 𝐶 • The encoder expects a sequence as input, hence the spatial dimensions of 𝑧0 is collapsed into one dimension, resulting in a 𝑑 × 𝐻𝑊 feature map. • Since the transformer architecture is permutation-invariant, DETR supplement it with fixed positional encodings that are added to the input of each attention layer.
  • 18. Transformer Decoder • The decoder follows the standard architecture of the transformer, transforming N embeddings of size d using multi-headed self- and encoder-decoder attention mechanisms. • The difference with the original trans former is that our model decodes the N objects in parallel at each decoder layer. • Since the decoder is also permutation-invariant, the N input embeddings must be different to produce different results. • These input embeddings are learnt positional encodings that we refer to as object queries, and similarly to the encoder, they are added to the input of each attention layer
  • 20. Prediction FFN and Auxiliary Decoding Losses • The final prediction is computed by a 3-layer perceptron with ReLU and hidden dimension 𝑑. • The FFN predicts the normalized center coordinates, height and width of the box . • Authors add prediction FFNs and Hungarian loss after each decoder layer. All predictions FFNs share their parameters.
  • 22. Experiments • Experiments on COCO 2017 detection and panoptic segmentation dataset.There are 7 instances per image on average, up to 63 instances in a single image in training set. • Two different backbones are used: a ResNet-50(DETR) and a ResNet- 101(DETR-R101). • Dilation to the last stage of the backbone also used for increasing the feature resolution: DETR-DC5 and DETR-DC5-R101 (dilated C5 stage) • This modification increases the resolution by a factor of 2, thus improving performance for small object, at the cost of a 16x higher cost in the self-attentions of the encoder, leading to an overall 2x increase in computational cost.
  • 25. Number of Encoder Layers • Without encoder layers, overall AP drops by 3.9 points, with a more significant drop of 6.0 AP on large objects • By global scene reasoning, the encoder is important for disentangling objects.
  • 27. Number of Decoder Layers • A single decoding layer of the transformer is not able to compute any cross-correlation between the output elements, and thus it is prone to making multiple predictions for same object.  NMS improves performance
  • 28. DecoderAttention • Decoder attention is fairly local, meaning that is mostly attends to object extremities such as heads or legs.
  • 31. Analysis • Decoder output slot analysis  DETR learns different specialization for each query slot. • Generalization to unseen numbers of instances  There is no image with more than 13 giraffes in the training set.  This experiment confirms that there is no strong class-specialization in each object query.
  • 32. Increasing the Number of Instances • While the model detects all instances when up to 50 are visible, it then starts saturating and misses more and more instances. Notably, when the image contains all 100 instances, the model only detects 30 on average, which is less than if the image contains only 50 instances that are all detected.
  • 33. DETR for Panoptic Segmentation • Similarly to the extension of Faster R-CNN to Mask R-CNN, DETR can be naturally extended by adding a mask head on top of the decoder outputs.
  • 34. DETR for Panoptic Segmentation
  • 35. DETR for Panoptic Segmentation https://youtu.be/utxbUlo9CyY