PR-284: End-to-End Object Detection with Transformers(DETR)

End-to-End Object Detection with
Transformers
(DETR)
Nicolas Carion & Francisco Massa et al., “End-to-End Object Detection with Transformers”, ECCV2020
8th November, 2020
PR12 Paper Review
JinWon Lee
Samsung Electronics

End-to-End Object Detection withTransformers

Reference
• Facebook AI Blog
 https://ai.facebook.com/blog/end-to-end-object-detection-with-
transformers/
• Author’sYoutube
 https://youtu.be/utxbUlo9CyY
• Official Code
 https://github.com/facebookresearch/detr

Object Detection
• Predict a set of bounding boxes with labels, given an image

Classical Approach to Detection
• Popular approach: detection := classification of boxes
• Requires selecting a subset of candidate boxes
• Regression step to refine the predictions
• Typically non-differentiable

DETR in a nutshell
• Set prediction formulation
• No geometric priors(NMS, anchors, …)
• Fully differentiable
• Competitive performance
• Extensible to other tasks

Reframing theTask of Object Detection
• DETR casts the object detection task as an image-to-set problem.
• Given an image, the model must predict an unordered set of all the
objects present, each represented by its class, along with a tight
bounding box surrounding each one.
• This formulation is particularly suitable forTransformers.

Transformers
• Transformers are a deep learning architecture that has gained
popularity in recent years.
• They rely on a simple yet powerful mechanism called attention,
which enables AI models to selectively focus on certain parts of their
input and thus reason more effectively.
• Transformers have been widely applied on problems with sequential
data, and have also been extended to tasks as diverse as speech
recognition, symbolic mathematics, and reinforcement learning.
• But, perhaps surprisingly, computer vision has not yet been swept up
by theTransformer revolution.

Streamlined Detection Pipeline

Object Detection Set Prediction Loss
• DETR infers a fixed-size set of N predictions, where N is set to be
significantly large than the typical number of objects in an image.
• Let us denote by 𝑦 the ground truth set of objects, and ො𝑦 = {ෝ𝑦𝑖}𝑖=1
𝑁
the set of N predictions.
• 𝑦 also as a set of size N padded with ∅(no object).

Object Detection Set Prediction Loss
• To find a bipartite matching between these two sets we search for a
permutation of N elements 𝜎 ∈ 𝔖 𝑁 with the lowest cost:
ො𝜎 = arg min
𝜎∈𝔖 𝑁
෍
𝑖
𝑁
ℒ 𝑚𝑎𝑡𝑐ℎ(𝑦𝑖, ො𝑦 𝜎 𝑖 )
𝑦𝑖 = 𝑐𝑖, 𝑏𝑖 , where 𝑐𝑖 is the target class label(which may be ∅) and
𝑏𝑖 ∈ [0, 1] 4 is a vector of box center coordinates and its height and width.
ℒ 𝑚𝑎𝑡𝑐ℎ 𝑦𝑖, ො𝑦 𝜎 𝑖 = −𝕝 𝑐 𝑖≠∅ Ƹ𝑝 𝜎 𝑖 + 𝕝 𝑐 𝑖≠∅ ℒ 𝑏𝑜𝑥(𝑏𝑖, ෠𝑏 𝜎 𝑖 )
Ƹ𝑝 𝜎 𝑖 (𝑐𝑖) is a prob. of class 𝑐𝑖 and ෠𝑏 𝜎 𝑖 is a predicted box
• ℒ 𝑚𝑎𝑡𝑐ℎ(𝑦𝑖, ො𝑦 𝜎 𝑖 ) is a pair-wise matching cost and this optimal assignment
is computed efficiently with Hungarian algorithm.

Loss Function – Hungarian Loss
• A linear combination of a negative log-likelihood for class prediction
and a box loss
ℒ 𝐻𝑢𝑛𝑔𝑎𝑟𝑖𝑎𝑛 𝑦, ො𝑦 = ෍
𝑖=1
𝑁
[− log Ƹ𝑝ෝ𝜎 𝑖 𝑐𝑖 + 𝕝 𝑐 𝑖≠∅ ℒ 𝑏𝑜𝑥(𝑏𝑖, ෠𝑏 𝜎 𝑖 )]
In practice, the log prob. term is down-weighted when 𝑐𝑖=∅ by a factor
of 10 to account for class imbalance.

Bounding Box Loss
• Unlike many detectors that do box predictions as a Δ w.r.t. some
initial guesses, DETR make box prediction directly.
• The most commonly-used ℓ1 loss will have different scales for small
and large boxes even if there relative errors are similar.
• To mitigate this issue, DETR use a linear combination of the ℓ1 loss
and generalized IoU loss that is scale-invariant.
ℒ 𝑏𝑜𝑥 𝑏𝑖, ෠𝑏 𝜎 𝑖 = 𝜆𝑖𝑜𝑢ℒ 𝑖𝑜𝑢 𝑏𝑖, ෠𝑏 𝜎 𝑖 + 𝜆 𝐿1 𝑏𝑖 − ෠𝑏 𝜎 𝑖 1

Backbone
• Starting from the initial image 𝑥𝑖𝑚𝑔 ∈ ℝ3×𝐻0×𝑊0 (with 3 color channels),
a conventional CNN backbone generates a lower-resolution activation
map 𝑓 ∈ ℝ 𝐶×𝐻×𝑊
• Typical value DETR use are 𝐶 = 2048 and H, 𝑊 =
𝐻0
32
,
𝑊0
32

Transformer Encoder
• Creating a new feature map 𝑧0 ∈ ℝ 𝑑×𝐻×𝑊, 𝑑 is smaller than 𝐶
• The encoder expects a sequence as input, hence the spatial
dimensions of 𝑧0 is collapsed into one dimension, resulting in a 𝑑 ×
𝐻𝑊 feature map.
• Since the transformer architecture is permutation-invariant, DETR
supplement it with fixed positional encodings that are added to the
input of each attention layer.

Transformer Decoder
• The decoder follows the standard architecture of the transformer,
transforming N embeddings of size d using multi-headed self- and
encoder-decoder attention mechanisms.
• The difference with the original trans former is that our model
decodes the N objects in parallel at each decoder layer.
• Since the decoder is also permutation-invariant, the N input
embeddings must be different to produce different results.
• These input embeddings are learnt positional encodings that we
refer to as object queries, and similarly to the encoder, they are
added to the input of each attention layer

Architecture of DETR’sTransformer

Prediction FFN and Auxiliary Decoding Losses
• The final prediction is computed by a 3-layer perceptron with ReLU
and hidden dimension 𝑑.
• The FFN predicts the normalized center coordinates, height and
width of the box .
• Authors add prediction FFNs and Hungarian loss after each decoder
layer. All predictions FFNs share their parameters.

DETR
https://youtu.be/utxbUlo9CyY

Experiments
• Experiments on COCO 2017 detection and panoptic segmentation
dataset.There are 7 instances per image on average, up to 63
instances in a single image in training set.
• Two different backbones are used: a ResNet-50(DETR) and a ResNet-
101(DETR-R101).
• Dilation to the last stage of the backbone also used for increasing the
feature resolution: DETR-DC5 and DETR-DC5-R101 (dilated C5 stage)
• This modification increases the resolution by a factor of 2, thus
improving performance for small object, at the cost of a 16x higher
cost in the self-attentions of the encoder, leading to an overall 2x
increase in computational cost.

Deformable DETR
(https://arxiv.org/abs/2010.04159)

Number of Encoder Layers
• Without encoder layers, overall AP drops by 3.9 points, with a more
significant drop of 6.0 AP on large objects
• By global scene reasoning, the encoder is important for
disentangling objects.

Number of Decoder Layers
• A single decoding layer of the transformer is not able to compute any
cross-correlation between the output elements, and thus it is prone
to making multiple predictions for same object.  NMS improves
performance

DecoderAttention
• Decoder attention is fairly local, meaning that is mostly attends to
object extremities such as heads or legs.

Importance of Positional Encodings

Analysis
• Decoder output slot analysis
 DETR learns different specialization for each
query slot.
• Generalization to unseen numbers of
instances
 There is no image with more than 13 giraffes in
the training set.
 This experiment confirms that there is no strong
class-specialization in each object query.

Increasing the Number of Instances
• While the model detects all instances when up to 50 are visible, it
then starts saturating and misses more and more instances. Notably,
when the image contains all 100 instances, the model only detects 30
on average, which is less than if the image contains only 50 instances
that are all detected.

DETR for Panoptic Segmentation
• Similarly to the extension of Faster R-CNN to Mask R-CNN, DETR
can be naturally extended by adding a mask head on top of the
decoder outputs.

https://youtu.be/utxbUlo9CyY

PR-284: End-to-End Object Detection with Transformers(DETR)

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to PR-284: End-to-End Object Detection with Transformers(DETR)

Similar to PR-284: End-to-End Object Detection with Transformers(DETR) (20)

More from Jinwon Lee

More from Jinwon Lee (20)

Recently uploaded

Recently uploaded (20)

PR-284: End-to-End Object Detection with Transformers(DETR)