2020 12-2-detr

End-to-End Object Detection
with Transformers

DETR
Jaemin Jeong
Seminar
(세미나)
2
•DEtection + TRansformer
•Streamline the detection pipeline.
•Bipartite matching

Introduction
Jaemin Jeong
Seminar
(세미나)
3
Modern Detector : Anchors, Window Centers, Proposals.
-> Complex pipeline
Object detection isn't end-to-end because postprocessing such NMS, Soft NMS

Self Attention
Jaemin Jeong
Seminar
(세미나)
4
The self-attention mechanisms of transformers, which explicitly model all pairwise interactions between elements in a
sequence, make these architectures particularly suitable for specific constraints of set prediction such as removing
duplicate predictions.
https://jalammar.github.io/illustrated-transformer/

Self Attention
Jaemin Jeong
Seminar
(세미나)
5
https://lionbridge.ai/articles/what-are-transformer-models-in-machine-learning/

Multihead Attention & Feed forward
Jaemin Jeong
Seminar
(세미나)
6
https://jalammar.github.io/illustrated-transformer/

Decoder
Jaemin Jeong
Seminar
(세미나)
7
https://lionbridge.ai/articles/what-are-transformer-models-in-machine-learning/

𝑃𝐸(𝑝𝑜𝑠,2𝑖) = sin(𝑝𝑜𝑠
1
100002𝑖/𝑑)
𝑃𝐸(𝑝𝑜𝑠,2𝑖+1) = cos(𝑝𝑜𝑠
1
100002𝑖/𝑑)
𝑃𝑜𝑠𝑖𝑡𝑖𝑜𝑛 𝑘 = sin
𝑝𝑜𝑠
𝑥0
+ cos
𝑝𝑜𝑠
𝑥0
…
Positional Encoding
Jaemin Jeong
Seminar
(세미나)
8
https://inmoonlight.github.io/2020/01/26/Positional-Encoding/
𝑖 ≤
𝑑
2
𝑃𝐸(𝑝𝑜𝑠,2𝑖) = sin
𝑝𝑜𝑠
𝑥
𝑃𝐸(𝑝𝑜𝑠,2𝑖+1) = cos
𝑝𝑜𝑠
𝑥
𝑃𝐸(𝑝𝑜𝑠+𝑘,2𝑖) = sin
𝑝𝑜𝑠 + 𝑘
𝑥
= sin
𝑝𝑜𝑠
𝑥
cos
𝑘
𝑥
+ cos
𝑝𝑜𝑠
𝑥
sin
𝑘
𝑥
= 𝑃𝐸(𝑝𝑜𝑠,2𝑖) cos
𝑘
𝑥
+ cos
𝑝𝑜𝑠
𝑥
sin
𝑘
𝑥
𝑃𝐸(𝑝𝑜𝑠+𝑘,2𝑖+1) = cos
𝑝𝑜𝑠 + 𝑘
𝑥
= cos
𝑝𝑜𝑠
𝑥
cos
𝑘
𝑥
− sin
𝑝𝑜𝑠
𝑥
sin
𝑘
𝑥
= 𝑃𝐸(𝑝𝑜𝑠,2𝑖+1) cos
𝑘
𝑥
− sin
𝑝𝑜𝑠
𝑥
sin
𝑘
𝑥
𝑃𝐸(𝑝𝑜𝑠+𝑘,2𝑖) is linearly function of 𝑃𝐸(𝑝𝑜𝑠,2𝑖)
rule
- The distance between tokens must be constant.
- Position is unique value.
𝑥𝑖 = 100002𝑖/𝑑

DETR
Jaemin Jeong
Seminar
(세미나)
9

DETR
Jaemin Jeong
Seminar
(세미나)
10
• Better performance on large objects.
• Lower performance on small objects.
• Outperform on Panoptic Segmentation.
• Panoptic Segmentaion : Semantic Segmentation + Instance Segmentaion

DETR
Jaemin Jeong
Seminar
(세미나)
11
𝒚 ∶ 𝒈𝒓𝒐𝒖𝒏𝒅 𝒕𝒓𝒖𝒕𝒉 𝒔𝒆𝒕 𝒐𝒇 𝒐𝒃𝒋𝒆𝒄𝒕𝒔
𝒚 𝒚 𝒊=𝟏
𝑵
∶ 𝒔𝒆𝒕 𝒐𝒇 𝑵 𝒑𝒓𝒆𝒅𝒊𝒄𝒕𝒊𝒐𝒏𝒔
𝑳 𝒎𝒂𝒕𝒄𝒉(𝒚𝒊, 𝒚 𝝈(𝒊)) = −𝕝 𝒄 𝒊≠∅ 𝒑 𝝈 𝒊 𝒄𝒊 + 𝕝 𝒄𝒊≠∅ 𝑳 𝒃𝒐𝒙(𝒃𝒊, 𝒃 𝝈 𝒊 )
𝒚𝒊 = (𝒄𝒊, 𝒃𝒊)
𝒃𝒊 = 𝒃𝒐𝒙, 𝒄𝒊 = 𝒄𝒍𝒂𝒔𝒔
𝝈 𝒊 = 𝒊𝒏𝒅𝒆𝒙
𝝈 ∶ 𝒂𝒓𝒈𝒎𝒊𝒏 𝝈∈𝕾 𝑵 𝒊
𝑵
𝑳 𝒎𝒂𝒕𝒄𝒉(𝒚𝒊, 𝒚 𝝈(𝒊))
𝒑 𝝈 𝒊 𝒄𝒊 = 𝒑𝒓𝒆𝒅𝒊𝒄𝒕𝒊𝒐𝒏 𝒑𝒓𝒐𝒃𝒂𝒃𝒊𝒍𝒊𝒕𝒚 𝒐𝒇 𝒄𝒊
𝑳 𝑯𝒖𝒏𝒈𝒂𝒓𝒊𝒂𝒏 𝒚, 𝒚 = 𝒊=𝟏
𝑵
−𝒍𝒐𝒈 𝒑 𝝈 𝒊 𝒄𝒊 + 𝕝 𝒄 𝒊≠∅ 𝑳 𝒃𝒐𝒙(𝒃𝒊, 𝒃 𝝈 𝒊 )
𝑳 𝒃𝒐𝒙 𝒃𝒊, 𝒃 𝝈 𝒊 = 𝝀𝒊𝒐𝒖 𝑳𝒊𝒐𝒖 𝒃𝒊, 𝒃 𝝈 𝒊 + 𝝀 𝑳𝟏||𝒃𝒊 − 𝒃 𝝈 𝒊 || 𝟏
𝑖𝑓 𝑐𝑖 = ∅, down-weight the log-probability term
by a factor 10 to account for class imbalance.
https://www.vivekkalyan.com/end-to-end-object-detection-with-transformers

DETR
Jaemin Jeong
Seminar
(세미나)
12
Two ingredients are essential for direct set predictions in detection
1. a set prediction loss that forces unique matching between predicted and ground truth boxes.
2. an architecture that predicts (in a single pass) a set of objects and models their relation.

DETR
Jaemin Jeong
Seminar
(세미나)
13

What are object queries??
Jaemin Jeong
Seminar
(세미나)
14
Object queries are the input of the decoder.
Learnt embeddings
We use 100 of them
No built-in geometric prior

Backbone
Jaemin Jeong
Seminar
(세미나)
15
𝒙𝒊𝒎𝒈 = 𝟑 × 𝑯 𝟎 × 𝑾 𝟎
𝑩𝒂𝒄𝒌𝒃𝒐𝒏𝒆 𝒐𝒖𝒕𝒑𝒖𝒕 𝒇𝒆𝒂𝒕𝒖𝒓𝒆 𝒎𝒂𝒑 ∶ 𝑪 × 𝑯 × 𝑾
𝑪 = 𝟐𝟎𝟒𝟖, 𝑯 =
𝑯 𝟎
𝟑𝟐
, 𝑾 =
𝑾 𝟎
𝟑𝟐
𝑪 → 𝒅
𝒏𝒆𝒘 𝒇𝒆𝒂𝒕𝒖𝒓𝒆 𝒎𝒂𝒑 ∶ 𝒅 × 𝑯 × 𝑾
𝒓𝒆𝒔𝒉𝒂𝒑𝒆 ∶ 𝒅 × 𝑯𝑾

Experiments
Jaemin Jeong
Seminar
(세미나)
16
Optimizer : AdamW
Transformer Learning rate : 10−4
Backbone Learning rate : 10−5
Weight Initialization : Xavier init
Backbone : Imagenet-pretrained Resnet (torchvision), Frozen batchnorm layers
DETR : Resnet 50
DETR-R101 : Resnet 101
DETR-DC5 : we also increase the feature resolution by adding a dilation to the last stage of the backbone and removing a stride from the first convolution of this stage.
Data Augmentation Dropout : 0.1
Epoch : 300 [200 : lr * 0.1]
16 V100 GPUs 3day 4 images per GPU
Faster RCNN
Epoch : 500 [400 : lr * 0.1]

Jaemin Jeong
Seminar
(세미나)
17

Num of encoder layers
Jaemin Jeong
Seminar
(세미나)
18
Selected
6 transformer
6 decoder layers of width 256 with 8 attention heads
AP_L +7.8
AP_s - 5.5

Jaemin Jeong
Seminar
(세미나)
19
In Figure 3, we visualize the attention maps of the last encoder layer of a trained model, focusing on a few points in the image.
The encoder seems to separate instances already, which likely simplifies object extraction and localization for the decoder.

Number of decoder layers
Jaemin Jeong
Seminar
(세미나)
20
This can be explained by the fact that a
single decoding layer of the transformer is
not able to compute any cross-correlations
between the output elements, and thus it is
prone to making multiple predictions for
the same object.

Importance of FFN
Jaemin Jeong
Seminar
(세미나)
21
Remove FFN -> Performance drops by 2.3 AP

Importance of positional encodings
Jaemin Jeong
Seminar
(세미나)
22
none : none
at input : 입력에 한번
at attn : 매번
sine : fixed sine pos
learned : learnable

Loss ablations
Jaemin Jeong
Seminar
(세미나)
23

Analysis - Decoder output slot analysis
Jaemin Jeong
Seminar
(세미나)
24

Analysis - Generalization to unseen numbers of instances
Jaemin Jeong
Seminar
(세미나)
25

Panoptic segmentation
Jaemin Jeong
Seminar
(세미나)
26

Jaemin Jeong
Seminar
(세미나)
27
https://youtu.be/utxbUlo9CyY?t=450

Training
Jaemin Jeong
Seminar
(세미나)
28
Freeze all the weights and train only the mask head for 25 epochs

Jaemin Jeong
Seminar
(세미나)
29
PQ : Panoptic Quality

Jaemin Jeong
Seminar
(세미나)
30

Jaemin Jeong
Seminar
(세미나)
31

2020 12-2-detr

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to 2020 12-2-detr

Similar to 2020 12-2-detr (20)

More from JAEMINJEONG5

More from JAEMINJEONG5 (10)

Recently uploaded

Recently uploaded (20)

2020 12-2-detr