4. Self Attention
Jaemin Jeong
Seminar
(세미나)
4
The self-attention mechanisms of transformers, which explicitly model all pairwise interactions between elements in a
sequence, make these architectures particularly suitable for specific constraints of set prediction such as removing
duplicate predictions.
https://jalammar.github.io/illustrated-transformer/
8. 𝑃𝐸(𝑝𝑜𝑠,2𝑖) = sin(𝑝𝑜𝑠
1
100002𝑖/𝑑)
𝑃𝐸(𝑝𝑜𝑠,2𝑖+1) = cos(𝑝𝑜𝑠
1
100002𝑖/𝑑)
𝑃𝑜𝑠𝑖𝑡𝑖𝑜𝑛 𝑘 = sin
𝑝𝑜𝑠
𝑥0
+ cos
𝑝𝑜𝑠
𝑥0
…
Positional Encoding
Jaemin Jeong
Seminar
(세미나)
8
https://inmoonlight.github.io/2020/01/26/Positional-Encoding/
𝑖 ≤
𝑑
2
𝑃𝐸(𝑝𝑜𝑠,2𝑖) = sin
𝑝𝑜𝑠
𝑥
𝑃𝐸(𝑝𝑜𝑠,2𝑖+1) = cos
𝑝𝑜𝑠
𝑥
𝑃𝐸(𝑝𝑜𝑠+𝑘,2𝑖) = sin
𝑝𝑜𝑠 + 𝑘
𝑥
= sin
𝑝𝑜𝑠
𝑥
cos
𝑘
𝑥
+ cos
𝑝𝑜𝑠
𝑥
sin
𝑘
𝑥
= 𝑃𝐸(𝑝𝑜𝑠,2𝑖) cos
𝑘
𝑥
+ cos
𝑝𝑜𝑠
𝑥
sin
𝑘
𝑥
𝑃𝐸(𝑝𝑜𝑠+𝑘,2𝑖+1) = cos
𝑝𝑜𝑠 + 𝑘
𝑥
= cos
𝑝𝑜𝑠
𝑥
cos
𝑘
𝑥
− sin
𝑝𝑜𝑠
𝑥
sin
𝑘
𝑥
= 𝑃𝐸(𝑝𝑜𝑠,2𝑖+1) cos
𝑘
𝑥
− sin
𝑝𝑜𝑠
𝑥
sin
𝑘
𝑥
𝑃𝐸(𝑝𝑜𝑠+𝑘,2𝑖) is linearly function of 𝑃𝐸(𝑝𝑜𝑠,2𝑖)
rule
- The distance between tokens must be constant.
- Position is unique value.
𝑥𝑖 = 100002𝑖/𝑑
12. DETR
Jaemin Jeong
Seminar
(세미나)
12
Two ingredients are essential for direct set predictions in detection
1. a set prediction loss that forces unique matching between predicted and ground truth boxes.
2. an architecture that predicts (in a single pass) a set of objects and models their relation.
14. What are object queries??
Jaemin Jeong
Seminar
(세미나)
14
Object queries are the input of the decoder.
Learnt embeddings
We use 100 of them
No built-in geometric prior
16. Experiments
Jaemin Jeong
Seminar
(세미나)
16
Optimizer : AdamW
Transformer Learning rate : 10−4
Backbone Learning rate : 10−5
Weight Initialization : Xavier init
Backbone : Imagenet-pretrained Resnet (torchvision), Frozen batchnorm layers
DETR : Resnet 50
DETR-R101 : Resnet 101
DETR-DC5 : we also increase the feature resolution by adding a dilation to the last stage of the backbone and removing a stride from the first convolution of this stage.
Data Augmentation Dropout : 0.1
Epoch : 300 [200 : lr * 0.1]
16 V100 GPUs 3day 4 images per GPU
Faster RCNN
Epoch : 500 [400 : lr * 0.1]
18. Num of encoder layers
Jaemin Jeong
Seminar
(세미나)
18
Selected
6 transformer
6 decoder layers of width 256 with 8 attention heads
AP_L +7.8
AP_s - 5.5
19. Jaemin Jeong
Seminar
(세미나)
19
In Figure 3, we visualize the attention maps of the last encoder layer of a trained model, focusing on a few points in the image.
The encoder seems to separate instances already, which likely simplifies object extraction and localization for the decoder.
20. Number of decoder layers
Jaemin Jeong
Seminar
(세미나)
20
This can be explained by the fact that a
single decoding layer of the transformer is
not able to compute any cross-correlations
between the output elements, and thus it is
prone to making multiple predictions for
the same object.
22. Importance of positional encodings
Jaemin Jeong
Seminar
(세미나)
22
none : none
at input : 입력에 한번
at attn : 매번
sine : fixed sine pos
learned : learnable