SlideShare a Scribd company logo
1 of 23
Download to read offline
Deformable DETR: Deformable
Transformers for End-to-End
Object Detection
2021.01.10
딥러닝논문읽기모임 이미지처리팀
홍은기, 김병현, 안종식, 허다운
1. Object Detection
2. DETR
3. Deformable DETR
2
목차
• 문제 정의: 하나의 이미지에 존재하는 n개의 물체를 탐지 (n ≥ 0)
• 입력: (C x H x W)의 이미지
• 출력: bounding box {class_label, x, y, w, h}의 집합
3
Object Detection
• R-CNN, YOLO 등과 같은 기존의 detector들과 달리, 자연언어처리에 사용되
는 Transformer를 도입
• Object Detection 문제를 “Direct Set Prediction” 문제로 재정의함으로써,
앵커 박스, NMS와 같은 hand-designed feature를 제거
4End-to-End Object Detection with Transformer (Facebook AI, 2020)
DETR (DEtection TRansformer)
• 입력 이미지의 사이즈가 (416 x 416)일 경우, 셀 하나당 9개의 bbox,
총 10,647개의 bbox를 출력
5
YOLOv4 vs DETR
inference objectness score > thr NMS
6
• 사전 설정한 object query (default=100)의 개수만큼 bounding box 출력
• 100개의 bbox 중, class label이 no_object인 bbox 삭제
YOLOv4 vs DETR
7
1. 이미지를 입력으로 받아서 Resnet-50으로 feature map 추출
예) 입력이 (3 x 416 x 416)일 때, (2048 x 13 x 13)의 feature map 출력
DETR 파이프라인 - backbone
8
2. 픽셀 간의 위치 정보를 보존하기 위해 Spatial positional encoding 수행
1) hidden_dim=256일 때, (169, 1, 256)의 positional embedding 벡터 생성
2) feature map(13, 13, 2048)에 1 x 1 convolution 적용 -> (13, 13, 256) -> (169, 1, 256)
3) feature map + positional embedding
DETR 파이프라인 – positional encoding
9Attention Is All You Need (2017)
3. 입력 vector에 대하여 self-attention 수행
1) Query: 픽셀 값, Key: 픽셀 값, Value: 픽셀 값
2) num_heads = 8
3) 픽셀 간에 존재하는 관계를 포착
DETR 파이프라인 – encoder
10End-to-End Object Detection with Transformer (Facebook AI, 2020)
DETR 파이프라인 – encoder
11
4. Object Query에 대하여 self-attention 및 cross attention 수행
1) Self-Attention - Q: Object Query , K: Object Query, V: Object Query
2) Cross Attention – Q: Object Query, K: 인코더 출력, V: 인코더 출력
* object query: torch.nn.Embedding (num_queries=100, hidden_dim=256)
DETR 파이프라인 – decoder
12
5. Prediction heads
1) Decoder의 출력을 입력으로 받는 두 개의 fully-connected layer가 각각 class label과 bounding
box를 예측
2) Object query의 개수(100개)만큼의 bounding box 생성
3) class label이 no_object가 아닌 bounding box만 출력
DETR 파이프라인 – prediction heads
Loss Calculation step:
1. Matching cost가 최소화되는 bipartite matching 탐색
Matching Cost:
2. 매칭된 예측-정답 세트에 대하여 Hungarian Loss 계산
Hungarian Loss:
13
DETR – Loss Function
1. Hungarian algorithm을 이용하여, matching cost가 최소화되는
bipartite matching 탐색
Matching Cost:
ŷ1 = (ĉ1, b̂ 1) y1 = (c1, b1)
ŷ2 = (ĉ2, b̂2) y2 = (c2, b2)
… …
ŷn = (ĉn, b̂n) yn = (cn, bn)
14
DETR – Loss Function
* Matching 기준:
(1) Class Label
(2) Bbox의 유사도
(L1 loss & GIoU)
2. 매칭된 예측-정답 세트에 대하여 Hungarian Loss 계산
Hungarian Loss:
class prediction loss: (negative log likelihood)
bounding box loss: =
GIoU loss + L1 loss
* class imbalance 문제를 방지하기 위하여, class label이 no_object인 경우, log-probability term을 1/10으로 down.
15
DETR – Loss Function
16
• DETR의 문제점:
(1) 학습에 오랜 시간이 소요 (500 epochs)
- key의 개수 Nk가 클 경우, attention weight 이므로 ambiguous gradient 발생
- 학습 시작시, 어텐션 모듈은 feature map에 있는 모든 픽셀에 거의 동일한 weight 부여
- Sparse하게 퍼져 있는 유의미한 위치를 학습하는 데 오랜 학습 필요
(2) 작은 물체에 대하여 정확도가 떨어짐
- YOLO, SSD와 같은 detector들은 multi-scale feature를 사용하지만, DETR은 단 한 레벨의 feature map만 사용
- Transformer encoder의 경우, 계산 복잡도가 feature map의 길이의 제곱에 비례하기 때문에 높은 해상도의 feature map을 사
용하기 어려움
• Deformable DETR
(1) Deformable Convolution의 개념을 적용한 deformable attention module로 첫 번째 문제를 해결
(2) Deformable attention module 및 multi-scale feature로 두 번째 문제를 해결
Deformable DETR - Contribution
17
• Motivation: 일반적인 Convolution은 고정된 필터 배열을 사용하기 때문에 “geometric
transformation”을 모델링하는데 한계가 있음
• 해결책: variation에 적응하도록 필터의 모양을 변형시키는 offset으로 추가적으로 학습
Deformable Convolution
Deformable Convolutional Networks (2017)
18
Deformable DETR - Architecture
Feature map의 모든 픽셀(key)을 들여다 보는 대신, 가장 도움이 될 법한
픽셀을 샘플링하여 해당 샘플에 대해서만 attention 연산 수행
19
Deformable Attention
where:
Input feature map:
Query element indexer:
Key element indexer:
Number of sampled key: (K = 4)
reference point:
Attention weights:
Attention
score
Sampling
offset
20
Multi-scale Deformable Attention
21
Experiment
22
Ablation Studies
• multi-scale input: +1.7 % AP (small object +2.9 % APs)
• number of sampling point K: +0.9 %
• multi-scale deformable attention: +1.5 %
• Adding FPN does not improve the performance.
23
Q & A

More Related Content

What's hot

An overview of gradient descent optimization algorithms
An overview of gradient descent optimization algorithms An overview of gradient descent optimization algorithms
An overview of gradient descent optimization algorithms Hakky St
 
Introduction to Transformers for NLP - Olga Petrova
Introduction to Transformers for NLP - Olga PetrovaIntroduction to Transformers for NLP - Olga Petrova
Introduction to Transformers for NLP - Olga PetrovaAlexey Grigorev
 
Exploring Simple Siamese Representation Learning
Exploring Simple Siamese Representation LearningExploring Simple Siamese Representation Learning
Exploring Simple Siamese Representation LearningSungchul Kim
 
Generative adversarial networks
Generative adversarial networksGenerative adversarial networks
Generative adversarial networks남주 김
 
Deep Learning for Recommender Systems
Deep Learning for Recommender SystemsDeep Learning for Recommender Systems
Deep Learning for Recommender SystemsJustin Basilico
 
Review : PolarMask: Single Shot Instance Segmentation with Polar Representati...
Review : PolarMask: Single Shot Instance Segmentation with Polar Representati...Review : PolarMask: Single Shot Instance Segmentation with Polar Representati...
Review : PolarMask: Single Shot Instance Segmentation with Polar Representati...Dongmin Choi
 
Graph Based Clustering
Graph Based ClusteringGraph Based Clustering
Graph Based ClusteringSSA KPI
 
Deep learning based object detection
Deep learning based object detectionDeep learning based object detection
Deep learning based object detectionchettykulkarni
 
1시간만에 GAN(Generative Adversarial Network) 완전 정복하기
1시간만에 GAN(Generative Adversarial Network) 완전 정복하기1시간만에 GAN(Generative Adversarial Network) 완전 정복하기
1시간만에 GAN(Generative Adversarial Network) 완전 정복하기NAVER Engineering
 
Attention is All You Need (Transformer)
Attention is All You Need (Transformer)Attention is All You Need (Transformer)
Attention is All You Need (Transformer)Jeong-Gwan Lee
 
Spectral clustering
Spectral clusteringSpectral clustering
Spectral clusteringSOYEON KIM
 
Optimization for Deep Learning
Optimization for Deep LearningOptimization for Deep Learning
Optimization for Deep LearningSebastian Ruder
 
Generative Adversarial Networks (GAN)
Generative Adversarial Networks (GAN)Generative Adversarial Networks (GAN)
Generative Adversarial Networks (GAN)Manohar Mukku
 
Generative Adversarial Networks
Generative Adversarial NetworksGenerative Adversarial Networks
Generative Adversarial NetworksMustafa Yagmur
 
Variational Autoencoder
Variational AutoencoderVariational Autoencoder
Variational AutoencoderMark Chang
 

What's hot (20)

Word embedding
Word embedding Word embedding
Word embedding
 
An overview of gradient descent optimization algorithms
An overview of gradient descent optimization algorithms An overview of gradient descent optimization algorithms
An overview of gradient descent optimization algorithms
 
Meta-Learning Presentation
Meta-Learning PresentationMeta-Learning Presentation
Meta-Learning Presentation
 
Introduction to Transformers for NLP - Olga Petrova
Introduction to Transformers for NLP - Olga PetrovaIntroduction to Transformers for NLP - Olga Petrova
Introduction to Transformers for NLP - Olga Petrova
 
Exploring Simple Siamese Representation Learning
Exploring Simple Siamese Representation LearningExploring Simple Siamese Representation Learning
Exploring Simple Siamese Representation Learning
 
Generative adversarial networks
Generative adversarial networksGenerative adversarial networks
Generative adversarial networks
 
Deep Learning for Recommender Systems
Deep Learning for Recommender SystemsDeep Learning for Recommender Systems
Deep Learning for Recommender Systems
 
Zero shot learning
Zero shot learning Zero shot learning
Zero shot learning
 
Review : PolarMask: Single Shot Instance Segmentation with Polar Representati...
Review : PolarMask: Single Shot Instance Segmentation with Polar Representati...Review : PolarMask: Single Shot Instance Segmentation with Polar Representati...
Review : PolarMask: Single Shot Instance Segmentation with Polar Representati...
 
Graph Based Clustering
Graph Based ClusteringGraph Based Clustering
Graph Based Clustering
 
Deep learning based object detection
Deep learning based object detectionDeep learning based object detection
Deep learning based object detection
 
1시간만에 GAN(Generative Adversarial Network) 완전 정복하기
1시간만에 GAN(Generative Adversarial Network) 완전 정복하기1시간만에 GAN(Generative Adversarial Network) 완전 정복하기
1시간만에 GAN(Generative Adversarial Network) 완전 정복하기
 
You only look once
You only look onceYou only look once
You only look once
 
Attention is All You Need (Transformer)
Attention is All You Need (Transformer)Attention is All You Need (Transformer)
Attention is All You Need (Transformer)
 
Spectral clustering
Spectral clusteringSpectral clustering
Spectral clustering
 
Optimization for Deep Learning
Optimization for Deep LearningOptimization for Deep Learning
Optimization for Deep Learning
 
Mask R-CNN
Mask R-CNNMask R-CNN
Mask R-CNN
 
Generative Adversarial Networks (GAN)
Generative Adversarial Networks (GAN)Generative Adversarial Networks (GAN)
Generative Adversarial Networks (GAN)
 
Generative Adversarial Networks
Generative Adversarial NetworksGenerative Adversarial Networks
Generative Adversarial Networks
 
Variational Autoencoder
Variational AutoencoderVariational Autoencoder
Variational Autoencoder
 

More from taeseon ryu

OpineSum Entailment-based self-training for abstractive opinion summarization...
OpineSum Entailment-based self-training for abstractive opinion summarization...OpineSum Entailment-based self-training for abstractive opinion summarization...
OpineSum Entailment-based self-training for abstractive opinion summarization...taeseon ryu
 
3D Gaussian Splatting
3D Gaussian Splatting3D Gaussian Splatting
3D Gaussian Splattingtaeseon ryu
 
Hyperbolic Image Embedding.pptx
Hyperbolic  Image Embedding.pptxHyperbolic  Image Embedding.pptx
Hyperbolic Image Embedding.pptxtaeseon ryu
 
MCSE_Multimodal Contrastive Learning of Sentence Embeddings_변현정
MCSE_Multimodal Contrastive Learning of Sentence Embeddings_변현정MCSE_Multimodal Contrastive Learning of Sentence Embeddings_변현정
MCSE_Multimodal Contrastive Learning of Sentence Embeddings_변현정taeseon ryu
 
LLaMA Open and Efficient Foundation Language Models - 230528.pdf
LLaMA Open and Efficient Foundation Language Models - 230528.pdfLLaMA Open and Efficient Foundation Language Models - 230528.pdf
LLaMA Open and Efficient Foundation Language Models - 230528.pdftaeseon ryu
 
Dataset Distillation by Matching Training Trajectories
Dataset Distillation by Matching Training Trajectories Dataset Distillation by Matching Training Trajectories
Dataset Distillation by Matching Training Trajectories taeseon ryu
 
Packed Levitated Marker for Entity and Relation Extraction
Packed Levitated Marker for Entity and Relation ExtractionPacked Levitated Marker for Entity and Relation Extraction
Packed Levitated Marker for Entity and Relation Extractiontaeseon ryu
 
MOReL: Model-Based Offline Reinforcement Learning
MOReL: Model-Based Offline Reinforcement LearningMOReL: Model-Based Offline Reinforcement Learning
MOReL: Model-Based Offline Reinforcement Learningtaeseon ryu
 
Scaling Instruction-Finetuned Language Models
Scaling Instruction-Finetuned Language ModelsScaling Instruction-Finetuned Language Models
Scaling Instruction-Finetuned Language Modelstaeseon ryu
 
Visual prompt tuning
Visual prompt tuningVisual prompt tuning
Visual prompt tuningtaeseon ryu
 
variBAD, A Very Good Method for Bayes-Adaptive Deep RL via Meta-Learning.pdf
variBAD, A Very Good Method for Bayes-Adaptive Deep RL via Meta-Learning.pdfvariBAD, A Very Good Method for Bayes-Adaptive Deep RL via Meta-Learning.pdf
variBAD, A Very Good Method for Bayes-Adaptive Deep RL via Meta-Learning.pdftaeseon ryu
 
Reinforced Genetic Algorithm Learning For Optimizing Computation Graphs.pdf
Reinforced Genetic Algorithm Learning For Optimizing Computation Graphs.pdfReinforced Genetic Algorithm Learning For Optimizing Computation Graphs.pdf
Reinforced Genetic Algorithm Learning For Optimizing Computation Graphs.pdftaeseon ryu
 
The Forward-Forward Algorithm
The Forward-Forward AlgorithmThe Forward-Forward Algorithm
The Forward-Forward Algorithmtaeseon ryu
 
Towards Robust and Reproducible Active Learning using Neural Networks
Towards Robust and Reproducible Active Learning using Neural NetworksTowards Robust and Reproducible Active Learning using Neural Networks
Towards Robust and Reproducible Active Learning using Neural Networkstaeseon ryu
 
BRIO: Bringing Order to Abstractive Summarization
BRIO: Bringing Order to Abstractive SummarizationBRIO: Bringing Order to Abstractive Summarization
BRIO: Bringing Order to Abstractive Summarizationtaeseon ryu
 

More from taeseon ryu (20)

VoxelNet
VoxelNetVoxelNet
VoxelNet
 
OpineSum Entailment-based self-training for abstractive opinion summarization...
OpineSum Entailment-based self-training for abstractive opinion summarization...OpineSum Entailment-based self-training for abstractive opinion summarization...
OpineSum Entailment-based self-training for abstractive opinion summarization...
 
3D Gaussian Splatting
3D Gaussian Splatting3D Gaussian Splatting
3D Gaussian Splatting
 
JetsonTX2 Python
 JetsonTX2 Python  JetsonTX2 Python
JetsonTX2 Python
 
Hyperbolic Image Embedding.pptx
Hyperbolic  Image Embedding.pptxHyperbolic  Image Embedding.pptx
Hyperbolic Image Embedding.pptx
 
MCSE_Multimodal Contrastive Learning of Sentence Embeddings_변현정
MCSE_Multimodal Contrastive Learning of Sentence Embeddings_변현정MCSE_Multimodal Contrastive Learning of Sentence Embeddings_변현정
MCSE_Multimodal Contrastive Learning of Sentence Embeddings_변현정
 
LLaMA Open and Efficient Foundation Language Models - 230528.pdf
LLaMA Open and Efficient Foundation Language Models - 230528.pdfLLaMA Open and Efficient Foundation Language Models - 230528.pdf
LLaMA Open and Efficient Foundation Language Models - 230528.pdf
 
YOLO V6
YOLO V6YOLO V6
YOLO V6
 
Dataset Distillation by Matching Training Trajectories
Dataset Distillation by Matching Training Trajectories Dataset Distillation by Matching Training Trajectories
Dataset Distillation by Matching Training Trajectories
 
RL_UpsideDown
RL_UpsideDownRL_UpsideDown
RL_UpsideDown
 
Packed Levitated Marker for Entity and Relation Extraction
Packed Levitated Marker for Entity and Relation ExtractionPacked Levitated Marker for Entity and Relation Extraction
Packed Levitated Marker for Entity and Relation Extraction
 
MOReL: Model-Based Offline Reinforcement Learning
MOReL: Model-Based Offline Reinforcement LearningMOReL: Model-Based Offline Reinforcement Learning
MOReL: Model-Based Offline Reinforcement Learning
 
Scaling Instruction-Finetuned Language Models
Scaling Instruction-Finetuned Language ModelsScaling Instruction-Finetuned Language Models
Scaling Instruction-Finetuned Language Models
 
Visual prompt tuning
Visual prompt tuningVisual prompt tuning
Visual prompt tuning
 
mPLUG
mPLUGmPLUG
mPLUG
 
variBAD, A Very Good Method for Bayes-Adaptive Deep RL via Meta-Learning.pdf
variBAD, A Very Good Method for Bayes-Adaptive Deep RL via Meta-Learning.pdfvariBAD, A Very Good Method for Bayes-Adaptive Deep RL via Meta-Learning.pdf
variBAD, A Very Good Method for Bayes-Adaptive Deep RL via Meta-Learning.pdf
 
Reinforced Genetic Algorithm Learning For Optimizing Computation Graphs.pdf
Reinforced Genetic Algorithm Learning For Optimizing Computation Graphs.pdfReinforced Genetic Algorithm Learning For Optimizing Computation Graphs.pdf
Reinforced Genetic Algorithm Learning For Optimizing Computation Graphs.pdf
 
The Forward-Forward Algorithm
The Forward-Forward AlgorithmThe Forward-Forward Algorithm
The Forward-Forward Algorithm
 
Towards Robust and Reproducible Active Learning using Neural Networks
Towards Robust and Reproducible Active Learning using Neural NetworksTowards Robust and Reproducible Active Learning using Neural Networks
Towards Robust and Reproducible Active Learning using Neural Networks
 
BRIO: Bringing Order to Abstractive Summarization
BRIO: Bringing Order to Abstractive SummarizationBRIO: Bringing Order to Abstractive Summarization
BRIO: Bringing Order to Abstractive Summarization
 

210110 deformable detr

  • 1. Deformable DETR: Deformable Transformers for End-to-End Object Detection 2021.01.10 딥러닝논문읽기모임 이미지처리팀 홍은기, 김병현, 안종식, 허다운
  • 2. 1. Object Detection 2. DETR 3. Deformable DETR 2 목차
  • 3. • 문제 정의: 하나의 이미지에 존재하는 n개의 물체를 탐지 (n ≥ 0) • 입력: (C x H x W)의 이미지 • 출력: bounding box {class_label, x, y, w, h}의 집합 3 Object Detection
  • 4. • R-CNN, YOLO 등과 같은 기존의 detector들과 달리, 자연언어처리에 사용되 는 Transformer를 도입 • Object Detection 문제를 “Direct Set Prediction” 문제로 재정의함으로써, 앵커 박스, NMS와 같은 hand-designed feature를 제거 4End-to-End Object Detection with Transformer (Facebook AI, 2020) DETR (DEtection TRansformer)
  • 5. • 입력 이미지의 사이즈가 (416 x 416)일 경우, 셀 하나당 9개의 bbox, 총 10,647개의 bbox를 출력 5 YOLOv4 vs DETR inference objectness score > thr NMS
  • 6. 6 • 사전 설정한 object query (default=100)의 개수만큼 bounding box 출력 • 100개의 bbox 중, class label이 no_object인 bbox 삭제 YOLOv4 vs DETR
  • 7. 7 1. 이미지를 입력으로 받아서 Resnet-50으로 feature map 추출 예) 입력이 (3 x 416 x 416)일 때, (2048 x 13 x 13)의 feature map 출력 DETR 파이프라인 - backbone
  • 8. 8 2. 픽셀 간의 위치 정보를 보존하기 위해 Spatial positional encoding 수행 1) hidden_dim=256일 때, (169, 1, 256)의 positional embedding 벡터 생성 2) feature map(13, 13, 2048)에 1 x 1 convolution 적용 -> (13, 13, 256) -> (169, 1, 256) 3) feature map + positional embedding DETR 파이프라인 – positional encoding
  • 9. 9Attention Is All You Need (2017) 3. 입력 vector에 대하여 self-attention 수행 1) Query: 픽셀 값, Key: 픽셀 값, Value: 픽셀 값 2) num_heads = 8 3) 픽셀 간에 존재하는 관계를 포착 DETR 파이프라인 – encoder
  • 10. 10End-to-End Object Detection with Transformer (Facebook AI, 2020) DETR 파이프라인 – encoder
  • 11. 11 4. Object Query에 대하여 self-attention 및 cross attention 수행 1) Self-Attention - Q: Object Query , K: Object Query, V: Object Query 2) Cross Attention – Q: Object Query, K: 인코더 출력, V: 인코더 출력 * object query: torch.nn.Embedding (num_queries=100, hidden_dim=256) DETR 파이프라인 – decoder
  • 12. 12 5. Prediction heads 1) Decoder의 출력을 입력으로 받는 두 개의 fully-connected layer가 각각 class label과 bounding box를 예측 2) Object query의 개수(100개)만큼의 bounding box 생성 3) class label이 no_object가 아닌 bounding box만 출력 DETR 파이프라인 – prediction heads
  • 13. Loss Calculation step: 1. Matching cost가 최소화되는 bipartite matching 탐색 Matching Cost: 2. 매칭된 예측-정답 세트에 대하여 Hungarian Loss 계산 Hungarian Loss: 13 DETR – Loss Function
  • 14. 1. Hungarian algorithm을 이용하여, matching cost가 최소화되는 bipartite matching 탐색 Matching Cost: ŷ1 = (ĉ1, b̂ 1) y1 = (c1, b1) ŷ2 = (ĉ2, b̂2) y2 = (c2, b2) … … ŷn = (ĉn, b̂n) yn = (cn, bn) 14 DETR – Loss Function * Matching 기준: (1) Class Label (2) Bbox의 유사도 (L1 loss & GIoU)
  • 15. 2. 매칭된 예측-정답 세트에 대하여 Hungarian Loss 계산 Hungarian Loss: class prediction loss: (negative log likelihood) bounding box loss: = GIoU loss + L1 loss * class imbalance 문제를 방지하기 위하여, class label이 no_object인 경우, log-probability term을 1/10으로 down. 15 DETR – Loss Function
  • 16. 16 • DETR의 문제점: (1) 학습에 오랜 시간이 소요 (500 epochs) - key의 개수 Nk가 클 경우, attention weight 이므로 ambiguous gradient 발생 - 학습 시작시, 어텐션 모듈은 feature map에 있는 모든 픽셀에 거의 동일한 weight 부여 - Sparse하게 퍼져 있는 유의미한 위치를 학습하는 데 오랜 학습 필요 (2) 작은 물체에 대하여 정확도가 떨어짐 - YOLO, SSD와 같은 detector들은 multi-scale feature를 사용하지만, DETR은 단 한 레벨의 feature map만 사용 - Transformer encoder의 경우, 계산 복잡도가 feature map의 길이의 제곱에 비례하기 때문에 높은 해상도의 feature map을 사 용하기 어려움 • Deformable DETR (1) Deformable Convolution의 개념을 적용한 deformable attention module로 첫 번째 문제를 해결 (2) Deformable attention module 및 multi-scale feature로 두 번째 문제를 해결 Deformable DETR - Contribution
  • 17. 17 • Motivation: 일반적인 Convolution은 고정된 필터 배열을 사용하기 때문에 “geometric transformation”을 모델링하는데 한계가 있음 • 해결책: variation에 적응하도록 필터의 모양을 변형시키는 offset으로 추가적으로 학습 Deformable Convolution Deformable Convolutional Networks (2017)
  • 18. 18 Deformable DETR - Architecture Feature map의 모든 픽셀(key)을 들여다 보는 대신, 가장 도움이 될 법한 픽셀을 샘플링하여 해당 샘플에 대해서만 attention 연산 수행
  • 19. 19 Deformable Attention where: Input feature map: Query element indexer: Key element indexer: Number of sampled key: (K = 4) reference point: Attention weights: Attention score Sampling offset
  • 22. 22 Ablation Studies • multi-scale input: +1.7 % AP (small object +2.9 % APs) • number of sampling point K: +0.9 % • multi-scale deformable attention: +1.5 % • Adding FPN does not improve the performance.