TOOD: Task-aligned One-stage Object Detection

Contents
▪ Introduction
▪ Task-aligned One-stage Object Detection
• Task-aligned Head
• Task Alignment Learning
▪ Experiments and Results
▪ Conclusion

Introduction
▪ Object detection
• Localize and recognize objects of interest from natural images
• Classification
• Discriminative features that focus on the key or salient part of an object
• Localization
• Precisely locating the whole object with its boundaries
Fully Convolutional One-Stage Object Detection (FCOS)
RetinaNet

ATSS
TOOD
Introduction
▪ There is a certain level of misalignment when using two separate branches
• Result
• Object detection results
• Yellow box : ground-truth
• White arrow : the main direction of the best anchor
• Patch : the location of the best anchor / Box : predicted box
• Red : classification / Green : localization
• Score : spatial distributions of classification scores
• IoU : spatial distributions of localization scores
• Task-aligned One-stage Object Detection (TOOD)

Task-aligned One-stage Object Detection
▪ Overview
• Overall pipeline : backbone-FPN-head
• Use a single anchor per location like ATSS
• anchor : an anchor point for anchor-free detector, or an anchor box for anchor-based detector
• Align the two tasks (classification and localization) using
• Task-aligned head (T-head)
• Task Alignment Learning (TAL)

▪ Task-aligned Head (T-head)
• A simple feature extractor with two Task-Aligned Predictors (TAP)
• Let 𝑋𝑓𝑝𝑛 ∈ ℝ𝐻×𝑊×𝐶
• 𝑋𝑘
𝑖𝑛𝑡𝑒𝑟
: task-interactive features
• 𝑐𝑜𝑛𝑣𝑘 and 𝛿 : the 𝑘-th conv & relu

• Task-Aligned Predictors (TAP)
• A layer attention mechanism (2, 3)
• encourage task decomposition by dynamically
computing such task-specific features at the layer level
• 𝑤𝑘 : the 𝑘-th element of the learned layer attention 𝑤 ∈ ℝ𝑁
• 𝑓𝑐1 and 𝑓𝑐2 : two fully-connected layers
• 𝜎 : a sigmoid function
• 𝑥𝑖𝑛𝑡𝑒𝑟
: an average pooling of 𝑋𝑖𝑛𝑡𝑒𝑟
• The results of classification or localization (4)
• 𝑍𝑡𝑎𝑠𝑘
is converted into
• dense classification scores 𝑃 ∈ ℝ𝐻×𝑊×80
using sigmoid
• object bounding boxes 𝐵 ∈ ℝ𝐻×𝑊×4
with distance-to-bbox conversion

• Task-Aligned Predictors (TAP)
• A layer attention mechanism (2, 3)
• encourage task decomposition by dynamically
computing such task-specific features at the layer level
• 𝑤𝑘 : the 𝑘-th element of the learned layer attention 𝑤 ∈ ℝ𝑁
• 𝑓𝑐1 and 𝑓𝑐2 : two fully-connected layers
• 𝜎 : a sigmoid function
• 𝑥𝑖𝑛𝑡𝑒𝑟
: average pooling of 𝑋𝑖𝑛𝑡𝑒𝑟
• The results of classification or localization (4)
• 𝑍𝑡𝑎𝑠𝑘
is converted into
• dense classification scores 𝑃 ∈ ℝ𝐻×𝑊×80
using sigmoid
• object bounding boxes 𝐵 ∈ ℝ𝐻×𝑊×4
with distance-to-bbox conversion

• Prediction alignment
• adjust the spatial distributions using 𝑋𝑖𝑛𝑡𝑒𝑟
• 𝑃𝑎𝑙𝑖𝑔𝑛
: aligned classification prediction
• 𝑀 ∈ ℝ𝐻×𝑊×1
: a spatial probability map
• 𝑐𝑜𝑛𝑣1 : 1x1 conv for dimension reduction
• 𝐵𝑎𝑙𝑖𝑔𝑛
: aligned object bounding boxes
• 𝑂 ∈ ℝ𝐻×𝑊×8
: spatial offset maps
• 𝑐𝑜𝑛𝑣3 : 1x1 conv for dimension reduction
• 𝑀 and 𝑂 is trained through Task Alignment Learning (TAL)

▪ Task Alignment Learning (TAL)
• Task-aligned Sample Assignment
• The anchor assignment for a training instance should satisfy the following rules:
• a well-aligned anchor should be able to predict a high classification score with a precise localization jointly
• a misaligned anchor should have a low classification score and be suppressed subsequently
• Anchor alignment metric
• Design the following metric to compute anchor-level alignment for each instance:
• 𝑠 and 𝑢 : a classification score and IoU value
• 𝛼 and 𝛽 : are used to control the impact of the two tasks in the anchor alignment metric
• 𝑡 plays a critical role in the joint optimization of the two tasks towards the goal of task-alignment
• Training sample alignment
• Select 𝑚 anchors having the largest 𝑡 values as positive samples, while using the remaining anchors as negative ones

• Task-aligned Loss
• Classification objective
• Use Ƹ
𝑡 to replace the binary label of the positive anchor
• Ƹ
𝑡 : a normalized 𝑡 for stable training
• to ensure effective learning of hard instances (which usually have a small 𝑡 for all corresponding positive anchors)
• to preserve the rank between instances based on the precision of the predicted bounding boxes
• Binary Cross Entropy (BCE) computed on the positive anchors for the classification task
• The final loss function with modified focal loss

• Task-aligned Loss
• Localization objective
• Focus on the well-aligned anchors (with a large 𝑡) to improve the task alignment and regression precision
• Reduce the impact of the misaligned anchors (with a small 𝑡)
• Re-weight the loss of bbox regression computed for each anchor based on Ƹ
𝑡 and reformulate 𝑮𝑰𝒐𝑼 loss (𝐿𝐺𝐼𝑜𝑈)
• 𝑏 and ത
𝑏 : the predicted bboxes and the corresponding ground-truth boxes
• The total training loss for TAL = 𝑳𝒄𝒍𝒔 + 𝑳𝒓𝒆𝒈

Experiments and Results
▪ Dataset and evaluation protocol
• Dataset : COCO 2017
• Metric : COCO Average Precision (AP)
▪ Implementation details
• Detection pipeline : backbone – FPN – head
• Backbone : ResNet-50, ResNet-101, ResNeXt-101-64x4d
• 𝑁 (# of interactive layers) : 6 (to make a similar number of parameters as the conventional parallel head)
• 𝛾 (the focusing parameter for focal loss) : 2

▪ Ablation Study
• On head structures
• ResNet-50 / 12 epochs

▪ Ablation Study
• On sample assignments
• * : 18 epochs

▪ Ablation Study
• TOOD
• On hyper-parameters

▪ Comparison with the State-of-the Art

▪ Quantitative Analysis for Task-alignment
• Without NMS
• Pearson Correlation Coefficient (PCC) between the rankings of classification and localization
by selecting top-50 confident predictions for each instance
• A mean IoU of the top-10 confident predictions, averaged over instances
• With NMS
• #Correct boxes & #Redundant boxes : IoU >= 0.5
• #Error boxes : 0.1 < IoU < 0.5

▪ Quantitative Analysis for Task-alignment

Conclusion
▪ Task-aligned Single-stage Object Detector (TOOD)
• To align classification and localization tasks
• Task-aligned Head (T-head) to enhance the interaction of two tasks
• Task-Aligned Learning (TAL) with a sample assignment scheme and new loss functions
• Surpass the state-of-the-art one-stage detectors by a large margin

TOOD: Task-aligned One-stage Object Detection

More Related Content

Similar to TOOD: Task-aligned One-stage Object Detection

More from Sungchul Kim

Recently uploaded

TOOD: Task-aligned One-stage Object Detection