Sungchul Kim
2022. 11. 10
Contents
▪ Introduction
▪ Task-aligned One-stage Object Detection
• Task-aligned Head
• Task Alignment Learning
▪ Experiments and Results
▪ Conclusion
Introduction
▪ Object detection
• Localize and recognize objects of interest from natural images
• Classification
• Discriminative features that focus on the key or salient part of an object
• Localization
• Precisely locating the whole object with its boundaries
Fully Convolutional One-Stage Object Detection (FCOS)
RetinaNet
ATSS
TOOD
Introduction
▪ There is a certain level of misalignment when using two separate branches
• Result
• Object detection results
• Yellow box : ground-truth
• White arrow : the main direction of the best anchor
• Patch : the location of the best anchor / Box : predicted box
• Red : classification / Green : localization
• Score : spatial distributions of classification scores
• IoU : spatial distributions of localization scores
• Task-aligned One-stage Object Detection (TOOD)
Task-aligned One-stage Object Detection
▪ Overview
• Overall pipeline : backbone-FPN-head
• Use a single anchor per location like ATSS
• anchor : an anchor point for anchor-free detector, or an anchor box for anchor-based detector
• Align the two tasks (classification and localization) using
• Task-aligned head (T-head)
• Task Alignment Learning (TAL)
Task-aligned One-stage Object Detection
▪ Task-aligned Head (T-head)
• A simple feature extractor with two Task-Aligned Predictors (TAP)
• Let 𝑋𝑓𝑝𝑛 ∈ ℝ𝐻×𝑊×𝐶
• 𝑋𝑘
𝑖𝑛𝑡𝑒𝑟
: task-interactive features
• 𝑐𝑜𝑛𝑣𝑘 and 𝛿 : the 𝑘-th conv & relu
Task-aligned One-stage Object Detection
▪ Task-aligned Head (T-head)
• Task-Aligned Predictors (TAP)
• A layer attention mechanism (2, 3)
• encourage task decomposition by dynamically
computing such task-specific features at the layer level
• 𝑤𝑘 : the 𝑘-th element of the learned layer attention 𝑤 ∈ ℝ𝑁
• 𝑓𝑐1 and 𝑓𝑐2 : two fully-connected layers
• 𝜎 : a sigmoid function
• 𝑥𝑖𝑛𝑡𝑒𝑟
: an average pooling of 𝑋𝑖𝑛𝑡𝑒𝑟
• The results of classification or localization (4)
• 𝑍𝑡𝑎𝑠𝑘
is converted into
• dense classification scores 𝑃 ∈ ℝ𝐻×𝑊×80
using sigmoid
• object bounding boxes 𝐵 ∈ ℝ𝐻×𝑊×4
with distance-to-bbox conversion
Task-aligned One-stage Object Detection
▪ Task-aligned Head (T-head)
• Task-Aligned Predictors (TAP)
• A layer attention mechanism (2, 3)
• encourage task decomposition by dynamically
computing such task-specific features at the layer level
• 𝑤𝑘 : the 𝑘-th element of the learned layer attention 𝑤 ∈ ℝ𝑁
• 𝑓𝑐1 and 𝑓𝑐2 : two fully-connected layers
• 𝜎 : a sigmoid function
• 𝑥𝑖𝑛𝑡𝑒𝑟
: average pooling of 𝑋𝑖𝑛𝑡𝑒𝑟
• The results of classification or localization (4)
• 𝑍𝑡𝑎𝑠𝑘
is converted into
• dense classification scores 𝑃 ∈ ℝ𝐻×𝑊×80
using sigmoid
• object bounding boxes 𝐵 ∈ ℝ𝐻×𝑊×4
with distance-to-bbox conversion
Task-aligned One-stage Object Detection
▪ Task-aligned Head (T-head)
• Prediction alignment
• adjust the spatial distributions using 𝑋𝑖𝑛𝑡𝑒𝑟
• 𝑃𝑎𝑙𝑖𝑔𝑛
: aligned classification prediction
• 𝑀 ∈ ℝ𝐻×𝑊×1
: a spatial probability map
• 𝑐𝑜𝑛𝑣1 : 1x1 conv for dimension reduction
• 𝐵𝑎𝑙𝑖𝑔𝑛
: aligned object bounding boxes
• 𝑂 ∈ ℝ𝐻×𝑊×8
: spatial offset maps
• 𝑐𝑜𝑛𝑣3 : 1x1 conv for dimension reduction
• 𝑀 and 𝑂 is trained through Task Alignment Learning (TAL)
Task-aligned One-stage Object Detection
▪ Task-aligned Head (T-head)
• Prediction alignment
• adjust the spatial distributions using 𝑋𝑖𝑛𝑡𝑒𝑟
• 𝑃𝑎𝑙𝑖𝑔𝑛
: aligned classification prediction
• 𝑀 ∈ ℝ𝐻×𝑊×1
: a spatial probability map
• 𝑐𝑜𝑛𝑣1 : 1x1 conv for dimension reduction
• 𝐵𝑎𝑙𝑖𝑔𝑛
: aligned object bounding boxes
• 𝑂 ∈ ℝ𝐻×𝑊×8
: spatial offset maps
• 𝑐𝑜𝑛𝑣3 : 1x1 conv for dimension reduction
• 𝑀 and 𝑂 is trained through Task Alignment Learning (TAL)
Task-aligned One-stage Object Detection
▪ Task Alignment Learning (TAL)
• Task-aligned Sample Assignment
• The anchor assignment for a training instance should satisfy the following rules:
• a well-aligned anchor should be able to predict a high classification score with a precise localization jointly
• a misaligned anchor should have a low classification score and be suppressed subsequently
• Anchor alignment metric
• Design the following metric to compute anchor-level alignment for each instance:
• 𝑠 and 𝑢 : a classification score and IoU value
• 𝛼 and 𝛽 : are used to control the impact of the two tasks in the anchor alignment metric
• 𝑡 plays a critical role in the joint optimization of the two tasks towards the goal of task-alignment
• Training sample alignment
• Select 𝑚 anchors having the largest 𝑡 values as positive samples, while using the remaining anchors as negative ones
Task-aligned One-stage Object Detection
▪ Task Alignment Learning (TAL)
• Task-aligned Loss
• Classification objective
• Use Ƹ
𝑡 to replace the binary label of the positive anchor
• Ƹ
𝑡 : a normalized 𝑡 for stable training
• to ensure effective learning of hard instances (which usually have a small 𝑡 for all corresponding positive anchors)
• to preserve the rank between instances based on the precision of the predicted bounding boxes
• Binary Cross Entropy (BCE) computed on the positive anchors for the classification task
• The final loss function with modified focal loss
Task-aligned One-stage Object Detection
▪ Task Alignment Learning (TAL)
• Task-aligned Loss
• Localization objective
• Focus on the well-aligned anchors (with a large 𝑡) to improve the task alignment and regression precision
• Reduce the impact of the misaligned anchors (with a small 𝑡)
• Re-weight the loss of bbox regression computed for each anchor based on Ƹ
𝑡 and reformulate 𝑮𝑰𝒐𝑼 loss (𝐿𝐺𝐼𝑜𝑈)
• 𝑏 and ത
𝑏 : the predicted bboxes and the corresponding ground-truth boxes
• The total training loss for TAL = 𝑳𝒄𝒍𝒔 + 𝑳𝒓𝒆𝒈
Task-aligned One-stage Object Detection
Experiments and Results
▪ Dataset and evaluation protocol
• Dataset : COCO 2017
• Metric : COCO Average Precision (AP)
▪ Implementation details
• Detection pipeline : backbone – FPN – head
• Backbone : ResNet-50, ResNet-101, ResNeXt-101-64x4d
• 𝑁 (# of interactive layers) : 6 (to make a similar number of parameters as the conventional parallel head)
• 𝛾 (the focusing parameter for focal loss) : 2
Experiments and Results
▪ Ablation Study
• On head structures
• ResNet-50 / 12 epochs
Experiments and Results
▪ Ablation Study
• On sample assignments
• ResNet-50 / 12 epochs
• * : 18 epochs
Experiments and Results
▪ Ablation Study
• TOOD
• ResNet-50 / 12 epochs
• On hyper-parameters
• ResNet-50 / 12 epochs
Experiments and Results
▪ Comparison with the State-of-the Art
Experiments and Results
▪ Quantitative Analysis for Task-alignment
• Without NMS
• Pearson Correlation Coefficient (PCC) between the rankings of classification and localization
by selecting top-50 confident predictions for each instance
• A mean IoU of the top-10 confident predictions, averaged over instances
• With NMS
• #Correct boxes & #Redundant boxes : IoU >= 0.5
• #Error boxes : 0.1 < IoU < 0.5
Experiments and Results
▪ Quantitative Analysis for Task-alignment
Conclusion
▪ Task-aligned Single-stage Object Detector (TOOD)
• To align classification and localization tasks
• Task-aligned Head (T-head) to enhance the interaction of two tasks
• Task-Aligned Learning (TAL) with a sample assignment scheme and new loss functions
• Surpass the state-of-the-art one-stage detectors by a large margin

TOOD: Task-aligned One-stage Object Detection

  • 1.
  • 2.
    Contents ▪ Introduction ▪ Task-alignedOne-stage Object Detection • Task-aligned Head • Task Alignment Learning ▪ Experiments and Results ▪ Conclusion
  • 3.
    Introduction ▪ Object detection •Localize and recognize objects of interest from natural images • Classification • Discriminative features that focus on the key or salient part of an object • Localization • Precisely locating the whole object with its boundaries Fully Convolutional One-Stage Object Detection (FCOS) RetinaNet
  • 4.
    ATSS TOOD Introduction ▪ There isa certain level of misalignment when using two separate branches • Result • Object detection results • Yellow box : ground-truth • White arrow : the main direction of the best anchor • Patch : the location of the best anchor / Box : predicted box • Red : classification / Green : localization • Score : spatial distributions of classification scores • IoU : spatial distributions of localization scores • Task-aligned One-stage Object Detection (TOOD)
  • 5.
    Task-aligned One-stage ObjectDetection ▪ Overview • Overall pipeline : backbone-FPN-head • Use a single anchor per location like ATSS • anchor : an anchor point for anchor-free detector, or an anchor box for anchor-based detector • Align the two tasks (classification and localization) using • Task-aligned head (T-head) • Task Alignment Learning (TAL)
  • 6.
    Task-aligned One-stage ObjectDetection ▪ Task-aligned Head (T-head) • A simple feature extractor with two Task-Aligned Predictors (TAP) • Let 𝑋𝑓𝑝𝑛 ∈ ℝ𝐻×𝑊×𝐶 • 𝑋𝑘 𝑖𝑛𝑡𝑒𝑟 : task-interactive features • 𝑐𝑜𝑛𝑣𝑘 and 𝛿 : the 𝑘-th conv & relu
  • 7.
    Task-aligned One-stage ObjectDetection ▪ Task-aligned Head (T-head) • Task-Aligned Predictors (TAP) • A layer attention mechanism (2, 3) • encourage task decomposition by dynamically computing such task-specific features at the layer level • 𝑤𝑘 : the 𝑘-th element of the learned layer attention 𝑤 ∈ ℝ𝑁 • 𝑓𝑐1 and 𝑓𝑐2 : two fully-connected layers • 𝜎 : a sigmoid function • 𝑥𝑖𝑛𝑡𝑒𝑟 : an average pooling of 𝑋𝑖𝑛𝑡𝑒𝑟 • The results of classification or localization (4) • 𝑍𝑡𝑎𝑠𝑘 is converted into • dense classification scores 𝑃 ∈ ℝ𝐻×𝑊×80 using sigmoid • object bounding boxes 𝐵 ∈ ℝ𝐻×𝑊×4 with distance-to-bbox conversion
  • 8.
    Task-aligned One-stage ObjectDetection ▪ Task-aligned Head (T-head) • Task-Aligned Predictors (TAP) • A layer attention mechanism (2, 3) • encourage task decomposition by dynamically computing such task-specific features at the layer level • 𝑤𝑘 : the 𝑘-th element of the learned layer attention 𝑤 ∈ ℝ𝑁 • 𝑓𝑐1 and 𝑓𝑐2 : two fully-connected layers • 𝜎 : a sigmoid function • 𝑥𝑖𝑛𝑡𝑒𝑟 : average pooling of 𝑋𝑖𝑛𝑡𝑒𝑟 • The results of classification or localization (4) • 𝑍𝑡𝑎𝑠𝑘 is converted into • dense classification scores 𝑃 ∈ ℝ𝐻×𝑊×80 using sigmoid • object bounding boxes 𝐵 ∈ ℝ𝐻×𝑊×4 with distance-to-bbox conversion
  • 9.
    Task-aligned One-stage ObjectDetection ▪ Task-aligned Head (T-head) • Prediction alignment • adjust the spatial distributions using 𝑋𝑖𝑛𝑡𝑒𝑟 • 𝑃𝑎𝑙𝑖𝑔𝑛 : aligned classification prediction • 𝑀 ∈ ℝ𝐻×𝑊×1 : a spatial probability map • 𝑐𝑜𝑛𝑣1 : 1x1 conv for dimension reduction • 𝐵𝑎𝑙𝑖𝑔𝑛 : aligned object bounding boxes • 𝑂 ∈ ℝ𝐻×𝑊×8 : spatial offset maps • 𝑐𝑜𝑛𝑣3 : 1x1 conv for dimension reduction • 𝑀 and 𝑂 is trained through Task Alignment Learning (TAL)
  • 10.
    Task-aligned One-stage ObjectDetection ▪ Task-aligned Head (T-head) • Prediction alignment • adjust the spatial distributions using 𝑋𝑖𝑛𝑡𝑒𝑟 • 𝑃𝑎𝑙𝑖𝑔𝑛 : aligned classification prediction • 𝑀 ∈ ℝ𝐻×𝑊×1 : a spatial probability map • 𝑐𝑜𝑛𝑣1 : 1x1 conv for dimension reduction • 𝐵𝑎𝑙𝑖𝑔𝑛 : aligned object bounding boxes • 𝑂 ∈ ℝ𝐻×𝑊×8 : spatial offset maps • 𝑐𝑜𝑛𝑣3 : 1x1 conv for dimension reduction • 𝑀 and 𝑂 is trained through Task Alignment Learning (TAL)
  • 11.
    Task-aligned One-stage ObjectDetection ▪ Task Alignment Learning (TAL) • Task-aligned Sample Assignment • The anchor assignment for a training instance should satisfy the following rules: • a well-aligned anchor should be able to predict a high classification score with a precise localization jointly • a misaligned anchor should have a low classification score and be suppressed subsequently • Anchor alignment metric • Design the following metric to compute anchor-level alignment for each instance: • 𝑠 and 𝑢 : a classification score and IoU value • 𝛼 and 𝛽 : are used to control the impact of the two tasks in the anchor alignment metric • 𝑡 plays a critical role in the joint optimization of the two tasks towards the goal of task-alignment • Training sample alignment • Select 𝑚 anchors having the largest 𝑡 values as positive samples, while using the remaining anchors as negative ones
  • 12.
    Task-aligned One-stage ObjectDetection ▪ Task Alignment Learning (TAL) • Task-aligned Loss • Classification objective • Use Ƹ 𝑡 to replace the binary label of the positive anchor • Ƹ 𝑡 : a normalized 𝑡 for stable training • to ensure effective learning of hard instances (which usually have a small 𝑡 for all corresponding positive anchors) • to preserve the rank between instances based on the precision of the predicted bounding boxes • Binary Cross Entropy (BCE) computed on the positive anchors for the classification task • The final loss function with modified focal loss
  • 13.
    Task-aligned One-stage ObjectDetection ▪ Task Alignment Learning (TAL) • Task-aligned Loss • Localization objective • Focus on the well-aligned anchors (with a large 𝑡) to improve the task alignment and regression precision • Reduce the impact of the misaligned anchors (with a small 𝑡) • Re-weight the loss of bbox regression computed for each anchor based on Ƹ 𝑡 and reformulate 𝑮𝑰𝒐𝑼 loss (𝐿𝐺𝐼𝑜𝑈) • 𝑏 and ത 𝑏 : the predicted bboxes and the corresponding ground-truth boxes • The total training loss for TAL = 𝑳𝒄𝒍𝒔 + 𝑳𝒓𝒆𝒈
  • 14.
  • 15.
    Experiments and Results ▪Dataset and evaluation protocol • Dataset : COCO 2017 • Metric : COCO Average Precision (AP) ▪ Implementation details • Detection pipeline : backbone – FPN – head • Backbone : ResNet-50, ResNet-101, ResNeXt-101-64x4d • 𝑁 (# of interactive layers) : 6 (to make a similar number of parameters as the conventional parallel head) • 𝛾 (the focusing parameter for focal loss) : 2
  • 16.
    Experiments and Results ▪Ablation Study • On head structures • ResNet-50 / 12 epochs
  • 17.
    Experiments and Results ▪Ablation Study • On sample assignments • ResNet-50 / 12 epochs • * : 18 epochs
  • 18.
    Experiments and Results ▪Ablation Study • TOOD • ResNet-50 / 12 epochs • On hyper-parameters • ResNet-50 / 12 epochs
  • 19.
    Experiments and Results ▪Comparison with the State-of-the Art
  • 20.
    Experiments and Results ▪Quantitative Analysis for Task-alignment • Without NMS • Pearson Correlation Coefficient (PCC) between the rankings of classification and localization by selecting top-50 confident predictions for each instance • A mean IoU of the top-10 confident predictions, averaged over instances • With NMS • #Correct boxes & #Redundant boxes : IoU >= 0.5 • #Error boxes : 0.1 < IoU < 0.5
  • 21.
    Experiments and Results ▪Quantitative Analysis for Task-alignment
  • 22.
    Conclusion ▪ Task-aligned Single-stageObject Detector (TOOD) • To align classification and localization tasks • Task-aligned Head (T-head) to enhance the interaction of two tasks • Task-Aligned Learning (TAL) with a sample assignment scheme and new loss functions • Surpass the state-of-the-art one-stage detectors by a large margin