BoxInst: High-Performance Instance
Segmentation with Box Annotations
Zhi Tian et al (CVPR 2021)
Choi Dongmin
Abstract
• A high-performance mask-level instance segmentation

- with only bounding-box annotations for training

- dramatic improvement; AP 21.1% → 31.6 on COCO

- redesign the loss of learning masks

- can supervise the mask training without relying on mask annotations

- narrows the performance gap between weakly and fully supervision
Introduction
• Instance Segmentation
https://ai-pool.com/d/could-you-explain-me-how-instance-segmentation-works
• Weakly-Supervised Instance Segmentation (WSIS)

- Image-level annotation (e.g., CAM)

- Bounding-box annotation
Introduction
Chang et al. Weakly-Supervised Semantic Segmentation via Sub-category Exploration. CVPR 2020
• BBTB: Previous S.O.T.A of WSIS

- still low performance (21.1 AP on COCO benchmark)
Introduction
Hsu et al. Weakly Supervised Instance Segmentation using the Bounding Box Tightness Prior. NIPS 2019
• BoxInst

- only with bounding-box annotation

- COCO AP: 21.1 (previous S.O.T.A BBTA) → 31.6

- with aug. and 3x scheduling, achieved 32.1 AP
Introduction
NOTE: Mask R-CNN

achieved 37.5 AP
• BoxInst

- only with bounding-box annotation

- COCO AP: 21.1 (previous S.O.T.A BBTA) → 31.6

- with aug. and 3x scheduling, achieved 32.1 AP
Introduction
NOTE: Mask R-CNN

achieved 37.5 AP
• FCOS

- anchor-free one-stage object detector

- FPN + Detection head
Related Work
Tian et al. FCOS: Fully Convolutional One-Stage Object Detection. ICCV 2019
• CondInst

- solve instance segmentation in an RoI-free fully conv. way

- employ dynamic
fi
lters instead of a
fi
xed mask head

- FCOS + Mask head
Related Work
Tian et al. Conditional Convolutions for Instance Segmentation. ECCV 2020
• CondInst

- FCOS + Mask head
Related Work
Tian et al. Conditional Convolutions for Instance Segmentation. ECCV 2020
• CondInst

- FCOS + Mask head
Related Work
Tian et al. Conditional Convolutions for Instance Segmentation. ECCV 2020
predict the parameters of the mask head
• Two proposed loss term: 1. Projection loss term

- projections of the mask and box should be the same
Approach
Lproj = L(Projx(m̃), Projx(b)) + L(Projy(m̃), Projy(b))
: Dice Loss

: the network predictions for the instance mask

: ground-truth bounding box
L( ⋅ , ⋅ )
m̃ ∈ (0,1)H×W
b
• Two proposed loss term: 2. Pairwise a
ffi
nity loss term

- supervise the mask in a pairwise way
Approach
an undirected graph ; : the set of pixels, : the set of edges

the label for the edge 

if the two pixels have the same G.T. label or if not



G = (V, E) V E
ye ∈ {0,1}
ye = 1 ye = 0
P(ye = 1) = m̃i,j ⋅ m̃k,l + (1 − m̃i,j) ⋅ (1 − m̃k,l)
• Two proposed loss term: 2. Pairwise a
ffi
nity loss term

- supervise the mask in a pairwise way
Approach
ye
(i, j) (k, l)
m̃i,j m̃k,l
an undirected graph ; : the set of pixels, : the set of edges

the label for the edge 

if the two pixels have the same G.T. label or if not



G = (V, E) V E
ye ∈ {0,1}
ye = 1 ye = 0
P(ye = 1) = m̃i,j ⋅ m̃k,l + (1 − m̃i,j) ⋅ (1 − m̃k,l)
• Two proposed loss term: 2. Pairwise a
ffi
nity loss term

- supervise the mask in a pairwise way
Approach
ye
(i, j) (k, l)
m̃i,j m̃k,l






P(ye = 1) = m̃i,j ⋅ m̃k,l + (1 − m̃i,j) ⋅ (1 − m̃k,l)
P(ye = 0) = 1 − P(ye = 1)
Lpairwise = −
1
N ∑
e∈Ein
ye log P(ye = 1) + (1 − ye)log P(ye = 0)
• Two proposed loss term: 2. Pairwise a
ffi
nity loss term

- supervise the mask in a pairwise way
Approach
ye
(i, j) (k, l)
m̃i,j m̃k,l






P(ye = 1) = m̃i,j ⋅ m̃k,l + (1 − m̃i,j) ⋅ (1 − m̃k,l)
P(ye = 0) = 1 − P(ye = 1)
Lpairwise = −
1
N ∑
e∈Ein
ye log P(ye = 1) + (1 − ye)log P(ye = 0)
However, we only have bounding box annotations…
• Two proposed loss term: 2. Pairwise a
ffi
nity loss term

- Learning without mask annotations: color similarity

- idea: if two pixels have similar colors, they are likely to have the
same labels as well
Approach


assign if (a color similarity threshold)
Se = S(ci,j, cl,k) = exp(−
||ci,j − cj,k ||
θ
)
ye = 1 Se > τ
: color similarity of the edge 

: the color vector of the pixel (LAB color space)

: hyper-parameter
Se e
ci,j (i, j)
θ = 2
Lpairwise = −
1
N ∑
e∈En
1{Se>τ} log P(ye = 1)
only for the positive edges
• Two proposed loss term

- Projection loss term + Pairwise a
ffi
nity loss term
Approach
Lpairwise = −
1
N ∑
e∈En
1{Se>τ} log P(ye = 1)
Lproj = L(Projx(m̃), Projx(b)) + L(Projy(m̃), Projy(b))
Lmask = Lproj + Lpairwise
• Projection and Pairwise A
ffi
nity Loss for Mask Learning

- the redesigned mask loss can have similar performance to the
original pixel mask loss (e.g., Dice loss)
Experiments
• Box-supervised Instance Segmentation

- Varying the threshold of color similarity
Experiments
proportion of true positive edges
only with box annotations
• Box-supervised Instance Segmentation

- Varying the neighborhood of the pixels
Experiments
• Box-supervised Instance Segmentation

- The contribution of each loss term
Experiments
• Box-supervised Instance Segmentation

- Failure cases
Experiments
• Comparison with State-of-the-art on COCO
Experiments
• Comparison with State-of-the-art on COCO
Experiments
BoxInst outperforms the previous S.O.T.A BBTP with a large margin
• Comparison with State-of-the-art on COCO
Experiments
BoxInst even outperforms the fully supervised YOLACT and PolarMask
• Comparison with State-of-the-art on COCO
Experiments
BoxInst vs. Mask R-CNN and CondInst
• Experiments on PASCAL VOC
Experiments
• Semi-supervised Instance Segmentation
Experiments
Abstract
• BoxInst

- achieved high-quality instance segmentation w/ only box annotation

- core idea: the projection and pair-wise a
ffi
nity mask loss

- excellent instance segmentation performance w/o mask on COCO
and PASCAL VOC
Thank you

[Review] BoxInst: High-Performance Instance Segmentation with Box Annotations (CVPR2021)

  • 1.
    BoxInst: High-Performance Instance Segmentationwith Box Annotations Zhi Tian et al (CVPR 2021) Choi Dongmin
  • 2.
    Abstract • A high-performancemask-level instance segmentation
 - with only bounding-box annotations for training
 - dramatic improvement; AP 21.1% → 31.6 on COCO
 - redesign the loss of learning masks
 - can supervise the mask training without relying on mask annotations
 - narrows the performance gap between weakly and fully supervision
  • 3.
  • 4.
    • Weakly-Supervised InstanceSegmentation (WSIS)
 - Image-level annotation (e.g., CAM)
 - Bounding-box annotation Introduction Chang et al. Weakly-Supervised Semantic Segmentation via Sub-category Exploration. CVPR 2020
  • 5.
    • BBTB: PreviousS.O.T.A of WSIS
 - still low performance (21.1 AP on COCO benchmark) Introduction Hsu et al. Weakly Supervised Instance Segmentation using the Bounding Box Tightness Prior. NIPS 2019
  • 6.
    • BoxInst
 - onlywith bounding-box annotation
 - COCO AP: 21.1 (previous S.O.T.A BBTA) → 31.6
 - with aug. and 3x scheduling, achieved 32.1 AP Introduction NOTE: Mask R-CNN
 achieved 37.5 AP
  • 7.
    • BoxInst
 - onlywith bounding-box annotation
 - COCO AP: 21.1 (previous S.O.T.A BBTA) → 31.6
 - with aug. and 3x scheduling, achieved 32.1 AP Introduction NOTE: Mask R-CNN
 achieved 37.5 AP
  • 8.
    • FCOS
 - anchor-freeone-stage object detector
 - FPN + Detection head Related Work Tian et al. FCOS: Fully Convolutional One-Stage Object Detection. ICCV 2019
  • 9.
    • CondInst
 - solveinstance segmentation in an RoI-free fully conv. way
 - employ dynamic fi lters instead of a fi xed mask head
 - FCOS + Mask head Related Work Tian et al. Conditional Convolutions for Instance Segmentation. ECCV 2020
  • 10.
    • CondInst
 - FCOS+ Mask head Related Work Tian et al. Conditional Convolutions for Instance Segmentation. ECCV 2020
  • 11.
    • CondInst
 - FCOS+ Mask head Related Work Tian et al. Conditional Convolutions for Instance Segmentation. ECCV 2020 predict the parameters of the mask head
  • 12.
    • Two proposedloss term: 1. Projection loss term
 - projections of the mask and box should be the same Approach Lproj = L(Projx(m̃), Projx(b)) + L(Projy(m̃), Projy(b)) : Dice Loss : the network predictions for the instance mask : ground-truth bounding box L( ⋅ , ⋅ ) m̃ ∈ (0,1)H×W b
  • 13.
    • Two proposedloss term: 2. Pairwise a ffi nity loss term
 - supervise the mask in a pairwise way Approach an undirected graph ; : the set of pixels, : the set of edges the label for the edge if the two pixels have the same G.T. label or if not
 
 G = (V, E) V E ye ∈ {0,1} ye = 1 ye = 0 P(ye = 1) = m̃i,j ⋅ m̃k,l + (1 − m̃i,j) ⋅ (1 − m̃k,l)
  • 14.
    • Two proposedloss term: 2. Pairwise a ffi nity loss term
 - supervise the mask in a pairwise way Approach ye (i, j) (k, l) m̃i,j m̃k,l an undirected graph ; : the set of pixels, : the set of edges the label for the edge if the two pixels have the same G.T. label or if not
 
 G = (V, E) V E ye ∈ {0,1} ye = 1 ye = 0 P(ye = 1) = m̃i,j ⋅ m̃k,l + (1 − m̃i,j) ⋅ (1 − m̃k,l)
  • 15.
    • Two proposedloss term: 2. Pairwise a ffi nity loss term
 - supervise the mask in a pairwise way Approach ye (i, j) (k, l) m̃i,j m̃k,l P(ye = 1) = m̃i,j ⋅ m̃k,l + (1 − m̃i,j) ⋅ (1 − m̃k,l) P(ye = 0) = 1 − P(ye = 1) Lpairwise = − 1 N ∑ e∈Ein ye log P(ye = 1) + (1 − ye)log P(ye = 0)
  • 16.
    • Two proposedloss term: 2. Pairwise a ffi nity loss term
 - supervise the mask in a pairwise way Approach ye (i, j) (k, l) m̃i,j m̃k,l P(ye = 1) = m̃i,j ⋅ m̃k,l + (1 − m̃i,j) ⋅ (1 − m̃k,l) P(ye = 0) = 1 − P(ye = 1) Lpairwise = − 1 N ∑ e∈Ein ye log P(ye = 1) + (1 − ye)log P(ye = 0) However, we only have bounding box annotations…
  • 17.
    • Two proposedloss term: 2. Pairwise a ffi nity loss term
 - Learning without mask annotations: color similarity
 - idea: if two pixels have similar colors, they are likely to have the same labels as well Approach assign if (a color similarity threshold) Se = S(ci,j, cl,k) = exp(− ||ci,j − cj,k || θ ) ye = 1 Se > τ : color similarity of the edge : the color vector of the pixel (LAB color space) : hyper-parameter Se e ci,j (i, j) θ = 2 Lpairwise = − 1 N ∑ e∈En 1{Se>τ} log P(ye = 1) only for the positive edges
  • 18.
    • Two proposedloss term
 - Projection loss term + Pairwise a ffi nity loss term Approach Lpairwise = − 1 N ∑ e∈En 1{Se>τ} log P(ye = 1) Lproj = L(Projx(m̃), Projx(b)) + L(Projy(m̃), Projy(b)) Lmask = Lproj + Lpairwise
  • 19.
    • Projection andPairwise A ffi nity Loss for Mask Learning
 - the redesigned mask loss can have similar performance to the original pixel mask loss (e.g., Dice loss) Experiments
  • 20.
    • Box-supervised InstanceSegmentation
 - Varying the threshold of color similarity Experiments proportion of true positive edges only with box annotations
  • 21.
    • Box-supervised InstanceSegmentation
 - Varying the neighborhood of the pixels Experiments
  • 22.
    • Box-supervised InstanceSegmentation
 - The contribution of each loss term Experiments
  • 23.
    • Box-supervised InstanceSegmentation
 - Failure cases Experiments
  • 24.
    • Comparison withState-of-the-art on COCO Experiments
  • 25.
    • Comparison withState-of-the-art on COCO Experiments BoxInst outperforms the previous S.O.T.A BBTP with a large margin
  • 26.
    • Comparison withState-of-the-art on COCO Experiments BoxInst even outperforms the fully supervised YOLACT and PolarMask
  • 27.
    • Comparison withState-of-the-art on COCO Experiments BoxInst vs. Mask R-CNN and CondInst
  • 28.
    • Experiments onPASCAL VOC Experiments
  • 29.
    • Semi-supervised InstanceSegmentation Experiments
  • 30.
    Abstract • BoxInst
 - achievedhigh-quality instance segmentation w/ only box annotation
 - core idea: the projection and pair-wise a ffi nity mask loss
 - excellent instance segmentation performance w/o mask on COCO and PASCAL VOC
  • 31.