( (
8C 8B8 ). D C / /). B8 A
09D 8 1 D C , C
). 8 9 A D 28 .
-
Visual perception tasks
1 2 2. 2 3 . . 2
2. 1 2 . 1 2
4
Agenda
• Visual perception tasks
• Mask-RCNN
• Mask-RCNN architecture
• Feature Pyramid Network
• Region Proposal Network
• RoIAlign
• Mark-RCNN head network
• Result
• Summary
Introduction to MaskRCNN
• Mask-RCNN stands for Mask-Region Convolutional Neural Network
• State-of-the-art algorithm for Instance Segmentation
• Evolved through 4 main versions:
• RCNN → Fast-RCNN → Faster-RCNN → Mask-RCNN
• The first 3 versions are for Object Detection
• Improvements over Faster RCNN: use RoIAlign instead of RoIPool
• Employ Fully Convolutional Network (FCN) for mask prediction. Predict mask for each
class independently
• 2 main stages:
• 1st stage: use Region Proposal Network (RPN) to propose candidate object bounding boxes
• 2nd stage: classify the candidate boxes, refine the boxes and predict masks
Terms
• Bounding box: rectangle identifying location of an object
• Mask: set of pixels which belong to an object
• Anchor: a bounding box is generated independently from image content
• RoI: Region of Interest, a bounding box which may contain an object
• Non-Maximum Suppression (NMS): a method to eliminate duplicated bounding box using scores
and IoU threshold
• IoU: Intersection over Union, a metrics to evaluate how 2 areas likely to be similar to each other.
• RoIAlign: a method to extract features for RoIs from feature maps
• Feature Pyramid Network (FPN): a neural network to extract feature maps with different scale
• Region Proposal Network (RPN): a neural network to propose RoI for an image
• Fully Convolutional Network (FCN): a convolution-based neural network to extract masks
MaskRCNN architecture
1 1 + + : 7
:
:
: :B
7
: :
1 1 +C 1 A:
1 : + : : 7 1 A:
1 : : ( 1 1
7 1 Background + number of classes)
: :B 1 1 : : :B 1
Multi-scale problem
Approaches for multiple scaled objects
2 . . . 2 2 . 2 .-
. 2 2. -. . .
2 . 2 2 . . - 2A. .
2 .- 2A. 2 2 . - . .
1
2 :2 2 2 .
2.- :
: . 29 . 2 2.- -
. 2A. . 2 2 . - - . 2 .
( )
Feature Pyramid Network (FPN)
)
)
)
)
4 +4
+
3
4 6 5 4 2 53
0(1
To detect boundaries of objects:
• 1 1 1
• 11 1 1 1
• 1 1 1
Bounding box regression
Bounding box regression
)*
) )
* *
• / DI : = D F F G : ) ) * *
• AGD F F G : F D
: : = &
• F A IA : FD ) ) * * G-
. * ) . ) (&,
. * ) . ) (&,
• ) ) * * F AGD F FG A A IA : FD G-
) . (&, * . )
) . (&, * . )
/ DF DF F : : DI : = D 4
2FDI : FI D 2
• 4 4 4 4 - F F : : = D F : : D 4
• 2 2 2 2 - F F : : = D =FDI : FI D 2
• 0
• : . 2 4 4 : . 2 4 4
• : . AD= 2 4 : . AD= 2 4
• : : G G A F F GA D D F D 4
• : : G AD= G F GA D G D : = D 4
• F G F F : : DI : = D 4 G DG D AD = G F =F GG D
: IG : DI : = D G : : G-
4
4
4 4
2 2
2
2
1FD 5+
) (
Region Proposal Network
Anchor generator
Proposal layer
4
4 4
4
4 1 4
4 4
Filter out negative anchors with
Rpn_probs and Non-Max Suppression
RPN head network
, F
G
= BC
A4 :)C4
, F
G
4 BC3A B3= 4 G = BC
4G
/A 3AB 5C
, F
G
4 BC3A B3= 4 G = BC
/A 355 G
1 BC 2
1 BC 2
F4= C A B 4 B B
5 : 5 G B :B CC (
B 1 2
Anchor generator
)
-=38 A 4
-=38 3 =
A31
1
1
1
1 (
. 1 1
/8 A 12A 3 2 61A 3== 4 = 8 122 8 6 4 = 2: 3 4 3 =0
1=38 1=38 8 1
IoU – Intersection over Union
3 I I= : 5 = = : =G
?9 G ?= :=G E = /
• 4 7 1 2 2 7
• 2 7 2
• & 7 7 11 2
2 7 1
0
1
4 7 0 1 . ) ( (
2 9BEA=
0
1
4 7 0 1 . % . %
0
1
4 7 0 1 . , , . %%
= A9E = G=
Proposal layer
• Sort all anchors by rpn_probs (how likely an anchor contains an object)
• Choose top N anchors, throw the remainings (e.g., N ~ 6000)
• Apply Non-Maximum Suppresion (NMS) to eliminate duplicated boxes.
Keep up to M anchors (e.g., M ~ 2000).
Non-Maximum Suppresion (NMS)
• . C: A 5 C9 C
• 09 A = = C: C
?E ? ?? ? A ? (
,>
-: C
A A 9
, 1 C9A 9 5 0 0)
.( . 9
-: C 9
) ?A: =
12
1?A C9 A
12
- C: C9 A : A = : : (
9 C9 :C9 : ?A :C
:=:> A = : : :C9 ,? ) 2
?> - := = 1 A :?> ?A: =
Non-Maximum Suppresion (NMS)
=A
, A = 2= 5
. = 5 = 5 2=
=0 A 5 =: 5 (
- - == 5 2= 5
-A =A , A = = 5 2= 5
( > :
)
> A 5 2= 5 2 = 5
)
,== A : A 5 5 = 5 2=
== 5 A 5 2= C A > :: A )
) : 2 5 2= 5 2 C A 1 (
, 2 :A: A==> 2 > :
Train the RPN
• Positive boxes: IoU >= 0.7 with any GT box
• Negative boxes: IoU < 0.3 with all GT boxes
• Ratio of positive boxes: 1/3
• Fixed num of anchors per image for train: 256
Loss function
• i is the index of an anchor in a mini-batch
• pi is the predicted probability of anchor i being an object
• Ground truth label is 1 if the anchor is positive, and is 0 if the anchor is negative
• ti is a vector representing the 4 parameterized coordinates (dy, dx, dh, dw) of the predicted bounding box
• is that of the ground-truth box associated with a positive anchor
• Classification loss Lcls is log loss over two classes (object vs. not object)
• For regression loss Lreg, use , where R is smoothL1 defined as:
• While both positive and negative anchors contribute to classification loss, only positive anchors contribute to regression loss.
• Ncls is normalized by the mini-batch size ( ), Nreg is normalized by the number of anchor locations ( ), set
RoIAlign
1024
1024
540
540
Input image
Object
64
1024/16 = 64
540/16 = 33.75
33.75
Feature map
RoI
16X less
33.75 / 7 = 4.82 each bin
7x7
Small feature map
(for each RoI)
RoI
Use bilinear interpolation to
calculate exact value at each bin
No quantization
(From [1])
FCN
Identify Feature Pyramid level for RoIs
Resize
P2
P3
P4
P5
w, h: width & height of a RoI
224: canonical ImageNet pre-training size
k0: target level of the RoI whose w*h = 2242
(here, k0 = 5)
Target level k of a RoI is identified by:
Crop the RoIs on
their feature map
Intuitions:
Features of large RoIs from smaller feature map (high semantic level)
Features of small RoIs from larger feature map (low semantic level)
RoIs
(From [6])
-
Mask-RCNN head network
• A classifier to identify the class for each RoI: K classes + background
• A regressor to predict the 4 values dy, dx, dh, dw for each RoI
• Fully Convolutional Network (FCN) [5] to predict mask per class
• Represent a mask as m x m matrix
• For each RoI, try to predict mask for each class
• Use sigmoid to predict how probability for each pixel
• Use binary loss to train the network
Mask-RCNN head network architecture
7x7x256
Small feature map
(for each RoI)
1024
Fully connected layer implemented by CNN
Shared weights over multiple RoIs
Softmax
(K+1) x 4
(K+1)
14x14x256
3x3
(256 filters)
Conv1
14x14x256
Conv4
14x14x256
3x3
(256 filters)
Conv Transpose
(Up sampling)
2x2
(256 filters)
(stride 2)
28x28x256
...
x 4 conv layers
Conv
28x28x(K+1)
1x1
(K+1 filters)
Sigmoid
activation
28x28x(K+1)
7x7
(1024 filters)
Conv1 Conv2
(BG + num classes)
K+1
Dense
Dense
(K+1) x 4
1024 K+1
Predict mask per class
BG vs K classes
4 box regression values:
dy, dx, dh, dw
1x1
(1024 filters)
Loss functions
• For each sampled RoI, a multi-task loss is applied:
where
• Lcls is classification loss
• Lloc is bounding-box regression loss
• Lmask is mask loss
• The final loss is calculated as mean of loss over samples
Classification loss Lcls
• For a RoI, denotes:
• : true class of the RoI
• : predicted probability distribution over K+1 classes
• The classification loss Lcls for a RoI is a log-loss calculated as:
Bounding-box regression loss Lloc
• For a RoI, denotes:
• : true class of the RoI
• : true bounding-box regression targets of the RoI
• : predicted bounding-box regression for the class u.
• The bounding-box regression loss Lloc for the RoI is calculated as:
where
Mask loss Lmask
• For a RoI, denotes:
• : true class of the RoI
• : the true mask and the predicted mask for the class of the RoI
respectively ( )
• The mask loss Lmask for the RoI is the average binary cross-entropy
loss, calculated as:
Mask-RCNN on COCO data
(From [1])
Evolution of R-CNN
= Faster R-CNN [2] + Fully Convolutional Network [5]
RoIPool RoIAlign Per-pixel softmax Per-pixel sigmoid
Mask R-CNN [1]
Faster R-CNN = Fast R-CNN [3] +
Fast R-CNN = R-CNN [4] + ConvNet on whole input image first, then apply RoIPooling layer
R-CNN [1]
Region proposal on input image + +
+=
+
+
Summary
• Introduced MaskRCNN, an algorithm for Instance Segmentation
• Detect both bounding boxes and masks of objects in an end-to-end
neural network
• Improve RoIPool from Faster-RCNN with RoIAlign
• Employ Fully Convolutional Network for mask detection
References
[1] Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Girshick. Mask R-CNN. IEEE
International Conference on Computer Vision (ICCV), 2017.
[2] S. Ren, K. He, R. Girshick, and J. Sun. Faster R-CNN: Towards real-time object detection
with region proposal networks. In NIPS, 2015.
[3] R. Girshick. Fast R-CNN. In ICCV, 2015.
[4] R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich feature hierarchies for accurate object
detection and semantic segmentation. In CVPR, 2014
[5] J. Long, E. Shelhamer, and T. Darrell. Fully convolutional networks for semantic
segmentation. In CVPR, 2015.
[6] T.Y. Lin, P. Dollár, R. Girshick, K. He, B. Hariharan, and S. Belongie. Feature pyramid
networks for object detection. In CVPR, 2017
by Nguyen Phuoc Tat Dat
Appendix: Some popular DL-based algorithms for visual perception tasks
by Nguyen Phuoc Tat Dat
Visual perception tasks Algorithms
Image Classification
AlexNet
Inception
GooLeNet/Inception v1
ResNet
VGGNet
Object Detection
Fast/Faster R-CNN
SSD
YOLO
Semantic Segmentation
Fully Convolutional Network (FCN)
U-Net
Instance Segmentation Mask R-CNN
Thank you for listening!
!

Mask-RCNN for Instance Segmentation

  • 1.
    ( ( 8C 8B8). D C / /). B8 A 09D 8 1 D C , C ). 8 9 A D 28 . -
  • 2.
    Visual perception tasks 12 2. 2 3 . . 2 2. 1 2 . 1 2 4
  • 3.
    Agenda • Visual perceptiontasks • Mask-RCNN • Mask-RCNN architecture • Feature Pyramid Network • Region Proposal Network • RoIAlign • Mark-RCNN head network • Result • Summary
  • 4.
    Introduction to MaskRCNN •Mask-RCNN stands for Mask-Region Convolutional Neural Network • State-of-the-art algorithm for Instance Segmentation • Evolved through 4 main versions: • RCNN → Fast-RCNN → Faster-RCNN → Mask-RCNN • The first 3 versions are for Object Detection • Improvements over Faster RCNN: use RoIAlign instead of RoIPool • Employ Fully Convolutional Network (FCN) for mask prediction. Predict mask for each class independently • 2 main stages: • 1st stage: use Region Proposal Network (RPN) to propose candidate object bounding boxes • 2nd stage: classify the candidate boxes, refine the boxes and predict masks
  • 5.
    Terms • Bounding box:rectangle identifying location of an object • Mask: set of pixels which belong to an object • Anchor: a bounding box is generated independently from image content • RoI: Region of Interest, a bounding box which may contain an object • Non-Maximum Suppression (NMS): a method to eliminate duplicated bounding box using scores and IoU threshold • IoU: Intersection over Union, a metrics to evaluate how 2 areas likely to be similar to each other. • RoIAlign: a method to extract features for RoIs from feature maps • Feature Pyramid Network (FPN): a neural network to extract feature maps with different scale • Region Proposal Network (RPN): a neural network to propose RoI for an image • Fully Convolutional Network (FCN): a convolution-based neural network to extract masks
  • 6.
    MaskRCNN architecture 1 1+ + : 7 : : : :B 7 : : 1 1 +C 1 A: 1 : + : : 7 1 A: 1 : : ( 1 1 7 1 Background + number of classes) : :B 1 1 : : :B 1
  • 7.
  • 8.
    Approaches for multiplescaled objects 2 . . . 2 2 . 2 .- . 2 2. -. . . 2 . 2 2 . . - 2A. . 2 .- 2A. 2 2 . - . . 1 2 :2 2 2 . 2.- : : . 29 . 2 2.- - . 2A. . 2 2 . - - . 2 . ( )
  • 9.
    Feature Pyramid Network(FPN) ) ) ) ) 4 +4 + 3 4 6 5 4 2 53 0(1
  • 10.
    To detect boundariesof objects: • 1 1 1 • 11 1 1 1 • 1 1 1 Bounding box regression
  • 11.
    Bounding box regression )* )) * * • / DI : = D F F G : ) ) * * • AGD F F G : F D : : = & • F A IA : FD ) ) * * G- . * ) . ) (&, . * ) . ) (&, • ) ) * * F AGD F FG A A IA : FD G- ) . (&, * . ) ) . (&, * . ) / DF DF F : : DI : = D 4 2FDI : FI D 2 • 4 4 4 4 - F F : : = D F : : D 4 • 2 2 2 2 - F F : : = D =FDI : FI D 2 • 0 • : . 2 4 4 : . 2 4 4 • : . AD= 2 4 : . AD= 2 4 • : : G G A F F GA D D F D 4 • : : G AD= G F GA D G D : = D 4 • F G F F : : DI : = D 4 G DG D AD = G F =F GG D : IG : DI : = D G : : G- 4 4 4 4 2 2 2 2 1FD 5+
  • 12.
  • 13.
    Region Proposal Network Anchorgenerator Proposal layer 4 4 4 4 4 1 4 4 4 Filter out negative anchors with Rpn_probs and Non-Max Suppression
  • 14.
    RPN head network ,F G = BC A4 :)C4 , F G 4 BC3A B3= 4 G = BC 4G /A 3AB 5C , F G 4 BC3A B3= 4 G = BC /A 355 G 1 BC 2 1 BC 2 F4= C A B 4 B B 5 : 5 G B :B CC ( B 1 2
  • 15.
    Anchor generator ) -=38 A4 -=38 3 = A31 1 1 1 1 ( . 1 1 /8 A 12A 3 2 61A 3== 4 = 8 122 8 6 4 = 2: 3 4 3 =0 1=38 1=38 8 1
  • 16.
    IoU – Intersectionover Union 3 I I= : 5 = = : =G ?9 G ?= :=G E = / • 4 7 1 2 2 7 • 2 7 2 • & 7 7 11 2 2 7 1 0 1 4 7 0 1 . ) ( ( 2 9BEA= 0 1 4 7 0 1 . % . % 0 1 4 7 0 1 . , , . %% = A9E = G=
  • 17.
    Proposal layer • Sortall anchors by rpn_probs (how likely an anchor contains an object) • Choose top N anchors, throw the remainings (e.g., N ~ 6000) • Apply Non-Maximum Suppresion (NMS) to eliminate duplicated boxes. Keep up to M anchors (e.g., M ~ 2000).
  • 18.
    Non-Maximum Suppresion (NMS) •. C: A 5 C9 C • 09 A = = C: C ?E ? ?? ? A ? ( ,> -: C A A 9 , 1 C9A 9 5 0 0) .( . 9 -: C 9 ) ?A: = 12 1?A C9 A 12 - C: C9 A : A = : : ( 9 C9 :C9 : ?A :C :=:> A = : : :C9 ,? ) 2 ?> - := = 1 A :?> ?A: =
  • 19.
    Non-Maximum Suppresion (NMS) =A ,A = 2= 5 . = 5 = 5 2= =0 A 5 =: 5 ( - - == 5 2= 5 -A =A , A = = 5 2= 5 ( > : ) > A 5 2= 5 2 = 5 ) ,== A : A 5 5 = 5 2= == 5 A 5 2= C A > :: A ) ) : 2 5 2= 5 2 C A 1 ( , 2 :A: A==> 2 > :
  • 20.
    Train the RPN •Positive boxes: IoU >= 0.7 with any GT box • Negative boxes: IoU < 0.3 with all GT boxes • Ratio of positive boxes: 1/3 • Fixed num of anchors per image for train: 256
  • 21.
    Loss function • iis the index of an anchor in a mini-batch • pi is the predicted probability of anchor i being an object • Ground truth label is 1 if the anchor is positive, and is 0 if the anchor is negative • ti is a vector representing the 4 parameterized coordinates (dy, dx, dh, dw) of the predicted bounding box • is that of the ground-truth box associated with a positive anchor • Classification loss Lcls is log loss over two classes (object vs. not object) • For regression loss Lreg, use , where R is smoothL1 defined as: • While both positive and negative anchors contribute to classification loss, only positive anchors contribute to regression loss. • Ncls is normalized by the mini-batch size ( ), Nreg is normalized by the number of anchor locations ( ), set
  • 23.
    RoIAlign 1024 1024 540 540 Input image Object 64 1024/16 =64 540/16 = 33.75 33.75 Feature map RoI 16X less 33.75 / 7 = 4.82 each bin 7x7 Small feature map (for each RoI) RoI Use bilinear interpolation to calculate exact value at each bin No quantization (From [1]) FCN
  • 24.
    Identify Feature Pyramidlevel for RoIs Resize P2 P3 P4 P5 w, h: width & height of a RoI 224: canonical ImageNet pre-training size k0: target level of the RoI whose w*h = 2242 (here, k0 = 5) Target level k of a RoI is identified by: Crop the RoIs on their feature map Intuitions: Features of large RoIs from smaller feature map (high semantic level) Features of small RoIs from larger feature map (low semantic level) RoIs (From [6])
  • 25.
  • 26.
    Mask-RCNN head network •A classifier to identify the class for each RoI: K classes + background • A regressor to predict the 4 values dy, dx, dh, dw for each RoI • Fully Convolutional Network (FCN) [5] to predict mask per class • Represent a mask as m x m matrix • For each RoI, try to predict mask for each class • Use sigmoid to predict how probability for each pixel • Use binary loss to train the network
  • 27.
    Mask-RCNN head networkarchitecture 7x7x256 Small feature map (for each RoI) 1024 Fully connected layer implemented by CNN Shared weights over multiple RoIs Softmax (K+1) x 4 (K+1) 14x14x256 3x3 (256 filters) Conv1 14x14x256 Conv4 14x14x256 3x3 (256 filters) Conv Transpose (Up sampling) 2x2 (256 filters) (stride 2) 28x28x256 ... x 4 conv layers Conv 28x28x(K+1) 1x1 (K+1 filters) Sigmoid activation 28x28x(K+1) 7x7 (1024 filters) Conv1 Conv2 (BG + num classes) K+1 Dense Dense (K+1) x 4 1024 K+1 Predict mask per class BG vs K classes 4 box regression values: dy, dx, dh, dw 1x1 (1024 filters)
  • 28.
    Loss functions • Foreach sampled RoI, a multi-task loss is applied: where • Lcls is classification loss • Lloc is bounding-box regression loss • Lmask is mask loss • The final loss is calculated as mean of loss over samples
  • 29.
    Classification loss Lcls •For a RoI, denotes: • : true class of the RoI • : predicted probability distribution over K+1 classes • The classification loss Lcls for a RoI is a log-loss calculated as:
  • 30.
    Bounding-box regression lossLloc • For a RoI, denotes: • : true class of the RoI • : true bounding-box regression targets of the RoI • : predicted bounding-box regression for the class u. • The bounding-box regression loss Lloc for the RoI is calculated as: where
  • 31.
    Mask loss Lmask •For a RoI, denotes: • : true class of the RoI • : the true mask and the predicted mask for the class of the RoI respectively ( ) • The mask loss Lmask for the RoI is the average binary cross-entropy loss, calculated as:
  • 32.
    Mask-RCNN on COCOdata (From [1])
  • 33.
    Evolution of R-CNN =Faster R-CNN [2] + Fully Convolutional Network [5] RoIPool RoIAlign Per-pixel softmax Per-pixel sigmoid Mask R-CNN [1] Faster R-CNN = Fast R-CNN [3] + Fast R-CNN = R-CNN [4] + ConvNet on whole input image first, then apply RoIPooling layer R-CNN [1] Region proposal on input image + + += + +
  • 34.
    Summary • Introduced MaskRCNN,an algorithm for Instance Segmentation • Detect both bounding boxes and masks of objects in an end-to-end neural network • Improve RoIPool from Faster-RCNN with RoIAlign • Employ Fully Convolutional Network for mask detection
  • 35.
    References [1] Kaiming He,Georgia Gkioxari, Piotr Dollár, and Ross Girshick. Mask R-CNN. IEEE International Conference on Computer Vision (ICCV), 2017. [2] S. Ren, K. He, R. Girshick, and J. Sun. Faster R-CNN: Towards real-time object detection with region proposal networks. In NIPS, 2015. [3] R. Girshick. Fast R-CNN. In ICCV, 2015. [4] R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich feature hierarchies for accurate object detection and semantic segmentation. In CVPR, 2014 [5] J. Long, E. Shelhamer, and T. Darrell. Fully convolutional networks for semantic segmentation. In CVPR, 2015. [6] T.Y. Lin, P. Dollár, R. Girshick, K. He, B. Hariharan, and S. Belongie. Feature pyramid networks for object detection. In CVPR, 2017 by Nguyen Phuoc Tat Dat
  • 36.
    Appendix: Some popularDL-based algorithms for visual perception tasks by Nguyen Phuoc Tat Dat Visual perception tasks Algorithms Image Classification AlexNet Inception GooLeNet/Inception v1 ResNet VGGNet Object Detection Fast/Faster R-CNN SSD YOLO Semantic Segmentation Fully Convolutional Network (FCN) U-Net Instance Segmentation Mask R-CNN
  • 37.
    Thank you forlistening! !