Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Mask R-CNN

5,109 views

Published on

Mask R-CNN, Kaiming He, Facebook, ICCV2017

Published in: Science
  • Be the first to comment

Mask R-CNN

  1. 1. Mask R-CNN ICCV 2017(Oral) Kaiming He Georgia Gkioxari Piotr Dollár Ross Girshick Facebook AI Research (FAIR) Chanuk Lim KEPRI 2017.08.10
  2. 2. 1. Abstract Our approach efficiently detects objects in an image while simultaneously generating a high-quality segmentation mask for each instance. Mask R-CNN extends Faster R-CNN by adding a branch for predicting an object mask in parallel with the existing branch for bounding box recognition. Mask R-CNN is simple to train and adds only a small overhead to Faster R-CNN, running at 5fps.
  3. 3. 2. Review - Instance Segmentation http://blog.naver.com/sogangori/221012300995 Instance segmentation combines elements from the classical computer vision tasks of object detection, where the goal is to classify individual objects and localize each using a bounding box, and semantic segmentation, where the goal is to classify each pixel into a fixed set of categories without differentiating object instances. Instance segmentation is challenging because it requires the correct detection of all objects in an image while also precisely segmenting each instance.
  4. 4. 2. Review - Fast R-CNN & Faster R-CNNround: R-CNN architechture ~2k region proposals (independent algorithm) convolutional feature extraction warped region proposals SVM classification classification box regression class specific LSR CNN Based on proposed Regions of Interest (RoI) Requires region warping for fixed size features Very ine cient pipeline Background: Fast R-CNN CNN convolutional backbone RoIPool layer fixed size feature map feature map fully connected layers box regression classification RoIs from independent method Background: Faster R-CNN RPN CNN convolutional backbone RoIPool layer fixed size feature map feature map fully connected layers box regression classification sk R-CNN: overview RPN CNN mask branch convolutional backbone RoIAlign layer fixed size feature map feature map box regression classification fully connected layershead R-CNN Fast R-CNN Mask R-CNN Faster R-CNN Curtis Kim, kakao, https://www.slideshare.net/IldooKim/deep-object-detectors-1-20166
  5. 5. 4. Contribution 2. Adding a branch for predicting segmentation masks on each Region of Interest (RoI), in parallel with the existing branch for classification and bounding box regression. 1. To fix the misalignment, we propose a simple, quantization-free layer, called RoIAlign, that faithfully preserves exact spatial locations.
  6. 6. 4. Contribution http://blog.naver.com/sogangori/221012300995 ㅁ
  7. 7. 4. Contribution 1) RoIAlign • Previous works - RoIPool 112 x 112 7 x 7 Max-pooling (Input is rounded off) We propose an RoIAlign layer that removes the harsh quantization of RoIPool, properly aligning the extracted features with the input. RoIAlign improves mask accuracy by relative 10% to 50%, showing bigger gains under stricter localization metrics. Ex) [32/16] = 2, 32/16 = 2 [33/16] = 2, 33/16 = 2.06 [34/16] = 2, 34/16 = 2.12 [35/16] = 2, 35/16 = 2.18 [36/16] = 2, 36/16 = 2.25 [37/16] = 2, 37/16 = 2.31 [38/16] = 2, 38/16 = 2.37 [39/16] = 2, 39/16 = 2.43 [40/16] = 3, 40/16 = 2.5
  8. 8. 4. Contribution 1) RoIAlign Jaderberg, Max, Karen Simonyan, and Andrew Zisserman. "Spatial transformer networks." Advances in Neural Information Processing Systems. 2015. We use bilinear interpolation(Spatial transformer networks, Jaderberg, et al.) to compute the exact values of the input features at four regularly sampled locations in each RoI bin, and aggregate the result (using max or average). U V Localisation net Sampler Spatial Transformer Grid ! generator T✓(G)✓ e 2: The architecture of a spatial transformer module. The input feature map U is passed to a localisation rk which regresses the transformation parameters ✓. The regular spatial grid G over V is transformed to mpling grid T✓(G), which is applied to U as described in Sect. 3.3, producing the warped output feature V . The combination of the localisation network and sampling mechanism defines a spatial transformer. for a differentiable attention mechanism, while [14] use a differentiable attention mechansim ilising Gaussian kernels in a generative model. The work by Girshick et al. [11] uses a region osal algorithm as a form of attention, and [7] show that it is possible to regress salient regions a CNN. The framework we present in this paper can be seen as a generalisation of differentiable ion to any spatial transformation. Spatial Transformers s section we describe the formulation of a spatial transformer. This is a differentiable module h applies a spatial transformation to a feature map during a single forward pass, where the ormation is conditioned on the particular input, producing a single output feature map. For -channel inputs, the same warping is applied to each channel. For simplicity, in this section we der single transforms and single outputs per transformer, however we can generalise to multiple ormations, as shown in experiments. patial transformer mechanism is split into three parts, shown in Fig. 2. In order of computation, localisation network (Sect. 3.1) takes the input feature map, and through a number of hidden (a) (b) Figure 3: Two examples of applying the parameterised sampling grid to an image U producing the output V . (a) The sampling grid is the regular grid G = TI (G), where I is the identity transformation parameters. (b) The sampling grid is the result of warping the regular grid with an affine transformation T✓(G). For clarity of exposition, assume for the moment that T✓ is a 2D affine transformation A✓. We will discuss other transformations below. In this affine case, the pointwise transformation is ✓ xs i ys i ◆ = T✓(Gi) = A✓ 0 @ xt i yt i 1 1 A =  ✓11 ✓12 ✓13 ✓21 ✓22 ✓23 0 @ xt i yt i 1 1 A (1) where (xt i, yt i ) are the target coordinates of the regular grid in the output feature map, (xs i , ys i ) are the source coordinates in the input feature map that define the sample points, and A✓ is the affine transformation matrix. We use height and width normalised coordinates, such that 1  xt i, yt i  1 when within the spatial bounds of the output, and 1  xs i , ys i  1 when within the spatial bounds of the input (and similarly for the y coordinates). The source/target transformation and sampling is equivalent to the standard texture mapping and coordinates used in graphics [8]. The transform defined in (10) allows cropping, translation, rotation, scale, and skew to be applied to the input feature map, and requires only 6 parameters (the 6 elements of A✓) to be produced by the localisation network. It allows cropping because if the transformation is a contraction (i.e. the determinant of the left 2 ⇥ 2 sub-matrix has magnitude less than unity) then the mapped regular grid will lie in a parallelogram of area less than the range of xs i , ys i . The effect of this transformation on the grid compared to the identity transform is shown in Fig. 3. The class of transformations T✓ may be more constrained, such as that used for attention A✓ =  s 0 tx 0 s ty (2) allowing cropping, translation, and isotropic scaling by varying s, tx, and ty. The transformation (a) (b) Figure 3: Two examples of applying the parameterised sampling grid to an image U producing the outpu (a) The sampling grid is the regular grid G = TI (G), where I is the identity transformation parameters. The sampling grid is the result of warping the regular grid with an affine transformation T✓(G). For clarity of exposition, assume for the moment that T✓ is a 2D affine transformation A✓. We discuss other transformations below. In this affine case, the pointwise transformation is ✓ xs i ys i ◆ = T✓(Gi) = A✓ 0 @ xt i yt i 1 1 A =  ✓11 ✓12 ✓13 ✓21 ✓22 ✓23 0 @ xt i yt i 1 1 A where (xt i, yt i) are the target coordinates of the regular grid in the output feature map, (xs i , ys i ) the source coordinates in the input feature map that define the sample points, and A✓ is the af transformation matrix. We use height and width normalised coordinates, such that 1  xt i, yt i when within the spatial bounds of the output, and 1  xs i , ys i  1 when within the spatial bou of the input (and similarly for the y coordinates). The source/target transformation and samplin equivalent to the standard texture mapping and coordinates used in graphics [8]. The transform defined in (10) allows cropping, translation, rotation, scale, and skew to be app to the input feature map, and requires only 6 parameters (the 6 elements of A✓) to be produced the localisation network. It allows cropping because if the transformation is a contraction (i.e. determinant of the left 2 ⇥ 2 sub-matrix has magnitude less than unity) then the mapped regular will lie in a parallelogram of area less than the range of xs i , ys i . The effect of this transformation the grid compared to the identity transform is shown in Fig. 3. The class of transformations T✓ may be more constrained, such as that used for attention A✓ =  s 0 tx 0 s ty allowing cropping, translation, and isotropic scaling by varying s, tx, and ty. The transforma T✓ can also be more general, such as a plane projective transformation with 8 parameters, pie wise affine, or a thin plate spline. Indeed, the transformation can have any parameterised fo provided that it is differentiable with respect to the parameters – this crucially allows gradients to x and y are the parameters of a generic sampling kernel k() which defines the image tion (e.g. bilinear), Uc nm is the value at location (n, m) in channel c of the input, and V c i tput value for pixel i at location (xt i, yt i ) in channel c. Note that the sampling is done ly for each channel of the input, so every channel is transformed in an identical way (this s spatial consistency between channels). , any sampling kernel can be used, as long as (sub-)gradients can be defined with respect to s i . For example, using the integer sampling kernel reduces (3) to V c i = HX n WX m Uc nm (bxs i + 0.5c m) (bys i + 0.5c n) (4) x + 0.5c rounds x to the nearest integer and () is the Kronecker delta function. This g kernel equates to just copying the value at the nearest pixel to (xs i , ys i ) to the output location Alternatively, a bilinear sampling kernel can be used, giving V c i = HX n WX m Uc nm max(0, 1 |xs i m|) max(0, 1 |ys i n|) (5) The class of transformations T✓ may be more constrained, such as that used for attention A✓ =  s 0 tx 0 s ty (2) allowing cropping, translation, and isotropic scaling by varying s, tx, and ty. The transformation T✓ can also be more general, such as a plane projective transformation with 8 parameters, piece- wise affine, or a thin plate spline. Indeed, the transformation can have any parameterised form, provided that it is differentiable with respect to the parameters – this crucially allows gradients to be backpropagated through from the sample points T✓(Gi) to the localisation network output ✓. If the transformation is parameterised in a structured, low-dimensional way, this reduces the complexity of the task assigned to the localisation network. For instance, a generic class of structured and dif- ferentiable transformations, which is a superset of attention, affine, projective, and thin plate spline transformations, is T✓ = M✓B, where B is a target grid representation (e.g. in (10), B is the regu- lar grid G in homogeneous coordinates), and M✓ is a matrix parameterised by ✓. In this case it is possible to not only learn how to predict ✓ for a sample, but also to learn B for the task at hand. 3.3 Differentiable Image Sampling To perform a spatial transformation of the input feature map, a sampler must take the set of sampling points T✓(G), along with the input feature map U and produce the sampled output feature map V . Each (xs i , ys i ) coordinate in T✓(G) defines the spatial location in the input where a sampling kernel is applied to get the value at a particular pixel in the output V . This can be written as V c i = HX n WX m Uc nmk(xs i m; x)k(ys i n; y) 8i 2 [1 . . . H0 W0 ] 8c 2 [1 . . . C] (3) 4 bilinear sampling kernel Copying the value at the nearest pixel to (xs i , yi s ) to the output location (xt i , yi t ). Target feature value at location i in channel c Input feature value at location (n,m) in channel c Sampling kernel Kernel parameters Sampling coordinate
  9. 9. 4. Contribution 1) RoIAlign Faster R-CNN) Source: https://deepsense.io/region-of-interest-pooling-explained/ Input activation RoIPool (Faster R-CNN) Region projection and pooling sections RoIPool (Faster R-CNN) Max pooling output Faster R-CNN RoIPool
  10. 10. 4. Contribution 1) RoIAlign Mask R-CNN RoIAlign ask R-CNN) Input activation RoIAlign (Mask R-CNN) Region projection and pooling sections RoIAlign (Mask R-CNN) Sampling locations RoIAlign (Mask R-CNN) Bilinear interpolated values oIAlign (Mask R-CNN) Max pooling output Silvio Galesso, https://lmb.informatik.uni-freiburg.de/lectures/seminar_brox/seminar_ss17/maskrcnn_slides.pdf
  11. 11. 4. Contribution 2) Network architecture Instance segmentation Faster R-CNN + Instance segmentation Faster R-CNN ResNet or ResNeXt lity of ultiple (i) the re ex- head ssion) RoI. ave RoI RoI 14×14 ×256 7×7 ×256 14×14 ×256 1024 28×28 ×256 1024 mask 14×14 ×256 class box 2048RoI res5 7×7 ×1024 7×7 ×2048 ×4 class box 14×14 ×80 mask 28×28 ×80 Faster R-CNN w/ ResNet [19] Faster R-CNN w/ FPN [27] Figure 3. Head Architecture: We extend two existing Faster R- FPN (Feature Pyramid Network) te the generality of CNN with multiple ate between: (i) the sed for feature ex- the network head tion and regression) rately to each RoI. ave RoI RoI 14×14 ×256 7×7 ×256 14×14 ×256 1024 28×28 ×256 1024 mask 14×14 ×256 class box 2048RoI res5 7×7 ×1024 7×7 ×2048 ×4 class box 14×14 ×80 mask 28×28 ×80 Faster R-CNN w/ ResNet [19] Faster R-CNN w/ FPN [27] Figure 3. Head Architecture: We extend two existing Faster R-Backbone BackboneHead Head Background: Faster R-CNN RPN CNN convolutional backbone RoIPool layer fixed size feature map feature map fully connected layers box regression classification Faster R-CNN
  12. 12. 4. Contribution 2) Network architecture • Backbone - ResNet layer name output size 18-layer 34-layer 50-layer 101-layer 152-layer conv1 112⇥112 7⇥7, 64, stride 2 conv2 x 56⇥56 3⇥3 max pool, stride 2  3⇥3, 64 3⇥3, 64 ⇥2  3⇥3, 64 3⇥3, 64 ⇥3 2 4 1⇥1, 64 3⇥3, 64 1⇥1, 256 3 5⇥3 2 4 1⇥1, 64 3⇥3, 64 1⇥1, 256 3 5⇥3 2 4 1⇥1, 64 3⇥3, 64 1⇥1, 256 3 5⇥3 conv3 x 28⇥28  3⇥3, 128 3⇥3, 128 ⇥2  3⇥3, 128 3⇥3, 128 ⇥4 2 4 1⇥1, 128 3⇥3, 128 1⇥1, 512 3 5⇥4 2 4 1⇥1, 128 3⇥3, 128 1⇥1, 512 3 5⇥4 2 4 1⇥1, 128 3⇥3, 128 1⇥1, 512 3 5⇥8 conv4 x 14⇥14  3⇥3, 256 3⇥3, 256 ⇥2  3⇥3, 256 3⇥3, 256 ⇥6 2 4 1⇥1, 256 3⇥3, 256 1⇥1, 1024 3 5⇥6 2 4 1⇥1, 256 3⇥3, 256 1⇥1, 1024 3 5⇥23 2 4 1⇥1, 256 3⇥3, 256 1⇥1, 1024 3 5⇥36 conv5 x 7⇥7  3⇥3, 512 3⇥3, 512 ⇥2  3⇥3, 512 3⇥3, 512 ⇥3 2 4 1⇥1, 512 3⇥3, 512 1⇥1, 2048 3 5⇥3 2 4 1⇥1, 512 3⇥3, 512 1⇥1, 2048 3 5⇥3 2 4 1⇥1, 512 3⇥3, 512 1⇥1, 2048 3 5⇥3 1⇥1 average pool, 1000-d fc, softmax FLOPs 1.8⇥109 3.6⇥109 3.8⇥109 7.6⇥109 11.3⇥109 Table 1. Architectures for ImageNet. Building blocks are shown in brackets (see also Fig. 5), with the numbers of blocks stacked. Down- sampling is performed by conv3 1, conv4 1, and conv5 1 with a stride of 2. 0 10 20 30 40 50 20 30 40 50 60 iter. (1e4) error(%) plain-18 plain-34 0 10 20 30 40 50 20 30 40 50 60 iter. (1e4) error(%) ResNet-18 ResNet-34 18-layer 34-layer 18-layer 34-layer Figure 4. Training on ImageNet. Thin curves denote training error, and bold curves denote validation error of the center crops. Left: plain • Backbone - FPNure Pyramid Networks for RPN ce single scale feature ith FPN. scale anchors at each ResNet-50-C4 ResNet-101-C4 - FPN exploits the inherent hierarchy of CNNs to compute multi-scale features: - Replace single scale feature map with FPN. Source: Lin et al., Feature Pyramid Networks for Object Detection
  13. 13. 4. Contribution 2) Network architecture - Objective Function - RoI영역에 대한 Loss Function 정의 - Class Loss + Localization Loss - Class Loss는 Log Loss - - Localization Loss는 위치에 대한 Smoo - - L2가 Gradient가 매우 커질 수 있 비해서 덜 민감함 - BG Class(u=0)은 제거 Fast R-CNN : Details #4 - Objective Function - RoI영역에 대한 Loss Function 정의 - Class Loss + Localization Loss - Class Loss는 Log Loss - - - Objective Function - RoI영역에 대한 Loss Function 정의 - Class Loss + Localization Loss - Class Loss는 Log Loss - - Localization Loss는 위치에 대한 Smooth L1 Loss - - L2가 Gradient가 매우 커질 수 있는 것에 비해서 덜 민감함 - BG Class(u=0)은 제거 Fast R-CNN : Details #4 35 bounding-box classification and regression in par- hich turned out to largely simplify the multi-stage of original R-CNN [13]). ally, during training, we define a multi-task loss on mpled RoI as L = Lcls + Lbox + Lmask. The clas- n loss Lcls and bounding-box loss Lbox are identi- ose defined in [12]. The mask branch has a Km2 - onal output for each RoI, which encodes K binary f resolution m ⇥ m, one for each of the K classes. we apply a per-pixel sigmoid, and define Lmask as ge binary cross-entropy loss. For an RoI associated und-truth class k, Lmask is only defined on the k-th her mask outputs do not contribute to the loss). efinition of Lmask allows the network to generate or every class without competition among classes; on the dedicated classification branch to predict the el used to select the output mask. This decouples d class prediction. This is different from common when applying FCNs [29] to semantic segmenta- ch typically uses a per-pixel softmax and a multino- ss-entropy loss. In that case, masks across classes ; in our case, with a per-pixel sigmoid and a binary These quantizations introduce mis RoI and the extracted features. W classification, which is robust to sm large negative effect on predicting To address this, we propose an moves the harsh quantization of Ro the extracted features with the inp is simple: we avoid any quantizati or bins (i.e., we use x/16 instead of interpolation [22] to compute the features at four regularly sampled l and aggregate the result (using ma RoIAlign leads to large impro §4.2. We also compare to the RoIW in [10]. Unlike RoIAlign, RoIWa ment issue and was implemented i just like RoIPool. So even thoug bilinear resampling motivated by with RoIPool as shown by experim ble 2c), demonstrating the crucial 2We sample four regular locations, so t or average pooling. In fact, interpolating A. Fast R-CNN A. B. Segmentation Mask branch features: • Fully convolutional • K · (m ⇥ m) sigmoid outputs: ! pixel-wise binary classification ! one mask for each class, no competition • Lmask: mean binary cross-entropy Overall head loss: L = Lbox + Lmask B. Log loss Smooth L1 loss
  14. 14. FCIS+++ [26] (top) vs. Mask R-CNN (bottom, ResNet-101-FPN). FCIS exhibits systematic artifacts on -features AP AP50 AP75 -50-C4 30.3 51.2 31.5 101-C4 32.7 54.2 34.3 50-FPN 33.6 55.2 35.3 101-FPN 35.4 57.3 37.5 101-FPN 36.7 59.5 38.9 bone Architecture: Better back- g expected gains: deeper networks FPN outperforms C4 features, and improves on ResNet. AP AP50 AP75 softmax 24.8 44.1 25.1 sigmoid 30.3 51.2 31.5 +5.5 +7.1 +6.4 (b) Multinomial vs. Independent Masks (ResNet-50-C4): Decoupling via per- class binary masks (sigmoid) gives large gains over multinomial masks (softmax). align? bilinear? a RoIPool [12] m RoIWarp [10] X m X a RoIAlign X X m X X a (c) RoIAlign (ResNet-50-C4): M layers. Our RoIAlign layer impr AP75 by ⇠5 points. Using prope tor that contributes to the large gap AP AP50 AP75 APbb APbb 50 APbb 75 l 23.6 46.5 21.6 28.2 52.7 26.9 n 30.9 51.8 32.1 34.0 55.3 36.4 +7.3 + 5.3 +10.5 +5.8 +2.6 +9.5 ign (ResNet-50-C5, stride 32): Mask-level and box-level large-stride features. Misalignments are more severe than e-16 features (Table 2c), resulting in massive accuracy gaps. mask branch MLP fc: 1024!1024!80·282 MLP fc: 1024!1024!1024!80·282 FCN conv: 256!256!256!256!256!80 (e) Mask Branch (ResNet-50-FPN): Fully convolu multi-layer perceptrons (MLP, fully-connected) for m prove results as they take advantage of explicitly enco 2. Ablations for Mask R-CNN. We train on trainval35k, test on minival, and report mask AP un omial vs. Independent Masks: Mask R-CNN de- AP by about 3 points over RoIPool, Our definition of Lmask allows the network to generate masks for every class without competition among classes; we rely on the dedicated classification branch to predict the class label used to select the output mask. This decouples mask and class prediction. This is different from common practice when applying FCNs [29] to semantic segmentation, which typically uses a per-pixel softmax and a multinomial cross-entropy loss. In that case, masks across classes compete; in our case, with a per-pixel sigmoid and a binary loss, they do not. We show by experiments that this formulation is key for good instance segmentation results. 2) Network architecture 4. Contribution
  15. 15. 5. Experiments • Main dataset: MS COCO • 80 classes
 • 80k train image • 35k sub- set of val images 
 • 5k images for ablation experiments • Metric
  16. 16. 5. Experiments horse1.00 horse1.00 horse1.00 bus1.00 bus1.00 car.98 truck.88 car.93 car.78 car.98 car.91car.96 car.99 car.94 car.99 car.98truck.86 car.99 car.95 car1.00 car.93car.98 car.95 car.97 car.87 car.99 car.82 car.78 car.93 car.95 car.97 person.99 traffic light.73 person1.00 person.99 person.95 person.93 person.93 person1.00 person.98 skateboard.82 suitcase1.00 suitcase.99 suitcase.96 suitcase1.00 suitcase.93 suitcase.98 suitcase.88 suitcase.72 stop sign.88 person1.00 person1.00 person1.00 person1.00 person.99 person.99 bench.76 skateboard.91 skateboard.83 handbag.81 surfboard1.00 person1.00 person1.00 surfboard1.00person1.00 person.98 surfboard1.00 person1.00 surfboard.98 surfboard1.00 person.91 person.74person1.00 person1.00 person1.00 person1.00 person1.00person1.00person.98 person.99 person1.00 person.99 umbrella1.00 person.95 umbrella.99umbrella.97 umbrella.97 umbrella.96 umbrella1.00 backpack.96 umbrella.98 backpack.95 person.80 backpack.98 bicycle.93 umbrella.89 person.89 handbag.97 handbag.85 person1.00 person1.00person1.00person1.00 person1.00 person1.00 motorcycle.72 kite.89 person.99 kite.95 person.99 person1.00 person.81 person.72 kite.93 person.89 kite1.00 person.98 person1.00 kite.84 kite.97 person.80 handbag.80 person.99 kite.82 person.98person.96 kite.98 person.99 person.82 kite.81 person.95 person.84 kite.98 kite.72 kite.99 kite.84 kite.99 person.94 person.72person.98 kite.95 person.98person.77 kite.73 person.78 person.71person.87 kite.88 kite.88 person.94 kite.86 kite.89 zebra.99 zebra1.00 zebra1.00 zebra.99 zebra1.00 zebra.96 zebra.74 zebra.96 zebra.99zebra.90 zebra.88 zebra.76 dining table.91 dining table.78 chair.97 person.99 person.86 chair.94 chair.98 person.95 chair.95 person.97 chair.92 chair.99 person.97 person.99 person.94person.99 person.87 person.99 chair.83 person.94 person.99person.98 chair.87 chair.95 person.97 person.96 chair.99 person.86 person.89 chair.89 wine glass.93 person.98 person.88 person.97 person.88 person.88 person.91 chair.96 person.95 person.77 person.92 wine glass.94 cup.83 wine glass.94 wine glass.83 cup.91 chair.85 dining table.96 wine glass.91 person.96 cup.98 person.83 dining table.75 cup.96 person.72 wine glass.80 chair.98 person.81 person.82 dining table.81 chair.85 chair.78 cup.75 person.77 cup.71 wine glass.80 cup.79cup.93 cup.71 person.99 person.99 person1.00 person1.00 frisbee1.00 person.80 person.82 elephant1.00elephant1.00 elephant1.00 elephant.97 elephant.99 person1.00 person1.00 dining table.95 person1.00 person.88 wine glass1.00 bottle.97 wine glass1.00 wine glass.99 tv.98 tv.84 person1.00 bench.97 person.98 person1.00 person1.00 handbag.73 person.86potted plant.92 bird.93 person.76 person.98person.78person.78backpack.88handbag.91 cell phone.77clock.73 person.99 person1.00 person.98 person1.00 person1.00 person1.00 person.99 person.99 person.99person1.00person1.00 person.98 person.99 handbag.88 person1.00person.98 person.92 handbag.99 person.97 person.95 handbag.88 traffic light.99 person.95 person.87 person.95 traffic light.87 traffic light.71 person.80 person.95person.95person.73person.74 tie.85 car.99 car.86 car.97 car1.00car.95 car.97 traffic light1.00 traffic light.99 car.99 person.99 car.95 car.97car.98 car.98 car.91 car1.00 car.96 car.96 bicycle.86 car.97 car.97 car.97 car.94 car.95 car.94 car.81 person.87 parking meter.98 car.89 donut1.00 donut.90 donut.88 donut.81 donut.95 donut.96 donut1.00donut.98 donut.99 donut.94 donut.97 donut.99 donut.98 donut1.00 donut.95 donut1.00 donut.98 donut.98 donut.99 donut.96 donut.89 donut.96 donut.95 donut.98 donut.89 donut.93 donut.95 donut.90 donut.89 donut.89 donut.89 donut.86 donut.86 person1.00 person1.00person1.00 person1.00 person1.00 person1.00 person1.00 dog1.00 baseball bat.99 baseball bat.85 baseball bat.98 truck.92 truck.99 truck.96truck.99truck.97 bus.99 truck.93bus.90 person1.00 person1.00 horse.77 horse.99 cow.93 person.96 person1.00 person.99 horse.97 person.98person.97 person.98 person.96 person1.00 tennis racket1.00 chair.73 person.90 person.77 person.97 person.81 person.87 person.71person.96 person.99 person.98person.94 chair.97 chair.80 chair.71 chair.94chair.92 chair.99 chair.93 chair.99 chair.91chair.81chair.98chair.83 chair.81 chair.81 chair.93 sports ball.99 person1.00 couch.82 person1.00 person.99 person1.00 person1.00person1.00 person.99 skateboard.99 person.90 person.98 person.99 person.91 person.99person1.00 person.80 skateboard.98 Figure 4. More results of Mask R-CNN on COCO test images, using ResNet-101-FPN and running at 5 fps, with 35.7 mask AP (Table 1). backbone AP AP50 AP75 APS APM APL MNC [10] ResNet-101-C4 24.6 44.3 24.8 4.7 25.9 43.6 FCIS [26] +OHEM ResNet-101-C5-dilated 29.2 49.5 - 7.1 31.3 50.0
  17. 17. 5. Experiments Figure 5. FCIS+++ [26] (top) vs. Mask R-CNN (bottom, ResNet-101-FPN). FCIS exhibits systematic artifacts on overlapping objects. net-depth-features AP AP50 AP75 ResNet-50-C4 30.3 51.2 31.5 ResNet-101-C4 32.7 54.2 34.3 ResNet-50-FPN 33.6 55.2 35.3 ResNet-101-FPN 35.4 57.3 37.5 ResNeXt-101-FPN 36.7 59.5 38.9 (a) Backbone Architecture: Better back- bones bring expected gains: deeper networks do better, FPN outperforms C4 features, and ResNeXt improves on ResNet. AP AP50 AP75 softmax 24.8 44.1 25.1 sigmoid 30.3 51.2 31.5 +5.5 +7.1 +6.4 (b) Multinomial vs. Independent Masks (ResNet-50-C4): Decoupling via per- class binary masks (sigmoid) gives large gains over multinomial masks (softmax). align? bilinear? agg. AP AP50 AP75 RoIPool [12] max 26.9 48.8 26.4 RoIWarp [10] X max 27.2 49.2 27.1 X ave 27.1 48.9 27.1 RoIAlign X X max 30.2 51.0 31.8 X X ave 30.3 51.2 31.5 (c) RoIAlign (ResNet-50-C4): Mask results with various RoI layers. Our RoIAlign layer improves AP by ⇠3 points and AP75 by ⇠5 points. Using proper alignment is the only fac- tor that contributes to the large gap between RoI layers. AP AP50 AP75 APbb APbb 50 APbb 75 RoIPool 23.6 46.5 21.6 28.2 52.7 26.9 RoIAlign 30.9 51.8 32.1 34.0 55.3 36.4 +7.3 + 5.3 +10.5 +5.8 +2.6 +9.5 (d) RoIAlign (ResNet-50-C5, stride 32): Mask-level and box-level AP using large-stride features. Misalignments are more severe than with stride-16 features (Table 2c), resulting in massive accuracy gaps. mask branch AP AP50 AP75 MLP fc: 1024!1024!80·282 31.5 53.7 32.8 MLP fc: 1024!1024!1024!80·282 31.5 54.0 32.6 FCN conv: 256!256!256!256!256!80 33.6 55.2 35.3 (e) Mask Branch (ResNet-50-FPN): Fully convolutional networks (FCN) vs. multi-layer perceptrons (MLP, fully-connected) for mask prediction. FCNs im- prove results as they take advantage of explicitly encoding spatial layout. Table 2. Ablations for Mask R-CNN. We train on trainval35k, test on minival, and report mask AP unless otherwise noted. Multinomial vs. Independent Masks: Mask R-CNN de- couples mask and class prediction: as the existing box branch predicts the class label, we generate a mask for each class without competition among classes (by a per-pixel sig- moid and a binary loss). In Table 2b, we compare this to using a per-pixel softmax and a multinomial loss (as com- monly used in FCN [29]). This alternative couples the tasks of mask and class prediction, and results in a severe loss AP by about 3 points over RoIPool, with much of the gain coming at high IoU (AP75). RoIAlign is insensitive to max/average pool; we use average in the rest of the paper. Additionally, we compare with RoIWarp proposed in MNC [10] that also adopt bilinear sampling. As discussed in §3, RoIWarp still quantizes the RoI, losing alignment with the input. As can be seen in Table 2c, RoIWarp per- forms on par with RoIPool and much worse than RoIAlign.
  18. 18. 5. Experiments Figure 4. More results of Mask R-CNN on COCO test images, using ResNet-101-FPN and running at 5 fps, with 35.7 mask AP (Table 1). backbone AP AP50 AP75 APS APM APL MNC [10] ResNet-101-C4 24.6 44.3 24.8 4.7 25.9 43.6 FCIS [26] +OHEM ResNet-101-C5-dilated 29.2 49.5 - 7.1 31.3 50.0 FCIS+++ [26] +OHEM ResNet-101-C5-dilated 33.6 54.5 - - - - Mask R-CNN ResNet-101-C4 33.1 54.9 34.8 12.1 35.6 51.1 Mask R-CNN ResNet-101-FPN 35.7 58.0 37.8 15.5 38.1 52.4 Mask R-CNN ResNeXt-101-FPN 37.1 60.0 39.4 16.9 39.9 53.5 Table 1. Instance segmentation mask AP on COCO test-dev. MNC [10] and FCIS [26] are the winners of the COCO 2015 and 2016 segmentation challenges, respectively. Without bells and whistles, Mask R-CNN outperforms the more complex FCIS+++, which includes multi-scale train/test, horizontal flip test, and OHEM [35]. All entries are single-model results. is evaluating using mask IoU. As in previous work [5, 27], expect many such improvements to be applicable to ours. person1.00 person1.00 person1.00 person1.00 umbrella1.00umbrella.99 car.99 car.93 giraffe1.00 giraffe1.00 person1.00 person1.00 person1.00 person1.00 person.95 sports ball1.00 sports ball.98 person1.00 person1.00 person1.00 tie.95 tie1.00 FCISMaskR-CNN train1.00 train.99 train.80 person1.00 person1.00person1.00 person1.00 person1.00person1.00 skateboard.98 person.99 person.99 skateboard.99 handbag.93 Figure 5. FCIS+++ [26] (top) vs. Mask R-CNN (bottom, ResNet-101-FPN). FCIS exhibits systematic artifacts on overlapping objects. net-depth-features AP AP50 AP75 ResNet-50-C4 30.3 51.2 31.5 ResNet-101-C4 32.7 54.2 34.3 ResNet-50-FPN 33.6 55.2 35.3 ResNet-101-FPN 35.4 57.3 37.5 ResNeXt-101-FPN 36.7 59.5 38.9 (a) Backbone Architecture: Better back- bones bring expected gains: deeper networks do better, FPN outperforms C4 features, and ResNeXt improves on ResNet. AP AP50 AP75 softmax 24.8 44.1 25.1 sigmoid 30.3 51.2 31.5 +5.5 +7.1 +6.4 (b) Multinomial vs. Independent Masks (ResNet-50-C4): Decoupling via per- class binary masks (sigmoid) gives large gains over multinomial masks (softmax). align? bilinear? agg. AP AP50 AP75 RoIPool [12] max 26.9 48.8 26.4 RoIWarp [10] X max 27.2 49.2 27.1 X ave 27.1 48.9 27.1 RoIAlign X X max 30.2 51.0 31.8 X X ave 30.3 51.2 31.5 (c) RoIAlign (ResNet-50-C4): Mask results with various RoI layers. Our RoIAlign layer improves AP by ⇠3 points and AP75 by ⇠5 points. Using proper alignment is the only fac- tor that contributes to the large gap between RoI layers. AP AP50 AP75 APbb APbb 50 APbb 75 RoIPool 23.6 46.5 21.6 28.2 52.7 26.9 RoIAlign 30.9 51.8 32.1 34.0 55.3 36.4 +7.3 + 5.3 +10.5 +5.8 +2.6 +9.5 mask branch AP AP50 AP75 MLP fc: 1024!1024!80·282 31.5 53.7 32.8 MLP fc: 1024!1024!1024!80·282 31.5 54.0 32.6 FCN conv: 256!256!256!256!256!80 33.6 55.2 35.3 • Instance segmentation
  19. 19. 5. Experiments backbone APbb APbb 50 APbb 75 APbb S APbb M APbb L Faster R-CNN+++ [19] ResNet-101-C4 34.9 55.7 37.4 15.6 38.7 50.9 Faster R-CNN w FPN [27] ResNet-101-FPN 36.2 59.1 39.0 18.2 39.0 48.2 Faster R-CNN by G-RMI [21] Inception-ResNet-v2 [37] 34.7 55.5 36.7 13.5 38.1 52.0 Faster R-CNN w TDM [36] Inception-ResNet-v2-TDM 36.8 57.7 39.2 16.2 39.8 52.1 Faster R-CNN, RoIAlign ResNet-101-FPN 37.3 59.6 40.3 19.8 40.2 48.8 Mask R-CNN ResNet-101-FPN 38.2 60.3 41.7 20.1 41.1 50.2 Mask R-CNN ResNeXt-101-FPN 39.8 62.3 43.4 22.1 43.2 51.2 Table 3. Object detection single-model results (bounding box AP), vs. state-of-the-art on test-dev. Mask R-CNN using ResNet-101- FPN outperforms the base variants of all previous state-of-the-art models (the mask output is ignored in these experiments). The gains of Mask R-CNN over [27] come from using RoIAlign (+1.1 APbb ), multitask training (+0.9 APbb ), and ResNeXt-101 (+1.6 APbb ). Mask Branch: Segmentation is a pixel-to-pixel task and we exploit the spatial layout of masks by using an FCN. In Table 2e, we compare multi-layer perceptrons (MLP) and FCNs, using a ResNet-50-FPN backbone. Using FCNs gives a 2.1 mask AP gain over MLPs. We note that we choose this backbone so that the conv layers of the FCN head are not pre-trained, for a fair comparison with MLP. 4.3. Bounding Box Detection Results We compare Mask R-CNN to the state-of-the-art COCO bounding-box object detection in Table 3. For this result, even though the full Mask R-CNN model is trained, only the classification and box outputs are used at inference (the mask output is ignored). Mask R-CNN using ResNet-101- FPN outperforms the base variants of all previous state-of- the-art models, including the single-model variant of G- RMI [21], the winner of the COCO 2016 Detection Chal- ant takes ⇠400ms as it has a heavier box head (Figure 3), so we do not recommend using the C4 variant in practice. Although Mask R-CNN is fast, we note that our design is not optimized for speed, and better speed/accuracy trade- offs could be achieved [21], e.g., by varying image sizes and proposal numbers, which is beyond the scope of this paper. Training: Mask R-CNN is also fast to train. Training with ResNet-50-FPN on COCO trainval35k takes 32 hours in our synchronized 8-GPU implementation (0.72s per 16- image mini-batch), and 44 hours with ResNet-101-FPN. In fact, fast prototyping can be completed in less than one day when training on the train set. We hope such rapid train- ing will remove a major hurdle in this area and encourage more people to perform research on this challenging topic. 5. Mask R-CNN for Human Pose Estimation • Object detection
  20. 20. Reference paper • R. Girshick. Fast R-CNN. In ICCV, 2015. • S. Ren, K. He, R. Girshick, and J. Sun. Faster R-CNN: Towards real-time object detection with region proposal networks. In NIPS, 2015. • M. Jaderberg, K. Simonyan, A. Zisserman, andK. Kavukcuoglu. Spatial transformer networks. In NIPS, 2015. • K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In CVPR, 2016. • S. Xie, R. Girshick, P. Dolla ́r, Z. Tu, and K. He. Aggregated residual transformations for deep neural networks. In CVPR, 2017. • T.-Y.Lin, P.Dolla ́r, R.Girshick, K.He, B.Hariharan, and S. Belongie. Feature pyramid networks for object detection.In CVPR, 2017. • J. Long, E. Shelhamer, and T. Darrell. Fully convolutional networks for semantic segmentation. In CVPR, 2015. • Y. Li, H. Qi, J. Dai, X. Ji, and Y. Wei. Fully convolutional instance-aware semantic segmentation. In CVPR, 2017. • R.Girshick, F.Iandola, T.Darrell, and J.Malik. Deformable part models are convolutional neural networks. In CVPR, 2015.

×