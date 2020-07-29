Successfully reported this slideshow.
  1. 1. REVIEW: IMAGE SEGMENTATION BY DEEP LEARNING Trong-An Bui trongan93@gmail.com buitrongan.com ViP Lab – National Chi Nan University Advisor: Prof. Pei-Jun Lee pjlee@ncnu.edu.tw
  2. 2. DEFINITION COMPARISON • Image Classification: Classify the object (Recognize the object class) within an image. • Object Detection: Classify and detect the object(s) within an image with bounding box(es) bounded the object(s). That means we also need to know the class, position and size of each object. • Semantic Segmentation: Classify the object class for each pixel within an image. That means there is a label for each pixel.
  3. 3. OVERVIEW Main target: Image segmentation by a convolutional neural network
  4. 4. ABSTRACT DeepLab In this work we address the task of semantic image segmentation with Deep Learning and make three main contributions that are experimentally shown to have substantial practical merit. First, we highlight convolution with upsampled filters, or ‘atrous convolution’, as a powerful tool in dense prediction tasks. Atrous convolution allows us to explicitly control the resolution at which feature responses are computed within Deep Convolutional Neural Networks. It also allows us to effectively enlarge the field of view of filters to incorporate larger context without increasing the number of parameters or the amount of computation. Second, we propose atrous spatial pyramid pooling (ASPP) to robustly segment objects at multiple scales. ASPP probes an incoming convolutional feature layer with filters at multiple sampling rates and effective fields-of-views, thus capturing objects as well as image context at multiple scales. Third, we improve the localization of object boundaries by combining methods from DCNNs and probabilistic graphical models. The commonly deployed combination of max-pooling and downsampling in DCNNs achieves invariance but has a toll on localization accuracy. We overcome this by combining the responses at the final DCNN layer with a fully connected Conditional Random Field (CRF), which is shown both qualitatively and quantitatively to improve localization performance. Our proposed “DeepLab” system sets SegNet We present a novel and practical deep fully convolutional neural network architecture for semantic pixel-wise segmentation termed SegNet. This core trainable segmentation engine consists of an encoder network, a corresponding decoder network followed by a pixel-wise classification layer. The architecture of the encoder network is topologically identical to the 13 convolutional layers in the VGG16 network. The role of the decoder network is to map the low resolution encoder feature maps to full input resolution feature maps for pixel-wise classification. The novelty of SegNet lies is in the manner in which the decoder upsamples its lower resolution input feature map(s). Specifically, the decoder uses pooling indices computed in the max-pooling step of the corresponding encoder to perform non-linear upsampling. This eliminates the need for learning to upsample. The upsampled maps are sparse and are then convolved with trainable filters to produce dense feature maps. We compare our proposed architecture with the widely adopted FCN and also with the well known DeepLab-LargeFOV, DeconvNet architectures. This comparison reveals the memory versus accuracy trade-off involved in achieving good segmentation performance. SegNet was primarily motivated by scene understanding applications. Hence, it is designed to be efficient both in terms of memory and computational time during inference. It is also significantly smaller in the number of trainable parameters than other competing architectures and Convolutional networks are powerful visual models that yield hierarchies of features. We show that convolutional networks by themselves, trained end- to-end, pixels-to-pixels, improve on the previous best result in semantic segmentation. Our key insight is to build “fully convolutional” networks that take input of arbitrary size and produce correspondingly-sized output with efficient inference and learning. We define and detail the space of fully convolutional networks, explain their application to spatially dense prediction tasks, and draw connections to prior models. We adapt contemporary classification networks (AlexNet, the VGG net, and GoogLeNet) into fully convolutional networks and transfer their learned representations by fine- tuning to the segmentation task. We then define a skip architecture that combines semantic information from a deep, coarse layer with appearance information from a shallow, fine layer to produce accurate and detailed segmentations. Our fully convolutional Fully Convolutional Networks
  5. 5. OUTLINE 1. Main architecture comparison 2. Fully Convolutional Networks (Pub: April 2017) 3. SegNet (Pub: December 2017) 4. DeepLab (Pub: April 2018) 5. Results Comparision: FCN, SegNet, DeepLab
  6. 6. MAIN ARCHITECTURE COMPARISON To our knowledge, this is the first work to train FCNs end-to-end (1) for pixelwise prediction and (2) from supervised pre- training. Fully convolutional versions of existing networks predict dense outputs from arbitrary-sized inputs. Both learning and inference are performed whole-image- at-a-time by dense feedforward computation and backpropagation, as shown in Fig. 1. In-network up-sampling layers enable pixelwise prediction and The encoder network in SegNet is topologically identical to the convolutional layers in VGG16. The appropriate decoders use the max-pooling indices received from the corresponding encoder to perform non- linear upsampling of their input feature maps. The above figure is the DeepLab model architecture. First, the input image goes through the network with the use of atrous convolution and ASPP. Then the output from the network is bilinearly interpolated and goes through the fully connected CRF to fine tune the result and get the final output. DeepLabSegNetFully Convolutional Networks
  7. 7. OUTLINE 1. Main architecture comparison 2. Fully Convolutional Networks (Pub: April 2017) 3. SegNet (Pub: December 2017) 4. DeepLab (Pub: April 2018) 5. Results Comparision: FCN, SegNet, DeepLab
  8. 8. FULLY CONVOLUTIONAL NETWORKS 1. Image classification (Review) CNN Architecture: Types of Layers Convolutional Neural Networks have several types of layers: • Convolutional layer ━ a “filter” passes over the image, scanning a few pixels at a time and creating a feature map that predicts the class to which each feature belongs. • Pooling layer (downsampling) ━ reduces the amount of information in each feature obtained in the convolutional layer while maintaining the most important information (there are usually several rounds of convolution and pooling). • Fully connected input layer (flatten) ━ takes the output of the previous layers, “flattens” them and turns them into a single vector that can be an input for the next stage. • The first fully connected layer ━ takes the inputs from the feature analysis and applies weights to predict the correct label.
  9. 9. FULLY CONVOLUTIONAL NETWORKS 1. From Image Classification to Semantic Segmentation In classification, conventionally, an input image is downsized and goes through the convolution layers and fully connected (FC) layers, and output one predicted label for the input image, as follows: Imagine we turn the FC layers into 1×1 convolutional layers: Classification All layers are convolutional layers And if the image is not downsized, the output will not be a single label. Instead, the output has a size smaller than the input image (due to the max pooling): All layers are convolutional layers
  10. 10. FULLY CONVOLUTIONAL NETWORKS 1. From Image Classification to Semantic Segmentation If we upsample the output above, then we can calculate the pixelwise output (label map) as below: Upsampling at the last step Feature Map / Filter Number Along Layers
  11. 11. FULLY CONVOLUTIONAL NETWORKS 2. Upsampling via Deconvolution Convolution is a process getting the output size smaller. Thus, the name, deconvolution, is coming from when we want to have upsampling to get the output size larger. (But the name, deconvolution, is misinterpreted as reverse process of convolution, but it is not.) And it is also called, up convolution, and transposed convolution. And it is also called fractional stride convolution when fractional stride is used. Another way to connect coarse outputs to dense pixels is interpolation. For instance, simple bilinear interpolation computes each output yij from the nearest four inputs by a linear map that depends only on the relative positions of the input and output cells: In a sense, upsampling with factor f is convolution with a fractional input stride of 1/f. So long as f is integral, it’s natural to implement upsampling through “backward convolution” by reversing the forward and backward passes of more typical input-strided convolution. Thus upsampling is performed in-network for end-to-end learning by backpropagation from the pixelwise loss.
  12. 12. FULLY CONVOLUTIONAL NETWORKS 2. Upsampling via Deconvolution Visualization with a Deconvnet https://arxiv.org/pdf/1311.290 1.pdf To examine a convnet, a deconvnet is attached to each of its layers, as illustrated in Fig. 1(top), providing a continuous path back to image pixels. To start, an input image is presented to the convnet and features computed throughout the layers. To examine a given convnet activation, we set all other activations in the layer to zero and pass the feature maps as input to the attached deconvnet layer. Then we successively (i) unpool, (ii) rectify and (iii) filter to reconstruct the activity in the layer beneath that gave rise to the chosen activation. This is then repeated until input pixel space is reached. Unpooling: In the convnet, the max pooling operation is non-invertible, however we can obtain an approximate inverse by recording the locations of the maxima within each pooling region in a set of switch variables. In the deconvnet, the unpooling operation uses these switches to place the reconstructions from the layer above into appropriate locations, preserving the structure of the stimulus. See Fig. 1(bottom) for an illustration of the procedure. Rectification: The convnet uses relu non-linearities, which rectify the feature maps thus ensuring the feature maps are always positive. To obtain valid feature reconstructions at each layer (which also should be positive), we pass the reconstructed signal through a relu non-linearity. Filtering: The convnet uses learned filters to convolve the feature maps from the previous layer. To invert this, the deconvnet uses transposed versions of the same filters, but applied to the rectified maps, not the output of the layer beneath. In practice this means flipping each filter vertically and horizontally.
  13. 13. FULLY CONVOLUTIONAL NETWORKS 3. Fusing the Output
  14. 14. FULLY CONVOLUTIONAL NETWORKS 3. Fusing the Output After going through conv7 as below, the output size is small, then 32× upsampling is done to make the output have the same size of input image. But it also makes the output label map rough. And it is called FCN-32s: This is because, deep features can be obtained when going deeper, spatial location information is also lost when going deeper. That means output from shallower layers have more location information. If we combine both, we can enhance the result. To combine, we fuse the output (by element- wise addition):
  15. 15. OUTLINE 1. Main architecture comparison 2. Fully Convolutional Networks (FCN) (Pub: April 2017) 3. SegNet (Pub: December 2017) 4. DeepLab (Pub: April 2018) 5. Results Comparision: FCN, SegNet, DeepLab
  16. 16. SEGNET Main focused part 1.Encoder Decoder Architecture 2.Differences from DeconvNet and U-Net 3.Results
  17. 17. SEGNET 1. Encoder Decoder Architecture SegNet has an encoder network and a corresponding decoder network, followed by a final pixelwise classification layer.
  18. 18. SEGNET 1. Encoder Decoder Architecture 1.1. Encoder • At the encoder, convolutions and max pooling are performed. • There are 13 convolutional layers from VGG-16. (The original fully connected layers are discarded.) • While doing 2×2 max pooling, the corresponding max pooling indices (locations) are stored. 1.2. Decoder •At the decoder, upsampling and convolutions are performed. At the end, there is softmax classifier for each pixel. •During upsampling, the max pooling indices at the corresponding encoder layer are recalled to upsample as shown above. •Finally, a K-class softmax classifier is used to predict the class for each pixel. Upsampling Using Max-Pooling Indices
  19. 19. SEGNET 2. Compare with DeconvNet and U- Net DeconvNet and U-Net have similar structures as SegNet. https://arxiv.org/abs/1505.0 4366 2.1. Differences from DeconvNet • Similar upsampling approach called unpooling is used. • However, there are fully-connected layers which make the model larger. 2.2. Differences from U-Net • It is used for biomedical image segmentation. • Instead of using pooling indices, the entire feature maps are transfer from encoder to decoder, then with concatenation to perform convolution. • This makes the model larger and need more memory. https://arxiv.org/abs/1505. 04597
  20. 20. OUTLINE 1. Main architecture comparison 2. Fully Convolutional Networks (Pub: April 2017) 3. SegNet (Pub: December 2017) 4. DeepLab (Pub: April 2018) 5. Results Comparision: FCN, SegNet, DeepLab
  21. 21. DEEPLAB The above figure is the DeepLab model architecture. First, the input image goes through the network with the use of atrous convolution and ASPP. Then the output from the network is bilinearly interpolated and goes through the fully connected CRF to fine tune the result and get the final output. 1. Atrous Convolution 2. Atrous Spatial Pyramid Pooling (ASPP) 3. Fully Connected Conditional Random Field (CRF)
  22. 22. DEEPLAB 1. Atrous Convolution The term “Atrous” indeed comes from French “à trous” meaning hole. Thus, it is also called “algorithme à trous” and “hole algorithm”. Some of the papers also call this “dilated convolution”. It is commonly used in wavelet transform and right now it is applied in convolutions for deep learning. •When r=1, it is the standard convolution we usually use. •When r>1, it is the ‘atrous convolution’ which is the stride to sample the input sample during convolution. Illustration of atrous convolution in 1-D. (a) Sparse feature extraction with standard convolution on a low resolution input feature map. (b) Dense feature extraction with atrous convolution with rate r = 2, applied on a high resolution input feature map.
  23. 23. DEEPLAB 1. Atrous Convolution We illustrate the algorithm’s operation in 2-D through a simple example in Fig. 3: Given an image, we assume that we first have a down sampling operation that reduces the resolution by a factor of 2, and then perform a convolution with a kernel- here, the vertical Gaussian derivative. If one implants the resulting feature map in the original image coordinates, we realize that we have obtained responses at only 1/4 of the image positions. Instead, we can compute responses at all image positions if we convolve the full resolution image with a filter ‘with holes’, in which we up sample the original filter by a factor of 2, and introduce zeros in between filter values. Although the effective filter size increases, we only need to take into account the non-zero filter values, hence both the number of filter parameters and the number of operations per position stay constant. The resulting scheme allows us to easily and explicitly control the spatial resolution of neural network feature responses.
  24. 24. DEEPLAB 2. Spatial Pyramid Pooling (SPP) Conventionally, at the transition of conv layer and FC layer, there is one single pooling layer or even no pooling layer. In SPPNet, it suggests to have multiple pooling layers with different scales. In the above figure, 3-level SPP is used. Suppose the conv5 layer has 256 feature maps. Then at SPP layer, 1.first, each feature map is pooled to become one value (grey), thus 256-d vector is formed. 2.Then, each feature map is pooled to have 4 values (green), and form a 4×256-d vector. 3.Similarly, each feature map is pooled to have 16 values (blue), and form a 16×256-d vector. 4.The above 3 vectors are concatenated to form a 1-d vector.
  25. 25. DEEPLAB 2. Atrous Spatial Pyramid Pooling (ASPP) ASPP actually is an atrous version of SPP, in which the concept has been used in SPPNet. In ASPP, parallel atrous convolution with different rate applied in the input feature map, and fuse together. As objects of the same class can have different scales in the image, ASPP helps to account for different object scales which can improve the accuracy. As objects of the same class can have different scales in the image, ASPP helps to account for different object scales which can improve the accuracy.
  26. 26. DEEPLAB 3. Fully Connected Conditional Random Field (CRF) Fully Connected CRF is applied at the network output after bilinear interpolation x is the label assignment for pixels. P(xi) is the label assignment probability at pixel i. Therefore the first term θi is the log probability. For the second term, θij, it is a filter. µ = 1 when xi != xj. µ = 0 when xi = xj. In the bracket, it is the weighted sum of two kernels. The first kernel depends on pixel value difference and pixel position difference, which is a kind of bilateral filter. Bilateral filter has the property of preserving edges. The second kernel only depends on pixel position difference, which is a Gaussian filter. Those σ and w, are found by cross validation. The number of iteration is 10.
  27. 27. COMPARE DEEPLAB WITH CNN ARCHITECTURE
  28. 28. SOFTMAX (OUTPUT) COMPARISON Top: Score map (input before softmax function), Bottom: belief map (output of softmax function) With 10 times of CRF, those small areas with different colors around the aeroplane are smoothed out successfully.
  29. 29. VERIFICATION RESULTS The IoU is calculated for each class at the pixel- level as: The different sets of pixels that make up the Intersection over Union. From left to right: The original image, The expected pixels for the dog class, The predicted pixels for the dog class, The difference image illustrating the three Intersection over Union sets: Yellow: True Positives, Green: False Negatives, Red: False Positives. The IoU is a value between zero and 100, where a larger value indicates a more accurate segmentation. The mIoU is then the mean value across all the classes in the dataset. The state-of-the-art on the VOC2012 dataset achieves a mIoU of 86.9; however the 21 classes in the VOC2012 dataset are not limited to furniture items and as such are unsuitable for the use here at DigitalBridge. Subsequently we wish to adapt a state-of-the-art model to our needs.
  30. 30. SIMULATION RESULTS DeepLab-LargeFOV (Left: i.e. only single atrous conv), DeepLab-ASPP (Right, i.e. ASPP) •The simplest ResNet-101: 68.72% •MSC: Multiple Scale Input •COCO: Models pretrained by COCO dataset •Aug: Data augmentation by randomly scaling the input images (from 0.5 to 1.5) •LargeFOV: DeepLab using single-pass atrous convolution •ASPP: DeepLab using parallel atrous convolutions •CRF: Fully-connected CRF as post-processing.Finally, it got 77.69%. And it can be seen that MSC, COCO and Aug contribute the improvement from 68.72% to 74.87%, which are equally essential with LargeFOV, ASPP and CRF.
  31. 31. SIMULATION RESULTS PASCAL VOC 2012 Test Set (Leftmost) PASCAL-Context (2nd Left) PASCAL-Person-Part (2nd Right) Cityscape (Rightmost)
  32. 32. QUALITATIVE RESULTS
  33. 33. QUALITATIVE RESULTS
  34. 34. FAIL EXAMPLE But DeepLab also has some failure examples where the bike and the chairs which consist of multiple thin parts like the parts of bike and chair legs:
  35. 35. OUTLINE 1. Main architecture comparison 2. Fully Convolutional Networks (Pub: April 2017) 3. SegNet (Pub: December 2017) 4. DeepLab (Pub: April 2018) 5. Results Comparision: FCN, SegNet, DeepLab
  36. 36. RESULTS COMPARISION
  37. 37. OPEN SOURCE Suggestion open source – implementation with TensorFlow - FCN: https://github.com/shekkizh/FCN.tensorflow - SegNet: https://github.com/toimcio/SegNet-tensorflow - DeepLab: https://github.com/tensorflow/models/tree/master/research /deeplab Thank you for your attention!

