DeconvNet, DecoupledNet,
TransferNet in Image Segmentation
NamHyuk Ahn @ Ajou Univ.
2016. 05. 11
Contents
- Semantic Segmentation
- Deconvolution Network for Supervised Learning
- Decoupled Network for Semi-Supervised Learning
- Transfer Learning in Semantic Segmentation
Semantic Segmentation
Semantic Segmentation
- Predict pixel-level label in image
- ct
[Shotton et al . 2007]
PASCAL VOC
- 20 classes
- 12K training / 1K test images

MS COCO
- 91 classes
- 120K training / 40K test

images
Datasets
Deconvolution Network for
Supervised Learning
Problems of FCN
- FCN only handle
single-scale semantic,
since it has fixed-size
receptive field
- Label map is so small,
tend to forget detail
structures of object
DeconvNet
- To address such issue, they use “deconvolution”
- Convolution Network extract features (VGG-16 net)
- Deconvolution Network generate probability map (same size
to input image)
- Probability map indicate probability each pixel belongs to one
of class
-
Deconvolution Network
- Unpooling
• Reconstruct structure of
original activation map
• Activation size is preserved,
but still sparse
- Deconvolution
• Densify sparse (enlarge)
activation map
Analysis of DeconvNet
- DeconvNet is better in segmentation since it produce
dense and enlarged pixel-wise map
- Shallow layers tend to capture overall structure of object
(shape, region, position), deep layers does complicated
patterns
- Unpooling captures example-specific structure so can
reconstruct object details in higher resolution
- Deconvolution captures class-specific shape, so closely
related to target class are amplified and noise activations
are suppresed
Analysis of DeconvNet
More details of DeconvNet
- Instance-wise segmentation
- Use batch normalization in both networks
- Two-stage training
- Ensemble with FCN
• FCN, DeconvNet are complementary relationship
• Best result
Instance-wise Segmentation
- Input proposal instances in network (not entire image)
- Get proposal instance using EdgeBox algorithm
- Identify more details of object with multi scale
- Reduce search space, so can reduce memory at train
Two-stage Training
- DeconvNet has lots of parameters, but don’t have
many segmentation data (10K in PASCAL VOC)
• Use two-stage training to address this issue
• Fist stage: Input center-cropped images
• Second stage: Input proposal sub-images
- So network generalize better
Result
- 2nd best in Pascal VOC only training
- Note: In paper they say mean IOU is 72.5, but in
presentation files, 74.8
Qualitative Example
Recap
- Possible to make dense, precise segmentation mask
since reconstruct coarse-to-fine construction
- With instance-wise segmentation, it can handle object
scale variation
- But lots of parameters (almost 2x VGG-16) 

so additional training stage is needed
Decoupled Network for Semi-
Supervised Learning
Motivation
- Make ground-truth of segmentation takes a lot of
cost so do it like semi-supervised learning
- Utilize many image-level annotation and few pixel-
level annotation
- Modify DeconvNet
- With less data (25 per class), achieve good result
(62.5 mean IOU)
Main idea
- Semantic segmentation can be decomposed to 

multi-label classification, binary segmentation
Person
Bottle
Multi-label classification Binary segmentationSemantic segmentation
Overview
- Classification network for multi-label classification
- Segmentation network for binary segmentation
- Bridging layers for delivering class-specific
information to segmentation network
Architecture
- Classification Network (Same as VGG-16)
- Segmentation Network
• Take class-specific activation map from bridge layer and do
binary segmentation (main difference with DeconvNet)
• Binary segmentation reduce parameters, so we can train with
few pixel-wise annotation data
Architecture
- Bridging Layers
• Segmentation network needs class-specific and spatial info to
produce class-specific segmentation mask
• Get spatial information from pool5 in classification network
• has useful info for shape generation, but contain mixed info
of all relevant label → identify class-specific activation
• Make saliency map to identify class-specific activation
Architecture
- Saliency Map
1. Produce score vector, set
dscore all 0 but 1 in idx
related to label that want
to track
2. Backprop to arbitrary
layer (pool5 in this paper)
- By saliency map we can get
class-specific information 

in each label (class)
Qualitative example of saliency map 

[Karen Simonyan et al,. 2014]
Architecture
- Bridging Layers
• Combine , to produce class-specific activation map
• Pass through fc layer and feed to segmentation network
• g has both spatial and class-specific information
Inference
- Computing segmentation map for each identified label
- Pixel-wise aggregate each segmentation map M
Training
- Train classification network with many image-level
annotation
- Train segmentation network and bridging layers with
few pixel-level annotation
Result
Qualitative Example
Recap
- Utilize many image-level annotation and few pixel-level
annotation
- Add bridging layer to DeconvNet for binary segmentation to
reduce parameter
- Bridging layer output both spatial and class-specific information
in each class (label)
- Train two networks separately (decoupled)
• Worse performance in fully-supervision since jointly optimization is
more desirable in fully-supervision
- With few strong annotated data (25 per class) achieve good
result (62.5 mean IOU)
Transfer Learning in Semantic
Segmentation
Motivation
- Pre-train network and inference to new dataset

(ex. train with MS COCO, inference to PASCAL VOC)
- This idea doesn’t work well with DecoupledNet
• DecoupledNet trained with class-specific input, so it
can’t be generalize to new class
• Train network with class-independent input!
Overview
- Attention model identify salient region of each class associated with input
image
• Output of attention model has location information of each class in
coarse feature map
- Encoder extract features; Decoder generate dense foreground
segmentation mask of each focused region
- Training stage
• Fix encoder (pre-trained) and train decoder, attention model using pixel-level
annotation from source domain
• Train attention model using image-level annotation in both domain
- After training, decoder is trained with source domain and attention is
trained with both domain so attention adapted to target domain
Overview
- Decoupled encoder-decoder make it possible to share information
for shape generation among different class
- Attention model provides
• Predictions for localization
• Class-specific information → enable to adapt decoder into target domain
- With attention model, able to get information transferable across
different domain and provide useful segmentation prior information
Architecture
- Encoder
• Extract feature descriptor as 

A is obtain from last conv layer to retain spatial information
• M, D is # of hidden unit (20x20), # of channel respectively
- Attention model
• To train weight vector , where represents
relevance of location to each class l
• Formally,
• And extra technique to reduce parameter [R. Memisevic. 2013] did
Architecture
- Attention model
• To apply attention to this model, it has to be trainable in both
domain
• Add additional layers on top of attention model, and train

both , under classification objective
• Finally, , z represents class-specific
feature
• Can optimize z using weak annotation with both domain

• Example of attention
Architecture
- Decoder
• Output of attention model is spare due to softmax, it may lost
information for shape generation
• Feed additional input A to z (multiply) → densified attention
• With densified attention, optimize segmentation loss, procedure is
same as DecoupledNet, but optimize decoder only with source domain
Analysis of TransferNet
- Decoder generates foreground segmentation of
attention to each label
- By decoupling classification (domain specific task), it
can capture class-independent information for shape
generation and apply unseen class
- Train attention model using not only pixel-level but also
image-level annotation, it can handle unseen class
• In DecoupledNet, bridging layer is trained by only pixel-level data

Train / Inference
- When train, optimize this eq
• Trained using only class label is good, but jointly train with
segmentation label to regularize noise
• After training, remove since it is required only in training to
learn attention from target domain
- Inference
1. Iteratively obtain attention and segmentation mask
2. Aggregate mask (same as DecoupledNet)
Result
Qualitative Example
Reference
- Hyeonwoo Noh, Seunghoon Hong, and Bohyung Han. “Learning
deconvolution network for semantic segmentation.” Proceedings of the
IEEE International Conference on Computer Vision. 2015.
- Seunghoon Hong, Hyeonwoo Noh, and Bohyung Han. "Decoupled deep
neural network for semi-supervised semantic segmentation.” Advances in
Neural Information Processing Systems. 2015.
- Seunghoon Hong, et al. “Learning Transferrable Knowledge for Semantic
Segmentation with Deep Convolutional Neural Network.” arXiv preprint
arXiv:1512.07928 (2015).
- Hyeonwoo Noh. “Semantic Segmentation and Visual Question Answering”
(https://drive.google.com/file/d/0B5xl2L77gZfVRXZxQWNmSGlBemc/view)

DeconvNet, DecoupledNet, TransferNet in Image Segmentation

  • 1.
    DeconvNet, DecoupledNet, TransferNet inImage Segmentation NamHyuk Ahn @ Ajou Univ. 2016. 05. 11
  • 2.
    Contents - Semantic Segmentation -Deconvolution Network for Supervised Learning - Decoupled Network for Semi-Supervised Learning - Transfer Learning in Semantic Segmentation
  • 3.
  • 4.
    Semantic Segmentation - Predictpixel-level label in image - ct [Shotton et al . 2007]
  • 5.
    PASCAL VOC - 20classes - 12K training / 1K test images
 MS COCO - 91 classes - 120K training / 40K test
 images Datasets
  • 6.
  • 7.
    Problems of FCN -FCN only handle single-scale semantic, since it has fixed-size receptive field - Label map is so small, tend to forget detail structures of object
  • 8.
    DeconvNet - To addresssuch issue, they use “deconvolution” - Convolution Network extract features (VGG-16 net) - Deconvolution Network generate probability map (same size to input image) - Probability map indicate probability each pixel belongs to one of class -
  • 9.
    Deconvolution Network - Unpooling •Reconstruct structure of original activation map • Activation size is preserved, but still sparse - Deconvolution • Densify sparse (enlarge) activation map
  • 10.
    Analysis of DeconvNet -DeconvNet is better in segmentation since it produce dense and enlarged pixel-wise map - Shallow layers tend to capture overall structure of object (shape, region, position), deep layers does complicated patterns - Unpooling captures example-specific structure so can reconstruct object details in higher resolution - Deconvolution captures class-specific shape, so closely related to target class are amplified and noise activations are suppresed
  • 11.
  • 12.
    More details ofDeconvNet - Instance-wise segmentation - Use batch normalization in both networks - Two-stage training - Ensemble with FCN • FCN, DeconvNet are complementary relationship • Best result
  • 13.
    Instance-wise Segmentation - Inputproposal instances in network (not entire image) - Get proposal instance using EdgeBox algorithm - Identify more details of object with multi scale - Reduce search space, so can reduce memory at train
  • 14.
    Two-stage Training - DeconvNethas lots of parameters, but don’t have many segmentation data (10K in PASCAL VOC) • Use two-stage training to address this issue • Fist stage: Input center-cropped images • Second stage: Input proposal sub-images - So network generalize better
  • 15.
    Result - 2nd bestin Pascal VOC only training - Note: In paper they say mean IOU is 72.5, but in presentation files, 74.8
  • 16.
  • 17.
    Recap - Possible tomake dense, precise segmentation mask since reconstruct coarse-to-fine construction - With instance-wise segmentation, it can handle object scale variation - But lots of parameters (almost 2x VGG-16) 
 so additional training stage is needed
  • 18.
    Decoupled Network forSemi- Supervised Learning
  • 19.
    Motivation - Make ground-truthof segmentation takes a lot of cost so do it like semi-supervised learning - Utilize many image-level annotation and few pixel- level annotation - Modify DeconvNet - With less data (25 per class), achieve good result (62.5 mean IOU)
  • 20.
    Main idea - Semanticsegmentation can be decomposed to 
 multi-label classification, binary segmentation Person Bottle Multi-label classification Binary segmentationSemantic segmentation
  • 21.
    Overview - Classification networkfor multi-label classification - Segmentation network for binary segmentation - Bridging layers for delivering class-specific information to segmentation network
  • 22.
    Architecture - Classification Network(Same as VGG-16) - Segmentation Network • Take class-specific activation map from bridge layer and do binary segmentation (main difference with DeconvNet) • Binary segmentation reduce parameters, so we can train with few pixel-wise annotation data
  • 23.
    Architecture - Bridging Layers •Segmentation network needs class-specific and spatial info to produce class-specific segmentation mask • Get spatial information from pool5 in classification network • has useful info for shape generation, but contain mixed info of all relevant label → identify class-specific activation • Make saliency map to identify class-specific activation
  • 24.
    Architecture - Saliency Map 1.Produce score vector, set dscore all 0 but 1 in idx related to label that want to track 2. Backprop to arbitrary layer (pool5 in this paper) - By saliency map we can get class-specific information 
 in each label (class) Qualitative example of saliency map 
 [Karen Simonyan et al,. 2014]
  • 25.
    Architecture - Bridging Layers •Combine , to produce class-specific activation map • Pass through fc layer and feed to segmentation network • g has both spatial and class-specific information
  • 27.
    Inference - Computing segmentationmap for each identified label - Pixel-wise aggregate each segmentation map M
  • 28.
    Training - Train classificationnetwork with many image-level annotation - Train segmentation network and bridging layers with few pixel-level annotation
  • 29.
  • 30.
  • 31.
    Recap - Utilize manyimage-level annotation and few pixel-level annotation - Add bridging layer to DeconvNet for binary segmentation to reduce parameter - Bridging layer output both spatial and class-specific information in each class (label) - Train two networks separately (decoupled) • Worse performance in fully-supervision since jointly optimization is more desirable in fully-supervision - With few strong annotated data (25 per class) achieve good result (62.5 mean IOU)
  • 32.
    Transfer Learning inSemantic Segmentation
  • 33.
    Motivation - Pre-train networkand inference to new dataset
 (ex. train with MS COCO, inference to PASCAL VOC) - This idea doesn’t work well with DecoupledNet • DecoupledNet trained with class-specific input, so it can’t be generalize to new class • Train network with class-independent input!
  • 34.
    Overview - Attention modelidentify salient region of each class associated with input image • Output of attention model has location information of each class in coarse feature map - Encoder extract features; Decoder generate dense foreground segmentation mask of each focused region - Training stage • Fix encoder (pre-trained) and train decoder, attention model using pixel-level annotation from source domain • Train attention model using image-level annotation in both domain - After training, decoder is trained with source domain and attention is trained with both domain so attention adapted to target domain
  • 35.
    Overview - Decoupled encoder-decodermake it possible to share information for shape generation among different class - Attention model provides • Predictions for localization • Class-specific information → enable to adapt decoder into target domain - With attention model, able to get information transferable across different domain and provide useful segmentation prior information
  • 36.
    Architecture - Encoder • Extractfeature descriptor as 
 A is obtain from last conv layer to retain spatial information • M, D is # of hidden unit (20x20), # of channel respectively - Attention model • To train weight vector , where represents relevance of location to each class l • Formally, • And extra technique to reduce parameter [R. Memisevic. 2013] did
  • 37.
    Architecture - Attention model •To apply attention to this model, it has to be trainable in both domain • Add additional layers on top of attention model, and train
 both , under classification objective • Finally, , z represents class-specific feature • Can optimize z using weak annotation with both domain
 • Example of attention
  • 38.
    Architecture - Decoder • Outputof attention model is spare due to softmax, it may lost information for shape generation • Feed additional input A to z (multiply) → densified attention • With densified attention, optimize segmentation loss, procedure is same as DecoupledNet, but optimize decoder only with source domain
  • 39.
    Analysis of TransferNet -Decoder generates foreground segmentation of attention to each label - By decoupling classification (domain specific task), it can capture class-independent information for shape generation and apply unseen class - Train attention model using not only pixel-level but also image-level annotation, it can handle unseen class • In DecoupledNet, bridging layer is trained by only pixel-level data

  • 40.
    Train / Inference -When train, optimize this eq • Trained using only class label is good, but jointly train with segmentation label to regularize noise • After training, remove since it is required only in training to learn attention from target domain - Inference 1. Iteratively obtain attention and segmentation mask 2. Aggregate mask (same as DecoupledNet)
  • 41.
  • 42.
  • 43.
    Reference - Hyeonwoo Noh,Seunghoon Hong, and Bohyung Han. “Learning deconvolution network for semantic segmentation.” Proceedings of the IEEE International Conference on Computer Vision. 2015. - Seunghoon Hong, Hyeonwoo Noh, and Bohyung Han. "Decoupled deep neural network for semi-supervised semantic segmentation.” Advances in Neural Information Processing Systems. 2015. - Seunghoon Hong, et al. “Learning Transferrable Knowledge for Semantic Segmentation with Deep Convolutional Neural Network.” arXiv preprint arXiv:1512.07928 (2015). - Hyeonwoo Noh. “Semantic Segmentation and Visual Question Answering” (https://drive.google.com/file/d/0B5xl2L77gZfVRXZxQWNmSGlBemc/view)