Deep Learning for Computer Vision – Techniques
for Semantic Segmentation
Saurabh Jha
Agenda
• Deep Learning in medical imaging, opportunities and types of data
• Semantic Segmentation – Evolution
• Architecture – FCN, U-Net
• Exploring Medical Images
• Skin Lesion detection
• Challenges in Medical Images
Deep Learning in medical imaging:
There is lot of hype
AI in medicine: Rise of
the machines (Forbes,
2017
MIT
Technology
Review
“They should stop training radiologists now”
Geoffrey Hinton (godfather of deep learning , 2017)
“To the question, will AI replace radiologists. I say the answer is no…”
“ …but radiologists who do AI will replace radiologists who don’t”
Curtis Langlotz in 2017
Opportunities – Deep Learning for Medical Imaging
Value Proposition
Level of diagnostic support
Image Acquisition and Reconstruction
Automatic scan planning
Accelerated imaging
Image Enhancement Super-resolution
Semantic Image Segmentation Organ Segmentation
Quantification of Imaging Biomakers
Computer Aided Interpretation
Computer Aided Diagnosis
Tumour Quantification
Screening
What is Medical Imaging?
MR
CT
X-ray
Ultrasound
Semantic Segmentation
• Semantic segmentation is understanding an image at pixel level i.e, we want to assign each pixel
in the image an object class.
• There are many different approaches for estimating the semantic segmentation of the image. Most
common methods are based on Autoencoder(AE) architecture like FCN, Unet architecture
Evolution: Semantic Segmentation
Segmentation as clustering Segmentation as graph
P
R
E
C
N
N
E
R
A
Patch-based methods were used to
overcome small data problem with
limited success.
Ciresan, Dan, et al.(2012)
P
R
E
F
C
N
E
R
A
Deep Convolutional Nets for Segmentation
Need to reason about individual pixels!
Success factors?
• Wide receptive field great
• Spatial Invariance not good
 Need to preserve spatial info!
Want both wide receptive field and high spatial resolution
2 3 0 0
0 0
0 0 3 8
0 0 7 2
5 1
0 0 0 0
5 0 0 0
0 0 0 8
0 0 0 0
Feature response map 2 X 2 window based max pooling
5 0 0 8
Code with magnitude
• CNN’s make use of filters (also known as kernels), to detect what features,
such as edges, are present throughout an image. A filter is just a matrix of
values, called weights, that are trained to detect specific features. The filter
moves over each part of the image to check if the feature it is meant to detect
is present. To provide a value representing how confident it is that a specific
feature is present, the filter carries out a convolution operation, which is an
element-wise product and sum between two matrices.
Introducing Convolution and Max Pooling Operation
Convolution
Convolutional Auto-Encoder
Encoder-Decoder architecture
Ranzato et al, CPVR 07
3 Major innovations on network architecture
• Removal of fully connected layers
• Deconvolution
• Skip Path
• Downsampling path : capture semantic/contextual information
• Upsampling path : recover spatial information
Fully Convolutional Networks for Semantic Segmentation
Removal of fully connected layers
• Dense output with relative size to the input
• Replace with 1 X 1 convolutions to transform feature maps to
class – wise predictions
Upsampling – Unpooling
Input – 2 X 2 Output – 4 X 4
Nearest Neighbor
Max Pooling Max Unpooling
Input: 4 X 4
Output: 2 X 2 Input: 2 X 2
Output: 4 X 4
Corresponding pairs of downsampling and upsampling layers
Transposed Convolution
Stride 1 and no padding, just pad
the original Input (blue entries)
with zeros (white entries)
Data
1 2 3 4
6 7 8 9
11 12 13 14
16 17 18 19
Filter
0.1 0.2 0.3
0.2 0.5 0.4
-0.1 0.3 0.1
Result
13.1 15.1
23.1 25.1
Padded Result
0 0 0 0 0 0
0 0 0 0 0 0
0 0 13.1 15.1 0 0
0 0 23.1 25.1 0 0
0 0 0 0 0 0
0 0 0 0 0 0
Deconv Filter
0 0 0
0 0 0
0 0 0
Error
1 2 3 4
6 7 8 9
11 12 13 14
16 17 18 19
Downsampling
Padded Result
0 0 0 0 0 0
0 0 0 0 0 0
0 0 13.1 15.1 0 0
0 0 23.1 25.1 0 0
0 0 0 0 0 0
0 0 0 0 0 0
Deconv Filter
0.7687 0.00 0.678
0.0953 0.029 0.092
0.2948 -0.02 0.208
Upsampling – Transposed Convolution
Learn deconv filter after few epochs of SGD
Result
2.73 2.89 3.57 4.45
6.01 6.53 8 8.839
11 13.2 13 14
15.7 17 17.8 19.3
Error
-1.73 -0.89 -0.57 -0.45
-0.01 0.467 0.002 0.161
0 -1.2 0.009 0
0.348 -0.01 0.238 -0.3
Total – 6.373
Result
0 0 0 0
0 0 0 0
0 0 0 0
0 0 0 0
Skip Layer model in detail
Skip Path:
• Concatenate low level features with high level features to
handle multiscale objects
• Provide options for different output sizes
• With skip layer we can get rather finer pixel output
• Upsampling is used to resolve the size incompatible
problem between different layer and combining is done by
simple sum operation
• The U-net architecture is synonymous with an encoder-
decoder architecture. Essentially, it is a deep-learning
framework based on FCNs; it comprises two parts:
• A contracting path similar to an encoder, to capture
context via a compact feature map.
• A symmetric expanding path similar to a decoder,
which allows precise localization. This step is done to
retain boundary information (spatial information)
despite down sampling and max-pooling performed in
the encoder stage.
Advantages of Using U-Net
1.1. Computationally efficient
2.2. Trainable with a small data-set
3.3. Trained end-to-end
4.4. Preferable for bio-medical applications
U-Net Architecture
Axial
View
Exploring Medical Images
Skin Lesion Detection
Challenges in Medical Imaging
• Access to large datasets require partnering with clinical institutions
• Annotations are very expensive (medical experts required)
• Transfer Learning
- Natural Images are extremely different from medical images
- What to do in case of 3D data? Harder to train requires more data
• Data Augmentations
• Noisy Labels – Agreement between radiologists is low in many cases
• Data Variability – Different machine vendors, different scanning protocols , demographic factors
• GPU memory limitations – Use small batch size
• Imbalance data – class balance is skewed severely towards non-object class
- Majority of the non-object samples are easy to discriminate, lesions are challenging to discriminate
- Dedicated loss function (Dice loss)

World ml summit deep learning for computer vision

  • 1.
    Deep Learning forComputer Vision – Techniques for Semantic Segmentation Saurabh Jha
  • 2.
    Agenda • Deep Learningin medical imaging, opportunities and types of data • Semantic Segmentation – Evolution • Architecture – FCN, U-Net • Exploring Medical Images • Skin Lesion detection • Challenges in Medical Images
  • 3.
    Deep Learning inmedical imaging: There is lot of hype AI in medicine: Rise of the machines (Forbes, 2017 MIT Technology Review “They should stop training radiologists now” Geoffrey Hinton (godfather of deep learning , 2017) “To the question, will AI replace radiologists. I say the answer is no…” “ …but radiologists who do AI will replace radiologists who don’t” Curtis Langlotz in 2017
  • 4.
    Opportunities – DeepLearning for Medical Imaging Value Proposition Level of diagnostic support Image Acquisition and Reconstruction Automatic scan planning Accelerated imaging Image Enhancement Super-resolution Semantic Image Segmentation Organ Segmentation Quantification of Imaging Biomakers Computer Aided Interpretation Computer Aided Diagnosis Tumour Quantification Screening
  • 5.
    What is MedicalImaging? MR CT X-ray Ultrasound
  • 6.
    Semantic Segmentation • Semanticsegmentation is understanding an image at pixel level i.e, we want to assign each pixel in the image an object class. • There are many different approaches for estimating the semantic segmentation of the image. Most common methods are based on Autoencoder(AE) architecture like FCN, Unet architecture
  • 7.
    Evolution: Semantic Segmentation Segmentationas clustering Segmentation as graph P R E C N N E R A Patch-based methods were used to overcome small data problem with limited success. Ciresan, Dan, et al.(2012) P R E F C N E R A
  • 8.
    Deep Convolutional Netsfor Segmentation Need to reason about individual pixels! Success factors? • Wide receptive field great • Spatial Invariance not good  Need to preserve spatial info! Want both wide receptive field and high spatial resolution
  • 9.
    2 3 00 0 0 0 0 3 8 0 0 7 2 5 1 0 0 0 0 5 0 0 0 0 0 0 8 0 0 0 0 Feature response map 2 X 2 window based max pooling 5 0 0 8 Code with magnitude • CNN’s make use of filters (also known as kernels), to detect what features, such as edges, are present throughout an image. A filter is just a matrix of values, called weights, that are trained to detect specific features. The filter moves over each part of the image to check if the feature it is meant to detect is present. To provide a value representing how confident it is that a specific feature is present, the filter carries out a convolution operation, which is an element-wise product and sum between two matrices. Introducing Convolution and Max Pooling Operation Convolution
  • 10.
  • 11.
    3 Major innovationson network architecture • Removal of fully connected layers • Deconvolution • Skip Path • Downsampling path : capture semantic/contextual information • Upsampling path : recover spatial information Fully Convolutional Networks for Semantic Segmentation Removal of fully connected layers • Dense output with relative size to the input • Replace with 1 X 1 convolutions to transform feature maps to class – wise predictions
  • 12.
    Upsampling – Unpooling Input– 2 X 2 Output – 4 X 4 Nearest Neighbor Max Pooling Max Unpooling Input: 4 X 4 Output: 2 X 2 Input: 2 X 2 Output: 4 X 4 Corresponding pairs of downsampling and upsampling layers
  • 13.
    Transposed Convolution Stride 1and no padding, just pad the original Input (blue entries) with zeros (white entries) Data 1 2 3 4 6 7 8 9 11 12 13 14 16 17 18 19 Filter 0.1 0.2 0.3 0.2 0.5 0.4 -0.1 0.3 0.1 Result 13.1 15.1 23.1 25.1 Padded Result 0 0 0 0 0 0 0 0 0 0 0 0 0 0 13.1 15.1 0 0 0 0 23.1 25.1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Deconv Filter 0 0 0 0 0 0 0 0 0 Error 1 2 3 4 6 7 8 9 11 12 13 14 16 17 18 19 Downsampling Padded Result 0 0 0 0 0 0 0 0 0 0 0 0 0 0 13.1 15.1 0 0 0 0 23.1 25.1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Deconv Filter 0.7687 0.00 0.678 0.0953 0.029 0.092 0.2948 -0.02 0.208 Upsampling – Transposed Convolution Learn deconv filter after few epochs of SGD Result 2.73 2.89 3.57 4.45 6.01 6.53 8 8.839 11 13.2 13 14 15.7 17 17.8 19.3 Error -1.73 -0.89 -0.57 -0.45 -0.01 0.467 0.002 0.161 0 -1.2 0.009 0 0.348 -0.01 0.238 -0.3 Total – 6.373 Result 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
  • 14.
    Skip Layer modelin detail Skip Path: • Concatenate low level features with high level features to handle multiscale objects • Provide options for different output sizes • With skip layer we can get rather finer pixel output • Upsampling is used to resolve the size incompatible problem between different layer and combining is done by simple sum operation
  • 15.
    • The U-netarchitecture is synonymous with an encoder- decoder architecture. Essentially, it is a deep-learning framework based on FCNs; it comprises two parts: • A contracting path similar to an encoder, to capture context via a compact feature map. • A symmetric expanding path similar to a decoder, which allows precise localization. This step is done to retain boundary information (spatial information) despite down sampling and max-pooling performed in the encoder stage. Advantages of Using U-Net 1.1. Computationally efficient 2.2. Trainable with a small data-set 3.3. Trained end-to-end 4.4. Preferable for bio-medical applications U-Net Architecture
  • 16.
  • 17.
  • 18.
    Challenges in MedicalImaging • Access to large datasets require partnering with clinical institutions • Annotations are very expensive (medical experts required) • Transfer Learning - Natural Images are extremely different from medical images - What to do in case of 3D data? Harder to train requires more data • Data Augmentations • Noisy Labels – Agreement between radiologists is low in many cases • Data Variability – Different machine vendors, different scanning protocols , demographic factors • GPU memory limitations – Use small batch size • Imbalance data – class balance is skewed severely towards non-object class - Majority of the non-object samples are easy to discriminate, lesions are challenging to discriminate - Dedicated loss function (Dice loss)