U-Net is a convolutional neural network (CNN) architecture designed for semantic segmentation tasks, especially in the field of medical image analysis. It was introduced by Olaf Ronneberger, Philipp Fischer, and Thomas Brox in 2015. The name "U-Net" comes from its U-shaped architecture.
Key features of the U-Net architecture:
U-Shaped Design: U-Net consists of a contracting path (downsampling) and an expansive path (upsampling). The architecture resembles the letter "U" when visualized.
Contracting Path (Encoder):
The contracting path involves a series of convolutional and pooling layers.
Each convolutional layer is followed by a rectified linear unit (ReLU) activation function and possibly other normalization or activation functions.
Pooling layers (usually max pooling) reduce spatial dimensions, capturing high-level features.
Expansive Path (Decoder):
The expansive path involves a series of upsampling and convolutional layers.
Upsampling is achieved using transposed convolution (also known as deconvolution or convolutional transpose).
Skip connections are established between corresponding layers in the contracting and expansive paths. These connections help retain fine-grained spatial information during the upsampling process.
Skip Connections:
Skip connections concatenate feature maps from the contracting path to the corresponding layers in the expansive path.
These connections facilitate the fusion of low-level and high-level features, aiding in precise localization.
Final Layer:
The final layer typically uses a convolutional layer with a softmax activation function for multi-class segmentation tasks, providing probability scores for each class.
U-Net's architecture and skip connections help address the challenge of segmenting objects with varying sizes and shapes, which is often encountered in medical image analysis. Its success in this domain has led to its application in other areas of computer vision as well.
The U-Net architecture has also been extended and modified in various ways, leading to improvements like the U-Net++ architecture and variations with attention mechanisms, which further enhance the segmentation performance.
U-Net's intuitive design and effectiveness in semantic segmentation tasks have made it a cornerstone in the field of medical image analysis and an influential architecture for researchers working on segmentation challenges.
1. Ben-Gurion University of the Negev
Deep Learning Image Processing 2018
Eliya Ben Avraham & Laialy Darwesh
U-Net: Convolutional Networks
for
Biomedical Image Segmentation
Olaf Ronneberger, Philipp Fischer, and Thomas Brox
University of Freiburg, Germany
1
https://arxiv.org/pdf/1505.04597.pdf
2. Introduction
Motivation
Previous work
U-NET architecture
U-NET Training
Data Augmentation
Experiments
Extending U-NET
Conclusion
Topics
2
3. Convolutional Neural Networks (CNN)
Introduction
3
https://www.mathworks.com/videos/introduction-to-deep-learning-what-are-convolutional-neural-networks--1489512765771.html
https://leonardoaraujosantos.gitbooks.io/artificial-inteligence/content/convolutional_neural_networks.html
The fewer number of connections and weights make convolutional
layers relatively cheap (vs full connect) in terms of memory and
compute power needed.
Convolutional networks make the assumption of locality, and
hence are more powerful
5. Introduction
5
https://www.saagie.com/blog/object-detection-part1
The use of convolutional networks is on classification tasks
where the output to typical image is a single class label.
A class label is supposed to be assigned to each pixel.
In many visual tasks, especially in biomedical image
processing, the desired output should include localization
9. First Task
9
Ciresan, D.C., Gambardella, L.M., Giusti, A., Schmidhuber, J.: Deep neural net- works segment neuronal membranes in electron microscopy images. In: NIPS. pp. 2852{2860 (2012)
Predict the class label of each pixel
Stacks of Electron microscopy (EM) images
EM segmentation challenge at ISBI 2012
30 training images
Training stack Ground truth
Black - neuron membranes
White - cells
10. Second Task
10
ISBI 2015- separation of touching objects of the same class
Light microscopic images (recorded by phase contrast microscopy)
Part of the ISBI cell tracking challenge 2014 and 2015
Raw image
(HeLa cells)
Generated segmentation mask
(white:foreground, black:background)
Ground truth segmentation.
12. Previous work
Ciresan, D.C., Gambardella, L.M., Giusti, A., Schmidhuber, J.: Deep neural net- works segment neuronal membranes in electron microscopy images.
The winner (ISBI 2012) (Ciresan et al.)
Trained a network in a sliding-window (local region (patch) around that pixel)
x Slow because the network must be run separately for each patch
This network can localize
Deep
Neural
Netwok
The training data in terms of patches is much larger than the number of training images
x There is a lot of redundancy
13. Previous work
Ciresan, D.C., Gambardella, L.M., Giusti, A., Schmidhuber, J.: Deep neural net- works segment neuronal membranes in electron microscopy images. In: NIPS. pp. 2852{2860(2012)
The winner (ISBI 2012)
Trade-off between localization accuracy and the use of context.
Larger patches: Require more max-pooling layers → reduce the localization accuracy
Small patches: Allow the network to see only little context
We want a good localization and the use of context at the
same time
Deep
Neural
Netwok
15. Input
image
tile
W - Input volume size
F – Receptive field size (Filter
Size)
P - Zero padding used on the
border
S - Stride
U-NET Architecture
http://deeplearning.net/software/theano/tutorial/conv_arithmetic.html
Output
segmentation map
(here 2 classes)
background and
foreground
Increase the “What”
Reduce the “Where”
Create high-resolution
segmentation map
Output Size (first conv)
= (572 – 3 +2*0)/1 + 1 = 570
→ 570 x 570
Output Size (second
conv)
= (570 – 3 +2*0)/1 + 1 = 568
→ 568 x 568
Concatenation with
high-resolution features
from contracting path
16. U-NET Strategy
Over-tile strategy for arbitrary large images
Segmentation of the yellow area uses input data of the blue area
Raw data extrapolation by mirroring
17. U-net Training
17
𝐸 = −
𝑥∈𝛺
𝑤 𝑥 𝑙𝑜𝑔(pl(x)(x) )
𝑝𝑘(𝑥) = exp 𝑎𝑘 𝑥 /
𝑘′=1
𝐾
exp(𝑎𝑘′ 𝑥 )
Soft-max:
Cross-Entropy loss function:
𝑘- Feature channel
𝑎𝑘(𝑥) - The activation in feature channel k at pixel position x
𝑤(𝑥)- True label per a pixel
18. U-net Training
18
pixel-wise loss weight
Force the network to learn the small separation borders that they
introduce between touching cells.
𝐰 𝒙 = 𝒘𝒄 𝒙 + 𝒘𝟎 𝒆𝒙𝒑 −
𝒅𝟏 𝒙 + 𝒅𝟐 𝒙
𝟐
𝟐𝝈𝟐
𝑤𝑐 𝑥 - weight map to balance the class frequencies
𝑤0 - 10 , 𝜎 ≈ 5 pixels
𝑑1/𝑑2 - Distance to the border of the nearest cell / second nearest cell
Colors :different instances
20. U-net Training
20
Weights initialization
Achieved by Gaussian distribution:
A good initialization of the weights is extremely important
Ideally the initial weights should be adapted such that each feature
map in the network has approximately unit variance)
𝟏 = 𝑽𝒂𝒓
𝒊
𝑵
𝑿𝒊𝑾𝒊
𝝈𝒘 =
𝟏
𝑵
For example: 3x3 convolution and 64 feature channels in the
previous layer 𝑁 = 3 ∗ 3 ∗ 64 = 576
𝝈𝒘 =
𝟐
𝑵
ReLU layers
ReLU unit is zero for non positive inputs
21. Experiments: First task
21
The results of u-net is better than the sliding window convolutional
network which was the best one in 2012 until 2015.
Raw image Ground truth
EM segmentation challenge (since ISBI 2012)
23. Extending U-NET Architecture
23
Application scenarios for volumetric segmentation with the 3D u-net
Semi-automated segmentation
https://arxiv.org/abs/1606.06650
The user annotates some slices of each volume to be segmented
The network predicts the dense segmentation
Fully automated segmentation
Trained with annotated slices
Run on non-annotated volumes
24. Extending U-NET Architecture
24
Voxel size of 1.76×1.76×2.04µm3
Batch normalization (“BN”) before each ReLU
3 × 3 × 3 convolutions, 2 × 2 × 2 max pooling, upconvolution of 2 × 2 × 2
https://arxiv.org/abs/1606.06650
Input: 132 × 132 × 116 voxel tile
Output: 44×44×28 voxel
Application scenarios for volumetric segmentation with the 3D u-net
Jun 2016
25. Extends the previous u-net
25
Additional reconstruction layer
LS is the softmax loss (standard cross entropy loss averaged over all pixels),
LR is the reconstruction loss (standard mean squared error)
https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=7813160
shifted sigmoid
K = 50 was found to be sufficient
to ensure pre-training convergence
Unsupervised Pre-training for Fully Convolutional Neural Networks
(2016)
26. Summary and Conclusion
26
U-net advantages
Flexible and can be used for any rational image masking task
High accuracy (given proper training, dataset, and training time)
Doesn’t contain any fully connected layers
Faster than the sliding-window (1-sec per image)
Proven to be very powerful segmentation tool in scenarios with limited data
Succeeds to achieve very good performances on different biomedical
segmentation applications.
U-net disadvantages
Larger images need high GPU memory.
Takes significant amount of time to train (relatively many layers)
Pre-trained models not widely available (it's too task specific)
In the last layer there are 2 channels (1 for background and one for foreground)
Left: the training stack (one slice shown). Right: corresponding ground truth; black lines denote neuron membranes. Note complexity of image appearance.
Fig. 3. HeLa cells on glass recorded with DIC (dierential interference contrast) mi-
croscopy. (a) raw image. (b) overlay with ground truth segmentation. Dierent colors
indicate dierent instances of the HeLa cells. (c) generated segmentation mask (white:
foreground, black: background). (d) map with a pixel-wise loss weight to force the
network to learn the border pixels.
Fig. 3. HeLa cells on glass recorded with DIC (dierential interference contrast) mi-
croscopy. (a) raw image. (b) overlay with ground truth segmentation. Dierent colors
indicate dierent instances of the HeLa cells. (c) generated segmentation mask (white:
foreground, black: background). (d) map with a pixel-wise loss weight to force the
network to learn the border pixels.
IEEE International Symposium on Biomedical Imaging (ISBI)
IEEE International Symposium on Biomedical Imaging (ISBI)
In the last layer there are 2 channels (1 for background and one for foreground)
In the last layer there are 2 channels (1 for background and one for foreground)
In the last layer there are 2 channels (1 for background and one for foreground)
Example: if 3*3 convolution and 64 feature channels in the previous layer
then N = 9.64=576