Deep Learning for Computer Vision (1/4): Image Analytics @ laSalle 2016

@DocXavi
Deep Learning for Computer Vision
Image Analytics
5 May 2016
Xavier Giró-i-Nieto
Master en
Creació Multimedia

3
Introduction
Xavier Giro-i-Nieto
• Web: https://imatge.upc.edu/web/people/xavier-giro
Associate Professor at Universitat Politecnica de Catalunya (UPC)

One lecture organized in three parts
6
Images (global) Objects (local)
Deep ConvNets for Recognition for...
Video (2D+T)

One lecture organized in three parts
7
Images (global) Objects (local)
Deep ConvNets for Recognition for...
Video (2D+T)

Previously, before deep learning...
8Slide credit: Jose M Àlvarez
Dog

9Slide credit: Jose M Àlvarez
Dog
Learned
Representation
Previously, before deep learning...

Outline for Part I: Image Analytics...
10
Dog
Learned
Representation
Part I: End-to-end learning (E2E)

11
Learned
Representation
Task A
(eg. image classification)

12
Task A
Learned
Representation
Task B
(eg. image retrieval)Part II: Off-the-shelf features

13
Task A
Learned
Representation
Task B
(eg. image retrieval)Part II: Off-the-shelf features

E2E: Classification: Supervised learning
14
Manual Annotations
Model
New Image
Automatic
classification
Training
Test
Anchor

E2E: Classification: LeNet-5
15
LeCun, Y., Bottou, L., Bengio, Y., & Haffner, P. (1998). Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11), 2278-
2324.

E2E: Classification: LeNet-5
16
Demo: 3D Visualization of a Convolutional Neural Network
Harley, Adam W. "An Interactive Node-Link Visualization of Convolutional Neural Networks." In Advances in Visual Computing, pp. 867-877. Springer
International Publishing, 2015.

E2E: Classification: Similar to LeNet-5
17
Demo: Classify MNIST digits with a Convolutional Neural Network
“ConvNetJS is a Javascript library for
training Deep Learning models (mainly
Neural Networks) entirely in your browser.
Open a tab and you're training. No software
requirements, no compilers, no installations,
no GPUs, no sweat.”

E2E: Classification: Databases
18
Li Fei-Fei, “How we’re teaching computers to understand
pictures” TEDTalks 2014.
Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., ... & Fei-Fei, L. (2015). Imagenet large scale visual
recognition challenge. arXiv preprint arXiv:1409.0575. [web]

19

20
Zhou, Bolei, Agata Lapedriza, Jianxiong Xiao, Antonio Torralba, and Aude Oliva. "Learning deep features for scene recognition
using places database." In Advances in neural information processing systems, pp. 487-495. 2014. [web]
● 205 scene classes (categories).
● Images:
○ 2.5M train
○ 20.5k validation
○ 41k test

21
E2E: Classification: ImageNet ILSRVC
● 1000 object classes (categories).
● Images:
○ 1.2 M train
○ 100k test.

● Predict 5 classes.

Slide credit:
Rob Fergus (NYU)
Image Classifcation 2012
-9.8%
Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., ... & Fei-Fei, L. (2014). Imagenet large scale visual recognition challenge. arXiv
preprint arXiv:1409.0575. [web] 23
E2E: Classification: ILSRVC

E2E: Classification: AlexNet (Supervision)
24Slide credit: Junting Pan, “Visual Saliency Prediction using Deep Learning Techniques” (ETSETB-UPC 2015)
Orange
A Krizhevsky, I Sutskever, GE Hinton “Imagenet classification with deep convolutional neural networks” Part of: Advances in
Neural Information Processing Systems 25 (NIPS 2012)

27Image credit: Deep learning Tutorial (Stanford University)

30
Rectified
Linear
Unit
(non-linearity)
f(x) = max(0,x)
Slide credit: Junting Pan, “Visual Saliency Prediction using Deep Learning Techniques” (ETSETB-UPC 2015)

31
Dot Product
Slide credit: Junting Pan, “Visual Saliency Prediction using Deep Learning Techniques” (ETSETB-UPC 2015)

ImageNet Classification 2013
preprint arXiv:1409.0575. [web]
Slide credit:
Rob Fergus (NYU)
32

The development of better
convnets is reduced to trial-and-
error.
33
E2E: Classification: Visualize: ZF
Visualization can help in
proposing better architectures.
Zeiler, M. D., & Fergus, R. (2014). Visualizing and understanding convolutional networks. In Computer Vision–ECCV 2014 (pp. 818-833). Springer
International Publishing.

“A convnet model that uses the same
components (filtering, pooling) but in
reverse, so instead of mapping pixels
to features does the opposite.”
Zeiler, Matthew D., Graham W. Taylor, and Rob Fergus. "Adaptive deconvolutional networks for mid and high level feature learning." Computer Vision
(ICCV), 2011 IEEE International Conference on. IEEE, 2011.
34

DeconvN
et
Conv
Net
35

36

37

38

“To examine a given convnet activation, we set all other activations in the layer to zero and pass the feature
maps as input to the attached deconvnet layer.”
39

40

“(i) Unpool: In the convnet, the max pooling operation is non-invertible, however we can obtain an approximate
inverse by recording the locations of the maxima within each pooling region in a set of switch variables.”
41

XX
“(ii) Rectification: The convnet uses ReLU non-linearities, which rectify the feature maps thus ensuring the
feature maps are always positive.”
42

“(iii) Filtering: The convnet uses learned filters to convolve the feature maps from the previous layer. To
approximately invert this, the deconvnet uses transposed versions of the same filters (as other autoencoder
models, such as RBMs), but applied to the rectified maps, not the output of the layer beneath. In practice this
means flipping each filter vertically and horizontally.
XX
XX
43

“(iii) Filtering: The convnet uses learned filters to convolve the feature maps from the previous layer. To
approximately invert this, the deconvnet uses transposed versions of the same filters (as other autoencoder
models, such as RBMs), but applied to the rectified maps, not the output of the layer beneath. In practice this
means flipping each filter vertically and horizontally.
XX
XX
44

45
Top 9 activations in a random subset of feature maps
across the validation data, projected down to pixel space
using our deconvolutional network approach.
Corresponding image patches.

46

47

48

49

50

51
The smaller stride (2 vs 4) and filter size (7x7 vs 11x11) results in more distinctive features and fewer “dead"
features.
AlexNet (Layer 1) Clarifai (Layer 1)

52
Cleaner features in Clarifai, without the aliasing artifacts caused by the stride 4 used in AlexNet.
AlexNet (Layer 2) Clarifai (Layer 2)

53
Regularization with dropout:
Reduction of overfitting by setting
to zero the output of a portion
(typically 50%) of each
intermediate neuron.
Hinton, G. E., Srivastava, N., Krizhevsky, A., Sutskever, I., & Salakhutdinov, R. R. (2012). Improving neural networks by preventing co-adaptation of
feature detectors. arXiv preprint arXiv:1207.0580.
Chicago
E2E: Classification: Dropout: ZF

54

55
E2E: Classification: Ensembles: ZF

-5%
Slide credit:
Rob Fergus (NYU)
56

-5%
Slide credit:
Rob Fergus (NYU)
57

E2E: Classification: GoogLeNet
59Movie: Inception (2010)

60
● 22 layers, but 12 times fewer parameters than AlexNet.
Szegedy, Christian, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov,
Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. "Going deeper with convolutions."

61
● Challenges of going deeper:
○ Overfitting, due to the increase amount of parameters.
○ Inefficient computation if most weights end up close to zero.
Solution
Sparsity
How ?
Inception modules

62

63
Lin, Min, Qiang Chen, and Shuicheng Yan. "Network in network." ICLR 2014.

64

E2E: Classification: GoogLeNet (NiN)
65
3x3 and 5x5 convolutions deal with
different scales.
Lin, Min, Qiang Chen, and Shuicheng Yan. "Network in network." ICLR 2014. [Slides]

66
1x1 convolutions does dimensionality reduction (c3<c2) and
accounts for rectified linear units (ReLU).

67
In NiN, the Cascaded 1x1 Convolutions compute reductions after the
convolutions.

68
In GoogLeNet, the Cascaded 1x1 Convolutions compute reductions before the
expensive 3x3 and 5x5 convolutions.

69

70
3x3 max pooling introduces somewhat spatial invariance,
and has proven a benefitial effect by adding an alternative
parallel path.

71
Two Softmax Classifiers at intermediate layers combat the vanishing gradient while
providing regularization at training time.
...and no fully connected layers needed !

72

73NVIDIA, “NVIDIA and IBM CLoud Support ImageNet Large Scale Visual Recognition Challenge” (2015)

74
Szegedy, Christian, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent
Vanhoucke, and Andrew Rabinovich. "Going deeper with convolutions." CVPR 2015. [video] [slides] [poster]

E2E: Classification: VGG
75
Simonyan, Karen, and Andrew Zisserman. "Very deep convolutional networks for large-scale image recognition."
International Conference on Learning Representations (2015). [video] [slides] [project]

76
Simonyan, Karen, and Andrew Zisserman. "Very deep convolutional networks for large-scale image recognition."
International Conference on Learning Representations (2015). [video] [slides] [project]

E2E: Classification: VGG: 3x3 Stacks
77
Simonyan, Karen, and Andrew Zisserman. "Very deep convolutional networks for large-scale image
recognition." International Conference on Learning Representations (2015). [video] [slides] [project]

78
Simonyan, Karen, and Andrew Zisserman. "Very deep convolutional networks for large-scale image
recognition." International Conference on Learning Representations (2015). [video] [slides] [project]
● No poolings between some convolutional layers.
● Convolution strides of 1 (no skipping).

E2E: Classification
79
3.6% top 5 error…
with 152 layers !!

E2E: Classification: ResNet
80
He, Kaiming, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. "Deep Residual Learning for Image Recognition." arXiv preprint arXiv:1512.03385
(2015). [slides]

81
● Deeper networks (34 is deeper than 18) are more difficult to train.
(2015). [slides]
Thin curves: training error
Bold curves: validation error

82
● Residual learning: reformulate the layers as learning residual functions with
reference to the layer inputs, instead of learning unreferenced functions
(2015). [slides]

83
(2015). [slides]

84
E2E: Classification: Humans

85
“Is this a Border terrier” ?
Crowdsourcing
Yes
No
● Binary ground truth annotation from the crowd.

86
● Annotation Problems:
Carlier, Axel, Amaia Salvador, Ferran Cabezas, Xavier Giro-i-Nieto, Vincent Charvillat, and Oge Marques. "Assessment of
crowdsourcing and gamification loss in user-assisted object segmentation." Multimedia Tools and Applications (2015): 1-28.
Crowdsource loss (0.3%) More than 5 objects classes

87
Andrej Karpathy, “What I learned from competing against a computer on ImageNet” (2014)
● Test data collection from one human. [interface]

88
Andrej Karpathy, “What I learned from competing against a computer on ImageNet” (2014)
● Test data collection from one human. [interface]
“Aww, a cute dog! Would you like to spend 5 minutes scrolling through 120
breeds of dog to guess what species it is ?”

89
NVIDIA, “Mocha.jl: Deep Learning for Julia” (2015)
ResNet

96
E2E: Saliency
Slide credit: Junting Pan, “Visual Saliency Prediction using Deep Learning Techniques” (ETSETB 2015)

97
Eye Tracker Mouse Click
Slide credit: Junting Pan, “Visual Saliency Prediction using Deep Learning Techniques” (ETSETB 2015)
E2E: Saliency

98
E2E: Saliency: JuntingNet
Junting Pan, Kevin McGuinness, Elisa Sayrol, Noel O'Connor, and Xavier Giro-i-Nieto. "Shallow and Deep Convolutional
Networks for Saliency Prediction." CVPR 2016

99

100
TRAIN VALIDATION TEST
10,000 5,000 5,000
6,000 926 2,000
CAT2000 [Borji’15] 2,000 - 2,000
MIT300 [Judd’12] 300 - -Large
Scale

102
Upsample
+ filter
2D
map
96x96 2340=48x48
IMAGE
INPUT
(RGB)

103
Upsample
+ filter
2D
map
96x96 2340=48x48
3 CONV LAYERS

104
Upsample
+ filter
2D
map
96x96 2340=48x48
2 DENSE
LAYERS

105
Upsample
+ filter
2D
map
96x96 2340=48x48

107
Loss function Mean Square Error (MSE)
Weight initialization Gaussian distribution
Learning rate 0.03 to 0.0001
Mini batch size 128
Training time 7h (SALICON) / 4h (iSUN)
Acceleration SGD+ nesterov momentum (0.9)
Regularisation Maxout norm
GPU NVidia GTX 980

108
Number of iterations
(Training time)
● Back-propagation with the Euclidean distance.
● Training curve for the SALICON database.

109
JuntingNetGround TruthPixels
E2E: Saliency: JuntingNet: iSUN

110

111
Results from CVPR LSUN Challenge 2015

112
E2E: Saliency: JuntingNet: SALICON

113

114
Results from CVPR LSUN Challenge 2015

115

116
Domain B
Fine-tuned
Learned
Representation
Part I’: End-to-End Fine-Tuning (FT)
Domain ALearned
Representation
Transfer
Outline for Part I: Image Analytics

117
E2E: Fine-tuning
Fine-tuning a pre-trained network
Slide credit: Victor Campos, “Layer-wise CNN surgery for Visual Sentiment Prediction” (ETSETB 2015)

118
E2E: Fine-tuning
Slide credit: Victor Campos, “Layer-wise CNN surgery for Visual Sentiment Prediction” (ETSETB 2015)
Fine-tuning a pre-trained network

119
E2E: Fine-tuning: Sentiments
CNN
Campos, Victor, Amaia Salvador, Xavier Giro-i-Nieto, and Brendan Jou. "Diving Deep into Sentiment: Understanding
Fine-tuned CNNs for Visual Sentiment Prediction." In Proceedings of the 1st International Workshop on Affect &
Sentiment in Multimedia, pp. 57-62. ACM, 2015.

120
Campos, Victor, Xavier Giro-i-Nieto, and Brendan Jou. “From pixels to sentiments” (Submitted)
Long, Jonathan, Evan Shelhamer, and Trevor Darrell. "Fully Convolutional Networks for Semantic
Segmentation." CVPR 2015
E2E: Fine-tuning: Sentiments
True positive True negative False positive False negative
Visualizations with fully convolutional networks.

121
E2E: Fine-tuning: Cultural events
ChaLearn Workshop
A. Salvador, Zeppelzauer, M., Manchon-Vizuete, D., Calafell-Orós, A., and Giró-i-Nieto, X., “Cultural
Event Recognition with Visual ConvNets and Temporal Models”, in CVPR ChaLearn Looking at People
Workshop 2015, 2015. [slides]

122
VGG +
Fine tuned
E2E: Fine-tuning: Saliency prediction

123
Networks for Saliency Prediction." In Proceedings of the IEEE International Conference on Computer Vision. 2016.
From
scratch
VGG +
Fine tuned

124
Networks for Saliency Prediction." In Proceedings of the IEEE International Conference on Computer Vision. 2016.

125
Task A
Learned
Representation
Task B
(eg. image retrieval)
Part II: Off-The-Shelf features (OTS)

126
Razavian, Ali, Hossein Azizpour, Josephine Sullivan, and Stefan Carlsson. "CNN features off-the-shelf:
an astounding baseline for recognition." CVPRW 2014
Off-The-Shelf (OTS) Features

● Intermediate features can be used as regular visual descriptors for any task.
127
Off-The-Shelf (OTS) Features
Babenko, Artem, et al. "Neural codes for image retrieval." Computer Vision–ECCV 2014

128
OTS: Classification: Razavian
Razavian, Ali, Hossein Azizpour, Josephine Sullivan, and Stefan Carlsson. "CNN features off-the-shelf:
an astounding baseline for recognition." CVPRW 2014
Pascal VOC 2007

129
OTS: Classification: Return of devil
Chatfield, K., Simonyan, K., Vedaldi, A. and Zisserman, A.. Return of the devil in the details: Delving
deep into convolutional nets. BMVC 2014
Classifier
L2-normalization
Accuracy
+5%

Three representative architectures considered:
AlexNet
ZF
OverFeat
5 days (fast)
3 weeks (slow)
@ NVIDIA GTX Titan GPU
130

F
C
131
Data augmentation

132
Fisher
Kernels
(FK)
ConvNets
(CNN)

Color
Gray Scale (GS)
Accuracy
-2.5%
133

Dimensionality reduction by retraining the last layer to smaller sizes.
Accuracy
-2%
Size
x32
134

Ranking
Summary of the paper by Amaia Salvador on Bitsearch.
Babenko, Artem, et al. "Neural codes for image retrieval." Computer Vision–ECCV 2014. Springer
International Publishing, 2014. 584-599.
135
OTS: Retrieval

Oxford Buildings Inria Holidays UKB
136
OTS: Retrieval

Pooled from the network from Krizhevsky et. al. pretrained with images from
ImageNet.
137
OTS: Retrieval: FC layers
Summary of the paper by Amaia Salvador on Bitsearch.
Babenko, Artem, et al. "Neural codes for image retrieval." Computer Vision–ECCV 2014.

Off-the-shelf CNN descriptors from fully connected layers show useful but not
superior (w.r.t. FV, VLAD, Sparse Coding,...)
138Babenko, Artem, et al. "Neural codes for image retrieval." Computer Vision–ECCV 2014.
OTS: Retrieval: FC layers

Razavian et al, A baseline for visual instance retrieval with deep convolutional networks, ICLR 2015.
139
Convolutional layers have shown better performance than fully connected ones.
OTS: Retrieval: Conv layers

140
Spatial Search, (extract N local descriptor from predefined locations) increases
performance at computational cost.

Medium memory footprints
141

142
OTS: Summarization
Bolaños M, Mestre R, Talavera E, Giró-i-Nieto X, Radeva P. Visual Summary of Egocentric Photostreams by Representative
Keyframes. In: IEEE International Workshop on Wearable and Ego-vision Systems for Augmented Experience (WEsAX) 2015. Turin, Italy:
2015
Clustering based on Euclidean distance over FC7 features from AlexNet.

143
Thank you !
https://imatge.upc.edu/web/people/xavier-giro
https://twitter.com/DocXavi
https://www.facebook.com/ProfessorXavi
xavier.giro@upc.edu
Xavier Giró-i-Nieto

Deep Learning for Computer Vision (1/4): Image Analytics @ laSalle 2016

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (19)

Similar to Deep Learning for Computer Vision (1/4): Image Analytics @ laSalle 2016

Similar to Deep Learning for Computer Vision (1/4): Image Analytics @ laSalle 2016 (20)

More from Universitat Politècnica de Catalunya

More from Universitat Politècnica de Catalunya (20)

Recently uploaded

Recently uploaded (20)

Deep Learning for Computer Vision (1/4): Image Analytics @ laSalle 2016