Deep Learning
for Computer Vision
Executive-ML 2017/09/21
Neither Proprietary nor Confidential – Please Distribute ;)
Alex Conway
alex @ numberboost.com
@alxcnwy
PyConDE’17
Neither confidential nor proprietary - please distribute ;)
Hands up!
Deep Learning is Sexy (for a reason!)
3
Image Classification
4
http://yann.lecun.com/exdb/mnist/
https://github.com/fchollet/keras/blob/master/examples/mnist_cnn.py
(99.25% test accuracy in 192 seconds and 46 lines of code)
Image Classification
5
https://blog.prototypr.io/behind-the-magic-how-we-built-the-arkit-sudoku-solver-e586e5b685b0
Image Classification
6
Image Classification
7
ImageNet Classification with Deep Convolutional Neural Networks, Krizhevsky et. Al.
Advances in Neural Information Processing Systems 25 (NIPS2012)
https://research.googleblog.com/2017/06/supercharg
e-your-computer-vision-models.html
Object detection
https://www.youtube.com/watch?v=VOC3huqHrss
Object detection
Object detection
Image Captioning & Visual Attention
XXX
11
https://einstein.ai/research/knowing-when-to-look-adaptive-
attention-via-a-visual-sentinel-for-image-captioning
Image Q&A
12
https://arxiv.org/pdf/1612.00837.pdf
Video Q&A
XXX
13
https://www.youtube.com/watch?v=UeheTiBJ0Io
Pix2Pix
https://affinelayer.com/pix2pix/
https://github.com/affinelayer/pix2pix-tensorflow
14
Pix2Pix
https://medium.com/towards-data-science/face2face-a-pix2pix-demo-
that-mimics-the-facial-expression-of-the-german-chancellor-
b6771d65bf66 15
16
Original input
Rear Window (1954)
Pix2pix output
Fully Automated
Remastered
Painstakingly by Hand
https://hackernoon.co
m/remastering-classic-
films-in-tensorflow-
with-pix2pix-
f4d551fa0503
Style Transfer
https://arxiv.org/abs/1508.06576 17
Style Season Transfer
https://github.com/junyanz/CycleGAN 18
I WISH!1! ;)
Style Transfer SORCERY
https://github.com/junyanz/CycleGAN
19
Real Fake News
https://www.youtube.com/watch?v=MVBe6_o4cMI
20
Deep learning is MagicDeep learning is MagicDeep learning is EASY!
1. What is a neural network?
2. What is a convolutional neural network?
3. How to use a convolutional neural
network
4. More advanced Methods
5. Case studies & applications 22
1.What is a neural
network?
What is a neuron?
25
• 3 inputs [x1,x2,x3]
• 3 weights [w1,w2,w3]
• 1 bias term b
• Element-wise multiply and sum: inputs & weights
• Add bias term
• Apply activation function f
• Output of neuron is just a number (like the in
What is an Activation Function?
26
Sigmoid Tanh ReLU
Nonlinearities … “squashing functions” … transform neuron’s
output
NB: sigmoid output in [0,1]
So what is a Neural Network?
What is a
28
Inputs outputs
hidden
layer 1
hidden
layer 2
hidden
layer 3
Note: Outputs of one layer are inputs into the next layer
This (non-convolutional) architecture is called a “multi-layered
perceptron”
Neural Network?(Deep)
29
30
1 Layer Neural Network
31https://ml4a.github.io/ml4a/looking_inside_neural_nets/
0
1
2
3
4
5
6
7
8
9
2 Layer Neural Network
32https://ml4a.github.io/ml4a/looking_inside_neural_nets/
0
1
2
3
4
5
6
7
8
9
33https://www.youtube.com/watch?v=aircAruvnK
34https://www.youtube.com/watch?v=aircAruvnK
35https://www.youtube.com/watch?v=aircAruvnK
36https://www.youtube.com/watch?v=aircAruvnKk
3Blue1Brown is AWESOME!
How does a neural network learn?
37
• We need labelled examples “training data”
• We initialize network weights randomly and initially get random predictions
• For each labelled training data point, we calculate the error (cross-entropy
error for classification or mean-squared error for regression) between the
network’s predictions and the ground-truth labels
• Use ‘backpropagation’ (chain rule), to update the network parameters
(weights + convolutional filters ) in the opposite direction to the error
“How much
error increases
when we increase
this weight”
How does a neural network learn?
38
New
weight =
Old
weight
Learning
rate-
Gradient of
Error with
respect to Weight
( )x
Gradient Descent Interpretation
39
http://scs.ryerson.ca/~aharley/neural-networks/
http://playground.tensorflow.org
What is a Neural Network?
For much more detail, see:
1. Michael Nielson’s Neural Networks &
Deep Learning free online book
http://neuralnetworksanddeeplearning.com/chap1.html
2. Anrej Karpathy’s CS231n Notes
http://neuralnetworksanddeeplearning.com/chap1.html
41
2. What is a
convolutional
neural network?
What is a Convolutional Neural
Network?
43
“like a ordinary neural network but with
special types of layers that work well
on images”
(math works on numbers)
• Pixel = 3 colour channels (R, G, B)
• Pixel intensity ∈[0,255]
• Image has width w and height h
• Therefore image is w x h x 3 numbers
First of all why do we need special
layers?
44
1. Locality
– objects tend to have a local spatial support
2. Translation Invariance
– objects can appear anywhere in the image
3. Parameter counts explode in fully connected nets
– recall 3 layer 28x28 pixel MNIST net had 13’002 parameters…
if we instead had 500x500 pixel images, we’d need 4’000’458
4. We can get better results ;)
– Keras MLP (98.4% accuracy = 1.6% error) vs CNN (99.25% = 0.75
error)
So more than half the error rate
https://www.youtube.com/watch?v=t_TY_5bG9J8
45
This is VGGNet – don’t panic, we’ll break it down piece
by piece
Example Architecture
46
This is VGGNet – don’t panic, we’ll break it down piece
by piece
Example Architecture
Convolutions
47
http://setosa.io/ev/image-kernels/
Convolutions
48
http://deeplearning.net/software/theano/tutorial/conv_arithmetic.ht
ml
New Layer Type: Convolutional Layer
49
• 2-d weighted average when multiply kernel over pixel patches
• We slide the kernel over all pixels of the image (handle
borders)
• Kernel starts off with “random” values and network updates
(learns) the kernel values (using backpropagation) to try
minimize loss
• Kernels shared across the whole image (parameter
Many Kernels = Many “Activation Maps” = Volume
50http://cs231n.github.io/convolutional-networks/
New Layer Type: Convolutional Layer
51
Convolutions
52
https://github.com/fchollet/keras/blob/master/examples/conv_filter_visuali
zation.py
Convolutions
53
Convolutions
54
Convolutions
55
Convolution Learn Hierarchical Features
56
New Layer Type: Max Pooling
58
New Layer Type: Max Pooling
• Reduces dimensionality from one layer to next
• …by replacing NxN sub-area with max value
• Makes the network less sensitive to local perturbations
• Makes network “look” at larger areas of the image at a time
• e.g. Instead of identifying fur, identify cat
• Reduces overfitting since losing information helps the network
generalize
59http://cs231n.github.io/convolutional-networks/
New Layer Type: Max Pooling
60
Stack Conv + Pooling Layers and Go Deep
61
Convolution + max pooling + fully connected +
softmax
62
Stack Conv + Pooling Layers and Go Deep
Convolution + max pooling + fully connected +
softmax
63
Stack These Layers and Go Deep
Convolution + max pooling + fully connected +
softmax
64
Stack These Layers and Go Deep
Convolution + max pooling + fully connected +
softmax
65
Flatten the Final “Bottleneck” layer
Convolution + max pooling + fully connected +
softmax
Flatten the final 7 x 7 x 512 max pooling layer
Add fully-connected dense layer on top
66
Bringing it all together
Convolution + max pooling + fully connected +
softmax
Softmax
Convert scores ∈ ℝ to probabilities ∈ [0,1]
Final output prediction = highest probability class
67
Bringing it all together
68
Convolution + max pooling + fully connected +
softmax
We need labelled training data!
ImageNet
70
http://image-net.org/explore
1000 object categories
1.2 million training images
ImageNet
71
ImageNet
72
ImageNet Top 5 Error Rate
73
Traditional
Image Processing
Methods
AlexNet
8 Layers
ZFNet
8 Layers
GoogLeNe
t
22 Layers
ResNet
152 Layers SENet
Ensamble
TSNet
Ensamble
3. How to use a
convolutional neural
network
Using a Pre-Trained ImageNet-Winning
CNN
75
• We’ve been looking at “VGGNet”
• Oxford Visual Geometry Group (VGG)
• ImageNet 2014 Runner-up
• Network is 16 layers (deep!)
• Trained on 4 GPUs for 2-3 weeks (data split across
GPUs)
• Easy to fine-tune
Example: Classifying Product Images
76
https://github.com/alexcnwy/DeepLearning4Computer
Vision
Classifying
products
into 8
categories
77
https://github.com/alexcn
wy/DeepLearning4Comput
erVision
78
https://blog.keras.io/building-powerful-image-classification-models-using-very-little-
data.html
Start with Pre-Trained ImageNet
Model
Consumers vs Producers of Machine Lea
79
…but we don’t want to
ImageNet categories…
we want to
predict something
new!
“Transfer Learning”
to the rescue!1!
Fine-tuning A CNN To Solve A New Problem
• Cut off last layer of pre-trained Imagenet winning
CNN
• Keep learned feature detectors (convolutions)
• …BUT replace the dense and softmax layers with
new ones!
• Can learn to predict new (completely different)
classes
• Fine-tuning is re-training the new final layer(s)
82
Fine-tuning A CNN To Solve A New Problem
83
8
84
Before Fine-Tuning
…
85
After Fine-Tuning
Replace Final layers of Pre-trained
Model
Fine-tuning A CNN To Solve A New Problem
• Load pre-trained VGG network
• Remove final dense & softmax layers that predict 1000 ImageNet
classes
• Replace with new dense & softmax layer to predict 8 categories
86
96.3% accuracy in under 2 minutes for
classifying products into categories
(WITH ONLY 3467 TRAINING IMAGES!!1!)
Fine-tuning is awesome!
Insert obligatory brain analogy
Brief Error Analysis
“Truth” = tee
Prediction = shirt
“Truth” = jacket
Prediction = knitwear
Visual Similarity
89
• Chop off VGG dense layers and keep bottleneck
features
• Add single dense layer with 256 activations
• “Predict” to get activation for each image
• Compute nearest neighbours in the space of these
https://memeburn.com/2017/06/spree-image-search/
90
https://github.com/alexcnwy/DeepLearning4Computer
Vision
91
Input Image
not seen by
model
Results
Top 10 most
“visually similar”
92
Input Image
not seen by
model
Results
Top 10 most
“visually similar”
93
Final Convolutional Layer = Semantic
Vector
The final
convolutional layer
encodes everything
the network needs
to make predictions
94
Let’s Visualize the Data with PCA
http://setosa.io/ev/principal-component-analysis/
4. More Advanced
Methods
Use a Better Architecture (or all of
them!)
96
“Ensambles win”
learn a weighted average of many models’ predictions
cs231n.stanford.edu/slides/2017/cs231n_2017_lecture11.pdf
There are MANY Computer Vision Tasks
Long et al. “Fully Convolutional Networks for Semantic Segmentation” CVPR 2015
Noh et al. Learning Deconvolution Network for Semantic Segmentation. IEEE on Computer Vision 2016
Semantic Segmentation
http://cs231n.stanford.edu/slides/2017/cs231n_2017_lecture11.pdf
Object detection
http://blog.romanofoti.com/style_transfer/
Johnson et al. Perceptual losses for real-time style transfer and super-resolution. 2016
Style Transfer
f ( ) =,
https://www.youtube.com/watch?v=LhF_56SxrGk
Pixelated OriginalOutput
https://arstec
hnica.com/inf
ormation-
technology/2
017/02/googl
e-brain-
super-
resolution-
zoom-
enhance/
This image is
3.8 kb
Super-Resolution
https://github.com/tdeboissiere/BGG16CAM-keras
Visual Attention
Image Captioning
https://einstein.ai/research/knowing-when-to-look-adaptive-attention-via-a-visual-sentinel-for-image-
captioning
Karpathy & Li. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39(4), pp. 664–676
Image Q&A
XXX
108
http://iamaaditya.github.io/2016/04/visual_question_answering_demo_notebook
Video Q&A
XXX
109
https://www.youtube.com/watch?v=UeheTiBJ0Io
Video Q&A
XXX
110
https://www.youtube.com/watch?v=UeheTiBJ0Io
king + woman – man ≈ queen
111
Frome et al. (2013) ‘DeViSE: A Deep Visual-Semantic Embedding Model’,
Advances in Neural Information Processing Systems, pp. 2121–2129
CNN + Word2Vec = AWESOME
DeViSE: A Deep Visual-Semantic Embedding Model
XXX
112
Before:
Encode labels as 1-hot vector
DeViSE: A Deep Visual-Semantic Embedding Model
XXX
113
After:
Encode labels as word2vec vectors
(FROM A SEPARATE MODEL)
Can look these up for all the nouns in ImageNet
300-d
word2vec
vectors
DeViSE: A Deep Visual-Semantic Embedding Model
wv(fish)+ wv(net)
114
https://www.youtube.com/watch?v=uv0gmrXSXVg
2
wv* =
…get nearest neighbours to wv*
Multi-Modal Modeling – Mix & Match!
Structured
data e.g.
#beds,
#garages, m2,
etc.
Features
extracted from
description e.g.
with an LSTM
Features
extracted from
photos using
convolutional
network
Price
5. Case Studies
Estimating Accident Repair Cost from
Photos
TODO
118
Prototype for large
S.A. insurer
Detect car make &
model from license
sticker on
windshield
Predict repair cost
using learned
model
Image & Video Moderation
TODO
119
Large international gay dating app with tens of
millions of users
uploading hundreds-of-thousands of photos per day
Segmenting Medical Images
TODO
120
m
Counting People
TODO
Count shoppers, segment on age & gender
facial recognition loyalty is next
Counting Cars
TODO
“Using the final model that has been trained on survey data, we can
estimate per capita consumption expenditure for any location
where we have daytime satellite imagery ”
http://sustain.stanford.edu/predicting-poverty/
“…features visible in daytime
satellite imagery, such as roofing
material and distance to urban
areas, vary roughly linearly with
expenditure”
Detecting Potholes
Extracting Data from Product
Catalogues
Real-time ATM Video Classification
126
Optical Sorting
TODO
127
https://www.youtube.com/watch?v=Xf7jaxwnyso
We are hiring!
(and available for hire ;)
alex @ numberboost.com
GET IN TOUCH!
Alex Conway
alex @ numberboost.com
@alxcnwy

Deep Learning for Computer Vision - PyconDE 2017

Editor's Notes

  • #74 SENet = Squeeze and Excitation Net. TSNet = Trimps-Soushen
  • #87 Gets 88% accuracy in classifying products into categories