Deep Learning for Computer Vision - PyconDE 2017

Deep Learning
for Computer Vision
Executive-ML 2017/09/21
Neither Proprietary nor Confidential – Please Distribute ;)
Alex Conway
alex @ numberboost.com
@alxcnwy
PyConDE’17
Neither confidential nor proprietary - please distribute ;)

Deep Learning is Sexy (for a reason!)
3

Image Classification
4
http://yann.lecun.com/exdb/mnist/
https://github.com/fchollet/keras/blob/master/examples/mnist_cnn.py
(99.25% test accuracy in 192 seconds and 46 lines of code)

5
https://blog.prototypr.io/behind-the-magic-how-we-built-the-arkit-sudoku-solver-e586e5b685b0

7
ImageNet Classification with Deep Convolutional Neural Networks, Krizhevsky et. Al.
Advances in Neural Information Processing Systems 25 (NIPS2012)

https://research.googleblog.com/2017/06/supercharg
e-your-computer-vision-models.html
Object detection

https://www.youtube.com/watch?v=VOC3huqHrss
Object detection

Image Captioning & Visual Attention
XXX
11
https://einstein.ai/research/knowing-when-to-look-adaptive-
attention-via-a-visual-sentinel-for-image-captioning

Image Q&A
12
https://arxiv.org/pdf/1612.00837.pdf

Video Q&A
XXX
13
https://www.youtube.com/watch?v=UeheTiBJ0Io

Pix2Pix
https://affinelayer.com/pix2pix/
https://github.com/affinelayer/pix2pix-tensorflow
14

Pix2Pix
https://medium.com/towards-data-science/face2face-a-pix2pix-demo-
that-mimics-the-facial-expression-of-the-german-chancellor-
b6771d65bf66 15

16
Original input
Rear Window (1954)
Pix2pix output
Fully Automated
Remastered
Painstakingly by Hand
https://hackernoon.co
m/remastering-classic-
films-in-tensorflow-
with-pix2pix-
f4d551fa0503

Style Transfer
https://arxiv.org/abs/1508.06576 17

Style Season Transfer
https://github.com/junyanz/CycleGAN 18
I WISH!1! ;)

Style Transfer SORCERY
https://github.com/junyanz/CycleGAN
19

Real Fake News
https://www.youtube.com/watch?v=MVBe6_o4cMI
20

Deep learning is MagicDeep learning is MagicDeep learning is EASY!

1. What is a neural network?
2. What is a convolutional neural network?
3. How to use a convolutional neural
network
4. More advanced Methods
5. Case studies & applications 22

What is a neuron?
25
• 3 inputs [x1,x2,x3]
• 3 weights [w1,w2,w3]
• 1 bias term b
• Element-wise multiply and sum: inputs & weights
• Add bias term
• Apply activation function f
• Output of neuron is just a number (like the in

What is an Activation Function?
26
Sigmoid Tanh ReLU
Nonlinearities … “squashing functions” … transform neuron’s
output
NB: sigmoid output in [0,1]

What is a
28
Inputs outputs
hidden
layer 1
hidden
layer 2
hidden
layer 3
Note: Outputs of one layer are inputs into the next layer
This (non-convolutional) architecture is called a “multi-layered
perceptron”
Neural Network?(Deep)

1 Layer Neural Network
31https://ml4a.github.io/ml4a/looking_inside_neural_nets/
0
1
2
3
4
5
6
7
8
9

2 Layer Neural Network
32https://ml4a.github.io/ml4a/looking_inside_neural_nets/
0
1
2
3
4
5
6
7
8
9

33https://www.youtube.com/watch?v=aircAruvnK

36https://www.youtube.com/watch?v=aircAruvnKk
3Blue1Brown is AWESOME!

How does a neural network learn?
37
• We need labelled examples “training data”
• We initialize network weights randomly and initially get random predictions
• For each labelled training data point, we calculate the error (cross-entropy
error for classification or mean-squared error for regression) between the
network’s predictions and the ground-truth labels
• Use ‘backpropagation’ (chain rule), to update the network parameters
(weights + convolutional filters ) in the opposite direction to the error

“How much
error increases
when we increase
this weight”
How does a neural network learn?
38
New
weight =
Old
weight
Learning
rate-
Gradient of
Error with
respect to Weight
( )x

Gradient Descent Interpretation
39
http://scs.ryerson.ca/~aharley/neural-networks/

http://playground.tensorflow.org

What is a Neural Network?
For much more detail, see:
1. Michael Nielson’s Neural Networks &
Deep Learning free online book
http://neuralnetworksanddeeplearning.com/chap1.html
2. Anrej Karpathy’s CS231n Notes
http://neuralnetworksanddeeplearning.com/chap1.html
41

2. What is a
convolutional
neural network?

What is a Convolutional Neural
Network?
43
“like a ordinary neural network but with
special types of layers that work well
on images”
(math works on numbers)
• Pixel = 3 colour channels (R, G, B)
• Pixel intensity ∈[0,255]
• Image has width w and height h
• Therefore image is w x h x 3 numbers

First of all why do we need special
layers?
44
1. Locality
– objects tend to have a local spatial support
2. Translation Invariance
– objects can appear anywhere in the image
3. Parameter counts explode in fully connected nets
– recall 3 layer 28x28 pixel MNIST net had 13’002 parameters…
if we instead had 500x500 pixel images, we’d need 4’000’458
4. We can get better results ;)
– Keras MLP (98.4% accuracy = 1.6% error) vs CNN (99.25% = 0.75
error)
So more than half the error rate
https://www.youtube.com/watch?v=t_TY_5bG9J8

45
This is VGGNet – don’t panic, we’ll break it down piece
by piece
Example Architecture

46
This is VGGNet – don’t panic, we’ll break it down piece
by piece
Example Architecture

Convolutions
47
http://setosa.io/ev/image-kernels/

Convolutions
48
http://deeplearning.net/software/theano/tutorial/conv_arithmetic.ht
ml

New Layer Type: Convolutional Layer
49
• 2-d weighted average when multiply kernel over pixel patches
• We slide the kernel over all pixels of the image (handle
borders)
• Kernel starts off with “random” values and network updates
(learns) the kernel values (using backpropagation) to try
minimize loss
• Kernels shared across the whole image (parameter

Many Kernels = Many “Activation Maps” = Volume
50http://cs231n.github.io/convolutional-networks/

New Layer Type: Convolutional Layer
51

Convolutions
52
https://github.com/fchollet/keras/blob/master/examples/conv_filter_visuali
zation.py

Convolution Learn Hierarchical Features
56

New Layer Type: Max Pooling
58

• Reduces dimensionality from one layer to next
• …by replacing NxN sub-area with max value
• Makes the network less sensitive to local perturbations
• Makes network “look” at larger areas of the image at a time
• e.g. Instead of identifying fur, identify cat
• Reduces overfitting since losing information helps the network
generalize
59http://cs231n.github.io/convolutional-networks/

60

Stack Conv + Pooling Layers and Go Deep
61
Convolution + max pooling + fully connected +
softmax

62
Stack Conv + Pooling Layers and Go Deep
softmax

63
Stack These Layers and Go Deep
softmax

64
Stack These Layers and Go Deep
softmax

65
Flatten the Final “Bottleneck” layer
softmax
Flatten the final 7 x 7 x 512 max pooling layer
Add fully-connected dense layer on top

66
Bringing it all together
softmax

Softmax
Convert scores ∈ ℝ to probabilities ∈ [0,1]
Final output prediction = highest probability class
67

Bringing it all together
68
softmax

We need labelled training data!

ImageNet
70
http://image-net.org/explore
1000 object categories
1.2 million training images

ImageNet Top 5 Error Rate
73
Traditional
Image Processing
Methods
AlexNet
8 Layers
ZFNet
8 Layers
GoogLeNe
t
22 Layers
ResNet
152 Layers SENet
Ensamble
TSNet
Ensamble

3. How to use a
convolutional neural
network

Using a Pre-Trained ImageNet-Winning
CNN
75
• We’ve been looking at “VGGNet”
• Oxford Visual Geometry Group (VGG)
• ImageNet 2014 Runner-up
• Network is 16 layers (deep!)
• Trained on 4 GPUs for 2-3 weeks (data split across
GPUs)
• Easy to fine-tune

Example: Classifying Product Images
76
https://github.com/alexcnwy/DeepLearning4Computer
Vision
Classifying
products
into 8
categories

77
https://github.com/alexcn
wy/DeepLearning4Comput
erVision

78
https://blog.keras.io/building-powerful-image-classification-models-using-very-little-
data.html
Start with Pre-Trained ImageNet
Model
Consumers vs Producers of Machine Lea

…but we don’t want to
ImageNet categories…
we want to
predict something
new!

“Transfer Learning”
to the rescue!1!

Fine-tuning A CNN To Solve A New Problem
• Cut off last layer of pre-trained Imagenet winning
CNN
• Keep learned feature detectors (convolutions)
• …BUT replace the dense and softmax layers with
new ones!
• Can learn to predict new (completely different)
classes
• Fine-tuning is re-training the new final layer(s)
82

83
8

85
After Fine-Tuning
Replace Final layers of Pre-trained
Model

• Load pre-trained VGG network
• Remove final dense & softmax layers that predict 1000 ImageNet
classes
• Replace with new dense & softmax layer to predict 8 categories
86
96.3% accuracy in under 2 minutes for
classifying products into categories
(WITH ONLY 3467 TRAINING IMAGES!!1!)
Fine-tuning is awesome!
Insert obligatory brain analogy

Brief Error Analysis
“Truth” = tee
Prediction = shirt
“Truth” = jacket
Prediction = knitwear

Visual Similarity
89
• Chop off VGG dense layers and keep bottleneck
features
• Add single dense layer with 256 activations
• “Predict” to get activation for each image
• Compute nearest neighbours in the space of these
https://memeburn.com/2017/06/spree-image-search/

90
https://github.com/alexcnwy/DeepLearning4Computer
Vision

91
Input Image
not seen by
model
Results
Top 10 most
“visually similar”

92
Input Image
not seen by
model
Results
Top 10 most
“visually similar”

93
Final Convolutional Layer = Semantic
Vector
The final
convolutional layer
encodes everything
the network needs
to make predictions

94
Let’s Visualize the Data with PCA
http://setosa.io/ev/principal-component-analysis/

Use a Better Architecture (or all of
them!)
96
“Ensambles win”
learn a weighted average of many models’ predictions

cs231n.stanford.edu/slides/2017/cs231n_2017_lecture11.pdf
There are MANY Computer Vision Tasks

Long et al. “Fully Convolutional Networks for Semantic Segmentation” CVPR 2015
Noh et al. Learning Deconvolution Network for Semantic Segmentation. IEEE on Computer Vision 2016
Semantic Segmentation

http://cs231n.stanford.edu/slides/2017/cs231n_2017_lecture11.pdf
Object detection

http://blog.romanofoti.com/style_transfer/
Johnson et al. Perceptual losses for real-time style transfer and super-resolution. 2016
Style Transfer
f ( ) =,

https://www.youtube.com/watch?v=LhF_56SxrGk

Pixelated OriginalOutput
https://arstec
hnica.com/inf
ormation-
technology/2
017/02/googl
e-brain-
super-
resolution-
zoom-
enhance/

This image is
3.8 kb
Super-Resolution

https://github.com/tdeboissiere/BGG16CAM-keras
Visual Attention

Image Captioning
https://einstein.ai/research/knowing-when-to-look-adaptive-attention-via-a-visual-sentinel-for-image-
captioning
Karpathy & Li. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39(4), pp. 664–676

Image Q&A
XXX
108
http://iamaaditya.github.io/2016/04/visual_question_answering_demo_notebook

Video Q&A
XXX
109

Video Q&A
XXX
110

king + woman – man ≈ queen
111
Frome et al. (2013) ‘DeViSE: A Deep Visual-Semantic Embedding Model’,
Advances in Neural Information Processing Systems, pp. 2121–2129
CNN + Word2Vec = AWESOME

DeViSE: A Deep Visual-Semantic Embedding Model
XXX
112
Before:
Encode labels as 1-hot vector

XXX
113
After:
Encode labels as word2vec vectors
(FROM A SEPARATE MODEL)
Can look these up for all the nouns in ImageNet
300-d
word2vec
vectors

wv(fish)+ wv(net)
114
https://www.youtube.com/watch?v=uv0gmrXSXVg
2
wv* =
…get nearest neighbours to wv*

Multi-Modal Modeling – Mix & Match!
Structured
data e.g.
#beds,
#garages, m2,
etc.
Features
extracted from
description e.g.
with an LSTM
Features
extracted from
photos using
convolutional
network
Price

Estimating Accident Repair Cost from
Photos
TODO
118
Prototype for large
S.A. insurer
Detect car make &
model from license
sticker on
windshield
Predict repair cost
using learned
model

Image & Video Moderation
TODO
119
Large international gay dating app with tens of
millions of users
uploading hundreds-of-thousands of photos per day

Segmenting Medical Images
TODO
120

m
Counting People
TODO
Count shoppers, segment on age & gender
facial recognition loyalty is next

“Using the final model that has been trained on survey data, we can
estimate per capita consumption expenditure for any location
where we have daytime satellite imagery ”
http://sustain.stanford.edu/predicting-poverty/
“…features visible in daytime
satellite imagery, such as roofing
material and distance to urban
areas, vary roughly linearly with
expenditure”

Extracting Data from Product
Catalogues

Real-time ATM Video Classification
126

Optical Sorting
TODO
127
https://www.youtube.com/watch?v=Xf7jaxwnyso

We are hiring!
(and available for hire ;)

GET IN TOUCH!
Alex Conway
@alxcnwy

Deep Learning for Computer Vision - PyconDE 2017

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Deep Learning for Computer Vision - PyconDE 2017

Similar to Deep Learning for Computer Vision - PyconDE 2017 (20)

Recently uploaded

Recently uploaded (20)

Deep Learning for Computer Vision - PyconDE 2017

Editor's Notes