Slides from my talk on deep learning for computer vision at PyConZA on 2017/10/06.
Description:
The state-of-the-art in image classification has skyrocketed thanks to the development of deep convolutional neural networks and increases in the amount of data and computing power available to train them. The top-5 error rate in the ImageNet competition to predict which of 1000 classes an image belongs to has plummeted from 28% error in 2010 to just 2.25% in 2017 (human level error is around 5%).
In addition to being able to classify objects in images (including not hotdogs), deep learning can be used to automatically generate captions for images, convert photos into paintings, detect cancer in pathology slide images, and help self-driving cars ‘see’.
The talk will give an overview of the cutting edge and some of the core mathematical concepts and will also include a short code-first tutorial to show how easy it is to get started using deep learning for computer vision in python…
1. Deep Learning
for Computer Vision
Executive-ML 2017/09/21
Neither Proprietary nor Confidential – Please Distribute ;)
Alex Conway
alex @ numberboost.com
@alxcnwy
PyConZA’17
3. Check out the
Deep Learning Indaba
videos & practicals!
http://www.deeplearningindaba.com/videos.html
http://www.deeplearningindaba.com/practicals.html
20. Deep learning is MagicDeep learning is MagicDeep learning is EASY!
21. 1. What is a neural network?
2. What is a convolutional neural network?
3. How to use a convolutional neural
network
4. More advanced Methods
5. Case studies & applications 21
23. What is a neuron?
24
• 3 inputs [x1,x2,x3]
• 3 weights [w1,w2,w3]
• Element-wise multiply and sum
• Apply activation function f
• Often add a bias too (weight of 1) – not
24. What is an Activation Function?
25
Sigmoid Tanh ReLU
Nonlinearities … “squashing functions” … transform neuron’s
output
NB: sigmoid output in [0,1]
25. What is a (Deep) Neural Network?
26
Inputs outputs
hidden
layer 1
hidden
layer 2
hidden
layer 3
Outputs of one layer are inputs into the next layer
26. How does a neural network learn?
27
• We need labelled examples “training data”
• We initialize network weights randomly and initially get random predictions
• For each labelled training data point, we calculate the error between the
network’s predictions and the ground-truth labels
• Use ‘backpropagation’ (chain rule), to update the network parameters
(weights + convolutional filters ) in the opposite direction to the error
27. How does a neural network learn?
28
New
weight =
Old
weight
Learning
rate-
Gradient of
weight with
respect to Error( )x
“How much
error increases
when we increase
this weight”
30. What is a Neural Network?
For much more detail, see:
1. Michael Nielson’s Neural Networks &
Deep Learning free online book
http://neuralnetworksanddeeplearning.com/chap1.html
2. Anrej Karpathy’s CS231n Notes
http://neuralnetworksanddeeplearning.com/chap1.html
31
32. What is a Convolutional Neural
Network?
33
“like a ordinary neural network but with
special types of layers that work well
on images”
(math works on numbers)
• Pixel = 3 colour channels (R, G, B)
• Pixel intensity ∈[0,255]
• Image has width w and height h
• Therefore image is w x h x 3 numbers
33. 34
This is VGGNet – don’t panic, we’ll break it down piece
by piece
Example Architecture
34. 35
This is VGGNet – don’t panic, we’ll break it down piece
by piece
Example Architecture
37. New Layer Type: Convolutional Layer
38
• 2-d weighted average when multiply kernel over pixel patches
• We slide the kernel over all pixels of the image (handle
borders)
• Kernel starts off with “random” values and network updates
(learns) the kernel values (using backpropagation) to try
minimize loss
• Kernels shared across the whole image (parameter
38. Many Kernels = Many “Activation Maps” = Volume
39http://cs231n.github.io/convolutional-networks/
46. New Layer Type: Max Pooling
• Reduces dimensionality from one layer to next
• …by replacing NxN sub-area with max value
• Makes network “look” at larger areas of the image at a time
• e.g. Instead of identifying fur, identify cat
• Reduces overfitting since losing information helps the network
generalize
48http://cs231n.github.io/convolutional-networks/
48. Stack Conv + Pooling Layers and Go Deep
50
Convolution + max pooling + fully connected +
softmax
49. 51
Stack Conv + Pooling Layers and Go Deep
Convolution + max pooling + fully connected +
softmax
50. 52
Stack These Layers and Go Deep
Convolution + max pooling + fully connected +
softmax
51. 53
Stack These Layers and Go Deep
Convolution + max pooling + fully connected +
softmax
52. 54
Flatten the Final “Bottleneck” layer
Convolution + max pooling + fully connected +
softmax
Flatten the final 7 x 7 x 512 max pooling layer
Add fully-connected dense layer on top
53. 55
Bringing it all together
Convolution + max pooling + fully connected +
softmax
54. Softmax
Convert scores ∈ ℝ to probabilities ∈ [0,1]
Final output prediction = highest probability class
56
55. Bringing it all together
57
Convolution + max pooling + fully connected +
softmax
62. Using a Pre-Trained ImageNet-Winning
CNN
64
• We’ve been looking at “VGGNet”
• Oxford Visual Geometry Group (VGG)
• ImageNet 2014 Runner-up
• Network is 16 layers (deep!)
• Easy to fine-tune
https://blog.keras.io/building-powerful-image-classification-
models-using-very-little-data.html
67. Fine-tuning A CNN To Solve A New Problem
• Cut off last layer of pre-trained Imagenet winning
CNN
• Keep learned network (convolutions) but replace
final layer
• Can learn to predict new (completely different)
classes
• Fine-tuning is re-training new final layer - learn for
new task
69
71. Fine-tuning A CNN To Solve A New Problem
• Fix weights in convolutional layers (set
trainable=False)
• Remove final dense layer that predicts 1000
ImageNet classes
• Replace with new dense layer to predict 9
categories
73
88% accuracy in under 2 minutes for
classifying products into categories
Fine-tuning is awesome!
Insert obligatory brain analogy
72.
73. Visual Similarity
75
• Chop off last 2 VGG layers
• Use dense layer with 4096 activations
• Compute nearest neighbours in the space of these
activations
https://memeburn.com/2017/06/spree-image-search/
76. 78
Final Convolutional Layer = Semantic
Vector
• The final convolutional
layer encodes
everything the network
needs to make
predictions
• The dense layer added
on top and the softmax
layer both have lower
dimensionality
80. Long et al. “Fully Convolutional Networks for Semantic Segmentation” CVPR 2015
Noh et al. Learning Deconvolution Network for Semantic Segmentation. IEEE on Computer Vision 2016
Semantic Segmentation
91. king + woman – man ≈ queen
95
Frome et al. (2013) ‘DeViSE: A Deep Visual-Semantic Embedding Model’,
Advances in Neural Information Processing Systems, pp. 2121–2129
CNN + Word2Vec = AWESOME
92. DeViSE: A Deep Visual-Semantic Embedding Model
XXX
96
Before:
Encode labels as 1-hot vector
93. DeViSE: A Deep Visual-Semantic Embedding Model
XXX
97
After:
Encode labels as word2vec vectors
(FROM A SEPARATE MODEL)
Can look these up for all the nouns in ImageNet
300-d
word2vec
vectors
94. DeViSE: A Deep Visual-Semantic Embedding Model
wv(fish)+ wv(net)
98
https://www.youtube.com/watch?v=uv0gmrXSXVg
2
wv* =
…get nearest neighbours to wv*
96. Estimating Accident Repair Cost from
Photos
TODO
100
Prototype for
large SA
insurer
Detect car
make &
model from
registration
disk
Predict repair
cost using
learned
97. Image & Video Moderation
TODO
101
Large international gay dating app with tens of
millions of users
uploading hundreds-of-thousands of photos per day