Deep Learning for Developers

©2017, Amazon Web Services, Inc. or its affiliates. All rights reserved
Deep Learning for Developers
Julien Simon, Principal Technical Evangelist
@julsimon

Amazon AI for every developer
More information at aws.amazon.com/ai

The neuron
x = [x1, x2, …. xI ]
w = [w1, w2, …. wI ]
u = x.w
Activation function

The neural network
x =
x11, x12, …. x1I
x21, x22, …. x2I
x…, x.., …. x.I
xm1, xm2, …. xmI
I features
m samples
y =
y1
y2
y…
ym
m labels
Total number of predictions
Accuracy =
Number of correct predictions

The training process
• The difference between the predicted output and the actual output (aka
ground truth) is called the prediction loss.
• There are different ways to compute it (loss functions).
• The purpose of training is to iteratively minimize loss and maximize
accuracy for a given data set.
• We need a way to adjust weights (aka parameters) in order to gradually
minimize loss
 Backpropagation + optimization algorithm

Paul Werbos
Artificial Intelligence pioneer
IEEE Neural Network Pioneer Award
1974 – Backpropagation
The back-propagation algorithm acts as an error
correcting mechanism at each neuron level,
thereby helping the network to learn effectively.

Stochastic Gradient Descent (SGD)
Imagine you stand on top of a mountain with
skis strapped to your feet. You want to get down
to the valley as quickly as possible, but there is
fog and you can only see your immediate
surroundings. How can you get down the
mountain as quickly as possible? You look
around and identify the steepest path down, go
down that path for a bit, again look around and
find the new steepest path, go down that path,
and repeat—this is exactly what gradient
descent does.
Tim Dettmers
University of Lugano
2015
Source: https://devblogs.nvidia.com/parallelforall/deep-learning-nutshell-history-training/
The « step size » is called
the learning rate
z=f(x,y)

There is such a thing as « learning too well »
• If a network is large enough and given enough time, it will perfectly
learn a data set (universal approximation theorem).
• But what about new samples? Can it also predict them correctly?
• In other words, does the network generalize well or not?
• To prevent overfitting, we need to know when to stop training.
• The training data set is not enough.

Data sets
Full data set
Training data set
Validation data set
Test data set
Training set: this data set is used to
adjust the weights on the neural
network.
Validation set: this data set is used
to minimize overfitting. You're not
adjusting the weights of the network
with this data set, you're just
verifying that any increase in
accuracy over the training data set
actually yields an increase in
accuracy over a data set that has not
been shown to the network before.
Testing set: this data set is used
only for testing the final weights in
order to benchmark the actual
predictive power of the network.

Training
Training data set Training
Trained
neural network
Number of epochs
Batch size
Learning rate
Hyper parameters

Validation
Validation Data set Trained
neural network
Validation
accuracy
Prediction at
the end of
each epoch
Stop training when validation accuracy stops increasing
Saving parameters at the end of each epoch is a good idea

Convolutional Neural Networks
Le Cun, 1998: handwritten digit recognition, 32x32 pixels
Feature extraction and downsampling allow smaller networks
https://devblogs.nvidia.com/parallelforall/deep-learning-nutshell-core-concepts/

Long Short Term Memory (LSTM) Networks
• A LSTM neuron computes
the output based on the
input and a previous state.
• LSTM networks have
memory
• They’re great at predicting
sequences, e.g. machine
translation

Input Output
1 1 1
1 0 1
0 0 0
3
mx. sym. Convol ut i on( dat a, ker nel =( 5, 5) , num_f i l t er =20)
mx. sym. Pool i ng( dat a, pool _t ype=" max" , ker nel =( 2, 2) ,
st r i de=( 2, 2)
l st m. l st m_unr ol l ( num_l st m_l ayer , seq_l en, l en, num_hi dden, num_embed)
4 2
2 0
4=Max
1
3
...
4
0.2
-0.1
...
0.7
mx. sym. Ful l yConnect ed( dat a, num_hi dden=128)
2
mx. symbol . Embeddi ng( dat a, i nput _di m, out put _di m = k)
0.2
-0.1
...
0.7
Queen
4 2
2 0
2=Avg
Input Weights
cos(w, queen ) = cos(w, k i n g) - cos(w, m an ) + cos(w, w om an )
mx. sym. Act i vat i on( dat a, act _t ype=" xxxx" )
" r el u"
" t anh"
" si gmoi d"
" sof t r el u"
Neural Art
Face Search
Image Segmentation
Image Caption
“ People Riding
Bikes”
Bicycle, People,
Road, Sport
Image Labels
Image
Video
Speech
Text
“ People Riding
Bikes”
Machine Translation
“ Οι άνθρωποι
ιππασίας ποδήλατα”
Events
mx. model . FeedFor war d model . f i t
mx. sym. Sof t maxOut put

Ideal
Inception v3
Resnet
Alexnet
88%
Efficiency
0
64
128
192
256
1 2 4 8 16 32 64 128 256
Multi-GPU Scaling With MXNet

Demos
1. Image classification with pre-trained models (CNN)
2. Image classification from scratch (MLP, CNN)
3. Machine Translation with Sockeye (LSTM)
4. AI, IoT and robots 

Amazon Rekognition
Object and Scene
Detection
Facial
Analysis
Face
Comparison
Facial
Recognition
Amazon
Polly
Amazon
Rekognition
AWS IoT
Device domain IoT domain Robot domain Internet domain
JohnnyPi/move
JohnnyPi/scan
JohnnyPi/speak
JohnnyPi/see
Skill
interfaceAmazon
Echo
Skill
service
Pre-trained
model for image
classification
AI, IoT and robots
More information at
medium.com/@julsimon

Resources
http://mxnet.io
https://github.com/awslabs/sockeye
https://aws.amazon.com/blogs/ai/
http://medium.com/@julsimon
https://github.com/juliensimon/aws/tree/master/mxnet

Thank you!
http://aws.amazon.com/evangelists/julien-simon
@julsimon

Deep Learning for Developers

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Deep Learning for Developers

Similar to Deep Learning for Developers (20)

More from Julien SIMON

More from Julien SIMON (20)

Recently uploaded

Recently uploaded (20)

Deep Learning for Developers

Editor's Notes