Deep Learning with MXNet

© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Cyrus Vahid <cyrusmv@amazon.com>
Principal Evangelist, AI Labs – MXNet
Aug 2018
Apache MXNet and gluon
Building Deep Learning Applications with

Background

© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Deductive Reasoning
P Q P ∧ Q P ∨ Q P ∴ Q
T T T T T
T F F T F
F T F T T
F F F F T
• 𝑃 = 𝑇 ∧ 𝑄 = 𝑇 ∴ 𝑃 ∧ 𝑄 = 𝑇
• 𝑃 ∧ 𝑄 ∴ 𝑃 → 𝑄; ∼ 𝑃 ∴ 𝑃 → 𝑄
• P → Q
P
_________
∴ Q

Rule Based Programming

Plausible Reasoning

Programming with Data
Understand
your data
Algorithmically
Discover
Hidden Patents
Generalize
Solution
Algorithm
Apply solution
to unseen
patterns
Make
Predictions

Fundamentals

Biological & Artificial Neuron
Source: http://cs231n.github.io/neural-networks-1/

Perceptron
I1 I2 B
O
w1 w2 w3
𝑓 𝑥𝑖, 𝑤𝑖 = Φ(𝑏 + Σ𝑖(𝑤𝑖. 𝑥𝑖))
Φ 𝑥 =
1, 𝑖𝑓 𝑥 ≥ 0.5
0, 𝑖𝑓 𝑥 < 0.5

Perceptron
I1 I2 B
O
1 1 -1
𝑂1 = 1𝑥1 + 1𝑥1 + −1.5 = 0.5 ∴ Φ(𝑂1) = 1
𝐼1 = 𝐼2 = 𝐵1 = 1
𝑂1 = 1𝑥1 + 0𝑥1 + −1.5 = −0.5 ∴ Φ(𝑂1) = 0
𝐼2 = 0 ; 𝐼1 = 𝐵1 = 1

Non-Linearity
P Q P ∧ Q P ⨁ Q
T T T T
T F F F
F T F F
F F F T
P
Q
x0
0 0
P
Q
x0
x 0

Deep Learning
hidden layersInput layer
output
Add Non Linearity to output of hidden layer
To transform output into continuous range

The “Learning” in Deep Learning
0.4 0.3
0.2 0.9
...
backpropagation (gradient descent)
X1 != X
0.4 ± 𝛿 0.3 ± 𝛿
new
weights
new
weights
0
1
0
1
1
.
.
-
X
input
label
...
X1

Activation Function (Φ)

Inputs: Preprocessing, Batches, Epochs
Preprocessing
 Random separation of data into
training, validation, and test sets
 Necessary to measuring the
accuracy of the model
Batch
 Amount of data propagated
through network at every iteration
 Enables faster optimization
through shorter iteration cycles
Epoch
 Complete pass through all the
training data
 Optimization will have multiple
epochs to reduce error rate

Inputs: Encoding MNIST data
https://www.tensorflow.org/get_started/mnist/beginners

Inputs: Encoding Pictures into Data
7 x 7 x 3 Matrix

Classification with the Softmax Function
Softmax converts the output layer into probabilities – necessary for classification
Softmax Function

Loss Function
• It is an objective function that quantifies how successful
the model was in its predictions
• It is a measure of the difference between a neural net’s
prediction and the actual value – that is, the error
• Typically, we use Cross Entropy Loss, which adjusts
the plain loss calculation to mitigate learning slowdown
• Backpropagation is performed to calculate the error
contribution of each neuron after processing one batch

Gradient Descent
Iteratively update parameters to get the most optimal value for the objective function

Weight Initialization
https://stats.stackexchange.com/questions/47590/what-are-good-initial-weights-in-a-neural-network

Stochastic Gradient Descent
Gradient Descent
A single iteration for the
parameter update runs through
ALL of the training data
Stochastic Gradient Descent,
A single iteration for the
parameter update runs through
a BATCH of the training data

Optimizers
http://imgur.com/a/Hqolp

Learning Rates
• Learning Rate: It is a real number
that decides how far to move down in
the direction of steepest gradient
• Online Learning: Weights are
updated at each step (slow to learn)
• Batch Learning: Weights are
updated after all training data is
processed (hard to optimize)
• Mini-Batch: Combination of both
when we break up the training set
into smaller batches and update the
weights after each mini-batch

Training and Validation Data
Best model
When only evaluating accuracy using the training set, we face the Overfitting issue

Dropout
Srivastava, Nitish, et al. ”Dropout: a simple way to prevent neural networks from
overfitting”, JMLR 2014

MXNet

Computational Dependency/Graph
• 𝑧 = 𝑥 ⋅ 𝑦
• 𝑘 = 𝑎 ⋅ 𝑏
• 𝑡 = 𝜆𝑧 + 𝑘
x y
𝑧
x
𝜆
𝑢
x
a
x
b
k
𝑡
+
1 1
2
3

net = mx.sym.Variable('data')
net = mx.sym.FullyConnected(net, name='fc1', num_hidden=64)
net = mx.sym.Activation(net, name='relu1', act_type="relu")
net = mx.sym.SoftmaxOutput(net, name='softmax')
mx.viz.plot_network(net)

• 𝑧 = 𝑥 ⋅ 𝑦
• 𝑘 = 𝑎 ⋅ 𝑏
• 𝑡 = 𝜆𝑧 + 𝑘
x y
𝑧
x
𝜆
𝑢
x
a
x
b
k
𝑡
+
1 1
2
3
net = mx.sym.Variable('data')
net = mx.sym.Activation(net, name='relu1', act_type="relu")
net = mx.sym.SoftmaxOutput(net, name='softmax')
mx.viz.plot_network(net)

Ideal
Inception v3
Resnet
Alexnet
88%
Efficiency
1 2 4 8 16 32 64 128 256
Scaling with MXNet

Imperative vs Symbolic Programming
Imperative Symbolic
Execution Flow is the same as flow of the
code:
Abstract functions are defined and compiled
first, data binding happens next.
Flexible but inefficient: Efficient
• Memory: 4 * 10 * 8 = 320 bytes
• Interim values are available
• No Operation Folding.
• Familiar coding paradigm.
• Memory: 2 * 10 * 8 = 160 bytes
• Interim values are not available
• Operation Folding: Folding
multiple operations into one.
We run one op. instead of
many on GPU. This is possible
because we have access to
whole comp. graph

Gluon

Evolution of DL Frameworks

Advantages of the Gluon API
Simple, Easy-to-
Understand Code
Flexible, Imperative
Structure
Dynamic Graphs
High Performance
 Neural networks can be defined using simple, clear, concise code
 Plug-and-play neural network building blocks – including predefined layers,
optimizers, and initializers
 Eliminates rigidity of neural network model definition and brings together
the model with the training algorithm
 Intuitive, easy-to-debug, familiar code
 Neural networks can change in shape or size during the training process to
address advanced use cases where the size of data fed is variable
 Important area of innovation in Natural Language Processing (NLP)
 There is no sacrifice with respect to training speed
 When it is time to move from prototyping to production, easily cache neural
networks for high performance and a reduced memory footprint

Code
https://github.com/cyrusmvahid/GluonBootcamp/tree/master/labs/fancy_mnist

What’s New
• GluonCV, a Deep Learning Toolkit for Computer Vision
• Features:
• training scripts that reproduces SOTA results reported in latest
papers,
• a large set of pre-trained models,
• carefully designed APIs and easy to understand implementations,
• community support.

What’s New
• GluonNLP, a Deep Learning Toolkit for Natural
Language Processing
• Features:
• Training scripts to reproduce SOTA results reported in research
papers.
• Pre-trained models for common NLP tasks.
• Carefully designed APIs that greatly reduce the implementation
complexity.
• Community support.

What’s New
• MXNet backend for Keras: Keras is a high-level neural networks
API, written in Python and capable of running on top of Apache MXNet,
Tensorflow, CNTK, and Theano.
• Performance: MXNet backend provides scalable and fast backend for
new projects and existing code, hence with least effort it can improve
performance of existing models. For more on benchmarking please check:
https://github.com/awslabs/keras-apache-mxnet/tree/master/benchmark

Refrences
• Mxnet: http://mxnet.incubator.apache.org/
• Gluon 60-min crash course: https://gluon-crash-course.mxnet.io/
• Deep Learning book based on gluon: https://gluon.mxnet.io/
• GluonCV: https://gluon-cv.mxnet.io/
• GluonNLP: https://gluon-nlp.mxnet.io/
• Keras-mxnet: https://github.com/awslabs/keras-apache-mxnet

Thank you!
c y r u s m v @ a m a z o n . c o m

Deep Learning with MXNet

More Related Content

What's hot

Similar to Deep Learning with MXNet

Recently uploaded

Deep Learning with MXNet