Introduction to Deep Learning and neon at Galvanize

Proprietary and confidential. Do not distribute.
Introduction to Deep Learning and Neon
MAKING MACHINES SMARTER.™
Kyle H. Ambert, PhD 
Senior Data Scientist
May 25 , 2017th
@TheKyleAmbert

Nervana Systems Proprietary
About me & Intel’s Artiﬁcial Intelligence Products Group (AIPG)
+

About me & Intel’s Artiﬁcial Intelligence Products Group (AIPG)
+
Together, we create production deep learning solutions in multiple
domains, while advancing the ﬁeld of applied analytics and optimization.

8
Intel’s Interest in Analytics
To provide the infrastructure
for the fastest time-to-insight
To create tools that enable
scientists to think about their
research, rather than their
process
To enable users to ask bigger
questions
Bigger Data Better Hardware Smarter Algorithms
Image: 1000 KB / picture
Audio: 5000 KB / song
Video: 5,000,000 KB / movie
Transistor density doubles
every 18 months
Cost / GB in 1995: $1000.00
Cost / GB in 2015: $0.03
Advances in neural
networks leading to better
accuracy in training models
Great solutions require great hardware!

LIBRARIES Intel® MKL
Intel® MKL-DNN
FRAMEWORKS
Intel® DAAL
HARDWARE
Memory/Storage FabricCompute
Intel
Distribution
MORE
UNLEASHING
POTENTIAL
FULL
SOLUTIONS
PLATFORMS/TOOLS
BIGDL
Intel® Nervana™ Deep
Learning Platform
Intel® Nervana™
Cloud
Intel® Nervana™
Graph

10
This Evening
1. Machine Learning and Data Science
2. Introduction to Deep Learning
3. Nervana!
4. Neon
5. Deep Learning Use Cases

11
This Evening
3. Nervana!
4. Neon

12
AI? Machine Learning? Deep Learning?

Machine learning is the development, and application of, algorithms that can
learn from data in an automated, semi-automated, or supervised setting.
Deep LearningStatistical Learning
Algorithms where multiple layers of neurons learn
successively complex representations of input data
CNN RNN DFF RBM LSTM
Algorithms which leverage statistical methods for
estimating functions from examples
Naïve
Bayes SVM GLM
Tree-
based kNN
Training: building a mathematical model based on input data
Classiﬁcation (scoring): using a trained model to make predictions about new data

Machine learning is the development, and application of, algorithms that can
learn from data in an automated, semi-automated, or supervised setting.
Deep LearningStatistical Learning
Algorithms where multiple layers of neurons learn
successively complex representations of input data
CNN RNN DFF RBM LSTM
Algorithms which leverage statistical methods for
estimating functions from examples
Naïve
Bayes SVM GLM
Tree-
based kNN
Training: building a mathematical model based on input data
Classiﬁcation (scoring): using a trained model to make predictions about new data
Ingest
Data
Engineer Features
Structure 
Model
Clean
Data
Visualize
Query/
Analyze
TrainM
odel
Deploy

16
This Evening
3. Nervana!
4. Neon

17
A Quite Brief History of Deep Learning
• 1960s: Neural networks used for binary classification
• 1970s: Neural networks popularity dries after not delivering on the hype
• 1980s: Backpropagation is used to train deep networks
• 1990s: Neural networks take the back seat to support vector machines due to the nice
theoretical properties and guarantee bounds
• 2010s: Access to large datasets and more computation allowed deep networks to return and
have state-of-the-art results in speech, vision, and natural language processing
• 1949: The Organization of Behavior is published
(Hebb!)
(Minsky)
Today: Deep Learning is a fast-moving area of academic and applied analytics!
There are many opportunities for new discoveries!
(Vapnik)
(Hinton)

18
ML v. DL: Practical Differences

SVM
Random Forest
Naïve Bayes
Decision Trees
Logistic Regression
Ensemble methods

Harrison

19
End-to-End Deep learning
~60 million parameters
Harrison

20
Workflows in Machine Learning
⟹ The same rules apply for deep learning!
➝ Preprocessing data
➝ Feature extraction
➝ Parsimony in model selection
⟹ How we go about some of this does change…

21
End-to-End Deep learning: Data Considerations

22

23
X X
X
XX
X
Labels: Harrison? Transformations! More data is always better!

Deep Learning: Networks of Artificial Neurons

Output of unit
Activation Function
Linear weights Bias unit
Input from unit j

⟹ With an explosion of moving parts,
being able to understand and keep
track of what sort of model is being
built becomes even more important!

Practical example: recognition of handwritten digits
MNIST dataset
70,000 images (28x28 pixels)
Goal: classify images into a digit 0-9
N = 28 x 28 pixels
= 784 input units
N = 10 output units (one
for each digit)
Each unit i encodes the
probability of the input
image of being of the
digit i
N = 100 hidden units
(user-defined
parameter)
Input
Hidden
Output

Training procedure
Input
Hidden
Output 1. Randomly seed weights
2. Forward-pass
3. Cost
4. Backward-pass
5. Update weights

Forward pass
0.0
0.1
0.0
0.3
0.1
0.1
0.0
0.0
0.4
0.0
Output (10x1)
Input
Hidden
Output
28x28

Cost
0.0
0.1
0.0
0.3
0.1
0.1
0.0
0.0
0.4
0.0
Output (10x1)
28x28
Input
Hidden
Output
0
0
0
1
0
0
0
0
0
0
Ground Truth
Cost function

Backward pass
0.0
0.1
0.0
0.3
0.1
0.1
0.0
0.0
0.4
0.0
Output (10x1)
Input
Hidden
Output
0
0
0
1
0
0
0
0
0
0
Ground Truth
Cost function

∆Wi→j

Back-propagation
Input
Hidden
Output
compute

Back-propagation
Input
Hidden
Output

Back-propagation
Input
Hidden
Output

=

a
! = max ((,0)
a
!′(()

Training
fprop cost bprop
fprop cost bprop
fprop cost bprop
fprop cost bprop
fprop cost bprop
fprop cost bprop

Gradient descent
fprop cost bprop
fprop cost bprop
fprop cost bprop
fprop cost bprop
fprop cost bprop
fprop cost bprop
Update weights via:

Learning rate

Stochastic (minibatch) Gradient descent
fprop cost bprop
fprop cost bprop
fprop cost bprop
fprop cost bprop
fprop cost bprop
fprop cost bprop
minibatch #1
weight update
minibatch #2
weight update

Stochastic (minibatch) Gradient descent
Epoch 0
Epoch 1
Sample numbers:
• Learning rate ~0.001
• Batch sizes of 32-128
• 50-90 epochs

Why Does This Work at All?
Krizhevsky, 2012
60 million parameters
120 million parameters
Taigman, 2014

39
This Evening
3. Nervana!
4. Neon

Nervana in 30 seconds. Possibly less.
40
neon deep
learning
framework
train deployexplore
nervana
engine
2-3x speedup on
Titan X GPUs
cloudn

neon framework

nervana cloud
Web Interface Command Line

43
This Evening
3. Nervana!
4. Neon

Ge(i)t Neon!
1. git clone https://github.com/NervanaSystems/neon.git
2. pip install {h5py, pyaml, virtualenv}
3. brew install {opencv|opencv3}
4. make {python2|python3}
5. . .venv/bin/activate
6. examples/mnist_mlp.py
7. deactivate
⟹ https://goo.gl/jZgfNg
Documentation!

Deep learning ingredients
Dataset Model/Layers Activation OptimizerCost

neon overview
Backend NervanaGPU, NervanaCPU, NervanaMGPU
Datasets
MNIST, CIFAR-10, Imagenet 1K, PASCAL VOC, Mini-Places2, IMDB, Penn Treebank,
Shakespeare Text, bAbI, Hutter-prize, UCF101, flickr8k, flickr30k, COCO
Initializers Constant, Uniform, Gaussian, Glorot Uniform, Xavier, Kaiming, IdentityInit, Orthonormal
Optimizers Gradient Descent with Momentum, RMSProp, AdaDelta, Adam, Adagrad,MultiOptimizer
Activations Rectified Linear, Softmax, Tanh, Logistic, Identity, ExpLin
Layers
Linear, Convolution, Pooling, Deconvolution, Dropout, Recurrent,Long Short-
Term Memory, Gated Recurrent Unit, BatchNorm, LookupTable,Local Response Normalizat
ion, Bidirectional-RNN, Bidirectional-LSTM
Costs Binary Cross Entropy, Multiclass Cross Entropy, Sum of Squares Error
Metrics Misclassification (Top1, TopK), LogLoss, Accuracy, PrecisionRecall, ObjectDetection

Curated Models
47
• https://github.com/NervanaSystems/ModelZoo
• Pre-trained weights and models
SegNet
Deep Speech 2
Skip-thought
Autoencoders
Deep Dream

Neon workflow
1. Generate backend
2. Load data
3. Specify model architecture
4. Define training parameters
5. Train model
6. Evaluate

Interacting with Neon
1. Via command line
2. In a virtual environment
3. In an ipython/jupyter notebook
4. ncloud

Nervana Cloud

53
This Evening
3. Nervana!
4. Neon

54

•Layers: convolution, rectified linear units, pooling, dropout, softmax
•Popular with 2D + depth (+ time) inputs
•Gray or RBG images
•Videos
•Synthetic aperture radar
•Spectrogram (speech)

•Layers: convolution, rectified linear units, pooling, dropout,
softmax
•Use multiple copies of the same feature on the input
(correlation)
•Use several features (aka kernels, filters)
•Reduces number of weights compared to fully connected

•Layers: convolution, rectified linear units (ReLu),
pooling, dropout, softmax
•It is fast – no normalization or exponential computations
•Induces sparsity in the hidden units

•Downsampling
•Reduces the number of parameters
•Provides some translation invariance

•Reduces overfitting – Prevents co-adaptation on training data

•aka “normalized exponential function”
•Normalizes vector to a probability distribution

Code!

63
DEEP LEARNING USE CASES!
Long Short-Term Memory (LSTM)

Why Recurrent Neural Networks?
Input
Hidden
Output
• Temporal dependencies
• Variable sequence length
• Independence
• Fixed Length

Recurrent neuron

RNN: what is it good for?
0.1
-0.4
0.6
1
0
0
0
0.1
0.7
0.1
0.1
-0.3
0.6
1.6
1
0
0
0
0.1
0.3
0.4
0.2
0.7
-0.4
-0.4
1
0
0
0
0.3
0.0
0.6
0.1
0.1
-0.8
0.1
1
0
0
0
0.0
0.0
0.2
0.8
“h” “e” “l” “l”
“e” “l” “l” “o”

Learned a language model!

0.1
-0.4
0.6
1
0
0
0
0.1
0.7
0.1
0.1
-0.3
0.6
1.6
1
0
0
0
0.1
0.3
0.4
0.2
0.7
-0.4
-0.4
1
0
0
0
0.4
0.0
0.5
0.1
0.1
-0.8
0.1
1
0
0
0
0.0
0.0
0.2
0.8
“cash” “flow” “is” “high”
“flow” “is” “high” “today”

Learned a language model!
“low”
“high”

0.1
-0.4
0.6
1
0
0
0
-0.3
0.6
1.6
0
1
0
0
0.7
-0.4
-0.4
0
0
1
0
0.1
-0.8
0.1
0
0
0
1
“this” “movie” “was” “bad”
NEGATIVE
“and” “long” <eos>
0.1
-0.8
0.1
1
0
0
0
0.7
-0.4
-0.4
1
0
0
0
-0.3
0.6
1.6
0
1
0
0
0.2
0.8

0.1
-0.4
0.6
1
0
0
0
-0.3
0.6
1.6
0
1
0
0
0.7
-0.4
-0.4
0
0
1
0
0.1
-0.8
0.1
“neon” “is” “amazing”
0.1
-0.8
0.1
0.7
-0.4
-0.4
-0.3
0.6
1.6
0.1
0.7
0.1
0.1
0.1
0.3
0.4
0.2
0.3
0.0
0.6
0.1
0.0
0.0
0.2
0.8
“neon” “est” “incroyable” “!”
0.1
-0.4
0.6
1
0
0
0
-0.3
0.6
1.6
0
1
0
0
0.7
-0.4
-0.4
0
0
1
0
0.1
-0.8
0.1
“neon” “is” “amazing”
0.1
-0.8
0.1
0.7
-0.4
-0.4
-0.3
0.6
1.6
0.1
0.7
0.1
0.1
0.1
0.3
0.4
0.2
0.3
0.0
0.6
0.1
0.0
0.0
0.2
0.8
“neon”“est”“incroyable”“!”

Long-Short Term Memory (LSTM)

1 1

1
Manipulate memory cell:
1. “forget” (flush the memory)
2. “input” (add to memory)
3. “output” (get from memory)

Example – Sentiment analysis with LSTM
“Okay, sorry, but I loved this movie. I just
love the whole 80’s genre of these kind
of movies, because you don’t see many
like this...” -~CupidGrl~
POSITIVE
The plot/writing is completely unrealistic and just dumb at
times. Bond is dressed up in a white tux on an overnight
train ride? eh, OK. But then they just show up at the
villain’s compound like nothing bad is going to happen to
them. How stupid is this Bond?
NEGATIVE

Preprocessing
“Okay, sorry, but I loved this movie. I just
love the whole 80’s genre of these kind
of movies, because you don’t see many
like this...” -~CupidGrl~
[5, 4, 940, 107, 14, 672, 1790,
333, 47, 11, 7890, …,1]
Out-of-Vocab
(e.g. CupidGrl)
• Limit vocab size to 20,000 words
• Truncate each example to 128 words [from the left]
• Pad examples up to 128 whitespace

Model
d=128
embedding layer
LSTM
LSTM
LSTM
LSTM
N=2
[5, 4, 940, 107,
14, 672, 1790,
333, 47, 11,
7890, …,1]

POS
NEG
N=64
LSTM AffineRecurrentSum

Data flow
d=128
embedding layer
LSTM
(2, 1)
POS
NEG
LSTM Affine

LSTM LSTM LSTM

RecurrentSum

n=64

Data flow in batches with neon
d=128
embedding layer
LSTM
(2, bsz)
[5, 4, 940, 107,
14, 672, 1790,
333, 47, 11,
7890,…, 1]

POS
NEG
LSTM Affine

LSTM LSTM LSTM

RecurrentSum

n=64

Code!
LSTM

More Code!
LSTM

In Summary…
1. Deep learning methods are powerful and versatile
2. It’s important to understand how DL relates to
traditional ML methods
3. The barrier of entry to using DL in practice is
lowered with the neon framework on the Nervana
ecosystem
kyle.h.ambert@intel.com
@TheKyleAmbert

Introduction to Deep Learning and neon at Galvanize

More Related Content

What's hot

Viewers also liked

Similar to Introduction to Deep Learning and neon at Galvanize

More from Intel Nervana

Recently uploaded

Introduction to Deep Learning and neon at Galvanize