This document provides an overview of developing deep learning models with the neon deep learning framework. It introduces deep learning concepts and the Nervana platform, then describes hands-on exercises for building models including a sentiment analysis model using LSTMs on an IMDB dataset. Key aspects of neon like model architecture, initialization, datasets, backends, and training are demonstrated. Finally, a demo is shown for training and inference of the sentiment analysis model.
2. Outline
2
• Intro to Deep Learning
• Nervana platform
• Neon
• Building a sentiment analysis model (hands-on)
• Building a model that learns to play video games (demo)
• Nervana Cloud
4. 4
What is deep learning?
A method for extracting features at
multiple levels of abstraction
• Features are discovered from data
• Performance improves with more data
• Network can express complex transformations
• High degree of representational power
WHAT IS DEEP LEARNING?
MORE THAN AN ALGORITHM - A FUNDAMENTALLY
DISTINCT COMPUTE PARADIGM
A method of extracting features
at multiple levels of abstraction
• Unsupervised learning can find structure in
unlabeled datasets
• Supervised learning optimizes solutions for a
particular application
• Performance improves with more training data
5. 5
Convolutional neural networks
Filter + Non-Linearity
Pooling
Filter + Non-Linearity
Fully connected layers
…
“how can
I help
you?”
cat
Low level features
Mid level features
Object parts, phonemes
Objects, words
*Hinton et al., LeCun, Zeiler, Fergus
Filter + Non-Linearity
Pooling
16. 14
Long short term memory
Network activations determine
states of input, forget, output
gate:
f g i o
φ
* *
*
+
ct-1
ct ht
ht-1
17. 14
Long short term memory
Network activations determine
states of input, forget, output
gate:
• Open input, open output,
closed forget: LSTM network
acts like a standard RNN
f g i o
φ
* *
*
+
ct-1
ct ht
ht-1
f g i o
φ
0 1
1
+
ct-1
ct ht
ht-1
18. 14
Long short term memory
Network activations determine
states of input, forget, output
gate:
• Open input, open output,
closed forget: LSTM network
acts like a standard RNN
• Closing input, opening forget:
Memory cell recalls previous
state, new input is ignored
f g i o
φ
* *
*
+
ct-1
ct ht
ht-1
f g i o
φ
0 1
1
+
ct-1
ct ht
ht-1
f g i o
φ
1 0
1
+
ct-1
ct ht
ht-1
19. 14
Long short term memory
Network activations determine
states of input, forget, output
gate:
• Open input, open output,
closed forget: LSTM network
acts like a standard RNN
• Closing input, opening forget:
Memory cell recalls previous
state, new input is ignored
• Closing output: Internal state is
stored for the next time step
without producing any output
f g i o
φ
* *
*
+
ct-1
ct ht
ht-1
f g i o
φ
0 1
1
+
ct-1
ct ht
ht-1
f g i o
φ
1 0
1
+
ct-1
ct ht
ht-1
f g i o
φ
1 0
0
+
ct-1
ct
ht-1
ht
20. 15
LSTM networks
memory
forget gate
cell input
input gate
forget gate
LSTM weights:
• Requires less tuning than
RNN, with same or better
performance
• neon implementation
hides internal complexity
from the user
• LSTMs perform state of
the art on sequence and
time series data
• machine translation
• video recognition
• speech recognition
• caption generation
22. 17
Scalable deep learning is hard and expensive
Pre-process training
data
Augment
data
Design
model
Perform
hyperparameter
search
•Team of data scientists with
deep learning expertise
•Enormous compute (CPUs /
GPUs) and engineering
resources
http://papers.nips.cc/paper/4687-large-scale-distributed-deep-networks.pdf
23. 18
nervana platform for deep learning
neon deep
learning
framework
train deploy
nervana
cloud
explore
24. 18
nervana platform for deep learning
neon deep
learning
framework
train deploy
nervana
cloud
explore
AWS
VM
S3 S3
Web
VM VM
VM VM VM
S3
25. 18
nervana platform for deep learning
neon deep
learning
framework
train deploy
nervana
cloud
explore
GPUs
CPUs
nervana engine
AWS
VM
S3 S3
Web
VM VM
VM VM VM
S3
26. 20
Deep learning as a core technology
DL
Image
classification
Image
localization
Speech
recognition
Video
indexing Sentiment
analysis
Machine
Translation
Nervana Platform
33. neon: nervana python deep learning library
24
• User-friendly, extensible, abstracts parallelism & data caching
• Support for many deep learning models
• Interface to nervana cloud
• Supports multiple backends
• Currently optimized for Maxwell GPU at assembler level
• Basic automatic differentiation
• Open source (Apache 2.0)
nervana engine
GPU cluster
CPU cluster{ }
See github for details
35. Proprietary and confidential. Do not distribute.
Using neon
26
Start with basic model:
# create training set
train_set = DataIterator(X, y)
# define model
init_norm = Gaussian(loc=0.0, scale=0.01)
layers = [
Affine(nout=100, init=init_norm, activation=Rectlin()),
Affine(nout=10, init=init_norm, activation=Logistic(shortcut=True))
]
model = Model(layers=layers)
cost = GeneralizedCost(CrossEntropyBinary())
optimizer = GradientDescentMomentum(0.1, momentum_coef=0.9)
# fit model
model.fit(train_set, optimizer=optimizer, cost=cost)
mlp.py
Multilayer Perceptron
x
y
36. Proprietary and confidential. Do not distribute.
Using neon
27
Define data, model:
# create training set
train_set = DataIteratorSequence(X, y)
# define model
init = Uniform(low=-0.08, high=0.08)
layers = [
LSTM(hidden, init, Logistic(), Tanh()),
Dropout(keep=0.5),
Affine(features, init, bias=init, activation=Identity())
]
model = Model(layers=layers)
cost = GeneralizedCost(SumSquared())
optimizer = RMSProp()
# fit model
model.fit(train_set, optimizer=optimizer, cost=cost)
rnn.py
. . .
xtkxt1
xt2
yt2
yt1
ytk
Recurrent neural net
37. Proprietary and confidential. Do not distribute.
Speed is important
28
iteration = innovation
VGG-B ImageNet training
Traintime(hours)
0
275
550
825
1,100
CPU Single GPU NervanaGPU Multi NervanaGPU
64
450
1,000
25,000
25,000*
25000
*estimate
28
*
38. Proprietary and confidential. Do not distribute.
1 Soumith Chintala, github.com/soumith/convnet-benchmarks
Benchmarks for convnets1
29
Benchmarks compiled by Facebook. Smaller is better.
39. Proprietary and confidential. Do not distribute.
1 Soumith Chintala, github.com/soumith/convnet-benchmarks
Benchmarks for convnets (updated1)
30
Benchmarks compiled by Facebook. Smaller is better.
40. Proprietary and confidential. Do not distribute.
31
VGG-D speed comparison
Runtimes
VGG-D
NEON
[NervanaGPU]
Caffe
[CuDNN v3]
NEON
Speed Up
fprop 363 ms 581 ms 1.6x
bprop 762 ms 1472 ms 1.9x
full forward/
backward pass
1125 ms 2053 ms 1.8x
41. Proprietary and confidential. Do not distribute.
Benchmarks for RNNs1
32
GEMM benchmarks compiled by Baidu. Bigger is better. 1 Erich Elsen, http://svail.github.io/
42. 33
Optimized data loading
• Goal: ensure neon
never blocks
waiting for data
• C++ multi-
threaded
• Double buffered,
pooled resources
Library Wrapper
DataLoader DataLoader DecodeThreads
start
IOThreads
destroy thread pool
stop
next
...
next
create thread pool
create thread pool
destroy thread pool
read macrobatch file
decode
decode
decode
macrobatch
buffers
minibatch
buffers
(pinned)
raw file
buffers
44. Sentiment analysis using LSTMs
35
• Analyze text and map it to a numerical rating (1-5)
• Movie reviews (IMDB)
• Product reviews (Amazon, coming soon)
45. Data preprocessing
36
• Converting words to one-hot
• Top 50,000 words
• PAD, OOV, START tags
• Ids based on frequency
• Pre-defined sentence length
• Targets binarized to positive (>=7), negative (<7)
46. Embedding
37
• Learning to embed words from a sparse representation to a dense space
Mikolov et al. 2013a
*http://colah.github.io/posts/2014-07-NLP-RNNs-Representations/
W(woman)−W(man) ≃ W(aunt)−W(uncle)
W(woman)−W(man) ≃ W(queen)−W(king)
57. Deep Reinforcement Learning*
48
• Learning video games from raw pixels and scores
• Developer contribution: Tambet Matiisen, University of Tartu, Estonia
• https://github.com/tambetm/simple_dqn
*Mnih et al., Nature (2015)
58. Deep Reinforcement Learning*
48
• Learning video games from raw pixels and scores
• Developer contribution: Tambet Matiisen, University of Tartu, Estonia
• https://github.com/tambetm/simple_dqn
*Mnih et al., Nature (2015)
59. Deep Reinforcement Learning
49
• Convnet to compute Q score for state, action pairs
• Replay memory (to remove correlations in observation sequence)
• Freezing network (to reduce correlation with target)
• Clipping scores between -1, +1 (same learning rate across games)
• Same network can play a range of games
Mnih et al., Nature (2015)
64. Other parts of the code
53
• main.py: executable
• agent.py: Agent class (learning and playing)
• environment.py: wrapper for Arcade Learning Environment (ALE)
• replay_memory.py: replay memory class
65. Demo
54
• Training
• ./train.sh --minimal_action_set roms/breakout.bin
• ./train.sh --minimal_action_set roms/pong.bin
• Plot results
• ./plot.sh results/breakout.csv
• Play (observe the network learning)
• ./play.sh --minimal_action_set roms/pong/.bin --load_weights
snapshots/pong_<epoch>.pkl
• Record
• ./record.sh --minimal_action_set roms/pong.bin --load_weights
snapshots/pong_<epoch>.pkl
67. Proprietary and confidential. Do not distribute.
Using neon and nervana cloud
56
Running locally:
% python rnn.py # or neon rnn.yaml
Running in nervana cloud:
% ncloud submit rnn.py # or rnn.yaml
% ncloud show <model_id>
% ncloud list
% ncloud deploy <model_id>
% ncloud predict <model_id> <data> # or use REST api
68. Proprietary and confidential. Do not distribute.
Contact
57
arjun@nervanasys.com
@coffeephoenix
github.com/NervanaSystems/neon