Tokyo Webmining Talk1

Deep Learning Implementations
and Frameworks
(Very short version of PAKDD 2016 tutorial)
Kenta Oono oono@preferred.jp
Preferred Networks Inc.
25th Jun. 2016
Tokyo Webmining @FreakOut
1/31

Overview
•1st session (8:30 ‒ 10:00)
•Introduction (AK)
•Basics of neural networks (AK)
•Common design of neural network
implementations (KO)
•2nd session (10:30 ‒ 12:30)
•Differences of deep learning frameworks (ST)
•Coding examples of frameworks (KO & ST)
•Conclusion (ST)
2/31

Full contents
• Session1
• Basics of neural Networks
• http://www.slideshare.net/atsu-kan/pakdd2016-tutorial-dlif-introduction-
and-basics-63030841
• Common design of neural networks implementation
• http://www.slideshare.net/KentaOono/common-design-of-deep-learning-
frameworks
• Session2
• Differences of deep learning frameworks
• http://www.slideshare.net/beam2d/differences-of-deep-learning-
frameworks
• Coding examples of frameworks
• will be available soon.
3/31

Basics of Neural Networks
Atsunori Kanemura
AIST, Japan
4/31

Mathematical model for a neuron
• Compare the product of input and weights (parameters)
with a threshold
• Plasticity of the neuron
= The change of parameters and
…
…
f ： nonlinear transform
b
x w
b
x1
x2
y
w
xD
y = f
⇣ DX
d=1
wdxd b
⌘
= f(wT
x b)
b
5/31

Parameter update
• Gradient Descent (GD)
• Stochastic Gradient Descent (SGD)
• Take several samples (say, 128) from the dataset (mini-batch),
estimate the gradient.
• Theoretically motivated as the Robbins-Monro algorithm
• SGD to general gradient-based algorithms
• Adam, AdaGrad, etc.
• Use momentum and other techniques
w w rrwJ(w) = w
NX
n=1
h(xn
h(xn, w
¯
)
def
= (f(wT
xn) yn)f(wT
xn)(1 f(w
6/31

Gradient descent
• The gradient of the loss for 1-layer model is
• The update rule
（r is a
constant
learning
rate）
rwJ(w) =
1
2
NX
n=1
rw(f(wT
xn) y⇤
n)2
=
NX
n=1
(f(wT
xn) y⇤
n)rwf(wT
xn)
=
NX
n=1
(f(wT
xn) y⇤
n)f(wT
xn)(1 f(wT
xn))xn
w w rrwJ(w) = w
NX
n=1
h(xn, w)xn
h(xn, w)
def
= (f(wT
xn) y⇤
n)f(wT
xn)(1 f(wT
xn))
7/31

Neural networks
• Multi-layered
• Minimize the loss to learn the parameters
※ f works element-wise
y1
= f1(W 10
x)
y2
= f2(W 21
y1
)
y3
= f3(W 32
y2
)
...
yL
= fL(W (L)(L 1)
yL 1
)
J({W }) =
1
2
NX
n=1
(yL
(xn) y⇤
n)2
8/31

Backprop
• Use the chain rule to derive the gradient
• E.g. 2-layer case
• Calculate gradient recursively from top to bottom layers
• Cf. Gradient vanishing, ReLU
y1
n = f(W 10
xn), y2
n = f(w21
· y1
n)
@J
@W10
kl
=
X
n,i
@J
@y1
ni
@y1
ni
@W10
kl
J(W 10
, w21
) =
1
2
X
n
(y2
n y⇤
n)2
9/31

Common Design of
Deep Learning Frameworks
Kenta Oono <oono@preferred.jp>
Preferred Networks Inc.
10/31

Steps for training neural networks
Prepare the training dataset
Repeat until meeting some criterion
Prepare for the next (mini) batch
Compute the loss (forward prop)
Initialize the Neural Network (NN) parameters
Save the NN parameters
Define how to compute the loss of this batch
Compute the gradient (backprop)
Update the NN parameters
11/31

Technology stack of DL framework
name functions example
Graphical interface DIGITS, TensorBoard
Machine learning workflow
management
Dataset Management
Training Loop
Keras, Lasagne
Blocks, TF Learn
Computational graph
management
Build computational graph
Forward prop/Backprop
Theano, TensorFlow
Torch.nn
Multi-dimensional
array library
Linear algebra NumPy, CuPy
Eigen, torch (core)
Numerical computation
package
Matrix operation
Convolution
BLAS, cuBLAS, cuDNN
Hardware CPU, GPU
12/31

Technology stack of Chainer
cuDNN
Chainer
NumPy CuPy
BLAS
cuBLAS,
cuRAND
CPU GPU
name
Graphical interface
Machine learning workflow
management
Computational graph
management
Multi-dimensional
array library
Numerical computation
package
Hardware
13/31

Neural Network as a Computational Graph
• In simplest form, NN is represented as a computational graph
(CG) that is a stack of bipartite DAGs (Directed Acyclic Graph)
consisting of data nodes and operator nodes.
y = x1 * x2
z = y - x3
x1 mul suby
x3
z
x2
data node
operator node
14/31

Example: Multi-layer Perceptron (MLP)
x Affine
W1 b1
h1 ReLU a1
Affine
W2 b2
h2 ReLU a2
Soft
max
y
Cross
Entropy
Lo
ss
t
It is choice of
implementation if CG
includes weights and
biases.
15/31

Automatic Differentiation
• Computes gradient of some specified data nodes (e.g. loss)
with respect to each data node.
• Each operator node must have backward operation to
calculate gradients w.r.t. its inputs from gradients w.r.t. its
outputs (realization of chain rule).
• e.g. Function class of Chainer has backward method.
• e.g. Each layer classes of Caffe has Backward_cpu and
Backward_gpu methods
• e.g. Autograd has a thin wrapper that adds gradient methods as a
closure to most of NumPy methods.
16/31

Backprop through CG
∇y z∇x1 z ∇z z = 1
y = x1 * x2
z = y - x3
x1 mul suby
x3
z
x2
17/31

Backprop as extended graphs
x1 mul suby
x3
z
x2
dzid
neg
mul
mul
dy
dx
3
dx
1
dx
2
forward
propagation
backward
propagation
y = x1 * x2
z = y - x3
18/31

Differences of
Deep Learning Frameworks
Seiya Tokui
Preferred Networks, Inc.
19/31

Training of Neural Networks
Initialize the NN parameters
Define how to compute the loss of this batch
automated
20/31

Framework Design Choices
• The most crucial part of NN frameworks is
• How to define the parameters
• How to define the loss function of the parameters
(= how to write computational graphs)
• These also influence on APIs for forward prop, backprop, and
parameter updates (i.e., numerical optimization)
• And all of these are determined by how to implement
computational graphs
• Other parts are also important, but are mostly common to
implementations of other types of machine learning methods
21/31

Framework Comparison: Basic information*
Viewpoint Torch.nn** Theano*** Caffe
autograd
(NumPy,
Torch)
Chainer MXNet
Tensor-
Flow
GitHub
stars
4,719 3,457 9,590
N: 654
T: 554
1,295 3,316 20,981
Started
from
2002 2008 2013 2015 2015 2015 2015
Open
issues/PRs
97/26 525/105 407/204
N: 9/0
T: 3/1
95/25 271/18 330/33
Main
developers
Facebook,
Twitter,
Google, etc.
Université
de Montréal
BVLC
(U.C. Berkeley)
N: HIPS
(Harvard Univ.)
T: Twitter
Preferred
Networks
DMLC Google
Core
languages
C/Lua C/Python C++ Python/Lua Python C++ C++/Python
Supported
languages
Lua Python
C++/Python
MATLAB
Python/Lua Python
C++/Python
R/Julia/Go
etc.
C++/Python
* Data was taken on Apr. 12, 2016
** Includes statistics of Torch7
*** There are many frameworks on top of Theano, though we omit them due to the space constraints
22/31

List of Important Design Choices
Programming paradigms
1. How to write NNs in text format
2. How to build computational graphs
3. How to compute backprop
4. How to represent parameters
5. How to update parameters
Performance improvements
6. How to achieve the computational performance
7. How to scale the computations
23/31

Framework Comparison: Design Choices
Design
Choice
Torch.nn
Theano-
based
Caffe
autograd
(NumPy,
Torch)
Chainer MXNet
Tensor-
Flow
1.NN
definition
Script
(Lua)
Script*
(Python)
Data
(protobuf)
Script
(Python,
Lua)
Script
(Python)
Script
(many)
Script
(Python)
2. Graph
construction
Prebuild Prebuild Prebuild Dynamic Dynamic Prebuild** Prebuild
3.
Backprop
Through
graph
Extended
graph
Through
graph
Extended
graph
Through
graph
Through
graph
Extended
graph
4.
Parameters
Hidden in
operators
Separate
nodes
Hidden in
operators
Separate
nodes
Separate
nodes
Separate
nodes
Separate
nodes
5. Update
formula
Outside of
graphs
Part of
graphs
Outside of
graphs
Outside of
graphs
Outside of
graphs
Outside of
graphs**
Part of
graphs
6.
Optimization
-
Advanced
optimization
- - - -
Simple
optimization
57 Parallel
computation
Multi GPU Multi GPU
(libgpuarray)
Multi GPU
Multi GPU
(Torch)
Multi GPU
Multi node
Multi GPU
Multi node
Multi GPU
* Some of Theano-based frameworks use data (e.g. yaml)
** Dynamic dependency analysis and optimization is supported (no autodiff support)24/31

How to write NNs in text format
Write NNs in declarative
configuration files
Framework builds layers of
NNs as written in the files
(e.g. prototxt, YAML).
E.g.: Caffe (prototxt),
Pylearn2 (YAML)
Write NNs by procedural
scripting
Framework provides APIs of
scripting languages to build
NNs.
E.g.: most other frameworks
25/31

2. How to build computational graphs
Define how to compute the loss
Define how to compute the loss
Build once, run several times Build one at every iteration
26/31

3. How to compute backprop
Backprop through graphs
Framework only builds
graphs of forward prop, and
do backprop by backtracking
the graphs.
E.g.: Torch.nn, Caffe, MXNet,
Chainer
Backprop as extended graphs
Framework builds graphs for
backprop as well as those for
forward prop.
E.g.: Theano, TensorFlow
a mul suby
c
z
b
a mul suby
c
z
b
dzid
neg
mul
mul
dy
dc
da
db
∇y z∇x1 z ∇z z = 1
27/31

4. How to represent parameters
Parameters as part of
operator nodes
Parameters are owned by
operator nodes (e.g.,
convolution layers), and not
directly appear in the graphs.
E.g.: Torch.nn, Caffe, MXNet
Parameters as separate nodes
in the graphs
Parameters are represented as
separate variable nodes.
E.g.: Theano, Chainer,
TensorFlow
x
Affine
(own W and b)
y
x
Affine yW
b
28/31

5. How to update parameters
Update parameters by own
routines outside of the
graphs
Update formulae are
implemented directly using
the backend array libraries.
E.g.: Torch.nn, Caffe, MXNet,
Chainer
Represent update formulae as
a part of the graphs
Update formulae are built as a
part of computational graphs.
E.g.: Theano, TensorFlow
29/31

6. How to achieve the computational
performance
Transform the graphs to
optimize the computations
There are many ways to
optimize the computations.
Theano supports variout
optimizations.
TensorFlow does simple
ones.
Provide easy ways to write
custom operator nodes
Users can write their own
operator nodes optimized to
their purposes.
Torch, MXNet, and Chainer
provide ways to write one code
that runs both on CPU and GPU.
Chainer also provides ways to
write custom CUDA kernels
without manual compilation
steps.
30/31

7. How to scale the computations
Multi-GPU parallelizations
Nowadays, most popular
frameworks start supporting
multi-GPU computations.
Multi-GPU (one machine) is
enough for most use cases
today.
Distributed computations (i.e.,
multi-node parallelizations)
Some frameworks also support
distributed computations to
further scale the learning.
MXNet uses a simple
distributed key-value store.
TensorFlow uses gRPC. It will
also support easy-to-use cloud
environments.
CNTK uses simple MPI.
31/31

Conclusion
• We introduced the basics of NNs, typical designs of their
implementations, and pros/cons of various design choices.
• Deep learning is an emerging field with increasing speed of
development, so quick try-and-error is crutial for the
research/development in this field
• In that mean, using frameworks as highly reusable parts of
NNs is important
• There are growing number of frameworks in this world,
though most of them have different aspects, so it is also
important to choose one appropriate for your purpose
32/31

Tokyo Webmining Talk1

More Related Content

What's hot

Viewers also liked

Similar to Tokyo Webmining Talk1

More from Kenta Oono

Recently uploaded

Tokyo Webmining Talk1