Slides for the Part One of "Deep learning implementations and frameworks" presented as a Tutorial at PAKDD (Pacific Asia Knowledge Discovery and Data Mining Conference) 2016.
The presentation took place on April 19, 2016 at Auckland, New Zealand.
http://pakdd16.wordpress.fos.auckland.ac.nz/technical-program/tutorials/
3. Objective
• Get into deep learning research and practices
• 1) Learn the building blocks that are common to
most deep learning frameworks
– Review key technologies.
• 2) Understand the differences between the various
implementations
– How specific DL frameworks differ
– Useful to decide which framework to start with
• Not to know coding know-hows (although
coding examples will be given).
32016-04-19 DLIF Tutorial @ PAKDD2016
4. Model audience
• Want to use neural networks
• Want to model neural network architectures for
practical problems
• Expected background:
– Basics of computer science and numerical
computation
– General machine learning terminologies (in
particular around supervised learning)
– Basic knowledge or practices of neural networks
(recommended)
– Basic knowledge of Python programming
language (recommended)
42016-04-19 DLIF Tutorial @ PAKDD2016
5. Overview
• 1st session (8:30 – 10:00)
– Introduction (AK)
– Basics of neural networks (AK)
– Common design of neural network
implementations (KO)
• 2nd session (10:30 – 12:30)
– Differences of deep learning frameworks
(ST)
– Coding examples of frameworks (KO & ST)
– Conclusion (ST)
52016-04-19 DLIF Tutorial @ PAKDD2016
6. Frameworks to be (and not to be) explained
• Deeply explained with coding examples
– Chainer - Python
– Keras - Python
– Tensorflow – Python
• Also compared
– Torch.nn – Lua
– Theano – Python
– Caffe – C++ & Python & Matlab
– MXNet ̶ Many
– autograd ̶ Python & Lua
• Others not explained
– Cloud computing, Matlab toolboxes, DL4J, H2O, CNTK
– Wrappers: Lasagne, Blocks, skflow
– TensorBoard, DIGITS (only mention their names)
62016-04-19 DLIF Tutorial @ PAKDD2016
7. Basics of Neural Networks
Atsunori Kanemura
AIST, Japan
72016-04-19 DLIF Tutorial @ PAKDD2016
8. Artificial neural networks
• Biologically inspired
– A biological neuron is a nonlinear unit
connected with synapses at
the dendrites (input) and
the axon (output)
• A building block for pattern recognition
systems (and more)
82016-04-19 DLIF Tutorial @ PAKDD2016
9. Why neural networks?
• Superior performance
– Image recognition
• ImageNet LSVR Challenge – Exceeds human
performance
– Playing games
• AlphaGO – Human experts have defeated
• Extended to other problems
– Images and text
• Show & Tell – Generate texts from images with
intermediate representations (“embeddings”)
– Learn artist styles
– Many others (translation, speech recognition, …)
92016-04-19 DLIF Tutorial @ PAKDD2016
10. Technical inside of NNs
• Layered processing with
linear transformation (aka. matrix multiplication,
affine transformation)
+ nonlinear operation (aka. activation function)
• Adapt to data
102016-04-19 DLIF Tutorial @ PAKDD2016
11. 11
Mathematical model for a neuron
• Compare the product of input and
weights (parameters) with a threshold
– Plasticity of the neuron
= The change of parameters and
……
∑
f : nonlinear transform
2016-04-19 DLIF Tutorial @ PAKDD2016
b
x
w
b
x1
x2
y
w
xD
y = f
⇣ DX
d=1
wdxd b
⌘
= f(wT
x b)
b
12. Generalized linear discriminant
• Generalized linear discriminant
– :Nonlinear transformation
– ⇒ Logistic (classical), Probit, etc.
122016-04-19 DLIF Tutorial @ PAKDD2016
f(·)
f(wT
x)
?
? by = f
⇣ DX
d=1
wdxd b
⌘
= f(wT
x b)
yn =
(
1 (xn is positive)
0 (xn is negative)
13. Learning with loss minimization
• Learn from many samples
• Binary output
• Define the loss function
• Minimize J to learn (estimate) the
parameters
132016-04-19 DLIF Tutorial @ PAKDD2016
(Squared error)
w⇤
= arg min
w
J(w)
{xn, y⇤
n}N
n=1
y⇤
n =
(
1 (xn is positive)
0 (xn is negative)
J(w) =
1
2
NX
n=1
(f(wT
xn) y⇤
n)2
14. Neural networks
• Multi-layered
• Minimize the loss to learn the parameters
142016-04-19 DLIF Tutorial @ PAKDD2016
※ f works element-wise
y1
= f1(W 10
x)
y2
= f2(W 21
y1
)
y3
= f3(W 32
y2
)
...
yL
= fL(W (L)(L 1)
yL 1
)
J({W }) =
1
2
NX
n=1
(yL
(xn) y⇤
n)2
15. Gradient descent
• The gradient of the loss for 1-layer model is
• The update rule
152016-04-19 DLIF Tutorial @ PAKDD2016
(r is a
constant
learning
rate)
rwJ(w) =
1
2
NX
n=1
rw(f(wT
xn) y⇤
n)2
=
NX
n=1
(f(wT
xn) y⇤
n)rwf(wT
xn)
=
NX
n=1
(f(wT
xn) y⇤
n)f(wT
xn)(1 f(wT
xn))xn
w w rrwJ(w) = w
NX
n=1
h(xn, w)xn
h(xn, w)
def
= (f(wT
xn) y⇤
n)f(wT
xn)(1 f(wT
xn))
16. Backprop
• Use the chain rule to derive the gradient
• E.g. 2-layer case
– ⇒ Calculate gradient recursively from top to
bottom layers
• Cf. Gradient vanishing, ReLU
162016-04-19 DLIF Tutorial @ PAKDD2016
y1
n = f(W 10
xn), y2
n = f(w21
· y1
n)
@J
@W10
kl
=
X
n,i
@J
@y1
ni
@y1
ni
@W10
kl
J(W 10
, w21
) =
1
2
X
n
(y2
n y⇤
n)2
17. Automatic Differentiation
• The math for backprop is obvious (but
tedious) if the NN architecture has been
defined
• Can be automatically calculated after
defining the NN model
• This is called automatic differentiation
(which is a general concept that makes use
of the chain rule)
172016-04-19 DLIF Tutorial @ PAKDD2016
18. Parameter update
• Gradient Descent (GD)
• Stochastic Gradient Descent (SGD)
– Take several samples (say, 128) from the
dataset (mini-batch), estimate the gradient.
– Theoretically motivated as the Robbins-Monro
algorithm
• SGD to general gradient-based algorithms
– Adam, AdaGrad, etc.
– Use momentum and other techniques
182016-04-19 DLIF Tutorial @ PAKDD2016
w w rrwJ(w) = w
NX
n=1
h(xn, w)xn
h(xn, w
¯
)
def
= (f(wT
xn) yn)f(wT
xn)(1 f(wT
xn))
19. Overfitting and generalization error
• The goal of learning is to decrease the
generalization error, which is the error for
previously unseen data
• Having a low error at the data at hand is
not enough (or even harmful)
– We can achieve 0% error by memorizing all the
examples in the training data
– Complicated models (i.e., NNs with many
parameters and layers) can achieve this (if the
learning algorithm is clever enough).
192016-04-19 DLIF Tutorial @ PAKDD2016
20. Training procedure
• Avoid overfitting
• Split the data into two parts
– Training dataset
• We optimize the parameters using this training dataset
– Validation dataset
• We evaluate the performance of the learned NN with
this validation dataset
• Optional: Test errors
– If you want to estimate the generalization error,
use three-way splitting of the data and use the
last one, the test dataset, to measure
generalization error
202016-04-19 DLIF Tutorial @ PAKDD2016
Train Validation
Available data
21. Extra topics implemented
by most of the frameworks
• Weights initialization
– Random
– Pretraining
– Transfer from another trained network
• Techniques for avoid overffiting
– Dropout
– Batch normalization
– ResNet
• Convolution
• Visualization
– Deconvolution
212016-04-19 DLIF Tutorial @ PAKDD2016
22. Summary of this Part
• Neural networks are computational model
that stacks neurons, or non-linear
computational units
• The gradients of the loss w.r.t. the
parameter are recursively calculated from
top to bottom by backprop
• Care must be taken to avoid overfitting by
following validation procedures
222016-04-19 DLIF Tutorial @ PAKDD2016