PAKDD2016 Tutorial DLIF: Introduction and Basics

Tutorial:
Deep learning
implementations and frameworks
Seiya Tokui*, Kenta Oono*,
Atsunori Kanemura+,
Toshihiro Kamishima+
*Preferred Networks, Inc. (PFN)
{tokui,oono}@preferred.jp
+National Institute of Advanced Industrial Science and
Technology (AIST)
atsu-kan@aist.go.jp, mail@kamishima.net
12016-04-19 DLIF Tutorial @ PAKDD2016

Introduction
Atsunori Kanemura
AIST, Japan

Objective
•  Get into deep learning research and practices
•  1) Learn the building blocks that are common to
most deep learning frameworks
–  Review key technologies.
•  2) Understand the differences between the various
implementations
–  How specific DL frameworks differ
–  Useful to decide which framework to start with
•  Not to know coding know-hows (although
coding examples will be given).

Model audience
•  Want to use neural networks
•  Want to model neural network architectures for
practical problems
•  Expected background:
–  Basics of computer science and numerical
computation
–  General machine learning terminologies (in
particular around supervised learning)
–  Basic knowledge or practices of neural networks
(recommended)
–  Basic knowledge of Python programming
language (recommended)

Overview
•  1st session (8:30 – 10:00)
– Introduction (AK)
– Basics of neural networks (AK)
– Common design of neural network
implementations (KO)
•  2nd session (10:30 – 12:30)
– Diﬀerences of deep learning frameworks
(ST)
– Coding examples of frameworks (KO & ST)
– Conclusion (ST)

Frameworks to be (and not to be) explained
•  Deeply explained with coding examples
–  Chainer - Python
–  Keras - Python
–  Tensorflow – Python
•  Also compared
–  Torch.nn – Lua
–  Theano – Python
–  Caffe – C++ & Python & Matlab
–  MXNet ̶ Many
–  autograd ̶ Python & Lua
•  Others not explained
–  Cloud computing, Matlab toolboxes, DL4J, H2O, CNTK
–  Wrappers: Lasagne, Blocks, skflow
–  TensorBoard, DIGITS (only mention their names)

Basics of Neural Networks
Atsunori Kanemura
AIST, Japan

Artiﬁcial neural networks
•  Biologically inspired
–  A biological neuron is a nonlinear unit
connected with synapses at
the dendrites (input) and
the axon (output)
•  A building block for pattern recognition
systems (and more)

Why neural networks?
•  Superior performance
–  Image recognition
•  ImageNet LSVR Challenge – Exceeds human
performance
–  Playing games
•  AlphaGO – Human experts have defeated
•  Extended to other problems
–  Images and text
•  Show & Tell – Generate texts from images with
intermediate representations (“embeddings”)
–  Learn artist styles
–  Many others (translation, speech recognition, …)

Technical inside of NNs
•  Layered processing with
linear transformation (aka. matrix multiplication,
aﬃne transformation)
+ nonlinear operation (aka. activation function)
•  Adapt to data

11
Mathematical model for a neuron
•  Compare the product of input and
weights (parameters) with a threshold
–  Plasticity of the neuron
= The change of parameters and
……
∑
f ： nonlinear transform
b
x
w
b
x1
x2
y
w
xD
y = f
⇣ DX
d=1
wdxd b
⌘
= f(wT
x b)
b

Generalized linear discriminant
•  Generalized linear discriminant
–  ：Nonlinear transformation
–  ⇒ Logistic (classical), Probit, etc.
f(·)
f(wT
x)
?
? by = f
⇣ DX
d=1
wdxd b
⌘
= f(wT
x b)
yn =
(
1 (xn is positive)
0 (xn is negative)

Learning with loss minimization
•  Learn from many samples
•  Binary output
•  Deﬁne the loss function
•  Minimize J to learn (estimate) the
parameters
(Squared error)
w⇤
= arg min
w
J(w)
{xn, y⇤
n}N
n=1
y⇤
n =
(
1 (xn is positive)
0 (xn is negative)
J(w) =
1
2
NX
n=1
(f(wT
xn) y⇤
n)2

Neural networks
•  Multi-layered
•  Minimize the loss to learn the parameters
※ f works element-wise
y1
= f1(W 10
x)
y2
= f2(W 21
y1
)
y3
= f3(W 32
y2
)
...
yL
= fL(W (L)(L 1)
yL 1
)
J({W }) =
1
2
NX
n=1
(yL
(xn) y⇤
n)2

Gradient descent
•  The gradient of the loss for 1-layer model is
•  The update rule
（r is a
constant
learning
rate）
rwJ(w) =
1
2
NX
n=1
rw(f(wT
xn) y⇤
n)2
=
NX
n=1
(f(wT
xn) y⇤
n)rwf(wT
xn)
=
NX
n=1
(f(wT
xn) y⇤
n)f(wT
xn)(1 f(wT
xn))xn
w w rrwJ(w) = w
NX
n=1
h(xn, w)xn
h(xn, w)
def
= (f(wT
xn) y⇤
n)f(wT
xn)(1 f(wT
xn))

Backprop
•  Use the chain rule to derive the gradient
•  E.g. 2-layer case
–  ⇒ Calculate gradient recursively from top to
bottom layers
•  Cf. Gradient vanishing, ReLU
y1
n = f(W 10
xn), y2
n = f(w21
· y1
n)
@J
@W10
kl
=
X
n,i
@J
@y1
ni
@y1
ni
@W10
kl
J(W 10
, w21
) =
1
2
X
n
(y2
n y⇤
n)2

Automatic Differentiation
•  The math for backprop is obvious (but
tedious) if the NN architecture has been
defined
•  Can be automatically calculated after
defining the NN model
•  This is called automatic differentiation
(which is a general concept that makes use
of the chain rule)

Parameter update
•  Gradient Descent (GD)
•  Stochastic Gradient Descent (SGD)
–  Take several samples (say, 128) from the
dataset (mini-batch), estimate the gradient.
–  Theoretically motivated as the Robbins-Monro
algorithm
•  SGD to general gradient-based algorithms
–  Adam, AdaGrad, etc.
–  Use momentum and other techniques
w w rrwJ(w) = w
NX
n=1
h(xn, w)xn
h(xn, w
¯
)
def
= (f(wT
xn) yn)f(wT
xn)(1 f(wT
xn))

Overﬁtting and generalization error
•  The goal of learning is to decrease the
generalization error, which is the error for
previously unseen data
•  Having a low error at the data at hand is
not enough (or even harmful)
–  We can achieve 0% error by memorizing all the
examples in the training data
–  Complicated models (i.e., NNs with many
parameters and layers) can achieve this (if the
learning algorithm is clever enough).

Training procedure
•  Avoid overﬁtting
•  Split the data into two parts
–  Training dataset
•  We optimize the parameters using this training dataset
–  Validation dataset
•  We evaluate the performance of the learned NN with
this validation dataset
•  Optional: Test errors
–  If you want to estimate the generalization error,
use three-way splitting of the data and use the
last one, the test dataset, to measure
generalization error
Train Validation
Available data

Extra topics implemented
by most of the frameworks
•  Weights initialization
–  Random
–  Pretraining
–  Transfer from another trained network
•  Techniques for avoid overﬃting
–  Dropout
–  Batch normalization
–  ResNet
•  Convolution
•  Visualization
–  Deconvolution

Summary of this Part
•  Neural networks are computational model
that stacks neurons, or non-linear
computational units
•  The gradients of the loss w.r.t. the
parameter are recursively calculated from
top to bottom by backprop
•  Care must be taken to avoid overﬁtting by
following validation procedures

PAKDD2016 Tutorial DLIF: Introduction and Basics

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (16)

Similar to PAKDD2016 Tutorial DLIF: Introduction and Basics

Similar to PAKDD2016 Tutorial DLIF: Introduction and Basics (20)

Recently uploaded

Recently uploaded (20)

PAKDD2016 Tutorial DLIF: Introduction and Basics