Deep learning - a primer

Deep learning
A primer for the curious developer

Uwe Friedrichsen & Dr. Shirin Glander –codecentric AG – 2018

@ufried

Uwe Friedrichsen

uwe.friedrichsen@codecentric.de
@ShirinGlander

Dr. Shirin Glander

shirin.glander@codecentric.de

Why should I care about Deep Learning?

Deep learning has the potential to affect white collar workers (including IT)
in a similar way as robots affected blue collar workers.

What is Deep Learning?

Some success stories

What is Deep Learning?

A rough classification

AI

Artificial
Intelligence
ML

Machine
Learning
RL

Representational
Learning
DL

Deep
Learning

Traditional AI

Focus on problems that are ...
•  ... hard for humans
•  ... straightforward for computers
•  ... can be formally described
Deep Learning

Focus on problems that are ...
•  ... intuitive for humans
•  ... difficult for computers
(hard to be described formally)
•  ... best learnt from experience

Where does Deep Learning come from?

General evolution

•  Two opposed forces
•  Recreation of biological neural processing
•  Abstract mathematical models (mostly linear algebra)
•  Results in different models and algorithms
•  No clear winner yet

Cybernetics (ca. 1940 - 1960)

•  ADALINE, Perceptron
•  Linear models, typically no hidden layers
•  Stochastic Gradient Descent (SGD)
•  Limited applicability
•  E.g., ADALINE could not learn XOR
•  Resulted in “First winter of ANN” (Artificial Neural Networks)

Connectionism (ca. 1980 - 1990)

•  Neocognitron
•  Non-linear models, distributed feature representation
•  Backpropagation
•  Typically 1, rarely more hidden layers
•  First approaches of sequence modeling
•  LSTM (Long short-term memory) in 1997
•  Unrealistic expectations nurtured by ventures
•  Resulted in “Second winter of ANN”

Deep Learning (ca. 2006 -)

•  Improved algorithms, advanced computing power
•  Enabled training much larger and deeper networks
•  Enabled training much larger data sets
•  Typically several to many hidden layers
•  Overcame the “feature extraction dilemma”

What is Deep Learning used for?

Deep Learning application areas

•  Classification (incl. missing inputs)
•  Regression (value prediction)
•  Function prediction
•  Density estimation
•  Structured output (e.g., translation)

•  Anomaly detection
•  Synthesis and sampling
•  Denoising
•  Compression (dimension reduction)
•  ...

How does Deep Learning work?

A first (scientifically inspired) approach

„A computer program is said to learn
•  from experience E
•  with respect to some class of tasks T
•  and performance measure P
if its performance at tasks in T,
as measured by P,
improves with experience E.”

-- T. Mitchell, Machine Learning, p. 2, McGraw Hill (1997)
Supervised learning,
unsupervised learning,
reinforcement learning, ...
Too difficult to solve
with fixed programs
designed by humans
Accuracy vs. error rate,
training vs. test set, ...

Err ...
Hmm ...
Well ...
I don’t get it!

How does Deep Learning work?

A second (more down-to-earth) approach

Operating
principle
Training
Network
types
Deep
Learning

Deep
Learning
Operating
principle
Training
Network
types
Structure
Behavior
Weight
Operation
Neuron
Data
CNN
Types
Challenges
Quality
measure
RNN
LSTM
Auto-
encoder
GAN
MLP
Training
set
Cost
function
Transfer
learning
Regulari-
zation
Layer
Connection
Hyper-
parameter
Activation
function
Reinforce-
ment
Unsuper-
vised
Supervised
Stochastic
gradient
descent
Back-
propagation
Under-/
Overfitting
Validation/
Test set
Optimization
procedure

Deep
Learning
Operating
principle
Training
Network
types

Structure
Behavior
Operating
principle

Operating
principle
Structure
Behavior
Neuron

Neuron

•  Design inspired by biological neurons
•  One or more inputs
•  Processing (and state storage) unit
•  One or more outputs
•  In practice often implemented as tensor transformations
•  Relevance of internal state depends on network type
•  Usually negligible for feed-forward networks
•  Usually relevant for recurrent networks
Neuron

Processing

(+ State)
Output(s)
Input(s)
...
...

Layer
Operating
principle
Structure
Behavior
Neuron

Layer

•  Neurons typically organized in layers
•  Input and output layer as default
•  Optionally one or more hidden layer
•  Layer layout can have 1-n dimensions
•  Neurons in different layers can have different properties
•  Different layers responsible for different (sub-)tasks
Output layer
Input layer
...
N
1
2
Hidden layer(s)
...

Connection
Operating
principle
Structure
Behavior
Neuron
Layer

Connection

•  Usually connect input and output tensor in a 1:1 manner
•  Connect between layers (output layer N-1 à input layer N)
•  Layers can be fully or partially (sparsely) connected
•  RNNs also have backward and/or self connections
•  Some networks have connections between neurons
of the same layer (e.g., Hopfield nets, Boltzmann machines)
Input tensor(s)
Output tensor(s)

Weight
Operating
principle
Structure
Behavior
Neuron
Layer
Connection

Weight

•  (Logically) augments a connection
•  Used to amplify or dampen a signal sent over a connection
•  The actual “memory” of the network
•  The “right” values of the weights are learned during training
•  Can also be used to introduce a bias for a neuron
•  By connecting it to an extra neuron that constantly emits 1
W
Weight

Operation
Operating
principle
Structure
Behavior
Neuron
Layer
Weight
Connection

Input tensor(s)
Output tensor(s)
Step 1

•  For each neuron of input layer
•  Copy resp. input tensor’s value to neuron’s input
•  Calculate state/output using activation function
(typically linear function, passing value through)
Step 2-N

•  For each hidden layer and output layer in their order
•  For each neuron of the layer
•  Calculate weighted sum on inputs
•  Calculate state/output using activation function
(see examples later)
Final step

•  For each neuron of output layer
•  Copy neuron’s output to resp. output tensor’s value

Input tensor(s)
Output tensor(s)
Step 1
Final step
Step 2-N
•  Default update procedure (most widespread)
•  All neuron per layer in parallel
•  Different update procedures exist
•  E.g., some Hopfield net implementations
randomly pick neurons for update

Activation
function
Operating
principle
Structure
Behavior
Neuron
Layer
Weight
Connection
Operation

Linear function

•  Easy to handle
•  Cannot handle
non-linear problems

Logistic sigmoid function

•  Very widespread
•  Delimits output to [0, 1]
•  Vanishing gradient
problem

Hyperbolic tangent

•  Very widespread
•  Delimits output to [-1, 1]
•  Vanishing gradient
problem

Rectified linear unit (ReLU)

•  Easy to handle
•  No derivative in 0
•  Dying ReLU problem
•  Can be mitigated, e.g.,
by using leaky ReLU

Softplus

•  Smooth approximation
of ReLU
•  ReLU usually performs
better
•  Thus, use of softplus
usually discouraged

Hyper-
parameter
Operating
principle
Structure
Behavior
Neuron
Layer
Weight
Connection
Operation
Activation
function

Hyperparameter

•  Influence network and algorithm behavior
•  Often influence model capacity
•  Not learned, but usually manually optimized
•  Currently quite some research interest in
automatic hyperparameter optimization
Examples

•  Number of hidden layers
•  Number of hidden units
•  Learning rate
•  Number of clusters
•  Weight decay coefficient
•  Convolution kernel width
•  ...

Training
Deep
Learning
Operating
principle
Network
types
Structure
Behavior
Weight
Operation
Neuron
Layer
Connection
Hyper-
parameter
Activation
function

Cost function
Training
Quality
measure

Cost function (a.k.a. loss function)

•  Determines distance from optimal performance
•  Mean squared error as simple (and widespread) example

Cost function (a.k.a. loss function)

•  Determines distance from optimal performance
•  Mean squared error as simple (and widespread) example
•  Often augmented with regularization term
for better generalization (see challenges)

Optimization
procedure
Training
Quality
measure
Cost function

Training
Quality
measure
Stochastic
gradient
descent
Cost function
Optimization
procedure

Stochastic gradient descent

•  Direct calculation of minimum often not feasible
•  Instead stepwise “descent” using the gradient
à Gradient descent


•  Direct calculation of minimum often not feasible
•  Instead stepwise “descent” using the gradient
à Gradient descent
•  Not feasible for large training sets
•  Use (small) random sample of training set per iteration
à Stochastic gradient descent (SGD)

Gradient
Direction
Steepness
x

x
ε * gradient
x’
Learning
rate ε

Training
Quality
measure
Stochastic
gradient
descent
Back-
propagation
Cost function
Optimization
procedure

Backpropagation

•  Procedure to calculate new weights based on loss function
Depends on
cost function
Depends on
activation function
Depends on
input calculation

Backpropagation

•  Procedure to calculate new weights based on loss function
•  Usually “back-propagated” layer-wise
•  Most widespread optimization procedure
Depends on
cost function
Depends on
activation function
Depends on
input calculation

Data
Training
Quality
measure
Stochastic
gradient
descent
Back-
propagation
Cost function
Optimization
procedure

Training set
Validation/
Test set
Data
Training
Quality
measure
Stochastic
gradient
descent
Back-
propagation
Cost function
Optimization
procedure

Data set

•  Consists of examples (a.k.a. data points)
•  Example always contains input tensor
•  Sometimes also contains expected output tensor
(depending on training type)
•  Data set usually split up in several parts
•  Training set – optimize accuracy (always used)
•  Test set – test generalization (often used)
•  Validation set – tune hyperparameters (sometimes used)

Data
Types
Training
Quality
measure
Stochastic
gradient
descent
Back-
propagation
Training set
Validation/
Test set
Cost function
Optimization
procedure

Supervised
Data
Types
Training
Quality
measure
Stochastic
gradient
descent
Back-
propagation
Training set
Validation/
Test set
Cost function
Optimization
procedure

Supervised learning

•  Typically learns from a large, yet finite set of examples
•  Examples consist of input and output tensor
•  Output tensor describes desired output
•  Output tensor also called label or target
•  Typical application areas
•  Classification
•  Regression and function prediction
•  Structured output problems

Unsupervised
Data
Types
Supervised
Training
Quality
measure
Stochastic
gradient
descent
Back-
propagation
Training set
Validation/
Test set
Cost function
Optimization
procedure

Unsupervised learning

•  Typically learns from a large, yet finite set of examples
•  Examples consist of input tensor only
•  Learning algorithm tries to learn useful properties of the data
•  Requires different type of cost functions
•  Typical application areas
•  Clustering, density estimations
•  Denoising, compression (dimension reduction)

Reinforcement
Data
Types
Supervised
Training
Quality
measure
Unsupervised
Stochastic
gradient
descent
Back-
propagation
Training set
Validation/
Test set
Cost function
Optimization
procedure

Reinforcement learning

•  Continuously optimizes interaction with an environment
based on reward-based learning
Agent
Environment
State t
Reward t
State t+1
Reward t+1
Action t

Reinforcement learning

•  Continuously optimizes interaction with an environment
based on reward-based learning
•  Goal is selection of action with highest expected reward
•  Takes (discounted) expected future rewards into account
•  Labeling of examples replaced by reward function
•  Can continuously learn à data set can be infinite
•  Typically used to solve complex tasks in (increasingly)
complex environments with (very) limited feedback

Challenges
Data
Types
Supervised
Training
Quality
measure
Unsupervised
Reinforcement
Stochastic
gradient
descent
Back-
propagation
Training set
Validation/
Test set
Cost function
Optimization
procedure

Data
Types
Supervised
Training
Quality
measure
Unsupervised
Reinforcement
Stochastic
gradient
descent
Back-
propagation
Under-/
Overfitting
Training set
Validation/
Test set
Cost function
Challenges
Optimization
procedure

Underfitting and Overfitting

•  Training error describes how good training data is learnt
•  Test error is an indicator for generalization capability
•  Core challenge for all machine learning type algorithms
1.  Make training error small
2.  Make gap between training and test error small
•  Underfitting is the violation of #1
•  Overfitting is the violation of #2

Good fit
Underfitting
Overfitting
Training data
Test data

Underfitting and Overfitting

•  Under- and overfitting influenced by model capacity
•  Too low capacity usually leads to underfitting
•  Too high capacity usually leads to overfitting
•  Finding the right capacity is a challenge

Data
Types
Supervised
Training
Quality
measure
Unsupervised
Reinforcement
Stochastic
gradient
descent
Back-
propagation
Under-/
Overfitting
Training set
Validation/
Test set
Cost function
Regularization
Challenges
Optimization
procedure

Regularization

•  Regularization is a modification applied to learning algorithm
•  to reduce the generalization error
•  but not the training error
•  Weight decay is a typical regularization measure

Data
Types
Supervised
Training
Quality
measure
Unsupervised
Reinforcement
Stochastic
gradient
descent
Back-
propagation
Under-/
Overfitting
Transfer
learning
Training set
Validation/
Test set
Cost function
Regularization
Challenges
Optimization
procedure

Transfer learning

•  How to transfer insights between related tasks
•  E.g., is it possible to transfer knowledge gained while training
to recognize cars on the problem of recognizing trucks?
•  General machine learning problem
•  Subject of many research activities

Network
types
Deep
Learning
Operating
principle
Training
Structure
Behavior
Weight
Operation
Neuron
Data
Types
Challenges
Quality
measure
Training
set
Cost
function
Transfer
learning
Regulari-
zation
Layer
Connection
Hyper-
parameter
Activation
function
Reinforce-
ment
Unsuper-
vised
Supervised
Stochastic
gradient
descent
Back-
propagation
Under-/
Overfitting
Validation/
Test set
Optimization
procedure

MLP

Multilayer
Perceptron
Network
types

Multilayer perceptron (MLP)

•  Multilayer feed-forward networks
•  “Vanilla” neural networks
•  Typically used for
•  Function approximation
•  Regression
•  Classification
Image source: https://deeplearning4j.org

CNN

Convolutional
Neural Network
Network
types
MLP

Multilayer
Perceptron

Convolutional neural network (CNN)

•  Special type of MLP for image processing
•  Connects convolutional neuron only with receptive field
•  Advantages
•  Less computing
power required
•  Often even better
recognition rates
•  Inspired by organization of visual cortex

RNN

Recurrent
Neural Network
Network
types
MLP

Multilayer
Perceptron
CNN

Convolutional
Neural Network

Recurrent neural network (RNN)

•  Implements internal feedback loops
•  Provides a temporal memory
•  Speech recognition
•  Text recognition
•  Time series processing

LSTM

Long
Short-Term
Memory
Network
types
MLP

Multilayer
Perceptron
CNN

Convolutional
Neural Network
RNN

Recurrent
Neural Network

Long short-term memory (LSTM)

•  Special type of RNN
•  Uses special LSTM units
•  Can implement very long-term memory
while avoiding the vanishing/exploding
gradient problem
•  Same application areas as RNN

Auto-
encoder
Network
types
MLP

Multilayer
Perceptron
CNN

Convolutional
Neural Network
RNN

Recurrent
Neural Network
LSTM

Long
Short-Term
Memory

Autoencoder
•  Special type of MLP
•  Reproduces input at output layer
•  Consists of encoder and decoder
•  Usually configured undercomplete
•  Learns efficient feature codings
•  Dimension reduction (incl. compression)
•  Denoising
•  Usually needs pre-training for not only
reconstructing average of training set

GAN

Generative
Adversarial
Networks
Network
types
MLP

Multilayer
Perceptron
CNN

Convolutional
Neural Network
RNN

Recurrent
Neural Network
Auto-
encoder
LSTM

Long
Short-Term
Memory

Generative adversarial networks (GAN)
•  Consists of two (adversarial) networks
•  Generator creating fake images
•  Discriminator trying to identify
fake images
(e.g., textures in games)
•  Structured output with variance (e.g., variations of a design or voice generation)
•  Probably best known for creating fake celebrity images

How does Deep Learning feel in practice?

What issues might I face if diving deeper?

Issues you might face

•  Very fast moving research domain
•  You need the math. Really!
•  How much data do you have?
•  GDPR: Can you explain the decision of your network?
•  Meta-Learning as the next step
•  Monopolization of research and knowledge

Wrap-up

•  Broad, diverse topic
•  Very good library support and more
•  Very active research topic
•  No free lunch
•  You need the math!

à Exciting and important topic – become a part of it!

References

•  I. Goodfellow, Y. Bengio, A. Courville, ”Deep learning",
MIT press, 2016, also available via https://www.deeplearningbook.org
•  C. Perez, “The Deep Learning AI Playbook”,
Intuition Machine Inc., 2017
•  F. Chollet, "Deep Learning with Python",
Manning Publications, 2017
•  OpenAI, https://openai.com
•  Keras, https://keras.io
•  Deep Learning for Java, https://deeplearning4j.org/index.html
•  Deep Learning (Resource site), http://deeplearning.net

@ShirinGlander

Dr. Shirin Glander

shirin.glander@codecentric.de
@ufried

Uwe Friedrichsen

uwe.friedrichsen@codecentric.de

Deep learning - a primer

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Deep learning - a primer

Similar to Deep learning - a primer (20)

More from Shirin Elsinghorst

More from Shirin Elsinghorst (10)

Recently uploaded

Recently uploaded (20)

Deep learning - a primer