Illustrative Introductory Neural Networks

DATANOMIQ GmbH | Franklinstr. 11 | 10587 Berlin
Densely Connected Layers

Terminology
 You’re going to learn ”feedforward
neural networks,” which don’t go
backward while activating.
 And this lecture is about ”densely
connected layers,” or “fully
connected layers.”
 Also we can say that a neural
network is a combination of
perceptron, so it’s also called
“multilayer perceptron.”
⋮
⋮
⋮
⋮
 In this lecture, we’d like to take
another approach to examine the
structure, which is a kind of giving
misunderstanding to people about
machine learning.
⋮
1
2
N

Something Like Biology : The Structure of Neurons
 When electrical potential
of a neuron reaches a
certain extent, it emits a
electrical pulse.
 Each neuron gets
electrical pulses.
 Sensitivity of each neuron
is determined by
synapses.

Structure of an Unit of Neural Network : Mimicking Brain
⋮
1
2
N
 When electrical potential of
the neuron reaches a certain
level, it emits next pulse.
 This is like on/off of a switch.
 Sigmoid function has a
closer behavior.
ℎ(⋅)

⋮
1
2
N
⋮
Overview of the Architecture of Densely Connected Layers
Just repeat it

Overview of the Architecture of Densely Connected Layers
And repeat it
That’s all

Classifying MNIST Dataset with Densely Connected
Layers : “Hello World” of Machine Learning
Black and white images
of 28*28 = 784 pixels
 Some people say this is “Hello,
world.” of machine learning.
 You can classify MNIST datast
with densely connected layers.

⋮
⋮
⋮
⋮
“Hello World!” of Machine Learning
1.0
1.0
1.0
1.0
⋮
⋮
⋮
0.2
0.3
⋮
⋮
⋮
⋮
⋮
⋮
⋮
1.0
1.0
Flattening
784-d
vector
16-d
vector
10-d
vector
⋮
⋮
⋮
⋮
⋮
⋮
⋮
⋮
⋮
3%
⋮
⋮
⋮
⋮
83%
⋮
⋮
⋮
⋮
⋮
5%
‘5’784
pixel values

Naive Image Classification with Densely Connected Layers
ERRORS
You can achieve about
90% accuracy with
densely connected layers.
(I used Keras : one of
deep learning libraries. )

Please open your browser and search
“machine learning” on image search.
We’ve looked at analogies of neural networks
and brain neurons.
Google
Bing
DATANOMIQ
official website

 Seemingly this is the image of
machine learning in media.
 But please keep it in mind that neural
networks are NOT models of brain.
 Neural network is nothing but just a mapping
of input vectors or tensors to output vectors
or tensors.

Let’s Go More Mathematically
Input layer Hidden layer Output layer
Σ
Σ
Σ
ℎ(⋅)
ℎ(⋅)
ℎ(⋅)
⋮ ⋮ ⋮
Σ
Σ
Σ
ℎ(⋅)
ℎ(⋅)
ℎ(⋅)
⋮ ⋮
⋯
⋯
⋯
Activation function
No. 0 layer No. 1 ~ L-1 layer No. L layer
Supervising
vector
Activation function

Let’s Go More Mathematically :
Neural Network is just a mapping
Σ
Σ
Σ
ℎ(⋅)
ℎ(⋅)
ℎ(⋅)
⋮ ⋮ ⋮
Σ
Σ
Σ
⋮
⋯
⋯
⋯

Calculation of Neural Networks
are Divided into Two Parts
 Forward propagation : Calculating from
input layer to output layer. Activating each
neuron.
 Back propagation : Calculating from output
layer to input layer. Renewing Parameters.
⋮
⋮
⋮
⋮
In short, calculating

1
Σ
Σ
1
Unit
Unit
layer layer
Forward propagation
You can generalize the
relations of any pairs
of units this way.

1
Σ
layer layer
⋮
⋮
Forward propagation :
Let’s calculate concretely.
Please pay attention to , which is
the No. j neuron in (l+1)th layer.

1
Σ
layer layer
⋮
⋮
Forward propagation :
Let’s calculate concretely.
Assume that
Then
*Keep it in mind that

Forward Propagation : Activation Functions in Hidden Layers
Let’s take a brief look at some activation maps.
 Sigmoid function
 Hyperbolic tangent
 Relu function

Forward Propagation at the Last Layer : Regression
 In case of a regression problem, the activation function in the last
layers is usually an identity mapping.
 I mean, you do nothing.

Forward Propagation at the Last Layer : Classification
 In case of a multiclass classification problem, the activation function
in the last layers is usually a softmax function. A softmax function is
defined as below.
*Note that ,
and is the number of classes.
 The sum of the neurons in the last layer is 1, so
softmax function is useful for changing output into
plobalities.

⋮
⋮
⋮
⋮
Forward Propagation
1.0
1.0
1.0
1.0
⋮
⋮
⋮
0.2
0.3
⋮
⋮
⋮
⋮
⋮
⋮
⋮
1.0
1.0
Flattening
784-d
vector
16-d
vector
10-d
vector
⋮
⋮
⋮
⋮
⋮
⋮
⋮
⋮
⋮
3%
⋮
⋮
⋮
⋮
83%
⋮
⋮
⋮
⋮
⋮
5%
‘5’
In case of handwritten digit classification
problem, densely connected layers is
mapping a flattened image to
probabilities.
Then, how can we get parameters of such
useful function?

Mathematical General Outline of Supervised Learning
(When You Use Normal Gradient Descent)
 Setting an error function, whose
variances are parameters
 Optimizing parameters such that the
minimize the error function.
 That means, applying gradient descent to
loss function with respect to parameters
Most importantly, calculating parameters is what supervised learning is all
about. This is an outline of supervised learning using SGD

 In short, we want to set a loss
function such that gets smaller
when is closer to
What are Error Functions in Supervised Learning?
 We want to be close to
*Be careful that we’re going to consider only
one data point
 Assume that you have an output vector and a
supervising vector
*Note that ‘n’ is the index to show
the number of data sample.

Error Function : Square Error
 We use a square error as a loss function for regression problems.
 In a regression problem, simply we just want to be close to

 is always negative, and is an element of one-
hot encoding.
Error Funtion : Cross Entropy
*Note that ‘n’ is the index
to show the number of
data sample.
 We use cross entropy as a loss function of neural network for classification
problems.
Question : What will the cross entropy be like
if classification is more correct?
The cross entropy will be smaller.
Hint :
A graph of
y = log(x)

Gradient Descent : Gradient
 The gradient of , which is denoted
as , is defined as the equation
on the right side.
 Let be a function of D variances of
 A gradient of a function means “The
direction such that maximizes the
positive change of .”

The most efficient
way to descend a
slide (maybe).
Gradient Descent : In Case of 2 Variants
If you calculate partial
differentiations along
each axis you can
maximize this change.
Let’s look at an example of gradient descent
in a 3 dimensional space.

Gradient Descent : In Case of 2 Variants
 So much for stupid jokes.
Let’s use gradient descent
for more practical stuff.
 Kids are all right without
gradient descent.

Supervised Learning : Simple Example
A simple example is linear regression. In this case the error
function J is mean square error.
Suppose that we have a dataset
And we want to fit the function below to the data.
And the error function is defined as below.

Back Propagation
 We want to calculate such
that minimize the error function J.
 The graph of error function
is as below, and it’s easy to
find the minimum point in
this case.
 Also, there’re formulas to
calculate them.

Back Propagation
 In this lecture, let’s see how to apply
gradient descent to this linear regression
problem.
 In short you just need to apply
repeat the process below.

If You Use Gradient Descent
An image of gradient descent : a ball
rolling the surface of error function.
Start point

Gradient Descent
 If you calculate at every point on
and move slightly, you approach
higher point on .
 In reverse, if you move at
each step, you approach lower point
on . This is gradient descent.
 For supervised learning, you apply
gradient descent to an error function
, which is a function of parameters.
 And you can apply gradient descent to higher than 2 dimensional
variances. But of course you can’t visualize it anymore.

Gradient Descent for Neural Network
 Question : How many parameters
does the “Hellow world!” neural
network on the right side have?
 In the case of simple linear
regression, we considered only
2 parameters.
 Answer : 784*16 + 16 + 16*10 + 10
=12730
⋮
⋮
⋮
⋮
784-d
vector
16-d
vector
10-d
vector
12730
parameters
 …..At least it’s more than 2
parameters, seemingly.

Σ
Σ
ℎ(⋅)
ℎ(⋅)
Σ
Σ
Σ
ℎ(⋅)
ℎ(⋅)
ℎ(⋅)
Super Simple Example of Densely Connected Layers
No. 0 layer No. 1 layer No. 2 layer
Sigmoid function Softmax function
Supervising
Vector
 Before digit
classification problem
, we’ll consider this
simple neural
network and its toy
implementation.
 And we think
about next simple
classification
problem.

 Data points generated as three clusters.
 Data in each cluster has the same mean and the same variance.
100 data points used for training the neural network
Training
Classifying
100 data points used for classifying : testing the neural network

Naively Calculating Approximations of Partial Differentiations
Suppose that is a function whose
variances are N parameters
And we want to calculate the partial
differentiation at a point with respect
to (i=1, 2, … , N).
*Note that are not
variances. They are constants.

Naively Calculating Approximations of
Partial Differentiations
….Sorry, I’ve written it in a slightly snobbish way.
Let’s look at a simple example.
You can get an approximation of partial
differentiation of at as below.

Suppose that
,
Partial Differentiations : A Simple Example
You can calculate an approximation of partial
* are constants.

Suppose that ,
Partial Differentiations : A Simple Example
You can calculate an approximation of partial

The number of parameters.
 In case of this simple neural
network, it has
2*2 + 2 + 2*3 + 3 = 15
parameters.
 To train this neural
network on 100 training
data, naively calculating
approximations of partial
differentiations, this
neural network needed
around 250 sec.

Back Propagation : More Sophisticated
Partial Differentiations
 Again, in the ”Hello World!”
densely connected layers, you
have to calculate partial
differentiations of 12730
parameters.
⋮
⋮
⋮
⋮
784-d
vector
16-d
vector
10-d
vector
12730
parameters
 We usually use a method
called back propagation to
calculate them.

 Then the differentiation of with respect to is calculated as
below.
Chain Rule : Warming Up of Back Propagation
 Let be a function of a variance and the variance
be a function of
 Chain rules are essential for back propagation algorithms.
*In this slide we I avoid mathematically precise discussion.

 Then the differentiation of with respect to or is
calculated as below.
 Let be a function of two variances , and
the variance , be functions of ,
Chain Rule : 2 Variances

 Then the differentiation of with respect to is calculated as below.
Chain Rule : In General
 Let be a function of n variances and the
variances be functions of m variances

BPTT for Simple RNN : Brief Review on Chain Rules
 This generalized chain rule is super important for back propagation.
 For simplicity, let’s denote the function in the way below.
 Again, the partial differentiation of with respect to is

Back Propagation
Σ
Σ
Σ
ℎ(⋅)
ℎ(⋅)
ℎ(⋅)
⋮ ⋮ ⋮
Σ
Σ
Σ
ℎ(⋅)
ℎ(⋅)
ℎ(⋅)
⋮ ⋮
⋯
⋯
⋯ Back propagation is
an efficient way to
calculate the partial
differentiations of an
error function with
respect to each
parameter.
 You need errors of each
neuron to calculate those
partial differentiations.
*The error of layer i is defined as

Back Propagation
Σ
Σ
Σ
ℎ(⋅)
ℎ(⋅)
ℎ(⋅)
⋮ ⋮ ⋮
Σ
Σ
Σ
ℎ(⋅)
ℎ(⋅)
ℎ(⋅)
⋮ ⋮
⋯
⋯
⋯
To calculate the error of No.1 layer, you need the error of
the No.2 layer.
To calculate the error of No.2 layer, you need the error
of the No.3 layer.
To calculate the error of No.L-1 layer, you need the
error of the last layer.

1
Σ
layer layer
⋮
⋮
Back Propagation
 Let’s think about renewing parameters such
that they decrease an error function
 We want to renew parameters using
gradient descent.
*Be careful that is calculated with only
one data point

1
Σ
layer layer
⋮
⋮
Back Propagation
 Using chain rule we can say
 Let be , and then
* Pay attention to what can be variances of
∵
 And

1
Σ
layer layer
⋮
⋮
Back Propagation
 You can calculate as well.
 Next, let’s calculate an error
* Pay attention to the difference between the
last equation we applied chain rule. You have
to consider is a variance of what.

1
Σ
layer layer
⋮
⋮
Back Propagation
 Then
∵
 Hence
 This equations shows that to calculate an
error in a layer, you need all the errors in
the higher layer.

Back Propagation :
Let’s Make the Most of Linear Algebra
Assume that Then
And keep it in mind that

The number of parameters.
 To train this neural network on
100 training data, using back
propagation, this neural network
needed around 31 sec.

Accuracy : 70.8 %

There Are Many Other Things to Think About
 Other activation functions
 Batch learning vs. mini batch learning
 Applying other types of optimization
 How to initialize the weights
 Regularization of data
 Dropout
….You’ll soon realize that machine
learning is more of a matter of
considering those hyper parameters : the
parameters you don’t train through back
propagation.

Sigmoid function Relu function
Accuracy : 73.2 %

Training a Neural Network : Stochastic Gradient Descent
 We have been thinking about a loss function , which is
calculated with only one data point
 It is easy to imagine that , the total sum of ,
is more useful to evaluate how compatible the neural
network is for all the data.
Batch
learning

 Data is redundant : many data points are similar, so even if you
reduce data at random the reduced dataset is still useful to some
extent.
 Question : In practice, you don’t use as a loss function
for training a neural network. Why?
 Computationally expensive : usually you have bunch of data points
 The gradient calculated with all the data is NOT noisy.
𝑥
𝑦
𝑥
𝑦
data reduction

 Let’s think about the third reason in the last slide “The gradient
calculated with all the data is NOT noisy.“
 Assume that the graph on the right side is a loss
function , which is calculated with the
whole data.
 If you apply gradient descent from a start
point using as a loss
function, probably the point will shift
smoothly along the surface of the graph.
Minimum point
Start point

Start point
Minimum point
 In fact, smooth and
exact track of gradient
descent, which is like a
rolling ball, is not
necessarily good.
 Because depending
on how to set start
points, it can get
stuck at a local
minimum.

 Question : What would happen if you calculate
partial differentiations using , which is
calculated with only one data point?
 The track of gradient descent
would be like a zigzag path,
heading in the direction of the
minimum point.
Minimum point
Start point

Gradient Descent : Batch Learning
⋯
⋯
Renewing
parameters
Calculating an
error function
⋯⋯
dataset
1 epoch
Shuffling
dataset
1 epoch
1 epoch

Gradient Descent : Stochastic Gradient Descent(SGD)
Renewing
parameters
Calculating an
error function
Shuffling
dataset
dataset
Renewing
parameters
Calculating an
error function
⋯
Renewing
parameters
Calculating an
error function 1 epoch
⋯
⋯
Renewing
parameters
Calculating an
error function
Renewing
parameters
Calculating an
error function 1 epoch
⋯

Training a Neural Network : A Pseudo Code of SGD
You need partial differentiations
with respect to each parameter to
apply gradient descent.
And the process in this part is
calculating the differentiations,
using back propagation.
This algorithm renews
parameters for each data
point.

Renewing
parameters
Calculating error functions in the batch
Gradient Descent :Mini-Batch Learning
⋯
Renewing
parameters
Dividing dataset into mini batches
Shuffling
dataset
Calculating error functions in the batch
⋯
⋯
1 epoch
Dividing dataset into mini batches
1 epoch
⋯

Training a Neural Network : Mini-Batch Gradient Descent
You apply gradient descent, using
the average of partial
differentiations for each batch.
Just as the normal SGD, you
calculate partial differentiations, .
This algorithm renews
parameters for data points
in each batch.

Illustrative Introductory Neural Networks

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Illustrative Introductory Neural Networks

Similar to Illustrative Introductory Neural Networks (20)

Recently uploaded

Recently uploaded (20)

Illustrative Introductory Neural Networks