SlideShare a Scribd company logo
DATANOMIQ GmbH | Franklinstr. 11 | 10587 Berlin
Densely Connected Layers
Terminology
 You’re going to learn ”feedforward
neural networks,” which don’t go
backward while activating.
 And this lecture is about ”densely
connected layers,” or “fully
connected layers.”
 Also we can say that a neural
network is a combination of
perceptron, so it’s also called
“multilayer perceptron.”
⋮
⋮
⋮
⋮
 In this lecture, we’d like to take
another approach to examine the
structure, which is a kind of giving
misunderstanding to people about
machine learning.
⋮
1
2
N
Something Like Biology : The Structure of Neurons
 When electrical potential
of a neuron reaches a
certain extent, it emits a
electrical pulse.
 Each neuron gets
electrical pulses.
 Sensitivity of each neuron
is determined by
synapses.
Structure of an Unit of Neural Network : Mimicking Brain
⋮
1
2
N
 When electrical potential of
the neuron reaches a certain
level, it emits next pulse.
 This is like on/off of a switch.
 Sigmoid function has a
closer behavior.
ℎ(⋅)
⋮
1
2
N
⋮
Overview of the Architecture of Densely Connected Layers
Just repeat it
Overview of the Architecture of Densely Connected Layers
And repeat it
That’s all
Classifying MNIST Dataset with Densely Connected
Layers : “Hello World” of Machine Learning
Black and white images
of 28*28 = 784 pixels
 Some people say this is “Hello,
world.” of machine learning.
 You can classify MNIST datast
with densely connected layers.
⋮
⋮
⋮
⋮
“Hello World!” of Machine Learning
1.0
1.0
1.0
1.0
⋮
⋮
⋮
0.2
0.3
⋮
⋮
⋮
⋮
⋮
⋮
⋮
1.0
1.0
Flattening
784-d
vector
16-d
vector
10-d
vector
⋮
⋮
⋮
⋮
⋮
⋮
⋮
⋮
⋮
3%
⋮
⋮
⋮
⋮
83%
⋮
⋮
⋮
⋮
⋮
5%
‘5’784
pixel values
Naive Image Classification with Densely Connected Layers
ERRORS
You can achieve about
90% accuracy with
densely connected layers.
(I used Keras : one of
deep learning libraries. )
Please open your browser and search
“machine learning” on image search.
We’ve looked at analogies of neural networks
and brain neurons.
Google
Bing
DATANOMIQ
official website
 Seemingly this is the image of
machine learning in media.
 But please keep it in mind that neural
networks are NOT models of brain.
 Neural network is nothing but just a mapping
of input vectors or tensors to output vectors
or tensors.
Let’s Go More Mathematically
Input layer Hidden layer Output layer
Σ
Σ
Σ
ℎ(⋅)
ℎ(⋅)
ℎ(⋅)
⋮ ⋮ ⋮
Σ
Σ
Σ
ℎ(⋅)
ℎ(⋅)
ℎ(⋅)
⋮ ⋮
⋯
⋯
⋯
Activation function
No. 0 layer No. 1 ~ L-1 layer No. L layer
Supervising
vector
Activation function
Let’s Go More Mathematically :
Neural Network is just a mapping
Σ
Σ
Σ
ℎ(⋅)
ℎ(⋅)
ℎ(⋅)
⋮ ⋮ ⋮
Σ
Σ
Σ
⋮
⋯
⋯
⋯
Calculation of Neural Networks
are Divided into Two Parts
 Forward propagation : Calculating from
input layer to output layer. Activating each
neuron.
 Back propagation : Calculating from output
layer to input layer. Renewing Parameters.
⋮
⋮
⋮
⋮
In short, calculating
1
Σ
Σ
1
Unit
Unit
layer layer
Forward propagation
You can generalize the
relations of any pairs
of units this way.
1
Σ
layer layer
⋮
⋮
Forward propagation :
Let’s calculate concretely.
Please pay attention to , which is
the No. j neuron in (l+1)th layer.
1
Σ
layer layer
⋮
⋮
Forward propagation :
Let’s calculate concretely.
Assume that
Then
*Keep it in mind that
Forward Propagation : Activation Functions in Hidden Layers
Let’s take a brief look at some activation maps.
 Sigmoid function
 Hyperbolic tangent
 Relu function
Forward Propagation at the Last Layer : Regression
 In case of a regression problem, the activation function in the last
layers is usually an identity mapping.
 I mean, you do nothing.
Forward Propagation at the Last Layer : Classification
 In case of a multiclass classification problem, the activation function
in the last layers is usually a softmax function. A softmax function is
defined as below.
*Note that ,
and is the number of classes.
 The sum of the neurons in the last layer is 1, so
softmax function is useful for changing output into
plobalities.
⋮
⋮
⋮
⋮
Forward Propagation
1.0
1.0
1.0
1.0
⋮
⋮
⋮
0.2
0.3
⋮
⋮
⋮
⋮
⋮
⋮
⋮
1.0
1.0
Flattening
784-d
vector
16-d
vector
10-d
vector
⋮
⋮
⋮
⋮
⋮
⋮
⋮
⋮
⋮
3%
⋮
⋮
⋮
⋮
83%
⋮
⋮
⋮
⋮
⋮
5%
‘5’
In case of handwritten digit classification
problem, densely connected layers is
mapping a flattened image to
probabilities.
Then, how can we get parameters of such
useful function?
Mathematical General Outline of Supervised Learning
(When You Use Normal Gradient Descent)
 Setting an error function, whose
variances are parameters
 Optimizing parameters such that the
minimize the error function.
 That means, applying gradient descent to
loss function with respect to parameters
Most importantly, calculating parameters is what supervised learning is all
about. This is an outline of supervised learning using SGD
 In short, we want to set a loss
function such that gets smaller
when is closer to
What are Error Functions in Supervised Learning?
 We want to be close to
*Be careful that we’re going to consider only
one data point
 Assume that you have an output vector and a
supervising vector
*Note that ‘n’ is the index to show
the number of data sample.
Error Function : Square Error
 We use a square error as a loss function for regression problems.
 In a regression problem, simply we just want to be close to
 is always negative, and is an element of one-
hot encoding.
Error Funtion : Cross Entropy
*Note that ‘n’ is the index
to show the number of
data sample.
 We use cross entropy as a loss function of neural network for classification
problems.
Question : What will the cross entropy be like
if classification is more correct?
The cross entropy will be smaller.
Hint :
A graph of
y = log(x)
Gradient Descent : Gradient
 The gradient of , which is denoted
as , is defined as the equation
on the right side.
 Let be a function of D variances of
 A gradient of a function means “The
direction such that maximizes the
positive change of .”
The most efficient
way to descend a
slide (maybe).
Gradient Descent : In Case of 2 Variants
If you calculate partial
differentiations along
each axis you can
maximize this change.
Let’s look at an example of gradient descent
in a 3 dimensional space.
Gradient Descent : In Case of 2 Variants
 So much for stupid jokes.
Let’s use gradient descent
for more practical stuff.
 Kids are all right without
gradient descent.
Supervised Learning : Simple Example
A simple example is linear regression. In this case the error
function J is mean square error.
Suppose that we have a dataset
And we want to fit the function below to the data.
And the error function is defined as below.
Back Propagation
 We want to calculate such
that minimize the error function J.
 The graph of error function
is as below, and it’s easy to
find the minimum point in
this case.
 Also, there’re formulas to
calculate them.
Back Propagation
 In this lecture, let’s see how to apply
gradient descent to this linear regression
problem.
 In short you just need to apply
repeat the process below.
If You Use Gradient Descent
An image of gradient descent : a ball
rolling the surface of error function.
Start point
Gradient Descent
 If you calculate at every point on
and move slightly, you approach
higher point on .
 In reverse, if you move at
each step, you approach lower point
on . This is gradient descent.
 For supervised learning, you apply
gradient descent to an error function
, which is a function of parameters.
 And you can apply gradient descent to higher than 2 dimensional
variances. But of course you can’t visualize it anymore.
Gradient Descent for Neural Network
 Question : How many parameters
does the “Hellow world!” neural
network on the right side have?
 In the case of simple linear
regression, we considered only
2 parameters.
 Answer : 784*16 + 16 + 16*10 + 10
=12730
⋮
⋮
⋮
⋮
784-d
vector
16-d
vector
10-d
vector
12730
parameters
 …..At least it’s more than 2
parameters, seemingly.
Σ
Σ
ℎ(⋅)
ℎ(⋅)
Σ
Σ
Σ
ℎ(⋅)
ℎ(⋅)
ℎ(⋅)
Super Simple Example of Densely Connected Layers
No. 0 layer No. 1 layer No. 2 layer
Sigmoid function Softmax function
Supervising
Vector
 Before digit
classification problem
, we’ll consider this
simple neural
network and its toy
implementation.
 And we think
about next simple
classification
problem.
Super Simple Example of Densely Connected Layers
 Data points generated as three clusters.
 Data in each cluster has the same mean and the same variance.
100 data points used for training the neural network
Training
Classifying
100 data points used for classifying : testing the neural network
Naively Calculating Approximations of Partial Differentiations
Suppose that is a function whose
variances are N parameters
And we want to calculate the partial
differentiation at a point with respect
to (i=1, 2, … , N).
*Note that are not
variances. They are constants.
Naively Calculating Approximations of
Partial Differentiations
….Sorry, I’ve written it in a slightly snobbish way.
Let’s look at a simple example.
You can get an approximation of partial
differentiation of at as below.
Suppose that
,
Naively Calculating Approximations of
Partial Differentiations : A Simple Example
You can calculate an approximation of partial
differentiation of at as below.
* are constants.
Suppose that ,
Naively Calculating Approximations of
Partial Differentiations : A Simple Example
You can calculate an approximation of partial
differentiation of at as below.
The number of parameters.
 In case of this simple neural
network, it has
2*2 + 2 + 2*3 + 3 = 15
parameters.
 To train this neural
network on 100 training
data, naively calculating
approximations of partial
differentiations, this
neural network needed
around 250 sec.
Back Propagation : More Sophisticated
Partial Differentiations
 Again, in the ”Hello World!”
densely connected layers, you
have to calculate partial
differentiations of 12730
parameters.
⋮
⋮
⋮
⋮
784-d
vector
16-d
vector
10-d
vector
12730
parameters
 We usually use a method
called back propagation to
calculate them.
 Then the differentiation of with respect to is calculated as
below.
Chain Rule : Warming Up of Back Propagation
 Let be a function of a variance and the variance
be a function of
 Chain rules are essential for back propagation algorithms.
*In this slide we I avoid mathematically precise discussion.
 Then the differentiation of with respect to or is
calculated as below.
 Let be a function of two variances , and
the variance , be functions of ,
Chain Rule : 2 Variances
 Then the differentiation of with respect to is calculated as below.
Chain Rule : In General
 Let be a function of n variances and the
variances be functions of m variances
BPTT for Simple RNN : Brief Review on Chain Rules
 This generalized chain rule is super important for back propagation.
 For simplicity, let’s denote the function in the way below.
 Again, the partial differentiation of with respect to is
Back Propagation
Σ
Σ
Σ
ℎ(⋅)
ℎ(⋅)
ℎ(⋅)
⋮ ⋮ ⋮
Σ
Σ
Σ
ℎ(⋅)
ℎ(⋅)
ℎ(⋅)
⋮ ⋮
⋯
⋯
⋯ Back propagation is
an efficient way to
calculate the partial
differentiations of an
error function with
respect to each
parameter.
 You need errors of each
neuron to calculate those
partial differentiations.
*The error of layer i is defined as
Back Propagation
Σ
Σ
Σ
ℎ(⋅)
ℎ(⋅)
ℎ(⋅)
⋮ ⋮ ⋮
Σ
Σ
Σ
ℎ(⋅)
ℎ(⋅)
ℎ(⋅)
⋮ ⋮
⋯
⋯
⋯
To calculate the error of No.1 layer, you need the error of
the No.2 layer.
To calculate the error of No.2 layer, you need the error
of the No.3 layer.
To calculate the error of No.L-1 layer, you need the
error of the last layer.
1
Σ
layer layer
⋮
⋮
Back Propagation
 Let’s think about renewing parameters such
that they decrease an error function
 We want to renew parameters using
gradient descent.
*Be careful that is calculated with only
one data point
1
Σ
layer layer
⋮
⋮
Back Propagation
 Using chain rule we can say
 Let be , and then
* Pay attention to what can be variances of
∵
 And
1
Σ
layer layer
⋮
⋮
Back Propagation
 You can calculate as well.
 Next, let’s calculate an error
* Pay attention to the difference between the
last equation we applied chain rule. You have
to consider is a variance of what.
1
Σ
layer layer
⋮
⋮
Back Propagation
 Then
∵
 Hence
 This equations shows that to calculate an
error in a layer, you need all the errors in
the higher layer.
Back Propagation :
Let’s Make the Most of Linear Algebra
Assume that Then
And keep it in mind that
The number of parameters.
 To train this neural network on
100 training data, using back
propagation, this neural network
needed around 31 sec.
Super Simple Example of Densely Connected Layers
Accuracy : 70.8 %
There Are Many Other Things to Think About
 Other activation functions
 Batch learning vs. mini batch learning
 Applying other types of optimization
 How to initialize the weights
 Regularization of data
 Dropout
….You’ll soon realize that machine
learning is more of a matter of
considering those hyper parameters : the
parameters you don’t train through back
propagation.
Super Simple Example of Densely Connected Layers
Sigmoid function Relu function
Accuracy : 73.2 %
Training a Neural Network : Stochastic Gradient Descent
 We have been thinking about a loss function , which is
calculated with only one data point
 It is easy to imagine that , the total sum of ,
is more useful to evaluate how compatible the neural
network is for all the data.
Batch
learning
 Data is redundant : many data points are similar, so even if you
reduce data at random the reduced dataset is still useful to some
extent.
Training a Neural Network : Stochastic Gradient Descent
 Question : In practice, you don’t use as a loss function
for training a neural network. Why?
 Computationally expensive : usually you have bunch of data points
 The gradient calculated with all the data is NOT noisy.
𝑥
𝑦
𝑥
𝑦
data reduction
Training a Neural Network : Stochastic Gradient Descent
 Let’s think about the third reason in the last slide “The gradient
calculated with all the data is NOT noisy.“
 Assume that the graph on the right side is a loss
function , which is calculated with the
whole data.
 If you apply gradient descent from a start
point using as a loss
function, probably the point will shift
smoothly along the surface of the graph.
Minimum point
Start point
Training a Neural Network : Stochastic Gradient Descent
Start point
Minimum point
 In fact, smooth and
exact track of gradient
descent, which is like a
rolling ball, is not
necessarily good.
 Because depending
on how to set start
points, it can get
stuck at a local
minimum.
Training a Neural Network : Stochastic Gradient Descent
 Question : What would happen if you calculate
partial differentiations using , which is
calculated with only one data point?
 The track of gradient descent
would be like a zigzag path,
heading in the direction of the
minimum point.
Minimum point
Start point
Gradient Descent : Batch Learning
⋯
⋯
Renewing
parameters
Calculating an
error function
⋯⋯
dataset
1 epoch
Shuffling
dataset
1 epoch
1 epoch
Gradient Descent : Stochastic Gradient Descent(SGD)
Renewing
parameters
Calculating an
error function
Shuffling
dataset
dataset
Renewing
parameters
Calculating an
error function
⋯
Renewing
parameters
Calculating an
error function 1 epoch
⋯
⋯
Renewing
parameters
Calculating an
error function
Renewing
parameters
Calculating an
error function 1 epoch
⋯
Training a Neural Network : A Pseudo Code of SGD
You need partial differentiations
with respect to each parameter to
apply gradient descent.
And the process in this part is
calculating the differentiations,
using back propagation.
This algorithm renews
parameters for each data
point.
Renewing
parameters
Calculating error functions in the batch
Gradient Descent :Mini-Batch Learning
⋯
Renewing
parameters
Dividing dataset into mini batches
Shuffling
dataset
Calculating error functions in the batch
⋯
⋯
1 epoch
Dividing dataset into mini batches
1 epoch
⋯
Training a Neural Network : Mini-Batch Gradient Descent
You apply gradient descent, using
the average of partial
differentiations for each batch.
Just as the normal SGD, you
calculate partial differentiations, .
This algorithm renews
parameters for data points
in each batch.

More Related Content

What's hot

Manifold Blurring Mean Shift algorithms for manifold denoising, report, 2012
Manifold Blurring Mean Shift algorithms for manifold denoising, report, 2012Manifold Blurring Mean Shift algorithms for manifold denoising, report, 2012
Manifold Blurring Mean Shift algorithms for manifold denoising, report, 2012
Florent Renucci
 
Skip RNN: Learning to Skip State Updates in RNNs (ICLR 2018)
Skip RNN: Learning to Skip State Updates in RNNs (ICLR 2018)Skip RNN: Learning to Skip State Updates in RNNs (ICLR 2018)
Skip RNN: Learning to Skip State Updates in RNNs (ICLR 2018)
Universitat Politècnica de Catalunya
 
Generative models
Generative modelsGenerative models
Generative models
Avner Gidron
 
NIPS2007: deep belief nets
NIPS2007: deep belief netsNIPS2007: deep belief nets
NIPS2007: deep belief netszukun
 
Oleksandr Obiednikov “Affine transforms and how CNN lives with them”
Oleksandr Obiednikov “Affine transforms and how CNN lives with them”Oleksandr Obiednikov “Affine transforms and how CNN lives with them”
Oleksandr Obiednikov “Affine transforms and how CNN lives with them”
Lviv Startup Club
 
Restricted Boltzmann Machine - A comprehensive study with a focus on Deep Bel...
Restricted Boltzmann Machine - A comprehensive study with a focus on Deep Bel...Restricted Boltzmann Machine - A comprehensive study with a focus on Deep Bel...
Restricted Boltzmann Machine - A comprehensive study with a focus on Deep Bel...Indraneel Pole
 
Activation functions and Training Algorithms for Deep Neural network
Activation functions and Training Algorithms for Deep Neural networkActivation functions and Training Algorithms for Deep Neural network
Activation functions and Training Algorithms for Deep Neural network
Gayatri Khanvilkar
 
2021 04-01-dalle
2021 04-01-dalle2021 04-01-dalle
2021 04-01-dalle
JAEMINJEONG5
 
DNN and RBM
DNN and RBMDNN and RBM
DNN and RBM
Masayuki Tanaka
 
Cnn method
Cnn methodCnn method
Cnn method
AmirSajedi1
 
Simulation of Scale-Free Networks
Simulation of Scale-Free NetworksSimulation of Scale-Free Networks
Simulation of Scale-Free Networks
Gabriele D'Angelo
 
2021 03-02-transformer interpretability
2021 03-02-transformer interpretability2021 03-02-transformer interpretability
2021 03-02-transformer interpretability
JAEMINJEONG5
 
ThinkBayes: Chapter 9 two_dimensions
ThinkBayes: Chapter 9 two_dimensionsThinkBayes: Chapter 9 two_dimensions
ThinkBayes: Chapter 9 two_dimensionsJungkyu Lee
 
2021 01-04-learning filter-basis
2021 01-04-learning filter-basis2021 01-04-learning filter-basis
2021 01-04-learning filter-basis
JAEMINJEONG5
 
Training Deep Neural Nets
Training Deep Neural NetsTraining Deep Neural Nets
Training Deep Neural Nets
CloudxLab
 
Paper description of "Reformer"
Paper description of "Reformer"Paper description of "Reformer"
Paper description of "Reformer"
GenkiYasumoto
 
Deep Belief Networks
Deep Belief NetworksDeep Belief Networks
Deep Belief Networks
Hasan H Topcu
 
PCA-SIFT: A More Distinctive Representation for Local Image Descriptors
PCA-SIFT: A More Distinctive Representation for Local Image DescriptorsPCA-SIFT: A More Distinctive Representation for Local Image Descriptors
PCA-SIFT: A More Distinctive Representation for Local Image Descriptors
wolf
 
Scale free network Visualiuzation
Scale free network VisualiuzationScale free network Visualiuzation
Scale free network Visualiuzation
Harshit Srivastava
 
Review on cs231 part-2
Review on cs231 part-2Review on cs231 part-2
Review on cs231 part-2
Jeong Choi
 

What's hot (20)

Manifold Blurring Mean Shift algorithms for manifold denoising, report, 2012
Manifold Blurring Mean Shift algorithms for manifold denoising, report, 2012Manifold Blurring Mean Shift algorithms for manifold denoising, report, 2012
Manifold Blurring Mean Shift algorithms for manifold denoising, report, 2012
 
Skip RNN: Learning to Skip State Updates in RNNs (ICLR 2018)
Skip RNN: Learning to Skip State Updates in RNNs (ICLR 2018)Skip RNN: Learning to Skip State Updates in RNNs (ICLR 2018)
Skip RNN: Learning to Skip State Updates in RNNs (ICLR 2018)
 
Generative models
Generative modelsGenerative models
Generative models
 
NIPS2007: deep belief nets
NIPS2007: deep belief netsNIPS2007: deep belief nets
NIPS2007: deep belief nets
 
Oleksandr Obiednikov “Affine transforms and how CNN lives with them”
Oleksandr Obiednikov “Affine transforms and how CNN lives with them”Oleksandr Obiednikov “Affine transforms and how CNN lives with them”
Oleksandr Obiednikov “Affine transforms and how CNN lives with them”
 
Restricted Boltzmann Machine - A comprehensive study with a focus on Deep Bel...
Restricted Boltzmann Machine - A comprehensive study with a focus on Deep Bel...Restricted Boltzmann Machine - A comprehensive study with a focus on Deep Bel...
Restricted Boltzmann Machine - A comprehensive study with a focus on Deep Bel...
 
Activation functions and Training Algorithms for Deep Neural network
Activation functions and Training Algorithms for Deep Neural networkActivation functions and Training Algorithms for Deep Neural network
Activation functions and Training Algorithms for Deep Neural network
 
2021 04-01-dalle
2021 04-01-dalle2021 04-01-dalle
2021 04-01-dalle
 
DNN and RBM
DNN and RBMDNN and RBM
DNN and RBM
 
Cnn method
Cnn methodCnn method
Cnn method
 
Simulation of Scale-Free Networks
Simulation of Scale-Free NetworksSimulation of Scale-Free Networks
Simulation of Scale-Free Networks
 
2021 03-02-transformer interpretability
2021 03-02-transformer interpretability2021 03-02-transformer interpretability
2021 03-02-transformer interpretability
 
ThinkBayes: Chapter 9 two_dimensions
ThinkBayes: Chapter 9 two_dimensionsThinkBayes: Chapter 9 two_dimensions
ThinkBayes: Chapter 9 two_dimensions
 
2021 01-04-learning filter-basis
2021 01-04-learning filter-basis2021 01-04-learning filter-basis
2021 01-04-learning filter-basis
 
Training Deep Neural Nets
Training Deep Neural NetsTraining Deep Neural Nets
Training Deep Neural Nets
 
Paper description of "Reformer"
Paper description of "Reformer"Paper description of "Reformer"
Paper description of "Reformer"
 
Deep Belief Networks
Deep Belief NetworksDeep Belief Networks
Deep Belief Networks
 
PCA-SIFT: A More Distinctive Representation for Local Image Descriptors
PCA-SIFT: A More Distinctive Representation for Local Image DescriptorsPCA-SIFT: A More Distinctive Representation for Local Image Descriptors
PCA-SIFT: A More Distinctive Representation for Local Image Descriptors
 
Scale free network Visualiuzation
Scale free network VisualiuzationScale free network Visualiuzation
Scale free network Visualiuzation
 
Review on cs231 part-2
Review on cs231 part-2Review on cs231 part-2
Review on cs231 part-2
 

Similar to Illustrative Introductory Neural Networks

PRML Chapter 5
PRML Chapter 5PRML Chapter 5
PRML Chapter 5
Sunwoo Kim
 
Deep learning MindMap
Deep learning MindMapDeep learning MindMap
Deep learning MindMap
Ashish Patel
 
Deep learning: Mathematical Perspective
Deep learning: Mathematical PerspectiveDeep learning: Mathematical Perspective
Deep learning: Mathematical Perspective
YounusS2
 
Machine learning Module-2, 6th Semester Elective
Machine learning Module-2, 6th Semester ElectiveMachine learning Module-2, 6th Semester Elective
Machine learning Module-2, 6th Semester Elective
MayuraD1
 
9 neural network learning
9 neural network learning9 neural network learning
9 neural network learning
TanmayVijay1
 
Lesson_8_DeepLearning.pdf
Lesson_8_DeepLearning.pdfLesson_8_DeepLearning.pdf
Lesson_8_DeepLearning.pdf
ssuser7f0b19
 
Machine Learning 1
Machine Learning 1Machine Learning 1
Machine Learning 1
cairo university
 
Java and Deep Learning
Java and Deep LearningJava and Deep Learning
Java and Deep Learning
Oswald Campesato
 
lec26.pptx
lec26.pptxlec26.pptx
lec26.pptx
SwatiMahale4
 
Deep learning concepts
Deep learning conceptsDeep learning concepts
Deep learning concepts
Joe li
 
Scala and Deep Learning
Scala and Deep LearningScala and Deep Learning
Scala and Deep Learning
Oswald Campesato
 
Deep Learning: R with Keras and TensorFlow
Deep Learning: R with Keras and TensorFlowDeep Learning: R with Keras and TensorFlow
Deep Learning: R with Keras and TensorFlow
Oswald Campesato
 
Java and Deep Learning (Introduction)
Java and Deep Learning (Introduction)Java and Deep Learning (Introduction)
Java and Deep Learning (Introduction)
Oswald Campesato
 
Artificial Neural Network for machine learning
Artificial Neural Network for machine learningArtificial Neural Network for machine learning
Artificial Neural Network for machine learning
2303oyxxxjdeepak
 
Neural Networks - How do they work?
Neural Networks - How do they work?Neural Networks - How do they work?
Neural Networks - How do they work?
Accubits Technologies
 
19 - Neural Networks I.pptx
19 - Neural Networks I.pptx19 - Neural Networks I.pptx
19 - Neural Networks I.pptx
EmanAl15
 
Introduction to deep learning
Introduction to deep learningIntroduction to deep learning
Introduction to deep learning
Junaid Bhat
 
Neural-Networks.ppt
Neural-Networks.pptNeural-Networks.ppt
Neural-Networks.ppt
RINUSATHYAN
 
Deep Learning Module 2A Training MLP.pptx
Deep Learning Module 2A Training MLP.pptxDeep Learning Module 2A Training MLP.pptx
Deep Learning Module 2A Training MLP.pptx
vipul6601
 

Similar to Illustrative Introductory Neural Networks (20)

PRML Chapter 5
PRML Chapter 5PRML Chapter 5
PRML Chapter 5
 
Deep learning MindMap
Deep learning MindMapDeep learning MindMap
Deep learning MindMap
 
Deep learning: Mathematical Perspective
Deep learning: Mathematical PerspectiveDeep learning: Mathematical Perspective
Deep learning: Mathematical Perspective
 
Machine learning Module-2, 6th Semester Elective
Machine learning Module-2, 6th Semester ElectiveMachine learning Module-2, 6th Semester Elective
Machine learning Module-2, 6th Semester Elective
 
9 neural network learning
9 neural network learning9 neural network learning
9 neural network learning
 
Lesson_8_DeepLearning.pdf
Lesson_8_DeepLearning.pdfLesson_8_DeepLearning.pdf
Lesson_8_DeepLearning.pdf
 
Machine Learning 1
Machine Learning 1Machine Learning 1
Machine Learning 1
 
Java and Deep Learning
Java and Deep LearningJava and Deep Learning
Java and Deep Learning
 
lec26.pptx
lec26.pptxlec26.pptx
lec26.pptx
 
Deep learning concepts
Deep learning conceptsDeep learning concepts
Deep learning concepts
 
Scala and Deep Learning
Scala and Deep LearningScala and Deep Learning
Scala and Deep Learning
 
Deep Learning: R with Keras and TensorFlow
Deep Learning: R with Keras and TensorFlowDeep Learning: R with Keras and TensorFlow
Deep Learning: R with Keras and TensorFlow
 
Java and Deep Learning (Introduction)
Java and Deep Learning (Introduction)Java and Deep Learning (Introduction)
Java and Deep Learning (Introduction)
 
Artificial Neural Network for machine learning
Artificial Neural Network for machine learningArtificial Neural Network for machine learning
Artificial Neural Network for machine learning
 
Neural Networks - How do they work?
Neural Networks - How do they work?Neural Networks - How do they work?
Neural Networks - How do they work?
 
19 - Neural Networks I.pptx
19 - Neural Networks I.pptx19 - Neural Networks I.pptx
19 - Neural Networks I.pptx
 
Introduction to deep learning
Introduction to deep learningIntroduction to deep learning
Introduction to deep learning
 
Neural-Networks.ppt
Neural-Networks.pptNeural-Networks.ppt
Neural-Networks.ppt
 
Deep Learning Module 2A Training MLP.pptx
Deep Learning Module 2A Training MLP.pptxDeep Learning Module 2A Training MLP.pptx
Deep Learning Module 2A Training MLP.pptx
 
ANN - UNIT 2.pptx
ANN - UNIT 2.pptxANN - UNIT 2.pptx
ANN - UNIT 2.pptx
 

Recently uploaded

The role of big data in decision making.
The role of big data in decision making.The role of big data in decision making.
The role of big data in decision making.
ankuprajapati0525
 
block diagram and signal flow graph representation
block diagram and signal flow graph representationblock diagram and signal flow graph representation
block diagram and signal flow graph representation
Divya Somashekar
 
ethical hacking-mobile hacking methods.ppt
ethical hacking-mobile hacking methods.pptethical hacking-mobile hacking methods.ppt
ethical hacking-mobile hacking methods.ppt
Jayaprasanna4
 
Top 10 Oil and Gas Projects in Saudi Arabia 2024.pdf
Top 10 Oil and Gas Projects in Saudi Arabia 2024.pdfTop 10 Oil and Gas Projects in Saudi Arabia 2024.pdf
Top 10 Oil and Gas Projects in Saudi Arabia 2024.pdf
Teleport Manpower Consultant
 
LIGA(E)11111111111111111111111111111111111111111.ppt
LIGA(E)11111111111111111111111111111111111111111.pptLIGA(E)11111111111111111111111111111111111111111.ppt
LIGA(E)11111111111111111111111111111111111111111.ppt
ssuser9bd3ba
 
TECHNICAL TRAINING MANUAL GENERAL FAMILIARIZATION COURSE
TECHNICAL TRAINING MANUAL   GENERAL FAMILIARIZATION COURSETECHNICAL TRAINING MANUAL   GENERAL FAMILIARIZATION COURSE
TECHNICAL TRAINING MANUAL GENERAL FAMILIARIZATION COURSE
DuvanRamosGarzon1
 
AKS UNIVERSITY Satna Final Year Project By OM Hardaha.pdf
AKS UNIVERSITY Satna Final Year Project By OM Hardaha.pdfAKS UNIVERSITY Satna Final Year Project By OM Hardaha.pdf
AKS UNIVERSITY Satna Final Year Project By OM Hardaha.pdf
SamSarthak3
 
Design and Analysis of Algorithms-DP,Backtracking,Graphs,B&B
Design and Analysis of Algorithms-DP,Backtracking,Graphs,B&BDesign and Analysis of Algorithms-DP,Backtracking,Graphs,B&B
Design and Analysis of Algorithms-DP,Backtracking,Graphs,B&B
Sreedhar Chowdam
 
Cosmetic shop management system project report.pdf
Cosmetic shop management system project report.pdfCosmetic shop management system project report.pdf
Cosmetic shop management system project report.pdf
Kamal Acharya
 
Automobile Management System Project Report.pdf
Automobile Management System Project Report.pdfAutomobile Management System Project Report.pdf
Automobile Management System Project Report.pdf
Kamal Acharya
 
Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...
Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...
Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...
AJAYKUMARPUND1
 
在线办理(ANU毕业证书)澳洲国立大学毕业证录取通知书一模一样
在线办理(ANU毕业证书)澳洲国立大学毕业证录取通知书一模一样在线办理(ANU毕业证书)澳洲国立大学毕业证录取通知书一模一样
在线办理(ANU毕业证书)澳洲国立大学毕业证录取通知书一模一样
obonagu
 
WATER CRISIS and its solutions-pptx 1234
WATER CRISIS and its solutions-pptx 1234WATER CRISIS and its solutions-pptx 1234
WATER CRISIS and its solutions-pptx 1234
AafreenAbuthahir2
 
Quality defects in TMT Bars, Possible causes and Potential Solutions.
Quality defects in TMT Bars, Possible causes and Potential Solutions.Quality defects in TMT Bars, Possible causes and Potential Solutions.
Quality defects in TMT Bars, Possible causes and Potential Solutions.
PrashantGoswami42
 
Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...
Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...
Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...
Dr.Costas Sachpazis
 
ASME IX(9) 2007 Full Version .pdf
ASME IX(9)  2007 Full Version       .pdfASME IX(9)  2007 Full Version       .pdf
ASME IX(9) 2007 Full Version .pdf
AhmedHussein950959
 
MCQ Soil mechanics questions (Soil shear strength).pdf
MCQ Soil mechanics questions (Soil shear strength).pdfMCQ Soil mechanics questions (Soil shear strength).pdf
MCQ Soil mechanics questions (Soil shear strength).pdf
Osamah Alsalih
 
DESIGN A COTTON SEED SEPARATION MACHINE.docx
DESIGN A COTTON SEED SEPARATION MACHINE.docxDESIGN A COTTON SEED SEPARATION MACHINE.docx
DESIGN A COTTON SEED SEPARATION MACHINE.docx
FluxPrime1
 
Courier management system project report.pdf
Courier management system project report.pdfCourier management system project report.pdf
Courier management system project report.pdf
Kamal Acharya
 
HYDROPOWER - Hydroelectric power generation
HYDROPOWER - Hydroelectric power generationHYDROPOWER - Hydroelectric power generation
HYDROPOWER - Hydroelectric power generation
Robbie Edward Sayers
 

Recently uploaded (20)

The role of big data in decision making.
The role of big data in decision making.The role of big data in decision making.
The role of big data in decision making.
 
block diagram and signal flow graph representation
block diagram and signal flow graph representationblock diagram and signal flow graph representation
block diagram and signal flow graph representation
 
ethical hacking-mobile hacking methods.ppt
ethical hacking-mobile hacking methods.pptethical hacking-mobile hacking methods.ppt
ethical hacking-mobile hacking methods.ppt
 
Top 10 Oil and Gas Projects in Saudi Arabia 2024.pdf
Top 10 Oil and Gas Projects in Saudi Arabia 2024.pdfTop 10 Oil and Gas Projects in Saudi Arabia 2024.pdf
Top 10 Oil and Gas Projects in Saudi Arabia 2024.pdf
 
LIGA(E)11111111111111111111111111111111111111111.ppt
LIGA(E)11111111111111111111111111111111111111111.pptLIGA(E)11111111111111111111111111111111111111111.ppt
LIGA(E)11111111111111111111111111111111111111111.ppt
 
TECHNICAL TRAINING MANUAL GENERAL FAMILIARIZATION COURSE
TECHNICAL TRAINING MANUAL   GENERAL FAMILIARIZATION COURSETECHNICAL TRAINING MANUAL   GENERAL FAMILIARIZATION COURSE
TECHNICAL TRAINING MANUAL GENERAL FAMILIARIZATION COURSE
 
AKS UNIVERSITY Satna Final Year Project By OM Hardaha.pdf
AKS UNIVERSITY Satna Final Year Project By OM Hardaha.pdfAKS UNIVERSITY Satna Final Year Project By OM Hardaha.pdf
AKS UNIVERSITY Satna Final Year Project By OM Hardaha.pdf
 
Design and Analysis of Algorithms-DP,Backtracking,Graphs,B&B
Design and Analysis of Algorithms-DP,Backtracking,Graphs,B&BDesign and Analysis of Algorithms-DP,Backtracking,Graphs,B&B
Design and Analysis of Algorithms-DP,Backtracking,Graphs,B&B
 
Cosmetic shop management system project report.pdf
Cosmetic shop management system project report.pdfCosmetic shop management system project report.pdf
Cosmetic shop management system project report.pdf
 
Automobile Management System Project Report.pdf
Automobile Management System Project Report.pdfAutomobile Management System Project Report.pdf
Automobile Management System Project Report.pdf
 
Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...
Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...
Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...
 
在线办理(ANU毕业证书)澳洲国立大学毕业证录取通知书一模一样
在线办理(ANU毕业证书)澳洲国立大学毕业证录取通知书一模一样在线办理(ANU毕业证书)澳洲国立大学毕业证录取通知书一模一样
在线办理(ANU毕业证书)澳洲国立大学毕业证录取通知书一模一样
 
WATER CRISIS and its solutions-pptx 1234
WATER CRISIS and its solutions-pptx 1234WATER CRISIS and its solutions-pptx 1234
WATER CRISIS and its solutions-pptx 1234
 
Quality defects in TMT Bars, Possible causes and Potential Solutions.
Quality defects in TMT Bars, Possible causes and Potential Solutions.Quality defects in TMT Bars, Possible causes and Potential Solutions.
Quality defects in TMT Bars, Possible causes and Potential Solutions.
 
Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...
Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...
Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...
 
ASME IX(9) 2007 Full Version .pdf
ASME IX(9)  2007 Full Version       .pdfASME IX(9)  2007 Full Version       .pdf
ASME IX(9) 2007 Full Version .pdf
 
MCQ Soil mechanics questions (Soil shear strength).pdf
MCQ Soil mechanics questions (Soil shear strength).pdfMCQ Soil mechanics questions (Soil shear strength).pdf
MCQ Soil mechanics questions (Soil shear strength).pdf
 
DESIGN A COTTON SEED SEPARATION MACHINE.docx
DESIGN A COTTON SEED SEPARATION MACHINE.docxDESIGN A COTTON SEED SEPARATION MACHINE.docx
DESIGN A COTTON SEED SEPARATION MACHINE.docx
 
Courier management system project report.pdf
Courier management system project report.pdfCourier management system project report.pdf
Courier management system project report.pdf
 
HYDROPOWER - Hydroelectric power generation
HYDROPOWER - Hydroelectric power generationHYDROPOWER - Hydroelectric power generation
HYDROPOWER - Hydroelectric power generation
 

Illustrative Introductory Neural Networks

  • 1. DATANOMIQ GmbH | Franklinstr. 11 | 10587 Berlin Densely Connected Layers
  • 2. Terminology  You’re going to learn ”feedforward neural networks,” which don’t go backward while activating.  And this lecture is about ”densely connected layers,” or “fully connected layers.”  Also we can say that a neural network is a combination of perceptron, so it’s also called “multilayer perceptron.” ⋮ ⋮ ⋮ ⋮  In this lecture, we’d like to take another approach to examine the structure, which is a kind of giving misunderstanding to people about machine learning. ⋮ 1 2 N
  • 3. Something Like Biology : The Structure of Neurons  When electrical potential of a neuron reaches a certain extent, it emits a electrical pulse.  Each neuron gets electrical pulses.  Sensitivity of each neuron is determined by synapses.
  • 4. Structure of an Unit of Neural Network : Mimicking Brain ⋮ 1 2 N  When electrical potential of the neuron reaches a certain level, it emits next pulse.  This is like on/off of a switch.  Sigmoid function has a closer behavior. ℎ(⋅)
  • 5. ⋮ 1 2 N ⋮ Overview of the Architecture of Densely Connected Layers Just repeat it
  • 6. Overview of the Architecture of Densely Connected Layers And repeat it That’s all
  • 7. Classifying MNIST Dataset with Densely Connected Layers : “Hello World” of Machine Learning Black and white images of 28*28 = 784 pixels  Some people say this is “Hello, world.” of machine learning.  You can classify MNIST datast with densely connected layers.
  • 8. ⋮ ⋮ ⋮ ⋮ “Hello World!” of Machine Learning 1.0 1.0 1.0 1.0 ⋮ ⋮ ⋮ 0.2 0.3 ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ 1.0 1.0 Flattening 784-d vector 16-d vector 10-d vector ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ 3% ⋮ ⋮ ⋮ ⋮ 83% ⋮ ⋮ ⋮ ⋮ ⋮ 5% ‘5’784 pixel values
  • 9. Naive Image Classification with Densely Connected Layers ERRORS You can achieve about 90% accuracy with densely connected layers. (I used Keras : one of deep learning libraries. )
  • 10. Please open your browser and search “machine learning” on image search. We’ve looked at analogies of neural networks and brain neurons. Google Bing DATANOMIQ official website
  • 11.  Seemingly this is the image of machine learning in media.  But please keep it in mind that neural networks are NOT models of brain.  Neural network is nothing but just a mapping of input vectors or tensors to output vectors or tensors.
  • 12. Let’s Go More Mathematically Input layer Hidden layer Output layer Σ Σ Σ ℎ(⋅) ℎ(⋅) ℎ(⋅) ⋮ ⋮ ⋮ Σ Σ Σ ℎ(⋅) ℎ(⋅) ℎ(⋅) ⋮ ⋮ ⋯ ⋯ ⋯ Activation function No. 0 layer No. 1 ~ L-1 layer No. L layer Supervising vector Activation function
  • 13. Let’s Go More Mathematically : Neural Network is just a mapping Σ Σ Σ ℎ(⋅) ℎ(⋅) ℎ(⋅) ⋮ ⋮ ⋮ Σ Σ Σ ⋮ ⋯ ⋯ ⋯
  • 14. Calculation of Neural Networks are Divided into Two Parts  Forward propagation : Calculating from input layer to output layer. Activating each neuron.  Back propagation : Calculating from output layer to input layer. Renewing Parameters. ⋮ ⋮ ⋮ ⋮ In short, calculating
  • 15. 1 Σ Σ 1 Unit Unit layer layer Forward propagation You can generalize the relations of any pairs of units this way.
  • 16. 1 Σ layer layer ⋮ ⋮ Forward propagation : Let’s calculate concretely. Please pay attention to , which is the No. j neuron in (l+1)th layer.
  • 17. 1 Σ layer layer ⋮ ⋮ Forward propagation : Let’s calculate concretely. Assume that Then *Keep it in mind that
  • 18. Forward Propagation : Activation Functions in Hidden Layers Let’s take a brief look at some activation maps.  Sigmoid function  Hyperbolic tangent  Relu function
  • 19. Forward Propagation at the Last Layer : Regression  In case of a regression problem, the activation function in the last layers is usually an identity mapping.  I mean, you do nothing.
  • 20. Forward Propagation at the Last Layer : Classification  In case of a multiclass classification problem, the activation function in the last layers is usually a softmax function. A softmax function is defined as below. *Note that , and is the number of classes.  The sum of the neurons in the last layer is 1, so softmax function is useful for changing output into plobalities.
  • 21. ⋮ ⋮ ⋮ ⋮ Forward Propagation 1.0 1.0 1.0 1.0 ⋮ ⋮ ⋮ 0.2 0.3 ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ 1.0 1.0 Flattening 784-d vector 16-d vector 10-d vector ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ 3% ⋮ ⋮ ⋮ ⋮ 83% ⋮ ⋮ ⋮ ⋮ ⋮ 5% ‘5’ In case of handwritten digit classification problem, densely connected layers is mapping a flattened image to probabilities. Then, how can we get parameters of such useful function?
  • 22. Mathematical General Outline of Supervised Learning (When You Use Normal Gradient Descent)  Setting an error function, whose variances are parameters  Optimizing parameters such that the minimize the error function.  That means, applying gradient descent to loss function with respect to parameters Most importantly, calculating parameters is what supervised learning is all about. This is an outline of supervised learning using SGD
  • 23.  In short, we want to set a loss function such that gets smaller when is closer to What are Error Functions in Supervised Learning?  We want to be close to *Be careful that we’re going to consider only one data point  Assume that you have an output vector and a supervising vector *Note that ‘n’ is the index to show the number of data sample.
  • 24. Error Function : Square Error  We use a square error as a loss function for regression problems.  In a regression problem, simply we just want to be close to
  • 25.  is always negative, and is an element of one- hot encoding. Error Funtion : Cross Entropy *Note that ‘n’ is the index to show the number of data sample.  We use cross entropy as a loss function of neural network for classification problems. Question : What will the cross entropy be like if classification is more correct? The cross entropy will be smaller. Hint : A graph of y = log(x)
  • 26. Gradient Descent : Gradient  The gradient of , which is denoted as , is defined as the equation on the right side.  Let be a function of D variances of  A gradient of a function means “The direction such that maximizes the positive change of .”
  • 27. The most efficient way to descend a slide (maybe). Gradient Descent : In Case of 2 Variants If you calculate partial differentiations along each axis you can maximize this change. Let’s look at an example of gradient descent in a 3 dimensional space.
  • 28. Gradient Descent : In Case of 2 Variants  So much for stupid jokes. Let’s use gradient descent for more practical stuff.  Kids are all right without gradient descent.
  • 29. Supervised Learning : Simple Example A simple example is linear regression. In this case the error function J is mean square error. Suppose that we have a dataset And we want to fit the function below to the data. And the error function is defined as below.
  • 30. Back Propagation  We want to calculate such that minimize the error function J.  The graph of error function is as below, and it’s easy to find the minimum point in this case.  Also, there’re formulas to calculate them.
  • 31. Back Propagation  In this lecture, let’s see how to apply gradient descent to this linear regression problem.  In short you just need to apply repeat the process below.
  • 32. If You Use Gradient Descent An image of gradient descent : a ball rolling the surface of error function. Start point
  • 33. Gradient Descent  If you calculate at every point on and move slightly, you approach higher point on .  In reverse, if you move at each step, you approach lower point on . This is gradient descent.  For supervised learning, you apply gradient descent to an error function , which is a function of parameters.  And you can apply gradient descent to higher than 2 dimensional variances. But of course you can’t visualize it anymore.
  • 34. Gradient Descent for Neural Network  Question : How many parameters does the “Hellow world!” neural network on the right side have?  In the case of simple linear regression, we considered only 2 parameters.  Answer : 784*16 + 16 + 16*10 + 10 =12730 ⋮ ⋮ ⋮ ⋮ 784-d vector 16-d vector 10-d vector 12730 parameters  …..At least it’s more than 2 parameters, seemingly.
  • 35. Σ Σ ℎ(⋅) ℎ(⋅) Σ Σ Σ ℎ(⋅) ℎ(⋅) ℎ(⋅) Super Simple Example of Densely Connected Layers No. 0 layer No. 1 layer No. 2 layer Sigmoid function Softmax function Supervising Vector  Before digit classification problem , we’ll consider this simple neural network and its toy implementation.  And we think about next simple classification problem.
  • 36. Super Simple Example of Densely Connected Layers  Data points generated as three clusters.  Data in each cluster has the same mean and the same variance. 100 data points used for training the neural network Training Classifying 100 data points used for classifying : testing the neural network
  • 37. Naively Calculating Approximations of Partial Differentiations Suppose that is a function whose variances are N parameters And we want to calculate the partial differentiation at a point with respect to (i=1, 2, … , N). *Note that are not variances. They are constants.
  • 38. Naively Calculating Approximations of Partial Differentiations ….Sorry, I’ve written it in a slightly snobbish way. Let’s look at a simple example. You can get an approximation of partial differentiation of at as below.
  • 39. Suppose that , Naively Calculating Approximations of Partial Differentiations : A Simple Example You can calculate an approximation of partial differentiation of at as below. * are constants.
  • 40. Suppose that , Naively Calculating Approximations of Partial Differentiations : A Simple Example You can calculate an approximation of partial differentiation of at as below.
  • 41. The number of parameters.  In case of this simple neural network, it has 2*2 + 2 + 2*3 + 3 = 15 parameters.  To train this neural network on 100 training data, naively calculating approximations of partial differentiations, this neural network needed around 250 sec.
  • 42. Back Propagation : More Sophisticated Partial Differentiations  Again, in the ”Hello World!” densely connected layers, you have to calculate partial differentiations of 12730 parameters. ⋮ ⋮ ⋮ ⋮ 784-d vector 16-d vector 10-d vector 12730 parameters  We usually use a method called back propagation to calculate them.
  • 43.  Then the differentiation of with respect to is calculated as below. Chain Rule : Warming Up of Back Propagation  Let be a function of a variance and the variance be a function of  Chain rules are essential for back propagation algorithms. *In this slide we I avoid mathematically precise discussion.
  • 44.  Then the differentiation of with respect to or is calculated as below.  Let be a function of two variances , and the variance , be functions of , Chain Rule : 2 Variances
  • 45.  Then the differentiation of with respect to is calculated as below. Chain Rule : In General  Let be a function of n variances and the variances be functions of m variances
  • 46. BPTT for Simple RNN : Brief Review on Chain Rules  This generalized chain rule is super important for back propagation.  For simplicity, let’s denote the function in the way below.  Again, the partial differentiation of with respect to is
  • 47. Back Propagation Σ Σ Σ ℎ(⋅) ℎ(⋅) ℎ(⋅) ⋮ ⋮ ⋮ Σ Σ Σ ℎ(⋅) ℎ(⋅) ℎ(⋅) ⋮ ⋮ ⋯ ⋯ ⋯ Back propagation is an efficient way to calculate the partial differentiations of an error function with respect to each parameter.  You need errors of each neuron to calculate those partial differentiations. *The error of layer i is defined as
  • 48. Back Propagation Σ Σ Σ ℎ(⋅) ℎ(⋅) ℎ(⋅) ⋮ ⋮ ⋮ Σ Σ Σ ℎ(⋅) ℎ(⋅) ℎ(⋅) ⋮ ⋮ ⋯ ⋯ ⋯ To calculate the error of No.1 layer, you need the error of the No.2 layer. To calculate the error of No.2 layer, you need the error of the No.3 layer. To calculate the error of No.L-1 layer, you need the error of the last layer.
  • 49. 1 Σ layer layer ⋮ ⋮ Back Propagation  Let’s think about renewing parameters such that they decrease an error function  We want to renew parameters using gradient descent. *Be careful that is calculated with only one data point
  • 50. 1 Σ layer layer ⋮ ⋮ Back Propagation  Using chain rule we can say  Let be , and then * Pay attention to what can be variances of ∵  And
  • 51. 1 Σ layer layer ⋮ ⋮ Back Propagation  You can calculate as well.  Next, let’s calculate an error * Pay attention to the difference between the last equation we applied chain rule. You have to consider is a variance of what.
  • 52. 1 Σ layer layer ⋮ ⋮ Back Propagation  Then ∵  Hence  This equations shows that to calculate an error in a layer, you need all the errors in the higher layer.
  • 53. Back Propagation : Let’s Make the Most of Linear Algebra Assume that Then And keep it in mind that
  • 54. The number of parameters.  To train this neural network on 100 training data, using back propagation, this neural network needed around 31 sec.
  • 55. Super Simple Example of Densely Connected Layers Accuracy : 70.8 %
  • 56. There Are Many Other Things to Think About  Other activation functions  Batch learning vs. mini batch learning  Applying other types of optimization  How to initialize the weights  Regularization of data  Dropout ….You’ll soon realize that machine learning is more of a matter of considering those hyper parameters : the parameters you don’t train through back propagation.
  • 57. Super Simple Example of Densely Connected Layers Sigmoid function Relu function Accuracy : 73.2 %
  • 58. Training a Neural Network : Stochastic Gradient Descent  We have been thinking about a loss function , which is calculated with only one data point  It is easy to imagine that , the total sum of , is more useful to evaluate how compatible the neural network is for all the data. Batch learning
  • 59.  Data is redundant : many data points are similar, so even if you reduce data at random the reduced dataset is still useful to some extent. Training a Neural Network : Stochastic Gradient Descent  Question : In practice, you don’t use as a loss function for training a neural network. Why?  Computationally expensive : usually you have bunch of data points  The gradient calculated with all the data is NOT noisy. 𝑥 𝑦 𝑥 𝑦 data reduction
  • 60. Training a Neural Network : Stochastic Gradient Descent  Let’s think about the third reason in the last slide “The gradient calculated with all the data is NOT noisy.“  Assume that the graph on the right side is a loss function , which is calculated with the whole data.  If you apply gradient descent from a start point using as a loss function, probably the point will shift smoothly along the surface of the graph. Minimum point Start point
  • 61. Training a Neural Network : Stochastic Gradient Descent Start point Minimum point  In fact, smooth and exact track of gradient descent, which is like a rolling ball, is not necessarily good.  Because depending on how to set start points, it can get stuck at a local minimum.
  • 62. Training a Neural Network : Stochastic Gradient Descent  Question : What would happen if you calculate partial differentiations using , which is calculated with only one data point?  The track of gradient descent would be like a zigzag path, heading in the direction of the minimum point. Minimum point Start point
  • 63. Gradient Descent : Batch Learning ⋯ ⋯ Renewing parameters Calculating an error function ⋯⋯ dataset 1 epoch Shuffling dataset 1 epoch 1 epoch
  • 64. Gradient Descent : Stochastic Gradient Descent(SGD) Renewing parameters Calculating an error function Shuffling dataset dataset Renewing parameters Calculating an error function ⋯ Renewing parameters Calculating an error function 1 epoch ⋯ ⋯ Renewing parameters Calculating an error function Renewing parameters Calculating an error function 1 epoch ⋯
  • 65. Training a Neural Network : A Pseudo Code of SGD You need partial differentiations with respect to each parameter to apply gradient descent. And the process in this part is calculating the differentiations, using back propagation. This algorithm renews parameters for each data point.
  • 66. Renewing parameters Calculating error functions in the batch Gradient Descent :Mini-Batch Learning ⋯ Renewing parameters Dividing dataset into mini batches Shuffling dataset Calculating error functions in the batch ⋯ ⋯ 1 epoch Dividing dataset into mini batches 1 epoch ⋯
  • 67. Training a Neural Network : Mini-Batch Gradient Descent You apply gradient descent, using the average of partial differentiations for each batch. Just as the normal SGD, you calculate partial differentiations, . This algorithm renews parameters for data points in each batch.