An introductory/illustrative but precise slide on mathematics on neural networks (densely connected layers).
Please download it and see its animations with PowerPoint.
*This slide is not finished yet. If you like it, please give me some feedback to motivate me.
I made this slide as an intern in DATANOMIQ Gmbh
URL: https://www.datanomiq.de/
An introductory but very precise slide on mathematics of RNN/LSTM algorithms. You would get a clearer understanding on RNN back/forward propagation with this.
*This slide is not finished yet. If you like it, please give me some feedback to motivate me.
I made this slide as an intern in DATANOMIQ Gmbh
URL: https://www.datanomiq.de/
*A part of this slide is not completed.
Instructions on forward/back propagation on a simple RNN.
Supplement material of "Simple RNN: the first foothold for understanding LSTM."
https://data-science-blog.com/blog/2020/06/17/simple-rnn-the-first-foothold-for-understanding-lstm/
Based on https://www.deeplearningbook.org/contents/rnn.html
* I make study materials on machine learning, sponsored by DATANOMIQ. I do my best to make my content as straightforward but as precise as possible. I include all of my reference sources. If you notice any mistakes in my materials, including grammatical errors, please let me know (email: yasuto.tamura@datanomiq.de). And if you have any advice for making my materials more understandable to learners, I would appreciate hearing it.
An illustrative introduction on CNN.
Maybe one of the most visually understandable but precise slide on CNN in your life.
I made this slide as an intern in DATANOMIQ Gmbh
URL: https://www.datanomiq.de/
*This slide is not finished yet. If you like it, please give me some feedback to motivate me.
Computer Science
Active and Programmable Networks
Active safety systems
Ad Hoc & Sensor Network
Ad hoc networks for pervasive communications
Adaptive, autonomic and context-aware computing
Advance Computing technology and their application
Advanced Computing Architectures and New Programming Models
Advanced control and measurement
Aeronautical Engineering,
Agent-based middleware
Alert applications
Automotive, marine and aero-space control and all other control applications
Autonomic and self-managing middleware
Autonomous vehicle
Biochemistry
Bioinformatics
BioTechnology(Chemistry, Mathematics, Statistics, Geology)
Broadband and intelligent networks
Broadband wireless technologies
CAD/CAM/CAT/CIM
Call admission and flow/congestion control
Capacity planning and dimensioning
Changing Access to Patient Information
Channel capacity modelling and analysis
Civil Engineering,
Cloud Computing and Applications
Collaborative applications
Communication application
Communication architectures for pervasive computing
Communication systems
Computational intelligence
Computer and microprocessor-based control
Computer Architecture and Embedded Systems
Computer Business
Computer Sciences and Applications
Computer Vision
Computer-based information systems in health care
Computing Ethics
Computing Practices & Applications
Congestion and/or Flow Control
Content Distribution
Context-awareness and middleware
Creativity in Internet management and retailing
Cross-layer design and Physical layer based issue
Cryptography
Data Base Management
Data fusion
Data Mining
Data retrieval
Data Storage Management
Decision analysis methods
Decision making
Digital Economy and Digital Divide
Digital signal processing theory
Distributed Sensor Networks
Drives automation
Drug Design,
Drug Development
DSP implementation
E-Business
E-Commerce
E-Government
Electronic transceiver device for Retail Marketing Industries
Electronics Engineering,
Embeded Computer System
Emerging advances in business and its applications
Emerging signal processing areas
Enabling technologies for pervasive systems
Energy-efficient and green pervasive computing
Environmental Engineering,
Estimation and identification techniques
Evaluation techniques for middleware solutions
Event-based, publish/subscribe, and message-oriented middleware
Evolutionary computing and intelligent systems
Expert approaches
Facilities planning and management
Flexible manufacturing systems
Formal methods and tools for designing
Fuzzy algorithms
Fuzzy logics
GPS and location-based app
An introductory but very precise slide on mathematics of RNN/LSTM algorithms. You would get a clearer understanding on RNN back/forward propagation with this.
*This slide is not finished yet. If you like it, please give me some feedback to motivate me.
I made this slide as an intern in DATANOMIQ Gmbh
URL: https://www.datanomiq.de/
*A part of this slide is not completed.
Instructions on forward/back propagation on a simple RNN.
Supplement material of "Simple RNN: the first foothold for understanding LSTM."
https://data-science-blog.com/blog/2020/06/17/simple-rnn-the-first-foothold-for-understanding-lstm/
Based on https://www.deeplearningbook.org/contents/rnn.html
* I make study materials on machine learning, sponsored by DATANOMIQ. I do my best to make my content as straightforward but as precise as possible. I include all of my reference sources. If you notice any mistakes in my materials, including grammatical errors, please let me know (email: yasuto.tamura@datanomiq.de). And if you have any advice for making my materials more understandable to learners, I would appreciate hearing it.
An illustrative introduction on CNN.
Maybe one of the most visually understandable but precise slide on CNN in your life.
I made this slide as an intern in DATANOMIQ Gmbh
URL: https://www.datanomiq.de/
*This slide is not finished yet. If you like it, please give me some feedback to motivate me.
Computer Science
Active and Programmable Networks
Active safety systems
Ad Hoc & Sensor Network
Ad hoc networks for pervasive communications
Adaptive, autonomic and context-aware computing
Advance Computing technology and their application
Advanced Computing Architectures and New Programming Models
Advanced control and measurement
Aeronautical Engineering,
Agent-based middleware
Alert applications
Automotive, marine and aero-space control and all other control applications
Autonomic and self-managing middleware
Autonomous vehicle
Biochemistry
Bioinformatics
BioTechnology(Chemistry, Mathematics, Statistics, Geology)
Broadband and intelligent networks
Broadband wireless technologies
CAD/CAM/CAT/CIM
Call admission and flow/congestion control
Capacity planning and dimensioning
Changing Access to Patient Information
Channel capacity modelling and analysis
Civil Engineering,
Cloud Computing and Applications
Collaborative applications
Communication application
Communication architectures for pervasive computing
Communication systems
Computational intelligence
Computer and microprocessor-based control
Computer Architecture and Embedded Systems
Computer Business
Computer Sciences and Applications
Computer Vision
Computer-based information systems in health care
Computing Ethics
Computing Practices & Applications
Congestion and/or Flow Control
Content Distribution
Context-awareness and middleware
Creativity in Internet management and retailing
Cross-layer design and Physical layer based issue
Cryptography
Data Base Management
Data fusion
Data Mining
Data retrieval
Data Storage Management
Decision analysis methods
Decision making
Digital Economy and Digital Divide
Digital signal processing theory
Distributed Sensor Networks
Drives automation
Drug Design,
Drug Development
DSP implementation
E-Business
E-Commerce
E-Government
Electronic transceiver device for Retail Marketing Industries
Electronics Engineering,
Embeded Computer System
Emerging advances in business and its applications
Emerging signal processing areas
Enabling technologies for pervasive systems
Energy-efficient and green pervasive computing
Environmental Engineering,
Estimation and identification techniques
Evaluation techniques for middleware solutions
Event-based, publish/subscribe, and message-oriented middleware
Evolutionary computing and intelligent systems
Expert approaches
Facilities planning and management
Flexible manufacturing systems
Formal methods and tools for designing
Fuzzy algorithms
Fuzzy logics
GPS and location-based app
Manifold Blurring Mean Shift algorithms for manifold denoising, report, 2012Florent Renucci
(General) To retrieve a clean dataset by deleting outliers.
(Computer Vision) the recovery of a digital image that has been contaminated by additive white Gaussian noise.
https://imatge-upc.github.io/skiprnn-2017-telecombcn/
Victor Campos, Brendan Jou, Xavier Giro-i-Nieto, Jordi Torres, and Shih-Fu Chang. “Skip RNN: Learning to Skip State Updates in Recurrent Neural Networks”, In International Conference on Learning Representations, 2018.
Recurrent Neural Networks (RNNs) continue to show outstanding performance in sequence modeling tasks. However, training RNNs on long sequences often face challenges like slow inference, vanishing gradients and difficulty in capturing long term dependencies. In backpropagation through time settings, these issues are tightly coupled with the large, sequential computational graph resulting from unfolding the RNN in time. We introduce the Skip RNN model which extends existing RNN models by learning to skip state updates and shortens the effective size of the computational graph. This model can also be encouraged to perform fewer state updates through a budget constraint. We evaluate the proposed model on various tasks and show how it can reduce the number of required RNN updates while preserving, and sometimes even improving, the performance of the baseline RNN models.
Activation functions and Training Algorithms for Deep Neural networkGayatri Khanvilkar
Training of Deep neural network is difficult task. Deep neural network train with the help of training algorithms and activation function This is an overview of Activation Function and Training Algorithms used for Deep Neural Network. It underlines a brief comparative study of activation function and training algorithms.
We present a new simulation tool for scale-free networks composed of a high number of nodes. The tool, based on discrete-event simulation, enables the definition of scale-free networks composed of heterogeneous nodes and complex application-level protocols. To satisfy the performance and scalability requirements, the simulator supports both sequential (i.e. monolithic) and parallel/distributed (i.e. PADS) approaches. Furthermore, appropriate mechanisms for the communication overhead-reduction are implemented. To demonstrate the efficiency of the tool, we experiment with gossip protocols on top of scale-free networks generated by our simulator. Results of the simulations demonstrate the feasibility of our approach. The proposed tool is able to generate and manage large scale-free networks composed of thousands of nodes interacting following real-world dissemination protocols.
Manifold Blurring Mean Shift algorithms for manifold denoising, report, 2012Florent Renucci
(General) To retrieve a clean dataset by deleting outliers.
(Computer Vision) the recovery of a digital image that has been contaminated by additive white Gaussian noise.
https://imatge-upc.github.io/skiprnn-2017-telecombcn/
Victor Campos, Brendan Jou, Xavier Giro-i-Nieto, Jordi Torres, and Shih-Fu Chang. “Skip RNN: Learning to Skip State Updates in Recurrent Neural Networks”, In International Conference on Learning Representations, 2018.
Recurrent Neural Networks (RNNs) continue to show outstanding performance in sequence modeling tasks. However, training RNNs on long sequences often face challenges like slow inference, vanishing gradients and difficulty in capturing long term dependencies. In backpropagation through time settings, these issues are tightly coupled with the large, sequential computational graph resulting from unfolding the RNN in time. We introduce the Skip RNN model which extends existing RNN models by learning to skip state updates and shortens the effective size of the computational graph. This model can also be encouraged to perform fewer state updates through a budget constraint. We evaluate the proposed model on various tasks and show how it can reduce the number of required RNN updates while preserving, and sometimes even improving, the performance of the baseline RNN models.
Activation functions and Training Algorithms for Deep Neural networkGayatri Khanvilkar
Training of Deep neural network is difficult task. Deep neural network train with the help of training algorithms and activation function This is an overview of Activation Function and Training Algorithms used for Deep Neural Network. It underlines a brief comparative study of activation function and training algorithms.
We present a new simulation tool for scale-free networks composed of a high number of nodes. The tool, based on discrete-event simulation, enables the definition of scale-free networks composed of heterogeneous nodes and complex application-level protocols. To satisfy the performance and scalability requirements, the simulator supports both sequential (i.e. monolithic) and parallel/distributed (i.e. PADS) approaches. Furthermore, appropriate mechanisms for the communication overhead-reduction are implemented. To demonstrate the efficiency of the tool, we experiment with gossip protocols on top of scale-free networks generated by our simulator. Results of the simulations demonstrate the feasibility of our approach. The proposed tool is able to generate and manage large scale-free networks composed of thousands of nodes interacting following real-world dissemination protocols.
An introduction to Deep Learning (DL) concepts, such as neural networks, back propagation, activation functions, CNNs, RNNs (if time permits), and the CLT/AUT/fixed-point theorems, along with code samples in Java and TensorFlow.
An introduction to Deep Learning (DL) concepts, starting with a simple yet complete neural network (no frameworks), followed by aspects of deep neural networks, such as back propagation, activation functions, CNNs, and the AUT theorem. Next, a quick introduction to TensorFlow and Tensorboard, and then some code samples with Scala and TensorFlow.
An introduction to Deep Learning concepts, with a simple yet complete neural network, CNNs, followed by rudimentary concepts of Keras and TensorFlow, and some simple code fragments.
An introduction to Deep Learning (DL) concepts, such as neural networks, back propagation, activation functions, CNNs, and GANs, along with a simple yet complete neural network.
An Artificial Neural Network (ANN) is a computational model inspired by the structure and functioning of the human brain's neural networks. It consists of interconnected nodes, often referred to as neurons or units, organized in layers. These layers typically include an input layer, one or more hidden layers, and an output layer.
This presentation begins with explaining the basic algorithms of machine learning and using the same concepts, discusses in detail 2 supervised learning/deep learning algorithms - Artificial neural nets and Convolutional Neural Nets. The relationship between Artificial neural nets and basic machine learning algorithms such as logistic regression and soft max is also explored. For hands on the implementation of ANN's and CNN's on MNIST dataset is also explained.
Deep learning (also known as deep structured learning or hierarchical learning) is the application of artificial neural networks (ANNs) to learning tasks that contain more than one hidden layer. Deep learning is part of a broader family of machine learning methods based on learning data representations, as opposed to task-specific algorithms. Learning can be supervised, partially supervised or unsupervised.
Saudi Arabia stands as a titan in the global energy landscape, renowned for its abundant oil and gas resources. It's the largest exporter of petroleum and holds some of the world's most significant reserves. Let's delve into the top 10 oil and gas projects shaping Saudi Arabia's energy future in 2024.
TECHNICAL TRAINING MANUAL GENERAL FAMILIARIZATION COURSEDuvanRamosGarzon1
AIRCRAFT GENERAL
The Single Aisle is the most advanced family aircraft in service today, with fly-by-wire flight controls.
The A318, A319, A320 and A321 are twin-engine subsonic medium range aircraft.
The family offers a choice of engines
Cosmetic shop management system project report.pdfKamal Acharya
Buying new cosmetic products is difficult. It can even be scary for those who have sensitive skin and are prone to skin trouble. The information needed to alleviate this problem is on the back of each product, but it's thought to interpret those ingredient lists unless you have a background in chemistry.
Instead of buying and hoping for the best, we can use data science to help us predict which products may be good fits for us. It includes various function programs to do the above mentioned tasks.
Data file handling has been effectively used in the program.
The automated cosmetic shop management system should deal with the automation of general workflow and administration process of the shop. The main processes of the system focus on customer's request where the system is able to search the most appropriate products and deliver it to the customers. It should help the employees to quickly identify the list of cosmetic product that have reached the minimum quantity and also keep a track of expired date for each cosmetic product. It should help the employees to find the rack number in which the product is placed.It is also Faster and more efficient way.
Automobile Management System Project Report.pdfKamal Acharya
The proposed project is developed to manage the automobile in the automobile dealer company. The main module in this project is login, automobile management, customer management, sales, complaints and reports. The first module is the login. The automobile showroom owner should login to the project for usage. The username and password are verified and if it is correct, next form opens. If the username and password are not correct, it shows the error message.
When a customer search for a automobile, if the automobile is available, they will be taken to a page that shows the details of the automobile including automobile name, automobile ID, quantity, price etc. “Automobile Management System” is useful for maintaining automobiles, customers effectively and hence helps for establishing good relation between customer and automobile organization. It contains various customized modules for effectively maintaining automobiles and stock information accurately and safely.
When the automobile is sold to the customer, stock will be reduced automatically. When a new purchase is made, stock will be increased automatically. While selecting automobiles for sale, the proposed software will automatically check for total number of available stock of that particular item, if the total stock of that particular item is less than 5, software will notify the user to purchase the particular item.
Also when the user tries to sale items which are not in stock, the system will prompt the user that the stock is not enough. Customers of this system can search for a automobile; can purchase a automobile easily by selecting fast. On the other hand the stock of automobiles can be maintained perfectly by the automobile shop manager overcoming the drawbacks of existing system.
Water scarcity is the lack of fresh water resources to meet the standard water demand. There are two type of water scarcity. One is physical. The other is economic water scarcity.
Quality defects in TMT Bars, Possible causes and Potential Solutions.PrashantGoswami42
Maintaining high-quality standards in the production of TMT bars is crucial for ensuring structural integrity in construction. Addressing common defects through careful monitoring, standardized processes, and advanced technology can significantly improve the quality of TMT bars. Continuous training and adherence to quality control measures will also play a pivotal role in minimizing these defects.
Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...Dr.Costas Sachpazis
Terzaghi's soil bearing capacity theory, developed by Karl Terzaghi, is a fundamental principle in geotechnical engineering used to determine the bearing capacity of shallow foundations. This theory provides a method to calculate the ultimate bearing capacity of soil, which is the maximum load per unit area that the soil can support without undergoing shear failure. The Calculation HTML Code included.
Courier management system project report.pdfKamal Acharya
It is now-a-days very important for the people to send or receive articles like imported furniture, electronic items, gifts, business goods and the like. People depend vastly on different transport systems which mostly use the manual way of receiving and delivering the articles. There is no way to track the articles till they are received and there is no way to let the customer know what happened in transit, once he booked some articles. In such a situation, we need a system which completely computerizes the cargo activities including time to time tracking of the articles sent. This need is fulfilled by Courier Management System software which is online software for the cargo management people that enables them to receive the goods from a source and send them to a required destination and track their status from time to time.
Overview of the fundamental roles in Hydropower generation and the components involved in wider Electrical Engineering.
This paper presents the design and construction of hydroelectric dams from the hydrologist’s survey of the valley before construction, all aspects and involved disciplines, fluid dynamics, structural engineering, generation and mains frequency regulation to the very transmission of power through the network in the United Kingdom.
Author: Robbie Edward Sayers
Collaborators and co editors: Charlie Sims and Connor Healey.
(C) 2024 Robbie E. Sayers
2. Terminology
You’re going to learn ”feedforward
neural networks,” which don’t go
backward while activating.
And this lecture is about ”densely
connected layers,” or “fully
connected layers.”
Also we can say that a neural
network is a combination of
perceptron, so it’s also called
“multilayer perceptron.”
⋮
⋮
⋮
⋮
In this lecture, we’d like to take
another approach to examine the
structure, which is a kind of giving
misunderstanding to people about
machine learning.
⋮
1
2
N
3. Something Like Biology : The Structure of Neurons
When electrical potential
of a neuron reaches a
certain extent, it emits a
electrical pulse.
Each neuron gets
electrical pulses.
Sensitivity of each neuron
is determined by
synapses.
4. Structure of an Unit of Neural Network : Mimicking Brain
⋮
1
2
N
When electrical potential of
the neuron reaches a certain
level, it emits next pulse.
This is like on/off of a switch.
Sigmoid function has a
closer behavior.
ℎ(⋅)
6. Overview of the Architecture of Densely Connected Layers
And repeat it
That’s all
7. Classifying MNIST Dataset with Densely Connected
Layers : “Hello World” of Machine Learning
Black and white images
of 28*28 = 784 pixels
Some people say this is “Hello,
world.” of machine learning.
You can classify MNIST datast
with densely connected layers.
9. Naive Image Classification with Densely Connected Layers
ERRORS
You can achieve about
90% accuracy with
densely connected layers.
(I used Keras : one of
deep learning libraries. )
10. Please open your browser and search
“machine learning” on image search.
We’ve looked at analogies of neural networks
and brain neurons.
Google
Bing
DATANOMIQ
official website
11. Seemingly this is the image of
machine learning in media.
But please keep it in mind that neural
networks are NOT models of brain.
Neural network is nothing but just a mapping
of input vectors or tensors to output vectors
or tensors.
12. Let’s Go More Mathematically
Input layer Hidden layer Output layer
Σ
Σ
Σ
ℎ(⋅)
ℎ(⋅)
ℎ(⋅)
⋮ ⋮ ⋮
Σ
Σ
Σ
ℎ(⋅)
ℎ(⋅)
ℎ(⋅)
⋮ ⋮
⋯
⋯
⋯
Activation function
No. 0 layer No. 1 ~ L-1 layer No. L layer
Supervising
vector
Activation function
13. Let’s Go More Mathematically :
Neural Network is just a mapping
Σ
Σ
Σ
ℎ(⋅)
ℎ(⋅)
ℎ(⋅)
⋮ ⋮ ⋮
Σ
Σ
Σ
⋮
⋯
⋯
⋯
14. Calculation of Neural Networks
are Divided into Two Parts
Forward propagation : Calculating from
input layer to output layer. Activating each
neuron.
Back propagation : Calculating from output
layer to input layer. Renewing Parameters.
⋮
⋮
⋮
⋮
In short, calculating
18. Forward Propagation : Activation Functions in Hidden Layers
Let’s take a brief look at some activation maps.
Sigmoid function
Hyperbolic tangent
Relu function
19. Forward Propagation at the Last Layer : Regression
In case of a regression problem, the activation function in the last
layers is usually an identity mapping.
I mean, you do nothing.
20. Forward Propagation at the Last Layer : Classification
In case of a multiclass classification problem, the activation function
in the last layers is usually a softmax function. A softmax function is
defined as below.
*Note that ,
and is the number of classes.
The sum of the neurons in the last layer is 1, so
softmax function is useful for changing output into
plobalities.
22. Mathematical General Outline of Supervised Learning
(When You Use Normal Gradient Descent)
Setting an error function, whose
variances are parameters
Optimizing parameters such that the
minimize the error function.
That means, applying gradient descent to
loss function with respect to parameters
Most importantly, calculating parameters is what supervised learning is all
about. This is an outline of supervised learning using SGD
23. In short, we want to set a loss
function such that gets smaller
when is closer to
What are Error Functions in Supervised Learning?
We want to be close to
*Be careful that we’re going to consider only
one data point
Assume that you have an output vector and a
supervising vector
*Note that ‘n’ is the index to show
the number of data sample.
24. Error Function : Square Error
We use a square error as a loss function for regression problems.
In a regression problem, simply we just want to be close to
25. is always negative, and is an element of one-
hot encoding.
Error Funtion : Cross Entropy
*Note that ‘n’ is the index
to show the number of
data sample.
We use cross entropy as a loss function of neural network for classification
problems.
Question : What will the cross entropy be like
if classification is more correct?
The cross entropy will be smaller.
Hint :
A graph of
y = log(x)
26. Gradient Descent : Gradient
The gradient of , which is denoted
as , is defined as the equation
on the right side.
Let be a function of D variances of
A gradient of a function means “The
direction such that maximizes the
positive change of .”
27. The most efficient
way to descend a
slide (maybe).
Gradient Descent : In Case of 2 Variants
If you calculate partial
differentiations along
each axis you can
maximize this change.
Let’s look at an example of gradient descent
in a 3 dimensional space.
28. Gradient Descent : In Case of 2 Variants
So much for stupid jokes.
Let’s use gradient descent
for more practical stuff.
Kids are all right without
gradient descent.
29. Supervised Learning : Simple Example
A simple example is linear regression. In this case the error
function J is mean square error.
Suppose that we have a dataset
And we want to fit the function below to the data.
And the error function is defined as below.
30. Back Propagation
We want to calculate such
that minimize the error function J.
The graph of error function
is as below, and it’s easy to
find the minimum point in
this case.
Also, there’re formulas to
calculate them.
31. Back Propagation
In this lecture, let’s see how to apply
gradient descent to this linear regression
problem.
In short you just need to apply
repeat the process below.
32. If You Use Gradient Descent
An image of gradient descent : a ball
rolling the surface of error function.
Start point
33. Gradient Descent
If you calculate at every point on
and move slightly, you approach
higher point on .
In reverse, if you move at
each step, you approach lower point
on . This is gradient descent.
For supervised learning, you apply
gradient descent to an error function
, which is a function of parameters.
And you can apply gradient descent to higher than 2 dimensional
variances. But of course you can’t visualize it anymore.
34. Gradient Descent for Neural Network
Question : How many parameters
does the “Hellow world!” neural
network on the right side have?
In the case of simple linear
regression, we considered only
2 parameters.
Answer : 784*16 + 16 + 16*10 + 10
=12730
⋮
⋮
⋮
⋮
784-d
vector
16-d
vector
10-d
vector
12730
parameters
…..At least it’s more than 2
parameters, seemingly.
35. Σ
Σ
ℎ(⋅)
ℎ(⋅)
Σ
Σ
Σ
ℎ(⋅)
ℎ(⋅)
ℎ(⋅)
Super Simple Example of Densely Connected Layers
No. 0 layer No. 1 layer No. 2 layer
Sigmoid function Softmax function
Supervising
Vector
Before digit
classification problem
, we’ll consider this
simple neural
network and its toy
implementation.
And we think
about next simple
classification
problem.
36. Super Simple Example of Densely Connected Layers
Data points generated as three clusters.
Data in each cluster has the same mean and the same variance.
100 data points used for training the neural network
Training
Classifying
100 data points used for classifying : testing the neural network
37. Naively Calculating Approximations of Partial Differentiations
Suppose that is a function whose
variances are N parameters
And we want to calculate the partial
differentiation at a point with respect
to (i=1, 2, … , N).
*Note that are not
variances. They are constants.
38. Naively Calculating Approximations of
Partial Differentiations
….Sorry, I’ve written it in a slightly snobbish way.
Let’s look at a simple example.
You can get an approximation of partial
differentiation of at as below.
39. Suppose that
,
Naively Calculating Approximations of
Partial Differentiations : A Simple Example
You can calculate an approximation of partial
differentiation of at as below.
* are constants.
40. Suppose that ,
Naively Calculating Approximations of
Partial Differentiations : A Simple Example
You can calculate an approximation of partial
differentiation of at as below.
41. The number of parameters.
In case of this simple neural
network, it has
2*2 + 2 + 2*3 + 3 = 15
parameters.
To train this neural
network on 100 training
data, naively calculating
approximations of partial
differentiations, this
neural network needed
around 250 sec.
42. Back Propagation : More Sophisticated
Partial Differentiations
Again, in the ”Hello World!”
densely connected layers, you
have to calculate partial
differentiations of 12730
parameters.
⋮
⋮
⋮
⋮
784-d
vector
16-d
vector
10-d
vector
12730
parameters
We usually use a method
called back propagation to
calculate them.
43. Then the differentiation of with respect to is calculated as
below.
Chain Rule : Warming Up of Back Propagation
Let be a function of a variance and the variance
be a function of
Chain rules are essential for back propagation algorithms.
*In this slide we I avoid mathematically precise discussion.
44. Then the differentiation of with respect to or is
calculated as below.
Let be a function of two variances , and
the variance , be functions of ,
Chain Rule : 2 Variances
45. Then the differentiation of with respect to is calculated as below.
Chain Rule : In General
Let be a function of n variances and the
variances be functions of m variances
46. BPTT for Simple RNN : Brief Review on Chain Rules
This generalized chain rule is super important for back propagation.
For simplicity, let’s denote the function in the way below.
Again, the partial differentiation of with respect to is
47. Back Propagation
Σ
Σ
Σ
ℎ(⋅)
ℎ(⋅)
ℎ(⋅)
⋮ ⋮ ⋮
Σ
Σ
Σ
ℎ(⋅)
ℎ(⋅)
ℎ(⋅)
⋮ ⋮
⋯
⋯
⋯ Back propagation is
an efficient way to
calculate the partial
differentiations of an
error function with
respect to each
parameter.
You need errors of each
neuron to calculate those
partial differentiations.
*The error of layer i is defined as
48. Back Propagation
Σ
Σ
Σ
ℎ(⋅)
ℎ(⋅)
ℎ(⋅)
⋮ ⋮ ⋮
Σ
Σ
Σ
ℎ(⋅)
ℎ(⋅)
ℎ(⋅)
⋮ ⋮
⋯
⋯
⋯
To calculate the error of No.1 layer, you need the error of
the No.2 layer.
To calculate the error of No.2 layer, you need the error
of the No.3 layer.
To calculate the error of No.L-1 layer, you need the
error of the last layer.
49. 1
Σ
layer layer
⋮
⋮
Back Propagation
Let’s think about renewing parameters such
that they decrease an error function
We want to renew parameters using
gradient descent.
*Be careful that is calculated with only
one data point
51. 1
Σ
layer layer
⋮
⋮
Back Propagation
You can calculate as well.
Next, let’s calculate an error
* Pay attention to the difference between the
last equation we applied chain rule. You have
to consider is a variance of what.
52. 1
Σ
layer layer
⋮
⋮
Back Propagation
Then
∵
Hence
This equations shows that to calculate an
error in a layer, you need all the errors in
the higher layer.
53. Back Propagation :
Let’s Make the Most of Linear Algebra
Assume that Then
And keep it in mind that
54. The number of parameters.
To train this neural network on
100 training data, using back
propagation, this neural network
needed around 31 sec.
56. There Are Many Other Things to Think About
Other activation functions
Batch learning vs. mini batch learning
Applying other types of optimization
How to initialize the weights
Regularization of data
Dropout
….You’ll soon realize that machine
learning is more of a matter of
considering those hyper parameters : the
parameters you don’t train through back
propagation.
57. Super Simple Example of Densely Connected Layers
Sigmoid function Relu function
Accuracy : 73.2 %
58. Training a Neural Network : Stochastic Gradient Descent
We have been thinking about a loss function , which is
calculated with only one data point
It is easy to imagine that , the total sum of ,
is more useful to evaluate how compatible the neural
network is for all the data.
Batch
learning
59. Data is redundant : many data points are similar, so even if you
reduce data at random the reduced dataset is still useful to some
extent.
Training a Neural Network : Stochastic Gradient Descent
Question : In practice, you don’t use as a loss function
for training a neural network. Why?
Computationally expensive : usually you have bunch of data points
The gradient calculated with all the data is NOT noisy.
𝑥
𝑦
𝑥
𝑦
data reduction
60. Training a Neural Network : Stochastic Gradient Descent
Let’s think about the third reason in the last slide “The gradient
calculated with all the data is NOT noisy.“
Assume that the graph on the right side is a loss
function , which is calculated with the
whole data.
If you apply gradient descent from a start
point using as a loss
function, probably the point will shift
smoothly along the surface of the graph.
Minimum point
Start point
61. Training a Neural Network : Stochastic Gradient Descent
Start point
Minimum point
In fact, smooth and
exact track of gradient
descent, which is like a
rolling ball, is not
necessarily good.
Because depending
on how to set start
points, it can get
stuck at a local
minimum.
62. Training a Neural Network : Stochastic Gradient Descent
Question : What would happen if you calculate
partial differentiations using , which is
calculated with only one data point?
The track of gradient descent
would be like a zigzag path,
heading in the direction of the
minimum point.
Minimum point
Start point
64. Gradient Descent : Stochastic Gradient Descent(SGD)
Renewing
parameters
Calculating an
error function
Shuffling
dataset
dataset
Renewing
parameters
Calculating an
error function
⋯
Renewing
parameters
Calculating an
error function 1 epoch
⋯
⋯
Renewing
parameters
Calculating an
error function
Renewing
parameters
Calculating an
error function 1 epoch
⋯
65. Training a Neural Network : A Pseudo Code of SGD
You need partial differentiations
with respect to each parameter to
apply gradient descent.
And the process in this part is
calculating the differentiations,
using back propagation.
This algorithm renews
parameters for each data
point.
66. Renewing
parameters
Calculating error functions in the batch
Gradient Descent :Mini-Batch Learning
⋯
Renewing
parameters
Dividing dataset into mini batches
Shuffling
dataset
Calculating error functions in the batch
⋯
⋯
1 epoch
Dividing dataset into mini batches
1 epoch
⋯
67. Training a Neural Network : Mini-Batch Gradient Descent
You apply gradient descent, using
the average of partial
differentiations for each batch.
Just as the normal SGD, you
calculate partial differentiations, .
This algorithm renews
parameters for data points
in each batch.