Artificial Neural Networks , Recurrent networks , Perceptron's

ARTIFICIAL NEURAL
NETWORKS
UNIT - III
Department: AIML-A,D
Staff Name: Ms.L.Sasikala
SRM INSTITUTE OF SCIENCE
AND TECHNOLOGY, CHENNAI

Unit - III
Fundamentals on Learning and training samples
 Paradigms of Learning
 Using training samples
 Gradient Optimization Procedure
 Hebbian learning rule
Supervised learning network paradigms: The perceptron, back propagation and its
Variants
 The single-layer perceptron
 Linear Separability
 The multilayer perceptron
 Backpropagation of error
 Selecting learning rate
 Resilient Backpropagation
 Adaption of weights
 Further variations and extensions to Backpropagation
 Initial configuration of a multilayer perceptron

Fundamentals on Learning and training samples
Introduction to Learning:
 The primary significance for a neural network is the ability of the
network to learn from its environment and to improve its
performance through learning.
 The network becomes more knowledgeable about its environment
after each iteration of the learning process.
 The type of learning is determined by the way the parameter changes
take place.
 The learning process implies the following sequence of events:
 The neural network is stimulated by an environment.
 The neural network undergoes changes in its parameters as a result
of this stimulation.
 The neural network responds in a new way to the environment
because of the changes that have occurred in its internal structure.

Paradigms of learning:
 The most interesting characteristic of neural networks is their capability
to familiarize with problems by means of training and, after sufficient
training, to be able to solve unknown problems of the same class.
 This approach is referred to as generalization.
 Learning is a comprehensive term and learning system changes itself in
order to adapt to environmental changes.
 A neural network could learn by
• Developing new connections,
• Deleting existing connections,
• Changing connecting weights,
• Changing the threshold values of neurons,
• Varying one or more of the three neuron functions (activation
function, propagation function and output function),
• Developing new neurons,
• Deleting existing neurons.

 Our task is to find the weights that most accurately map our input
data to the correct output class.
 This mapping is what the network must learn.
 After passing all of our data through our model, we're going to
continue passing the same data over and over again.
 This process of repeatedly sending the same data through the
network is considered training.

Epochs:
 One Epoch is when an ENTIRE dataset is passed forward and backward
through the neural network only ONCE.
 Since one epoch is too big to feed to the computer at once we divide it
in several smaller batches.
 We need to pass the full dataset multiple times to the same neural
network.
 So, updating the weights with single pass or one epoch is not enough.
Batch Size:
 Total number of training examples present in a single batch.
 Batch size and number of batches are two different things.
 We can’t pass the entire dataset into the neural net at once.
 So, we divide dataset into Number of Batches or sets or parts.

Iterations:
 Iterations is the number of batches needed to complete one epoch.
 The number of batches is equal to number of iterations for one epoch.
 Let’s say we have 2000 training examples that we are going to use .
 We can divide the dataset of 2000 examples into batches of 500 then it
will take 4 iterations to complete 1 epoch.
 Where Batch Size is 500 and Iterations is 4, for 1 complete epoch.

Training set, Training patterns and Teaching input:
 A learning procedure is always an algorithm that can easily be
implemented by means of a programming language.
 A training set consisting of set of training patterns.
 A training pattern is a set of pairs of input patterns with corresponding
output pattern used to begin training a neural network.
 The teaching input tj is the desired and correct value j should output
after the input of a certain training pattern.
 For a neuron j with the incorrect output oj ,tj is the teaching input, which
means it is the correct or desired output for a training pattern p.

Summary:
 There is the
 input vector x, which can be entered into the neural network.
Depending on the type of network being used the neural network will
output an
 output vector y. Basically, the
 training sample p is nothing more than an input vector. We only use it
for training purposes because we know the corresponding
 teaching input t which is nothing more than the desired output vector
to the training sample. The
 error vector / difference vector Ep is the difference between the
teaching input t and the actual output y under a training input p

Offline Learning:
 The weight vector adjustment and threshold adjustment depend overall
(training) dataset, defining a global cost.
 The learning algorithm updates its parameters after consuming the
whole batch.
 It is also called Batch Learning.
Online Learning:
 The adjustment of the weight and threshold is made after presenting each
training sample to the network.
 The learning algorithm updates its parameters after learning from 1
training instance.
 It is also called Incremental learning.

Epoch:
 Suppose that we need to train a machine learning model with some data, that
data you call training data.
 Huge sets of training data cannot feed the whole bunch to the model at once
due to limitations in computer memory.
 So, we break up the whole training data set into sizeable batches which can fit
into computer’s memory at once.
 We then feed these batches one by one to the model for training.
 One forward pass and one backward pass of all batches exactly once , we call
it has an epoch.
 Basically, it is equivalent to showing the model, the whole training data bunch
once.
 Now, we must carry this one multiple times for successful training, hence,
multiple epochs.
 For example, if there are 20,000 images of data (training set) into 500 batch
size and number of batches is 4, so the iteration is 4 to complete 1 epoch.

Underfit, Overfit and Best fit:

 Overfitting is the situation where any given model is performing too
well on the training data but the performance drops significantly over
the test set is called an overfitting model.
 On the other hand, underfitting is the situation if the model is
performing poorly over the test and the train set, then we call that an
underfitting model.

Visualization of training results of the same training set on networks
with a capacities

Using training samples:
 Following successful learning, it's especially interesting to see if the
network has just memorized.
 If it can utilize our training examples to create the correct output but,
give incorrect answers for all other problems in the same class.
 Suppose that we want the network to train a mapping
 The network has sufficient storage capacity to concentrate on the six
training samples with the output 1 and exactly mark the areas
around the training samples (Image on top).
 On the other hand, a network could have insufficient capacity this
rough presentation of input data does not correspond to the good
generalization performance we desire (Image on bottom).
 Thus, we must find the balance (Image on middle).

 An often-proposed solution for these problems is to divide,
 the training set (70% for training data) into one training set really used to
train and
 the verification set (30% for verification data) to test our progress if there are
enough training samples.
 We can finish the training when the network provides good results on the
training data as well as on the verification data.
 If the verification data is poor, don't change the network structure until the
verification data is positive; otherwise, you risk customizing the network to the
verification data.
 The solution is a third set of validation data used only for validation after a
supposably successful training.
Order of pattern representation:
 There is no guarantee that model will be learned equally well if patterns are
presented in a random order.
 When employing recurrent networks, on the other hand, the same sequence of
patterns causes the patterns to be memorized.
 A random permutation would solve both concerns, however calculating such a
permutation takes a long time.

Learning Data Sets in Artificial Neural Networks:

Training Data Set:
 It's the set of data used to train the model.
 During each epoch, our model will be trained over and over again on this same data
in our training set, and it will continue to learn about the features of this data.
 A set of data used for learning is to fit the parameters [i.e., weights] of the network.
Validation Set:
 The validation set is a set of data, separate from the training set, that is used to
validate our model during training.
 This validation process helps give information that may assist us with adjusting our
hyperparameters.
 A set of data used to tune the parameters [i.e., architecture, number of hidden
units, Storage capacity] of the network.
Test set:
 The test set is a set of data that is used to test the model after the model has
already been trained.
 The test set is separate from both the training set and validation set.
 A set of data is used only to assess the performance [generalization] of a fully
specified network or apply successfully to predict output whose input is known.

Supervised learning:
 Supervised learning is a process of providing labelled input data as
well as correct output data to the machine learning model.
 The aim of a supervised learning algorithm is to find a mapping
function to map the input variable(x) with the output variable(y).
 In supervised learning, the training set consists of input patterns as
well as their correct results in the form of the precise activation of all
output neurons.
 Thus, for each training set that is fed into the network the output, for
instance, can directly be compared with the correct solution and the
network weights can be changed according to their difference.
 The training data provided to the machines work as the supervisor
that teaches the machines to predict the output correctly.
 It applies the same concept as a student learns in the supervision of
the teacher.

How Supervised Learning Works?
 In supervised learning, models are trained using labelled dataset, where the
model learns about each type of data.
 Once the training process is completed, the model is tested on the basis of test
data (a subset of the training set), and then it predicts the output.
 Suppose we have a dataset of different types of shapes which includes square,
rectangle, triangle, and Polygon.
 We need to train the model for each shape:
o If the given shape has four sides, and all the sides are equal, then it will be
labelled as a square.
o If the given shape has three sides, then it will be labelled as a triangle.
o If the given shape has six equal sides, then it will be labelled as hexagon.
 Now, after training, we test our model using the test set, and the task of the
model is to identify the shape.
 The machine is already trained on all types of shapes, and when it finds a new
shape, it classifies the shape on the basis of a number of sides, and predicts the
output.

Types of supervised machine learning algorithms:
 Supervised learning can be further divided into two types of problems:

Regression:
 Regression algorithms are used if there is a relationship between the
input variable and the output variable.
 Below are some popular Regression algorithms which come under
supervised learning:
• Linear Regression
• Logistic Regression
• Ridge Regression
• Lasso Regression
• Polynomial Regression

Classification:
 Classification algorithms are used when the output variable is
categorical, which means there are two classes such as Yes-No, True-
false, etc.
 Below are some popular Classification algorithms which come under
supervised learning:
• K-nearest neighbor (KNN)
• Decision Trees
• Naive Bayes
• Support vector Machines (SVM)

Applications of supervised learning:
• Text categorization
• Face Detection
• Signature recognition
• Customer discovery
• Spam detection
• Weather forecasting
• Predicting housing prices based on the prevailing market price

Advantages of Supervised learning:
• Supervised learning in Machine Learning allows you to collect data or
produce a data output from the previous experience.
• Supervised machine learning helps you to solve various types of real-
world computation problems.
Disadvantages of supervised learning:
• Supervised learning models are not suitable for handling the complex
tasks.
• Supervised learning cannot predict the correct output if the test data
is different from the training dataset.
• Training required lots of computation times.

Unsupervised learning:
 Unsupervised learning is a machine learning technique in which models
are not supervised using training dataset.
 Instead, models itself find the hidden patterns and insights from the
given data.
 Unsupervised learning works on unlabelled and uncategorized data
which make unsupervised learning more important.
 The training set only consists of input patterns, the network tries by
itself to detect similarities and to generate pattern classes.
 The goal of unsupervised learning is to find the underlying structure of
dataset, group that data according to similarities, and represent that
dataset in a compressed format.
 In real-world, we do not always have input data with the corresponding
output so to solve such cases, we need unsupervised learning.
 It can be compared to learning which takes place in the human brain
while learning new things.

How Unsupervised Learning works:
 We have taken an unlabelled input data, which means it is not
categorized and corresponding outputs are also not given.
 Now, this unlabelled input data is fed to the machine learning model
in order to train it.
 Firstly, it will interpret the raw data to find the hidden patterns from
the data and then will apply suitable algorithms.
 The algorithm is never trained upon the given dataset, divides the
data objects into groups according to the similarities and difference
between the objects.
 Suppose the unsupervised learning algorithm is given an input
dataset containing images of different types of cats and dogs.
 The task of the unsupervised learning algorithm is to identify the
image features on their own.

Types of Unsupervised machine learning algorithm:
 The unsupervised learning algorithm can be further categorized into
two types of problems:

Clustering:
• Clustering is a method of grouping the objects into clusters such that
objects with most similarities remains into a group and has less or no
similarities with the objects of another group.
• Cluster analysis finds the commonalities between the data objects
and categorizes them as per the presence and absence of those
commonalities.
• Below are some popular Clustering algorithms which come under
unsupervised learning:
• Centroid-based Clustering
• Density-based Clustering
• Distribution-based Clustering
• Hierarchical Clustering

Association:
• An association rule is an unsupervised learning method which is used
for finding the relationships between variables in the large database.
• It determines the set of items that occurs together in the dataset.
• People who buy X item (suppose a bread) are also tend to purchase Y
(Butter/Jam) item.
• Below is the popular Association algorithm which come under
unsupervised learning:
• Apriori algorithm

Applications of unsupervised learning algorithms:
• Fraud detection
• Malware detection
• Identification of human errors during data entry
• Conducting accurate basket analysis, etc.

Advantages of Unsupervised Learning:
• Unsupervised learning solves the problem by learning the data and
classifying it without any labels.
• This type of learning is like human intelligence in some way as the
model learns slowly and then calculates the result.
Disadvantages of Unsupervised Learning:
• The result of the unsupervised learning algorithm might be less
accurate as input data is not labelled, and algorithms do not know
the exact output in advance.
• The more the features, the more the complexity increases.
• The learning phase of the algorithm might take a lot of time, as it
analyses and calculates all possibilities.

Reinforcement learning:
 In reinforcement learning, the network receives a logical or a real
value after completion of a sequence, which defines whether the
result is right or wrong.
 It is clear that this procedure should be more effective than
unsupervised learning since the network receives specific criteria for
problem-solving.
 The training set consists of input patterns, after completion a value
is returned to the network indicating whether the result was right or
wrong.
 Reinforcement Learning is defined as a Machine Learning method
that is concerned with how software agents should take actions in an
environment.

Some important terms used in Reinforcement AI:
• Agent(): An entity that can perceive/explore the environment and act
upon it.
• Environment(): A situation in which an agent is present or surrounded
by. In RL, we assume the stochastic environment, which means it is
random in nature.
• Action(): Actions are the moves taken by an agent within the
environment.
• State(): State is a situation returned by the environment after each
action taken by the agent.
• Reward(): A feedback returned to the agent from the environment to
evaluate the action of the agent.
• Policy(): Policy is a strategy applied by the agent for the next action
based on the current state.
• Value(): It is expected long-term retuned with the discount factor and
opposite to the short-term reward.

How Reinforcement Learning works?
 Consider the scenario of teaching new tricks to your cat
 As cat doesn’t understand English or any other human language, we can’t
tell her directly what to do.
 Your cat is an agent that is exposed to the environment, it is your house.
 An example of a state could be your cat sitting, and you use a specific
word in for cat to walk.
 Our agent reacts by performing an action transition from one “state” to
another “state.”
 For example, your cat goes from sitting to walking.
 The reaction of an agent is an action, and the policy is a method of
selecting an action given a state in expectation of better outcomes.
 After the transition, they may get a reward (fish) or penalty in return.

Types of Reinforcement machine learning algorithms:
 Two kinds of reinforcement learning methods are:
Positive:
 It is defined as an event, that occurs because of specific behaviour.
 It increases the strength and the frequency of the behaviour and
impacts positively on the action taken by the agent.
 This type of Reinforcement helps you to maximize performance and
sustain change for a more extended period.
Negative:
 Negative Reinforcement is defined as behaviour that occurs because
of a negative condition which should have stopped or avoided.

Reinforcement Learning Algorithms:
 The main used algorithms are:
 Q-Learning
 State-Action-Reward-State-Action algorithm (SARSA)
 Monte carlo
 Deep Q network

Applications of Reinforcement learning:
 Traffic Light Control
 Robotics
 Games
 Healthcare
 Finance
 Image processing
 Marketing

Advantages of Reinforcement learning:
 Reinforcement learning can be used to solve very complex problems
that cannot be solved by conventional techniques.
 Once an error is corrected by the model, the chances of occurring
the same error are very less.
 In the absence of a training dataset, it is bound to learn from its
experience.
Disadvantages of Reinforcement learning:
• Too much reinforcement learning can lead to an overload of states,
which can diminish the results.
• Reinforcement learning is not preferable to use for solving simple
problems.
 Reinforcement learning needs a lot of data and a lot of computation.

Gradient optimization procedures:
• Gradient descent is an optimization algorithm which is commonly-used
to train machine learning models and neural networks, for finding a
local minimum of a differentiable function.
• Until the function is close to or equal to zero, the model will continue
to adjust its parameters to yield the smallest possible error or cost
function.

How does gradient descent work?
 The starting point is just an arbitrary point for us to evaluate the
performance.
 From that starting point, we will find the derivative (or slope), and
from there, we can use a tangent line to observe the steepness of the
slope.
 The slope will inform the updates to the parameters—i.e. the weights
and bias.
 The slope at the starting point will be steeper, but as new parameters
are generated, the steepness should gradually reduce until it reaches
the lowest point on the curve, known as the point of convergence.
 The goal of gradient descent is to minimize the cost function, or the
error between predicted and actual value.

Learning rate or step size - is the size of the steps that are taken to reach
the minimum. This is typically a small value, and it is evaluated and
updated based on the behaviour of the cost function.
 High learning rates result in larger steps but risks overshooting the
minimum.
 A low learning rate has small step sizes. While it has the advantage
of more precision, the number of iterations compromises overall
efficiency as this takes more time and computations to reach the
minimum.
The cost function or loss function - measures the difference or error,
between actual value and predicted value at its current position. This
improves the machine learning model's efficacy by providing feedback to
the model so that it can adjust the parameters to minimize the error and
find the local or global minimum.

 The gradient is a vector g that is defined for any differentiable point of a
function, steepest ascent is g.
 The gradient is a generalization of the derivative for multi- dimensional
functions.
 The negative gradient -g exactly points towards the steepest descent.
 The gradient is referred to as nabla operator ∇
 The overall notation of the gradient g of the point (x, y) of a two- dimensional
function f being g(x, y) =∇f (x, y).
 Let g be a gradient, then g is a vector with n components that is defined for any
point ofa differential n-dimensionalfunction f(x1, x2, . . . , xn).
 The gradient operator notation is defined as
g(x1, x2, ... ,xn)=∇f(x1,x2,. ..,xn)

How to calculate Gradient Descent?
 Here are the steps of finding minimum of the function using gradient
descent:
 Calculate the gradient by taking the derivative of the function with
respect to the specific parameter.
 In case, there are multiple parameters, take the partial derivatives
with respect to different parameters.
 Calculate the descent value for different parameters by multiplying
the value of derivatives with learning rate (step size) and -1.
 Update the value of parameter by adding up the existing value of
parameter and the descent value.

 The below represents the updating of parameter θ with the value of
gradient in the opposite direction while taking small steps.

 The gradient of a scalar-valued multivariable function g(x,y,…)
denoted by ∇f(x,y,…)

 As the name suggests minimum is the lowest value in a set and
maximum is the highest value.
 Global means it is true for the entire set and local means it is true in
some vicinity.
 A function can have multiple local maxima and minima. However,
there can be only one global maximum as well as minimum.

Possible errors during gradient descent:
 (a) - Every gradient descent procedure can get stuck within local
minimum.
 This problem is increasing proportionally to the size of the error
surface, and there is no universal solution.
 (b) - Flat plateau on the error surface has less slope may cause
training slowness.
 (c) – Steep canyons in the error surface may cause oscillation.
 A sudden alternation from very strong negative gradient to a very
positive one, result in oscillation. Such error does not occur often.
 (d) – The gradient is very large at a steep slope, so that the large
steps can be made, and a good minimum can be possibly be missed.

Hebbian learning rule:
 In 1949, Donald O. Hebb formulated the Hebbian rule, which is the
basis for most of the complicated learning rules.
 Hebbian rule "If neuron j receives an input from neuron i and if both
neurons are strongly active at the same time, then increase the
weight wi,j”.
 The rule is:
 with being the change in weight from i to j, which is
proportional to the following factors:
 The output oi of the predecessor neuron i, as well as,
 The activation aj of the successor neuron j
 A constant η, i.e. the learning rate

 The changes in weight are simply added to the weight wi,j
 Hebb proposed that
 If two interconnected neurons on either side of a synapse are both on
or fired or activated are at the same time(synchronously), then the
synaptic weight between them should be increased.
 If two interconnected neurons on either side of a synapse are both on
or fired or activated are at the different time(asynchronously), then
the synaptic weight between them should be decreased.
 Such a synapse is called Hebbian Synapse.
 The generalized form of the Hebbian Rule specifies the proportionality
of the change in weight to the product of two undefined functions, but
with defined input values. Thus, the product of the functions.
Changes in weight = Learning rate . Pre synaptic signal . Post synaptic
signal

The perceptron, backpropagation and its variants
Introduction:
 Perceptron was described by Frank Rosenblatt in 1958.
 Rosenblatt defined that the weighted sum and a non-linear activation
function as components of the perceptron.

Architecture of perceptron with one layer of variable connections

 An input neuron
 is an identity neuron.
 It exactly forwards the information received.
 The input neuron is represented by the symbol
 Information processing neuron
 processes the input information, i.e., do not represent the identity
function.
 A binary neuron sums up all inputs by using the weighted sum as
propagation function, which we want to illustrate by the sign Σ.
 Then the activation function of the neuron is the binary threshold
function, which can be illustrated by
 Other neurons that use the weighted sum as propagation function
but the activation functions hyperbolic tangent or Fermi function,
or with a separately defined activation function fact, are similarly
represented by

 The perceptron
 is a feed forward network containing a retina that is used only for data
acquisition and which has fixed-weighted connections with the first neuron
layer (input layer).
 The fixed-weight layer is followed by at least one trainable weight layer.
 One neuron layer is completely linked with the following layer.
 The first layer of the perceptron consists of the input neurons.
 The first neuron layer is often understood as input layer, because this layer
only forwards the input values.
 The retina itself and the static weights behind it are no longer mentioned
or displayed, since they do not process information in any case.
 So, the depiction of a perceptron starts with the input neurons

A single layer perceptron
Introduction:
 A single layer perceptron (SLP) is a perceptron having only one layer
of variable weights and one layer of output neurons Ω.
 Connections with trainable weights go from input layer to an output
neuron Ω, which returns the information whether the pattern
entered at the input neurons was recognized or not.
 Certainly, the existence of several output neurons Ω 1, Ω2. . . Ω n does
not considerably change the concept of the perceptron.
 A perceptron with several output neurons can also be regarded as
several different perceptron with the same input.

Perceptron learning algorithm:
 The original perceptron learning algorithm with binary neuron
activation function is described in algorithm.
 It has been proven that the algorithm converges in finite time, so in
finite time the perceptron can learn anything.
 Suppose that we have a single layer perceptron with randomly set
weights which we want to teach a function by means of training
samples.
 The set of these training samples is called P.
 It contains, as already defined, the pairs (p, t) of the training samples p
and the associated teaching input t.

 x is the input vector
 y is the output vector of a neural network
 Output neurons are referred to as
 i is the input value of a neuron
 o is the output value of a neuron
 The error vector Ep represents the difference (t−y) under a certain
training sample p.
 O be the set of output neurons
 I be the set of input neurons.
 Our learning target will be certainly be, that for all training samples the
output y of the network is approximately the desired output t,

Learning in neural network:
 Learn values of weights from I/O pairs
 Start with random weights
 Load training example’s input
 Observe computed input
 Modify weights to reduce difference
 Iterate over all training examples
 Terminate when weights stop changing OR when error is very small

 The error function
 regards the set of weights W as a vector and maps the values onto the
normalized output error.
 It is obvious that a specific error function can analogously be
generated for a single pattern p.
 Err(W) is defined on the set of all weights which we here regard as the
vector W.
 Change in all weights is referred to as ΔW.
 ΔW is calculated by the gradient ∇Err(W) of the error function Err(W):

 We derive the error function according to a weight wi,Ω and obtain the
value ∆wi,Ω of how to change this weight.
 The squared distance between the output vector y and the teaching
input t appears adequate to our needs. It provides the error Errp that is
specific for a training sample p over the output of all outputneurons Ω
 Thus, we calculate the squared difference of the components of the
vectors t and y, given the pattern p, and sum up these squares.

 The summation of the specific errors Errp(W ) of all patterns p then
yields the definition of the error Err and there fore the definition of
the error functionErr(W ):
 We tweak the individual weights wi,Ω a bit and see how the error
Err(W ) is changing – which corresponds to the derivative of the error
function Err(W ) according to the very same weight wi,Ω.
 This derivative corresponds to the sum of the derivativesof all
specific errors Errp according to this weight (since the total error
Err(W ) results from the sum of the specific errors):

 Basically, the data is only transferred through a function, the result of
the function is sent through another one, and so on.
 The path of the neuron outputs oi1 and oi2 , which the neurons i1 and i2
entered into aneuron Ω, initially is the propagation function (here
weighted sum), from which thenetwork input is going to be received.
 This is then sent through the activation function of the neuron Ω so
that we receive the output of this neuron which is at the same time a
component of the output vector y:

Propagation function & Network input

 Let I = i1, i2, . . . , in be the set of neurons, such that
 Then the network input of j, called netj, is calculated by the propagation
function fprop as follows:
 The multiplication of the output of each neuron i by wi,j, and the
summation of the results represents netj

netΩ → fact
= fact(netΩ)
= oΩ
= yΩ
 As we can see, this output results from many nested
functions:
oΩ = fact(netΩ)
= fact(oi1 · wi1,Ω + oi2 ·wi2,Ω)
 We want to calculate the derivatives ofequation and due to the nested
functions we can apply the chain rule to factorize the derivative

 The examination of Errp clearly shows that this change is exactly the
difference between teaching input and output (tp,Ω - op,Ω)
 Since Ω is anoutput neuron, op,Ω = yp,Ω.
 The closer the output is to the teaching input, the smaller is the specific
error.
 This difference is also called δp,Ω .
 We know,
op,Ω = yp,Ω

 The second multiplicative factor of equation and of the following one is
the derivative of the output specific to the pattern p of the neuron Ω
according to the weight wi,Ω.
 Due to the requirement at the beginning of the derivation, we only
have a linear activation function fact, therefore we can just as well look
at the change of the network input when wi,Ω is changing:

 We insert this in equation, which results in our modification rule for
a weight wi,Ω

 From the very beginning the derivation has been intended as an
“offline rule” by means of the question of how to add the errors of
all patterns and how to learn them after all patterns have been
represented.
 Although this approach is mathematically correct, the
implementation is far more time-consuming.
 The "online-learning version" of the delta rule simply omits the
summation and learning is realized immediately after the
presentation of each pattern.

Delta rule:
 If we determine, analogously to the aforementioned derivation, that
the function h of the Hebbian theory only provides the output oi of the
predecessor neuron i
 and if the function g is the difference between the desired activation tΩ
and the actual activation aΩ or oΩ, we will receive the delta rule, also
known as Widrow-Hoff rule:
 Apparently the delta rule only applies for SLPs, since the
formula is always related to the teaching input, and there is
no teaching input for the inner processing layers of neurons.

Linear Separability:
 Let f be the XOR function which expects two binary inputs and
generates a binary output.
 Let us try to represent the XOR function by means of an SLP with two
input neurons i1, i2 and one output neuron.
 We use the weighted sum as propagation function, a binary activation
function with the threshold value and the identity as output function.
 Depending on i1 and i2has to output the value 1 if the following holds:

 With a constant threshold value ƟΩ, the right part of in equation is a
straight line through a coordinate system defined by the possible
outputs oi1 und oi2 of the input neurons i1 and i2
 For a positive wi2,Ω the output neuron fires for input combinations lying
above the generated straight line.
 For a negative wi2,Ω it would fire for all input combinations lying below
the straight line.
 A SLP is only capable of representing linearly separable data.
 Only sets that can be separated by a hyperplane, i.e. which are linearly
separable, can be classified by an SLP.
 Thus, for more difficult tasks with more inputs we need something
more powerful than SLP.
 The XOR problem itself is one of these tasks, since a perceptron that is
supposed to represent the XOR function already needs a hidden layer.

 Assume a simple SLP model with 3 neurons and inputs= 2, 2 and 2.
The weights to the input neurons are 4, 4 and 4 respectively. Assume
the activation function is a linear constant value of 3. What will be the
output?

 It is given that inputs of 3 neurons are 2, 2 and 2 corresponding
weights are 4, 4 and 4.
activation function (2 * 4 + 2 * 4 + 2 * 4)
 activation function is a linear constant value of 3.
3 (2 * 4 + 2 * 4 + 2 * 4)
3 (24)
 The output will be 72

A multilayer perceptron:
 A perceptron with two or more trainable weight layers is called
multilayer perceptron or MLP.
 It is more powerful than an SLP.
 A single layer perceptron can divide the input space by means of a
hyper plane (in a two-dimensional input space by means of a straight
line).

 A two stage perceptron (two trainable weight layers, three neuron
layers) can classify convex polygons by further processing these
straight lines.
 A multilayer perceptron represents an universal function
approximator.
 Perceptron with more than one layer of variably weighted
connections are referred to as multilayer perceptron (MLP).
 An n-layer or n-stage perceptron has thereby exactly n variable
weight layers and n +1 neuron layers with neuron layer 1 being the
input layer.

3 weights layer or 3-stage perceptron and 4 neuron layers

Backpropagation of error:
 Backpropagation of error generalizes the delta rule to allow for MLP
training.
 Backpropagation is a gradient descent procedure with the error
function Err(W) receiving all n weights as arguments and assigning
them to the output error, i.e. being n-dimensional.
 On Err(W) a point of small error or even a point of the smallest error
is sought by means of the gradient descent.
 Thus, in analogy to the delta rule, backpropagation trains the weights
of the neural network.
 And it is exactly the delta rule or its variable i for a neuron i which is
expanded from one trainable weight layer to several ones by
backpropagation.

Selecting learning rate:
 The selection of the learning rate has heavy influence on the learning
process.
 The change in weight is proportional to the learning rate.
 Speed and accuracy of a learning procedure can always be controlled
by and are always proportional to a learning rate which is written as
η.
 If the value of the learning rate is too large, the jumps on the error
surface are also too large.
 Additionally, the movements across the error surface would be very
uncontrolled.
 A small η is the desired input, which, however, can cost a huge, often
unacceptable amount of time.
 Experience shows that good learning rate values are in the range of

Variation of the learning rate over time:
 The selection of η significantly depends on the problem, the network and
the training data.
 But for instance, it is popular to start with a relatively large , e.g. 0.9 and to
slowly decrease it down to 0.1
 For simpler problems can often be kept constant.
Variable learning rate:
 In the beginning, a large learning rate leads to good results, but later it
results in inaccurate learning.
 A smaller learning rate is more time-consuming, but the result is more
precise.
 Thus, during the learning process the learning rate needs to be decreased by
one order of magnitude once or repeatedly.

Different layers – Different learning rates:
 The farer we move away from the output layer during the learning
process, the slower backpropagation is learning.
 Thus, it is a good idea to select a larger learning rate for the weight
layers close to the input layer than for the weight layers close to the
output layer.

Resilient backpropagation:
 Resilient backpropagation is an extension to backpropagation of error.
 We have two backpropagation specific properties that can occasionally
be a problem:
1. Users of backpropagation can choose a bad learning rate η.
2. The further the weights are from the output layer; the slower
backpropagation learns.
 MARTIN RIEDMILLER et al. enhanced backpropagation and called their
version resilient backpropagation (short Rprop).

Learning rates:
 Backpropagation uses default learning rate η, which is selected by the
user, and applies to the entire network.
 It remains static until it is manually changed.
 Here, Rprop pursues a completely different approach:
 Thereis no global learning rate.
 First, eachweight wi,j has its own learning rate ηi,j
 second, these learning ratesare not chosen by the user, but are
automatically set by Rprop itself.
 Third,the weight changes are not static but are adapted for each
time step of Rprop.
 To account for the temporal change, we must correctly call it ηi,j(t).
 This not only enables more focused learning, also the problem of an
increasingly slowed down learning throughout the layers is solved in
anelegant way.

Weight change:
 When using backpropagation, weights are changed proportionally to the
gradient of the error function.
 Here, Rprop takes other waysas well:
 The amount of weight change ∆wi,j directly corresponds to the
automatically adjusted learning rate ηi,j.
 Thus, the change in weight is not proportional to the gradient,
it is only influenced by the sign of the gradient.
 The weight specific learning rates directly serve as absolute values for the
changes of the respective weights.
 As with the derivation of backpropagation, we derive the error function
Err(W ) by the individual weights wi,j and obtain gradients

 We shorten the gradient to :
 If the sign of the gradient is positive, we must decrease the weight
wi,j, so the weight is reduced by ηi,j.
 If the sign of the gradient is negative, the weightneeds to be increased,
so ηi,j is added to it.
 If the gradient is exactly 0, nothing happens at all.
 The corresponding terms are affixed with a (t)to show that everything
happens at the same time step.

Variations in Backpropagation:
Backpropagation has often been extended:
 Backpropagation has often been extended and altered besides
Rprop.
 Many of these extensions can simply be implemented as optional
features of backpropagation in order to have a larger scope for
testing.
Adding momentum to learning:
 Let us assume to descent a steep slope on skis, what prevents us
from immediately stopping at the edge of the slopeto the plateau?
 Exactly - our momentum.
 With backpropagation the momentum term is responsible for the
fact that a kind of moment of inertia (momentum) is added to every
step size by always adding a fraction of the previous change to every
new change in weight:

 The concept of time - when referring to the current cycle as (t), then
the previous cycle is identified by (t - 1),
 The variation of backpropagation by means of the momentum term is
defined as:
 We accelerate on plateaus (avoiding standstill on plateaus) and slow
down on craggy surfaces (preventing oscillations).
 Moreover, the effect of inertia can be varied via the pre-factor α,
common values are between 0.6 and 0.9.
 The momentum enables the positive effect that our skier swings back
and forth several times in a minimum, and finally lands in the
minimum.

Flat spot elimination prevents neurons from getting stuck:
 It must be pointed out that with the hyperbolic tangent as well
as with the Fermi function the derivative outside of the close
proximity of Θ is nearly 0.
f(x) =
 This results in the fact that it becomes very difficult to move
neurons away from the limits of the activation (flat spots), which
could extremely extend the learning time.
f'(x) = sigmoid(x)*(1-sigmoid(x))
 This problem can be dealt with by modifying the derivative, for
example by adding a constant (e.g. 0.1), which is called flat spot
elimination or fudging.

Fermi activation function and hyperbolic tangent activation function

The second derivative can beused, too:
 According to DAVID PARKER, Second order backpropagation also uses the second
gradient, i.e. the second multi-dimensional derivative of the error function, to
obtain more precise estimates of the correct ∆wi,j.
 Even higher derivatives rarely improve the estimations.
 Thus, less training cycles are needed but those require much more computational
effort.
 In general, we use further derivatives for higher order methods.
 As expected, the procedures reduce the number of learning epochs, but
significantly increase the computational effort of the individual epochs.
 So in the end theseprocedures often need more learning time than
backpropagation.
 If you have “n” weights in the neural network, one iteration of a second-order
optimization algorithm will reduce the loss function at approximately the same
rate as “n” iterations of a standard first-order optimization algorithm.

Weight decay - Punishment oflarge weights:
 The weight decay according to PAUL WERBOS is a modification that extends the
error by a term punishing large weights.
 So, the error under weight decay does not only increase proportionally to the
actual error but also proportionally tothe square of the weights.
 As a result, the network is keeping the weights small during learning.
 Additionally, due to these small weights, the error function often shows weaker
fluctuations, allowing easier and more controlled learning.
 The pre-factor again resulted from simple pragmatics.
 The factor strength of punishment:
 Values from 0.001 to 0.02 are often used here.

Cutting networks down: Pruning and Optimal BrainDamage:
 If we have executed the weight decay longenough and notice that for a neuron in
the input layer all successor weights are 0 or close to 0, we can remove the
neuron, hence losing this neuron and some weightsand thereby reduce the
possibility that the network will memorize. This procedure is called pruning.
 A method to detect and delete unnecessary weights and neurons is referred to as
optimal brain damage (reducing the size of a learning network by selectively
deleting weights)
 Two competing terms make up the mean error per output neuron:
 The first term is, if a weight is required to minimize the error, as we evaluate
the difference between output and teaching input which is a customary.
 If this does not happen, the second term attempts to "push" a weight towards
0 (Weight decay).
 Neurons which only have zero weights can be pruned again in the end.

Getting started – Initial configuration of a multilayer perceptron:
Number of layers:
 A network should have one layer of input neurons and one layer of
output neurons, which results in at least two layers.
 If our problem is not linearly separable, then we need at least one
hidden layer ofneurons.
 MLP with one hidden neuron layer is already capable of approximating
arbitrary functions withany accuracy.
 Representability means that a perceptron can theoretically realize a
mapping.
 Learnability indicates that we can train a perceptron to realize a
mapping.
 Experience shows that two hidden neuron layers (or three trainable
weight layers) can be very useful to solve a problem.

The number of neurons hasto be tested:
 The number of neurons principally corresponds to the number of free
parameters (Neuron) of the problem to be represented.
 Since we have already discussed the network capacity with respect to
memorizing, it is clear that our goal is to have as few free parameters as
possible but as many as necessary.
 But we also know that there is no standard solution for the question of
how many neurons should be used.
 Thus, the most useful approach is to initially train with only a few neurons
and to repeatedly train new networks with more neurons until the result
significantly improves and, particularly, the generalization performance is
not affected.

Selecting an activationfunction:
 Another very important parameter for the way of information
processing of a neural network is the selection of an activation
function.
 The activation function for input neurons is fixed to the identity
function, since they do not process information.
 The first question to be asked is whether we actually want to use the
same activation function in the hidden layer and in the output layer –
no one prevents us from choosing different functions.
 Generally, the activation function is the same for all hidden neurons
as well as for the output neurons respectively.
 For tasks of function approximation it has been found reasonable to
use the hyperbolic tangent as activation function of the hidden
neurons, while a linear activation function is used in the output.
 However, linear activation functions in the output can also cause huge
learning steps and jumping over good minima in the error surface.
 This can be avoided by setting the learning rate to very small values in the
output layer.

Weights should be initializedwith small, randomly chosenvalues:
 The initialization of weights with 0, there will be no changein weights
at all.
 If they are all initialized by the same value, they will all change equally
during training.
 The simple solution of this problem is called symmetry breaking,
which is the initialization of weights with small random values.
 The range of random values could be the interval [-0.5 to 0.5] not
including 0 or values very close to 0.
 This random initializationhas a nice effect.

Dropout:
 Deep neural nets with a large number of parameters are very powerful machine
learning systems.
 However, overfitting is a serious problem in such networks.
 Large networks are also slow to use, making it difficult to deal with overfitting by
combining the predictions of many different large neural nets at test time.
 Dropout is a technique for addressing this problem.
 The key idea is to randomly drop units (along with their connections) from the
neural network during training.
 This prevents units from co-adapting too much.
 We show that dropout improves the performance of neural networks on supervised
learning tasks in vision, speech recognition, document classification and
computational biology.

Artificial Neural Networks , Recurrent networks , Perceptron's

Recommended

Recommended

More Related Content

Similar to Artificial Neural Networks , Recurrent networks , Perceptron's

Similar to Artificial Neural Networks , Recurrent networks , Perceptron's (20)

Recently uploaded

Recently uploaded (20)

Artificial Neural Networks , Recurrent networks , Perceptron's