Learning Processes
Learning
Relevant learning technologies
–Artificial Intelligence
–Experts Systems
–Case-Based Reasoning
–Data Warehousing
–Intelligent agents
–Neural Networks
Learning of NN
• Learning of NN is a process by which the
free parameters of a neural network are
adapted through a process of stimulation by
the environment in which the network is
embedded.
• The type of the learning is determined by
the manner in which the parameter changes
take place. (Mendel & McClaren 1970)
(neurobiological)
(neurobiological)
Supervised Learning
• In supervised learning, the network is presented with
inputs together with the target (teacher signal)
outputs.
• Then, the neural network tries to produce an output
as close as possible to the target signal by adjusting
the values of internal weights.
• The most common supervised learning method is the
“error correction method”, using methods such as
– Least Mean Square (LMS)
– Back Propagation
NN - Supervised Learning
Unsupervised Learning
• In unsupervised learning, there is no teacher (target signal) from
outside and the network adjusts its weights in response to only
the input patterns.
• A typical example of unsupervised learning is
•
Unsupervised Hebbian learning
– Ex. Principal component analysis
• Unsupervised competitive learning
– Ex. Clustering,
– Data compression
Unsupervised Competitive Learning
• In unsupervised competitive learning the neurons
• take part in some competition for each input. The
• winner of the competition and sometimes some
• other neurons are allowed to change their weights
• In simple competitive learning only the winner is
allowed to learn (change its weight).
• In self-organizing maps other neurons in the
• neighborhood of the winner may also learn.
Learning Tasks
Supervised Unsupervised
Data:
Labeled examples
(input , desired output)
Tasks:
classification
pattern recognition
regression
NN models:
perceptron
adaline
feed-forward NN
radial basis function
support vector machines
Data:
Unlabeled examples
(different realizations of the
input)
Tasks:
clustering
NN models:
self-organizing maps (SOM)
Hopfield networks
Supervised & Unsupervised Learning
An un-supervised learning task
Organise these blocks into classes
A
BA
B A
B
Two Possible Groupings…
A
B A
B
A
B A
B
A
B
A
B
No guidance on which to choose
Supervised Learning
A
B
A
B
A
B
 Class
 Class
 Class
 Class
 Class
 Class
B
To which class does
this belong?
Class categories are given as
side information.
From the biological perspective
un-supervised learning is more
important.
Reinforcement learning: Generalization of Supervised
Learning;
Reward given later, not necessarily tied to specific action
Learning Laws
• Hebb’s Rule
• Hopfield Law
• Delta Rule (Least Mean Square Rule)
• The Gradient Descent Rule
• Kohonen’s Learning Law
– Compete to learn
– Winner survives
Learning Laws
Hebb’s Rule
If a neuron receives an output from another
neuron, and if both are highly active (both have
same sign), the weight between the neurons
should be strengthened.
• It means that if two interconnected neurons are both
“on” at the same time, then the weight between them
should be increased
Learning Laws
Hopfield Law
If the desired output and the input are both
active or both inactive, increment the
connection weight by the learning rate,
otherwise decrement the weight by the
learning rate.
Similar to Hebb’s rule + specifies magnitude
Learning Laws
The Delta Rule
Continuously modifying the strengths of the
input connections to reduce the difference (the
delta) between the desired output value and the
actual output of a processing element. Derivative
of the transfer function is used.
Most commonly used.
Learning Laws
The Gradient Descent Rule
The derivative of the transfer function is
still used to modify delta error and
additional proportional constant is tied
with the learning rate to adjust the
weights.
Extension of Delta rule. Very commonly used.
Learning Laws
Kohonen’s Learning Law
The processing elements compete for the
opportunity to learn, or update their weights.
The processing element with the largest output
wins and has the capability to inhibit its
competitors and exciting its neighbors.
Inspired by learning in biological systems.
With Hebbian learning (1949) , two
learning methods are possible
• With supervised learning an input is
associated with an output
– If the input and output are the same, we speak
of auto-associative learning
– If they are different it is called hetero-
associative learning
• With unsupervised learning there is no
teacher: the network tries to discern
regularities in the input patterns
Associative memory
Hetero associative Auto associative
A
B
α
β
A
B
A
B
memory
Auto-association (Same Patterns)
A A
memory
Hetero-association (Different Patterns)
Niagara Waterfall
Hebb’s Algorithm
• Step 0: initialize all weights to 0
• Step 1: Given a training input, xi, and output unit is t
• Step 2: Adjust the weights: wi(new) = wi(old) + xit
• Step 3: Adjust the bias (just like the weights):
• b(new) = b(old) + t
Example
• PROBLEM:
Construct a Hebb Net
which performs like an
AND function, that is,
only when both features
are “active” will the
data be in the target
class.
• TRAINING SET (with the
bias input always at 1):
x1 x2 bias Target
1 1 1 1
1 -1 1 -1
-1 1 1 -1
-1 -1 1 -1
0w1
w2
Sx1
x2
1
b
Training – First Input
• Initialize the weights to 0
00
0
Sx1
x2
1
0
Present the first input (1 1 1) with a
target of 1
Update the weights:
w1(new) = w1(old) + x1t
w2(new) = w2(old) + x2t
= 0 + 1 = 1
= 0 + 1 = 1
b(new) = b(old) + t
= 0 + 1 = 1
01
1
Sx1
x2
1
11
1
1
Training – Second Input
• Present the
second input:
(1 -1 1) with a
target of -1
01
1
Sx1
x2
1
1
00
2
Sx1
x2
1
0
Update the weights:
w1(new) = w1(old) + x1t
w2(new) = w2(old) + x2t
= 1 + 1(-1) = 0
= 1 + (-1)(-1) = 2
b(new) = b(old) + t
= 1 + (-1) = 0-1
-1
1
Training – Third Input
• Present the
third input:
(-1 1 1) with
a target of -1
00
2
Sx1
x2
1
0
01
1
Sx1
x2
1
-1
Update the weights:
w1(new) = w1(old) + x1t
w2(new) = w2(old) + x2t
= 0 + (-1)(-1) = 1
= 2 + 1(-1) = 1
b(new) = b(old) + t
= 0 + (-1) = -1-1
1
-1
Training – Fourth Input
• Present the
fourth input:
(-1 -1 1) with
a target of -1
01
1
Sx1
x2
1
-1
02
2
Sx1
x2
1
-2
Update the weights:
w1(new) = w1(old) + x1t
w2(new) = w2(old) + x2t
= 1 + (-1)(-1) = 2
= 1 + (-1)(-1) = 1
b(new) = b(old) + t
= -1 + (-1) = -2
-1
-1
-1
Final Neuron
• This neuron works:
02
2
Sx1
x2
1
-2
x1 x2 bias Target
1 1 1 1
1 -1 1 -1
-1 1 1 -1
-1 -1 1 -1
1 1 1 1
1
1
1
1*2 + 1*2 + 1*(-2) = 2 > 0
11 -1 1 -1
1
1
-1
-1
(-1)*2 + 1*2 +1*(-2) = -2 < 0
-1
-1 1 1 -1
-1
1
1
1*2 + (-1)*2 + 1*(-2) = -2 < 0
(-1)*2 + (-1)*2 + 1 *(-2) = -6 <0
-1
1
-1
-1
-1 -1 1 -1
Hebbian Learning (again)
Alternative View
• Goal: Associate an input vector with a specific output
vector in a neural net
• In this case, Hebb’s Rule is the same as taking the outer
product of the two vectors:
s = (s1, . . ., si, . . . sn) and t = (t1, . . ., ti, . . . tm)
st = [t1 … tm] =
s1
sn
.
.
.
s1t1 s1tm
snt1 sntm
.
.
.
. . Weight matrix
Weight Matrix
• To store more than one association in a neural net
using Hebb’s Rule
– Add the individual weight matrices
• This method works only if the input vectors for each
association are orthogonal (uncorrelated)
– That is, if their dot product is 0
s = (s1, . . ., si, . . . sn) and t = (t1, . . ., ti, . . . tm)
s t = [s1 … sn]
t1
tm
.
.
.
= 0
Heteroassociative
Architecture
Heteroassociative Architecture
• There are n input units and m output units with each
input connected to each output unit.
0wi1
wn1
S
w11
0wi2
wn2
S
w12
0wi3
wn3
S
w13
For m = 3
x1
xi
xn
n inputs Activation Functions:
For bipolar targets:
yi =
1 if y_ini > 0
0 if y_ini = 0
-1 if y_ini < 0
{
For binary targets:
yi =
1 if y_ini > 0
0 if y_ini <= 0{
Example
• GOAL: build a neural network which will associate the
following two sets of patterns using Hebb’s Rule:
s1 = ( 1 -1 -1 -1) t1 = ( 1 -1 -1)
s2 = (-1 1 -1 -1) t2 = ( 1 -1 1)
s3 = (-1 -1 1 -1) t3 = (-1 1 -1)
s4 = (-1 -1 -1 1) t4 = (-1 1 1)
The process will involve 4 input
neurons and 3 output neurons
The algorithm involves finding
the four outer products and
adding them
input
output
n1
n2
n3
Algorithm
1
-1
-1
-1
[ 1 –1 –1] =
1 –1 –1
-1 1 1
-1 1 1
-1 1 1
Pattern pair 1:
-1
1
-1
-1
[ 1 –1 1] =
-1 1 –1
1 -1 1
-1 1 -1
-1 1 -1
Pattern pair 2:
-1
-1
1
-1
[-1 1 –1] =
1 –1 1
1 -1 1
-1 1 -1
1 -1 1
Pattern pair 3:
-1
-1
-1
1
[-1 1 1] =
1 -1 –1
1 -1 -1
1 -1 -1
-1 1 1
Pattern pair 4:
Weight Matrix
• Add all four individual weight matrices to
produce the final weight matrix:
1 –1 –1
-1 1 1
-1 1 1
-1 1 1
-1 1 –1
1 -1 1
-1 1 -1
-1 1 -1
1 –1 1
1 -1 1
-1 1 -1
1 -1 1
1 -1 –1
1 -1 -1
1 -1 -1
-1 1 1
+ + +
2 -2 –2
2 -2 2
-2 2 -2
-2 2 2
=
Each column defines
the weights for an output
neuron
n1 n2 n3
Example Architecture
• General Structure
x1
x2
x4
x3
02
-2
S
2
-2
0
-2
2
S
-2
2
0
2
-2
S
-2
2
yi =
1 if y_ini > 0
0 if y_ini = 0
-1 if y_ini < 0
{2 -2 –2
2 -2 2
-2 2 -2
-2 2 2
n1
n2
n3
Example Run 1
• Try the first input pattern:
1
-1
-1
-1
0
2
-2
S
2
-2
0
-2
2
S
-2
2
0
2
-2
S
-2
2
yi =
1 if y_ini > 0
0 if y_ini = 0
-1 if y_ini < 0
{
y_in1 = 2 – 2 + 2 + 2 = 4 so y1 = 1
y_in2 = -2 + 2 - 2 - 2 = -4 so y2 = -1
y_in3 = -2 - 2 + 2 - 2 = -4 so y3 = -1
S1 = ( 1 -1 -1 -1) t1 = ( 1 -1 -1)
Where is this
information
stored?
Example Run II
• Try the second input pattern:
s2 = (-1 1 -1 -1) t2 = ( 1 -1 1)
-1
1
-1
-1
0
2
-2
S
2
-2
0
-2
2
S
-2
2
0
2
-2
S
-2
2
yi =
1 if y_ini > 0
0 if y_ini = 0
-1 if y_ini < 0
{
y_in1 = -2 + 2 + 2 + 2 = 4 so y1 = 1
y_in2 = 2 - 2 - 2 - 2 = -4 so y2 = -1
y_in3 = 2 + 2 + 2 - 2 = 4 so y3 = 1
Example Run III
• Try the Third input pattern:
s3 = (-1 -1 1 -1) t3 = (-1 1 -1)
-1
-1
-1
1
0
2
-2
S
2
-2
0
-2
2
S
-2
2
0
2
-2
S
-2
2
yi =
1 if y_ini > 0
0 if y_ini = 0
-1 if y_ini < 0
{
y_in1 = -2 - 2 - 2 + 2 = -4 so y1 = -1
y_in2 = 2 + 2 + 2 - 2 = 4 so y2 = 1
y_in3 = 2 - 2 - 2 - 2 = -4 so y3 = -1
Example Run IV
• Try the fourth input pattern:
s4 = (-1 -1 -1 1) t4 = (-1 1 1)
-1
-1
1
-1
0
2
-2
S
2
-2
0
-2
2
S
-2
2
0
2
-2
S
-2
2
yi =
1 if y_ini > 0
0 if y_ini = 0
-1 if y_ini < 0
{
y_in1 = -2 - 2 + 2 - 2 = -4 so y1 = -1
y_in2 = 2 + 2 - 2 + 2 = 4 so y2 = 1
y_in3 = 2 - 2 + 2 + 2 = 4 so y3 = 1
Example Run V
• Try a non-training pattern
-1
-1
-1
-1
02
-2
S
2
-2
0
-2
2
S
-2
2
0
2
-2
S
-2
2
yi =
1 if y_ini > 0
0 if y_ini = 0
-1 if y_ini < 0
{
y_in1 = -2 - 2 + 2 + 2 = 0 so y1 = 0
y_in2 = 2 + 2 - 2 - 2 = 0 so y2 = 0
y_in3 = 2 - 2 + 2 - 2 = 0 so y3 = 0
The Perceptron
by Frank Rosenblatt (1958, 1962)
• Two-layers
• binary nodes (McCulloch-Pitts nodes) that
take values 0 or 1
• continuous weights, initially chosen
randomly
• Today, used as a synonym for a single-layer,
feed-forward network
Perceptrons
The Master Of The Perceptron
(Rosenblatt, 1958)
Hyperplane, again
A perceptron can learn only examples that are called
“linearly separable”. These are examples that can be perfectly
separated by a hyperplane.
+
+
+
-
-
-
+
+
+-
-
-
Linearly separable Non-linearly separable
Learning problem to be solved
• Suppose we have an input pattern
(0 1)
• We have a single output pattern (1)
• We have a net input of -0.1, which
gives an output pattern of (0)
• How could we adjust the weights,
so that this situation is remedied
and the spontaneous output
matches our target output pattern of
(1)?
0 1
0
-0.10.4
net input =
0.4  0 + -0.1  1 = -0.1
Answer
• Increase the weights, so that the net input
exceeds 0.0
• E.g., add 0.2 to all weights
• Observation: Weight from input node with
activation 0 does not have any effect on the
net input
• So we will leave it alone
Case-1
Case-2
Case-2
Case-1
Note sometimes we may put w0=0 and x0=1
Another representation for
Remember: Hebbian LR was
wi(new) = wi(old) + xi (target)
End of percepton
There is no doubt that Minsky and Papert's book was a block to the funding
of research in neural networks for more than ten years. The book was widely
interpreted as showing that neural networks are basically limited.
Adaline Network
(ADAptive LInear NEuron)
Change from Perceptron
Adaline
• By Widrow and Hoff (~1960)
– Adaptive linear elements for signal processing
– The same architecture of perceptrons
– Learning method: delta rule (another way of error driven),
also called Widrow-Hoff learning rule
Try to reduce the mean squared error (MSE) between the net
input and the desired out put
the activation function, during training is the identity
Function after training the activation is a threshold function
Adaline Algorithm
• Step 0: initialize the weights to small random values and select a
learning rate, (h)
• Step 1: set each input vector xi , with target output, t.
• Step 2: compute the neuron inputs y in = b + xiwi
• Step 3: use the delta rule to update the bias and weights
• b (new) = b (old) + h (t – y in)
• wi (new) = wi (old) + h(t – y in) xi
• Step 4: stop if the largest weight change across all the training
samples is less than a specified tolerance, otherwise cycle through
the training set again
Remember: Hebbian LR was, and Percepton LR was
wi (new) = wi (old) + xi (target), wi = wi +h (target-output) xi
The Learning Rate
• The performance of an ADALINE neuron depends
heavily on the choice of the learning rate
– if it is too large the system will not converge
– if it is too small the convergence will take to long
• Typically, h is selected by trial and error
– typical range: 0.01 < h < 10.0
– often start at 0.1
– sometimes it is suggested that:
0.1 < n h < 1.0
where n is the number of inputs
Running Adaline
• One unique feature of ADALINE is that its
activation function is different for training and
running
• When running ADALINE use the following:
– initialize the weights to those found during training
– compute the net input
– apply the activation function
y_in = b + xiwiS
Neuron input
Activation Function
y =
1 if y_in >= 0
-1 if y_in < 0{
Example – AND Function
• Construct an AND function for a ADALINE
neuron
– let h = 0.1
x1 x2 bias Target
1 1 1 1
1 -1 1 -1
-1 1 1 -1
-1 -1 1 -1
0w1
w2
Sx1
x2
1
b
Initial Conditions: Set the weights to
small random values:
00.2
0.3
Sx1
x2
1
0.1
First Training Run
• Apply the input (1,1) with output 1
00.2
0.3
S1
1
1
0.1
1
The net input is:
y_in = 0.1 + 0.2*1 + 0.3*1 = 0.6
y_in = b + S xi wi
Neuron input Delta rule
b(new) = b(old) + h(t - y_in)
wi(new) = wi(old) + h(t – y_ in)xi
The new weights are:
b = 0.1 + 0.1(1-0.6) = 0.14
w1 = 0.2 + 0.1(1-0.6)1 = 0.24
w2 = 0.3 + 0.1(1-0.6)1 = 0.34
The largest
weight
change
is 0.04
Second Training Run
• Apply the second training set (1 -1) with output -1
00.24
0.34
S1
-1
1
0.14
-1
The net input is:
y_in = 0.14 + 0.24*1 + 0.34*(-1) = 0.04
The new weights are:
b = 0.14 - 0.1(1+0.04) = 0.04
w1 = 0.24 - 0.1(1+0.04)1 = 0.14
w2 = 0.34 + 0.1(1+0.04)1 = 0.44
The largest
weight
change
is 0.1
Third Training Run
• Apply the third training set (-1 1) with
output -1
00.14
0.44
S-1
1
1
0.04
-1
The net input is:
y_in = 0.04 - 0.14*1 + 0.44*1 = 0.34
The new weights are:
b = 0.04 - 0.1(1+0.34) = -0.09
w1 = 0.14 + 0.1(1+0.34)1 = 0.27
w2 = 0.44 - 0.1(1+0.34)1 = 0.31
The largest
weight
change
is 0.13
Fourth Training Run
• Apply the fourth training set (-1 -1) with output -1
00.27
0.31
S-1
-1
1
-0.09
-1
The net input is:
y_in = -0.09 - 0.27*1 - 0.31*1 = -0.67
The new weights are:
b = -0.09 - 0.1(1+0.67) = -0.27
w1 = 0.27 + 0.1(1+0.67)1 = 0.43
w2 = 0.31 + 0.1(1+0.67)1 = 0.47
The largest
weight
change
is 0.16
The final solution
• Continue to cycle through the four training
inputs until the largest change in the
weights over a complete cycle is less than
some small number (say 0.01)
• In this case, the solution becomes
– b = -0.5
– w1 = 0.5
– w2 = 0.5
Comparison of Perceptron and Adaline
Perceptron Adaline
Architecture Single-layer Single-layer
Neuron
model
Non-linear linear
Learning
algorithm
Minimze
number of
misclassified
examples
Minimize total
error
Application Linear
classification
Linear classification and
regression
Gradient Descent Algorithm
Gradient Descent Algorithm
1. Choose some (random) initial values for the weights w.
2. Calculate the gradient of the error function E with respect to
each (w).
3. Change the weights so that we move a short distance in the
direction of the greatest rate of decrease of the error, i.e., in
the direction of –ve gradiant.
4. Repeat steps 2 and 3 until gradiant gets close to zero.
To descend the hill, reverse the derivative.
Dw = -E/w
Moving Downhill: Move in direction of
negative derivative
E(w)
w1
d E(w)/
dw1
w1
Decreasing E(w)
d E(w)/dw1 > 0
w1 <= w1 - h d E(w)/dw1
i.e., the rule decreases w1
If d E(w)/dw1 < 0
w1 > w1+ h d E(w)/dw1
i.e., the rule increases w1
Illustration of Gradient Descent
w1
w0
E(w)
Original point in
weight space
New point in
weight space
Gradient Descent: Graphical
(w1,w2)
(w1+Dw1,w2 +Dw2)
– We may have multiple local minima (note: for
perceptron, E(w) only has a single global minimum, so
this is not a problem)
• Can be problems knowing when to stop
– gradient descent goes to the closest local minimum:
– To find the global minim: random restarts from multiple
places in weight space.
The problem of Local minima
Local minima
Gradient Descent Learning Rule
• Consider linear unit without threshold and
continuous output o (not just –1,1)
o= -w0 + w1 x1 + … + wn xn
• Train the wi’s such that they minimize the squared
error
– E[w1,…,wn] = ½ SjD (tj-oj)2
where D is the set of training examples
Neuron with Sigmoid-Function
S
x1
x2
xn
.
.
.
w1
w2
wn
a=
O=s(a) =1/(1+e-a)
y
inputs
weights
Activation function s
output

n
i
ii xw
1
Remember in LRs: you may have a neuron input (yin or a)
And the neuron output o, what is the difference?
Also remember when you have a linear unit then (o=a)
Sigmoid Unit
S
x1
x2
xn
.
.
.
w1
w2
wn
w0
x0=-1
a=
O
O=s(a)=1/(1+e-a)
Note that :
If s(x) is the sigmoid function: 1/(1+e-x)
then ds(x)/dx= s(x) (1- s(x)), proof that.
Derive gradient decent rules to train:one sigmoid function
Dw = E/wi = h O (1-O) (tj-O) xi (try to proof that)

n
i
ii xw
1
Scince, E[w1,…,wn] = ½ SjD (tj-oj)2
Nice Property of Sigmoids
sigmoidal function
Derivative of sigmoidal function
Explantion: Gradient Descent Learning Rule
Dwi = h Oj(1-Oj) (tj-Oj) xi
xi
wji
yj
activation of
pre-synaptic neuron
error j of
post-synaptic neuron
derivative of
activation function
learning rate
Gradient Descent Rule
• Gradient descent rule, update the the weight
value using the equation:
wi = wi + h Oj(1-Oj) (tj-Oj) xi
derived from minimization of error function
E[w1,…,wn] = ½ Sj (t-y)2
by means of gradient descent.
Gradient Descent for linear networks
I-Initialize all weights to small random values.
II- Repeat until E/wij reach near the minim value (zero) .
1. For each weight wij set Dwij=0
2. For each data point (x, t)
1. compute output values
2. For each weight wij set , update the weight,
using the gradient descent equ., and note that
If d E(w)/dwij < 0, it increase wij
wi = wi + h Oj(1-Oj) (tj-Oj) xi ,
If d E(w)/dwij > 0, it decrease wij
wi = wi - h Oj(1-Oj) (tj-Oj) xi
Effect of Learning rate
On gradient algorithm
More details for Delta Rule
to whom may have interest
Delta rule
• The delta rule is a rule for updating the weights of the
neurons in a single-layer perceptron. For a neuron j
with activation function g(x) , AKA f(a), the delta rule
for j's ith weight wji is given by
Δwji = h(tj − xj)g'(hj)xi
• where h is a small constant, g(x) is the neuron's
activation function, tj is the target output, xj (o) is the
actual output, and xi is the ith input. The delta rule is
commonly stated in simplified form for a perceptron
with a linear activation function as
Δwji = h (tj − xj)xi
Derivation of the delta rule
• The delta rule is derived by attempting to
minimize the error in the output of the perceptron
through gradient descent. The error for a
perceptron with j outputs can be measured as
– In this case, we wish to move through "weight space"
of the neuron (the space of all possible values of all of
the neuron's weights) in proportion to the gradient of
the error function with respect to each weight. In
order to do that, we calculate the partial derivative of
the error with respect to each weight. For the ith
weight, this derivative can be written as
• Because we are only concerning ourselves
with the jth neuron, we can substitute the
error formula above while omitting the
summation:
• Next we use the chain rule to split this into
two derivatives:
• To find the left derivative, we simply apply
the general power rule:
• To find the right derivative, we again apply
the chain rule, this time differentiating with
respect to the total input to j, hj:
• Note that the output of the neuron xj is just the
neuron's activation function g() applied to the
neuron's input hj. We can therefore write the
derivative of xj with respect to hj simply as g()'s
first derivative:
• Next we rewrite hj in the last term as the sum over
all k weights of each weight wjk times its
corresponding input xk:
• Because we are only concerned with the ith
weight, the only term of the summation that is
relevant is xiwji. Clearly, ,
• giving us our final equation for the gradient:
,
• As noted above, gradient descent tells us that our
change for each weight should be proportional to
the gradient. Choosing a proportionality constant α
and eliminating the minus sign to enable us to
move the weight in the negative direction of the
gradient to minimize error, we arrive at our target
equation:
• Δwji = h(tj − xj)g'(hj)xi.
• Gradient descent is an optimization algorithm
that approaches a local minimum of a function by
taking steps proportional to the negative of the
gradient (or the approximate gradient) of the
function at the current point. If instead one takes
steps proportional to the gradient, one approaches
a local maximum of that function; the procedure is
then known as gradient ascent.
• Gradient descent is also known as steepest
descent, or the method of steepest descent, not to
be confused with the method for approximating
integrals with the same name, see method of
steepest descent.
End of slides

Lec 3-4-5-learning

  • 1.
  • 3.
    Relevant learning technologies –ArtificialIntelligence –Experts Systems –Case-Based Reasoning –Data Warehousing –Intelligent agents –Neural Networks
  • 4.
    Learning of NN •Learning of NN is a process by which the free parameters of a neural network are adapted through a process of stimulation by the environment in which the network is embedded. • The type of the learning is determined by the manner in which the parameter changes take place. (Mendel & McClaren 1970)
  • 5.
  • 6.
    Supervised Learning • Insupervised learning, the network is presented with inputs together with the target (teacher signal) outputs. • Then, the neural network tries to produce an output as close as possible to the target signal by adjusting the values of internal weights. • The most common supervised learning method is the “error correction method”, using methods such as – Least Mean Square (LMS) – Back Propagation
  • 7.
  • 8.
    Unsupervised Learning • Inunsupervised learning, there is no teacher (target signal) from outside and the network adjusts its weights in response to only the input patterns. • A typical example of unsupervised learning is • Unsupervised Hebbian learning – Ex. Principal component analysis • Unsupervised competitive learning – Ex. Clustering, – Data compression
  • 9.
    Unsupervised Competitive Learning •In unsupervised competitive learning the neurons • take part in some competition for each input. The • winner of the competition and sometimes some • other neurons are allowed to change their weights • In simple competitive learning only the winner is allowed to learn (change its weight). • In self-organizing maps other neurons in the • neighborhood of the winner may also learn.
  • 10.
    Learning Tasks Supervised Unsupervised Data: Labeledexamples (input , desired output) Tasks: classification pattern recognition regression NN models: perceptron adaline feed-forward NN radial basis function support vector machines Data: Unlabeled examples (different realizations of the input) Tasks: clustering NN models: self-organizing maps (SOM) Hopfield networks
  • 11.
    Supervised & UnsupervisedLearning An un-supervised learning task Organise these blocks into classes A BA B A B
  • 12.
    Two Possible Groupings… A BA B A B A B A B A B No guidance on which to choose
  • 13.
    Supervised Learning A B A B A B  Class Class  Class  Class  Class  Class B To which class does this belong? Class categories are given as side information. From the biological perspective un-supervised learning is more important.
  • 14.
    Reinforcement learning: Generalizationof Supervised Learning; Reward given later, not necessarily tied to specific action
  • 15.
    Learning Laws • Hebb’sRule • Hopfield Law • Delta Rule (Least Mean Square Rule) • The Gradient Descent Rule • Kohonen’s Learning Law – Compete to learn – Winner survives
  • 16.
    Learning Laws Hebb’s Rule Ifa neuron receives an output from another neuron, and if both are highly active (both have same sign), the weight between the neurons should be strengthened. • It means that if two interconnected neurons are both “on” at the same time, then the weight between them should be increased
  • 17.
    Learning Laws Hopfield Law Ifthe desired output and the input are both active or both inactive, increment the connection weight by the learning rate, otherwise decrement the weight by the learning rate. Similar to Hebb’s rule + specifies magnitude
  • 18.
    Learning Laws The DeltaRule Continuously modifying the strengths of the input connections to reduce the difference (the delta) between the desired output value and the actual output of a processing element. Derivative of the transfer function is used. Most commonly used.
  • 19.
    Learning Laws The GradientDescent Rule The derivative of the transfer function is still used to modify delta error and additional proportional constant is tied with the learning rate to adjust the weights. Extension of Delta rule. Very commonly used.
  • 20.
    Learning Laws Kohonen’s LearningLaw The processing elements compete for the opportunity to learn, or update their weights. The processing element with the largest output wins and has the capability to inhibit its competitors and exciting its neighbors. Inspired by learning in biological systems.
  • 21.
    With Hebbian learning(1949) , two learning methods are possible • With supervised learning an input is associated with an output – If the input and output are the same, we speak of auto-associative learning – If they are different it is called hetero- associative learning • With unsupervised learning there is no teacher: the network tries to discern regularities in the input patterns
  • 22.
    Associative memory Hetero associativeAuto associative A B α β A B A B memory Auto-association (Same Patterns) A A memory Hetero-association (Different Patterns) Niagara Waterfall
  • 23.
    Hebb’s Algorithm • Step0: initialize all weights to 0 • Step 1: Given a training input, xi, and output unit is t • Step 2: Adjust the weights: wi(new) = wi(old) + xit • Step 3: Adjust the bias (just like the weights): • b(new) = b(old) + t
  • 24.
    Example • PROBLEM: Construct aHebb Net which performs like an AND function, that is, only when both features are “active” will the data be in the target class. • TRAINING SET (with the bias input always at 1): x1 x2 bias Target 1 1 1 1 1 -1 1 -1 -1 1 1 -1 -1 -1 1 -1 0w1 w2 Sx1 x2 1 b
  • 25.
    Training – FirstInput • Initialize the weights to 0 00 0 Sx1 x2 1 0 Present the first input (1 1 1) with a target of 1 Update the weights: w1(new) = w1(old) + x1t w2(new) = w2(old) + x2t = 0 + 1 = 1 = 0 + 1 = 1 b(new) = b(old) + t = 0 + 1 = 1 01 1 Sx1 x2 1 11 1 1
  • 26.
    Training – SecondInput • Present the second input: (1 -1 1) with a target of -1 01 1 Sx1 x2 1 1 00 2 Sx1 x2 1 0 Update the weights: w1(new) = w1(old) + x1t w2(new) = w2(old) + x2t = 1 + 1(-1) = 0 = 1 + (-1)(-1) = 2 b(new) = b(old) + t = 1 + (-1) = 0-1 -1 1
  • 27.
    Training – ThirdInput • Present the third input: (-1 1 1) with a target of -1 00 2 Sx1 x2 1 0 01 1 Sx1 x2 1 -1 Update the weights: w1(new) = w1(old) + x1t w2(new) = w2(old) + x2t = 0 + (-1)(-1) = 1 = 2 + 1(-1) = 1 b(new) = b(old) + t = 0 + (-1) = -1-1 1 -1
  • 28.
    Training – FourthInput • Present the fourth input: (-1 -1 1) with a target of -1 01 1 Sx1 x2 1 -1 02 2 Sx1 x2 1 -2 Update the weights: w1(new) = w1(old) + x1t w2(new) = w2(old) + x2t = 1 + (-1)(-1) = 2 = 1 + (-1)(-1) = 1 b(new) = b(old) + t = -1 + (-1) = -2 -1 -1 -1
  • 29.
    Final Neuron • Thisneuron works: 02 2 Sx1 x2 1 -2 x1 x2 bias Target 1 1 1 1 1 -1 1 -1 -1 1 1 -1 -1 -1 1 -1 1 1 1 1 1 1 1 1*2 + 1*2 + 1*(-2) = 2 > 0 11 -1 1 -1 1 1 -1 -1 (-1)*2 + 1*2 +1*(-2) = -2 < 0 -1 -1 1 1 -1 -1 1 1 1*2 + (-1)*2 + 1*(-2) = -2 < 0 (-1)*2 + (-1)*2 + 1 *(-2) = -6 <0 -1 1 -1 -1 -1 -1 1 -1
  • 30.
  • 31.
    Alternative View • Goal:Associate an input vector with a specific output vector in a neural net • In this case, Hebb’s Rule is the same as taking the outer product of the two vectors: s = (s1, . . ., si, . . . sn) and t = (t1, . . ., ti, . . . tm) st = [t1 … tm] = s1 sn . . . s1t1 s1tm snt1 sntm . . . . . Weight matrix
  • 32.
    Weight Matrix • Tostore more than one association in a neural net using Hebb’s Rule – Add the individual weight matrices • This method works only if the input vectors for each association are orthogonal (uncorrelated) – That is, if their dot product is 0 s = (s1, . . ., si, . . . sn) and t = (t1, . . ., ti, . . . tm) s t = [s1 … sn] t1 tm . . . = 0
  • 33.
  • 34.
    Heteroassociative Architecture • Thereare n input units and m output units with each input connected to each output unit. 0wi1 wn1 S w11 0wi2 wn2 S w12 0wi3 wn3 S w13 For m = 3 x1 xi xn n inputs Activation Functions: For bipolar targets: yi = 1 if y_ini > 0 0 if y_ini = 0 -1 if y_ini < 0 { For binary targets: yi = 1 if y_ini > 0 0 if y_ini <= 0{
  • 35.
    Example • GOAL: builda neural network which will associate the following two sets of patterns using Hebb’s Rule: s1 = ( 1 -1 -1 -1) t1 = ( 1 -1 -1) s2 = (-1 1 -1 -1) t2 = ( 1 -1 1) s3 = (-1 -1 1 -1) t3 = (-1 1 -1) s4 = (-1 -1 -1 1) t4 = (-1 1 1) The process will involve 4 input neurons and 3 output neurons The algorithm involves finding the four outer products and adding them input output n1 n2 n3
  • 36.
    Algorithm 1 -1 -1 -1 [ 1 –1–1] = 1 –1 –1 -1 1 1 -1 1 1 -1 1 1 Pattern pair 1: -1 1 -1 -1 [ 1 –1 1] = -1 1 –1 1 -1 1 -1 1 -1 -1 1 -1 Pattern pair 2: -1 -1 1 -1 [-1 1 –1] = 1 –1 1 1 -1 1 -1 1 -1 1 -1 1 Pattern pair 3: -1 -1 -1 1 [-1 1 1] = 1 -1 –1 1 -1 -1 1 -1 -1 -1 1 1 Pattern pair 4:
  • 37.
    Weight Matrix • Addall four individual weight matrices to produce the final weight matrix: 1 –1 –1 -1 1 1 -1 1 1 -1 1 1 -1 1 –1 1 -1 1 -1 1 -1 -1 1 -1 1 –1 1 1 -1 1 -1 1 -1 1 -1 1 1 -1 –1 1 -1 -1 1 -1 -1 -1 1 1 + + + 2 -2 –2 2 -2 2 -2 2 -2 -2 2 2 = Each column defines the weights for an output neuron n1 n2 n3
  • 38.
    Example Architecture • GeneralStructure x1 x2 x4 x3 02 -2 S 2 -2 0 -2 2 S -2 2 0 2 -2 S -2 2 yi = 1 if y_ini > 0 0 if y_ini = 0 -1 if y_ini < 0 {2 -2 –2 2 -2 2 -2 2 -2 -2 2 2 n1 n2 n3
  • 39.
    Example Run 1 •Try the first input pattern: 1 -1 -1 -1 0 2 -2 S 2 -2 0 -2 2 S -2 2 0 2 -2 S -2 2 yi = 1 if y_ini > 0 0 if y_ini = 0 -1 if y_ini < 0 { y_in1 = 2 – 2 + 2 + 2 = 4 so y1 = 1 y_in2 = -2 + 2 - 2 - 2 = -4 so y2 = -1 y_in3 = -2 - 2 + 2 - 2 = -4 so y3 = -1 S1 = ( 1 -1 -1 -1) t1 = ( 1 -1 -1) Where is this information stored?
  • 40.
    Example Run II •Try the second input pattern: s2 = (-1 1 -1 -1) t2 = ( 1 -1 1) -1 1 -1 -1 0 2 -2 S 2 -2 0 -2 2 S -2 2 0 2 -2 S -2 2 yi = 1 if y_ini > 0 0 if y_ini = 0 -1 if y_ini < 0 { y_in1 = -2 + 2 + 2 + 2 = 4 so y1 = 1 y_in2 = 2 - 2 - 2 - 2 = -4 so y2 = -1 y_in3 = 2 + 2 + 2 - 2 = 4 so y3 = 1
  • 41.
    Example Run III •Try the Third input pattern: s3 = (-1 -1 1 -1) t3 = (-1 1 -1) -1 -1 -1 1 0 2 -2 S 2 -2 0 -2 2 S -2 2 0 2 -2 S -2 2 yi = 1 if y_ini > 0 0 if y_ini = 0 -1 if y_ini < 0 { y_in1 = -2 - 2 - 2 + 2 = -4 so y1 = -1 y_in2 = 2 + 2 + 2 - 2 = 4 so y2 = 1 y_in3 = 2 - 2 - 2 - 2 = -4 so y3 = -1
  • 42.
    Example Run IV •Try the fourth input pattern: s4 = (-1 -1 -1 1) t4 = (-1 1 1) -1 -1 1 -1 0 2 -2 S 2 -2 0 -2 2 S -2 2 0 2 -2 S -2 2 yi = 1 if y_ini > 0 0 if y_ini = 0 -1 if y_ini < 0 { y_in1 = -2 - 2 + 2 - 2 = -4 so y1 = -1 y_in2 = 2 + 2 - 2 + 2 = 4 so y2 = 1 y_in3 = 2 - 2 + 2 + 2 = 4 so y3 = 1
  • 43.
    Example Run V •Try a non-training pattern -1 -1 -1 -1 02 -2 S 2 -2 0 -2 2 S -2 2 0 2 -2 S -2 2 yi = 1 if y_ini > 0 0 if y_ini = 0 -1 if y_ini < 0 { y_in1 = -2 - 2 + 2 + 2 = 0 so y1 = 0 y_in2 = 2 + 2 - 2 - 2 = 0 so y2 = 0 y_in3 = 2 - 2 + 2 - 2 = 0 so y3 = 0
  • 44.
    The Perceptron by FrankRosenblatt (1958, 1962) • Two-layers • binary nodes (McCulloch-Pitts nodes) that take values 0 or 1 • continuous weights, initially chosen randomly • Today, used as a synonym for a single-layer, feed-forward network
  • 45.
  • 46.
    The Master OfThe Perceptron (Rosenblatt, 1958)
  • 47.
    Hyperplane, again A perceptroncan learn only examples that are called “linearly separable”. These are examples that can be perfectly separated by a hyperplane. + + + - - - + + +- - - Linearly separable Non-linearly separable
  • 48.
    Learning problem tobe solved • Suppose we have an input pattern (0 1) • We have a single output pattern (1) • We have a net input of -0.1, which gives an output pattern of (0) • How could we adjust the weights, so that this situation is remedied and the spontaneous output matches our target output pattern of (1)? 0 1 0 -0.10.4 net input = 0.4  0 + -0.1  1 = -0.1
  • 49.
    Answer • Increase theweights, so that the net input exceeds 0.0 • E.g., add 0.2 to all weights • Observation: Weight from input node with activation 0 does not have any effect on the net input • So we will leave it alone
  • 53.
  • 55.
  • 56.
    Note sometimes wemay put w0=0 and x0=1
  • 62.
    Another representation for Remember:Hebbian LR was wi(new) = wi(old) + xi (target)
  • 81.
    End of percepton Thereis no doubt that Minsky and Papert's book was a block to the funding of research in neural networks for more than ten years. The book was widely interpreted as showing that neural networks are basically limited.
  • 82.
  • 85.
  • 86.
    Adaline • By Widrowand Hoff (~1960) – Adaptive linear elements for signal processing – The same architecture of perceptrons – Learning method: delta rule (another way of error driven), also called Widrow-Hoff learning rule Try to reduce the mean squared error (MSE) between the net input and the desired out put
  • 88.
    the activation function,during training is the identity Function after training the activation is a threshold function
  • 89.
    Adaline Algorithm • Step0: initialize the weights to small random values and select a learning rate, (h) • Step 1: set each input vector xi , with target output, t. • Step 2: compute the neuron inputs y in = b + xiwi • Step 3: use the delta rule to update the bias and weights • b (new) = b (old) + h (t – y in) • wi (new) = wi (old) + h(t – y in) xi • Step 4: stop if the largest weight change across all the training samples is less than a specified tolerance, otherwise cycle through the training set again Remember: Hebbian LR was, and Percepton LR was wi (new) = wi (old) + xi (target), wi = wi +h (target-output) xi
  • 90.
    The Learning Rate •The performance of an ADALINE neuron depends heavily on the choice of the learning rate – if it is too large the system will not converge – if it is too small the convergence will take to long • Typically, h is selected by trial and error – typical range: 0.01 < h < 10.0 – often start at 0.1 – sometimes it is suggested that: 0.1 < n h < 1.0 where n is the number of inputs
  • 91.
    Running Adaline • Oneunique feature of ADALINE is that its activation function is different for training and running • When running ADALINE use the following: – initialize the weights to those found during training – compute the net input – apply the activation function y_in = b + xiwiS Neuron input Activation Function y = 1 if y_in >= 0 -1 if y_in < 0{
  • 92.
    Example – ANDFunction • Construct an AND function for a ADALINE neuron – let h = 0.1 x1 x2 bias Target 1 1 1 1 1 -1 1 -1 -1 1 1 -1 -1 -1 1 -1 0w1 w2 Sx1 x2 1 b Initial Conditions: Set the weights to small random values: 00.2 0.3 Sx1 x2 1 0.1
  • 93.
    First Training Run •Apply the input (1,1) with output 1 00.2 0.3 S1 1 1 0.1 1 The net input is: y_in = 0.1 + 0.2*1 + 0.3*1 = 0.6 y_in = b + S xi wi Neuron input Delta rule b(new) = b(old) + h(t - y_in) wi(new) = wi(old) + h(t – y_ in)xi The new weights are: b = 0.1 + 0.1(1-0.6) = 0.14 w1 = 0.2 + 0.1(1-0.6)1 = 0.24 w2 = 0.3 + 0.1(1-0.6)1 = 0.34 The largest weight change is 0.04
  • 94.
    Second Training Run •Apply the second training set (1 -1) with output -1 00.24 0.34 S1 -1 1 0.14 -1 The net input is: y_in = 0.14 + 0.24*1 + 0.34*(-1) = 0.04 The new weights are: b = 0.14 - 0.1(1+0.04) = 0.04 w1 = 0.24 - 0.1(1+0.04)1 = 0.14 w2 = 0.34 + 0.1(1+0.04)1 = 0.44 The largest weight change is 0.1
  • 95.
    Third Training Run •Apply the third training set (-1 1) with output -1 00.14 0.44 S-1 1 1 0.04 -1 The net input is: y_in = 0.04 - 0.14*1 + 0.44*1 = 0.34 The new weights are: b = 0.04 - 0.1(1+0.34) = -0.09 w1 = 0.14 + 0.1(1+0.34)1 = 0.27 w2 = 0.44 - 0.1(1+0.34)1 = 0.31 The largest weight change is 0.13
  • 96.
    Fourth Training Run •Apply the fourth training set (-1 -1) with output -1 00.27 0.31 S-1 -1 1 -0.09 -1 The net input is: y_in = -0.09 - 0.27*1 - 0.31*1 = -0.67 The new weights are: b = -0.09 - 0.1(1+0.67) = -0.27 w1 = 0.27 + 0.1(1+0.67)1 = 0.43 w2 = 0.31 + 0.1(1+0.67)1 = 0.47 The largest weight change is 0.16
  • 97.
    The final solution •Continue to cycle through the four training inputs until the largest change in the weights over a complete cycle is less than some small number (say 0.01) • In this case, the solution becomes – b = -0.5 – w1 = 0.5 – w2 = 0.5
  • 98.
    Comparison of Perceptronand Adaline Perceptron Adaline Architecture Single-layer Single-layer Neuron model Non-linear linear Learning algorithm Minimze number of misclassified examples Minimize total error Application Linear classification Linear classification and regression
  • 99.
  • 101.
    Gradient Descent Algorithm 1.Choose some (random) initial values for the weights w. 2. Calculate the gradient of the error function E with respect to each (w). 3. Change the weights so that we move a short distance in the direction of the greatest rate of decrease of the error, i.e., in the direction of –ve gradiant. 4. Repeat steps 2 and 3 until gradiant gets close to zero. To descend the hill, reverse the derivative. Dw = -E/w
  • 102.
    Moving Downhill: Movein direction of negative derivative E(w) w1 d E(w)/ dw1 w1 Decreasing E(w) d E(w)/dw1 > 0 w1 <= w1 - h d E(w)/dw1 i.e., the rule decreases w1 If d E(w)/dw1 < 0 w1 > w1+ h d E(w)/dw1 i.e., the rule increases w1
  • 103.
    Illustration of GradientDescent w1 w0 E(w) Original point in weight space New point in weight space
  • 104.
  • 105.
    – We mayhave multiple local minima (note: for perceptron, E(w) only has a single global minimum, so this is not a problem) • Can be problems knowing when to stop – gradient descent goes to the closest local minimum: – To find the global minim: random restarts from multiple places in weight space. The problem of Local minima Local minima
  • 106.
    Gradient Descent LearningRule • Consider linear unit without threshold and continuous output o (not just –1,1) o= -w0 + w1 x1 + … + wn xn • Train the wi’s such that they minimize the squared error – E[w1,…,wn] = ½ SjD (tj-oj)2 where D is the set of training examples
  • 107.
    Neuron with Sigmoid-Function S x1 x2 xn . . . w1 w2 wn a= O=s(a)=1/(1+e-a) y inputs weights Activation function s output  n i ii xw 1 Remember in LRs: you may have a neuron input (yin or a) And the neuron output o, what is the difference? Also remember when you have a linear unit then (o=a)
  • 108.
    Sigmoid Unit S x1 x2 xn . . . w1 w2 wn w0 x0=-1 a= O O=s(a)=1/(1+e-a) Note that: If s(x) is the sigmoid function: 1/(1+e-x) then ds(x)/dx= s(x) (1- s(x)), proof that. Derive gradient decent rules to train:one sigmoid function Dw = E/wi = h O (1-O) (tj-O) xi (try to proof that)  n i ii xw 1 Scince, E[w1,…,wn] = ½ SjD (tj-oj)2
  • 109.
    Nice Property ofSigmoids sigmoidal function Derivative of sigmoidal function
  • 110.
    Explantion: Gradient DescentLearning Rule Dwi = h Oj(1-Oj) (tj-Oj) xi xi wji yj activation of pre-synaptic neuron error j of post-synaptic neuron derivative of activation function learning rate
  • 111.
    Gradient Descent Rule •Gradient descent rule, update the the weight value using the equation: wi = wi + h Oj(1-Oj) (tj-Oj) xi derived from minimization of error function E[w1,…,wn] = ½ Sj (t-y)2 by means of gradient descent.
  • 112.
    Gradient Descent forlinear networks I-Initialize all weights to small random values. II- Repeat until E/wij reach near the minim value (zero) . 1. For each weight wij set Dwij=0 2. For each data point (x, t) 1. compute output values 2. For each weight wij set , update the weight, using the gradient descent equ., and note that If d E(w)/dwij < 0, it increase wij wi = wi + h Oj(1-Oj) (tj-Oj) xi , If d E(w)/dwij > 0, it decrease wij wi = wi - h Oj(1-Oj) (tj-Oj) xi
  • 113.
    Effect of Learningrate On gradient algorithm
  • 114.
    More details forDelta Rule to whom may have interest
  • 115.
    Delta rule • Thedelta rule is a rule for updating the weights of the neurons in a single-layer perceptron. For a neuron j with activation function g(x) , AKA f(a), the delta rule for j's ith weight wji is given by Δwji = h(tj − xj)g'(hj)xi • where h is a small constant, g(x) is the neuron's activation function, tj is the target output, xj (o) is the actual output, and xi is the ith input. The delta rule is commonly stated in simplified form for a perceptron with a linear activation function as Δwji = h (tj − xj)xi
  • 116.
    Derivation of thedelta rule • The delta rule is derived by attempting to minimize the error in the output of the perceptron through gradient descent. The error for a perceptron with j outputs can be measured as – In this case, we wish to move through "weight space" of the neuron (the space of all possible values of all of the neuron's weights) in proportion to the gradient of the error function with respect to each weight. In order to do that, we calculate the partial derivative of the error with respect to each weight. For the ith weight, this derivative can be written as
  • 117.
    • Because weare only concerning ourselves with the jth neuron, we can substitute the error formula above while omitting the summation: • Next we use the chain rule to split this into two derivatives:
  • 118.
    • To findthe left derivative, we simply apply the general power rule: • To find the right derivative, we again apply the chain rule, this time differentiating with respect to the total input to j, hj:
  • 119.
    • Note thatthe output of the neuron xj is just the neuron's activation function g() applied to the neuron's input hj. We can therefore write the derivative of xj with respect to hj simply as g()'s first derivative: • Next we rewrite hj in the last term as the sum over all k weights of each weight wjk times its corresponding input xk:
  • 120.
    • Because weare only concerned with the ith weight, the only term of the summation that is relevant is xiwji. Clearly, , • giving us our final equation for the gradient: ,
  • 121.
    • As notedabove, gradient descent tells us that our change for each weight should be proportional to the gradient. Choosing a proportionality constant α and eliminating the minus sign to enable us to move the weight in the negative direction of the gradient to minimize error, we arrive at our target equation: • Δwji = h(tj − xj)g'(hj)xi.
  • 122.
    • Gradient descentis an optimization algorithm that approaches a local minimum of a function by taking steps proportional to the negative of the gradient (or the approximate gradient) of the function at the current point. If instead one takes steps proportional to the gradient, one approaches a local maximum of that function; the procedure is then known as gradient ascent. • Gradient descent is also known as steepest descent, or the method of steepest descent, not to be confused with the method for approximating integrals with the same name, see method of steepest descent.
  • 123.