Lecture9April2020_time_11_55amto12_50pm(Neural_network_PPT).pptx

Threshold Logic Unit (TLU)

x1
x2
xn
.
.
.
w1
w2
wn
a=i=1
n wi xi
1 if a  q
y=
0 if a < q
y
{
inputs
weights
activation output
q

Activation Functions
a
y
a
y
a
y
a
y
threshold linear
piece-wise linear sigmoid

Decision Surface of a TLU
x1
x2
Decision line
w1 x1 + w2 x2 = q
w
1
1 1
0
0
0
0
0
1

Threshold as Weight

x1
x2
xn
.
.
.
w1
w2
wn
wn+1
xn+1=-1
a= i=1
n+1 wi xi
y
1 if a  0
y=
0 if a <0
{
q=wn+1

Training ANNs
• Training set S of examples {x,t}
– x is an input vector and
– t the desired target vector
– Example: Logical And
S = {(0,0),0}, {(0,1),0}, {(1,0),0}, {(1,1),1}
• Iterative process
– Present a training example x , compute network
output y , compare output y with target t, adjust
weights and thresholds
• Learning rule
– Specifies how to change the weights w and
thresholds q of the network as a function of the inputs
x, output y and target t.

Perceptron Learning Rule
• w’=w + a (t-y) x
Or in components
• w’i = wi + Dwi = wi + a (t-y) xi (i=1..n+1)
With wn+1 = q and xn+1=-1
• The parameter a is called the learning rate. It
determines the magnitude of weight updates
Dwi .
• If the output is correct (t=y) the weights are
not changed (Dwi =0).
• If the output is incorrect (t  y) the weights wi
are changed such that the output of the TLU
for the new weights w’i is closer/further to the
input xi.

Perceptron Training Algorithm
Repeat
for each training vector pair (x,t)
evaluate the output y when x is the input
if yt then
form a new weight vector w’ according
to w’=w + a (t-y) x
else
do nothing
end if
end for
Until y=t for all training vector pairs

Perceptron Convergence Theorem
The algorithm converges to the correct
classification
–if the training data is linearly separable
–and  is sufficiently small
• If two classes of vectors X1 and X2 are linearly
separable, the application of the perceptron
training algorithm will eventually result in a
weight vector w0, such that w0 defines a TLU
whose decision hyper-plane separates X1 and X2
(Rosenblatt 1962).
• Solution w0 is not unique, since if w0 x =0 defines
a hyper-plane, so does w’0 = k w0.

Linear Unit

x1
x2
xn
.
.
.
w1
w2
wn
a=i=1
n wi xi
y
y= a = i=1
n wi xi
inputs
weights
activation output

Gradient Descent Learning Rule
• Consider linear unit without threshold and
continuous output o (not just –1,1)
–o=w0 + w1 x1 + … + wn xn
• Train the wi’s such that they minimize the
squared error
–E[w1,…,wn] = ½ dD (td-od)2
where D is the set of training examples

Gradient Descent
D={<(1,1),1>,<(-1,-1),1>,
<(1,-1),-1>,<(-1,1),-1>}
Gradient:
E[w]=[E/w0,… E/wn]
(w1,w2)
(w1+Dw1,w2 +Dw2)
Dw=- E[w]
Dwi=- E/wi
=/wi 1/2d(td-od)2
= /wi 1/2d(td-i wi xi)2
= d(td- od)(-xi)

Incremental Stochastic Gradient
Descent
• Batch mode : gradient descent
w=w -  ED[w] over the entire data D
ED[w]=1/2d(td-od)2
• Incremental mode: gradient descent
w=w -  Ed[w] over individual training examples d
Ed[w]=1/2 (td-od)2
Incremental Gradient Descent can approximate Batch
Gradient Descent arbitrarily closely if  is small
enough

Perceptron vs. Gradient Descent Rule
• perceptron rule
w’i = wi + a (tp-yp) xi
p
derived from manipulation of decision
surface.
• gradient descent rule
w’i = wi + a (tp-yp) xi
p
derived from minimization of error function
E[w1,…,wn] = ½ p (tp-yp)2
by means of gradient descent.
Where is the big difference?

Perceptron vs. Gradient Descent Rule
Perceptron learning rule guaranteed to succeed if
• Training examples are linearly separable
• Sufficiently small learning rate 
Linear unit training rules uses gradient descent
• Guaranteed to converge to hypothesis with
minimum squared error
• Given sufficiently small learning rate 
• Even when training data contains noise
• Even when training data not separable by H

Presentation of Training Examples
• Presenting all training examples once to
the ANN is called an epoch.
• In incremental stochastic gradient
descent training examples can be
presented in
–Fixed order (1,2,3…,M)
–Randomly permutated order
(5,2,7,…,3)
–Completely random (4,1,7,1,5,4,……)

Neuron with Sigmoid-Function

x1
x2
xn
.
.
.
w1
w2
wn
a=i=1
n wi xi
y=s(a) =1/(1+e-a)
y
inputs
weights
activation output

Sigmoid Unit

x1
x2
xn
.
.
.
w1
w2
wn
w0
x0=-1
a=i=0
n wi xi
y
y=s(a)=1/(1+e-a)
s(x) is the sigmoid function: 1/(1+e-x)
ds(x)/dx= s(x) (1- s(x))
Derive gradient decent rules to train:
• one sigmoid function
E/wi = -p(tp-y) y (1-y) xi
p
• Multilayer networks of sigmoid units
backpropagation:

Applications of neural networks
• Alvinn (the neural network that learns to drive a van from camera
inputs).
• NETtalk: a network that learns to pronounce English text.
• Recognizing hand-written zip codes.
• Lots of applications in financial time series analysis.

Generalization
How can we control number of effective weights?
• Manually or automatically select optimum number of
hidden nodes and connections
• Prevent over-fitting = over-training
• Add a weight-cost term to the bp error equation

Lecture9April2020_time_11_55amto12_50pm(Neural_network_PPT).pptx

Recommended

Recommended

More Related Content

Similar to Lecture9April2020_time_11_55amto12_50pm(Neural_network_PPT).pptx

Similar to Lecture9April2020_time_11_55amto12_50pm(Neural_network_PPT).pptx (20)

More from VAIBHAVSAHU55

More from VAIBHAVSAHU55 (11)

Recently uploaded

Recently uploaded (20)

Lecture9April2020_time_11_55amto12_50pm(Neural_network_PPT).pptx