SlideShare a Scribd company logo
1 of 33
CS532 ANN
Weight updates for linear threshold
units
Why the perceptron learning procedure cannot be
generalised to hidden layers
• The perceptron convergence procedure works by ensuring that every time
the weights change, they get closer to every “generously feasible” set of
weights.
– This type of guarantee cannot be extended to more complex networks
in which the average of two good solutions may be a bad solution.
• So “multi-layer” neural networks do not use the perceptron learning
procedure.
– They should never have been called multi-layer perceptrons.
A different way to show that
a learning procedure makes progress
• Instead of showing the weights get closer to a good set of weights, show
that the actual output values get closer the target values.
– This can be true even for non-convex problems in which there are many
quite different sets of weights that work well and averaging two good
sets of weights may give a bad set of weights.
– It is not true for perceptron learning.
• The simplest example is a linear neuron with a squared error measure.
Linear neurons (also called linear filters)
• The neuron has a real-valued
output which is a weighted
sum of its inputs
• The aim of learning is to
minimize the error summed
over all training cases.
– The error is the squared
difference between the
desired output and the
actual output.
y = wi
i
ĂĽ xi = wT
x
neuron’s
estimate of the
desired output
input
vector
weight
vector
Why don’t we solve it analytically?
• It is straight-forward to write down a set of equations, one per training case,
and to solve for the best set of weights.
– This is the standard engineering approach so why don’t we use it?
• Scientific answer: We want a method that real neurons could use.
• Engineering answer: We want a method that can be generalized to multi-
layer, non-linear neural networks.
– The analytic solution relies on it being linear and having a squared error
measure.
– Iterative methods are usually less efficient but they are much easier to
generalize.
A toy example to illustrate the iterative method
• Each day you get lunch at the cafeteria.
– Your diet consists of fish, chips, and ketchup.
– You get several portions of each.
• The cashier only tells you the total price of the meal
– After several days, you should be able to figure out the price of each
portion.
• The iterative approach: Start with random guesses for the prices and then
adjust them to get a better fit to the observed prices of whole meals.
Solving the equations iteratively
• Each meal price gives a linear constraint on the prices of the portions:
• The prices of the portions are like the weights in of a linear neuron.
• We will start with guesses for the weights and then adjust the guesses
slightly to give a better fit to the prices given by the cashier.
w = (wfish,wchips,wketchup)
price = xfishwfish + xchipswchips + xketchupwketchup
The true weights used by the cashier
Price of meal = 850 = target
portions
of fish
portions
of chips
portions of
ketchup
150 50 100
2 5 3
linear
neuron
• Residual error = 350
• The “delta-rule” for learning is:
• With a learning rate of 1/35,
the weight changes are +20,
+50, +30
• This gives new weights of 70,
100, 80.
– Notice that the weight for
chips got worse!
A model of the cashier with arbitrary initial weights
Dwi =e xi (t - y)
price of meal = 500
portions
of fish
portions
of chips
portions of
ketchup
50 50 50
2 5 3

Behaviour of the iterative learning procedure
• Does the learning procedure eventually get the right answer?
– There may be no perfect answer.
– By making the learning rate small enough we can get as close as we desire to
the best answer.
• How quickly do the weights converge to their correct values?
– It can be very slow if two input dimensions are highly correlated. If you almost
always have the same number of portions of ketchup and chips, it is hard to
decide how to divide the price between ketchup and chips.
The relationship between the online delta-rule and the
learning rule for perceptrons
• In perceptron learning, we increment or decrement the weight vector by
the input vector.
– But we only change the weights when we make an error.
• In the online version of the delta-rule we increment or decrement the
weight vector by the input vector scaled by the residual error and the
learning rate.
– So we have to choose a learning rate. This is annoying.
THE ERROR SURFACE FOR A LINEAR
NEURON
The error surface in extended weight space
• The error surface lies in a space with a
horizontal axis for each weight and one
vertical axis for the error.
– For a linear neuron with a squared error,
it is a quadratic bowl.
– Vertical cross-sections are parabolas.
– Horizontal cross-sections are ellipses.
• For multi-layer, non-linear nets the error
surface is much more complicated.
E
w1
w2
• The simplest kind of batch
learning does steepest descent
on the error surface.
– This travels perpendicular to the
contour lines.
• The simplest kind of online
learning zig-zags around the
direction of steepest descent:
w1
w2
w1
w2
Online versus batch learning
constraint from
training case 1
constraint from
training case 2
Why learning can be slow
• If the ellipse is very elongated, the
direction of steepest descent is almost
perpendicular to the direction towards the
minimum!
– The red gradient vector has a large
component along the short axis of the
ellipse and a small component along
the long axis of the ellipse.
– This is just the opposite of what we
want.
w1
w2
LEARNING THE WEIGHTS OF A LOGISTIC
OUTPUT NEURON
Logistic neurons
• These give a real-valued
output that is a smooth and
bounded function of their
total input.
– They have nice
derivatives which make
learning easy.
y =
1
1+e-z
0.5
0
0
1
z
y
z = b+ xi
i
ĂĽ wi
The derivatives of a logistic neuron
• The derivatives of the logit, z,
with respect to the inputs and the
weights are very simple:
• The derivative of the output with
respect to the logit is simple if you
express it in terms of the output:
z = b+ xi
i
ĂĽ wi
y =
1
1+ e-z
Âśz
Âświ
= xi
Âśz
Âśxi
= wi dy
dz
= y (1 - y)
The derivatives of a logistic neuron
y =
1
1+ e-z
= (1+ e-z)-1
dy
dz
=
-1(-e-z)
(1+ e-z)2
=
1
1+ e-z
ĂŚ
è
ç
Ăś
ø
á
e-z
1+ e-z
ĂŚ
è
ç
ç
Ăś
ø
á
á
= y(1- y)
e-z
1+ e-z
=
(1+ e-z)-1
1+ e-z
=
(1+ e-z)
1+ e-z
-1
1+ e-z
=1- y
because
Using the chain rule to get the derivatives needed for
learning the weights of a logistic unit
• To learn the weights we need the derivative of the output with respect
to each weight:
Âśy
Âświ
=
Âśz
Âświ
dy
dz
= xi y (1- y)
ÂśE
Âświ
=
Âśyn
Âświ
ÂśE
Âśyn
n
ĂĽ = - xi
n
yn
(1- yn
) (tn
- yn
)
n
ĂĽ
delta-rule
extra term = slope of logistic
THE BACKPROPAGATION
ALGORITHM
Learning with hidden units (again)
• Networks without hidden units are very limited in the input-output
mappings they can model.
• Adding a layer of hand-coded features (as in a perceptron) makes them
much more powerful but the hard bit is designing the features.
– We would like to find good features without requiring insights into the task or
repeated trial and error where we guess some features and see how well they
work.
• We need to automate the loop of designing features for a particular task
and seeing how well they work.
Learning by perturbing weights
(this idea occurs to everyone who knows about evolution)
• Randomly perturb one weight and see if it
improves performance. If so, save the
change.
– This is a form of reinforcement learning.
– Very inefficient. We need to do multiple
forward passes on a representative set of
training cases just to change one weight.
Backpropagation is much better.
– Towards the end of learning, large weight
perturbations will nearly always make things
worse, because the weights need to have
the right relative values.
hidden units
output units
input units
Learning by using perturbations
• We could randomly perturb all the weights in parallel and
correlate the performance gain with the weight changes.
– Not any better because we need lots of trials on each
training case to “see” the effect of changing one weight
through the noise created by all the changes to other
weights.
• A better idea: Randomly perturb the activities of the
hidden units.
– Once we know how we want a hidden activity to change on
a given training case, we can compute how to change the
weights.
– There are fewer activities than weights, but
backpropagation still wins by a factor of the number of
neurons.
The idea behind backpropagation
• We don’t know what the hidden units ought to do, but we can compute
how fast the error changes as we change a hidden activity.
– Instead of using desired activities to train the hidden units, use error
derivatives w.r.t. hidden activities.
– Each hidden activity can affect many output units and can therefore
have many separate effects on the error. These effects must be
combined.
• We can compute error derivatives for all the hidden units efficiently at the
same time.
– Once we have the error derivatives for the hidden activities, its easy to
get the error derivatives for the weights going into a hidden unit.
Sketch of the backpropagation algorithm on a single case
• First convert the discrepancy between
each output and its target value into an
error derivative.
• Then compute error derivatives in each
hidden layer from error derivatives in
the layer above.
• Then use error derivatives w.r.t.
activities to get error derivatives w.r.t.
the incoming weights.
E = 1
2
(tj
jÎoutput
ĂĽ - yj )2
ÂśE
Âśyj
= -(tj - yj )
ÂśE
Âśyj
ÂśE
Âśyi
Backpropagating dE/dy
ÂśE
Âśzj
=
dyj
dzj
ÂśE
Âśyj
= yj (1- yj )
ÂśE
Âśyj
yj
j
yi
i
zj
ÂśE
Âśyi
=
dzj
dyi
ÂśE
Âśzj
j
ĂĽ = wij
ÂśE
Âśzj
j
ĂĽ
ÂśE
Âświj
=
Âśzj
Âświj
ÂśE
Âśzj
= yi
ÂśE
Âśzj
HOW TO USE THE DERIVATIVES
COMPUTED BY THE BACKPROPAGATION
ALGORITHM
Converting error derivatives into a learning procedure
• The backpropagation algorithm is an efficient way of computing the error
derivative dE/dw for every weight on a single training case.
• To get a fully specified learning procedure, we still need to make a lot of other
decisions about how to use these error derivatives:
– Optimization issues: How do we use the error derivatives on individual
cases to discover a good set of weights?
– Generalization issues: How do we ensure that the learned weights work
well for cases we did not see during training?
• We now have a very brief overview of these two sets of issues.
Optimization issues in using the weight derivatives
• How often to update the weights
– Online: after each training case.
– Full batch: after a full sweep through the training data.
– Mini-batch: after a small sample of training cases.
• How much to update
– Use a fixed learning rate?
– Adapt the global learning rate?
– Adapt the learning rate on each connection separately?
– Don’t use steepest descent?
w1
w2
Overfitting: The downside of using powerful models
• The training data contains information about the regularities in the
mapping from input to output. But it also contains two types of noise.
– The target values may be unreliable (usually only a minor worry).
– There is sampling error. There will be accidental regularities just
because of the particular training cases that were chosen.
• When we fit the model, it cannot tell which regularities are real and which
are caused by sampling error.
– So it fits both kinds of regularity.
– If the model is very flexible it can model the sampling error really well.
This is a disaster.
A simple example of overfitting
• Which model do you trust?
– The complicated model fits the
data better.
– But it is not economical.
• A model is convincing when it fits a
lot of data surprisingly well.
– It is not surprising that a
complicated model can fit a small
amount of data well.
Which output value should you
predict for this test input?
input = x
output
=
y
Ways to reduce overfitting
• A large number of different methods have been developed.
– Weight-decay
– Early stopping
– Model averaging
– Dropout (and its offshoots)
– Generative pre-training

More Related Content

Similar to CS532L4_Backpropagation.pptx

Artificial Neural Network.pptx
Artificial Neural Network.pptxArtificial Neural Network.pptx
Artificial Neural Network.pptxshashankbhadouria4
 
machine learning for engineering students
machine learning for engineering studentsmachine learning for engineering students
machine learning for engineering studentsKavitabani1
 
Setting Artificial Neural Networks parameters
Setting Artificial Neural Networks parametersSetting Artificial Neural Networks parameters
Setting Artificial Neural Networks parametersMadhumita Tamhane
 
ARTIFICIAL-NEURAL-NETWORKMACHINELEARNING
ARTIFICIAL-NEURAL-NETWORKMACHINELEARNINGARTIFICIAL-NEURAL-NETWORKMACHINELEARNING
ARTIFICIAL-NEURAL-NETWORKMACHINELEARNINGmohanapriyastp
 
lec10new.ppt
lec10new.pptlec10new.ppt
lec10new.pptSumantKuch
 
Particle swarm optimization
Particle swarm optimizationParticle swarm optimization
Particle swarm optimizationHanya Mohammed
 
nural network ER. Abhishek k. upadhyay
nural network ER. Abhishek  k. upadhyaynural network ER. Abhishek  k. upadhyay
nural network ER. Abhishek k. upadhyayabhishek upadhyay
 
Hyperparameter Tuning
Hyperparameter TuningHyperparameter Tuning
Hyperparameter TuningJon Lederman
 
33.-Multi-Layer-Perceptron.pdf
33.-Multi-Layer-Perceptron.pdf33.-Multi-Layer-Perceptron.pdf
33.-Multi-Layer-Perceptron.pdfgnans Kgnanshek
 
Artificial neural network
Artificial neural networkArtificial neural network
Artificial neural networkMohd Arafat Shaikh
 
Artificial Neural Network by Dr.C.R.Dhivyaa Kongu Engineering College
Artificial Neural Network by Dr.C.R.Dhivyaa Kongu Engineering CollegeArtificial Neural Network by Dr.C.R.Dhivyaa Kongu Engineering College
Artificial Neural Network by Dr.C.R.Dhivyaa Kongu Engineering CollegeDhivyaa C.R
 
Introduction to Neural networks (under graduate course) Lecture 9 of 9
Introduction to Neural networks (under graduate course) Lecture 9 of 9Introduction to Neural networks (under graduate course) Lecture 9 of 9
Introduction to Neural networks (under graduate course) Lecture 9 of 9Randa Elanwar
 
Neural network for machine learning
Neural network for machine learningNeural network for machine learning
Neural network for machine learningUjjawal
 
Unit 2 ml.pptx
Unit 2 ml.pptxUnit 2 ml.pptx
Unit 2 ml.pptxPradeeshSAI
 
Terminology Machine Learning
Terminology Machine LearningTerminology Machine Learning
Terminology Machine LearningDataminingTools Inc
 

Similar to CS532L4_Backpropagation.pptx (20)

Artificial Neural Network.pptx
Artificial Neural Network.pptxArtificial Neural Network.pptx
Artificial Neural Network.pptx
 
machine learning for engineering students
machine learning for engineering studentsmachine learning for engineering students
machine learning for engineering students
 
Setting Artificial Neural Networks parameters
Setting Artificial Neural Networks parametersSetting Artificial Neural Networks parameters
Setting Artificial Neural Networks parameters
 
ARTIFICIAL-NEURAL-NETWORKMACHINELEARNING
ARTIFICIAL-NEURAL-NETWORKMACHINELEARNINGARTIFICIAL-NEURAL-NETWORKMACHINELEARNING
ARTIFICIAL-NEURAL-NETWORKMACHINELEARNING
 
Advanced Machine Learning
Advanced Machine LearningAdvanced Machine Learning
Advanced Machine Learning
 
rnn BASICS
rnn BASICSrnn BASICS
rnn BASICS
 
Lec10new
Lec10newLec10new
Lec10new
 
lec10new.ppt
lec10new.pptlec10new.ppt
lec10new.ppt
 
Particle swarm optimization
Particle swarm optimizationParticle swarm optimization
Particle swarm optimization
 
nural network ER. Abhishek k. upadhyay
nural network ER. Abhishek  k. upadhyaynural network ER. Abhishek  k. upadhyay
nural network ER. Abhishek k. upadhyay
 
Hyperparameter Tuning
Hyperparameter TuningHyperparameter Tuning
Hyperparameter Tuning
 
33.-Multi-Layer-Perceptron.pdf
33.-Multi-Layer-Perceptron.pdf33.-Multi-Layer-Perceptron.pdf
33.-Multi-Layer-Perceptron.pdf
 
Artificial neural network
Artificial neural networkArtificial neural network
Artificial neural network
 
UNIT III (8).pptx
UNIT III (8).pptxUNIT III (8).pptx
UNIT III (8).pptx
 
Artificial Neural Network by Dr.C.R.Dhivyaa Kongu Engineering College
Artificial Neural Network by Dr.C.R.Dhivyaa Kongu Engineering CollegeArtificial Neural Network by Dr.C.R.Dhivyaa Kongu Engineering College
Artificial Neural Network by Dr.C.R.Dhivyaa Kongu Engineering College
 
Neural Networks
Neural NetworksNeural Networks
Neural Networks
 
Introduction to Neural networks (under graduate course) Lecture 9 of 9
Introduction to Neural networks (under graduate course) Lecture 9 of 9Introduction to Neural networks (under graduate course) Lecture 9 of 9
Introduction to Neural networks (under graduate course) Lecture 9 of 9
 
Neural network for machine learning
Neural network for machine learningNeural network for machine learning
Neural network for machine learning
 
Unit 2 ml.pptx
Unit 2 ml.pptxUnit 2 ml.pptx
Unit 2 ml.pptx
 
Terminology Machine Learning
Terminology Machine LearningTerminology Machine Learning
Terminology Machine Learning
 

Recently uploaded

Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxEmmanuel Dauda
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxolyaivanovalion
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusTimothy Spann
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130Suhani Kapoor
 
RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998YohFuh
 
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...Suhani Kapoor
 
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfSocial Samosa
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxolyaivanovalion
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptxAnupama Kate
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
B2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxB2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxStephen266013
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysismanisha194592
 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxolyaivanovalion
 
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一ffjhghh
 
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /WhatsappsBeautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsappssapnasaifi408
 
Call Girls In Mahipalpur O9654467111 Escorts Service
Call Girls In Mahipalpur O9654467111  Escorts ServiceCall Girls In Mahipalpur O9654467111  Escorts Service
Call Girls In Mahipalpur O9654467111 Escorts ServiceSapana Sha
 
Introduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxIntroduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxfirstjob4
 

Recently uploaded (20)

Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptx
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptx
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and Milvus
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
 
RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998
 
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
 
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
 
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptx
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
B2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxB2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docx
 
E-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptxE-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptx
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysis
 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptx
 
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一
 
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /WhatsappsBeautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
 
Call Girls In Mahipalpur O9654467111 Escorts Service
Call Girls In Mahipalpur O9654467111  Escorts ServiceCall Girls In Mahipalpur O9654467111  Escorts Service
Call Girls In Mahipalpur O9654467111 Escorts Service
 
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
 
Introduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxIntroduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptx
 

CS532L4_Backpropagation.pptx

  • 1. CS532 ANN Weight updates for linear threshold units
  • 2. Why the perceptron learning procedure cannot be generalised to hidden layers • The perceptron convergence procedure works by ensuring that every time the weights change, they get closer to every “generously feasible” set of weights. – This type of guarantee cannot be extended to more complex networks in which the average of two good solutions may be a bad solution. • So “multi-layer” neural networks do not use the perceptron learning procedure. – They should never have been called multi-layer perceptrons.
  • 3. A different way to show that a learning procedure makes progress • Instead of showing the weights get closer to a good set of weights, show that the actual output values get closer the target values. – This can be true even for non-convex problems in which there are many quite different sets of weights that work well and averaging two good sets of weights may give a bad set of weights. – It is not true for perceptron learning. • The simplest example is a linear neuron with a squared error measure.
  • 4. Linear neurons (also called linear filters) • The neuron has a real-valued output which is a weighted sum of its inputs • The aim of learning is to minimize the error summed over all training cases. – The error is the squared difference between the desired output and the actual output. y = wi i ĂĽ xi = wT x neuron’s estimate of the desired output input vector weight vector
  • 5. Why don’t we solve it analytically? • It is straight-forward to write down a set of equations, one per training case, and to solve for the best set of weights. – This is the standard engineering approach so why don’t we use it? • Scientific answer: We want a method that real neurons could use. • Engineering answer: We want a method that can be generalized to multi- layer, non-linear neural networks. – The analytic solution relies on it being linear and having a squared error measure. – Iterative methods are usually less efficient but they are much easier to generalize.
  • 6. A toy example to illustrate the iterative method • Each day you get lunch at the cafeteria. – Your diet consists of fish, chips, and ketchup. – You get several portions of each. • The cashier only tells you the total price of the meal – After several days, you should be able to figure out the price of each portion. • The iterative approach: Start with random guesses for the prices and then adjust them to get a better fit to the observed prices of whole meals.
  • 7. Solving the equations iteratively • Each meal price gives a linear constraint on the prices of the portions: • The prices of the portions are like the weights in of a linear neuron. • We will start with guesses for the weights and then adjust the guesses slightly to give a better fit to the prices given by the cashier. w = (wfish,wchips,wketchup) price = xfishwfish + xchipswchips + xketchupwketchup
  • 8. The true weights used by the cashier Price of meal = 850 = target portions of fish portions of chips portions of ketchup 150 50 100 2 5 3 linear neuron
  • 9. • Residual error = 350 • The “delta-rule” for learning is: • With a learning rate of 1/35, the weight changes are +20, +50, +30 • This gives new weights of 70, 100, 80. – Notice that the weight for chips got worse! A model of the cashier with arbitrary initial weights Dwi =e xi (t - y) price of meal = 500 portions of fish portions of chips portions of ketchup 50 50 50 2 5 3 
  • 10. Behaviour of the iterative learning procedure • Does the learning procedure eventually get the right answer? – There may be no perfect answer. – By making the learning rate small enough we can get as close as we desire to the best answer. • How quickly do the weights converge to their correct values? – It can be very slow if two input dimensions are highly correlated. If you almost always have the same number of portions of ketchup and chips, it is hard to decide how to divide the price between ketchup and chips.
  • 11. The relationship between the online delta-rule and the learning rule for perceptrons • In perceptron learning, we increment or decrement the weight vector by the input vector. – But we only change the weights when we make an error. • In the online version of the delta-rule we increment or decrement the weight vector by the input vector scaled by the residual error and the learning rate. – So we have to choose a learning rate. This is annoying.
  • 12. THE ERROR SURFACE FOR A LINEAR NEURON
  • 13. The error surface in extended weight space • The error surface lies in a space with a horizontal axis for each weight and one vertical axis for the error. – For a linear neuron with a squared error, it is a quadratic bowl. – Vertical cross-sections are parabolas. – Horizontal cross-sections are ellipses. • For multi-layer, non-linear nets the error surface is much more complicated. E w1 w2
  • 14. • The simplest kind of batch learning does steepest descent on the error surface. – This travels perpendicular to the contour lines. • The simplest kind of online learning zig-zags around the direction of steepest descent: w1 w2 w1 w2 Online versus batch learning constraint from training case 1 constraint from training case 2
  • 15. Why learning can be slow • If the ellipse is very elongated, the direction of steepest descent is almost perpendicular to the direction towards the minimum! – The red gradient vector has a large component along the short axis of the ellipse and a small component along the long axis of the ellipse. – This is just the opposite of what we want. w1 w2
  • 16. LEARNING THE WEIGHTS OF A LOGISTIC OUTPUT NEURON
  • 17. Logistic neurons • These give a real-valued output that is a smooth and bounded function of their total input. – They have nice derivatives which make learning easy. y = 1 1+e-z 0.5 0 0 1 z y z = b+ xi i ĂĽ wi
  • 18. The derivatives of a logistic neuron • The derivatives of the logit, z, with respect to the inputs and the weights are very simple: • The derivative of the output with respect to the logit is simple if you express it in terms of the output: z = b+ xi i ĂĽ wi y = 1 1+ e-z Âśz Âświ = xi Âśz Âśxi = wi dy dz = y (1 - y)
  • 19. The derivatives of a logistic neuron y = 1 1+ e-z = (1+ e-z)-1 dy dz = -1(-e-z) (1+ e-z)2 = 1 1+ e-z ĂŚ è ç Ăś ø á e-z 1+ e-z ĂŚ è ç ç Ăś ø á á = y(1- y) e-z 1+ e-z = (1+ e-z)-1 1+ e-z = (1+ e-z) 1+ e-z -1 1+ e-z =1- y because
  • 20. Using the chain rule to get the derivatives needed for learning the weights of a logistic unit • To learn the weights we need the derivative of the output with respect to each weight: Âśy Âświ = Âśz Âświ dy dz = xi y (1- y) ÂśE Âświ = Âśyn Âświ ÂśE Âśyn n ĂĽ = - xi n yn (1- yn ) (tn - yn ) n ĂĽ delta-rule extra term = slope of logistic
  • 22. Learning with hidden units (again) • Networks without hidden units are very limited in the input-output mappings they can model. • Adding a layer of hand-coded features (as in a perceptron) makes them much more powerful but the hard bit is designing the features. – We would like to find good features without requiring insights into the task or repeated trial and error where we guess some features and see how well they work. • We need to automate the loop of designing features for a particular task and seeing how well they work.
  • 23. Learning by perturbing weights (this idea occurs to everyone who knows about evolution) • Randomly perturb one weight and see if it improves performance. If so, save the change. – This is a form of reinforcement learning. – Very inefficient. We need to do multiple forward passes on a representative set of training cases just to change one weight. Backpropagation is much better. – Towards the end of learning, large weight perturbations will nearly always make things worse, because the weights need to have the right relative values. hidden units output units input units
  • 24. Learning by using perturbations • We could randomly perturb all the weights in parallel and correlate the performance gain with the weight changes. – Not any better because we need lots of trials on each training case to “see” the effect of changing one weight through the noise created by all the changes to other weights. • A better idea: Randomly perturb the activities of the hidden units. – Once we know how we want a hidden activity to change on a given training case, we can compute how to change the weights. – There are fewer activities than weights, but backpropagation still wins by a factor of the number of neurons.
  • 25. The idea behind backpropagation • We don’t know what the hidden units ought to do, but we can compute how fast the error changes as we change a hidden activity. – Instead of using desired activities to train the hidden units, use error derivatives w.r.t. hidden activities. – Each hidden activity can affect many output units and can therefore have many separate effects on the error. These effects must be combined. • We can compute error derivatives for all the hidden units efficiently at the same time. – Once we have the error derivatives for the hidden activities, its easy to get the error derivatives for the weights going into a hidden unit.
  • 26. Sketch of the backpropagation algorithm on a single case • First convert the discrepancy between each output and its target value into an error derivative. • Then compute error derivatives in each hidden layer from error derivatives in the layer above. • Then use error derivatives w.r.t. activities to get error derivatives w.r.t. the incoming weights. E = 1 2 (tj jÎoutput ĂĽ - yj )2 ÂśE Âśyj = -(tj - yj ) ÂśE Âśyj ÂśE Âśyi
  • 27. Backpropagating dE/dy ÂśE Âśzj = dyj dzj ÂśE Âśyj = yj (1- yj ) ÂśE Âśyj yj j yi i zj ÂśE Âśyi = dzj dyi ÂśE Âśzj j ĂĽ = wij ÂśE Âśzj j ĂĽ ÂśE Âświj = Âśzj Âświj ÂśE Âśzj = yi ÂśE Âśzj
  • 28. HOW TO USE THE DERIVATIVES COMPUTED BY THE BACKPROPAGATION ALGORITHM
  • 29. Converting error derivatives into a learning procedure • The backpropagation algorithm is an efficient way of computing the error derivative dE/dw for every weight on a single training case. • To get a fully specified learning procedure, we still need to make a lot of other decisions about how to use these error derivatives: – Optimization issues: How do we use the error derivatives on individual cases to discover a good set of weights? – Generalization issues: How do we ensure that the learned weights work well for cases we did not see during training? • We now have a very brief overview of these two sets of issues.
  • 30. Optimization issues in using the weight derivatives • How often to update the weights – Online: after each training case. – Full batch: after a full sweep through the training data. – Mini-batch: after a small sample of training cases. • How much to update – Use a fixed learning rate? – Adapt the global learning rate? – Adapt the learning rate on each connection separately? – Don’t use steepest descent? w1 w2
  • 31. Overfitting: The downside of using powerful models • The training data contains information about the regularities in the mapping from input to output. But it also contains two types of noise. – The target values may be unreliable (usually only a minor worry). – There is sampling error. There will be accidental regularities just because of the particular training cases that were chosen. • When we fit the model, it cannot tell which regularities are real and which are caused by sampling error. – So it fits both kinds of regularity. – If the model is very flexible it can model the sampling error really well. This is a disaster.
  • 32. A simple example of overfitting • Which model do you trust? – The complicated model fits the data better. – But it is not economical. • A model is convincing when it fits a lot of data surprisingly well. – It is not surprising that a complicated model can fit a small amount of data well. Which output value should you predict for this test input? input = x output = y
  • 33. Ways to reduce overfitting • A large number of different methods have been developed. – Weight-decay – Early stopping – Model averaging – Dropout (and its offshoots) – Generative pre-training