This document provides an overview of neural networks and machine learning concepts. It discusses how neural networks mimic the brain and simulate networks of neurons. It then covers perceptrons and their limitations in solving XOR problems. Next, it introduces multi-layer neural networks, backpropagation for training networks, and regularization to address overfitting. Key concepts are explained through examples, including computing gradients, error minimization, and determining optimal hidden unit numbers.
Double Revolving field theory-how the rotor develops torque
Machine Learning 1
1. Advanced Topics in Systems
Machine Learning
Neural Networks
Inas A. Yassine
Systems and Biomedical Engineering Department,
Faculty of Engineering - Cairo University
iyassine@eng.cu.edu.eg
2. Neurons and Brain
§ Neural network mimic the brain , hear,
visualize, find the geometric relations, …..
through single learning algorithm
§ Simulating network of neurons
Machine Learning Fall 2018 Inas A.Yassine 2
4. The Perceptron
§ A number of McCulloch-Pitts neurons can be connected
together in any way .
§ An arrangement of one input layer of McCulloch-Pitts neurons
feeding forward to one output layer of McCulloch-Pitts neurons
is known as a Perceptron.
§ Powerful computational device.
5. Logic Gate Implementation
§ McCulloch-Pitts neurons can be used to to
implement the basic logic gates.
§ find the appropriate connection weights and neuron
thresholds to produce the right outputs for each set of
inputs.
§ construct simple networks that perform NOT,AND, and
OR.
§ we can construct any logical function from these three
operations that have a much more complex architecture
§ Try to avoid decomposing complex problems into simple
logic gates, by finding the weights and thresholds that
work directly in a Perceptron architecture.
8. Solve analytically for And gate’s weights
§ two weights w1 and w2 and the threshold q, and for each
training pattern we need to satisfy
§ The training data lead to four inequalities:
§ It is easy to see that there are an infinite number of
solutions. Similarly, there are an infinite number of
solutions for the NOT and OR networks.
10. Limitations of single Perceptron
§ for the XOR network:
§ the second and third inequalities are incompatible with the
fourth,
§ No solution.
§ More complex networks is needed , e.g. that combine together
many simple networks, or use different activation/
thresholding/transfer functions.
§ It then becomes much more difficult to determine all the
weights and thresholds by hand.
12. Threshold as a weight component
§ To simplify the mathematical description
§ Assume W0j=−𝜃#, out0=1
§ The perceptron Equation Becomes
13. Perceptron Learning
§ If the network weights at time t are wij(t) , then
the shifting process corresponds to moving them
by an amount Δwij(t) so that at time t+1 we have
weights
wij (t +1) = wij (t) + ∆wij (t)
∆wij (t)= η(tj-oj )xi
§ It is convenient to treat the thresholds as
weights, as discussed previously, so we don’t
need separate equations for them.
14. Convergence of Perceptron Learning
§ The weight changes ∆wij need to be applied repeatedly – for each
weight wij in the network, and for each training pattern in the
training set. One pass through all the weights for the whole
training set is called one epoch of training.
§ Eventually, usually after many epochs, when all the network
outputs match the targets for all the training patterns, all the ∆
wij will be zero and the process of training will increase.We then say
that the training process has converged to a solution.
§ If the weights can be found in a finite number of iterations.
§ if a problem is linearly separable.
§ problem correctly defined.
§ The step is sufficiently small.
16. Learning by Error Minimization
§ minimize the difference between the actual
outputs outj and the desired outputs targj.
§ Error Function to quantify this difference
using the sum Square Error:
E (wij )=∑∑(targj -out j)2
§ A systematic procedure for doing this
requires the knowledge of how the error
E(wij) varies as we change the weights wij, i.e.
the gradient of E with respect to wij.
17. Computing Gradient and Derivatives
§ The gradient, or rate of change, of f(x) at a particular
value of x, as we change x can be approximated by Dy/Dx.
which is known as the partial derivative of f(x) with
respect to x.
18. Gradient Descent Minimization
§ Suppose we have a function f(x) and we want to change the value of x to minimize f(x).
§ What we need to do depends on the gradient of f(x).There are three cases to consider:
§ If > 0 then f(x) increases as x increases so we should decrease x
§ If < 0 then f(x) decreases as x increases so we should increase x
§ If = 0 then f(x) is at a maximum or minimum so we should not change x
§ In summary, we can decrease f(x) by changing x by the amount:
§ where η is a small positive constant specifying how much we change x by, and the derivative
∂f/∂x tells us which direction to go in. If we repeatedly use this equation, f(x) will (assuming h is
sufficiently small) keep descending towards its minimum, and hence this procedure is known as
gradient descent minimization.
x
f
¶
¶
x
f
¶
¶
x
f
¶
¶
21. Gradient Descent Error Minimization
§ Remember that we want to train our neural networks by adjusting
their weights wij in order to minimize the error function:
§ We now see it makes sense to do this by a series of gradient descent
weight updates:
§ If the transfer function for the output neurons is f(x), and the activations
of the previous layer of neurons are ini , then the outputs are
and
§ Dealing with equations like this is easy if we use the chain rules for
derivatives.
24. Delta Rule
§ the basic gradient descent learning algorithm for single layer
networks:
§ Involving the derivative of the transfer function f(x).
§ problematic for the simple Perceptron that uses the step function
sgn(x) as its threshold function, because this has zero derivative
everywhere except at x = 0 where it is infinite.
§ zx
25. Multi Layer Neural Network
𝑎5
(#)
= 𝑎𝑐𝑡𝑖𝑣𝑎𝑡𝑖𝑜𝑛 𝑜𝑓 𝑢𝑛𝑖𝑡 𝑖 𝑖𝑛 𝑙𝑎𝑦𝑒𝑟 𝑗
𝑤
(#)
= 𝑚𝑎𝑡𝑟𝑖𝑥 𝑜𝑓 𝑤𝑒𝑖𝑔ℎ𝑡𝑠 𝑐𝑜𝑛𝑡𝑟𝑜𝑙𝑙𝑖𝑛𝑔
𝑓𝑢𝑛𝑐𝑖𝑜𝑛 𝑚𝑎𝑝𝑝𝑖𝑛𝑔 𝑓𝑟𝑜𝑚 𝑙𝑎𝑦𝑒𝑟 𝑗
𝑡𝑜 𝑙𝑎𝑦𝑒𝑟 𝑗 + 1
𝑎O
(P)
= 𝑠𝑖𝑔(𝑤OQ
O
𝑥Q + 𝑤OO
O
𝑥O + 𝑤OP
O
𝑥P + 𝑤OR
(O)
𝑥R)
𝑎P
(P)
= 𝑠𝑖𝑔(𝑤PQ
O
𝑥Q + 𝑤PO
O
𝑥O + 𝑤PP
O
𝑥P + 𝑤PR
(O)
𝑥R)
𝑎R
(P)
= 𝑠𝑖𝑔(𝑤RQ
O
𝑥Q + 𝑤RO
O
𝑥O + 𝑤RP
O
𝑥P + 𝑤RR
(O)
𝑥R)
ℎS (𝑥) = 𝑠𝑖𝑔(𝑤OQ
P
𝑎Q
(P)
+ 𝑤OO
P
𝑎O
(P)
+ 𝑤OP
P
𝑎P
(P)
+ 𝑤OR
(P)
𝑎R
(P)
)
Machine Learning Fall 2018 Inas A.Yassine 25
30. 30
Derivation of the Backpropagation algorithm
For output units
So:
Source: http://www.speech.sri.com/people/anand/771/html/node37.html
output
hidden
input
31. 31
Derivation of the Backpropagation algorithm
For Hidden units
Also:
So:
Source: http://www.speech.sri.com/people/anand/771/html/node37.html
output
hidden
input
j
k
32. 32
Backpropagation - example
§ First calculate error of output units and use this
to change the output layer of weights.
Current output: oj=0.2
Correct output: tj=1.0
Error δj = oj(1–oj)(tj–oj)
0.2(1–0.2)(1–0.2)=0.128
output
hidden
input
Update weights into j
ijji ow hd=D
Source: Raymond J. Mooney, University ofTexas at Austin, CS 391L: Machine Learning Neural Networks
33. 33
Backpropagation - example
§ Next calculate error for hidden units based on
errors on the output units it feeds into.
å-=
k
kjkjjj woo dd )1(
output
hidden
input
Source: Raymond J. Mooney, University ofTexas at Austin, CS 391L: Machine Learning Neural Networks
34. 34
Backpropagation - example
§ Finally update bottom layer of weights based on
errors calculated for hidden units.
å-=
k
kjkjjj woo dd )1(
output
hidden
input
Update weights into j
jijji xw hd=D
Source: Raymond J. Mooney, University ofTexas at Austin, CS 391L: Machine Learning Neural Networks
35. 35
Error Backpropagation
§ Next calculate error for hidden units based on
errors on the output units it feeds into.
output
hidden
input
å-=
k
kjkjjj woo dd )1(
36. 36
Error Backpropagation
§ Finally update bottom layer of weights based on
errors calculated for hidden units.
output
hidden
input
å-=
k
kjkjjj woo dd )1(
Update weights into j
ijji ow hd=D
37. Notes on Back propagation
Algorithm
§ Gradient Descent over entire network weight
vector
§ Easily generalized to arbitrary directed graphs
§ Will find a local, not necessarily global error
minimum
§ In practice, often works will (can turn multiple times)
§ Often include weight with a momentum
§ Minimize error over training examples
§ Will it generalize well to subsequent examples?
38. 38
Sample Learned XOR Network
3.11
-7.386.96
-5.24
-3.6
-3.58
-5.57
-5.74
-2.03A
X Y
B
Hidden Unit A represents: ¬(X ÙY)
Hidden Unit B represents: ¬(X ÚY)
Output O represents: A Ù ¬B = ¬(X ÙY) Ù (X ÚY)
= X ÅY
O
39. 39
Hidden Unit Representations
§ Trained hidden units can be seen as newly
constructed features that make the target concept
linearly separable in the transformed space.
§ can be interpreted as representing meaningful
features such as vowel detectors or edge
detectors, etc..
§ become a distributed representation of the input
in which each individual unit is not easily
interpretable as a meaningful feature.
41. Convergence of Backpropagation
§ Gradient descent to some local minimum
§ Perhaps not the global minimum
§ Add momentum
§ Stochastic gradient descent
§ Use multiple initial weights
§ Nature of convergence
§ Initialize weights near zero
§ Initial network can be nonlinear
42. Expressive Capabilities of ANN
§ Boolean Functions
§ Every Boolean function can be expressed by a
network of single hidden layer.
§ How many hidden units?
§ Continuous functions
§ Every bounded continuous function can be
approximated with arbitrarily small error.
§ Any function can be approximated to arbitrary
accuracy by a network with 2 hidden layers.
43. Overfitting in ANN
§ If we have too many
features, the learned
hypothesis may fit the
training set very well,
but fail to generalize to
new examples…
44. How to address overfitting
§ Plot the hypothesis
§ Lot of features?, lot of classes?
§ Reduce number of features:
§ Manually select which features to keep
§ Model selection algorithm ( feature reduction, throwing
some information
§ Regularization
§ Keep all features but reduce magnitude/values of the
parameters theta
§ Works well in case of lots of features, where each
contributes a bit to predict y
Machine Learning Fall 2018 Inas A.Yassine 44
45. 45
Determining the Best
Number of Hidden Units
§ Too few hidden units prevents the network from
adequately fitting the data.
§ Too many hidden units can result in over-fitting.
§ Use internal cross-validation to empirically determine an
optimal number of hidden units.
error
on training data
on test data
0
# hidden units
47. Regularization
§ Small values to parameters thetas:
§ Simpler hypothesis
§ Less prone to overfitting
§ Which features to pick to screw it down, then
add a regularization term to shrink every single
parameter
𝑗 𝑤 =
1
2𝑚
(](ℎS 𝑥 5 − 𝑦 5 )P + 𝜆 ] 𝑤#
P
_
#[O
)
Z
5[O
Machine Learning Fall 2018 Inas A.Yassine 47
48. Regularization
§ Control the trade of between fitting and
keeping the parameter small to decrease
overfitting problem:
§ getting a curve much smoother and much
simpler
§ How to choose lambda,
§ if high, then we almost got to the hypothesis
=𝑤Q .
Machine Learning Fall 2018 Inas A.Yassine 48
49. Regularized Gradient Descent
§ Gradient descent
§ Repeat{
§ 𝑤Q : = 𝑤Q − 𝜂
O
Z
∑ ℎS 𝑥 5 − 𝑦 5 𝑥Q
(5)Z
5[O
§ 𝑤# : = 𝑤# − 𝜂
O
Z
∑ ℎS 𝑥 5 − 𝑦 5 𝑥#
5
+
b
Z
𝑤#
Z
5[O
§ 𝑤# : = 𝑤# (1 − 𝜂
b
Z
) − 𝜂
O
Z
∑ ℎS 𝑥 5 − 𝑦 5 𝑥#
(5)Z
5[O
(1 − 𝜂
b
Z
)<1
Machine Learning Fall 2018 Inas A.Yassine 49