Machine Learning 1

Advanced Topics in Systems
Machine Learning
Neural Networks
Inas A. Yassine
Systems and Biomedical Engineering Department,
Faculty of Engineering - Cairo University
iyassine@eng.cu.edu.eg

Neurons and Brain
§ Neural network mimic the brain , hear,
visualize, find the geometric relations, …..
through single learning algorithm
§ Simulating network of neurons
Machine Learning Fall 2018 Inas A.Yassine 2

The Perceptron
§ A number of McCulloch-Pitts neurons can be connected
together in any way .
§ An arrangement of one input layer of McCulloch-Pitts neurons
feeding forward to one output layer of McCulloch-Pitts neurons
is known as a Perceptron.
§ Powerful computational device.

Logic Gate Implementation
§ McCulloch-Pitts neurons can be used to to
implement the basic logic gates.
§ find the appropriate connection weights and neuron
thresholds to produce the right outputs for each set of
inputs.
§ construct simple networks that perform NOT,AND, and
OR.
§ we can construct any logical function from these three
operations that have a much more complex architecture
§ Try to avoid decomposing complex problems into simple
logic gates, by finding the weights and thresholds that
work directly in a Perceptron architecture.

Logic Gates Implementation
§ We need to determine the weights and
thresholds.

Decision Boundaries for Logic
circuits

Solve analytically for And gate’s weights
§ two weights w1 and w2 and the threshold q, and for each
training pattern we need to satisfy
§ The training data lead to four inequalities:
§ It is easy to see that there are an infinite number of
solutions. Similarly, there are an infinite number of
solutions for the NOT and OR networks.

Limitations of single Perceptron
§ for the XOR network:
§ the second and third inequalities are incompatible with the
fourth,
§ No solution.
§ More complex networks is needed , e.g. that combine together
many simple networks, or use different activation/
thresholding/transfer functions.
§ It then becomes much more difficult to determine all the
weights and thresholds by hand.

Activation /Transfer Functions
§ Step Function
§ Sigmoid Function
§ Sigmoid Function
§ Hyperbolic Tangent
§ Piecewise Linear

Threshold as a weight component
§ To simplify the mathematical description
§ Assume W0j=−𝜃#, out0=1
§ The perceptron Equation Becomes

Perceptron Learning
§ If the network weights at time t are wij(t) , then
the shifting process corresponds to moving them
by an amount Δwij(t) so that at time t+1 we have
weights
wij (t +1) = wij (t) + ∆wij (t)
∆wij (t)= η(tj-oj )xi
§ It is convenient to treat the thresholds as
weights, as discussed previously, so we don’t
need separate equations for them.

Convergence of Perceptron Learning
§ The weight changes ∆wij need to be applied repeatedly – for each
weight wij in the network, and for each training pattern in the
training set. One pass through all the weights for the whole
training set is called one epoch of training.
§ Eventually, usually after many epochs, when all the network
outputs match the targets for all the training patterns, all the ∆
wij will be zero and the process of training will increase.We then say
that the training process has converged to a solution.
§ If the weights can be found in a finite number of iterations.
§ if a problem is linearly separable.
§ problem correctly defined.
§ The step is sufficiently small.

Learning by Error Minimization
§ minimize the difference between the actual
outputs outj and the desired outputs targj.
§ Error Function to quantify this difference
using the sum Square Error:
E (wij )=∑∑(targj -out j)2
§ A systematic procedure for doing this
requires the knowledge of how the error
E(wij) varies as we change the weights wij, i.e.
the gradient of E with respect to wij.

Computing Gradient and Derivatives
§ The gradient, or rate of change, of f(x) at a particular
value of x, as we change x can be approximated by Dy/Dx.
which is known as the partial derivative of f(x) with
respect to x.

Gradient Descent Minimization
§ Suppose we have a function f(x) and we want to change the value of x to minimize f(x).
§ What we need to do depends on the gradient of f(x).There are three cases to consider:
§ If > 0 then f(x) increases as x increases so we should decrease x
§ If < 0 then f(x) decreases as x increases so we should increase x
§ If = 0 then f(x) is at a maximum or minimum so we should not change x
§ In summary, we can decrease f(x) by changing x by the amount:
§ where η is a small positive constant specifying how much we change x by, and the derivative
∂f/∂x tells us which direction to go in. If we repeatedly use this equation, f(x) will (assuming h is
sufficiently small) keep descending towards its minimum, and hence this procedure is known as
gradient descent minimization.
x
f
¶
¶
x
f
¶
¶
x
f
¶
¶

Gradients in more than one Direction

Gradient Descent Error Minimization

Gradient Descent Error Minimization
§ Remember that we want to train our neural networks by adjusting
their weights wij in order to minimize the error function:
§ We now see it makes sense to do this by a series of gradient descent
weight updates:
§ If the transfer function for the output neurons is f(x), and the activations
of the previous layer of neurons are ini , then the outputs are
and
§ Dealing with equations like this is easy if we use the chain rules for
derivatives.

Weights Derivatives Calculation
§ Chain Rule
§ Calculating the derivative of

Weights Derivative Calculation
Kronecker Delta symbol 𝛿 ijdefined such that 𝛿 ij= 1 when i = j and Type equation here.ij= 0 when i ≠j

Delta Rule
§ the basic gradient descent learning algorithm for single layer
networks:
§ Involving the derivative of the transfer function f(x).
§ problematic for the simple Perceptron that uses the step function
sgn(x) as its threshold function, because this has zero derivative
everywhere except at x = 0 where it is infinite.
§ zx

Multi Layer Neural Network
𝑎5
(#)
= 𝑎𝑐𝑡𝑖𝑣𝑎𝑡𝑖𝑜𝑛 𝑜𝑓 𝑢𝑛𝑖𝑡 𝑖 𝑖𝑛 𝑙𝑎𝑦𝑒𝑟 𝑗
𝑤
(#)
= 𝑚𝑎𝑡𝑟𝑖𝑥 𝑜𝑓 𝑤𝑒𝑖𝑔ℎ𝑡𝑠 𝑐𝑜𝑛𝑡𝑟𝑜𝑙𝑙𝑖𝑛𝑔
𝑓𝑢𝑛𝑐𝑖𝑜𝑛 𝑚𝑎𝑝𝑝𝑖𝑛𝑔 𝑓𝑟𝑜𝑚 𝑙𝑎𝑦𝑒𝑟 𝑗
𝑡𝑜 𝑙𝑎𝑦𝑒𝑟 𝑗 + 1
𝑎O
(P)
= 𝑠𝑖𝑔(𝑤OQ
O
𝑥Q + 𝑤OO
O
𝑥O + 𝑤OP
O
𝑥P + 𝑤OR
(O)
𝑥R)
𝑎P
(P)
= 𝑠𝑖𝑔(𝑤PQ
O
𝑥Q + 𝑤PO
O
𝑥O + 𝑤PP
O
𝑥P + 𝑤PR
(O)
𝑥R)
𝑎R
(P)
= 𝑠𝑖𝑔(𝑤RQ
O
𝑥Q + 𝑤RO
O
𝑥O + 𝑤RP
O
𝑥P + 𝑤RR
(O)
𝑥R)
ℎS (𝑥) = 𝑠𝑖𝑔(𝑤OQ
P
𝑎Q
(P)
+ 𝑤OO
P
𝑎O
(P)
+ 𝑤OP
P
𝑎P
(P)
+ 𝑤OR
(P)
𝑎R
(P)
)

Xnor Using MultiLayer ANN

Multi Output Multi Layer
ℎS 𝑥 ∈ ℝV
ℎS(𝑥) ≈
1
0
0
0
, ℎS(𝑥) ≈
0
1
0
0
, ℎS(𝑥) ≈
0
0
1
0
, etc..

30
Derivation of the Backpropagation algorithm
For output units
So:
Source: http://www.speech.sri.com/people/anand/771/html/node37.html
output
hidden
input

31
Derivation of the Backpropagation algorithm
For Hidden units
Also:
So:
Source: http://www.speech.sri.com/people/anand/771/html/node37.html
output
hidden
input
j
k

32
Backpropagation - example
§ First calculate error of output units and use this
to change the output layer of weights.
Current output: oj=0.2
Correct output: tj=1.0
Error δj = oj(1–oj)(tj–oj)
0.2(1–0.2)(1–0.2)=0.128
output
hidden
input
Update weights into j
ijji ow hd=D
Source: Raymond J. Mooney, University ofTexas at Austin, CS 391L: Machine Learning Neural Networks

33
§ Next calculate error for hidden units based on
errors on the output units it feeds into.
å-=
k
kjkjjj woo dd )1(
output
hidden
input

34
§ Finally update bottom layer of weights based on
errors calculated for hidden units.
å-=
k
kjkjjj woo dd )1(
output
hidden
input
jijji xw hd=D

35
Error Backpropagation
§ Next calculate error for hidden units based on
errors on the output units it feeds into.
output
hidden
input
å-=
k
kjkjjj woo dd )1(

36
Error Backpropagation
§ Finally update bottom layer of weights based on
errors calculated for hidden units.
output
hidden
input
å-=
k
kjkjjj woo dd )1(
ijji ow hd=D

Notes on Back propagation
Algorithm
§ Gradient Descent over entire network weight
vector
§ Easily generalized to arbitrary directed graphs
§ Will find a local, not necessarily global error
minimum
§ In practice, often works will (can turn multiple times)
§ Often include weight with a momentum
§ Minimize error over training examples
§ Will it generalize well to subsequent examples?

38
Sample Learned XOR Network
3.11
-7.386.96
-5.24
-3.6
-3.58
-5.57
-5.74
-2.03A
X Y
B
Hidden Unit A represents: ¬(X ÙY)
Hidden Unit B represents: ¬(X ÚY)
Output O represents: A Ù ¬B = ¬(X ÙY) Ù (X ÚY)
= X ÅY
O

39
Hidden Unit Representations
§ Trained hidden units can be seen as newly
constructed features that make the target concept
linearly separable in the transformed space.
§ can be interpreted as representing meaningful
features such as vowel detectors or edge
detectors, etc..
§ become a distributed representation of the input
in which each individual unit is not easily
interpretable as a meaningful feature.

Learning Hidden layer
Representations

Convergence of Backpropagation
§ Gradient descent to some local minimum
§ Perhaps not the global minimum
§ Add momentum
§ Stochastic gradient descent
§ Use multiple initial weights
§ Nature of convergence
§ Initialize weights near zero
§ Initial network can be nonlinear

Expressive Capabilities of ANN
§ Boolean Functions
§ Every Boolean function can be expressed by a
network of single hidden layer.
§ How many hidden units?
§ Continuous functions
§ Every bounded continuous function can be
approximated with arbitrarily small error.
§ Any function can be approximated to arbitrary
accuracy by a network with 2 hidden layers.

Overfitting in ANN
§ If we have too many
features, the learned
hypothesis may fit the
training set very well,
but fail to generalize to
new examples…

How to address overfitting
§ Plot the hypothesis
§ Lot of features?, lot of classes?
§ Reduce number of features:
§ Manually select which features to keep
§ Model selection algorithm ( feature reduction, throwing
some information
§ Regularization
§ Keep all features but reduce magnitude/values of the
parameters theta
§ Works well in case of lots of features, where each
contributes a bit to predict y

45
Determining the Best
Number of Hidden Units
§ Too few hidden units prevents the network from
adequately fitting the data.
§ Too many hidden units can result in over-fitting.
§ Use internal cross-validation to empirically determine an
optimal number of hidden units.
error
on training data
on test data
0
# hidden units

Penalize …
§ 𝑚𝑖𝑛S ∑ (ℎS 𝑥(5) − 𝑦(5))PZ
5[O
§ 𝑚𝑖𝑛S ∑ (ℎS 𝑥 5 − 𝑦 5 )P + 100𝑤R
PZ
5[O +
100𝑤V
P

Regularization
§ Small values to parameters thetas:
§ Simpler hypothesis
§ Less prone to overfitting
§ Which features to pick to screw it down, then
add a regularization term to shrink every single
parameter
𝑗 𝑤 =
1
2𝑚
(](ℎS 𝑥 5 − 𝑦 5 )P + 𝜆 ] 𝑤#
P
_
#[O
)
Z
5[O

Regularization
§ Control the trade of between fitting and
keeping the parameter small to decrease
overfitting problem:
§ getting a curve much smoother and much
simpler
§ How to choose lambda,
§ if high, then we almost got to the hypothesis
=𝑤Q .

Regularized Gradient Descent
§ Gradient descent
§ Repeat{
§ 𝑤Q : = 𝑤Q − 𝜂
O
Z
∑ ℎS 𝑥 5 − 𝑦 5 𝑥Q
(5)Z
5[O
§ 𝑤# : = 𝑤# − 𝜂
O
Z
∑ ℎS 𝑥 5 − 𝑦 5 𝑥#
5
+
b
Z
𝑤#
Z
5[O
§ 𝑤# : = 𝑤# (1 − 𝜂
b
Z
) − 𝜂
O
Z
∑ ℎS 𝑥 5 − 𝑦 5 𝑥#
(5)Z
5[O
(1 − 𝜂
b
Z
)<1

Machine Learning 1

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Machine Learning 1

Similar to Machine Learning 1 (20)

More from cairo university

More from cairo university (20)

Recently uploaded

Recently uploaded (20)

Machine Learning 1