Lecture11 - neural networks

Introduction to Machine
Learning
Lecture 11
Neural Networks
N lN t k

Albert Orriols i Puig
aorriols@salle.url.edu

Artificial Intelligence – Machine Learning
g g
Enginyeria i Arquitectura La Salle
Universitat Ramon Llull

Recap of Lecture 5-10
Data classification
Decision trees (C4.5)

Instance-based learners (kNN and CBR)

Slide 2
Artificial Intelligence Machine Learning

Recap of Lecture 5-10
Data classification
Probabilistic-based learners

P (D | h )P (h )
P (h | D ) =
P (D )

Linear/polynomial classifier

Slide 3

Today’s Agenda

Why Neural Networks?
Looking into a Brain
Neural Networks
Starting from the Beginning:
Perceptrons
Multi-layer perceptrons

Slide 4

Why Neural Networks?
Brain vs. machines
Machines are tremendously faster than brains in well-defined
problems:
Invert matrices solve differential equations etc
matrices, equations, etc.
Brains are tremendously faster and more accurate than
machines in ill-defined methods or problems that require a lot
p q
of processing
Recognize the character of objects in TV

Let’s simulate our brains with artificial neural networks!
Massive parallelism
Neurons interchanging signals

Slide 5

Looking into a Brain
1011 neurons of more than 20 different types
0.001 seconds of neuron switching time
104-5 connections per neuron
0.1 seconds of scene recognition time

Slide 6

Artificial Neural Networks
Borrow some ideas from nervous systems of animals

ai =g (ini ) =g (∑ j W j ,i a j )

THE PERCEPTRON
(McCulloch & Pitts)
Slide 7

Adaline
Adaptive Linear Element
Adaptive linear combiner
cascaded with a hard-limiting
quantizer
Linear output transformed to
binary by means of a threshold
device
Training = adjusting the weights

Activation functions

Slide 8

Adaline
Note that Adaline implements a function
rr n
f ( x , w) =w0 + ∑ xi wi
i =1

This defines a threshold when the output is zero

rr n
f ( x , w) =w0 + ∑ xi wi =0
i =1

Slide 9

Adaline
Let’s assume that we have two variables
rr
f ( x , w) =w0 + x1w1 + x2 w2 = 0
Therefore w0
w1
x2 =− x1 −
w2 w2

So, Adaline is drawing a linear
, g
discriminant that divides the
space into two regions
Linear classifier

Slide 10

Adaline
So, we got a cool way to create linear classifiers
But are linear classifiers enough to tackle our problems?

Can you draw a line that separates examples of class white
and black for the last example?

Slide 11

Moving to more Flexible NN
So, we want to classify problems such as x-or. Any idea?
Polynomial discriminant functions

In this system:
rr
f ( x , w) =w0 + x1w1 + x12 w11 + x1 x2 w12 + x2 w22 + x2 w2 = 0
2

Slide 12

Moving to more Flexible NN

With appropriate values of w, I can fit data that is not
linearly separable

Slide 13

Even more Flexible: Multi-layer NN
So, we want to classify problems such as x-or. Any other idea?

Madaline: Multiple Adalines connected
This also enables the network to solve non-separable problems

Slide 14

But Step Down… How Do I Learn w?
We have seen that different structures enable me to
define different functions
But the key is to get a proper estimation of w
There are many algorithms
Perceptron rule
α-LMS
α-perceptron
May’s algorithm
Backpropagation
p pg
We are going to see two examples: α-LMS and backprop.

Slide 15

Weight Learning in Adaline
Recall that we want to adjust w

Slide 16

Weight learning with α-LMS algorithm
εk Xk
Wk +1 =Wk + α
Incrementally update weights as 2
Xk
The error is the difference between
ε k +1 = d k − WkT X k
the actual and the expected output

Δε k = Δ(d k − WkT X k ) =− X k ΔWk
A change in the T
weights effects the error
εk Xk
ΔWk = Wk +1 − Wk = α
And the weight change is 2
Xk

ε k X kT X k
Δε k = −α = −αε k
Therefore 2
Xk
Slide 17


εk
Δε k = − X k ΔWk ΔWk = α
T
Xk
2
Xk

Slide 18

Backpropagation
α-LMS works for networks with a single layer. But what
happens in networks with multiple layers?
Backpropagation (Rumelhat, 1986)
The most influential development of NN in the 1980s
Here, we present the method conceptually (the math details are
in the papers)
Let’s assume a network with
Three neurons in the input layer
Two neurons in the output layer

Slide 19

Backpropagation
Strategy
Compute the gradient of the error

∂ε
ˆ k = ∂ε k
2
∇
∂Wk

Adjust the weights in the direction opposite to the
instantaneous error gradient
Now, Wk is a vector that contains all the components of the net

Slide 20

Backpropagation
Algorithm
Insert a new example Xk into the network and sweep it forward
1.
till getting the output y
Compute the square error of thi attribute
C t th f this tt ib t
2.
Ny Ny

ε k 2 = ∑ ε ik 2 = ∑ (d ik − yik )2
i =1 i =1

For example, for two outputs (disregarding k)

ε = (d 1 − y1 ) + (d 2 − y2 )
2 2
2

Propagate the error to the previous layer (b k
P t th t th i l (back-propagation).
ti )
3.
How?
Steepest descent
p
Compute the derivative of the square error δ for each Adaline
Slide 21

Backpropagation Example
Example borrowed from: http://home.agh.edu.pl/~vlsi/AI/backp_t_en/backprop.html

Slide 22

1. Sweep the weights forward

Slide 23

2. Backpropagate the error

Slide 24

3. Modify the weights of each neuron

Slide 25

3.bis. Do the same of each neuron

Slide 26

3.bis2. Until reaching the output

Slide 27

Backpropagation for a Two-Layer Net.

That is, the algorithm is
Find the instantaneous square error derivative
1.

1 ∂ε 2
δj =−
(l )

2 ∂s j ( l )
This tells us how sensitive is the square output error of the
network
net ork is to changes in the linear output s of the associated
o tp t
Madaline
Expanding the error term we g
p g get
2.

[ ]
1 ∂ ( d 1 − y1 ) 2 + ( d 2 − y 2 ) 2 1 ∂[ d 1 − sgm( s1
(2)
)]2
δ1 =− =−
(2)

∂s1 ∂s1
(2) (2)
2 2
And recognizing that d1 is independent of s1
3.

δ 1( 2 ) = [ d 1 − sgm( s1( 2 ) )]sgm' ( s1( 2 ) ) = ε 1( 2 ) sgm' ( s1( 2 ) )
Slide 28


That is, the algorithm is
Similarly for the hidden layers we have
4.

1 ⎛ ∂ε 2 ∂s1 ∂ε 2 ∂s2 ⎞
1 ∂ε 2
(2) (2)
= − ⎜ (2) ⎟
δ 1( 1 ) =− +
⎜ ∂s (1) ⎟
2 ∂s1 2 ⎝ 1 ∂s1 ∂s2 ∂s1 ⎠
(1) (1) (2)

∂s1 ( 2 ) ∂s 2
(2) (2)
δ = δ1 + δ2
That is (1) (2)
5.
∂s1 ∂s1
1 (1) (1)

Which yields
4.
⎡ ⎤ ⎡ ⎤
3 3
∂ ⎢ w10 ( 2 ) + ∑ w1 i ( 2 ) sgm ( si ( 1 ) ∂ ⎢ w20 ( 2 ) + ∑ w1 i ( 2 ) sgm ( s 2 ( 1 )
)⎥ )⎥
δ 1( 1 ) = δ +δ
(2) ⎣ ⎦ (2) ⎣ ⎦
i =1 i =1
∂s1( 1 ) ∂s1( 1 )
1 2

= δ1 ) + δ2
(2) (2) (1) (2) (2) (1)
w11 sgm' ( s1 w21 sgm' ( s1 )

[ ]sgm' ( s
= δ1 + δ2
(2) (2) (2) (2) (1)
w11 w21 )
1
Slide 29

Δ
ε1 =δ1 + δ2
(1) (2) (2) (2) (2)
w11 w21
Defining
δ 1( 1 ) = ε 1( 1 ) sgm' ( s1( 1 ) )
We obtain

Implementation details of each Adaline

Slide 30

Next Class

Support Vector Machines

Slide 31

Lecture11 - neural networks

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Lecture11 - neural networks

Similar to Lecture11 - neural networks (20)

More from Albert Orriols-Puig

More from Albert Orriols-Puig (20)

Lecture11 - neural networks