ANN Slides.pdf

CS 5751 Machine
Learning
Chapter 4 Artificial Neural Networks 1
Artificial Neural Networks
• Threshold units
• Gradient descent
• Multilayer networks
• Backpropagation
• Hidden layer representations
• Example: Face recognition
• Advanced topics

CS 5751 Machine
Learning
Connectionist Models
Consider humans
• Neuron switching time ~.001 second
• Number of neurons ~1010
• Connections per neuron ~104-5
• Scene recognition time ~.1 second
• 100 inference step does not seem like enough
must use lots of parallel computation!
Properties of artificial neural nets (ANNs):
• Many neuron-like threshold switching units
• Many weighted interconnections among units
• Highly parallel, distributed process
• Emphasis on tuning weights automatically

CS 5751 Machine
Learning
When to Consider Neural Networks
• Input is high-dimensional discrete or real-valued (e.g., raw
sensor input)
• Output is discrete or real valued
• Output is a vector of values
• Possibly noisy data
• Form of target function is unknown
• Human readability of result is unimportant
Examples:
• Speech phoneme recognition [Waibel]
• Image classification [Kanade, Baluja, Rowley]
• Financial prediction

CS 5751 Machine
Learning
ALVINN drives 70 mph on highways
4 Hidden
Units
30x32 Sensor
Input Retina
Sharp
Left
Sharp
Right
Straight
Ahead

CS 5751 Machine
Learning
Perceptron
W n
W
2
W
0
W
1
X1
Σ
X2
Xn
X0=1
i
n
i
i x
w
∑
=0





= ∑
=
otherwise
1
-
if
1
n
1
i
i
i x
w
σ


 >
⋅
=


 >
+
+
+
=
otherwise
1
-
0
if
1
:
notation
ctor
simpler ve
use
will
we
Sometimes
otherwise
1
-
0
...
if
1
)
,...,
( 1
1
0
1
x
w
)
x
o(
x
w
x
w
w
x
x
o n
n
n
r
r
r

CS 5751 Machine
Learning
Decision Surface of Perceptron
X2
X
1
X2
X
1
Represents some useful functions
• What weights represent g(x1,x2) = AND(x1,x2)?
But some functions not representable
• e.g., not linearly separable
• therefore, we will want networks of these ...

CS 5751 Machine
Learning
Perceptron Training Rule
•
small
ly
sufficient
is
and
separable
linearly
is
data
training
If
converge
it will
prove
Can
rate
learning
called
.1)
(e.g.,
constant
small
is
output
perceptron
is
lue
target va
is
)
(
)
(
where
η
η
η
•
•
•
•
=
•
−
=
∆
∆
+
←
o
x
c
t
x
o
t
w
w
w
w
i
i
i
i
i
r

CS 5751 Machine
Learning
Gradient Descent
[ ]
examples
training
of
set
the
is
Where
)
(
error
squared
the
minimize
that
s
'
learn
:
Idea
where
,
simple
consider
,
understand
To
2
2
1
1
1
0
D
o
t
w
E
w
x
w
...
x
w
w
o
t
linear uni
D
d
d
d
i
n
n
∑
∈
−
≡
+
+
+
=
r

CS 5751 Machine
Learning
Gradient Descent

CS 5751 Machine
Learning
Gradient Descent
i
i
i
n
w
E
w
w
E
w
w
E
w
E
w
E
w
E
∂
∂
−
=
∆
∇
−
=
∆






∂
∂
∂
∂
∂
∂
≡
∇
i.e.,
]
[
:
rule
Training
,...,
,
]
[
Gradient
1
0
η
η
r
r

CS 5751 Machine
Learning
Gradient Descent
)
)(
(
)
(
)
(
)
(
)
(
2
2
1
)
(
2
1
)
(
2
1
,
2
2
∑
∑
∑
∑
∑
−
−
=
∂
∂
⋅
−
∂
∂
−
=
−
∂
∂
−
=
−
∂
∂
=
−
∂
∂
=
∂
∂
d
d
i
d
d
i
d
d
d i
d
d
d
d
d i
d
d
d
d
d i
d
d
d
i
i
x
o
t
w
E
x
w
t
w
o
t
o
t
w
o
t
o
t
w
o
t
w
w
E
r
r

CS 5751 Machine
Learning
Gradient Descent
i
i
i
i
i
i
i
i
i
w
w
w
x
o
t
w
w
w
o
x
examples
training
,t
x
w
w
).
(e.g., .
rning rate
is the lea
value.
et output
s the t
es and t i
input valu
ctor of
is the ve
x
where
,
,t
x
e form
pair of th
ples is a
ining exam
Each tra
examples
training
∆
+
←
−
+
∆
←
∆
∗
>
<
∆
•
•
>
<
−
do
t w,
unit weigh
linear
each
For
-
)
(
do
,
t
unit weigh
linear
each
For
*
output
compute
and
instance
Input the
do
,
_
in
each
For
-
zero.
to
each
Initialize
-
do
met,
is
condition
on
terminati
the
Until
value
random
small
some
to
each
Initialize
05
arg
)
,
_
(
DESCENT
GRADIENT
η
η
η
r
r
r
r

CS 5751 Machine
Learning
Summary
Perceptron training rule guaranteed to succeed if
• Training examples are linearly separable
• Sufficiently small learning rate η
Linear unit training rule uses gradient descent
• Guaranteed to converge to hypothesis with
minimum squared error
• Given sufficiently small learning rate η
• Even when training data contains noise
• Even when training data not separable by H

CS 5751 Machine
Learning
Incremental (Stochastic) Gradient Descent
Batch mode Gradient Descent:
Do until satisfied:
]
[
2.
gradient
the
Compute
.
1
w
E
w
w
]
w
[
E
D
D
r
r
r
r
∇
−
←
∇
η
Incremental mode Gradient Descent:
Do until satisfied:
- For each training example d in D
]
[
2.
gradient
the
Compute
.
1
w
E
w
w
]
w
[
E
d
d
r
r
r
r
∇
−
←
∇
η
2
2
1
)
(
]
[ d
D
d
d
D o
t
w
E −
≡ ∑
∈
r
2
2
1
)
(
]
[ d
d
d o
t
w
E −
≡
r
Incremental Gradient Descent can approximate Batch Gradient
Descent arbitrarily closely if η made small enough

CS 5751 Machine
Learning
Multilayer Networks of Sigmoid Units
d1
d0 d2
d3
d1
d0
d2
d3
h1
h2
x1
x2
h1 h2
o

CS 5751 Machine
Learning
Multilayer Decision Space

CS 5751 Machine
Learning
Sigmoid Unit
W n
W
2
W
0
W
1
X1
Σ
X2
Xn
X0=1
i
n
i
i x
w
net ∑
=
=
0
net
e
net
o −
+
=
=
1
1
)
(
σ
ation
Backpropag
networks
Multilayer
x
x
dx
x
d
e
x
-x
units
sigmoid
of
unit
sigmoid
One
train
to
rules
descent
gradient
derive
can
We
))
(
1
)(
(
)
(
:
property
Nice
1
1
function
sigmoid
the
is
)
(
σ
→
•
•
−
=
+
σ
σ
σ

CS 5751 Machine
Learning
The Sigmoid Function
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
-6 -5 -4 -3 -2 -1 0 1 2 3 4 5 6
net input
output
-x
e
x
+
=
1
1
)
(
σ
Sort of a rounded step function
Unlike step function, can take derivative (makes learning
possible)

CS 5751 Machine
Learning
Error Gradient for a Sigmoid Unit
i
d
d
d
d
d
d
d i
d
d
d
d
d
d
i
d
d
d
d
d
i
d
D
d
d
i
i
w
net
net
o
o
t
w
o
o
t
o
t
w
o
t
o
t
w
o
t
w
w
E
∂
∂
∂
∂
−
=








∂
∂
−
−
=
−
∂
∂
−
=
−
∂
∂
=
−
∂
∂
=
∂
∂
∑
∑
∑
∑
∑
∈
)
(
-
)
(
)
(
)
(
2
2
1
)
(
2
1
)
(
2
1
2
2
d
i
d
d
d
D
d
d
i
d
i
i
d
i
d
d
d
d
d
d
d
x
o
o
o
t
w
E
x
w
x
w
w
net
o
o
net
net
net
o
,
,
)
1
(
)
(
:
So
)
(
)
1
(
)
(
:
know
But we
−
−
−
=
∂
∂
=
∂
⋅
∂
=
∂
∂
−
=
∂
∂
=
∂
∂
∑
∈
r
r
σ

CS 5751 Machine
Learning
Backpropagation Algorithm
j
i
j
j
i
j
i
j
i
j
i
j
i
k
outputs
k
k
h
h
h
k
k
k
k
k
k
x
w
w
w
w
w
w
o
o
h
o
t
o
o
k
,
,
,
,
,
,
,
ere
wh
ight
network we
each
Update
4.
)
1
(
unit
hidden
each
For
3.
)
)(
1
(
unit
output
each
For
2.
outputs
the
compute
and
example
training
Input the
1.
do
example,
ing
each train
For
do
satisfied,
Until
numbers.
random
small
to
weights
all
Initialize
δ
η
δ
δ
δ
=
∆
∆
+
←
−
←
−
−
←
•
∑
∈

CS 5751 Machine
Learning
More on Backpropagation
• Gradient descent over entire network weight vector
• Easily generalized to arbitrary directed graphs
• Will find a local, not necessarily global error minimum
– In practice, often works well (can run multiple times)
• Often include weight momentum α
• Minimizes error over training examples
• Will it generalize well to subsequent examples?
• Training can take thousands of iterations -- slow!
– Using network after training is fast
)
1
(
)
( ,
,
, −
∆
+
=
∆ n
w
x
n
w j
i
j
i
j
j
i α
δ
η

CS 5751 Machine
Learning
Learning Hidden Layer Representations
00000001
00000001
00000010
00000010
00000100
00000100
00001000
00001000
00010000
00010000
00100000
00100000
01000000
01000000
10000000
10000000
Output
Input
→
→
→
→
→
→
→
→
Inputs
Outputs

CS 5751 Machine
Learning
Learning Hidden Layer Representations
Inputs
Outputs
00000001
.01
.94
.60
00000001
00000010
.98
.01
.80
00000010
00000100
.99
.99
.22
00000100
00001000
.02
.05
.03
00001000
00010000
.71
.97
.99
00010000
00100000
.27
.97
.01
00100000
01000000
.88
.11
.01
01000000
10000000
.08
.04
.89
10000000
Output
Input
→
→
→
→
→
→
→
→
→
→
→
→
→
→
→
→

CS 5751 Machine
Learning
Output Unit Error during Training
Sum of squared errors for each output unit
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
0 500 1000 1500 2000 2500

CS 5751 Machine
Learning
Hidden Unit Encoding
Hidden unit encoding for one input
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 500 1000 1500 2000 2500

CS 5751 Machine
Learning
Input to Hidden Weights
Weights from inputs to one hidden unit
-5
-4
-3
-2
-1
0
1
2
3
4
0 500 1000 1500 2000 2500

CS 5751 Machine
Learning
Convergence of Backpropagation
Gradient descent to some local minimum
• Perhaps not global minimum
• Momentum can cause quicker convergence
• Stochastic gradient descent also results in faster
convergence
• Can train multiple networks and get different results (using
different initial weights)
Nature of convergence
• Initialize weights near zero
• Therefore, initial networks near-linear
• Increasingly non-linear functions as training progresses

CS 5751 Machine
Learning
Expressive Capabilities of ANNs
Boolean functions:
• Every Boolean function can be represented by network
with a single hidden layer
• But that might require an exponential (in the number of
inputs) hidden units
Continuous functions:
• Every bounded continuous function can be approximated
with arbitrarily small error by a network with one hidden
layer [Cybenko 1989; Hornik et al. 1989]
• Any function can be approximated to arbitrary accuracy by
a network with two hidden layers [Cybenko 1988]

CS 5751 Machine
Learning
Overfitting in ANNs
Error versus weight updates (example 1)
0.002
0.003
0.004
0.005
0.006
0.007
0.008
0.009
0.01
0 5000 10000 15000 20000
Number of weight updates
Error
Training set
Validation set

CS 5751 Machine
Learning
Overfitting in ANNs
Error versus weight updates (Example 2)
0
0.01
0.02
0.03
0.04
0.05
0.06
0.07
0.08
0 1000 2000 3000 4000 5000 6000
Number of weight updates
Error
Training set
Validation set

CS 5751 Machine
Learning
Neural Nets for Face Recognition
30x32
inputs
left strt rgt up
Typical Input Images
90% accurate learning
head pose, and recognizing
1-of-20 faces

CS 5751 Machine
Learning
Learned Network Weights
Typical Input Images
30x32
inputs
left strt rgt up
Learned Weights

CS 5751 Machine
Learning
Alternative Error Functions
n
recognitio
phoneme
in
e.g.,
:
weights
together
Tie
)
(
)
(
:
values
as
well
as
slopes
on target
Train
)
(
)
(
:
weights
large
Penalize
2
2
2
1
j
i,
2
2
2
1
•














∂
∂
−
∂
∂
+
−
≡
+
−
≡
∑ ∑
∑
∑ ∑
∈ ∈
∈ ∈
D
d outputs
k d
j
kd
d
j
kd
kd
kd
ji
kd
D
d outputs
k
kd
x
o
x
t
o
t
w
E
w
o
t
w
E
µ
γ
r
r

CS 5751 Machine
Learning
Recurrent Networks
Feedforward
Network
y(t+1)
x(t)
Recurrent
Network
y(t+1)
x(t)
y(t+1)
x(t)
y(t)
x(t-1)
y(t-1)
x(t-2)
Recurrent Network
unfolded in time

CS 5751 Machine
Learning
Neural Network Summary
• physiologically (neurons) inspired model
• powerful (accurate), slow, opaque (hard to
understand resulting model)
• bias: preferential
– based on gradient descent
– finds local minimum
– effect by initial conditions, parameters
• neural units
– linear
– linear threshold
– sigmoid

CS 5751 Machine
Learning
Neural Network Summary (cont)
• gradient descent
– convergence
• linear units
– limitation: hyperplane decision surface
– learning rule
• multilayer network
– advantage: can have non-linear decision surface
– backpropagation to learn
• backprop learning rule
• learning issues
– units used

CS 5751 Machine
Learning
Neural Network Summary (cont)
• learning issues (cont)
– batch versus incremental (stochastic)
– parameters
• initial weights
• learning rate
• momentum
– cost (error) function
• sum of squared errors
• can include penalty terms
• recurrent networks
– simple
– backpropagation through time

ANN Slides.pdf

Recommended

Recommended

More Related Content

Similar to ANN Slides.pdf

Similar to ANN Slides.pdf (20)

Recently uploaded

Recently uploaded (20)

ANN Slides.pdf