2. CS 5751 Machine
Learning
Chapter 4 Artificial Neural Networks 2
Connectionist Models
Consider humans
• Neuron switching time ~.001 second
• Number of neurons ~1010
• Connections per neuron ~104-5
• Scene recognition time ~.1 second
• 100 inference step does not seem like enough
must use lots of parallel computation!
Properties of artificial neural nets (ANNs):
• Many neuron-like threshold switching units
• Many weighted interconnections among units
• Highly parallel, distributed process
• Emphasis on tuning weights automatically
3. CS 5751 Machine
Learning
Chapter 4 Artificial Neural Networks 3
When to Consider Neural Networks
• Input is high-dimensional discrete or real-valued (e.g., raw
sensor input)
• Output is discrete or real valued
• Output is a vector of values
• Possibly noisy data
• Form of target function is unknown
• Human readability of result is unimportant
Examples:
• Speech phoneme recognition [Waibel]
• Image classification [Kanade, Baluja, Rowley]
• Financial prediction
4. CS 5751 Machine
Learning
Chapter 4 Artificial Neural Networks 4
ALVINN drives 70 mph on highways
4 Hidden
Units
30x32 Sensor
Input Retina
Sharp
Left
Sharp
Right
Straight
Ahead
5. CS 5751 Machine
Learning
Chapter 4 Artificial Neural Networks 5
Perceptron
W n
W
2
W
0
W
1
X1
Σ
X2
Xn
X0=1
i
n
i
i x
w
∑
=0
= ∑
=
otherwise
1
-
if
1
n
1
i
i
i x
w
σ
>
⋅
=
>
+
+
+
=
otherwise
1
-
0
if
1
:
notation
ctor
simpler ve
use
will
we
Sometimes
otherwise
1
-
0
...
if
1
)
,...,
( 1
1
0
1
x
w
)
x
o(
x
w
x
w
w
x
x
o n
n
n
r
r
r
6. CS 5751 Machine
Learning
Chapter 4 Artificial Neural Networks 6
Decision Surface of Perceptron
X2
X
1
X2
X
1
Represents some useful functions
• What weights represent g(x1,x2) = AND(x1,x2)?
But some functions not representable
• e.g., not linearly separable
• therefore, we will want networks of these ...
7. CS 5751 Machine
Learning
Chapter 4 Artificial Neural Networks 7
Perceptron Training Rule
•
small
ly
sufficient
is
and
separable
linearly
is
data
training
If
converge
it will
prove
Can
rate
learning
called
.1)
(e.g.,
constant
small
is
output
perceptron
is
lue
target va
is
)
(
)
(
where
η
η
η
•
•
•
•
=
•
−
=
∆
∆
+
←
o
x
c
t
x
o
t
w
w
w
w
i
i
i
i
i
r
8. CS 5751 Machine
Learning
Chapter 4 Artificial Neural Networks 8
Gradient Descent
[ ]
examples
training
of
set
the
is
Where
)
(
error
squared
the
minimize
that
s
'
learn
:
Idea
where
,
simple
consider
,
understand
To
2
2
1
1
1
0
D
o
t
w
E
w
x
w
...
x
w
w
o
t
linear uni
D
d
d
d
i
n
n
∑
∈
−
≡
+
+
+
=
r
10. CS 5751 Machine
Learning
Chapter 4 Artificial Neural Networks 10
Gradient Descent
i
i
i
n
w
E
w
w
E
w
w
E
w
E
w
E
w
E
∂
∂
−
=
∆
∇
−
=
∆
∂
∂
∂
∂
∂
∂
≡
∇
i.e.,
]
[
:
rule
Training
,...,
,
]
[
Gradient
1
0
η
η
r
r
11. CS 5751 Machine
Learning
Chapter 4 Artificial Neural Networks 11
Gradient Descent
)
)(
(
)
(
)
(
)
(
)
(
2
2
1
)
(
2
1
)
(
2
1
,
2
2
∑
∑
∑
∑
∑
−
−
=
∂
∂
⋅
−
∂
∂
−
=
−
∂
∂
−
=
−
∂
∂
=
−
∂
∂
=
∂
∂
d
d
i
d
d
i
d
d
d i
d
d
d
d
d i
d
d
d
d
d i
d
d
d
i
i
x
o
t
w
E
x
w
t
w
o
t
o
t
w
o
t
o
t
w
o
t
w
w
E
r
r
12. CS 5751 Machine
Learning
Chapter 4 Artificial Neural Networks 12
Gradient Descent
i
i
i
i
i
i
i
i
i
w
w
w
x
o
t
w
w
w
o
x
examples
training
,t
x
w
w
).
(e.g., .
rning rate
is the lea
value.
et output
s the t
es and t i
input valu
ctor of
is the ve
x
where
,
,t
x
e form
pair of th
ples is a
ining exam
Each tra
examples
training
∆
+
←
−
+
∆
←
∆
∗
>
<
∆
•
•
>
<
−
do
t w,
unit weigh
linear
each
For
-
)
(
do
,
t
unit weigh
linear
each
For
*
output
compute
and
instance
Input the
do
,
_
in
each
For
-
zero.
to
each
Initialize
-
do
met,
is
condition
on
terminati
the
Until
value
random
small
some
to
each
Initialize
05
arg
)
,
_
(
DESCENT
GRADIENT
η
η
η
r
r
r
r
13. CS 5751 Machine
Learning
Chapter 4 Artificial Neural Networks 13
Summary
Perceptron training rule guaranteed to succeed if
• Training examples are linearly separable
• Sufficiently small learning rate η
Linear unit training rule uses gradient descent
• Guaranteed to converge to hypothesis with
minimum squared error
• Given sufficiently small learning rate η
• Even when training data contains noise
• Even when training data not separable by H
14. CS 5751 Machine
Learning
Chapter 4 Artificial Neural Networks 14
Incremental (Stochastic) Gradient Descent
Batch mode Gradient Descent:
Do until satisfied:
]
[
2.
gradient
the
Compute
.
1
w
E
w
w
]
w
[
E
D
D
r
r
r
r
∇
−
←
∇
η
Incremental mode Gradient Descent:
Do until satisfied:
- For each training example d in D
]
[
2.
gradient
the
Compute
.
1
w
E
w
w
]
w
[
E
d
d
r
r
r
r
∇
−
←
∇
η
2
2
1
)
(
]
[ d
D
d
d
D o
t
w
E −
≡ ∑
∈
r
2
2
1
)
(
]
[ d
d
d o
t
w
E −
≡
r
Incremental Gradient Descent can approximate Batch Gradient
Descent arbitrarily closely if η made small enough
17. CS 5751 Machine
Learning
Chapter 4 Artificial Neural Networks 17
Sigmoid Unit
W n
W
2
W
0
W
1
X1
Σ
X2
Xn
X0=1
i
n
i
i x
w
net ∑
=
=
0
net
e
net
o −
+
=
=
1
1
)
(
σ
ation
Backpropag
networks
Multilayer
x
x
dx
x
d
e
x
-x
units
sigmoid
of
unit
sigmoid
One
train
to
rules
descent
gradient
derive
can
We
))
(
1
)(
(
)
(
:
property
Nice
1
1
function
sigmoid
the
is
)
(
σ
→
•
•
−
=
+
σ
σ
σ
18. CS 5751 Machine
Learning
Chapter 4 Artificial Neural Networks 18
The Sigmoid Function
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
-6 -5 -4 -3 -2 -1 0 1 2 3 4 5 6
net input
output
-x
e
x
+
=
1
1
)
(
σ
Sort of a rounded step function
Unlike step function, can take derivative (makes learning
possible)
19. CS 5751 Machine
Learning
Chapter 4 Artificial Neural Networks 19
Error Gradient for a Sigmoid Unit
i
d
d
d
d
d
d
d i
d
d
d
d
d
d
i
d
d
d
d
d
i
d
D
d
d
i
i
w
net
net
o
o
t
w
o
o
t
o
t
w
o
t
o
t
w
o
t
w
w
E
∂
∂
∂
∂
−
=
∂
∂
−
−
=
−
∂
∂
−
=
−
∂
∂
=
−
∂
∂
=
∂
∂
∑
∑
∑
∑
∑
∈
)
(
-
)
(
)
(
)
(
2
2
1
)
(
2
1
)
(
2
1
2
2
d
i
d
d
d
D
d
d
i
d
i
i
d
i
d
d
d
d
d
d
d
x
o
o
o
t
w
E
x
w
x
w
w
net
o
o
net
net
net
o
,
,
)
1
(
)
(
:
So
)
(
)
1
(
)
(
:
know
But we
−
−
−
=
∂
∂
=
∂
⋅
∂
=
∂
∂
−
=
∂
∂
=
∂
∂
∑
∈
r
r
σ
20. CS 5751 Machine
Learning
Chapter 4 Artificial Neural Networks 20
Backpropagation Algorithm
j
i
j
j
i
j
i
j
i
j
i
j
i
k
outputs
k
k
h
h
h
k
k
k
k
k
k
x
w
w
w
w
w
w
o
o
h
o
t
o
o
k
,
,
,
,
,
,
,
ere
wh
ight
network we
each
Update
4.
)
1
(
unit
hidden
each
For
3.
)
)(
1
(
unit
output
each
For
2.
outputs
the
compute
and
example
training
Input the
1.
do
example,
ing
each train
For
do
satisfied,
Until
numbers.
random
small
to
weights
all
Initialize
δ
η
δ
δ
δ
=
∆
∆
+
←
−
←
−
−
←
•
∑
∈
21. CS 5751 Machine
Learning
Chapter 4 Artificial Neural Networks 21
More on Backpropagation
• Gradient descent over entire network weight vector
• Easily generalized to arbitrary directed graphs
• Will find a local, not necessarily global error minimum
– In practice, often works well (can run multiple times)
• Often include weight momentum α
• Minimizes error over training examples
• Will it generalize well to subsequent examples?
• Training can take thousands of iterations -- slow!
– Using network after training is fast
)
1
(
)
( ,
,
, −
∆
+
=
∆ n
w
x
n
w j
i
j
i
j
j
i α
δ
η
24. CS 5751 Machine
Learning
Chapter 4 Artificial Neural Networks 24
Output Unit Error during Training
Sum of squared errors for each output unit
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
0 500 1000 1500 2000 2500
25. CS 5751 Machine
Learning
Chapter 4 Artificial Neural Networks 25
Hidden Unit Encoding
Hidden unit encoding for one input
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 500 1000 1500 2000 2500
26. CS 5751 Machine
Learning
Chapter 4 Artificial Neural Networks 26
Input to Hidden Weights
Weights from inputs to one hidden unit
-5
-4
-3
-2
-1
0
1
2
3
4
0 500 1000 1500 2000 2500
27. CS 5751 Machine
Learning
Chapter 4 Artificial Neural Networks 27
Convergence of Backpropagation
Gradient descent to some local minimum
• Perhaps not global minimum
• Momentum can cause quicker convergence
• Stochastic gradient descent also results in faster
convergence
• Can train multiple networks and get different results (using
different initial weights)
Nature of convergence
• Initialize weights near zero
• Therefore, initial networks near-linear
• Increasingly non-linear functions as training progresses
28. CS 5751 Machine
Learning
Chapter 4 Artificial Neural Networks 28
Expressive Capabilities of ANNs
Boolean functions:
• Every Boolean function can be represented by network
with a single hidden layer
• But that might require an exponential (in the number of
inputs) hidden units
Continuous functions:
• Every bounded continuous function can be approximated
with arbitrarily small error by a network with one hidden
layer [Cybenko 1989; Hornik et al. 1989]
• Any function can be approximated to arbitrary accuracy by
a network with two hidden layers [Cybenko 1988]
29. CS 5751 Machine
Learning
Chapter 4 Artificial Neural Networks 29
Overfitting in ANNs
Error versus weight updates (example 1)
0.002
0.003
0.004
0.005
0.006
0.007
0.008
0.009
0.01
0 5000 10000 15000 20000
Number of weight updates
Error
Training set
Validation set
30. CS 5751 Machine
Learning
Chapter 4 Artificial Neural Networks 30
Overfitting in ANNs
Error versus weight updates (Example 2)
0
0.01
0.02
0.03
0.04
0.05
0.06
0.07
0.08
0 1000 2000 3000 4000 5000 6000
Number of weight updates
Error
Training set
Validation set
31. CS 5751 Machine
Learning
Chapter 4 Artificial Neural Networks 31
Neural Nets for Face Recognition
30x32
inputs
left strt rgt up
Typical Input Images
90% accurate learning
head pose, and recognizing
1-of-20 faces
33. CS 5751 Machine
Learning
Chapter 4 Artificial Neural Networks 33
Alternative Error Functions
n
recognitio
phoneme
in
e.g.,
:
weights
together
Tie
)
(
)
(
:
values
as
well
as
slopes
on target
Train
)
(
)
(
:
weights
large
Penalize
2
2
2
1
j
i,
2
2
2
1
•
∂
∂
−
∂
∂
+
−
≡
+
−
≡
∑ ∑
∑
∑ ∑
∈ ∈
∈ ∈
D
d outputs
k d
j
kd
d
j
kd
kd
kd
ji
kd
D
d outputs
k
kd
x
o
x
t
o
t
w
E
w
o
t
w
E
µ
γ
r
r