4. Feedforward- vs. recurrent NN
...
...
...
...
...
...
Input Input
Output Output
• connections only "from
left to right", no
connection cycle
• activation is fed forward
from input to output
through "hidden layers"
• no memory
• at least one connection
cycle
• activation can
"reverberate", persist
even with no input
• system with memory
5. recurrent NNs, main properties
• input time series output time series
• can approximate any dynamical system
(universal approximation property)
• mathematical analysis difficult
• learning algorithms computationally
expensive and difficult to master
• few application-oriented publications, little
research
...
...
6. Supervised training of RNNs
A. Training
Teacher:
Model:
B. Exploitation
Input:
Correct (unknown)
output:
Model:
in
out
in
out
7. Backpropagation through time (BPTT)
• Most widely used general-
purpose supervised training
algorithm
• Idea: 1. stack network
copies, 2. interpret as
feedforward network, 3. use
backprop algorithm.
. . .
original
RNN
stack of
copies
8. What are ESNs?
• training method for
recurrent neural
networks
• black-box modelling of
nonlinear dynamical
systems
• supervised training,
offline and online
• exploits linear methods
for nonlinear modeling
...
+
+
Previously
ESN training
9. Introductory example: a tone generator
Goal: train a network to work as a tuneable tone
generator
input: frequency
setting
output: sines of
desired frequency
20 40 60 80 100
0.25
0.35
0.4
0.45
0.5
20 40 60 80 100
0.1
0.2
0.3
0.4
12. Tone generator: exploitation
• With new output weights in place, drive trained network with input.
• Observation: network continues to function as in training.
– internal states reflect input and output
– output is reconstituted from internal states
• internal states and output create each other
20 40 60 80 100
0.25
0.35
0.4
0.45
0.5
20 40 60 80 100
0.1
0.2
0.3
0.4
20 40 60 80 100
-0.5
-0.4
-0.3
-0.2
-0.1
20 40 60 80 100
-0.8
-0.6
-0.4
-0.2
20 40 60 80 100
-0.75
-0.5
-0.25
0.25
0.5
0.75
20 40 60 80 100
0.4
0.5
0.6
0.7
0.8
0.9
echo
reconstitute
13. Tone generator: generalization
The trained generator network also works with input different from training input
50 100 150 200
0.1
0.2
0.3
0.4
50 100 150 200
0.25
0.35
0.4
0.45
0.5
200
0.6
0.7
0.8
0.9
200
- 0.6
- 0.5
- 0.4
- 0.3
200
- 0.7
- 0.6
- 0.5
- 0.4
200
- 0.75
- 0.5
- 0.25
0.25
0.5
0.75
A. step input B. teacher and learned output
C. some internal states
14. Dynamical reservoir
• large recurrent
network (100 -
units)
• works as
"dynamical
reservoir", "echo
chamber"
• units in DR
respond differently
to excitation
• output units
combine different
internal dynamics
into desired
dynamics
...
...
input units output units
recurrent "dynamical
reservoir"
16. Notation and Update Rules
))
(
),
1
(
),
1
(
(
(
)
1
(
))
(
)
(
)
1
(
(
)
1
(
)
(
),
(
),
(
),
(
))'
(
),...,
(
(
)
(
))'
(
),...,
(
(
)
(
))'
(
),...,
(
(
)
(
1
1
1
n
y
n
x
n
u
W
f
n
y
n
y
W
n
Wx
n
u
W
f
n
x
w
W
w
W
w
W
w
W
n
y
n
y
n
y
n
x
n
x
n
x
n
u
n
u
n
u
out
out
back
in
back
ij
back
out
ij
out
ij
in
ij
in
L
N
K
17. Learning: basic idea
Every stationary deterministic dynamical system can be defined
by an equation like
),
2
(
),
1
(
,
),
1
(
),
(
)
(
t
d
t
d
t
u
t
u
h
t
d
where the system function h might be a monster.
Combine h from the I/O echo functions by selecting
suitable DR-to-output weights :
i
i
i
i
i
i
t
y
t
u
h
w
t
x
w
t
y
t
d
),...)
1
(
),...,
(
(
)
(
)
(
)
(
i
w
)
(t
xi i
w
)
(t
y
)
(t
u
18. Offline training: task definition
i
i
i t
x
w
t
y )
(
)
(
Let be the teacher output. .
)
(t
d
Compute weights such that mean square error
]
)
(
)
(
[
]
)
(
)
(
[ 2
2
t
x
w
t
d
E
t
y
t
d
E i
i
is minimized.
Recall
19. Offline training: how it works
1. Let network run with training signal teacher-forced.
2. During this run, collect network states , in matrix M
3. Compute weights , such that
)
(t
d
)
(t
xi
i
w ]
)
(
)
(
[ 2
t
x
w
t
d
E i
i
is minimized
MSE minimizing weight computation (step 3) is a standard
operation.
Many efficient implementations available, offline/constructive
and online/adaptive.
T
M
w 1
20. Practical Considerations
back
in
W
W
W ,
, Chosen randomly
• Spectral radius of W < 1
• W should be sparse
• Input and feedback weights have to be scaled
“appropriately”
• Adding noise in the update rule can increase
generalization performance
21. Echo state network training, summary
• use large recurrent network as "excitable
dynamical reservoir (DR)"
• DR is not modified through learning
• adapt only DR output weights
• thereby combine desired system function from I/O
history echo functions
• use any offline or online linear regression algorithm
to minimize error
]
))
(
)
(
[( 2
t
y
t
d
E
29. Identifying higher-order nonlinear systems
9
,...
0
1
.
0
)
9
(
)
(
5
.
1
)
(
)
(
05
.
0
)
(
3
.
0
)
1
(
k
n
u
n
u
k
n
y
n
y
n
y
n
y
A tenth-order system
...
20 40 60 80 100
0.1
0.2
0.3
0.4
20 40 60 80 100
0.3
0.4
0.5
0.6
Training setup
)
(n
y
)
(n
u
33. Results for t = 17
Error for 84-step prediction:
NRMSE = 1E-4.2
(averaged over 100 training
runs on independently created
data)
With refined training method:
NRMSE = 1E-5.1
previous best:
NRMSE = 1E-1.7
original
0.4 0.6 0.8 1.2
0.4
0.6
0.8
1.2
0.4 0.6 0.8 1.2
0.4
0.6
0.8
1.2
learnt model
37. Dynamic pattern detection1)
Training signal:
output jumps to 1 after occurence of pattern instance in input
)
(n
y
)
(n
u
1) see GMD Report Nr 152 for detailed coverage
38. Single-instance patterns, training setup
1. A single-instance, 10-
step pattern is randomly
fixed
4 6 8 10
-0.4
-0.2
0.2
0.4
2. It is inserted into 500-
step random signal at
positions
200 (for training)
350, 400, 450, 500 (for
testing)
3. 100-unit ESN trained
on first 300 steps (single
positive instance! "single
shot learning), tested on
remaining 200 steps
50 100 150 200
-0.4
-0.2
0.2
0.4
test data: 200 steps with 4 occurances of pattern on
random background, desired output: red impulses
the pattern
39. 50 100 150 200
-0.75
-0.5
-0.25
0.25
0.5
0.75
1
50 100 150 200
-0.02
0.02
0.04
0.06
0.08
0.1
50 100 150 200
-0.05
0.05
0.1
0.15
0.2
0.25
0.3
Single-instance patterns, results
1. trained network
response on test data
50 100 150 200
-0.1
0.1
0.2
0.3
2. network response after
training 800 more pattern-
free steps ("negative
examples")
3. like 2., but 5 positive
examples in training data
DR: 12.4
DR: 12.1
DR: 6.4
4. comparison: optimal
linear filter
DR: 3.5
discrimination ratio DR:
)]
(
[
/
)]
(
[ 2
2
n
d
E
n
d
E
40. Event detection for robots
(joint work with J.Hertzberg & F. Schönherr)
Robot runs through office environment, experiences
data streams (27 channels) like...
10 sec
infrared distance sensor
left motor speed
activation of "goThruDoor"
external teacher signal,
marking event category
41. Learning setup
...
...
27 (raw) data channels unlimited number of event
detector channels
100 unit RNN
• simulated robot (rich
simulation)
• training run spans 15
simulated minutes
• event categories like
• pass through door
• pass by 90° corner
• pass by smooth corner
42. Results
• easy to train event hypothesis signals
• "boolean" categories possible
• single-shot learning possible
43. Network setup in training
...
...
_
a
z
29 input channels
code symbols
.
.
.
29 output channels for
next symbol hypotheses
400 units
44. Trained network in "text" generation
...
...
decision
mechanism, e.g.
winner-take-all
!!
winning symbol is next input
45. Results
Selection by random draw according to output
yth_upsghteyshhfakeofw_io,l_yodoinglle_d_upeiuttytyr_hsymua_doey_sa
mmusos_trll,t.krpuflvek_hwiblhooslolyoe,_wtheble_ft_a_gimllveteud_ ...
Winner-take-all selection
sdear_oh,_grandmamma,_who_will_go_and_the_wolf_said_the_wolf_said
_the_wolf_said_the_wolf_said_the_wolf_said_the_wolf_said_the_wolf ...
48. Multiple time scales
This is hard to learn (Laser benchmark time series):
100 200 300 400 500
-1
-0.8
-0.6
-0.4
-0.2
0.2
0.4
Reason: 2 widely separated time scales
Approach for future research: ESNs with different
time constants in their units
49. Additive dynamics
This proved impossible to learn:
Reason: requires 2 independent oscillators; but in
ESN all dynamics are mutually coupled.
Approach for future research: modular ESNs and
unsupervised multiple expert learning
50 100 150 200 250 300
0.1
0.2
0.3
0.4
0.5
0.6
0.7
)
311
.
0
Sin(
)
2
.
0
Sin(
)
( n
n
n
y
50. "Switching" memory
This FSA has long memory "switches":
Generating such sequences not possible with monotonic, area-
bounded forgetting curves!
a a
b
c
baaa....aaacaaa...aaabaaa...aaacaaa...aaa...
bounded
area
unbounded
width
An ESN simply is not a model for long-term memory!
51. High-dimensional dynamics
High-dimensional dynamics would require very large ESN.
Example: 6-DOF nonstationary time series one-step prediction
200-unit ESN: RMS = 0.2; 400-unit network: RMS = 0.1; best other
training technique1): RMS = 0.02
Approach for future research: task-specific optimization of ESN
100 200 300 400 500 600
0.2
0.4
0.6
0.8
f
eu
du
u
cu
bu
au
n
y
2
1
2
1
2
2
2
1
)
(
1)Prokhorov et al, extended Kalman filtering BPPT. Network size 40, 1400
trained links, training time 3 weeks
52. Spreading trouble...
• Signals xi(n) of reservoir can be interpreted as vectors in
(infinite-dimensional) signal space
• Correlation E[xy] yields inner product < x, y > on this space
• Output signal y(n) is linear combination of these xi(n)
• The more orthogonal the xi(n), the smaller the output weights:
y
y
x1
x2
x2
x1
y = 30 x1 28 x2 y = 0.5 x1 0.7 x2
53. • Eigenvectors vk of correlation matrix R = (E[xi x j] ) are
orthogonal signals
• Eigenvalues lk indicate what "mass" of reservoir signals xi
(all together) is aligned with vk
• Eigenvalue spread l max/ l min indicates overall "non-
orthogonality" of reservoir signals
vmax
x1
x2
x2
x1
vmin
vmax
vmin
l max/ l min 20 l max/ l min 1
54. Large eigenvalue spread
large output weights ...
• harmful for generalization, because
slight changes in reservoir signals
will induce large changes in output
• harmful for model accuracy, because
estimation error contained in
reservoir signals is magnified
(applies not to deterministic systems)
• renders LMS online adaptive
learning useless
vmax
x1
x2
vmin
l max/ l min 20
55. Summary
• Basic idea: dynamical reservoir of echo states +
supervised teaching of output connections.
• Seemed difficult: in nonlinear coupled systems,
every variable interacts with every other. BUT
seen the other way round, every variable rules and
echoes every other. Exploit this for local learning
and local system analysis.
• Echo states shape the tool for the solution from
the task.
57. References
• H. Jaeger (2002): Tutorial on training recurrent
neural networks, covering BPPT, RTRL, EKF and
the "echo state network" approach. GMD Report
159, German National Research Center for
Information Technology, 2002
• Slides used by Herbert Jaeger at IK2002