Continual reinforcement learning with complex synapses
1. Continual Reinforcement Learning with
Complex Synapses
Christos Kaplanis, Murray Shanahan, Claudia Clopath
presentation by Jia-Qi Yang
LAMDA Group
2. Idea
Catastrophic forgetting is a common problem with reinforcement
learning(nonstationary & correlated experiences, neural network).
Replay buffer: not scale well.
A possible solution: save the parameters in some unit that have
memory function.
A biologically plausible synaptic model: The Benna-Fusi Model(2016,
Nature Neuroscience)
1
3. The Benna-Fusi Model
Maximise the expected signal to noise ratio (SNR) of memories over
time in a population of synapses undergoing continual plasticity in
the form of random, un-correlated modifications.
w(t) =
∑
t‘<t
∆w(t‘
)r(t − t‘
)
Maximum signal to noise ratio (SNR) is achieved when r(t) ∼ t− 1
2
(power law decay).
Impractical.
2
4. The Benna-Fusi Model
Power law decay can be approximated by a synaptic model
consisting of a finite chain of N communicating dynamic variables.
And it’s dynamic is defined as:
Ck
duk
dt
= gk−1,k(uk−1 − uk) + gk,k+1(uk+1 − uk)
It’s an ordinary differential equation (ODE) and can be solved by
Euler method.
3
5. The Benna-Fusi Model: visualize
Figure 1: liquid flowing between a series of beakers of increasing size and
decreasing tube widths
4
6. Reinforcement learning
Q learning:
Q(s, a) = E
π
[
∞∑
i=t
γi−t
ri|st = s, at = a]
Q(st, at) ← Q(st, at) + η[rt + γV(st+1) − Q(st, at)]
Deep Q learning(DQN):
Just fit V(s) and Q(s, a) using neural network, say, use V(s; θ) and
Q(s, a; θ).
5
7. Some details
Eligibility traces, only used in tabular case.
Q-learning: target network, replay buffer, soft Q-learning,
task-specific gains and biases.
6
8. Experiments
• Continual Q-learning(tabular Q-values)
• Continual Multi-task Deep RL(DQN, unrelated tasks)
• Continual Learning within a Single Task(without replay buffer)
7
9. Continual Q-learning
10x10 grid map.
5 actions.
Two tasks:
1. the reward located at upper right corner.
2. the reward located at bottom left corner.
Alternate tasks every 10000 episodes.
Directly memorize Q values(tabular).
8
11. Continual Multi-task Deep RL
Two completely different tasks:
1. Cart-Pole.
2. Catcher.
Continuous observation space and discrete action space -> DQN.
Memorize the parameters of DQN.
10
14. Continual Learning within a Single Task
Targets is moving during learning process, strong correlation, replay
buffer is used to alleviate this problem.
Try to remove replay buffer, and learn single task.
13
16. Conclusion
Looks good on simple tasks.
Didn’t work on more complex tasks(from ALE) -> still too simple.
Fast: 1.5-2 times slower than Q-learning
15