2. Contents
I. Concepts of Continual Learning
A. Problem Definition
B. Catastrophic Forgetting
C. Desiderata
II. Various Approaches
A. Regularization (EWC, SI*, SSL)
B. Dynamic Architectures (PN, DEN*)
C. Dual-Memory (PCN, GR*)
D. Others (IMM*, VCL*, GEM*)
III. Research Direction
* It is not future reading paper. I already read this paper, but I will update it before presentation.
02/08 기준 추가해야 할 것
1. Previous work의 실험 결과
2. Previous work의 limitation
3. * Paper review 슬라이드
4. Benchmark of continual learning
3. I. Concepts of Continual Learning
A. Problem Definition
Human and animals have the ability to continually acquire and fine-tune knowledge
throughout their lifespan.
4. I. Concepts of Continual Learning
A. Problem Definition
Continual learning capabilities are crucial for artificial agents interacting in the real
world and processing continuous streams of information.
Time
Additional tasks
5. I. Concepts of Continual Learning
B. Catastrophic Forgetting
Catastrophic Forgetting
When a neural network is used to learn a sequence of tasks, the learning of the later
tasks may degrade the performance of the models learned for the earlier tasks.
Time
Catastrophic forgetting
6. I. Concepts of Continual Learning
B. Catastrophic Forgetting
Catastrophic Forgetting
It occurs as the network parameters shift toward the optimal state for performing the
second of two successive tasks, overwriting the configuration that allowed them to
perform the first.
7. I. Concepts of Continual Learning
C. Desiderata
Desiderata of Continual Learning
● Online learning*
○ Learning occurs at every moment, with no fixed tasks or data sets and no
clear boundaries between tasks.
● Presence of transfer (forward/backward)*
○ Learning agents should be able to transfer and adapt what it learned from
previous experience, as well as make use of more recent experience to
improve performance on capabilities learned earlier.
* Not an absolute criterion
8. I. Concepts of Continual Learning
C. Desiderata
Desiderata of Continual Learning
● Resistance to catastrophic forgetting
○ New learning should not destroy performance on previously seen data.
● Bounded model size*
○ Capacity should be fixed, forcing the model to use its resources
intelligently, gracefully forgetting what it has learned.
● No direct access to previous experience
○ Model cannot access to all of its past experience(data).
* Not an absolute criterion
9. Contents
I. Concepts of Continual Learning
A. Problem Definition (data, metric)
B. Catastrophic Forgetting
C. Desiderata
II. Various Approaches
A. Regularization (EWC, SI*, SSL)
B. Dynamic Architectures (PN, DEN)
C. Dual-Memory (PCN, GR)
D. Others (IMM, VCL, GEM)
III. Research Direction
10. II. Various Approaches
A. Regularization
Regularization Approaches
Regularization approaches alleviate catastrophic forgetting by imposing constraints
on the update of the neural weights
● Paper list
○ Overcoming catastrophic forgetting in neural networks (EWC), PNAS 2017
○ Continual Learning Through Synaptic Intelligence (SI), ICML 2017
○ Selfless Sequential Learning (SSL), ICLR 2019
11. II. Various Approaches
A. Regularization
Elastic Weight Consolidation (EWC)
● Motivation
It is inspired by human brain in which synaptic consolidation enables continual
learning by reducing the plasticity of synapse related to previous learned tasks.
It can be done by constraining important parameters to stay close to their old
values.
12. II. Various Approaches
A. Regularization
Elastic Weight Consolidation (EWC)
In computationally, quadratic penalty on the difference between the parameters for
the old and the new tasks that slows down the learning for task-relevant weights
coding for previously learned knowledge.
How to choose important parameters for the earlier tasks?
13. II. Various Approaches
A. Regularization
Elastic Weight Consolidation (EWC)
The Bayesian approach is used to measure the importance of parameters.
The relevance of the parameter 𝜃 with respect to a task’s training data is modeled
as the posterior distribution
The log probability of the data given the parameters is simply the negative
of the loss function for the problem at hand
14. II. Various Approaches
A. Regularization
Elastic Weight Consolidation (EWC)
The data is split into two independent parts,
one defining task A( ) and the other task B( )
Posterior probability of the
parameters given the entire
dataset
Loss function for task B
Posterior probability of the
parameters given the task A
15. II. Various Approaches
A. Regularization
Elastic Weight Consolidation (EWC)
The posterior probability of the parameters given the task A ( ) must
contain information about which parameters were important to task A and is the key
to implementing EWC
Posterior probability of the
parameters given the entire
dataset
Loss function for task B
Posterior probability of the
parameters given the task A
(Intractable)
16. II. Various Approaches
A. Regularization
Elastic Weight Consolidation (EWC)
It approximates the posterior as a Gaussian distribution(Laplace’s approximation)
with mean given by the parameters and a diagonal precision given by the
diagonal of the Fisher information matrix
Given this approximation, the objective in EWC is:
: Loss for task B only , : importance of old task
: Parameter index
17. II. Various Approaches
A. Regularization
Elastic Weight Consolidation (EWC)
● Fisher information matrix
It can measure sensitivity of the model with respect to the parameter
The diagonal entry is:
20. II. Various Approaches
A. Regularization
Selfless Sequential Learning (SSL)
● Core motivation
In neural biology, lateral inhibition describes the process where an activated neuron reduces the
activity of its weaker neighbors. It creates a powerful decorrelated and compact representation with
minimum interference between different input patterns in the brain (Yu et al., 2014).
1. Decorrelated representation
○ Learning a decorrelated representation is less vulnerable to catastrophic
forgetting
2. Sparse representation
○ It leaves enough capacity for future tasks. (Selfless)
21. II. Various Approaches
A. Regularization
Selfless Sequential Learning (SSL)
● Core motivation
Orange indicates changed neurons activations as a result of the second task.
Such interference is largely reduced when imposing sparsity on the representation.
22. II. Various Approaches
A. Regularization
Selfless Sequential Learning (SSL)
● Core Idea
:
Encourages sparsity in the
activations for each
layer
Additional Regularizer
It penalizes changes to important
parameters for earlier tasks (e.g.
EWC)
Loss for current task
23. II. Various Approaches
A. Regularization
Selfless Sequential Learning (SSL)
● Sparse Coding through Neural Inhibition (SNI)
: Hidden layer with activations for a set of inputs
running over all N neurons
24. II. Various Approaches
A. Regularization
Selfless Sequential Learning (SSL)
● Sparse Coding through Neural Inhibition (SNI)
1. By assuming a close to zero mean of the activations(ReLU), ,
it minimizes the correlation between any two active neurons.
Decorrelated representation
25. II. Various Approaches
A. Regularization
Selfless Sequential Learning (SSL)
● Sparse Coding through Neural Inhibition (SNI)
2. Each active neuron receives a penalty from every other active neuron that
corresponds to that other neuron’s activation magnitude.
Sparse representation
26. II. Various Approaches
A. Regularization
Selfless Sequential Learning (SSL)
● Sparse Coding through Local Neural Inhibition (SLNI)
SNI is too harsh for complex tasks that need a richer representation.
Relaxing the objective by imposing a spatial weighting to the correlation penalty.
An active neuron penalizes mostly its close neighbours and this effect vanishes for
neurons further away.
27. II. Various Approaches
A. Regularization
Selfless Sequential Learning (SSL)
● Neuron Importance for Discounting Inhibition (SLNID)
When the new tasks are similar to previous task(shared pattern), neurons used for
previous tasks will be active.
The SLNI would discourage other neurons from being active and encourage the new
task to adapt the already activate neurons. => Catastrophic forgetting
To avoid such interference,
28. II. Various Approaches
A. Regularization
Selfless Sequential Learning (SSL)
● Neuron Importance for Discounting Inhibition (SLNID)
Add a weight factor taking into account the importance of the neurons w.r.t the
previous tasks.
Sensitivity of the loss w.r.t the neurons outputs
29. II. Various Approaches
A. Regularization
Selfless Sequential Learning (SSL)
● Neuron Importance for Discounting Inhibition (SLNID)
It will not suppress the other neurons from being active
neither be affected by other active neurons
30. II. Various Approaches
A. Regularization
Selfless Sequential Learning (SSL)
SNID SLNID
First layer neuron importance after learning the first task. More active neurons are
tolerated in SLNID.
31. II. Various Approaches
A. Regularization
Selfless Sequential Learning (SSL)
SNID SLNID
SLNID allows previous neurons to be re-used for the third task. It avoids changing
the previous important neurons by adding new neurons.
33. Contents
I. Concepts of Continual Learning
A. Problem Definition (data, metric)
B. Catastrophic Forgetting
C. Desiderata
II. Various Approaches
A. Regularization (EWC, SI, SSL)
B. Dynamic Architectures (PN, DEN*)
C. Dual-Memory (PCN, GR)
D. Others (IMM, VCL, GEM)
III. Research Direction
34. II. Various Approaches
B. Dynamic Architectures
Dynamic Architecture Approaches
It changes architecture properties in response to new tasks by dynamically
accommodating novel neural resources. (e.g., re-training with an increased number
of neurons or layers.)
● Paper list
○ Progressive Neural Networks (PN), NIPS 2016
○ Lifelong Learning with Dynamically Expandable Networks (DEN), ICLR 2018
35. II. Various Approaches
B. Dynamic Architectures
Progressive Neural Network (PN)
● Core Idea
Catastrophic forgetting is prevented by instantiating a new neural network for each
task being solved, while transfer is enabled via lateral connections to features of
previously learned columns.
36. II. Various Approaches
B. Dynamic Architectures
Progressive Neural Network (PN)
PN starts with a single column:
a neural network having layers with hidden activations , with the
number of units at layer , and parameters trained to convergence.
37. II. Various Approaches
B. Dynamic Architectures
Progressive Neural Network (PN)
When switching to a second task, the parameters are “frozen” and a new
column with parameters is instantiated, where layer receives input from
both and via lateral connections
38. II. Various Approaches
B. Dynamic Architectures
Progressive Neural Network (PN)
Weight matrix of layer of
column
Lateral connections from
layer of column ,
to layer of column
39. II. Various Approaches
B. Dynamic Architectures
Progressive Neural Network (PN)
● Adapters
Non-linear lateral connections. They serve both to improve initial conditioning and
dimensionality reduction.
Anterior feature vector
with dimensionality
Scale factor
Projection matrix
40. Contents
I. Concepts of Continual Learning
A. Problem Definition (data, metric)
B. Catastrophic Forgetting
C. Desiderata
II. Various Approaches
A. Regularization (EWC, SI, SSL)
B. Dynamic Architectures (PN, DEN)
C. Dual-Memory (PCN, GR*)
D. Others (IMM, VCL, GEM)
III. Research Direction
41. II. Various Approaches
C. Dual Memory
Dual-Memory Approaches
The CLS* theory provides the basis for the approaches.
● Paper list
○ Progress & Compress: A scalable framework for continual learning (PCN), ICML 2018
○ Continual Learning with Deep Generative Replay (GR), NIPS 2017
*Complement Learning System
Rapid Learning : Initial learning of arbitrary new information
Gradual Learning of structured knowledge
Bidirectional connections: storage, retrieval and replay
42. II. Various Approaches
C. Dual-Memory
Progress & Compress Network (PCN)
The PCN implements two neural networks, a knowledge base and an active column,
which are trained in two distinct, alternating phases(Progress & Compress).
Knowledge
base
(Compress)
Active
column
(Progress)
43. II. Various Approaches
C. Dual-Memory
Progress & Compress Network (PCN)
● Progress phase
The knowledge base is fixed, while parameters in the active column are optimised
without constraints or regularisation.
PNC enables the reuse of past information through layerwise connections
between knowledge base and active column (similar to PN).
44. II. Various Approaches
C. Dual-Memory
Progress & Compress Network (PCN)
● Compress phase
Newly learnt parameters are consolidated into knowledge base.
The consolidation is done via a distillation process.
Objective
Minimize the KL Divergence between the active column and
knowledge base prediction
: Prediction of the active column and
knowledge base respectively.
&
45. II. Various Approaches
C. Dual-Memory
Progress & Compress Network (PCN)
● Compress phase
Minimizing the KL Divergence is same as minimizing the cross-entropy
Fixed
cross-entropy
46. II. Various Approaches
C. Dual-Memory
Progress & Compress Network (PCN)
● Compress phase
The original EWC makes the computational cost linear in the number of tasks.
Thus, Online EWC is used to avoid catastrophic forgetting.
Final objective
Online EWC
47. II. Various Approaches
C. Dual-Memory
Progress & Compress Network (PCN)
● Online EWC
Apply Laplace’s approximation to the whole posterior, rather than the likelihood
terms. The Gaussian approximation of previous task likelihoods are “re-centered” at
the latest MAP parameter
EWC (Mahalanobis norm) Online EWC
A mean and a Fisher need to be kept for each
task, which makes the computational cost
linear in the number of tasks.
,
48. Contents
I. Concepts of Continual Learning
A. Problem Definition (data, metric)
B. Catastrophic Forgetting
C. Desiderata
II. Various Approaches
A. Regularization (EWC, SI, SSL)
B. Dynamic Architectures (PN, DEN)
C. Dual-Memory (PCN, GR)
D. Others (IMM*, VCL*, GEM*)
III. Research Direction
IV. Appendix
49. Contents
I. Concepts of Continual Learning
A. Problem Definition (data, metric)
B. Catastrophic Forgetting
C. Desiderata
II. Various Approaches
A. Regularization (EWC, SI, SSL)
B. Dynamic Architectures (PN, DEN)
C. Dual-Memory (PCN, GR)
D. Others (IMM, VCL, GEM)
III. Research Direction
50. III. Research Direction
Plan (~ 3월 초)
1. Sparse & Decorrelated representation
a. Selfless Sequential Learning 구현 및 추가실험
2. Knowledge Distillation
a. Progress & Compress 구현 및 추가실험
3. Variational Bayes
a. Variational Continual Learning 구현 및 추가실험
4. Task-agnostic Continual Learning (final research goal)
a. Online version of Variational Bayes (paper for task-agnostic CL)
b. 위의 1, 2, 3 을 task-agnostic continual learning에 접목 가능 할지
51. III. Research Direction
A. Motivation
● Task-agnostic Continual Learning
What if the task changes gradually ?
So far, most papers assume that the model knows task boundaries or labels(Task-aware).
In practice, the data distribution shifts gradually without hard task boundaries as well as
task labels(Task-agnostic).
Task 1
Task 2
Task 3
Task
Task
Task
Task-aware Task-agnostic
52. III. Research Direction
A. Motivation
● Limitations of the previous work
EWC : Task boundaries are needed to compute Fisher of previous task
SSL : Task boundaries are needed to compute the Fisher of previous tasks and the importance of neurons.
PN : Task boundaries and labels are needed to add new columns and inference.
PCN : Task boundaries are needed to change the progress mode to compress mode.
GEM : Task boundaries and labels are needed to make episodic memory and compute the gradient of it.
53. III. Research Direction
A. Motivation
General training scheme of Continual learning
Task 1
Task 2
Task 3
Learn T1
Learn T2
Consolidate T1
Learn T3
Consolidate
T1 ,T2
Stop training
&
Compute fisher
Add a new column
Distillation
...
Task-aware
54. III. Research Direction
A. Motivation
Q.
1. When and how should we consolidate knowledge of earlier tasks?
2. How do we make a model that knows that task is changing?
3. Online version of variational Bayes (Paper for task-agnostic CL)
Task
Task
Task
55. III. Research Direction
B. Research plan
Reproducing & Extra experiment
● Selfless Sequential Learning
● Progress & Compress
● Variational Continual Learning
Paper Reading
● Online structured Laplace Approximations For Overcoming Catastrophic Forgetting
● Task Agnostic Continual Learning Using Online Variational Bayes
● A Bayesian Approach to On-line Learning
● Overcoming Catastrophic Interference using Conceptor-Aided Backpropagation
● iCaRL: Incremental classifier and representation learning
● Overcoming Catastrophic Forgetting with Hard Attention to the Task
● Lifelong learning with a network of experts
● Distilling the knowledge in a neural network
● etc...