Continual learning: Survey

Continual Learning
in
Deep Neural Networks
Concepts, Recent Work Reviews, Direction of research
Wonjun Chung
wonjunc@mli.kaist.ac.kr

Contents
I. Concepts of Continual Learning
A. Problem Definition
B. Catastrophic Forgetting
C. Desiderata
II. Various Approaches
A. Regularization (EWC, SI*, SSL)
B. Dynamic Architectures (PN, DEN*)
C. Dual-Memory (PCN, GR*)
D. Others (IMM*, VCL*, GEM*)
III. Research Direction
* It is not future reading paper. I already read this paper, but I will update it before presentation.
02/08 기준 추가해야 할 것
1. Previous work의 실험 결과
2. Previous work의 limitation
3. * Paper review 슬라이드
4. Benchmark of continual learning

Human and animals have the ability to continually acquire and fine-tune knowledge
throughout their lifespan.

Continual learning capabilities are crucial for artificial agents interacting in the real
world and processing continuous streams of information.
Time
Additional tasks

Catastrophic Forgetting
When a neural network is used to learn a sequence of tasks, the learning of the later
tasks may degrade the performance of the models learned for the earlier tasks.
Time
Catastrophic forgetting

Catastrophic Forgetting
It occurs as the network parameters shift toward the optimal state for performing the
second of two successive tasks, overwriting the configuration that allowed them to
perform the first.

C. Desiderata
Desiderata of Continual Learning
● Online learning*
○ Learning occurs at every moment, with no fixed tasks or data sets and no
clear boundaries between tasks.
● Presence of transfer (forward/backward)*
○ Learning agents should be able to transfer and adapt what it learned from
previous experience, as well as make use of more recent experience to
improve performance on capabilities learned earlier.
* Not an absolute criterion

C. Desiderata
Desiderata of Continual Learning
● Resistance to catastrophic forgetting
○ New learning should not destroy performance on previously seen data.
● Bounded model size*
○ Capacity should be fixed, forcing the model to use its resources
intelligently, gracefully forgetting what it has learned.
● No direct access to previous experience
○ Model cannot access to all of its past experience(data).
* Not an absolute criterion

Contents
A. Problem Definition (data, metric)
C. Desiderata
A. Regularization (EWC, SI*, SSL)
B. Dynamic Architectures (PN, DEN)
C. Dual-Memory (PCN, GR)
D. Others (IMM, VCL, GEM)

A. Regularization
Regularization Approaches
Regularization approaches alleviate catastrophic forgetting by imposing constraints
on the update of the neural weights
● Paper list
○ Overcoming catastrophic forgetting in neural networks (EWC), PNAS 2017
○ Continual Learning Through Synaptic Intelligence (SI), ICML 2017
○ Selfless Sequential Learning (SSL), ICLR 2019

A. Regularization
Elastic Weight Consolidation (EWC)
● Motivation
It is inspired by human brain in which synaptic consolidation enables continual
learning by reducing the plasticity of synapse related to previous learned tasks.
It can be done by constraining important parameters to stay close to their old
values.

A. Regularization
In computationally, quadratic penalty on the difference between the parameters for
the old and the new tasks that slows down the learning for task-relevant weights
coding for previously learned knowledge.
How to choose important parameters for the earlier tasks?

A. Regularization
The Bayesian approach is used to measure the importance of parameters.
The relevance of the parameter 𝜃 with respect to a task’s training data is modeled
as the posterior distribution
The log probability of the data given the parameters is simply the negative
of the loss function for the problem at hand

A. Regularization
The data is split into two independent parts,
one defining task A( ) and the other task B( )
Posterior probability of the
parameters given the entire
dataset
Loss function for task B
parameters given the task A

A. Regularization
The posterior probability of the parameters given the task A ( ) must
contain information about which parameters were important to task A and is the key
to implementing EWC
parameters given the entire
dataset
Loss function for task B
parameters given the task A
(Intractable)

A. Regularization
It approximates the posterior as a Gaussian distribution(Laplace’s approximation)
with mean given by the parameters and a diagonal precision given by the
diagonal of the Fisher information matrix
Given this approximation, the objective in EWC is:
: Loss for task B only , : importance of old task
: Parameter index

A. Regularization
● Fisher information matrix
It can measure sensitivity of the model with respect to the parameter
The diagonal entry is:

A. Regularization

A. Regularization
● Results

A. Regularization
Selfless Sequential Learning (SSL)
● Core motivation
In neural biology, lateral inhibition describes the process where an activated neuron reduces the
activity of its weaker neighbors. It creates a powerful decorrelated and compact representation with
minimum interference between different input patterns in the brain (Yu et al., 2014).
1. Decorrelated representation
○ Learning a decorrelated representation is less vulnerable to catastrophic
forgetting
2. Sparse representation
○ It leaves enough capacity for future tasks. (Selfless)

A. Regularization
● Core motivation
Orange indicates changed neurons activations as a result of the second task.
Such interference is largely reduced when imposing sparsity on the representation.

A. Regularization
● Core Idea
:
Encourages sparsity in the
activations for each
layer
Additional Regularizer
It penalizes changes to important
parameters for earlier tasks (e.g.
EWC)
Loss for current task

A. Regularization
● Sparse Coding through Neural Inhibition (SNI)
: Hidden layer with activations for a set of inputs
running over all N neurons

A. Regularization
1. By assuming a close to zero mean of the activations(ReLU), ,
it minimizes the correlation between any two active neurons.
Decorrelated representation

A. Regularization
2. Each active neuron receives a penalty from every other active neuron that
corresponds to that other neuron’s activation magnitude.
Sparse representation

A. Regularization
● Sparse Coding through Local Neural Inhibition (SLNI)
SNI is too harsh for complex tasks that need a richer representation.
Relaxing the objective by imposing a spatial weighting to the correlation penalty.
An active neuron penalizes mostly its close neighbours and this effect vanishes for
neurons further away.

A. Regularization
● Neuron Importance for Discounting Inhibition (SLNID)
When the new tasks are similar to previous task(shared pattern), neurons used for
previous tasks will be active.
The SLNI would discourage other neurons from being active and encourage the new
task to adapt the already activate neurons. => Catastrophic forgetting
To avoid such interference,

A. Regularization
Add a weight factor taking into account the importance of the neurons w.r.t the
previous tasks.
Sensitivity of the loss w.r.t the neurons outputs

A. Regularization
It will not suppress the other neurons from being active
neither be affected by other active neurons

A. Regularization
SNID SLNID
First layer neuron importance after learning the first task. More active neurons are
tolerated in SLNID.

A. Regularization
SNID SLNID
SLNID allows previous neurons to be re-used for the third task. It avoids changing
the previous important neurons by adding new neurons.

A. Regularization
● Final Objective

Contents
C. Desiderata
A. Regularization (EWC, SI, SSL)
B. Dynamic Architectures (PN, DEN*)

B. Dynamic Architectures
Dynamic Architecture Approaches
It changes architecture properties in response to new tasks by dynamically
accommodating novel neural resources. (e.g., re-training with an increased number
of neurons or layers.)
● Paper list
○ Progressive Neural Networks (PN), NIPS 2016
○ Lifelong Learning with Dynamically Expandable Networks (DEN), ICLR 2018

Progressive Neural Network (PN)
● Core Idea
Catastrophic forgetting is prevented by instantiating a new neural network for each
task being solved, while transfer is enabled via lateral connections to features of
previously learned columns.

PN starts with a single column:
a neural network having layers with hidden activations , with the
number of units at layer , and parameters trained to convergence.

When switching to a second task, the parameters are “frozen” and a new
column with parameters is instantiated, where layer receives input from
both and via lateral connections

Weight matrix of layer of
column
Lateral connections from
layer of column ,
to layer of column

● Adapters
Non-linear lateral connections. They serve both to improve initial conditioning and
dimensionality reduction.
Anterior feature vector
with dimensionality
Scale factor
Projection matrix

Contents
C. Desiderata
C. Dual-Memory (PCN, GR*)

C. Dual Memory
Dual-Memory Approaches
The CLS* theory provides the basis for the approaches.
● Paper list
○ Progress & Compress: A scalable framework for continual learning (PCN), ICML 2018
○ Continual Learning with Deep Generative Replay (GR), NIPS 2017
*Complement Learning System
Rapid Learning : Initial learning of arbitrary new information
Gradual Learning of structured knowledge
Bidirectional connections: storage, retrieval and replay

C. Dual-Memory
Progress & Compress Network (PCN)
The PCN implements two neural networks, a knowledge base and an active column,
which are trained in two distinct, alternating phases(Progress & Compress).
Knowledge
base
(Compress)
Active
column
(Progress)

C. Dual-Memory
● Progress phase
The knowledge base is fixed, while parameters in the active column are optimised
without constraints or regularisation.
PNC enables the reuse of past information through layerwise connections
between knowledge base and active column (similar to PN).

C. Dual-Memory
● Compress phase
Newly learnt parameters are consolidated into knowledge base.
The consolidation is done via a distillation process.
Objective
Minimize the KL Divergence between the active column and
knowledge base prediction
: Prediction of the active column and
knowledge base respectively.
&

C. Dual-Memory
● Compress phase
Minimizing the KL Divergence is same as minimizing the cross-entropy
Fixed
cross-entropy

C. Dual-Memory
● Compress phase
The original EWC makes the computational cost linear in the number of tasks.
Thus, Online EWC is used to avoid catastrophic forgetting.
Final objective
Online EWC

C. Dual-Memory
● Online EWC
Apply Laplace’s approximation to the whole posterior, rather than the likelihood
terms. The Gaussian approximation of previous task likelihoods are “re-centered” at
the latest MAP parameter
EWC (Mahalanobis norm) Online EWC
A mean and a Fisher need to be kept for each
task, which makes the computational cost
linear in the number of tasks.
,

Contents
C. Desiderata
D. Others (IMM*, VCL*, GEM*)
IV. Appendix

Contents
C. Desiderata

Plan (~ 3월 초)
1. Sparse & Decorrelated representation
a. Selfless Sequential Learning 구현 및 추가실험
2. Knowledge Distillation
a. Progress & Compress 구현 및 추가실험
3. Variational Bayes
a. Variational Continual Learning 구현 및 추가실험
4. Task-agnostic Continual Learning (final research goal)
a. Online version of Variational Bayes (paper for task-agnostic CL)
b. 위의 1, 2, 3 을 task-agnostic continual learning에 접목 가능 할지

A. Motivation
● Task-agnostic Continual Learning
What if the task changes gradually ?
So far, most papers assume that the model knows task boundaries or labels(Task-aware).
In practice, the data distribution shifts gradually without hard task boundaries as well as
task labels(Task-agnostic).
Task 1
Task 2
Task 3
Task
Task
Task
Task-aware Task-agnostic

A. Motivation
● Limitations of the previous work
EWC : Task boundaries are needed to compute Fisher of previous task
SSL : Task boundaries are needed to compute the Fisher of previous tasks and the importance of neurons.
PN : Task boundaries and labels are needed to add new columns and inference.
PCN : Task boundaries are needed to change the progress mode to compress mode.
GEM : Task boundaries and labels are needed to make episodic memory and compute the gradient of it.

A. Motivation
General training scheme of Continual learning
Task 1
Task 2
Task 3
Learn T1
Learn T2
Consolidate T1
Learn T3
Consolidate
T1 ,T2
Stop training
&
Compute fisher
Add a new column
Distillation
...
Task-aware

A. Motivation
Q.
1. When and how should we consolidate knowledge of earlier tasks?
2. How do we make a model that knows that task is changing?
3. Online version of variational Bayes (Paper for task-agnostic CL)
Task
Task
Task

B. Research plan
Reproducing & Extra experiment
● Selfless Sequential Learning
● Progress & Compress
● Variational Continual Learning
Paper Reading
● Online structured Laplace Approximations For Overcoming Catastrophic Forgetting
● Task Agnostic Continual Learning Using Online Variational Bayes
● A Bayesian Approach to On-line Learning
● Overcoming Catastrophic Interference using Conceptor-Aided Backpropagation
● iCaRL: Incremental classifier and representation learning
● Overcoming Catastrophic Forgetting with Hard Attention to the Task
● Lifelong learning with a network of experts
● Distilling the knowledge in a neural network
● etc...

Continual learning: Survey

More Related Content

What's hot

Similar to Continual learning: Survey

Recently uploaded

Continual learning: Survey