Optimization as a model for few shot learning

Optimization as a Model
for Few-shot Learning
ICLR 2017
Katy@Datalab 2017.06.22

Motivation
• Deep learning successes have required a lot of
labeled training data
• Labeling such data requires signiﬁcant human
labor  
• Few-shot learning is a important topic

• Why gradient-based optimization fails for few-shot
learning?
• weren’t designed speciﬁcally to perform well
under the constraint of a set number of updates.
• pertained network doesn’t work if the task
network was trained on diverges from the target
task
Background

Related Work
Santoro, Adam, et al. "One-shot learning with memory-
augmented neural networks." (2016).

Andrychowicz, Marcin, et al. "Learning to learn by gradient
descent by gradient descent." Advances in Neural Information
Processing Systems. 2016.

• In this work, they proposed to replace hand-
designed update rules with a learned update rule,
which we called the optimizer(a LSTM) m, with its
own parameter
• This results in updates to the optimizee f of the form
φ
gt is the output of LSTM

How to train the optimizer
• For training the optimizer, we have an objective that
depends on the trajectory for a time horizon T
• θ the optimizee parameters
• ϕ: the optimizer parameters
• f: the function in question
m is the LSTM

Intuition of Trajectory
old trajectory
trajectory with new φ

Challenge
• too many parameters in LSTM
• solution?

Coordinatewise LSTM
Optimizer
gt

Problem Formulation
• We want to design a learning algorithm A that
outputs a good parameters 𝜽 theta of a model M,
when fed a small dataset D_train={(X_t,Y_t)}, t=1…
T

• D_meta−train: training a learning procedure
• D_meta-validation: hyper-parameter selection of
the meta-learner
• D_test: evaluate its generalization performance

Parameter Sharing
• Share parameters across the coordinates of the
learner gradient.
• This means each coordinate has its own hidden
and cell state values but the LSTM parameters are
the same across all coordinates.

Experiment
• Mini-ImageNet
• random subset of 100 classes from imagenet (64
training, 16 validation, 20 testing)
• random sets D_train are generated by randomly
picking 5 classes from class subset assigned to
the meta-set
• model M is a small 4-layers CNN, meta-learner
LSTM has 2 layers

• 1. Forget gate: the meta-learner seems to adopt a
simple weight decay strategy that seems consistent
across different layers.
• 2. Input gate: there seems to a be a lot of variability
between different datasets, indicating that the meta-
learner isn’t simply learning a ﬁxed optimization strategy.
• 3. there seem to be differences between the two tasks(1
and 5 shot), suggesting that the meta-learner has
adopted different methods to deal with the different
conditions of each

Vinyals, Oriol, et al. "Matching networks for one shot learning."
Advances in Neural Information Processing Systems. 2016.

Task: k-shot learning︎
slides from https://
www.slideshare.net/
KazukiFujikawa/matching-
networks-for-one-shot-
learning-71257100

Embedding functions f, g are parameterized as a simple CNN
(e.g. Where the attention mechanism is zero VGG or Inception)

Conclusion
• Use meta-learning LSTM to model parameter
dynamics during training
• One-shot learning is much easier if you train the
network to do one-shot learning
• without external memory, reasonable result

Optimization as a model for few shot learning

More Related Content

What's hot

Similar to Optimization as a model for few shot learning

More from Katy Lee

Recently uploaded

Optimization as a model for few shot learning