Optimization as a Model
for Few-shot Learning
ICLR 2017
Katy@Datalab 2017.06.22
Motivation
• Deep learning successes have required a lot of
labeled training data
• Labeling such data requires significant human
labor 

• Few-shot learning is a important topic
• Why gradient-based optimization fails for few-shot
learning?
• weren’t designed specifically to perform well
under the constraint of a set number of updates.
• pertained network doesn’t work if the task
network was trained on diverges from the target
task
Background
Related Work
Santoro, Adam, et al. "One-shot learning with memory-
augmented neural networks." (2016).
Andrychowicz, Marcin, et al. "Learning to learn by gradient
descent by gradient descent." Advances in Neural Information
Processing Systems. 2016.
Learning to learn with RNN
• In this work, they proposed to replace hand-
designed update rules with a learned update rule,
which we called the optimizer(a LSTM) m, with its
own parameter
• This results in updates to the optimizee f of the form
φ
gt is the output of LSTM
How to train the optimizer
• For training the optimizer, we have an objective that
depends on the trajectory for a time horizon T
• θ the optimizee parameters
• ϕ: the optimizer parameters
• f: the function in question
m is the LSTM
Intuition of Trajectory
old trajectory
trajectory with new φ
Challenge
• too many parameters in LSTM
• solution?
Coordinatewise LSTM
Optimizer
gt
Problem Formulation
• We want to design a learning algorithm A that
outputs a good parameters 𝜽 theta of a model M,
when fed a small dataset D_train={(X_t,Y_t)}, t=1…
T
Meta-learning
• D_meta−train: training a learning procedure
• D_meta-validation: hyper-parameter selection of
the meta-learner
• D_test: evaluate its generalization performance
seems like a LSTM
Parameter Sharing
• Share parameters across the coordinates of the
learner gradient.
• This means each coordinate has its own hidden
and cell state values but the LSTM parameters are
the same across all coordinates.
Experiment
• Mini-ImageNet
• random subset of 100 classes from imagenet (64
training, 16 validation, 20 testing)
• random sets D_train are generated by randomly
picking 5 classes from class subset assigned to
the meta-set
• model M is a small 4-layers CNN, meta-learner
LSTM has 2 layers
learned input gates
• 1. Forget gate: the meta-learner seems to adopt a
simple weight decay strategy that seems consistent
across different layers.
• 2. Input gate: there seems to a be a lot of variability
between different datasets, indicating that the meta-
learner isn’t simply learning a fixed optimization strategy.
• 3. there seem to be differences between the two tasks(1
and 5 shot), suggesting that the meta-learner has
adopted different methods to deal with the different
conditions of each
Experiment
Vinyals, Oriol, et al. "Matching networks for one shot learning."
Advances in Neural Information Processing Systems. 2016.
Task: k-shot learning︎
slides from https://
www.slideshare.net/
KazukiFujikawa/matching-
networks-for-one-shot-
learning-71257100
Embedding functions f, g are parameterized as a simple CNN
(e.g. Where the attention mechanism is zero VGG or Inception)
Fully Conditional Embedding
read out from memory K steps
Conclusion
• Use meta-learning LSTM to model parameter
dynamics during training
• One-shot learning is much easier if you train the
network to do one-shot learning
• without external memory, reasonable result

Optimization as a model for few shot learning

  • 1.
    Optimization as aModel for Few-shot Learning ICLR 2017 Katy@Datalab 2017.06.22
  • 2.
    Motivation • Deep learningsuccesses have required a lot of labeled training data • Labeling such data requires significant human labor 
 • Few-shot learning is a important topic
  • 3.
    • Why gradient-basedoptimization fails for few-shot learning? • weren’t designed specifically to perform well under the constraint of a set number of updates. • pertained network doesn’t work if the task network was trained on diverges from the target task Background
  • 4.
    Related Work Santoro, Adam,et al. "One-shot learning with memory- augmented neural networks." (2016).
  • 5.
    Andrychowicz, Marcin, etal. "Learning to learn by gradient descent by gradient descent." Advances in Neural Information Processing Systems. 2016.
  • 6.
  • 7.
    • In thiswork, they proposed to replace hand- designed update rules with a learned update rule, which we called the optimizer(a LSTM) m, with its own parameter • This results in updates to the optimizee f of the form φ gt is the output of LSTM
  • 8.
    How to trainthe optimizer • For training the optimizer, we have an objective that depends on the trajectory for a time horizon T • θ the optimizee parameters • ϕ: the optimizer parameters • f: the function in question m is the LSTM
  • 9.
    Intuition of Trajectory oldtrajectory trajectory with new φ
  • 11.
    Challenge • too manyparameters in LSTM • solution?
  • 12.
  • 13.
    Problem Formulation • Wewant to design a learning algorithm A that outputs a good parameters 𝜽 theta of a model M, when fed a small dataset D_train={(X_t,Y_t)}, t=1… T
  • 14.
  • 16.
    • D_meta−train: traininga learning procedure • D_meta-validation: hyper-parameter selection of the meta-learner • D_test: evaluate its generalization performance
  • 17.
  • 20.
    Parameter Sharing • Shareparameters across the coordinates of the learner gradient. • This means each coordinate has its own hidden and cell state values but the LSTM parameters are the same across all coordinates.
  • 21.
    Experiment • Mini-ImageNet • randomsubset of 100 classes from imagenet (64 training, 16 validation, 20 testing) • random sets D_train are generated by randomly picking 5 classes from class subset assigned to the meta-set • model M is a small 4-layers CNN, meta-learner LSTM has 2 layers
  • 22.
  • 23.
    • 1. Forgetgate: the meta-learner seems to adopt a simple weight decay strategy that seems consistent across different layers. • 2. Input gate: there seems to a be a lot of variability between different datasets, indicating that the meta- learner isn’t simply learning a fixed optimization strategy. • 3. there seem to be differences between the two tasks(1 and 5 shot), suggesting that the meta-learner has adopted different methods to deal with the different conditions of each
  • 24.
  • 25.
    Vinyals, Oriol, etal. "Matching networks for one shot learning." Advances in Neural Information Processing Systems. 2016.
  • 26.
    Task: k-shot learning︎ slidesfrom https:// www.slideshare.net/ KazukiFujikawa/matching- networks-for-one-shot- learning-71257100
  • 27.
    Embedding functions f,g are parameterized as a simple CNN (e.g. Where the attention mechanism is zero VGG or Inception)
  • 29.
  • 37.
    read out frommemory K steps
  • 39.
    Conclusion • Use meta-learningLSTM to model parameter dynamics during training • One-shot learning is much easier if you train the network to do one-shot learning • without external memory, reasonable result