Optimization as a Model for Few-Shot Learning - ICLR 2017 reading seminar
@ DeNA, Shibuya Hikarie
Optimization as a Model
for Few-Shot Learning
• Better inference for few-shot/one-shot learning
• Meta-learning based on LSTM of deep neural
• Competitive with deep metric-learning techniques
• Why deep learning succeeded?
• machine power
• amount of data
• Large Dataset
• ImageNet (Image)
• Microsoft COCO Captions (Image & Caption)
• YouTube 8M (Video)
• WikiText (Text)
• However, In many fields, to collect a large
amount of training samples is:
• Ex: Fine-grained recognition (car, bird, food..)
• scraping, crawling, annotating…
• Actually human beings can generalize using
few samples of targets.
Problem & Purpose (1)
• How can we acquire generalized model using
few samples and a set number of updates?
• Existed gradient-based training algorithm (SGD,
ADAM, AdaGlad..) does not fit the problem with a
set number of parameter updates.
• In other simple words, authors want to find
good initial parameters of NN.
• cf) review comments: it is much better to be able to
find architectural parameters of NN.
Problem & Purpose (2)
• Meta learning
• Learning to learn. Train learner itself.
• A variety of meta learning
• Transfer learning
• Use the experiences of different domain
• Popular in the field of image classification, especially for
fine-grained visual classification
• Ensemble classifier
• combine multiple classifier
- This article is very good to understand meta learning
Proposed Method (5)
• What will be improved gradually?
• First: LSTM parameter (a.k.a. meta-learner
• that is, ”how should we update target models?”
• Second: LSTM states (outputs?)
• Final 𝜃A is shared among each batch, so learning proceeds
rapidly thanks for good initialization
• coordinate-wise LSTM
• Preprocessing to LSTM inputs
• about both topics, see [Andrychowicz, NIPS 2016] (preprocessing is in appendix)
• adjust the scaling of gradients and losses
• separate info of magnitude and sign
• Batch normalization
• avoid ”dataset” (episode) level leakage of
• Related work: metric learning
• ex: Siamese network
Visualization and Insight
• input gates
• 1. different among datasets
• = meta-learner isnʼt simply learning a fixed optimization
• 2. different among tasks
• = meta-learner has used different ways to solve each
• forget gates
• simple decay
• 結局ほとんど constant
• Found LSTM-based model to learn a learner,
which is inspired by a metaphor between
SGD updates and LSTM.
• Train meta-learner to discover:
• 1. Good initialization of learner
• 2. Good mechanism for updating learnerʼs
• competitive experimental result with SOTA
metric learning methods.
• few samples / lots of classes
• more challenging scenarios
• from review comment
• it is much better to be able to find architectural
parameters of NN.