Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Optimization as a Model for Few-Shot Learning - ICLR 2017 reading seminar

22,081 views

Published on

These slides is presented in ICLR 2017 reading seminar @ Shibuya Hikarie, Tokyo, Japan

Published in: Science
  • Be the first to comment

Optimization as a Model for Few-Shot Learning - ICLR 2017 reading seminar

  1. 1. 2016/06/17 @ DeNA, Shibuya Hikarie Hokuto Kagaya (@_hokkun_) Optimization as a Model for Few-Shot Learning 1
  2. 2. TL; DR • Purpose • Better inference for few-shot/one-shot learning problem • Method • Meta-learning based on LSTM of deep neural network • Result • Competitive with deep metric-learning techniques 2
  3. 3. Background (1) • Why deep learning succeeded? • machine power • amount of data • Large Dataset • ImageNet (Image) • Microsoft COCO Captions (Image & Caption) • YouTube 8M (Video) • WikiText (Text) 3
  4. 4. Background (2) • However, In many fields, to collect a large amount of training samples is: • difficult • Ex: Fine-grained recognition (car, bird, food..) • time-consuming • scraping, crawling, annotating… • Actually human beings can generalize using few samples of targets. 4
  5. 5. Problem & Purpose (1) • How can we acquire generalized model using few samples and a set number of updates? • Existed gradient-based training algorithm (SGD, ADAM, AdaGlad..) does not fit the problem with a set number of parameter updates. • In other simple words, authors want to find good initial parameters of NN. • cf) review comments: it is much better to be able to find architectural parameters of NN. 5
  6. 6. Problem & Purpose (2) • How? • Meta learning • Learning to learn. Train learner itself. • A variety of meta learning • Transfer learning • Use the experiences of different domain • Popular in the field of image classification, especially for fine-grained visual classification • Ensemble classifier • combine multiple classifier 6 - This article is very good to understand meta learning - http://http://www.scholarpedia.org/article/Metalearning
  7. 7. Proposed method • LSTM-based meta learning 7
  8. 8. * Prerequisites • What is LSTM? • Long-time Short Term Memory • 時系列を扱いたい、でも誤差が発散/消失しちゃう • 過去のデータの重みを1にして忘れないようにした上 で、選択的に⼊⼒/出⼒を⾏うようにした (ʻ97) • しかし急激な状況の変化(?)に対応できなかったの で、忘却ゲートを設置することで選択的に過去のデー タの記憶を消去できるようにした (ʼ99) • 参考(⽇本語) • http://qiita.com/t_Signull/items/21b82be280b46f467d1b 8
  9. 9. * Data Separation • meta-train dataset • meta-test dataset • meta sample 9 target training samples target testing samples one meta sample (a.k.a. episode)
  10. 10. Proposed Method (2) 𝜃" = 𝜃"$% + 𝛼∇)*+, ℒ" 𝑐" = 𝑓"⨀𝑐"$% + 𝑖"⨀𝑐̅" where 𝑖" = 𝜎(𝑊6 7 ? + b9) 𝑓" = 𝜎 𝑊; 7 ? + b< where ? = current gradients, current loss, previous 𝜃, previous itself (𝑖, 𝑓) 10 Normal SGD Metaphor not constant 1, to escape from bad local optima
  11. 11. Proposed Method (3) 11 ←meta learner‘s iteration ←learner‘s iteration (meta) loss value is computed by final state of LSTM (= parameters of target model) and 𝐷"?@"’s data and labels.
  12. 12. Proposed Method (4) 12 • From authorʼs slide
  13. 13. Proposed Method (5) • What will be improved gradually? • First: LSTM parameter (a.k.a. meta-learner parameters) • that is, ”how should we update target models?” • Second: LSTM states (outputs?) • Final 𝜃A is shared among each batch, so learning proceeds rapidly thanks for good initialization 13
  14. 14. Other Topics • coordinate-wise LSTM • Preprocessing to LSTM inputs • about both topics, see [Andrychowicz, NIPS 2016] (preprocessing is in appendix) • adjust the scaling of gradients and losses • separate info of magnitude and sign • Batch normalization • avoid ”dataset” (episode) level leakage of information • Related work: metric learning • ex: Siamese network 14
  15. 15. Evaluation Method • Baseline 1: nearest neighbor • meta-train: train neural network using all sample • meta-test: training sample をNNにぶちこんだ結果と testing sample のそれを⽐較 • Baseline 2: fine-tune • meta-train: 1 に加えて、 meta-validation dataset を hyper parameter の探索に使い、1 の network を finetune する • Baseline 3: Matching network • 距離学習のSOTA 15
  16. 16. Evaluation Result 16
  17. 17. Visualization and Insight • input gates • 1. different among datasets • = meta-learner isnʼt simply learning a fixed optimization strategy • 2. different among tasks • = meta-learner has used different ways to solve each setting • forget gates • simple decay • 結局ほとんど constant 17
  18. 18. Visualization and Insight 18
  19. 19. Conclusion • Found LSTM-based model to learn a learner, which is inspired by a metaphor between SGD updates and LSTM. • Train meta-learner to discover: • 1. Good initialization of learner • 2. Good mechanism for updating learnerʼs parameters. • competitive experimental result with SOTA metric learning methods. 19
  20. 20. Future work • few samples / lots of classes • more challenging scenarios • from review comment • it is much better to be able to find architectural parameters of NN. 20
  21. 21. 所感 • transfer learning における「別ドメインの経験 を活かす」という作業を「時系列の学習」的に捉 えて LSTM モデルとして学習した、というのは ⾃然に思えた • すでにあった発想?時間なく関連研究まで読み込め ず。。 • review comment にあった、構造の最適化まで できるとすごくよさそうだと思った • シンプルなフィルタをたくさん重ねるといいという話 もあるが。。 • 学部時代 cuda-convnet を使ってたくさんハイパパラ メータを試した苦労が蘇った 21
  22. 22. 多分わかってないこと • 結局、この論⽂で初めてわかったのはどこ? LSTM を learning-to-learn に使ったのは多分初 めてじゃない? • 例えば Andrychowicz+ʼ16 では、勾配を⼊⼒にして target learner の parameter updates を出⼒する LSTM を学習 • パラメタそのものを直接出⼒してるところ? 22

×