Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

NIPS読み会2013: One-shot learning by inverting a compositional causal process

13,358 views

Published on

Published in: Technology, Education

NIPS読み会2013: One-shot learning by inverting a compositional causal process

  1. 1. One-­‐shot  learning  by  inver2ng   a  composi2onal  causal  process Brenden  M.Lake Ruslan  Salakhutdinov Joshua  B.  Tenenbaum 能地宏  @NII   ※スライド中の図は論文からの引用です
  2. 2. people can classify new images of a foreign handwritte [23, 16, 17]. Similarly, while classifiers are generally t 16, 17]. Similarly, while classifiers are generally [23, g benchmark One-­‐shot  classifica2on[4]and CIFAR benchmark datasets such as ImageNet [4] and CIFA datasets such as ImageNet b) b) an you learn a new concept from just one example? (a & b) n in learn a new concept from just one example? (a & b wnyoured? Answers for b) are row 4 column 3 (left) and row wn in red? Answers for b) are row 4 column 3 (left) and ro
  3. 3. ample mple Example People HBP One-­‐shot  Genera2on People People HBPL HBPL Affi Affine
  4. 4. One-­‐shot  Genera2on Visual Turing test. To compare
  5. 5. ample mple Example People HBP One-­‐shot  Genera2on People People People HBPL HBPLmodel The   Affi Affine
  6. 6. One-­‐shot  Genera2on The  model People Visual Turing test. To compare
  7. 7. Overview ‣ 人間はたった一つの例から、そのシンボルの特徴を取り出せる -­‐ 分類:似たものを取り出せる -­‐ 生成:新しいサンプルを作り出せる ‣ 機械学習は典型的に、ラベル毎に大量のデータを必要とする -­‐ ex)  MNIST:  6000  training  data  /  class ‣ タスクと貢献 -­‐ 機械学習はこの人間の能力を模倣できるか? -­‐ 丁寧に生成モデルを定義したら、人間と同じような結果が得られた -­‐ 人間も同じような仕組みで特徴を抽出していると言えるかも
  8. 8. ing algorithms typically require hundreds or thousand typically require hundreds or thousan same problems. Here we present a Hierarchical Bay Here we present a Hierarchical Bay positionality and causality that can learn aawide rang causality that can learn wide ran ple) visual concepts, generalizing in human-like way generalizing in human-like wa evaluated performance on a challenging one-shot c evaluated performance on a challenging one-shot cla model model achieved a human-level error rate while subst human-level error rate while subs learn  h We also tested the model on a deep deep learning models.yper  parameters: on an models. We also tested the model erating examples, by using by strokes 30  aa) library oferating new examples, number ofusing aa “visual Turing tes lphabets primitives human-like performance. “visual Turing te b) motor produces produces performance. Figure 4: Lear データと学習 frequency Omniglot  dataset Number of strokes 6000 1 1 2 2 parameters. a) 2000 primitives, whe row shows the 0 0 2 4 6 8 mon ones. Th c) stroke start positions trol point (circle b&c) Empirica People can acquire a new concept from only the barestof e People can acquire a new concept from only the barestwhere ex of th tions examples in a high-dimensional space of raw perceptualinp c) show sta examples in a high-dimensional space of raw perceptualhowinp 1 2 ≥4 differs by stroke tackled some of the same classification 3and recognition proble 1 Introduction 1 Introduction 1 4000 1 2 3 2 3 4 4 tackled some of the same classification and recognition probl Image. standard transformation Arequire4 hundredsfrom P (A(m) ) = Nof ex An image algorithms (m) 2 R is sampled the standard algorithms require hundreds or thousands ([1, 1, the where the first two elements control a global re-scaling and the or thousands of e While centerstandard T (m) . The transformed trajectoriessecond two control a globa tion of the the of mass of MNIST benchmark dataset then be rendered as can for digit recog While the standard MNIST benchmark dataset for digit recogn class image, can classify new [10] (see rom SI-2). handw grayscale[19], people can classify new images of foreignhandwri 20  alphabetsusing an inklearn  posterior  fof aaforeign This graysc class [19], peoplenoisemodel adapted fromimages Sectionmore robust during is (Figure 1b)by two16, 17]. Similarly,make the classifiers are general then perturbed [23, processes, which while gradient (Figure 1b) partial solutions Similarly,xample: tion and encourage[23, 16, 17]. during classification. These processes include peo Figure 2: Four alphabets from Omniglot, each with five only  one  such asby four differentconvol characters e while classifiers are genera class, using benchmark datasets drawnas ImageNet [4] and✏(m) , using benchmark datasets and pixel flipping with probability CIF (m) a class, filter with standard deviation b Gaussian such ImageNet [4] and CI 3 3 4 4 50  alphabets; a) amount noise “Segway” b) Figure on a These range larger drawn new visual object from just one examplea) ofpixels then parameterize 105x105uniformly1a).pre-specifiednew (Section S (e.g., a and ✏ areb) independent Bernoulli distributions, completi in grayscale 1600  characters; along with larger and “deeper” model |✓ ) = P (I |T , A , while). perform have developed architectures, and , ✏ model of binary images P (I steadily (and even spectacularly [15]) improved in this big data setting, it is unknown 20  examples  /  character 2.3 Learning high-level knowledge of motor programs (m) b (m) (m) (m) (m) (m) (m) (m) b (m) progress translates to the “one-shot” setting that is a hallmark of human learning [3, 22, 28 The Omniglot dataset was randomly split into a 30 alphabet “background” set and a 20 “evaluation” set, constrained such that the background set included the six most common as determined by Google hits. Background images, paired with their motor data, were use the hyperparameters of the HBPL model, including a set of 1000 primitive motor elemen 4a) and position models for a drawing’s first, second, and third stroke, etc. (Figure 4c). possible, cross-validation (within the background set) was used to decide issues of model c within the conditional probability distributions of HBPL. Details are provided in Sectio learning the1: Can of primitives, a new concept from just one example? transfo Figure models you learn positions, relations, token variability, and image (a & Additionally, while classification has received most of the attention in machine learning can generalize in a variety of other ways after learning a new concept. Equipped with the “Segway” or a new handwritten character (Figure 1c), people can produce new examples, object into its critical parts, and fill in a missing part of an image. While this flexibility highl Figure 1:shown much Answers for are row 4 column features& richness of people’s concepts, suggesting they Can youred? more thanb)discriminative 3 (left) and concept are in learn a new concept from just one example? (a 2.4 Inference
  9. 9. 先に結果を紹介 One-­‐shot  classifica2on  (Error  rate) 34.8 38 18.2 4.5 human 4.8 HBPL affine DBM HD ‣ Deep  learning  よりも良い性能 ‣ ほぼ人間と同じエラー率 One-­‐shot  genera2on Visual  Turing  test: 9個の同じシンボルを見て どちらが人間かを当ててもらう    56%  で正解
  10. 10. モデル type level primitives R1 } y11 x11 (m) R1 x(m) 11 } (m) L1 (m) y11 (m) R2 x12 y12 17 = along s11 42 x21 y21 (m) (m) R2 (m) x12 x21 (m) (m) y12 (m) y21 L2 T2 (m) 2 character type 2 ( = 2) 157 z11 = 5 z21 = 42 (m) T1 {A, ✏, 5 z12 = 17 z11 = 17 = independent token level ✓(m) ... character type 1 ( = 2) R1 I (m) x11 y11 = independent (m) R2 = start of s11 (m) (m) x11(m) R2 y11 (m) L2 R1 (m) L1 x21 y21 (m) x21 (m) y21 (m) (m) T1 {A, ✏, b} z21 = 17 T2 (m) b} I (m) Figure 3: An illustration of the HBPL model generating two character types (left and right), where the dotted line separates the type-level from the token-level variables. Legend: number of strokes , relations R, primitive id z (color-coded to highlight sharing), control points x (open circles), scale y, start locations L, trajectories T , transformation A, noise ✏ and ✓b , and image I.
  11. 11. ハイパーパラメータの学習 a) b) library of motor primitives number of of strokes Number strokes frequency 6000 1 1 2 2 4000 2000 0 c) 0 2 4 6 8 stroke start positions 1 1 2 1 2 3 3 4 2 3 3 3 4 4 ≥4 4 Figur param primi row mon trol p b&c) tions c) sh differ Image. An image transformation A(m) 2 R4 is sampled from P (A(m) ) = where the first two elements control a global re-scaling and the second two cont ‣ シンボルの描き方に関する“常識”を学習 tion of the center of mass of T (m) . The transformed trajectories can then be ren grayscale image, using an ink model adapted from [10] (see Section SI-2). Th is then ycle  data  by two noise processes, ‣ motor  cperturbed (動画)を用いる which make the gradient more robu tion and encourage partial solutions during classification. These processes inclu (m) a Gaussian filter with standard deviation b and pixel flipping with probabi (m) (m)
  12. 12. d MNIST benchmark dataset for digit recognition has 6000 training example can classify new images of a foreign handwritten character from just one exa can classify new imagesin theinference intestedcharacterveryjust one exam e. Forty participants of a foreign were this model is fromchallenging Posterior USA handwritten on one-shot classificati 6, 17]. Similarly, while classifiers are generally trainedon hundreds of images 6, 17]. Similarly,Figureclassifiersare generally traineddifferent numbersaan ch trial, as in whileImageNet [4] and CIFAR-10/100 [14], image can lea aas 1b, participants space shown an peopleofof n large combinatorial were of on hundreds image hmark datasets such as ImageNet [4] and CIFAR-10/100 [14], people can le mark datasets such that shows the same character. To ensure class on another image one-­‐shot  classifica2on developed an algorithm for finding K high-probabi c) c) 各イメージに対して、 completed mostone randomly selected proposed by a f just promising candidates trial from each are the fication tasks, so5a and detailed in Section の  posterior  を推定 appro that characters never repeated These parses Ther stroke   SI-5. across trials. wo practice trials with the Latin and Greek alphabets, and K feedback type X (m) (m) Human dr ( , a |I ) ⇡ rchial Bayesian Program Learning.P For ✓ test image I (T ) wHuman and (✓ i 2 token ..., 20, we use a Bayesian classification rule for which wei=1 compute 1 (T ) (c) where each weight wi is proportional to parse 212 3 score argmax log P (I |I ). b) b) participants c [i ˜ earn a new concept from just one example? (a & b) Where arewi / wiexamples of the other = P ( vely, new conceptare row 4 column 3 (left) (a &rowWhere are 4 (right). c) The lear earnAnswersapproximationone example? and b) 2 column the other examples a the for b) from just uses the HBPL search algorithm to get K ed? Pcolumn 4 (right). c) The le d? Answers abilities suchconstrained such that CMC chains to are rowas generating (left) and row parsing. = 1. around tha rt many other for b) estimatecolumn 3 examples and 2 variability Rather eac and 4 the local type-level i wi 1 rt many other abilities suchre-optimizes the token-level variables ✓ (T ) (al as generating examples and parsing. 2 nt-based searches to 1 approximation can be improved by incorporating so posterior 2 推定されたtypeからの (T ) 3 canoni x cano mage I . The approximation can be written aswhich closelySI-7 fo the token-level variables ✓(m) , (see Section track 1Z ターゲットの生成確率 6.2 (T ) (c) is inexpensive to draw conditional samples from the 1 P (I (T ) |✓ (T ) )P (✓ (T ) | )Q(✓ (c) , , I log P (I |I ) ⇡ log 6 it does not require evaluating the likelihood of the im
  13. 13. an algorithm for finding K high-probability parses, [1] , ✓(m)[1] , ..., [K] , ✓(m st promising candidates proposed by a fast, bottom-up image analysis, show iled in Section SI-5. These parses approximate the posterior with a discrete d posterior  inference P ( , ✓(m) |I (m) ) ⇡ train e K X i=1 train wi (✓(m) ✓(m)[i] ) ( [i] ), prior  からのスコア    正規化 train weight wi is proportional to parse score, marginalizing over 1 shape variables 1 1 Binary image e b) 1 2 train train train traintrain train 22 222 2 1 111 11 aw) P 0 2 2 train 1 1 2 22 11 222 111 2i 1 1 wi / w = P ( ˜ −59.6 0 2 1 2 22 1 1 1 11 111 [i] (m)[i] 222 (m) 22 11 11 111 2 2 1 22 1 x 0 1 −59.6 train 1 1 ,✓ −88.9 −59.6 ,I 12 ) −159 −88.9 1 1 11 111 2 1 1 −88.9 −168 −159 2 1 1 12 −159 −168 ined such that i wi0 = 1. Rather than using just a point estimate for eac -159 -60 -89 -168 1 1 22 22 1 12 2 ion can be improved by incorporating some1of the local variance around the 2 1 1 21 11 1 2 1 1 111 22 222 (m) 2 111 パースの候補を選んで近似 2 1 2 allow evel variables ✓ 1, which closely track122 11111image, 2222211 for1 little variability, the11 2 11 2 2 11 2 1 1 22 11 222 11 1 11 11 111 1 (m)[i] (m) 1 ive to draw conditional samples from the1type-level P ( |✓ ,I ) = P( 1.  シンボルの上でランダムウォークを行い、ストロークのサンプル equire evaluating the likelihood of the image, just the local variance around th を得る(150個) d with the token-level fixed. Metropolis Hastings is run to produce N samp each parse ✓(m)[i] , denoted by [i1] , ..., [iN ] , where the improved approxim ge Thinned image ge aned) test 00000 0 test −59.6 −59.6 −59.6 −59.6 −59.6 0 −59.6 test test test test test test test −831 −88.9 −88.9 −88.9 −88.9 −88.9 −59.6 −88.9 0 −159 −159 −159 −159 −159 −88.9 −159 −59.6 −168 −168 −168 −168 −168 −159 −168 −88.9 −168 −159 test −881 −2.12e+03 −2.12e+03 −2.12e+03 −2.12e+03 −2.12e+03 −2.12e+03 −831 test −1.98e+03 −1.98e+03 −1.98e+03 −1.98e+03 −1.98e+03 −2.12e+03 −881 −1.41e+03 −983 −1.98e+03 −1.22e+03 −979 −2.07e+03 −2.07e+03 −2.07e+03 −2.07e+03 −2.07e+03 −2.07e+03 −1.41e+03 −1.98e+03 −983 −2.09e+03 −2.09e+03 −2.09e+03 −2.09e+03 −2.09e+03 −1.22e+03 −2.07e+03 −979 −1.18e+03 −1.17e+03 −2.09e+03 −1.72e+03 −2.12e+03 −2.12e+03 −2.12e+03 −2.12e+03 −2.12e+03 −2.12e+03 −1.18e+03 −2.09e+03 −1.17e+03 −1.72e+03 −2.12e+03 planning 2.  そのストロークのスコアを  prior  -1273 から計算、上位  K  個に絞る -831 -2041 N K X by a thinning algorithm X (ii) and image. (m) is processed (m) a) The raw image (i) (m) (m) (m) (m)[i] 1 [18] ned ,✓ ned |I ) ⇡ Q( , ✓ planning cleaned ,I )= wi (✓ ✓ ) (
  14. 14. barest of experience – just one or a handful of の計算 adjustment rceptual input. Although machine learning has nition problems that people solve so effortlessly, 1 2 1 b) 2 1 he baresta)of examples to reach good performance.2121 1 experience – just one or a handful of 1 2 1 i 1 1 1 usands of 11 22 2 2 2 22 1 222 1 22 222 11 111 perceptual input. Although machine learning111has 11 11 111 111 digit recognition people solve so effortlessly, ognition problems that has 6000 training examples per -60 -168 eign handwritten character 2from just one example-159 housands ofii examples to reach good0 performance. -89 1 22 2 1 1 2 1 1 1 for digit recognitiontrained on hundreds of images per 111111 has 6000 training examples per s are generally 22 222 1 22 222 foreign handwritten character from111just one example 11 111 22 11 222 111 11 111 1 1 t [4] andiiiCIFAR-10/100 [14], peopleper learn a can fiers are generally trained on hundreds of images Thinned Binary image train train Binary image Binary image train train train train traintrain train 1 1 Traced graph (raw) 2 0 22 1 1 −59.6 0 test test test test test test Thinned image test −59.6 −59.6 −59.6 −59.6 −59.6 0 −59.6 −831 1 2 2 1 1 −881 −2.12e+03 Net [4] and CIFAR-10/100 [14], people can learn a c) planning planning −2.12e+03 −2.12e+03 −2.12e+03 −2.12e+03 −2.12e+03 −831 −1.98e+03 −1.98e+03 −1.98e+03 −1.98e+03 −1.98e+03 −2.12e+03 −881 1 2 2 11 0 −88.9 −59.6 2 1 1 1 22 1 1 −59.6 −159 −88.9 1 2 1 −88.9 −168 −159 2 1 1 12 − − test test 2 12 traced graph (cleaned) 1 1 1 train 2 test 00000 0 Thinned image Thinned image 2 train −88.9 −88.9 −88.9 −88.9 −88.9 −59.6 −88.9 0 −159 −159 −159 −159 −159 −88.9 −159 −59.6 −168 −168 −168 −168 −168 −159 −168 −88.9 1 2 test 2 1 1 1 2 1 1 2 −1.41e+03 −983 −1.98e+03 1 2 2 11 −2.07e+03 −2.07e+03 −2.07e+03 −2.07e+03 −2.07e+03 −1.41e+03 −1.98e+03 −983 −2.09e+03 −2.09e+03 −2.09e+03 −2.09e+03 −2.09e+03 −1.22e+03 −2.07e+03 −979 −1.22e+03 −979 −2.07e+03 1 1 2 1 −1.18e+03 −1.17e+03 −2.09e+03 −2.12e+03 −2.12e+03 −2.12e+03 −2.12e+03 −2.12e+03 −1.18e+03 −2.09e+03 −1.17e+03 − −1 1 2 2 1 1 −1.7 −2.1 −1.72 −2.1 planning c) -831 -1273 -2041 e 5: Parsing a raw image. a) The raw image (i) is processed by a thinning algorithm [18] (ii) zed as an undirected graph [20] (iii) where parses are guided random walks (Section SI-5). b) parses found for that image (top row) are shown with their log wj (Eq. 5), where numbers insid Human drawers 推定された  type  circles denote sub-stroke breaks. These fi Human drawers e stroke order and starting position, and smaller open 変数を用いて、ターゲットの re-fit to three different raw images of characters (left in image triplets), where the best parse (t token  変数を推定(MCMC) 2 1 1 are shown 1 1 s associated image reconstruction (bottom right) above its score (Eq. 9). 2 2 2 1 1 3 1 3 n an approximate posterior for a particular image, the model 3can evaluate the posterior 2 3 2 score of a new& b) Wherere-fitting the examples of the image by are the other token-level variables 5(bottom Figure 5b), as 5.3 expl canonical 5.1 3 3 e example? (a planning cleaned planning cleaned planning cleaned
  15. 15. まとめ ‣ モデルはかなり作り込んでいてアドホック ‣ しかし、機械学習も人間と同じように一つのexampleから分類・ 生成が行えることを示した、という点で面白い ‣ 人間がどのように特徴を抽出しているか、ということの 理解に繋がる

×