Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

(DL輪読)Matching Networks for One Shot Learning


  • Login to see the comments

(DL輪読)Matching Networks for One Shot Learning

  1. 1. Matching Networks for One Shot Learning
  2. 2. ¤ Deep Mind ¤ Oriol Vinyals, Charles Blundell, Timothy Lillicrap, Koray Kavukcuoglu, Daan Wierstra ¤ 2016/06/13 arXiv ¤ One-shot learning ¤ Matching Nets state-of-the art ¤ ¤ ¤ one-shot learning
  3. 3. One-shot learning
  4. 4. One-shot learning ¤ One-shot learning ¤ 1 ¤ ¤ deep learning ¤ Deep learning ¤ AI ¤ One-shot learning ¤ Li Fei-Fei Brenden Lake Ruslan Salakhutdinov Joshua B. Tenenbaum One shot learning of simple visual concepts Brenden M. Lake, Ruslan Salakhutdinov, Jason Gross, and Joshua B. Tenenbaum Department of Brain and Cognitive Sciences Massachusetts Institute of Technology Abstract People can learn visual concepts from just one en- counter, but it remains a mystery how this is accom- plished. Many authors have proposed that transferred knowledge from more familiar concepts is a route to one shot learning, but what is the form of this abstract knowledge? One hypothesis is that the sharing of parts is core to one shot learning, but there have been few attempts to test this hypothesis on a large scale. This paper works in the domain of handwritten characters, which contain a rich component structure of strokes. We introduce a generative model of how characters are composed from strokes, and how knowledge from previ- ous characters helps to infer the latent strokes in novel characters. After comparing several models and humans on one shot character learning, we find that our stroke model outperforms a state-of-the-art character model by a large margin, and it provides a closer fit to human per- ceptual data. Keywords: category learning; transfer learning; Bayesian modeling; neural networks A hallmark of human cognition is learning from just a few examples. For instance, a person only needs to see one Segway to acquire the concept and be able to dis- criminate future Segways from other vehicles like scoot- ers and unicycles (Fig. 1 left). Similarly, children can ac- quire a new word from one encounter (Carey & Bartlett, 1978). How is one shot learning possible? New concepts are almost never learned in a vacuum. Experience with other, more familiar concepts in a do- Where are the others? Figure 1: Test yourself on one shot learning. From the example boxed in red, can you find the others in the grid below? On the left is a Segway and on the right is the first character of the Bengali alphabet. AnswerfortheBengalicharacter:Row2,Column2;Row3,Column4. [Lake+ 2011]One shot learning of simple visual concepts
  5. 5. Zero-shot learning ¤ Zero-shot learning 1 ¤ side information ¤ One-shot learning [Socher+ 2013] Zero-Shot Learning Through Cross-Modal Transfer
  6. 6. ¤ ¤ One-shot learning ¤ ¤ [Pan+ 2010] A Survey on Transfer Learning
  7. 7. ¤ ¤ [Pan+2010] [ +2010] ¤ ¤ ¤ ¤ ¤ ¤ ¤ ¤ ¤ or
  8. 8. One-shot learning
  9. 9. One-shot learning ¤ Fei-Fei [Fei-Fei+ 2006] ¤ ¤ Zero-shot learning [Larochelle+ 2008] ¤ Hierarchical Bayesian Program Learning (HBPL) [Lake+ 2011; 2012; 2013; 2015] ¤ 1 ¤ for each subpart. Last, parts are roughly positioned to begin either independently, at the beginning, at the end, or along previous parts, as defined by relation Ri (Fig. 3A, iv). Character tokens q(m) are produced by execut- ing the parts and the relations and modeling how ink flows from the pen to the page. First, motor noise is added to the control points and the scale of the subparts to create token-level stroke tra- jectories S(m) . Second, the trajectory’s precise start location L(m) is sampled from the schematic pro- vided by its relation Ri to previous strokes. Third, global transformations are sampled, including an affine warp A(m) and adaptive noise parame- ters that ease probabilistic inference (30). Last, a binary image I(m) is created by a stochastic ren- dering function, lining the stroke trajectories with grayscale ink and interpreting the pixel values as independent Bernoulli probabilities. Posterior inference requires searching the large combinatorial space of programs that could have generated a raw image I (m) . Our strategy uses fast bottom-up methods (31) to propose a range of candidate parses. The most promising candidates are refined by using continuous optimization and local search, forming a tion to the posterior distrib (section S3). Figure 4A show ered programs for a train how they are refit to differen compute a classification scor log posterior predictive proba scores indicate that they ar long to the same class. A hi when at least one set of par successfully explain both th test images, without violating of the learned within-class Figure 4B compares the m parses with the ground-trut several characters. Results People, BPL, and alternative pared side by side on five co that examine different form from just one or a few exam Fig. 5). All behavioral exp through Amazon’s Mechanic perimental procedures are d 1 1 1 2 1 2 Human or Machine? the pen (Fig. 3A, ii). To construct a new character type, first the model samples the number of parts k and the number of subparts ni, for each part i = 1, ..., k, from their empirical distributions as measured from the background set. Second, a template for a part Si is constructed by sampling subparts from a set of discrete primitive actions learned from the background set (Fig. 3A, i), such that the probability of the next action depends on the previous. Third, parts are then grounded as parameterized curves (splines) by sampling the control points and scale parameters Fig. 3. A generative model of handwritten characters. (A) New types are generated by choosing primitive actions (color coded) from a library (i), combining these subparts (ii) to make parts (iii), and combining parts with relations to define simple programs (iv). New tokens are generated by running these programs (v), which are then rendered as raw data (vi). (B) Pseudocode for generating new types y and new token images I(m) for m = 1, ..., M. The function f (·, ·) transforms a subpart sequence and start location into a trajectory. Human parses Machine parsesHuman drawings -505 -593 -655 -695 -723 Training item with model’s five best parses RESEARCH | RESEARCH ARTICLES
  10. 10. one-shot learning ¤ 1. 2. 3.
  11. 11. one-shot learning ¤ ¤ ¤ 2 ¤ One-shot learning ¤ One-shot same or different ¤ One-shot Siamese Neural Networks for One-shot Image Recognition should generalize to one-shot classification. The verifica- tion model learns to identify input pairs according to the probability that they belong to the same class or differ- ent classes. This model can then be used to evaluate new images, exactly one per novel class, in a pairwise manner against the test image. The pairing with the highest score according to the verification network is then awarded the highest probability for the one-shot task. If the features learned by the verification model are sufficient to confirm or deny the identity of characters from one set of alpha- Siamese Neural Networks for One-shot Image Figure 2. Our general strategy. 1) Train a model to discriminate should general tion model lea probability tha ent classes. Th images, exactl against the tes according to th highest probab learned by the or deny the id bets, then they provided that alphabets to en tures. 2. Related W
  12. 12. Siamese Network ¤ Siamese Network one-shot learning [Koch+ 2015] ¤ Siamese Network [Bromlay+ 1993] ¤ ¤ Siamese Neural Networks for One-shot Image Recognition Figure 3. A simple 2 hidden layer siamese network for binary classification with logistic prediction p. The structure of the net- work is replicated across the top and bottom sections to form twin networks, with shared weight matrices at each layer. sets where very few examples exist for some classes, pro- viding a flexible and continuous means of incorporating inter-class information into the model. by the energy loss, whereas we fix the metric as spec above, following the approach in Facebook’s DeepFac per (Taigman et al., 2014). Our best-performing models use multiple convolut layers before the fully-connected layers and top- energy function. Convolutional neural networks achieved exceptional results in many large-scale com vision applications, particularly in image recognition (Bengio, 2009; Krizhevsky et al., 2012; Simonyan & serman, 2014; Srivastava, 2013). Several factors make convolutional networks especiall pealing. Local connectivity can greatly reduce the n ber of parameters in the model, which inherently prov some form of built-in regularization, although conv tional layers are computationally more expensive than dard nonlinearities. Also, the convolution operation us these networks has a direct filtering interpretation, w each feature map is convolved against input featur identify patterns as groupings of pixels. Thus, the puts of each convolutional layer correspond to impo spatial features in the original input space and offer s robustness to simple transforms. Finally, very fast CU libraries are now available in order to build large conv tional networks without an unacceptable amount of t activation function. This final layer induces a metric on the learned feature space of the (L 1)th hidden layer and scores the similarity between the two feature vec- tors. The ↵j are additional parameters that are learned by the model during training, weighting the importance of the component-wise distance. This defines a final Lth fully-connected layer for the network which joins the two siamese twins. We depict one example above (Figure 4), which shows the largest version of our model that we considered. This net- work also gave the best result for any network on the veri- fication task. 3.2. Learning Loss function. Let M represent the minibatch size, where i indexes the ith minibatch. Now let y(x (i) 1 , x (i) 2 ) be a length-M vector which contains the labels for the mini- batch, where we assume y(x (i) 1 , x (i) 2 ) = 1 whenever x1 and x2 are from the same character class and y(x (i) 1 , x (i) 2 ) = 0 otherwise. We impose a regularized cross-entropy objec- tive on our binary classifier of the following form: L(x (i) 1 , x (i) 2 ) = y(x (i) 1 , x (i) 2 ) log p(x (i) 1 , x (i) 2 )+ (1 y(x (i) 1 , x (i) 2 )) log (1 p(x (i) 1 , x (i) 2 )) + T |w|2 Optimization. This objective is combined with standard backpropagation algorithm, where the gradient is additive across the twin networks due to the tied weights. We fix Weight initialization. We in in the convolutional layers fro zero-mean and a standard dev also initialized from a norma 0.5 and standard deviation 1 layers, the biases were initia convolutional layers, but the much wider normal distributi dard deviation 2 ⇥ 10 1 . Learning schedule. Althoug learning rate for each layer, uniformly across the network that ⌘ (T ) j = 0.99⌘ (T 1) j . We learning rate, the network w minima more easily without g face. We fixed momentum t increasing linearly each epoc the individual momentum term We trained each network for a monitored one-shot validatio shot learning tasks generated and drawers in the validation s did not decrease for 20 epoc parameters of the model at th one-shot validation error. If t to decrease for the entire lear final state of the model genera Hyperparameter optimizat
  13. 13. ¤ … ¤ 1 one-shot Neural Turing Machine ¤ Neural Turing Machine (NTM) [Graves+ 2014] ¤ Figure 1: Neural Turing Machine Architecture. During each update cycle, the controller network receives inputs from an external environment and emits outputs in response. It also reads to and writes from a memory matrix via a set of parallel read and write heads. The dashed
  14. 14. Memory Augmented Neural Network ¤ ¤ ¤ ¤ one-shot learning 25 2 タスク設定 • この一連のプロセスを エピソード と呼ぶ • エピソードの冒頭では、番号はランダムに推定するしかない • エピソードの後半に行くにつれて、正答率が上がってくる。 • 素早く正答率が上がる = One-Shot Learning がよく出来る 2 正解!2 1 “少数の文字例を見ただけで、すぐに認識できるようになる” というタスクを学習させたい 以下50回続く... 記憶
  15. 15. Matching Networks for One Shot Learning
  16. 16. one-shot learning one-shot learning ¤ One-shot learning ¤ N ¤ N k 1 5
  17. 17. ¤ one-shot learning 1. N k 1 5 L 2. L S B ¤ One-shot learning (",$ %&) ∈ ) * = {"-, %-}-/0 1 S "& −> %& ¤ 5(%&|"&, *) ¤ * −> *7 ¤ 5(%&|"&, *) support set S, and adds “depth” to the computation of attention (see appendix for more details). 2.2 Training Strategy In the previous subsection we described Matching Networks which map a support set to a classification function, S ! c(ˆx). We achieve this via a modification of the set-to-set paradigm augmented with attention, with the resulting mapping being of the form P✓(.|ˆx, S), noting that ✓ are the parameters of the model (i.e. of the embedding functions f and g described previously). The training procedure has to be chosen carefully so as to match inference at test time. Our model has to perform well with support sets S0 which contain classes never seen during training. More specifically, let us define a task T as distribution over possible label sets L. Typically we consider T to uniformly weight all data sets of up to a few unique classes (e.g., 5), with a few examples per class (e.g., up to 5). In this case, a label set L sampled from a task T, L ⇠ T, will typically have 5 to 25 examples. To form an “episode” to compute gradients and update our model, we first sample L from T (e.g., L could be the label set {cats, dogs}). We then use L to sample the support set S and a batch B (i.e., both S and B are labelled examples of cats and dogs). The Matching Net is then trained to minimise the error predicting the labels in the batch B conditioned on the support set S. This is a form of meta-learning since the training procedure explicitly learns to learn from a given support set to minimise a loss over a batch. More precisely, the Matching Nets training objective is as follows: ✓ = arg max ✓ EL⇠T 2 4ES⇠L,B⇠L 2 4 X (x,y)2B log P✓ (y|x, S) 3 5 3 5 . (2) Training ✓ with eq. 2 yields a model which works well when sampling S0 ⇠ T0 from a different distribution of novel labels. Crucially, our model does not need any fine tuning on the classes it has
  18. 18. Matching Networks ¤ 5 %& "&, * Matching networks ¤ one-shot learning end-to-end Figure 1: Matching Networks architecture train it by showing only a few examples per class, switching the task from minibatch to minibatch, much like how it will be tested when presented with a few examples of a new task. Besides our contributions in defining a model and training criterion amenable for one-shot learning, we contribute by the definition of tasks that can be used to benchmark other approaches on both S "& %&
  19. 19. Matching Networks ¤ Matching network ¤ a ¤ nearest-neighbor ¤ neural machine translation alignment model ¤ [Bahdanau+ 2016] ¤ a y memories bound new support set of examples S0 from which to one-shot learn, we simply use the parametric neural network defined by P to make predictions about the appropriate label ˆy for each test example ˆx: P(ˆy|ˆx, S0 ). In general, our predicted output class for a given input unseen example ˆx and a support set S becomes arg maxy P(y|ˆx, S). Our model in its simplest form computes ˆy as follows: ˆy = kX i=1 a(ˆx, xi)yi (1) where xi, yi are the samples and labels from the support set S = {(xi, yi)}k i=1, and a is an attention mechanism which we discuss below. Note that eq. 1 essentially describes the output for a new class as a linear combination of the labels in the support set. Where the attention mechanism a is a kernel on X ⇥ X, then (1) is akin to a kernel density estimator. Where the attention mechanism is zero for the b furthest xi from ˆx according to some distance metric and an appropriate constant otherwise, then 1) is equivalent to ‘k b’-nearest neighbours (although this requires an extension to the attention mechanism that we describe in Section 2.1.2). Thus (1) subsumes both KDE and kNN methods. Another view of (1) is where a acts as an attention mechanism and the yi act as memories bound to he corresponding xi. In this case we can understand this as a particular kind of associative memory where, given an input, we “point” to the corresponding example in the support set, retrieving its label. However, unlike other attentional memory mechanisms [2], (1) is non-parametric in nature: as the support set size grows, so does the memory used. Hence the functional form defined by the classifier cS(ˆx) is very flexible and can adapt easily to any new support set.
  20. 20. ¤ a c softmax ¤ g bidirectional RNN ¤ f LSTM ¤ VGG Inception Figure 1: Matching Networks architecture train it by showing only a few examples per class, switching the task from minibatch to minibatch, much like how it will be tested when presented with a few examples of a new task. Besides our contributions in defining a model and training criterion amenable for one-shot learning, we contribute by the definition of tasks that can be used to benchmark other approaches on both ImageNet and small scale language modeling. We hope that our results will encourage others to work on this challenging problem. We organized the paper by first defining and explaining our model whilst linking its several compo- nents to related work. Then in the following section we briefly elaborate on some of the related work to the task and our model. In Section 4 we describe both our general setup and the experiments we performed, demonstrating strong results on one-shot learning on a variety of tasks and setups. 2 Model Our non-parametric approach to solving one-shot learning is based on two components which we describe in the following subsections. First, our model architecture follows recent advances in neural networks augmented with memory (as discussed in Section 3). Given a (small) support set S, our model defines a function cS (or classifier) for each S, i.e. a mapping S ! cS(.). Second, we employ a training strategy which is tailored for one-shot learning from the support set S. 2.1 Model Architecture In recent years, many groups have investigated ways to augment neural network architectures with external memories and other components that make them more “computer-like”. We draw inspiration from models such as sequence to sequence (seq2seq) with attention [2], memory networks [29] and pointer networks [27]. In all these models, a neural attention mechanism, often fully differentiable, is defined to access (or read) a memory matrix which stores useful information to solve the task at hand. Typical uses of this include machine translation, speech recognition, or question answering. More generally, these architectures model P(B|A) where A and/or B can be a sequence (like in seq2seq models), or, more interestingly for us, a set [26]. Our contribution is to cast the problem of one-shot learning within the set-to-set framework [26]. Appendix A Model Description In this section we fully specify the models which condition the embedding functions f and g on the whole support set S. Much previous work has fully described similar mechanisms, which is why we left the precise details for this appendix. A.1 The Fully Conditional Embedding f As described in section 2.1.2, the embedding function for an example ˆx in the batch B is as follows: f(ˆx, S) = attLSTM(f0 (ˆx), g(S), K) where f0 is a neural network (e.g., VGG or Inception, as described in the main text). We define K to be the number of “processing” steps following work from [26] from their “Process” block. g(S) represents the embedding function g applied to each element xi from the set S. Thus, the state after k processing steps is as follows: Appendix A Model Description In this section we fully specify the models which condition the embedding function whole support set S. Much previous work has fully described similar mechanisms, left the precise details for this appendix. A.1 The Fully Conditional Embedding f As described in section 2.1.2, the embedding function for an example ˆx in the batch f(ˆx, S) = attLSTM(f0 (ˆx), g(S), K) where f0 is a neural network (e.g., VGG or Inception, as described in the main tex to be the number of “processing” steps following work from [26] from their “Proc represents the embedding function g applied to each element xi from the set S. where, given an input, we “point” to the corresponding example in the support set, retrievin However, unlike other attentional memory mechanisms [2], (1) is non-parametric in natu support set size grows, so does the memory used. Hence the functional form defined by th cS(ˆx) is very flexible and can adapt easily to any new support set. 2.1.1 The Attention Kernel Equation 1 relies on choosing a(., .), the attention mechanism, which fully specifies fier. The simplest form that this takes (and which has very tight relationships with attention models and kernel functions) is to use the softmax over the cosine distan a(ˆx, xi) = ec(f(ˆx),g(xi)) / Pk j=1 ec(f(ˆx),g(xj )) with embedding functions f and g being ate neural networks (potentially with f = g) to embed ˆx and xi. In our experiments we examples where f and g are parameterised variously as deep convolutional networks tasks (as in VGG[22] or Inception[24]) or a simple form word embedding for language Section 4). We note that, though related to metric learning, the classifier defined by Equation 1 is disc For a given support set S and sample to classify ˆx, it is enough for ˆx to be sufficiently ali pairs (x0 , y0 ) 2 S such that y0 = y and misaligned with the rest. This kind of loss is also methods such as Neighborhood Component Analysis (NCA) [18], triplet loss [9] or lar nearest neighbor [28]. However, the objective that we are trying to optimize is precisely aligned with multi-way classification, and thus we expect it to perform better than its counterparts. Additionally, simple and differentiable so that one can find the optimal parameters in an “end-to-end” f 2.1.2 Full Context Embeddings The main novelty of our model lies in reinterpreting a well studied framework (neural netw external memories) to do one-shot learning. Closely related to metric learning, the embed tions f and g act as a lift to feature space X to achieve maximum accuracy through the cla noting that LSTM(x, h, c) follows the same LSTM implementation defined in [23] with x the input, h the output (i.e., cell after the output gate), and c the cell. a is commonly referred to as “content” based attention, and the softmax in eq. 6 normalizes w.r.t. g(xi). The read-out rk 1 from g(S) is concatenated to hk 1. Since we do K steps of “reads”, attLSTM(f0 (ˆx), g(S), K) = hK where hk is as described in eq. 3. A.2 The Fully Conditional Embedding g In section 2.1.2 we described the encoding function for the elements in the support set S, g(xi, S), as a bidirectional LSTM. More precisely, let g0 (xi) be a neural network (similar to f0 above, e.g. a VGG or Inception model). Then we define g(xi, S) = ~hi + ~hi + g0 (xi) with: ~hi,~ci = LSTM(g0 (xi),~hi 1,~ci 1) ~hi, ~ci = LSTM(g0 (xi), ~hi+1, ~ci+1) where, as in above, LSTM(x, h, c) follows the same LSTM implementation defined in [23] with x the input, h the output (i.e., cell after the output gate), and c the cell. Note that the recursion for ~h starts from i = |S|. As in eq. 3, we add a skip connection between input and outputs. B ImageNet Class Splits Here we define the two class splits used in our full ImageNet experiments – these classes were excluded for training during our one-shot experiments described in section 4.1.2.
  21. 21. Set-to-set ¤ seq2seq ¤ Matching network ¤ Order Matters: Sequence to sequence for sets [Vinyals+ 2015] ¤ ¤ Seq2seq Published as a conference paper at ICLR 2016 All these empirical findings point to the same story: often for optimization purposes, the order in which input data is shown to the model has an impact on the learning performance. Note that we can define an ordering which is independent of the input sequence or set X (e.g., always reversing the words in a translation task), but also an ordering which is input dependent (e.g., sorting the input points in the convex hull case). This distinction also applies in the discussion about output sequences and sets in Section 5.1. Recent approaches which pushed the seq2seq paradigm further by adding memory and computation to these models allowed us to define a model which makes no assumptions about input ordering, whilst preserving the right properties which we just discussed: a memory that increases with the size of the set, and which is order invariant. In the next sections, we explain such a modification, which could also be seen as a special case of a Memory Network (Weston et al., 2015) or Neural Turing Machine (Graves et al., 2014) – with a computation flow as depicted in Figure 1. 4.2 ATTENTION MECHANISMS Neural models with memories coupled to differentiable addressing mechanism have been success- fully applied to handwriting generation and recognition (Graves, 2012), machine translation (Bah- danau et al., 2015a), and more general computation machines (Graves et al., 2014; Weston et al., 2015). Since we are interested in associative memories we employed a “content” based attention. This has the property that the vector retrieved from our memory would not change if we randomly shuffled the memory. This is crucial for proper treatment of the input set X as such. In particular, our process block based on an attention mechanism uses the following: qt = LSTM(q⇤ t 1) (3) ei,t = f(mi, qt) (4) ai,t = exp(ei,t) P j exp(ej,t) (5) rt = X i ai,tmi (6) q⇤ t = [qt rt] (7) Read Process Write Figure 1: The Read-Process-and-Write model. where i indexes through each memory vector mi (typically equal to the cardinality of X), qt is a query vector which allows us to read rt from the memories, f is a function that computes a single scalar from mi and qt (e.g., a dot product), and LSTM is an LSTM which computes a recurrent state but which takes no inputs. q⇤ t is the state which this LSTM evolves, and is formed by concatenating the query qt with the resulting attention readout rt. t is the index which indicates
  22. 22. ¤ N-way k-shot learning ¤ One-shot ¤ N k ¤ N N 1/N ¤ fine-tuning N
  23. 23. 1: ¤ Omniglot ¤ 1623 20 ¤ ¤ Pixels nearest neighbor ¤ Baseline CNN nearest neighbor ¤ N ¤ MANN ¤ Siamese network
  24. 24. ¤ ¤ Fine-tuning ¤ ¤ Lake (by Karpathy ) ¤ 1-shot 20-way 95.2% [Lake+ 2011]
  25. 25. 2 ¤
  26. 26. ¤ One-shot generalization [Rezende+ 2016] ¤ VAE ¤ One-shot generation One-shot Generalization in Deep Generative Models xct 1 zt 1 ht 1 A … … fw fc A fw fo hT cT Generative model zT A ht 1 x zt fr Inference model (a) Unconditional generative model. x A fw fo hT cTx’ hT 1 A Generative model zT A ht 1 x fr x’ A zt Inference model (b) One-step of the conditional generative model. Figure 2. Stochastic computational graph showing conditional probabilities and computational steps for sequential generative models. A represents an attentional mechanism that uses function fw for writings and function fr for reading. and our transition is specified as a long short-term mem- ory network (LSTM, Hochreiter & Schmidhuber (1997). We explicitly represent the creation of a set of hidden vari- ables ct that is a hidden canvas of the model (equation (6)). The canvas function fc allows for many different trans- formations, and it is here where generative (writing) at- tention is used; we describe a number of choices for this function in section 3.2.3. The generated image (7) is sam- pled using an observation function fo(c; ✓o) that maps the last hidden canvas cT to the parameters of the observation model. The set of all parameters of the generative model is ✓ = {✓h, ✓c, ✓o}. 3.2.2. FREE ENERGY OBJECTIVE Given the probabilistic model (3)-(7) we can obtain an ob- smaller in size and can have any number of channels (four in this paper). We consider two ways with which to update the hidden canvas: Additive Canvas. As the name implies, an additive canvas updates the canvas by simply adding a transformation of the hidden state fw(ht; ✓c) to the previous canvas state ct 1. This is a simple, yet effective (see results) update rule: fc(ct 1, ht; ✓c) = ct 1 + fw(ht; ✓c), (9) Gated Recurrent Canvas. The canvas function can be up- dated using a convolutional gated recurrent unit (CGRU) architecture (Kaiser & Sutskever, 2015), which provides a non-linear and recursive updating mechanism for the can- vas and are simplified versions of convolutional LSTMs (further details of the CGRU are given in appendix B). The One-shot Generalization in Deep Generative Models Figure 8. Unconditional samples for 52 ⇥ 52 omniglot (task 1). For a video of the generation process, see watch?v=HQEI2xfTgm4 Figure 9. Generating new examplars of a given character for the weak generalization test (task 2a). The first row shows the test images and the next 10 are one-shot samples from the model. 30-20 40-10 45-5 Figure 10. Generating new examplars of a given character for the strong generalization test (task 2b,c), with models trained with different amounts of data. Left: Samples from model trained on 30-20 train-test split; Middle: 40-10 split; Right: 45-5 split (right)
  27. 27. ¤ One-shot learning ¤ Zero-shot learning ¤ ¤ ¤ 3 1. 2. 3. ¤ 3 ¤ Matching Networks end-to-end ¤ ¤
  28. 28. ¤ ¤ deep-learning# ¤ learning-and-transfer-learning ¤ Karpathy notes/blob/master/ ¤ ¤