SlideShare a Scribd company logo
Copyright	(C)	DeNA	Co.,Ltd.	All	Rights	Reserved.	
AI	System	Dept.	
System	Management	Unit	
Kazuki	Fujikawa	
Matching Networks for One Shot
Learning
https://papers.nips.cc/paper/6385-matching-networks-for-one-
shot-learning
č«–ā½‚ē“¹ä»‹
1
NIPS2016	čŖ­ćæ会	@Preferred	Networks	
2017/01/19
Copyright	(C)	DeNA	Co.,Ltd.	All	Rights	Reserved.	
nā€Æ One-shot learning with attention and memory
āƒā€Æ Learn a concept from one or only a few training examples
āƒā€Æ Train a fully end-to-end nearest neighbor classiļ¬er: incorporating
the best characteristics from both parametric and non-parametric
models
āƒā€Æ Improved one-shot accuracy on Omniglot from 88.0% to 93.2%
compared to competing approaches
2
Abstract
Figure 1: Matching Networks architecture
Copyright	(C)	DeNA	Co.,Ltd.	All	Rights	Reserved.	
AGENDA
nā€Æ Introduction
nā€Æ Related work
āƒā€Æ One-shot learning
āƒā€Æ Attention mechanisms
nā€Æ Matching Networks
nā€Æ Experiments
āƒā€Æ Omniglot
āƒā€Æ ImageNet
āƒā€Æ Penn Treebank
3
Copyright	(C)	DeNA	Co.,Ltd.	All	Rights	Reserved.	
Supervised Learning
nā€Æ Learn a correspondence between training data and labels
āƒā€Æ Require a large labeled dataset for training
(ex. CIFAR10 [Krizhevsky+, 2009]: 6000 data / class)
āƒā€Æ It is hard to let classiļ¬ers learn new concepts from little data
4
airplane
automobile
bird
cat
deer
Classiļ¬er
examples Labels
0 airplane
1 automobile
0 bird
0 cat
0 deer
Classiļ¬er
Training phase Predicting phase
https://www.cs.toronto.edu/~kriz/cifar.html
Copyright	(C)	DeNA	Co.,Ltd.	All	Rights	Reserved.	
One-shot Learning
nā€Æ Learn a concept from one or only a few training examples
āƒā€Æ A classiļ¬er can be trained by datasets with labels which donŹ¼t
be used in predicting phase
5
airplane
automobile
bird
cat
deer
Classiļ¬er
examples Labels
0 airplane
1 automobile
0 bird
0 cat
0 deer
Classiļ¬er
ļ¼ˆPre-ļ¼‰Training phase Predicting phaseļ¼ˆone-shot learning phaseļ¼‰
https://www.cs.toronto.edu/~kriz/cifar.html
dog
frog
horse
ship
truck
Classiļ¬er
examples Labels
Copyright	(C)	DeNA	Co.,Ltd.	All	Rights	Reserved.	
One-shot Learning
nā€Æ Task: N-way k-shot learning
6
Tā€™: Testing taskT: Training task
dog
frog
horse
ship
truck
airplane
automobile
bird
cat
deer
dog
frog
horse
ship
truck
airplane
automobile
bird
cat
deer
ā€¢ā€Æ Separate labels for training and testing
ā€¢ā€Æ All the labels which you use in testing
phase (one-shot learning phase) are not
used in training phase
https://www.cs.toronto.edu/~kriz/cifar.html
Copyright	(C)	DeNA	Co.,Ltd.	All	Rights	Reserved.	
One-shot Learning
nā€Æ Task: N-way k-shot learning
7
Tā€™: Testing taskT: Training task
dog
frog
horse
ship
truck
airplane
automobile
bird
cat
deer
dog
frog
horse
ship
truck
airplane
automobile
bird
cat
deer
ā€¢ā€Æ Tā€™ is used for one-shot learning
ā€¢ā€Æ T can be used freely to train
ļ¼ˆe.g. Multiclass classiļ¬cationļ¼‰
https://www.cs.toronto.edu/~kriz/cifar.html
Copyright	(C)	DeNA	Co.,Ltd.	All	Rights	Reserved.	
One-shot Learning
nā€Æ Task: N-way k-shot learning
8
Tā€™: Testing taskT: Training task
dog
frog
horse
ship
truck
airplane
automobile
bird
cat
deer
dog
frog
horse
ship
truck
airplane
automobile
bird
cat
deer
Lā€™: Label set
sampling N labels from TŹ¼
ā€¢ā€Æ In this ļ¬gure, LŹ¼ has 3 classes, thus
ā€œ3-way k-shot learningā€
automobile
cat
deer
https://www.cs.toronto.edu/~kriz/cifar.html
Copyright	(C)	DeNA	Co.,Ltd.	All	Rights	Reserved.	
One-shot Learning
nā€Æ Task: N-way k-shot learning
9
Tā€™: Testing taskT: Training task
dog
frog
horse
ship
truck
airplane
automobile
bird
cat
deer
dog
frog
horse
ship
truck
airplane
automobile
bird
cat
deer
Lā€™: Label set
Sā€™: Support set : Query
automobile
cat
deer
sampling N labels from TŹ¼
sampling k examples from LŹ¼ sampling 1 example from LŹ¼
Ė†x
ā€¢ā€Æ Task: classify into 3
classes, {automobile, cat,
deer}, using support set
Ė†x
https://www.cs.toronto.edu/~kriz/cifar.html
Copyright	(C)	DeNA	Co.,Ltd.	All	Rights	Reserved.	
Related Work (One-shot Learning)
nā€Æ Convolutional Siamese Network [Koch+, 2015]
āƒā€Æ Learn image representation with a siamese neural network
āƒā€Æ Reuse features from the network for one-shot learning
10
Figure 1: Matching Networks architecture
train it by showing only a few examples per class, switching the task from minibatch to minibatch,
much like how it will be tested when presented with a few examples of a new task.
Besides our contributions in deļ¬ning a model and training criterion amenable for one-shot learning,
we contribute by the deļ¬nition of tasks that can be used to benchmark other approaches on both
ImageNet and small scale language modeling. We hope that our results will encourage others to work
CNN CNN
Same?
Copyright	(C)	DeNA	Co.,Ltd.	All	Rights	Reserved.	
Related Work (One-shot Learning)
nā€Æ Memory-Augmented Neural Networks (MANN) [Santoro+, 2016]
āƒā€Æ Quickly encode and retrieve new information using external
memory, inspired by the idea of Neural Turing Machine
11
Copyright	(C)	DeNA	Co.,Ltd.	All	Rights	Reserved.	
Related Work (One-shot Learning)
nā€Æ Siamese Learnet [Bertinetto+, NIPS2016]
āƒā€Æ Learn the parameters of a network to incorporate domain
speciļ¬c information from a few examples
12
siamese
siamese learnet
learnet
Figure 1: Our proposed architectures predict the parameters of a network from a single example,
replacing static convolutions (green) with dynamic convolutions (red). The siamese learnet predicts
the parameters of an embedding function that is applied to both inputs, whereas the single-stream
learnet predicts the parameters of a function that is applied to the other input. Linear layers are
denoted by ā‡¤ and nonlinear layers by . Dashed connections represent parameter sharing.
discriminative one-shot learning is to ļ¬nd a mechanism to incorporate domain-speciļ¬c information in
the learner, i.e. learning to learn. Another challenge, which is of practical importance in applications
of one-shot learning, is to avoid a lengthy optimization process such as eq. (1).
We propose to address both challenges by learning the parameters W of the predictor from a single
exemplar z using a meta-prediction process, i.e. a non-iterative feed-forward function ! that maps
(z; W0
) to W. Since in practice this function will be implemented using a deep neural network, we
call it a learnet. The learnet depends on the exemplar z, which is a single representative of the class of
interest, and contains parameters W0
of its own. Learning to learn can now be posed as the problem of
optimizing the learnet meta-parameters W0
using an objective function deļ¬ned below. Furthermore,
the feed-forward learnet evaluation is much faster than solving the optimization problem (1).
In order to train the learnet, we require the latter to produce good predictors given any possible
exemplar z, which is empirically evaluated as an average over n training samples zi:
Copyright	(C)	DeNA	Co.,Ltd.	All	Rights	Reserved.	
Related Work (Attention Mechanism)
nā€Æ Sequence to Sequence with Attention [Bahdanau+, 2014]
āƒā€Æ Attend to the word relevant to the generation of the next
target word in the source sentence
13
t t
her architectures such as a hybrid of an RNN
alchbrenner and Blunsom, 2013).
ral machine translation. The new architecture
3.2) and a decoder that emulates searching
n (Sec. 3.1).
x1 x2 x3 xT
+
Ī±t,1
Ī±t,2 Ī±t,3
Ī±t,T
yt-1 yt
h1 h2 h3 hT
h1 h2 h3 hT
st-1 st
Figure 1: The graphical illus-
tration of the proposed model
trying to generate the t-th tar-
get word yt given a source
sentence (x1, x2, . . . , xT ).
al probability
(4)
by
ā€“decoder ap-
on a distinct
annotations
ntence. Each
put sequence
word of the
ons are com-
sum of these
(5)
ij)
Figure 1: The graphical illus-
tration of the proposed model
trying to generate the t-th tar-
get word yt given a source
sentence (x1, x2, . . . , xT ).
si = f(si 1, yi 1, ci).
It should be noted that unlike the existing encoderā€“decoder ap-
proach (see Eq. (2)), here the probability is conditioned on a distinct
context vector ci for each target word yi.
The context vector ci depends on a sequence of annotations
(h1, Ā· Ā· Ā· , hTx
) to which an encoder maps the input sentence. Each
annotation hi contains information about the whole input sequence
with a strong focus on the parts surrounding the i-th word of the
input sequence. We explain in detail how the annotations are com-
puted in the next section.
The context vector ci is, then, computed as a weighted sum of these
annotations hi:
ci =
TxX
j=1
ā†µijhj. (5)
The weight ā†µij of each annotation hj is computed by
ā†µij =
exp (eij)
PTx
k=1 exp (eik)
, (6)
where
eij = a(si 1, hj)
is an alignment model which scores how well the inputs around position j and the output at position
i match. The score is based on the RNN hidden state si 1 (just before emitting yi, Eq. (4)) and the
j-th annotation hj of the input sentence.
We parametrize the alignment model a as a feedforward neural network which is jointly trained with
all the other components of the proposed system. Note that unlike in traditional machine translation,
3
Figure 1: The graphical illus-
tration of the proposed model
trying to generate the t-th tar-
get word yt given a source
sentence (x1, x2, . . . , xT ).
si = f(si 1, yi 1, ci).
It should be noted that unlike the existing encoderā€“decoder ap-
proach (see Eq. (2)), here the probability is conditioned on a distinct
context vector ci for each target word yi.
The context vector ci depends on a sequence of annotations
(h1, Ā· Ā· Ā· , hTx
) to which an encoder maps the input sentence. Each
annotation hi contains information about the whole input sequence
with a strong focus on the parts surrounding the i-th word of the
input sequence. We explain in detail how the annotations are com-
puted in the next section.
The context vector ci is, then, computed as a weighted sum of these
annotations hi:
ci =
TxX
j=1
ā†µijhj. (5)
The weight ā†µij of each annotation hj is computed by
ā†µij =
exp (eij)
PTx
k=1 exp (eik)
, (6)
where
eij = a(si 1, hj)
is an alignment model which scores how well the inputs around position j and the output at position
i match. The score is based on the RNN hidden state si 1 (just before emitting yi, Eq. (4)) and the
j-th annotation hj of the input sentence.
We parametrize the alignment model a as a feedforward neural network which is jointly trained with
all the other components of the proposed system. Note that unlike in traditional machine translation,
3
Figure 1: The graphical illus-
tration of the proposed model
trying to generate the t-th tar-
get word yt given a source
sentence (x1, x2, . . . , xT ).
proach (see Eq. (2)), here the probability is conditioned on a distinct
context vector ci for each target word yi.
The context vector ci depends on a sequence of annotations
(h1, Ā· Ā· Ā· , hTx
) to which an encoder maps the input sentence. Each
annotation hi contains information about the whole input sequence
with a strong focus on the parts surrounding the i-th word of the
input sequence. We explain in detail how the annotations are com-
puted in the next section.
The context vector ci is, then, computed as a weighted sum of these
annotations hi:
ci =
TxX
j=1
ā†µijhj. (5)
The weight ā†µij of each annotation hj is computed by
ā†µij =
exp (eij)
PTx
k=1 exp (eik)
, (6)
where
eij = a(si 1, hj)
is an alignment model which scores how well the inputs around position j and the output at position
i match. The score is based on the RNN hidden state si 1 (just before emitting yi, Eq. (4)) and the
j-th annotation hj of the input sentence.
We parametrize the alignment model a as a feedforward neural network which is jointly trained with
all the other components of the proposed system. Note that unlike in traditional machine translation,
3
Published as a conference paper at ICLR 2015
(a) (b)
Copyright	(C)	DeNA	Co.,Ltd.	All	Rights	Reserved.	
Related Work (Attention Mechanism)
nā€Æ Pointer Networks [Vinyals+, 2015]
āƒā€Æ Generate output sequence using a distribution over the
dictionary of inputs
14
(a) Sequence-to-Sequence (b) Ptr-Net
Figure 1: (a) Sequence-to-Sequence - An RNN (blue) processes the input sequence to create a code
vector that is used to generate the output sequence (purple) using the probability chain rule and
another RNN. The output dimensionality is ļ¬xed by the dimensionality of the problem and it is the
same during training and inference [1]. (b) Ptr-Net - An encoding RNN converts the input sequence
to a code (blue) that is fed to the generating network (purple). At each step, the generating network
produces a vector that modulates a content-based attention mechanism over inputs ([5, 2]). The
output of the attention mechanism is a softmax distribution with dictionary size equal to the length
of the input.
ion (i.e., when we only have examples of inputs and desired outputs). The proposed approach is
depicted in Figure 1.
The main contributions of our work are as follows:
This model performs signiļ¬cantly better than the sequence-to-sequence model on the co
problem, but it is not applicable to problems where the output dictionary size depends on
Nevertheless, a very simple extension (or rather reduction) of the model allows us to do th
2.3 Ptr-Net
We now describe a very simple modiļ¬cation of the attention model that allows us to
method to solve combinatorial optimization problems where the output dictionary size d
the number of elements in the input sequence.
The sequence-to-sequence model of Section 2.1 uses a softmax distribution over a ļ¬xed si
dictionary to compute p(Ci|C1, . . . , Ci 1, P) in Equation 1. Thus it cannot be used for our
where the size of the output dictionary is equal to the length of the input sequence. To
problem we model p(Ci|C1, . . . , Ci 1, P) using the attention mechanism of Equation 3 a
ui
j = vT
tanh(W1ej + W2di) j 2 (1, . . . , n)
p(Ci|C1, . . . , Ci 1, P) = softmax(ui
)
where softmax normalizes the vector ui
(of length n) to be an output distribution over the
of inputs, and v, W1, and W2 are learnable parameters of the output model. Here, we do
the encoder state ej to propagate extra information to the decoder, but instead, use ui
j a
to the input elements. In a similar way, to condition on Ci 1 as in Equation 1, we sim
the corresponding PCi 1
as the input. Both our method and the attention model can be
application of content-based attention mechanisms proposed in [6, 5, 2].
We also note that our approach speciļ¬cally targets problems whose outputs are discrete
spond to positions in the input. Such problems may be addressed artiļ¬cially ā€“ for example
learn to output the coordinates of the target point directly using an RNN. However, at
this solution does not respect the constraint that the outputs map back to the inputs exac
out the constraints, the predictions are bound to become blurry over longer sequences as
sequence-to-sequence models for videos [12].
3 Motivation and Datasets Structure
In the following sections, we review each of the three problems we considered, as well a
generation protocol.1
In the training data, the inputs are planar point sets P = {P1, . . . , Pn} with n elements ea
Pj = (xj, yj) are the cartesian coordinates of the points over which we ļ¬nd the convex hu
launay triangulation or the solution to the corresponding Travelling Salesman Problem. In
we sample from a uniform distribution in [0, 1] ā‡„ [0, 1]. The outputs CP
= {C1, . . . , C
sequences representing the solution associated to the point set P. In Figure 2, we ļ¬nd an i
of an input/output pair (P, CP
) for the convex hull and the Delaunay problems.
Copyright	(C)	DeNA	Co.,Ltd.	All	Rights	Reserved.	
Related Work (Attention Mechanism)
nā€Æ Sequence to Sequence for Sets [Vinyals+, ICLR2016]
āƒā€Æ Handle input sets using an extension of seq2seq framework:
Read-Process-and Write model
15
ural models with memories coupled to differentiable addressing mechanism have been success-
y applied to handwriting generation and recognition (Graves, 2012), machine translation (Bah-
au et al., 2015a), and more general computation machines (Graves et al., 2014; Weston et al.,
5). Since we are interested in associative memories we employed a ā€œcontentā€ based attention.
s has the property that the vector retrieved from our memory would not change if we randomly
fļ¬‚ed the memory. This is crucial for proper treatment of the input set X as such. In particular,
process block based on an attention mechanism uses the following:
qt = LSTM(qā‡¤
t 1) (3)
ei,t = f(mi, qt) (4)
ai,t =
exp(ei,t)
P
j exp(ej,t)
(5)
rt =
X
i
ai,tmi (6)
qā‡¤
t = [qt rt] (7)
Read
Process Write
Figure 1: The Read-Process-and-Write model.
ere i indexes through each memory vector mi (typically equal to the cardinality of X), qt is
uery vector which allows us to read rt from the memories, f is a function that computes a
gle scalar from mi and qt (e.g., a dot product), and LSTM is an LSTM which computes a
urrent state but which takes no inputs. qā‡¤
t is the state which this LSTM evolves, and is formed
concatenating the query qt with the resulting attention readout rt. t is the index which indicates
Copyright	(C)	DeNA	Co.,Ltd.	All	Rights	Reserved.	
Matching Networks [Vinyals+, NIPS2016]
nā€Æ Motivation
āƒā€Æ It is important for one-shot learning to attain rapid learning
from new examples while keeping an ability for common
examples
ā€¢ā€Æ Simple parametric models such as deep classiļ¬ers need to be
optimized to treat with new examples
ā€¢ā€Æ Non-parametric models such as k-nearest neighbor donŹ¼t require
optimization but performance depends on the chosen metric
āƒā€Æ It could be eļ¬ƒcient to train a end-to-end nearest neighbor
based classiļ¬er
16
Copyright	(C)	DeNA	Co.,Ltd.	All	Rights	Reserved.	
Matching Networks [Vinyals+, NIPS2016]
nā€Æ Train a classiļ¬er through one-shot learning
17
Tā€™: Testing taskT: Training task
dog
frog
horse
ship
truck
airplane
automobile
bird
cat
deer
dog
frog
horse
ship
truck
airplane
automobile
bird
cat
deer
L: Label set
S: Support set B : Batch
dog
horse
ship
sampling N labels from T
sampling k examples
from L
sampling b example from L
https://www.cs.toronto.edu/~kriz/cifar.html
Copyright	(C)	DeNA	Co.,Ltd.	All	Rights	Reserved.	
Matching Networks [Vinyals+, NIPS2016]
nā€Æ System Overview
āƒā€Æ Embedding functions f, g are parameterized as a simple CNN (e.g.
VGG or Inception) or a fully conditional embedding function
mentioned later
18
Figure 1: Matching Networks architecture
train it by showing only a few examples per class, switching the task from minibatch to minibatch,
much like how it will be tested when presented with a few examples of a new task.
Besides our contributions in deļ¬ning a model and training criterion amenable for one-shot learning,
we contribute by the deļ¬nition of tasks that can be used to benchmark other approaches on both
ImageNet and small scale language modeling. We hope that our results will encourage others to work
on this challenging problem.
We organized the paper by ļ¬rst deļ¬ning and explaining our model whilst linking its several compo-
nents to related work. Then in the following section we brieļ¬‚y elaborate on some of the related work
to the task and our model. In Section 4 we describe both our general setup and the experiments we
performed, demonstrating strong results on one-shot learning on a variety of tasks and setups.
2 Model
Our non-parametric approach to solving one-shot learning is based on two components which we
describe in the following subsections. First, our model architecture follows recent advances in neural
networks augmented with memory (as discussed in Section 3). Given a (small) support set S, our
model deļ¬nes a function cS (or classiļ¬er) for each S, i.e. a mapping S ! cS(.). Second, we employ
Ė†x
Query
f
g(xi )
f ( Ė†x,S)
a
āˆ‘
P(Ė†y|Ė†x
where xi, yi are the inputs and corresp
{(xi, yi)}k
i=1, and a is an attention mech
tially describes the output for a new class
Where the attention mechanism a is a kerne
Where the attention mechanism is zero f
metric and an appropriate constant otherw
(although this requires an extension to the
Thus (1) subsumes both KDE and kNN me
mechanism and the yi act as values bound
this case we can understand this as a parti
we ā€œpointā€ to the corresponding example i
form deļ¬ned by the classiļ¬er cS(Ė†x) is very
2.1.1 The Attention Kernel
Equation 1 relies on choosing a(., .), the
ļ¬er. The simplest form that this takes
attention models and kernel functions)
a(Ė†x, xi) = ec(f(Ė†x),g(xi))
/
Pk
j=1 ec(f(Ė†x),g(
ate neural networks (potentially with f =
examples where f and g are parameteris
tasks (as in VGG[22] or Inception[24]) or
Section 4).
We note that, though related to metric learn
For a given support set S and sample to cl
pairs (x0
, y0
) 2 S such that y0
= y and mi
methods such as Neighborhood Compone
nearest neighbor [28].
However, the objective that we are trying
classiļ¬cation, and thus we expect it to per
Our model in its simplest form computes a probability over Ė†y as follows:
P(Ė†y|Ė†x, S) =
kX
i=1
a(Ė†x, xi)yi
where xi, yi are the inputs and corresponding label distributions from the support
{(xi, yi)}k
i=1, and a is an attention mechanism which we discuss below. Note that e
tially describes the output for a new class as a linear combination of the labels in the s
Where the attention mechanism a is a kernel on X ā‡„ X, then (1) is akin to a kernel densit
Where the attention mechanism is zero for the b furthest xi from Ė†x according to som
metric and an appropriate constant otherwise, then (1) is equivalent to ā€˜k bā€™-nearest n
(although this requires an extension to the attention mechanism that we describe in Sec
Thus (1) subsumes both KDE and kNN methods. Another view of (1) is where a acts as a
mechanism and the yi act as values bound to the corresponding keys xi, much like a has
this case we can understand this as a particular kind of associative memory where, give
Figure 1: Matching Networks architecture
by showing only a few examples per class, switching the task from minibatch to minibatch,
ike how it will be tested when presented with a few examples of a new task.
s our contributions in deļ¬ning a model and training criterion amenable for one-shot learning,
ntribute by the deļ¬nition of tasks that can be used to benchmark other approaches on both
Figure 1: Matching Networks architecture
it by showing only a few examples per class, switching the task from minibatch to minibatch,
h like how it will be tested when presented with a few examples of a new task.
Figure 1: Matching Networks architecture
xi
Support Setļ¼ˆSļ¼‰
yi
g
Copyright	(C)	DeNA	Co.,Ltd.	All	Rights	Reserved.	
Matching Networks [Vinyals+, NIPS2016]
nā€Æ The Attention Kernel
āƒā€Æ Calculate softmax over the cosine distance between and
ā€¢ā€Æ Similar to nearest neighbor calculation
āƒā€Æ Train a network using cross entropy loss
19
Figure 1: Matching Networks architecture
train it by showing only a few examples per class, switching the task from minibatch to minibatch,
much like how it will be tested when presented with a few examples of a new task.
Besides our contributions in deļ¬ning a model and training criterion amenable for one-shot learning,
we contribute by the deļ¬nition of tasks that can be used to benchmark other approaches on both
ImageNet and small scale language modeling. We hope that our results will encourage others to work
on this challenging problem.
We organized the paper by ļ¬rst deļ¬ning and explaining our model whilst linking its several compo-
nents to related work. Then in the following section we brieļ¬‚y elaborate on some of the related work
to the task and our model. In Section 4 we describe both our general setup and the experiments we
performed, demonstrating strong results on one-shot learning on a variety of tasks and setups.
2 Model
Our non-parametric approach to solving one-shot learning is based on two components which we
describe in the following subsections. First, our model architecture follows recent advances in neural
networks augmented with memory (as discussed in Section 3). Given a (small) support set S, our
model deļ¬nes a function cS (or classiļ¬er) for each S, i.e. a mapping S ! cS(.). Second, we employ
Ė†x
Query
f
g(xi )
f ( Ė†x,S)
aOur model in its simplest form computes a probability over Ė†y as follow
P(Ė†y|Ė†x, S) =
kX
i=1
a(Ė†x, xi)yi
where xi, yi are the inputs and corresponding label distributions
{(xi, yi)}k
i=1, and a is an attention mechanism which we discuss b
tially describes the output for a new class as a linear combination of
Where the attention mechanism a is a kernel on X ā‡„ X, then (1) is akin
Where the attention mechanism is zero for the b furthest xi from Ė†x
metric and an appropriate constant otherwise, then (1) is equivalent t
(although this requires an extension to the attention mechanism that w
āˆ‘
P(Ė†y|Ė†x
where xi, yi are the inputs and corresp
{(xi, yi)}k
i=1, and a is an attention mech
tially describes the output for a new class
Where the attention mechanism a is a kerne
Where the attention mechanism is zero f
metric and an appropriate constant otherw
(although this requires an extension to the
Thus (1) subsumes both KDE and kNN me
mechanism and the yi act as values bound
this case we can understand this as a parti
we ā€œpointā€ to the corresponding example i
form deļ¬ned by the classiļ¬er cS(Ė†x) is very
2.1.1 The Attention Kernel
Equation 1 relies on choosing a(., .), the
ļ¬er. The simplest form that this takes
attention models and kernel functions)
a(Ė†x, xi) = ec(f(Ė†x),g(xi))
/
Pk
j=1 ec(f(Ė†x),g(
ate neural networks (potentially with f =
examples where f and g are parameteris
tasks (as in VGG[22] or Inception[24]) or
Section 4).
We note that, though related to metric learn
For a given support set S and sample to cl
pairs (x0
, y0
) 2 S such that y0
= y and mi
methods such as Neighborhood Compone
nearest neighbor [28].
However, the objective that we are trying
classiļ¬cation, and thus we expect it to per
Our model in its simplest form computes a probability over Ė†y as follows:
P(Ė†y|Ė†x, S) =
kX
i=1
a(Ė†x, xi)yi
where xi, yi are the inputs and corresponding label distributions from the support
{(xi, yi)}k
i=1, and a is an attention mechanism which we discuss below. Note that e
tially describes the output for a new class as a linear combination of the labels in the s
Where the attention mechanism a is a kernel on X ā‡„ X, then (1) is akin to a kernel densit
Where the attention mechanism is zero for the b furthest xi from Ė†x according to som
metric and an appropriate constant otherwise, then (1) is equivalent to ā€˜k bā€™-nearest n
(although this requires an extension to the attention mechanism that we describe in Sec
Thus (1) subsumes both KDE and kNN methods. Another view of (1) is where a acts as a
mechanism and the yi act as values bound to the corresponding keys xi, much like a has
this case we can understand this as a particular kind of associative memory where, give
Our model in its simplest form computes a probability over Ė†y as follows:
P(Ė†y|Ė†x, S) =
kX
i=1
a(Ė†x, xi)yi
where xi, yi are the inputs and corresponding label distributions from the suppo
{(xi, yi)}k
i=1, and a is an attention mechanism which we discuss below. Note that
tially describes the output for a new class as a linear combination of the labels in the
Where the attention mechanism a is a kernel on X ā‡„ X, then (1) is akin to a kernel dens
Where the attention mechanism is zero for the b furthest xi from Ė†x according to so
metric and an appropriate constant otherwise, then (1) is equivalent to ā€˜k bā€™-nearest
(although this requires an extension to the attention mechanism that we describe in Se
Thus (1) subsumes both KDE and kNN methods. Another view of (1) is where a acts as
mechanism and the yi act as values bound to the corresponding keys xi, much like a ha
this case we can understand this as a particular kind of associative memory where, giv
we ā€œpointā€ to the corresponding example in the support set, retrieving its label. Hence th
form deļ¬ned by the classiļ¬er cS(Ė†x) is very ļ¬‚exible and can adapt easily to any new sup
2.1.1 The Attention Kernel
Equation 1 relies on choosing a(., .), the attention mechanism, which fully speciļ¬e
ļ¬er. The simplest form that this takes (and which has very tight relationships wi
attention models and kernel functions) is to use the softmax over the cosine dist
a(Ė†x, xi) = ec(f(Ė†x),g(xi))
/
Pk
j=1 ec(f(Ė†x),g(xj ))
with embedding functions f and g bein
ate neural networks (potentially with f = g) to embed Ė†x and xi. In our experiments w
examples where f and g are parameterised variously as deep convolutional network
tasks (as in VGG[22] or Inception[24]) or a simple form word embedding for languag
Section 4).
We note that, though related to metric learning, the classiļ¬er deļ¬ned by Equation 1 is di
For a given support set S and sample to classify Ė†x, it is enough for Ė†x to be sufļ¬ciently a
pairs (x0
, y0
) 2 S such that y0
= y and misaligned with the rest. This kind of loss is als
c: cosine distance
Figure 1: Matching Networks architecture
by showing only a few examples per class, switching the task from minibatch to minibatch,
ike how it will be tested when presented with a few examples of a new task.
s our contributions in deļ¬ning a model and training criterion amenable for one-shot learning,
ntribute by the deļ¬nition of tasks that can be used to benchmark other approaches on both
Figure 1: Matching Networks architecture
it by showing only a few examples per class, switching the task from minibatch to minibatch,
h like how it will be tested when presented with a few examples of a new task.
Figure 1: Matching Networks architecture
xi
Support Setļ¼ˆSļ¼‰
yi
g
Ė†hk, ck = LSTM(f0
(Ė†x), [hk 1, rk 1], ck 1)
hk = Ė†hk + f0
(Ė†x)
rk 1 =
|S|
X
i=1
a(hk 1, g(xi))g(xi)
a(hk 1, g(xi)) = ehT
k 1g(xi)
/
|S|
X
j=1
ehT
k 1g(xj )
noting that LSTM(x, h, c) follows the same LSTM implementation deļ¬ned in [23]
h the output (i.e., cell after the output gate), and c the cell. a is commonly referred
based attention. We do K steps of ā€œreadsā€, so f(Ė†x, S) = hK where hk is as describ
2.2 Training Strategy
In the previous subsection we described Matching Networks which map a support set t
function, S ! c(Ė†x). We achieve this via a modiļ¬cation of the set-to-set paradigm
attention, with the resulting mapping being of the form Pāœ“(.|Ė†x, S), noting that āœ“ are
of the model (i.e. of the embedding functions f and g described previously).
The training procedure has to be chosen carefully so as to match inference at test t
has to perform well with support sets S0
which contain classes never seen during tra
More speciļ¬cally, let us deļ¬ne a task T as distribution over possible label sets L
consider T to uniformly weight all data sets of up to a few unique classes (e.g.
examples per class (e.g., up to 5). In this case, a label set L sampled from a task
typically have 5 to 25 examples.
To form an ā€œepisodeā€ to compute gradients and update our model, we ļ¬rst sample
L could be the label set {cats, dogs}). We then use L to sample the support set S
(i.e., both S and B are labelled examples of cats and dogs). The Matching Net is
minimise the error predicting the labels in the batch B conditioned on the support
form of meta-learning since the training procedure explicitly learns to learn from a g
to minimise a loss over a batch. More precisely, the Matching Nets training objectiv
āœ“ = arg max
āœ“
ELā‡ T
2
4ESā‡ L,Bā‡ L
2
4
X
(x,y)2B
log Pāœ“ (y|x, S)
3
5
3
5 .
Training āœ“ with eq. 6 yields a model which works well when sampling S0
ā‡  T0
g(xi )f ( Ė†x,S)
Copyright	(C)	DeNA	Co.,Ltd.	All	Rights	Reserved.	
Matching Networks [Vinyals+, NIPS2016]
nā€Æ The Fully Conditional Embedding g
āƒā€Æ Embed in consideration of S
gā€™
LSTM
LSTM
+
Figure 1: Matching Networks architecture
by showing only a few examples per class, switching the task from minibatch to minibatch,
ike how it will be tested when presented with a few examples of a new task.
s our contributions in deļ¬ning a model and training criterion amenable for one-shot learning,
ntribute by the deļ¬nition of tasks that can be used to benchmark other approaches on both
Net and small scale language modeling. We hope that our results will encourage others to work
challenging problem.
ganized the paper by ļ¬rst deļ¬ning and explaining our model whilst linking its several compo-
o related work. Then in the following section we brieļ¬‚y elaborate on some of the related work
Figure 1: Matching Networks architecture
it by showing only a few examples per class, switching the task from minibatch to minibatch,
h like how it will be tested when presented with a few examples of a new task.
des our contributions in deļ¬ning a model and training criterion amenable for one-shot learning,
ontribute by the deļ¬nition of tasks that can be used to benchmark other approaches on both
geNet and small scale language modeling. We hope that our results will encourage others to work
his challenging problem.
20
Figure 1: Matching Networks architecture
examples per class, switching the task from minibatch to minibatch, much like
when presented with a few examples of a new task.
utions in deļ¬ning a model and training criterion amenable for one-shot learning,
xi
Support Setļ¼ˆSļ¼‰
yi
gā€™
LSTM
LSTM
+
gā€™
LSTM
LSTM
+
noting that LSTM(x, h, c) follows the same LSTM implementation deļ¬ned in [
h the output (i.e., cell after the output gate), and c the cell. a is commonly refe
based attention, and the softmax in eq. 6 normalizes w.r.t. g(xi). The read-ou
concatenated to hk 1. Since we do K steps of ā€œreadsā€, attLSTM(f0
(Ė†x), g(S),
is as described in eq. 3.
A.2 The Fully Conditional Embedding g
In section 2.1.2 we described the encoding function for the elements in the sup
as a bidirectional LSTM. More precisely, let g0
(xi) be a neural network (simila
VGG or Inception model). Then we deļ¬ne g(xi, S) = ~hi + ~hi + g0
(xi) with:
~hi,~ci = LSTM(g0
(xi),~hi 1,~ci 1)
~hi, ~ci = LSTM(g0
(xi), ~hi+1, ~ci+1)
where, as in above, LSTM(x, h, c) follows the same LSTM implementation de
the input, h the output (i.e., cell after the output gate), and c the cell. Note tha
starts from i = |S|. As in eq. 3, we add a skip connection between input and ou
B ImageNet Class Splits
Here we deļ¬ne the two class splits used in our full ImageNet experiments ā€“
excluded for training during our one-shot experiments described in section 4.1.
Lrand =
n01498041, n01537544, n01580077, n01592084, n01632777, n01644373, n01665541, n01675722, n016882
n01818515, n01843383, n01883070, n01950731, n02002724, n02013706, n02092339, n02093256, n020953
10
gā€™: neural network (e.g., VGG or Inception)
a(hk 1, g(xi)) = softmax(hk 1g(xi))
noting that LSTM(x, h, c) follows the same LSTM implementation deļ¬ned in [23] with x th
h the output (i.e., cell after the output gate), and c the cell. a is commonly referred to as ā€œc
based attention, and the softmax in eq. 6 normalizes w.r.t. g(xi). The read-out rk 1 from
concatenated to hk 1. Since we do K steps of ā€œreadsā€, attLSTM(f0
(Ė†x), g(S), K) = hK w
is as described in eq. 3.
A.2 The Fully Conditional Embedding g
In section 2.1.2 we described the encoding function for the elements in the support set S, g
as a bidirectional LSTM. More precisely, let g0
(xi) be a neural network (similar to f0
above
VGG or Inception model). Then we deļ¬ne g(xi, S) = ~hi + ~hi + g0
(xi) with:
~hi,~ci = LSTM(g0
(xi),~hi 1,~ci 1)
~hi, ~ci = LSTM(g0
(xi), ~hi+1, ~ci+1)
where, as in above, LSTM(x, h, c) follows the same LSTM implementation deļ¬ned in [23]
the input, h the output (i.e., cell after the output gate), and c the cell. Note that the recursio
starts from i = |S|. As in eq. 3, we add a skip connection between input and outputs.
B ImageNet Class Splits
Here we deļ¬ne the two class splits used in our full ImageNet experiments ā€“ these classe
excluded for training during our one-shot experiments described in section 4.1.2.
Lrand =
n01498041, n01537544, n01580077, n01592084, n01632777, n01644373, n01665541, n01675722, n01688243, n01729977, n
n01818515, n01843383, n01883070, n01950731, n02002724, n02013706, n02092339, n02093256, n02095314, n02097130, n
xi
g(xi,S)
Copyright	(C)	DeNA	Co.,Ltd.	All	Rights	Reserved.	
Matching Networks [Vinyals+, NIPS2016]
nā€Æ The Fully Conditional Embedding g
āƒā€Æ Embed in consideration of S
gā€™
Figure 1: Matching Networks architecture
by showing only a few examples per class, switching the task from minibatch to minibatch,
ike how it will be tested when presented with a few examples of a new task.
s our contributions in deļ¬ning a model and training criterion amenable for one-shot learning,
ntribute by the deļ¬nition of tasks that can be used to benchmark other approaches on both
Net and small scale language modeling. We hope that our results will encourage others to work
challenging problem.
ganized the paper by ļ¬rst deļ¬ning and explaining our model whilst linking its several compo-
o related work. Then in the following section we brieļ¬‚y elaborate on some of the related work
Figure 1: Matching Networks architecture
it by showing only a few examples per class, switching the task from minibatch to minibatch,
h like how it will be tested when presented with a few examples of a new task.
des our contributions in deļ¬ning a model and training criterion amenable for one-shot learning,
ontribute by the deļ¬nition of tasks that can be used to benchmark other approaches on both
geNet and small scale language modeling. We hope that our results will encourage others to work
his challenging problem.
21
Figure 1: Matching Networks architecture
examples per class, switching the task from minibatch to minibatch, much like
when presented with a few examples of a new task.
utions in deļ¬ning a model and training criterion amenable for one-shot learning,
xi
Support Setļ¼ˆSļ¼‰
yi
gā€™
gā€™
noting that LSTM(x, h, c) follows the same LSTM implementation deļ¬ned in [
h the output (i.e., cell after the output gate), and c the cell. a is commonly refe
based attention, and the softmax in eq. 6 normalizes w.r.t. g(xi). The read-ou
concatenated to hk 1. Since we do K steps of ā€œreadsā€, attLSTM(f0
(Ė†x), g(S),
is as described in eq. 3.
A.2 The Fully Conditional Embedding g
In section 2.1.2 we described the encoding function for the elements in the sup
as a bidirectional LSTM. More precisely, let g0
(xi) be a neural network (simila
VGG or Inception model). Then we deļ¬ne g(xi, S) = ~hi + ~hi + g0
(xi) with:
~hi,~ci = LSTM(g0
(xi),~hi 1,~ci 1)
~hi, ~ci = LSTM(g0
(xi), ~hi+1, ~ci+1)
where, as in above, LSTM(x, h, c) follows the same LSTM implementation de
the input, h the output (i.e., cell after the output gate), and c the cell. Note tha
starts from i = |S|. As in eq. 3, we add a skip connection between input and ou
B ImageNet Class Splits
Here we deļ¬ne the two class splits used in our full ImageNet experiments ā€“
excluded for training during our one-shot experiments described in section 4.1.
Lrand =
n01498041, n01537544, n01580077, n01592084, n01632777, n01644373, n01665541, n01675722, n016882
n01818515, n01843383, n01883070, n01950731, n02002724, n02013706, n02092339, n02093256, n020953
10
gā€™: neural network (e.g., VGG or Inception)
a(hk 1, g(xi)) = softmax(hk 1g(xi))
noting that LSTM(x, h, c) follows the same LSTM implementation deļ¬ned in [23] with x th
h the output (i.e., cell after the output gate), and c the cell. a is commonly referred to as ā€œc
based attention, and the softmax in eq. 6 normalizes w.r.t. g(xi). The read-out rk 1 from
concatenated to hk 1. Since we do K steps of ā€œreadsā€, attLSTM(f0
(Ė†x), g(S), K) = hK w
is as described in eq. 3.
A.2 The Fully Conditional Embedding g
In section 2.1.2 we described the encoding function for the elements in the support set S, g
as a bidirectional LSTM. More precisely, let g0
(xi) be a neural network (similar to f0
above
VGG or Inception model). Then we deļ¬ne g(xi, S) = ~hi + ~hi + g0
(xi) with:
~hi,~ci = LSTM(g0
(xi),~hi 1,~ci 1)
~hi, ~ci = LSTM(g0
(xi), ~hi+1, ~ci+1)
where, as in above, LSTM(x, h, c) follows the same LSTM implementation deļ¬ned in [23]
the input, h the output (i.e., cell after the output gate), and c the cell. Note that the recursio
starts from i = |S|. As in eq. 3, we add a skip connection between input and outputs.
B ImageNet Class Splits
Here we deļ¬ne the two class splits used in our full ImageNet experiments ā€“ these classe
excluded for training during our one-shot experiments described in section 4.1.2.
Lrand =
n01498041, n01537544, n01580077, n01592084, n01632777, n01644373, n01665541, n01675722, n01688243, n01729977, n
n01818515, n01843383, n01883070, n01950731, n02002724, n02013706, n02092339, n02093256, n02095314, n02097130, n
Embed into vector using gā€™
ļ¼ˆgā€™: neural network such as VGG or Inceptionļ¼‰
xi
xi
Copyright	(C)	DeNA	Co.,Ltd.	All	Rights	Reserved.	
Matching Networks [Vinyals+, NIPS2016]
nā€Æ The Fully Conditional Embedding g
āƒā€Æ Embed in consideration of S
gā€™
LSTM
LSTM
Figure 1: Matching Networks architecture
by showing only a few examples per class, switching the task from minibatch to minibatch,
ike how it will be tested when presented with a few examples of a new task.
s our contributions in deļ¬ning a model and training criterion amenable for one-shot learning,
ntribute by the deļ¬nition of tasks that can be used to benchmark other approaches on both
Net and small scale language modeling. We hope that our results will encourage others to work
challenging problem.
ganized the paper by ļ¬rst deļ¬ning and explaining our model whilst linking its several compo-
o related work. Then in the following section we brieļ¬‚y elaborate on some of the related work
Figure 1: Matching Networks architecture
it by showing only a few examples per class, switching the task from minibatch to minibatch,
h like how it will be tested when presented with a few examples of a new task.
des our contributions in deļ¬ning a model and training criterion amenable for one-shot learning,
ontribute by the deļ¬nition of tasks that can be used to benchmark other approaches on both
geNet and small scale language modeling. We hope that our results will encourage others to work
his challenging problem.
22
Figure 1: Matching Networks architecture
examples per class, switching the task from minibatch to minibatch, much like
when presented with a few examples of a new task.
utions in deļ¬ning a model and training criterion amenable for one-shot learning,
xi
Support Setļ¼ˆSļ¼‰
yi
gā€™
LSTM
LSTM
gā€™
LSTM
LSTM
noting that LSTM(x, h, c) follows the same LSTM implementation deļ¬ned in [
h the output (i.e., cell after the output gate), and c the cell. a is commonly refe
based attention, and the softmax in eq. 6 normalizes w.r.t. g(xi). The read-ou
concatenated to hk 1. Since we do K steps of ā€œreadsā€, attLSTM(f0
(Ė†x), g(S),
is as described in eq. 3.
A.2 The Fully Conditional Embedding g
In section 2.1.2 we described the encoding function for the elements in the sup
as a bidirectional LSTM. More precisely, let g0
(xi) be a neural network (simila
VGG or Inception model). Then we deļ¬ne g(xi, S) = ~hi + ~hi + g0
(xi) with:
~hi,~ci = LSTM(g0
(xi),~hi 1,~ci 1)
~hi, ~ci = LSTM(g0
(xi), ~hi+1, ~ci+1)
where, as in above, LSTM(x, h, c) follows the same LSTM implementation de
the input, h the output (i.e., cell after the output gate), and c the cell. Note tha
starts from i = |S|. As in eq. 3, we add a skip connection between input and ou
B ImageNet Class Splits
Here we deļ¬ne the two class splits used in our full ImageNet experiments ā€“
excluded for training during our one-shot experiments described in section 4.1.
Lrand =
n01498041, n01537544, n01580077, n01592084, n01632777, n01644373, n01665541, n01675722, n016882
n01818515, n01843383, n01883070, n01950731, n02002724, n02013706, n02092339, n02093256, n020953
10
gā€™: neural network (e.g., VGG or Inception)
a(hk 1, g(xi)) = softmax(hk 1g(xi))
noting that LSTM(x, h, c) follows the same LSTM implementation deļ¬ned in [23] with x th
h the output (i.e., cell after the output gate), and c the cell. a is commonly referred to as ā€œc
based attention, and the softmax in eq. 6 normalizes w.r.t. g(xi). The read-out rk 1 from
concatenated to hk 1. Since we do K steps of ā€œreadsā€, attLSTM(f0
(Ė†x), g(S), K) = hK w
is as described in eq. 3.
A.2 The Fully Conditional Embedding g
In section 2.1.2 we described the encoding function for the elements in the support set S, g
as a bidirectional LSTM. More precisely, let g0
(xi) be a neural network (similar to f0
above
VGG or Inception model). Then we deļ¬ne g(xi, S) = ~hi + ~hi + g0
(xi) with:
~hi,~ci = LSTM(g0
(xi),~hi 1,~ci 1)
~hi, ~ci = LSTM(g0
(xi), ~hi+1, ~ci+1)
where, as in above, LSTM(x, h, c) follows the same LSTM implementation deļ¬ned in [23]
the input, h the output (i.e., cell after the output gate), and c the cell. Note that the recursio
starts from i = |S|. As in eq. 3, we add a skip connection between input and outputs.
B ImageNet Class Splits
Here we deļ¬ne the two class splits used in our full ImageNet experiments ā€“ these classe
excluded for training during our one-shot experiments described in section 4.1.2.
Lrand =
n01498041, n01537544, n01580077, n01592084, n01632777, n01644373, n01665541, n01675722, n01688243, n01729977, n
n01818515, n01843383, n01883070, n01950731, n02002724, n02013706, n02092339, n02093256, n02095314, n02097130, n
Feed into Bi-LSTM
ļ¼ˆgŹ¼: neural network such as VGG or Inceptionļ¼‰
g'(xi )
xi
Copyright	(C)	DeNA	Co.,Ltd.	All	Rights	Reserved.	
Matching Networks [Vinyals+, NIPS2016]
nā€Æ The Fully Conditional Embedding g
āƒā€Æ Embed in consideration of S
gā€™
LSTM
LSTM
+
Figure 1: Matching Networks architecture
by showing only a few examples per class, switching the task from minibatch to minibatch,
ike how it will be tested when presented with a few examples of a new task.
s our contributions in deļ¬ning a model and training criterion amenable for one-shot learning,
ntribute by the deļ¬nition of tasks that can be used to benchmark other approaches on both
Net and small scale language modeling. We hope that our results will encourage others to work
challenging problem.
ganized the paper by ļ¬rst deļ¬ning and explaining our model whilst linking its several compo-
o related work. Then in the following section we brieļ¬‚y elaborate on some of the related work
Figure 1: Matching Networks architecture
it by showing only a few examples per class, switching the task from minibatch to minibatch,
h like how it will be tested when presented with a few examples of a new task.
des our contributions in deļ¬ning a model and training criterion amenable for one-shot learning,
ontribute by the deļ¬nition of tasks that can be used to benchmark other approaches on both
geNet and small scale language modeling. We hope that our results will encourage others to work
his challenging problem.
23
Figure 1: Matching Networks architecture
examples per class, switching the task from minibatch to minibatch, much like
when presented with a few examples of a new task.
utions in deļ¬ning a model and training criterion amenable for one-shot learning,
xi
Support Setļ¼ˆSļ¼‰
yi
gā€™
LSTM
LSTM
+
gā€™
LSTM
LSTM
+
noting that LSTM(x, h, c) follows the same LSTM implementation deļ¬ned in [
h the output (i.e., cell after the output gate), and c the cell. a is commonly refe
based attention, and the softmax in eq. 6 normalizes w.r.t. g(xi). The read-ou
concatenated to hk 1. Since we do K steps of ā€œreadsā€, attLSTM(f0
(Ė†x), g(S),
is as described in eq. 3.
A.2 The Fully Conditional Embedding g
In section 2.1.2 we described the encoding function for the elements in the sup
as a bidirectional LSTM. More precisely, let g0
(xi) be a neural network (simila
VGG or Inception model). Then we deļ¬ne g(xi, S) = ~hi + ~hi + g0
(xi) with:
~hi,~ci = LSTM(g0
(xi),~hi 1,~ci 1)
~hi, ~ci = LSTM(g0
(xi), ~hi+1, ~ci+1)
where, as in above, LSTM(x, h, c) follows the same LSTM implementation de
the input, h the output (i.e., cell after the output gate), and c the cell. Note tha
starts from i = |S|. As in eq. 3, we add a skip connection between input and ou
B ImageNet Class Splits
Here we deļ¬ne the two class splits used in our full ImageNet experiments ā€“
excluded for training during our one-shot experiments described in section 4.1.
Lrand =
n01498041, n01537544, n01580077, n01592084, n01632777, n01644373, n01665541, n01675722, n016882
n01818515, n01843383, n01883070, n01950731, n02002724, n02013706, n02092339, n02093256, n020953
10
gā€™: neural network (e.g., VGG or Inception)
a(hk 1, g(xi)) = softmax(hk 1g(xi))
noting that LSTM(x, h, c) follows the same LSTM implementation deļ¬ned in [23] with x th
h the output (i.e., cell after the output gate), and c the cell. a is commonly referred to as ā€œc
based attention, and the softmax in eq. 6 normalizes w.r.t. g(xi). The read-out rk 1 from
concatenated to hk 1. Since we do K steps of ā€œreadsā€, attLSTM(f0
(Ė†x), g(S), K) = hK w
is as described in eq. 3.
A.2 The Fully Conditional Embedding g
In section 2.1.2 we described the encoding function for the elements in the support set S, g
as a bidirectional LSTM. More precisely, let g0
(xi) be a neural network (similar to f0
above
VGG or Inception model). Then we deļ¬ne g(xi, S) = ~hi + ~hi + g0
(xi) with:
~hi,~ci = LSTM(g0
(xi),~hi 1,~ci 1)
~hi, ~ci = LSTM(g0
(xi), ~hi+1, ~ci+1)
where, as in above, LSTM(x, h, c) follows the same LSTM implementation deļ¬ned in [23]
the input, h the output (i.e., cell after the output gate), and c the cell. Note that the recursio
starts from i = |S|. As in eq. 3, we add a skip connection between input and outputs.
B ImageNet Class Splits
Here we deļ¬ne the two class splits used in our full ImageNet experiments ā€“ these classe
excluded for training during our one-shot experiments described in section 4.1.2.
Lrand =
n01498041, n01537544, n01580077, n01592084, n01632777, n01644373, n01665541, n01675722, n01688243, n01729977, n
n01818515, n01843383, n01883070, n01950731, n02002724, n02013706, n02092339, n02093256, n02095314, n02097130, n
g(xi,S)
Let be the sum of
and outputs of Bi-LSTM
g(xi,S) g'(xi )
xi
Copyright	(C)	DeNA	Co.,Ltd.	All	Rights	Reserved.	
Matching Networks [Vinyals+, NIPS2016]
nā€Æ The Fully Conditional Embedding f
āƒā€Æ Embed in consideration of S
g
Figure 1: Matching Networks architecture
train it by showing only a few examples per class, switching the task from mi
much like how it will be tested when presented with a few examples of a new
Besides our contributions in deļ¬ning a model and training criterion amenable
we contribute by the deļ¬nition of tasks that can be used to benchmark other
ImageNet and small scale language modeling. We hope that our results will enc
on this challenging problem.
We organized the paper by ļ¬rst deļ¬ning and explaining our model whilst linki
nents to related work. Then in the following section we brieļ¬‚y elaborate on som
to the task and our model. In Section 4 we describe both our general setup an
performed, demonstrating strong results on one-shot learning on a variety of ta
2 Model
Our non-parametric approach to solving one-shot learning is based on two co
describe in the following subsections. First, our model architecture follows rece
networks augmented with memory (as discussed in Section 3). Given a (sma
model deļ¬nes a function cS (or classiļ¬er) for each S, i.e. a mapping S ! cS(.)
a training strategy which is tailored for one-shot learning from the support set
2.1 Model Architecture
In recent years, many groups have investigated ways to augment neural netwo
fā€™LSTM
rkāˆ’1
a(hkāˆ’1,g(xi ))g(xi )
LSTM
f ( Ė†x,S) = hK
Ė†hkāˆ’1
hkāˆ’1
Ė†hk
+
+
Ė†x
so, we deļ¬ne the following recurrence over ā€œprocessingā€ steps k, following work from [26]:
Ė†hk, ck = LSTM(f0
(Ė†x), [hk 1, rk 1], ck 1) (2)
hk = Ė†hk + f0
(Ė†x) (3)
rk 1 =
|S|
X
i=1
a(hk 1, g(xi))g(xi) (4)
a(hk 1, g(xi)) = ehT
k 1g(xi)
/
|S|
X
j=1
ehT
k 1g(xj )
(5)
Query
weighted sum
24
Figure 1: Matching Networks architecture
by showing only a few examples per class, switching the task from minibatch to minibatch,
ike how it will be tested when presented with a few examples of a new task.
s our contributions in deļ¬ning a model and training criterion amenable for one-shot learning,
ntribute by the deļ¬nition of tasks that can be used to benchmark other approaches on both
Net and small scale language modeling. We hope that our results will encourage others to work
challenging problem.
ganized the paper by ļ¬rst deļ¬ning and explaining our model whilst linking its several compo-
o related work. Then in the following section we brieļ¬‚y elaborate on some of the related work
Figure 1: Matching Networks architecture
it by showing only a few examples per class, switching the task from minibatch to minibatch,
h like how it will be tested when presented with a few examples of a new task.
des our contributions in deļ¬ning a model and training criterion amenable for one-shot learning,
ontribute by the deļ¬nition of tasks that can be used to benchmark other approaches on both
geNet and small scale language modeling. We hope that our results will encourage others to work
his challenging problem.
Figure 1: Matching Networks architecture
examples per class, switching the task from minibatch to minibatch, much like
when presented with a few examples of a new task.
utions in deļ¬ning a model and training criterion amenable for one-shot learning,
xi
Support Setļ¼ˆSļ¼‰
yi
Ė†x
ollowing recurrence over ā€œprocessingā€ steps k, following work from [26]:
Ė†hk, ck = LSTM(f0
(Ė†x), [hk 1, rk 1], ck 1) (2)
hk = Ė†hk + f0
(Ė†x) (3)
rk 1 =
|S|
X
i=1
a(hk 1, g(xi))g(xi) (4)
Copyright	(C)	DeNA	Co.,Ltd.	All	Rights	Reserved.	
Matching Networks [Vinyals+, NIPS2016]
nā€Æ The Fully Conditional Embedding f
āƒā€Æ Embed in consideration of S
g
Figure 1: Matching Networks architecture
train it by showing only a few examples per class, switching the task from mi
much like how it will be tested when presented with a few examples of a new
Besides our contributions in deļ¬ning a model and training criterion amenable
we contribute by the deļ¬nition of tasks that can be used to benchmark other
ImageNet and small scale language modeling. We hope that our results will enc
on this challenging problem.
We organized the paper by ļ¬rst deļ¬ning and explaining our model whilst linki
nents to related work. Then in the following section we brieļ¬‚y elaborate on som
to the task and our model. In Section 4 we describe both our general setup an
performed, demonstrating strong results on one-shot learning on a variety of ta
2 Model
Our non-parametric approach to solving one-shot learning is based on two co
describe in the following subsections. First, our model architecture follows rece
networks augmented with memory (as discussed in Section 3). Given a (sma
model deļ¬nes a function cS (or classiļ¬er) for each S, i.e. a mapping S ! cS(.)
a training strategy which is tailored for one-shot learning from the support set
2.1 Model Architecture
In recent years, many groups have investigated ways to augment neural netwo
fā€™LSTM
g(xi ) Ė†h1
h1
+
Ė†x
so, we deļ¬ne the following recurrence over ā€œprocessingā€ steps k, following work from [26]:
Ė†hk, ck = LSTM(f0
(Ė†x), [hk 1, rk 1], ck 1) (2)
hk = Ė†hk + f0
(Ė†x) (3)
rk 1 =
|S|
X
i=1
a(hk 1, g(xi))g(xi) (4)
a(hk 1, g(xi)) = ehT
k 1g(xi)
/
|S|
X
j=1
ehT
k 1g(xj )
(5)
Query
25
Figure 1: Matching Networks architecture
by showing only a few examples per class, switching the task from minibatch to minibatch,
ike how it will be tested when presented with a few examples of a new task.
s our contributions in deļ¬ning a model and training criterion amenable for one-shot learning,
ntribute by the deļ¬nition of tasks that can be used to benchmark other approaches on both
Net and small scale language modeling. We hope that our results will encourage others to work
challenging problem.
ganized the paper by ļ¬rst deļ¬ning and explaining our model whilst linking its several compo-
o related work. Then in the following section we brieļ¬‚y elaborate on some of the related work
Figure 1: Matching Networks architecture
it by showing only a few examples per class, switching the task from minibatch to minibatch,
h like how it will be tested when presented with a few examples of a new task.
des our contributions in deļ¬ning a model and training criterion amenable for one-shot learning,
ontribute by the deļ¬nition of tasks that can be used to benchmark other approaches on both
geNet and small scale language modeling. We hope that our results will encourage others to work
his challenging problem.
Figure 1: Matching Networks architecture
examples per class, switching the task from minibatch to minibatch, much like
when presented with a few examples of a new task.
utions in deļ¬ning a model and training criterion amenable for one-shot learning,
xi
Support Setļ¼ˆSļ¼‰
yi
ollowing recurrence over ā€œprocessingā€ steps k, following work from [26]:
Ė†hk, ck = LSTM(f0
(Ė†x), [hk 1, rk 1], ck 1) (2)
hk = Ė†hk + f0
(Ė†x) (3)
rk 1 =
|S|
X
i=1
a(hk 1, g(xi))g(xi) (4)
is calculated without using S
h1 = LSTM( f '( Ė†x),[ Ė†h0,r0 ],c0 )+ f '( Ė†x)
h1
Ė†x
Copyright	(C)	DeNA	Co.,Ltd.	All	Rights	Reserved.	
Matching Networks [Vinyals+, NIPS2016]
nā€Æ The Fully Conditional Embedding f
āƒā€Æ Embed in consideration of S
g
Figure 1: Matching Networks architecture
train it by showing only a few examples per class, switching the task from mi
much like how it will be tested when presented with a few examples of a new
Besides our contributions in deļ¬ning a model and training criterion amenable
we contribute by the deļ¬nition of tasks that can be used to benchmark other
ImageNet and small scale language modeling. We hope that our results will enc
on this challenging problem.
We organized the paper by ļ¬rst deļ¬ning and explaining our model whilst linki
nents to related work. Then in the following section we brieļ¬‚y elaborate on som
to the task and our model. In Section 4 we describe both our general setup an
performed, demonstrating strong results on one-shot learning on a variety of ta
2 Model
Our non-parametric approach to solving one-shot learning is based on two co
describe in the following subsections. First, our model architecture follows rece
networks augmented with memory (as discussed in Section 3). Given a (sma
model deļ¬nes a function cS (or classiļ¬er) for each S, i.e. a mapping S ! cS(.)
a training strategy which is tailored for one-shot learning from the support set
2.1 Model Architecture
In recent years, many groups have investigated ways to augment neural netwo
fā€™LSTM
g(xi ) Ė†h1
h1
+
Ė†x
so, we deļ¬ne the following recurrence over ā€œprocessingā€ steps k, following work from [26]:
Ė†hk, ck = LSTM(f0
(Ė†x), [hk 1, rk 1], ck 1) (2)
hk = Ė†hk + f0
(Ė†x) (3)
rk 1 =
|S|
X
i=1
a(hk 1, g(xi))g(xi) (4)
a(hk 1, g(xi)) = ehT
k 1g(xi)
/
|S|
X
j=1
ehT
k 1g(xj )
(5)
Query
26
Figure 1: Matching Networks architecture
by showing only a few examples per class, switching the task from minibatch to minibatch,
ike how it will be tested when presented with a few examples of a new task.
s our contributions in deļ¬ning a model and training criterion amenable for one-shot learning,
ntribute by the deļ¬nition of tasks that can be used to benchmark other approaches on both
Net and small scale language modeling. We hope that our results will encourage others to work
challenging problem.
ganized the paper by ļ¬rst deļ¬ning and explaining our model whilst linking its several compo-
o related work. Then in the following section we brieļ¬‚y elaborate on some of the related work
Figure 1: Matching Networks architecture
it by showing only a few examples per class, switching the task from minibatch to minibatch,
h like how it will be tested when presented with a few examples of a new task.
des our contributions in deļ¬ning a model and training criterion amenable for one-shot learning,
ontribute by the deļ¬nition of tasks that can be used to benchmark other approaches on both
geNet and small scale language modeling. We hope that our results will encourage others to work
his challenging problem.
Figure 1: Matching Networks architecture
examples per class, switching the task from minibatch to minibatch, much like
when presented with a few examples of a new task.
utions in deļ¬ning a model and training criterion amenable for one-shot learning,
xi
Support Setļ¼ˆSļ¼‰
yi
ollowing recurrence over ā€œprocessingā€ steps k, following work from [26]:
Ė†hk, ck = LSTM(f0
(Ė†x), [hk 1, rk 1], ck 1) (2)
hk = Ė†hk + f0
(Ė†x) (3)
rk 1 =
|S|
X
i=1
a(hk 1, g(xi))g(xi) (4)
Calculate the relevance between and
softmaxa(h1,g(x1)) =
a(h1,g(xi ))
(hT
1g(x1))
g(xi ) h1
Ė†x
Copyright	(C)	DeNA	Co.,Ltd.	All	Rights	Reserved.	
Matching Networks [Vinyals+, NIPS2016]
nā€Æ The Fully Conditional Embedding f
āƒā€Æ Embed in consideration of S
g
Figure 1: Matching Networks architecture
train it by showing only a few examples per class, switching the task from mi
much like how it will be tested when presented with a few examples of a new
Besides our contributions in deļ¬ning a model and training criterion amenable
we contribute by the deļ¬nition of tasks that can be used to benchmark other
ImageNet and small scale language modeling. We hope that our results will enc
on this challenging problem.
We organized the paper by ļ¬rst deļ¬ning and explaining our model whilst linki
nents to related work. Then in the following section we brieļ¬‚y elaborate on som
to the task and our model. In Section 4 we describe both our general setup an
performed, demonstrating strong results on one-shot learning on a variety of ta
2 Model
Our non-parametric approach to solving one-shot learning is based on two co
describe in the following subsections. First, our model architecture follows rece
networks augmented with memory (as discussed in Section 3). Given a (sma
model deļ¬nes a function cS (or classiļ¬er) for each S, i.e. a mapping S ! cS(.)
a training strategy which is tailored for one-shot learning from the support set
2.1 Model Architecture
In recent years, many groups have investigated ways to augment neural netwo
fā€™LSTM
g(xi ) Ė†h1
h1
+
Ė†x
so, we deļ¬ne the following recurrence over ā€œprocessingā€ steps k, following work from [26]:
Ė†hk, ck = LSTM(f0
(Ė†x), [hk 1, rk 1], ck 1) (2)
hk = Ė†hk + f0
(Ė†x) (3)
rk 1 =
|S|
X
i=1
a(hk 1, g(xi))g(xi) (4)
a(hk 1, g(xi)) = ehT
k 1g(xi)
/
|S|
X
j=1
ehT
k 1g(xj )
(5)
Query
27
Figure 1: Matching Networks architecture
by showing only a few examples per class, switching the task from minibatch to minibatch,
ike how it will be tested when presented with a few examples of a new task.
s our contributions in deļ¬ning a model and training criterion amenable for one-shot learning,
ntribute by the deļ¬nition of tasks that can be used to benchmark other approaches on both
Net and small scale language modeling. We hope that our results will encourage others to work
challenging problem.
ganized the paper by ļ¬rst deļ¬ning and explaining our model whilst linking its several compo-
o related work. Then in the following section we brieļ¬‚y elaborate on some of the related work
Figure 1: Matching Networks architecture
it by showing only a few examples per class, switching the task from minibatch to minibatch,
h like how it will be tested when presented with a few examples of a new task.
des our contributions in deļ¬ning a model and training criterion amenable for one-shot learning,
ontribute by the deļ¬nition of tasks that can be used to benchmark other approaches on both
geNet and small scale language modeling. We hope that our results will encourage others to work
his challenging problem.
Figure 1: Matching Networks architecture
examples per class, switching the task from minibatch to minibatch, much like
when presented with a few examples of a new task.
utions in deļ¬ning a model and training criterion amenable for one-shot learning,
xi
Support Setļ¼ˆSļ¼‰
yi
ollowing recurrence over ā€œprocessingā€ steps k, following work from [26]:
Ė†hk, ck = LSTM(f0
(Ė†x), [hk 1, rk 1], ck 1) (2)
hk = Ė†hk + f0
(Ė†x) (3)
rk 1 =
|S|
X
i=1
a(hk 1, g(xi))g(xi) (4)
is a sum of weighted according to the
relevance to
a(h1,g(xi ))
r1
weighted sum
r1
g(xi )
h1
r1 = a(h1,g(xi ))
i=1
|S|
āˆ‘ g(xi )
Ė†x
Copyright	(C)	DeNA	Co.,Ltd.	All	Rights	Reserved.	
Matching Networks [Vinyals+, NIPS2016]
nā€Æ The Fully Conditional Embedding f
āƒā€Æ Embed in consideration of S
g
Figure 1: Matching Networks architecture
train it by showing only a few examples per class, switching the task from mi
much like how it will be tested when presented with a few examples of a new
Besides our contributions in deļ¬ning a model and training criterion amenable
we contribute by the deļ¬nition of tasks that can be used to benchmark other
ImageNet and small scale language modeling. We hope that our results will enc
on this challenging problem.
We organized the paper by ļ¬rst deļ¬ning and explaining our model whilst linki
nents to related work. Then in the following section we brieļ¬‚y elaborate on som
to the task and our model. In Section 4 we describe both our general setup an
performed, demonstrating strong results on one-shot learning on a variety of ta
2 Model
Our non-parametric approach to solving one-shot learning is based on two co
describe in the following subsections. First, our model architecture follows rece
networks augmented with memory (as discussed in Section 3). Given a (sma
model deļ¬nes a function cS (or classiļ¬er) for each S, i.e. a mapping S ! cS(.)
a training strategy which is tailored for one-shot learning from the support set
2.1 Model Architecture
In recent years, many groups have investigated ways to augment neural netwo
fā€™LSTM
g(xi ) Ė†h1
h1
+
Ė†x
so, we deļ¬ne the following recurrence over ā€œprocessingā€ steps k, following work from [26]:
Ė†hk, ck = LSTM(f0
(Ė†x), [hk 1, rk 1], ck 1) (2)
hk = Ė†hk + f0
(Ė†x) (3)
rk 1 =
|S|
X
i=1
a(hk 1, g(xi))g(xi) (4)
a(hk 1, g(xi)) = ehT
k 1g(xi)
/
|S|
X
j=1
ehT
k 1g(xj )
(5)
Query
28
Figure 1: Matching Networks architecture
by showing only a few examples per class, switching the task from minibatch to minibatch,
ike how it will be tested when presented with a few examples of a new task.
s our contributions in deļ¬ning a model and training criterion amenable for one-shot learning,
ntribute by the deļ¬nition of tasks that can be used to benchmark other approaches on both
Net and small scale language modeling. We hope that our results will encourage others to work
challenging problem.
ganized the paper by ļ¬rst deļ¬ning and explaining our model whilst linking its several compo-
o related work. Then in the following section we brieļ¬‚y elaborate on some of the related work
Figure 1: Matching Networks architecture
it by showing only a few examples per class, switching the task from minibatch to minibatch,
h like how it will be tested when presented with a few examples of a new task.
des our contributions in deļ¬ning a model and training criterion amenable for one-shot learning,
ontribute by the deļ¬nition of tasks that can be used to benchmark other approaches on both
geNet and small scale language modeling. We hope that our results will encourage others to work
his challenging problem.
Figure 1: Matching Networks architecture
examples per class, switching the task from minibatch to minibatch, much like
when presented with a few examples of a new task.
utions in deļ¬ning a model and training criterion amenable for one-shot learning,
xi
Support Setļ¼ˆSļ¼‰
yi
ollowing recurrence over ā€œprocessingā€ steps k, following work from [26]:
Ė†hk, ck = LSTM(f0
(Ė†x), [hk 1, rk 1], ck 1) (2)
hk = Ė†hk + f0
(Ė†x) (3)
rk 1 =
|S|
X
i=1
a(hk 1, g(xi))g(xi) (4)
a(h1,g(xi ))
r1
weighted sum
LSTM
Ė†h1
+
h1
is calculated using Sh1
Ė†x
Copyright	(C)	DeNA	Co.,Ltd.	All	Rights	Reserved.	
Matching Networks [Vinyals+, NIPS2016]
nā€Æ The Fully Conditional Embedding f
āƒā€Æ Embed in consideration of S
g
Figure 1: Matching Networks architecture
train it by showing only a few examples per class, switching the task from mi
much like how it will be tested when presented with a few examples of a new
Besides our contributions in deļ¬ning a model and training criterion amenable
we contribute by the deļ¬nition of tasks that can be used to benchmark other
ImageNet and small scale language modeling. We hope that our results will enc
on this challenging problem.
We organized the paper by ļ¬rst deļ¬ning and explaining our model whilst linki
nents to related work. Then in the following section we brieļ¬‚y elaborate on som
to the task and our model. In Section 4 we describe both our general setup an
performed, demonstrating strong results on one-shot learning on a variety of ta
2 Model
Our non-parametric approach to solving one-shot learning is based on two co
describe in the following subsections. First, our model architecture follows rece
networks augmented with memory (as discussed in Section 3). Given a (sma
model deļ¬nes a function cS (or classiļ¬er) for each S, i.e. a mapping S ! cS(.)
a training strategy which is tailored for one-shot learning from the support set
2.1 Model Architecture
In recent years, many groups have investigated ways to augment neural netwo
fā€™LSTM
rkāˆ’1
a(hkāˆ’1,g(xi ))g(xi )
LSTM
f ( Ė†x,S) = hK
Ė†hkāˆ’1
hkāˆ’1
Ė†hk
+
+
Ė†x
so, we deļ¬ne the following recurrence over ā€œprocessingā€ steps k, following work from [26]:
Ė†hk, ck = LSTM(f0
(Ė†x), [hk 1, rk 1], ck 1) (2)
hk = Ė†hk + f0
(Ė†x) (3)
rk 1 =
|S|
X
i=1
a(hk 1, g(xi))g(xi) (4)
a(hk 1, g(xi)) = ehT
k 1g(xi)
/
|S|
X
j=1
ehT
k 1g(xj )
(5)
Query
weighted sum
29
Figure 1: Matching Networks architecture
by showing only a few examples per class, switching the task from minibatch to minibatch,
ike how it will be tested when presented with a few examples of a new task.
s our contributions in deļ¬ning a model and training criterion amenable for one-shot learning,
ntribute by the deļ¬nition of tasks that can be used to benchmark other approaches on both
Net and small scale language modeling. We hope that our results will encourage others to work
challenging problem.
ganized the paper by ļ¬rst deļ¬ning and explaining our model whilst linking its several compo-
o related work. Then in the following section we brieļ¬‚y elaborate on some of the related work
Figure 1: Matching Networks architecture
it by showing only a few examples per class, switching the task from minibatch to minibatch,
h like how it will be tested when presented with a few examples of a new task.
des our contributions in deļ¬ning a model and training criterion amenable for one-shot learning,
ontribute by the deļ¬nition of tasks that can be used to benchmark other approaches on both
geNet and small scale language modeling. We hope that our results will encourage others to work
his challenging problem.
Figure 1: Matching Networks architecture
examples per class, switching the task from minibatch to minibatch, much like
when presented with a few examples of a new task.
utions in deļ¬ning a model and training criterion amenable for one-shot learning,
xi
Support Setļ¼ˆSļ¼‰
yi
ollowing recurrence over ā€œprocessingā€ steps k, following work from [26]:
Ė†hk, ck = LSTM(f0
(Ė†x), [hk 1, rk 1], ck 1) (2)
hk = Ė†hk + f0
(Ė†x) (3)
rk 1 =
|S|
X
i=1
a(hk 1, g(xi))g(xi) (4)
Let be the output
after K steps
f ( Ė†x,S)
Ė†x
Copyright	(C)	DeNA	Co.,Ltd.	All	Rights	Reserved.	
Experimental Settings
nā€Æ Datasets
āƒā€Æ Image classiļ¬cation sets
ā€¢ā€Æ Omniglot [Lake+, 2011]
āƒā€Æ Language modeling
ā€¢ā€Æ Penn Treebank [Marcus+, 1993]
30
ā€¢ā€Æ ImageNet [Deng+, 2009]
ref. http://karpathy.github.io/2014/09/02/what-i-learned-
from-competing-against-a-convnet-on-imagenet/
4.1.3 One-Shot Language Modeling
We also introduce a new one-shot language task which is analogous to those examined for images.
The task is as follows: given a query sentence with a missing word in it, and a support set of sentences
which each have a missing word and a corresponding 1-hot label, choose the label from the support
set that best matches the query sentence. Here we show a single example, though note that the words
on the right are not provided and the labels for the set are given as 1-hot-of-5 vectors.
1. an experimental vaccine can alter the immune response of people infected with the aids virus a
<blank_token> u.s. scientist said.
prominent
2. the show one of five new nbc <blank_token> is the second casualty of the three networks so far
this fall.
series
3. however since eastern first filed for chapter N protection march N it has consistently promised
to pay creditors N cents on the <blank_token>.
dollar
4. we had a lot of people who threw in the <blank_token> today said <unk> ellis a partner in
benjamin jacobson & sons a specialist in trading ual stock on the big board.
towel
5. itā€™s not easy to roll out something that <blank_token> and make it pay mr. jacob says. comprehensive
Query: in late new york trading yesterday the <blank_token> was quoted at N marks down from N
marks late friday and at N yen down from N yen late friday.
dollar
Sentences were taken from the Penn Treebank dataset [15]. On each trial, we make sure that the set
and batch are populated with sentences that are non-overlapping. This means that we do not use
words with very low frequency counts; e.g. if there is only a single sentence for a given word we do
not use this data since the sentence would need to be in both the set and the batch. As with the image
tasks, each trial consisted of a 5 way choice between the classes available in the set. We used a batch
size of 20 throughout the sentence matching task and varied the set size across k=1,2,3. We ensured
that the same number of sentences were available for each class in the set. We split the words into a
randomly sampled 9000 for training and 1000 for testing, and we used the standard test set to report
results. Thus, neither the words nor the sentences used during test time had been seen during training.
We compared our one-shot matching model to an oracle LSTM language model (LSTM-LM) [30]
Copyright	(C)	DeNA	Co.,Ltd.	All	Rights	Reserved.	
Experimental Settings (Omniglot)
nā€Æ Baseline
āƒā€Æ Matching on raw pixels
āƒā€Æ Matching on discriminative features from VGG
(Baseine classiļ¬er)
āƒā€Æ MANN
āƒā€Æ Convolutional Siamese Network
nā€Æ Datasets
āƒā€Æ training: 1200 characters
āƒā€Æ testing: 423 characters
31
Copyright	(C)	DeNA	Co.,Ltd.	All	Rights	Reserved.	
Experimental Results (Omniglot)
32
nā€Æ Fully Conditional Embedding (FCE) did not seem to help much
nā€Æ Baseline and Siamese Net were improved with ļ¬ne-tuning
took this network and used the features from the last layer (before the softmax) for nearest neighbour
matching, a strategy commonly used in computer vision [3] which has achieved excellent results
across many tasks. Following [11], the convolutional siamese nets were trained on a same-or-different
task of the original training data set and then the last layer was used for nearest neighbour matching.
Model Matching Fn Fine Tune
5-way Acc 20-way Acc
1-shot 5-shot 1-shot 5-shot
PIXELS Cosine N 41.7% 63.2% 26.7% 42.6%
BASELINE CLASSIFIER Cosine N 80.0% 95.0% 69.5% 89.1%
BASELINE CLASSIFIER Cosine Y 82.3% 98.4% 70.6% 92.0%
BASELINE CLASSIFIER Softmax Y 86.0% 97.6% 72.9% 92.3%
MANN (NO CONV) [21] Cosine N 82.8% 94.9% ā€“ ā€“
CONVOLUTIONAL SIAMESE NET [11] Cosine N 96.7% 98.4% 88.0% 96.5%
CONVOLUTIONAL SIAMESE NET [11] Cosine Y 97.3% 98.4% 88.1% 97.0%
MATCHING NETS (OURS) Cosine N 98.1% 98.9% 93.8% 98.5%
MATCHING NETS (OURS) Cosine Y 97.9% 98.7% 93.5% 98.7%
Table 1: Results on the Omniglot dataset.
5
Copyright	(C)	DeNA	Co.,Ltd.	All	Rights	Reserved.	
Experimental Settings (ImageNet)
nā€Æ Baseline
āƒā€Æ Matching on raw pixels
āƒā€Æ Matching on discriminative features from InceptionV3
(Baseine classiļ¬er)
nā€Æ Datasets
āƒā€Æ miniImageNet (size: 84x84)
ā€¢ā€Æ training: (80 classes)
ā€¢ā€Æ testing: (20 classes)
āƒā€Æ randImageNet
ā€¢ā€Æ training: randomly picked up classes (882 classes)
ā€¢ā€Æ testing: remaining classes (118 classes)
āƒā€Æ dogsImageNet
ā€¢ā€Æ training: all non-dog classes (882 classes)
ā€¢ā€Æ testing: dog classes (118 classes)
33
Copyright	(C)	DeNA	Co.,Ltd.	All	Rights	Reserved.	
Experimental Results (miniImageNet)
34
Figure 2: Example of two 5-way problem instance on ImageNet. The images in the set S0
contain
classes never seen during training. Our model makes far less mistakes than the Inception baseline.
Table 2: Results on miniImageNet.
Model Matching Fn Fine Tune
5-way Acc
1-shot 5-shot
PIXELS Cosine N 23.0% 26.6%
BASELINE CLASSIFIER Cosine N 36.6% 46.0%
BASELINE CLASSIFIER Cosine Y 36.2% 52.2%
BASELINE CLASSIFIER Softmax Y 38.4% 51.2%
MATCHING NETS (OURS) Cosine N 41.2% 56.2%
MATCHING NETS (OURS) Cosine Y 42.4% 58.0%
MATCHING NETS (OURS) Cosine (FCE) N 44.2% 57.0%
MATCHING NETS (OURS) Cosine (FCE) Y 46.6% 60.0%
1-shot tasks from the training data set, incorporating Full Context Embeddings and our Matching
Networks and training strategy.
The results of the randImageNet and dogsImageNet experiments are shown in Table 3. The Inception
Oracle (trained on all classes) performs almost perfectly when restricted to 5 classes only, which is
not too surprising given its impressive top-1 accuracy. When trained solely on 6=Lrand, Matching
Nets improve upon Inception by almost 6% when tested on Lrand, halving the errors. Figure 2 shows
two instances of 5-way one-shot learning, where Inception fails. Looking at all the errors, Inception
appears to sometimes prefer an image above all others (these images tend to be cluttered like the
example in the second column, or more constant in color). Matching Nets, on the other hand, manage
to recover from these outliers that sometimes appear in the support set S0
.
Matching Nets manage to improve upon Inception on the complementary subset 6=Ldogs (although
nā€Æ Matching Networks overtook baseline
nā€Æ Fully Conditional Embedding (FCE) was shown eļ¬€ective to
improve the performance in this task
Copyright	(C)	DeNA	Co.,Ltd.	All	Rights	Reserved.	
Experimental Results (randImageNet, dogsImageNet)
35
classiļ¬cation. Thus, we believe that if we adapted our training strategy to samples S from ļ¬ne grained
sets of labels instead of sampling uniformly from the leafs of the ImageNet class tree, improvements
could be attained. We leave this as future work.
Table 3: Results on full ImageNet on rand and dogs one-shot tasks. Note that 6=Lrand and 6=Ldogs
are sets of classes which are seen during training, but are provided for completeness.
Model Matching Fn Fine Tune
ImageNet 5-way 1-shot Acc
Lrand 6=Lrand Ldogs 6=Ldogs
PIXELS Cosine N 42.0% 42.8% 41.4% 43.0%
INCEPTION CLASSIFIER Cosine N 87.6% 92.6% 59.8% 90.0%
MATCHING NETS (OURS) Cosine (FCE) N 93.2% 97.0% 58.8% 96.4%
INCEPTION ORACLE Softmax (Full) Y (Full) ā‡” 99% ā‡” 99% ā‡” 99% ā‡” 99%
7
nā€Æ Matching Networks outperformed Inception Classiļ¬er in ,
but degraded in
nā€Æ The decrease of the performance in might be caused by the
diļ¬€erent distributions of labels between training and testing
āƒā€Æ Training support set comes from a random distribution
whereas testing one comes from similar classes
BASELINE CLASSIFIER Cosine Y 36
BASELINE CLASSIFIER Softmax Y 38
MATCHING NETS (OURS) Cosine N 41
MATCHING NETS (OURS) Cosine Y 42
MATCHING NETS (OURS) Cosine (FCE) N 44
MATCHING NETS (OURS) Cosine (FCE) Y 46
1-shot tasks from the training data set, incorporating Full Context Emb
Networks and training strategy.
The results of the randImageNet and dogsImageNet experiments are show
Oracle (trained on all classes) performs almost perfectly when restricted
not too surprising given its impressive top-1 accuracy. When trained so
Nets improve upon Inception by almost 6% when tested on Lrand, halving
two instances of 5-way one-shot learning, where Inception fails. Looking
appears to sometimes prefer an image above all others (these images te
example in the second column, or more constant in color). Matching Nets,
to recover from these outliers that sometimes appear in the support set S0
Matching Nets manage to improve upon Inception on the complementar
this setup is not one-shot, as the feature extraction has been trained on the
much more challenging Ldogs subset, our model degrades by 1%. We h
1-shot tasks from the training data set, incorporating Full Context Embeddings an
Networks and training strategy.
The results of the randImageNet and dogsImageNet experiments are shown in Table
Oracle (trained on all classes) performs almost perfectly when restricted to 5 classe
not too surprising given its impressive top-1 accuracy. When trained solely on 6=L
Nets improve upon Inception by almost 6% when tested on Lrand, halving the errors
two instances of 5-way one-shot learning, where Inception fails. Looking at all the e
appears to sometimes prefer an image above all others (these images tend to be c
example in the second column, or more constant in color). Matching Nets, on the oth
to recover from these outliers that sometimes appear in the support set S0
.
Matching Nets manage to improve upon Inception on the complementary subset 6=
this setup is not one-shot, as the feature extraction has been trained on these labels).
much more challenging Ldogs subset, our model degrades by 1%. We hypothesiz
that the sampled set during training, S, comes from a random distribution of labels
whereas the testing support set S0
from Ldogs contains similar classes, more akin
classiļ¬cation. Thus, we believe that if we adapted our training strategy to samples S f
sets of labels instead of sampling uniformly from the leafs of the ImageNet class tre
could be attained. We leave this as future work.
1-shot tasks from the training data set, incorporating Full C
Networks and training strategy.
The results of the randImageNet and dogsImageNet experimen
Oracle (trained on all classes) performs almost perfectly whe
not too surprising given its impressive top-1 accuracy. When
Nets improve upon Inception by almost 6% when tested on Lr
two instances of 5-way one-shot learning, where Inception fa
appears to sometimes prefer an image above all others (thes
example in the second column, or more constant in color). Ma
to recover from these outliers that sometimes appear in the su
Matching Nets manage to improve upon Inception on the com
this setup is not one-shot, as the feature extraction has been tra
much more challenging Ldogs subset, our model degrades b
that the sampled set during training, S, comes from a random
whereas the testing support set S0
from Ldogs contains simi
classiļ¬cation. Thus, we believe that if we adapted our training
sets of labels instead of sampling uniformly from the leafs of
could be attained. We leave this as future work.
Copyright	(C)	DeNA	Co.,Ltd.	All	Rights	Reserved.	
Experimental Settings (Penn Treebank)
36
xi
Support Setļ¼ˆSļ¼‰
Ė†x
Query
g(xi )
f ( Ė†x,S)
a
Our model in its simplest form computes a probability over Ė†y as follows:
P(Ė†y|Ė†x, S) =
kX
i=1
a(Ė†x, xi)yi
where xi, yi are the inputs and corresponding label distributions from the su
k
āˆ‘
P(Ė†y|Ė†x, S) =
where xi, yi are the inputs and correspondin
{(xi, yi)}k
i=1, and a is an attention mechanism
tially describes the output for a new class as a
Where the attention mechanism a is a kernel on X
Where the attention mechanism is zero for the
metric and an appropriate constant otherwise, th
(although this requires an extension to the atten
Thus (1) subsumes both KDE and kNN methods.
mechanism and the yi act as values bound to the
this case we can understand this as a particular
we ā€œpointā€ to the corresponding example in the s
form deļ¬ned by the classiļ¬er cS(Ė†x) is very ļ¬‚exib
2.1.1 The Attention Kernel
Equation 1 relies on choosing a(., .), the atten
ļ¬er. The simplest form that this takes (and w
attention models and kernel functions) is to
a(Ė†x, xi) = ec(f(Ė†x),g(xi))
/
Pk
j=1 ec(f(Ė†x),g(xj ))
w
ate neural networks (potentially with f = g) to
examples where f and g are parameterised var
tasks (as in VGG[22] or Inception[24]) or a sim
Section 4).
We note that, though related to metric learning, th
For a given support set S and sample to classify
pairs (x0
, y0
) 2 S such that y0
= y and misalign
methods such as Neighborhood Component An
nearest neighbor [28].
However, the objective that we are trying to opti
classiļ¬cation, and thus we expect it to perform b
Our model in its simplest form computes a probability over Ė†y as follows:
P(Ė†y|Ė†x, S) =
kX
i=1
a(Ė†x, xi)yi
where xi, yi are the inputs and corresponding label distributions from the support set
{(xi, yi)}k
i=1, and a is an attention mechanism which we discuss below. Note that eq. 1
tially describes the output for a new class as a linear combination of the labels in the suppo
Where the attention mechanism a is a kernel on X ā‡„ X, then (1) is akin to a kernel density esti
Where the attention mechanism is zero for the b furthest xi from Ė†x according to some dis
metric and an appropriate constant otherwise, then (1) is equivalent to ā€˜k bā€™-nearest neigh
(although this requires an extension to the attention mechanism that we describe in Section 2
Thus (1) subsumes both KDE and kNN methods. Another view of (1) is where a acts as an atte
mechanism and the yi act as values bound to the corresponding keys xi, much like a hash tab
yi
Our model in its simplest form computes a probability over Ė†y as follows:
P(Ė†y|Ė†x, S) =
kX
i=1
a(Ė†x, xi)yi
where xi, yi are the inputs and corresponding label distributions from the support s
{(xi, yi)}k
i=1, and a is an attention mechanism which we discuss below. Note that eq.
tially describes the output for a new class as a linear combination of the labels in the su
Where the attention mechanism a is a kernel on X ā‡„ X, then (1) is akin to a kernel density
Where the attention mechanism is zero for the b furthest xi from Ė†x according to some
metric and an appropriate constant otherwise, then (1) is equivalent to ā€˜k bā€™-nearest ne
(although this requires an extension to the attention mechanism that we describe in Secti
Thus (1) subsumes both KDE and kNN methods. Another view of (1) is where a acts as an
mechanism and the yi act as values bound to the corresponding keys xi, much like a hash
this case we can understand this as a particular kind of associative memory where, given
we ā€œpointā€ to the corresponding example in the support set, retrieving its label. Hence the f
form deļ¬ned by the classiļ¬er cS(Ė†x) is very ļ¬‚exible and can adapt easily to any new suppo
2.1.1 The Attention Kernel
Equation 1 relies on choosing a(., .), the attention mechanism, which fully speciļ¬es th
ļ¬er. The simplest form that this takes (and which has very tight relationships with
attention models and kernel functions) is to use the softmax over the cosine distanc
a(Ė†x, xi) = ec(f(Ė†x),g(xi))
/
Pk
j=1 ec(f(Ė†x),g(xj ))
with embedding functions f and g being
ate neural networks (potentially with f = g) to embed Ė†x and xi. In our experiments we
examples where f and g are parameterised variously as deep convolutional networks f
tasks (as in VGG[22] or Inception[24]) or a simple form word embedding for language t
Section 4).
We note that, though related to metric learning, the classiļ¬er deļ¬ned by Equation 1 is discri
c: cosine distance
LSTMLSTMā€¦
virus a
LSTMLSTMā€¦
new nbc
LSTMLSTM
on the
ā€¦
LSTMLSTM
the yesterday
ā€¦
4.1.3 One-Shot Language Modeling
We also introduce a new one-shot language task which is analogous to those examined for images.
The task is as follows: given a query sentence with a missing word in it, and a support set of sentences
which each have a missing word and a corresponding 1-hot label, choose the label from the support
set that best matches the query sentence. Here we show a single example, though note that the words
on the right are not provided and the labels for the set are given as 1-hot-of-5 vectors.
1. an experimental vaccine can alter the immune response of people infected with the aids virus a
<blank_token> u.s. scientist said.
prominent
2. the show one of five new nbc <blank_token> is the second casualty of the three networks so far
this fall.
series
3. however since eastern first filed for chapter N protection march N it has consistently promised
to pay creditors N cents on the <blank_token>.
dollar
4. we had a lot of people who threw in the <blank_token> today said <unk> ellis a partner in
benjamin jacobson & sons a specialist in trading ual stock on the big board.
towel
5. itā€™s not easy to roll out something that <blank_token> and make it pay mr. jacob says. comprehensive
Query: in late new york trading yesterday the <blank_token> was quoted at N marks down from N
marks late friday and at N yen down from N yen late friday.
dollar
Sentences were taken from the Penn Treebank dataset [15]. On each trial, we make sure that the set
and batch are populated with sentences that are non-overlapping. This means that we do not use
words with very low frequency counts; e.g. if there is only a single sentence for a given word we do
not use this data since the sentence would need to be in both the set and the batch. As with the image
tasks, each trial consisted of a 5 way choice between the classes available in the set. We used a batch
size of 20 throughout the sentence matching task and varied the set size across k=1,2,3. We ensured
that the same number of sentences were available for each class in the set. We split the words into a
randomly sampled 9000 for training and 1000 for testing, and we used the standard test set to report
results. Thus, neither the words nor the sentences used during test time had been seen during training.
We compared our one-shot matching model to an oracle LSTM language model (LSTM-LM) [30]
trained on all the words. In this setup, the LSTM has an unfair advantage as it is not doing one-shot
learning but seeing all the data ā€“ thus, this should be taken as an upper bound. To do so, we examined
a similar setup wherein a sentence was presented to the model with a single word ļ¬lled in with 5
different possible words (including the correct answer). For each of these 5 sentences the model gave
The task is as follows: given a query sentence with a missing word in it, and a support set of sentences
which each have a missing word and a corresponding 1-hot label, choose the label from the support
set that best matches the query sentence. Here we show a single example, though note that the words
on the right are not provided and the labels for the set are given as 1-hot-of-5 vectors.
1. an experimental vaccine can alter the immune response of people infected with the aids virus a
<blank_token> u.s. scientist said.
prominent
2. the show one of five new nbc <blank_token> is the second casualty of the three networks so far
this fall.
series
3. however since eastern first filed for chapter N protection march N it has consistently promised
to pay creditors N cents on the <blank_token>.
dollar
4. we had a lot of people who threw in the <blank_token> today said <unk> ellis a partner in
benjamin jacobson & sons a specialist in trading ual stock on the big board.
towel
5. itā€™s not easy to roll out something that <blank_token> and make it pay mr. jacob says. comprehensive
Query: in late new york trading yesterday the <blank_token> was quoted at N marks down from N
marks late friday and at N yen down from N yen late friday.
dollar
Sentences were taken from the Penn Treebank dataset [15]. On each trial, we make sure that the set
and batch are populated with sentences that are non-overlapping. This means that we do not use
words with very low frequency counts; e.g. if there is only a single sentence for a given word we do
not use this data since the sentence would need to be in both the set and the batch. As with the image
tasks, each trial consisted of a 5 way choice between the classes available in the set. We used a batch
size of 20 throughout the sentence matching task and varied the set size across k=1,2,3. We ensured
that the same number of sentences were available for each class in the set. We split the words into a
randomly sampled 9000 for training and 1000 for testing, and we used the standard test set to report
results. Thus, neither the words nor the sentences used during test time had been seen during training.
We compared our one-shot matching model to an oracle LSTM language model (LSTM-LM) [30]
trained on all the words. In this setup, the LSTM has an unfair advantage as it is not doing one-shot
learning but seeing all the data ā€“ thus, this should be taken as an upper bound. To do so, we examined
a similar setup wherein a sentence was presented to the model with a single word ļ¬lled in with 5
different possible words (including the correct answer). For each of these 5 sentences the model gave
a log-likelihood and the max of these was taken to be the choice of the model.
nā€Æ Fill in a brank in a query sentence by a label in a support set
Copyright	(C)	DeNA	Co.,Ltd.	All	Rights	Reserved.	
Experimental Settings and Results (Penn Treebank)
37
nā€Æ Baseline
āƒā€Æ Oracle LSTM-LM
ā€¢ā€Æ Trained on all the words (not one-shot)
ā€¢ā€Æ Consider this model as an upper bound
nā€Æ Datasets
āƒā€Æ training: 9000 words
āƒā€Æ testing: 1000 words
nā€Æ Results
Model
5 way accuracy
1-shot 2-shot 3-shot
Matching Nets 32.4% 36.1% 38.2%
Oracle LSTM-LM (72.8%) - -
Copyright	(C)	DeNA	Co.,Ltd.	All	Rights	Reserved.	
Conclusion
nā€Æ They proposed Matching Networks: nearest neighbor based
approach trained fully end-to-end
nā€Æ Keypoints
āƒā€Æ ā€œOne-shot learning is much easier if you train the network to
do one-shot learningā€ [Vinyals+, 2016]
āƒā€Æ Matching Network has non-parametric structure, thus has
ability to acquisition of new examples rapidly
nā€Æ Findings
āƒā€Æ Matching Networks was eļ¬€ective to improve the performance
for Omniglot, miniImageNet, randImageNet, however it
degraded for dogsImageNet
āƒā€Æ One-shot learning with ļ¬ne-grained sets of labels is diļ¬ƒcult
to solve thus could be exciting challenge in this area
38
Copyright	(C)	DeNA	Co.,Ltd.	All	Rights	Reserved.	
References
nā€Æ Matching Networks
āƒā€Æ Vinyals, Oriol, et al. "Matching networks for one shot learning." Advances in Neural
Information Processing Systems. 2016.
nā€Æ One-shot Learning
āƒā€Æ Koch, Gregory. Siamese neural networks for one-shot image recognition. Diss.
University of Toronto, 2015.
āƒā€Æ Santoro, Adam, et al. "Meta-learning with memory-augmented neural networks."
Proceedings of The 33rd International Conference on Machine Learning. 2016.
āƒā€Æ Bertinetto, Luca, et al. "Learning feed-forward one-shot learners." Advances in Neural
Information Processing Systems. 2016.
nā€Æ Attention Mechanisms
āƒā€Æ Bahdanau, Dzmitry, Kyunghyun Cho, and Yoshua Bengio. "Neural machine translation
by jointly learning to align and translate." arXiv preprint arXiv:1409.0473 (2014).
āƒā€Æ Vinyals, Oriol, Meire Fortunato, and Navdeep Jaitly. "Pointer networks." Advances in
Neural Information Processing Systems. 2015.
āƒā€Æ Vinyals, Oriol, Samy Bengio, and Manjunath Kudlur. "Order matters: Sequence to
sequence for sets." In ICLR2016
39
Copyright	(C)	DeNA	Co.,Ltd.	All	Rights	Reserved.	
References
nā€Æ Datasets
āƒā€Æ Krizhevsky, Alex, and Geoļ¬€rey Hinton. "Learning multiple layers of features from tiny
images." (2009).
āƒā€Æ Deng, Jia, et al. "Imagenet: A large-scale hierarchical image database." Computer
Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on. IEEE, 2009.
āƒā€Æ Lake, Brenden M., et al. "One shot learning of simple visual concepts." Proceedings of
the 33rd Annual Conference of the Cognitive Science Society. Vol. 172. 2011.
āƒā€Æ Marcus, Mitchell P., Mary Ann Marcinkiewicz, and Beatrice Santorini. "Building a large
annotated corpus of English: The Penn Treebank." Computational linguistics 19.2
(1993): 313-330.
40

More Related Content

What's hot

ć‚µćƒćƒ¼ćƒˆćƒ™ć‚Æć‚æćƒ¼ćƒžć‚·ćƒ³(SVM)ć®ę•°å­¦ć‚’ćæ悓ćŖ恫čŖ¬ę˜Žć—ćŸć„ć ć‘ć®ä¼š
ć‚µćƒćƒ¼ćƒˆćƒ™ć‚Æć‚æćƒ¼ćƒžć‚·ćƒ³(SVM)ć®ę•°å­¦ć‚’ćæ悓ćŖ恫čŖ¬ę˜Žć—ćŸć„ć ć‘ć®ä¼šć‚µćƒćƒ¼ćƒˆćƒ™ć‚Æć‚æćƒ¼ćƒžć‚·ćƒ³(SVM)ć®ę•°å­¦ć‚’ćæ悓ćŖ恫čŖ¬ę˜Žć—ćŸć„ć ć‘ć®ä¼š
ć‚µćƒćƒ¼ćƒˆćƒ™ć‚Æć‚æćƒ¼ćƒžć‚·ćƒ³(SVM)ć®ę•°å­¦ć‚’ćæ悓ćŖ恫čŖ¬ę˜Žć—ćŸć„ć ć‘ć®ä¼š
Kenyu Uehara
Ā 
強化学ēæ’ćć®3
強化学ēæ’ćć®3強化学ēæ’ćć®3
強化学ēæ’ćć®3
nishio
Ā 
怐DLč¼ŖčŖ­ä¼šć€‘How Much Can CLIP Benefit Vision-and-Language Tasks?
怐DLč¼ŖčŖ­ä¼šć€‘How Much Can CLIP Benefit Vision-and-Language Tasks? 怐DLč¼ŖčŖ­ä¼šć€‘How Much Can CLIP Benefit Vision-and-Language Tasks?
怐DLč¼ŖčŖ­ä¼šć€‘How Much Can CLIP Benefit Vision-and-Language Tasks?
Deep Learning JP
Ā 
Triplet Loss å¾¹åŗ•č§£čŖ¬
Triplet Loss å¾¹åŗ•č§£čŖ¬Triplet Loss å¾¹åŗ•č§£čŖ¬
Triplet Loss å¾¹åŗ•č§£čŖ¬
tancoro
Ā 
Contrastive learning 20200607
Contrastive learning 20200607Contrastive learning 20200607
Anomaly detection ē³»ć®č«–ꖇ悒äø€čØ€ć§ć¾ćØ悁恟
Anomaly detection ē³»ć®č«–ꖇ悒äø€čØ€ć§ć¾ćØ悁恟Anomaly detection ē³»ć®č«–ꖇ悒äø€čØ€ć§ć¾ćØ悁恟
Anomaly detection ē³»ć®č«–ꖇ悒äø€čØ€ć§ć¾ćØ悁恟
恱悓恄恔 恙ćæ悂ćØ
Ā 
ę·±å±¤å­¦ēæ’恮ꕰē†
ę·±å±¤å­¦ēæ’恮ꕰē†ę·±å±¤å­¦ēæ’恮ꕰē†
ę·±å±¤å­¦ēæ’恮ꕰē†
Taiji Suzuki
Ā 
[DLč¼ŖčŖ­ä¼š]A closer look at few shot classification
[DLč¼ŖčŖ­ä¼š]A closer look at few shot classification[DLč¼ŖčŖ­ä¼š]A closer look at few shot classification
[DLč¼ŖčŖ­ä¼š]A closer look at few shot classification
Deep Learning JP
Ā 
Meta learning with memory augmented neural network
Meta learning with memory augmented neural networkMeta learning with memory augmented neural network
Meta learning with memory augmented neural network
Katy Lee
Ā 
ćƒžćƒ«ćƒćƒ¢ćƒ¼ćƒ€ćƒ«ę·±å±¤å­¦ēæ’恮ē ”ē©¶å‹•å‘
ćƒžćƒ«ćƒćƒ¢ćƒ¼ćƒ€ćƒ«ę·±å±¤å­¦ēæ’恮ē ”ē©¶å‹•å‘ćƒžćƒ«ćƒćƒ¢ćƒ¼ćƒ€ćƒ«ę·±å±¤å­¦ēæ’恮ē ”ē©¶å‹•å‘
ćƒžćƒ«ćƒćƒ¢ćƒ¼ćƒ€ćƒ«ę·±å±¤å­¦ēæ’恮ē ”ē©¶å‹•å‘
Koichiro Mori
Ā 
Recurrent Neural Networks (RNN) | RNN LSTM | Deep Learning Tutorial | Tensorf...
Recurrent Neural Networks (RNN) | RNN LSTM | Deep Learning Tutorial | Tensorf...Recurrent Neural Networks (RNN) | RNN LSTM | Deep Learning Tutorial | Tensorf...
Recurrent Neural Networks (RNN) | RNN LSTM | Deep Learning Tutorial | Tensorf...
Edureka!
Ā 
Meta-Learning with Memory Augmented Neural Network
Meta-Learning with Memory Augmented Neural NetworkMeta-Learning with Memory Augmented Neural Network
Meta-Learning with Memory Augmented Neural Network
Yusuke Watanabe
Ā 
[DLč¼ŖčŖ­ä¼š]Swin Transformer: Hierarchical Vision Transformer using Shifted Windows
[DLč¼ŖčŖ­ä¼š]Swin Transformer: Hierarchical Vision Transformer using Shifted Windows[DLč¼ŖčŖ­ä¼š]Swin Transformer: Hierarchical Vision Transformer using Shifted Windows
[DLč¼ŖčŖ­ä¼š]Swin Transformer: Hierarchical Vision Transformer using Shifted Windows
Deep Learning JP
Ā 
Emerging Properties in Self-Supervised Vision Transformers
Emerging Properties in Self-Supervised Vision TransformersEmerging Properties in Self-Supervised Vision Transformers
Emerging Properties in Self-Supervised Vision Transformers
Sungchul Kim
Ā 
Keras vs Tensorflow vs PyTorch | Deep Learning Frameworks Comparison | Edureka
Keras vs Tensorflow vs PyTorch | Deep Learning Frameworks Comparison | EdurekaKeras vs Tensorflow vs PyTorch | Deep Learning Frameworks Comparison | Edureka
Keras vs Tensorflow vs PyTorch | Deep Learning Frameworks Comparison | Edureka
Edureka!
Ā 
Convolutional Neural Network Models - Deep Learning
Convolutional Neural Network Models - Deep LearningConvolutional Neural Network Models - Deep Learning
Convolutional Neural Network Models - Deep Learning
Mohamed Loey
Ā 
continual learning survey
continual learning surveycontinual learning survey
[DLč¼ŖčŖ­ä¼š]NVAE: A Deep Hierarchical Variational Autoencoder
[DLč¼ŖčŖ­ä¼š]NVAE: A Deep Hierarchical Variational Autoencoder[DLč¼ŖčŖ­ä¼š]NVAE: A Deep Hierarchical Variational Autoencoder
[DLč¼ŖčŖ­ä¼š]NVAE: A Deep Hierarchical Variational Autoencoder
Deep Learning JP
Ā 
One Class SVM悒ē”Ø恄恟ē•°åøøå€¤ę¤œēŸ„
One Class SVM悒ē”Ø恄恟ē•°åøøå€¤ę¤œēŸ„One Class SVM悒ē”Ø恄恟ē•°åøøå€¤ę¤œēŸ„
One Class SVM悒ē”Ø恄恟ē•°åøøå€¤ę¤œēŸ„
Yuto Mori
Ā 
怐LTč³‡ę–™ć€‘ Neural Network ē“ äŗŗćŖć‚“ć ć‘ć©ä½•ćØć‹ć”ę©Ÿå«Œå–ć‚Šć‚’ć—ćŸć„
怐LTč³‡ę–™ć€‘ Neural Network ē“ äŗŗćŖć‚“ć ć‘ć©ä½•ćØć‹ć”ę©Ÿå«Œå–ć‚Šć‚’ć—ćŸć„ć€LTč³‡ę–™ć€‘ Neural Network ē“ äŗŗćŖć‚“ć ć‘ć©ä½•ćØć‹ć”ę©Ÿå«Œå–ć‚Šć‚’ć—ćŸć„
怐LTč³‡ę–™ć€‘ Neural Network ē“ äŗŗćŖć‚“ć ć‘ć©ä½•ćØć‹ć”ę©Ÿå«Œå–ć‚Šć‚’ć—ćŸć„
Takuji Tahara
Ā 

What's hot (20)

ć‚µćƒćƒ¼ćƒˆćƒ™ć‚Æć‚æćƒ¼ćƒžć‚·ćƒ³(SVM)ć®ę•°å­¦ć‚’ćæ悓ćŖ恫čŖ¬ę˜Žć—ćŸć„ć ć‘ć®ä¼š
ć‚µćƒćƒ¼ćƒˆćƒ™ć‚Æć‚æćƒ¼ćƒžć‚·ćƒ³(SVM)ć®ę•°å­¦ć‚’ćæ悓ćŖ恫čŖ¬ę˜Žć—ćŸć„ć ć‘ć®ä¼šć‚µćƒćƒ¼ćƒˆćƒ™ć‚Æć‚æćƒ¼ćƒžć‚·ćƒ³(SVM)ć®ę•°å­¦ć‚’ćæ悓ćŖ恫čŖ¬ę˜Žć—ćŸć„ć ć‘ć®ä¼š
ć‚µćƒćƒ¼ćƒˆćƒ™ć‚Æć‚æćƒ¼ćƒžć‚·ćƒ³(SVM)ć®ę•°å­¦ć‚’ćæ悓ćŖ恫čŖ¬ę˜Žć—ćŸć„ć ć‘ć®ä¼š
Ā 
強化学ēæ’ćć®3
強化学ēæ’ćć®3強化学ēæ’ćć®3
強化学ēæ’ćć®3
Ā 
怐DLč¼ŖčŖ­ä¼šć€‘How Much Can CLIP Benefit Vision-and-Language Tasks?
怐DLč¼ŖčŖ­ä¼šć€‘How Much Can CLIP Benefit Vision-and-Language Tasks? 怐DLč¼ŖčŖ­ä¼šć€‘How Much Can CLIP Benefit Vision-and-Language Tasks?
怐DLč¼ŖčŖ­ä¼šć€‘How Much Can CLIP Benefit Vision-and-Language Tasks?
Ā 
Triplet Loss å¾¹åŗ•č§£čŖ¬
Triplet Loss å¾¹åŗ•č§£čŖ¬Triplet Loss å¾¹åŗ•č§£čŖ¬
Triplet Loss å¾¹åŗ•č§£čŖ¬
Ā 
Contrastive learning 20200607
Contrastive learning 20200607Contrastive learning 20200607
Contrastive learning 20200607
Ā 
Anomaly detection ē³»ć®č«–ꖇ悒äø€čØ€ć§ć¾ćØ悁恟
Anomaly detection ē³»ć®č«–ꖇ悒äø€čØ€ć§ć¾ćØ悁恟Anomaly detection ē³»ć®č«–ꖇ悒äø€čØ€ć§ć¾ćØ悁恟
Anomaly detection ē³»ć®č«–ꖇ悒äø€čØ€ć§ć¾ćØ悁恟
Ā 
ę·±å±¤å­¦ēæ’恮ꕰē†
ę·±å±¤å­¦ēæ’恮ꕰē†ę·±å±¤å­¦ēæ’恮ꕰē†
ę·±å±¤å­¦ēæ’恮ꕰē†
Ā 
[DLč¼ŖčŖ­ä¼š]A closer look at few shot classification
[DLč¼ŖčŖ­ä¼š]A closer look at few shot classification[DLč¼ŖčŖ­ä¼š]A closer look at few shot classification
[DLč¼ŖčŖ­ä¼š]A closer look at few shot classification
Ā 
Meta learning with memory augmented neural network
Meta learning with memory augmented neural networkMeta learning with memory augmented neural network
Meta learning with memory augmented neural network
Ā 
ćƒžćƒ«ćƒćƒ¢ćƒ¼ćƒ€ćƒ«ę·±å±¤å­¦ēæ’恮ē ”ē©¶å‹•å‘
ćƒžćƒ«ćƒćƒ¢ćƒ¼ćƒ€ćƒ«ę·±å±¤å­¦ēæ’恮ē ”ē©¶å‹•å‘ćƒžćƒ«ćƒćƒ¢ćƒ¼ćƒ€ćƒ«ę·±å±¤å­¦ēæ’恮ē ”ē©¶å‹•å‘
ćƒžćƒ«ćƒćƒ¢ćƒ¼ćƒ€ćƒ«ę·±å±¤å­¦ēæ’恮ē ”ē©¶å‹•å‘
Ā 
Recurrent Neural Networks (RNN) | RNN LSTM | Deep Learning Tutorial | Tensorf...
Recurrent Neural Networks (RNN) | RNN LSTM | Deep Learning Tutorial | Tensorf...Recurrent Neural Networks (RNN) | RNN LSTM | Deep Learning Tutorial | Tensorf...
Recurrent Neural Networks (RNN) | RNN LSTM | Deep Learning Tutorial | Tensorf...
Ā 
Meta-Learning with Memory Augmented Neural Network
Meta-Learning with Memory Augmented Neural NetworkMeta-Learning with Memory Augmented Neural Network
Meta-Learning with Memory Augmented Neural Network
Ā 
[DLč¼ŖčŖ­ä¼š]Swin Transformer: Hierarchical Vision Transformer using Shifted Windows
[DLč¼ŖčŖ­ä¼š]Swin Transformer: Hierarchical Vision Transformer using Shifted Windows[DLč¼ŖčŖ­ä¼š]Swin Transformer: Hierarchical Vision Transformer using Shifted Windows
[DLč¼ŖčŖ­ä¼š]Swin Transformer: Hierarchical Vision Transformer using Shifted Windows
Ā 
Emerging Properties in Self-Supervised Vision Transformers
Emerging Properties in Self-Supervised Vision TransformersEmerging Properties in Self-Supervised Vision Transformers
Emerging Properties in Self-Supervised Vision Transformers
Ā 
Keras vs Tensorflow vs PyTorch | Deep Learning Frameworks Comparison | Edureka
Keras vs Tensorflow vs PyTorch | Deep Learning Frameworks Comparison | EdurekaKeras vs Tensorflow vs PyTorch | Deep Learning Frameworks Comparison | Edureka
Keras vs Tensorflow vs PyTorch | Deep Learning Frameworks Comparison | Edureka
Ā 
Convolutional Neural Network Models - Deep Learning
Convolutional Neural Network Models - Deep LearningConvolutional Neural Network Models - Deep Learning
Convolutional Neural Network Models - Deep Learning
Ā 
continual learning survey
continual learning surveycontinual learning survey
continual learning survey
Ā 
[DLč¼ŖčŖ­ä¼š]NVAE: A Deep Hierarchical Variational Autoencoder
[DLč¼ŖčŖ­ä¼š]NVAE: A Deep Hierarchical Variational Autoencoder[DLč¼ŖčŖ­ä¼š]NVAE: A Deep Hierarchical Variational Autoencoder
[DLč¼ŖčŖ­ä¼š]NVAE: A Deep Hierarchical Variational Autoencoder
Ā 
One Class SVM悒ē”Ø恄恟ē•°åøøå€¤ę¤œēŸ„
One Class SVM悒ē”Ø恄恟ē•°åøøå€¤ę¤œēŸ„One Class SVM悒ē”Ø恄恟ē•°åøøå€¤ę¤œēŸ„
One Class SVM悒ē”Ø恄恟ē•°åøøå€¤ę¤œēŸ„
Ā 
怐LTč³‡ę–™ć€‘ Neural Network ē“ äŗŗćŖć‚“ć ć‘ć©ä½•ćØć‹ć”ę©Ÿå«Œå–ć‚Šć‚’ć—ćŸć„
怐LTč³‡ę–™ć€‘ Neural Network ē“ äŗŗćŖć‚“ć ć‘ć©ä½•ćØć‹ć”ę©Ÿå«Œå–ć‚Šć‚’ć—ćŸć„ć€LTč³‡ę–™ć€‘ Neural Network ē“ äŗŗćŖć‚“ć ć‘ć©ä½•ćØć‹ć”ę©Ÿå«Œå–ć‚Šć‚’ć—ćŸć„
怐LTč³‡ę–™ć€‘ Neural Network ē“ äŗŗćŖć‚“ć ć‘ć©ä½•ćØć‹ć”ę©Ÿå«Œå–ć‚Šć‚’ć—ćŸć„
Ā 

Viewers also liked

NIPS2013čŖ­ćæ会 DeViSE: A Deep Visual-Semantic Embedding Model
NIPS2013čŖ­ćæ会 DeViSE: A Deep Visual-Semantic Embedding ModelNIPS2013čŖ­ćæ会 DeViSE: A Deep Visual-Semantic Embedding Model
NIPS2013čŖ­ćæ会 DeViSE: A Deep Visual-Semantic Embedding Model
Seiya Tokui
Ā 
Zero shot learning through cross-modal transfer
Zero shot learning through cross-modal transferZero shot learning through cross-modal transfer
Zero shot learning through cross-modal transfer
Roelof Pieters
Ā 
One-Shot Learning
One-Shot LearningOne-Shot Learning
One-Shot Learning
Jisung Kim
Ā 
[DLč¼ŖčŖ­ä¼š]Attention Is All You Need
[DLč¼ŖčŖ­ä¼š]Attention Is All You Need[DLč¼ŖčŖ­ä¼š]Attention Is All You Need
[DLč¼ŖčŖ­ä¼š]Attention Is All You Need
Deep Learning JP
Ā 
Learning to learn by gradient descent by gradient descent
Learning to learn by gradient descent by gradient descentLearning to learn by gradient descent by gradient descent
Learning to learn by gradient descent by gradient descent
Hiroyuki Fukuda
Ā 
Ꙃē³»åˆ—ćƒ‡ćƒ¼ć‚æ3
Ꙃē³»åˆ—ćƒ‡ćƒ¼ć‚æ3Ꙃē³»åˆ—ćƒ‡ćƒ¼ć‚æ3
Ꙃē³»åˆ—ćƒ‡ćƒ¼ć‚æ3graySpace999
Ā 
Fast and Probvably Seedings for k-Means
Fast and Probvably Seedings for k-MeansFast and Probvably Seedings for k-Means
Fast and Probvably Seedings for k-Means
Kimikazu Kato
Ā 
Value iteration networks
Value iteration networksValue iteration networks
Value iteration networks
Fujimoto Keisuke
Ā 
Dual Learning for Machine Translation (NIPS 2016)
Dual Learning for Machine Translation (NIPS 2016)Dual Learning for Machine Translation (NIPS 2016)
Dual Learning for Machine Translation (NIPS 2016)
Toru Fujino
Ā 
Conditional Image Generation with PixelCNN Decoders
Conditional Image Generation with PixelCNN DecodersConditional Image Generation with PixelCNN Decoders
Conditional Image Generation with PixelCNN Decoders
suga93
Ā 
Interaction Networks for Learning about Objects, Relations and Physics
Interaction Networks for Learning about Objects, Relations and PhysicsInteraction Networks for Learning about Objects, Relations and Physics
Interaction Networks for Learning about Objects, Relations and Physics
Ken Kuroki
Ā 
Introduction of "TrailBlazer" algorithm
Introduction of "TrailBlazer" algorithmIntroduction of "TrailBlazer" algorithm
Introduction of "TrailBlazer" algorithm
Katsuki Ohto
Ā 
InfoGAN: Interpretable Representation Learning by Information Maximizing Gen...
InfoGAN: Interpretable Representation Learning by Information Maximizing Gen...InfoGAN: Interpretable Representation Learning by Information Maximizing Gen...
InfoGAN: Interpretable Representation Learning by Information Maximizing Gen...
Shuhei Yoshida
Ā 
Safe and Efficient Off-Policy Reinforcement Learning
Safe and Efficient Off-Policy Reinforcement LearningSafe and Efficient Off-Policy Reinforcement Learning
Safe and Efficient Off-Policy Reinforcement Learning
mooopan
Ā 
Introduction of ā€œFairness in Learning: Classic and Contextual Banditsā€
Introduction of ā€œFairness in Learning: Classic and Contextual Banditsā€Introduction of ā€œFairness in Learning: Classic and Contextual Banditsā€
Introduction of ā€œFairness in Learning: Classic and Contextual Banditsā€
Kazuto Fukuchi
Ā 
Improving Variational Inference with Inverse Autoregressive Flow
Improving Variational Inference with Inverse Autoregressive FlowImproving Variational Inference with Inverse Autoregressive Flow
Improving Variational Inference with Inverse Autoregressive Flow
Tatsuya Shirakawa
Ā 
[DLč¼ŖčŖ­ä¼š]Convolutional Sequence to Sequence Learning
[DLč¼ŖčŖ­ä¼š]Convolutional Sequence to Sequence Learning[DLč¼ŖčŖ­ä¼š]Convolutional Sequence to Sequence Learning
[DLč¼ŖčŖ­ä¼š]Convolutional Sequence to Sequence Learning
Deep Learning JP
Ā 
NIPS 2016 Overview and Deep Learning Topics
NIPS 2016 Overview and Deep Learning Topics  NIPS 2016 Overview and Deep Learning Topics
NIPS 2016 Overview and Deep Learning Topics
Koichi Hamada
Ā 
č«–ę–‡ē“¹ä»‹ Combining Model-Based and Model-Free Updates for Trajectory-Centric Rein...
č«–ę–‡ē“¹ä»‹ Combining Model-Based and Model-Free Updates for Trajectory-Centric Rein...č«–ę–‡ē“¹ä»‹ Combining Model-Based and Model-Free Updates for Trajectory-Centric Rein...
č«–ę–‡ē“¹ä»‹ Combining Model-Based and Model-Free Updates for Trajectory-Centric Rein...
Kusano Hitoshi
Ā 
Differential privacy without sensitivity [NIPS2016čŖ­ćæä¼šč³‡ę–™]
Differential privacy without sensitivity [NIPS2016čŖ­ćæä¼šč³‡ę–™]Differential privacy without sensitivity [NIPS2016čŖ­ćæä¼šč³‡ę–™]
Differential privacy without sensitivity [NIPS2016čŖ­ćæä¼šč³‡ę–™]
Kentaro Minami
Ā 

Viewers also liked (20)

NIPS2013čŖ­ćæ会 DeViSE: A Deep Visual-Semantic Embedding Model
NIPS2013čŖ­ćæ会 DeViSE: A Deep Visual-Semantic Embedding ModelNIPS2013čŖ­ćæ会 DeViSE: A Deep Visual-Semantic Embedding Model
NIPS2013čŖ­ćæ会 DeViSE: A Deep Visual-Semantic Embedding Model
Ā 
Zero shot learning through cross-modal transfer
Zero shot learning through cross-modal transferZero shot learning through cross-modal transfer
Zero shot learning through cross-modal transfer
Ā 
One-Shot Learning
One-Shot LearningOne-Shot Learning
One-Shot Learning
Ā 
[DLč¼ŖčŖ­ä¼š]Attention Is All You Need
[DLč¼ŖčŖ­ä¼š]Attention Is All You Need[DLč¼ŖčŖ­ä¼š]Attention Is All You Need
[DLč¼ŖčŖ­ä¼š]Attention Is All You Need
Ā 
Learning to learn by gradient descent by gradient descent
Learning to learn by gradient descent by gradient descentLearning to learn by gradient descent by gradient descent
Learning to learn by gradient descent by gradient descent
Ā 
Ꙃē³»åˆ—ćƒ‡ćƒ¼ć‚æ3
Ꙃē³»åˆ—ćƒ‡ćƒ¼ć‚æ3Ꙃē³»åˆ—ćƒ‡ćƒ¼ć‚æ3
Ꙃē³»åˆ—ćƒ‡ćƒ¼ć‚æ3
Ā 
Fast and Probvably Seedings for k-Means
Fast and Probvably Seedings for k-MeansFast and Probvably Seedings for k-Means
Fast and Probvably Seedings for k-Means
Ā 
Value iteration networks
Value iteration networksValue iteration networks
Value iteration networks
Ā 
Dual Learning for Machine Translation (NIPS 2016)
Dual Learning for Machine Translation (NIPS 2016)Dual Learning for Machine Translation (NIPS 2016)
Dual Learning for Machine Translation (NIPS 2016)
Ā 
Conditional Image Generation with PixelCNN Decoders
Conditional Image Generation with PixelCNN DecodersConditional Image Generation with PixelCNN Decoders
Conditional Image Generation with PixelCNN Decoders
Ā 
Interaction Networks for Learning about Objects, Relations and Physics
Interaction Networks for Learning about Objects, Relations and PhysicsInteraction Networks for Learning about Objects, Relations and Physics
Interaction Networks for Learning about Objects, Relations and Physics
Ā 
Introduction of "TrailBlazer" algorithm
Introduction of "TrailBlazer" algorithmIntroduction of "TrailBlazer" algorithm
Introduction of "TrailBlazer" algorithm
Ā 
InfoGAN: Interpretable Representation Learning by Information Maximizing Gen...
InfoGAN: Interpretable Representation Learning by Information Maximizing Gen...InfoGAN: Interpretable Representation Learning by Information Maximizing Gen...
InfoGAN: Interpretable Representation Learning by Information Maximizing Gen...
Ā 
Safe and Efficient Off-Policy Reinforcement Learning
Safe and Efficient Off-Policy Reinforcement LearningSafe and Efficient Off-Policy Reinforcement Learning
Safe and Efficient Off-Policy Reinforcement Learning
Ā 
Introduction of ā€œFairness in Learning: Classic and Contextual Banditsā€
Introduction of ā€œFairness in Learning: Classic and Contextual Banditsā€Introduction of ā€œFairness in Learning: Classic and Contextual Banditsā€
Introduction of ā€œFairness in Learning: Classic and Contextual Banditsā€
Ā 
Improving Variational Inference with Inverse Autoregressive Flow
Improving Variational Inference with Inverse Autoregressive FlowImproving Variational Inference with Inverse Autoregressive Flow
Improving Variational Inference with Inverse Autoregressive Flow
Ā 
[DLč¼ŖčŖ­ä¼š]Convolutional Sequence to Sequence Learning
[DLč¼ŖčŖ­ä¼š]Convolutional Sequence to Sequence Learning[DLč¼ŖčŖ­ä¼š]Convolutional Sequence to Sequence Learning
[DLč¼ŖčŖ­ä¼š]Convolutional Sequence to Sequence Learning
Ā 
NIPS 2016 Overview and Deep Learning Topics
NIPS 2016 Overview and Deep Learning Topics  NIPS 2016 Overview and Deep Learning Topics
NIPS 2016 Overview and Deep Learning Topics
Ā 
č«–ę–‡ē“¹ä»‹ Combining Model-Based and Model-Free Updates for Trajectory-Centric Rein...
č«–ę–‡ē“¹ä»‹ Combining Model-Based and Model-Free Updates for Trajectory-Centric Rein...č«–ę–‡ē“¹ä»‹ Combining Model-Based and Model-Free Updates for Trajectory-Centric Rein...
č«–ę–‡ē“¹ä»‹ Combining Model-Based and Model-Free Updates for Trajectory-Centric Rein...
Ā 
Differential privacy without sensitivity [NIPS2016čŖ­ćæä¼šč³‡ę–™]
Differential privacy without sensitivity [NIPS2016čŖ­ćæä¼šč³‡ę–™]Differential privacy without sensitivity [NIPS2016čŖ­ćæä¼šč³‡ę–™]
Differential privacy without sensitivity [NIPS2016čŖ­ćæä¼šč³‡ę–™]
Ā 

Similar to Matching networks for one shot learning

A simple framework for contrastive learning of visual representations
A simple framework for contrastive learning of visual representationsA simple framework for contrastive learning of visual representations
A simple framework for contrastive learning of visual representations
Devansh16
Ā 
Ai_Project_report
Ai_Project_reportAi_Project_report
Ai_Project_reportRavi Gupta
Ā 
IRJET- Image Captioning using Multimodal Embedding
IRJET-  	  Image Captioning using Multimodal EmbeddingIRJET-  	  Image Captioning using Multimodal Embedding
IRJET- Image Captioning using Multimodal Embedding
IRJET Journal
Ā 
Sparse autoencoder
Sparse autoencoderSparse autoencoder
Sparse autoencoder
Devashish Patel
Ā 
nlp dl 1.pdf
nlp dl 1.pdfnlp dl 1.pdf
nlp dl 1.pdf
nyomans1
Ā 
An Uncertainty-Aware Approach to Optimal Configuration of Stream Processing S...
An Uncertainty-Aware Approach to Optimal Configuration of Stream Processing S...An Uncertainty-Aware Approach to Optimal Configuration of Stream Processing S...
An Uncertainty-Aware Approach to Optimal Configuration of Stream Processing S...
Pooyan Jamshidi
Ā 
Continuous Architecting of Stream-Based Systems
Continuous Architecting of Stream-Based SystemsContinuous Architecting of Stream-Based Systems
Continuous Architecting of Stream-Based Systems
CHOOSE
Ā 
IRJET- Chatbot Using Gated End-to-End Memory Networks
IRJET-  	  Chatbot Using Gated End-to-End Memory NetworksIRJET-  	  Chatbot Using Gated End-to-End Memory Networks
IRJET- Chatbot Using Gated End-to-End Memory Networks
IRJET Journal
Ā 
AMAZON STOCK PRICE PREDICTION BY USING SMLT
AMAZON STOCK PRICE PREDICTION BY USING SMLTAMAZON STOCK PRICE PREDICTION BY USING SMLT
AMAZON STOCK PRICE PREDICTION BY USING SMLT
IRJET Journal
Ā 
Python + Tensorflow: how to earn money in the Stock Exchange with Deep Learni...
Python + Tensorflow: how to earn money in the Stock Exchange with Deep Learni...Python + Tensorflow: how to earn money in the Stock Exchange with Deep Learni...
Python + Tensorflow: how to earn money in the Stock Exchange with Deep Learni...
ETS Asset Management Factory
Ā 
Transfer Learning for Improving Model Predictions in Highly Configurable Soft...
Transfer Learning for Improving Model Predictions in Highly Configurable Soft...Transfer Learning for Improving Model Predictions in Highly Configurable Soft...
Transfer Learning for Improving Model Predictions in Highly Configurable Soft...
Pooyan Jamshidi
Ā 
20070702 Text Categorization
20070702 Text Categorization20070702 Text Categorization
20070702 Text Categorization
midi
Ā 
Methodological Study Of Opinion Mining And Sentiment Analysis Techniques
Methodological Study Of Opinion Mining And Sentiment Analysis Techniques  Methodological Study Of Opinion Mining And Sentiment Analysis Techniques
Methodological Study Of Opinion Mining And Sentiment Analysis Techniques
ijsc
Ā 
00463517b1e90c1e63000000
00463517b1e90c1e6300000000463517b1e90c1e63000000
00463517b1e90c1e63000000Ivonne Liu
Ā 
Accelerated Particle Swarm Optimization and Support Vector Machine for Busine...
Accelerated Particle Swarm Optimization and Support Vector Machine for Busine...Accelerated Particle Swarm Optimization and Support Vector Machine for Busine...
Accelerated Particle Swarm Optimization and Support Vector Machine for Busine...
Xin-She Yang
Ā 
Methodological study of opinion mining and sentiment analysis techniques
Methodological study of opinion mining and sentiment analysis techniquesMethodological study of opinion mining and sentiment analysis techniques
Methodological study of opinion mining and sentiment analysis techniques
ijsc
Ā 
Neural networks with python
Neural networks with pythonNeural networks with python
Neural networks with python
Simone Piunno
Ā 
An introduction to deep learning
An introduction to deep learningAn introduction to deep learning
An introduction to deep learning
Van Thanh
Ā 
Ann
Ann Ann
Ann vini89
Ā 

Similar to Matching networks for one shot learning (20)

A simple framework for contrastive learning of visual representations
A simple framework for contrastive learning of visual representationsA simple framework for contrastive learning of visual representations
A simple framework for contrastive learning of visual representations
Ā 
Ai_Project_report
Ai_Project_reportAi_Project_report
Ai_Project_report
Ā 
IRJET- Image Captioning using Multimodal Embedding
IRJET-  	  Image Captioning using Multimodal EmbeddingIRJET-  	  Image Captioning using Multimodal Embedding
IRJET- Image Captioning using Multimodal Embedding
Ā 
Sparse autoencoder
Sparse autoencoderSparse autoencoder
Sparse autoencoder
Ā 
nlp dl 1.pdf
nlp dl 1.pdfnlp dl 1.pdf
nlp dl 1.pdf
Ā 
C sharp chap6
C sharp chap6C sharp chap6
C sharp chap6
Ā 
An Uncertainty-Aware Approach to Optimal Configuration of Stream Processing S...
An Uncertainty-Aware Approach to Optimal Configuration of Stream Processing S...An Uncertainty-Aware Approach to Optimal Configuration of Stream Processing S...
An Uncertainty-Aware Approach to Optimal Configuration of Stream Processing S...
Ā 
Continuous Architecting of Stream-Based Systems
Continuous Architecting of Stream-Based SystemsContinuous Architecting of Stream-Based Systems
Continuous Architecting of Stream-Based Systems
Ā 
IRJET- Chatbot Using Gated End-to-End Memory Networks
IRJET-  	  Chatbot Using Gated End-to-End Memory NetworksIRJET-  	  Chatbot Using Gated End-to-End Memory Networks
IRJET- Chatbot Using Gated End-to-End Memory Networks
Ā 
AMAZON STOCK PRICE PREDICTION BY USING SMLT
AMAZON STOCK PRICE PREDICTION BY USING SMLTAMAZON STOCK PRICE PREDICTION BY USING SMLT
AMAZON STOCK PRICE PREDICTION BY USING SMLT
Ā 
Python + Tensorflow: how to earn money in the Stock Exchange with Deep Learni...
Python + Tensorflow: how to earn money in the Stock Exchange with Deep Learni...Python + Tensorflow: how to earn money in the Stock Exchange with Deep Learni...
Python + Tensorflow: how to earn money in the Stock Exchange with Deep Learni...
Ā 
Transfer Learning for Improving Model Predictions in Highly Configurable Soft...
Transfer Learning for Improving Model Predictions in Highly Configurable Soft...Transfer Learning for Improving Model Predictions in Highly Configurable Soft...
Transfer Learning for Improving Model Predictions in Highly Configurable Soft...
Ā 
20070702 Text Categorization
20070702 Text Categorization20070702 Text Categorization
20070702 Text Categorization
Ā 
Methodological Study Of Opinion Mining And Sentiment Analysis Techniques
Methodological Study Of Opinion Mining And Sentiment Analysis Techniques  Methodological Study Of Opinion Mining And Sentiment Analysis Techniques
Methodological Study Of Opinion Mining And Sentiment Analysis Techniques
Ā 
00463517b1e90c1e63000000
00463517b1e90c1e6300000000463517b1e90c1e63000000
00463517b1e90c1e63000000
Ā 
Accelerated Particle Swarm Optimization and Support Vector Machine for Busine...
Accelerated Particle Swarm Optimization and Support Vector Machine for Busine...Accelerated Particle Swarm Optimization and Support Vector Machine for Busine...
Accelerated Particle Swarm Optimization and Support Vector Machine for Busine...
Ā 
Methodological study of opinion mining and sentiment analysis techniques
Methodological study of opinion mining and sentiment analysis techniquesMethodological study of opinion mining and sentiment analysis techniques
Methodological study of opinion mining and sentiment analysis techniques
Ā 
Neural networks with python
Neural networks with pythonNeural networks with python
Neural networks with python
Ā 
An introduction to deep learning
An introduction to deep learningAn introduction to deep learning
An introduction to deep learning
Ā 
Ann
Ann Ann
Ann
Ā 

More from Kazuki Fujikawa

Stanford Covid Vaccine 2nd place solution
Stanford Covid Vaccine 2nd place solutionStanford Covid Vaccine 2nd place solution
Stanford Covid Vaccine 2nd place solution
Kazuki Fujikawa
Ā 
BMS Molecular Translation 3rd place solution
BMS Molecular Translation 3rd place solutionBMS Molecular Translation 3rd place solution
BMS Molecular Translation 3rd place solution
Kazuki Fujikawa
Ā 
ACL2020 best papers
ACL2020 best papersACL2020 best papers
ACL2020 best papers
Kazuki Fujikawa
Ā 
Kaggle参加報告: Champs Predicting Molecular Properties
Kaggle参加報告: Champs Predicting Molecular PropertiesKaggle参加報告: Champs Predicting Molecular Properties
Kaggle参加報告: Champs Predicting Molecular Properties
Kazuki Fujikawa
Ā 
NLP@ICLR2019
NLP@ICLR2019NLP@ICLR2019
NLP@ICLR2019
Kazuki Fujikawa
Ā 
Kaggle参加報告: Quora Insincere Questions Classification
Kaggle参加報告: Quora Insincere Questions ClassificationKaggle参加報告: Quora Insincere Questions Classification
Kaggle参加報告: Quora Insincere Questions Classification
Kazuki Fujikawa
Ā 
Ordered neurons integrating tree structures into recurrent neural networks
Ordered neurons integrating tree structures into recurrent neural networksOrdered neurons integrating tree structures into recurrent neural networks
Ordered neurons integrating tree structures into recurrent neural networks
Kazuki Fujikawa
Ā 
A closer look at few shot classification
A closer look at few shot classificationA closer look at few shot classification
A closer look at few shot classification
Kazuki Fujikawa
Ā 
Graph convolutional policy network for goal directed molecular graph generation
Graph convolutional policy network for goal directed molecular graph generationGraph convolutional policy network for goal directed molecular graph generation
Graph convolutional policy network for goal directed molecular graph generation
Kazuki Fujikawa
Ā 
Conditional neural processes
Conditional neural processesConditional neural processes
Conditional neural processes
Kazuki Fujikawa
Ā 
NIPS2017 Few-shot Learning and Graph Convolution
NIPS2017 Few-shot Learning and Graph ConvolutionNIPS2017 Few-shot Learning and Graph Convolution
NIPS2017 Few-shot Learning and Graph Convolution
Kazuki Fujikawa
Ā 
Matrix capsules with em routing
Matrix capsules with em routingMatrix capsules with em routing
Matrix capsules with em routing
Kazuki Fujikawa
Ā 
Predicting organic reaction outcomes with weisfeiler lehman network
Predicting organic reaction outcomes with weisfeiler lehman networkPredicting organic reaction outcomes with weisfeiler lehman network
Predicting organic reaction outcomes with weisfeiler lehman network
Kazuki Fujikawa
Ā 
SchNet: A continuous-filter convolutional neural network for modeling quantum...
SchNet: A continuous-filter convolutional neural network for modeling quantum...SchNet: A continuous-filter convolutional neural network for modeling quantum...
SchNet: A continuous-filter convolutional neural network for modeling quantum...
Kazuki Fujikawa
Ā 
DeNAć«ćŠć‘ć‚‹ę©Ÿę¢°å­¦ēæ’ćƒ»ę·±å±¤å­¦ēæ’ę“»ē”Ø
DeNAć«ćŠć‘ć‚‹ę©Ÿę¢°å­¦ēæ’ćƒ»ę·±å±¤å­¦ēæ’ę“»ē”ØDeNAć«ćŠć‘ć‚‹ę©Ÿę¢°å­¦ēæ’ćƒ»ę·±å±¤å­¦ēæ’ę“»ē”Ø
DeNAć«ćŠć‘ć‚‹ę©Ÿę¢°å­¦ēæ’ćƒ»ę·±å±¤å­¦ēæ’ę“»ē”Ø
Kazuki Fujikawa
Ā 

More from Kazuki Fujikawa (15)

Stanford Covid Vaccine 2nd place solution
Stanford Covid Vaccine 2nd place solutionStanford Covid Vaccine 2nd place solution
Stanford Covid Vaccine 2nd place solution
Ā 
BMS Molecular Translation 3rd place solution
BMS Molecular Translation 3rd place solutionBMS Molecular Translation 3rd place solution
BMS Molecular Translation 3rd place solution
Ā 
ACL2020 best papers
ACL2020 best papersACL2020 best papers
ACL2020 best papers
Ā 
Kaggle参加報告: Champs Predicting Molecular Properties
Kaggle参加報告: Champs Predicting Molecular PropertiesKaggle参加報告: Champs Predicting Molecular Properties
Kaggle参加報告: Champs Predicting Molecular Properties
Ā 
NLP@ICLR2019
NLP@ICLR2019NLP@ICLR2019
NLP@ICLR2019
Ā 
Kaggle参加報告: Quora Insincere Questions Classification
Kaggle参加報告: Quora Insincere Questions ClassificationKaggle参加報告: Quora Insincere Questions Classification
Kaggle参加報告: Quora Insincere Questions Classification
Ā 
Ordered neurons integrating tree structures into recurrent neural networks
Ordered neurons integrating tree structures into recurrent neural networksOrdered neurons integrating tree structures into recurrent neural networks
Ordered neurons integrating tree structures into recurrent neural networks
Ā 
A closer look at few shot classification
A closer look at few shot classificationA closer look at few shot classification
A closer look at few shot classification
Ā 
Graph convolutional policy network for goal directed molecular graph generation
Graph convolutional policy network for goal directed molecular graph generationGraph convolutional policy network for goal directed molecular graph generation
Graph convolutional policy network for goal directed molecular graph generation
Ā 
Conditional neural processes
Conditional neural processesConditional neural processes
Conditional neural processes
Ā 
NIPS2017 Few-shot Learning and Graph Convolution
NIPS2017 Few-shot Learning and Graph ConvolutionNIPS2017 Few-shot Learning and Graph Convolution
NIPS2017 Few-shot Learning and Graph Convolution
Ā 
Matrix capsules with em routing
Matrix capsules with em routingMatrix capsules with em routing
Matrix capsules with em routing
Ā 
Predicting organic reaction outcomes with weisfeiler lehman network
Predicting organic reaction outcomes with weisfeiler lehman networkPredicting organic reaction outcomes with weisfeiler lehman network
Predicting organic reaction outcomes with weisfeiler lehman network
Ā 
SchNet: A continuous-filter convolutional neural network for modeling quantum...
SchNet: A continuous-filter convolutional neural network for modeling quantum...SchNet: A continuous-filter convolutional neural network for modeling quantum...
SchNet: A continuous-filter convolutional neural network for modeling quantum...
Ā 
DeNAć«ćŠć‘ć‚‹ę©Ÿę¢°å­¦ēæ’ćƒ»ę·±å±¤å­¦ēæ’ę“»ē”Ø
DeNAć«ćŠć‘ć‚‹ę©Ÿę¢°å­¦ēæ’ćƒ»ę·±å±¤å­¦ēæ’ę“»ē”ØDeNAć«ćŠć‘ć‚‹ę©Ÿę¢°å­¦ēæ’ćƒ»ę·±å±¤å­¦ēæ’ę“»ē”Ø
DeNAć«ćŠć‘ć‚‹ę©Ÿę¢°å­¦ēæ’ćƒ»ę·±å±¤å­¦ēæ’ę“»ē”Ø
Ā 

Recently uploaded

Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
John Andrews
Ā 
Adjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTESAdjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTES
Subhajit Sahu
Ā 
äø€ęƔäø€åŽŸē‰ˆ(UIUCęƕäøščƁ)伊利čÆŗ伊大学|厄巓ēŗ³-é¦™ę§Ÿåˆ†ę ”ęƕäøščƁ如何办ē†
äø€ęƔäø€åŽŸē‰ˆ(UIUCęƕäøščƁ)伊利čÆŗ伊大学|厄巓ēŗ³-é¦™ę§Ÿåˆ†ę ”ęƕäøščƁ如何办ē†äø€ęƔäø€åŽŸē‰ˆ(UIUCęƕäøščƁ)伊利čÆŗ伊大学|厄巓ēŗ³-é¦™ę§Ÿåˆ†ę ”ęƕäøščƁ如何办ē†
äø€ęƔäø€åŽŸē‰ˆ(UIUCęƕäøščƁ)伊利čÆŗ伊大学|厄巓ēŗ³-é¦™ę§Ÿåˆ†ę ”ęƕäøščƁ如何办ē†
ahzuo
Ā 
原ē‰ˆåˆ¶ä½œ(DeakinęƕäøščƁ书)čæŖč‚Æ大学ęƕäøščƁ学位čƁäø€ęØ”äø€ę ·
原ē‰ˆåˆ¶ä½œ(DeakinęƕäøščƁ书)čæŖč‚Æ大学ęƕäøščƁ学位čƁäø€ęØ”äø€ę ·åŽŸē‰ˆåˆ¶ä½œ(DeakinęƕäøščƁ书)čæŖč‚Æ大学ęƕäøščƁ学位čƁäø€ęØ”äø€ę ·
原ē‰ˆåˆ¶ä½œ(DeakinęƕäøščƁ书)čæŖč‚Æ大学ęƕäøščƁ学位čƁäø€ęØ”äø€ę ·
u86oixdj
Ā 
äø€ęƔäø€åŽŸē‰ˆ(BradfordęƕäøščƁ书)åøƒę‹‰å¾·ē¦å¾·å¤§å­¦ęƕäøščƁ如何办ē†
äø€ęƔäø€åŽŸē‰ˆ(BradfordęƕäøščƁ书)åøƒę‹‰å¾·ē¦å¾·å¤§å­¦ęƕäøščƁ如何办ē†äø€ęƔäø€åŽŸē‰ˆ(BradfordęƕäøščƁ书)åøƒę‹‰å¾·ē¦å¾·å¤§å­¦ęƕäøščƁ如何办ē†
äø€ęƔäø€åŽŸē‰ˆ(BradfordęƕäøščƁ书)åøƒę‹‰å¾·ē¦å¾·å¤§å­¦ęƕäøščƁ如何办ē†
mbawufebxi
Ā 
äø€ęƔäø€åŽŸē‰ˆ(UofMęƕäøščƁ)ę˜Žå°¼č‹č¾¾å¤§å­¦ęƕäøščÆęˆē»©å•
äø€ęƔäø€åŽŸē‰ˆ(UofMęƕäøščƁ)ę˜Žå°¼č‹č¾¾å¤§å­¦ęƕäøščÆęˆē»©å•äø€ęƔäø€åŽŸē‰ˆ(UofMęƕäøščƁ)ę˜Žå°¼č‹č¾¾å¤§å­¦ęƕäøščÆęˆē»©å•
äø€ęƔäø€åŽŸē‰ˆ(UofMęƕäøščƁ)ę˜Žå°¼č‹č¾¾å¤§å­¦ęƕäøščÆęˆē»©å•
ewymefz
Ā 
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
AbhimanyuSinha9
Ā 
äø€ęƔäø€åŽŸē‰ˆ(QUęƕäøščƁ)ēš‡åŽå¤§å­¦ęƕäøščÆęˆē»©å•
äø€ęƔäø€åŽŸē‰ˆ(QUęƕäøščƁ)ēš‡åŽå¤§å­¦ęƕäøščÆęˆē»©å•äø€ęƔäø€åŽŸē‰ˆ(QUęƕäøščƁ)ēš‡åŽå¤§å­¦ęƕäøščÆęˆē»©å•
äø€ęƔäø€åŽŸē‰ˆ(QUęƕäøščƁ)ēš‡åŽå¤§å­¦ęƕäøščÆęˆē»©å•
enxupq
Ā 
Opendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptxOpendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptx
Opendatabay
Ā 
SOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape ReportSOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape Report
SOCRadar
Ā 
äø€ęƔäø€åŽŸē‰ˆ(UPennęƕäøščƁ)å®¾å¤•ę³•å°¼äŗšå¤§å­¦ęƕäøščÆęˆē»©å•
äø€ęƔäø€åŽŸē‰ˆ(UPennęƕäøščƁ)å®¾å¤•ę³•å°¼äŗšå¤§å­¦ęƕäøščÆęˆē»©å•äø€ęƔäø€åŽŸē‰ˆ(UPennęƕäøščƁ)å®¾å¤•ę³•å°¼äŗšå¤§å­¦ęƕäøščÆęˆē»©å•
äø€ęƔäø€åŽŸē‰ˆ(UPennęƕäøščƁ)å®¾å¤•ę³•å°¼äŗšå¤§å­¦ęƕäøščÆęˆē»©å•
ewymefz
Ā 
äø€ęƔäø€åŽŸē‰ˆ(UVicęƕäøščƁ)ē»“多利äŗšå¤§å­¦ęƕäøščÆęˆē»©å•
äø€ęƔäø€åŽŸē‰ˆ(UVicęƕäøščƁ)ē»“多利äŗšå¤§å­¦ęƕäøščÆęˆē»©å•äø€ęƔäø€åŽŸē‰ˆ(UVicęƕäøščƁ)ē»“多利äŗšå¤§å­¦ęƕäøščÆęˆē»©å•
äø€ęƔäø€åŽŸē‰ˆ(UVicęƕäøščƁ)ē»“多利äŗšå¤§å­¦ęƕäøščÆęˆē»©å•
ukgaet
Ā 
Empowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptxEmpowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptx
benishzehra469
Ā 
å“Ŗ里卖(usqęƕäøščƁ书)å—ę˜†å£«å…°å¤§å­¦ęƕäøščƁē ”ē©¶ē”Ÿę–‡å‡­čÆä¹¦ę‰˜ē¦čƁ书原ē‰ˆäø€ęØ”äø€ę ·
å“Ŗ里卖(usqęƕäøščƁ书)å—ę˜†å£«å…°å¤§å­¦ęƕäøščƁē ”ē©¶ē”Ÿę–‡å‡­čÆä¹¦ę‰˜ē¦čƁ书原ē‰ˆäø€ęØ”äø€ę ·å“Ŗ里卖(usqęƕäøščƁ书)å—ę˜†å£«å…°å¤§å­¦ęƕäøščƁē ”ē©¶ē”Ÿę–‡å‡­čÆä¹¦ę‰˜ē¦čƁ书原ē‰ˆäø€ęØ”äø€ę ·
å“Ŗ里卖(usqęƕäøščƁ书)å—ę˜†å£«å…°å¤§å­¦ęƕäøščƁē ”ē©¶ē”Ÿę–‡å‡­čÆä¹¦ę‰˜ē¦čƁ书原ē‰ˆäø€ęØ”äø€ę ·
axoqas
Ā 
äø€ęƔäø€åŽŸē‰ˆ(BUęƕäøščƁ)ę³¢å£«é”æ大学ęƕäøščÆęˆē»©å•
äø€ęƔäø€åŽŸē‰ˆ(BUęƕäøščƁ)ę³¢å£«é”æ大学ęƕäøščÆęˆē»©å•äø€ęƔäø€åŽŸē‰ˆ(BUęƕäøščƁ)ę³¢å£«é”æ大学ęƕäøščÆęˆē»©å•
äø€ęƔäø€åŽŸē‰ˆ(BUęƕäøščƁ)ę³¢å£«é”æ大学ęƕäøščÆęˆē»©å•
ewymefz
Ā 
äø€ęƔäø€åŽŸē‰ˆ(NYUęƕäøščƁ)ēŗ½ēŗ¦å¤§å­¦ęƕäøščÆęˆē»©å•
äø€ęƔäø€åŽŸē‰ˆ(NYUęƕäøščƁ)ēŗ½ēŗ¦å¤§å­¦ęƕäøščÆęˆē»©å•äø€ęƔäø€åŽŸē‰ˆ(NYUęƕäøščƁ)ēŗ½ēŗ¦å¤§å­¦ęƕäøščÆęˆē»©å•
äø€ęƔäø€åŽŸē‰ˆ(NYUęƕäøščƁ)ēŗ½ēŗ¦å¤§å­¦ęƕäøščÆęˆē»©å•
ewymefz
Ā 
äø€ęƔäø€åŽŸē‰ˆ(YUęƕäøščƁ)ēŗ¦å…‹å¤§å­¦ęƕäøščÆęˆē»©å•
äø€ęƔäø€åŽŸē‰ˆ(YUęƕäøščƁ)ēŗ¦å…‹å¤§å­¦ęƕäøščÆęˆē»©å•äø€ęƔäø€åŽŸē‰ˆ(YUęƕäøščƁ)ēŗ¦å…‹å¤§å­¦ęƕäøščÆęˆē»©å•
äø€ęƔäø€åŽŸē‰ˆ(YUęƕäøščƁ)ēŗ¦å…‹å¤§å­¦ęƕäøščÆęˆē»©å•
enxupq
Ā 
äø€ęƔäø€åŽŸē‰ˆ(CBUęƕäøščƁ)å”ę™®é”æ大学ęƕäøščƁ如何办ē†
äø€ęƔäø€åŽŸē‰ˆ(CBUęƕäøščƁ)å”ę™®é”æ大学ęƕäøščƁ如何办ē†äø€ęƔäø€åŽŸē‰ˆ(CBUęƕäøščƁ)å”ę™®é”æ大学ęƕäøščƁ如何办ē†
äø€ęƔäø€åŽŸē‰ˆ(CBUęƕäøščƁ)å”ę™®é”æ大学ęƕäøščƁ如何办ē†
ahzuo
Ā 
ē¤¾å†…å‹‰å¼·ä¼šč³‡ę–™_LLM Agents怀怀怀怀怀怀怀怀怀怀怀怀怀怀怀怀怀怀怀怀怀怀怀怀怀怀怀怀怀怀.
ē¤¾å†…å‹‰å¼·ä¼šč³‡ę–™_LLM Agents怀怀怀怀怀怀怀怀怀怀怀怀怀怀怀怀怀怀怀怀怀怀怀怀怀怀怀怀怀怀.ē¤¾å†…å‹‰å¼·ä¼šč³‡ę–™_LLM Agents怀怀怀怀怀怀怀怀怀怀怀怀怀怀怀怀怀怀怀怀怀怀怀怀怀怀怀怀怀怀.
ē¤¾å†…å‹‰å¼·ä¼šč³‡ę–™_LLM Agents怀怀怀怀怀怀怀怀怀怀怀怀怀怀怀怀怀怀怀怀怀怀怀怀怀怀怀怀怀怀.
NABLASę Ŗ式会ē¤¾
Ā 
äø€ęƔäø€åŽŸē‰ˆ(RUGęƕäøščƁ)ę ¼ē½—å®ę ¹å¤§å­¦ęƕäøščÆęˆē»©å•
äø€ęƔäø€åŽŸē‰ˆ(RUGęƕäøščƁ)ę ¼ē½—å®ę ¹å¤§å­¦ęƕäøščÆęˆē»©å•äø€ęƔäø€åŽŸē‰ˆ(RUGęƕäøščƁ)ę ¼ē½—å®ę ¹å¤§å­¦ęƕäøščÆęˆē»©å•
äø€ęƔäø€åŽŸē‰ˆ(RUGęƕäøščƁ)ę ¼ē½—å®ę ¹å¤§å­¦ęƕäøščÆęˆē»©å•
vcaxypu
Ā 

Recently uploaded (20)

Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Ā 
Adjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTESAdjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTES
Ā 
äø€ęƔäø€åŽŸē‰ˆ(UIUCęƕäøščƁ)伊利čÆŗ伊大学|厄巓ēŗ³-é¦™ę§Ÿåˆ†ę ”ęƕäøščƁ如何办ē†
äø€ęƔäø€åŽŸē‰ˆ(UIUCęƕäøščƁ)伊利čÆŗ伊大学|厄巓ēŗ³-é¦™ę§Ÿåˆ†ę ”ęƕäøščƁ如何办ē†äø€ęƔäø€åŽŸē‰ˆ(UIUCęƕäøščƁ)伊利čÆŗ伊大学|厄巓ēŗ³-é¦™ę§Ÿåˆ†ę ”ęƕäøščƁ如何办ē†
äø€ęƔäø€åŽŸē‰ˆ(UIUCęƕäøščƁ)伊利čÆŗ伊大学|厄巓ēŗ³-é¦™ę§Ÿåˆ†ę ”ęƕäøščƁ如何办ē†
Ā 
原ē‰ˆåˆ¶ä½œ(DeakinęƕäøščƁ书)čæŖč‚Æ大学ęƕäøščƁ学位čƁäø€ęØ”äø€ę ·
原ē‰ˆåˆ¶ä½œ(DeakinęƕäøščƁ书)čæŖč‚Æ大学ęƕäøščƁ学位čƁäø€ęØ”äø€ę ·åŽŸē‰ˆåˆ¶ä½œ(DeakinęƕäøščƁ书)čæŖč‚Æ大学ęƕäøščƁ学位čƁäø€ęØ”äø€ę ·
原ē‰ˆåˆ¶ä½œ(DeakinęƕäøščƁ书)čæŖč‚Æ大学ęƕäøščƁ学位čƁäø€ęØ”äø€ę ·
Ā 
äø€ęƔäø€åŽŸē‰ˆ(BradfordęƕäøščƁ书)åøƒę‹‰å¾·ē¦å¾·å¤§å­¦ęƕäøščƁ如何办ē†
äø€ęƔäø€åŽŸē‰ˆ(BradfordęƕäøščƁ书)åøƒę‹‰å¾·ē¦å¾·å¤§å­¦ęƕäøščƁ如何办ē†äø€ęƔäø€åŽŸē‰ˆ(BradfordęƕäøščƁ书)åøƒę‹‰å¾·ē¦å¾·å¤§å­¦ęƕäøščƁ如何办ē†
äø€ęƔäø€åŽŸē‰ˆ(BradfordęƕäøščƁ书)åøƒę‹‰å¾·ē¦å¾·å¤§å­¦ęƕäøščƁ如何办ē†
Ā 
äø€ęƔäø€åŽŸē‰ˆ(UofMęƕäøščƁ)ę˜Žå°¼č‹č¾¾å¤§å­¦ęƕäøščÆęˆē»©å•
äø€ęƔäø€åŽŸē‰ˆ(UofMęƕäøščƁ)ę˜Žå°¼č‹č¾¾å¤§å­¦ęƕäøščÆęˆē»©å•äø€ęƔäø€åŽŸē‰ˆ(UofMęƕäøščƁ)ę˜Žå°¼č‹č¾¾å¤§å­¦ęƕäøščÆęˆē»©å•
äø€ęƔäø€åŽŸē‰ˆ(UofMęƕäøščƁ)ę˜Žå°¼č‹č¾¾å¤§å­¦ęƕäøščÆęˆē»©å•
Ā 
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Ā 
äø€ęƔäø€åŽŸē‰ˆ(QUęƕäøščƁ)ēš‡åŽå¤§å­¦ęƕäøščÆęˆē»©å•
äø€ęƔäø€åŽŸē‰ˆ(QUęƕäøščƁ)ēš‡åŽå¤§å­¦ęƕäøščÆęˆē»©å•äø€ęƔäø€åŽŸē‰ˆ(QUęƕäøščƁ)ēš‡åŽå¤§å­¦ęƕäøščÆęˆē»©å•
äø€ęƔäø€åŽŸē‰ˆ(QUęƕäøščƁ)ēš‡åŽå¤§å­¦ęƕäøščÆęˆē»©å•
Ā 
Opendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptxOpendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptx
Ā 
SOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape ReportSOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape Report
Ā 
äø€ęƔäø€åŽŸē‰ˆ(UPennęƕäøščƁ)å®¾å¤•ę³•å°¼äŗšå¤§å­¦ęƕäøščÆęˆē»©å•
äø€ęƔäø€åŽŸē‰ˆ(UPennęƕäøščƁ)å®¾å¤•ę³•å°¼äŗšå¤§å­¦ęƕäøščÆęˆē»©å•äø€ęƔäø€åŽŸē‰ˆ(UPennęƕäøščƁ)å®¾å¤•ę³•å°¼äŗšå¤§å­¦ęƕäøščÆęˆē»©å•
äø€ęƔäø€åŽŸē‰ˆ(UPennęƕäøščƁ)å®¾å¤•ę³•å°¼äŗšå¤§å­¦ęƕäøščÆęˆē»©å•
Ā 
äø€ęƔäø€åŽŸē‰ˆ(UVicęƕäøščƁ)ē»“多利äŗšå¤§å­¦ęƕäøščÆęˆē»©å•
äø€ęƔäø€åŽŸē‰ˆ(UVicęƕäøščƁ)ē»“多利äŗšå¤§å­¦ęƕäøščÆęˆē»©å•äø€ęƔäø€åŽŸē‰ˆ(UVicęƕäøščƁ)ē»“多利äŗšå¤§å­¦ęƕäøščÆęˆē»©å•
äø€ęƔäø€åŽŸē‰ˆ(UVicęƕäøščƁ)ē»“多利äŗšå¤§å­¦ęƕäøščÆęˆē»©å•
Ā 
Empowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptxEmpowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptx
Ā 
å“Ŗ里卖(usqęƕäøščƁ书)å—ę˜†å£«å…°å¤§å­¦ęƕäøščƁē ”ē©¶ē”Ÿę–‡å‡­čÆä¹¦ę‰˜ē¦čƁ书原ē‰ˆäø€ęØ”äø€ę ·
å“Ŗ里卖(usqęƕäøščƁ书)å—ę˜†å£«å…°å¤§å­¦ęƕäøščƁē ”ē©¶ē”Ÿę–‡å‡­čÆä¹¦ę‰˜ē¦čƁ书原ē‰ˆäø€ęØ”äø€ę ·å“Ŗ里卖(usqęƕäøščƁ书)å—ę˜†å£«å…°å¤§å­¦ęƕäøščƁē ”ē©¶ē”Ÿę–‡å‡­čÆä¹¦ę‰˜ē¦čƁ书原ē‰ˆäø€ęØ”äø€ę ·
å“Ŗ里卖(usqęƕäøščƁ书)å—ę˜†å£«å…°å¤§å­¦ęƕäøščƁē ”ē©¶ē”Ÿę–‡å‡­čÆä¹¦ę‰˜ē¦čƁ书原ē‰ˆäø€ęØ”äø€ę ·
Ā 
äø€ęƔäø€åŽŸē‰ˆ(BUęƕäøščƁ)ę³¢å£«é”æ大学ęƕäøščÆęˆē»©å•
äø€ęƔäø€åŽŸē‰ˆ(BUęƕäøščƁ)ę³¢å£«é”æ大学ęƕäøščÆęˆē»©å•äø€ęƔäø€åŽŸē‰ˆ(BUęƕäøščƁ)ę³¢å£«é”æ大学ęƕäøščÆęˆē»©å•
äø€ęƔäø€åŽŸē‰ˆ(BUęƕäøščƁ)ę³¢å£«é”æ大学ęƕäøščÆęˆē»©å•
Ā 
äø€ęƔäø€åŽŸē‰ˆ(NYUęƕäøščƁ)ēŗ½ēŗ¦å¤§å­¦ęƕäøščÆęˆē»©å•
äø€ęƔäø€åŽŸē‰ˆ(NYUęƕäøščƁ)ēŗ½ēŗ¦å¤§å­¦ęƕäøščÆęˆē»©å•äø€ęƔäø€åŽŸē‰ˆ(NYUęƕäøščƁ)ēŗ½ēŗ¦å¤§å­¦ęƕäøščÆęˆē»©å•
äø€ęƔäø€åŽŸē‰ˆ(NYUęƕäøščƁ)ēŗ½ēŗ¦å¤§å­¦ęƕäøščÆęˆē»©å•
Ā 
äø€ęƔäø€åŽŸē‰ˆ(YUęƕäøščƁ)ēŗ¦å…‹å¤§å­¦ęƕäøščÆęˆē»©å•
äø€ęƔäø€åŽŸē‰ˆ(YUęƕäøščƁ)ēŗ¦å…‹å¤§å­¦ęƕäøščÆęˆē»©å•äø€ęƔäø€åŽŸē‰ˆ(YUęƕäøščƁ)ēŗ¦å…‹å¤§å­¦ęƕäøščÆęˆē»©å•
äø€ęƔäø€åŽŸē‰ˆ(YUęƕäøščƁ)ēŗ¦å…‹å¤§å­¦ęƕäøščÆęˆē»©å•
Ā 
äø€ęƔäø€åŽŸē‰ˆ(CBUęƕäøščƁ)å”ę™®é”æ大学ęƕäøščƁ如何办ē†
äø€ęƔäø€åŽŸē‰ˆ(CBUęƕäøščƁ)å”ę™®é”æ大学ęƕäøščƁ如何办ē†äø€ęƔäø€åŽŸē‰ˆ(CBUęƕäøščƁ)å”ę™®é”æ大学ęƕäøščƁ如何办ē†
äø€ęƔäø€åŽŸē‰ˆ(CBUęƕäøščƁ)å”ę™®é”æ大学ęƕäøščƁ如何办ē†
Ā 
ē¤¾å†…å‹‰å¼·ä¼šč³‡ę–™_LLM Agents怀怀怀怀怀怀怀怀怀怀怀怀怀怀怀怀怀怀怀怀怀怀怀怀怀怀怀怀怀怀.
ē¤¾å†…å‹‰å¼·ä¼šč³‡ę–™_LLM Agents怀怀怀怀怀怀怀怀怀怀怀怀怀怀怀怀怀怀怀怀怀怀怀怀怀怀怀怀怀怀.ē¤¾å†…å‹‰å¼·ä¼šč³‡ę–™_LLM Agents怀怀怀怀怀怀怀怀怀怀怀怀怀怀怀怀怀怀怀怀怀怀怀怀怀怀怀怀怀怀.
ē¤¾å†…å‹‰å¼·ä¼šč³‡ę–™_LLM Agents怀怀怀怀怀怀怀怀怀怀怀怀怀怀怀怀怀怀怀怀怀怀怀怀怀怀怀怀怀怀.
Ā 
äø€ęƔäø€åŽŸē‰ˆ(RUGęƕäøščƁ)ę ¼ē½—å®ę ¹å¤§å­¦ęƕäøščÆęˆē»©å•
äø€ęƔäø€åŽŸē‰ˆ(RUGęƕäøščƁ)ę ¼ē½—å®ę ¹å¤§å­¦ęƕäøščÆęˆē»©å•äø€ęƔäø€åŽŸē‰ˆ(RUGęƕäøščƁ)ę ¼ē½—å®ę ¹å¤§å­¦ęƕäøščÆęˆē»©å•
äø€ęƔäø€åŽŸē‰ˆ(RUGęƕäøščƁ)ę ¼ē½—å®ę ¹å¤§å­¦ęƕäøščÆęˆē»©å•
Ā 

Matching networks for one shot learning

  • 1. Copyright (C) DeNA Co.,Ltd. All Rights Reserved. AI System Dept. System Management Unit Kazuki Fujikawa Matching Networks for One Shot Learning https://papers.nips.cc/paper/6385-matching-networks-for-one- shot-learning č«–ā½‚ē“¹ä»‹ 1 NIPS2016 čŖ­ćæ会 @Preferred Networks 2017/01/19
  • 2. Copyright (C) DeNA Co.,Ltd. All Rights Reserved. nā€Æ One-shot learning with attention and memory āƒā€Æ Learn a concept from one or only a few training examples āƒā€Æ Train a fully end-to-end nearest neighbor classiļ¬er: incorporating the best characteristics from both parametric and non-parametric models āƒā€Æ Improved one-shot accuracy on Omniglot from 88.0% to 93.2% compared to competing approaches 2 Abstract Figure 1: Matching Networks architecture
  • 3. Copyright (C) DeNA Co.,Ltd. All Rights Reserved. AGENDA nā€Æ Introduction nā€Æ Related work āƒā€Æ One-shot learning āƒā€Æ Attention mechanisms nā€Æ Matching Networks nā€Æ Experiments āƒā€Æ Omniglot āƒā€Æ ImageNet āƒā€Æ Penn Treebank 3
  • 4. Copyright (C) DeNA Co.,Ltd. All Rights Reserved. Supervised Learning nā€Æ Learn a correspondence between training data and labels āƒā€Æ Require a large labeled dataset for training (ex. CIFAR10 [Krizhevsky+, 2009]: 6000 data / class) āƒā€Æ It is hard to let classiļ¬ers learn new concepts from little data 4 airplane automobile bird cat deer Classiļ¬er examples Labels 0 airplane 1 automobile 0 bird 0 cat 0 deer Classiļ¬er Training phase Predicting phase https://www.cs.toronto.edu/~kriz/cifar.html
  • 5. Copyright (C) DeNA Co.,Ltd. All Rights Reserved. One-shot Learning nā€Æ Learn a concept from one or only a few training examples āƒā€Æ A classiļ¬er can be trained by datasets with labels which donŹ¼t be used in predicting phase 5 airplane automobile bird cat deer Classiļ¬er examples Labels 0 airplane 1 automobile 0 bird 0 cat 0 deer Classiļ¬er ļ¼ˆPre-ļ¼‰Training phase Predicting phaseļ¼ˆone-shot learning phaseļ¼‰ https://www.cs.toronto.edu/~kriz/cifar.html dog frog horse ship truck Classiļ¬er examples Labels
  • 6. Copyright (C) DeNA Co.,Ltd. All Rights Reserved. One-shot Learning nā€Æ Task: N-way k-shot learning 6 Tā€™: Testing taskT: Training task dog frog horse ship truck airplane automobile bird cat deer dog frog horse ship truck airplane automobile bird cat deer ā€¢ā€Æ Separate labels for training and testing ā€¢ā€Æ All the labels which you use in testing phase (one-shot learning phase) are not used in training phase https://www.cs.toronto.edu/~kriz/cifar.html
  • 7. Copyright (C) DeNA Co.,Ltd. All Rights Reserved. One-shot Learning nā€Æ Task: N-way k-shot learning 7 Tā€™: Testing taskT: Training task dog frog horse ship truck airplane automobile bird cat deer dog frog horse ship truck airplane automobile bird cat deer ā€¢ā€Æ Tā€™ is used for one-shot learning ā€¢ā€Æ T can be used freely to train ļ¼ˆe.g. Multiclass classiļ¬cationļ¼‰ https://www.cs.toronto.edu/~kriz/cifar.html
  • 8. Copyright (C) DeNA Co.,Ltd. All Rights Reserved. One-shot Learning nā€Æ Task: N-way k-shot learning 8 Tā€™: Testing taskT: Training task dog frog horse ship truck airplane automobile bird cat deer dog frog horse ship truck airplane automobile bird cat deer Lā€™: Label set sampling N labels from TŹ¼ ā€¢ā€Æ In this ļ¬gure, LŹ¼ has 3 classes, thus ā€œ3-way k-shot learningā€ automobile cat deer https://www.cs.toronto.edu/~kriz/cifar.html
  • 9. Copyright (C) DeNA Co.,Ltd. All Rights Reserved. One-shot Learning nā€Æ Task: N-way k-shot learning 9 Tā€™: Testing taskT: Training task dog frog horse ship truck airplane automobile bird cat deer dog frog horse ship truck airplane automobile bird cat deer Lā€™: Label set Sā€™: Support set : Query automobile cat deer sampling N labels from TŹ¼ sampling k examples from LŹ¼ sampling 1 example from LŹ¼ Ė†x ā€¢ā€Æ Task: classify into 3 classes, {automobile, cat, deer}, using support set Ė†x https://www.cs.toronto.edu/~kriz/cifar.html
  • 10. Copyright (C) DeNA Co.,Ltd. All Rights Reserved. Related Work (One-shot Learning) nā€Æ Convolutional Siamese Network [Koch+, 2015] āƒā€Æ Learn image representation with a siamese neural network āƒā€Æ Reuse features from the network for one-shot learning 10 Figure 1: Matching Networks architecture train it by showing only a few examples per class, switching the task from minibatch to minibatch, much like how it will be tested when presented with a few examples of a new task. Besides our contributions in deļ¬ning a model and training criterion amenable for one-shot learning, we contribute by the deļ¬nition of tasks that can be used to benchmark other approaches on both ImageNet and small scale language modeling. We hope that our results will encourage others to work CNN CNN Same?
  • 11. Copyright (C) DeNA Co.,Ltd. All Rights Reserved. Related Work (One-shot Learning) nā€Æ Memory-Augmented Neural Networks (MANN) [Santoro+, 2016] āƒā€Æ Quickly encode and retrieve new information using external memory, inspired by the idea of Neural Turing Machine 11
  • 12. Copyright (C) DeNA Co.,Ltd. All Rights Reserved. Related Work (One-shot Learning) nā€Æ Siamese Learnet [Bertinetto+, NIPS2016] āƒā€Æ Learn the parameters of a network to incorporate domain speciļ¬c information from a few examples 12 siamese siamese learnet learnet Figure 1: Our proposed architectures predict the parameters of a network from a single example, replacing static convolutions (green) with dynamic convolutions (red). The siamese learnet predicts the parameters of an embedding function that is applied to both inputs, whereas the single-stream learnet predicts the parameters of a function that is applied to the other input. Linear layers are denoted by ā‡¤ and nonlinear layers by . Dashed connections represent parameter sharing. discriminative one-shot learning is to ļ¬nd a mechanism to incorporate domain-speciļ¬c information in the learner, i.e. learning to learn. Another challenge, which is of practical importance in applications of one-shot learning, is to avoid a lengthy optimization process such as eq. (1). We propose to address both challenges by learning the parameters W of the predictor from a single exemplar z using a meta-prediction process, i.e. a non-iterative feed-forward function ! that maps (z; W0 ) to W. Since in practice this function will be implemented using a deep neural network, we call it a learnet. The learnet depends on the exemplar z, which is a single representative of the class of interest, and contains parameters W0 of its own. Learning to learn can now be posed as the problem of optimizing the learnet meta-parameters W0 using an objective function deļ¬ned below. Furthermore, the feed-forward learnet evaluation is much faster than solving the optimization problem (1). In order to train the learnet, we require the latter to produce good predictors given any possible exemplar z, which is empirically evaluated as an average over n training samples zi:
  • 13. Copyright (C) DeNA Co.,Ltd. All Rights Reserved. Related Work (Attention Mechanism) nā€Æ Sequence to Sequence with Attention [Bahdanau+, 2014] āƒā€Æ Attend to the word relevant to the generation of the next target word in the source sentence 13 t t her architectures such as a hybrid of an RNN alchbrenner and Blunsom, 2013). ral machine translation. The new architecture 3.2) and a decoder that emulates searching n (Sec. 3.1). x1 x2 x3 xT + Ī±t,1 Ī±t,2 Ī±t,3 Ī±t,T yt-1 yt h1 h2 h3 hT h1 h2 h3 hT st-1 st Figure 1: The graphical illus- tration of the proposed model trying to generate the t-th tar- get word yt given a source sentence (x1, x2, . . . , xT ). al probability (4) by ā€“decoder ap- on a distinct annotations ntence. Each put sequence word of the ons are com- sum of these (5) ij) Figure 1: The graphical illus- tration of the proposed model trying to generate the t-th tar- get word yt given a source sentence (x1, x2, . . . , xT ). si = f(si 1, yi 1, ci). It should be noted that unlike the existing encoderā€“decoder ap- proach (see Eq. (2)), here the probability is conditioned on a distinct context vector ci for each target word yi. The context vector ci depends on a sequence of annotations (h1, Ā· Ā· Ā· , hTx ) to which an encoder maps the input sentence. Each annotation hi contains information about the whole input sequence with a strong focus on the parts surrounding the i-th word of the input sequence. We explain in detail how the annotations are com- puted in the next section. The context vector ci is, then, computed as a weighted sum of these annotations hi: ci = TxX j=1 ā†µijhj. (5) The weight ā†µij of each annotation hj is computed by ā†µij = exp (eij) PTx k=1 exp (eik) , (6) where eij = a(si 1, hj) is an alignment model which scores how well the inputs around position j and the output at position i match. The score is based on the RNN hidden state si 1 (just before emitting yi, Eq. (4)) and the j-th annotation hj of the input sentence. We parametrize the alignment model a as a feedforward neural network which is jointly trained with all the other components of the proposed system. Note that unlike in traditional machine translation, 3 Figure 1: The graphical illus- tration of the proposed model trying to generate the t-th tar- get word yt given a source sentence (x1, x2, . . . , xT ). si = f(si 1, yi 1, ci). It should be noted that unlike the existing encoderā€“decoder ap- proach (see Eq. (2)), here the probability is conditioned on a distinct context vector ci for each target word yi. The context vector ci depends on a sequence of annotations (h1, Ā· Ā· Ā· , hTx ) to which an encoder maps the input sentence. Each annotation hi contains information about the whole input sequence with a strong focus on the parts surrounding the i-th word of the input sequence. We explain in detail how the annotations are com- puted in the next section. The context vector ci is, then, computed as a weighted sum of these annotations hi: ci = TxX j=1 ā†µijhj. (5) The weight ā†µij of each annotation hj is computed by ā†µij = exp (eij) PTx k=1 exp (eik) , (6) where eij = a(si 1, hj) is an alignment model which scores how well the inputs around position j and the output at position i match. The score is based on the RNN hidden state si 1 (just before emitting yi, Eq. (4)) and the j-th annotation hj of the input sentence. We parametrize the alignment model a as a feedforward neural network which is jointly trained with all the other components of the proposed system. Note that unlike in traditional machine translation, 3 Figure 1: The graphical illus- tration of the proposed model trying to generate the t-th tar- get word yt given a source sentence (x1, x2, . . . , xT ). proach (see Eq. (2)), here the probability is conditioned on a distinct context vector ci for each target word yi. The context vector ci depends on a sequence of annotations (h1, Ā· Ā· Ā· , hTx ) to which an encoder maps the input sentence. Each annotation hi contains information about the whole input sequence with a strong focus on the parts surrounding the i-th word of the input sequence. We explain in detail how the annotations are com- puted in the next section. The context vector ci is, then, computed as a weighted sum of these annotations hi: ci = TxX j=1 ā†µijhj. (5) The weight ā†µij of each annotation hj is computed by ā†µij = exp (eij) PTx k=1 exp (eik) , (6) where eij = a(si 1, hj) is an alignment model which scores how well the inputs around position j and the output at position i match. The score is based on the RNN hidden state si 1 (just before emitting yi, Eq. (4)) and the j-th annotation hj of the input sentence. We parametrize the alignment model a as a feedforward neural network which is jointly trained with all the other components of the proposed system. Note that unlike in traditional machine translation, 3 Published as a conference paper at ICLR 2015 (a) (b)
  • 14. Copyright (C) DeNA Co.,Ltd. All Rights Reserved. Related Work (Attention Mechanism) nā€Æ Pointer Networks [Vinyals+, 2015] āƒā€Æ Generate output sequence using a distribution over the dictionary of inputs 14 (a) Sequence-to-Sequence (b) Ptr-Net Figure 1: (a) Sequence-to-Sequence - An RNN (blue) processes the input sequence to create a code vector that is used to generate the output sequence (purple) using the probability chain rule and another RNN. The output dimensionality is ļ¬xed by the dimensionality of the problem and it is the same during training and inference [1]. (b) Ptr-Net - An encoding RNN converts the input sequence to a code (blue) that is fed to the generating network (purple). At each step, the generating network produces a vector that modulates a content-based attention mechanism over inputs ([5, 2]). The output of the attention mechanism is a softmax distribution with dictionary size equal to the length of the input. ion (i.e., when we only have examples of inputs and desired outputs). The proposed approach is depicted in Figure 1. The main contributions of our work are as follows: This model performs signiļ¬cantly better than the sequence-to-sequence model on the co problem, but it is not applicable to problems where the output dictionary size depends on Nevertheless, a very simple extension (or rather reduction) of the model allows us to do th 2.3 Ptr-Net We now describe a very simple modiļ¬cation of the attention model that allows us to method to solve combinatorial optimization problems where the output dictionary size d the number of elements in the input sequence. The sequence-to-sequence model of Section 2.1 uses a softmax distribution over a ļ¬xed si dictionary to compute p(Ci|C1, . . . , Ci 1, P) in Equation 1. Thus it cannot be used for our where the size of the output dictionary is equal to the length of the input sequence. To problem we model p(Ci|C1, . . . , Ci 1, P) using the attention mechanism of Equation 3 a ui j = vT tanh(W1ej + W2di) j 2 (1, . . . , n) p(Ci|C1, . . . , Ci 1, P) = softmax(ui ) where softmax normalizes the vector ui (of length n) to be an output distribution over the of inputs, and v, W1, and W2 are learnable parameters of the output model. Here, we do the encoder state ej to propagate extra information to the decoder, but instead, use ui j a to the input elements. In a similar way, to condition on Ci 1 as in Equation 1, we sim the corresponding PCi 1 as the input. Both our method and the attention model can be application of content-based attention mechanisms proposed in [6, 5, 2]. We also note that our approach speciļ¬cally targets problems whose outputs are discrete spond to positions in the input. Such problems may be addressed artiļ¬cially ā€“ for example learn to output the coordinates of the target point directly using an RNN. However, at this solution does not respect the constraint that the outputs map back to the inputs exac out the constraints, the predictions are bound to become blurry over longer sequences as sequence-to-sequence models for videos [12]. 3 Motivation and Datasets Structure In the following sections, we review each of the three problems we considered, as well a generation protocol.1 In the training data, the inputs are planar point sets P = {P1, . . . , Pn} with n elements ea Pj = (xj, yj) are the cartesian coordinates of the points over which we ļ¬nd the convex hu launay triangulation or the solution to the corresponding Travelling Salesman Problem. In we sample from a uniform distribution in [0, 1] ā‡„ [0, 1]. The outputs CP = {C1, . . . , C sequences representing the solution associated to the point set P. In Figure 2, we ļ¬nd an i of an input/output pair (P, CP ) for the convex hull and the Delaunay problems.
  • 15. Copyright (C) DeNA Co.,Ltd. All Rights Reserved. Related Work (Attention Mechanism) nā€Æ Sequence to Sequence for Sets [Vinyals+, ICLR2016] āƒā€Æ Handle input sets using an extension of seq2seq framework: Read-Process-and Write model 15 ural models with memories coupled to differentiable addressing mechanism have been success- y applied to handwriting generation and recognition (Graves, 2012), machine translation (Bah- au et al., 2015a), and more general computation machines (Graves et al., 2014; Weston et al., 5). Since we are interested in associative memories we employed a ā€œcontentā€ based attention. s has the property that the vector retrieved from our memory would not change if we randomly fļ¬‚ed the memory. This is crucial for proper treatment of the input set X as such. In particular, process block based on an attention mechanism uses the following: qt = LSTM(qā‡¤ t 1) (3) ei,t = f(mi, qt) (4) ai,t = exp(ei,t) P j exp(ej,t) (5) rt = X i ai,tmi (6) qā‡¤ t = [qt rt] (7) Read Process Write Figure 1: The Read-Process-and-Write model. ere i indexes through each memory vector mi (typically equal to the cardinality of X), qt is uery vector which allows us to read rt from the memories, f is a function that computes a gle scalar from mi and qt (e.g., a dot product), and LSTM is an LSTM which computes a urrent state but which takes no inputs. qā‡¤ t is the state which this LSTM evolves, and is formed concatenating the query qt with the resulting attention readout rt. t is the index which indicates
  • 16. Copyright (C) DeNA Co.,Ltd. All Rights Reserved. Matching Networks [Vinyals+, NIPS2016] nā€Æ Motivation āƒā€Æ It is important for one-shot learning to attain rapid learning from new examples while keeping an ability for common examples ā€¢ā€Æ Simple parametric models such as deep classiļ¬ers need to be optimized to treat with new examples ā€¢ā€Æ Non-parametric models such as k-nearest neighbor donŹ¼t require optimization but performance depends on the chosen metric āƒā€Æ It could be eļ¬ƒcient to train a end-to-end nearest neighbor based classiļ¬er 16
  • 17. Copyright (C) DeNA Co.,Ltd. All Rights Reserved. Matching Networks [Vinyals+, NIPS2016] nā€Æ Train a classiļ¬er through one-shot learning 17 Tā€™: Testing taskT: Training task dog frog horse ship truck airplane automobile bird cat deer dog frog horse ship truck airplane automobile bird cat deer L: Label set S: Support set B : Batch dog horse ship sampling N labels from T sampling k examples from L sampling b example from L https://www.cs.toronto.edu/~kriz/cifar.html
  • 18. Copyright (C) DeNA Co.,Ltd. All Rights Reserved. Matching Networks [Vinyals+, NIPS2016] nā€Æ System Overview āƒā€Æ Embedding functions f, g are parameterized as a simple CNN (e.g. VGG or Inception) or a fully conditional embedding function mentioned later 18 Figure 1: Matching Networks architecture train it by showing only a few examples per class, switching the task from minibatch to minibatch, much like how it will be tested when presented with a few examples of a new task. Besides our contributions in deļ¬ning a model and training criterion amenable for one-shot learning, we contribute by the deļ¬nition of tasks that can be used to benchmark other approaches on both ImageNet and small scale language modeling. We hope that our results will encourage others to work on this challenging problem. We organized the paper by ļ¬rst deļ¬ning and explaining our model whilst linking its several compo- nents to related work. Then in the following section we brieļ¬‚y elaborate on some of the related work to the task and our model. In Section 4 we describe both our general setup and the experiments we performed, demonstrating strong results on one-shot learning on a variety of tasks and setups. 2 Model Our non-parametric approach to solving one-shot learning is based on two components which we describe in the following subsections. First, our model architecture follows recent advances in neural networks augmented with memory (as discussed in Section 3). Given a (small) support set S, our model deļ¬nes a function cS (or classiļ¬er) for each S, i.e. a mapping S ! cS(.). Second, we employ Ė†x Query f g(xi ) f ( Ė†x,S) a āˆ‘ P(Ė†y|Ė†x where xi, yi are the inputs and corresp {(xi, yi)}k i=1, and a is an attention mech tially describes the output for a new class Where the attention mechanism a is a kerne Where the attention mechanism is zero f metric and an appropriate constant otherw (although this requires an extension to the Thus (1) subsumes both KDE and kNN me mechanism and the yi act as values bound this case we can understand this as a parti we ā€œpointā€ to the corresponding example i form deļ¬ned by the classiļ¬er cS(Ė†x) is very 2.1.1 The Attention Kernel Equation 1 relies on choosing a(., .), the ļ¬er. The simplest form that this takes attention models and kernel functions) a(Ė†x, xi) = ec(f(Ė†x),g(xi)) / Pk j=1 ec(f(Ė†x),g( ate neural networks (potentially with f = examples where f and g are parameteris tasks (as in VGG[22] or Inception[24]) or Section 4). We note that, though related to metric learn For a given support set S and sample to cl pairs (x0 , y0 ) 2 S such that y0 = y and mi methods such as Neighborhood Compone nearest neighbor [28]. However, the objective that we are trying classiļ¬cation, and thus we expect it to per Our model in its simplest form computes a probability over Ė†y as follows: P(Ė†y|Ė†x, S) = kX i=1 a(Ė†x, xi)yi where xi, yi are the inputs and corresponding label distributions from the support {(xi, yi)}k i=1, and a is an attention mechanism which we discuss below. Note that e tially describes the output for a new class as a linear combination of the labels in the s Where the attention mechanism a is a kernel on X ā‡„ X, then (1) is akin to a kernel densit Where the attention mechanism is zero for the b furthest xi from Ė†x according to som metric and an appropriate constant otherwise, then (1) is equivalent to ā€˜k bā€™-nearest n (although this requires an extension to the attention mechanism that we describe in Sec Thus (1) subsumes both KDE and kNN methods. Another view of (1) is where a acts as a mechanism and the yi act as values bound to the corresponding keys xi, much like a has this case we can understand this as a particular kind of associative memory where, give Figure 1: Matching Networks architecture by showing only a few examples per class, switching the task from minibatch to minibatch, ike how it will be tested when presented with a few examples of a new task. s our contributions in deļ¬ning a model and training criterion amenable for one-shot learning, ntribute by the deļ¬nition of tasks that can be used to benchmark other approaches on both Figure 1: Matching Networks architecture it by showing only a few examples per class, switching the task from minibatch to minibatch, h like how it will be tested when presented with a few examples of a new task. Figure 1: Matching Networks architecture xi Support Setļ¼ˆSļ¼‰ yi g
  • 19. Copyright (C) DeNA Co.,Ltd. All Rights Reserved. Matching Networks [Vinyals+, NIPS2016] nā€Æ The Attention Kernel āƒā€Æ Calculate softmax over the cosine distance between and ā€¢ā€Æ Similar to nearest neighbor calculation āƒā€Æ Train a network using cross entropy loss 19 Figure 1: Matching Networks architecture train it by showing only a few examples per class, switching the task from minibatch to minibatch, much like how it will be tested when presented with a few examples of a new task. Besides our contributions in deļ¬ning a model and training criterion amenable for one-shot learning, we contribute by the deļ¬nition of tasks that can be used to benchmark other approaches on both ImageNet and small scale language modeling. We hope that our results will encourage others to work on this challenging problem. We organized the paper by ļ¬rst deļ¬ning and explaining our model whilst linking its several compo- nents to related work. Then in the following section we brieļ¬‚y elaborate on some of the related work to the task and our model. In Section 4 we describe both our general setup and the experiments we performed, demonstrating strong results on one-shot learning on a variety of tasks and setups. 2 Model Our non-parametric approach to solving one-shot learning is based on two components which we describe in the following subsections. First, our model architecture follows recent advances in neural networks augmented with memory (as discussed in Section 3). Given a (small) support set S, our model deļ¬nes a function cS (or classiļ¬er) for each S, i.e. a mapping S ! cS(.). Second, we employ Ė†x Query f g(xi ) f ( Ė†x,S) aOur model in its simplest form computes a probability over Ė†y as follow P(Ė†y|Ė†x, S) = kX i=1 a(Ė†x, xi)yi where xi, yi are the inputs and corresponding label distributions {(xi, yi)}k i=1, and a is an attention mechanism which we discuss b tially describes the output for a new class as a linear combination of Where the attention mechanism a is a kernel on X ā‡„ X, then (1) is akin Where the attention mechanism is zero for the b furthest xi from Ė†x metric and an appropriate constant otherwise, then (1) is equivalent t (although this requires an extension to the attention mechanism that w āˆ‘ P(Ė†y|Ė†x where xi, yi are the inputs and corresp {(xi, yi)}k i=1, and a is an attention mech tially describes the output for a new class Where the attention mechanism a is a kerne Where the attention mechanism is zero f metric and an appropriate constant otherw (although this requires an extension to the Thus (1) subsumes both KDE and kNN me mechanism and the yi act as values bound this case we can understand this as a parti we ā€œpointā€ to the corresponding example i form deļ¬ned by the classiļ¬er cS(Ė†x) is very 2.1.1 The Attention Kernel Equation 1 relies on choosing a(., .), the ļ¬er. The simplest form that this takes attention models and kernel functions) a(Ė†x, xi) = ec(f(Ė†x),g(xi)) / Pk j=1 ec(f(Ė†x),g( ate neural networks (potentially with f = examples where f and g are parameteris tasks (as in VGG[22] or Inception[24]) or Section 4). We note that, though related to metric learn For a given support set S and sample to cl pairs (x0 , y0 ) 2 S such that y0 = y and mi methods such as Neighborhood Compone nearest neighbor [28]. However, the objective that we are trying classiļ¬cation, and thus we expect it to per Our model in its simplest form computes a probability over Ė†y as follows: P(Ė†y|Ė†x, S) = kX i=1 a(Ė†x, xi)yi where xi, yi are the inputs and corresponding label distributions from the support {(xi, yi)}k i=1, and a is an attention mechanism which we discuss below. Note that e tially describes the output for a new class as a linear combination of the labels in the s Where the attention mechanism a is a kernel on X ā‡„ X, then (1) is akin to a kernel densit Where the attention mechanism is zero for the b furthest xi from Ė†x according to som metric and an appropriate constant otherwise, then (1) is equivalent to ā€˜k bā€™-nearest n (although this requires an extension to the attention mechanism that we describe in Sec Thus (1) subsumes both KDE and kNN methods. Another view of (1) is where a acts as a mechanism and the yi act as values bound to the corresponding keys xi, much like a has this case we can understand this as a particular kind of associative memory where, give Our model in its simplest form computes a probability over Ė†y as follows: P(Ė†y|Ė†x, S) = kX i=1 a(Ė†x, xi)yi where xi, yi are the inputs and corresponding label distributions from the suppo {(xi, yi)}k i=1, and a is an attention mechanism which we discuss below. Note that tially describes the output for a new class as a linear combination of the labels in the Where the attention mechanism a is a kernel on X ā‡„ X, then (1) is akin to a kernel dens Where the attention mechanism is zero for the b furthest xi from Ė†x according to so metric and an appropriate constant otherwise, then (1) is equivalent to ā€˜k bā€™-nearest (although this requires an extension to the attention mechanism that we describe in Se Thus (1) subsumes both KDE and kNN methods. Another view of (1) is where a acts as mechanism and the yi act as values bound to the corresponding keys xi, much like a ha this case we can understand this as a particular kind of associative memory where, giv we ā€œpointā€ to the corresponding example in the support set, retrieving its label. Hence th form deļ¬ned by the classiļ¬er cS(Ė†x) is very ļ¬‚exible and can adapt easily to any new sup 2.1.1 The Attention Kernel Equation 1 relies on choosing a(., .), the attention mechanism, which fully speciļ¬e ļ¬er. The simplest form that this takes (and which has very tight relationships wi attention models and kernel functions) is to use the softmax over the cosine dist a(Ė†x, xi) = ec(f(Ė†x),g(xi)) / Pk j=1 ec(f(Ė†x),g(xj )) with embedding functions f and g bein ate neural networks (potentially with f = g) to embed Ė†x and xi. In our experiments w examples where f and g are parameterised variously as deep convolutional network tasks (as in VGG[22] or Inception[24]) or a simple form word embedding for languag Section 4). We note that, though related to metric learning, the classiļ¬er deļ¬ned by Equation 1 is di For a given support set S and sample to classify Ė†x, it is enough for Ė†x to be sufļ¬ciently a pairs (x0 , y0 ) 2 S such that y0 = y and misaligned with the rest. This kind of loss is als c: cosine distance Figure 1: Matching Networks architecture by showing only a few examples per class, switching the task from minibatch to minibatch, ike how it will be tested when presented with a few examples of a new task. s our contributions in deļ¬ning a model and training criterion amenable for one-shot learning, ntribute by the deļ¬nition of tasks that can be used to benchmark other approaches on both Figure 1: Matching Networks architecture it by showing only a few examples per class, switching the task from minibatch to minibatch, h like how it will be tested when presented with a few examples of a new task. Figure 1: Matching Networks architecture xi Support Setļ¼ˆSļ¼‰ yi g Ė†hk, ck = LSTM(f0 (Ė†x), [hk 1, rk 1], ck 1) hk = Ė†hk + f0 (Ė†x) rk 1 = |S| X i=1 a(hk 1, g(xi))g(xi) a(hk 1, g(xi)) = ehT k 1g(xi) / |S| X j=1 ehT k 1g(xj ) noting that LSTM(x, h, c) follows the same LSTM implementation deļ¬ned in [23] h the output (i.e., cell after the output gate), and c the cell. a is commonly referred based attention. We do K steps of ā€œreadsā€, so f(Ė†x, S) = hK where hk is as describ 2.2 Training Strategy In the previous subsection we described Matching Networks which map a support set t function, S ! c(Ė†x). We achieve this via a modiļ¬cation of the set-to-set paradigm attention, with the resulting mapping being of the form Pāœ“(.|Ė†x, S), noting that āœ“ are of the model (i.e. of the embedding functions f and g described previously). The training procedure has to be chosen carefully so as to match inference at test t has to perform well with support sets S0 which contain classes never seen during tra More speciļ¬cally, let us deļ¬ne a task T as distribution over possible label sets L consider T to uniformly weight all data sets of up to a few unique classes (e.g. examples per class (e.g., up to 5). In this case, a label set L sampled from a task typically have 5 to 25 examples. To form an ā€œepisodeā€ to compute gradients and update our model, we ļ¬rst sample L could be the label set {cats, dogs}). We then use L to sample the support set S (i.e., both S and B are labelled examples of cats and dogs). The Matching Net is minimise the error predicting the labels in the batch B conditioned on the support form of meta-learning since the training procedure explicitly learns to learn from a g to minimise a loss over a batch. More precisely, the Matching Nets training objectiv āœ“ = arg max āœ“ ELā‡ T 2 4ESā‡ L,Bā‡ L 2 4 X (x,y)2B log Pāœ“ (y|x, S) 3 5 3 5 . Training āœ“ with eq. 6 yields a model which works well when sampling S0 ā‡  T0 g(xi )f ( Ė†x,S)
  • 20. Copyright (C) DeNA Co.,Ltd. All Rights Reserved. Matching Networks [Vinyals+, NIPS2016] nā€Æ The Fully Conditional Embedding g āƒā€Æ Embed in consideration of S gā€™ LSTM LSTM + Figure 1: Matching Networks architecture by showing only a few examples per class, switching the task from minibatch to minibatch, ike how it will be tested when presented with a few examples of a new task. s our contributions in deļ¬ning a model and training criterion amenable for one-shot learning, ntribute by the deļ¬nition of tasks that can be used to benchmark other approaches on both Net and small scale language modeling. We hope that our results will encourage others to work challenging problem. ganized the paper by ļ¬rst deļ¬ning and explaining our model whilst linking its several compo- o related work. Then in the following section we brieļ¬‚y elaborate on some of the related work Figure 1: Matching Networks architecture it by showing only a few examples per class, switching the task from minibatch to minibatch, h like how it will be tested when presented with a few examples of a new task. des our contributions in deļ¬ning a model and training criterion amenable for one-shot learning, ontribute by the deļ¬nition of tasks that can be used to benchmark other approaches on both geNet and small scale language modeling. We hope that our results will encourage others to work his challenging problem. 20 Figure 1: Matching Networks architecture examples per class, switching the task from minibatch to minibatch, much like when presented with a few examples of a new task. utions in deļ¬ning a model and training criterion amenable for one-shot learning, xi Support Setļ¼ˆSļ¼‰ yi gā€™ LSTM LSTM + gā€™ LSTM LSTM + noting that LSTM(x, h, c) follows the same LSTM implementation deļ¬ned in [ h the output (i.e., cell after the output gate), and c the cell. a is commonly refe based attention, and the softmax in eq. 6 normalizes w.r.t. g(xi). The read-ou concatenated to hk 1. Since we do K steps of ā€œreadsā€, attLSTM(f0 (Ė†x), g(S), is as described in eq. 3. A.2 The Fully Conditional Embedding g In section 2.1.2 we described the encoding function for the elements in the sup as a bidirectional LSTM. More precisely, let g0 (xi) be a neural network (simila VGG or Inception model). Then we deļ¬ne g(xi, S) = ~hi + ~hi + g0 (xi) with: ~hi,~ci = LSTM(g0 (xi),~hi 1,~ci 1) ~hi, ~ci = LSTM(g0 (xi), ~hi+1, ~ci+1) where, as in above, LSTM(x, h, c) follows the same LSTM implementation de the input, h the output (i.e., cell after the output gate), and c the cell. Note tha starts from i = |S|. As in eq. 3, we add a skip connection between input and ou B ImageNet Class Splits Here we deļ¬ne the two class splits used in our full ImageNet experiments ā€“ excluded for training during our one-shot experiments described in section 4.1. Lrand = n01498041, n01537544, n01580077, n01592084, n01632777, n01644373, n01665541, n01675722, n016882 n01818515, n01843383, n01883070, n01950731, n02002724, n02013706, n02092339, n02093256, n020953 10 gā€™: neural network (e.g., VGG or Inception) a(hk 1, g(xi)) = softmax(hk 1g(xi)) noting that LSTM(x, h, c) follows the same LSTM implementation deļ¬ned in [23] with x th h the output (i.e., cell after the output gate), and c the cell. a is commonly referred to as ā€œc based attention, and the softmax in eq. 6 normalizes w.r.t. g(xi). The read-out rk 1 from concatenated to hk 1. Since we do K steps of ā€œreadsā€, attLSTM(f0 (Ė†x), g(S), K) = hK w is as described in eq. 3. A.2 The Fully Conditional Embedding g In section 2.1.2 we described the encoding function for the elements in the support set S, g as a bidirectional LSTM. More precisely, let g0 (xi) be a neural network (similar to f0 above VGG or Inception model). Then we deļ¬ne g(xi, S) = ~hi + ~hi + g0 (xi) with: ~hi,~ci = LSTM(g0 (xi),~hi 1,~ci 1) ~hi, ~ci = LSTM(g0 (xi), ~hi+1, ~ci+1) where, as in above, LSTM(x, h, c) follows the same LSTM implementation deļ¬ned in [23] the input, h the output (i.e., cell after the output gate), and c the cell. Note that the recursio starts from i = |S|. As in eq. 3, we add a skip connection between input and outputs. B ImageNet Class Splits Here we deļ¬ne the two class splits used in our full ImageNet experiments ā€“ these classe excluded for training during our one-shot experiments described in section 4.1.2. Lrand = n01498041, n01537544, n01580077, n01592084, n01632777, n01644373, n01665541, n01675722, n01688243, n01729977, n n01818515, n01843383, n01883070, n01950731, n02002724, n02013706, n02092339, n02093256, n02095314, n02097130, n xi g(xi,S)
  • 21. Copyright (C) DeNA Co.,Ltd. All Rights Reserved. Matching Networks [Vinyals+, NIPS2016] nā€Æ The Fully Conditional Embedding g āƒā€Æ Embed in consideration of S gā€™ Figure 1: Matching Networks architecture by showing only a few examples per class, switching the task from minibatch to minibatch, ike how it will be tested when presented with a few examples of a new task. s our contributions in deļ¬ning a model and training criterion amenable for one-shot learning, ntribute by the deļ¬nition of tasks that can be used to benchmark other approaches on both Net and small scale language modeling. We hope that our results will encourage others to work challenging problem. ganized the paper by ļ¬rst deļ¬ning and explaining our model whilst linking its several compo- o related work. Then in the following section we brieļ¬‚y elaborate on some of the related work Figure 1: Matching Networks architecture it by showing only a few examples per class, switching the task from minibatch to minibatch, h like how it will be tested when presented with a few examples of a new task. des our contributions in deļ¬ning a model and training criterion amenable for one-shot learning, ontribute by the deļ¬nition of tasks that can be used to benchmark other approaches on both geNet and small scale language modeling. We hope that our results will encourage others to work his challenging problem. 21 Figure 1: Matching Networks architecture examples per class, switching the task from minibatch to minibatch, much like when presented with a few examples of a new task. utions in deļ¬ning a model and training criterion amenable for one-shot learning, xi Support Setļ¼ˆSļ¼‰ yi gā€™ gā€™ noting that LSTM(x, h, c) follows the same LSTM implementation deļ¬ned in [ h the output (i.e., cell after the output gate), and c the cell. a is commonly refe based attention, and the softmax in eq. 6 normalizes w.r.t. g(xi). The read-ou concatenated to hk 1. Since we do K steps of ā€œreadsā€, attLSTM(f0 (Ė†x), g(S), is as described in eq. 3. A.2 The Fully Conditional Embedding g In section 2.1.2 we described the encoding function for the elements in the sup as a bidirectional LSTM. More precisely, let g0 (xi) be a neural network (simila VGG or Inception model). Then we deļ¬ne g(xi, S) = ~hi + ~hi + g0 (xi) with: ~hi,~ci = LSTM(g0 (xi),~hi 1,~ci 1) ~hi, ~ci = LSTM(g0 (xi), ~hi+1, ~ci+1) where, as in above, LSTM(x, h, c) follows the same LSTM implementation de the input, h the output (i.e., cell after the output gate), and c the cell. Note tha starts from i = |S|. As in eq. 3, we add a skip connection between input and ou B ImageNet Class Splits Here we deļ¬ne the two class splits used in our full ImageNet experiments ā€“ excluded for training during our one-shot experiments described in section 4.1. Lrand = n01498041, n01537544, n01580077, n01592084, n01632777, n01644373, n01665541, n01675722, n016882 n01818515, n01843383, n01883070, n01950731, n02002724, n02013706, n02092339, n02093256, n020953 10 gā€™: neural network (e.g., VGG or Inception) a(hk 1, g(xi)) = softmax(hk 1g(xi)) noting that LSTM(x, h, c) follows the same LSTM implementation deļ¬ned in [23] with x th h the output (i.e., cell after the output gate), and c the cell. a is commonly referred to as ā€œc based attention, and the softmax in eq. 6 normalizes w.r.t. g(xi). The read-out rk 1 from concatenated to hk 1. Since we do K steps of ā€œreadsā€, attLSTM(f0 (Ė†x), g(S), K) = hK w is as described in eq. 3. A.2 The Fully Conditional Embedding g In section 2.1.2 we described the encoding function for the elements in the support set S, g as a bidirectional LSTM. More precisely, let g0 (xi) be a neural network (similar to f0 above VGG or Inception model). Then we deļ¬ne g(xi, S) = ~hi + ~hi + g0 (xi) with: ~hi,~ci = LSTM(g0 (xi),~hi 1,~ci 1) ~hi, ~ci = LSTM(g0 (xi), ~hi+1, ~ci+1) where, as in above, LSTM(x, h, c) follows the same LSTM implementation deļ¬ned in [23] the input, h the output (i.e., cell after the output gate), and c the cell. Note that the recursio starts from i = |S|. As in eq. 3, we add a skip connection between input and outputs. B ImageNet Class Splits Here we deļ¬ne the two class splits used in our full ImageNet experiments ā€“ these classe excluded for training during our one-shot experiments described in section 4.1.2. Lrand = n01498041, n01537544, n01580077, n01592084, n01632777, n01644373, n01665541, n01675722, n01688243, n01729977, n n01818515, n01843383, n01883070, n01950731, n02002724, n02013706, n02092339, n02093256, n02095314, n02097130, n Embed into vector using gā€™ ļ¼ˆgā€™: neural network such as VGG or Inceptionļ¼‰ xi xi
  • 22. Copyright (C) DeNA Co.,Ltd. All Rights Reserved. Matching Networks [Vinyals+, NIPS2016] nā€Æ The Fully Conditional Embedding g āƒā€Æ Embed in consideration of S gā€™ LSTM LSTM Figure 1: Matching Networks architecture by showing only a few examples per class, switching the task from minibatch to minibatch, ike how it will be tested when presented with a few examples of a new task. s our contributions in deļ¬ning a model and training criterion amenable for one-shot learning, ntribute by the deļ¬nition of tasks that can be used to benchmark other approaches on both Net and small scale language modeling. We hope that our results will encourage others to work challenging problem. ganized the paper by ļ¬rst deļ¬ning and explaining our model whilst linking its several compo- o related work. Then in the following section we brieļ¬‚y elaborate on some of the related work Figure 1: Matching Networks architecture it by showing only a few examples per class, switching the task from minibatch to minibatch, h like how it will be tested when presented with a few examples of a new task. des our contributions in deļ¬ning a model and training criterion amenable for one-shot learning, ontribute by the deļ¬nition of tasks that can be used to benchmark other approaches on both geNet and small scale language modeling. We hope that our results will encourage others to work his challenging problem. 22 Figure 1: Matching Networks architecture examples per class, switching the task from minibatch to minibatch, much like when presented with a few examples of a new task. utions in deļ¬ning a model and training criterion amenable for one-shot learning, xi Support Setļ¼ˆSļ¼‰ yi gā€™ LSTM LSTM gā€™ LSTM LSTM noting that LSTM(x, h, c) follows the same LSTM implementation deļ¬ned in [ h the output (i.e., cell after the output gate), and c the cell. a is commonly refe based attention, and the softmax in eq. 6 normalizes w.r.t. g(xi). The read-ou concatenated to hk 1. Since we do K steps of ā€œreadsā€, attLSTM(f0 (Ė†x), g(S), is as described in eq. 3. A.2 The Fully Conditional Embedding g In section 2.1.2 we described the encoding function for the elements in the sup as a bidirectional LSTM. More precisely, let g0 (xi) be a neural network (simila VGG or Inception model). Then we deļ¬ne g(xi, S) = ~hi + ~hi + g0 (xi) with: ~hi,~ci = LSTM(g0 (xi),~hi 1,~ci 1) ~hi, ~ci = LSTM(g0 (xi), ~hi+1, ~ci+1) where, as in above, LSTM(x, h, c) follows the same LSTM implementation de the input, h the output (i.e., cell after the output gate), and c the cell. Note tha starts from i = |S|. As in eq. 3, we add a skip connection between input and ou B ImageNet Class Splits Here we deļ¬ne the two class splits used in our full ImageNet experiments ā€“ excluded for training during our one-shot experiments described in section 4.1. Lrand = n01498041, n01537544, n01580077, n01592084, n01632777, n01644373, n01665541, n01675722, n016882 n01818515, n01843383, n01883070, n01950731, n02002724, n02013706, n02092339, n02093256, n020953 10 gā€™: neural network (e.g., VGG or Inception) a(hk 1, g(xi)) = softmax(hk 1g(xi)) noting that LSTM(x, h, c) follows the same LSTM implementation deļ¬ned in [23] with x th h the output (i.e., cell after the output gate), and c the cell. a is commonly referred to as ā€œc based attention, and the softmax in eq. 6 normalizes w.r.t. g(xi). The read-out rk 1 from concatenated to hk 1. Since we do K steps of ā€œreadsā€, attLSTM(f0 (Ė†x), g(S), K) = hK w is as described in eq. 3. A.2 The Fully Conditional Embedding g In section 2.1.2 we described the encoding function for the elements in the support set S, g as a bidirectional LSTM. More precisely, let g0 (xi) be a neural network (similar to f0 above VGG or Inception model). Then we deļ¬ne g(xi, S) = ~hi + ~hi + g0 (xi) with: ~hi,~ci = LSTM(g0 (xi),~hi 1,~ci 1) ~hi, ~ci = LSTM(g0 (xi), ~hi+1, ~ci+1) where, as in above, LSTM(x, h, c) follows the same LSTM implementation deļ¬ned in [23] the input, h the output (i.e., cell after the output gate), and c the cell. Note that the recursio starts from i = |S|. As in eq. 3, we add a skip connection between input and outputs. B ImageNet Class Splits Here we deļ¬ne the two class splits used in our full ImageNet experiments ā€“ these classe excluded for training during our one-shot experiments described in section 4.1.2. Lrand = n01498041, n01537544, n01580077, n01592084, n01632777, n01644373, n01665541, n01675722, n01688243, n01729977, n n01818515, n01843383, n01883070, n01950731, n02002724, n02013706, n02092339, n02093256, n02095314, n02097130, n Feed into Bi-LSTM ļ¼ˆgŹ¼: neural network such as VGG or Inceptionļ¼‰ g'(xi ) xi
  • 23. Copyright (C) DeNA Co.,Ltd. All Rights Reserved. Matching Networks [Vinyals+, NIPS2016] nā€Æ The Fully Conditional Embedding g āƒā€Æ Embed in consideration of S gā€™ LSTM LSTM + Figure 1: Matching Networks architecture by showing only a few examples per class, switching the task from minibatch to minibatch, ike how it will be tested when presented with a few examples of a new task. s our contributions in deļ¬ning a model and training criterion amenable for one-shot learning, ntribute by the deļ¬nition of tasks that can be used to benchmark other approaches on both Net and small scale language modeling. We hope that our results will encourage others to work challenging problem. ganized the paper by ļ¬rst deļ¬ning and explaining our model whilst linking its several compo- o related work. Then in the following section we brieļ¬‚y elaborate on some of the related work Figure 1: Matching Networks architecture it by showing only a few examples per class, switching the task from minibatch to minibatch, h like how it will be tested when presented with a few examples of a new task. des our contributions in deļ¬ning a model and training criterion amenable for one-shot learning, ontribute by the deļ¬nition of tasks that can be used to benchmark other approaches on both geNet and small scale language modeling. We hope that our results will encourage others to work his challenging problem. 23 Figure 1: Matching Networks architecture examples per class, switching the task from minibatch to minibatch, much like when presented with a few examples of a new task. utions in deļ¬ning a model and training criterion amenable for one-shot learning, xi Support Setļ¼ˆSļ¼‰ yi gā€™ LSTM LSTM + gā€™ LSTM LSTM + noting that LSTM(x, h, c) follows the same LSTM implementation deļ¬ned in [ h the output (i.e., cell after the output gate), and c the cell. a is commonly refe based attention, and the softmax in eq. 6 normalizes w.r.t. g(xi). The read-ou concatenated to hk 1. Since we do K steps of ā€œreadsā€, attLSTM(f0 (Ė†x), g(S), is as described in eq. 3. A.2 The Fully Conditional Embedding g In section 2.1.2 we described the encoding function for the elements in the sup as a bidirectional LSTM. More precisely, let g0 (xi) be a neural network (simila VGG or Inception model). Then we deļ¬ne g(xi, S) = ~hi + ~hi + g0 (xi) with: ~hi,~ci = LSTM(g0 (xi),~hi 1,~ci 1) ~hi, ~ci = LSTM(g0 (xi), ~hi+1, ~ci+1) where, as in above, LSTM(x, h, c) follows the same LSTM implementation de the input, h the output (i.e., cell after the output gate), and c the cell. Note tha starts from i = |S|. As in eq. 3, we add a skip connection between input and ou B ImageNet Class Splits Here we deļ¬ne the two class splits used in our full ImageNet experiments ā€“ excluded for training during our one-shot experiments described in section 4.1. Lrand = n01498041, n01537544, n01580077, n01592084, n01632777, n01644373, n01665541, n01675722, n016882 n01818515, n01843383, n01883070, n01950731, n02002724, n02013706, n02092339, n02093256, n020953 10 gā€™: neural network (e.g., VGG or Inception) a(hk 1, g(xi)) = softmax(hk 1g(xi)) noting that LSTM(x, h, c) follows the same LSTM implementation deļ¬ned in [23] with x th h the output (i.e., cell after the output gate), and c the cell. a is commonly referred to as ā€œc based attention, and the softmax in eq. 6 normalizes w.r.t. g(xi). The read-out rk 1 from concatenated to hk 1. Since we do K steps of ā€œreadsā€, attLSTM(f0 (Ė†x), g(S), K) = hK w is as described in eq. 3. A.2 The Fully Conditional Embedding g In section 2.1.2 we described the encoding function for the elements in the support set S, g as a bidirectional LSTM. More precisely, let g0 (xi) be a neural network (similar to f0 above VGG or Inception model). Then we deļ¬ne g(xi, S) = ~hi + ~hi + g0 (xi) with: ~hi,~ci = LSTM(g0 (xi),~hi 1,~ci 1) ~hi, ~ci = LSTM(g0 (xi), ~hi+1, ~ci+1) where, as in above, LSTM(x, h, c) follows the same LSTM implementation deļ¬ned in [23] the input, h the output (i.e., cell after the output gate), and c the cell. Note that the recursio starts from i = |S|. As in eq. 3, we add a skip connection between input and outputs. B ImageNet Class Splits Here we deļ¬ne the two class splits used in our full ImageNet experiments ā€“ these classe excluded for training during our one-shot experiments described in section 4.1.2. Lrand = n01498041, n01537544, n01580077, n01592084, n01632777, n01644373, n01665541, n01675722, n01688243, n01729977, n n01818515, n01843383, n01883070, n01950731, n02002724, n02013706, n02092339, n02093256, n02095314, n02097130, n g(xi,S) Let be the sum of and outputs of Bi-LSTM g(xi,S) g'(xi ) xi
  • 24. Copyright (C) DeNA Co.,Ltd. All Rights Reserved. Matching Networks [Vinyals+, NIPS2016] nā€Æ The Fully Conditional Embedding f āƒā€Æ Embed in consideration of S g Figure 1: Matching Networks architecture train it by showing only a few examples per class, switching the task from mi much like how it will be tested when presented with a few examples of a new Besides our contributions in deļ¬ning a model and training criterion amenable we contribute by the deļ¬nition of tasks that can be used to benchmark other ImageNet and small scale language modeling. We hope that our results will enc on this challenging problem. We organized the paper by ļ¬rst deļ¬ning and explaining our model whilst linki nents to related work. Then in the following section we brieļ¬‚y elaborate on som to the task and our model. In Section 4 we describe both our general setup an performed, demonstrating strong results on one-shot learning on a variety of ta 2 Model Our non-parametric approach to solving one-shot learning is based on two co describe in the following subsections. First, our model architecture follows rece networks augmented with memory (as discussed in Section 3). Given a (sma model deļ¬nes a function cS (or classiļ¬er) for each S, i.e. a mapping S ! cS(.) a training strategy which is tailored for one-shot learning from the support set 2.1 Model Architecture In recent years, many groups have investigated ways to augment neural netwo fā€™LSTM rkāˆ’1 a(hkāˆ’1,g(xi ))g(xi ) LSTM f ( Ė†x,S) = hK Ė†hkāˆ’1 hkāˆ’1 Ė†hk + + Ė†x so, we deļ¬ne the following recurrence over ā€œprocessingā€ steps k, following work from [26]: Ė†hk, ck = LSTM(f0 (Ė†x), [hk 1, rk 1], ck 1) (2) hk = Ė†hk + f0 (Ė†x) (3) rk 1 = |S| X i=1 a(hk 1, g(xi))g(xi) (4) a(hk 1, g(xi)) = ehT k 1g(xi) / |S| X j=1 ehT k 1g(xj ) (5) Query weighted sum 24 Figure 1: Matching Networks architecture by showing only a few examples per class, switching the task from minibatch to minibatch, ike how it will be tested when presented with a few examples of a new task. s our contributions in deļ¬ning a model and training criterion amenable for one-shot learning, ntribute by the deļ¬nition of tasks that can be used to benchmark other approaches on both Net and small scale language modeling. We hope that our results will encourage others to work challenging problem. ganized the paper by ļ¬rst deļ¬ning and explaining our model whilst linking its several compo- o related work. Then in the following section we brieļ¬‚y elaborate on some of the related work Figure 1: Matching Networks architecture it by showing only a few examples per class, switching the task from minibatch to minibatch, h like how it will be tested when presented with a few examples of a new task. des our contributions in deļ¬ning a model and training criterion amenable for one-shot learning, ontribute by the deļ¬nition of tasks that can be used to benchmark other approaches on both geNet and small scale language modeling. We hope that our results will encourage others to work his challenging problem. Figure 1: Matching Networks architecture examples per class, switching the task from minibatch to minibatch, much like when presented with a few examples of a new task. utions in deļ¬ning a model and training criterion amenable for one-shot learning, xi Support Setļ¼ˆSļ¼‰ yi Ė†x ollowing recurrence over ā€œprocessingā€ steps k, following work from [26]: Ė†hk, ck = LSTM(f0 (Ė†x), [hk 1, rk 1], ck 1) (2) hk = Ė†hk + f0 (Ė†x) (3) rk 1 = |S| X i=1 a(hk 1, g(xi))g(xi) (4)
  • 25. Copyright (C) DeNA Co.,Ltd. All Rights Reserved. Matching Networks [Vinyals+, NIPS2016] nā€Æ The Fully Conditional Embedding f āƒā€Æ Embed in consideration of S g Figure 1: Matching Networks architecture train it by showing only a few examples per class, switching the task from mi much like how it will be tested when presented with a few examples of a new Besides our contributions in deļ¬ning a model and training criterion amenable we contribute by the deļ¬nition of tasks that can be used to benchmark other ImageNet and small scale language modeling. We hope that our results will enc on this challenging problem. We organized the paper by ļ¬rst deļ¬ning and explaining our model whilst linki nents to related work. Then in the following section we brieļ¬‚y elaborate on som to the task and our model. In Section 4 we describe both our general setup an performed, demonstrating strong results on one-shot learning on a variety of ta 2 Model Our non-parametric approach to solving one-shot learning is based on two co describe in the following subsections. First, our model architecture follows rece networks augmented with memory (as discussed in Section 3). Given a (sma model deļ¬nes a function cS (or classiļ¬er) for each S, i.e. a mapping S ! cS(.) a training strategy which is tailored for one-shot learning from the support set 2.1 Model Architecture In recent years, many groups have investigated ways to augment neural netwo fā€™LSTM g(xi ) Ė†h1 h1 + Ė†x so, we deļ¬ne the following recurrence over ā€œprocessingā€ steps k, following work from [26]: Ė†hk, ck = LSTM(f0 (Ė†x), [hk 1, rk 1], ck 1) (2) hk = Ė†hk + f0 (Ė†x) (3) rk 1 = |S| X i=1 a(hk 1, g(xi))g(xi) (4) a(hk 1, g(xi)) = ehT k 1g(xi) / |S| X j=1 ehT k 1g(xj ) (5) Query 25 Figure 1: Matching Networks architecture by showing only a few examples per class, switching the task from minibatch to minibatch, ike how it will be tested when presented with a few examples of a new task. s our contributions in deļ¬ning a model and training criterion amenable for one-shot learning, ntribute by the deļ¬nition of tasks that can be used to benchmark other approaches on both Net and small scale language modeling. We hope that our results will encourage others to work challenging problem. ganized the paper by ļ¬rst deļ¬ning and explaining our model whilst linking its several compo- o related work. Then in the following section we brieļ¬‚y elaborate on some of the related work Figure 1: Matching Networks architecture it by showing only a few examples per class, switching the task from minibatch to minibatch, h like how it will be tested when presented with a few examples of a new task. des our contributions in deļ¬ning a model and training criterion amenable for one-shot learning, ontribute by the deļ¬nition of tasks that can be used to benchmark other approaches on both geNet and small scale language modeling. We hope that our results will encourage others to work his challenging problem. Figure 1: Matching Networks architecture examples per class, switching the task from minibatch to minibatch, much like when presented with a few examples of a new task. utions in deļ¬ning a model and training criterion amenable for one-shot learning, xi Support Setļ¼ˆSļ¼‰ yi ollowing recurrence over ā€œprocessingā€ steps k, following work from [26]: Ė†hk, ck = LSTM(f0 (Ė†x), [hk 1, rk 1], ck 1) (2) hk = Ė†hk + f0 (Ė†x) (3) rk 1 = |S| X i=1 a(hk 1, g(xi))g(xi) (4) is calculated without using S h1 = LSTM( f '( Ė†x),[ Ė†h0,r0 ],c0 )+ f '( Ė†x) h1 Ė†x
  • 26. Copyright (C) DeNA Co.,Ltd. All Rights Reserved. Matching Networks [Vinyals+, NIPS2016] nā€Æ The Fully Conditional Embedding f āƒā€Æ Embed in consideration of S g Figure 1: Matching Networks architecture train it by showing only a few examples per class, switching the task from mi much like how it will be tested when presented with a few examples of a new Besides our contributions in deļ¬ning a model and training criterion amenable we contribute by the deļ¬nition of tasks that can be used to benchmark other ImageNet and small scale language modeling. We hope that our results will enc on this challenging problem. We organized the paper by ļ¬rst deļ¬ning and explaining our model whilst linki nents to related work. Then in the following section we brieļ¬‚y elaborate on som to the task and our model. In Section 4 we describe both our general setup an performed, demonstrating strong results on one-shot learning on a variety of ta 2 Model Our non-parametric approach to solving one-shot learning is based on two co describe in the following subsections. First, our model architecture follows rece networks augmented with memory (as discussed in Section 3). Given a (sma model deļ¬nes a function cS (or classiļ¬er) for each S, i.e. a mapping S ! cS(.) a training strategy which is tailored for one-shot learning from the support set 2.1 Model Architecture In recent years, many groups have investigated ways to augment neural netwo fā€™LSTM g(xi ) Ė†h1 h1 + Ė†x so, we deļ¬ne the following recurrence over ā€œprocessingā€ steps k, following work from [26]: Ė†hk, ck = LSTM(f0 (Ė†x), [hk 1, rk 1], ck 1) (2) hk = Ė†hk + f0 (Ė†x) (3) rk 1 = |S| X i=1 a(hk 1, g(xi))g(xi) (4) a(hk 1, g(xi)) = ehT k 1g(xi) / |S| X j=1 ehT k 1g(xj ) (5) Query 26 Figure 1: Matching Networks architecture by showing only a few examples per class, switching the task from minibatch to minibatch, ike how it will be tested when presented with a few examples of a new task. s our contributions in deļ¬ning a model and training criterion amenable for one-shot learning, ntribute by the deļ¬nition of tasks that can be used to benchmark other approaches on both Net and small scale language modeling. We hope that our results will encourage others to work challenging problem. ganized the paper by ļ¬rst deļ¬ning and explaining our model whilst linking its several compo- o related work. Then in the following section we brieļ¬‚y elaborate on some of the related work Figure 1: Matching Networks architecture it by showing only a few examples per class, switching the task from minibatch to minibatch, h like how it will be tested when presented with a few examples of a new task. des our contributions in deļ¬ning a model and training criterion amenable for one-shot learning, ontribute by the deļ¬nition of tasks that can be used to benchmark other approaches on both geNet and small scale language modeling. We hope that our results will encourage others to work his challenging problem. Figure 1: Matching Networks architecture examples per class, switching the task from minibatch to minibatch, much like when presented with a few examples of a new task. utions in deļ¬ning a model and training criterion amenable for one-shot learning, xi Support Setļ¼ˆSļ¼‰ yi ollowing recurrence over ā€œprocessingā€ steps k, following work from [26]: Ė†hk, ck = LSTM(f0 (Ė†x), [hk 1, rk 1], ck 1) (2) hk = Ė†hk + f0 (Ė†x) (3) rk 1 = |S| X i=1 a(hk 1, g(xi))g(xi) (4) Calculate the relevance between and softmaxa(h1,g(x1)) = a(h1,g(xi )) (hT 1g(x1)) g(xi ) h1 Ė†x
  • 27. Copyright (C) DeNA Co.,Ltd. All Rights Reserved. Matching Networks [Vinyals+, NIPS2016] nā€Æ The Fully Conditional Embedding f āƒā€Æ Embed in consideration of S g Figure 1: Matching Networks architecture train it by showing only a few examples per class, switching the task from mi much like how it will be tested when presented with a few examples of a new Besides our contributions in deļ¬ning a model and training criterion amenable we contribute by the deļ¬nition of tasks that can be used to benchmark other ImageNet and small scale language modeling. We hope that our results will enc on this challenging problem. We organized the paper by ļ¬rst deļ¬ning and explaining our model whilst linki nents to related work. Then in the following section we brieļ¬‚y elaborate on som to the task and our model. In Section 4 we describe both our general setup an performed, demonstrating strong results on one-shot learning on a variety of ta 2 Model Our non-parametric approach to solving one-shot learning is based on two co describe in the following subsections. First, our model architecture follows rece networks augmented with memory (as discussed in Section 3). Given a (sma model deļ¬nes a function cS (or classiļ¬er) for each S, i.e. a mapping S ! cS(.) a training strategy which is tailored for one-shot learning from the support set 2.1 Model Architecture In recent years, many groups have investigated ways to augment neural netwo fā€™LSTM g(xi ) Ė†h1 h1 + Ė†x so, we deļ¬ne the following recurrence over ā€œprocessingā€ steps k, following work from [26]: Ė†hk, ck = LSTM(f0 (Ė†x), [hk 1, rk 1], ck 1) (2) hk = Ė†hk + f0 (Ė†x) (3) rk 1 = |S| X i=1 a(hk 1, g(xi))g(xi) (4) a(hk 1, g(xi)) = ehT k 1g(xi) / |S| X j=1 ehT k 1g(xj ) (5) Query 27 Figure 1: Matching Networks architecture by showing only a few examples per class, switching the task from minibatch to minibatch, ike how it will be tested when presented with a few examples of a new task. s our contributions in deļ¬ning a model and training criterion amenable for one-shot learning, ntribute by the deļ¬nition of tasks that can be used to benchmark other approaches on both Net and small scale language modeling. We hope that our results will encourage others to work challenging problem. ganized the paper by ļ¬rst deļ¬ning and explaining our model whilst linking its several compo- o related work. Then in the following section we brieļ¬‚y elaborate on some of the related work Figure 1: Matching Networks architecture it by showing only a few examples per class, switching the task from minibatch to minibatch, h like how it will be tested when presented with a few examples of a new task. des our contributions in deļ¬ning a model and training criterion amenable for one-shot learning, ontribute by the deļ¬nition of tasks that can be used to benchmark other approaches on both geNet and small scale language modeling. We hope that our results will encourage others to work his challenging problem. Figure 1: Matching Networks architecture examples per class, switching the task from minibatch to minibatch, much like when presented with a few examples of a new task. utions in deļ¬ning a model and training criterion amenable for one-shot learning, xi Support Setļ¼ˆSļ¼‰ yi ollowing recurrence over ā€œprocessingā€ steps k, following work from [26]: Ė†hk, ck = LSTM(f0 (Ė†x), [hk 1, rk 1], ck 1) (2) hk = Ė†hk + f0 (Ė†x) (3) rk 1 = |S| X i=1 a(hk 1, g(xi))g(xi) (4) is a sum of weighted according to the relevance to a(h1,g(xi )) r1 weighted sum r1 g(xi ) h1 r1 = a(h1,g(xi )) i=1 |S| āˆ‘ g(xi ) Ė†x
  • 28. Copyright (C) DeNA Co.,Ltd. All Rights Reserved. Matching Networks [Vinyals+, NIPS2016] nā€Æ The Fully Conditional Embedding f āƒā€Æ Embed in consideration of S g Figure 1: Matching Networks architecture train it by showing only a few examples per class, switching the task from mi much like how it will be tested when presented with a few examples of a new Besides our contributions in deļ¬ning a model and training criterion amenable we contribute by the deļ¬nition of tasks that can be used to benchmark other ImageNet and small scale language modeling. We hope that our results will enc on this challenging problem. We organized the paper by ļ¬rst deļ¬ning and explaining our model whilst linki nents to related work. Then in the following section we brieļ¬‚y elaborate on som to the task and our model. In Section 4 we describe both our general setup an performed, demonstrating strong results on one-shot learning on a variety of ta 2 Model Our non-parametric approach to solving one-shot learning is based on two co describe in the following subsections. First, our model architecture follows rece networks augmented with memory (as discussed in Section 3). Given a (sma model deļ¬nes a function cS (or classiļ¬er) for each S, i.e. a mapping S ! cS(.) a training strategy which is tailored for one-shot learning from the support set 2.1 Model Architecture In recent years, many groups have investigated ways to augment neural netwo fā€™LSTM g(xi ) Ė†h1 h1 + Ė†x so, we deļ¬ne the following recurrence over ā€œprocessingā€ steps k, following work from [26]: Ė†hk, ck = LSTM(f0 (Ė†x), [hk 1, rk 1], ck 1) (2) hk = Ė†hk + f0 (Ė†x) (3) rk 1 = |S| X i=1 a(hk 1, g(xi))g(xi) (4) a(hk 1, g(xi)) = ehT k 1g(xi) / |S| X j=1 ehT k 1g(xj ) (5) Query 28 Figure 1: Matching Networks architecture by showing only a few examples per class, switching the task from minibatch to minibatch, ike how it will be tested when presented with a few examples of a new task. s our contributions in deļ¬ning a model and training criterion amenable for one-shot learning, ntribute by the deļ¬nition of tasks that can be used to benchmark other approaches on both Net and small scale language modeling. We hope that our results will encourage others to work challenging problem. ganized the paper by ļ¬rst deļ¬ning and explaining our model whilst linking its several compo- o related work. Then in the following section we brieļ¬‚y elaborate on some of the related work Figure 1: Matching Networks architecture it by showing only a few examples per class, switching the task from minibatch to minibatch, h like how it will be tested when presented with a few examples of a new task. des our contributions in deļ¬ning a model and training criterion amenable for one-shot learning, ontribute by the deļ¬nition of tasks that can be used to benchmark other approaches on both geNet and small scale language modeling. We hope that our results will encourage others to work his challenging problem. Figure 1: Matching Networks architecture examples per class, switching the task from minibatch to minibatch, much like when presented with a few examples of a new task. utions in deļ¬ning a model and training criterion amenable for one-shot learning, xi Support Setļ¼ˆSļ¼‰ yi ollowing recurrence over ā€œprocessingā€ steps k, following work from [26]: Ė†hk, ck = LSTM(f0 (Ė†x), [hk 1, rk 1], ck 1) (2) hk = Ė†hk + f0 (Ė†x) (3) rk 1 = |S| X i=1 a(hk 1, g(xi))g(xi) (4) a(h1,g(xi )) r1 weighted sum LSTM Ė†h1 + h1 is calculated using Sh1 Ė†x
  • 29. Copyright (C) DeNA Co.,Ltd. All Rights Reserved. Matching Networks [Vinyals+, NIPS2016] nā€Æ The Fully Conditional Embedding f āƒā€Æ Embed in consideration of S g Figure 1: Matching Networks architecture train it by showing only a few examples per class, switching the task from mi much like how it will be tested when presented with a few examples of a new Besides our contributions in deļ¬ning a model and training criterion amenable we contribute by the deļ¬nition of tasks that can be used to benchmark other ImageNet and small scale language modeling. We hope that our results will enc on this challenging problem. We organized the paper by ļ¬rst deļ¬ning and explaining our model whilst linki nents to related work. Then in the following section we brieļ¬‚y elaborate on som to the task and our model. In Section 4 we describe both our general setup an performed, demonstrating strong results on one-shot learning on a variety of ta 2 Model Our non-parametric approach to solving one-shot learning is based on two co describe in the following subsections. First, our model architecture follows rece networks augmented with memory (as discussed in Section 3). Given a (sma model deļ¬nes a function cS (or classiļ¬er) for each S, i.e. a mapping S ! cS(.) a training strategy which is tailored for one-shot learning from the support set 2.1 Model Architecture In recent years, many groups have investigated ways to augment neural netwo fā€™LSTM rkāˆ’1 a(hkāˆ’1,g(xi ))g(xi ) LSTM f ( Ė†x,S) = hK Ė†hkāˆ’1 hkāˆ’1 Ė†hk + + Ė†x so, we deļ¬ne the following recurrence over ā€œprocessingā€ steps k, following work from [26]: Ė†hk, ck = LSTM(f0 (Ė†x), [hk 1, rk 1], ck 1) (2) hk = Ė†hk + f0 (Ė†x) (3) rk 1 = |S| X i=1 a(hk 1, g(xi))g(xi) (4) a(hk 1, g(xi)) = ehT k 1g(xi) / |S| X j=1 ehT k 1g(xj ) (5) Query weighted sum 29 Figure 1: Matching Networks architecture by showing only a few examples per class, switching the task from minibatch to minibatch, ike how it will be tested when presented with a few examples of a new task. s our contributions in deļ¬ning a model and training criterion amenable for one-shot learning, ntribute by the deļ¬nition of tasks that can be used to benchmark other approaches on both Net and small scale language modeling. We hope that our results will encourage others to work challenging problem. ganized the paper by ļ¬rst deļ¬ning and explaining our model whilst linking its several compo- o related work. Then in the following section we brieļ¬‚y elaborate on some of the related work Figure 1: Matching Networks architecture it by showing only a few examples per class, switching the task from minibatch to minibatch, h like how it will be tested when presented with a few examples of a new task. des our contributions in deļ¬ning a model and training criterion amenable for one-shot learning, ontribute by the deļ¬nition of tasks that can be used to benchmark other approaches on both geNet and small scale language modeling. We hope that our results will encourage others to work his challenging problem. Figure 1: Matching Networks architecture examples per class, switching the task from minibatch to minibatch, much like when presented with a few examples of a new task. utions in deļ¬ning a model and training criterion amenable for one-shot learning, xi Support Setļ¼ˆSļ¼‰ yi ollowing recurrence over ā€œprocessingā€ steps k, following work from [26]: Ė†hk, ck = LSTM(f0 (Ė†x), [hk 1, rk 1], ck 1) (2) hk = Ė†hk + f0 (Ė†x) (3) rk 1 = |S| X i=1 a(hk 1, g(xi))g(xi) (4) Let be the output after K steps f ( Ė†x,S) Ė†x
  • 30. Copyright (C) DeNA Co.,Ltd. All Rights Reserved. Experimental Settings nā€Æ Datasets āƒā€Æ Image classiļ¬cation sets ā€¢ā€Æ Omniglot [Lake+, 2011] āƒā€Æ Language modeling ā€¢ā€Æ Penn Treebank [Marcus+, 1993] 30 ā€¢ā€Æ ImageNet [Deng+, 2009] ref. http://karpathy.github.io/2014/09/02/what-i-learned- from-competing-against-a-convnet-on-imagenet/ 4.1.3 One-Shot Language Modeling We also introduce a new one-shot language task which is analogous to those examined for images. The task is as follows: given a query sentence with a missing word in it, and a support set of sentences which each have a missing word and a corresponding 1-hot label, choose the label from the support set that best matches the query sentence. Here we show a single example, though note that the words on the right are not provided and the labels for the set are given as 1-hot-of-5 vectors. 1. an experimental vaccine can alter the immune response of people infected with the aids virus a <blank_token> u.s. scientist said. prominent 2. the show one of five new nbc <blank_token> is the second casualty of the three networks so far this fall. series 3. however since eastern first filed for chapter N protection march N it has consistently promised to pay creditors N cents on the <blank_token>. dollar 4. we had a lot of people who threw in the <blank_token> today said <unk> ellis a partner in benjamin jacobson & sons a specialist in trading ual stock on the big board. towel 5. itā€™s not easy to roll out something that <blank_token> and make it pay mr. jacob says. comprehensive Query: in late new york trading yesterday the <blank_token> was quoted at N marks down from N marks late friday and at N yen down from N yen late friday. dollar Sentences were taken from the Penn Treebank dataset [15]. On each trial, we make sure that the set and batch are populated with sentences that are non-overlapping. This means that we do not use words with very low frequency counts; e.g. if there is only a single sentence for a given word we do not use this data since the sentence would need to be in both the set and the batch. As with the image tasks, each trial consisted of a 5 way choice between the classes available in the set. We used a batch size of 20 throughout the sentence matching task and varied the set size across k=1,2,3. We ensured that the same number of sentences were available for each class in the set. We split the words into a randomly sampled 9000 for training and 1000 for testing, and we used the standard test set to report results. Thus, neither the words nor the sentences used during test time had been seen during training. We compared our one-shot matching model to an oracle LSTM language model (LSTM-LM) [30]
  • 31. Copyright (C) DeNA Co.,Ltd. All Rights Reserved. Experimental Settings (Omniglot) nā€Æ Baseline āƒā€Æ Matching on raw pixels āƒā€Æ Matching on discriminative features from VGG (Baseine classiļ¬er) āƒā€Æ MANN āƒā€Æ Convolutional Siamese Network nā€Æ Datasets āƒā€Æ training: 1200 characters āƒā€Æ testing: 423 characters 31
  • 32. Copyright (C) DeNA Co.,Ltd. All Rights Reserved. Experimental Results (Omniglot) 32 nā€Æ Fully Conditional Embedding (FCE) did not seem to help much nā€Æ Baseline and Siamese Net were improved with ļ¬ne-tuning took this network and used the features from the last layer (before the softmax) for nearest neighbour matching, a strategy commonly used in computer vision [3] which has achieved excellent results across many tasks. Following [11], the convolutional siamese nets were trained on a same-or-different task of the original training data set and then the last layer was used for nearest neighbour matching. Model Matching Fn Fine Tune 5-way Acc 20-way Acc 1-shot 5-shot 1-shot 5-shot PIXELS Cosine N 41.7% 63.2% 26.7% 42.6% BASELINE CLASSIFIER Cosine N 80.0% 95.0% 69.5% 89.1% BASELINE CLASSIFIER Cosine Y 82.3% 98.4% 70.6% 92.0% BASELINE CLASSIFIER Softmax Y 86.0% 97.6% 72.9% 92.3% MANN (NO CONV) [21] Cosine N 82.8% 94.9% ā€“ ā€“ CONVOLUTIONAL SIAMESE NET [11] Cosine N 96.7% 98.4% 88.0% 96.5% CONVOLUTIONAL SIAMESE NET [11] Cosine Y 97.3% 98.4% 88.1% 97.0% MATCHING NETS (OURS) Cosine N 98.1% 98.9% 93.8% 98.5% MATCHING NETS (OURS) Cosine Y 97.9% 98.7% 93.5% 98.7% Table 1: Results on the Omniglot dataset. 5
  • 33. Copyright (C) DeNA Co.,Ltd. All Rights Reserved. Experimental Settings (ImageNet) nā€Æ Baseline āƒā€Æ Matching on raw pixels āƒā€Æ Matching on discriminative features from InceptionV3 (Baseine classiļ¬er) nā€Æ Datasets āƒā€Æ miniImageNet (size: 84x84) ā€¢ā€Æ training: (80 classes) ā€¢ā€Æ testing: (20 classes) āƒā€Æ randImageNet ā€¢ā€Æ training: randomly picked up classes (882 classes) ā€¢ā€Æ testing: remaining classes (118 classes) āƒā€Æ dogsImageNet ā€¢ā€Æ training: all non-dog classes (882 classes) ā€¢ā€Æ testing: dog classes (118 classes) 33
  • 34. Copyright (C) DeNA Co.,Ltd. All Rights Reserved. Experimental Results (miniImageNet) 34 Figure 2: Example of two 5-way problem instance on ImageNet. The images in the set S0 contain classes never seen during training. Our model makes far less mistakes than the Inception baseline. Table 2: Results on miniImageNet. Model Matching Fn Fine Tune 5-way Acc 1-shot 5-shot PIXELS Cosine N 23.0% 26.6% BASELINE CLASSIFIER Cosine N 36.6% 46.0% BASELINE CLASSIFIER Cosine Y 36.2% 52.2% BASELINE CLASSIFIER Softmax Y 38.4% 51.2% MATCHING NETS (OURS) Cosine N 41.2% 56.2% MATCHING NETS (OURS) Cosine Y 42.4% 58.0% MATCHING NETS (OURS) Cosine (FCE) N 44.2% 57.0% MATCHING NETS (OURS) Cosine (FCE) Y 46.6% 60.0% 1-shot tasks from the training data set, incorporating Full Context Embeddings and our Matching Networks and training strategy. The results of the randImageNet and dogsImageNet experiments are shown in Table 3. The Inception Oracle (trained on all classes) performs almost perfectly when restricted to 5 classes only, which is not too surprising given its impressive top-1 accuracy. When trained solely on 6=Lrand, Matching Nets improve upon Inception by almost 6% when tested on Lrand, halving the errors. Figure 2 shows two instances of 5-way one-shot learning, where Inception fails. Looking at all the errors, Inception appears to sometimes prefer an image above all others (these images tend to be cluttered like the example in the second column, or more constant in color). Matching Nets, on the other hand, manage to recover from these outliers that sometimes appear in the support set S0 . Matching Nets manage to improve upon Inception on the complementary subset 6=Ldogs (although nā€Æ Matching Networks overtook baseline nā€Æ Fully Conditional Embedding (FCE) was shown eļ¬€ective to improve the performance in this task
  • 35. Copyright (C) DeNA Co.,Ltd. All Rights Reserved. Experimental Results (randImageNet, dogsImageNet) 35 classiļ¬cation. Thus, we believe that if we adapted our training strategy to samples S from ļ¬ne grained sets of labels instead of sampling uniformly from the leafs of the ImageNet class tree, improvements could be attained. We leave this as future work. Table 3: Results on full ImageNet on rand and dogs one-shot tasks. Note that 6=Lrand and 6=Ldogs are sets of classes which are seen during training, but are provided for completeness. Model Matching Fn Fine Tune ImageNet 5-way 1-shot Acc Lrand 6=Lrand Ldogs 6=Ldogs PIXELS Cosine N 42.0% 42.8% 41.4% 43.0% INCEPTION CLASSIFIER Cosine N 87.6% 92.6% 59.8% 90.0% MATCHING NETS (OURS) Cosine (FCE) N 93.2% 97.0% 58.8% 96.4% INCEPTION ORACLE Softmax (Full) Y (Full) ā‡” 99% ā‡” 99% ā‡” 99% ā‡” 99% 7 nā€Æ Matching Networks outperformed Inception Classiļ¬er in , but degraded in nā€Æ The decrease of the performance in might be caused by the diļ¬€erent distributions of labels between training and testing āƒā€Æ Training support set comes from a random distribution whereas testing one comes from similar classes BASELINE CLASSIFIER Cosine Y 36 BASELINE CLASSIFIER Softmax Y 38 MATCHING NETS (OURS) Cosine N 41 MATCHING NETS (OURS) Cosine Y 42 MATCHING NETS (OURS) Cosine (FCE) N 44 MATCHING NETS (OURS) Cosine (FCE) Y 46 1-shot tasks from the training data set, incorporating Full Context Emb Networks and training strategy. The results of the randImageNet and dogsImageNet experiments are show Oracle (trained on all classes) performs almost perfectly when restricted not too surprising given its impressive top-1 accuracy. When trained so Nets improve upon Inception by almost 6% when tested on Lrand, halving two instances of 5-way one-shot learning, where Inception fails. Looking appears to sometimes prefer an image above all others (these images te example in the second column, or more constant in color). Matching Nets, to recover from these outliers that sometimes appear in the support set S0 Matching Nets manage to improve upon Inception on the complementar this setup is not one-shot, as the feature extraction has been trained on the much more challenging Ldogs subset, our model degrades by 1%. We h 1-shot tasks from the training data set, incorporating Full Context Embeddings an Networks and training strategy. The results of the randImageNet and dogsImageNet experiments are shown in Table Oracle (trained on all classes) performs almost perfectly when restricted to 5 classe not too surprising given its impressive top-1 accuracy. When trained solely on 6=L Nets improve upon Inception by almost 6% when tested on Lrand, halving the errors two instances of 5-way one-shot learning, where Inception fails. Looking at all the e appears to sometimes prefer an image above all others (these images tend to be c example in the second column, or more constant in color). Matching Nets, on the oth to recover from these outliers that sometimes appear in the support set S0 . Matching Nets manage to improve upon Inception on the complementary subset 6= this setup is not one-shot, as the feature extraction has been trained on these labels). much more challenging Ldogs subset, our model degrades by 1%. We hypothesiz that the sampled set during training, S, comes from a random distribution of labels whereas the testing support set S0 from Ldogs contains similar classes, more akin classiļ¬cation. Thus, we believe that if we adapted our training strategy to samples S f sets of labels instead of sampling uniformly from the leafs of the ImageNet class tre could be attained. We leave this as future work. 1-shot tasks from the training data set, incorporating Full C Networks and training strategy. The results of the randImageNet and dogsImageNet experimen Oracle (trained on all classes) performs almost perfectly whe not too surprising given its impressive top-1 accuracy. When Nets improve upon Inception by almost 6% when tested on Lr two instances of 5-way one-shot learning, where Inception fa appears to sometimes prefer an image above all others (thes example in the second column, or more constant in color). Ma to recover from these outliers that sometimes appear in the su Matching Nets manage to improve upon Inception on the com this setup is not one-shot, as the feature extraction has been tra much more challenging Ldogs subset, our model degrades b that the sampled set during training, S, comes from a random whereas the testing support set S0 from Ldogs contains simi classiļ¬cation. Thus, we believe that if we adapted our training sets of labels instead of sampling uniformly from the leafs of could be attained. We leave this as future work.
  • 36. Copyright (C) DeNA Co.,Ltd. All Rights Reserved. Experimental Settings (Penn Treebank) 36 xi Support Setļ¼ˆSļ¼‰ Ė†x Query g(xi ) f ( Ė†x,S) a Our model in its simplest form computes a probability over Ė†y as follows: P(Ė†y|Ė†x, S) = kX i=1 a(Ė†x, xi)yi where xi, yi are the inputs and corresponding label distributions from the su k āˆ‘ P(Ė†y|Ė†x, S) = where xi, yi are the inputs and correspondin {(xi, yi)}k i=1, and a is an attention mechanism tially describes the output for a new class as a Where the attention mechanism a is a kernel on X Where the attention mechanism is zero for the metric and an appropriate constant otherwise, th (although this requires an extension to the atten Thus (1) subsumes both KDE and kNN methods. mechanism and the yi act as values bound to the this case we can understand this as a particular we ā€œpointā€ to the corresponding example in the s form deļ¬ned by the classiļ¬er cS(Ė†x) is very ļ¬‚exib 2.1.1 The Attention Kernel Equation 1 relies on choosing a(., .), the atten ļ¬er. The simplest form that this takes (and w attention models and kernel functions) is to a(Ė†x, xi) = ec(f(Ė†x),g(xi)) / Pk j=1 ec(f(Ė†x),g(xj )) w ate neural networks (potentially with f = g) to examples where f and g are parameterised var tasks (as in VGG[22] or Inception[24]) or a sim Section 4). We note that, though related to metric learning, th For a given support set S and sample to classify pairs (x0 , y0 ) 2 S such that y0 = y and misalign methods such as Neighborhood Component An nearest neighbor [28]. However, the objective that we are trying to opti classiļ¬cation, and thus we expect it to perform b Our model in its simplest form computes a probability over Ė†y as follows: P(Ė†y|Ė†x, S) = kX i=1 a(Ė†x, xi)yi where xi, yi are the inputs and corresponding label distributions from the support set {(xi, yi)}k i=1, and a is an attention mechanism which we discuss below. Note that eq. 1 tially describes the output for a new class as a linear combination of the labels in the suppo Where the attention mechanism a is a kernel on X ā‡„ X, then (1) is akin to a kernel density esti Where the attention mechanism is zero for the b furthest xi from Ė†x according to some dis metric and an appropriate constant otherwise, then (1) is equivalent to ā€˜k bā€™-nearest neigh (although this requires an extension to the attention mechanism that we describe in Section 2 Thus (1) subsumes both KDE and kNN methods. Another view of (1) is where a acts as an atte mechanism and the yi act as values bound to the corresponding keys xi, much like a hash tab yi Our model in its simplest form computes a probability over Ė†y as follows: P(Ė†y|Ė†x, S) = kX i=1 a(Ė†x, xi)yi where xi, yi are the inputs and corresponding label distributions from the support s {(xi, yi)}k i=1, and a is an attention mechanism which we discuss below. Note that eq. tially describes the output for a new class as a linear combination of the labels in the su Where the attention mechanism a is a kernel on X ā‡„ X, then (1) is akin to a kernel density Where the attention mechanism is zero for the b furthest xi from Ė†x according to some metric and an appropriate constant otherwise, then (1) is equivalent to ā€˜k bā€™-nearest ne (although this requires an extension to the attention mechanism that we describe in Secti Thus (1) subsumes both KDE and kNN methods. Another view of (1) is where a acts as an mechanism and the yi act as values bound to the corresponding keys xi, much like a hash this case we can understand this as a particular kind of associative memory where, given we ā€œpointā€ to the corresponding example in the support set, retrieving its label. Hence the f form deļ¬ned by the classiļ¬er cS(Ė†x) is very ļ¬‚exible and can adapt easily to any new suppo 2.1.1 The Attention Kernel Equation 1 relies on choosing a(., .), the attention mechanism, which fully speciļ¬es th ļ¬er. The simplest form that this takes (and which has very tight relationships with attention models and kernel functions) is to use the softmax over the cosine distanc a(Ė†x, xi) = ec(f(Ė†x),g(xi)) / Pk j=1 ec(f(Ė†x),g(xj )) with embedding functions f and g being ate neural networks (potentially with f = g) to embed Ė†x and xi. In our experiments we examples where f and g are parameterised variously as deep convolutional networks f tasks (as in VGG[22] or Inception[24]) or a simple form word embedding for language t Section 4). We note that, though related to metric learning, the classiļ¬er deļ¬ned by Equation 1 is discri c: cosine distance LSTMLSTMā€¦ virus a LSTMLSTMā€¦ new nbc LSTMLSTM on the ā€¦ LSTMLSTM the yesterday ā€¦ 4.1.3 One-Shot Language Modeling We also introduce a new one-shot language task which is analogous to those examined for images. The task is as follows: given a query sentence with a missing word in it, and a support set of sentences which each have a missing word and a corresponding 1-hot label, choose the label from the support set that best matches the query sentence. Here we show a single example, though note that the words on the right are not provided and the labels for the set are given as 1-hot-of-5 vectors. 1. an experimental vaccine can alter the immune response of people infected with the aids virus a <blank_token> u.s. scientist said. prominent 2. the show one of five new nbc <blank_token> is the second casualty of the three networks so far this fall. series 3. however since eastern first filed for chapter N protection march N it has consistently promised to pay creditors N cents on the <blank_token>. dollar 4. we had a lot of people who threw in the <blank_token> today said <unk> ellis a partner in benjamin jacobson & sons a specialist in trading ual stock on the big board. towel 5. itā€™s not easy to roll out something that <blank_token> and make it pay mr. jacob says. comprehensive Query: in late new york trading yesterday the <blank_token> was quoted at N marks down from N marks late friday and at N yen down from N yen late friday. dollar Sentences were taken from the Penn Treebank dataset [15]. On each trial, we make sure that the set and batch are populated with sentences that are non-overlapping. This means that we do not use words with very low frequency counts; e.g. if there is only a single sentence for a given word we do not use this data since the sentence would need to be in both the set and the batch. As with the image tasks, each trial consisted of a 5 way choice between the classes available in the set. We used a batch size of 20 throughout the sentence matching task and varied the set size across k=1,2,3. We ensured that the same number of sentences were available for each class in the set. We split the words into a randomly sampled 9000 for training and 1000 for testing, and we used the standard test set to report results. Thus, neither the words nor the sentences used during test time had been seen during training. We compared our one-shot matching model to an oracle LSTM language model (LSTM-LM) [30] trained on all the words. In this setup, the LSTM has an unfair advantage as it is not doing one-shot learning but seeing all the data ā€“ thus, this should be taken as an upper bound. To do so, we examined a similar setup wherein a sentence was presented to the model with a single word ļ¬lled in with 5 different possible words (including the correct answer). For each of these 5 sentences the model gave The task is as follows: given a query sentence with a missing word in it, and a support set of sentences which each have a missing word and a corresponding 1-hot label, choose the label from the support set that best matches the query sentence. Here we show a single example, though note that the words on the right are not provided and the labels for the set are given as 1-hot-of-5 vectors. 1. an experimental vaccine can alter the immune response of people infected with the aids virus a <blank_token> u.s. scientist said. prominent 2. the show one of five new nbc <blank_token> is the second casualty of the three networks so far this fall. series 3. however since eastern first filed for chapter N protection march N it has consistently promised to pay creditors N cents on the <blank_token>. dollar 4. we had a lot of people who threw in the <blank_token> today said <unk> ellis a partner in benjamin jacobson & sons a specialist in trading ual stock on the big board. towel 5. itā€™s not easy to roll out something that <blank_token> and make it pay mr. jacob says. comprehensive Query: in late new york trading yesterday the <blank_token> was quoted at N marks down from N marks late friday and at N yen down from N yen late friday. dollar Sentences were taken from the Penn Treebank dataset [15]. On each trial, we make sure that the set and batch are populated with sentences that are non-overlapping. This means that we do not use words with very low frequency counts; e.g. if there is only a single sentence for a given word we do not use this data since the sentence would need to be in both the set and the batch. As with the image tasks, each trial consisted of a 5 way choice between the classes available in the set. We used a batch size of 20 throughout the sentence matching task and varied the set size across k=1,2,3. We ensured that the same number of sentences were available for each class in the set. We split the words into a randomly sampled 9000 for training and 1000 for testing, and we used the standard test set to report results. Thus, neither the words nor the sentences used during test time had been seen during training. We compared our one-shot matching model to an oracle LSTM language model (LSTM-LM) [30] trained on all the words. In this setup, the LSTM has an unfair advantage as it is not doing one-shot learning but seeing all the data ā€“ thus, this should be taken as an upper bound. To do so, we examined a similar setup wherein a sentence was presented to the model with a single word ļ¬lled in with 5 different possible words (including the correct answer). For each of these 5 sentences the model gave a log-likelihood and the max of these was taken to be the choice of the model. nā€Æ Fill in a brank in a query sentence by a label in a support set
  • 37. Copyright (C) DeNA Co.,Ltd. All Rights Reserved. Experimental Settings and Results (Penn Treebank) 37 nā€Æ Baseline āƒā€Æ Oracle LSTM-LM ā€¢ā€Æ Trained on all the words (not one-shot) ā€¢ā€Æ Consider this model as an upper bound nā€Æ Datasets āƒā€Æ training: 9000 words āƒā€Æ testing: 1000 words nā€Æ Results Model 5 way accuracy 1-shot 2-shot 3-shot Matching Nets 32.4% 36.1% 38.2% Oracle LSTM-LM (72.8%) - -
  • 38. Copyright (C) DeNA Co.,Ltd. All Rights Reserved. Conclusion nā€Æ They proposed Matching Networks: nearest neighbor based approach trained fully end-to-end nā€Æ Keypoints āƒā€Æ ā€œOne-shot learning is much easier if you train the network to do one-shot learningā€ [Vinyals+, 2016] āƒā€Æ Matching Network has non-parametric structure, thus has ability to acquisition of new examples rapidly nā€Æ Findings āƒā€Æ Matching Networks was eļ¬€ective to improve the performance for Omniglot, miniImageNet, randImageNet, however it degraded for dogsImageNet āƒā€Æ One-shot learning with ļ¬ne-grained sets of labels is diļ¬ƒcult to solve thus could be exciting challenge in this area 38
  • 39. Copyright (C) DeNA Co.,Ltd. All Rights Reserved. References nā€Æ Matching Networks āƒā€Æ Vinyals, Oriol, et al. "Matching networks for one shot learning." Advances in Neural Information Processing Systems. 2016. nā€Æ One-shot Learning āƒā€Æ Koch, Gregory. Siamese neural networks for one-shot image recognition. Diss. University of Toronto, 2015. āƒā€Æ Santoro, Adam, et al. "Meta-learning with memory-augmented neural networks." Proceedings of The 33rd International Conference on Machine Learning. 2016. āƒā€Æ Bertinetto, Luca, et al. "Learning feed-forward one-shot learners." Advances in Neural Information Processing Systems. 2016. nā€Æ Attention Mechanisms āƒā€Æ Bahdanau, Dzmitry, Kyunghyun Cho, and Yoshua Bengio. "Neural machine translation by jointly learning to align and translate." arXiv preprint arXiv:1409.0473 (2014). āƒā€Æ Vinyals, Oriol, Meire Fortunato, and Navdeep Jaitly. "Pointer networks." Advances in Neural Information Processing Systems. 2015. āƒā€Æ Vinyals, Oriol, Samy Bengio, and Manjunath Kudlur. "Order matters: Sequence to sequence for sets." In ICLR2016 39
  • 40. Copyright (C) DeNA Co.,Ltd. All Rights Reserved. References nā€Æ Datasets āƒā€Æ Krizhevsky, Alex, and Geoļ¬€rey Hinton. "Learning multiple layers of features from tiny images." (2009). āƒā€Æ Deng, Jia, et al. "Imagenet: A large-scale hierarchical image database." Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on. IEEE, 2009. āƒā€Æ Lake, Brenden M., et al. "One shot learning of simple visual concepts." Proceedings of the 33rd Annual Conference of the Cognitive Science Society. Vol. 172. 2011. āƒā€Æ Marcus, Mitchell P., Mary Ann Marcinkiewicz, and Beatrice Santorini. "Building a large annotated corpus of English: The Penn Treebank." Computational linguistics 19.2 (1993): 313-330. 40