Matching networks for one shot learning

Copyright (C) DeNA Co.,Ltd. All Rights Reserved.
AI System Dept.
System Management Unit
Kazuki Fujikawa
Matching Networks for One Shot
Learning
https://papers.nips.cc/paper/6385-matching-networks-for-one-
shot-learning
論⽂紹介
1
NIPS2016 読み会 @Preferred Networks
2017/01/19

n  One-shot learning with attention and memory
⁃  Learn a concept from one or only a few training examples
⁃  Train a fully end-to-end nearest neighbor classiﬁer: incorporating
the best characteristics from both parametric and non-parametric
models
⁃  Improved one-shot accuracy on Omniglot from 88.0% to 93.2%
compared to competing approaches
2
Abstract
Figure 1: Matching Networks architecture

AGENDA
n  Introduction
n  Related work
⁃  One-shot learning
⁃  Attention mechanisms
n  Matching Networks
n  Experiments
⁃  Omniglot
⁃  ImageNet
⁃  Penn Treebank
3

Supervised Learning
n  Learn a correspondence between training data and labels
⁃  Require a large labeled dataset for training
(ex. CIFAR10 [Krizhevsky+, 2009]: 6000 data / class)
⁃  It is hard to let classifiers learn new concepts from little data
4
airplane
automobile
bird
cat
deer
Classifier
examples Labels
0 airplane
1 automobile
0 bird
0 cat
0 deer
Classifier
Training phase Predicting phase
https://www.cs.toronto.edu/~kriz/cifar.html

One-shot Learning
n  Learn a concept from one or only a few training examples
⁃  A classifier can be trained by datasets with labels which donʼt
be used in predicting phase
5
airplane
automobile
bird
cat
deer
Classifier
examples Labels
0 airplane
1 automobile
0 bird
0 cat
0 deer
Classifier
（Pre-）Training phase Predicting phase（one-shot learning phase）
dog
frog
horse
ship
truck
Classifier
examples Labels

One-shot Learning
n  Task: N-way k-shot learning
6
T’: Testing taskT: Training task
dog
frog
horse
ship
truck
airplane
automobile
bird
cat
deer
dog
frog
horse
ship
truck
airplane
automobile
bird
cat
deer
•  Separate labels for training and testing
•  All the labels which you use in testing
phase (one-shot learning phase) are not
used in training phase

One-shot Learning
7
dog
frog
horse
ship
truck
airplane
automobile
bird
cat
deer
dog
frog
horse
ship
truck
airplane
automobile
bird
cat
deer
•  T’ is used for one-shot learning
•  T can be used freely to train
（e.g. Multiclass classiﬁcation）

One-shot Learning
8
dog
frog
horse
ship
truck
airplane
automobile
bird
cat
deer
dog
frog
horse
ship
truck
airplane
automobile
bird
cat
deer
L’: Label set
sampling N labels from Tʼ
•  In this ﬁgure, Lʼ has 3 classes, thus
“3-way k-shot learning”
automobile
cat
deer

One-shot Learning
9
dog
frog
horse
ship
truck
airplane
automobile
bird
cat
deer
dog
frog
horse
ship
truck
airplane
automobile
bird
cat
deer
L’: Label set
S’: Support set : Query
automobile
cat
deer
sampling N labels from Tʼ
sampling k examples from Lʼ sampling 1 example from Lʼ
ˆx
•  Task: classify into 3
classes, {automobile, cat,
deer}, using support set
ˆx

Related Work (One-shot Learning)
n  Convolutional Siamese Network [Koch+, 2015]
⁃  Learn image representation with a siamese neural network
⁃  Reuse features from the network for one-shot learning
10
train it by showing only a few examples per class, switching the task from minibatch to minibatch,
much like how it will be tested when presented with a few examples of a new task.
Besides our contributions in deﬁning a model and training criterion amenable for one-shot learning,
we contribute by the deﬁnition of tasks that can be used to benchmark other approaches on both
ImageNet and small scale language modeling. We hope that our results will encourage others to work
CNN CNN
Same?

n  Memory-Augmented Neural Networks (MANN) [Santoro+, 2016]
⁃  Quickly encode and retrieve new information using external
memory, inspired by the idea of Neural Turing Machine
11

n  Siamese Learnet [Bertinetto+, NIPS2016]
⁃  Learn the parameters of a network to incorporate domain
specific information from a few examples
12
siamese
siamese learnet
learnet
Figure 1: Our proposed architectures predict the parameters of a network from a single example,
replacing static convolutions (green) with dynamic convolutions (red). The siamese learnet predicts
the parameters of an embedding function that is applied to both inputs, whereas the single-stream
learnet predicts the parameters of a function that is applied to the other input. Linear layers are
denoted by ⇤ and nonlinear layers by . Dashed connections represent parameter sharing.
discriminative one-shot learning is to find a mechanism to incorporate domain-specific information in
the learner, i.e. learning to learn. Another challenge, which is of practical importance in applications
of one-shot learning, is to avoid a lengthy optimization process such as eq. (1).
We propose to address both challenges by learning the parameters W of the predictor from a single
exemplar z using a meta-prediction process, i.e. a non-iterative feed-forward function ! that maps
(z; W0
) to W. Since in practice this function will be implemented using a deep neural network, we
call it a learnet. The learnet depends on the exemplar z, which is a single representative of the class of
interest, and contains parameters W0
of its own. Learning to learn can now be posed as the problem of
optimizing the learnet meta-parameters W0
using an objective function defined below. Furthermore,
the feed-forward learnet evaluation is much faster than solving the optimization problem (1).
In order to train the learnet, we require the latter to produce good predictors given any possible
exemplar z, which is empirically evaluated as an average over n training samples zi:

Related Work (Attention Mechanism)
n  Sequence to Sequence with Attention [Bahdanau+, 2014]
⁃  Attend to the word relevant to the generation of the next
target word in the source sentence
13
t t
her architectures such as a hybrid of an RNN
alchbrenner and Blunsom, 2013).
ral machine translation. The new architecture
3.2) and a decoder that emulates searching
n (Sec. 3.1).
x1 x2 x3 xT
+
αt,1
αt,2 αt,3
αt,T
yt-1 yt
h1 h2 h3 hT
h1 h2 h3 hT
st-1 st
Figure 1: The graphical illus-
tration of the proposed model
trying to generate the t-th tar-
get word yt given a source
sentence (x1, x2, . . . , xT ).
al probability
(4)
by
–decoder ap-
on a distinct
annotations
ntence. Each
put sequence
word of the
ons are com-
sum of these
(5)
ij)
sentence (x1, x2, . . . , xT ).
si = f(si 1, yi 1, ci).
It should be noted that unlike the existing encoder–decoder ap-
proach (see Eq. (2)), here the probability is conditioned on a distinct
context vector ci for each target word yi.
The context vector ci depends on a sequence of annotations
(h1, · · · , hTx
) to which an encoder maps the input sentence. Each
annotation hi contains information about the whole input sequence
with a strong focus on the parts surrounding the i-th word of the
input sequence. We explain in detail how the annotations are com-
puted in the next section.
The context vector ci is, then, computed as a weighted sum of these
annotations hi:
ci =
TxX
j=1
↵ijhj. (5)
The weight ↵ij of each annotation hj is computed by
↵ij =
exp (eij)
PTx
k=1 exp (eik)
, (6)
where
eij = a(si 1, hj)
is an alignment model which scores how well the inputs around position j and the output at position
i match. The score is based on the RNN hidden state si 1 (just before emitting yi, Eq. (4)) and the
j-th annotation hj of the input sentence.
We parametrize the alignment model a as a feedforward neural network which is jointly trained with
all the other components of the proposed system. Note that unlike in traditional machine translation,
3
sentence (x1, x2, . . . , xT ).
si = f(si 1, yi 1, ci).
It should be noted that unlike the existing encoder–decoder ap-
(h1, · · · , hTx
annotations hi:
ci =
TxX
j=1
↵ijhj. (5)
↵ij =
exp (eij)
PTx
k=1 exp (eik)
, (6)
where
eij = a(si 1, hj)
3
sentence (x1, x2, . . . , xT ).
(h1, · · · , hTx
annotations hi:
ci =
TxX
j=1
↵ijhj. (5)
↵ij =
exp (eij)
PTx
k=1 exp (eik)
, (6)
where
eij = a(si 1, hj)
3
Published as a conference paper at ICLR 2015
(a) (b)

n  Pointer Networks [Vinyals+, 2015]
⁃  Generate output sequence using a distribution over the
dictionary of inputs
14
(a) Sequence-to-Sequence (b) Ptr-Net
Figure 1: (a) Sequence-to-Sequence - An RNN (blue) processes the input sequence to create a code
vector that is used to generate the output sequence (purple) using the probability chain rule and
another RNN. The output dimensionality is fixed by the dimensionality of the problem and it is the
same during training and inference [1]. (b) Ptr-Net - An encoding RNN converts the input sequence
to a code (blue) that is fed to the generating network (purple). At each step, the generating network
produces a vector that modulates a content-based attention mechanism over inputs ([5, 2]). The
output of the attention mechanism is a softmax distribution with dictionary size equal to the length
of the input.
ion (i.e., when we only have examples of inputs and desired outputs). The proposed approach is
depicted in Figure 1.
The main contributions of our work are as follows:
This model performs significantly better than the sequence-to-sequence model on the co
problem, but it is not applicable to problems where the output dictionary size depends on
Nevertheless, a very simple extension (or rather reduction) of the model allows us to do th
2.3 Ptr-Net
We now describe a very simple modification of the attention model that allows us to
method to solve combinatorial optimization problems where the output dictionary size d
the number of elements in the input sequence.
The sequence-to-sequence model of Section 2.1 uses a softmax distribution over a fixed si
dictionary to compute p(Ci|C1, . . . , Ci 1, P) in Equation 1. Thus it cannot be used for our
where the size of the output dictionary is equal to the length of the input sequence. To
problem we model p(Ci|C1, . . . , Ci 1, P) using the attention mechanism of Equation 3 a
ui
j = vT
tanh(W1ej + W2di) j 2 (1, . . . , n)
p(Ci|C1, . . . , Ci 1, P) = softmax(ui
)
where softmax normalizes the vector ui
(of length n) to be an output distribution over the
of inputs, and v, W1, and W2 are learnable parameters of the output model. Here, we do
the encoder state ej to propagate extra information to the decoder, but instead, use ui
j a
to the input elements. In a similar way, to condition on Ci 1 as in Equation 1, we sim
the corresponding PCi 1
as the input. Both our method and the attention model can be
application of content-based attention mechanisms proposed in [6, 5, 2].
We also note that our approach specifically targets problems whose outputs are discrete
spond to positions in the input. Such problems may be addressed artificially – for example
learn to output the coordinates of the target point directly using an RNN. However, at
this solution does not respect the constraint that the outputs map back to the inputs exac
out the constraints, the predictions are bound to become blurry over longer sequences as
sequence-to-sequence models for videos [12].
3 Motivation and Datasets Structure
In the following sections, we review each of the three problems we considered, as well a
generation protocol.1
In the training data, the inputs are planar point sets P = {P1, . . . , Pn} with n elements ea
Pj = (xj, yj) are the cartesian coordinates of the points over which we find the convex hu
launay triangulation or the solution to the corresponding Travelling Salesman Problem. In
we sample from a uniform distribution in [0, 1] ⇥ [0, 1]. The outputs CP
= {C1, . . . , C
sequences representing the solution associated to the point set P. In Figure 2, we find an i
of an input/output pair (P, CP
) for the convex hull and the Delaunay problems.

n  Sequence to Sequence for Sets [Vinyals+, ICLR2016]
⁃  Handle input sets using an extension of seq2seq framework:
Read-Process-and Write model
15
ural models with memories coupled to differentiable addressing mechanism have been success-
y applied to handwriting generation and recognition (Graves, 2012), machine translation (Bah-
au et al., 2015a), and more general computation machines (Graves et al., 2014; Weston et al.,
5). Since we are interested in associative memories we employed a “content” based attention.
s has the property that the vector retrieved from our memory would not change if we randomly
fﬂed the memory. This is crucial for proper treatment of the input set X as such. In particular,
process block based on an attention mechanism uses the following:
qt = LSTM(q⇤
t 1) (3)
ei,t = f(mi, qt) (4)
ai,t =
exp(ei,t)
P
j exp(ej,t)
(5)
rt =
X
i
ai,tmi (6)
q⇤
t = [qt rt] (7)
Read
Process Write
Figure 1: The Read-Process-and-Write model.
ere i indexes through each memory vector mi (typically equal to the cardinality of X), qt is
uery vector which allows us to read rt from the memories, f is a function that computes a
gle scalar from mi and qt (e.g., a dot product), and LSTM is an LSTM which computes a
urrent state but which takes no inputs. q⇤
t is the state which this LSTM evolves, and is formed
concatenating the query qt with the resulting attention readout rt. t is the index which indicates

Matching Networks [Vinyals+, NIPS2016]
n  Motivation
⁃  It is important for one-shot learning to attain rapid learning
from new examples while keeping an ability for common
examples
•  Simple parametric models such as deep classifiers need to be
optimized to treat with new examples
•  Non-parametric models such as k-nearest neighbor donʼt require
optimization but performance depends on the chosen metric
⁃  It could be efficient to train a end-to-end nearest neighbor
based classifier
16

n  Train a classiﬁer through one-shot learning
17
dog
frog
horse
ship
truck
airplane
automobile
bird
cat
deer
dog
frog
horse
ship
truck
airplane
automobile
bird
cat
deer
L: Label set
S: Support set B : Batch
dog
horse
ship
sampling N labels from T
sampling k examples
from L
sampling b example from L

n  System Overview
⁃  Embedding functions f, g are parameterized as a simple CNN (e.g.
VGG or Inception) or a fully conditional embedding function
mentioned later
18
on this challenging problem.
We organized the paper by first defining and explaining our model whilst linking its several compo-
nents to related work. Then in the following section we briefly elaborate on some of the related work
to the task and our model. In Section 4 we describe both our general setup and the experiments we
performed, demonstrating strong results on one-shot learning on a variety of tasks and setups.
2 Model
Our non-parametric approach to solving one-shot learning is based on two components which we
describe in the following subsections. First, our model architecture follows recent advances in neural
networks augmented with memory (as discussed in Section 3). Given a (small) support set S, our
model defines a function cS (or classifier) for each S, i.e. a mapping S ! cS(.). Second, we employ
ˆx
Query
f
g(xi )
f ( ˆx,S)
a
∑
P(ˆy|ˆx
where xi, yi are the inputs and corresp
{(xi, yi)}k
i=1, and a is an attention mech
tially describes the output for a new class
Where the attention mechanism a is a kerne
Where the attention mechanism is zero f
metric and an appropriate constant otherw
(although this requires an extension to the
Thus (1) subsumes both KDE and kNN me
mechanism and the yi act as values bound
this case we can understand this as a parti
we “point” to the corresponding example i
form defined by the classifier cS(ˆx) is very
2.1.1 The Attention Kernel
Equation 1 relies on choosing a(., .), the
fier. The simplest form that this takes
attention models and kernel functions)
a(ˆx, xi) = ec(f(ˆx),g(xi))
/
Pk
j=1 ec(f(ˆx),g(
ate neural networks (potentially with f =
examples where f and g are parameteris
tasks (as in VGG[22] or Inception[24]) or
Section 4).
We note that, though related to metric learn
For a given support set S and sample to cl
pairs (x0
, y0
) 2 S such that y0
= y and mi
methods such as Neighborhood Compone
nearest neighbor [28].
However, the objective that we are trying
classification, and thus we expect it to per
Our model in its simplest form computes a probability over ˆy as follows:
P(ˆy|ˆx, S) =
kX
i=1
a(ˆx, xi)yi
where xi, yi are the inputs and corresponding label distributions from the support
{(xi, yi)}k
i=1, and a is an attention mechanism which we discuss below. Note that e
tially describes the output for a new class as a linear combination of the labels in the s
Where the attention mechanism a is a kernel on X ⇥ X, then (1) is akin to a kernel densit
Where the attention mechanism is zero for the b furthest xi from ˆx according to som
metric and an appropriate constant otherwise, then (1) is equivalent to ‘k b’-nearest n
(although this requires an extension to the attention mechanism that we describe in Sec
Thus (1) subsumes both KDE and kNN methods. Another view of (1) is where a acts as a
mechanism and the yi act as values bound to the corresponding keys xi, much like a has
this case we can understand this as a particular kind of associative memory where, give
by showing only a few examples per class, switching the task from minibatch to minibatch,
ike how it will be tested when presented with a few examples of a new task.
s our contributions in defining a model and training criterion amenable for one-shot learning,
ntribute by the definition of tasks that can be used to benchmark other approaches on both
it by showing only a few examples per class, switching the task from minibatch to minibatch,
h like how it will be tested when presented with a few examples of a new task.
xi
Support Set（S）
yi
g

n  The Attention Kernel
⁃  Calculate softmax over the cosine distance between and
•  Similar to nearest neighbor calculation
⁃  Train a network using cross entropy loss
19
We organized the paper by first defining and explaining our model whilst linking its several compo-
nents to related work. Then in the following section we briefly elaborate on some of the related work
to the task and our model. In Section 4 we describe both our general setup and the experiments we
performed, demonstrating strong results on one-shot learning on a variety of tasks and setups.
2 Model
Our non-parametric approach to solving one-shot learning is based on two components which we
describe in the following subsections. First, our model architecture follows recent advances in neural
networks augmented with memory (as discussed in Section 3). Given a (small) support set S, our
model defines a function cS (or classifier) for each S, i.e. a mapping S ! cS(.). Second, we employ
ˆx
Query
f
g(xi )
f ( ˆx,S)
aOur model in its simplest form computes a probability over ˆy as follow
P(ˆy|ˆx, S) =
kX
i=1
a(ˆx, xi)yi
where xi, yi are the inputs and corresponding label distributions
{(xi, yi)}k
i=1, and a is an attention mechanism which we discuss b
tially describes the output for a new class as a linear combination of
Where the attention mechanism a is a kernel on X ⇥ X, then (1) is akin
Where the attention mechanism is zero for the b furthest xi from ˆx
metric and an appropriate constant otherwise, then (1) is equivalent t
(although this requires an extension to the attention mechanism that w
∑
P(ˆy|ˆx
where xi, yi are the inputs and corresp
{(xi, yi)}k
i=1, and a is an attention mech
tially describes the output for a new class
Where the attention mechanism a is a kerne
Where the attention mechanism is zero f
metric and an appropriate constant otherw
(although this requires an extension to the
Thus (1) subsumes both KDE and kNN me
mechanism and the yi act as values bound
this case we can understand this as a parti
we “point” to the corresponding example i
form defined by the classifier cS(ˆx) is very
Equation 1 relies on choosing a(., .), the
fier. The simplest form that this takes
attention models and kernel functions)
/
Pk
j=1 ec(f(ˆx),g(
ate neural networks (potentially with f =
examples where f and g are parameteris
tasks (as in VGG[22] or Inception[24]) or
Section 4).
We note that, though related to metric learn
For a given support set S and sample to cl
pairs (x0
, y0
) 2 S such that y0
= y and mi
methods such as Neighborhood Compone
However, the objective that we are trying
classification, and thus we expect it to per
P(ˆy|ˆx, S) =
kX
i=1
a(ˆx, xi)yi
where xi, yi are the inputs and corresponding label distributions from the support
{(xi, yi)}k
i=1, and a is an attention mechanism which we discuss below. Note that e
tially describes the output for a new class as a linear combination of the labels in the s
Where the attention mechanism a is a kernel on X ⇥ X, then (1) is akin to a kernel densit
Where the attention mechanism is zero for the b furthest xi from ˆx according to som
metric and an appropriate constant otherwise, then (1) is equivalent to ‘k b’-nearest n
(although this requires an extension to the attention mechanism that we describe in Sec
Thus (1) subsumes both KDE and kNN methods. Another view of (1) is where a acts as a
mechanism and the yi act as values bound to the corresponding keys xi, much like a has
this case we can understand this as a particular kind of associative memory where, give
P(ˆy|ˆx, S) =
kX
i=1
a(ˆx, xi)yi
where xi, yi are the inputs and corresponding label distributions from the suppo
{(xi, yi)}k
i=1, and a is an attention mechanism which we discuss below. Note that
tially describes the output for a new class as a linear combination of the labels in the
Where the attention mechanism a is a kernel on X ⇥ X, then (1) is akin to a kernel dens
Where the attention mechanism is zero for the b furthest xi from ˆx according to so
metric and an appropriate constant otherwise, then (1) is equivalent to ‘k b’-nearest
(although this requires an extension to the attention mechanism that we describe in Se
Thus (1) subsumes both KDE and kNN methods. Another view of (1) is where a acts as
mechanism and the yi act as values bound to the corresponding keys xi, much like a ha
this case we can understand this as a particular kind of associative memory where, giv
we “point” to the corresponding example in the support set, retrieving its label. Hence th
form defined by the classifier cS(ˆx) is very flexible and can adapt easily to any new sup
Equation 1 relies on choosing a(., .), the attention mechanism, which fully specifie
fier. The simplest form that this takes (and which has very tight relationships wi
attention models and kernel functions) is to use the softmax over the cosine dist
/
Pk
j=1 ec(f(ˆx),g(xj ))
with embedding functions f and g bein
ate neural networks (potentially with f = g) to embed ˆx and xi. In our experiments w
examples where f and g are parameterised variously as deep convolutional network
tasks (as in VGG[22] or Inception[24]) or a simple form word embedding for languag
Section 4).
We note that, though related to metric learning, the classifier defined by Equation 1 is di
For a given support set S and sample to classify ˆx, it is enough for ˆx to be sufficiently a
pairs (x0
, y0
) 2 S such that y0
= y and misaligned with the rest. This kind of loss is als
c: cosine distance
xi
Support Set（S）
yi
g
ˆhk, ck = LSTM(f0
(ˆx), [hk 1, rk 1], ck 1)
hk = ˆhk + f0
(ˆx)
rk 1 =
|S|
X
i=1
a(hk 1, g(xi))g(xi)
a(hk 1, g(xi)) = ehT
k 1g(xi)
/
|S|
X
j=1
ehT
k 1g(xj )
noting that LSTM(x, h, c) follows the same LSTM implementation defined in [23]
h the output (i.e., cell after the output gate), and c the cell. a is commonly referred
based attention. We do K steps of “reads”, so f(ˆx, S) = hK where hk is as describ
2.2 Training Strategy
In the previous subsection we described Matching Networks which map a support set t
function, S ! c(ˆx). We achieve this via a modification of the set-to-set paradigm
attention, with the resulting mapping being of the form P✓(.|ˆx, S), noting that ✓ are
of the model (i.e. of the embedding functions f and g described previously).
The training procedure has to be chosen carefully so as to match inference at test t
has to perform well with support sets S0
which contain classes never seen during tra
More specifically, let us define a task T as distribution over possible label sets L
consider T to uniformly weight all data sets of up to a few unique classes (e.g.
examples per class (e.g., up to 5). In this case, a label set L sampled from a task
typically have 5 to 25 examples.
To form an “episode” to compute gradients and update our model, we first sample
L could be the label set {cats, dogs}). We then use L to sample the support set S
(i.e., both S and B are labelled examples of cats and dogs). The Matching Net is
minimise the error predicting the labels in the batch B conditioned on the support
form of meta-learning since the training procedure explicitly learns to learn from a g
to minimise a loss over a batch. More precisely, the Matching Nets training objectiv
✓ = arg max
✓
EL⇠T
2
4ES⇠L,B⇠L
2
4
X
(x,y)2B
log P✓ (y|x, S)
3
5
3
5 .
Training ✓ with eq. 6 yields a model which works well when sampling S0
⇠ T0
g(xi )f ( ˆx,S)

n  The Fully Conditional Embedding g
⁃  Embed in consideration of S
g’
LSTM
LSTM
+
Net and small scale language modeling. We hope that our results will encourage others to work
challenging problem.
ganized the paper by first defining and explaining our model whilst linking its several compo-
o related work. Then in the following section we briefly elaborate on some of the related work
des our contributions in defining a model and training criterion amenable for one-shot learning,
ontribute by the definition of tasks that can be used to benchmark other approaches on both
geNet and small scale language modeling. We hope that our results will encourage others to work
his challenging problem.
20
examples per class, switching the task from minibatch to minibatch, much like
when presented with a few examples of a new task.
utions in defining a model and training criterion amenable for one-shot learning,
xi
Support Set（S）
yi
g’
LSTM
LSTM
+
g’
LSTM
LSTM
+
noting that LSTM(x, h, c) follows the same LSTM implementation defined in [
h the output (i.e., cell after the output gate), and c the cell. a is commonly refe
based attention, and the softmax in eq. 6 normalizes w.r.t. g(xi). The read-ou
concatenated to hk 1. Since we do K steps of “reads”, attLSTM(f0
(ˆx), g(S),
is as described in eq. 3.
A.2 The Fully Conditional Embedding g
In section 2.1.2 we described the encoding function for the elements in the sup
as a bidirectional LSTM. More precisely, let g0
(xi) be a neural network (simila
VGG or Inception model). Then we define g(xi, S) = ~hi + ~hi + g0
(xi) with:
~hi,~ci = LSTM(g0
(xi),~hi 1,~ci 1)
~hi, ~ci = LSTM(g0
(xi), ~hi+1, ~ci+1)
where, as in above, LSTM(x, h, c) follows the same LSTM implementation de
the input, h the output (i.e., cell after the output gate), and c the cell. Note tha
starts from i = |S|. As in eq. 3, we add a skip connection between input and ou
B ImageNet Class Splits
Here we define the two class splits used in our full ImageNet experiments –
excluded for training during our one-shot experiments described in section 4.1.
Lrand =
n01498041, n01537544, n01580077, n01592084, n01632777, n01644373, n01665541, n01675722, n016882
n01818515, n01843383, n01883070, n01950731, n02002724, n02013706, n02092339, n02093256, n020953
10
g’: neural network (e.g., VGG or Inception)
a(hk 1, g(xi)) = softmax(hk 1g(xi))
noting that LSTM(x, h, c) follows the same LSTM implementation defined in [23] with x th
h the output (i.e., cell after the output gate), and c the cell. a is commonly referred to as “c
based attention, and the softmax in eq. 6 normalizes w.r.t. g(xi). The read-out rk 1 from
(ˆx), g(S), K) = hK w
In section 2.1.2 we described the encoding function for the elements in the support set S, g
(xi) be a neural network (similar to f0
above
(xi) with:
~hi,~ci = LSTM(g0
(xi),~hi 1,~ci 1)
~hi, ~ci = LSTM(g0
(xi), ~hi+1, ~ci+1)
where, as in above, LSTM(x, h, c) follows the same LSTM implementation defined in [23]
the input, h the output (i.e., cell after the output gate), and c the cell. Note that the recursio
starts from i = |S|. As in eq. 3, we add a skip connection between input and outputs.
Here we define the two class splits used in our full ImageNet experiments – these classe
excluded for training during our one-shot experiments described in section 4.1.2.
Lrand =
n01498041, n01537544, n01580077, n01592084, n01632777, n01644373, n01665541, n01675722, n01688243, n01729977, n
n01818515, n01843383, n01883070, n01950731, n02002724, n02013706, n02092339, n02093256, n02095314, n02097130, n
xi
g(xi,S)

g’
21
xi
Support Set（S）
yi
g’
g’
(ˆx), g(S),
(xi) with:
~hi,~ci = LSTM(g0
(xi),~hi 1,~ci 1)
~hi, ~ci = LSTM(g0
(xi), ~hi+1, ~ci+1)
Lrand =
n01498041, n01537544, n01580077, n01592084, n01632777, n01644373, n01665541, n01675722, n016882
n01818515, n01843383, n01883070, n01950731, n02002724, n02013706, n02092339, n02093256, n020953
10
(ˆx), g(S), K) = hK w
above
(xi) with:
~hi,~ci = LSTM(g0
(xi),~hi 1,~ci 1)
~hi, ~ci = LSTM(g0
(xi), ~hi+1, ~ci+1)
Lrand =
n01498041, n01537544, n01580077, n01592084, n01632777, n01644373, n01665541, n01675722, n01688243, n01729977, n
n01818515, n01843383, n01883070, n01950731, n02002724, n02013706, n02092339, n02093256, n02095314, n02097130, n
Embed into vector using g’
（g’: neural network such as VGG or Inception）
xi
xi

g’
LSTM
LSTM
22
xi
Support Set（S）
yi
g’
LSTM
LSTM
g’
LSTM
LSTM
(ˆx), g(S),
(xi) with:
~hi,~ci = LSTM(g0
(xi),~hi 1,~ci 1)
~hi, ~ci = LSTM(g0
(xi), ~hi+1, ~ci+1)
Lrand =
n01498041, n01537544, n01580077, n01592084, n01632777, n01644373, n01665541, n01675722, n016882
n01818515, n01843383, n01883070, n01950731, n02002724, n02013706, n02092339, n02093256, n020953
10
(ˆx), g(S), K) = hK w
above
(xi) with:
~hi,~ci = LSTM(g0
(xi),~hi 1,~ci 1)
~hi, ~ci = LSTM(g0
(xi), ~hi+1, ~ci+1)
Lrand =
n01498041, n01537544, n01580077, n01592084, n01632777, n01644373, n01665541, n01675722, n01688243, n01729977, n
n01818515, n01843383, n01883070, n01950731, n02002724, n02013706, n02092339, n02093256, n02095314, n02097130, n
Feed into Bi-LSTM
（gʼ: neural network such as VGG or Inception）
g'(xi )
xi

g’
LSTM
LSTM
+
23
xi
Support Set（S）
yi
g’
LSTM
LSTM
+
g’
LSTM
LSTM
+
(ˆx), g(S),
(xi) with:
~hi,~ci = LSTM(g0
(xi),~hi 1,~ci 1)
~hi, ~ci = LSTM(g0
(xi), ~hi+1, ~ci+1)
Lrand =
n01498041, n01537544, n01580077, n01592084, n01632777, n01644373, n01665541, n01675722, n016882
n01818515, n01843383, n01883070, n01950731, n02002724, n02013706, n02092339, n02093256, n020953
10
(ˆx), g(S), K) = hK w
above
(xi) with:
~hi,~ci = LSTM(g0
(xi),~hi 1,~ci 1)
~hi, ~ci = LSTM(g0
(xi), ~hi+1, ~ci+1)
Lrand =
n01498041, n01537544, n01580077, n01592084, n01632777, n01644373, n01665541, n01675722, n01688243, n01729977, n
n01818515, n01843383, n01883070, n01950731, n02002724, n02013706, n02092339, n02093256, n02095314, n02097130, n
g(xi,S)
Let be the sum of
and outputs of Bi-LSTM
g(xi,S) g'(xi )
xi

n  The Fully Conditional Embedding f
g
train it by showing only a few examples per class, switching the task from mi
much like how it will be tested when presented with a few examples of a new
Besides our contributions in defining a model and training criterion amenable
we contribute by the definition of tasks that can be used to benchmark other
ImageNet and small scale language modeling. We hope that our results will enc
We organized the paper by first defining and explaining our model whilst linki
nents to related work. Then in the following section we briefly elaborate on som
to the task and our model. In Section 4 we describe both our general setup an
performed, demonstrating strong results on one-shot learning on a variety of ta
2 Model
Our non-parametric approach to solving one-shot learning is based on two co
describe in the following subsections. First, our model architecture follows rece
networks augmented with memory (as discussed in Section 3). Given a (sma
model defines a function cS (or classifier) for each S, i.e. a mapping S ! cS(.)
a training strategy which is tailored for one-shot learning from the support set
2.1 Model Architecture
In recent years, many groups have investigated ways to augment neural netwo
f’LSTM
rk−1
a(hk−1,g(xi ))g(xi )
LSTM
f ( ˆx,S) = hK
ˆhk−1
hk−1
ˆhk
+
+
ˆx
so, we define the following recurrence over “processing” steps k, following work from [26]:
ˆhk, ck = LSTM(f0
(ˆx), [hk 1, rk 1], ck 1) (2)
hk = ˆhk + f0
(ˆx) (3)
rk 1 =
|S|
X
i=1
a(hk 1, g(xi))g(xi) (4)
k 1g(xi)
/
|S|
X
j=1
ehT
k 1g(xj )
(5)
Query
weighted sum
24
xi
Support Set（S）
yi
ˆx
ollowing recurrence over “processing” steps k, following work from [26]:
ˆhk, ck = LSTM(f0
(ˆx), [hk 1, rk 1], ck 1) (2)
hk = ˆhk + f0
(ˆx) (3)
rk 1 =
|S|
X
i=1
a(hk 1, g(xi))g(xi) (4)

g
2 Model
f’LSTM
g(xi ) ˆh1
h1
+
ˆx
ˆhk, ck = LSTM(f0
(ˆx), [hk 1, rk 1], ck 1) (2)
hk = ˆhk + f0
(ˆx) (3)
rk 1 =
|S|
X
i=1
a(hk 1, g(xi))g(xi) (4)
k 1g(xi)
/
|S|
X
j=1
ehT
k 1g(xj )
(5)
Query
25
xi
Support Set（S）
yi
ˆhk, ck = LSTM(f0
(ˆx), [hk 1, rk 1], ck 1) (2)
hk = ˆhk + f0
(ˆx) (3)
rk 1 =
|S|
X
i=1
a(hk 1, g(xi))g(xi) (4)
is calculated without using S
h1 = LSTM( f '( ˆx),[ ˆh0,r0 ],c0 )+ f '( ˆx)
h1
ˆx

g
2 Model
f’LSTM
g(xi ) ˆh1
h1
+
ˆx
ˆhk, ck = LSTM(f0
(ˆx), [hk 1, rk 1], ck 1) (2)
hk = ˆhk + f0
(ˆx) (3)
rk 1 =
|S|
X
i=1
a(hk 1, g(xi))g(xi) (4)
k 1g(xi)
/
|S|
X
j=1
ehT
k 1g(xj )
(5)
Query
26
xi
Support Set（S）
yi
ˆhk, ck = LSTM(f0
(ˆx), [hk 1, rk 1], ck 1) (2)
hk = ˆhk + f0
(ˆx) (3)
rk 1 =
|S|
X
i=1
a(hk 1, g(xi))g(xi) (4)
Calculate the relevance between and
softmaxa(h1,g(x1)) =
a(h1,g(xi ))
(hT
1g(x1))
g(xi ) h1
ˆx

g
2 Model
f’LSTM
g(xi ) ˆh1
h1
+
ˆx
ˆhk, ck = LSTM(f0
(ˆx), [hk 1, rk 1], ck 1) (2)
hk = ˆhk + f0
(ˆx) (3)
rk 1 =
|S|
X
i=1
a(hk 1, g(xi))g(xi) (4)
k 1g(xi)
/
|S|
X
j=1
ehT
k 1g(xj )
(5)
Query
27
xi
Support Set（S）
yi
ˆhk, ck = LSTM(f0
(ˆx), [hk 1, rk 1], ck 1) (2)
hk = ˆhk + f0
(ˆx) (3)
rk 1 =
|S|
X
i=1
a(hk 1, g(xi))g(xi) (4)
is a sum of weighted according to the
relevance to
a(h1,g(xi ))
r1
weighted sum
r1
g(xi )
h1
r1 = a(h1,g(xi ))
i=1
|S|
∑ g(xi )
ˆx

g
2 Model
f’LSTM
g(xi ) ˆh1
h1
+
ˆx
ˆhk, ck = LSTM(f0
(ˆx), [hk 1, rk 1], ck 1) (2)
hk = ˆhk + f0
(ˆx) (3)
rk 1 =
|S|
X
i=1
a(hk 1, g(xi))g(xi) (4)
k 1g(xi)
/
|S|
X
j=1
ehT
k 1g(xj )
(5)
Query
28
xi
Support Set（S）
yi
ˆhk, ck = LSTM(f0
(ˆx), [hk 1, rk 1], ck 1) (2)
hk = ˆhk + f0
(ˆx) (3)
rk 1 =
|S|
X
i=1
a(hk 1, g(xi))g(xi) (4)
a(h1,g(xi ))
r1
weighted sum
LSTM
ˆh1
+
h1
is calculated using Sh1
ˆx

g
2 Model
f’LSTM
rk−1
a(hk−1,g(xi ))g(xi )
LSTM
f ( ˆx,S) = hK
ˆhk−1
hk−1
ˆhk
+
+
ˆx
ˆhk, ck = LSTM(f0
(ˆx), [hk 1, rk 1], ck 1) (2)
hk = ˆhk + f0
(ˆx) (3)
rk 1 =
|S|
X
i=1
a(hk 1, g(xi))g(xi) (4)
k 1g(xi)
/
|S|
X
j=1
ehT
k 1g(xj )
(5)
Query
weighted sum
29
xi
Support Set（S）
yi
ˆhk, ck = LSTM(f0
(ˆx), [hk 1, rk 1], ck 1) (2)
hk = ˆhk + f0
(ˆx) (3)
rk 1 =
|S|
X
i=1
a(hk 1, g(xi))g(xi) (4)
Let be the output
after K steps
f ( ˆx,S)
ˆx

Experimental Settings
n  Datasets
⁃  Image classiﬁcation sets
•  Omniglot [Lake+, 2011]
⁃  Language modeling
•  Penn Treebank [Marcus+, 1993]
30
•  ImageNet [Deng+, 2009]
ref. http://karpathy.github.io/2014/09/02/what-i-learned-
from-competing-against-a-convnet-on-imagenet/
4.1.3 One-Shot Language Modeling
We also introduce a new one-shot language task which is analogous to those examined for images.
The task is as follows: given a query sentence with a missing word in it, and a support set of sentences
which each have a missing word and a corresponding 1-hot label, choose the label from the support
set that best matches the query sentence. Here we show a single example, though note that the words
on the right are not provided and the labels for the set are given as 1-hot-of-5 vectors.
1. an experimental vaccine can alter the immune response of people infected with the aids virus a
<blank_token> u.s. scientist said.
prominent
2. the show one of five new nbc <blank_token> is the second casualty of the three networks so far
this fall.
series
3. however since eastern first filed for chapter N protection march N it has consistently promised
to pay creditors N cents on the <blank_token>.
dollar
4. we had a lot of people who threw in the <blank_token> today said <unk> ellis a partner in
benjamin jacobson & sons a specialist in trading ual stock on the big board.
towel
5. it’s not easy to roll out something that <blank_token> and make it pay mr. jacob says. comprehensive
Query: in late new york trading yesterday the <blank_token> was quoted at N marks down from N
marks late friday and at N yen down from N yen late friday.
dollar
Sentences were taken from the Penn Treebank dataset [15]. On each trial, we make sure that the set
and batch are populated with sentences that are non-overlapping. This means that we do not use
words with very low frequency counts; e.g. if there is only a single sentence for a given word we do
not use this data since the sentence would need to be in both the set and the batch. As with the image
tasks, each trial consisted of a 5 way choice between the classes available in the set. We used a batch
size of 20 throughout the sentence matching task and varied the set size across k=1,2,3. We ensured
that the same number of sentences were available for each class in the set. We split the words into a
randomly sampled 9000 for training and 1000 for testing, and we used the standard test set to report
results. Thus, neither the words nor the sentences used during test time had been seen during training.
We compared our one-shot matching model to an oracle LSTM language model (LSTM-LM) [30]

Experimental Settings (Omniglot)
n  Baseline
⁃  Matching on raw pixels
⁃  Matching on discriminative features from VGG
(Baseine classiﬁer)
⁃  MANN
⁃  Convolutional Siamese Network
n  Datasets
⁃  training: 1200 characters
⁃  testing: 423 characters
31

Experimental Results (Omniglot)
32
n  Fully Conditional Embedding (FCE) did not seem to help much
n  Baseline and Siamese Net were improved with ﬁne-tuning
took this network and used the features from the last layer (before the softmax) for nearest neighbour
matching, a strategy commonly used in computer vision [3] which has achieved excellent results
across many tasks. Following [11], the convolutional siamese nets were trained on a same-or-different
task of the original training data set and then the last layer was used for nearest neighbour matching.
Model Matching Fn Fine Tune
5-way Acc 20-way Acc
1-shot 5-shot 1-shot 5-shot
PIXELS Cosine N 41.7% 63.2% 26.7% 42.6%
BASELINE CLASSIFIER Cosine N 80.0% 95.0% 69.5% 89.1%
BASELINE CLASSIFIER Cosine Y 82.3% 98.4% 70.6% 92.0%
BASELINE CLASSIFIER Softmax Y 86.0% 97.6% 72.9% 92.3%
MANN (NO CONV) [21] Cosine N 82.8% 94.9% – –
CONVOLUTIONAL SIAMESE NET [11] Cosine N 96.7% 98.4% 88.0% 96.5%
CONVOLUTIONAL SIAMESE NET [11] Cosine Y 97.3% 98.4% 88.1% 97.0%
MATCHING NETS (OURS) Cosine N 98.1% 98.9% 93.8% 98.5%
MATCHING NETS (OURS) Cosine Y 97.9% 98.7% 93.5% 98.7%
Table 1: Results on the Omniglot dataset.
5

Experimental Settings (ImageNet)
n  Baseline
⁃  Matching on raw pixels
⁃  Matching on discriminative features from InceptionV3
(Baseine classiﬁer)
n  Datasets
⁃  miniImageNet (size: 84x84)
•  training: (80 classes)
•  testing: (20 classes)
⁃  randImageNet
•  training: randomly picked up classes (882 classes)
•  testing: remaining classes (118 classes)
⁃  dogsImageNet
•  training: all non-dog classes (882 classes)
•  testing: dog classes (118 classes)
33

Experimental Results (miniImageNet)
34
Figure 2: Example of two 5-way problem instance on ImageNet. The images in the set S0
contain
classes never seen during training. Our model makes far less mistakes than the Inception baseline.
Table 2: Results on miniImageNet.
5-way Acc
1-shot 5-shot
PIXELS Cosine N 23.0% 26.6%
BASELINE CLASSIFIER Cosine N 36.6% 46.0%
BASELINE CLASSIFIER Cosine Y 36.2% 52.2%
BASELINE CLASSIFIER Softmax Y 38.4% 51.2%
MATCHING NETS (OURS) Cosine N 41.2% 56.2%
MATCHING NETS (OURS) Cosine Y 42.4% 58.0%
MATCHING NETS (OURS) Cosine (FCE) N 44.2% 57.0%
MATCHING NETS (OURS) Cosine (FCE) Y 46.6% 60.0%
1-shot tasks from the training data set, incorporating Full Context Embeddings and our Matching
Networks and training strategy.
The results of the randImageNet and dogsImageNet experiments are shown in Table 3. The Inception
Oracle (trained on all classes) performs almost perfectly when restricted to 5 classes only, which is
not too surprising given its impressive top-1 accuracy. When trained solely on 6=Lrand, Matching
Nets improve upon Inception by almost 6% when tested on Lrand, halving the errors. Figure 2 shows
two instances of 5-way one-shot learning, where Inception fails. Looking at all the errors, Inception
appears to sometimes prefer an image above all others (these images tend to be cluttered like the
example in the second column, or more constant in color). Matching Nets, on the other hand, manage
to recover from these outliers that sometimes appear in the support set S0
.
Matching Nets manage to improve upon Inception on the complementary subset 6=Ldogs (although
n  Matching Networks overtook baseline
n  Fully Conditional Embedding (FCE) was shown eﬀective to
improve the performance in this task

Experimental Results (randImageNet, dogsImageNet)
35
classification. Thus, we believe that if we adapted our training strategy to samples S from fine grained
sets of labels instead of sampling uniformly from the leafs of the ImageNet class tree, improvements
could be attained. We leave this as future work.
Table 3: Results on full ImageNet on rand and dogs one-shot tasks. Note that 6=Lrand and 6=Ldogs
are sets of classes which are seen during training, but are provided for completeness.
ImageNet 5-way 1-shot Acc
Lrand 6=Lrand Ldogs 6=Ldogs
PIXELS Cosine N 42.0% 42.8% 41.4% 43.0%
INCEPTION CLASSIFIER Cosine N 87.6% 92.6% 59.8% 90.0%
MATCHING NETS (OURS) Cosine (FCE) N 93.2% 97.0% 58.8% 96.4%
INCEPTION ORACLE Softmax (Full) Y (Full) ⇡ 99% ⇡ 99% ⇡ 99% ⇡ 99%
7
n  Matching Networks outperformed Inception Classifier in ,
but degraded in
n  The decrease of the performance in might be caused by the
different distributions of labels between training and testing
⁃  Training support set comes from a random distribution
whereas testing one comes from similar classes
BASELINE CLASSIFIER Cosine Y 36
BASELINE CLASSIFIER Softmax Y 38
MATCHING NETS (OURS) Cosine N 41
MATCHING NETS (OURS) Cosine Y 42
MATCHING NETS (OURS) Cosine (FCE) N 44
MATCHING NETS (OURS) Cosine (FCE) Y 46
1-shot tasks from the training data set, incorporating Full Context Emb
The results of the randImageNet and dogsImageNet experiments are show
Oracle (trained on all classes) performs almost perfectly when restricted
not too surprising given its impressive top-1 accuracy. When trained so
Nets improve upon Inception by almost 6% when tested on Lrand, halving
two instances of 5-way one-shot learning, where Inception fails. Looking
appears to sometimes prefer an image above all others (these images te
example in the second column, or more constant in color). Matching Nets,
Matching Nets manage to improve upon Inception on the complementar
this setup is not one-shot, as the feature extraction has been trained on the
much more challenging Ldogs subset, our model degrades by 1%. We h
1-shot tasks from the training data set, incorporating Full Context Embeddings an
The results of the randImageNet and dogsImageNet experiments are shown in Table
Oracle (trained on all classes) performs almost perfectly when restricted to 5 classe
not too surprising given its impressive top-1 accuracy. When trained solely on 6=L
Nets improve upon Inception by almost 6% when tested on Lrand, halving the errors
two instances of 5-way one-shot learning, where Inception fails. Looking at all the e
appears to sometimes prefer an image above all others (these images tend to be c
example in the second column, or more constant in color). Matching Nets, on the oth
.
Matching Nets manage to improve upon Inception on the complementary subset 6=
this setup is not one-shot, as the feature extraction has been trained on these labels).
much more challenging Ldogs subset, our model degrades by 1%. We hypothesiz
that the sampled set during training, S, comes from a random distribution of labels
whereas the testing support set S0
from Ldogs contains similar classes, more akin
classification. Thus, we believe that if we adapted our training strategy to samples S f
sets of labels instead of sampling uniformly from the leafs of the ImageNet class tre
1-shot tasks from the training data set, incorporating Full C
The results of the randImageNet and dogsImageNet experimen
Oracle (trained on all classes) performs almost perfectly whe
not too surprising given its impressive top-1 accuracy. When
Nets improve upon Inception by almost 6% when tested on Lr
two instances of 5-way one-shot learning, where Inception fa
appears to sometimes prefer an image above all others (thes
example in the second column, or more constant in color). Ma
to recover from these outliers that sometimes appear in the su
Matching Nets manage to improve upon Inception on the com
this setup is not one-shot, as the feature extraction has been tra
much more challenging Ldogs subset, our model degrades b
that the sampled set during training, S, comes from a random
whereas the testing support set S0
from Ldogs contains simi
classification. Thus, we believe that if we adapted our training
sets of labels instead of sampling uniformly from the leafs of

Experimental Settings (Penn Treebank)
36
xi
Support Set（S）
ˆx
Query
g(xi )
f ( ˆx,S)
a
P(ˆy|ˆx, S) =
kX
i=1
a(ˆx, xi)yi
where xi, yi are the inputs and corresponding label distributions from the su
k
∑
P(ˆy|ˆx, S) =
where xi, yi are the inputs and correspondin
{(xi, yi)}k
i=1, and a is an attention mechanism
tially describes the output for a new class as a
Where the attention mechanism a is a kernel on X
Where the attention mechanism is zero for the
metric and an appropriate constant otherwise, th
(although this requires an extension to the atten
Thus (1) subsumes both KDE and kNN methods.
mechanism and the yi act as values bound to the
this case we can understand this as a particular
we “point” to the corresponding example in the s
form defined by the classifier cS(ˆx) is very flexib
Equation 1 relies on choosing a(., .), the atten
fier. The simplest form that this takes (and w
attention models and kernel functions) is to
/
Pk
w
ate neural networks (potentially with f = g) to
examples where f and g are parameterised var
tasks (as in VGG[22] or Inception[24]) or a sim
Section 4).
We note that, though related to metric learning, th
For a given support set S and sample to classify
pairs (x0
, y0
) 2 S such that y0
= y and misalign
methods such as Neighborhood Component An
However, the objective that we are trying to opti
classification, and thus we expect it to perform b
P(ˆy|ˆx, S) =
kX
i=1
a(ˆx, xi)yi
where xi, yi are the inputs and corresponding label distributions from the support set
{(xi, yi)}k
i=1, and a is an attention mechanism which we discuss below. Note that eq. 1
tially describes the output for a new class as a linear combination of the labels in the suppo
Where the attention mechanism a is a kernel on X ⇥ X, then (1) is akin to a kernel density esti
Where the attention mechanism is zero for the b furthest xi from ˆx according to some dis
metric and an appropriate constant otherwise, then (1) is equivalent to ‘k b’-nearest neigh
(although this requires an extension to the attention mechanism that we describe in Section 2
Thus (1) subsumes both KDE and kNN methods. Another view of (1) is where a acts as an atte
mechanism and the yi act as values bound to the corresponding keys xi, much like a hash tab
yi
P(ˆy|ˆx, S) =
kX
i=1
a(ˆx, xi)yi
where xi, yi are the inputs and corresponding label distributions from the support s
{(xi, yi)}k
i=1, and a is an attention mechanism which we discuss below. Note that eq.
tially describes the output for a new class as a linear combination of the labels in the su
Where the attention mechanism a is a kernel on X ⇥ X, then (1) is akin to a kernel density
Where the attention mechanism is zero for the b furthest xi from ˆx according to some
metric and an appropriate constant otherwise, then (1) is equivalent to ‘k b’-nearest ne
(although this requires an extension to the attention mechanism that we describe in Secti
Thus (1) subsumes both KDE and kNN methods. Another view of (1) is where a acts as an
mechanism and the yi act as values bound to the corresponding keys xi, much like a hash
this case we can understand this as a particular kind of associative memory where, given
we “point” to the corresponding example in the support set, retrieving its label. Hence the f
form defined by the classifier cS(ˆx) is very flexible and can adapt easily to any new suppo
Equation 1 relies on choosing a(., .), the attention mechanism, which fully specifies th
fier. The simplest form that this takes (and which has very tight relationships with
attention models and kernel functions) is to use the softmax over the cosine distanc
/
Pk
with embedding functions f and g being
ate neural networks (potentially with f = g) to embed ˆx and xi. In our experiments we
examples where f and g are parameterised variously as deep convolutional networks f
tasks (as in VGG[22] or Inception[24]) or a simple form word embedding for language t
Section 4).
We note that, though related to metric learning, the classifier defined by Equation 1 is discri
c: cosine distance
LSTMLSTM…
virus a
LSTMLSTM…
new nbc
LSTMLSTM
on the
…
LSTMLSTM
the yesterday
…
4.1.3 One-Shot Language Modeling
We also introduce a new one-shot language task which is analogous to those examined for images.
prominent
this fall.
series
dollar
towel
dollar
trained on all the words. In this setup, the LSTM has an unfair advantage as it is not doing one-shot
learning but seeing all the data – thus, this should be taken as an upper bound. To do so, we examined
a similar setup wherein a sentence was presented to the model with a single word filled in with 5
different possible words (including the correct answer). For each of these 5 sentences the model gave
prominent
this fall.
series
dollar
towel
dollar
trained on all the words. In this setup, the LSTM has an unfair advantage as it is not doing one-shot
learning but seeing all the data – thus, this should be taken as an upper bound. To do so, we examined
a similar setup wherein a sentence was presented to the model with a single word filled in with 5
different possible words (including the correct answer). For each of these 5 sentences the model gave
a log-likelihood and the max of these was taken to be the choice of the model.
n  Fill in a brank in a query sentence by a label in a support set

Experimental Settings and Results (Penn Treebank)
37
n  Baseline
⁃  Oracle LSTM-LM
•  Trained on all the words (not one-shot)
•  Consider this model as an upper bound
n  Datasets
⁃  training: 9000 words
⁃  testing: 1000 words
n  Results
Model
5 way accuracy
1-shot 2-shot 3-shot
Matching Nets 32.4% 36.1% 38.2%
Oracle LSTM-LM (72.8%) - -

Conclusion
n  They proposed Matching Networks: nearest neighbor based
approach trained fully end-to-end
n  Keypoints
⁃  “One-shot learning is much easier if you train the network to
do one-shot learning” [Vinyals+, 2016]
⁃  Matching Network has non-parametric structure, thus has
ability to acquisition of new examples rapidly
n  Findings
⁃  Matching Networks was effective to improve the performance
for Omniglot, miniImageNet, randImageNet, however it
degraded for dogsImageNet
⁃  One-shot learning with fine-grained sets of labels is difficult
to solve thus could be exciting challenge in this area
38

References
n  Matching Networks
⁃  Vinyals, Oriol, et al. "Matching networks for one shot learning." Advances in Neural
Information Processing Systems. 2016.
n  One-shot Learning
⁃  Koch, Gregory. Siamese neural networks for one-shot image recognition. Diss.
University of Toronto, 2015.
⁃  Santoro, Adam, et al. "Meta-learning with memory-augmented neural networks."
Proceedings of The 33rd International Conference on Machine Learning. 2016.
⁃  Bertinetto, Luca, et al. "Learning feed-forward one-shot learners." Advances in Neural
Information Processing Systems. 2016.
n  Attention Mechanisms
⁃  Bahdanau, Dzmitry, Kyunghyun Cho, and Yoshua Bengio. "Neural machine translation
by jointly learning to align and translate." arXiv preprint arXiv:1409.0473 (2014).
⁃  Vinyals, Oriol, Meire Fortunato, and Navdeep Jaitly. "Pointer networks." Advances in
Neural Information Processing Systems. 2015.
⁃  Vinyals, Oriol, Samy Bengio, and Manjunath Kudlur. "Order matters: Sequence to
sequence for sets." In ICLR2016
39

References
n  Datasets
⁃  Krizhevsky, Alex, and Geoﬀrey Hinton. "Learning multiple layers of features from tiny
images." (2009).
⁃  Deng, Jia, et al. "Imagenet: A large-scale hierarchical image database." Computer
Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on. IEEE, 2009.
⁃  Lake, Brenden M., et al. "One shot learning of simple visual concepts." Proceedings of
the 33rd Annual Conference of the Cognitive Science Society. Vol. 172. 2011.
⁃  Marcus, Mitchell P., Mary Ann Marcinkiewicz, and Beatrice Santorini. "Building a large
annotated corpus of English: The Penn Treebank." Computational linguistics 19.2
(1993): 313-330.
40

Matching networks for one shot learning

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Matching networks for one shot learning

Similar to Matching networks for one shot learning (20)

More from Kazuki Fujikawa

More from Kazuki Fujikawa (14)

Recently uploaded

Recently uploaded (20)

Matching networks for one shot learning