The document summarizes the paper "Matching Networks for One Shot Learning". It discusses one-shot learning, where a classifier can learn new concepts from only one or a few examples. It introduces matching networks, a new approach that trains an end-to-end nearest neighbor classifier for one-shot learning tasks. The matching networks architecture uses an attention mechanism to compare a test example to a small support set and achieve state-of-the-art one-shot accuracy on Omniglot and other datasets. The document provides background on one-shot learning challenges and related work on siamese networks, memory augmented neural networks, and attention mechanisms.
ęē®ē“¹ä»ļ¼Learning From Noisy Labels With Deep Neural Networks: A SurveyToru Tamaki
Ā
H. Song, M. Kim, D. Park, Y. Shin and J. -G. Lee, "Learning From Noisy Labels With Deep Neural Networks: A Survey", in IEEE Transactions on Neural Networks and Learning Systems, doi: 10.1109/TNNLS.2022.3152527. IEEE TNNLS 2022
https://ieeexplore.ieee.org/document/9729424
https://arxiv.org/abs/2007.08199
ęē®ē“¹ä»ļ¼Learning From Noisy Labels With Deep Neural Networks: A SurveyToru Tamaki
Ā
H. Song, M. Kim, D. Park, Y. Shin and J. -G. Lee, "Learning From Noisy Labels With Deep Neural Networks: A Survey", in IEEE Transactions on Neural Networks and Learning Systems, doi: 10.1109/TNNLS.2022.3152527. IEEE TNNLS 2022
https://ieeexplore.ieee.org/document/9729424
https://arxiv.org/abs/2007.08199
This Edureka Recurrent Neural Networks tutorial will help you in understanding why we need Recurrent Neural Networks (RNN) and what exactly it is. It also explains few issues with training a Recurrent Neural Network and how to overcome those challenges using LSTMs. The last section includes a use-case of LSTM to predict the next word using a sample short story
Below are the topics covered in this tutorial:
1. Why Not Feedforward Networks?
2. What Are Recurrent Neural Networks?
3. Training A Recurrent Neural Network
4. Issues With Recurrent Neural Networks - Vanishing And Exploding Gradient
5. Long Short-Term Memory Networks (LSTMs)
6. LSTM Use-Case
Keras vs Tensorflow vs PyTorch | Deep Learning Frameworks Comparison | EdurekaEdureka!
Ā
** AI & Deep Learning with Tensorflow Training: https://www.edureka.co/ai-deep-learni... **
This Edureka PPT on "Keras vs TensorFlow vs PyTorch" will provide you with a crisp comparison among the top three deep learning frameworks. It provides a detailed and comprehensive knowledge about Keras, TensorFlow and PyTorch and which one to use for what purposes. Following topics will be covered in this PPT:
Introduction to keras, Tensorflow, Pytorch
Parameters of Comparison
Level of API
Speed
Architecture
Ease of Code
Debugging
Community Support
Datasets
Popularity
Suitable use cases
Follow us to never miss an update in the future.
Instagram: https://www.instagram.com/edureka_learning/
Facebook: https://www.facebook.com/edurekaIN/
Twitter: https://twitter.com/edurekain
LinkedIn: https://www.linkedin.com/company/edureka
Zero shot learning through cross-modal transferRoelof Pieters
Ā
review of the paper "Zero-Shot Learning Through Cross-Modal Transfer" by Richard Socher, Milind Ganjoo, Hamsa Sridhar, Osbert Bastani, Christopher D. Manning, Andrew Y. Ng.
at KTH's Deep Learning reading group:
www.csc.kth.se/cvap/cvg/rg/
This Edureka Recurrent Neural Networks tutorial will help you in understanding why we need Recurrent Neural Networks (RNN) and what exactly it is. It also explains few issues with training a Recurrent Neural Network and how to overcome those challenges using LSTMs. The last section includes a use-case of LSTM to predict the next word using a sample short story
Below are the topics covered in this tutorial:
1. Why Not Feedforward Networks?
2. What Are Recurrent Neural Networks?
3. Training A Recurrent Neural Network
4. Issues With Recurrent Neural Networks - Vanishing And Exploding Gradient
5. Long Short-Term Memory Networks (LSTMs)
6. LSTM Use-Case
Keras vs Tensorflow vs PyTorch | Deep Learning Frameworks Comparison | EdurekaEdureka!
Ā
** AI & Deep Learning with Tensorflow Training: https://www.edureka.co/ai-deep-learni... **
This Edureka PPT on "Keras vs TensorFlow vs PyTorch" will provide you with a crisp comparison among the top three deep learning frameworks. It provides a detailed and comprehensive knowledge about Keras, TensorFlow and PyTorch and which one to use for what purposes. Following topics will be covered in this PPT:
Introduction to keras, Tensorflow, Pytorch
Parameters of Comparison
Level of API
Speed
Architecture
Ease of Code
Debugging
Community Support
Datasets
Popularity
Suitable use cases
Follow us to never miss an update in the future.
Instagram: https://www.instagram.com/edureka_learning/
Facebook: https://www.facebook.com/edurekaIN/
Twitter: https://twitter.com/edurekain
LinkedIn: https://www.linkedin.com/company/edureka
Zero shot learning through cross-modal transferRoelof Pieters
Ā
review of the paper "Zero-Shot Learning Through Cross-Modal Transfer" by Richard Socher, Milind Ganjoo, Hamsa Sridhar, Osbert Bastani, Christopher D. Manning, Andrew Y. Ng.
at KTH's Deep Learning reading group:
www.csc.kth.se/cvap/cvg/rg/
Interaction Networks for Learning about Objects, Relations and PhysicsKen Kuroki
Ā
For my presentation for a reading group. I have not in any way contributed this study, which is done by the researchers named on the first slide.
https://papers.nips.cc/paper/6418-interaction-networks-for-learning-about-objects-relations-and-physics
Introduction of āFairness in Learning: Classic and Contextual BanditsāKazuto Fukuchi
Ā
This material consists of an introduction of a paper titled āFairness in Learning: Classic and Contextual Banditsā from NIPS2016. This is presented at https://connpass.com/event/47580/.
Improving Variational Inference with Inverse Autoregressive FlowTatsuya Shirakawa
Ā
This slide was created for NIPS 2016 study meetup.
IAF and other related researches are briefly explained.
paper:
Diederik P. Kingma et al., "Improving Variational Inference with Inverse Autoregressive Flow", 2016
https://papers.nips.cc/paper/6581-improving-variational-autoencoders-with-inverse-autoregressive-flow
A simple framework for contrastive learning of visual representationsDevansh16
Ā
Link: https://machine-learning-made-simple.medium.com/learnings-from-simclr-a-framework-contrastive-learning-for-visual-representations-6c145a5d8e99
If you'd like to discuss something, text me on LinkedIn, IG, or Twitter. To support me, please use my referral link to Robinhood. It's completely free, and we both get a free stock. Not using it is literally losing out on free money.
Check out my other articles on Medium.Ā : https://rb.gy/zn1aiu
My YouTube: https://rb.gy/88iwdd
Reach out to me on LinkedIn. Let's connect: https://rb.gy/m5ok2y
My Instagram: https://rb.gy/gmvuy9
My Twitter: https://twitter.com/Machine01776819
My Substack: https://devanshacc.substack.com/
Live conversations at twitch here: https://rb.gy/zlhk9y
Get a free stock on Robinhood: https://join.robinhood.com/fnud75
This paper presents SimCLR: a simple framework for contrastive learning of visual representations. We simplify recently proposed contrastive self-supervised learning algorithms without requiring specialized architectures or a memory bank. In order to understand what enables the contrastive prediction tasks to learn useful representations, we systematically study the major components of our framework. We show that (1) composition of data augmentations plays a critical role in defining effective predictive tasks, (2) introducing a learnable nonlinear transformation between the representation and the contrastive loss substantially improves the quality of the learned representations, and (3) contrastive learning benefits from larger batch sizes and more training steps compared to supervised learning. By combining these findings, we are able to considerably outperform previous methods for self-supervised and semi-supervised learning on ImageNet. A linear classifier trained on self-supervised representations learned by SimCLR achieves 76.5% top-1 accuracy, which is a 7% relative improvement over previous state-of-the-art, matching the performance of a supervised ResNet-50. When fine-tuned on only 1% of the labels, we achieve 85.8% top-5 accuracy, outperforming AlexNet with 100X fewer labels.
Comments: ICML'2020. Code and pretrained models at this https URL
Subjects: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (stat.ML)
Cite as: arXiv:2002.05709 [cs.LG]
(or arXiv:2002.05709v3 [cs.LG] for this version)
Submission history
From: Ting Chen [view email]
[v1] Thu, 13 Feb 2020 18:50:45 UTC (5,093 KB)
[v2] Mon, 30 Mar 2020 15:32:51 UTC (5,047 KB)
[v3] Wed, 1 Jul 2020 00:09:08 UTC (5,829 KB)
An Uncertainty-Aware Approach to Optimal Configuration of Stream Processing S...Pooyan Jamshidi
Ā
https://arxiv.org/abs/1606.06543
Finding optimal configurations for Stream Processing Systems (SPS) is a challenging problem due to the large number of parameters that can influence their performance and the lack of analytical models to anticipate the effect of a change. To tackle this issue, we consider tuning methods where an experimenter is given a limited budget of experiments and needs to carefully allocate this budget to find optimal configurations. We propose in this setting Bayesian Optimization for Configuration Optimization (BO4CO), an auto-tuning algorithm that leverages Gaussian Processes (GPs) to iteratively capture posterior distributions of the configuration spaces and sequentially drive the experimentation. Validation based on Apache Storm demonstrates that our approach locates optimal configurations within a limited experimental budget, with an improvement of SPS performance typically of at least an order of magnitude compared to existing configuration algorithms.
Continuous Architecting of Stream-Based SystemsCHOOSE
Ā
Pooyan Jamshidi CHOOSE Talk 2016-11-01
Big data architectures have been gaining momentum in recent years. For instance, Twitter uses stream processing frameworks like Storm to analyse billions of tweets per minute and learn the trending topics. However, architectures that process big data involve many different components interconnected via semantically different connectors making it a difficult task for software architects to refactor the initial designs. As an aid to designers and developers, we developed OSTIA (On-the-fly Static Topology Inference Analysis) that allows: (a) visualizing big data architectures for the purpose of design-time refactoring while maintaining constraints that would only be evaluated at later stages such as deployment and run-time; (b) detecting the occurrence of common anti-patterns across big data architectures; (c) exploiting software verification techniques on the elicited architectural models. In the lecture, OSTIA will be shown on three industrial-scale case studies.
See: http://www.choose.s-i.ch/events/jamshidi-2016/
Jose Leiva, data scientist at Ets Asset Management Factory, gives an accurate and simple introduction to Machine Learning. He explains some of the problems that quantitative managers have to get alpha in the markets, and how to face them using Deep Learning.
Transfer Learning for Improving Model Predictions in Highly Configurable Soft...Pooyan Jamshidi
Ā
Modern software systems are now being built to be used in dynamic environments utilizing configuration capabilities to adapt to changes and external uncertainties. In a self-adaptation context, we are often interested in reasoning about the performance of the systems under different configurations. Usually, we learn a black-box model based on real measurements to predict the performance of the system given a specific configuration. However, as modern systems become more complex, there are many configuration parameters that may interact and, therefore, we end up learning an exponentially large configuration space. Naturally, this does not scale when relying on real measurements in the actual changing environment. We propose a different solution: Instead of taking the measurements from the real system, we learn the model using samples from other sources, such as simulators that approximate performance of the real system at low cost.
Methodological Study Of Opinion Mining And Sentiment Analysis Techniques ijsc
Ā
Decision making both on individual and organizational level is always accompanied by the search of otherās opinion on the same. With tremendous establishment of opinion rich resources like, reviews, forum discussions, blogs, micro-blogs, Twitter etc provide a rich anthology of sentiments. This user generated content can serve as a benefaction to market if the semantic orientations are deliberated. Opinion mining and sentiment analysis are the formalization for studying and construing opinions and sentiments. The digital ecosystem has itself paved way for use of huge volume of opinionated data recorded. This paper is an attempt to review and evaluate the various techniques used for opinion and sentiment analysis.
Methodological study of opinion mining and sentiment analysis techniquesijsc
Ā
Decision making both on individual and organizational level is always accompanied by the search of
otherās opinion on the same. With tremendous establishment of opinion rich resources like, reviews, forum
discussions, blogs, micro-blogs, Twitter etc provide a rich anthology of sentiments. This user generated
content can serve as a benefaction to market if the semantic orientations are deliberated. Opinion mining
and sentiment analysis are the formalization for studying and construing opinions and sentiments. The
digital ecosystem has itself paved way for use of huge volume of opinionated data recorded. This paper is
an attempt to review and evaluate the various techniques used for opinion and sentiment analysis.
ICLR/ICML2019čŖćæä¼ć§ē“¹ä»ćććICLR2019ć§ć®NLPć«é¢ććOral4件ć®č«ęē“¹ä»ć§ćć
ē“¹ä»č«ę:
Shen, Yikang, et al. āOrdered neurons: Integrating tree structures into recurrent neural networks.ā in Proc. of ICLR, 2019.
Li, Xiang, et al. "Smoothing the Geometry of Probabilistic Box Embeddings." in Proc. of ICLR, 2019.
Wu, Felix, et al. "Pay less attention with lightweight and dynamic convolutions." in Proc. of ICLR, 2019.
Mao, Jiayuan, et al. "The neuro-symbolic concept learner: Interpreting scenes, words, and sentences from natural supervision." in Proc. of ICLR, 2019.
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...John Andrews
Ā
SlideShare Description for "Chatty Kathy - UNC Bootcamp Final Project Presentation"
Title: Chatty Kathy: Enhancing Physical Activity Among Older Adults
Description:
Discover how Chatty Kathy, an innovative project developed at the UNC Bootcamp, aims to tackle the challenge of low physical activity among older adults. Our AI-driven solution uses peer interaction to boost and sustain exercise levels, significantly improving health outcomes. This presentation covers our problem statement, the rationale behind Chatty Kathy, synthetic data and persona creation, model performance metrics, a visual demonstration of the project, and potential future developments. Join us for an insightful Q&A session to explore the potential of this groundbreaking project.
Project Team: Jay Requarth, Jana Avery, John Andrews, Dr. Dick Davis II, Nee Buntoum, Nam Yeongjin & Mat Nicholas
Adjusting primitives for graph : SHORT REPORT / NOTESSubhajit Sahu
Ā
Graph algorithms, like PageRank Compressed Sparse Row (CSR) is an adjacency-list based graph representation that is
Multiply with different modes (map)
1. Performance of sequential execution based vs OpenMP based vector multiply.
2. Comparing various launch configs for CUDA based vector multiply.
Sum with different storage types (reduce)
1. Performance of vector element sum using float vs bfloat16 as the storage type.
Sum with different modes (reduce)
1. Performance of sequential execution based vs OpenMP based vector element sum.
2. Performance of memcpy vs in-place based CUDA based vector element sum.
3. Comparing various launch configs for CUDA based vector element sum (memcpy).
4. Comparing various launch configs for CUDA based vector element sum (in-place).
Sum with in-place strategies of CUDA mode (reduce)
1. Comparing various launch configs for CUDA based vector element sum (in-place).
Opendatabay - Open Data Marketplace.pptxOpendatabay
Ā
Opendatabay.com unlocks the power of data for everyone. Open Data Marketplace fosters a collaborative hub for data enthusiasts to explore, share, and contribute to a vast collection of datasets.
First ever open hub for data enthusiasts to collaborate and innovate. A platform to explore, share, and contribute to a vast collection of datasets. Through robust quality control and innovative technologies like blockchain verification, opendatabay ensures the authenticity and reliability of datasets, empowering users to make data-driven decisions with confidence. Leverage cutting-edge AI technologies to enhance the data exploration, analysis, and discovery experience.
From intelligent search and recommendations to automated data productisation and quotation, Opendatabay AI-driven features streamline the data workflow. Finding the data you need shouldn't be a complex. Opendatabay simplifies the data acquisition process with an intuitive interface and robust search tools. Effortlessly explore, discover, and access the data you need, allowing you to focus on extracting valuable insights. Opendatabay breaks new ground with a dedicated, AI-generated, synthetic datasets.
Leverage these privacy-preserving datasets for training and testing AI models without compromising sensitive information. Opendatabay prioritizes transparency by providing detailed metadata, provenance information, and usage guidelines for each dataset, ensuring users have a comprehensive understanding of the data they're working with. By leveraging a powerful combination of distributed ledger technology and rigorous third-party audits Opendatabay ensures the authenticity and reliability of every dataset. Security is at the core of Opendatabay. Marketplace implements stringent security measures, including encryption, access controls, and regular vulnerability assessments, to safeguard your data and protect your privacy.
As Europe's leading economic powerhouse and the fourth-largest hashtag#economy globally, Germany stands at the forefront of innovation and industrial might. Renowned for its precision engineering and high-tech sectors, Germany's economic structure is heavily supported by a robust service industry, accounting for approximately 68% of its GDP. This economic clout and strategic geopolitical stance position Germany as a focal point in the global cyber threat landscape.
In the face of escalating global tensions, particularly those emanating from geopolitical disputes with nations like hashtag#Russia and hashtag#China, hashtag#Germany has witnessed a significant uptick in targeted cyber operations. Our analysis indicates a marked increase in hashtag#cyberattack sophistication aimed at critical infrastructure and key industrial sectors. These attacks range from ransomware campaigns to hashtag#AdvancedPersistentThreats (hashtag#APTs), threatening national security and business integrity.
š Key findings include:
š Increased frequency and complexity of cyber threats.
š Escalation of state-sponsored and criminally motivated cyber operations.
š Active dark web exchanges of malicious tools and tactics.
Our comprehensive report delves into these challenges, using a blend of open-source and proprietary data collection techniques. By monitoring activity on critical networks and analyzing attack patterns, our team provides a detailed overview of the threats facing German entities.
This report aims to equip stakeholders across public and private sectors with the knowledge to enhance their defensive strategies, reduce exposure to cyber risks, and reinforce Germany's resilience against cyber threats.
Show drafts
volume_up
Empowering the Data Analytics Ecosystem: A Laser Focus on Value
The data analytics ecosystem thrives when every component functions at its peak, unlocking the true potential of data. Here's a laser focus on key areas for an empowered ecosystem:
1. Democratize Access, Not Data:
Granular Access Controls: Provide users with self-service tools tailored to their specific needs, preventing data overload and misuse.
Data Catalogs: Implement robust data catalogs for easy discovery and understanding of available data sources.
2. Foster Collaboration with Clear Roles:
Data Mesh Architecture: Break down data silos by creating a distributed data ownership model with clear ownership and responsibilities.
Collaborative Workspaces: Utilize interactive platforms where data scientists, analysts, and domain experts can work seamlessly together.
3. Leverage Advanced Analytics Strategically:
AI-powered Automation: Automate repetitive tasks like data cleaning and feature engineering, freeing up data talent for higher-level analysis.
Right-Tool Selection: Strategically choose the most effective advanced analytics techniques (e.g., AI, ML) based on specific business problems.
4. Prioritize Data Quality with Automation:
Automated Data Validation: Implement automated data quality checks to identify and rectify errors at the source, minimizing downstream issues.
Data Lineage Tracking: Track the flow of data throughout the ecosystem, ensuring transparency and facilitating root cause analysis for errors.
5. Cultivate a Data-Driven Mindset:
Metrics-Driven Performance Management: Align KPIs and performance metrics with data-driven insights to ensure actionable decision making.
Data Storytelling Workshops: Equip stakeholders with the skills to translate complex data findings into compelling narratives that drive action.
Benefits of a Precise Ecosystem:
Sharpened Focus: Precise access and clear roles ensure everyone works with the most relevant data, maximizing efficiency.
Actionable Insights: Strategic analytics and automated quality checks lead to more reliable and actionable data insights.
Continuous Improvement: Data-driven performance management fosters a culture of learning and continuous improvement.
Sustainable Growth: Empowered by data, organizations can make informed decisions to drive sustainable growth and innovation.
By focusing on these precise actions, organizations can create an empowered data analytics ecosystem that delivers real value by driving data-driven decisions and maximizing the return on their data investment.
2. Copyright (C) DeNA Co.,Ltd. All Rights Reserved.
nāÆ One-shot learning with attention and memory
āāÆ Learn a concept from one or only a few training examples
āāÆ Train a fully end-to-end nearest neighbor classiļ¬er: incorporating
the best characteristics from both parametric and non-parametric
models
āāÆ Improved one-shot accuracy on Omniglot from 88.0% to 93.2%
compared to competing approaches
2
Abstract
Figure 1: Matching Networks architecture
4. Copyright (C) DeNA Co.,Ltd. All Rights Reserved.
Supervised Learning
nāÆ Learn a correspondence between training data and labels
āāÆ Require a large labeled dataset for training
(ex. CIFAR10 [Krizhevsky+, 2009]: 6000 data / class)
āāÆ It is hard to let classiļ¬ers learn new concepts from little data
4
airplane
automobile
bird
cat
deer
Classiļ¬er
examples Labels
0 airplane
1 automobile
0 bird
0 cat
0 deer
Classiļ¬er
Training phase Predicting phase
https://www.cs.toronto.edu/~kriz/cifar.html
5. Copyright (C) DeNA Co.,Ltd. All Rights Reserved.
One-shot Learning
nāÆ Learn a concept from one or only a few training examples
āāÆ A classiļ¬er can be trained by datasets with labels which donŹ¼t
be used in predicting phase
5
airplane
automobile
bird
cat
deer
Classiļ¬er
examples Labels
0 airplane
1 automobile
0 bird
0 cat
0 deer
Classiļ¬er
ļ¼Pre-ļ¼Training phase Predicting phaseļ¼one-shot learning phaseļ¼
https://www.cs.toronto.edu/~kriz/cifar.html
dog
frog
horse
ship
truck
Classiļ¬er
examples Labels
6. Copyright (C) DeNA Co.,Ltd. All Rights Reserved.
One-shot Learning
nāÆ Task: N-way k-shot learning
6
Tā: Testing taskT: Training task
dog
frog
horse
ship
truck
airplane
automobile
bird
cat
deer
dog
frog
horse
ship
truck
airplane
automobile
bird
cat
deer
ā¢āÆ Separate labels for training and testing
ā¢āÆ All the labels which you use in testing
phase (one-shot learning phase) are not
used in training phase
https://www.cs.toronto.edu/~kriz/cifar.html
7. Copyright (C) DeNA Co.,Ltd. All Rights Reserved.
One-shot Learning
nāÆ Task: N-way k-shot learning
7
Tā: Testing taskT: Training task
dog
frog
horse
ship
truck
airplane
automobile
bird
cat
deer
dog
frog
horse
ship
truck
airplane
automobile
bird
cat
deer
ā¢āÆ Tā is used for one-shot learning
ā¢āÆ T can be used freely to train
ļ¼e.g. Multiclass classiļ¬cationļ¼
https://www.cs.toronto.edu/~kriz/cifar.html
8. Copyright (C) DeNA Co.,Ltd. All Rights Reserved.
One-shot Learning
nāÆ Task: N-way k-shot learning
8
Tā: Testing taskT: Training task
dog
frog
horse
ship
truck
airplane
automobile
bird
cat
deer
dog
frog
horse
ship
truck
airplane
automobile
bird
cat
deer
Lā: Label set
sampling N labels from TŹ¼
ā¢āÆ In this ļ¬gure, LŹ¼ has 3 classes, thus
ā3-way k-shot learningā
automobile
cat
deer
https://www.cs.toronto.edu/~kriz/cifar.html
9. Copyright (C) DeNA Co.,Ltd. All Rights Reserved.
One-shot Learning
nāÆ Task: N-way k-shot learning
9
Tā: Testing taskT: Training task
dog
frog
horse
ship
truck
airplane
automobile
bird
cat
deer
dog
frog
horse
ship
truck
airplane
automobile
bird
cat
deer
Lā: Label set
Sā: Support set : Query
automobile
cat
deer
sampling N labels from TŹ¼
sampling k examples from LŹ¼ sampling 1 example from LŹ¼
Ėx
ā¢āÆ Task: classify into 3
classes, {automobile, cat,
deer}, using support set
Ėx
https://www.cs.toronto.edu/~kriz/cifar.html
10. Copyright (C) DeNA Co.,Ltd. All Rights Reserved.
Related Work (One-shot Learning)
nāÆ Convolutional Siamese Network [Koch+, 2015]
āāÆ Learn image representation with a siamese neural network
āāÆ Reuse features from the network for one-shot learning
10
Figure 1: Matching Networks architecture
train it by showing only a few examples per class, switching the task from minibatch to minibatch,
much like how it will be tested when presented with a few examples of a new task.
Besides our contributions in deļ¬ning a model and training criterion amenable for one-shot learning,
we contribute by the deļ¬nition of tasks that can be used to benchmark other approaches on both
ImageNet and small scale language modeling. We hope that our results will encourage others to work
CNN CNN
Same?
12. Copyright (C) DeNA Co.,Ltd. All Rights Reserved.
Related Work (One-shot Learning)
nāÆ Siamese Learnet [Bertinetto+, NIPS2016]
āāÆ Learn the parameters of a network to incorporate domain
speciļ¬c information from a few examples
12
siamese
siamese learnet
learnet
Figure 1: Our proposed architectures predict the parameters of a network from a single example,
replacing static convolutions (green) with dynamic convolutions (red). The siamese learnet predicts
the parameters of an embedding function that is applied to both inputs, whereas the single-stream
learnet predicts the parameters of a function that is applied to the other input. Linear layers are
denoted by ā¤ and nonlinear layers by . Dashed connections represent parameter sharing.
discriminative one-shot learning is to ļ¬nd a mechanism to incorporate domain-speciļ¬c information in
the learner, i.e. learning to learn. Another challenge, which is of practical importance in applications
of one-shot learning, is to avoid a lengthy optimization process such as eq. (1).
We propose to address both challenges by learning the parameters W of the predictor from a single
exemplar z using a meta-prediction process, i.e. a non-iterative feed-forward function ! that maps
(z; W0
) to W. Since in practice this function will be implemented using a deep neural network, we
call it a learnet. The learnet depends on the exemplar z, which is a single representative of the class of
interest, and contains parameters W0
of its own. Learning to learn can now be posed as the problem of
optimizing the learnet meta-parameters W0
using an objective function deļ¬ned below. Furthermore,
the feed-forward learnet evaluation is much faster than solving the optimization problem (1).
In order to train the learnet, we require the latter to produce good predictors given any possible
exemplar z, which is empirically evaluated as an average over n training samples zi:
13. Copyright (C) DeNA Co.,Ltd. All Rights Reserved.
Related Work (Attention Mechanism)
nāÆ Sequence to Sequence with Attention [Bahdanau+, 2014]
āāÆ Attend to the word relevant to the generation of the next
target word in the source sentence
13
t t
her architectures such as a hybrid of an RNN
alchbrenner and Blunsom, 2013).
ral machine translation. The new architecture
3.2) and a decoder that emulates searching
n (Sec. 3.1).
x1 x2 x3 xT
+
Ī±t,1
Ī±t,2 Ī±t,3
Ī±t,T
yt-1 yt
h1 h2 h3 hT
h1 h2 h3 hT
st-1 st
Figure 1: The graphical illus-
tration of the proposed model
trying to generate the t-th tar-
get word yt given a source
sentence (x1, x2, . . . , xT ).
al probability
(4)
by
ādecoder ap-
on a distinct
annotations
ntence. Each
put sequence
word of the
ons are com-
sum of these
(5)
ij)
Figure 1: The graphical illus-
tration of the proposed model
trying to generate the t-th tar-
get word yt given a source
sentence (x1, x2, . . . , xT ).
si = f(si 1, yi 1, ci).
It should be noted that unlike the existing encoderādecoder ap-
proach (see Eq. (2)), here the probability is conditioned on a distinct
context vector ci for each target word yi.
The context vector ci depends on a sequence of annotations
(h1, Ā· Ā· Ā· , hTx
) to which an encoder maps the input sentence. Each
annotation hi contains information about the whole input sequence
with a strong focus on the parts surrounding the i-th word of the
input sequence. We explain in detail how the annotations are com-
puted in the next section.
The context vector ci is, then, computed as a weighted sum of these
annotations hi:
ci =
TxX
j=1
āµijhj. (5)
The weight āµij of each annotation hj is computed by
āµij =
exp (eij)
PTx
k=1 exp (eik)
, (6)
where
eij = a(si 1, hj)
is an alignment model which scores how well the inputs around position j and the output at position
i match. The score is based on the RNN hidden state si 1 (just before emitting yi, Eq. (4)) and the
j-th annotation hj of the input sentence.
We parametrize the alignment model a as a feedforward neural network which is jointly trained with
all the other components of the proposed system. Note that unlike in traditional machine translation,
3
Figure 1: The graphical illus-
tration of the proposed model
trying to generate the t-th tar-
get word yt given a source
sentence (x1, x2, . . . , xT ).
si = f(si 1, yi 1, ci).
It should be noted that unlike the existing encoderādecoder ap-
proach (see Eq. (2)), here the probability is conditioned on a distinct
context vector ci for each target word yi.
The context vector ci depends on a sequence of annotations
(h1, Ā· Ā· Ā· , hTx
) to which an encoder maps the input sentence. Each
annotation hi contains information about the whole input sequence
with a strong focus on the parts surrounding the i-th word of the
input sequence. We explain in detail how the annotations are com-
puted in the next section.
The context vector ci is, then, computed as a weighted sum of these
annotations hi:
ci =
TxX
j=1
āµijhj. (5)
The weight āµij of each annotation hj is computed by
āµij =
exp (eij)
PTx
k=1 exp (eik)
, (6)
where
eij = a(si 1, hj)
is an alignment model which scores how well the inputs around position j and the output at position
i match. The score is based on the RNN hidden state si 1 (just before emitting yi, Eq. (4)) and the
j-th annotation hj of the input sentence.
We parametrize the alignment model a as a feedforward neural network which is jointly trained with
all the other components of the proposed system. Note that unlike in traditional machine translation,
3
Figure 1: The graphical illus-
tration of the proposed model
trying to generate the t-th tar-
get word yt given a source
sentence (x1, x2, . . . , xT ).
proach (see Eq. (2)), here the probability is conditioned on a distinct
context vector ci for each target word yi.
The context vector ci depends on a sequence of annotations
(h1, Ā· Ā· Ā· , hTx
) to which an encoder maps the input sentence. Each
annotation hi contains information about the whole input sequence
with a strong focus on the parts surrounding the i-th word of the
input sequence. We explain in detail how the annotations are com-
puted in the next section.
The context vector ci is, then, computed as a weighted sum of these
annotations hi:
ci =
TxX
j=1
āµijhj. (5)
The weight āµij of each annotation hj is computed by
āµij =
exp (eij)
PTx
k=1 exp (eik)
, (6)
where
eij = a(si 1, hj)
is an alignment model which scores how well the inputs around position j and the output at position
i match. The score is based on the RNN hidden state si 1 (just before emitting yi, Eq. (4)) and the
j-th annotation hj of the input sentence.
We parametrize the alignment model a as a feedforward neural network which is jointly trained with
all the other components of the proposed system. Note that unlike in traditional machine translation,
3
Published as a conference paper at ICLR 2015
(a) (b)
14. Copyright (C) DeNA Co.,Ltd. All Rights Reserved.
Related Work (Attention Mechanism)
nāÆ Pointer Networks [Vinyals+, 2015]
āāÆ Generate output sequence using a distribution over the
dictionary of inputs
14
(a) Sequence-to-Sequence (b) Ptr-Net
Figure 1: (a) Sequence-to-Sequence - An RNN (blue) processes the input sequence to create a code
vector that is used to generate the output sequence (purple) using the probability chain rule and
another RNN. The output dimensionality is ļ¬xed by the dimensionality of the problem and it is the
same during training and inference [1]. (b) Ptr-Net - An encoding RNN converts the input sequence
to a code (blue) that is fed to the generating network (purple). At each step, the generating network
produces a vector that modulates a content-based attention mechanism over inputs ([5, 2]). The
output of the attention mechanism is a softmax distribution with dictionary size equal to the length
of the input.
ion (i.e., when we only have examples of inputs and desired outputs). The proposed approach is
depicted in Figure 1.
The main contributions of our work are as follows:
This model performs signiļ¬cantly better than the sequence-to-sequence model on the co
problem, but it is not applicable to problems where the output dictionary size depends on
Nevertheless, a very simple extension (or rather reduction) of the model allows us to do th
2.3 Ptr-Net
We now describe a very simple modiļ¬cation of the attention model that allows us to
method to solve combinatorial optimization problems where the output dictionary size d
the number of elements in the input sequence.
The sequence-to-sequence model of Section 2.1 uses a softmax distribution over a ļ¬xed si
dictionary to compute p(Ci|C1, . . . , Ci 1, P) in Equation 1. Thus it cannot be used for our
where the size of the output dictionary is equal to the length of the input sequence. To
problem we model p(Ci|C1, . . . , Ci 1, P) using the attention mechanism of Equation 3 a
ui
j = vT
tanh(W1ej + W2di) j 2 (1, . . . , n)
p(Ci|C1, . . . , Ci 1, P) = softmax(ui
)
where softmax normalizes the vector ui
(of length n) to be an output distribution over the
of inputs, and v, W1, and W2 are learnable parameters of the output model. Here, we do
the encoder state ej to propagate extra information to the decoder, but instead, use ui
j a
to the input elements. In a similar way, to condition on Ci 1 as in Equation 1, we sim
the corresponding PCi 1
as the input. Both our method and the attention model can be
application of content-based attention mechanisms proposed in [6, 5, 2].
We also note that our approach speciļ¬cally targets problems whose outputs are discrete
spond to positions in the input. Such problems may be addressed artiļ¬cially ā for example
learn to output the coordinates of the target point directly using an RNN. However, at
this solution does not respect the constraint that the outputs map back to the inputs exac
out the constraints, the predictions are bound to become blurry over longer sequences as
sequence-to-sequence models for videos [12].
3 Motivation and Datasets Structure
In the following sections, we review each of the three problems we considered, as well a
generation protocol.1
In the training data, the inputs are planar point sets P = {P1, . . . , Pn} with n elements ea
Pj = (xj, yj) are the cartesian coordinates of the points over which we ļ¬nd the convex hu
launay triangulation or the solution to the corresponding Travelling Salesman Problem. In
we sample from a uniform distribution in [0, 1] ā„ [0, 1]. The outputs CP
= {C1, . . . , C
sequences representing the solution associated to the point set P. In Figure 2, we ļ¬nd an i
of an input/output pair (P, CP
) for the convex hull and the Delaunay problems.
15. Copyright (C) DeNA Co.,Ltd. All Rights Reserved.
Related Work (Attention Mechanism)
nāÆ Sequence to Sequence for Sets [Vinyals+, ICLR2016]
āāÆ Handle input sets using an extension of seq2seq framework:
Read-Process-and Write model
15
ural models with memories coupled to differentiable addressing mechanism have been success-
y applied to handwriting generation and recognition (Graves, 2012), machine translation (Bah-
au et al., 2015a), and more general computation machines (Graves et al., 2014; Weston et al.,
5). Since we are interested in associative memories we employed a ācontentā based attention.
s has the property that the vector retrieved from our memory would not change if we randomly
fļ¬ed the memory. This is crucial for proper treatment of the input set X as such. In particular,
process block based on an attention mechanism uses the following:
qt = LSTM(qā¤
t 1) (3)
ei,t = f(mi, qt) (4)
ai,t =
exp(ei,t)
P
j exp(ej,t)
(5)
rt =
X
i
ai,tmi (6)
qā¤
t = [qt rt] (7)
Read
Process Write
Figure 1: The Read-Process-and-Write model.
ere i indexes through each memory vector mi (typically equal to the cardinality of X), qt is
uery vector which allows us to read rt from the memories, f is a function that computes a
gle scalar from mi and qt (e.g., a dot product), and LSTM is an LSTM which computes a
urrent state but which takes no inputs. qā¤
t is the state which this LSTM evolves, and is formed
concatenating the query qt with the resulting attention readout rt. t is the index which indicates
16. Copyright (C) DeNA Co.,Ltd. All Rights Reserved.
Matching Networks [Vinyals+, NIPS2016]
nāÆ Motivation
āāÆ It is important for one-shot learning to attain rapid learning
from new examples while keeping an ability for common
examples
ā¢āÆ Simple parametric models such as deep classiļ¬ers need to be
optimized to treat with new examples
ā¢āÆ Non-parametric models such as k-nearest neighbor donŹ¼t require
optimization but performance depends on the chosen metric
āāÆ It could be eļ¬cient to train a end-to-end nearest neighbor
based classiļ¬er
16
17. Copyright (C) DeNA Co.,Ltd. All Rights Reserved.
Matching Networks [Vinyals+, NIPS2016]
nāÆ Train a classiļ¬er through one-shot learning
17
Tā: Testing taskT: Training task
dog
frog
horse
ship
truck
airplane
automobile
bird
cat
deer
dog
frog
horse
ship
truck
airplane
automobile
bird
cat
deer
L: Label set
S: Support set B : Batch
dog
horse
ship
sampling N labels from T
sampling k examples
from L
sampling b example from L
https://www.cs.toronto.edu/~kriz/cifar.html
18. Copyright (C) DeNA Co.,Ltd. All Rights Reserved.
Matching Networks [Vinyals+, NIPS2016]
nāÆ System Overview
āāÆ Embedding functions f, g are parameterized as a simple CNN (e.g.
VGG or Inception) or a fully conditional embedding function
mentioned later
18
Figure 1: Matching Networks architecture
train it by showing only a few examples per class, switching the task from minibatch to minibatch,
much like how it will be tested when presented with a few examples of a new task.
Besides our contributions in deļ¬ning a model and training criterion amenable for one-shot learning,
we contribute by the deļ¬nition of tasks that can be used to benchmark other approaches on both
ImageNet and small scale language modeling. We hope that our results will encourage others to work
on this challenging problem.
We organized the paper by ļ¬rst deļ¬ning and explaining our model whilst linking its several compo-
nents to related work. Then in the following section we brieļ¬y elaborate on some of the related work
to the task and our model. In Section 4 we describe both our general setup and the experiments we
performed, demonstrating strong results on one-shot learning on a variety of tasks and setups.
2 Model
Our non-parametric approach to solving one-shot learning is based on two components which we
describe in the following subsections. First, our model architecture follows recent advances in neural
networks augmented with memory (as discussed in Section 3). Given a (small) support set S, our
model deļ¬nes a function cS (or classiļ¬er) for each S, i.e. a mapping S ! cS(.). Second, we employ
Ėx
Query
f
g(xi )
f ( Ėx,S)
a
ā
P(Ėy|Ėx
where xi, yi are the inputs and corresp
{(xi, yi)}k
i=1, and a is an attention mech
tially describes the output for a new class
Where the attention mechanism a is a kerne
Where the attention mechanism is zero f
metric and an appropriate constant otherw
(although this requires an extension to the
Thus (1) subsumes both KDE and kNN me
mechanism and the yi act as values bound
this case we can understand this as a parti
we āpointā to the corresponding example i
form deļ¬ned by the classiļ¬er cS(Ėx) is very
2.1.1 The Attention Kernel
Equation 1 relies on choosing a(., .), the
ļ¬er. The simplest form that this takes
attention models and kernel functions)
a(Ėx, xi) = ec(f(Ėx),g(xi))
/
Pk
j=1 ec(f(Ėx),g(
ate neural networks (potentially with f =
examples where f and g are parameteris
tasks (as in VGG[22] or Inception[24]) or
Section 4).
We note that, though related to metric learn
For a given support set S and sample to cl
pairs (x0
, y0
) 2 S such that y0
= y and mi
methods such as Neighborhood Compone
nearest neighbor [28].
However, the objective that we are trying
classiļ¬cation, and thus we expect it to per
Our model in its simplest form computes a probability over Ėy as follows:
P(Ėy|Ėx, S) =
kX
i=1
a(Ėx, xi)yi
where xi, yi are the inputs and corresponding label distributions from the support
{(xi, yi)}k
i=1, and a is an attention mechanism which we discuss below. Note that e
tially describes the output for a new class as a linear combination of the labels in the s
Where the attention mechanism a is a kernel on X ā„ X, then (1) is akin to a kernel densit
Where the attention mechanism is zero for the b furthest xi from Ėx according to som
metric and an appropriate constant otherwise, then (1) is equivalent to āk bā-nearest n
(although this requires an extension to the attention mechanism that we describe in Sec
Thus (1) subsumes both KDE and kNN methods. Another view of (1) is where a acts as a
mechanism and the yi act as values bound to the corresponding keys xi, much like a has
this case we can understand this as a particular kind of associative memory where, give
Figure 1: Matching Networks architecture
by showing only a few examples per class, switching the task from minibatch to minibatch,
ike how it will be tested when presented with a few examples of a new task.
s our contributions in deļ¬ning a model and training criterion amenable for one-shot learning,
ntribute by the deļ¬nition of tasks that can be used to benchmark other approaches on both
Figure 1: Matching Networks architecture
it by showing only a few examples per class, switching the task from minibatch to minibatch,
h like how it will be tested when presented with a few examples of a new task.
Figure 1: Matching Networks architecture
xi
Support Setļ¼Sļ¼
yi
g
19. Copyright (C) DeNA Co.,Ltd. All Rights Reserved.
Matching Networks [Vinyals+, NIPS2016]
nāÆ The Attention Kernel
āāÆ Calculate softmax over the cosine distance between and
ā¢āÆ Similar to nearest neighbor calculation
āāÆ Train a network using cross entropy loss
19
Figure 1: Matching Networks architecture
train it by showing only a few examples per class, switching the task from minibatch to minibatch,
much like how it will be tested when presented with a few examples of a new task.
Besides our contributions in deļ¬ning a model and training criterion amenable for one-shot learning,
we contribute by the deļ¬nition of tasks that can be used to benchmark other approaches on both
ImageNet and small scale language modeling. We hope that our results will encourage others to work
on this challenging problem.
We organized the paper by ļ¬rst deļ¬ning and explaining our model whilst linking its several compo-
nents to related work. Then in the following section we brieļ¬y elaborate on some of the related work
to the task and our model. In Section 4 we describe both our general setup and the experiments we
performed, demonstrating strong results on one-shot learning on a variety of tasks and setups.
2 Model
Our non-parametric approach to solving one-shot learning is based on two components which we
describe in the following subsections. First, our model architecture follows recent advances in neural
networks augmented with memory (as discussed in Section 3). Given a (small) support set S, our
model deļ¬nes a function cS (or classiļ¬er) for each S, i.e. a mapping S ! cS(.). Second, we employ
Ėx
Query
f
g(xi )
f ( Ėx,S)
aOur model in its simplest form computes a probability over Ėy as follow
P(Ėy|Ėx, S) =
kX
i=1
a(Ėx, xi)yi
where xi, yi are the inputs and corresponding label distributions
{(xi, yi)}k
i=1, and a is an attention mechanism which we discuss b
tially describes the output for a new class as a linear combination of
Where the attention mechanism a is a kernel on X ā„ X, then (1) is akin
Where the attention mechanism is zero for the b furthest xi from Ėx
metric and an appropriate constant otherwise, then (1) is equivalent t
(although this requires an extension to the attention mechanism that w
ā
P(Ėy|Ėx
where xi, yi are the inputs and corresp
{(xi, yi)}k
i=1, and a is an attention mech
tially describes the output for a new class
Where the attention mechanism a is a kerne
Where the attention mechanism is zero f
metric and an appropriate constant otherw
(although this requires an extension to the
Thus (1) subsumes both KDE and kNN me
mechanism and the yi act as values bound
this case we can understand this as a parti
we āpointā to the corresponding example i
form deļ¬ned by the classiļ¬er cS(Ėx) is very
2.1.1 The Attention Kernel
Equation 1 relies on choosing a(., .), the
ļ¬er. The simplest form that this takes
attention models and kernel functions)
a(Ėx, xi) = ec(f(Ėx),g(xi))
/
Pk
j=1 ec(f(Ėx),g(
ate neural networks (potentially with f =
examples where f and g are parameteris
tasks (as in VGG[22] or Inception[24]) or
Section 4).
We note that, though related to metric learn
For a given support set S and sample to cl
pairs (x0
, y0
) 2 S such that y0
= y and mi
methods such as Neighborhood Compone
nearest neighbor [28].
However, the objective that we are trying
classiļ¬cation, and thus we expect it to per
Our model in its simplest form computes a probability over Ėy as follows:
P(Ėy|Ėx, S) =
kX
i=1
a(Ėx, xi)yi
where xi, yi are the inputs and corresponding label distributions from the support
{(xi, yi)}k
i=1, and a is an attention mechanism which we discuss below. Note that e
tially describes the output for a new class as a linear combination of the labels in the s
Where the attention mechanism a is a kernel on X ā„ X, then (1) is akin to a kernel densit
Where the attention mechanism is zero for the b furthest xi from Ėx according to som
metric and an appropriate constant otherwise, then (1) is equivalent to āk bā-nearest n
(although this requires an extension to the attention mechanism that we describe in Sec
Thus (1) subsumes both KDE and kNN methods. Another view of (1) is where a acts as a
mechanism and the yi act as values bound to the corresponding keys xi, much like a has
this case we can understand this as a particular kind of associative memory where, give
Our model in its simplest form computes a probability over Ėy as follows:
P(Ėy|Ėx, S) =
kX
i=1
a(Ėx, xi)yi
where xi, yi are the inputs and corresponding label distributions from the suppo
{(xi, yi)}k
i=1, and a is an attention mechanism which we discuss below. Note that
tially describes the output for a new class as a linear combination of the labels in the
Where the attention mechanism a is a kernel on X ā„ X, then (1) is akin to a kernel dens
Where the attention mechanism is zero for the b furthest xi from Ėx according to so
metric and an appropriate constant otherwise, then (1) is equivalent to āk bā-nearest
(although this requires an extension to the attention mechanism that we describe in Se
Thus (1) subsumes both KDE and kNN methods. Another view of (1) is where a acts as
mechanism and the yi act as values bound to the corresponding keys xi, much like a ha
this case we can understand this as a particular kind of associative memory where, giv
we āpointā to the corresponding example in the support set, retrieving its label. Hence th
form deļ¬ned by the classiļ¬er cS(Ėx) is very ļ¬exible and can adapt easily to any new sup
2.1.1 The Attention Kernel
Equation 1 relies on choosing a(., .), the attention mechanism, which fully speciļ¬e
ļ¬er. The simplest form that this takes (and which has very tight relationships wi
attention models and kernel functions) is to use the softmax over the cosine dist
a(Ėx, xi) = ec(f(Ėx),g(xi))
/
Pk
j=1 ec(f(Ėx),g(xj ))
with embedding functions f and g bein
ate neural networks (potentially with f = g) to embed Ėx and xi. In our experiments w
examples where f and g are parameterised variously as deep convolutional network
tasks (as in VGG[22] or Inception[24]) or a simple form word embedding for languag
Section 4).
We note that, though related to metric learning, the classiļ¬er deļ¬ned by Equation 1 is di
For a given support set S and sample to classify Ėx, it is enough for Ėx to be sufļ¬ciently a
pairs (x0
, y0
) 2 S such that y0
= y and misaligned with the rest. This kind of loss is als
c: cosine distance
Figure 1: Matching Networks architecture
by showing only a few examples per class, switching the task from minibatch to minibatch,
ike how it will be tested when presented with a few examples of a new task.
s our contributions in deļ¬ning a model and training criterion amenable for one-shot learning,
ntribute by the deļ¬nition of tasks that can be used to benchmark other approaches on both
Figure 1: Matching Networks architecture
it by showing only a few examples per class, switching the task from minibatch to minibatch,
h like how it will be tested when presented with a few examples of a new task.
Figure 1: Matching Networks architecture
xi
Support Setļ¼Sļ¼
yi
g
Ėhk, ck = LSTM(f0
(Ėx), [hk 1, rk 1], ck 1)
hk = Ėhk + f0
(Ėx)
rk 1 =
|S|
X
i=1
a(hk 1, g(xi))g(xi)
a(hk 1, g(xi)) = ehT
k 1g(xi)
/
|S|
X
j=1
ehT
k 1g(xj )
noting that LSTM(x, h, c) follows the same LSTM implementation deļ¬ned in [23]
h the output (i.e., cell after the output gate), and c the cell. a is commonly referred
based attention. We do K steps of āreadsā, so f(Ėx, S) = hK where hk is as describ
2.2 Training Strategy
In the previous subsection we described Matching Networks which map a support set t
function, S ! c(Ėx). We achieve this via a modiļ¬cation of the set-to-set paradigm
attention, with the resulting mapping being of the form Pā(.|Ėx, S), noting that ā are
of the model (i.e. of the embedding functions f and g described previously).
The training procedure has to be chosen carefully so as to match inference at test t
has to perform well with support sets S0
which contain classes never seen during tra
More speciļ¬cally, let us deļ¬ne a task T as distribution over possible label sets L
consider T to uniformly weight all data sets of up to a few unique classes (e.g.
examples per class (e.g., up to 5). In this case, a label set L sampled from a task
typically have 5 to 25 examples.
To form an āepisodeā to compute gradients and update our model, we ļ¬rst sample
L could be the label set {cats, dogs}). We then use L to sample the support set S
(i.e., both S and B are labelled examples of cats and dogs). The Matching Net is
minimise the error predicting the labels in the batch B conditioned on the support
form of meta-learning since the training procedure explicitly learns to learn from a g
to minimise a loss over a batch. More precisely, the Matching Nets training objectiv
ā = arg max
ā
ELā T
2
4ESā L,Bā L
2
4
X
(x,y)2B
log Pā (y|x, S)
3
5
3
5 .
Training ā with eq. 6 yields a model which works well when sampling S0
ā T0
g(xi )f ( Ėx,S)
20. Copyright (C) DeNA Co.,Ltd. All Rights Reserved.
Matching Networks [Vinyals+, NIPS2016]
nāÆ The Fully Conditional Embedding g
āāÆ Embed in consideration of S
gā
LSTM
LSTM
+
Figure 1: Matching Networks architecture
by showing only a few examples per class, switching the task from minibatch to minibatch,
ike how it will be tested when presented with a few examples of a new task.
s our contributions in deļ¬ning a model and training criterion amenable for one-shot learning,
ntribute by the deļ¬nition of tasks that can be used to benchmark other approaches on both
Net and small scale language modeling. We hope that our results will encourage others to work
challenging problem.
ganized the paper by ļ¬rst deļ¬ning and explaining our model whilst linking its several compo-
o related work. Then in the following section we brieļ¬y elaborate on some of the related work
Figure 1: Matching Networks architecture
it by showing only a few examples per class, switching the task from minibatch to minibatch,
h like how it will be tested when presented with a few examples of a new task.
des our contributions in deļ¬ning a model and training criterion amenable for one-shot learning,
ontribute by the deļ¬nition of tasks that can be used to benchmark other approaches on both
geNet and small scale language modeling. We hope that our results will encourage others to work
his challenging problem.
20
Figure 1: Matching Networks architecture
examples per class, switching the task from minibatch to minibatch, much like
when presented with a few examples of a new task.
utions in deļ¬ning a model and training criterion amenable for one-shot learning,
xi
Support Setļ¼Sļ¼
yi
gā
LSTM
LSTM
+
gā
LSTM
LSTM
+
noting that LSTM(x, h, c) follows the same LSTM implementation deļ¬ned in [
h the output (i.e., cell after the output gate), and c the cell. a is commonly refe
based attention, and the softmax in eq. 6 normalizes w.r.t. g(xi). The read-ou
concatenated to hk 1. Since we do K steps of āreadsā, attLSTM(f0
(Ėx), g(S),
is as described in eq. 3.
A.2 The Fully Conditional Embedding g
In section 2.1.2 we described the encoding function for the elements in the sup
as a bidirectional LSTM. More precisely, let g0
(xi) be a neural network (simila
VGG or Inception model). Then we deļ¬ne g(xi, S) = ~hi + ~hi + g0
(xi) with:
~hi,~ci = LSTM(g0
(xi),~hi 1,~ci 1)
~hi, ~ci = LSTM(g0
(xi), ~hi+1, ~ci+1)
where, as in above, LSTM(x, h, c) follows the same LSTM implementation de
the input, h the output (i.e., cell after the output gate), and c the cell. Note tha
starts from i = |S|. As in eq. 3, we add a skip connection between input and ou
B ImageNet Class Splits
Here we deļ¬ne the two class splits used in our full ImageNet experiments ā
excluded for training during our one-shot experiments described in section 4.1.
Lrand =
n01498041, n01537544, n01580077, n01592084, n01632777, n01644373, n01665541, n01675722, n016882
n01818515, n01843383, n01883070, n01950731, n02002724, n02013706, n02092339, n02093256, n020953
10
gā: neural network (e.g., VGG or Inception)
a(hk 1, g(xi)) = softmax(hk 1g(xi))
noting that LSTM(x, h, c) follows the same LSTM implementation deļ¬ned in [23] with x th
h the output (i.e., cell after the output gate), and c the cell. a is commonly referred to as āc
based attention, and the softmax in eq. 6 normalizes w.r.t. g(xi). The read-out rk 1 from
concatenated to hk 1. Since we do K steps of āreadsā, attLSTM(f0
(Ėx), g(S), K) = hK w
is as described in eq. 3.
A.2 The Fully Conditional Embedding g
In section 2.1.2 we described the encoding function for the elements in the support set S, g
as a bidirectional LSTM. More precisely, let g0
(xi) be a neural network (similar to f0
above
VGG or Inception model). Then we deļ¬ne g(xi, S) = ~hi + ~hi + g0
(xi) with:
~hi,~ci = LSTM(g0
(xi),~hi 1,~ci 1)
~hi, ~ci = LSTM(g0
(xi), ~hi+1, ~ci+1)
where, as in above, LSTM(x, h, c) follows the same LSTM implementation deļ¬ned in [23]
the input, h the output (i.e., cell after the output gate), and c the cell. Note that the recursio
starts from i = |S|. As in eq. 3, we add a skip connection between input and outputs.
B ImageNet Class Splits
Here we deļ¬ne the two class splits used in our full ImageNet experiments ā these classe
excluded for training during our one-shot experiments described in section 4.1.2.
Lrand =
n01498041, n01537544, n01580077, n01592084, n01632777, n01644373, n01665541, n01675722, n01688243, n01729977, n
n01818515, n01843383, n01883070, n01950731, n02002724, n02013706, n02092339, n02093256, n02095314, n02097130, n
xi
g(xi,S)
21. Copyright (C) DeNA Co.,Ltd. All Rights Reserved.
Matching Networks [Vinyals+, NIPS2016]
nāÆ The Fully Conditional Embedding g
āāÆ Embed in consideration of S
gā
Figure 1: Matching Networks architecture
by showing only a few examples per class, switching the task from minibatch to minibatch,
ike how it will be tested when presented with a few examples of a new task.
s our contributions in deļ¬ning a model and training criterion amenable for one-shot learning,
ntribute by the deļ¬nition of tasks that can be used to benchmark other approaches on both
Net and small scale language modeling. We hope that our results will encourage others to work
challenging problem.
ganized the paper by ļ¬rst deļ¬ning and explaining our model whilst linking its several compo-
o related work. Then in the following section we brieļ¬y elaborate on some of the related work
Figure 1: Matching Networks architecture
it by showing only a few examples per class, switching the task from minibatch to minibatch,
h like how it will be tested when presented with a few examples of a new task.
des our contributions in deļ¬ning a model and training criterion amenable for one-shot learning,
ontribute by the deļ¬nition of tasks that can be used to benchmark other approaches on both
geNet and small scale language modeling. We hope that our results will encourage others to work
his challenging problem.
21
Figure 1: Matching Networks architecture
examples per class, switching the task from minibatch to minibatch, much like
when presented with a few examples of a new task.
utions in deļ¬ning a model and training criterion amenable for one-shot learning,
xi
Support Setļ¼Sļ¼
yi
gā
gā
noting that LSTM(x, h, c) follows the same LSTM implementation deļ¬ned in [
h the output (i.e., cell after the output gate), and c the cell. a is commonly refe
based attention, and the softmax in eq. 6 normalizes w.r.t. g(xi). The read-ou
concatenated to hk 1. Since we do K steps of āreadsā, attLSTM(f0
(Ėx), g(S),
is as described in eq. 3.
A.2 The Fully Conditional Embedding g
In section 2.1.2 we described the encoding function for the elements in the sup
as a bidirectional LSTM. More precisely, let g0
(xi) be a neural network (simila
VGG or Inception model). Then we deļ¬ne g(xi, S) = ~hi + ~hi + g0
(xi) with:
~hi,~ci = LSTM(g0
(xi),~hi 1,~ci 1)
~hi, ~ci = LSTM(g0
(xi), ~hi+1, ~ci+1)
where, as in above, LSTM(x, h, c) follows the same LSTM implementation de
the input, h the output (i.e., cell after the output gate), and c the cell. Note tha
starts from i = |S|. As in eq. 3, we add a skip connection between input and ou
B ImageNet Class Splits
Here we deļ¬ne the two class splits used in our full ImageNet experiments ā
excluded for training during our one-shot experiments described in section 4.1.
Lrand =
n01498041, n01537544, n01580077, n01592084, n01632777, n01644373, n01665541, n01675722, n016882
n01818515, n01843383, n01883070, n01950731, n02002724, n02013706, n02092339, n02093256, n020953
10
gā: neural network (e.g., VGG or Inception)
a(hk 1, g(xi)) = softmax(hk 1g(xi))
noting that LSTM(x, h, c) follows the same LSTM implementation deļ¬ned in [23] with x th
h the output (i.e., cell after the output gate), and c the cell. a is commonly referred to as āc
based attention, and the softmax in eq. 6 normalizes w.r.t. g(xi). The read-out rk 1 from
concatenated to hk 1. Since we do K steps of āreadsā, attLSTM(f0
(Ėx), g(S), K) = hK w
is as described in eq. 3.
A.2 The Fully Conditional Embedding g
In section 2.1.2 we described the encoding function for the elements in the support set S, g
as a bidirectional LSTM. More precisely, let g0
(xi) be a neural network (similar to f0
above
VGG or Inception model). Then we deļ¬ne g(xi, S) = ~hi + ~hi + g0
(xi) with:
~hi,~ci = LSTM(g0
(xi),~hi 1,~ci 1)
~hi, ~ci = LSTM(g0
(xi), ~hi+1, ~ci+1)
where, as in above, LSTM(x, h, c) follows the same LSTM implementation deļ¬ned in [23]
the input, h the output (i.e., cell after the output gate), and c the cell. Note that the recursio
starts from i = |S|. As in eq. 3, we add a skip connection between input and outputs.
B ImageNet Class Splits
Here we deļ¬ne the two class splits used in our full ImageNet experiments ā these classe
excluded for training during our one-shot experiments described in section 4.1.2.
Lrand =
n01498041, n01537544, n01580077, n01592084, n01632777, n01644373, n01665541, n01675722, n01688243, n01729977, n
n01818515, n01843383, n01883070, n01950731, n02002724, n02013706, n02092339, n02093256, n02095314, n02097130, n
Embed into vector using gā
ļ¼gā: neural network such as VGG or Inceptionļ¼
xi
xi
22. Copyright (C) DeNA Co.,Ltd. All Rights Reserved.
Matching Networks [Vinyals+, NIPS2016]
nāÆ The Fully Conditional Embedding g
āāÆ Embed in consideration of S
gā
LSTM
LSTM
Figure 1: Matching Networks architecture
by showing only a few examples per class, switching the task from minibatch to minibatch,
ike how it will be tested when presented with a few examples of a new task.
s our contributions in deļ¬ning a model and training criterion amenable for one-shot learning,
ntribute by the deļ¬nition of tasks that can be used to benchmark other approaches on both
Net and small scale language modeling. We hope that our results will encourage others to work
challenging problem.
ganized the paper by ļ¬rst deļ¬ning and explaining our model whilst linking its several compo-
o related work. Then in the following section we brieļ¬y elaborate on some of the related work
Figure 1: Matching Networks architecture
it by showing only a few examples per class, switching the task from minibatch to minibatch,
h like how it will be tested when presented with a few examples of a new task.
des our contributions in deļ¬ning a model and training criterion amenable for one-shot learning,
ontribute by the deļ¬nition of tasks that can be used to benchmark other approaches on both
geNet and small scale language modeling. We hope that our results will encourage others to work
his challenging problem.
22
Figure 1: Matching Networks architecture
examples per class, switching the task from minibatch to minibatch, much like
when presented with a few examples of a new task.
utions in deļ¬ning a model and training criterion amenable for one-shot learning,
xi
Support Setļ¼Sļ¼
yi
gā
LSTM
LSTM
gā
LSTM
LSTM
noting that LSTM(x, h, c) follows the same LSTM implementation deļ¬ned in [
h the output (i.e., cell after the output gate), and c the cell. a is commonly refe
based attention, and the softmax in eq. 6 normalizes w.r.t. g(xi). The read-ou
concatenated to hk 1. Since we do K steps of āreadsā, attLSTM(f0
(Ėx), g(S),
is as described in eq. 3.
A.2 The Fully Conditional Embedding g
In section 2.1.2 we described the encoding function for the elements in the sup
as a bidirectional LSTM. More precisely, let g0
(xi) be a neural network (simila
VGG or Inception model). Then we deļ¬ne g(xi, S) = ~hi + ~hi + g0
(xi) with:
~hi,~ci = LSTM(g0
(xi),~hi 1,~ci 1)
~hi, ~ci = LSTM(g0
(xi), ~hi+1, ~ci+1)
where, as in above, LSTM(x, h, c) follows the same LSTM implementation de
the input, h the output (i.e., cell after the output gate), and c the cell. Note tha
starts from i = |S|. As in eq. 3, we add a skip connection between input and ou
B ImageNet Class Splits
Here we deļ¬ne the two class splits used in our full ImageNet experiments ā
excluded for training during our one-shot experiments described in section 4.1.
Lrand =
n01498041, n01537544, n01580077, n01592084, n01632777, n01644373, n01665541, n01675722, n016882
n01818515, n01843383, n01883070, n01950731, n02002724, n02013706, n02092339, n02093256, n020953
10
gā: neural network (e.g., VGG or Inception)
a(hk 1, g(xi)) = softmax(hk 1g(xi))
noting that LSTM(x, h, c) follows the same LSTM implementation deļ¬ned in [23] with x th
h the output (i.e., cell after the output gate), and c the cell. a is commonly referred to as āc
based attention, and the softmax in eq. 6 normalizes w.r.t. g(xi). The read-out rk 1 from
concatenated to hk 1. Since we do K steps of āreadsā, attLSTM(f0
(Ėx), g(S), K) = hK w
is as described in eq. 3.
A.2 The Fully Conditional Embedding g
In section 2.1.2 we described the encoding function for the elements in the support set S, g
as a bidirectional LSTM. More precisely, let g0
(xi) be a neural network (similar to f0
above
VGG or Inception model). Then we deļ¬ne g(xi, S) = ~hi + ~hi + g0
(xi) with:
~hi,~ci = LSTM(g0
(xi),~hi 1,~ci 1)
~hi, ~ci = LSTM(g0
(xi), ~hi+1, ~ci+1)
where, as in above, LSTM(x, h, c) follows the same LSTM implementation deļ¬ned in [23]
the input, h the output (i.e., cell after the output gate), and c the cell. Note that the recursio
starts from i = |S|. As in eq. 3, we add a skip connection between input and outputs.
B ImageNet Class Splits
Here we deļ¬ne the two class splits used in our full ImageNet experiments ā these classe
excluded for training during our one-shot experiments described in section 4.1.2.
Lrand =
n01498041, n01537544, n01580077, n01592084, n01632777, n01644373, n01665541, n01675722, n01688243, n01729977, n
n01818515, n01843383, n01883070, n01950731, n02002724, n02013706, n02092339, n02093256, n02095314, n02097130, n
Feed into Bi-LSTM
ļ¼gŹ¼: neural network such as VGG or Inceptionļ¼
g'(xi )
xi
23. Copyright (C) DeNA Co.,Ltd. All Rights Reserved.
Matching Networks [Vinyals+, NIPS2016]
nāÆ The Fully Conditional Embedding g
āāÆ Embed in consideration of S
gā
LSTM
LSTM
+
Figure 1: Matching Networks architecture
by showing only a few examples per class, switching the task from minibatch to minibatch,
ike how it will be tested when presented with a few examples of a new task.
s our contributions in deļ¬ning a model and training criterion amenable for one-shot learning,
ntribute by the deļ¬nition of tasks that can be used to benchmark other approaches on both
Net and small scale language modeling. We hope that our results will encourage others to work
challenging problem.
ganized the paper by ļ¬rst deļ¬ning and explaining our model whilst linking its several compo-
o related work. Then in the following section we brieļ¬y elaborate on some of the related work
Figure 1: Matching Networks architecture
it by showing only a few examples per class, switching the task from minibatch to minibatch,
h like how it will be tested when presented with a few examples of a new task.
des our contributions in deļ¬ning a model and training criterion amenable for one-shot learning,
ontribute by the deļ¬nition of tasks that can be used to benchmark other approaches on both
geNet and small scale language modeling. We hope that our results will encourage others to work
his challenging problem.
23
Figure 1: Matching Networks architecture
examples per class, switching the task from minibatch to minibatch, much like
when presented with a few examples of a new task.
utions in deļ¬ning a model and training criterion amenable for one-shot learning,
xi
Support Setļ¼Sļ¼
yi
gā
LSTM
LSTM
+
gā
LSTM
LSTM
+
noting that LSTM(x, h, c) follows the same LSTM implementation deļ¬ned in [
h the output (i.e., cell after the output gate), and c the cell. a is commonly refe
based attention, and the softmax in eq. 6 normalizes w.r.t. g(xi). The read-ou
concatenated to hk 1. Since we do K steps of āreadsā, attLSTM(f0
(Ėx), g(S),
is as described in eq. 3.
A.2 The Fully Conditional Embedding g
In section 2.1.2 we described the encoding function for the elements in the sup
as a bidirectional LSTM. More precisely, let g0
(xi) be a neural network (simila
VGG or Inception model). Then we deļ¬ne g(xi, S) = ~hi + ~hi + g0
(xi) with:
~hi,~ci = LSTM(g0
(xi),~hi 1,~ci 1)
~hi, ~ci = LSTM(g0
(xi), ~hi+1, ~ci+1)
where, as in above, LSTM(x, h, c) follows the same LSTM implementation de
the input, h the output (i.e., cell after the output gate), and c the cell. Note tha
starts from i = |S|. As in eq. 3, we add a skip connection between input and ou
B ImageNet Class Splits
Here we deļ¬ne the two class splits used in our full ImageNet experiments ā
excluded for training during our one-shot experiments described in section 4.1.
Lrand =
n01498041, n01537544, n01580077, n01592084, n01632777, n01644373, n01665541, n01675722, n016882
n01818515, n01843383, n01883070, n01950731, n02002724, n02013706, n02092339, n02093256, n020953
10
gā: neural network (e.g., VGG or Inception)
a(hk 1, g(xi)) = softmax(hk 1g(xi))
noting that LSTM(x, h, c) follows the same LSTM implementation deļ¬ned in [23] with x th
h the output (i.e., cell after the output gate), and c the cell. a is commonly referred to as āc
based attention, and the softmax in eq. 6 normalizes w.r.t. g(xi). The read-out rk 1 from
concatenated to hk 1. Since we do K steps of āreadsā, attLSTM(f0
(Ėx), g(S), K) = hK w
is as described in eq. 3.
A.2 The Fully Conditional Embedding g
In section 2.1.2 we described the encoding function for the elements in the support set S, g
as a bidirectional LSTM. More precisely, let g0
(xi) be a neural network (similar to f0
above
VGG or Inception model). Then we deļ¬ne g(xi, S) = ~hi + ~hi + g0
(xi) with:
~hi,~ci = LSTM(g0
(xi),~hi 1,~ci 1)
~hi, ~ci = LSTM(g0
(xi), ~hi+1, ~ci+1)
where, as in above, LSTM(x, h, c) follows the same LSTM implementation deļ¬ned in [23]
the input, h the output (i.e., cell after the output gate), and c the cell. Note that the recursio
starts from i = |S|. As in eq. 3, we add a skip connection between input and outputs.
B ImageNet Class Splits
Here we deļ¬ne the two class splits used in our full ImageNet experiments ā these classe
excluded for training during our one-shot experiments described in section 4.1.2.
Lrand =
n01498041, n01537544, n01580077, n01592084, n01632777, n01644373, n01665541, n01675722, n01688243, n01729977, n
n01818515, n01843383, n01883070, n01950731, n02002724, n02013706, n02092339, n02093256, n02095314, n02097130, n
g(xi,S)
Let be the sum of
and outputs of Bi-LSTM
g(xi,S) g'(xi )
xi
24. Copyright (C) DeNA Co.,Ltd. All Rights Reserved.
Matching Networks [Vinyals+, NIPS2016]
nāÆ The Fully Conditional Embedding f
āāÆ Embed in consideration of S
g
Figure 1: Matching Networks architecture
train it by showing only a few examples per class, switching the task from mi
much like how it will be tested when presented with a few examples of a new
Besides our contributions in deļ¬ning a model and training criterion amenable
we contribute by the deļ¬nition of tasks that can be used to benchmark other
ImageNet and small scale language modeling. We hope that our results will enc
on this challenging problem.
We organized the paper by ļ¬rst deļ¬ning and explaining our model whilst linki
nents to related work. Then in the following section we brieļ¬y elaborate on som
to the task and our model. In Section 4 we describe both our general setup an
performed, demonstrating strong results on one-shot learning on a variety of ta
2 Model
Our non-parametric approach to solving one-shot learning is based on two co
describe in the following subsections. First, our model architecture follows rece
networks augmented with memory (as discussed in Section 3). Given a (sma
model deļ¬nes a function cS (or classiļ¬er) for each S, i.e. a mapping S ! cS(.)
a training strategy which is tailored for one-shot learning from the support set
2.1 Model Architecture
In recent years, many groups have investigated ways to augment neural netwo
fāLSTM
rkā1
a(hkā1,g(xi ))g(xi )
LSTM
f ( Ėx,S) = hK
Ėhkā1
hkā1
Ėhk
+
+
Ėx
so, we deļ¬ne the following recurrence over āprocessingā steps k, following work from [26]:
Ėhk, ck = LSTM(f0
(Ėx), [hk 1, rk 1], ck 1) (2)
hk = Ėhk + f0
(Ėx) (3)
rk 1 =
|S|
X
i=1
a(hk 1, g(xi))g(xi) (4)
a(hk 1, g(xi)) = ehT
k 1g(xi)
/
|S|
X
j=1
ehT
k 1g(xj )
(5)
Query
weighted sum
24
Figure 1: Matching Networks architecture
by showing only a few examples per class, switching the task from minibatch to minibatch,
ike how it will be tested when presented with a few examples of a new task.
s our contributions in deļ¬ning a model and training criterion amenable for one-shot learning,
ntribute by the deļ¬nition of tasks that can be used to benchmark other approaches on both
Net and small scale language modeling. We hope that our results will encourage others to work
challenging problem.
ganized the paper by ļ¬rst deļ¬ning and explaining our model whilst linking its several compo-
o related work. Then in the following section we brieļ¬y elaborate on some of the related work
Figure 1: Matching Networks architecture
it by showing only a few examples per class, switching the task from minibatch to minibatch,
h like how it will be tested when presented with a few examples of a new task.
des our contributions in deļ¬ning a model and training criterion amenable for one-shot learning,
ontribute by the deļ¬nition of tasks that can be used to benchmark other approaches on both
geNet and small scale language modeling. We hope that our results will encourage others to work
his challenging problem.
Figure 1: Matching Networks architecture
examples per class, switching the task from minibatch to minibatch, much like
when presented with a few examples of a new task.
utions in deļ¬ning a model and training criterion amenable for one-shot learning,
xi
Support Setļ¼Sļ¼
yi
Ėx
ollowing recurrence over āprocessingā steps k, following work from [26]:
Ėhk, ck = LSTM(f0
(Ėx), [hk 1, rk 1], ck 1) (2)
hk = Ėhk + f0
(Ėx) (3)
rk 1 =
|S|
X
i=1
a(hk 1, g(xi))g(xi) (4)
25. Copyright (C) DeNA Co.,Ltd. All Rights Reserved.
Matching Networks [Vinyals+, NIPS2016]
nāÆ The Fully Conditional Embedding f
āāÆ Embed in consideration of S
g
Figure 1: Matching Networks architecture
train it by showing only a few examples per class, switching the task from mi
much like how it will be tested when presented with a few examples of a new
Besides our contributions in deļ¬ning a model and training criterion amenable
we contribute by the deļ¬nition of tasks that can be used to benchmark other
ImageNet and small scale language modeling. We hope that our results will enc
on this challenging problem.
We organized the paper by ļ¬rst deļ¬ning and explaining our model whilst linki
nents to related work. Then in the following section we brieļ¬y elaborate on som
to the task and our model. In Section 4 we describe both our general setup an
performed, demonstrating strong results on one-shot learning on a variety of ta
2 Model
Our non-parametric approach to solving one-shot learning is based on two co
describe in the following subsections. First, our model architecture follows rece
networks augmented with memory (as discussed in Section 3). Given a (sma
model deļ¬nes a function cS (or classiļ¬er) for each S, i.e. a mapping S ! cS(.)
a training strategy which is tailored for one-shot learning from the support set
2.1 Model Architecture
In recent years, many groups have investigated ways to augment neural netwo
fāLSTM
g(xi ) Ėh1
h1
+
Ėx
so, we deļ¬ne the following recurrence over āprocessingā steps k, following work from [26]:
Ėhk, ck = LSTM(f0
(Ėx), [hk 1, rk 1], ck 1) (2)
hk = Ėhk + f0
(Ėx) (3)
rk 1 =
|S|
X
i=1
a(hk 1, g(xi))g(xi) (4)
a(hk 1, g(xi)) = ehT
k 1g(xi)
/
|S|
X
j=1
ehT
k 1g(xj )
(5)
Query
25
Figure 1: Matching Networks architecture
by showing only a few examples per class, switching the task from minibatch to minibatch,
ike how it will be tested when presented with a few examples of a new task.
s our contributions in deļ¬ning a model and training criterion amenable for one-shot learning,
ntribute by the deļ¬nition of tasks that can be used to benchmark other approaches on both
Net and small scale language modeling. We hope that our results will encourage others to work
challenging problem.
ganized the paper by ļ¬rst deļ¬ning and explaining our model whilst linking its several compo-
o related work. Then in the following section we brieļ¬y elaborate on some of the related work
Figure 1: Matching Networks architecture
it by showing only a few examples per class, switching the task from minibatch to minibatch,
h like how it will be tested when presented with a few examples of a new task.
des our contributions in deļ¬ning a model and training criterion amenable for one-shot learning,
ontribute by the deļ¬nition of tasks that can be used to benchmark other approaches on both
geNet and small scale language modeling. We hope that our results will encourage others to work
his challenging problem.
Figure 1: Matching Networks architecture
examples per class, switching the task from minibatch to minibatch, much like
when presented with a few examples of a new task.
utions in deļ¬ning a model and training criterion amenable for one-shot learning,
xi
Support Setļ¼Sļ¼
yi
ollowing recurrence over āprocessingā steps k, following work from [26]:
Ėhk, ck = LSTM(f0
(Ėx), [hk 1, rk 1], ck 1) (2)
hk = Ėhk + f0
(Ėx) (3)
rk 1 =
|S|
X
i=1
a(hk 1, g(xi))g(xi) (4)
is calculated without using S
h1 = LSTM( f '( Ėx),[ Ėh0,r0 ],c0 )+ f '( Ėx)
h1
Ėx
26. Copyright (C) DeNA Co.,Ltd. All Rights Reserved.
Matching Networks [Vinyals+, NIPS2016]
nāÆ The Fully Conditional Embedding f
āāÆ Embed in consideration of S
g
Figure 1: Matching Networks architecture
train it by showing only a few examples per class, switching the task from mi
much like how it will be tested when presented with a few examples of a new
Besides our contributions in deļ¬ning a model and training criterion amenable
we contribute by the deļ¬nition of tasks that can be used to benchmark other
ImageNet and small scale language modeling. We hope that our results will enc
on this challenging problem.
We organized the paper by ļ¬rst deļ¬ning and explaining our model whilst linki
nents to related work. Then in the following section we brieļ¬y elaborate on som
to the task and our model. In Section 4 we describe both our general setup an
performed, demonstrating strong results on one-shot learning on a variety of ta
2 Model
Our non-parametric approach to solving one-shot learning is based on two co
describe in the following subsections. First, our model architecture follows rece
networks augmented with memory (as discussed in Section 3). Given a (sma
model deļ¬nes a function cS (or classiļ¬er) for each S, i.e. a mapping S ! cS(.)
a training strategy which is tailored for one-shot learning from the support set
2.1 Model Architecture
In recent years, many groups have investigated ways to augment neural netwo
fāLSTM
g(xi ) Ėh1
h1
+
Ėx
so, we deļ¬ne the following recurrence over āprocessingā steps k, following work from [26]:
Ėhk, ck = LSTM(f0
(Ėx), [hk 1, rk 1], ck 1) (2)
hk = Ėhk + f0
(Ėx) (3)
rk 1 =
|S|
X
i=1
a(hk 1, g(xi))g(xi) (4)
a(hk 1, g(xi)) = ehT
k 1g(xi)
/
|S|
X
j=1
ehT
k 1g(xj )
(5)
Query
26
Figure 1: Matching Networks architecture
by showing only a few examples per class, switching the task from minibatch to minibatch,
ike how it will be tested when presented with a few examples of a new task.
s our contributions in deļ¬ning a model and training criterion amenable for one-shot learning,
ntribute by the deļ¬nition of tasks that can be used to benchmark other approaches on both
Net and small scale language modeling. We hope that our results will encourage others to work
challenging problem.
ganized the paper by ļ¬rst deļ¬ning and explaining our model whilst linking its several compo-
o related work. Then in the following section we brieļ¬y elaborate on some of the related work
Figure 1: Matching Networks architecture
it by showing only a few examples per class, switching the task from minibatch to minibatch,
h like how it will be tested when presented with a few examples of a new task.
des our contributions in deļ¬ning a model and training criterion amenable for one-shot learning,
ontribute by the deļ¬nition of tasks that can be used to benchmark other approaches on both
geNet and small scale language modeling. We hope that our results will encourage others to work
his challenging problem.
Figure 1: Matching Networks architecture
examples per class, switching the task from minibatch to minibatch, much like
when presented with a few examples of a new task.
utions in deļ¬ning a model and training criterion amenable for one-shot learning,
xi
Support Setļ¼Sļ¼
yi
ollowing recurrence over āprocessingā steps k, following work from [26]:
Ėhk, ck = LSTM(f0
(Ėx), [hk 1, rk 1], ck 1) (2)
hk = Ėhk + f0
(Ėx) (3)
rk 1 =
|S|
X
i=1
a(hk 1, g(xi))g(xi) (4)
Calculate the relevance between and
softmaxa(h1,g(x1)) =
a(h1,g(xi ))
(hT
1g(x1))
g(xi ) h1
Ėx
27. Copyright (C) DeNA Co.,Ltd. All Rights Reserved.
Matching Networks [Vinyals+, NIPS2016]
nāÆ The Fully Conditional Embedding f
āāÆ Embed in consideration of S
g
Figure 1: Matching Networks architecture
train it by showing only a few examples per class, switching the task from mi
much like how it will be tested when presented with a few examples of a new
Besides our contributions in deļ¬ning a model and training criterion amenable
we contribute by the deļ¬nition of tasks that can be used to benchmark other
ImageNet and small scale language modeling. We hope that our results will enc
on this challenging problem.
We organized the paper by ļ¬rst deļ¬ning and explaining our model whilst linki
nents to related work. Then in the following section we brieļ¬y elaborate on som
to the task and our model. In Section 4 we describe both our general setup an
performed, demonstrating strong results on one-shot learning on a variety of ta
2 Model
Our non-parametric approach to solving one-shot learning is based on two co
describe in the following subsections. First, our model architecture follows rece
networks augmented with memory (as discussed in Section 3). Given a (sma
model deļ¬nes a function cS (or classiļ¬er) for each S, i.e. a mapping S ! cS(.)
a training strategy which is tailored for one-shot learning from the support set
2.1 Model Architecture
In recent years, many groups have investigated ways to augment neural netwo
fāLSTM
g(xi ) Ėh1
h1
+
Ėx
so, we deļ¬ne the following recurrence over āprocessingā steps k, following work from [26]:
Ėhk, ck = LSTM(f0
(Ėx), [hk 1, rk 1], ck 1) (2)
hk = Ėhk + f0
(Ėx) (3)
rk 1 =
|S|
X
i=1
a(hk 1, g(xi))g(xi) (4)
a(hk 1, g(xi)) = ehT
k 1g(xi)
/
|S|
X
j=1
ehT
k 1g(xj )
(5)
Query
27
Figure 1: Matching Networks architecture
by showing only a few examples per class, switching the task from minibatch to minibatch,
ike how it will be tested when presented with a few examples of a new task.
s our contributions in deļ¬ning a model and training criterion amenable for one-shot learning,
ntribute by the deļ¬nition of tasks that can be used to benchmark other approaches on both
Net and small scale language modeling. We hope that our results will encourage others to work
challenging problem.
ganized the paper by ļ¬rst deļ¬ning and explaining our model whilst linking its several compo-
o related work. Then in the following section we brieļ¬y elaborate on some of the related work
Figure 1: Matching Networks architecture
it by showing only a few examples per class, switching the task from minibatch to minibatch,
h like how it will be tested when presented with a few examples of a new task.
des our contributions in deļ¬ning a model and training criterion amenable for one-shot learning,
ontribute by the deļ¬nition of tasks that can be used to benchmark other approaches on both
geNet and small scale language modeling. We hope that our results will encourage others to work
his challenging problem.
Figure 1: Matching Networks architecture
examples per class, switching the task from minibatch to minibatch, much like
when presented with a few examples of a new task.
utions in deļ¬ning a model and training criterion amenable for one-shot learning,
xi
Support Setļ¼Sļ¼
yi
ollowing recurrence over āprocessingā steps k, following work from [26]:
Ėhk, ck = LSTM(f0
(Ėx), [hk 1, rk 1], ck 1) (2)
hk = Ėhk + f0
(Ėx) (3)
rk 1 =
|S|
X
i=1
a(hk 1, g(xi))g(xi) (4)
is a sum of weighted according to the
relevance to
a(h1,g(xi ))
r1
weighted sum
r1
g(xi )
h1
r1 = a(h1,g(xi ))
i=1
|S|
ā g(xi )
Ėx
28. Copyright (C) DeNA Co.,Ltd. All Rights Reserved.
Matching Networks [Vinyals+, NIPS2016]
nāÆ The Fully Conditional Embedding f
āāÆ Embed in consideration of S
g
Figure 1: Matching Networks architecture
train it by showing only a few examples per class, switching the task from mi
much like how it will be tested when presented with a few examples of a new
Besides our contributions in deļ¬ning a model and training criterion amenable
we contribute by the deļ¬nition of tasks that can be used to benchmark other
ImageNet and small scale language modeling. We hope that our results will enc
on this challenging problem.
We organized the paper by ļ¬rst deļ¬ning and explaining our model whilst linki
nents to related work. Then in the following section we brieļ¬y elaborate on som
to the task and our model. In Section 4 we describe both our general setup an
performed, demonstrating strong results on one-shot learning on a variety of ta
2 Model
Our non-parametric approach to solving one-shot learning is based on two co
describe in the following subsections. First, our model architecture follows rece
networks augmented with memory (as discussed in Section 3). Given a (sma
model deļ¬nes a function cS (or classiļ¬er) for each S, i.e. a mapping S ! cS(.)
a training strategy which is tailored for one-shot learning from the support set
2.1 Model Architecture
In recent years, many groups have investigated ways to augment neural netwo
fāLSTM
g(xi ) Ėh1
h1
+
Ėx
so, we deļ¬ne the following recurrence over āprocessingā steps k, following work from [26]:
Ėhk, ck = LSTM(f0
(Ėx), [hk 1, rk 1], ck 1) (2)
hk = Ėhk + f0
(Ėx) (3)
rk 1 =
|S|
X
i=1
a(hk 1, g(xi))g(xi) (4)
a(hk 1, g(xi)) = ehT
k 1g(xi)
/
|S|
X
j=1
ehT
k 1g(xj )
(5)
Query
28
Figure 1: Matching Networks architecture
by showing only a few examples per class, switching the task from minibatch to minibatch,
ike how it will be tested when presented with a few examples of a new task.
s our contributions in deļ¬ning a model and training criterion amenable for one-shot learning,
ntribute by the deļ¬nition of tasks that can be used to benchmark other approaches on both
Net and small scale language modeling. We hope that our results will encourage others to work
challenging problem.
ganized the paper by ļ¬rst deļ¬ning and explaining our model whilst linking its several compo-
o related work. Then in the following section we brieļ¬y elaborate on some of the related work
Figure 1: Matching Networks architecture
it by showing only a few examples per class, switching the task from minibatch to minibatch,
h like how it will be tested when presented with a few examples of a new task.
des our contributions in deļ¬ning a model and training criterion amenable for one-shot learning,
ontribute by the deļ¬nition of tasks that can be used to benchmark other approaches on both
geNet and small scale language modeling. We hope that our results will encourage others to work
his challenging problem.
Figure 1: Matching Networks architecture
examples per class, switching the task from minibatch to minibatch, much like
when presented with a few examples of a new task.
utions in deļ¬ning a model and training criterion amenable for one-shot learning,
xi
Support Setļ¼Sļ¼
yi
ollowing recurrence over āprocessingā steps k, following work from [26]:
Ėhk, ck = LSTM(f0
(Ėx), [hk 1, rk 1], ck 1) (2)
hk = Ėhk + f0
(Ėx) (3)
rk 1 =
|S|
X
i=1
a(hk 1, g(xi))g(xi) (4)
a(h1,g(xi ))
r1
weighted sum
LSTM
Ėh1
+
h1
is calculated using Sh1
Ėx
29. Copyright (C) DeNA Co.,Ltd. All Rights Reserved.
Matching Networks [Vinyals+, NIPS2016]
nāÆ The Fully Conditional Embedding f
āāÆ Embed in consideration of S
g
Figure 1: Matching Networks architecture
train it by showing only a few examples per class, switching the task from mi
much like how it will be tested when presented with a few examples of a new
Besides our contributions in deļ¬ning a model and training criterion amenable
we contribute by the deļ¬nition of tasks that can be used to benchmark other
ImageNet and small scale language modeling. We hope that our results will enc
on this challenging problem.
We organized the paper by ļ¬rst deļ¬ning and explaining our model whilst linki
nents to related work. Then in the following section we brieļ¬y elaborate on som
to the task and our model. In Section 4 we describe both our general setup an
performed, demonstrating strong results on one-shot learning on a variety of ta
2 Model
Our non-parametric approach to solving one-shot learning is based on two co
describe in the following subsections. First, our model architecture follows rece
networks augmented with memory (as discussed in Section 3). Given a (sma
model deļ¬nes a function cS (or classiļ¬er) for each S, i.e. a mapping S ! cS(.)
a training strategy which is tailored for one-shot learning from the support set
2.1 Model Architecture
In recent years, many groups have investigated ways to augment neural netwo
fāLSTM
rkā1
a(hkā1,g(xi ))g(xi )
LSTM
f ( Ėx,S) = hK
Ėhkā1
hkā1
Ėhk
+
+
Ėx
so, we deļ¬ne the following recurrence over āprocessingā steps k, following work from [26]:
Ėhk, ck = LSTM(f0
(Ėx), [hk 1, rk 1], ck 1) (2)
hk = Ėhk + f0
(Ėx) (3)
rk 1 =
|S|
X
i=1
a(hk 1, g(xi))g(xi) (4)
a(hk 1, g(xi)) = ehT
k 1g(xi)
/
|S|
X
j=1
ehT
k 1g(xj )
(5)
Query
weighted sum
29
Figure 1: Matching Networks architecture
by showing only a few examples per class, switching the task from minibatch to minibatch,
ike how it will be tested when presented with a few examples of a new task.
s our contributions in deļ¬ning a model and training criterion amenable for one-shot learning,
ntribute by the deļ¬nition of tasks that can be used to benchmark other approaches on both
Net and small scale language modeling. We hope that our results will encourage others to work
challenging problem.
ganized the paper by ļ¬rst deļ¬ning and explaining our model whilst linking its several compo-
o related work. Then in the following section we brieļ¬y elaborate on some of the related work
Figure 1: Matching Networks architecture
it by showing only a few examples per class, switching the task from minibatch to minibatch,
h like how it will be tested when presented with a few examples of a new task.
des our contributions in deļ¬ning a model and training criterion amenable for one-shot learning,
ontribute by the deļ¬nition of tasks that can be used to benchmark other approaches on both
geNet and small scale language modeling. We hope that our results will encourage others to work
his challenging problem.
Figure 1: Matching Networks architecture
examples per class, switching the task from minibatch to minibatch, much like
when presented with a few examples of a new task.
utions in deļ¬ning a model and training criterion amenable for one-shot learning,
xi
Support Setļ¼Sļ¼
yi
ollowing recurrence over āprocessingā steps k, following work from [26]:
Ėhk, ck = LSTM(f0
(Ėx), [hk 1, rk 1], ck 1) (2)
hk = Ėhk + f0
(Ėx) (3)
rk 1 =
|S|
X
i=1
a(hk 1, g(xi))g(xi) (4)
Let be the output
after K steps
f ( Ėx,S)
Ėx
30. Copyright (C) DeNA Co.,Ltd. All Rights Reserved.
Experimental Settings
nāÆ Datasets
āāÆ Image classiļ¬cation sets
ā¢āÆ Omniglot [Lake+, 2011]
āāÆ Language modeling
ā¢āÆ Penn Treebank [Marcus+, 1993]
30
ā¢āÆ ImageNet [Deng+, 2009]
ref. http://karpathy.github.io/2014/09/02/what-i-learned-
from-competing-against-a-convnet-on-imagenet/
4.1.3 One-Shot Language Modeling
We also introduce a new one-shot language task which is analogous to those examined for images.
The task is as follows: given a query sentence with a missing word in it, and a support set of sentences
which each have a missing word and a corresponding 1-hot label, choose the label from the support
set that best matches the query sentence. Here we show a single example, though note that the words
on the right are not provided and the labels for the set are given as 1-hot-of-5 vectors.
1. an experimental vaccine can alter the immune response of people infected with the aids virus a
<blank_token> u.s. scientist said.
prominent
2. the show one of five new nbc <blank_token> is the second casualty of the three networks so far
this fall.
series
3. however since eastern first filed for chapter N protection march N it has consistently promised
to pay creditors N cents on the <blank_token>.
dollar
4. we had a lot of people who threw in the <blank_token> today said <unk> ellis a partner in
benjamin jacobson & sons a specialist in trading ual stock on the big board.
towel
5. itās not easy to roll out something that <blank_token> and make it pay mr. jacob says. comprehensive
Query: in late new york trading yesterday the <blank_token> was quoted at N marks down from N
marks late friday and at N yen down from N yen late friday.
dollar
Sentences were taken from the Penn Treebank dataset [15]. On each trial, we make sure that the set
and batch are populated with sentences that are non-overlapping. This means that we do not use
words with very low frequency counts; e.g. if there is only a single sentence for a given word we do
not use this data since the sentence would need to be in both the set and the batch. As with the image
tasks, each trial consisted of a 5 way choice between the classes available in the set. We used a batch
size of 20 throughout the sentence matching task and varied the set size across k=1,2,3. We ensured
that the same number of sentences were available for each class in the set. We split the words into a
randomly sampled 9000 for training and 1000 for testing, and we used the standard test set to report
results. Thus, neither the words nor the sentences used during test time had been seen during training.
We compared our one-shot matching model to an oracle LSTM language model (LSTM-LM) [30]
32. Copyright (C) DeNA Co.,Ltd. All Rights Reserved.
Experimental Results (Omniglot)
32
nāÆ Fully Conditional Embedding (FCE) did not seem to help much
nāÆ Baseline and Siamese Net were improved with ļ¬ne-tuning
took this network and used the features from the last layer (before the softmax) for nearest neighbour
matching, a strategy commonly used in computer vision [3] which has achieved excellent results
across many tasks. Following [11], the convolutional siamese nets were trained on a same-or-different
task of the original training data set and then the last layer was used for nearest neighbour matching.
Model Matching Fn Fine Tune
5-way Acc 20-way Acc
1-shot 5-shot 1-shot 5-shot
PIXELS Cosine N 41.7% 63.2% 26.7% 42.6%
BASELINE CLASSIFIER Cosine N 80.0% 95.0% 69.5% 89.1%
BASELINE CLASSIFIER Cosine Y 82.3% 98.4% 70.6% 92.0%
BASELINE CLASSIFIER Softmax Y 86.0% 97.6% 72.9% 92.3%
MANN (NO CONV) [21] Cosine N 82.8% 94.9% ā ā
CONVOLUTIONAL SIAMESE NET [11] Cosine N 96.7% 98.4% 88.0% 96.5%
CONVOLUTIONAL SIAMESE NET [11] Cosine Y 97.3% 98.4% 88.1% 97.0%
MATCHING NETS (OURS) Cosine N 98.1% 98.9% 93.8% 98.5%
MATCHING NETS (OURS) Cosine Y 97.9% 98.7% 93.5% 98.7%
Table 1: Results on the Omniglot dataset.
5
33. Copyright (C) DeNA Co.,Ltd. All Rights Reserved.
Experimental Settings (ImageNet)
nāÆ Baseline
āāÆ Matching on raw pixels
āāÆ Matching on discriminative features from InceptionV3
(Baseine classiļ¬er)
nāÆ Datasets
āāÆ miniImageNet (size: 84x84)
ā¢āÆ training: (80 classes)
ā¢āÆ testing: (20 classes)
āāÆ randImageNet
ā¢āÆ training: randomly picked up classes (882 classes)
ā¢āÆ testing: remaining classes (118 classes)
āāÆ dogsImageNet
ā¢āÆ training: all non-dog classes (882 classes)
ā¢āÆ testing: dog classes (118 classes)
33
34. Copyright (C) DeNA Co.,Ltd. All Rights Reserved.
Experimental Results (miniImageNet)
34
Figure 2: Example of two 5-way problem instance on ImageNet. The images in the set S0
contain
classes never seen during training. Our model makes far less mistakes than the Inception baseline.
Table 2: Results on miniImageNet.
Model Matching Fn Fine Tune
5-way Acc
1-shot 5-shot
PIXELS Cosine N 23.0% 26.6%
BASELINE CLASSIFIER Cosine N 36.6% 46.0%
BASELINE CLASSIFIER Cosine Y 36.2% 52.2%
BASELINE CLASSIFIER Softmax Y 38.4% 51.2%
MATCHING NETS (OURS) Cosine N 41.2% 56.2%
MATCHING NETS (OURS) Cosine Y 42.4% 58.0%
MATCHING NETS (OURS) Cosine (FCE) N 44.2% 57.0%
MATCHING NETS (OURS) Cosine (FCE) Y 46.6% 60.0%
1-shot tasks from the training data set, incorporating Full Context Embeddings and our Matching
Networks and training strategy.
The results of the randImageNet and dogsImageNet experiments are shown in Table 3. The Inception
Oracle (trained on all classes) performs almost perfectly when restricted to 5 classes only, which is
not too surprising given its impressive top-1 accuracy. When trained solely on 6=Lrand, Matching
Nets improve upon Inception by almost 6% when tested on Lrand, halving the errors. Figure 2 shows
two instances of 5-way one-shot learning, where Inception fails. Looking at all the errors, Inception
appears to sometimes prefer an image above all others (these images tend to be cluttered like the
example in the second column, or more constant in color). Matching Nets, on the other hand, manage
to recover from these outliers that sometimes appear in the support set S0
.
Matching Nets manage to improve upon Inception on the complementary subset 6=Ldogs (although
nāÆ Matching Networks overtook baseline
nāÆ Fully Conditional Embedding (FCE) was shown eļ¬ective to
improve the performance in this task
35. Copyright (C) DeNA Co.,Ltd. All Rights Reserved.
Experimental Results (randImageNet, dogsImageNet)
35
classiļ¬cation. Thus, we believe that if we adapted our training strategy to samples S from ļ¬ne grained
sets of labels instead of sampling uniformly from the leafs of the ImageNet class tree, improvements
could be attained. We leave this as future work.
Table 3: Results on full ImageNet on rand and dogs one-shot tasks. Note that 6=Lrand and 6=Ldogs
are sets of classes which are seen during training, but are provided for completeness.
Model Matching Fn Fine Tune
ImageNet 5-way 1-shot Acc
Lrand 6=Lrand Ldogs 6=Ldogs
PIXELS Cosine N 42.0% 42.8% 41.4% 43.0%
INCEPTION CLASSIFIER Cosine N 87.6% 92.6% 59.8% 90.0%
MATCHING NETS (OURS) Cosine (FCE) N 93.2% 97.0% 58.8% 96.4%
INCEPTION ORACLE Softmax (Full) Y (Full) ā” 99% ā” 99% ā” 99% ā” 99%
7
nāÆ Matching Networks outperformed Inception Classiļ¬er in ,
but degraded in
nāÆ The decrease of the performance in might be caused by the
diļ¬erent distributions of labels between training and testing
āāÆ Training support set comes from a random distribution
whereas testing one comes from similar classes
BASELINE CLASSIFIER Cosine Y 36
BASELINE CLASSIFIER Softmax Y 38
MATCHING NETS (OURS) Cosine N 41
MATCHING NETS (OURS) Cosine Y 42
MATCHING NETS (OURS) Cosine (FCE) N 44
MATCHING NETS (OURS) Cosine (FCE) Y 46
1-shot tasks from the training data set, incorporating Full Context Emb
Networks and training strategy.
The results of the randImageNet and dogsImageNet experiments are show
Oracle (trained on all classes) performs almost perfectly when restricted
not too surprising given its impressive top-1 accuracy. When trained so
Nets improve upon Inception by almost 6% when tested on Lrand, halving
two instances of 5-way one-shot learning, where Inception fails. Looking
appears to sometimes prefer an image above all others (these images te
example in the second column, or more constant in color). Matching Nets,
to recover from these outliers that sometimes appear in the support set S0
Matching Nets manage to improve upon Inception on the complementar
this setup is not one-shot, as the feature extraction has been trained on the
much more challenging Ldogs subset, our model degrades by 1%. We h
1-shot tasks from the training data set, incorporating Full Context Embeddings an
Networks and training strategy.
The results of the randImageNet and dogsImageNet experiments are shown in Table
Oracle (trained on all classes) performs almost perfectly when restricted to 5 classe
not too surprising given its impressive top-1 accuracy. When trained solely on 6=L
Nets improve upon Inception by almost 6% when tested on Lrand, halving the errors
two instances of 5-way one-shot learning, where Inception fails. Looking at all the e
appears to sometimes prefer an image above all others (these images tend to be c
example in the second column, or more constant in color). Matching Nets, on the oth
to recover from these outliers that sometimes appear in the support set S0
.
Matching Nets manage to improve upon Inception on the complementary subset 6=
this setup is not one-shot, as the feature extraction has been trained on these labels).
much more challenging Ldogs subset, our model degrades by 1%. We hypothesiz
that the sampled set during training, S, comes from a random distribution of labels
whereas the testing support set S0
from Ldogs contains similar classes, more akin
classiļ¬cation. Thus, we believe that if we adapted our training strategy to samples S f
sets of labels instead of sampling uniformly from the leafs of the ImageNet class tre
could be attained. We leave this as future work.
1-shot tasks from the training data set, incorporating Full C
Networks and training strategy.
The results of the randImageNet and dogsImageNet experimen
Oracle (trained on all classes) performs almost perfectly whe
not too surprising given its impressive top-1 accuracy. When
Nets improve upon Inception by almost 6% when tested on Lr
two instances of 5-way one-shot learning, where Inception fa
appears to sometimes prefer an image above all others (thes
example in the second column, or more constant in color). Ma
to recover from these outliers that sometimes appear in the su
Matching Nets manage to improve upon Inception on the com
this setup is not one-shot, as the feature extraction has been tra
much more challenging Ldogs subset, our model degrades b
that the sampled set during training, S, comes from a random
whereas the testing support set S0
from Ldogs contains simi
classiļ¬cation. Thus, we believe that if we adapted our training
sets of labels instead of sampling uniformly from the leafs of
could be attained. We leave this as future work.
36. Copyright (C) DeNA Co.,Ltd. All Rights Reserved.
Experimental Settings (Penn Treebank)
36
xi
Support Setļ¼Sļ¼
Ėx
Query
g(xi )
f ( Ėx,S)
a
Our model in its simplest form computes a probability over Ėy as follows:
P(Ėy|Ėx, S) =
kX
i=1
a(Ėx, xi)yi
where xi, yi are the inputs and corresponding label distributions from the su
k
ā
P(Ėy|Ėx, S) =
where xi, yi are the inputs and correspondin
{(xi, yi)}k
i=1, and a is an attention mechanism
tially describes the output for a new class as a
Where the attention mechanism a is a kernel on X
Where the attention mechanism is zero for the
metric and an appropriate constant otherwise, th
(although this requires an extension to the atten
Thus (1) subsumes both KDE and kNN methods.
mechanism and the yi act as values bound to the
this case we can understand this as a particular
we āpointā to the corresponding example in the s
form deļ¬ned by the classiļ¬er cS(Ėx) is very ļ¬exib
2.1.1 The Attention Kernel
Equation 1 relies on choosing a(., .), the atten
ļ¬er. The simplest form that this takes (and w
attention models and kernel functions) is to
a(Ėx, xi) = ec(f(Ėx),g(xi))
/
Pk
j=1 ec(f(Ėx),g(xj ))
w
ate neural networks (potentially with f = g) to
examples where f and g are parameterised var
tasks (as in VGG[22] or Inception[24]) or a sim
Section 4).
We note that, though related to metric learning, th
For a given support set S and sample to classify
pairs (x0
, y0
) 2 S such that y0
= y and misalign
methods such as Neighborhood Component An
nearest neighbor [28].
However, the objective that we are trying to opti
classiļ¬cation, and thus we expect it to perform b
Our model in its simplest form computes a probability over Ėy as follows:
P(Ėy|Ėx, S) =
kX
i=1
a(Ėx, xi)yi
where xi, yi are the inputs and corresponding label distributions from the support set
{(xi, yi)}k
i=1, and a is an attention mechanism which we discuss below. Note that eq. 1
tially describes the output for a new class as a linear combination of the labels in the suppo
Where the attention mechanism a is a kernel on X ā„ X, then (1) is akin to a kernel density esti
Where the attention mechanism is zero for the b furthest xi from Ėx according to some dis
metric and an appropriate constant otherwise, then (1) is equivalent to āk bā-nearest neigh
(although this requires an extension to the attention mechanism that we describe in Section 2
Thus (1) subsumes both KDE and kNN methods. Another view of (1) is where a acts as an atte
mechanism and the yi act as values bound to the corresponding keys xi, much like a hash tab
yi
Our model in its simplest form computes a probability over Ėy as follows:
P(Ėy|Ėx, S) =
kX
i=1
a(Ėx, xi)yi
where xi, yi are the inputs and corresponding label distributions from the support s
{(xi, yi)}k
i=1, and a is an attention mechanism which we discuss below. Note that eq.
tially describes the output for a new class as a linear combination of the labels in the su
Where the attention mechanism a is a kernel on X ā„ X, then (1) is akin to a kernel density
Where the attention mechanism is zero for the b furthest xi from Ėx according to some
metric and an appropriate constant otherwise, then (1) is equivalent to āk bā-nearest ne
(although this requires an extension to the attention mechanism that we describe in Secti
Thus (1) subsumes both KDE and kNN methods. Another view of (1) is where a acts as an
mechanism and the yi act as values bound to the corresponding keys xi, much like a hash
this case we can understand this as a particular kind of associative memory where, given
we āpointā to the corresponding example in the support set, retrieving its label. Hence the f
form deļ¬ned by the classiļ¬er cS(Ėx) is very ļ¬exible and can adapt easily to any new suppo
2.1.1 The Attention Kernel
Equation 1 relies on choosing a(., .), the attention mechanism, which fully speciļ¬es th
ļ¬er. The simplest form that this takes (and which has very tight relationships with
attention models and kernel functions) is to use the softmax over the cosine distanc
a(Ėx, xi) = ec(f(Ėx),g(xi))
/
Pk
j=1 ec(f(Ėx),g(xj ))
with embedding functions f and g being
ate neural networks (potentially with f = g) to embed Ėx and xi. In our experiments we
examples where f and g are parameterised variously as deep convolutional networks f
tasks (as in VGG[22] or Inception[24]) or a simple form word embedding for language t
Section 4).
We note that, though related to metric learning, the classiļ¬er deļ¬ned by Equation 1 is discri
c: cosine distance
LSTMLSTMā¦
virus a
LSTMLSTMā¦
new nbc
LSTMLSTM
on the
ā¦
LSTMLSTM
the yesterday
ā¦
4.1.3 One-Shot Language Modeling
We also introduce a new one-shot language task which is analogous to those examined for images.
The task is as follows: given a query sentence with a missing word in it, and a support set of sentences
which each have a missing word and a corresponding 1-hot label, choose the label from the support
set that best matches the query sentence. Here we show a single example, though note that the words
on the right are not provided and the labels for the set are given as 1-hot-of-5 vectors.
1. an experimental vaccine can alter the immune response of people infected with the aids virus a
<blank_token> u.s. scientist said.
prominent
2. the show one of five new nbc <blank_token> is the second casualty of the three networks so far
this fall.
series
3. however since eastern first filed for chapter N protection march N it has consistently promised
to pay creditors N cents on the <blank_token>.
dollar
4. we had a lot of people who threw in the <blank_token> today said <unk> ellis a partner in
benjamin jacobson & sons a specialist in trading ual stock on the big board.
towel
5. itās not easy to roll out something that <blank_token> and make it pay mr. jacob says. comprehensive
Query: in late new york trading yesterday the <blank_token> was quoted at N marks down from N
marks late friday and at N yen down from N yen late friday.
dollar
Sentences were taken from the Penn Treebank dataset [15]. On each trial, we make sure that the set
and batch are populated with sentences that are non-overlapping. This means that we do not use
words with very low frequency counts; e.g. if there is only a single sentence for a given word we do
not use this data since the sentence would need to be in both the set and the batch. As with the image
tasks, each trial consisted of a 5 way choice between the classes available in the set. We used a batch
size of 20 throughout the sentence matching task and varied the set size across k=1,2,3. We ensured
that the same number of sentences were available for each class in the set. We split the words into a
randomly sampled 9000 for training and 1000 for testing, and we used the standard test set to report
results. Thus, neither the words nor the sentences used during test time had been seen during training.
We compared our one-shot matching model to an oracle LSTM language model (LSTM-LM) [30]
trained on all the words. In this setup, the LSTM has an unfair advantage as it is not doing one-shot
learning but seeing all the data ā thus, this should be taken as an upper bound. To do so, we examined
a similar setup wherein a sentence was presented to the model with a single word ļ¬lled in with 5
different possible words (including the correct answer). For each of these 5 sentences the model gave
The task is as follows: given a query sentence with a missing word in it, and a support set of sentences
which each have a missing word and a corresponding 1-hot label, choose the label from the support
set that best matches the query sentence. Here we show a single example, though note that the words
on the right are not provided and the labels for the set are given as 1-hot-of-5 vectors.
1. an experimental vaccine can alter the immune response of people infected with the aids virus a
<blank_token> u.s. scientist said.
prominent
2. the show one of five new nbc <blank_token> is the second casualty of the three networks so far
this fall.
series
3. however since eastern first filed for chapter N protection march N it has consistently promised
to pay creditors N cents on the <blank_token>.
dollar
4. we had a lot of people who threw in the <blank_token> today said <unk> ellis a partner in
benjamin jacobson & sons a specialist in trading ual stock on the big board.
towel
5. itās not easy to roll out something that <blank_token> and make it pay mr. jacob says. comprehensive
Query: in late new york trading yesterday the <blank_token> was quoted at N marks down from N
marks late friday and at N yen down from N yen late friday.
dollar
Sentences were taken from the Penn Treebank dataset [15]. On each trial, we make sure that the set
and batch are populated with sentences that are non-overlapping. This means that we do not use
words with very low frequency counts; e.g. if there is only a single sentence for a given word we do
not use this data since the sentence would need to be in both the set and the batch. As with the image
tasks, each trial consisted of a 5 way choice between the classes available in the set. We used a batch
size of 20 throughout the sentence matching task and varied the set size across k=1,2,3. We ensured
that the same number of sentences were available for each class in the set. We split the words into a
randomly sampled 9000 for training and 1000 for testing, and we used the standard test set to report
results. Thus, neither the words nor the sentences used during test time had been seen during training.
We compared our one-shot matching model to an oracle LSTM language model (LSTM-LM) [30]
trained on all the words. In this setup, the LSTM has an unfair advantage as it is not doing one-shot
learning but seeing all the data ā thus, this should be taken as an upper bound. To do so, we examined
a similar setup wherein a sentence was presented to the model with a single word ļ¬lled in with 5
different possible words (including the correct answer). For each of these 5 sentences the model gave
a log-likelihood and the max of these was taken to be the choice of the model.
nāÆ Fill in a brank in a query sentence by a label in a support set
37. Copyright (C) DeNA Co.,Ltd. All Rights Reserved.
Experimental Settings and Results (Penn Treebank)
37
nāÆ Baseline
āāÆ Oracle LSTM-LM
ā¢āÆ Trained on all the words (not one-shot)
ā¢āÆ Consider this model as an upper bound
nāÆ Datasets
āāÆ training: 9000 words
āāÆ testing: 1000 words
nāÆ Results
Model
5 way accuracy
1-shot 2-shot 3-shot
Matching Nets 32.4% 36.1% 38.2%
Oracle LSTM-LM (72.8%) - -
38. Copyright (C) DeNA Co.,Ltd. All Rights Reserved.
Conclusion
nāÆ They proposed Matching Networks: nearest neighbor based
approach trained fully end-to-end
nāÆ Keypoints
āāÆ āOne-shot learning is much easier if you train the network to
do one-shot learningā [Vinyals+, 2016]
āāÆ Matching Network has non-parametric structure, thus has
ability to acquisition of new examples rapidly
nāÆ Findings
āāÆ Matching Networks was eļ¬ective to improve the performance
for Omniglot, miniImageNet, randImageNet, however it
degraded for dogsImageNet
āāÆ One-shot learning with ļ¬ne-grained sets of labels is diļ¬cult
to solve thus could be exciting challenge in this area
38
39. Copyright (C) DeNA Co.,Ltd. All Rights Reserved.
References
nāÆ Matching Networks
āāÆ Vinyals, Oriol, et al. "Matching networks for one shot learning." Advances in Neural
Information Processing Systems. 2016.
nāÆ One-shot Learning
āāÆ Koch, Gregory. Siamese neural networks for one-shot image recognition. Diss.
University of Toronto, 2015.
āāÆ Santoro, Adam, et al. "Meta-learning with memory-augmented neural networks."
Proceedings of The 33rd International Conference on Machine Learning. 2016.
āāÆ Bertinetto, Luca, et al. "Learning feed-forward one-shot learners." Advances in Neural
Information Processing Systems. 2016.
nāÆ Attention Mechanisms
āāÆ Bahdanau, Dzmitry, Kyunghyun Cho, and Yoshua Bengio. "Neural machine translation
by jointly learning to align and translate." arXiv preprint arXiv:1409.0473 (2014).
āāÆ Vinyals, Oriol, Meire Fortunato, and Navdeep Jaitly. "Pointer networks." Advances in
Neural Information Processing Systems. 2015.
āāÆ Vinyals, Oriol, Samy Bengio, and Manjunath Kudlur. "Order matters: Sequence to
sequence for sets." In ICLR2016
39
40. Copyright (C) DeNA Co.,Ltd. All Rights Reserved.
References
nāÆ Datasets
āāÆ Krizhevsky, Alex, and Geoļ¬rey Hinton. "Learning multiple layers of features from tiny
images." (2009).
āāÆ Deng, Jia, et al. "Imagenet: A large-scale hierarchical image database." Computer
Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on. IEEE, 2009.
āāÆ Lake, Brenden M., et al. "One shot learning of simple visual concepts." Proceedings of
the 33rd Annual Conference of the Cognitive Science Society. Vol. 172. 2011.
āāÆ Marcus, Mitchell P., Mary Ann Marcinkiewicz, and Beatrice Santorini. "Building a large
annotated corpus of English: The Penn Treebank." Computational linguistics 19.2
(1993): 313-330.
40