Deep Learning: a Next Step?
Kyunghyun Cho
New York University
Center for Data Science, and
Courant Institute of Mathematical Sciences
Naver Labs
Awesomeness
everywhere!
Awesome
ConvNet
Awesome
LM
Awesome
ASR
Awesome
RoboArm
Controller
Awesome
Q&A
Awesome
Auto-
Driver
Awesome
Program
Interpreter
Awesome
Meta-
Learner
Awesome
Atari
Player
What we want is…
Awesome
ConvNet
Awesome
LM
Awesome
ASR
Awesome
RoboArm
Controller
Awesome
Q&A
Awesome
Auto-
Driver
Awesome
Memory
• One system with
many modules
• Modules interact with
each other to solve a task
• Knowledge sharing across tasks via
shared modules
• Some trainable, others fixed
Paradigm shift
• One neural network per task
• One neural network per function
• Multiple networks cooperate to
solve many higher-level tasks
• Mixture of trainable networks
and fixed modules
Awesome
ConvNet
Awesome
LM
Awesome
ASR
Awesome
RoboArm
Controller
Awesome
Q&A
Awesome
Auto-
Driver
Awesome
Memory
Examples
• Q&A system
1. Receives a question via
awesome LM+ASR
2. Retrieves relevant info from
awesome memory
3. Generates a response via
awesome LM
• Autonomous driving
1. Senses the environment with
awesome ConvNet+ASR
2. Plans a route with
awesome memory
3. Controls a car via awesome
robot arm controller
But, simple composition of neural networks may not work! Why Not?
Awesome
ConvNet
Awesome
LM
Awesome
ASR
Awesome
RoboArm
Controller
Awesome
Q&A
Awesome
Auto-
Driver
Awesome
Memory
Learning to use an NN module Awesome
ConvNet
Awesome
LM
Awesome
ASR
Awesome
RoboArm
Controller
Awesome
Q&A
Awesome
Auto-
Driver
Awesome
Memory
• Why not?
• Target tasks are often unknown at
training time
• Input/output cannot be defined
well a priori
• The amount of learning signal
differs vastly across tasks
• Rich information captured by the NN
module must be passed along
• Internal of the NN module must allow
external manipulation
Good news: NN’s are transparent!
Hidden activations of a recurrent language model
• NN’s are not black boxes.
• We can observe every single bit
inside a neural net.
Bad news: NN’s are not easy to understand!
• Humans are not good with high-dimensional
vectors
• Distributed representation
• exponential combinations of hidden units
Learning to use an NN module Awesome
ConvNet
Awesome
LM
Awesome
ASR
Awesome
RoboArm
Controller
Awesome
Q&A
Awesome
Auto-
Driver
Awesome
Memory
• Neural nets are good at interpreting
high-dimensional input
• Neural nets are also good at
predicting high-dimensional output
• Internal representation learned by a
neural network is well structured
• Neural nets can be trained with an
arbitrary objective
(My Rejected NSF Proposal, 2016)
Learning to use an NN module Awesome
ConvNet
Awesome
LM
Awesome
ASR
Awesome
RoboArm
Controller
Awesome
Q&A
Awesome
Auto-
Driver
Awesome
Memory
1. Query-Efficient Imitation Learning
2. Trainable Decoding
• Real-time Neural Machine Translation
• Trainable Greedy Decoding
3. Neural Query Reformulation
4. Non-Parametric Neural Machine Translation
Query-Efficient Imitation Learning
Jiakai Zhang & K Cho. Query-Efficient Imitation Learning for End-to-End
Autonomous Driving. AAAI 2017.
Imitation Learning
• A learner directly interacts with the world
• A supervisor augments reward signal from
the world
• Advantages over supervised and
• Match between training and test
• Strong learning signal
• Disadvantages
• Where do we get the supervisor???
(Ross et al., 2011; Daume III et al., 2007; and more…)
• Supervisors are expensive
• As the learner gets better, less
intervention from the supervisor
• Learner learns from difficult examples
• Questions:
1. Where do we get the safety net?
2. What is the impact on the
learner’s performance?
SafeDAgger: Query-Efficient Imitation Learning
(Zhang&Cho, AAAI 2017; Laskey et al., ICRA 2016)
SafeDAgger: Query-Efficient Imitation Learning
1. Learner observes the world
2. SafetyNet observes the learner
3. SafetyNet predicts whether the
learner will fail
4. If no, the learner continues
5. If yes,
1. the supervisor intervenes
2. The learner imitate the
supervisor’s behaviour
Reminds us of the value function from RL!
SafeDAgger: Learning
1. Initial labelled data sets: and
2. Train the policy using
3. Train the safety net using
1. Target for the safety net given
4. Collect additional data
1. Let drive, but the expert intervenes when
2. Collect data:
5. Data aggregation:
6. Go to 2
After 1st iteration
SafeDAgger in Action
SafeDAgger in Action
Trainable Decoding of
Neural Machine Translation
Jiatao Gu, Graham Neubig, K Cho and Victor Li. Learning to Translate in Real-time
with Neural Machine Translation. EACL 2017.
Jiatao Gu, K Cho and Victor Li. Trainable Greedy Decoding for Neural Machine
Translation. EMNLP 2017.
Trainable Decoding
Motivation
• Many decoding objectives unknown while training
• Lack of target training examples
• Arbitrary (non-differentiable) decoding objectives
• Sample-”in”efficiency of RL algorithms
Our Approach
• Train NMT with supervised learning
• Train a decoding module on top
(1) Real-Time Translation
Decoding
1. Start with a pretrained NMT
2. A simultaenous decoder intercepts and
interprets the incoming signal
3. The simultaneous decoder forces the
pretrained model to either
1. output a target symbol, or
2. wait for a next source symbol
Learning
1. Trade-off between delay and quality
2. Stochastic policy gradient (REINFORCE)
(Gu, Neubig, Cho & Li, EACL 2017)
(1) Real-Time Translation
(2) Trainable Greedy Decoding
Decoding
1. Start with a pretrained NMT
2. A Trainable decoder intercepts and
interprets the incoming signal
3. The trainable decoder sends out
the altering signal back to the
pretrained model
Learning
1. Deterministic policy gradient
2. Maximize any arbitrary objective
(Gu, Cho & Li, 2017)
(2) Trainable Greedy Decoding
Models
1. Actor
• Input: prev. hid. state , prev. symbol , and
context from the attention model
• Output: additive bias for hid. state
• Example:
2. Critic
• Input: a sequence of the hidden states from the decoder
• Output: a predicted return
• In our case, the critic estimates the full return rather than
Q at each time step
(Gu, Cho & Li, 2017)
(2) Trainable Greedy Decoding
(Gu, Cho & Li, EMNLP 2017)
Learning
1) Generate translation given a source sentence with noise
and
2) Train the critic to minimize
3) Generate multiple translations with noise
4) Critic-aware actor learning: newly proposed
where
Inference: simply throw away the critic and use the actor
(2) Trainable Greedy Decoding
• The trainable decoder does improve the target decoding objective
• Training is quite unstable without the critic-aware actor learning algorithm
• More work is definitely needed for further improvement
Toward End-to-End Q&A
Rodrigo Nogueira & K Cho. Task-Oriented Query Reformulation with Reinforcement
Learning. EMNLP 2017.
Dunn et al. SearchQA: A New Q&A Dataset Augmented with Context from a Search
Engine. arXiv 2017.
End-to-End Question-Answering
Neural Query Reformulator
Machine Comprehension
Trainable
Fixed
(Black box)
Neural Query Reformulator
Neural Query Reformulator
1. Reads an original query q0
2. Augment/reformulate q0
Learning
1. Hard RL problem: partial observability
due to the black box search engine
2. Policy gradient to maximize recall@K
(Nogueira & Cho, 2017)
Code and data available at https://github.com/nyu-dl/QueryReformulator
SearchQA: new dataset
for machine comprehension
(Dunn et al., 2017)
Data available at https://github.com/nyu-dl/SearchQA
(Q, A)
(Q, A, { S1, S2, . . . , SN } )
Retrieve
Crawl
Search
SearchQA
1. Realistic, noisy context from Google
2. Multiple snippets per question
3. Large-scale data (140k q-a-c tuples)
And, Google did it!
• A pretrained, black-box Q&A
model
• Query reformulation with RL
• Tested on SearchQA
(Buck et al., 2017)
https arxiv.org abs
Few more relevant research directions
• Communicating neural networks
• Neural nets talk to each other to solve a problem
• Sukhbaatar & Fergus (2015), Foerster et al. (2016), Evtimova et al. (2017), Lewis et al. (2017),
…
• Multimodal processing
• Image captioning, zero-shot retrieval, …
• Cho et al. (2015, review paper)
• Planning, program synthesis
• How do the modules compose with each other to solve a task?
• Neural programmer interpreter [Reed et al., 2016; Cai et al., 2017]
• Forward modelling [Henaff et al., 2017; Sutton, 1991 Dyna; optimal control…]
• Mixture of experts [Google], progressive networks [Google DeepMind]
Paradigm Shift: modular, life-long learning
Neural
Network
Environment
Users/Experts
Search
Engine
Neural
Network
Database
Neural
Network
Neural
Network
Neural Machine Translation
Multilingual, Character-Level, Non-parametric
Machine Translation
• [Allen 1987 IEEE 1st ICNN]
• 3310 En-Es pairs constructed on 31
En, 40 Es words, max 10/11 word
sentence; 33 used as test set
• Binary encoding of words – 50
inputs, 66 outputs; 1 or 3 hidden
150-unit layers. Ave WER: 1.3
words
• [Chrisman 1992 Connection Science]
• Dual-ported RAAM architecture
[Pollack 1990 Artificial Intelligence]
applied to corpus of 216 parallel pairs
of simple En-Es sentences:
• Split 50/50 as train/test, 75% of
sentences correctly translated!
Brief resurrection in 1997: Spain
Modern neural machine translation
rce
ence
get
ence
ural
work
Source
Sentence
Target
Sentence
Neural Net
SMT
(Schwenk et al. 2006)
Source
Sentence
Target
Sentence
SMT
Neural Net
(Devlin et al. 2014)al MT
e
ce
et
ce
al
ork
Source
Sentence
Target
Sentence
Neural Net
SMT
(Schwenk et al. 2006)
Source
Sentence
Target
Sentence
SMT
Neural Net
(Devlin et al. 2014)MT
Source
Sentence
Target
Sentence
Neural
Network
So
Sen
Ta
Sen
Neur
S
(SchwenkNeural MT
1 Year
WMT 2017: news translation task
A better single-pair translation system has
never been “the” goal of neural MT
Continuous representation:
Interlingua 2.0?
What if we can project sentences in multiple languages into a single vector space?
What does NMT do?
Encoder
• Project a source sentence into a
set of continuous vectors
Decoder+Attention
• Decode a target sentence from a
set of “source” continuous
vectors
What is this “continuous vector space”?
• Similar sentences are near each other
in this vector space
• Multiple dimensions of similarity are
encoded simultaneously
(Sutskever et al., 2014)
What is this “continuous vector space”?
• Similar sentences are near each other
in this vector space
• Multiple dimensions of similarity are
encoded simultaneously
• (Trainable) near-bijective mapping
between the continuous vector space
and the sentence space
• Stripped of hard linguistic symbols
What is this “continuous vector space”?
(Firat et al., 2016; Luong et al., 2015; Dong et al., 2015)
• Can this continuous vector space be shared across multiple languages?
Multi-way, multilingual machine translation (1)
Language-agnostic
Continuous Vector
Space
• One encoder per source language
• One decoder per target language
• Attention/alignment shared across
all the language pairs
• Only bilingual parallel
corpora necessary
• No multi-way parallel corpus needed
(Firat et al., 2016)
Multi-way, multilingual machine translation (2)
• Neural nets are like lego
• Build one encoder per source
• Build one decoder per target
• Build one attention mechanism
• Given a sentence pair
•
•
(Firat et al., 2016)
Multi-way, multilingual machine translation (3)
Language-
agnostic
Continuous
Vector Space
• Sentence-level positive language transfer
• Helps low-resource language pairs
• Why?
1. Better structural constraint on the
continuous vector space
2. Regularization
• Real-valued vector-based interlingua?
(Firat et al., 2016)
Beyond languages: multimodal translation
• Does the source have to be “sentence”?
Annotation
Vectors
Word
Ssample
ui
Recurrent
State
zi
f = (a, man, is, jumping, into, a, lake, .)
+
hj
Attention
Mechanism
a
Attention
weight
j
ajΣ =1
ConvolutionalNeuralNetwork
(Xu et al., 2015)
Beyond languages: multimodal translation
(Caglayan et al., 2016; Elliott & Kadar, 2017)
What is a sentence?
Is a sentence a sequence of phrases, words, morphemes or characters?
What is a sentence to a neural net?
• Each word/symbol: one-hot vector
• Prior-less encoding
• Permutation invariant
• Sentence
• To us: a sequence of words
• To NN: a sequence of one-hot vectors
• What does it mean?
Why not words?
• Inefficient handling of various morphological variants
• Sub-optimal segmentation/tokenization
• “Etxaberria”, “Etxazarra”, “Etxaguren”, “Etxarren”: four independent vectors
• Lack of generalization to novel/rare morphological variants
• For instance, in Arabic => “and to his vehicle”
• One vector for compound words?
• “kolmi/vaihe/kilo/watti/tunti/mittari” => one vector?
• “kolme” => one vector?
• Spelling issues
• See Workshop on Processing Historical Language or Universal Dependencies
• Good segmentation/tokenization needed for each language
• So, no, words don’t look like the units we want to work with…
Then, what should we do…?
• Original: 고양이가 침대 위에 누워있습니다
• Word-level modelling:
(고양이가, 침대, 위에, 누워있습니다)
• Subword-level modelling (Sennrich et al., 2015; Wu et al., 2016)
(고양이, 가, 침대, 위, 에, 누워, 있습니, 다)
• Character-level modelling with segmentation
(Wang et al., 2015; Luong & Manning, 2016; Costa-Jussa & Fonollosa, 2016)
((ㄱ,ㅗ,ㅇ,ㅑ,ㅇ,ㅣ,ㄱ,ㅏ), (ㅊ,ㅣ,ㅁ,ㄷ,ㅐ), (ㅇ,ㅟ,ㅇ,ㅔ),
(ㄴ,ㅜ,ㅇ,ㅝ,ㅇ,ㅣ,ㅆ,ㅅ,ㅡ,ㅂ,ㄴ,ㅣ,ㄷ,ㅏ))
• Fully character-level modelling (Chung et al., 2016; Lee et al., 2017)
(ㄱ,ㅗ,ㅇ,ㅑ,ㅇ,ㅣ,ㄱ,ㅏ,_,ㅊ,ㅣ,ㅁ,ㄷ,ㅐ,_,ㅇ,ㅟ,ㅇ,ㅔ,_,ㄴ,ㅜ,ㅇ,ㅝ,ㅇ,ㅣ,ㅆ,ㅅ,ㅡ,ㅂ
,ㄴ,ㅣ,ㄷ,ㅏ))
Character-level translation
• Source: subword-level representation
• Target: character-level representation
• The decoder implicitly learned word-like units automatically!
(Chung et al., 2017)
Fully Character-level translation
• Source: character-level representation
• Target: character-level representation
• Efficient modelling with
a convolutional-recurrent encoder
• Works as well as, or better than,
subword-level translation
(Lee et al., 2017)
(Lee et al., 2017)
• More robust to errors
• Better handles rare tokens
• Rare tokens are not necessary rare!
Character-level Multilingual Translation
• When symbols are shared across multiple languages, why not share a
single encoder/decoder for them?
1. Language transfer at all levels: letters, words, phrases, sentences, …
2. Intra-sentence code-switching without any specific data
(Lee et al., 2017; Johnson et al., 2016; Ha et al., 2016)
Non-parametric
neural machine translation
Bridging question-answering, information retrieval and machine translation
Parametric ML: Learning as Compression
• What does learning do?
• Parametric machine learning: data compression + pattern matching
Neural
Network
Training
Data
learning
Neural
Network
Inference
Non-Parametric NMT (1)
• Bring the whole training corpus together with a model
• Retrieved a small subset of examples using a fast search engine
• Let NMT figure out how to fuse
1. the current sentence, and
2. the retrieved translation pairs
Non-Parametric NMT (2)
• Apache Lucene: search engine
• A key-value memory network
[Gulcehre et al., 2017; Miller et al., 2016]
for storing retrieved pairs
• Similar to larger-context NMT
• [Wang et al., 2017;
Jean et al., 2017]
• Similar to NMT with external
knowledge
• [Ahn et al., 2016;
Bahdanau et al., 2017]
Non-Parametric NMT (3)
• When retrieved pairs are similar, huge
improvement!
• Otherwise, revert back to a normal NMT
• More consistency in style and vocabulary choice
Other advances in neural machine translation
• Discourse-level machine translation
• [Jean et al., 2017; DCU, 2017]
• Better decoding strategies
• Learning-to-search [Wiseman & Rush, 2016]
• Reinforcement learning [MRT, 2016; Ranzato et al., 2015; Bahdanau et al., 2015]
• Trainable decoding [Gu et al., 2017]
• Alternative decoding cost [Li et al., 2016; Li et al., 2017]
• Linguistics-guided neural machine translation
• Learning to parse and translate [Eriguchi et al., 2017; Rohee & Goldberg, 2017; Luong
et al., 2016]
• Syntax-aware neural machine translation [Nadejde et al., 2017]
Paradigm Shift: modular, life-long learning
Search
Engine
Neural
Network
Database
Neural
Network
Neural
Network
• TenCent, eBay, Google, NVIDIA,
Facebook and NYU for generously
supporting my research and lab!
• Some of the works were sponsored
through industrial projects with
Samsung and NVIDIA!
Acknowledgement

Deep Learning, Where Are You Going?

  • 1.
    Deep Learning: aNext Step? Kyunghyun Cho New York University Center for Data Science, and Courant Institute of Mathematical Sciences Naver Labs
  • 2.
  • 3.
    What we wantis… Awesome ConvNet Awesome LM Awesome ASR Awesome RoboArm Controller Awesome Q&A Awesome Auto- Driver Awesome Memory • One system with many modules • Modules interact with each other to solve a task • Knowledge sharing across tasks via shared modules • Some trainable, others fixed
  • 4.
    Paradigm shift • Oneneural network per task • One neural network per function • Multiple networks cooperate to solve many higher-level tasks • Mixture of trainable networks and fixed modules Awesome ConvNet Awesome LM Awesome ASR Awesome RoboArm Controller Awesome Q&A Awesome Auto- Driver Awesome Memory
  • 5.
    Examples • Q&A system 1.Receives a question via awesome LM+ASR 2. Retrieves relevant info from awesome memory 3. Generates a response via awesome LM • Autonomous driving 1. Senses the environment with awesome ConvNet+ASR 2. Plans a route with awesome memory 3. Controls a car via awesome robot arm controller But, simple composition of neural networks may not work! Why Not? Awesome ConvNet Awesome LM Awesome ASR Awesome RoboArm Controller Awesome Q&A Awesome Auto- Driver Awesome Memory
  • 6.
    Learning to usean NN module Awesome ConvNet Awesome LM Awesome ASR Awesome RoboArm Controller Awesome Q&A Awesome Auto- Driver Awesome Memory • Why not? • Target tasks are often unknown at training time • Input/output cannot be defined well a priori • The amount of learning signal differs vastly across tasks • Rich information captured by the NN module must be passed along • Internal of the NN module must allow external manipulation
  • 7.
    Good news: NN’sare transparent! Hidden activations of a recurrent language model • NN’s are not black boxes. • We can observe every single bit inside a neural net. Bad news: NN’s are not easy to understand! • Humans are not good with high-dimensional vectors • Distributed representation • exponential combinations of hidden units
  • 8.
    Learning to usean NN module Awesome ConvNet Awesome LM Awesome ASR Awesome RoboArm Controller Awesome Q&A Awesome Auto- Driver Awesome Memory • Neural nets are good at interpreting high-dimensional input • Neural nets are also good at predicting high-dimensional output • Internal representation learned by a neural network is well structured • Neural nets can be trained with an arbitrary objective (My Rejected NSF Proposal, 2016)
  • 9.
    Learning to usean NN module Awesome ConvNet Awesome LM Awesome ASR Awesome RoboArm Controller Awesome Q&A Awesome Auto- Driver Awesome Memory 1. Query-Efficient Imitation Learning 2. Trainable Decoding • Real-time Neural Machine Translation • Trainable Greedy Decoding 3. Neural Query Reformulation 4. Non-Parametric Neural Machine Translation
  • 10.
    Query-Efficient Imitation Learning JiakaiZhang & K Cho. Query-Efficient Imitation Learning for End-to-End Autonomous Driving. AAAI 2017.
  • 11.
    Imitation Learning • Alearner directly interacts with the world • A supervisor augments reward signal from the world • Advantages over supervised and • Match between training and test • Strong learning signal • Disadvantages • Where do we get the supervisor??? (Ross et al., 2011; Daume III et al., 2007; and more…)
  • 12.
    • Supervisors areexpensive • As the learner gets better, less intervention from the supervisor • Learner learns from difficult examples • Questions: 1. Where do we get the safety net? 2. What is the impact on the learner’s performance? SafeDAgger: Query-Efficient Imitation Learning (Zhang&Cho, AAAI 2017; Laskey et al., ICRA 2016)
  • 13.
    SafeDAgger: Query-Efficient ImitationLearning 1. Learner observes the world 2. SafetyNet observes the learner 3. SafetyNet predicts whether the learner will fail 4. If no, the learner continues 5. If yes, 1. the supervisor intervenes 2. The learner imitate the supervisor’s behaviour Reminds us of the value function from RL!
  • 14.
    SafeDAgger: Learning 1. Initiallabelled data sets: and 2. Train the policy using 3. Train the safety net using 1. Target for the safety net given 4. Collect additional data 1. Let drive, but the expert intervenes when 2. Collect data: 5. Data aggregation: 6. Go to 2
  • 15.
  • 16.
  • 17.
  • 18.
    Trainable Decoding of NeuralMachine Translation Jiatao Gu, Graham Neubig, K Cho and Victor Li. Learning to Translate in Real-time with Neural Machine Translation. EACL 2017. Jiatao Gu, K Cho and Victor Li. Trainable Greedy Decoding for Neural Machine Translation. EMNLP 2017.
  • 19.
    Trainable Decoding Motivation • Manydecoding objectives unknown while training • Lack of target training examples • Arbitrary (non-differentiable) decoding objectives • Sample-”in”efficiency of RL algorithms Our Approach • Train NMT with supervised learning • Train a decoding module on top
  • 20.
    (1) Real-Time Translation Decoding 1.Start with a pretrained NMT 2. A simultaenous decoder intercepts and interprets the incoming signal 3. The simultaneous decoder forces the pretrained model to either 1. output a target symbol, or 2. wait for a next source symbol Learning 1. Trade-off between delay and quality 2. Stochastic policy gradient (REINFORCE) (Gu, Neubig, Cho & Li, EACL 2017)
  • 21.
  • 22.
    (2) Trainable GreedyDecoding Decoding 1. Start with a pretrained NMT 2. A Trainable decoder intercepts and interprets the incoming signal 3. The trainable decoder sends out the altering signal back to the pretrained model Learning 1. Deterministic policy gradient 2. Maximize any arbitrary objective (Gu, Cho & Li, 2017)
  • 23.
    (2) Trainable GreedyDecoding Models 1. Actor • Input: prev. hid. state , prev. symbol , and context from the attention model • Output: additive bias for hid. state • Example: 2. Critic • Input: a sequence of the hidden states from the decoder • Output: a predicted return • In our case, the critic estimates the full return rather than Q at each time step (Gu, Cho & Li, 2017)
  • 24.
    (2) Trainable GreedyDecoding (Gu, Cho & Li, EMNLP 2017) Learning 1) Generate translation given a source sentence with noise and 2) Train the critic to minimize 3) Generate multiple translations with noise 4) Critic-aware actor learning: newly proposed where Inference: simply throw away the critic and use the actor
  • 25.
    (2) Trainable GreedyDecoding • The trainable decoder does improve the target decoding objective • Training is quite unstable without the critic-aware actor learning algorithm • More work is definitely needed for further improvement
  • 26.
    Toward End-to-End Q&A RodrigoNogueira & K Cho. Task-Oriented Query Reformulation with Reinforcement Learning. EMNLP 2017. Dunn et al. SearchQA: A New Q&A Dataset Augmented with Context from a Search Engine. arXiv 2017.
  • 27.
    End-to-End Question-Answering Neural QueryReformulator Machine Comprehension Trainable Fixed (Black box)
  • 28.
    Neural Query Reformulator NeuralQuery Reformulator 1. Reads an original query q0 2. Augment/reformulate q0 Learning 1. Hard RL problem: partial observability due to the black box search engine 2. Policy gradient to maximize recall@K (Nogueira & Cho, 2017) Code and data available at https://github.com/nyu-dl/QueryReformulator
  • 29.
    SearchQA: new dataset formachine comprehension (Dunn et al., 2017) Data available at https://github.com/nyu-dl/SearchQA (Q, A) (Q, A, { S1, S2, . . . , SN } ) Retrieve Crawl Search SearchQA 1. Realistic, noisy context from Google 2. Multiple snippets per question 3. Large-scale data (140k q-a-c tuples)
  • 30.
    And, Google didit! • A pretrained, black-box Q&A model • Query reformulation with RL • Tested on SearchQA (Buck et al., 2017) https arxiv.org abs
  • 31.
    Few more relevantresearch directions • Communicating neural networks • Neural nets talk to each other to solve a problem • Sukhbaatar & Fergus (2015), Foerster et al. (2016), Evtimova et al. (2017), Lewis et al. (2017), … • Multimodal processing • Image captioning, zero-shot retrieval, … • Cho et al. (2015, review paper) • Planning, program synthesis • How do the modules compose with each other to solve a task? • Neural programmer interpreter [Reed et al., 2016; Cai et al., 2017] • Forward modelling [Henaff et al., 2017; Sutton, 1991 Dyna; optimal control…] • Mixture of experts [Google], progressive networks [Google DeepMind]
  • 32.
    Paradigm Shift: modular,life-long learning Neural Network Environment Users/Experts Search Engine Neural Network Database Neural Network Neural Network
  • 33.
    Neural Machine Translation Multilingual,Character-Level, Non-parametric Machine Translation
  • 35.
    • [Allen 1987IEEE 1st ICNN] • 3310 En-Es pairs constructed on 31 En, 40 Es words, max 10/11 word sentence; 33 used as test set • Binary encoding of words – 50 inputs, 66 outputs; 1 or 3 hidden 150-unit layers. Ave WER: 1.3 words • [Chrisman 1992 Connection Science] • Dual-ported RAAM architecture [Pollack 1990 Artificial Intelligence] applied to corpus of 216 parallel pairs of simple En-Es sentences: • Split 50/50 as train/test, 75% of sentences correctly translated!
  • 36.
  • 37.
    Modern neural machinetranslation rce ence get ence ural work Source Sentence Target Sentence Neural Net SMT (Schwenk et al. 2006) Source Sentence Target Sentence SMT Neural Net (Devlin et al. 2014)al MT e ce et ce al ork Source Sentence Target Sentence Neural Net SMT (Schwenk et al. 2006) Source Sentence Target Sentence SMT Neural Net (Devlin et al. 2014)MT Source Sentence Target Sentence Neural Network So Sen Ta Sen Neur S (SchwenkNeural MT
  • 38.
  • 39.
    WMT 2017: newstranslation task
  • 40.
    A better single-pairtranslation system has never been “the” goal of neural MT
  • 41.
    Continuous representation: Interlingua 2.0? Whatif we can project sentences in multiple languages into a single vector space?
  • 42.
    What does NMTdo? Encoder • Project a source sentence into a set of continuous vectors Decoder+Attention • Decode a target sentence from a set of “source” continuous vectors
  • 43.
    What is this“continuous vector space”? • Similar sentences are near each other in this vector space • Multiple dimensions of similarity are encoded simultaneously (Sutskever et al., 2014)
  • 44.
    What is this“continuous vector space”? • Similar sentences are near each other in this vector space • Multiple dimensions of similarity are encoded simultaneously • (Trainable) near-bijective mapping between the continuous vector space and the sentence space • Stripped of hard linguistic symbols
  • 45.
    What is this“continuous vector space”? (Firat et al., 2016; Luong et al., 2015; Dong et al., 2015) • Can this continuous vector space be shared across multiple languages?
  • 46.
    Multi-way, multilingual machinetranslation (1) Language-agnostic Continuous Vector Space • One encoder per source language • One decoder per target language • Attention/alignment shared across all the language pairs • Only bilingual parallel corpora necessary • No multi-way parallel corpus needed (Firat et al., 2016)
  • 47.
    Multi-way, multilingual machinetranslation (2) • Neural nets are like lego • Build one encoder per source • Build one decoder per target • Build one attention mechanism • Given a sentence pair • • (Firat et al., 2016)
  • 48.
    Multi-way, multilingual machinetranslation (3) Language- agnostic Continuous Vector Space • Sentence-level positive language transfer • Helps low-resource language pairs • Why? 1. Better structural constraint on the continuous vector space 2. Regularization • Real-valued vector-based interlingua? (Firat et al., 2016)
  • 49.
    Beyond languages: multimodaltranslation • Does the source have to be “sentence”? Annotation Vectors Word Ssample ui Recurrent State zi f = (a, man, is, jumping, into, a, lake, .) + hj Attention Mechanism a Attention weight j ajΣ =1 ConvolutionalNeuralNetwork (Xu et al., 2015)
  • 50.
    Beyond languages: multimodaltranslation (Caglayan et al., 2016; Elliott & Kadar, 2017)
  • 51.
    What is asentence? Is a sentence a sequence of phrases, words, morphemes or characters?
  • 52.
    What is asentence to a neural net? • Each word/symbol: one-hot vector • Prior-less encoding • Permutation invariant • Sentence • To us: a sequence of words • To NN: a sequence of one-hot vectors • What does it mean?
  • 53.
    Why not words? •Inefficient handling of various morphological variants • Sub-optimal segmentation/tokenization • “Etxaberria”, “Etxazarra”, “Etxaguren”, “Etxarren”: four independent vectors • Lack of generalization to novel/rare morphological variants • For instance, in Arabic => “and to his vehicle” • One vector for compound words? • “kolmi/vaihe/kilo/watti/tunti/mittari” => one vector? • “kolme” => one vector? • Spelling issues • See Workshop on Processing Historical Language or Universal Dependencies • Good segmentation/tokenization needed for each language • So, no, words don’t look like the units we want to work with…
  • 54.
    Then, what shouldwe do…? • Original: 고양이가 침대 위에 누워있습니다 • Word-level modelling: (고양이가, 침대, 위에, 누워있습니다) • Subword-level modelling (Sennrich et al., 2015; Wu et al., 2016) (고양이, 가, 침대, 위, 에, 누워, 있습니, 다) • Character-level modelling with segmentation (Wang et al., 2015; Luong & Manning, 2016; Costa-Jussa & Fonollosa, 2016) ((ㄱ,ㅗ,ㅇ,ㅑ,ㅇ,ㅣ,ㄱ,ㅏ), (ㅊ,ㅣ,ㅁ,ㄷ,ㅐ), (ㅇ,ㅟ,ㅇ,ㅔ), (ㄴ,ㅜ,ㅇ,ㅝ,ㅇ,ㅣ,ㅆ,ㅅ,ㅡ,ㅂ,ㄴ,ㅣ,ㄷ,ㅏ)) • Fully character-level modelling (Chung et al., 2016; Lee et al., 2017) (ㄱ,ㅗ,ㅇ,ㅑ,ㅇ,ㅣ,ㄱ,ㅏ,_,ㅊ,ㅣ,ㅁ,ㄷ,ㅐ,_,ㅇ,ㅟ,ㅇ,ㅔ,_,ㄴ,ㅜ,ㅇ,ㅝ,ㅇ,ㅣ,ㅆ,ㅅ,ㅡ,ㅂ ,ㄴ,ㅣ,ㄷ,ㅏ))
  • 55.
    Character-level translation • Source:subword-level representation • Target: character-level representation • The decoder implicitly learned word-like units automatically! (Chung et al., 2017)
  • 56.
    Fully Character-level translation •Source: character-level representation • Target: character-level representation • Efficient modelling with a convolutional-recurrent encoder • Works as well as, or better than, subword-level translation (Lee et al., 2017)
  • 57.
    (Lee et al.,2017) • More robust to errors • Better handles rare tokens • Rare tokens are not necessary rare!
  • 58.
    Character-level Multilingual Translation •When symbols are shared across multiple languages, why not share a single encoder/decoder for them? 1. Language transfer at all levels: letters, words, phrases, sentences, … 2. Intra-sentence code-switching without any specific data (Lee et al., 2017; Johnson et al., 2016; Ha et al., 2016)
  • 59.
    Non-parametric neural machine translation Bridgingquestion-answering, information retrieval and machine translation
  • 60.
    Parametric ML: Learningas Compression • What does learning do? • Parametric machine learning: data compression + pattern matching Neural Network Training Data learning Neural Network Inference
  • 61.
    Non-Parametric NMT (1) •Bring the whole training corpus together with a model • Retrieved a small subset of examples using a fast search engine • Let NMT figure out how to fuse 1. the current sentence, and 2. the retrieved translation pairs
  • 62.
    Non-Parametric NMT (2) •Apache Lucene: search engine • A key-value memory network [Gulcehre et al., 2017; Miller et al., 2016] for storing retrieved pairs • Similar to larger-context NMT • [Wang et al., 2017; Jean et al., 2017] • Similar to NMT with external knowledge • [Ahn et al., 2016; Bahdanau et al., 2017]
  • 63.
    Non-Parametric NMT (3) •When retrieved pairs are similar, huge improvement! • Otherwise, revert back to a normal NMT • More consistency in style and vocabulary choice
  • 64.
    Other advances inneural machine translation • Discourse-level machine translation • [Jean et al., 2017; DCU, 2017] • Better decoding strategies • Learning-to-search [Wiseman & Rush, 2016] • Reinforcement learning [MRT, 2016; Ranzato et al., 2015; Bahdanau et al., 2015] • Trainable decoding [Gu et al., 2017] • Alternative decoding cost [Li et al., 2016; Li et al., 2017] • Linguistics-guided neural machine translation • Learning to parse and translate [Eriguchi et al., 2017; Rohee & Goldberg, 2017; Luong et al., 2016] • Syntax-aware neural machine translation [Nadejde et al., 2017]
  • 65.
    Paradigm Shift: modular,life-long learning Search Engine Neural Network Database Neural Network Neural Network • TenCent, eBay, Google, NVIDIA, Facebook and NYU for generously supporting my research and lab! • Some of the works were sponsored through industrial projects with Samsung and NVIDIA! Acknowledgement