Deep Learning, Where Are You Going?

Deep Learning: a Next Step?
Kyunghyun Cho
New York University
Center for Data Science, and
Courant Institute of Mathematical Sciences
Naver Labs

Awesomeness
everywhere!
Awesome
ConvNet
Awesome
LM
Awesome
ASR
Awesome
RoboArm
Controller
Awesome
Q&A
Awesome
Auto-
Driver
Awesome
Program
Interpreter
Awesome
Meta-
Learner
Awesome
Atari
Player

What we want is…
Awesome
ConvNet
Awesome
LM
Awesome
ASR
Awesome
RoboArm
Controller
Awesome
Q&A
Awesome
Auto-
Driver
Awesome
Memory
• One system with
many modules
• Modules interact with
each other to solve a task
• Knowledge sharing across tasks via
shared modules
• Some trainable, others fixed

Paradigm shift
• One neural network per task
• One neural network per function
• Multiple networks cooperate to
solve many higher-level tasks
• Mixture of trainable networks
and fixed modules
Awesome
ConvNet
Awesome
LM
Awesome
ASR
Awesome
RoboArm
Controller
Awesome
Q&A
Awesome
Auto-
Driver
Awesome
Memory

Examples
• Q&A system
1. Receives a question via
awesome LM+ASR
2. Retrieves relevant info from
awesome memory
3. Generates a response via
awesome LM
• Autonomous driving
1. Senses the environment with
awesome ConvNet+ASR
2. Plans a route with
awesome memory
3. Controls a car via awesome
robot arm controller
But, simple composition of neural networks may not work! Why Not?
Awesome
ConvNet
Awesome
LM
Awesome
ASR
Awesome
RoboArm
Controller
Awesome
Q&A
Awesome
Auto-
Driver
Awesome
Memory

Learning to use an NN module Awesome
ConvNet
Awesome
LM
Awesome
ASR
Awesome
RoboArm
Controller
Awesome
Q&A
Awesome
Auto-
Driver
Awesome
Memory
• Why not?
• Target tasks are often unknown at
training time
• Input/output cannot be defined
well a priori
• The amount of learning signal
differs vastly across tasks
• Rich information captured by the NN
module must be passed along
• Internal of the NN module must allow
external manipulation

Good news: NN’s are transparent!
Hidden activations of a recurrent language model
• NN’s are not black boxes.
• We can observe every single bit
inside a neural net.
Bad news: NN’s are not easy to understand!
• Humans are not good with high-dimensional
vectors
• Distributed representation
• exponential combinations of hidden units

ConvNet
Awesome
LM
Awesome
ASR
Awesome
RoboArm
Controller
Awesome
Q&A
Awesome
Auto-
Driver
Awesome
Memory
• Neural nets are good at interpreting
high-dimensional input
• Neural nets are also good at
predicting high-dimensional output
• Internal representation learned by a
neural network is well structured
• Neural nets can be trained with an
arbitrary objective
(My Rejected NSF Proposal, 2016)

ConvNet
Awesome
LM
Awesome
ASR
Awesome
RoboArm
Controller
Awesome
Q&A
Awesome
Auto-
Driver
Awesome
Memory
1. Query-Efficient Imitation Learning
2. Trainable Decoding
• Real-time Neural Machine Translation
• Trainable Greedy Decoding
3. Neural Query Reformulation
4. Non-Parametric Neural Machine Translation

Query-Efficient Imitation Learning
Jiakai Zhang & K Cho. Query-Efficient Imitation Learning for End-to-End
Autonomous Driving. AAAI 2017.

Imitation Learning
• A learner directly interacts with the world
• A supervisor augments reward signal from
the world
• Advantages over supervised and
• Match between training and test
• Strong learning signal
• Disadvantages
• Where do we get the supervisor???
(Ross et al., 2011; Daume III et al., 2007; and more…)

• Supervisors are expensive
• As the learner gets better, less
intervention from the supervisor
• Learner learns from difficult examples
• Questions:
1. Where do we get the safety net?
2. What is the impact on the
learner’s performance?
SafeDAgger: Query-Efficient Imitation Learning
(Zhang&Cho, AAAI 2017; Laskey et al., ICRA 2016)

SafeDAgger: Query-Efficient Imitation Learning
1. Learner observes the world
2. SafetyNet observes the learner
3. SafetyNet predicts whether the
learner will fail
4. If no, the learner continues
5. If yes,
1. the supervisor intervenes
2. The learner imitate the
supervisor’s behaviour
Reminds us of the value function from RL!

SafeDAgger: Learning
1. Initial labelled data sets: and
2. Train the policy using
3. Train the safety net using
1. Target for the safety net given
4. Collect additional data
1. Let drive, but the expert intervenes when
2. Collect data:
5. Data aggregation:
6. Go to 2

Trainable Decoding of
Neural Machine Translation
Jiatao Gu, Graham Neubig, K Cho and Victor Li. Learning to Translate in Real-time
with Neural Machine Translation. EACL 2017.
Jiatao Gu, K Cho and Victor Li. Trainable Greedy Decoding for Neural Machine
Translation. EMNLP 2017.

Trainable Decoding
Motivation
• Many decoding objectives unknown while training
• Lack of target training examples
• Arbitrary (non-differentiable) decoding objectives
• Sample-”in”efficiency of RL algorithms
Our Approach
• Train NMT with supervised learning
• Train a decoding module on top

(1) Real-Time Translation
Decoding
1. Start with a pretrained NMT
2. A simultaenous decoder intercepts and
interprets the incoming signal
3. The simultaneous decoder forces the
pretrained model to either
1. output a target symbol, or
2. wait for a next source symbol
Learning
1. Trade-off between delay and quality
2. Stochastic policy gradient (REINFORCE)
(Gu, Neubig, Cho & Li, EACL 2017)

(2) Trainable Greedy Decoding
Decoding
1. Start with a pretrained NMT
2. A Trainable decoder intercepts and
interprets the incoming signal
3. The trainable decoder sends out
the altering signal back to the
pretrained model
Learning
1. Deterministic policy gradient
2. Maximize any arbitrary objective
(Gu, Cho & Li, 2017)

Models
1. Actor
• Input: prev. hid. state , prev. symbol , and
context from the attention model
• Output: additive bias for hid. state
• Example:
2. Critic
• Input: a sequence of the hidden states from the decoder
• Output: a predicted return
• In our case, the critic estimates the full return rather than
Q at each time step
(Gu, Cho & Li, 2017)

(Gu, Cho & Li, EMNLP 2017)
Learning
1) Generate translation given a source sentence with noise
and
2) Train the critic to minimize
3) Generate multiple translations with noise
4) Critic-aware actor learning: newly proposed
where
Inference: simply throw away the critic and use the actor

• The trainable decoder does improve the target decoding objective
• Training is quite unstable without the critic-aware actor learning algorithm
• More work is definitely needed for further improvement

Toward End-to-End Q&A
Rodrigo Nogueira & K Cho. Task-Oriented Query Reformulation with Reinforcement
Learning. EMNLP 2017.
Dunn et al. SearchQA: A New Q&A Dataset Augmented with Context from a Search
Engine. arXiv 2017.

End-to-End Question-Answering
Neural Query Reformulator
Machine Comprehension
Trainable
Fixed
(Black box)

1. Reads an original query q0
2. Augment/reformulate q0
Learning
1. Hard RL problem: partial observability
due to the black box search engine
2. Policy gradient to maximize recall@K
(Nogueira & Cho, 2017)
Code and data available at https://github.com/nyu-dl/QueryReformulator

SearchQA: new dataset
for machine comprehension
(Dunn et al., 2017)
Data available at https://github.com/nyu-dl/SearchQA
(Q, A)
(Q, A, { S1, S2, . . . , SN } )
Retrieve
Crawl
Search
SearchQA
1. Realistic, noisy context from Google
2. Multiple snippets per question
3. Large-scale data (140k q-a-c tuples)

And, Google did it!
• A pretrained, black-box Q&A
model
• Query reformulation with RL
• Tested on SearchQA
(Buck et al., 2017)
https arxiv.org abs

Few more relevant research directions
• Communicating neural networks
• Neural nets talk to each other to solve a problem
• Sukhbaatar & Fergus (2015), Foerster et al. (2016), Evtimova et al. (2017), Lewis et al. (2017),
…
• Multimodal processing
• Image captioning, zero-shot retrieval, …
• Cho et al. (2015, review paper)
• Planning, program synthesis
• How do the modules compose with each other to solve a task?
• Neural programmer interpreter [Reed et al., 2016; Cai et al., 2017]
• Forward modelling [Henaff et al., 2017; Sutton, 1991 Dyna; optimal control…]
• Mixture of experts [Google], progressive networks [Google DeepMind]

Paradigm Shift: modular, life-long learning
Neural
Network
Environment
Users/Experts
Search
Engine
Neural
Network
Database
Neural
Network
Neural
Network

Neural Machine Translation
Multilingual, Character-Level, Non-parametric
Machine Translation

• [Allen 1987 IEEE 1st ICNN]
• 3310 En-Es pairs constructed on 31
En, 40 Es words, max 10/11 word
sentence; 33 used as test set
• Binary encoding of words – 50
inputs, 66 outputs; 1 or 3 hidden
150-unit layers. Ave WER: 1.3
words
• [Chrisman 1992 Connection Science]
• Dual-ported RAAM architecture
[Pollack 1990 Artificial Intelligence]
applied to corpus of 216 parallel pairs
of simple En-Es sentences:
• Split 50/50 as train/test, 75% of
sentences correctly translated!

Brief resurrection in 1997: Spain

Modern neural machine translation
rce
ence
get
ence
ural
work
Source
Sentence
Target
Sentence
Neural Net
SMT
(Schwenk et al. 2006)
Source
Sentence
Target
Sentence
SMT
Neural Net
(Devlin et al. 2014)al MT
e
ce
et
ce
al
ork
Source
Sentence
Target
Sentence
Neural Net
SMT
(Schwenk et al. 2006)
Source
Sentence
Target
Sentence
SMT
Neural Net
(Devlin et al. 2014)MT
Source
Sentence
Target
Sentence
Neural
Network
So
Sen
Ta
Sen
Neur
S
(SchwenkNeural MT

WMT 2017: news translation task

A better single-pair translation system has
never been “the” goal of neural MT

Continuous representation:
Interlingua 2.0?
What if we can project sentences in multiple languages into a single vector space?

What does NMT do?
Encoder
• Project a source sentence into a
set of continuous vectors
Decoder+Attention
• Decode a target sentence from a
set of “source” continuous
vectors

What is this “continuous vector space”?
• Similar sentences are near each other
in this vector space
• Multiple dimensions of similarity are
encoded simultaneously
(Sutskever et al., 2014)

• Similar sentences are near each other
in this vector space
• Multiple dimensions of similarity are
encoded simultaneously
• (Trainable) near-bijective mapping
between the continuous vector space
and the sentence space
• Stripped of hard linguistic symbols

(Firat et al., 2016; Luong et al., 2015; Dong et al., 2015)
• Can this continuous vector space be shared across multiple languages?

Multi-way, multilingual machine translation (1)
Language-agnostic
Continuous Vector
Space
• One encoder per source language
• One decoder per target language
• Attention/alignment shared across
all the language pairs
• Only bilingual parallel
corpora necessary
• No multi-way parallel corpus needed
(Firat et al., 2016)

• Neural nets are like lego
• Build one encoder per source
• Build one decoder per target
• Build one attention mechanism
• Given a sentence pair
•
•

Language-
agnostic
Continuous
Vector Space
• Sentence-level positive language transfer
• Helps low-resource language pairs
• Why?
1. Better structural constraint on the
continuous vector space
2. Regularization
• Real-valued vector-based interlingua?

Beyond languages: multimodal translation
• Does the source have to be “sentence”?
Annotation
Vectors
Word
Ssample
ui
Recurrent
State
zi
f = (a, man, is, jumping, into, a, lake, .)
+
hj
Attention
Mechanism
a
Attention
weight
j
ajΣ =1
ConvolutionalNeuralNetwork
(Xu et al., 2015)

Beyond languages: multimodal translation
(Caglayan et al., 2016; Elliott & Kadar, 2017)

What is a sentence?
Is a sentence a sequence of phrases, words, morphemes or characters?

What is a sentence to a neural net?
• Each word/symbol: one-hot vector
• Prior-less encoding
• Permutation invariant
• Sentence
• To us: a sequence of words
• To NN: a sequence of one-hot vectors
• What does it mean?

Why not words?
• Inefficient handling of various morphological variants
• Sub-optimal segmentation/tokenization
• “Etxaberria”, “Etxazarra”, “Etxaguren”, “Etxarren”: four independent vectors
• Lack of generalization to novel/rare morphological variants
• For instance, in Arabic => “and to his vehicle”
• One vector for compound words?
• “kolmi/vaihe/kilo/watti/tunti/mittari” => one vector?
• “kolme” => one vector?
• Spelling issues
• See Workshop on Processing Historical Language or Universal Dependencies
• Good segmentation/tokenization needed for each language
• So, no, words don’t look like the units we want to work with…

Then, what should we do…?
• Original: 고양이가 침대 위에 누워있습니다
• Word-level modelling:
(고양이가, 침대, 위에, 누워있습니다)
• Subword-level modelling (Sennrich et al., 2015; Wu et al., 2016)
(고양이, 가, 침대, 위, 에, 누워, 있습니, 다)
• Character-level modelling with segmentation
(Wang et al., 2015; Luong & Manning, 2016; Costa-Jussa & Fonollosa, 2016)
((ㄱ,ㅗ,ㅇ,ㅑ,ㅇ,ㅣ,ㄱ,ㅏ), (ㅊ,ㅣ,ㅁ,ㄷ,ㅐ), (ㅇ,ㅟ,ㅇ,ㅔ),
(ㄴ,ㅜ,ㅇ,ㅝ,ㅇ,ㅣ,ㅆ,ㅅ,ㅡ,ㅂ,ㄴ,ㅣ,ㄷ,ㅏ))
• Fully character-level modelling (Chung et al., 2016; Lee et al., 2017)
(ㄱ,ㅗ,ㅇ,ㅑ,ㅇ,ㅣ,ㄱ,ㅏ,_,ㅊ,ㅣ,ㅁ,ㄷ,ㅐ,_,ㅇ,ㅟ,ㅇ,ㅔ,_,ㄴ,ㅜ,ㅇ,ㅝ,ㅇ,ㅣ,ㅆ,ㅅ,ㅡ,ㅂ
,ㄴ,ㅣ,ㄷ,ㅏ))

Character-level translation
• Source: subword-level representation
• Target: character-level representation
• The decoder implicitly learned word-like units automatically!
(Chung et al., 2017)

Fully Character-level translation
• Source: character-level representation
• Target: character-level representation
• Efficient modelling with
a convolutional-recurrent encoder
• Works as well as, or better than,
subword-level translation
(Lee et al., 2017)

(Lee et al., 2017)
• More robust to errors
• Better handles rare tokens
• Rare tokens are not necessary rare!

Character-level Multilingual Translation
• When symbols are shared across multiple languages, why not share a
single encoder/decoder for them?
1. Language transfer at all levels: letters, words, phrases, sentences, …
2. Intra-sentence code-switching without any specific data
(Lee et al., 2017; Johnson et al., 2016; Ha et al., 2016)

Non-parametric
neural machine translation
Bridging question-answering, information retrieval and machine translation

Parametric ML: Learning as Compression
• What does learning do?
• Parametric machine learning: data compression + pattern matching
Neural
Network
Training
Data
learning
Neural
Network
Inference

Non-Parametric NMT (1)
• Bring the whole training corpus together with a model
• Retrieved a small subset of examples using a fast search engine
• Let NMT figure out how to fuse
1. the current sentence, and
2. the retrieved translation pairs

• Apache Lucene: search engine
• A key-value memory network
[Gulcehre et al., 2017; Miller et al., 2016]
for storing retrieved pairs
• Similar to larger-context NMT
• [Wang et al., 2017;
Jean et al., 2017]
• Similar to NMT with external
knowledge
• [Ahn et al., 2016;
Bahdanau et al., 2017]

• When retrieved pairs are similar, huge
improvement!
• Otherwise, revert back to a normal NMT
• More consistency in style and vocabulary choice

Other advances in neural machine translation
• Discourse-level machine translation
• [Jean et al., 2017; DCU, 2017]
• Better decoding strategies
• Learning-to-search [Wiseman & Rush, 2016]
• Reinforcement learning [MRT, 2016; Ranzato et al., 2015; Bahdanau et al., 2015]
• Trainable decoding [Gu et al., 2017]
• Alternative decoding cost [Li et al., 2016; Li et al., 2017]
• Linguistics-guided neural machine translation
• Learning to parse and translate [Eriguchi et al., 2017; Rohee & Goldberg, 2017; Luong
et al., 2016]
• Syntax-aware neural machine translation [Nadejde et al., 2017]

Paradigm Shift: modular, life-long learning
Search
Engine
Neural
Network
Database
Neural
Network
Neural
Network
• TenCent, eBay, Google, NVIDIA,
Facebook and NYU for generously
supporting my research and lab!
• Some of the works were sponsored
through industrial projects with
Samsung and NVIDIA!
Acknowledgement

Deep Learning, Where Are You Going?

More Related Content

What's hot

Viewers also liked

Similar to Deep Learning, Where Are You Going?

More from NAVER Engineering

Recently uploaded

Deep Learning, Where Are You Going?