Adams Wei Yu
Deview 2018, Seoul
Quoc
Le
Thang
Luong
Rui Zhao Mohammad
Norouzi
Kai Chen
Collaborators
David Dohan
Bio
Adams Wei Yu
● Ph.D Candidate @ MLD, CMU
○ Advisor: Jaime Carbonell, Alex Smola
○ Large scale optimization
○ Machine reading comprehension
Question Answering
Concrete Answer No clear answer
Early Success
http://www.aaai.org/Magazine/Watson/watson.php
Watson: complex multi-stage system
Moving towards end-to-end systems
● Translation
● Question Answering
Lots of Datasets Available
TriviaQA
Narrative QA
MS Marco
Stanford Question Answer Dataset (SQuAD)
In education, teachers facilitate student learning, often in a school or
academy or perhaps in another environment such as outdoors. A teacher
who teaches on an individual basis may be described as a tutor.
Passage:
What is the role of teachers in education?Question:
facilitate student learningGroundtruth:
facilitate student learningPrediction 1: EM = 1, F1 = 1
student learningPrediction 2: EM = 0, F1 = 0.8
teachers facilitate student learningPrediction 3: EM = 0, F1 = 0.86
Data: Crowdsourced 100k question-answer pairs on 500 Wikipedia articles.
Roadmap
● Models for text
● General neural structures for QA
● Building blocks for QANet
○ Fully parallel (CNN + Self-attention)
○ data augmentation via back-translation
○ transfer learning from unsupervised tasks
That movie was awful.
That movie was awful .
embed embed embed embed embed
sum
Bag of words
hout
Continuous bag-of-words and skip-gram
architectures (Mikolov et al., 2013a;
2013b)
That movie was awful .
conv conv conv conv conv
sum
embed embed embed embed embed
Bag of N-Grams hout
That movie was awful .
embed embed embed embed embed
f(x,h) f(x,h) f(x,h) f(x,h) f(x,h)
hinit hout
Recurrent Neural Networks
The quick brown fox jumped over the lazy doo
The quick brown fox jumped over the lazy dog
A feed-forward neural network language
model (Bengio et al., 2001; 2003)
<s> The quick brown fox
embed embed embed embed embed
f(x,h) f(x,h) f(x,h) f(x,h) f(x,h)hinit
project
The quick brown fox jumped
Language Models
Yes please <de> ja bitte
embed embed embed embed embed
f(x,h) f(x,h) f(x,h) f(x,h) f(x,h)hinit
project
Ja bitte </s>
Language Models Seq2Seq
Yes please <s> ja bitte
embed embed embed embed embed
f(x,h) f(x,h) f(x,h) f(x,h) f(x,h)hinit
Ja bitte </s>
Seq2Seq + Attention
Encoder Decoder
hinit
?
https://distill.pub/2016/augmented-rnns/#
https://distill.pub/2016/augmented-rnns/#
Attention: a weighted average
The cat stuck out its tongue and licked its owner
The cat stuck out its tongue and licked its owner
Convolution:
Different linear transformations by relative position.
The cat stuck out its tongue and licked its owner
The cat stuck out its tongue and licked its owner
Attention: a weighted average
The cat stuck out its tongue and licked its owner
The cat stuck out its tongue and licked its owner
Multi-head Attention
Parallel attention layers with different linear transformations on input and output.
The cat stuck out its tongue and licked its owner
The cat stuck out its tongue and licked its owner
Yes please <s> Ja
embed embed embed embed embed
f(x,h) f(x,h) f(x,h) f(x,h) f(x,h)hinit
Ja ...bitte </s>
Seq2Seq + Attention
Encoder Decoder
hinit
w1
w2
<s> The quick brown fox
embed embed embed embed embed
f(x,h) f(x,h) f(x,h) f(x,h) f(x,h)
project
The quick brown fox jumped
Language Models with attention
Roadmap
● Models for text
● General neural structures for QA
● Building blocks for QANet
○ Fully parallel (CNN + Self-attention)
○ data augmentation via back-translation
○ transfer learning from unsupervised tasks
General (Doc, Question) → Answer Model
General framework neural QA Systems
Bi-directional Attention Flow (BiDAF)
[Seo et al., ICLR’17]
Base Model (BiDAF)
Similar general architectures:
● R-Net [Wang et al, ACL’17]
● DCN [Xiong et al., ICLR’17]
Base Model (BiDAF)
Similar general architectures:
● R-Net [Wang et al, ACL’17]
● DCN [Xiong et al., ICLR’17]
Base Model (BiDAF)
36
Similar general architectures:
● R-Net [Wang et al, ACL’17]
● DCN [Xiong et al., ICLR’17]
Base Model (BiDAF)
Similar general architectures:
● R-Net [Wang et al, ACL’17]
● DCN [Xiong et al., ICLR’17]
Base Model (BiDAF)
Similar general architectures:
● R-Net [Wang et al, ACL’17]
● DCN [Xiong et al., ICLR’17]
RNN RNN
RNN
RNN
Two Challenges with RNNs Remain...
Base Model (BiDAF)
First challenge: hard to capture long dependency
h1
h3
h4
h5
h6
h2
Being a long-time fan of Japanese film, I expected more than this. I can't really be
bothered to write too much, as this movie is just so poor. The story might be the cutest
romantic little something ever, pity I couldn't stand the awful acting, the mess they called
pacing, and the standard "quirky" Japanese story. If you've noticed how many Japanese
movies use characters, plots and twists that seem too "different", forcedly so, then steer
clear of this movie. Seriously, a 12-year old could have told you how this movie was
going to move along, and that's not a good thing in my book. Fans of "Beat" Takeshi: his
part in this movie is not really more than a cameo, and unless you're a rabid fan, you
don't need to suffer through this waste of film.
Second challenge: hard to compute in parallel
Strictly Sequential!
1. local context
input
hidden state
h1
h3
h4
h5
h6
h2
2. global interaction
3. Temporal info
What do RNNs Capture?
Substitution?
Roadmap
● Models for text
● General neural structures for QA
● Building blocks for QANet
○ Fully parallel (CNN + Self-attention)
○ data augmentation via back-translation
○ transfer learning from unsupervised tasks
Convolution: Capturing Local Context
0.6
0.2
0.8
0.4
0.1
0.6
0.4
0.1
0.4
0.9
0.1
0.8
0.2
0.3
0.1
0.6
0.2
0.8
0.4
0.1
0.6
0.4
0.1
0.4
0.9
0.1
0.8
0.2
0.3
0.1
The todayniceisweather
Convolution: Capturing Local Context
0.6
0.2
0.8
0.4
0.1
0.6
0.4
0.1
0.4
0.9
0.1
0.8
0.2
0.3
0.1
0.4 0.72.51.10.6
k = 2
d = 3
0.0
0.0
0.0
Convolution: Capturing Local Context
0.6
0.2
0.8
0.4
0.1
0.6
0.4
0.1
0.4
0.9
0.1
0.8
0.2
0.3
0.1
0.4 0.72.51.10.6
k = 2
d = 3
0.0
0.0
0.0
Convolution: Capturing Local Context
0.6
0.2
0.8
0.4
0.1
0.6
0.4
0.1
0.4
0.9
0.1
0.8
0.2
0.3
0.1
0.4 0.72.51.10.6
k = 2
d = 3
0.0
0.0
0.0
Convolution: Capturing Local Context
0.6
0.2
0.8
0.4
0.1
0.6
0.4
0.1
0.4
0.9
0.1
0.8
0.2
0.3
0.1
0.4 0.72.51.10.6
1.8 0.90.30.41.6
k = 3k = 2
d = 3
1.2 0.81.40.52.1
k = 3
0.0
0.0
0.0
k-gram features
Fully parallel!
How about Global Interaction?
The todayniceisweather
layer 1
layer 2
layer 3
1. May need O(logk
N) layers
2. Interaction may become weaker
N: Seq length.
k: Filter size.
The todayniceisweather
The todayniceisweather
0.6
0.2
0.8
0.4
0.1
0.6
0.4
0.1
0.4
0.9
0.1
0.8
0.2
0.3
0.1
The todayniceisweather
w1
x + w2
x + w3
x + w4
x + w5
x
1.8
2.3
0.4
The
=
w1 w2
w3
w4
w5
w1
, w2
, w3
, w4
, w5
= softmax ( )
0.6
0.2
0.8
0.4
0.1
0.6
0.4
0.1
0.4
0.9
0.1
0.8
0.2
0.3
0.1
0.6 0.2 0.8 x
The
The todayniceisweather
[Vaswani et al., NIPS’17]
The todayniceisweather
The todayniceisweather
Self-attention is fully parallel & all-to-all!
Per Unit Total
Per Layer
Sequential Op
(Path Memory)
Self-Attn O(Nd) O(N2
d) O(1)
Conv O(kd2
) O(kNd2
) O(1)
RNN O(d2
) O(Nd2
) O(N)
Complexity
Self-Attn
Conv
RNN
N: Seq length.
d: Dim. (N > d)
k: Filter size.
Explicitly Encode Temporal Info
+ ++++
RNN
Position
Embedding
1 5432
1 5432
Implicit encode
explicit encode
Position Emb
Feedforward
Layer Norm
Self Attention
Layer Norm
Convolution
Layer Norm
+
+
+
Repeat
Position Emb
Feedforward
Self Attention
Repeat
Convolution
if you want to
go deeper
QANet Encoder
[Yu et al., ICLR’18]
RNN RNN
RNN
RNN
Base Model (BiDAF) → QANet
QANet Encoder QANet Encoder
QANet Encoder
QANet Encoder
130
layers
QANet – 130+ layers (Deepest NLP NN)
QANet – First QA system with No Recurrence
● Very fast!
○ Training: 3x - 13x
○ Inference: 4x - 9x
QANet – 130+ layers (Deepest NLP NN)
● Layer normalization
● Residual connections
● L2
regularization
● Stochastic Depth
● Squeeze and Excitation
● ...
Roadmap
● Models for text
● General neural structures for QA
● Building blocks for QANet
○ Fully parallel (CNN + Self-attention)
○ data augmentation via back-translation
○ transfer learning from unsupervised tasks
Data augmentation: popular in vision & speech
More data with NMT back-translation
Input
Paraphrase
Translation
English → French
English ← French
Previously, tea had been used primarily for
Buddhist monks to stay awake during meditation.
Autrefois, le thé avait
été utilisé surtout pour
les moines bouddhistes
pour rester éveillé
pendant la méditation.
In the past, tea was used mostly for Buddhist
monks to stay awake during the meditation.
More data with NMT back-translation
Input
Paraphrase
Translation
English → French
English ← French
Previously, tea had been used primarily for
Buddhist monks to stay awake during meditation.
In the past, tea was used mostly for Buddhist
monks to stay awake during the meditation.
● More data
○ (Input, label)
○ (Paraphrase, label)
Applicable to virtually any NLP tasks!
QANet augmentation
Input
Paraphrase
Translation
English → French
English ← French
Improvement: +1.1 F1
Use 2 language pairs: English-French, English-German. 3x data.
Roadmap
● Models for text
● General neural structures for QA
● Building blocks for QANet
○ Fully parallel (CNN + Self-attention)
○ data augmentation via back-translation
○ transfer learning from unsupervised tasks
Proprietary + Confidential
Transfer learning for richer presentation
<s> The quick brown fox
embed embed embed embed embed
f(x,h) f(x,h) f(x,h) f(x,h) f(x,h)hinit
project
The quick brown fox jumped
Language Models
Sebastian Ruder @ Indaba 2018
Transfer learning for richer presentation
● Pretrained language model
(ELMo, [Peters et al., NAACL’18])
○ + 4.0 F1
Transfer learning for richer presentation
71
● Pretrained language model
(ELMo, [Peters et al., NAACL’18])
○ + 4.0 F1
● Pretrained machine translation
model (CoVe [McCann, NIPS’17])
○ + 0.3 F1
QANet – 3 key ideas
● Deep Architecture without RNN
○ 130-layer (Deepest in NLP)
● Transfer Learning
○ leverage unlabeled data
● Data Augmentation
○ with back-translation
#1 on SQuAD (Mar-Aug 2018)
QA is not Solved!!
QA is not Solved!!
Thank you!

[246]QANet: Towards Efficient and Human-Level Reading Comprehension on SQuAD

  • 1.
    Adams Wei Yu Deview2018, Seoul Quoc Le Thang Luong Rui Zhao Mohammad Norouzi Kai Chen Collaborators David Dohan
  • 2.
    Bio Adams Wei Yu ●Ph.D Candidate @ MLD, CMU ○ Advisor: Jaime Carbonell, Alex Smola ○ Large scale optimization ○ Machine reading comprehension
  • 3.
  • 4.
  • 5.
  • 6.
    Moving towards end-to-endsystems ● Translation ● Question Answering
  • 7.
    Lots of DatasetsAvailable TriviaQA Narrative QA MS Marco
  • 8.
    Stanford Question AnswerDataset (SQuAD) In education, teachers facilitate student learning, often in a school or academy or perhaps in another environment such as outdoors. A teacher who teaches on an individual basis may be described as a tutor. Passage: What is the role of teachers in education?Question: facilitate student learningGroundtruth: facilitate student learningPrediction 1: EM = 1, F1 = 1 student learningPrediction 2: EM = 0, F1 = 0.8 teachers facilitate student learningPrediction 3: EM = 0, F1 = 0.86 Data: Crowdsourced 100k question-answer pairs on 500 Wikipedia articles.
  • 9.
    Roadmap ● Models fortext ● General neural structures for QA ● Building blocks for QANet ○ Fully parallel (CNN + Self-attention) ○ data augmentation via back-translation ○ transfer learning from unsupervised tasks
  • 10.
  • 11.
    That movie wasawful . embed embed embed embed embed sum Bag of words hout
  • 13.
    Continuous bag-of-words andskip-gram architectures (Mikolov et al., 2013a; 2013b)
  • 14.
    That movie wasawful . conv conv conv conv conv sum embed embed embed embed embed Bag of N-Grams hout
  • 15.
    That movie wasawful . embed embed embed embed embed f(x,h) f(x,h) f(x,h) f(x,h) f(x,h) hinit hout Recurrent Neural Networks
  • 16.
    The quick brownfox jumped over the lazy doo
  • 17.
    The quick brownfox jumped over the lazy dog
  • 18.
    A feed-forward neuralnetwork language model (Bengio et al., 2001; 2003)
  • 19.
    <s> The quickbrown fox embed embed embed embed embed f(x,h) f(x,h) f(x,h) f(x,h) f(x,h)hinit project The quick brown fox jumped Language Models
  • 20.
    Yes please <de>ja bitte embed embed embed embed embed f(x,h) f(x,h) f(x,h) f(x,h) f(x,h)hinit project Ja bitte </s> Language Models Seq2Seq
  • 21.
    Yes please <s>ja bitte embed embed embed embed embed f(x,h) f(x,h) f(x,h) f(x,h) f(x,h)hinit Ja bitte </s> Seq2Seq + Attention Encoder Decoder hinit ?
  • 22.
  • 23.
  • 24.
    Attention: a weightedaverage The cat stuck out its tongue and licked its owner The cat stuck out its tongue and licked its owner
  • 25.
    Convolution: Different linear transformationsby relative position. The cat stuck out its tongue and licked its owner The cat stuck out its tongue and licked its owner
  • 26.
    Attention: a weightedaverage The cat stuck out its tongue and licked its owner The cat stuck out its tongue and licked its owner
  • 27.
    Multi-head Attention Parallel attentionlayers with different linear transformations on input and output. The cat stuck out its tongue and licked its owner The cat stuck out its tongue and licked its owner
  • 28.
    Yes please <s>Ja embed embed embed embed embed f(x,h) f(x,h) f(x,h) f(x,h) f(x,h)hinit Ja ...bitte </s> Seq2Seq + Attention Encoder Decoder hinit w1 w2
  • 29.
    <s> The quickbrown fox embed embed embed embed embed f(x,h) f(x,h) f(x,h) f(x,h) f(x,h) project The quick brown fox jumped Language Models with attention
  • 31.
    Roadmap ● Models fortext ● General neural structures for QA ● Building blocks for QANet ○ Fully parallel (CNN + Self-attention) ○ data augmentation via back-translation ○ transfer learning from unsupervised tasks
  • 32.
    General (Doc, Question)→ Answer Model
  • 33.
    General framework neuralQA Systems Bi-directional Attention Flow (BiDAF) [Seo et al., ICLR’17]
  • 34.
    Base Model (BiDAF) Similargeneral architectures: ● R-Net [Wang et al, ACL’17] ● DCN [Xiong et al., ICLR’17]
  • 35.
    Base Model (BiDAF) Similargeneral architectures: ● R-Net [Wang et al, ACL’17] ● DCN [Xiong et al., ICLR’17]
  • 36.
    Base Model (BiDAF) 36 Similargeneral architectures: ● R-Net [Wang et al, ACL’17] ● DCN [Xiong et al., ICLR’17]
  • 37.
    Base Model (BiDAF) Similargeneral architectures: ● R-Net [Wang et al, ACL’17] ● DCN [Xiong et al., ICLR’17]
  • 38.
    Base Model (BiDAF) Similargeneral architectures: ● R-Net [Wang et al, ACL’17] ● DCN [Xiong et al., ICLR’17]
  • 39.
    RNN RNN RNN RNN Two Challengeswith RNNs Remain... Base Model (BiDAF)
  • 40.
    First challenge: hardto capture long dependency h1 h3 h4 h5 h6 h2 Being a long-time fan of Japanese film, I expected more than this. I can't really be bothered to write too much, as this movie is just so poor. The story might be the cutest romantic little something ever, pity I couldn't stand the awful acting, the mess they called pacing, and the standard "quirky" Japanese story. If you've noticed how many Japanese movies use characters, plots and twists that seem too "different", forcedly so, then steer clear of this movie. Seriously, a 12-year old could have told you how this movie was going to move along, and that's not a good thing in my book. Fans of "Beat" Takeshi: his part in this movie is not really more than a cameo, and unless you're a rabid fan, you don't need to suffer through this waste of film.
  • 41.
    Second challenge: hardto compute in parallel Strictly Sequential!
  • 42.
    1. local context input hiddenstate h1 h3 h4 h5 h6 h2 2. global interaction 3. Temporal info What do RNNs Capture? Substitution?
  • 43.
    Roadmap ● Models fortext ● General neural structures for QA ● Building blocks for QANet ○ Fully parallel (CNN + Self-attention) ○ data augmentation via back-translation ○ transfer learning from unsupervised tasks
  • 44.
    Convolution: Capturing LocalContext 0.6 0.2 0.8 0.4 0.1 0.6 0.4 0.1 0.4 0.9 0.1 0.8 0.2 0.3 0.1 0.6 0.2 0.8 0.4 0.1 0.6 0.4 0.1 0.4 0.9 0.1 0.8 0.2 0.3 0.1 The todayniceisweather
  • 45.
    Convolution: Capturing LocalContext 0.6 0.2 0.8 0.4 0.1 0.6 0.4 0.1 0.4 0.9 0.1 0.8 0.2 0.3 0.1 0.4 0.72.51.10.6 k = 2 d = 3 0.0 0.0 0.0
  • 46.
    Convolution: Capturing LocalContext 0.6 0.2 0.8 0.4 0.1 0.6 0.4 0.1 0.4 0.9 0.1 0.8 0.2 0.3 0.1 0.4 0.72.51.10.6 k = 2 d = 3 0.0 0.0 0.0
  • 47.
    Convolution: Capturing LocalContext 0.6 0.2 0.8 0.4 0.1 0.6 0.4 0.1 0.4 0.9 0.1 0.8 0.2 0.3 0.1 0.4 0.72.51.10.6 k = 2 d = 3 0.0 0.0 0.0
  • 48.
    Convolution: Capturing LocalContext 0.6 0.2 0.8 0.4 0.1 0.6 0.4 0.1 0.4 0.9 0.1 0.8 0.2 0.3 0.1 0.4 0.72.51.10.6 1.8 0.90.30.41.6 k = 3k = 2 d = 3 1.2 0.81.40.52.1 k = 3 0.0 0.0 0.0 k-gram features Fully parallel!
  • 49.
    How about GlobalInteraction? The todayniceisweather layer 1 layer 2 layer 3 1. May need O(logk N) layers 2. Interaction may become weaker N: Seq length. k: Filter size.
  • 50.
    The todayniceisweather The todayniceisweather 0.6 0.2 0.8 0.4 0.1 0.6 0.4 0.1 0.4 0.9 0.1 0.8 0.2 0.3 0.1 Thetodayniceisweather w1 x + w2 x + w3 x + w4 x + w5 x 1.8 2.3 0.4 The = w1 w2 w3 w4 w5 w1 , w2 , w3 , w4 , w5 = softmax ( ) 0.6 0.2 0.8 0.4 0.1 0.6 0.4 0.1 0.4 0.9 0.1 0.8 0.2 0.3 0.1 0.6 0.2 0.8 x The The todayniceisweather [Vaswani et al., NIPS’17]
  • 51.
  • 52.
    Per Unit Total PerLayer Sequential Op (Path Memory) Self-Attn O(Nd) O(N2 d) O(1) Conv O(kd2 ) O(kNd2 ) O(1) RNN O(d2 ) O(Nd2 ) O(N) Complexity Self-Attn Conv RNN N: Seq length. d: Dim. (N > d) k: Filter size.
  • 53.
    Explicitly Encode TemporalInfo + ++++ RNN Position Embedding 1 5432 1 5432 Implicit encode explicit encode
  • 54.
    Position Emb Feedforward Layer Norm SelfAttention Layer Norm Convolution Layer Norm + + + Repeat Position Emb Feedforward Self Attention Repeat Convolution if you want to go deeper QANet Encoder [Yu et al., ICLR’18]
  • 55.
    RNN RNN RNN RNN Base Model(BiDAF) → QANet QANet Encoder QANet Encoder QANet Encoder QANet Encoder
  • 56.
    130 layers QANet – 130+layers (Deepest NLP NN)
  • 57.
    QANet – FirstQA system with No Recurrence ● Very fast! ○ Training: 3x - 13x ○ Inference: 4x - 9x
  • 58.
    QANet – 130+layers (Deepest NLP NN) ● Layer normalization ● Residual connections ● L2 regularization ● Stochastic Depth ● Squeeze and Excitation ● ...
  • 59.
    Roadmap ● Models fortext ● General neural structures for QA ● Building blocks for QANet ○ Fully parallel (CNN + Self-attention) ○ data augmentation via back-translation ○ transfer learning from unsupervised tasks
  • 60.
    Data augmentation: popularin vision & speech
  • 61.
    More data withNMT back-translation Input Paraphrase Translation English → French English ← French Previously, tea had been used primarily for Buddhist monks to stay awake during meditation. Autrefois, le thé avait été utilisé surtout pour les moines bouddhistes pour rester éveillé pendant la méditation. In the past, tea was used mostly for Buddhist monks to stay awake during the meditation.
  • 62.
    More data withNMT back-translation Input Paraphrase Translation English → French English ← French Previously, tea had been used primarily for Buddhist monks to stay awake during meditation. In the past, tea was used mostly for Buddhist monks to stay awake during the meditation. ● More data ○ (Input, label) ○ (Paraphrase, label) Applicable to virtually any NLP tasks!
  • 63.
    QANet augmentation Input Paraphrase Translation English →French English ← French Improvement: +1.1 F1 Use 2 language pairs: English-French, English-German. 3x data.
  • 64.
    Roadmap ● Models fortext ● General neural structures for QA ● Building blocks for QANet ○ Fully parallel (CNN + Self-attention) ○ data augmentation via back-translation ○ transfer learning from unsupervised tasks
  • 65.
  • 67.
    Transfer learning forricher presentation
  • 68.
    <s> The quickbrown fox embed embed embed embed embed f(x,h) f(x,h) f(x,h) f(x,h) f(x,h)hinit project The quick brown fox jumped Language Models
  • 69.
    Sebastian Ruder @Indaba 2018
  • 70.
    Transfer learning forricher presentation ● Pretrained language model (ELMo, [Peters et al., NAACL’18]) ○ + 4.0 F1
  • 71.
    Transfer learning forricher presentation 71 ● Pretrained language model (ELMo, [Peters et al., NAACL’18]) ○ + 4.0 F1 ● Pretrained machine translation model (CoVe [McCann, NIPS’17]) ○ + 0.3 F1
  • 72.
    QANet – 3key ideas ● Deep Architecture without RNN ○ 130-layer (Deepest in NLP) ● Transfer Learning ○ leverage unlabeled data ● Data Augmentation ○ with back-translation #1 on SQuAD (Mar-Aug 2018)
  • 73.
    QA is notSolved!!
  • 74.
    QA is notSolved!! Thank you!