SlideShare a Scribd company logo
Adams Wei Yu
Deview 2018, Seoul
Quoc
Le
Thang
Luong
Rui Zhao Mohammad
Norouzi
Kai Chen
Collaborators
David Dohan
Bio
Adams Wei Yu
● Ph.D Candidate @ MLD, CMU
○ Advisor: Jaime Carbonell, Alex Smola
○ Large scale optimization
○ Machine reading comprehension
Question Answering
Concrete Answer No clear answer
Early Success
http://www.aaai.org/Magazine/Watson/watson.php
Watson: complex multi-stage system
Moving towards end-to-end systems
● Translation
● Question Answering
Lots of Datasets Available
TriviaQA
Narrative QA
MS Marco
Stanford Question Answer Dataset (SQuAD)
In education, teachers facilitate student learning, often in a school or
academy or perhaps in another environment such as outdoors. A teacher
who teaches on an individual basis may be described as a tutor.
Passage:
What is the role of teachers in education?Question:
facilitate student learningGroundtruth:
facilitate student learningPrediction 1: EM = 1, F1 = 1
student learningPrediction 2: EM = 0, F1 = 0.8
teachers facilitate student learningPrediction 3: EM = 0, F1 = 0.86
Data: Crowdsourced 100k question-answer pairs on 500 Wikipedia articles.
Roadmap
● Models for text
● General neural structures for QA
● Building blocks for QANet
○ Fully parallel (CNN + Self-attention)
○ data augmentation via back-translation
○ transfer learning from unsupervised tasks
That movie was awful.
That movie was awful .
embed embed embed embed embed
sum
Bag of words
hout
Continuous bag-of-words and skip-gram
architectures (Mikolov et al., 2013a;
2013b)
That movie was awful .
conv conv conv conv conv
sum
embed embed embed embed embed
Bag of N-Grams hout
That movie was awful .
embed embed embed embed embed
f(x,h) f(x,h) f(x,h) f(x,h) f(x,h)
hinit hout
Recurrent Neural Networks
The quick brown fox jumped over the lazy doo
The quick brown fox jumped over the lazy dog
A feed-forward neural network language
model (Bengio et al., 2001; 2003)
<s> The quick brown fox
embed embed embed embed embed
f(x,h) f(x,h) f(x,h) f(x,h) f(x,h)hinit
project
The quick brown fox jumped
Language Models
Yes please <de> ja bitte
embed embed embed embed embed
f(x,h) f(x,h) f(x,h) f(x,h) f(x,h)hinit
project
Ja bitte </s>
Language Models Seq2Seq
Yes please <s> ja bitte
embed embed embed embed embed
f(x,h) f(x,h) f(x,h) f(x,h) f(x,h)hinit
Ja bitte </s>
Seq2Seq + Attention
Encoder Decoder
hinit
?
https://distill.pub/2016/augmented-rnns/#
https://distill.pub/2016/augmented-rnns/#
Attention: a weighted average
The cat stuck out its tongue and licked its owner
The cat stuck out its tongue and licked its owner
Convolution:
Different linear transformations by relative position.
The cat stuck out its tongue and licked its owner
The cat stuck out its tongue and licked its owner
Attention: a weighted average
The cat stuck out its tongue and licked its owner
The cat stuck out its tongue and licked its owner
Multi-head Attention
Parallel attention layers with different linear transformations on input and output.
The cat stuck out its tongue and licked its owner
The cat stuck out its tongue and licked its owner
Yes please <s> Ja
embed embed embed embed embed
f(x,h) f(x,h) f(x,h) f(x,h) f(x,h)hinit
Ja ...bitte </s>
Seq2Seq + Attention
Encoder Decoder
hinit
w1
w2
<s> The quick brown fox
embed embed embed embed embed
f(x,h) f(x,h) f(x,h) f(x,h) f(x,h)
project
The quick brown fox jumped
Language Models with attention
Roadmap
● Models for text
● General neural structures for QA
● Building blocks for QANet
○ Fully parallel (CNN + Self-attention)
○ data augmentation via back-translation
○ transfer learning from unsupervised tasks
General (Doc, Question) → Answer Model
General framework neural QA Systems
Bi-directional Attention Flow (BiDAF)
[Seo et al., ICLR’17]
Base Model (BiDAF)
Similar general architectures:
● R-Net [Wang et al, ACL’17]
● DCN [Xiong et al., ICLR’17]
Base Model (BiDAF)
Similar general architectures:
● R-Net [Wang et al, ACL’17]
● DCN [Xiong et al., ICLR’17]
Base Model (BiDAF)
36
Similar general architectures:
● R-Net [Wang et al, ACL’17]
● DCN [Xiong et al., ICLR’17]
Base Model (BiDAF)
Similar general architectures:
● R-Net [Wang et al, ACL’17]
● DCN [Xiong et al., ICLR’17]
Base Model (BiDAF)
Similar general architectures:
● R-Net [Wang et al, ACL’17]
● DCN [Xiong et al., ICLR’17]
RNN RNN
RNN
RNN
Two Challenges with RNNs Remain...
Base Model (BiDAF)
First challenge: hard to capture long dependency
h1
h3
h4
h5
h6
h2
Being a long-time fan of Japanese film, I expected more than this. I can't really be
bothered to write too much, as this movie is just so poor. The story might be the cutest
romantic little something ever, pity I couldn't stand the awful acting, the mess they called
pacing, and the standard "quirky" Japanese story. If you've noticed how many Japanese
movies use characters, plots and twists that seem too "different", forcedly so, then steer
clear of this movie. Seriously, a 12-year old could have told you how this movie was
going to move along, and that's not a good thing in my book. Fans of "Beat" Takeshi: his
part in this movie is not really more than a cameo, and unless you're a rabid fan, you
don't need to suffer through this waste of film.
Second challenge: hard to compute in parallel
Strictly Sequential!
1. local context
input
hidden state
h1
h3
h4
h5
h6
h2
2. global interaction
3. Temporal info
What do RNNs Capture?
Substitution?
Roadmap
● Models for text
● General neural structures for QA
● Building blocks for QANet
○ Fully parallel (CNN + Self-attention)
○ data augmentation via back-translation
○ transfer learning from unsupervised tasks
Convolution: Capturing Local Context
0.6
0.2
0.8
0.4
0.1
0.6
0.4
0.1
0.4
0.9
0.1
0.8
0.2
0.3
0.1
0.6
0.2
0.8
0.4
0.1
0.6
0.4
0.1
0.4
0.9
0.1
0.8
0.2
0.3
0.1
The todayniceisweather
Convolution: Capturing Local Context
0.6
0.2
0.8
0.4
0.1
0.6
0.4
0.1
0.4
0.9
0.1
0.8
0.2
0.3
0.1
0.4 0.72.51.10.6
k = 2
d = 3
0.0
0.0
0.0
Convolution: Capturing Local Context
0.6
0.2
0.8
0.4
0.1
0.6
0.4
0.1
0.4
0.9
0.1
0.8
0.2
0.3
0.1
0.4 0.72.51.10.6
k = 2
d = 3
0.0
0.0
0.0
Convolution: Capturing Local Context
0.6
0.2
0.8
0.4
0.1
0.6
0.4
0.1
0.4
0.9
0.1
0.8
0.2
0.3
0.1
0.4 0.72.51.10.6
k = 2
d = 3
0.0
0.0
0.0
Convolution: Capturing Local Context
0.6
0.2
0.8
0.4
0.1
0.6
0.4
0.1
0.4
0.9
0.1
0.8
0.2
0.3
0.1
0.4 0.72.51.10.6
1.8 0.90.30.41.6
k = 3k = 2
d = 3
1.2 0.81.40.52.1
k = 3
0.0
0.0
0.0
k-gram features
Fully parallel!
How about Global Interaction?
The todayniceisweather
layer 1
layer 2
layer 3
1. May need O(logk
N) layers
2. Interaction may become weaker
N: Seq length.
k: Filter size.
The todayniceisweather
The todayniceisweather
0.6
0.2
0.8
0.4
0.1
0.6
0.4
0.1
0.4
0.9
0.1
0.8
0.2
0.3
0.1
The todayniceisweather
w1
x + w2
x + w3
x + w4
x + w5
x
1.8
2.3
0.4
The
=
w1 w2
w3
w4
w5
w1
, w2
, w3
, w4
, w5
= softmax ( )
0.6
0.2
0.8
0.4
0.1
0.6
0.4
0.1
0.4
0.9
0.1
0.8
0.2
0.3
0.1
0.6 0.2 0.8 x
The
The todayniceisweather
[Vaswani et al., NIPS’17]
The todayniceisweather
The todayniceisweather
Self-attention is fully parallel & all-to-all!
Per Unit Total
Per Layer
Sequential Op
(Path Memory)
Self-Attn O(Nd) O(N2
d) O(1)
Conv O(kd2
) O(kNd2
) O(1)
RNN O(d2
) O(Nd2
) O(N)
Complexity
Self-Attn
Conv
RNN
N: Seq length.
d: Dim. (N > d)
k: Filter size.
Explicitly Encode Temporal Info
+ ++++
RNN
Position
Embedding
1 5432
1 5432
Implicit encode
explicit encode
Position Emb
Feedforward
Layer Norm
Self Attention
Layer Norm
Convolution
Layer Norm
+
+
+
Repeat
Position Emb
Feedforward
Self Attention
Repeat
Convolution
if you want to
go deeper
QANet Encoder
[Yu et al., ICLR’18]
RNN RNN
RNN
RNN
Base Model (BiDAF) → QANet
QANet Encoder QANet Encoder
QANet Encoder
QANet Encoder
130
layers
QANet – 130+ layers (Deepest NLP NN)
QANet – First QA system with No Recurrence
● Very fast!
○ Training: 3x - 13x
○ Inference: 4x - 9x
QANet – 130+ layers (Deepest NLP NN)
● Layer normalization
● Residual connections
● L2
regularization
● Stochastic Depth
● Squeeze and Excitation
● ...
Roadmap
● Models for text
● General neural structures for QA
● Building blocks for QANet
○ Fully parallel (CNN + Self-attention)
○ data augmentation via back-translation
○ transfer learning from unsupervised tasks
Data augmentation: popular in vision & speech
More data with NMT back-translation
Input
Paraphrase
Translation
English → French
English ← French
Previously, tea had been used primarily for
Buddhist monks to stay awake during meditation.
Autrefois, le thé avait
été utilisé surtout pour
les moines bouddhistes
pour rester éveillé
pendant la méditation.
In the past, tea was used mostly for Buddhist
monks to stay awake during the meditation.
More data with NMT back-translation
Input
Paraphrase
Translation
English → French
English ← French
Previously, tea had been used primarily for
Buddhist monks to stay awake during meditation.
In the past, tea was used mostly for Buddhist
monks to stay awake during the meditation.
● More data
○ (Input, label)
○ (Paraphrase, label)
Applicable to virtually any NLP tasks!
QANet augmentation
Input
Paraphrase
Translation
English → French
English ← French
Improvement: +1.1 F1
Use 2 language pairs: English-French, English-German. 3x data.
Roadmap
● Models for text
● General neural structures for QA
● Building blocks for QANet
○ Fully parallel (CNN + Self-attention)
○ data augmentation via back-translation
○ transfer learning from unsupervised tasks
Proprietary + Confidential
Transfer learning for richer presentation
<s> The quick brown fox
embed embed embed embed embed
f(x,h) f(x,h) f(x,h) f(x,h) f(x,h)hinit
project
The quick brown fox jumped
Language Models
Sebastian Ruder @ Indaba 2018
Transfer learning for richer presentation
● Pretrained language model
(ELMo, [Peters et al., NAACL’18])
○ + 4.0 F1
Transfer learning for richer presentation
71
● Pretrained language model
(ELMo, [Peters et al., NAACL’18])
○ + 4.0 F1
● Pretrained machine translation
model (CoVe [McCann, NIPS’17])
○ + 0.3 F1
QANet – 3 key ideas
● Deep Architecture without RNN
○ 130-layer (Deepest in NLP)
● Transfer Learning
○ leverage unlabeled data
● Data Augmentation
○ with back-translation
#1 on SQuAD (Mar-Aug 2018)
QA is not Solved!!
QA is not Solved!!
Thank you!

More Related Content

Similar to [246]QANet: Towards Efficient and Human-Level Reading Comprehension on SQuAD

NTU DBME5028 Week8 Transfer Learning
NTU DBME5028 Week8 Transfer LearningNTU DBME5028 Week8 Transfer Learning
NTU DBME5028 Week8 Transfer Learning
Sean Yu
 
MLconf seattle 2015 presentation
MLconf seattle 2015 presentationMLconf seattle 2015 presentation
MLconf seattle 2015 presentation
ehtshamelahi
 
Transfer Learning: Breve introducción a modelos pre-entrenados.
Transfer Learning: Breve introducción a modelos pre-entrenados.Transfer Learning: Breve introducción a modelos pre-entrenados.
Transfer Learning: Breve introducción a modelos pre-entrenados.
Fernando Constantino
 
Use CNN for Sequence Modeling
Use CNN for Sequence ModelingUse CNN for Sequence Modeling
Use CNN for Sequence Modeling
Dongang (Sean) Wang
 
Deep Learning Applications (dadada2017)
Deep Learning Applications (dadada2017)Deep Learning Applications (dadada2017)
Deep Learning Applications (dadada2017)
Abhishek Thakur
 
Icpp power ai-workshop 2018
Icpp power ai-workshop 2018Icpp power ai-workshop 2018
Icpp power ai-workshop 2018
Ganesan Narayanasamy
 
Swimming upstream
Swimming upstreamSwimming upstream
Swimming upstream
Dave Neary
 
Swimming upstream
Swimming upstreamSwimming upstream
Swimming upstream
OPNFV
 
Recurrent Neural Networks for Text Analysis
Recurrent Neural Networks for Text AnalysisRecurrent Neural Networks for Text Analysis
Recurrent Neural Networks for Text Analysis
odsc
 
Crafting Recommenders: the Shallow and the Deep of it!
Crafting Recommenders: the Shallow and the Deep of it! Crafting Recommenders: the Shallow and the Deep of it!
Crafting Recommenders: the Shallow and the Deep of it!
Sudeep Das, Ph.D.
 
Deep Learning and Automatic Differentiation from Theano to PyTorch
Deep Learning and Automatic Differentiation from Theano to PyTorchDeep Learning and Automatic Differentiation from Theano to PyTorch
Deep Learning and Automatic Differentiation from Theano to PyTorch
inside-BigData.com
 
Babar: Knowledge Recognition, Extraction and Representation
Babar: Knowledge Recognition, Extraction and RepresentationBabar: Knowledge Recognition, Extraction and Representation
Babar: Knowledge Recognition, Extraction and Representation
Pierre de Lacaze
 
Week 3 Deep Learning And POS Tagging Hands-On
Week 3 Deep Learning And POS Tagging Hands-OnWeek 3 Deep Learning And POS Tagging Hands-On
Week 3 Deep Learning And POS Tagging Hands-On
SARCCOM
 
Exploiting large-scale graph analytics for unsupervised Entity Linking
Exploiting large-scale graph analytics for unsupervised Entity LinkingExploiting large-scale graph analytics for unsupervised Entity Linking
Exploiting large-scale graph analytics for unsupervised Entity Linking
NECST Lab @ Politecnico di Milano
 
Corinna Cortes, Head of Research, Google, at MLconf NYC 2017
Corinna Cortes, Head of Research, Google, at MLconf NYC 2017Corinna Cortes, Head of Research, Google, at MLconf NYC 2017
Corinna Cortes, Head of Research, Google, at MLconf NYC 2017
MLconf
 
2019 Triangle Machine Learning Day - Stacking Audience Models - Adaptive Deep...
2019 Triangle Machine Learning Day - Stacking Audience Models - Adaptive Deep...2019 Triangle Machine Learning Day - Stacking Audience Models - Adaptive Deep...
2019 Triangle Machine Learning Day - Stacking Audience Models - Adaptive Deep...
The Statistical and Applied Mathematical Sciences Institute
 
Cassandra introduction mars jug
Cassandra introduction mars jugCassandra introduction mars jug
Cassandra introduction mars jug
Duyhai Doan
 
Understanding Deep Learning
Understanding Deep LearningUnderstanding Deep Learning
Understanding Deep Learning
C4Media
 
Deep learning with Keras
Deep learning with KerasDeep learning with Keras
Deep learning with Keras
QuantUniversity
 

Similar to [246]QANet: Towards Efficient and Human-Level Reading Comprehension on SQuAD (20)

NTU DBME5028 Week8 Transfer Learning
NTU DBME5028 Week8 Transfer LearningNTU DBME5028 Week8 Transfer Learning
NTU DBME5028 Week8 Transfer Learning
 
MLconf seattle 2015 presentation
MLconf seattle 2015 presentationMLconf seattle 2015 presentation
MLconf seattle 2015 presentation
 
Transfer Learning: Breve introducción a modelos pre-entrenados.
Transfer Learning: Breve introducción a modelos pre-entrenados.Transfer Learning: Breve introducción a modelos pre-entrenados.
Transfer Learning: Breve introducción a modelos pre-entrenados.
 
Use CNN for Sequence Modeling
Use CNN for Sequence ModelingUse CNN for Sequence Modeling
Use CNN for Sequence Modeling
 
Deep Learning Applications (dadada2017)
Deep Learning Applications (dadada2017)Deep Learning Applications (dadada2017)
Deep Learning Applications (dadada2017)
 
Icpp power ai-workshop 2018
Icpp power ai-workshop 2018Icpp power ai-workshop 2018
Icpp power ai-workshop 2018
 
Swimming upstream
Swimming upstreamSwimming upstream
Swimming upstream
 
Swimming upstream
Swimming upstreamSwimming upstream
Swimming upstream
 
Recurrent Neural Networks for Text Analysis
Recurrent Neural Networks for Text AnalysisRecurrent Neural Networks for Text Analysis
Recurrent Neural Networks for Text Analysis
 
Crafting Recommenders: the Shallow and the Deep of it!
Crafting Recommenders: the Shallow and the Deep of it! Crafting Recommenders: the Shallow and the Deep of it!
Crafting Recommenders: the Shallow and the Deep of it!
 
Deep Learning and Automatic Differentiation from Theano to PyTorch
Deep Learning and Automatic Differentiation from Theano to PyTorchDeep Learning and Automatic Differentiation from Theano to PyTorch
Deep Learning and Automatic Differentiation from Theano to PyTorch
 
Babar: Knowledge Recognition, Extraction and Representation
Babar: Knowledge Recognition, Extraction and RepresentationBabar: Knowledge Recognition, Extraction and Representation
Babar: Knowledge Recognition, Extraction and Representation
 
Week 3 Deep Learning And POS Tagging Hands-On
Week 3 Deep Learning And POS Tagging Hands-OnWeek 3 Deep Learning And POS Tagging Hands-On
Week 3 Deep Learning And POS Tagging Hands-On
 
Exploiting large-scale graph analytics for unsupervised Entity Linking
Exploiting large-scale graph analytics for unsupervised Entity LinkingExploiting large-scale graph analytics for unsupervised Entity Linking
Exploiting large-scale graph analytics for unsupervised Entity Linking
 
Corinna Cortes, Head of Research, Google, at MLconf NYC 2017
Corinna Cortes, Head of Research, Google, at MLconf NYC 2017Corinna Cortes, Head of Research, Google, at MLconf NYC 2017
Corinna Cortes, Head of Research, Google, at MLconf NYC 2017
 
2019 Triangle Machine Learning Day - Stacking Audience Models - Adaptive Deep...
2019 Triangle Machine Learning Day - Stacking Audience Models - Adaptive Deep...2019 Triangle Machine Learning Day - Stacking Audience Models - Adaptive Deep...
2019 Triangle Machine Learning Day - Stacking Audience Models - Adaptive Deep...
 
Cassandra introduction mars jug
Cassandra introduction mars jugCassandra introduction mars jug
Cassandra introduction mars jug
 
Understanding Deep Learning
Understanding Deep LearningUnderstanding Deep Learning
Understanding Deep Learning
 
2014 pycon-talk
2014 pycon-talk2014 pycon-talk
2014 pycon-talk
 
Deep learning with Keras
Deep learning with KerasDeep learning with Keras
Deep learning with Keras
 

More from NAVER D2

[211] 인공지능이 인공지능 챗봇을 만든다
[211] 인공지능이 인공지능 챗봇을 만든다[211] 인공지능이 인공지능 챗봇을 만든다
[211] 인공지능이 인공지능 챗봇을 만든다
NAVER D2
 
[233] 대형 컨테이너 클러스터에서의 고가용성 Network Load Balancing: Maglev Hashing Scheduler i...
[233] 대형 컨테이너 클러스터에서의 고가용성 Network Load Balancing: Maglev Hashing Scheduler i...[233] 대형 컨테이너 클러스터에서의 고가용성 Network Load Balancing: Maglev Hashing Scheduler i...
[233] 대형 컨테이너 클러스터에서의 고가용성 Network Load Balancing: Maglev Hashing Scheduler i...
NAVER D2
 
[215] Druid로 쉽고 빠르게 데이터 분석하기
[215] Druid로 쉽고 빠르게 데이터 분석하기[215] Druid로 쉽고 빠르게 데이터 분석하기
[215] Druid로 쉽고 빠르게 데이터 분석하기
NAVER D2
 
[245]Papago Internals: 모델분석과 응용기술 개발
[245]Papago Internals: 모델분석과 응용기술 개발[245]Papago Internals: 모델분석과 응용기술 개발
[245]Papago Internals: 모델분석과 응용기술 개발
NAVER D2
 
[236] 스트림 저장소 최적화 이야기: 아파치 드루이드로부터 얻은 교훈
[236] 스트림 저장소 최적화 이야기: 아파치 드루이드로부터 얻은 교훈[236] 스트림 저장소 최적화 이야기: 아파치 드루이드로부터 얻은 교훈
[236] 스트림 저장소 최적화 이야기: 아파치 드루이드로부터 얻은 교훈
NAVER D2
 
[235]Wikipedia-scale Q&A
[235]Wikipedia-scale Q&A[235]Wikipedia-scale Q&A
[235]Wikipedia-scale Q&A
NAVER D2
 
[244]로봇이 현실 세계에 대해 학습하도록 만들기
[244]로봇이 현실 세계에 대해 학습하도록 만들기[244]로봇이 현실 세계에 대해 학습하도록 만들기
[244]로봇이 현실 세계에 대해 학습하도록 만들기
NAVER D2
 
[243] Deep Learning to help student’s Deep Learning
[243] Deep Learning to help student’s Deep Learning[243] Deep Learning to help student’s Deep Learning
[243] Deep Learning to help student’s Deep Learning
NAVER D2
 
[234]Fast & Accurate Data Annotation Pipeline for AI applications
[234]Fast & Accurate Data Annotation Pipeline for AI applications[234]Fast & Accurate Data Annotation Pipeline for AI applications
[234]Fast & Accurate Data Annotation Pipeline for AI applications
NAVER D2
 
Old version: [233]대형 컨테이너 클러스터에서의 고가용성 Network Load Balancing
Old version: [233]대형 컨테이너 클러스터에서의 고가용성 Network Load BalancingOld version: [233]대형 컨테이너 클러스터에서의 고가용성 Network Load Balancing
Old version: [233]대형 컨테이너 클러스터에서의 고가용성 Network Load Balancing
NAVER D2
 
[226]NAVER 광고 deep click prediction: 모델링부터 서빙까지
[226]NAVER 광고 deep click prediction: 모델링부터 서빙까지[226]NAVER 광고 deep click prediction: 모델링부터 서빙까지
[226]NAVER 광고 deep click prediction: 모델링부터 서빙까지
NAVER D2
 
[225]NSML: 머신러닝 플랫폼 서비스하기 & 모델 튜닝 자동화하기
[225]NSML: 머신러닝 플랫폼 서비스하기 & 모델 튜닝 자동화하기[225]NSML: 머신러닝 플랫폼 서비스하기 & 모델 튜닝 자동화하기
[225]NSML: 머신러닝 플랫폼 서비스하기 & 모델 튜닝 자동화하기
NAVER D2
 
[224]네이버 검색과 개인화
[224]네이버 검색과 개인화[224]네이버 검색과 개인화
[224]네이버 검색과 개인화
NAVER D2
 
[216]Search Reliability Engineering (부제: 지진에도 흔들리지 않는 네이버 검색시스템)
[216]Search Reliability Engineering (부제: 지진에도 흔들리지 않는 네이버 검색시스템)[216]Search Reliability Engineering (부제: 지진에도 흔들리지 않는 네이버 검색시스템)
[216]Search Reliability Engineering (부제: 지진에도 흔들리지 않는 네이버 검색시스템)
NAVER D2
 
[214] Ai Serving Platform: 하루 수 억 건의 인퍼런스를 처리하기 위한 고군분투기
[214] Ai Serving Platform: 하루 수 억 건의 인퍼런스를 처리하기 위한 고군분투기[214] Ai Serving Platform: 하루 수 억 건의 인퍼런스를 처리하기 위한 고군분투기
[214] Ai Serving Platform: 하루 수 억 건의 인퍼런스를 처리하기 위한 고군분투기
NAVER D2
 
[213] Fashion Visual Search
[213] Fashion Visual Search[213] Fashion Visual Search
[213] Fashion Visual Search
NAVER D2
 
[232] TensorRT를 활용한 딥러닝 Inference 최적화
[232] TensorRT를 활용한 딥러닝 Inference 최적화[232] TensorRT를 활용한 딥러닝 Inference 최적화
[232] TensorRT를 활용한 딥러닝 Inference 최적화
NAVER D2
 
[242]컴퓨터 비전을 이용한 실내 지도 자동 업데이트 방법: 딥러닝을 통한 POI 변화 탐지
[242]컴퓨터 비전을 이용한 실내 지도 자동 업데이트 방법: 딥러닝을 통한 POI 변화 탐지[242]컴퓨터 비전을 이용한 실내 지도 자동 업데이트 방법: 딥러닝을 통한 POI 변화 탐지
[242]컴퓨터 비전을 이용한 실내 지도 자동 업데이트 방법: 딥러닝을 통한 POI 변화 탐지
NAVER D2
 
[212]C3, 데이터 처리에서 서빙까지 가능한 하둡 클러스터
[212]C3, 데이터 처리에서 서빙까지 가능한 하둡 클러스터[212]C3, 데이터 처리에서 서빙까지 가능한 하둡 클러스터
[212]C3, 데이터 처리에서 서빙까지 가능한 하둡 클러스터
NAVER D2
 
[223]기계독해 QA: 검색인가, NLP인가?
[223]기계독해 QA: 검색인가, NLP인가?[223]기계독해 QA: 검색인가, NLP인가?
[223]기계독해 QA: 검색인가, NLP인가?
NAVER D2
 

More from NAVER D2 (20)

[211] 인공지능이 인공지능 챗봇을 만든다
[211] 인공지능이 인공지능 챗봇을 만든다[211] 인공지능이 인공지능 챗봇을 만든다
[211] 인공지능이 인공지능 챗봇을 만든다
 
[233] 대형 컨테이너 클러스터에서의 고가용성 Network Load Balancing: Maglev Hashing Scheduler i...
[233] 대형 컨테이너 클러스터에서의 고가용성 Network Load Balancing: Maglev Hashing Scheduler i...[233] 대형 컨테이너 클러스터에서의 고가용성 Network Load Balancing: Maglev Hashing Scheduler i...
[233] 대형 컨테이너 클러스터에서의 고가용성 Network Load Balancing: Maglev Hashing Scheduler i...
 
[215] Druid로 쉽고 빠르게 데이터 분석하기
[215] Druid로 쉽고 빠르게 데이터 분석하기[215] Druid로 쉽고 빠르게 데이터 분석하기
[215] Druid로 쉽고 빠르게 데이터 분석하기
 
[245]Papago Internals: 모델분석과 응용기술 개발
[245]Papago Internals: 모델분석과 응용기술 개발[245]Papago Internals: 모델분석과 응용기술 개발
[245]Papago Internals: 모델분석과 응용기술 개발
 
[236] 스트림 저장소 최적화 이야기: 아파치 드루이드로부터 얻은 교훈
[236] 스트림 저장소 최적화 이야기: 아파치 드루이드로부터 얻은 교훈[236] 스트림 저장소 최적화 이야기: 아파치 드루이드로부터 얻은 교훈
[236] 스트림 저장소 최적화 이야기: 아파치 드루이드로부터 얻은 교훈
 
[235]Wikipedia-scale Q&A
[235]Wikipedia-scale Q&A[235]Wikipedia-scale Q&A
[235]Wikipedia-scale Q&A
 
[244]로봇이 현실 세계에 대해 학습하도록 만들기
[244]로봇이 현실 세계에 대해 학습하도록 만들기[244]로봇이 현실 세계에 대해 학습하도록 만들기
[244]로봇이 현실 세계에 대해 학습하도록 만들기
 
[243] Deep Learning to help student’s Deep Learning
[243] Deep Learning to help student’s Deep Learning[243] Deep Learning to help student’s Deep Learning
[243] Deep Learning to help student’s Deep Learning
 
[234]Fast & Accurate Data Annotation Pipeline for AI applications
[234]Fast & Accurate Data Annotation Pipeline for AI applications[234]Fast & Accurate Data Annotation Pipeline for AI applications
[234]Fast & Accurate Data Annotation Pipeline for AI applications
 
Old version: [233]대형 컨테이너 클러스터에서의 고가용성 Network Load Balancing
Old version: [233]대형 컨테이너 클러스터에서의 고가용성 Network Load BalancingOld version: [233]대형 컨테이너 클러스터에서의 고가용성 Network Load Balancing
Old version: [233]대형 컨테이너 클러스터에서의 고가용성 Network Load Balancing
 
[226]NAVER 광고 deep click prediction: 모델링부터 서빙까지
[226]NAVER 광고 deep click prediction: 모델링부터 서빙까지[226]NAVER 광고 deep click prediction: 모델링부터 서빙까지
[226]NAVER 광고 deep click prediction: 모델링부터 서빙까지
 
[225]NSML: 머신러닝 플랫폼 서비스하기 & 모델 튜닝 자동화하기
[225]NSML: 머신러닝 플랫폼 서비스하기 & 모델 튜닝 자동화하기[225]NSML: 머신러닝 플랫폼 서비스하기 & 모델 튜닝 자동화하기
[225]NSML: 머신러닝 플랫폼 서비스하기 & 모델 튜닝 자동화하기
 
[224]네이버 검색과 개인화
[224]네이버 검색과 개인화[224]네이버 검색과 개인화
[224]네이버 검색과 개인화
 
[216]Search Reliability Engineering (부제: 지진에도 흔들리지 않는 네이버 검색시스템)
[216]Search Reliability Engineering (부제: 지진에도 흔들리지 않는 네이버 검색시스템)[216]Search Reliability Engineering (부제: 지진에도 흔들리지 않는 네이버 검색시스템)
[216]Search Reliability Engineering (부제: 지진에도 흔들리지 않는 네이버 검색시스템)
 
[214] Ai Serving Platform: 하루 수 억 건의 인퍼런스를 처리하기 위한 고군분투기
[214] Ai Serving Platform: 하루 수 억 건의 인퍼런스를 처리하기 위한 고군분투기[214] Ai Serving Platform: 하루 수 억 건의 인퍼런스를 처리하기 위한 고군분투기
[214] Ai Serving Platform: 하루 수 억 건의 인퍼런스를 처리하기 위한 고군분투기
 
[213] Fashion Visual Search
[213] Fashion Visual Search[213] Fashion Visual Search
[213] Fashion Visual Search
 
[232] TensorRT를 활용한 딥러닝 Inference 최적화
[232] TensorRT를 활용한 딥러닝 Inference 최적화[232] TensorRT를 활용한 딥러닝 Inference 최적화
[232] TensorRT를 활용한 딥러닝 Inference 최적화
 
[242]컴퓨터 비전을 이용한 실내 지도 자동 업데이트 방법: 딥러닝을 통한 POI 변화 탐지
[242]컴퓨터 비전을 이용한 실내 지도 자동 업데이트 방법: 딥러닝을 통한 POI 변화 탐지[242]컴퓨터 비전을 이용한 실내 지도 자동 업데이트 방법: 딥러닝을 통한 POI 변화 탐지
[242]컴퓨터 비전을 이용한 실내 지도 자동 업데이트 방법: 딥러닝을 통한 POI 변화 탐지
 
[212]C3, 데이터 처리에서 서빙까지 가능한 하둡 클러스터
[212]C3, 데이터 처리에서 서빙까지 가능한 하둡 클러스터[212]C3, 데이터 처리에서 서빙까지 가능한 하둡 클러스터
[212]C3, 데이터 처리에서 서빙까지 가능한 하둡 클러스터
 
[223]기계독해 QA: 검색인가, NLP인가?
[223]기계독해 QA: 검색인가, NLP인가?[223]기계독해 QA: 검색인가, NLP인가?
[223]기계독해 QA: 검색인가, NLP인가?
 

Recently uploaded

PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)
Ralf Eggert
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance
 
Free Complete Python - A step towards Data Science
Free Complete Python - A step towards Data ScienceFree Complete Python - A step towards Data Science
Free Complete Python - A step towards Data Science
RinaMondal9
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
Laura Byrne
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
KatiaHIMEUR1
 
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
91mobiles
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
Guy Korland
 
PCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase TeamPCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase Team
ControlCase
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
James Anderson
 
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Thierry Lestable
 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
Product School
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance
 
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Ramesh Iyer
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
DanBrown980551
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
Alan Dix
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
Kari Kakkonen
 
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
Sri Ambati
 
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdfSAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
Peter Spielvogel
 
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptx
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptxSecstrike : Reverse Engineering & Pwnable tools for CTF.pptx
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptx
nkrafacyberclub
 
Elevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object CalisthenicsElevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object Calisthenics
Dorra BARTAGUIZ
 

Recently uploaded (20)

PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
 
Free Complete Python - A step towards Data Science
Free Complete Python - A step towards Data ScienceFree Complete Python - A step towards Data Science
Free Complete Python - A step towards Data Science
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
 
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
 
PCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase TeamPCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase Team
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
 
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
 
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
 
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
 
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdfSAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
 
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptx
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptxSecstrike : Reverse Engineering & Pwnable tools for CTF.pptx
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptx
 
Elevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object CalisthenicsElevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object Calisthenics
 

[246]QANet: Towards Efficient and Human-Level Reading Comprehension on SQuAD

  • 1. Adams Wei Yu Deview 2018, Seoul Quoc Le Thang Luong Rui Zhao Mohammad Norouzi Kai Chen Collaborators David Dohan
  • 2. Bio Adams Wei Yu ● Ph.D Candidate @ MLD, CMU ○ Advisor: Jaime Carbonell, Alex Smola ○ Large scale optimization ○ Machine reading comprehension
  • 6. Moving towards end-to-end systems ● Translation ● Question Answering
  • 7. Lots of Datasets Available TriviaQA Narrative QA MS Marco
  • 8. Stanford Question Answer Dataset (SQuAD) In education, teachers facilitate student learning, often in a school or academy or perhaps in another environment such as outdoors. A teacher who teaches on an individual basis may be described as a tutor. Passage: What is the role of teachers in education?Question: facilitate student learningGroundtruth: facilitate student learningPrediction 1: EM = 1, F1 = 1 student learningPrediction 2: EM = 0, F1 = 0.8 teachers facilitate student learningPrediction 3: EM = 0, F1 = 0.86 Data: Crowdsourced 100k question-answer pairs on 500 Wikipedia articles.
  • 9. Roadmap ● Models for text ● General neural structures for QA ● Building blocks for QANet ○ Fully parallel (CNN + Self-attention) ○ data augmentation via back-translation ○ transfer learning from unsupervised tasks
  • 10. That movie was awful.
  • 11. That movie was awful . embed embed embed embed embed sum Bag of words hout
  • 12.
  • 13. Continuous bag-of-words and skip-gram architectures (Mikolov et al., 2013a; 2013b)
  • 14. That movie was awful . conv conv conv conv conv sum embed embed embed embed embed Bag of N-Grams hout
  • 15. That movie was awful . embed embed embed embed embed f(x,h) f(x,h) f(x,h) f(x,h) f(x,h) hinit hout Recurrent Neural Networks
  • 16. The quick brown fox jumped over the lazy doo
  • 17. The quick brown fox jumped over the lazy dog
  • 18. A feed-forward neural network language model (Bengio et al., 2001; 2003)
  • 19. <s> The quick brown fox embed embed embed embed embed f(x,h) f(x,h) f(x,h) f(x,h) f(x,h)hinit project The quick brown fox jumped Language Models
  • 20. Yes please <de> ja bitte embed embed embed embed embed f(x,h) f(x,h) f(x,h) f(x,h) f(x,h)hinit project Ja bitte </s> Language Models Seq2Seq
  • 21. Yes please <s> ja bitte embed embed embed embed embed f(x,h) f(x,h) f(x,h) f(x,h) f(x,h)hinit Ja bitte </s> Seq2Seq + Attention Encoder Decoder hinit ?
  • 24. Attention: a weighted average The cat stuck out its tongue and licked its owner The cat stuck out its tongue and licked its owner
  • 25. Convolution: Different linear transformations by relative position. The cat stuck out its tongue and licked its owner The cat stuck out its tongue and licked its owner
  • 26. Attention: a weighted average The cat stuck out its tongue and licked its owner The cat stuck out its tongue and licked its owner
  • 27. Multi-head Attention Parallel attention layers with different linear transformations on input and output. The cat stuck out its tongue and licked its owner The cat stuck out its tongue and licked its owner
  • 28. Yes please <s> Ja embed embed embed embed embed f(x,h) f(x,h) f(x,h) f(x,h) f(x,h)hinit Ja ...bitte </s> Seq2Seq + Attention Encoder Decoder hinit w1 w2
  • 29. <s> The quick brown fox embed embed embed embed embed f(x,h) f(x,h) f(x,h) f(x,h) f(x,h) project The quick brown fox jumped Language Models with attention
  • 30.
  • 31. Roadmap ● Models for text ● General neural structures for QA ● Building blocks for QANet ○ Fully parallel (CNN + Self-attention) ○ data augmentation via back-translation ○ transfer learning from unsupervised tasks
  • 32. General (Doc, Question) → Answer Model
  • 33. General framework neural QA Systems Bi-directional Attention Flow (BiDAF) [Seo et al., ICLR’17]
  • 34. Base Model (BiDAF) Similar general architectures: ● R-Net [Wang et al, ACL’17] ● DCN [Xiong et al., ICLR’17]
  • 35. Base Model (BiDAF) Similar general architectures: ● R-Net [Wang et al, ACL’17] ● DCN [Xiong et al., ICLR’17]
  • 36. Base Model (BiDAF) 36 Similar general architectures: ● R-Net [Wang et al, ACL’17] ● DCN [Xiong et al., ICLR’17]
  • 37. Base Model (BiDAF) Similar general architectures: ● R-Net [Wang et al, ACL’17] ● DCN [Xiong et al., ICLR’17]
  • 38. Base Model (BiDAF) Similar general architectures: ● R-Net [Wang et al, ACL’17] ● DCN [Xiong et al., ICLR’17]
  • 39. RNN RNN RNN RNN Two Challenges with RNNs Remain... Base Model (BiDAF)
  • 40. First challenge: hard to capture long dependency h1 h3 h4 h5 h6 h2 Being a long-time fan of Japanese film, I expected more than this. I can't really be bothered to write too much, as this movie is just so poor. The story might be the cutest romantic little something ever, pity I couldn't stand the awful acting, the mess they called pacing, and the standard "quirky" Japanese story. If you've noticed how many Japanese movies use characters, plots and twists that seem too "different", forcedly so, then steer clear of this movie. Seriously, a 12-year old could have told you how this movie was going to move along, and that's not a good thing in my book. Fans of "Beat" Takeshi: his part in this movie is not really more than a cameo, and unless you're a rabid fan, you don't need to suffer through this waste of film.
  • 41. Second challenge: hard to compute in parallel Strictly Sequential!
  • 42. 1. local context input hidden state h1 h3 h4 h5 h6 h2 2. global interaction 3. Temporal info What do RNNs Capture? Substitution?
  • 43. Roadmap ● Models for text ● General neural structures for QA ● Building blocks for QANet ○ Fully parallel (CNN + Self-attention) ○ data augmentation via back-translation ○ transfer learning from unsupervised tasks
  • 44. Convolution: Capturing Local Context 0.6 0.2 0.8 0.4 0.1 0.6 0.4 0.1 0.4 0.9 0.1 0.8 0.2 0.3 0.1 0.6 0.2 0.8 0.4 0.1 0.6 0.4 0.1 0.4 0.9 0.1 0.8 0.2 0.3 0.1 The todayniceisweather
  • 45. Convolution: Capturing Local Context 0.6 0.2 0.8 0.4 0.1 0.6 0.4 0.1 0.4 0.9 0.1 0.8 0.2 0.3 0.1 0.4 0.72.51.10.6 k = 2 d = 3 0.0 0.0 0.0
  • 46. Convolution: Capturing Local Context 0.6 0.2 0.8 0.4 0.1 0.6 0.4 0.1 0.4 0.9 0.1 0.8 0.2 0.3 0.1 0.4 0.72.51.10.6 k = 2 d = 3 0.0 0.0 0.0
  • 47. Convolution: Capturing Local Context 0.6 0.2 0.8 0.4 0.1 0.6 0.4 0.1 0.4 0.9 0.1 0.8 0.2 0.3 0.1 0.4 0.72.51.10.6 k = 2 d = 3 0.0 0.0 0.0
  • 48. Convolution: Capturing Local Context 0.6 0.2 0.8 0.4 0.1 0.6 0.4 0.1 0.4 0.9 0.1 0.8 0.2 0.3 0.1 0.4 0.72.51.10.6 1.8 0.90.30.41.6 k = 3k = 2 d = 3 1.2 0.81.40.52.1 k = 3 0.0 0.0 0.0 k-gram features Fully parallel!
  • 49. How about Global Interaction? The todayniceisweather layer 1 layer 2 layer 3 1. May need O(logk N) layers 2. Interaction may become weaker N: Seq length. k: Filter size.
  • 50. The todayniceisweather The todayniceisweather 0.6 0.2 0.8 0.4 0.1 0.6 0.4 0.1 0.4 0.9 0.1 0.8 0.2 0.3 0.1 The todayniceisweather w1 x + w2 x + w3 x + w4 x + w5 x 1.8 2.3 0.4 The = w1 w2 w3 w4 w5 w1 , w2 , w3 , w4 , w5 = softmax ( ) 0.6 0.2 0.8 0.4 0.1 0.6 0.4 0.1 0.4 0.9 0.1 0.8 0.2 0.3 0.1 0.6 0.2 0.8 x The The todayniceisweather [Vaswani et al., NIPS’17]
  • 52. Per Unit Total Per Layer Sequential Op (Path Memory) Self-Attn O(Nd) O(N2 d) O(1) Conv O(kd2 ) O(kNd2 ) O(1) RNN O(d2 ) O(Nd2 ) O(N) Complexity Self-Attn Conv RNN N: Seq length. d: Dim. (N > d) k: Filter size.
  • 53. Explicitly Encode Temporal Info + ++++ RNN Position Embedding 1 5432 1 5432 Implicit encode explicit encode
  • 54. Position Emb Feedforward Layer Norm Self Attention Layer Norm Convolution Layer Norm + + + Repeat Position Emb Feedforward Self Attention Repeat Convolution if you want to go deeper QANet Encoder [Yu et al., ICLR’18]
  • 55. RNN RNN RNN RNN Base Model (BiDAF) → QANet QANet Encoder QANet Encoder QANet Encoder QANet Encoder
  • 56. 130 layers QANet – 130+ layers (Deepest NLP NN)
  • 57. QANet – First QA system with No Recurrence ● Very fast! ○ Training: 3x - 13x ○ Inference: 4x - 9x
  • 58. QANet – 130+ layers (Deepest NLP NN) ● Layer normalization ● Residual connections ● L2 regularization ● Stochastic Depth ● Squeeze and Excitation ● ...
  • 59. Roadmap ● Models for text ● General neural structures for QA ● Building blocks for QANet ○ Fully parallel (CNN + Self-attention) ○ data augmentation via back-translation ○ transfer learning from unsupervised tasks
  • 60. Data augmentation: popular in vision & speech
  • 61. More data with NMT back-translation Input Paraphrase Translation English → French English ← French Previously, tea had been used primarily for Buddhist monks to stay awake during meditation. Autrefois, le thé avait été utilisé surtout pour les moines bouddhistes pour rester éveillé pendant la méditation. In the past, tea was used mostly for Buddhist monks to stay awake during the meditation.
  • 62. More data with NMT back-translation Input Paraphrase Translation English → French English ← French Previously, tea had been used primarily for Buddhist monks to stay awake during meditation. In the past, tea was used mostly for Buddhist monks to stay awake during the meditation. ● More data ○ (Input, label) ○ (Paraphrase, label) Applicable to virtually any NLP tasks!
  • 63. QANet augmentation Input Paraphrase Translation English → French English ← French Improvement: +1.1 F1 Use 2 language pairs: English-French, English-German. 3x data.
  • 64. Roadmap ● Models for text ● General neural structures for QA ● Building blocks for QANet ○ Fully parallel (CNN + Self-attention) ○ data augmentation via back-translation ○ transfer learning from unsupervised tasks
  • 66.
  • 67. Transfer learning for richer presentation
  • 68. <s> The quick brown fox embed embed embed embed embed f(x,h) f(x,h) f(x,h) f(x,h) f(x,h)hinit project The quick brown fox jumped Language Models
  • 69. Sebastian Ruder @ Indaba 2018
  • 70. Transfer learning for richer presentation ● Pretrained language model (ELMo, [Peters et al., NAACL’18]) ○ + 4.0 F1
  • 71. Transfer learning for richer presentation 71 ● Pretrained language model (ELMo, [Peters et al., NAACL’18]) ○ + 4.0 F1 ● Pretrained machine translation model (CoVe [McCann, NIPS’17]) ○ + 0.3 F1
  • 72. QANet – 3 key ideas ● Deep Architecture without RNN ○ 130-layer (Deepest in NLP) ● Transfer Learning ○ leverage unlabeled data ● Data Augmentation ○ with back-translation #1 on SQuAD (Mar-Aug 2018)
  • 73. QA is not Solved!!
  • 74. QA is not Solved!! Thank you!