1
Y U N - N U N G ( V I V I A N ) C H E N
H T T P : / / V I V I A N C H E N . I D V.T W
H A K K A N I - T U R , T U R , G A O , D E N G
Outline
Introduction
Spoken Dialogue System
Spoken/Natural Language Understanding (SLU/NLU)
Contextual Spoken Language Understanding
Model Architecture
End-to-End Training
Experiments
Conclusion & Future Work
2
End-to-EndMemoryNetworksforMulti-TurnSpokenLanguageUnderstandingYun-Nung(Vivian)Chen
Outline
Introduction
Spoken Dialogue System
Spoken/Natural Language Understanding (SLU/NLU)
Contextual Spoken Language Understanding
Model Architecture
End-to-End Training
Experiments
Conclusion & Future Work
3
End-to-EndMemoryNetworksforMulti-TurnSpokenLanguageUnderstandingYun-Nung(Vivian)Chen
Spoken Dialogue System (SDS)
• Spoken dialogue systems are intelligent agents that are able to help
users finish tasks more efficiently via spoken interactions.
• Spoken dialogue systems are being incorporated into various devices
(smart-phones, smart TVs, in-car navigating system, etc).
4
Good intelligent assistants help users to organize and access information
conveniently
JARVIS – Iron Man’s Personal Assistant Baymax – Personal Healthcare Companion
End-to-EndMemoryNetworksforMulti-TurnSpokenLanguageUnderstandingYun-Nung(Vivian)Chen
Dialogue System Pipeline
End-to-EndMemoryNetworksforMulti-TurnSpokenLanguageUnderstandingYun-Nung(Vivian)Chen
5
ASR
Language Understanding (LU)
• User Intent Detection
• Slot Filling
Dialogue Management (DM)
• Dialogue State Tracking
• Policy Decision
Output
Generation
Hypothesis
are there any action movies
to see this weekend
Semantic Frame
(Intents, Slots)
request_movie
genre=action
date=this weekend
System Action
request_locaion
Text response
Where are you located?
Screen Display
location?
Text Input
Are there any action movies to see this weekend?
Speech Signal
LU Importance
End-to-EndMemoryNetworksforMulti-TurnSpokenLanguageUnderstandingYun-Nung(Vivian)Chen
6
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0
15
30
45
60
75
90
105
120
135
150
165
180
195
210
225
240
255
270
285
300
315
330
345
360
375
390
405
420
435
450
465
480
495
SuccessRate
Simulation Epoch
Learning Curve of System Performance
Upper Bound DQN - 0.00 DQN - 0.05 Rule - 0.00 Rule - 0.05
RL Agent w/o LU errors
Rule Agent w/o LU errors
LU Importance
End-to-EndMemoryNetworksforMulti-TurnSpokenLanguageUnderstandingYun-Nung(Vivian)Chen
7
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0
15
30
45
60
75
90
105
120
135
150
165
180
195
210
225
240
255
270
285
300
315
330
345
360
375
390
405
420
435
450
465
480
495
SuccessRate
Simulation Epoch
Learning Curve of System Performance
Upper Bound DQN - 0.00 DQN - 0.05 Rule - 0.00 Rule - 0.05
RL Agent w/o LU errors
RL Agent w/ 5% LU errors
Rule Agent w/o LU errors
Rule Agent w/ 5% LU errors
>5% performance drop
The system performance is sensitive to LU errors, for both rule-based
and reinforcement learning agents.
Dialogue System Pipeline
End-to-EndMemoryNetworksforMulti-TurnSpokenLanguageUnderstandingYun-Nung(Vivian)Chen
8
SLU usually focuses on understanding single-turn utterances
The understanding result is usually influenced by
1) local observations 2) global knowledge.
ASR
Language Understanding (LU)
• User Intent Detection
• Slot Filling
Dialogue Management (DM)
• Dialogue State Tracking
• Policy Decision
Output
Generation
Hypothesis
are there any action movies
to see this weekend
Semantic Frame
(Intents, Slots)
request_movie
genre=action
date=this weekend
System Action
request_locaion
Text response
Where are you located?
Screen Display
location?
Text Input
Are there any action movies to see this weekend?
Speech Signal
current bottleneck
 error propagation
Spoken Language Understanding
9
End-to-EndMemoryNetworksforMulti-TurnSpokenLanguageUnderstandingYun-Nung(Vivian)Chen
just sent email to bob about fishing this weekend
O O O O
B-contact_name
O
B-subject I-subject I-subject
U
S
I send_email
D communication
 send_email(contact_name=“bob”, subject=“fishing this weekend”)
are we going to fish this weekend
U1
S2
 send_email(message=“are we going to fish this weekend”)
send email to bob
U2
 send_email(contact_name=“bob”)
B-message
I-message
I-message I-message I-message
I-message I-message
B-contact_nameS1
Domain Identification  Intent Prediction  Slot Filling
Outline
Introduction
Spoken Dialogue System
Spoken/Natural Language Understanding (SLU/NLU)
Contextual Spoken Language Understanding
Model Architecture
End-to-End Training
Experiments
Conclusion & Future Work
10
End-to-EndMemoryNetworksforMulti-TurnSpokenLanguageUnderstandingYun-Nung(Vivian)Chen
MODEL ARCHITECTURE
End-to-EndMemoryNetworksforMulti-TurnSpokenLanguageUnderstandingYun-Nung(Vivian)Chen
11
u
Knowledge Attention Distributionpi
mi
Memory Representation
Weighted
Sum
h
∑ Wkg
o
Knowledge Encoding
Representation
history utterances
{xi}
current utterance
c
Inner
Product
Sentence
Encoder
RNNin
x1 x2 xi…
Contextual
Sentence Encoder
x1 x2 xi…
RNNmem
slot tagging sequence y
ht-1 ht
V V
W W W
wt-1 wt
yt-1 yt
U U
RNN
Tagger
M M
Idea: additionally incorporating contextual knowledge during slot tagging
Chen, et al., “End-to-End Memory Networks with Knowledge Carryover for Multi-Turn Spoken Language Understanding,” in Interspeech, 2016.
1. Sentence Encoding 2. Knowledge Attention 3. Knowledge Encoding
MODEL ARCHITECTURE
End-to-EndMemoryNetworksforMulti-TurnSpokenLanguageUnderstandingYun-Nung(Vivian)Chen
12
u
Knowledge Attention Distributionpi
mi
Memory Representation
Weighted
Sum
h
∑ Wkg
o
Knowledge Encoding
Representation
history utterances
{xi}
current utterance
c
Inner
Product
Sentence
Encoder
RNNin
x1 x2 xi…
Contextual
Sentence Encoder
x1 x2 xi…
RNNmem
slot tagging sequence y
ht-1 ht
V V
W W W
wt-1 wt
yt-1 yt
U U
RNN
Tagger
M M
Idea: additionally incorporating contextual knowledge during slot tagging
Chen, et al., “End-to-End Memory Networks with Knowledge Carryover for Multi-Turn Spoken Language Understanding,” in Interspeech, 2016.
1. Sentence Encoding 2. Knowledge Attention 3. Knowledge Encoding
CNN
CNN
END-TO-END TRAINING
• Tagging Objective
• RNN Tagger
End-to-EndMemoryNetworksforMulti-TurnSpokenLanguageUnderstandingYun-Nung(Vivian)Chen
13
slot tag sequence contextual utterances & current utterance
ht-1 ht+1ht
V V V
W W W W
wt-1 wt+1wt
yt-1 yt+1yt
U U U
o
M M M
Automatically figure out the attention distribution without explicit
supervision
Outline
Introduction
Spoken Dialogue System
Spoken/Natural Language Understanding (SLU/NLU)
Contextual Spoken Language Understanding
Model Architecture
End-to-End Training
Experiments
Conclusion & Future Work
14
End-to-EndMemoryNetworksforMulti-TurnSpokenLanguageUnderstandingYun-Nung(Vivian)Chen
EXPERIMENTS
• Dataset: Cortana communication session data
– GRU for all RNN
– adam optimizer
– embedding dim=150
– hidden unit=100
– dropout=0.5
End-to-EndMemoryNetworksforMulti-TurnSpokenLanguageUnderstandingYun-Nung(Vivian)Chen
15
Model Training Set
Knowledge
Encoding
Sentence
Encoder
First Turn Other Overall
RNN Tagger
single-turn x x 60.6 16.2 25.5
The model trained on single-turn data performs worse for non-first
turns due to mismatched training data
EXPERIMENTS
• Dataset: Cortana communication session data
– GRU for all RNN
– adam optimizer
– embedding dim=150
– hidden unit=100
– dropout=0.5
End-to-EndMemoryNetworksforMulti-TurnSpokenLanguageUnderstandingYun-Nung(Vivian)Chen
16
Model Training Set
Knowledge
Encoding
Sentence
Encoder
First Turn Other Overall
RNN Tagger
single-turn x x 60.6 16.2 25.5
multi-turn x x 55.9 45.7 47.4
Treating multi-turn data as single-turn for training performs reasonable
EXPERIMENTS
• Dataset: Cortana communication session data
– GRU for all RNN
– adam optimizer
– embedding dim=150
– hidden unit=100
– dropout=0.5
End-to-EndMemoryNetworksforMulti-TurnSpokenLanguageUnderstandingYun-Nung(Vivian)Chen
17
Model Training Set
Knowledge
Encoding
Sentence
Encoder
First Turn Other Overall
RNN Tagger
single-turn x x 60.6 16.2 25.5
multi-turn x x 55.9 45.7 47.4
Encoder-
Tagger
multi-turn current utt (c) RNN 57.6 56.0 56.3
multi-turn history + current (x, c) RNN 69.9 60.8 62.5
Encoding current and history utterances improves the performance
but increases the training time
EXPERIMENTS
• Dataset: Cortana communication session data
– GRU for all RNN
– adam optimizer
– embedding dim=150
– hidden unit=100
– dropout=0.5
End-to-EndMemoryNetworksforMulti-TurnSpokenLanguageUnderstandingYun-Nung(Vivian)Chen
18
Model Training Set
Knowledge
Encoding
Sentence
Encoder
First Turn Other Overall
RNN Tagger
single-turn x x 60.6 16.2 25.5
multi-turn x x 55.9 45.7 47.4
Encoder-
Tagger
multi-turn current utt (c) RNN 57.6 56.0 56.3
multi-turn history + current (x, c) RNN 69.9 60.8 62.5
Proposed multi-turn history + current (x, c) RNN 73.2 65.7 67.1
Applying memory networks significantly outperforms all approaches
with much less training time
EXPERIMENTS
• Dataset: Cortana communication session data
– GRU for all RNN
– adam optimizer
– embedding dim=150
– hidden unit=100
– dropout=0.5
End-to-EndMemoryNetworksforMulti-TurnSpokenLanguageUnderstandingYun-Nung(Vivian)Chen
19
Model Training Set
Knowledge
Encoding
Sentence
Encoder
First Turn Other Overall
RNN Tagger
single-turn x x 60.6 16.2 25.5
multi-turn x x 55.9 45.7 47.4
Encoder-
Tagger
multi-turn current utt (c) RNN 57.6 56.0 56.3
multi-turn history + current (x, c) RNN 69.9 60.8 62.5
Proposed
multi-turn history + current (x, c) RNN 73.2 65.7 67.1
multi-turn history + current (x, c) CNN 73.8 66.5 68.0
CNN produces comparable results for sentence encoding with
shorter training time
NEW! NOT IN THE PAPER!
Outline
Introduction
Spoken Dialogue System
Spoken/Natural Language Understanding (SLU/NLU)
Contextual Spoken Language Understanding
Model Architecture
End-to-End Training
Experiments
Conclusion & Future Work
20
End-to-EndMemoryNetworksforMulti-TurnSpokenLanguageUnderstandingYun-Nung(Vivian)Chen
Conclusion
• The proposed end-to-end memory networks store
contextual knowledge, which can be exploited dynamically
based on an attention model for manipulating knowledge
carryover for multi-turn understanding
• The end-to-end model performs the tagging task instead of
classification
• The experiments show the feasibility and robustness of
modeling knowledge carryover through memory networks
21
End-to-EndMemoryNetworksforMulti-TurnSpokenLanguageUnderstandingYun-Nung(Vivian)Chen
Future Work
• Leveraging not only local observation but also global
knowledge for better language understanding
– Syntax or semantics can serve as global knowledge to guide
the understanding model
– “Knowledge as a Teacher: Knowledge-Guided Structural
Attention Networks,” arXiv preprint arXiv: 1609.03286
22
End-to-EndMemoryNetworksforMulti-TurnSpokenLanguageUnderstandingYun-Nung(Vivian)Chen
Q & A
T H A N K S F O R YO U R AT T E N T I O N !
23
The code will be available at
https://github.com/yvchen/ContextualSLU

End-to-End Memory Networks with Knowledge Carryover for Multi-Turn Spoken Language Understanding

  • 1.
    1 Y U N- N U N G ( V I V I A N ) C H E N H T T P : / / V I V I A N C H E N . I D V.T W H A K K A N I - T U R , T U R , G A O , D E N G
  • 2.
    Outline Introduction Spoken Dialogue System Spoken/NaturalLanguage Understanding (SLU/NLU) Contextual Spoken Language Understanding Model Architecture End-to-End Training Experiments Conclusion & Future Work 2 End-to-EndMemoryNetworksforMulti-TurnSpokenLanguageUnderstandingYun-Nung(Vivian)Chen
  • 3.
    Outline Introduction Spoken Dialogue System Spoken/NaturalLanguage Understanding (SLU/NLU) Contextual Spoken Language Understanding Model Architecture End-to-End Training Experiments Conclusion & Future Work 3 End-to-EndMemoryNetworksforMulti-TurnSpokenLanguageUnderstandingYun-Nung(Vivian)Chen
  • 4.
    Spoken Dialogue System(SDS) • Spoken dialogue systems are intelligent agents that are able to help users finish tasks more efficiently via spoken interactions. • Spoken dialogue systems are being incorporated into various devices (smart-phones, smart TVs, in-car navigating system, etc). 4 Good intelligent assistants help users to organize and access information conveniently JARVIS – Iron Man’s Personal Assistant Baymax – Personal Healthcare Companion End-to-EndMemoryNetworksforMulti-TurnSpokenLanguageUnderstandingYun-Nung(Vivian)Chen
  • 5.
    Dialogue System Pipeline End-to-EndMemoryNetworksforMulti-TurnSpokenLanguageUnderstandingYun-Nung(Vivian)Chen 5 ASR LanguageUnderstanding (LU) • User Intent Detection • Slot Filling Dialogue Management (DM) • Dialogue State Tracking • Policy Decision Output Generation Hypothesis are there any action movies to see this weekend Semantic Frame (Intents, Slots) request_movie genre=action date=this weekend System Action request_locaion Text response Where are you located? Screen Display location? Text Input Are there any action movies to see this weekend? Speech Signal
  • 6.
  • 7.
    LU Importance End-to-EndMemoryNetworksforMulti-TurnSpokenLanguageUnderstandingYun-Nung(Vivian)Chen 7 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 15 30 45 60 75 90 105 120 135 150 165 180 195 210 225 240 255 270 285 300 315 330 345 360 375 390 405 420 435 450 465 480 495 SuccessRate Simulation Epoch LearningCurve of System Performance Upper Bound DQN - 0.00 DQN - 0.05 Rule - 0.00 Rule - 0.05 RL Agent w/o LU errors RL Agent w/ 5% LU errors Rule Agent w/o LU errors Rule Agent w/ 5% LU errors >5% performance drop The system performance is sensitive to LU errors, for both rule-based and reinforcement learning agents.
  • 8.
    Dialogue System Pipeline End-to-EndMemoryNetworksforMulti-TurnSpokenLanguageUnderstandingYun-Nung(Vivian)Chen 8 SLUusually focuses on understanding single-turn utterances The understanding result is usually influenced by 1) local observations 2) global knowledge. ASR Language Understanding (LU) • User Intent Detection • Slot Filling Dialogue Management (DM) • Dialogue State Tracking • Policy Decision Output Generation Hypothesis are there any action movies to see this weekend Semantic Frame (Intents, Slots) request_movie genre=action date=this weekend System Action request_locaion Text response Where are you located? Screen Display location? Text Input Are there any action movies to see this weekend? Speech Signal current bottleneck  error propagation
  • 9.
    Spoken Language Understanding 9 End-to-EndMemoryNetworksforMulti-TurnSpokenLanguageUnderstandingYun-Nung(Vivian)Chen justsent email to bob about fishing this weekend O O O O B-contact_name O B-subject I-subject I-subject U S I send_email D communication  send_email(contact_name=“bob”, subject=“fishing this weekend”) are we going to fish this weekend U1 S2  send_email(message=“are we going to fish this weekend”) send email to bob U2  send_email(contact_name=“bob”) B-message I-message I-message I-message I-message I-message I-message B-contact_nameS1 Domain Identification  Intent Prediction  Slot Filling
  • 10.
    Outline Introduction Spoken Dialogue System Spoken/NaturalLanguage Understanding (SLU/NLU) Contextual Spoken Language Understanding Model Architecture End-to-End Training Experiments Conclusion & Future Work 10 End-to-EndMemoryNetworksforMulti-TurnSpokenLanguageUnderstandingYun-Nung(Vivian)Chen
  • 11.
    MODEL ARCHITECTURE End-to-EndMemoryNetworksforMulti-TurnSpokenLanguageUnderstandingYun-Nung(Vivian)Chen 11 u Knowledge AttentionDistributionpi mi Memory Representation Weighted Sum h ∑ Wkg o Knowledge Encoding Representation history utterances {xi} current utterance c Inner Product Sentence Encoder RNNin x1 x2 xi… Contextual Sentence Encoder x1 x2 xi… RNNmem slot tagging sequence y ht-1 ht V V W W W wt-1 wt yt-1 yt U U RNN Tagger M M Idea: additionally incorporating contextual knowledge during slot tagging Chen, et al., “End-to-End Memory Networks with Knowledge Carryover for Multi-Turn Spoken Language Understanding,” in Interspeech, 2016. 1. Sentence Encoding 2. Knowledge Attention 3. Knowledge Encoding
  • 12.
    MODEL ARCHITECTURE End-to-EndMemoryNetworksforMulti-TurnSpokenLanguageUnderstandingYun-Nung(Vivian)Chen 12 u Knowledge AttentionDistributionpi mi Memory Representation Weighted Sum h ∑ Wkg o Knowledge Encoding Representation history utterances {xi} current utterance c Inner Product Sentence Encoder RNNin x1 x2 xi… Contextual Sentence Encoder x1 x2 xi… RNNmem slot tagging sequence y ht-1 ht V V W W W wt-1 wt yt-1 yt U U RNN Tagger M M Idea: additionally incorporating contextual knowledge during slot tagging Chen, et al., “End-to-End Memory Networks with Knowledge Carryover for Multi-Turn Spoken Language Understanding,” in Interspeech, 2016. 1. Sentence Encoding 2. Knowledge Attention 3. Knowledge Encoding CNN CNN
  • 13.
    END-TO-END TRAINING • TaggingObjective • RNN Tagger End-to-EndMemoryNetworksforMulti-TurnSpokenLanguageUnderstandingYun-Nung(Vivian)Chen 13 slot tag sequence contextual utterances & current utterance ht-1 ht+1ht V V V W W W W wt-1 wt+1wt yt-1 yt+1yt U U U o M M M Automatically figure out the attention distribution without explicit supervision
  • 14.
    Outline Introduction Spoken Dialogue System Spoken/NaturalLanguage Understanding (SLU/NLU) Contextual Spoken Language Understanding Model Architecture End-to-End Training Experiments Conclusion & Future Work 14 End-to-EndMemoryNetworksforMulti-TurnSpokenLanguageUnderstandingYun-Nung(Vivian)Chen
  • 15.
    EXPERIMENTS • Dataset: Cortanacommunication session data – GRU for all RNN – adam optimizer – embedding dim=150 – hidden unit=100 – dropout=0.5 End-to-EndMemoryNetworksforMulti-TurnSpokenLanguageUnderstandingYun-Nung(Vivian)Chen 15 Model Training Set Knowledge Encoding Sentence Encoder First Turn Other Overall RNN Tagger single-turn x x 60.6 16.2 25.5 The model trained on single-turn data performs worse for non-first turns due to mismatched training data
  • 16.
    EXPERIMENTS • Dataset: Cortanacommunication session data – GRU for all RNN – adam optimizer – embedding dim=150 – hidden unit=100 – dropout=0.5 End-to-EndMemoryNetworksforMulti-TurnSpokenLanguageUnderstandingYun-Nung(Vivian)Chen 16 Model Training Set Knowledge Encoding Sentence Encoder First Turn Other Overall RNN Tagger single-turn x x 60.6 16.2 25.5 multi-turn x x 55.9 45.7 47.4 Treating multi-turn data as single-turn for training performs reasonable
  • 17.
    EXPERIMENTS • Dataset: Cortanacommunication session data – GRU for all RNN – adam optimizer – embedding dim=150 – hidden unit=100 – dropout=0.5 End-to-EndMemoryNetworksforMulti-TurnSpokenLanguageUnderstandingYun-Nung(Vivian)Chen 17 Model Training Set Knowledge Encoding Sentence Encoder First Turn Other Overall RNN Tagger single-turn x x 60.6 16.2 25.5 multi-turn x x 55.9 45.7 47.4 Encoder- Tagger multi-turn current utt (c) RNN 57.6 56.0 56.3 multi-turn history + current (x, c) RNN 69.9 60.8 62.5 Encoding current and history utterances improves the performance but increases the training time
  • 18.
    EXPERIMENTS • Dataset: Cortanacommunication session data – GRU for all RNN – adam optimizer – embedding dim=150 – hidden unit=100 – dropout=0.5 End-to-EndMemoryNetworksforMulti-TurnSpokenLanguageUnderstandingYun-Nung(Vivian)Chen 18 Model Training Set Knowledge Encoding Sentence Encoder First Turn Other Overall RNN Tagger single-turn x x 60.6 16.2 25.5 multi-turn x x 55.9 45.7 47.4 Encoder- Tagger multi-turn current utt (c) RNN 57.6 56.0 56.3 multi-turn history + current (x, c) RNN 69.9 60.8 62.5 Proposed multi-turn history + current (x, c) RNN 73.2 65.7 67.1 Applying memory networks significantly outperforms all approaches with much less training time
  • 19.
    EXPERIMENTS • Dataset: Cortanacommunication session data – GRU for all RNN – adam optimizer – embedding dim=150 – hidden unit=100 – dropout=0.5 End-to-EndMemoryNetworksforMulti-TurnSpokenLanguageUnderstandingYun-Nung(Vivian)Chen 19 Model Training Set Knowledge Encoding Sentence Encoder First Turn Other Overall RNN Tagger single-turn x x 60.6 16.2 25.5 multi-turn x x 55.9 45.7 47.4 Encoder- Tagger multi-turn current utt (c) RNN 57.6 56.0 56.3 multi-turn history + current (x, c) RNN 69.9 60.8 62.5 Proposed multi-turn history + current (x, c) RNN 73.2 65.7 67.1 multi-turn history + current (x, c) CNN 73.8 66.5 68.0 CNN produces comparable results for sentence encoding with shorter training time NEW! NOT IN THE PAPER!
  • 20.
    Outline Introduction Spoken Dialogue System Spoken/NaturalLanguage Understanding (SLU/NLU) Contextual Spoken Language Understanding Model Architecture End-to-End Training Experiments Conclusion & Future Work 20 End-to-EndMemoryNetworksforMulti-TurnSpokenLanguageUnderstandingYun-Nung(Vivian)Chen
  • 21.
    Conclusion • The proposedend-to-end memory networks store contextual knowledge, which can be exploited dynamically based on an attention model for manipulating knowledge carryover for multi-turn understanding • The end-to-end model performs the tagging task instead of classification • The experiments show the feasibility and robustness of modeling knowledge carryover through memory networks 21 End-to-EndMemoryNetworksforMulti-TurnSpokenLanguageUnderstandingYun-Nung(Vivian)Chen
  • 22.
    Future Work • Leveragingnot only local observation but also global knowledge for better language understanding – Syntax or semantics can serve as global knowledge to guide the understanding model – “Knowledge as a Teacher: Knowledge-Guided Structural Attention Networks,” arXiv preprint arXiv: 1609.03286 22 End-to-EndMemoryNetworksforMulti-TurnSpokenLanguageUnderstandingYun-Nung(Vivian)Chen
  • 23.
    Q & A TH A N K S F O R YO U R AT T E N T I O N ! 23 The code will be available at https://github.com/yvchen/ContextualSLU