Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

End-to-End Memory Networks with Knowledge Carryover for Multi-Turn Spoken Language Understanding

1,037 views

Published on

Interspeech 2016 Oral Presentation

Published in: Technology
  • Login to see the comments

End-to-End Memory Networks with Knowledge Carryover for Multi-Turn Spoken Language Understanding

  1. 1. 1 Y U N - N U N G ( V I V I A N ) C H E N H T T P : / / V I V I A N C H E N . I D V.T W H A K K A N I - T U R , T U R , G A O , D E N G
  2. 2. Outline Introduction Spoken Dialogue System Spoken/Natural Language Understanding (SLU/NLU) Contextual Spoken Language Understanding Model Architecture End-to-End Training Experiments Conclusion & Future Work 2 End-to-EndMemoryNetworksforMulti-TurnSpokenLanguageUnderstandingYun-Nung(Vivian)Chen
  3. 3. Outline Introduction Spoken Dialogue System Spoken/Natural Language Understanding (SLU/NLU) Contextual Spoken Language Understanding Model Architecture End-to-End Training Experiments Conclusion & Future Work 3 End-to-EndMemoryNetworksforMulti-TurnSpokenLanguageUnderstandingYun-Nung(Vivian)Chen
  4. 4. Spoken Dialogue System (SDS) • Spoken dialogue systems are intelligent agents that are able to help users finish tasks more efficiently via spoken interactions. • Spoken dialogue systems are being incorporated into various devices (smart-phones, smart TVs, in-car navigating system, etc). 4 Good intelligent assistants help users to organize and access information conveniently JARVIS – Iron Man’s Personal Assistant Baymax – Personal Healthcare Companion End-to-EndMemoryNetworksforMulti-TurnSpokenLanguageUnderstandingYun-Nung(Vivian)Chen
  5. 5. Dialogue System Pipeline End-to-EndMemoryNetworksforMulti-TurnSpokenLanguageUnderstandingYun-Nung(Vivian)Chen 5 ASR Language Understanding (LU) • User Intent Detection • Slot Filling Dialogue Management (DM) • Dialogue State Tracking • Policy Decision Output Generation Hypothesis are there any action movies to see this weekend Semantic Frame (Intents, Slots) request_movie genre=action date=this weekend System Action request_locaion Text response Where are you located? Screen Display location? Text Input Are there any action movies to see this weekend? Speech Signal
  6. 6. LU Importance End-to-EndMemoryNetworksforMulti-TurnSpokenLanguageUnderstandingYun-Nung(Vivian)Chen 6 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 15 30 45 60 75 90 105 120 135 150 165 180 195 210 225 240 255 270 285 300 315 330 345 360 375 390 405 420 435 450 465 480 495 SuccessRate Simulation Epoch Learning Curve of System Performance Upper Bound DQN - 0.00 DQN - 0.05 Rule - 0.00 Rule - 0.05 RL Agent w/o LU errors Rule Agent w/o LU errors
  7. 7. LU Importance End-to-EndMemoryNetworksforMulti-TurnSpokenLanguageUnderstandingYun-Nung(Vivian)Chen 7 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 15 30 45 60 75 90 105 120 135 150 165 180 195 210 225 240 255 270 285 300 315 330 345 360 375 390 405 420 435 450 465 480 495 SuccessRate Simulation Epoch Learning Curve of System Performance Upper Bound DQN - 0.00 DQN - 0.05 Rule - 0.00 Rule - 0.05 RL Agent w/o LU errors RL Agent w/ 5% LU errors Rule Agent w/o LU errors Rule Agent w/ 5% LU errors >5% performance drop The system performance is sensitive to LU errors, for both rule-based and reinforcement learning agents.
  8. 8. Dialogue System Pipeline End-to-EndMemoryNetworksforMulti-TurnSpokenLanguageUnderstandingYun-Nung(Vivian)Chen 8 SLU usually focuses on understanding single-turn utterances The understanding result is usually influenced by 1) local observations 2) global knowledge. ASR Language Understanding (LU) • User Intent Detection • Slot Filling Dialogue Management (DM) • Dialogue State Tracking • Policy Decision Output Generation Hypothesis are there any action movies to see this weekend Semantic Frame (Intents, Slots) request_movie genre=action date=this weekend System Action request_locaion Text response Where are you located? Screen Display location? Text Input Are there any action movies to see this weekend? Speech Signal current bottleneck  error propagation
  9. 9. Spoken Language Understanding 9 End-to-EndMemoryNetworksforMulti-TurnSpokenLanguageUnderstandingYun-Nung(Vivian)Chen just sent email to bob about fishing this weekend O O O O B-contact_name O B-subject I-subject I-subject U S I send_email D communication  send_email(contact_name=“bob”, subject=“fishing this weekend”) are we going to fish this weekend U1 S2  send_email(message=“are we going to fish this weekend”) send email to bob U2  send_email(contact_name=“bob”) B-message I-message I-message I-message I-message I-message I-message B-contact_nameS1 Domain Identification  Intent Prediction  Slot Filling
  10. 10. Outline Introduction Spoken Dialogue System Spoken/Natural Language Understanding (SLU/NLU) Contextual Spoken Language Understanding Model Architecture End-to-End Training Experiments Conclusion & Future Work 10 End-to-EndMemoryNetworksforMulti-TurnSpokenLanguageUnderstandingYun-Nung(Vivian)Chen
  11. 11. MODEL ARCHITECTURE End-to-EndMemoryNetworksforMulti-TurnSpokenLanguageUnderstandingYun-Nung(Vivian)Chen 11 u Knowledge Attention Distributionpi mi Memory Representation Weighted Sum h ∑ Wkg o Knowledge Encoding Representation history utterances {xi} current utterance c Inner Product Sentence Encoder RNNin x1 x2 xi… Contextual Sentence Encoder x1 x2 xi… RNNmem slot tagging sequence y ht-1 ht V V W W W wt-1 wt yt-1 yt U U RNN Tagger M M Idea: additionally incorporating contextual knowledge during slot tagging Chen, et al., “End-to-End Memory Networks with Knowledge Carryover for Multi-Turn Spoken Language Understanding,” in Interspeech, 2016. 1. Sentence Encoding 2. Knowledge Attention 3. Knowledge Encoding
  12. 12. MODEL ARCHITECTURE End-to-EndMemoryNetworksforMulti-TurnSpokenLanguageUnderstandingYun-Nung(Vivian)Chen 12 u Knowledge Attention Distributionpi mi Memory Representation Weighted Sum h ∑ Wkg o Knowledge Encoding Representation history utterances {xi} current utterance c Inner Product Sentence Encoder RNNin x1 x2 xi… Contextual Sentence Encoder x1 x2 xi… RNNmem slot tagging sequence y ht-1 ht V V W W W wt-1 wt yt-1 yt U U RNN Tagger M M Idea: additionally incorporating contextual knowledge during slot tagging Chen, et al., “End-to-End Memory Networks with Knowledge Carryover for Multi-Turn Spoken Language Understanding,” in Interspeech, 2016. 1. Sentence Encoding 2. Knowledge Attention 3. Knowledge Encoding CNN CNN
  13. 13. END-TO-END TRAINING • Tagging Objective • RNN Tagger End-to-EndMemoryNetworksforMulti-TurnSpokenLanguageUnderstandingYun-Nung(Vivian)Chen 13 slot tag sequence contextual utterances & current utterance ht-1 ht+1ht V V V W W W W wt-1 wt+1wt yt-1 yt+1yt U U U o M M M Automatically figure out the attention distribution without explicit supervision
  14. 14. Outline Introduction Spoken Dialogue System Spoken/Natural Language Understanding (SLU/NLU) Contextual Spoken Language Understanding Model Architecture End-to-End Training Experiments Conclusion & Future Work 14 End-to-EndMemoryNetworksforMulti-TurnSpokenLanguageUnderstandingYun-Nung(Vivian)Chen
  15. 15. EXPERIMENTS • Dataset: Cortana communication session data – GRU for all RNN – adam optimizer – embedding dim=150 – hidden unit=100 – dropout=0.5 End-to-EndMemoryNetworksforMulti-TurnSpokenLanguageUnderstandingYun-Nung(Vivian)Chen 15 Model Training Set Knowledge Encoding Sentence Encoder First Turn Other Overall RNN Tagger single-turn x x 60.6 16.2 25.5 The model trained on single-turn data performs worse for non-first turns due to mismatched training data
  16. 16. EXPERIMENTS • Dataset: Cortana communication session data – GRU for all RNN – adam optimizer – embedding dim=150 – hidden unit=100 – dropout=0.5 End-to-EndMemoryNetworksforMulti-TurnSpokenLanguageUnderstandingYun-Nung(Vivian)Chen 16 Model Training Set Knowledge Encoding Sentence Encoder First Turn Other Overall RNN Tagger single-turn x x 60.6 16.2 25.5 multi-turn x x 55.9 45.7 47.4 Treating multi-turn data as single-turn for training performs reasonable
  17. 17. EXPERIMENTS • Dataset: Cortana communication session data – GRU for all RNN – adam optimizer – embedding dim=150 – hidden unit=100 – dropout=0.5 End-to-EndMemoryNetworksforMulti-TurnSpokenLanguageUnderstandingYun-Nung(Vivian)Chen 17 Model Training Set Knowledge Encoding Sentence Encoder First Turn Other Overall RNN Tagger single-turn x x 60.6 16.2 25.5 multi-turn x x 55.9 45.7 47.4 Encoder- Tagger multi-turn current utt (c) RNN 57.6 56.0 56.3 multi-turn history + current (x, c) RNN 69.9 60.8 62.5 Encoding current and history utterances improves the performance but increases the training time
  18. 18. EXPERIMENTS • Dataset: Cortana communication session data – GRU for all RNN – adam optimizer – embedding dim=150 – hidden unit=100 – dropout=0.5 End-to-EndMemoryNetworksforMulti-TurnSpokenLanguageUnderstandingYun-Nung(Vivian)Chen 18 Model Training Set Knowledge Encoding Sentence Encoder First Turn Other Overall RNN Tagger single-turn x x 60.6 16.2 25.5 multi-turn x x 55.9 45.7 47.4 Encoder- Tagger multi-turn current utt (c) RNN 57.6 56.0 56.3 multi-turn history + current (x, c) RNN 69.9 60.8 62.5 Proposed multi-turn history + current (x, c) RNN 73.2 65.7 67.1 Applying memory networks significantly outperforms all approaches with much less training time
  19. 19. EXPERIMENTS • Dataset: Cortana communication session data – GRU for all RNN – adam optimizer – embedding dim=150 – hidden unit=100 – dropout=0.5 End-to-EndMemoryNetworksforMulti-TurnSpokenLanguageUnderstandingYun-Nung(Vivian)Chen 19 Model Training Set Knowledge Encoding Sentence Encoder First Turn Other Overall RNN Tagger single-turn x x 60.6 16.2 25.5 multi-turn x x 55.9 45.7 47.4 Encoder- Tagger multi-turn current utt (c) RNN 57.6 56.0 56.3 multi-turn history + current (x, c) RNN 69.9 60.8 62.5 Proposed multi-turn history + current (x, c) RNN 73.2 65.7 67.1 multi-turn history + current (x, c) CNN 73.8 66.5 68.0 CNN produces comparable results for sentence encoding with shorter training time NEW! NOT IN THE PAPER!
  20. 20. Outline Introduction Spoken Dialogue System Spoken/Natural Language Understanding (SLU/NLU) Contextual Spoken Language Understanding Model Architecture End-to-End Training Experiments Conclusion & Future Work 20 End-to-EndMemoryNetworksforMulti-TurnSpokenLanguageUnderstandingYun-Nung(Vivian)Chen
  21. 21. Conclusion • The proposed end-to-end memory networks store contextual knowledge, which can be exploited dynamically based on an attention model for manipulating knowledge carryover for multi-turn understanding • The end-to-end model performs the tagging task instead of classification • The experiments show the feasibility and robustness of modeling knowledge carryover through memory networks 21 End-to-EndMemoryNetworksforMulti-TurnSpokenLanguageUnderstandingYun-Nung(Vivian)Chen
  22. 22. Future Work • Leveraging not only local observation but also global knowledge for better language understanding – Syntax or semantics can serve as global knowledge to guide the understanding model – “Knowledge as a Teacher: Knowledge-Guided Structural Attention Networks,” arXiv preprint arXiv: 1609.03286 22 End-to-EndMemoryNetworksforMulti-TurnSpokenLanguageUnderstandingYun-Nung(Vivian)Chen
  23. 23. Q & A T H A N K S F O R YO U R AT T E N T I O N ! 23 The code will be available at https://github.com/yvchen/ContextualSLU

×