Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Towards End-to-End Reinforcement Learning of Dialogue Agents for Information Access

600 views

Published on

Oral presentation of The 55th Annual Meeting of the Association for Computational Linguistics (ACL 2017)

Published in: Technology
  • Be the first to comment

Towards End-to-End Reinforcement Learning of Dialogue Agents for Information Access

  1. 1. Towards End-to-End Reinforcement Learning of Dialogue Agents for Information Access Bhuwan Dhingra Carnegie Mellon University Lihong Li Microsoft Research Xiujun Li Microsoft Research Jianfeng Gao Microsoft Research Yun-Nung (Vivian) Chen National Taiwan University Faisal Ahmed Microsoft Research Li Deng Citadel
  2. 2. KB-InfoBot: An interactive search engine • Setting – User is looking for a piece of information from one or more tables/KBs – System must iteratively ask for user constraints (“slots”) to retrieve the answer • Interactive search is more natural – Users are used to issuing queries of length less than 5 words (Spink et al, 2001) – Users may not know the structure of the database being queried Movie=? Actor=Bill Murray; Release Year=1993 Find me the Bill Murray’s movie. I think it came out in 1993. When was it released? Groundhog Day is a Bill Murray movie which came out in 1993. KB-InfoBotUser Entity-Centric Knowledge Base Movie Actor Release Year Groundhog Day Bill Murray 1993 Australia Nicole Kidman X Mad Max: Fury Road X 2015
  3. 3. Goal-Oriented Dialogue System (Young et al., 2013) Natural Language Understanding (NLU) State Tracker/ Belief Tracker Dialogue Policy Natural Language Generator (NLG) Database / KB User Agent User Utterance Acts/ Entities Dialogue State System Response Query Results Query Example: SELECT Movie WHERE Actor==Bill Murray AND Genre==ComedyDialogue Act
  4. 4. KB-InfoBot • A simple rule-based approach: – Use heuristics to maintain belief state over slots – Ask for slot with maximum uncertainty, until some “inform” criterion is met Has no notion of what the user is likely to be looking for or likely to know Symbolic queries lose notion of uncertainty in upstream modules Cannot improve online with user feedback
  5. 5. KB-InfoBot • Supervised / Reinforcement Learning-based approach – Use neural networks to model LU, Belief Tracker and Policy Learn user behaviors (e.g. slots likely to be known) Symbolic queries lose notion of uncertainty in upstream modules End-to-end and online learning possible, but cannot backprop gradients through symbolic query
  6. 6. Network-Based Dialogue System (Wen et al., 2017) Database / KB User Agent User Utterance Acts/ Entities Dialogue Act System Response Query Results Query Example: SELECT Movie WHERE Actor==Bill Murray AND Genre==Comedy Dialogue StateLoss / Reward Backprop Not Differentiable! Supervised Learning / Reinforcement LearningTruly “End-to-end” learning not possible 
  7. 7. Piecewise Training (Wen et al., 2017) Database / KB User Agent User Utterance Acts/ Entities Dialogue Act System Response Query Results Dialogue StateLoss / Reward Backprop Supervised Learning / Reinforcement Learning Labeled Data LossBackprop Supervised Learning- Labeling expensive - Cannot learn online
  8. 8. • Replace symbolic query with an attention distribution – Compose slot-wise belief states into one posterior distribution over entire database – The KB structure is encoded in the computation of attention Uncertainty over database entries propagated to policy network (rule-based + RL) Differentiable operations allow backpropagation of gradients (RL) Computationally expensive for large databases Our Approach: Soft-KB Lookup via Attention
  9. 9. Our Approach: Soft-KB Lookup via Attention Database / KB Agent Soft Attention Supervised Learning / Reinforcement Learning User User Utterance Acts/ Entities Dialogue Act System Response Dialogu e State Full Distribution over DB Backprop Backprop Uncertainty propagated forward Gradients propagated backward
  10. 10. Entity-Centric KB Soft-KB lookup Agent Beliefs Distribution over slots (or fields) in the KB KB Posterior Posterior distribution over entities in the KB Entity Slot1 Slot2 A x1 y1 B x2 ? C ? y2 Missing Values
  11. 11. State Tracker For each slot j: 1. A multinomial over slot values – 2. A binomial probability of whether user knows the value of the slot - x1 x2 0.3 0.7 Slot Values Probabilities 0.8
  12. 12. KB Posterior Entity Slot1 Slot2 A x1 y1 B x2 ? C ? y2 Assumption: Slot values are independently distributed
  13. 13. KB Posterior 0.8
  14. 14. KB Posterior Entity Slot1 A x1 B x2 C ? x1 x2 0.3 0.7 Examples:
  15. 15. KB-Posterior • Distribution over all entities in the database • Posterior reflects uncertainty in LU + State Tracking • All operations are differentiable – Gradients can pass through during backward pass
  16. 16. Evaluation – Three Questions Does Soft-KB lookup lead to better dialog policies? Does Reinforcement Learning improve over Rule-based approach? Does End-to-end learning lead to higher rewards?
  17. 17. KB-InfoBot Versions Belief Trackers: A. Hand-Crafted (Bayesian updates) B. Neural (GRU) Policy Network: C. Hand-Crafted (Entropy Minimization) D. Neural (GRU) KB-lookup: 1. No KB lookup (Policy unaware of KB) 2. Hard-KB lookup (SQL type lookup) 3. Soft-KB lookup (KB Posterior) Rule-Based Agents: A + C + (1, 2, 3) RL-Based Agents: A + D + (1, 2, 3) E2E Agent: B + D + (3)
  18. 18. Training • All agents trained using against a publicly available user simulator (Li et al, 2017)* • Optimize future discounted rewards: • RL agent: • E2E agent: • Credit assignment: – E2E agent always fails with random initialization – Imitation learning at beginning to mimic rule-based policy * https://github.com/MiuLab/TC-Bot Policy KB Posterior Policy
  19. 19. Simulation Results • Evaluated on Movie-Centric KBs – small, medium, large, X-large • Metrics: – # of Dialogue Turns (T) – Success Rate (correct movie returned) (S) – Average Reward (R) • All agents tuned to maximize average reward Soft-KB > Hard-KB > No-KB RL > Rule-based E2E performs best
  20. 20. Human Evaluation • Setting – Typed interactions – Given 1) a goal entity 2) subset of slot values – multiple values per slot  noise modeling – Users are free to frame their inputs Soft-KB lookup > Hard-KB lookup (Success Rate) RL agent > Rule-based agent (#Turns) However, full E2E agent performed worse than RL- Soft and Rule-Soft agents
  21. 21. Discussion • Soft-KB lookup – Better dialogue policies • E2E agent – Strong performance in simulations – Does not transfer to real interactions – Overfits to the limited natural language from the simulator • Future research: personalized dialogue assistants? – Deploy using RL-Soft agent – Collect interactions to train E2E agent – Gradually switch to the E2E agent
  22. 22. Thanks for Your Attention! Code Available: https://github.com/MiuLab/KB-InfoBot

×