Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

[AAAI 2019 tutorial] End-to-end goal-oriented question answering systems

1,947 views

Published on

End-to-end goal-oriented question answering systems
- End-to-End Workflow
- Basic version: Single-Turn Question Answering
- Advanced version: Multi-Turn Question Answering
- LinkedIn real scenarios: LinkedIn Help Center, Analytics Bot

Published in: Technology
  • Get Now to Download PDF Format === http://bestadaododadj.justdied.com/291337252X-sur-la-question-juive.html
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • Get HERE to Download This eBook === http://bestadaododadj.justdied.com/291337252X-sur-la-question-juive.html
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • Got a new Iphone 6 in just 7 days completing surveys and offers! Now I'm just a few days away from completing and receiving my samsung tablet! Highly recommended! Definitely the best survey site out there! ★★★ http://t.cn/AieXAuZz
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • DOWNLOAD THIS BOOKS INTO AVAILABLE FORMAT (2019 Update) ......................................................................................................................... ......................................................................................................................... Download Full PDF EBOOK here { https://soo.gd/irt2 } ......................................................................................................................... Download Full EPUB Ebook here { https://soo.gd/irt2 } ......................................................................................................................... Download Full doc Ebook here { https://soo.gd/irt2 } ......................................................................................................................... Download PDF EBOOK here { https://soo.gd/irt2 } ......................................................................................................................... Download EPUB Ebook here { https://soo.gd/irt2 } ......................................................................................................................... Download doc Ebook here { https://soo.gd/irt2 } ......................................................................................................................... ......................................................................................................................... ................................................................................................................................... eBook is an electronic version of a traditional print book THIS can be read by using a personal computer or by using an eBook reader. (An eBook reader can be a software application for use on a computer such as Microsoft's free Reader application, or a book-sized computer THIS is used solely as a reading device such as Nuvomedia's Rocket eBook.) Users can purchase an eBook on diskette or CD, but the most popular method of getting an eBook is to purchase a downloadable file of the eBook (or other reading material) from a Web site (such as Barnes and Noble) to be read from the user's computer or reading device. Generally, an eBook can be downloaded in five minutes or less ......................................................................................................................... .............. Browse by Genre Available eBooks .............................................................................................................................. Art, Biography, Business, Chick Lit, Children's, Christian, Classics, Comics, Contemporary, Cookbooks, Manga, Memoir, Music, Mystery, Non Fiction, Paranormal, Philosophy, Poetry, Psychology, Religion, Romance, Science, Science Fiction, Self Help, Suspense, Spirituality, Sports, Thriller, Travel, Young Adult, Crime, Ebooks, Fantasy, Fiction, Graphic Novels, Historical Fiction, History, Horror, Humor And Comedy, ......................................................................................................................... ......................................................................................................................... .....BEST SELLER FOR EBOOK RECOMMEND............................................................. ......................................................................................................................... Blowout: Corrupted Democracy, Rogue State Russia, and the Richest, Most Destructive Industry on Earth,-- The Ride of a Lifetime: Lessons Learned from 15 Years as CEO of the Walt Disney Company,-- Call Sign Chaos: Learning to Lead,-- StrengthsFinder 2.0,-- Stillness Is the Key,-- She Said: Breaking the Sexual Harassment Story THIS Helped Ignite a Movement,-- Atomic Habits: An Easy & Proven Way to Build Good Habits & Break Bad Ones,-- Everything Is Figureoutable,-- What It Takes: Lessons in the Pursuit of Excellence,-- Rich Dad Poor Dad: What the Rich Teach Their Kids About Money THIS the Poor and Middle Class Do Not!,-- The Total Money Makeover: Classic Edition: A Proven Plan for Financial Fitness,-- Shut Up and Listen!: Hard Business Truths THIS Will Help You Succeed, ......................................................................................................................... .........................................................................................................................
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • DOWNLOAD THIS BOOKS INTO AVAILABLE FORMAT (2019 Update) ......................................................................................................................... ......................................................................................................................... Download Full PDF EBOOK here { https://soo.gd/irt2 } ......................................................................................................................... Download Full EPUB Ebook here { https://soo.gd/irt2 } ......................................................................................................................... Download Full doc Ebook here { https://soo.gd/irt2 } ......................................................................................................................... Download PDF EBOOK here { https://soo.gd/irt2 } ......................................................................................................................... Download EPUB Ebook here { https://soo.gd/irt2 } ......................................................................................................................... Download doc Ebook here { https://soo.gd/irt2 } ......................................................................................................................... ......................................................................................................................... ................................................................................................................................... eBook is an electronic version of a traditional print book THIS can be read by using a personal computer or by using an eBook reader. (An eBook reader can be a software application for use on a computer such as Microsoft's free Reader application, or a book-sized computer THIS is used solely as a reading device such as Nuvomedia's Rocket eBook.) Users can purchase an eBook on diskette or CD, but the most popular method of getting an eBook is to purchase a downloadable file of the eBook (or other reading material) from a Web site (such as Barnes and Noble) to be read from the user's computer or reading device. Generally, an eBook can be downloaded in five minutes or less ......................................................................................................................... .............. Browse by Genre Available eBooks .............................................................................................................................. Art, Biography, Business, Chick Lit, Children's, Christian, Classics, Comics, Contemporary, Cookbooks, Manga, Memoir, Music, Mystery, Non Fiction, Paranormal, Philosophy, Poetry, Psychology, Religion, Romance, Science, Science Fiction, Self Help, Suspense, Spirituality, Sports, Thriller, Travel, Young Adult, Crime, Ebooks, Fantasy, Fiction, Graphic Novels, Historical Fiction, History, Horror, Humor And Comedy, ......................................................................................................................... ......................................................................................................................... .....BEST SELLER FOR EBOOK RECOMMEND............................................................. ......................................................................................................................... Blowout: Corrupted Democracy, Rogue State Russia, and the Richest, Most Destructive Industry on Earth,-- The Ride of a Lifetime: Lessons Learned from 15 Years as CEO of the Walt Disney Company,-- Call Sign Chaos: Learning to Lead,-- StrengthsFinder 2.0,-- Stillness Is the Key,-- She Said: Breaking the Sexual Harassment Story THIS Helped Ignite a Movement,-- Atomic Habits: An Easy & Proven Way to Build Good Habits & Break Bad Ones,-- Everything Is Figureoutable,-- What It Takes: Lessons in the Pursuit of Excellence,-- Rich Dad Poor Dad: What the Rich Teach Their Kids About Money THIS the Poor and Middle Class Do Not!,-- The Total Money Makeover: Classic Edition: A Proven Plan for Financial Fitness,-- Shut Up and Listen!: Hard Business Truths THIS Will Help You Succeed, ......................................................................................................................... .........................................................................................................................
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here

[AAAI 2019 tutorial] End-to-end goal-oriented question answering systems

  1. 1. LinkedIn ​Deepak Agarwal ​Bee-Chung Chen ​Qi He ​Jaewon Yang ​Liang Zhang
  2. 2. 1:30 Introduction 1:40 End-to-End Workflow 1:50 Basic version: Single-Turn Question Answering 2:40 Advanced version: Multi-Turn Question Answering 3:30 break 4:00 Overview of Our Approach at LinkedIn 4:15 LinkedIn Help Center 4:30 Analytics Bot 5:15 Conclusion and Q & A
  3. 3. ● ● ● ● ● ●
  4. 4. Knowledge Card
  5. 5. ● ● ○ Where is AAAI 2019? Hilton Hawaiian Village Waikiki Beach Resort
  6. 6. ● ● ○ ○ Where can I have japanese food there? Tomi Sushi is a nice japanese restaurant
  7. 7. ● ● ○ ○ ○ Please help me make a reservation at Tomi Sushi. Great. When and how many people?
  8. 8. ● ● ○ ○ ○ ○ ○ I forgot my password of the registration site. Can you help? Sure. What is your account id?
  9. 9. ● ○ ○ ■ ■ ● ○ ○
  10. 10. ● ○ ■
  11. 11. ● ○ ■ ■ Article/paragraph Question: Answer: Question: Answer: Question: Answer:
  12. 12. ● ○ ■ ■ Many datasets are available!
  13. 13. ● ○ ■ ■ ● ○ Who founded Linkedin? Reid Hoffman, Allen Blue, ... Knowledge Base (KB) KB Query: (?, IsFounderOf, LinkedIn)
  14. 14. LinkedIn Knowledge Base How should I hire for this AI engineer position? Recruiter Assistant Job Seeker Assistant Find good jobs for me. … urgency ...… skills ...… job market ... other considerations These are good candidates ... … career goals ...… skill gaps ...… job markets ...other considerationsThese are good job positions for you ...
  15. 15. ● ○ ○ ■ ● … ○ ● ○ ○
  16. 16. ● ○ ○ ● ○ ○ ○ ● ○ ○ ●
  17. 17. ● ●
  18. 18. Where can I have japanese food in the downtown? Natural Language Understanding (NLU) Dialogue State Tracking (DST) Action Generation (Dialogue Policy) Natural Language Generation (NLG) KB, DB, Index Tomi Sushi is a nice japanese restaurant
  19. 19. Natural Language Understanding (NLU) Input: User Utterance Where can I have japanese food in the downtown? (Speech-to-Text) Output: Interpretation Intent: Find_Restaurant Type = Japanese Area = Downtown KB, DB, Index Restaurant types: Japanese, Chinese, … Location: Country: USA State: Hawaii City: Honolulu Area: Downtown
  20. 20. Dialogue State Tracking (DST) Input: Current Interpretation Intent: Find_Restaurant Type = Japanese Area = Downtown Past State Intent: Find_Flight FromCity = San Jose FromState = California FromCountry = US ToCity = Honolulu ToState = Hawaii ToCountry = US Output: State Intent: Find_Restaurant Type = Japanese Area = Downtown City = Honolulu State = Hawaii Country = US KB, DB, Index Restaurant types: Japanese, Chinese, … Location: Country: USA State: Hawaii City: Honolulu Area: Downtown Or, an embedding vector
  21. 21. Action Generation (Dialogue Policy) Output: Action Action: Suggest_Restaurant Type = Japanese Name = Tomi Sushi KB, DB, Index Restaurant Search API Input: (Type, Location) Output: A list of restaurants ranked by their ratings Input: State Intent: Find_Restaurant Type = Japanese Area = Downtown City = Honolulu State = Hawaii Country = US
  22. 22. Natural Language Generation (NLG) Output: System Utterance Tomi Sushi is a nice Japanese restaurant (Text-to-Speech) Or, other UI elements KB, DB, Index Input: Action Action: Suggest_Restaurant Type = Japanese Name = Tomi Sushi Knowledge cards of restaurants Address, phone number, hours Menu, price range, reviews
  23. 23. Where can I have japanese food in the downtown? Sequence-to- Sequence Model with Memory KB, DB, Index Tomi Sushi is a nice japanese restaurant
  24. 24. Modular approach: Practical End-to-end learning approach: Research, reading comprehension Basic version: Single-turn question answering NLU: Main Focus No DST Action Generation: Rule-Based NLG: Template-Based KB, DB, Index
  25. 25. Modular approach: Practical End-to-end learning approach: Research, reading comprehension Advanced version: Multi-turn question answering NLU: Neural Net DST: Neural Net Dialogue Policy: Neural Net NLG: Neural Net KB, DB, Index
  26. 26. NLU: Main Focus No DST Dialogue Policy: Rule-Based NLG: Template-Based KB, DB, Index Pipelined or Joint Learning Logic Form KB, DB executerule/template-based action generation Sequence to Sequence Domain Detection Intent Detection Slot Filling ASR Text Audio Text Direct Search
  27. 27. Pipelined or Joint Learning Logic Form KB, DB executerule/template-based action generation Domain Detection Intent Detection Slot Filling ASR Text Audio Motivations ● Template / slot filling: simple. Practical in a domain-specific goal-oriented Q&A
  28. 28. Domain/Intent detection is a semantic text classification problem. domain/intent detection Select Flight From Airline_Travel_Table ... fill in arguments semantic frame/template
  29. 29. Classic Sentence Classification Query Classification in Search Domain/Intent Detection (Text Classification) in Q&A Input Written language sentence Keywords Spoken language sentence with significant utterance variations Training data Rich (News articles, Reviews, Tweets, TREC) Rich (Click-through) Few (Human labels) State-of-the -arts [Kalchbrenner et al., ACL 2014] [Kim, EMNLP 2014] CNN [Shen et al., CIKM 2014] [Palangi et al., TASLP 2016] CLSM, LSTM-DSSM [Tur et al., ICASSP 2012] [Ravuri and Stolcke, Interspeech 2015] DCN, RNN, BERT (LinkedIn)
  30. 30. • 2 questions with different intents - “How was the Mexican restaurant” - “Tell me about Mexican restaurants” • Q1: Can we automatically generate contextual features for entities? Temporal scope of entity Utterance Variations • 2 questions with the same intent - “Show me weekend flights between JFK and SFO’ - “I want to fly from San Francisco to New York next Sunday” • Q2: Can they generate the same answer? • Q3: Which question will generate the better answer? • Significant unknown words, unknown syntactic structures for the same semantic • Q4: Can we efficiently expand the training data? • Q5: Can we significantly reduce the need of training data? Lack of training data Deep Neural Networks - Paraphrase - Active Q&A (RL) - Bots Simulator, Domain-independent Grammar - Paraphrase, Character-level Modeling, Transfer Learning
  31. 31. Slot Filling is to extract slot/concept values from the question for a set of predefined slots/concepts. More specifically, it is often modeled as a sequence labeling task with explicit alignment. extract semantic concepts Select Flight From Airline_Travel_Table Where dept_city = “Boston” and arr_city = “New York” and date = today fill in arguments/slots semantic frame/template
  32. 32. Semantic frame • Pre-defined slots/concepts by goal-oriented dialog systems - vs. open-domain dialog systems • Q6: Can we leverage domain knowledge? • Long entity phrase has strong slot dependency • Q7: Is model sensitive to slot position? • Q8: Shall we globally assign labels together? Slot dependency Sentence watch star war episode IV a new hope Slots O B-mov I-mov I-mov I-mov I-mov I-mov I-mov • Input and output are of the same length - vs. other sequence labeling tasks (machine translation and speech recognition): the output is of the variable length • Q9: Can slot filling model leverage the “explicit alignment”? Explicit alignment Knowledge-based Model Bidirectional RNN, Slot Language Model, RNN-CRF Attention-Based RNN
  33. 33. Domain Detection Intent Detection Slot Filling Benefits 1. Only 1 model needs to be trained, fine-tuned for multiple tasks, and deployed 2. Tasks enhance each other. For example, if the intent of a sentence is to find a flight, the sentence likely contains the departure and arrival cities, and vice versa 3. Outperform separate models for each task Joint Learning Q10: What is the most effective learning structure of this multi-task learning? Q11: Should we jointly optimize the loss function or not? append intent to the beginning/end of slots 2 different tasks …...
  34. 34. Logic Form KB, DB execute Sequence to SequenceText Motivations ● Simply the workflow: directly generate the final action ● Theoretically, model very complex Q&A Challenges ● Grammar mismatch between question and logic form ● Practically, suffer from the lack of training data for complex Q&A Question ● Q12: How to collect domain-specific training data? Grammar-complied Methods
  35. 35. KB, DB Direct SearchText Motivations ● Can we retrieve answer in KB/DB like querying search engine? Yes, but: ● Limited to simple Q&A (retrieve a single fact) Challenge ● It is hard to find the answer if the answer is supported by multiple knowledge facts Question ● Q13: Can we build a scalable framework for extracting the answer from multiple knowledge facts? Memory Network
  36. 36. Pipelined or Joint Learning Logic Form KB, DB executerule/template-based action generation Sequence to Sequence Domain Detection Intent Detection Slot Filling ASR Text Audio Text Direct Search Covered ● Deep Neural Networks for template-based Q&A ● Memory Networks for direct search Q&A ● Reduce the need of training data Not covered ● Sequence to Sequence: comply with the grammar of target ● Active Q&A (RL)
  37. 37. • • •
  38. 38. CNN RNN/LSTM Seq2Seq Attention Transformer Advantages: good for variable-length representations such as sequences and long-range info propagation Problems: the sequentiality prohibits parallelization, sequence-aligned states are wasteful (O(n)) for long-range dependencies, hard to model hierarchy RNN/Gated RNN (LSTM/GRU) is the core of Seq2Seq Advantages: good for variable-length output Problems: the same as RNN Attention + RNN-based Seq2Seq Advantages: O(1) for long-range dependency, yield more interpretable models Problems: the same as RNN Multi-head attention + non-recurrent Seq2Seq Advantages: O(1) for long-range dependency, yield more interpretable models, easy to parallelize So far, state-of-the-art Advantages: fit intuition that most dependencies are local, easy to parallelize, easy to model hierarchy Problems: path length between positions can be logarithmic for long-range dependency (O(logn)), needs a lot of tricks to model the position relationships (not natural)
  39. 39. [Kalchbrenner et al., ACL 2014, Kim, EMNLP 2014, Shen et al., CIKM 2014] ● 1-d convolution → k-d, 1 CNN layer → multiple CNN layers ● multiple filters: capture various lengths of local contexts for each word ● max pooling → k-Max pooling: retain salient features from a few keywords in global feature vector ● Convolution: produce n-gram features
  40. 40. [Kalchbrenner et al., ACL 2014, Kim, EMNLP 2014, Shen et al., CIKM 2014] ● TREC question classification: CNNs are close to “SVM with careful feature engineerings” ● Large window width: long-term dependency ● k-Max pooling maintains relative positions of most relevant n-grams ● Web query: ○ DSSM < C-DSSM (CLSM) ○ Short text: CNN is slightly better than unigram model Learned 3-gram features: keywords win at 5 active neurons in max pooling:
  41. 41. The clouds are in the sky I grew up in France … I speak fluent French RNN LSTM github: 2015-08-Understanding-LSTMs
  42. 42. github: 2015-08-Understanding-LSTMs
  43. 43. [Palangi et al., TASLP 2016] ● Memory: become richer (more info) over x-axis ● Input gates: do not update words 3, 7, 9 ● Peephole, forget updates are not too helpful when text is short and memory is initialized as 0 (just do not update) ● Web query: ○ DSSM < C-DSSM (CLSM) < LSTM-DSSM Case: match “hotels in shanghai” with “shanghai hotels accommodation (3) hotel in shanghai discount (7) and reservation (9)” Input gates cell gates (memory)
  44. 44. [Ravuri and Stolcke, Interspeech 2015] When sentence is long: Basic RNN < LSTM When sentence is short: Basic RNN > LSTM
  45. 45. [Sutskever, et al., NIPS 2014] use the last state of the encoder forget the first part when finishing the whole input Attention
  46. 46. Decoder focuses on different information from encoder at every step (weighted sum). [Bahdanau et al., ICLR 2015] Model dependency w/t regard to their distance in the input or output sequences. Transformer sequentially process: cannot parallelize
  47. 47. http://jalammar.github.io/illustrated-transformer/ Novelty: eliminate recurrence
  48. 48. [Vaswani, et al., NIPS 2017] ● Encoded representation of the input as a set of key-value pairs, (K,V) ● In the decoder, ○ The previous output is compressed into a query Q, ○ The next output is the weighted sum of the values, where the weight assigned to each value is determined by the dot-product of the query with all the keys:
  49. 49. - positional encoding - residuals http://jalammar.github.io/illustrated-transformer/ - masked self-attention: only attend to earlier positions in the output sequence - Q: from the layer below it - K/V: from the output of the encoder stack
  50. 50. [Vaswani, et al., NIPS 2017] ● Multi-head attention: run the scaled dot-product attention multiple times in parallel ● Why does this work? similar to ensembling The application of Transformer in classification (domain/intent) detection will be discussed in LinkedIn Anabot scenaria.
  51. 51. • • •
  52. 52. Sentence show flights from Boston to New York today Slots O O O B-dept O B-arr I-arr B-date Sentence is today’s New York arrival flight schedule available to see Slots O B-date B-arr I-arr O O O O O O Forward RNN is better: Backward RNN is better: [Mesnil et al., Interspeech 2013]
  53. 53. [Mesnil et al., TASLP 2015] Consider global sequence optimization
  54. 54. • • •
  55. 55. [Liu and Lane, Interspeech 2016] attention alignment attention + alignment attention: normalized weighted sum of encoder states, conditioned on previous decoder state. Carry additional longer term dependencies (vs. h codes whole sentence info already) alignment: Do not learn alignment from training data for slot filling task -- waste explicit attention same encoder for 2 decoders
  56. 56. [Liu and Lane, Interspeech 2016] attention-based bidirectional RNNattention-based encoder-decoder performed similarly; faster
  57. 57. Related work Idea [Xu and Sarikaya, ASRU 2013] CNN features for CRF optimization framework [Zhang and Wang, IJCAI 2016] Similar to Attention-based RNN, 1) no attention, 2) CNN contextual layer on top of input, 3) global label assignment, 4) replace LSTM by GRU [Hakkani-Tür Interspeech 2016] Append intent to the end of slots, Bidirectional LSTM [Wen et al., CCF 2017] Modeling slot filling at lower layer and intent detection at higher layer is slightly better than other variations [Goo et al., NAACL-HLT 2018] Attention has a higher weight, if slot attention and intent attention pay more attention to the same part of the input sequence (indicates 2 tasks have higher correlation) [Wang et al., NAACL-HLT 2018] Optimize loss function separately, alternatively update hidden layers of 2 tasks
  58. 58. Results 1. Most of models achieved similar results 2. Attention-based RNN (2nd) beats the majority 3. Optimizing loss function separately is slightly better, partially because the weights on joint loss function need fine-tune [Wang et al., NAACL-HLT 2018]
  59. 59. • •
  60. 60. [Yih et al., ACL 2014] KB, DB Question Question - EntityEntity Entity Relation Entity CNN Similarity CNN Similarity Score Rank Answer
  61. 61. • •
  62. 62. [Weston, ICML 2016] 1) Input module: input KB and questions to memory 2) Generalization module: add new KB/question to next available slot in memory
  63. 63. [Weston, ICML 2016] 3) Output module: return supporting facts based on iterative memory lookups
  64. 64. [Weston, ICML 2016] 4) Response module: score and return objects of the support facts OR all words in the dictionary
  65. 65. [Bordes et al., 2015; Weston, ICML 2016]
  66. 66. • •
  67. 67. [Duboue and Chu-Carroll, HLTC 2006] lexical paraphrase syntactical paraphrase 1. QA is sensitive to small variations in question 2. QA returns different answers for questions that are semantically equivalent 3. Lack of training data to cover all paraphrases Problem 1. Replace user question by the paraphrase canonical form 2. Use MT to generate paraphrases candidates 3. Multiple MTs to enhance diversity 4. Feature-based paraphrase selection 1. Oracle of paraphrase selection: +35% (high reward) 2. Random paraphrase selection: -55% (high risk) 3. A feature-based selection: +0.6% Solution Impact
  68. 68. Definition: paraphrase generation is a sequence-to-sequence modeling problem. Characteristics: 1. Monolingual parallel data is not readily available (vs. bilingual parallel data in MT). Use Pivot language (pairs of ML systems, especially with different methods) 2. Not all of the words or phrases need to be replaced (vs. MT) 3. Hard evaluation (vs. MT) a. MT uses BLEU: translations are scored based on their similarity to the human references b. More difficult to provide human references (canonical forms) in paraphrase generation [Androutsopoulos and Malakasiotis, JAIR 2010]
  69. 69. [Zhao et al., ACL 2009] Utility example: sentence compression 1. Adequacy: {evidently not, generally, completely} preserved meaning 2. Fluency: {incomprehensible, comprehensible, flawless} paraphrase 3. Usability: {opposite to, does not achieve, achieve} the application 1. Jointly likelihood of Paraphrase Tables 2. Trigram language model 3. Application dependent utility score (e.g., similarity to canonical form in “paraphrase generation”) Human evaluation Model 1. Prefer paraphrases which are a part of the canonical form 2. Better than pure MT-based methods 3. Utility score is crucial Analysis
  70. 70. [Cho et al., EMNLP 2014] ● Semantically similar (most are about duration of time in the left figure) ● Syntactically similar (those phrases that are syntactically similar are clustered together)
  71. 71. • •
  72. 72. [Devlin et al., 2018] Needs Reduce the need of a large amount of training data Parallelization OpenAI GPT, a Transformer Decoder Stack (predict next word → LM) BERT, a Transformer Encoder Stack Solutions Transfer learning: pre-train unsupervised LM before fine-tuning it for a supervised task [OpenAI GPT, BERT] Context from both directions is crucial to sentence-level and (esp.) token-level tasks (e.g., slot filling) Bidirectional representation [ELMo, BERT]
  73. 73. http://jalammar.github.io/illustrated-bert/ - Thousands of books - Wikipedia ... Transformer Encoder Stack
  74. 74. [Devlin et al., 2018] IsNext NotNext Next Sentence Prediction does not need human-labeled data as well
  75. 75. Paraphrase generation Intent detection Answer Generation
  76. 76. Pipelined or Joint Learning Logic Form KB, DB executerule/template-based action generation Sequence to Sequence Domain Detection Intent Detection Slot Filling ASR Text Audio Text Direct Search Covered ● Deep NNs for template-based Q&A, direct search Q&A, and reducing the need of training data Single-turn Q&A ● Technology is relatively mature but still evolving very fast ● Pave the foundation for complex Q&A infrastructure
  77. 77. Where can I have japanese food in the downtown? Natural Language Understanding (NLU) Dialogue State Tracking (DST) Action Generation (Dialogue Policy) Natural Language Generation (NLG) KB, DB, Index San Jose downtown or Honolulu downtown?
  78. 78. Much more difficult than evaluating a single-turn system ● In a single-turn system, it is easier to label the correct answer to a question ○ Once we have labels, we can do offline evaluation ● In a multi-turn system, it is unclear what is “the correct response” ○ There can be many “successful paths” to achieve the goal ■ Example: Which action is better when the question is ambiguous? ● Ask for clarification ● Answer it based on the best guess and address misunderstanding later ○ It is difficult to label sufficiently many successful paths for offline evaluation
  79. 79. Evaluation by hired human evaluators ● (also used for single-turn systems) ● Expensive ● Difficult to cover all possible scenarios Experiments with end users ● (also used for single-turn systems) ● Sometimes difficult to assess whether a user is satisfied (and need to predict user satisfaction) ● Only available after we launch the product Treating it as a ranking problem ● (also used for single-turn systems) ● For each turn, use the model to rank a set of predetermined possible system utterances and compute precision ● Far from a realistic setting Evaluation based on a user simulator ● Limited by the capability of the simulator ● However, can be used to provide unlimited training and test data
  80. 80. 1. Define the problem space and collect annotated conversations 2. Build a user simulator ● Use annotated conversations to fit the parameters of the simulator ● Collect more annotated conversations based on simulator + crowdsourcing 3. Train a model using supervised learning on annotated conversations 4. Improve the model using reinforcement learning (RL) based on the simulator 5. Test the model with friends and/or hired human evaluators and apply RL 6. Test the model with end users and apply RL
  81. 81. User Goals ● Request & Constraints ● Example: Make a reservation of a Japanese restaurant in San Jose, … and let me know the address Intents / Dialogue Acts ● inform ● request ● confirm_question ● confirm_answer ● greeting ● closing ● multiple_choice ● thanks ● welcome ● deny ● not_sure Slot & Values ● Cuisine Japanese Chinese … ● Rating 5 4 … ● City, State San Jose, CA New York, NY ... request: address, reservation constraints: cuisine = “Japanese”, city = “San Jose”, state = “CA”, rating = “5”, date = “today”, time = “7pm”, number_of_people = “4” Example: https://github.com/xiul-msr/e2e_dialog_challenge
  82. 82. Role Utterance Annotation (logical form, semantic frame) User Hello, i was wondering if you can book a restaurant for me? Pizza would be good. greeting(greeting=hello), request(reservation), inform(food=pizza) Agent Sure! How many people are in your party? request(number_of_people) User Please book a table for 4 inform(number_of_people=4) Agent Great! What city are you dining in? request(city) User Portland inform(city=Portland) Agent Ciao Pizza or Neapolitan Pizzeria? multiple_choice(restaurant_name=...) User What is the price range of Ciao Pizza inform(restaurant_name=Ciao Pizza), request(pricing)
  83. 83. 1. Define the problem space and collect annotated conversations 2. Build a user simulator ● Use annotated conversations to fit the parameters of the simulator ● Collect more annotated conversations based on simulator + crowdsourcing 3. Train a model using supervised learning on annotated conversations 4. Improve the model using reinforcement learning (RL) based on the simulator 5. Test the model with friends and/or hired human evaluators and apply RL 6. Test the model with end users and apply RL
  84. 84. User Goal ● Request Rt : e.g., name, address, phone ● Constraint Ct : e.g., { type=bar, drinks=beer, area=central } ● Rt and Ct can change over time t Agenda ● A stack of user actions to be performed ● Generated by a set of probabilistic rules ● Pop to perform a user action ● Push to add future actions in response to the agent’s actions At =
  85. 85. Probabilistic user model ● Take n user actions ○ Pr(#user actions at this turn | user state) ○ Pop n actions from the user’s agenda ● Receive agent actions ● Update the user’s goal ○ Pr(add constraint S=V | user state, agent actions) ○ Pr(satisfy request X | user state, agent actions) ● Update the user’s agenda ○ Pr(push user action A to the agenda | user state, agent actions)
  86. 86. Rule-Based Agent User Simulator Simulated Conversations Contextual Paraphrasing Crowdsourcing Task #1 Make conversation more natural with coreferences, linguistic variations and shortened sentences (because of the context) Validation Crowdsourcing Task #2 Verify the created paraphrases have the same meaning by consensus of n workers Generate both utterances and annotations Annotated Conversation Paraphrasing and validation tasks are much easier than annotation tasks
  87. 87. Turn 1 Turn 2 Turn 3 Feed Forward Neural Net After we collect more annotated data, we can improve the user simulator by more advanced models Example: Sequence-to-sequence models learned from annotated conversations vt : Feature vector of the user state and agent’s action at turn t User utterance of turn 4 =
  88. 88. 1. Define the problem space and collect annotated conversations 2. Build a user simulator ● Use annotated conversations to fit the parameters of the simulator ● Collect more annotated conversations based on simulator + crowdsourcing 3. Train a model using supervised learning on annotated conversations 4. Improve the model using reinforcement learning (RL) based on the simulator 5. Test the model with friends and/or hired human evaluators and apply RL 6. Test the model with end users and apply RL
  89. 89. Where can I have japanese food in the downtown? Natural Language Understanding (NLU) Dialogue State Tracking (DST) Action Generation (Dialogue Policy) Natural Language Generation (NLG) KB, DB, Index San Jose downtown or London downtown?
  90. 90. State-of-the-art: A neural network model combining NLU & DST Input: Previous state & conversation State: Example - Belief state of the user’s goal Request: Pr(request: reservation) = 0.2 Pr(request: phone) = 0.1 …. Constraints: Pr(price=cheap) = 0.1 Pr(city=Honolulu) = 0.5 …. Agent: What kind of food at what price? request(food, price) User: I just want something cheap. Can you book a table for me? Output: New state = 0.7 = 0.9
  91. 91. With annotated data, this is a supervised learning problem Input: Previous state & conversation Agent: What kind of food at what price? request(food, price) User: I want something cheap. What are the available food types? Output: Request: Pr(request: reservation) = 0.7 Pr(request: phone) = 0.1 …. Constraints: Pr(price=cheap) = 0.9 Pr(city=Honolulu) = 0.5 …. Label 1 0 Label 1 1
  92. 92. Predict Pr(price=cheap) For each (slot, value) request(food, price) CNN
  93. 93. slot-specific slot-specific Utterance Utterance Representation
  94. 94. For each candidate slot=value Predict Pr(price_range=cheap | inform) using global-locally self-attentive encoders
  95. 95. Where can I have japanese food in the downtown? Natural Language Understanding (NLU) Dialogue State Tracking (DST) Action Generation (Dialogue Policy) Natural Language Generation (NLG) KB, DB, Index San Jose downtown or London downtown?
  96. 96. Input: State Request: Pr(request: reservation) = 0.7 Pr(request: phone) = 0.1 …. Constraints: Pr(price=cheap) = 0.9 Pr(city=Honolulu) = 0.5 …. Output: Action confirm(city=Honolulu) Methods: - Rules - Supervised Learning model(state) => intent(slot=value, ...) - Reinforcement Learning
  97. 97. Policy: 𝜋(st ) → at Training Data: State Correct Action s0 greeting(hello) s1 request(city) s2 confirm(city=...) ... ... Example state: st : Pr(request: reserv..) = 0.7 Pr(request: phone) = 0.1 ... Pr(price=cheap) = 0.9 Pr(city=Honolulu) = 0.5 … Or, an embedding vector feature vector greeting(hello) request(food) confirm(city=...) ...
  98. 98. Action template (summary action) 𝜋(st ) confirm(city=San Jose) confirm(city=London) confirm(food=Chinese) confirm(food=Korean) ... ... 𝜋(st ) confirm(city=GET_CITY) confirm(food=GET_FOOD) Action template San Jose London Chinese Korean ... ... API call or rules Action mask - Remove invalid candidate actions based on rules (domain knowledge, common sense) - Example: Don’t recommend a restaurant if location is unknown Don’t make a reservation if the user has not yet selected a restaurant
  99. 99. State Tracker: RNN Input feature vector to RNN t -1 t -1 t -1
  100. 100. Where can I have japanese food in the downtown? Natural Language Understanding (NLU) Dialogue State Tracking (DST) Action Generation (Dialogue Policy) Natural Language Generation (NLG) KB, DB, Index San Jose downtown or London downtown?
  101. 101. Natural Language Generation (NLG) Output: System Utterance Tomi Sushi is a nice Japanese restaurant (Text-to-Speech) Input: Action Action: Suggest_Restaurant Type = Japanese Name = Tomi Sushi Basic version: - Template-based approach [Name] is a nice [Type] restaurant How about a [Type] restaurant like [Name] I would recommend a [Type] restaurant like [Name] - Retrieval-based approach Retrieve the most relevant utterance from a large corpus
  102. 102. Natural Language Generation (NLG) Output: System Utterance Tomi Sushi is a nice Japanese restaurant (Text-to-Speech) Input: Action Action: Suggest_Restaurant Type = Japanese Name = Tomi Sushi Advanced version: RNN Decoder (e.g., LSTM) ● Add the action as additional input to each RNN cell ● Use multiple layers of RNN to improve performance ● Use a backward RNN to further improve performance
  103. 103. xt xt xt xt xt ht-1 d0 input gate output gate forget gate reading gate Likelihood that an dialog at has be used in the previous step
  104. 104. 1. Define the problem space and collect annotated conversations 2. Build a user simulator ● Use annotated conversations to fit the parameters of the simulator ● Collect more annotated conversations based on simulator + crowdsourcing 3. Train a model using supervised learning on annotated conversations 4. Improve the model using reinforcement learning (RL) based on the simulator 5. Test the model with friends and/or hired human evaluators and apply RL 6. Test the model with end users and apply RL
  105. 105. Agent Turn t=1 Agent State s0 input x0 action a0 State s1 input x1 update Turn t=0 Agent State s2 input x2 action a1 update Turn t=2 action a2 End USER DST: 𝛿(st-1 , at-1 , xt ) → st Policy: 𝜋(st ) → at Example: at-1 = request(food, price) Example: st-1 : Pr(request: reserv..) = 0.7 Pr(request: phone) = 0.1 ... Pr(price=cheap) = 0.9 Pr(city=Honolulu) = 0.1 ... xt = “I want something cheap. What are the available food types?” at = inform(food, GET...)
  106. 106. Agent State s0 input x0 Agent State s1 input x1 action a0 reward r0 update Time t=0 Agent State s2 input x2 action a1 reward r1 update Time t=1 Time t=2 action a2 reward r2 End USER Supervised Learning: Annotate correct actions Policy: 𝜋(st ) → at -1 -1 20 Example Reward - Each step: -1 - Success: 20 - Failure: 0 Reinforcement Learning: Define the reward
  107. 107. Basic methods - Q-learning: Deep Q-Network - Policy gradient: REINFORCE (Monte-Carlo gradient ascent) Advanced methods - Actor-Critic policy gradient method with experience replay - Deep Dyna-Q & BBQ-Networks - Multi-level reinforcement learning
  108. 108. Q(s, a) Q 𝜋 (s, a) = E [ total reward | we start from state s, take action a, and then follow 𝜋] Policy: 𝜋(st ) → at Q*(s, a) = Q 𝜋 (s, a) when 𝜋 is the optimal policy Optimal policy 𝜋*(st ) = argmaxa Q*(st , a) Q*(st , at ) = E [ rt + 𝛾 maxa Q*(st+1 , a) | st , at ] Value
  109. 109. Goal: Learn Q*(st , at ) = E [ rt + 𝛾 maxa Q*(st+1 , a) | st , at ] Optimal policy 𝜋*(st ) = argmaxa Q*(st , a) Neural Net: Q(s, a | w) → Value w = model parameters (weights) Q-Learning: Find w that minimizes E [ ( Q(s, a | w) - Q*(s, a) )2 ] SGD: Use a single sample (st , at , rt , st+1 ) to to compute Q*(st , at ) Compute the gradient using the sample and do gradient descent
  110. 110. Q*(st , at ) = E [ rt + 𝛾 maxa Q*(st+1 , a) | st , at ] Deep Neural Net: Q(s, a | w), w = argminw E [ ( Q(s, a | w) - Q*(s, a) )2 ] While(current state st ) Take action at = argmaxa Q(st , a | wt ) with probability (1 - 𝜀); random, otherwise. Receive (rt , st+1 ) and save (st , at , rt , st+1 ) in replay memory D Sample a mini-batch B from buffer D Update wt based on 𝜀-greedy
  111. 111. Basic methods - Q-learning: Deep Q-Network - Policy gradient: REINFORCE (Monte-Carlo gradient ascent) Advanced methods - Actor-Critic policy gradient method with experience replay - Deep Dyna-Q & BBQ-Networks - Multi-level reinforcement learning
  112. 112. Policy 𝜋(s, a | 𝜃) = Pr(take action a | state s, model parameter 𝜃) Reward
  113. 113. (No reward discounting over time)
  114. 114. Recursion (No reward discounting over time)
  115. 115. Supervised (imitation) Learning Reinforcement Learning correct actions agent’s actions value of action at
  116. 116. While( run policy 𝜋(⋅| 𝜃) to generate s0 , a0 , …, sT , aT ) Compute vt = total reward starting from step t (based on this sample run) Update 𝜃 based on SGD: Do a sample run of the policy s0 , a0 , …, sT , aT Use this run to compute the sample Q value Compute the gradient using this sample and do gradient ascent
  117. 117. Basic methods - Q-learning: Deep Q-Network - Policy gradient: REINFORCE (Monte-Carlo gradient ascent) Advanced methods - Actor-Critic policy gradient method with experience replay - Deep Dyna-Q & BBQ-Networks - Multi-level reinforcement learning
  118. 118. Collect data D = { (s0 , a0 , p0 , v0 , …, sT , aT , pT , vT ) } - pt = Pr(take action at at step t), recorded during data collection While( sample (s0 , a0 , p0 , v0 , …, sT , aT , pT , vT ) from D ) Update 𝜃 based on Can we learn from past data? - Importance sampling , which is capped to prevent high variance past example Pr(past example | old policy) Pr(past example | new policy)
  119. 119. While( run policy 𝜋(⋅| 𝜃) to generate s0 , a0 , …, sT , aT ) Save (s0 , a0 , p0 , v0 , …, sT , aT , pT , vT ) in replay memory D Train w1 and w2 using experience replay (with importance sampling weighting) Update 𝜃 based on Problem: vt has high variance - Predict vt by a model Q(st , at | w) - Q(st , at | w) also have high variance - Replace Q(st , at | w) by A(st , at | w) = Q(st , at | w1 ) - V(st | w2 )
  120. 120. Basic methods - Q-learning: Deep Q-Network - Policy gradient: REINFORCE (Monte-Carlo gradient ascent) Advanced methods - Actor-Critic policy gradient method with experience replay - Deep Dyna-Q & BBQ-Networks - Multi-level reinforcement learning
  121. 121. Train a “world model” to predict rewards and user actions M(s, a | wM ) → (reward, user action, terminate or not) Use the world model to generate simulated data Apply Q-learning to simulated data => Planning (The agent thinks about and “plans” for hypothetical scenarios)
  122. 122. For each step: - Serve user based on Q(s, a | wQ ) using 𝜺-greedy - Save experience in D - Update wQ by Q-learning based on a sample from D - Update wM by learning from a sample from D - Update wQ by Q-learning based on simulation (a.k.a. planning) using M(s, a | wM ) and Q(s, a | wQ ) Q(s, a | wQ ) → value M(s, a | wM ) → (reward, user action, terminate or not)
  123. 123. Model the uncertainty of Deep Q-Network: Q(s, a | w) Bayes-by-Backprop [Blundell et al., 2015] Assume prior w ~ N( 𝛍0 , diag( 𝛔0 2 ) ) D = { (si , ai , vi ) }, where vi is the observed Q(si , ai ) Approximate p(w | D) by q(w | 𝛍, 𝛒) = N( 𝛍, diag( 𝛔2 ) ) s.t. where 𝛔 = log(1+ exp( 𝛒)) Thompson sampling Draw wt ~ q(w | 𝛍t , 𝛒t ) and take action argmaxa Q(st , a | wt )
  124. 124. Draw 𝜼 ~ N(0, 1) for L times Take a minibatch of size M from D Compute the gradient and perform one step of SGD
  125. 125. Basic methods - Q-learning: Deep Q-Network - Policy gradient: REINFORCE (Monte-Carlo gradient ascent) Advanced methods - Actor-Critic policy gradient method with experience replay - Deep Dyna-Q & BBQ-Networks - Multi-level reinforcement learning
  126. 126. Action hierarchy - Action group → individual action Learn K+1 policies - Master policy: 𝜋(state) → action group g - K sub-policies: For each group g, 𝜋g (state) → action a Q(s, g | w) Q(s, a | wg ) Deep Q-Network share some parameters across different groups
  127. 127. E.g., recommend a restaurant E.g., make a reservation
  128. 128. 1. Define the problem space and collect annotated conversations 2. Build a user simulator ● Use annotated conversations to fit the parameters of the simulator ● Collect more annotated conversations based on simulator + crowdsourcing 3. Train a model using supervised learning on annotated conversations 4. Improve the model using reinforcement learning (RL) based on the simulator 5. Test the model with friends and/or hired human evaluators and apply RL 6. Test the model with end users and apply RL
  129. 129. Multi-turn question answering is an active research area How to obtain training data is a key challenge - Simulation + crowdsourcing is a promising direction - Continuously improve the simulator to generate better data Reinforcement learning is promising - It is important to pretrain a RL model on reasonable sample data - Otherwise, it will have a hard time to get success and will learn to end early - Interesting directions: Reduce variance in training, model uncertainty better, leverage hierarchical structure
  130. 130. Our Use Cases of Goal-Oriented Question Answering ● ○ ○ ● ○ ○ ●
  131. 131. • • Cold-Start
  132. 132. Supervised ML Problems Feature X Label Y Model QA Problems ● ● ■ ● ■ ⇔ ■
  133. 133. ● ○ ○ ■ ■ ● ○ ■
  134. 134. NLU: Main Focus No DST Dialogue Policy: Rule-Based NLG: Template-Based KB, DB, Index Pipelined or Joint Learning Logic Form KB, DB executerule/template-based action generation Sequence to Sequence Domain Detection Intent Detection Slot Filling ASR Text Audio Text Direct Search
  135. 135. “Who founded LinkedIn?” Entity Recognition: “LinkedIn” Intent Detection: “Founder” Slot Filling & Rule/Template-based Action Generation Output: “Reid Hoffman, Allen Blue, ...” Model DB Query: (_, Founder, LinkedIn)
  136. 136. A Set of Seed Questions & Answers Define Problem & Scope “Who founded LinkedIn?” “What jobs do you have?” …... “Who is the founder of LinkedIn?”“Founder of LinkedIn?” “LinkedIn founders” …... Expansion of similar questions that lead to same answer Sometimes challenging! ● Out-of-scope questions collected from public databases with similar domains ● Transfer learning techniques (e.g. pre-trained embeddings from BERT) help reduce the data volume requirement
  137. 137. ● Entity Recognition: ● Intent detection ● Slot Filling & Rule/Template-based Action generation “Who founded LinkedIn?” Entity Annotation: “LinkedIn” => Company “Who founded LinkedIn?” Intent Annotation: “Founder” “Who founded LinkedIn?” Result Annotation: DB Query: (_, Founder, LinkedIn)
  138. 138. How many software engineers know Java in United States? Title Skill Country ● Semi-CRF Model (Sarawagi and Cohen 2005) ● Deep Neural Network (Lample et al. 2016) Lample et al. 2016
  139. 139. ● Define a set of intents ○ Depend heavily on product design! ● Multi-class classification model ○ Problem: Question => Intent ○ Model: Logistic regression, CNN, RNN, LSTM, … ○ Features: Bag of words, word embeddings, tagged entities, ... ● Out-of-scope intent a must!
  140. 140. ● “Good jobs at Google?” ○ Call LinkedIn job recommendation engine with company == “google” ● “How many members are in active community?” ○ Convert to SQL query with slot filling and query the database ● “Change my password” ○ Show the article section that contains “how to change the password” step by step ● ...
  141. 141. * Special thanks to Weiwei Guo for providing the material
  142. 142. • ○ ○
  143. 143.
  144. 144. • Wide Component (keyword matches)
  145. 145. Model Lift % in Precision@1 vs Control (IR-based) Embedding Similarity (GloVe) -35% Embedding Similarity (LinkedIn data) -8% Deep Model (CNN) +55% Deep + Wide Model +57%
  146. 146. Metric Lift % vs Control (IR-based) Happy Path Rate +14% Undesired Path Rate -18% Search Sessions w/ Clicks on Results +7% ● Happy Path: Users who clicked only one search result and then left help center without creating a case ● Undesired Path: User who did the search and went to “contact us” directly without reading any articles from search results.
  147. 147. • • • http://tanerakcok.com/data-driven-product-management-taner-akcok/
  148. 148. • • •
  149. 149. • Intent: Definition, Query, Out-of-scope (OOS) • Definition: What is “contributor”? • Query: How many contributors 2 days ago? • OOS: How is the weather? Intent Detection Question2Definition Question2Query • Find right definition for the definition question • In: What is contributor? • Out: Contributor is a user who initiates or continues conversation at LinkedIn • Create a database query from a given question • In: How many contributor yesterday? • Out: SELECT COUNT(DISTINCT mem_id) FROM contributors WHERE date = ‘08-18-2018’;
  150. 150. UI Intent Detection Intent-specific NLU: Question2Definition Presto UI Intent-specific NLU: Question2Query Knowledge Base
  151. 151. • • • ○ • ○ ○ SELECT COUNT(DISTINCT mem_id) FROM contributors WHERE contribution_type == ‘message’ AND activity_time > 2018-08-13 “How many contributors messaged last week?”
  152. 152. • Expensive to get large training data • Need to onboard new metrics • SQL is hard to canonicalize • “Almost success” does not count Challenges Our Approaches • Leverage models trained on public data sets • Leverage metadata available in the company • Formulate slot filling for SQL • Show our interpretation and allow users to make minor changes.
  153. 153. • Survey “What question would you ask to Ana?” • 60 seed questions from 20 domain experts • Discovered “Definition” intent from the seed Qs • Selected 20 target metrics (tables and columns) • Defined slot filling problem for SQL query generation Initial User Study Training Data Generation • 60 Seed questions -> 3k (question, answer) pairs • Initial annotation by data scientists • Annotators generate paraphrases • Multiple reviews to get consistent annotation
  154. 154. • ○ ○ do slot filling • ○ ○ ○ SELECT COUNT(DISTINCT mem_id) FROM contributors WHERE contribution_type == ‘message’ AND activity_time > 2018-08-13 “How many contributors messaged last week?”
  155. 155. • ○ Metric ○ Time ○ Filter ○ Breakdown • • in English • SELECT COUNT(DISTINCT mem_id) FROM contributors WHERE contribution_type == ‘message’ AND activity_time > 2018-08-13 Metric: unique_contributors Filter: contribution_type == ‘message’ Time: Last 7 days “How many contributors messaged last week?”
  156. 156. • ○ q m m q • q m ○ q m ■ ■ m • Transfer learning:
  157. 157. • ○ • ○ •
  158. 158. Model Improvement in accuracy over baseline Unsupervised (Finding Nearest Neighbor in Training Data) -- Logistic Regression model with All features +21% Prod Model: Logistic Regression with Pairwise model training +26% Prod Model with Top 3 Accuracy +44%
  159. 159. • ○ ○ • ○ ○
  160. 160. How many users contributed ?[CLS] [SEP] number of contributorsInput EHow Emany Eusers Econtributed E? E[CLS] E[SEP] Enumber Eof Econtributors Token Embedding E1 E2 E3 E4 E5 E0 E6 E7 E8 E9 Positional Embedding EA EA EA EA EA EA EA EB EB EB Segment Embedding Transformers Label EHow Emany Eusers Econtributed E? E[CLS] E[SEP] Enumber Eof Econtributors Final Embedding
  161. 161. Evaluation Metric Relative Improvement over Production Model Top1 Accuracy +11% Top3 Accuracy +3%
  162. 162. Acc. # Training questions. • •
  163. 163. • • •
  164. 164. • • ○ • •
  165. 165. • • • • • ○ ○ ○
  166. 166. • • ○ • ○ • ○
  167. 167. ● Literature review of Question Answering Systems for ○ Basic Version: Single-Turn Question Answering ○ Advanced Version: Multi-Turn Question Answering ● Our practical lessons learned through 3 LinkedIn use cases ● An area with a lot of potentials and challenges, e.g. ○ How to handle cold-start scalably? ○ How to make the model work more reliably? ○ How to have seamless human-like interactions with human? ○ How to make the model generic enough so that it works for different domains with little effort? ○ …...
  168. 168. ● [Androutsopoulos and Malakasiotis, JAIR 2010] A Survey of Paraphrasing and Textual Entailment Methods ● [Berant and Liang, ACL 2014] Semantic Parsing via Paraphrasing ● [Bordes et al., 2015] Large-scale Simple Question Answering with Memory Networks ● [Buck et al., ICLR 2018] Ask The Right Questions: Active Question Reformulation With Reinforcement Learning ● [Casanueva et al., 2018] Casanueva, Iñigo, et al. "Feudal Reinforcement Learning for Dialogue Management in Large Domains." arXiv preprint arXiv:1803.03232 (2018). ● [Chen et al., KDD Explorations 2017] A Survey on Dialogue Systems: Recent Advances and New Frontiers ● [Chen et al., WWW 2012 – CQA'12 Workshop] Understanding User Intent in Community Question Answering ● [Cho et al., EMNLP 2014] Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation ● [Crook & Marin, 2017] Crook, Paul, and Alex Marin. "Sequence to sequence modeling for user simulation in dialog systems." Proceedings of the 18th Annual Conference of the International Speech Communication Association (INTERSPEECH 2017). 2017. ● [Dauphin et al., ICLR 2014] Zero-Shot Learning for Semantic Utterance Classification ● [Deng and Yu, Interspeech 2011] Deep Convex Net: A Scalable Architecture for Speech Pattern Classification
  169. 169. ● [Deng et al., SLT 2012] Use Of Kernel Deep Convex Networks And End-to-end Learning For Spoken Language Understanding ● [Duboue and Chu-Carroll, HLTC 2006] Answering the Question You Wish They Had Asked: The Impact of Paraphrasing for Question Answering ● [Goo et al., NAACL-HLT 2018] Slot-Gated Modeling for Joint Slot Filling and Intent Prediction ● [Hakkani-Tür Interspeech 2016] Multi-Domain Joint Semantic Frame Parsing using Bi-directional RNN-LSTM ● [Hashemi et al., QRUMS 2016] Query Intent Detection using Convolutional Neural Networks ● [Hu et al., NIPS 2014] Convolutional Neural Network Architectures for Matching Natural Language Sentences ● [Kalchbrenner et al., ACL 2014] A convolutional neural network for modelling sentences ● [Kavosh & Williams, 2016] Asadi, Kavosh, and Jason D. Williams. "Sample-efficient deep reinforcement learning for dialog control." arXiv preprint arXiv:1612.06000 (2016). ● [Kim, EMNLP 2014] Convolutional Neural Networks for Sentence Classification ● [Kingma & Welling, 2014] Kingma, Diederik P., and Max Welling. "Stochastic gradient VB and the variational auto-encoder." Second International Conference on Learning Representations, ICLR. 2014. ● [Kreyssig et al., 2018] Kreyssig, Florian, et al. "Neural User Simulation for Corpus-based Policy Optimisation for Spoken Dialogue Systems." arXiv preprint arXiv:1805.06966 (2018).
  170. 170. ● [Kwiatkowski et al., EMNLP 2013] Scaling Semantic Parsers with On-the-fly Ontology Matching ● [Lample et al. 2016] Lample, Guillaume, et al. "Neural architectures for named entity recognition." arXiv preprint arXiv:1603.01360 (2016). ● [Lee and Dernoncourt, NAACL 2016] Sequential Short-Text Classification with Recurrent and Convolutional Neural Networks ● [Lipton et al., 2017] Lipton, Zachary, et al. "BBQ-Networks: Efficient Exploration in Deep Reinforcement Learning for Task-Oriented Dialogue Systems." arXiv preprint arXiv:1711.05715 (2017). ● [Liu and Lane, Interspeech 2016] Attention-Based Recurrent Neural Network Models for Joint Intent Detection and Slot Filling ● [Mesnil et al., Interspeech 2013] Investigation of Recurrent-Neural-Network Architectures and Learning Methods for Spoken Language Understanding ● [Mesnil et al., TASLP 2015] Using Recurrent Neural Networks for Slot Filling in Spoken Language Understanding ● [Mnih et al., 2015] Mnih, Volodymyr, et al. "Human-level control through deep reinforcement learning." Nature 518.7540 (2015): 529. ● [Mrksic et al., 2018] Mrkšić, Nikola, and Ivan Vulić. "Fully statistical neural belief tracking." arXiv preprint arXiv:1805.11350 (2018). ● [Palangi et al., TASLP 2016] Deep Sentence Embedding Using Long Short-Term Memory Networks: Analysis and Application to Information Retrieval
  171. 171. ● [Peng et al., 2017] Peng, Baolin, et al. "Composite task-completion dialogue policy learning via hierarchical deep reinforcement learning." arXiv preprint arXiv:1704.03084 (2017). ● [Peng et al., 2018] Peng, Baolin, et al. "Deep Dyna-Q: Integrating Planning for Task-Completion Dialogue Policy Learning." Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Vol. 1. 2018. ● [Rajpurkar et al. 2016] Rajpurkar, Pranav, et al. "Squad: 100,000+ questions for machine comprehension of text." arXiv preprint arXiv:1606.05250 (2016). ● [Ravuri and Stolcke, Interspeech 2015] Recurrent Neural Network and LSTM Models for Lexical Utterance Classification ● [Reddy et al., TACL 2014] Large-scale Semantic Parsing without Question-Answer Pairs ● [Sarawagi and Cohen 2015] Sarawagi, Sunita, and William W. Cohen. "Semi-markov conditional random fields for information extraction." Advances in neural information processing systems. 2005. ● [Sarikaya et al., ICASSP 2011] Deep Belief Nets For Natural Language Call–routing ● [Schatzmann & Young, 2009] Schatzmann, Jost, and Steve Young. "The hidden agenda user simulation model." IEEE transactions on audio, speech, and language processing 17.4 (2009): 733-747. ● [Serdyuk et al., 2018] Towards End-to-end Spoken Language Understanding ● [Shah et al., 2018] Shah, Pararth, et al. "Building a Conversational Agent Overnight with Dialogue Self-Play." arXiv preprint arXiv:1801.04871 (2018).
  172. 172. ● [Shen et al., CIKM 2014] A Latent Semantic Model with Convolutional-Pooling Structure for Information Retrieval ● [Shen et al., WWW 2014] Learning Semantic Representations Using Convolutional Neural Networks for Web Search ● [Sutton et al., 2000] Sutton, Richard S., et al. "Policy gradient methods for reinforcement learning with function approximation." Advances in neural information processing systems. 2000. ● [Tur et al., ICASSP 2012] Towards deeper understanding: Deep convex networks for semantic utterance classification ● [Wang et al., ACL 2015] Building a Semantic Parser Overnight ● [Wang et al., NAACL-HLT 2018] A Bi-model based RNN Semantic Frame Parsing Model for Intent Detection and Slot Filling ● [Weisz et al., 2018] Weisz, Gellért, et al. "Sample efficient deep reinforcement learning for dialogue systems with large action spaces." arXiv preprint arXiv:1802.03753 (2018). ● [Wen et al., 2015] Wen, Tsung-Hsien, et al. "Semantically conditioned lstm-based natural language generation for spoken dialogue systems." arXiv preprint arXiv:1508.01745 (2015). ● [Wen et al., CCF 2017] Jointly Modeling Intent Identification and Slot Filling with Contextual and Hierarchical Information ● [Weston, ICML 2016] Memory Networks for Language Understanding, ICML Tutorial 2016
  173. 173. ● [Williams et al., 1988] Williams, R. J. Toward a theory of reinforcement-learning connectionist systems. Technical Report NU-CCS-88-3, Northeastern University, College of Computer Science. ● [Williams et al., 2017] Williams, Jason D., Kavosh Asadi, and Geoffrey Zweig. "Hybrid code networks: practical and efficient end-to-end dialog control with supervised and reinforcement learning." arXiv preprint arXiv:1702.03274 (2017). ● [Xiao et al., ACL 2016] Sequence-based Structured Prediction for Semantic Parsing ● [Xu and Sarikaya, ASRU 2013] Convolutional Neural Network Based Triangular CRF For Joint Intent Detection And Slot Filling ● [Yan et al., AAAI 2017] Building Task-Oriented Dialogue Systems for Online Shopping ● [Yao et al., Interspeech 2013] Recurrent Neural Networks for Language Understanding ● [Yih et al., ACL 2014] Semantic Parsing for Single-Relation Question Answering ● [Zhang and Wang, IJCAI 2016] A Joint Model of Intent Determination and Slot Filling for Spoken Language Understanding ● [Zhong et al., 2017] Zhong, Victor, Caiming Xiong, and Richard Socher. "Global-Locally Self-Attentive Encoder for Dialogue State Tracking." Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Vol. 1. 2018. ● [Zhong et al., 2017b] Zhong, Victor, Caiming Xiong, and Richard Socher. “Seq2SQL: Generating Structured Queries from Natural Language using Reinforcement Learning.” arXiv preprint arXiv:1709.00103 (2017).

×