Analysis of a state Machine-based
Interactive Dialogue Management System
Ruixiang Qi
Content
´ Research Background
´ Methods
´ Analysis and Evaluation
´ Conclusion
What is a spoken dialogue system
Alexa’s Prize
Evaluated by user ratings
´ Random Users
´ Completely Open Domain
´ Return Users
Dialogue System Architecture
Sports Topic Handler
´ Conversation flow controlling (State Machine)
´ Real-time information based dialogues (Database)
´ Open Domain Question Answering
Conversation Flow Controlling
´Content
´Topic Switching
´Logic (Multi-turn Conversation)
Data Driven Approach
Pros:
1. Doe not require expert
knowledge
Cons:
1. Require a lot of data
2. Not logical (Not suitable for
multi-turn conversation)
Example:
Scraped 1.3 million
conversations from Twitter to
generate responses in social
Media
State Machine Approach
Pros:
1. Could generate multi-turn
Dialogues (with a database)
Cons:
1. Based on fixed data,
structures and ontologies
2. Not Real-Time
Example:
A static database within the Lon-
don Restaurant Domain and the
travel domain.
Related Works
Static Data (Not User Specific )
Monotone Conversation (Binary )
Not Support Returning User
Emora: Do you like Sports ?
User: Yes
Emora: What is your favorite
Sports?
User: Basketball
Emora : Do you like NBA?
User : Yes
……..
Real-time information based dialogues
( A daily-updated Database)
´ Talk about real-time information (Most recent games)
´ Recommend User- specific topics
Summary Of Tables
Connect Database to State Machine
´ Direct Look Up :
´ Example : Michael Jordan Plays for Chicago Bulls
´ User Attribute Table
´ Example: Already covered topics (recent games) ;
Return User
Static Data (Not User Specific )
Monotone Conversation (Binary )
Not Support Returning User
Convert open domain conversation to
closed domain
Yes/No Questions:
Expect positive/negative answers
´ Example:
´ Do you want to know the best defensive player of that game?
´ Do you like NBA?
´ Do you like sports?
´ Pros : Easy to handle
´ Cons : Boring and fixed conversation
Questions with finite possible answers:
Expect many possible answers but the number of possible answers is finite
´ Example:
´ Who is your favorite NBA player?
´ Which is your favorite NBA team?
´ Pros : Easy to handle
´ Cons : Requires manual work (create states), conversation easily breakable
(user answers in unexpected way)
Questions with infinite possible answers
Completely open domain question : Expect infinitely possible answers
´ Example:
´ Why do you like Michael Jordan?
´ What do you think about the NBA shutdown due to coronavirus?
Impossible to be handled by a state machine , in theory
What do you think about the NBA shutdown
due to coronavirus?
´ Disappointed , Sad
´ Understandable, it’s for the best
´ Doesn’t care
´ Happy
´ Shocked
Trending News
´ Death of Kobe Bryant
´ Super bowl
´ Coronavirus outbreak
Better Conversation Flow
´ Disadvantage :
´ Needs Expert Knowledge in the Domain
´ Requires Manual Work
(Develop and Maintain Database, Adding trending news)
Static Data (Not User Specific )
Monotone Conversation (Binary )
Not Support Returning User
´ Answer any factual based questions
´ Does not need to support multi-turn questions
Open Domain Question Answering
Related works
´ Using a knowledge base (freebase, DBpedia)
´ “Answer Selection Task”
´ “Reading comprehension Task”
´ “End-to-end” question answering”
Knowledge base
´ An encyclopedia that is designed for machine to read
´ Needs
Manual work
to construct
´ Does not
cover all
questions(th
e structure of
the
database is
fixed)
Answer Selection Task
´ Given a question and a set of candidate answers, answer selection is the
task of identifying which of the candidates answers the question correctly.
´ Example: WikiQA : given a question and a set of candidate sentences, the
task is to choose the correct sentence.
Reading Comprehension Task
´ Retrieving the answer to a query from a given paragraph
´ Example : SQUAD (Stanford Question Answering Dataset)
“End-to-end” question answering
´ Retrieving the answer to a query from a very large corpus (the entire
Wikipedia)
´ Dr. QA (Danqi chen, 2017)
´ Document Reader
´ Document Retriever
SQUAD Goal : Answer a query, given a
paragraph
DR.QA
QA on Wikipedia Goal : Retrieve a
paragraph from the whole Wikipedia that
is most relevant to the given query
QA on Wikipedia Goal : Retrieve a
paragraph from a single wikipedia article
that is most relevant to the given query
´ State of art : (According to the SQUAD website leaderboard)
´ ELECTRA : F1 : 91.4 EM: 88.7
´ BERT : F1 : 83.1 EM : 80.4
´ Human : F1: 89.5 EM: 86.8
Document READER (SQUAD)
´ Given an article (10-20 paragraphs), find the paragraph most relevant to the query
´ Dataset : reformat of SQUAD
Document Retriever
´ Old Task : Given question , Find best matching text span in paragraph
´ (paragraph, question, textspan )
´ New Task : Given question , Find best matching paragraph that
´ (article, question, paragraph)
´ Possible Problems with this dataset : For each query , there might be multiple
corresponding paragraphs that contain correct answer, yet we only regard one of
them as correct.
Term Frequency–Inverse Document
frequency (TF-IDF )
´ The combination score of term frequency and inverse document
frequency is used to calculate the relevancy of each word in query to
each corresponding paragraph
´ Term Frequency : the number of occurrences of a term in the paragraph
´ Inverse Document Frequency : how much information the word provides
Example
´ D: {Emory is a good university} {Emily is a good college student} {Apple is a
good type of fruit} {which university is good }
´ Query : {which university is good }
Problems with Document retrieval
´ Difficult to find a reasonable threshold of the number of paragraphs to be retrieved.
Problems with Document reader
´ High Latency
Conclusion on Open-Domain Question
Answering
´ The method does not fit into the context of a chat-bot that has a strict
requirement of low latency
Experiment and Analysis on Sports
Component
´ Consistent Positive correlation between modular updates and user ratings
´ Consistent Negative correlation between bugs and user ratings
Methods
´ Analyzing the logged information
Unweighted user-ratings
• Average ratings of
conversations that are not
influenced by the sports topic
handler at all and those are
influenced by the sports topic
handler are calculated
separately.
• Possible Bias: Some
conversation’s ratings are only
partially affected by the sports
component
Weighted user-ratings
• when calculating the
user ratings of sports
topichandler, the
proportion of the
conversation that are
handled by the sports
topichandler is used as a
weight.
Sports Rating VS. Non-sports rating (Jan)
Database Update:
Extract team ranking
and team top players
from database
Conversation flow
update: Added
Variations in
Conversations ; bug
fixes
Trending news Update:
Added Small Talk about
Kobe Bryant’s DeathUnusual low traffic to
Sports Topic handler
Sports Rating VS. Non-sports rating (Feb)
Trending News:
Support Small Talk
about Super bowl ;
Conversation about
each game in NBA
Database-related talk:
new conversation
about player age; fixed
wording at the end of
the conversation
Support return user
conversation
Bug in return user
conversation
Better Transition to
Other Component
Limited Topics Covered
Low Traffic to sports topic
handler
Unweighted user-ratings vs. Weighted
user-ratings (Feb)
Sports Topic Handler Traffic
Jan 15 : Sports Added
to Emora Suggestion List
Error Analysis
´ 5.0 : 2055
´ 4.0-5.0 : 1087
´ 3.0-4.0: 1057
´ 2.0-3.0: 1057
´ 1.0-2.0: 1227
Qualitative Error Analysis
´ Situation 1 : Uncovered topics.
User: let’s talk about sports
Emora: I love sports. My favorite sport is basketball.
What is your favoritesport?
User: volleyball
Emora: That is a great sport. I am sorry I only watch
basketball. Do youplay or watch basketball?
User: No
Qualitative Error Analysis
´ Situation 2 : Bugs and unexpected utterance.
User: I watch n. b. a.
Emora: oh sorry, i did not pay attention. could you
clarify which sport youjust mentioned?
Qualitative Error Analysis
´ Situation 3 : Ignore user response
´ Chatbot: Do you watch nba or nfl ?
´ User: i watch both.
´ Chatbot: oh sorry, I just started to watch sport recently. That is why I
currently could only talk about basketball or football in the sports domain
for now.
Conclusion
´ Associating a state machine with a database could generate real- time,
user-specific, and multi-turn dialogues, which boosts the perceived quality
(user ratings )of a chat-bot. But it requires manual work and expert
knowledge
´ End to end question answering does not fit into the context of a chat-bot
because of high latency and low accuracy.
References
´ See
https://docs.google.com/document/d/1Rh_6AiaJ5vBRyTfZAN3SwbDQHHvY
WMwrVtCb50aaui8/edit?usp=sharing
Thanks!

Analysis of a State Machine-based Interactive Dialogue Management System

  • 1.
    Analysis of astate Machine-based Interactive Dialogue Management System Ruixiang Qi
  • 2.
    Content ´ Research Background ´Methods ´ Analysis and Evaluation ´ Conclusion
  • 3.
    What is aspoken dialogue system
  • 4.
    Alexa’s Prize Evaluated byuser ratings ´ Random Users ´ Completely Open Domain ´ Return Users
  • 5.
  • 6.
    Sports Topic Handler ´Conversation flow controlling (State Machine) ´ Real-time information based dialogues (Database) ´ Open Domain Question Answering
  • 7.
    Conversation Flow Controlling ´Content ´TopicSwitching ´Logic (Multi-turn Conversation)
  • 8.
    Data Driven Approach Pros: 1.Doe not require expert knowledge Cons: 1. Require a lot of data 2. Not logical (Not suitable for multi-turn conversation) Example: Scraped 1.3 million conversations from Twitter to generate responses in social Media State Machine Approach Pros: 1. Could generate multi-turn Dialogues (with a database) Cons: 1. Based on fixed data, structures and ontologies 2. Not Real-Time Example: A static database within the Lon- don Restaurant Domain and the travel domain. Related Works
  • 10.
    Static Data (NotUser Specific ) Monotone Conversation (Binary ) Not Support Returning User Emora: Do you like Sports ? User: Yes Emora: What is your favorite Sports? User: Basketball Emora : Do you like NBA? User : Yes ……..
  • 11.
    Real-time information baseddialogues ( A daily-updated Database) ´ Talk about real-time information (Most recent games) ´ Recommend User- specific topics
  • 12.
  • 13.
    Connect Database toState Machine ´ Direct Look Up : ´ Example : Michael Jordan Plays for Chicago Bulls ´ User Attribute Table ´ Example: Already covered topics (recent games) ; Return User Static Data (Not User Specific ) Monotone Conversation (Binary ) Not Support Returning User
  • 14.
    Convert open domainconversation to closed domain
  • 15.
    Yes/No Questions: Expect positive/negativeanswers ´ Example: ´ Do you want to know the best defensive player of that game? ´ Do you like NBA? ´ Do you like sports? ´ Pros : Easy to handle ´ Cons : Boring and fixed conversation
  • 16.
    Questions with finitepossible answers: Expect many possible answers but the number of possible answers is finite ´ Example: ´ Who is your favorite NBA player? ´ Which is your favorite NBA team? ´ Pros : Easy to handle ´ Cons : Requires manual work (create states), conversation easily breakable (user answers in unexpected way)
  • 17.
    Questions with infinitepossible answers Completely open domain question : Expect infinitely possible answers ´ Example: ´ Why do you like Michael Jordan? ´ What do you think about the NBA shutdown due to coronavirus? Impossible to be handled by a state machine , in theory
  • 18.
    What do youthink about the NBA shutdown due to coronavirus? ´ Disappointed , Sad ´ Understandable, it’s for the best ´ Doesn’t care ´ Happy ´ Shocked
  • 19.
    Trending News ´ Deathof Kobe Bryant ´ Super bowl ´ Coronavirus outbreak
  • 21.
    Better Conversation Flow ´Disadvantage : ´ Needs Expert Knowledge in the Domain ´ Requires Manual Work (Develop and Maintain Database, Adding trending news) Static Data (Not User Specific ) Monotone Conversation (Binary ) Not Support Returning User
  • 22.
    ´ Answer anyfactual based questions ´ Does not need to support multi-turn questions Open Domain Question Answering
  • 23.
    Related works ´ Usinga knowledge base (freebase, DBpedia) ´ “Answer Selection Task” ´ “Reading comprehension Task” ´ “End-to-end” question answering”
  • 24.
    Knowledge base ´ Anencyclopedia that is designed for machine to read ´ Needs Manual work to construct ´ Does not cover all questions(th e structure of the database is fixed)
  • 25.
    Answer Selection Task ´Given a question and a set of candidate answers, answer selection is the task of identifying which of the candidates answers the question correctly. ´ Example: WikiQA : given a question and a set of candidate sentences, the task is to choose the correct sentence.
  • 26.
    Reading Comprehension Task ´Retrieving the answer to a query from a given paragraph ´ Example : SQUAD (Stanford Question Answering Dataset)
  • 27.
    “End-to-end” question answering ´Retrieving the answer to a query from a very large corpus (the entire Wikipedia) ´ Dr. QA (Danqi chen, 2017)
  • 28.
    ´ Document Reader ´Document Retriever SQUAD Goal : Answer a query, given a paragraph DR.QA QA on Wikipedia Goal : Retrieve a paragraph from the whole Wikipedia that is most relevant to the given query QA on Wikipedia Goal : Retrieve a paragraph from a single wikipedia article that is most relevant to the given query
  • 29.
    ´ State ofart : (According to the SQUAD website leaderboard) ´ ELECTRA : F1 : 91.4 EM: 88.7 ´ BERT : F1 : 83.1 EM : 80.4 ´ Human : F1: 89.5 EM: 86.8 Document READER (SQUAD)
  • 30.
    ´ Given anarticle (10-20 paragraphs), find the paragraph most relevant to the query ´ Dataset : reformat of SQUAD Document Retriever
  • 31.
    ´ Old Task: Given question , Find best matching text span in paragraph ´ (paragraph, question, textspan ) ´ New Task : Given question , Find best matching paragraph that ´ (article, question, paragraph) ´ Possible Problems with this dataset : For each query , there might be multiple corresponding paragraphs that contain correct answer, yet we only regard one of them as correct.
  • 32.
    Term Frequency–Inverse Document frequency(TF-IDF ) ´ The combination score of term frequency and inverse document frequency is used to calculate the relevancy of each word in query to each corresponding paragraph ´ Term Frequency : the number of occurrences of a term in the paragraph ´ Inverse Document Frequency : how much information the word provides
  • 33.
    Example ´ D: {Emoryis a good university} {Emily is a good college student} {Apple is a good type of fruit} {which university is good } ´ Query : {which university is good }
  • 34.
    Problems with Documentretrieval ´ Difficult to find a reasonable threshold of the number of paragraphs to be retrieved.
  • 35.
    Problems with Documentreader ´ High Latency
  • 36.
    Conclusion on Open-DomainQuestion Answering ´ The method does not fit into the context of a chat-bot that has a strict requirement of low latency
  • 37.
    Experiment and Analysison Sports Component ´ Consistent Positive correlation between modular updates and user ratings ´ Consistent Negative correlation between bugs and user ratings
  • 38.
    Methods ´ Analyzing thelogged information
  • 39.
    Unweighted user-ratings • Averageratings of conversations that are not influenced by the sports topic handler at all and those are influenced by the sports topic handler are calculated separately. • Possible Bias: Some conversation’s ratings are only partially affected by the sports component Weighted user-ratings • when calculating the user ratings of sports topichandler, the proportion of the conversation that are handled by the sports topichandler is used as a weight.
  • 40.
    Sports Rating VS.Non-sports rating (Jan) Database Update: Extract team ranking and team top players from database Conversation flow update: Added Variations in Conversations ; bug fixes Trending news Update: Added Small Talk about Kobe Bryant’s DeathUnusual low traffic to Sports Topic handler
  • 41.
    Sports Rating VS.Non-sports rating (Feb) Trending News: Support Small Talk about Super bowl ; Conversation about each game in NBA Database-related talk: new conversation about player age; fixed wording at the end of the conversation Support return user conversation Bug in return user conversation Better Transition to Other Component Limited Topics Covered Low Traffic to sports topic handler
  • 42.
    Unweighted user-ratings vs.Weighted user-ratings (Feb)
  • 43.
    Sports Topic HandlerTraffic Jan 15 : Sports Added to Emora Suggestion List
  • 45.
    Error Analysis ´ 5.0: 2055 ´ 4.0-5.0 : 1087 ´ 3.0-4.0: 1057 ´ 2.0-3.0: 1057 ´ 1.0-2.0: 1227
  • 46.
    Qualitative Error Analysis ´Situation 1 : Uncovered topics. User: let’s talk about sports Emora: I love sports. My favorite sport is basketball. What is your favoritesport? User: volleyball Emora: That is a great sport. I am sorry I only watch basketball. Do youplay or watch basketball? User: No
  • 47.
    Qualitative Error Analysis ´Situation 2 : Bugs and unexpected utterance. User: I watch n. b. a. Emora: oh sorry, i did not pay attention. could you clarify which sport youjust mentioned?
  • 48.
    Qualitative Error Analysis ´Situation 3 : Ignore user response ´ Chatbot: Do you watch nba or nfl ? ´ User: i watch both. ´ Chatbot: oh sorry, I just started to watch sport recently. That is why I currently could only talk about basketball or football in the sports domain for now.
  • 49.
    Conclusion ´ Associating astate machine with a database could generate real- time, user-specific, and multi-turn dialogues, which boosts the perceived quality (user ratings )of a chat-bot. But it requires manual work and expert knowledge ´ End to end question answering does not fit into the context of a chat-bot because of high latency and low accuracy.
  • 50.
  • 51.