SlideShare a Scribd company logo
Towards Automatic Evaluation 
of Multi‐Turn Dialogues:
A Task Design that Leverages 
Inherently Subjective Annotations
Tetsuya Sakai
tetsuyasakai@acm.org
Waseda University, Japan
December 5, 2017@EVIA 2017, NII
TALK OUTLINE
1. Motivation and background
2. A new multi‐turn dialogue task
3. Comparing estimated and gold distributions: 
introducing SNOD
4. EVIA reviewers’ comments
5. Conclusions and future work
Motivation
• You cannot improve what you cannot measure.
• ⇒ To build good task‐oriented, multi‐turn, textual 
dialogue systems, we need good ways to evaluate 
them.
Online evaluation is important but
• Costly and does not scale
• Difficult to compare different systems
• Not repeatable even for the same system
Evaluator
Your overall score is 20%.
NTCIR‐12 STC/NTCIR‐13 STC‐2 
[Shang+16,Shang+17]
• Not task‐oriented
• Single‐turn dialogues only
• Systems treated as response ranking systems, 
evaluated with information retrieval measures
First time in Tokyo!
Make sure you visit the 
Tokyo Skytree
Dialogue Breakdown Detection Challenge 
[Higashinaka+17]
• Not task‐oriented
• DBDC3 is part of DSTC6 (Dialog System Technology 
Challenges 6)
INPUT
December 10, 2017, right after NTCIR‐13!
OUTPUT
EVALUATION
User utterance
System utterance
User utterance
System utterance
User utterance
System utterance
NB PB B
A user‐system multi‐turn
dialogue Did this utterance cause a 
breakdown (B)?
No Possibly Yes
NB PB B
Mean Squared Error
JS‐divergence
Gold distribution based on
30 annotators, reflecting subjective views
DCH‐1 Chinese dialogue test collection 
[Zeng+17] http://waseda.box.com/DCH‐0‐1
• 3700 Chinese customer‐helpdesk dialogues mined 
from Weibo, with annotations
• English translation available for 10% of DCH‐1 
(more will be available soon)
Customer post
Customer post
Helpdesk post
Customer post
Helpdesk post
Helpdesk post
Customer post
Helpdesk post
Dialogue quality annotations
(task accomplishment, customer satisfaction etc.)
Customer trigger nugget (CNUG0)
Helpdesk regular nugget (HNUG)
Helpdesk goal nugget (HNUG*)
Customer goal nugget (CNUG*)
Nugget annotations
Nuggets for dialogue evaluation 
[Zeng+17]
TALK OUTLINE
1. Motivation and background
2. A new multi‐turn dialogue task
3. Comparing estimated and gold distributions: 
introducing SNOD
4. EVIA reviewers’ comments
5. Conclusions and future work
NTCIR‐14 STC‐3 (Chinese and English) 
Dialogue Quality subtask
Customer post
Customer post
Helpdesk post
Customer post
Helpdesk post
Helpdesk post
Customer post
Helpdesk post
INPUT:
a customer‐helpdesk
dialogue d ∈ D
OUTPUT: an estimated probability 
distribution p of dialogue quality score
(e.g. customer satisfaction) 
Gold distribution p* based on 
N annotators reflecting
subjective views
M(d): how p
differs from p* 
NTCIR‐14 STC‐3 (Chinese and English) 
Nugget Detection subtask
Customer post
Customer post
Helpdesk post
Customer post
Helpdesk post
Helpdesk post
Customer post
Helpdesk post
INPUT: d ∈ DOUTPUT: estimated p’s
over helpdesk nugget types
OUTPUT: estimated p’s
over customer nugget types
M(bH)
M(bC)
bCbH
weighted
average
Examples of nugget types [Zeng+17]
TALK OUTLINE
1. Motivation and background
2. A new multi‐turn dialogue task
3. Comparing estimated and gold distributions: 
introducing SNOD
4. EVIA reviewers’ comments
5. Conclusions and future work
How should we compare the 
estimated and gold distributions?
M(d): how p
differs from p* 
M(d): how p
differs from p* 
Dialogue quality Nugget detection
ordinal bins nominal bins
Variational Distance [Lin91]
(Mean Absolute Error): not suitable
These two systems are 
equally effective 
according to
but clearly 
X is better than Y!
(Root) Normalised Sum of Squares
RNSS = 0.1414
RNSS = 0.1732
X is better than Y!
Jensen‐Shannon Divergence (JSD)
JSD = 0.0390
JSD = 0.0490
X is better than Y!
Problems with (R)NSS for ordinal bins
SS = 1^2 + 0^2 + 1^2 = 2 SS = 1^2 + 1^2 + 0^2 = 2These two
considered 
equally 
effective!
Sum across 
bins: no 
distance 
between 
bins
Problems with JSD for ordinal bins
KLD = 1*log2(1/0.5) = 1 KLD = 1*log2(1/0.5) = 1These two
considered 
equally 
effective!
pM
pM
Sum across 
bins: no 
distance 
between 
bins
B* = {i|p*(i)>0}
B = {i|p(i)>0}
B*={1}
B={3} B={2}
L = |A|
B*={1}
A: all bins
Proposal: Order‐aware Divergence (OD) (1)
B*={1} B*={1}
|1‐3| = 2 |1‐2| = 1
1^2 = 1 1^2 = 1
OD = 2 OD = 1
(d) better 
than (a)!
Proposal: Order‐aware Divergence (OD) (2)
|1‐2| = 1 |1‐2| = 1
(1/3)^2 = 1/9
(2/3)^2 = 4/9
1/9
4/9
Proposal: Order‐aware Divergence (OD) (3)
|1‐3| = 2 |1‐3| = 2
(2/3)^2 = 4/9 (1/3)^2 = 1/9
1/9
8/9 4/9
2/9
Proposal: Order‐aware Divergence (OD) (4)
1/9
8/9 4/9
2/9
OD = (1+8)/9 = 1 OD = (4+2)/9 = 2/3
(c) better 
than (b)!
Proposal: Order‐aware Divergence (OD) (5)
OD is generally not symmetric but
• Symmetric if B* = B (i.e., estimated and gold 
distributions cover exactly the same bins) 
• Symmetric if |B*| = |B| = 1 (i.e. estimated and gold 
distributions each cover exactly one bin)
• Symmetric OD (SOD) can easily be defined as
(Symmetric) Normalised OD 
(NOD, SNOD)
OD is largest when e.g.
p*(1)=1, p=(L)=1
(the two bins are as far apart 
as possible), in which case
OD = (L‐1) * 1^2 = L‐1.
So…
More examples of (S)NOD
In summary, for STC‐3@NTCIR‐14...
• The Dialogue Quality subtask (ordinal bins) can use 
(R)NSS, JSD and SNOD as M(d), the measure that 
compares the estimated and the gold distributions.
• The Nugget Detection subtask (nominal bins) can 
use (R)NSS and JSD as M(b) for each utterance 
block.
TALK OUTLINE
1. Motivation and background
2. A new multi‐turn dialogue task
3. Comparing estimated and gold distributions: 
introducing SNOD
4. EVIA reviewers’ comments
5. Conclusions and future work
Reviewer 1
“The proposed measures assume that the categories 
are unordered. This means that, given an item let us 
say of Category 1 according to the gold, a 
misclassification of such an item into Category 2 or 
Category 3 "weight" the same, i.e., the two errors are 
of equal gravity. This might be true for the Nugget 
Detection Subtask, but I think it is false for the 
Dialogue Quality Subtask, where the L levels are of 
course totally ordered.”
⇒ Thank you for the brilliant comment! This lead to 
my design of SNOD.
Reviewer 2
“Since for every dialogue/nugget it is necessary to have a 
relatively big number of annotators, it could make sense 
to talk also a bit about the cost of building such a test 
collection with respect to a more traditional way of 
collecting judgments where only one assessment is 
required.”
⇒ Future work: discuss cost 
“Would not be better to ask to the participants to 
provide probability of an annotator to belong to a 
particular level rather than the number of annotators 
that belong to that level?”
⇒ Yes that’s actually what we are going to do
Reviewer 3
“To me, this seems almost akin to solving the whole 
task of building a good dialog system. If we can 
automatically detect what is a good and bad dialog, 
wouldn't we have the ability to simply create good 
dialogs?”
⇒ That is correct, and hence my proposal.
Evaluating dialogue systems online is costly; does not 
scale; makes cross‐system comparisons difficult; not 
repeatable even for the same system. Simple offline 
evaluation offers some advantages.
TALK OUTLINE
1. Motivation and background
2. A new multi‐turn dialogue task
3. Comparing estimated and gold distributions: 
introducing SNOD
4. EVIA reviewers’ comments
5. Conclusions and future work
Conclusions
• Proposed the Dialogue Quality and Nugget 
Detection subtasks to help the progress of 
customer‐helpdesk dialogue systems through 
offline evaluation.
• Designed SNOD (Symmeric Normalised Order‐
aware Divergence) for comparing probability 
distributions with ordinal bins. They have clear 
advantages over (R)NSS and JSD for ordinal bins.
Future work: STC‐3@NTCIR‐14
• Investigate the properties of SNOD with real data
• Construct the Chinese and English test collections for STC‐3, 
analyse cost
• Run STC‐3 successfully:
Oct‐Dec, 2017 Training data Chinese‐English translation
Jan, 2018 Test data crawling
May‐Jun, 2018 Test data Chinese‐English translation
Jul‐Aug, Training and test data annotation
Aug 31, 2018 CEMD registrations due
Sep 1, 2018 Training data released
Nov 1, 2018 Test data released
Nov 30, 2018 Run submissions due
Dec 20, 2018 Evaluation results and draft overview released
Now translating the DCH‐1 
Chinese test collection [Zeng+17] 
into English
References
[Higashinaka+17] Overview of Dialogue Breakdown 
Detection Challenge 3, DSTC6, 2017.
[Lin91] Divergence Measures Based on the Shannon 
Entropy. IEEE Transactions on Information Theory 37(1), 
1991.
[Shang+16] Overview of the NTCIR‐12 Short Text 
Conversation Task, NTCIR‐12, 2016.
[Shang+17] Overview of the NTCIR‐13 Short Text 
Conversation Task, NTCIR‐13, 2017.
[Zeng+17] Test Collections and Measures for Evaluating 
Customer‐Helpdesk Dialogues, EVIA 2017.

More Related Content

Similar to Evia2017dialogues

Workplace Simulated Courses - Course Technology Computing Conference
Workplace Simulated Courses - Course Technology Computing ConferenceWorkplace Simulated Courses - Course Technology Computing Conference
Workplace Simulated Courses - Course Technology Computing Conference
Cengage Learning
 
10 Techniques for Gathering Requirements
10 Techniques for Gathering Requirements10 Techniques for Gathering Requirements
10 Techniques for Gathering Requirements
z-999
 
Glfes summer institute2013_raleigh_final
Glfes summer institute2013_raleigh_finalGlfes summer institute2013_raleigh_final
Glfes summer institute2013_raleigh_final
Tricia Townsend
 

Similar to Evia2017dialogues (20)

From ic to tech lead
From ic to tech leadFrom ic to tech lead
From ic to tech lead
 
Agile isd by_lisa_cooney
Agile isd by_lisa_cooneyAgile isd by_lisa_cooney
Agile isd by_lisa_cooney
 
Workplace Simulated Courses - Course Technology Computing Conference
Workplace Simulated Courses - Course Technology Computing ConferenceWorkplace Simulated Courses - Course Technology Computing Conference
Workplace Simulated Courses - Course Technology Computing Conference
 
ODLAA_Webinar_Authentic_Assessment_21.10.2021
ODLAA_Webinar_Authentic_Assessment_21.10.2021ODLAA_Webinar_Authentic_Assessment_21.10.2021
ODLAA_Webinar_Authentic_Assessment_21.10.2021
 
Leadership Workshop - A Shared Eperience
Leadership Workshop - A Shared EperienceLeadership Workshop - A Shared Eperience
Leadership Workshop - A Shared Eperience
 
Embedding Clinical standards in research workshop
Embedding Clinical standards in research workshopEmbedding Clinical standards in research workshop
Embedding Clinical standards in research workshop
 
Laying the groundwork: Implementing a new evaluation system
Laying the groundwork: Implementing a new evaluation systemLaying the groundwork: Implementing a new evaluation system
Laying the groundwork: Implementing a new evaluation system
 
Collaboration Within A Multidisciplinary Team
Collaboration Within A Multidisciplinary Team Collaboration Within A Multidisciplinary Team
Collaboration Within A Multidisciplinary Team
 
Gps mod 7 v2.1
Gps mod 7 v2.1Gps mod 7 v2.1
Gps mod 7 v2.1
 
EDUC 5101 3rd Adobe Connect Class Session Presentation
EDUC 5101 3rd Adobe Connect Class Session PresentationEDUC 5101 3rd Adobe Connect Class Session Presentation
EDUC 5101 3rd Adobe Connect Class Session Presentation
 
10 Techniques for Gathering Requirements
10 Techniques for Gathering Requirements10 Techniques for Gathering Requirements
10 Techniques for Gathering Requirements
 
Best Practices in Recommender System Challenges
Best Practices in Recommender System ChallengesBest Practices in Recommender System Challenges
Best Practices in Recommender System Challenges
 
[EADTU-ENQA PLA] The E-xcellence QA methodology: lessons learned over ten yea...
[EADTU-ENQA PLA] The E-xcellence QA methodology: lessons learned over ten yea...[EADTU-ENQA PLA] The E-xcellence QA methodology: lessons learned over ten yea...
[EADTU-ENQA PLA] The E-xcellence QA methodology: lessons learned over ten yea...
 
Glfes summer institute2013_raleigh_final
Glfes summer institute2013_raleigh_finalGlfes summer institute2013_raleigh_final
Glfes summer institute2013_raleigh_final
 
Management Question.pdf
Management Question.pdfManagement Question.pdf
Management Question.pdf
 
Management Question.pdf
Management Question.pdfManagement Question.pdf
Management Question.pdf
 
The why and what of testa
The why and what of testaThe why and what of testa
The why and what of testa
 
AFEL-REC: A Recommender System for Providing Learning Resource Recommendation...
AFEL-REC: A Recommender System for Providing Learning Resource Recommendation...AFEL-REC: A Recommender System for Providing Learning Resource Recommendation...
AFEL-REC: A Recommender System for Providing Learning Resource Recommendation...
 
Assessment Analytics - EUNIS 2015 E-Learning Task Force Workshop
Assessment Analytics - EUNIS 2015 E-Learning Task Force WorkshopAssessment Analytics - EUNIS 2015 E-Learning Task Force Workshop
Assessment Analytics - EUNIS 2015 E-Learning Task Force Workshop
 
Embracing AI in new forms of assessment
Embracing AI in new forms of assessmentEmbracing AI in new forms of assessment
Embracing AI in new forms of assessment
 

More from Tetsuya Sakai

More from Tetsuya Sakai (20)

NTCIR15WWW3overview
NTCIR15WWW3overviewNTCIR15WWW3overview
NTCIR15WWW3overview
 
sigir2020
sigir2020sigir2020
sigir2020
 
ipsjifat201909
ipsjifat201909ipsjifat201909
ipsjifat201909
 
sigir2019
sigir2019sigir2019
sigir2019
 
assia2019
assia2019assia2019
assia2019
 
ntcir14centre-overview
ntcir14centre-overviewntcir14centre-overview
ntcir14centre-overview
 
evia2019
evia2019evia2019
evia2019
 
ecir2019tutorial-finalised
ecir2019tutorial-finalisedecir2019tutorial-finalised
ecir2019tutorial-finalised
 
ecir2019tutorial
ecir2019tutorialecir2019tutorial
ecir2019tutorial
 
WSDM2019tutorial
WSDM2019tutorialWSDM2019tutorial
WSDM2019tutorial
 
sigir2018tutorial
sigir2018tutorialsigir2018tutorial
sigir2018tutorial
 
Evia2017unanimity
Evia2017unanimityEvia2017unanimity
Evia2017unanimity
 
Evia2017assessors
Evia2017assessorsEvia2017assessors
Evia2017assessors
 
Evia2017wcw
Evia2017wcwEvia2017wcw
Evia2017wcw
 
sigir2017bayesian
sigir2017bayesiansigir2017bayesian
sigir2017bayesian
 
NL20161222invited
NL20161222invitedNL20161222invited
NL20161222invited
 
AIRS2016
AIRS2016AIRS2016
AIRS2016
 
Nl201609
Nl201609Nl201609
Nl201609
 
ictir2016
ictir2016ictir2016
ictir2016
 
ICTIR2016tutorial
ICTIR2016tutorialICTIR2016tutorial
ICTIR2016tutorial
 

Recently uploaded

Recently uploaded (20)

AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
 
Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........
 
Exploring UiPath Orchestrator API: updates and limits in 2024 🚀
Exploring UiPath Orchestrator API: updates and limits in 2024 🚀Exploring UiPath Orchestrator API: updates and limits in 2024 🚀
Exploring UiPath Orchestrator API: updates and limits in 2024 🚀
 
Measures in SQL (a talk at SF Distributed Systems meetup, 2024-05-22)
Measures in SQL (a talk at SF Distributed Systems meetup, 2024-05-22)Measures in SQL (a talk at SF Distributed Systems meetup, 2024-05-22)
Measures in SQL (a talk at SF Distributed Systems meetup, 2024-05-22)
 
Salesforce Adoption – Metrics, Methods, and Motivation, Antone Kom
Salesforce Adoption – Metrics, Methods, and Motivation, Antone KomSalesforce Adoption – Metrics, Methods, and Motivation, Antone Kom
Salesforce Adoption – Metrics, Methods, and Motivation, Antone Kom
 
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
 
Connector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a buttonConnector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a button
 
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualitySoftware Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
 
10 Differences between Sales Cloud and CPQ, Blanka Doktorová
10 Differences between Sales Cloud and CPQ, Blanka Doktorová10 Differences between Sales Cloud and CPQ, Blanka Doktorová
10 Differences between Sales Cloud and CPQ, Blanka Doktorová
 
Demystifying gRPC in .Net by John Staveley
Demystifying gRPC in .Net by John StaveleyDemystifying gRPC in .Net by John Staveley
Demystifying gRPC in .Net by John Staveley
 
Assuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyesAssuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyes
 
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
 
UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3
 
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
 
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
 
ODC, Data Fabric and Architecture User Group
ODC, Data Fabric and Architecture User GroupODC, Data Fabric and Architecture User Group
ODC, Data Fabric and Architecture User Group
 
Speed Wins: From Kafka to APIs in Minutes
Speed Wins: From Kafka to APIs in MinutesSpeed Wins: From Kafka to APIs in Minutes
Speed Wins: From Kafka to APIs in Minutes
 
IoT Analytics Company Presentation May 2024
IoT Analytics Company Presentation May 2024IoT Analytics Company Presentation May 2024
IoT Analytics Company Presentation May 2024
 
How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...
 
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
 

Evia2017dialogues