Evia2017dialogues

Towards Automatic Evaluation
of Multi‐Turn Dialogues:
A Task Design that Leverages
Inherently Subjective Annotations
Tetsuya Sakai
tetsuyasakai@acm.org
Waseda University, Japan
December 5, 2017@EVIA 2017, NII

TALK OUTLINE
1. Motivation and background
2. A new multi‐turn dialogue task
3. Comparing estimated and gold distributions:
introducing SNOD
4. EVIA reviewers’ comments
5. Conclusions and future work

Motivation
• You cannot improve what you cannot measure.
• ⇒ To build good task‐oriented, multi‐turn, textual
dialogue systems, we need good ways to evaluate
them.

Online evaluation is important but
• Costly and does not scale
• Difficult to compare different systems
• Not repeatable even for the same system
Evaluator
Your overall score is 20%.

NTCIR‐12 STC/NTCIR‐13 STC‐2
[Shang+16,Shang+17]
• Not task‐oriented
• Single‐turn dialogues only
• Systems treated as response ranking systems,
evaluated with information retrieval measures
First time in Tokyo!
Make sure you visit the
Tokyo Skytree

Dialogue Breakdown Detection Challenge
[Higashinaka+17]
• Not task‐oriented
• DBDC3 is part of DSTC6 (Dialog System Technology
Challenges 6)
INPUT
December 10, 2017, right after NTCIR‐13!
OUTPUT
EVALUATION
User utterance
System utterance
User utterance
System utterance
User utterance
System utterance
NB PB B
A user‐system multi‐turn
dialogue Did this utterance cause a
breakdown (B)?
No Possibly Yes
NB PB B
Mean Squared Error
JS‐divergence
Gold distribution based on
30 annotators, reflecting subjective views

DCH‐1 Chinese dialogue test collection
[Zeng+17] http://waseda.box.com/DCH‐0‐1
• 3700 Chinese customer‐helpdesk dialogues mined
from Weibo, with annotations
• English translation available for 10% of DCH‐1
(more will be available soon)
Customer post
Customer post
Helpdesk post
Customer post
Helpdesk post
Helpdesk post
Customer post
Helpdesk post
Dialogue quality annotations
(task accomplishment, customer satisfaction etc.)
Customer trigger nugget (CNUG0)
Helpdesk regular nugget (HNUG)
Helpdesk goal nugget (HNUG*)
Customer goal nugget (CNUG*)
Nugget annotations

Nuggets for dialogue evaluation
[Zeng+17]

NTCIR‐14 STC‐3 (Chinese and English)
Dialogue Quality subtask
Customer post
Customer post
Helpdesk post
Customer post
Helpdesk post
Helpdesk post
Customer post
Helpdesk post
INPUT:
a customer‐helpdesk
dialogue d ∈ D
OUTPUT: an estimated probability
distribution p of dialogue quality score
(e.g. customer satisfaction)
Gold distribution p* based on
N annotators reflecting
subjective views
M(d): how p
differs from p*

NTCIR‐14 STC‐3 (Chinese and English)
Nugget Detection subtask
Customer post
Customer post
Helpdesk post
Customer post
Helpdesk post
Helpdesk post
Customer post
Helpdesk post
INPUT: d ∈ DOUTPUT: estimated p’s
over helpdesk nugget types
OUTPUT: estimated p’s
over customer nugget types
M(bH)
M(bC)
bCbH
weighted
average

Examples of nugget types [Zeng+17]

How should we compare the
estimated and gold distributions?
M(d): how p
differs from p*
M(d): how p
differs from p*
Dialogue quality Nugget detection
ordinal bins nominal bins

Variational Distance [Lin91]
(Mean Absolute Error): not suitable
These two systems are
equally effective
according to
but clearly
X is better than Y!

(Root) Normalised Sum of Squares
RNSS = 0.1414
RNSS = 0.1732
X is better than Y!

Jensen‐Shannon Divergence (JSD)
JSD = 0.0390
JSD = 0.0490
X is better than Y!

Problems with (R)NSS for ordinal bins
SS = 1^2 + 0^2 + 1^2 = 2 SS = 1^2 + 1^2 + 0^2 = 2These two
considered
equally
effective!
Sum across
bins: no
distance
between
bins

Problems with JSD for ordinal bins
KLD = 1*log2(1/0.5) = 1 KLD = 1*log2(1/0.5) = 1These two
considered
equally
effective!
pM
pM
Sum across
bins: no
distance
between
bins

B* = {i|p*(i)>0}
B = {i|p(i)>0}
B*={1}
B={3} B={2}
L = |A|
B*={1}
A: all bins
Proposal: Order‐aware Divergence (OD) (1)

B*={1} B*={1}
|1‐3| = 2 |1‐2| = 1
1^2 = 1 1^2 = 1
OD = 2 OD = 1
(d) better
than (a)!

|1‐2| = 1 |1‐2| = 1
(1/3)^2 = 1/9
(2/3)^2 = 4/9
1/9
4/9

|1‐3| = 2 |1‐3| = 2
(2/3)^2 = 4/9 (1/3)^2 = 1/9
1/9
8/9 4/9
2/9

1/9
8/9 4/9
2/9
OD = (1+8)/9 = 1 OD = (4+2)/9 = 2/3
(c) better
than (b)!

OD is generally not symmetric but
• Symmetric if B* = B (i.e., estimated and gold
distributions cover exactly the same bins)
• Symmetric if |B*| = |B| = 1 (i.e. estimated and gold
distributions each cover exactly one bin)
• Symmetric OD (SOD) can easily be defined as

(Symmetric) Normalised OD
(NOD, SNOD)
OD is largest when e.g.
p*(1)=1, p=(L)=1
(the two bins are as far apart
as possible), in which case
OD = (L‐1) * 1^2 = L‐1.
So…

In summary, for STC‐3@NTCIR‐14...
• The Dialogue Quality subtask (ordinal bins) can use
(R)NSS, JSD and SNOD as M(d), the measure that
compares the estimated and the gold distributions.
• The Nugget Detection subtask (nominal bins) can
use (R)NSS and JSD as M(b) for each utterance
block.

Reviewer 1
“The proposed measures assume that the categories
are unordered. This means that, given an item let us
say of Category 1 according to the gold, a
misclassification of such an item into Category 2 or
Category 3 "weight" the same, i.e., the two errors are
of equal gravity. This might be true for the Nugget
Detection Subtask, but I think it is false for the
Dialogue Quality Subtask, where the L levels are of
course totally ordered.”
⇒ Thank you for the brilliant comment! This lead to
my design of SNOD.

Reviewer 2
“Since for every dialogue/nugget it is necessary to have a
relatively big number of annotators, it could make sense
to talk also a bit about the cost of building such a test
collection with respect to a more traditional way of
collecting judgments where only one assessment is
required.”
⇒ Future work: discuss cost
“Would not be better to ask to the participants to
provide probability of an annotator to belong to a
particular level rather than the number of annotators
that belong to that level?”
⇒ Yes that’s actually what we are going to do

Reviewer 3
“To me, this seems almost akin to solving the whole
task of building a good dialog system. If we can
automatically detect what is a good and bad dialog,
wouldn't we have the ability to simply create good
dialogs?”
⇒ That is correct, and hence my proposal.
Evaluating dialogue systems online is costly; does not
scale; makes cross‐system comparisons difficult; not
repeatable even for the same system. Simple offline
evaluation offers some advantages.

Conclusions
• Proposed the Dialogue Quality and Nugget
Detection subtasks to help the progress of
customer‐helpdesk dialogue systems through
offline evaluation.
• Designed SNOD (Symmeric Normalised Order‐
aware Divergence) for comparing probability
distributions with ordinal bins. They have clear
advantages over (R)NSS and JSD for ordinal bins.

Future work: STC‐3@NTCIR‐14
• Investigate the properties of SNOD with real data
• Construct the Chinese and English test collections for STC‐3,
analyse cost
• Run STC‐3 successfully:
Oct‐Dec, 2017 Training data Chinese‐English translation
Jan, 2018 Test data crawling
May‐Jun, 2018 Test data Chinese‐English translation
Jul‐Aug, Training and test data annotation
Aug 31, 2018 CEMD registrations due
Sep 1, 2018 Training data released
Nov 1, 2018 Test data released
Nov 30, 2018 Run submissions due
Dec 20, 2018 Evaluation results and draft overview released
Now translating the DCH‐1
Chinese test collection [Zeng+17]
into English

References
[Higashinaka+17] Overview of Dialogue Breakdown
Detection Challenge 3, DSTC6, 2017.
[Lin91] Divergence Measures Based on the Shannon
Entropy. IEEE Transactions on Information Theory 37(1),
1991.
[Shang+16] Overview of the NTCIR‐12 Short Text
Conversation Task, NTCIR‐12, 2016.
[Shang+17] Overview of the NTCIR‐13 Short Text
Conversation Task, NTCIR‐13, 2017.
[Zeng+17] Test Collections and Measures for Evaluating
Customer‐Helpdesk Dialogues, EVIA 2017.

Evia2017dialogues

Recommended

Recommended

More Related Content

Similar to Evia2017dialogues

Similar to Evia2017dialogues (20)

More from Tetsuya Sakai

More from Tetsuya Sakai (20)

Recently uploaded

Recently uploaded (20)

Evia2017dialogues