[KDD 2018 tutorial] End to-end goal-oriented question answering systemsQi He
End to-end goal-oriented question answering systems
version 2.0: An updated version with references of the old version (https://www.slideshare.net/QiHe2/kdd-2018-tutorial-end-toend-goaloriented-question-answering-systems).
08/22/2018: The old version was just deleted for reducing the confusion.
Reference Scope Identification of Citances Using Convolutional Neural NetworkSaurav Jha
In the task of summarization of a scientific paper, a lot of information stands to be gained about a reference paper, from the papers that cite it. Automatically generating the reference scope (the span of cited text) in a reference paper, corresponding to citances (sentences in the citing papers that cite it) has great significance in preparing a structured summary of the reference paper. We treat this task as a binary classification problem, by extracting feature vectors from pairs of citances and reference sentences. These features are lexical, corpus-based, surface and knowledge-based. We extend the current feature set employed for reference-citance pair identification in the current state-of-the-art system. Using these features, we present a novel classification approach for this task, that employs a deep Convolutional Neural Network along with two boosting ensemble algorithms. We outperform the existing state-of-the- art for distinguishing between cited spans and non-cited spans of text in the reference paper.
ACL-WMT2013.A Description of Tunable Machine Translation Evaluation Systems i...Lifeng (Aaron) Han
Proceedings of the ACL 2013 EIGHTH WORKSHOP ON STATISTICAL MACHINE TRANSLATION (ACL-WMT 2013), 8-9 August 2013. Sofia, Bulgaria. Open tool https://github.com/aaronlifenghan/aaron-project-lepor & https://github.com/aaronlifenghan/aaron-project-hlepor(ACM digital library, ACL anthology)
A birthday video is always nice if you’re trying to show someone that they are special to you. In case of lovers, even more so, as there is the need to keep the intimacy fire burning.
[KDD 2018 tutorial] End to-end goal-oriented question answering systemsQi He
End to-end goal-oriented question answering systems
version 2.0: An updated version with references of the old version (https://www.slideshare.net/QiHe2/kdd-2018-tutorial-end-toend-goaloriented-question-answering-systems).
08/22/2018: The old version was just deleted for reducing the confusion.
Reference Scope Identification of Citances Using Convolutional Neural NetworkSaurav Jha
In the task of summarization of a scientific paper, a lot of information stands to be gained about a reference paper, from the papers that cite it. Automatically generating the reference scope (the span of cited text) in a reference paper, corresponding to citances (sentences in the citing papers that cite it) has great significance in preparing a structured summary of the reference paper. We treat this task as a binary classification problem, by extracting feature vectors from pairs of citances and reference sentences. These features are lexical, corpus-based, surface and knowledge-based. We extend the current feature set employed for reference-citance pair identification in the current state-of-the-art system. Using these features, we present a novel classification approach for this task, that employs a deep Convolutional Neural Network along with two boosting ensemble algorithms. We outperform the existing state-of-the- art for distinguishing between cited spans and non-cited spans of text in the reference paper.
ACL-WMT2013.A Description of Tunable Machine Translation Evaluation Systems i...Lifeng (Aaron) Han
Proceedings of the ACL 2013 EIGHTH WORKSHOP ON STATISTICAL MACHINE TRANSLATION (ACL-WMT 2013), 8-9 August 2013. Sofia, Bulgaria. Open tool https://github.com/aaronlifenghan/aaron-project-lepor & https://github.com/aaronlifenghan/aaron-project-hlepor(ACM digital library, ACL anthology)
A birthday video is always nice if you’re trying to show someone that they are special to you. In case of lovers, even more so, as there is the need to keep the intimacy fire burning.
Oportundades con Millennials y Generación Z para Sector TurísticoEngel Fonseca
Presentación impartida en APATEL, Panamá a explicando la oportunidad que existe de conectar y detonar reciprocidad " engagement " con nuevas audiencias.
Tugas Teknik Tenaga Listrik
Transformator
Nama : Fatkhul Susyawan
NIM : 1310502002
Fakultas : Teknik
Prodi : S1 Teknik Mesin
Instansi : Universitas Tidar
Dosen Pengampu
R. Suryoto Edy Raharjo S.T., M.Eng.
Планування та оцінка обсягів тестування часто буває дуже болісною. Найкращими ліками є певна формалізація процесу та застосування відомих технік оцінювання, в той час, як поверхове неформальне оцінювання часто знищує саму суть оцінювання. Особливу увагу у цій презентації приділено саме цим формальним процесам оцінки та боротьби з наслідками.
Цю доповідь представив Антон Мужайло (Associate Manager, Consultant, GlobalLogic) на GlobalLogic Kyiv QA Career Day 16 лютого 2019 року.
Відео: https://youtu.be/Dhqrk9GPj4k
Iterative Multi-document Neural Attention for Multiple Answer PredictionClaudio Greco
Slides for the presentation of the paper "Iterative Multi-document Neural Attention for Multiple Answer Prediction" at the Deep Understanding and Reasoning: A challenge for Next-generation Intelligent Agents (URANIA) workshop, held in the context of the AI*IA 2016 conference.
What Does Conversational Information Access Exactly Mean and How to Evaluate It?krisztianbalog
This talk discusses a set of specific tasks and scenarios related to information access within the vast space that is casually referred to as conversational AI. While most of these problems have been identified in the literature for quite some time now, progress has been limited. Apart from the inherently challenging nature of these problems, the lack of progress, in large part, can be attributed to the shortage of appropriate evaluation methodology and resources. This talk presents some recent work towards filling this gap.
In one line of research, we investigate the presentation of tabular search results in a conversational setting. Instead of generating a static summary of a result table, we complement brief summaries with clues that invite further exploration, thereby taking advantage of the conversational paradigm. One of the main contributions of this study is the development of a test collection using crowdsourcing.
Another line of work focuses on large-scale evaluation of conversational recommender systems via simulated users. Building on the well-established agenda-based simulation framework from dialogue systems research, we develop interaction and preference models specific to the item recommendation scenario. For evaluation, we compare three existing conversational movie recommender systems with both real and simulated users, and observe high correlation between the two means of evaluation.
This talk has been given at the CIIR talk series at the University of Massachusetts Amherst in Jan 2021 as well as at the IR seminar series at the University of Glasgow in March 2021.
Oportundades con Millennials y Generación Z para Sector TurísticoEngel Fonseca
Presentación impartida en APATEL, Panamá a explicando la oportunidad que existe de conectar y detonar reciprocidad " engagement " con nuevas audiencias.
Tugas Teknik Tenaga Listrik
Transformator
Nama : Fatkhul Susyawan
NIM : 1310502002
Fakultas : Teknik
Prodi : S1 Teknik Mesin
Instansi : Universitas Tidar
Dosen Pengampu
R. Suryoto Edy Raharjo S.T., M.Eng.
Планування та оцінка обсягів тестування часто буває дуже болісною. Найкращими ліками є певна формалізація процесу та застосування відомих технік оцінювання, в той час, як поверхове неформальне оцінювання часто знищує саму суть оцінювання. Особливу увагу у цій презентації приділено саме цим формальним процесам оцінки та боротьби з наслідками.
Цю доповідь представив Антон Мужайло (Associate Manager, Consultant, GlobalLogic) на GlobalLogic Kyiv QA Career Day 16 лютого 2019 року.
Відео: https://youtu.be/Dhqrk9GPj4k
Iterative Multi-document Neural Attention for Multiple Answer PredictionClaudio Greco
Slides for the presentation of the paper "Iterative Multi-document Neural Attention for Multiple Answer Prediction" at the Deep Understanding and Reasoning: A challenge for Next-generation Intelligent Agents (URANIA) workshop, held in the context of the AI*IA 2016 conference.
What Does Conversational Information Access Exactly Mean and How to Evaluate It?krisztianbalog
This talk discusses a set of specific tasks and scenarios related to information access within the vast space that is casually referred to as conversational AI. While most of these problems have been identified in the literature for quite some time now, progress has been limited. Apart from the inherently challenging nature of these problems, the lack of progress, in large part, can be attributed to the shortage of appropriate evaluation methodology and resources. This talk presents some recent work towards filling this gap.
In one line of research, we investigate the presentation of tabular search results in a conversational setting. Instead of generating a static summary of a result table, we complement brief summaries with clues that invite further exploration, thereby taking advantage of the conversational paradigm. One of the main contributions of this study is the development of a test collection using crowdsourcing.
Another line of work focuses on large-scale evaluation of conversational recommender systems via simulated users. Building on the well-established agenda-based simulation framework from dialogue systems research, we develop interaction and preference models specific to the item recommendation scenario. For evaluation, we compare three existing conversational movie recommender systems with both real and simulated users, and observe high correlation between the two means of evaluation.
This talk has been given at the CIIR talk series at the University of Massachusetts Amherst in Jan 2021 as well as at the IR seminar series at the University of Glasgow in March 2021.
The presentation slides for workshop 4, “Structuring online collaboration though 3 Ts : task time & teams” in 2nd STELLARnet Alpine Rendez-Vous in the French Alps 27th to 31st March 2011
Overview of the NTCIR-15 We Want Web with CENTRE (WWW-3) Task
by
Tetsuya Sakai, Sijie Tao, Zhaohao Zeng
Yukun Zheng, Jiaxin Mao, Zhumin Chu, Yiqun Liu
Maria Maistro
Zhicheng Dou
Nicola Ferro
Ian Soboroff
Sakai, T: The Effect of Inter-Assessor Disagreement on IR System Evaluation: A Case Study with Lancers and Students, Proceedings of EVIA 2017, pp.31-38, 2017.
http://ceur-ws.org/Vol-2008/paper_5.pdf
- Sakai, T: Towards Automatic Evaluation of Multi-Turn Dialogues: A Task Design that Leverages Inherently Subjective Annotations, Proceedings of EVIA 2017, pp.24-30, 2017.
http://ceur-ws.org/Vol-2008/paper_4.pdf
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...SOFTTECHHUB
The choice of an operating system plays a pivotal role in shaping our computing experience. For decades, Microsoft's Windows has dominated the market, offering a familiar and widely adopted platform for personal and professional use. However, as technological advancements continue to push the boundaries of innovation, alternative operating systems have emerged, challenging the status quo and offering users a fresh perspective on computing.
One such alternative that has garnered significant attention and acclaim is Nitrux Linux 3.5.0, a sleek, powerful, and user-friendly Linux distribution that promises to redefine the way we interact with our devices. With its focus on performance, security, and customization, Nitrux Linux presents a compelling case for those seeking to break free from the constraints of proprietary software and embrace the freedom and flexibility of open-source computing.
Enhancing Performance with Globus and the Science DMZGlobus
ESnet has led the way in helping national facilities—and many other institutions in the research community—configure Science DMZs and troubleshoot network issues to maximize data transfer performance. In this talk we will present a summary of approaches and tips for getting the most out of your network infrastructure using Globus Connect Server.
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionAggregage
Join Maher Hanafi, VP of Engineering at Betterworks, in this new session where he'll share a practical framework to transform Gen AI prototypes into impactful products! He'll delve into the complexities of data collection and management, model selection and optimization, and ensuring security, scalability, and responsible use.
Climate Impact of Software Testing at Nordic Testing DaysKari Kakkonen
My slides at Nordic Testing Days 6.6.2024
Climate impact / sustainability of software testing discussed on the talk. ICT and testing must carry their part of global responsibility to help with the climat warming. We can minimize the carbon footprint but we can also have a carbon handprint, a positive impact on the climate. Quality characteristics can be added with sustainability, and then measured continuously. Test environments can be used less, and in smaller scale and on demand. Test techniques can be used in optimizing or minimizing number of tests. Test automation can be used to speed up testing.
Epistemic Interaction - tuning interfaces to provide information for AI supportAlan Dix
Paper presented at SYNERGY workshop at AVI 2024, Genoa, Italy. 3rd June 2024
https://alandix.com/academic/papers/synergy2024-epistemic/
As machine learning integrates deeper into human-computer interactions, the concept of epistemic interaction emerges, aiming to refine these interactions to enhance system adaptability. This approach encourages minor, intentional adjustments in user behaviour to enrich the data available for system learning. This paper introduces epistemic interaction within the context of human-system communication, illustrating how deliberate interaction design can improve system understanding and adaptation. Through concrete examples, we demonstrate the potential of epistemic interaction to significantly advance human-computer interaction by leveraging intuitive human communication strategies to inform system design and functionality, offering a novel pathway for enriching user-system engagements.
A tale of scale & speed: How the US Navy is enabling software delivery from l...sonjaschweigert1
Rapid and secure feature delivery is a goal across every application team and every branch of the DoD. The Navy’s DevSecOps platform, Party Barge, has achieved:
- Reduction in onboarding time from 5 weeks to 1 day
- Improved developer experience and productivity through actionable findings and reduction of false positives
- Maintenance of superior security standards and inherent policy enforcement with Authorization to Operate (ATO)
Development teams can ship efficiently and ensure applications are cyber ready for Navy Authorizing Officials (AOs). In this webinar, Sigma Defense and Anchore will give attendees a look behind the scenes and demo secure pipeline automation and security artifacts that speed up application ATO and time to production.
We will cover:
- How to remove silos in DevSecOps
- How to build efficient development pipeline roles and component templates
- How to deliver security artifacts that matter for ATO’s (SBOMs, vulnerability reports, and policy evidence)
- How to streamline operations with automated policy checks on container images
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Albert Hoitingh
In this session I delve into the encryption technology used in Microsoft 365 and Microsoft Purview. Including the concepts of Customer Key and Double Key Encryption.
Transcript: Selling digital books in 2024: Insights from industry leaders - T...BookNet Canada
The publishing industry has been selling digital audiobooks and ebooks for over a decade and has found its groove. What’s changed? What has stayed the same? Where do we go from here? Join a group of leading sales peers from across the industry for a conversation about the lessons learned since the popularization of digital books, best practices, digital book supply chain management, and more.
Link to video recording: https://bnctechforum.ca/sessions/selling-digital-books-in-2024-insights-from-industry-leaders/
Presented by BookNet Canada on May 28, 2024, with support from the Department of Canadian Heritage.
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfPaige Cruz
Monitoring and observability aren’t traditionally found in software curriculums and many of us cobble this knowledge together from whatever vendor or ecosystem we were first introduced to and whatever is a part of your current company’s observability stack.
While the dev and ops silo continues to crumble….many organizations still relegate monitoring & observability as the purview of ops, infra and SRE teams. This is a mistake - achieving a highly observable system requires collaboration up and down the stack.
I, a former op, would like to extend an invitation to all application developers to join the observability party will share these foundational concepts to build on:
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptx
Nl201609
1. Evaluating Helpdesk Dialogues:
Initial Considerations from An
Information Access Perspective
Tetsuya Sakai (Waseda University)
Zhaohao Zeng (Waseda University)
Cheng Luo (Tsinghua University/Waseda University)
tetsuyasakai@acm.org
September 29, 2016
@IPSJ SIGNL (unrefereed), Osaka.
2. TALK OUTLINE
1. Motivation
2. Related Work
3. Project Overview
4. Pilot Test Collection: Progress Report
5. Pilot Evaluation Measures
6. Summary and Future Work
5. Motivation (3)
• We cannot conduct a subjective evaluation for every dialogue that we
want to evaluate. We want an automatic evaluation method that
approximates subjective evaluation.
• Build a human-human helpdesk dialogue test collection with both
subjective annotations (target variables) and clues for automatic
evaluation (explanatory variables).
• Using the test collection, design and verify automatic evaluation
measures that approximate subjective evaluation.
• One step beyond: human-system dialogue evaluation based on the
human-human dialogue test collection.
6. TALK OUTLINE
1. Motivation
2. Related Work
3. Project Overview
4. Pilot Test Collection: Progress Report
5. Pilot Evaluation Measures
6. Summary and Future Work
7. Evaluating non-task-oriented dialogues (1)
• Evaluating conversational responses
Discriminative BLEU [Galley+15] extends the machine-translation
measure BLEU to incorporate +/- weights for human references (=gold
responses) to reflect different subjective views.
• Dialogue Breakdown Detection Challenge [Higashinaka+16]
Find a point in dialogue where it becomes impossible to continue due
to system’s inappropriate utterances.
System’s output: a probability distribution over NB (not a breakdown),
PB (possible breakdown), or B (breakdown), which is compared against
a gold distribution.
8. Evaluating non-task-oriented dialogues (2)
• Evaluating the Short Text Conversation Task [Sakai+15AIRS,Shang+16]
Human-system single-turn dialogues by searching a repository of past
tweets. Ranked lists evaluated with information retrieval measures.
old post old comment
old post old comment
old post old comment
old post old comment
old post old comment
new post
new post
new post
old comment
old comment
old comment
new post
new post For each new post,
retrieve and rank
old comments!
Graded label (L0-L2) for each comment
Repository Training data Test data
9. Evaluating task-oriented dialogues (1)
• PARADISE [Walker+97]
Task: Train timetable lookup
User satisfaction = f(task success, cost)
Attribute-value matrix (depart-city=?, arrival-city=?, depart-time=?...)
• Spoken Dialogue Challenge [Black+09]
Task: Bus timetable lookup
Live evaluation by calling systems on the phone
• Dialogue State Tracking Challenge [Williams+13,Kim+16]
Task: Bus timetable lookup
Evaluation: at each time t, the system outputs a probability distribution over
possible dialogue states (e.g. different bus routes), which is compared with a gold
label.
Closed-domain, slot filling tasks
10. Evaluating task-oriented dialogues (2)
• Subjective Assessment of Speech System Interfaces (SASSI) [Hone+00]
Task: In-car speech interface
Factor analysis of questionnaires revealed the following as key factors for subjective
assessment:
- system response accuracy
- likeability
- cognitive demand
- annoyance
- habitability
- speed
• SERVQUAL [Hartikainen+04]
Task: Phone-based email application
Closed-domain, slot filling tasks
11. Evaluating task-oriented dialogues (3)
• Response Selection [Lowe+15]
Ubuntu corpus containing “artificial” dyadic dialogues.
Task: Ubuntu Q&A: most similar to ours, with no pre-defined slot filling schemes
Response selection task:
Previous dialogue
context
Correct response in
original dialogue
Previous dialogue
context
Incorrect response
from another dialogue
Previous dialogue
context
Incorrect response
from another dialogue
...
Given the context, can the system choose the correct response from 10 choices?
12. Evaluating textual information access (1)
[Sakai15book]
• ROUGE for summarisation evaluation [Lin04]
Recall and F-measure based on n-grams and skip bigrams.
Requires multiple reference summaries.
• Nugget pyramids and POURPRE for QA [Lin+06]
• Nugget definition at TREC QA: “a fact for which the assessor could
make a binary decision as to whether a response contained that
nugget.”
Nugget recall, allowance-based nugget precision, nugget F-measure.
POURPRE: replaces manual nugget matching with automatic nugget
matching based on unigrams.
Text is regarded as a set of small textual units
13. Evaluating textual information access (2)
[Sakai15book]
• S-measure [Sakai+11CIKM]
A measure for query-focussed summaries, introduces a decay function
over text, just as nDCG uses a decay function over ranks.
• T-measure [Sakai+12AIRS]
Nugget-precision that can handle different allowances for different
nuggets.
• U-measure [Sakai+13SIGIR]
A generalisation of S, which works for any textual information access
tasks, including web search, summaries, sessions etc.
14. Trailtext: <Sentence A> <Sentence Z>
Trailtext:
<Rank 1 snippet> <Rank 2 snippet> <Rank 2 full text> <Rank 1 full text>
Nonlinear traversal
Building trailtexts for U-measure (1)
15. Trailtext: <News 1> <Ad 2> <Blog 1>
Trailtext:
<Rank 1 snippet> <Rank 2 snippet> <Rank 1’ snippet> <Rank 1’ full text>
Building trailtexts for U-measure (2)
17. Advertisement:
http://sigir.org/sigir2017/
Jan 17: full paper abstracts due
Jan 24: full papers due
Feb 28: short papers and demo proposals due
Aug 7: tutorials and doctoral consortium
Aug 8-10: main conference
Aug 11: workshops
The first ever SIGIR in Japan!
18. TALK OUTLINE
1. Motivation
2. Related Work
3. Project Overview
4. Pilot Test Collection: Progress Report
5. Pilot Evaluation Measures
6. Summary and Future Work
19. Project overview
(1) Construct a pilot Chinese human-human dialogue test collection
with subjective labels, nuggets and English translations.
(2) Design nugget-based evaluation measures and investigate the
correlation with subjective measures.
(3) Revise criteria for subjective and nugget annotations, as well as the
measures
(4) Construct a larger test collection with subjective labels, nuggets and
English translations. Re-investigate the correlations.
(5) Release the finalised test collection with code for computing the
measures.
φ1
φ2
20. Subjective labels = target variables
Possible axes for subjective annotation:
- Is the task clearly stated and is actually accomplished?
- How efficiently is the task accomplished through the dialogue?
- Is Customer satisfied with the dialogue, and to what degree?
Interlocutor viewpoints:
Customer’s viewpoint: Solve my problem efficiently, but I’m giving you
minimal information about it.
Helpdesk’s viewpoint: Solve Customer’s problem efficiently, as time is
money for the company.
The two viewpoints may be weighted depending on practical needs.
21. Why nuggets?
• Subjective labels tell us about the quality of the entire dialogue, but
not about why.
• Helpdesk dialogues lack pre-defined slot filling schemes.
• Subjective scores (gold standard) = f(nuggets) ?
• Parts-Make-The-Whole Hypothesis: The overall quality of a helpdesk
dialogue is governed by the quality of its parts.
C
H
C
H
C
H
C
H
Overall
quality
(subjective)
f(nuggets)
22. Nugget annotation vs subjective annotation
• Consistency Hypothesis: Nugget annotation achieves higher inter-
annotator consistency. (Smaller units = reduces subjectivity and
variations in annotation procedure)
• Sensitivity Hypothesis: Nugget annotation enables finer distinctions
among different dialogues. (Nuggets = details)
• Reusability Hypothesis: Nugget annotation enables us to predict the
quality of unannotated dialogues more accurately.
C
H
C
H
WITH annotations
C
H
C
H
WITHOUT annotations
Same task,
different dialogues
Reuse nuggets
23. Unique features of nuggets for dialogue
evaluation
• A dialogue involves Customer and Helpdesk (not one search engine
user) – two types of nuggets
• Within each nugget type, nuggets are not homogeneous
- Special nuggets that identify the task (trigger nuggets)
- Special nuggets that accomplish the task (goal nuggets)
- Regular nuggets
25. Possible requirements for nugget-based
evaluation measures (1)
(a) Highly correlated with subjective labels. (Validates the Parts-Make-The-
Whole Hypothesis)
(b) Easy to compute and to interpret.
(c) Accommodate Customer’s and Helpdesk’s viewpoints and change the
balance if required.
(d) Accommodate nugget weights (i.e., importance).
(e) For a given task, prefer a dialogue that accomplish it over one that does
not.
(f) Given two dialogues containing the same set of nuggets for the same
task, prefer the shorter one.
(g) Given two dialogues that accomplish the same task, prefer the one that
reaches task accomplishment more quickly.
Goal nugget No goal nuggets
C
H
C
H
C
H
C
H
Same task
>
26. Possible requirements for nugget-based
evaluation measures (2)
(a) Highly correlated with subjective labels. (Validates the Parts-Make-The-
Whole Hypothesis)
(b) Easy to compute and to interpret.
(c) Accommodate Customer’s and Helpdesk’s viewpoints and change the
balance if required.
(d) Accommodate nugget weights (i.e., importance).
(e) For a given task, prefer a dialogue that accomplish it over one that does
not.
(f) Given two dialogues containing the same set of nuggets for the same
task, prefer the shorter one.
(g) Given two dialogues that accomplish the same task, prefer the one that
reaches task accomplishment more quickly.
C
H
C
H
C
H
Same task
>
27. Possible requirements for nugget-based
evaluation measures (3)
(a) Highly correlated with subjective labels. (Validates the Parts-Make-The-
Whole Hypothesis)
(b) Easy to compute and to interpret.
(c) Accommodate Customer’s and Helpdesk’s viewpoints and change the
balance if required.
(d) Accommodate nugget weights (i.e., importance).
(e) For a given task, prefer a dialogue that accomplish it over one that does
not.
(f) Given two dialogues containing the same set of nuggets for the same
task, prefer the shorter one.
(g) Given two dialogues that accomplish the same task, prefer the one that
reaches task accomplishment more quickly.
Goal nugget
C
H
C
H
C
H
C
H
Same task
> Goal nugget
C
28. Possible requirements for nugget-based
evaluation measures (4)
(a) Highly correlated with subjective labels. (Validates the Parts-Make-The-
Whole Hypothesis)
(b) Easy to compute and to interpret.
(c) Accommodate Customer’s and Helpdesk’s viewpoints and change the
balance if required.
(d) Accommodate nugget weights (i.e., importance).
(e) For a given task, prefer a dialogue that accomplish it over one that does
not.
(f) Given two dialogues containing the same set of nuggets for the same
task, prefer the shorter one.
(g) Given two dialogues that accomplish the same task, prefer the one that
reaches task accomplishment more quickly.
29. After completing the project...
Evaluating human-system dialogues
Human-human dialogue test collection with
subjective and nugget annotations
Utilise as an unstructured
knowledge base
C
H
C
H
Sampled dialogue with
subjective and nugget
annotations
Task
Initiate a human-system dialogue for the same task,
using participant’s own expressions
Participant
Participant terminates dialogue as soon as
he receives an incoherent or a
breakdown-causing utterance from
System.
Can System still provide the goal nuggets?
How does human-system UCH compare
with human-human UCH?
30. TALK OUTLINE
1. Motivation
2. Related Work
3. Project Overview
4. Pilot Test Collection: Progress Report
5. Pilot Evaluation Measures
6. Summary and Future Work
31. Dialogue mining (100% done)
Pilot data containing 234 Customer-Helpdesk dialogues obtained as
follows:
1. Collect an initial set of Weibo accounts A0 by searching account
names with keywords such as assistant and helper (in Chinese).
2. For each account in A0, crawl 200 most recent posts that mention
that account using “@”. Filter accounts that did not respond to
more than 50% of the posts. Let the set of “active” accounts be A.
3. For each account in A, crawl 2000 most recent posts that mention
that account, and then extract those with at least 5 Customer posts
AND at least 5 Helpdesk posts.
34. Nugget definition (for annotators)
• A post: a piece of text input by Customer/Helpdesk who presses
ENTER to upload it on Weibo.
• A nugget:
(I) is a post, or a sequence of consecutive posts by the same
interlocutor.
(II) can neither partially nor wholly overlap with another nugget.
(III) should be minimal: it should not contain irrelevant posts at
start/end/middle.
(IV) helps Customer transition from Current State towards Target State.
35. Nugget types (for annotators)
CNUG0: Customer trigger nuggets. Define Customer’s initial problem.
CNUG: Customer regular nuggets.
HNUG: Helpdesk regular nuggets.
CNUG*: Customer goal nuggets. Customer tells Helpdesk that the
problem has been solved.
HNUG*: Helpdesk goal nuggets. Helpdesk provides customer with a
solution to the problem.
Nuggets annotated for 40/234=17% of the dialogues
36. TALK OUTLINE
1. Motivation
2. Related Work
3. Project Overview
4. Pilot Test Collection: Progress Report
5. Pilot Evaluation Measures
6. Summary and Future Work
37. Pilot measures for dialogue evaluation
• U-measure [Sakai13+SIGIR]
Trailtext = concatenation of all texts that the search engine user has
read
• UCH (U computed based on Customer’s and Helpdesk’s nuggets)
Trailtext = dyadic dialogue
- UC = U computed based on Customer’s nuggets (Helpdesk’s
viewpoint)
- UH = U computed based on Helpdesk’s nuggets (Customer’s
viewpoint)
UCH = (1-α) UC + α UH
40. UCH = (1-α) UC + α UH When α=0.5, UCH is U-measure
that places the two graphs on top of each other
Weight of the goal nugget
higher than the sum of the
others
Normalisation?
Unnecessary if
score standardisation
is applied
[Sakai16ICTIR,Sakai16AIRS]
Maximum tolerable dialogue length
41. Possible variants
• Use different decay functions for Customer and Helpdesk
• Use time rather than trailtext as the basis for discounting as in Time-
Biased Gain [Smucker+12]
+: the gap between the timestamps of two posts can be quantified
-/+: cannot quantify the amount of information conveyed in each post
expressed in a particular language / language independence
But remember Requirement (b): Measures should be easy to compute and to interpret.
Max tolerable dialogue duration
43. TALK OUTLINE
1. Motivation
2. Related Work
3. Project Overview
4. Pilot Test Collection: Progress Report
5. Pilot Evaluation Measures
6. Summary and Future Work
44. Conclusions and future work (1)
(1) Construct a pilot Chinese human-human dialogue test collection
with subjective labels, nuggets and English translations.
(2) Design nugget-based evaluation measures and investigate the
correlation with subjective measures.
(3) Revise criteria for subjective and nugget annotations, as well as the
measures
(4) Construct a larger test collection with subjective labels, nuggets and
English translations. Re-investigate the correlations.
(5) Release the finalised test collection with code for computing the
measures.
φ1
φ2
Done
45. Human-human dialogue test collection with
subjective and nugget annotations
Utilise as an unstructured
knowledge base
C
H
C
H
Sampled dialogue with
subjective and nugget
annotations
Task
Initiate a human-system dialogue for the same task,
using participant’s own expressions
Participant
Conclusions and future work (2)
Participant terminates dialogue as soon as
he receives an incoherent or a
breakdown-causing utterance from
System.
Can System still provide the goal nuggets?
How does human-system UCH compare
with human-human UCH?
After φ2..
47. Selected References (1)
[Black+09] The Spoken Dialogue Challenge, Proceedings of SIGDIAL 2009
[Galley+15] ΔBLEU: A Discriminative Metric for Generation Tasks with Intrinsically Diverse Targets, Proceedings of ACL 2015.
[Higashinaka+16] The Dialogue Breakdown Detection Challenge: Task Description, Datasets, and Evaluation Metrics,
Proceedings of LREC 2016.
[Hartikainen+04] Subjective Evaluation of Spoken Dialogue Systems Using SERVQUAL, Method, Proceedings of INTERSPEECH
2004-ICSLP.
[Hone+00] Towards a Tool for the Subjective Assessment of Speech System Interfaces (SASSI), Natural Language Engineering,
6(3-4), 2000.
[Kim+16] The Fourth Dialog State Tracking Challenge, Proceedings of IWSDS 2016.
[Lin04] ROUGE: A Package for Automatic Evaluation of Summaries, Proceedings of the Workshop on Text Summarization
Branches Out, 2004.
[Lin+06] Will Pyramids Built of Nuggets Topple Over? Proceedings of HLT/NAACL 2006.
[Lowe+15] The Ubuntu Dialogue Corpus: A Large Dataset for Research in Unstructured Multi-Turn Dialogue Systems,
Proceedings of SIGDIAL 2015.
48. Selected References (2)
[Sakai+11CIKM] Click the Search Button and Be Happy: Evaluating Direct and Immediate Information Access, Proceedings of
ACM CIKM 2011.
[Sakai+12AIRS] One Click One Revisited: Enhancing Evaluation based on Information Units, Proceedings of AIRS 2012.
[Sakai+13SIGIR] Summaries, Ranked Retrieval and Sessions: A Unified Framework for Information Access Evaluation,
Proceedings of ACM SIGIR 2013.
[Sakai+15AIRS] Topic Set Size Design with the Evaluation Measures for Short Text Conversation, Proceedings of AIRS 2015.
[Sakai15book] 情報アクセス評価方法論: 検索エンジンの進歩のために, コロナ社, 2015.
[Sakai16AIRS] The Effect of Score Standardisation on Topic Set Size Design, Proceedings of AIRS 2016, to appear.
[Sakai16ICTIR] A Simple and Effective Approach to Score Standardisation, Proceedings of ACM ICTIR 2016.
[Sakai16SIGIR] Statistical Significance, Power, and Sample Sizes: A Systematic Review of SIGIR and TOIS, 2006-2015,
Proceedings of ACM SIGIR 2016.
49. Selected References (3)
[Shang+16] Overview of the NTCIR-12 Short Text Conversation Task, Proceedings of NTCIR-12, 2016.
[Smucker+12] Time-Based Calibration of Effectiveness Measures, Proceedings of ACM SIGIR 2012.
[Walker+97] PARADISE: A Framework for Evaluating Spoken Dialogue Agents, Proceedings of ACL 1997.
[Williams+13] The Dialog State Tracking Challenge, Proceedings of SIGDIAL 2013.
50. Acknowledgements
• We thank Hang Li and Lifeng Shang (Huawei Noah's Ark Lab) for
helpful discussions and continued support; and Guan Jun, Lingtao Li
and Yimeng Fan (Waseda University) for helping us construct the pilot
test collection.
• We also thank Ryuichiro Higashinaka (NTT Media Intelligence
Laboratories) for providing us with valuable information related to the
evaluation of non-task-oriented dialogues.