SlideShare a Scribd company logo
Weronika Łajewska and Krisztian Balog
University of Stavanger, Norway
Towards Filling the Gap in
Conversational Search: From Passage
Retrieval to Conversational Response
Generation
CIKM’23, Birmingham
This study
● Problem setting: Conversational response generation
○ It extends beyond passage retrieval + summarization
● Goal: snippet-level annotations of relevant passages, to enable
1. the training of response generation models that are able to ground answers
in actual statements
2. the automatic evaluation of the generated responses in terms of
completeness
● Main contributions:
1. Crowdsourcing task design and protocol to collect high-quality annotations
2. A dataset of 1.8k query-passage pairs annotated from the TREC 2020 and
2022 Conversational Assistance track
CAsT-snippets sample
CAsT-snippets sample
The seemingly straightforward task of highlighting relevant
snippets turns out to be not that simple.
Preliminary study
A comparison of different task designs, platforms, and worker pools
● Task designs: paragraph-based vs. sentence-based annotation
● Platforms and workers:
○ Amazon MTurk (regular vs. master workers)
○ Prolific
○ Expert annotators (PhD students)
Evaluation measures
Traditional measures of inter-annotator
agreement are insufficient
● Fleiss’ Kappa and Krippendorff’s Alpha are
measures for categorical annotations that
rely on a binary notion of agreement
● Here: we need to measure the degree to
which snippets selected by different workers
overlap
○ Inter-annotator agreement: Jaccard
similarity (also a less strict variant,
k-Jaccard)
○ Similarity against expert annotators:
“ROUGE-like” variant of precision and recall
Results
Inter-annotator agreement
Task variant Annotators F1
Paragraph-based
MTurk regular 0.36
MTurk master 0.54
Prolific 0.50
Sentence-based
MTurk regular 0.31
MTurk master 0.41
Task variant Annotators Jaccard
Jaccard_k
k = 4 k = 3 k = 2
Paragraph-based
MTurk regular (n=5) 0.02 0.08 0.21 0.48
MTurk master (n=5) 0.18 0.35 0.53 0.73
Prolific (n=5) 0.14 0.27 0.44 0.65
Expert (m=3) 0.25 - - 0.54
Sentence-based
MTurk regular (n=3) 0.35 - - 0.71
MTurk master (n=3) 0.47 - - 0.76
Comparison to expert annotations
Main findings
● Relative ordering: MTurk masters > Prolific > MTurk regular
● Paragraph-level > sentence-level (w.r.t. similarity with expert annotations)
⇒ use MTurk and paragraph-based design for the large-scale data collection
Data collection
Setup
Employ a small group of trained crowd workers, selected through a qualification
task, and create an extended set of guidelines with help of the annotators
Data collection
Performed in daily batches
(1 topic/batch =~46 HITs)
Individual feedback after each
submitted batch
General comments/suggestions on
a common Slack channel
$0.3 per HIT +$2 bonus for
completing within 24h
Qualification task
Task consisted of: a detailed
description of the problem,
examples of correct annotations,
a quiz, and 10 query-passage
pairs to be annotated
20 workers completed/15 passed
Initial guidelines
Discussion
Feedback on qualification task
Extended guidelines
Resulting dataset: CAsT-snippets
371 queries, top 5 passages per query ⇒ 1855 query-passage pairs
(each annotated by 3 crowd workers)
● Data quality
○ Inter-annotator agreement exceeds even that of expert annotators
○ Similarity with expert annotations is on par with MTurk master workers
● Comparison against other datasets
○ More snippets annotated per input text; also, snippets are longer
Dataset Input text
Avg. snippets length
(tokens)
# snippets per
annotation
CAsT-snippets Paragraph 39.6 2.3
SaaC Top 10 passages 23.8 1.5
QuaC Wikipedia article 14.6 1
Challenges identified
Challenges pointed out by the crowd workers that need to be addressed in
conversational response generation:
● Only a partial answer is present
● Temporal considerations
○ Spans may need to be excluded given the time constraints in the query
○ Assessing temporal validity can be challenging based on the paragraph alone
(without larger context)
● Subjectivity of the passages originating from blogs or comments
● Indirect answers that require reasoning and background knowledge
● Determining the appropriate amount of context to include in each span
○ Balancing between being concise and being self-contained
● Determining whether the evidence or additional information is needed or an
entity alone is sufficient as an answer
Summary
● Snippet-level annotations for conversational response generation
(information-seeking queries)
● Several measures to ensure high data quality
○ Preliminary study to compare task variants and crowdsourcing platforms
○ Providing feedback and training to annotators throughout the data
collection process
○ Incentive structure to engage crowd workers over a period of time and avoid
worker fatigue
● Communication with workers also led to various insights regarding
challenges in conversational response generation
Questions?
Extended version on arXiv: https://arxiv.org/abs/2308.08911
Dataset: https://github.com/iai-group/CAsT-snippets
Preliminary study
Dataset: TREC CAsT’20 and ‘22 (top
5 passages according to relevance
score for each query)
Input: query + passage/sentence
Output: snippet-level annotations
in passage
Task
Variant
Annotator Time
#
workers
Acceptance
rate
Cost
Paragraph
MTurk regular 182s 5 50% $0.36
MTurk master 63s 5 90% $0.38
Prolific 154s 5 79% $0.51
Expert 96s 3 - -
Sentence
MTurk regular 977s 3 72% $0.43
MTurk master 305s 3 87% $0.56
Results (large-scale data collection)
Inter-annotator agreement Comparison to expert annotations
Task variant Annotator Jaccard Jaccard_2
Paragraph
-based
MTurk regular (n=5) 0.02 0.48
MTurk master (n=5) 0.18 0.73
Prolific (n=5) 0.14 0.65
Expert (m=3) 0.25 0.54
Large-scale (topics 1,2)
(m=3)
0.38 0.62
Large-scale (all data) (m=3) 0.33 0.61
Sentence
-based
MTurk regular (n=3) 0.35 0.71
MTurk master (n=3) 0.47 0.76
Task variant Annotator F1
Paragraph
-based
MTurk regular 0.36
MTurk master 0.54
Prolific 0.50
Large-scale
(topics 1,2) (m=3)
0.54
Sentence
-based
MTurk regular 0.31
MTurk master 0.41
Amazon MTurk - paragraph-based design
Amazon MTurk - sentence-based design
Prolific
paragraph-based
design

More Related Content

Similar to Towards Filling the Gap in Conversational Search: From Passage Retrieval to Conversational Response Generation

SelQA: A New Benchmark for Selection-based Question Answering
SelQA: A New Benchmark for Selection-based Question AnsweringSelQA: A New Benchmark for Selection-based Question Answering
SelQA: A New Benchmark for Selection-based Question Answering
Jinho Choi
 
Content Wizard: Concept-Based Recommender System for Instructors of Programmi...
Content Wizard: Concept-Based Recommender System for Instructors of Programmi...Content Wizard: Concept-Based Recommender System for Instructors of Programmi...
Content Wizard: Concept-Based Recommender System for Instructors of Programmi...
Hung Chau
 
ML Framework for auto-responding to customer support queries
ML Framework for auto-responding to customer support queriesML Framework for auto-responding to customer support queries
ML Framework for auto-responding to customer support queries
Varun Nathan
 
Ijcai01 mspc.ppt
Ijcai01 mspc.pptIjcai01 mspc.ppt
Ijcai01 mspc.ppt
Yann-Gaël Guéhéneuc
 
ntcir14centre-overview
ntcir14centre-overviewntcir14centre-overview
ntcir14centre-overview
Tetsuya Sakai
 
Financial Question Answering with BERT Language Models
Financial Question Answering with BERT Language ModelsFinancial Question Answering with BERT Language Models
Financial Question Answering with BERT Language Models
Bithiah Yuan
 
AUTOMATIC QUESTION GENERATION USING NATURAL LANGUAGE PROCESSING
AUTOMATIC QUESTION GENERATION USING NATURAL LANGUAGE PROCESSINGAUTOMATIC QUESTION GENERATION USING NATURAL LANGUAGE PROCESSING
AUTOMATIC QUESTION GENERATION USING NATURAL LANGUAGE PROCESSING
IRJET Journal
 
ML Framework for auto-responding to customer support queries
ML Framework for auto-responding to customer support queriesML Framework for auto-responding to customer support queries
ML Framework for auto-responding to customer support queries
Varun Nathan
 
Bsa 376 week 5 dq 2
Bsa 376 week 5 dq 2Bsa 376 week 5 dq 2
Bsa 376 week 5 dq 2
nforampernorth1988
 
Bsa 376 week 1 dq 2
Bsa 376 week 1 dq 2Bsa 376 week 1 dq 2
Bsa 376 week 1 dq 2
sioverlite1978
 
Bsa 376 week 2 dq 2
Bsa 376 week 2 dq 2Bsa 376 week 2 dq 2
Bsa 376 week 2 dq 2
conslileamas1980
 
Shibani Antonette_Augmenting pedagogic writing practice with CLAD.pdf
Shibani Antonette_Augmenting pedagogic writing practice with CLAD.pdfShibani Antonette_Augmenting pedagogic writing practice with CLAD.pdf
Shibani Antonette_Augmenting pedagogic writing practice with CLAD.pdf
Shibani22
 
Week 3 Assignment Organizational Needs AssessmentSubmit As.docx
Week 3 Assignment Organizational Needs AssessmentSubmit As.docxWeek 3 Assignment Organizational Needs AssessmentSubmit As.docx
Week 3 Assignment Organizational Needs AssessmentSubmit As.docx
endawalling
 
Embed, Encode, Attend, Predict – applying the 4 step NLP recipe for text clas...
Embed, Encode, Attend, Predict – applying the 4 step NLP recipe for text clas...Embed, Encode, Attend, Predict – applying the 4 step NLP recipe for text clas...
Embed, Encode, Attend, Predict – applying the 4 step NLP recipe for text clas...
Sujit Pal
 
Bsa 376 week 4 dq 1
Bsa 376 week 4 dq 1Bsa 376 week 4 dq 1
Bsa 376 week 4 dq 1
peiberternpres1985
 
Comparison of Techniques for Measuring Research Coverage of Scientific Papers...
Comparison of Techniques for Measuring Research Coverage of Scientific Papers...Comparison of Techniques for Measuring Research Coverage of Scientific Papers...
Comparison of Techniques for Measuring Research Coverage of Scientific Papers...
Aravind Sesagiri Raamkumar
 
Enabling Real-Time Adaptivity in MOOCs with a Personalized Next-Step Recommen...
Enabling Real-Time Adaptivity in MOOCs with a Personalized Next-Step Recommen...Enabling Real-Time Adaptivity in MOOCs with a Personalized Next-Step Recommen...
Enabling Real-Time Adaptivity in MOOCs with a Personalized Next-Step Recommen...
Daniel Davis
 
B2 2006 sizing_benchmarking
B2 2006 sizing_benchmarkingB2 2006 sizing_benchmarking
B2 2006 sizing_benchmarkingSteve Feldman
 
B2 2006 sizing_benchmarking (1)
B2 2006 sizing_benchmarking (1)B2 2006 sizing_benchmarking (1)
B2 2006 sizing_benchmarking (1)Steve Feldman
 
Instant search - A hands-on tutorial
Instant search  - A hands-on tutorialInstant search  - A hands-on tutorial
Instant search - A hands-on tutorial
Ganesh Venkataraman
 

Similar to Towards Filling the Gap in Conversational Search: From Passage Retrieval to Conversational Response Generation (20)

SelQA: A New Benchmark for Selection-based Question Answering
SelQA: A New Benchmark for Selection-based Question AnsweringSelQA: A New Benchmark for Selection-based Question Answering
SelQA: A New Benchmark for Selection-based Question Answering
 
Content Wizard: Concept-Based Recommender System for Instructors of Programmi...
Content Wizard: Concept-Based Recommender System for Instructors of Programmi...Content Wizard: Concept-Based Recommender System for Instructors of Programmi...
Content Wizard: Concept-Based Recommender System for Instructors of Programmi...
 
ML Framework for auto-responding to customer support queries
ML Framework for auto-responding to customer support queriesML Framework for auto-responding to customer support queries
ML Framework for auto-responding to customer support queries
 
Ijcai01 mspc.ppt
Ijcai01 mspc.pptIjcai01 mspc.ppt
Ijcai01 mspc.ppt
 
ntcir14centre-overview
ntcir14centre-overviewntcir14centre-overview
ntcir14centre-overview
 
Financial Question Answering with BERT Language Models
Financial Question Answering with BERT Language ModelsFinancial Question Answering with BERT Language Models
Financial Question Answering with BERT Language Models
 
AUTOMATIC QUESTION GENERATION USING NATURAL LANGUAGE PROCESSING
AUTOMATIC QUESTION GENERATION USING NATURAL LANGUAGE PROCESSINGAUTOMATIC QUESTION GENERATION USING NATURAL LANGUAGE PROCESSING
AUTOMATIC QUESTION GENERATION USING NATURAL LANGUAGE PROCESSING
 
ML Framework for auto-responding to customer support queries
ML Framework for auto-responding to customer support queriesML Framework for auto-responding to customer support queries
ML Framework for auto-responding to customer support queries
 
Bsa 376 week 5 dq 2
Bsa 376 week 5 dq 2Bsa 376 week 5 dq 2
Bsa 376 week 5 dq 2
 
Bsa 376 week 1 dq 2
Bsa 376 week 1 dq 2Bsa 376 week 1 dq 2
Bsa 376 week 1 dq 2
 
Bsa 376 week 2 dq 2
Bsa 376 week 2 dq 2Bsa 376 week 2 dq 2
Bsa 376 week 2 dq 2
 
Shibani Antonette_Augmenting pedagogic writing practice with CLAD.pdf
Shibani Antonette_Augmenting pedagogic writing practice with CLAD.pdfShibani Antonette_Augmenting pedagogic writing practice with CLAD.pdf
Shibani Antonette_Augmenting pedagogic writing practice with CLAD.pdf
 
Week 3 Assignment Organizational Needs AssessmentSubmit As.docx
Week 3 Assignment Organizational Needs AssessmentSubmit As.docxWeek 3 Assignment Organizational Needs AssessmentSubmit As.docx
Week 3 Assignment Organizational Needs AssessmentSubmit As.docx
 
Embed, Encode, Attend, Predict – applying the 4 step NLP recipe for text clas...
Embed, Encode, Attend, Predict – applying the 4 step NLP recipe for text clas...Embed, Encode, Attend, Predict – applying the 4 step NLP recipe for text clas...
Embed, Encode, Attend, Predict – applying the 4 step NLP recipe for text clas...
 
Bsa 376 week 4 dq 1
Bsa 376 week 4 dq 1Bsa 376 week 4 dq 1
Bsa 376 week 4 dq 1
 
Comparison of Techniques for Measuring Research Coverage of Scientific Papers...
Comparison of Techniques for Measuring Research Coverage of Scientific Papers...Comparison of Techniques for Measuring Research Coverage of Scientific Papers...
Comparison of Techniques for Measuring Research Coverage of Scientific Papers...
 
Enabling Real-Time Adaptivity in MOOCs with a Personalized Next-Step Recommen...
Enabling Real-Time Adaptivity in MOOCs with a Personalized Next-Step Recommen...Enabling Real-Time Adaptivity in MOOCs with a Personalized Next-Step Recommen...
Enabling Real-Time Adaptivity in MOOCs with a Personalized Next-Step Recommen...
 
B2 2006 sizing_benchmarking
B2 2006 sizing_benchmarkingB2 2006 sizing_benchmarking
B2 2006 sizing_benchmarking
 
B2 2006 sizing_benchmarking (1)
B2 2006 sizing_benchmarking (1)B2 2006 sizing_benchmarking (1)
B2 2006 sizing_benchmarking (1)
 
Instant search - A hands-on tutorial
Instant search  - A hands-on tutorialInstant search  - A hands-on tutorial
Instant search - A hands-on tutorial
 

More from krisztianbalog

Conversational AI from an Information Retrieval Perspective: Remaining Challe...
Conversational AI from an Information Retrieval Perspective: Remaining Challe...Conversational AI from an Information Retrieval Perspective: Remaining Challe...
Conversational AI from an Information Retrieval Perspective: Remaining Challe...
krisztianbalog
 
What Does Conversational Information Access Exactly Mean and How to Evaluate It?
What Does Conversational Information Access Exactly Mean and How to Evaluate It?What Does Conversational Information Access Exactly Mean and How to Evaluate It?
What Does Conversational Information Access Exactly Mean and How to Evaluate It?
krisztianbalog
 
Personal Knowledge Graphs
Personal Knowledge GraphsPersonal Knowledge Graphs
Personal Knowledge Graphs
krisztianbalog
 
Entities for Augmented Intelligence
Entities for Augmented IntelligenceEntities for Augmented Intelligence
Entities for Augmented Intelligence
krisztianbalog
 
On Entities and Evaluation
On Entities and EvaluationOn Entities and Evaluation
On Entities and Evaluation
krisztianbalog
 
Table Retrieval and Generation
Table Retrieval and GenerationTable Retrieval and Generation
Table Retrieval and Generation
krisztianbalog
 
Entity Search: The Last Decade and the Next
Entity Search: The Last Decade and the NextEntity Search: The Last Decade and the Next
Entity Search: The Last Decade and the Next
krisztianbalog
 
Overview of the TREC 2016 Open Search track: Academic Search Edition
Overview of the TREC 2016 Open Search track: Academic Search EditionOverview of the TREC 2016 Open Search track: Academic Search Edition
Overview of the TREC 2016 Open Search track: Academic Search Edition
krisztianbalog
 
Overview of the Living Labs for IR Evaluation (LL4IR) CLEF Lab
Overview of the Living Labs for IR Evaluation (LL4IR) CLEF LabOverview of the Living Labs for IR Evaluation (LL4IR) CLEF Lab
Overview of the Living Labs for IR Evaluation (LL4IR) CLEF Lab
krisztianbalog
 
Entity Linking
Entity LinkingEntity Linking
Entity Linking
krisztianbalog
 
Evaluation Initiatives for Entity-oriented Search
Evaluation Initiatives for Entity-oriented SearchEvaluation Initiatives for Entity-oriented Search
Evaluation Initiatives for Entity-oriented Search
krisztianbalog
 
Entity Retrieval (tutorial organized by Radialpoint in Montreal)
Entity Retrieval (tutorial organized by Radialpoint in Montreal)Entity Retrieval (tutorial organized by Radialpoint in Montreal)
Entity Retrieval (tutorial organized by Radialpoint in Montreal)
krisztianbalog
 
Entity Retrieval (WSDM 2014 tutorial)
Entity Retrieval (WSDM 2014 tutorial)Entity Retrieval (WSDM 2014 tutorial)
Entity Retrieval (WSDM 2014 tutorial)
krisztianbalog
 
Time-aware Evaluation of Cumulative Citation Recommendation Systems
Time-aware Evaluation of Cumulative Citation Recommendation SystemsTime-aware Evaluation of Cumulative Citation Recommendation Systems
Time-aware Evaluation of Cumulative Citation Recommendation Systems
krisztianbalog
 
Entity Retrieval (SIGIR 2013 tutorial)
Entity Retrieval (SIGIR 2013 tutorial)Entity Retrieval (SIGIR 2013 tutorial)
Entity Retrieval (SIGIR 2013 tutorial)
krisztianbalog
 
Multi-step Classification Approaches to Cumulative Citation Recommendation
Multi-step Classification Approaches to Cumulative Citation RecommendationMulti-step Classification Approaches to Cumulative Citation Recommendation
Multi-step Classification Approaches to Cumulative Citation Recommendation
krisztianbalog
 
Entity Retrieval (WWW 2013 tutorial)
Entity Retrieval (WWW 2013 tutorial)Entity Retrieval (WWW 2013 tutorial)
Entity Retrieval (WWW 2013 tutorial)
krisztianbalog
 
Semistructured Data Seach
Semistructured Data SeachSemistructured Data Seach
Semistructured Data Seach
krisztianbalog
 
Collection Ranking and Selection for Federated Entity Search
Collection Ranking and Selection for Federated Entity SearchCollection Ranking and Selection for Federated Entity Search
Collection Ranking and Selection for Federated Entity Search
krisztianbalog
 

More from krisztianbalog (19)

Conversational AI from an Information Retrieval Perspective: Remaining Challe...
Conversational AI from an Information Retrieval Perspective: Remaining Challe...Conversational AI from an Information Retrieval Perspective: Remaining Challe...
Conversational AI from an Information Retrieval Perspective: Remaining Challe...
 
What Does Conversational Information Access Exactly Mean and How to Evaluate It?
What Does Conversational Information Access Exactly Mean and How to Evaluate It?What Does Conversational Information Access Exactly Mean and How to Evaluate It?
What Does Conversational Information Access Exactly Mean and How to Evaluate It?
 
Personal Knowledge Graphs
Personal Knowledge GraphsPersonal Knowledge Graphs
Personal Knowledge Graphs
 
Entities for Augmented Intelligence
Entities for Augmented IntelligenceEntities for Augmented Intelligence
Entities for Augmented Intelligence
 
On Entities and Evaluation
On Entities and EvaluationOn Entities and Evaluation
On Entities and Evaluation
 
Table Retrieval and Generation
Table Retrieval and GenerationTable Retrieval and Generation
Table Retrieval and Generation
 
Entity Search: The Last Decade and the Next
Entity Search: The Last Decade and the NextEntity Search: The Last Decade and the Next
Entity Search: The Last Decade and the Next
 
Overview of the TREC 2016 Open Search track: Academic Search Edition
Overview of the TREC 2016 Open Search track: Academic Search EditionOverview of the TREC 2016 Open Search track: Academic Search Edition
Overview of the TREC 2016 Open Search track: Academic Search Edition
 
Overview of the Living Labs for IR Evaluation (LL4IR) CLEF Lab
Overview of the Living Labs for IR Evaluation (LL4IR) CLEF LabOverview of the Living Labs for IR Evaluation (LL4IR) CLEF Lab
Overview of the Living Labs for IR Evaluation (LL4IR) CLEF Lab
 
Entity Linking
Entity LinkingEntity Linking
Entity Linking
 
Evaluation Initiatives for Entity-oriented Search
Evaluation Initiatives for Entity-oriented SearchEvaluation Initiatives for Entity-oriented Search
Evaluation Initiatives for Entity-oriented Search
 
Entity Retrieval (tutorial organized by Radialpoint in Montreal)
Entity Retrieval (tutorial organized by Radialpoint in Montreal)Entity Retrieval (tutorial organized by Radialpoint in Montreal)
Entity Retrieval (tutorial organized by Radialpoint in Montreal)
 
Entity Retrieval (WSDM 2014 tutorial)
Entity Retrieval (WSDM 2014 tutorial)Entity Retrieval (WSDM 2014 tutorial)
Entity Retrieval (WSDM 2014 tutorial)
 
Time-aware Evaluation of Cumulative Citation Recommendation Systems
Time-aware Evaluation of Cumulative Citation Recommendation SystemsTime-aware Evaluation of Cumulative Citation Recommendation Systems
Time-aware Evaluation of Cumulative Citation Recommendation Systems
 
Entity Retrieval (SIGIR 2013 tutorial)
Entity Retrieval (SIGIR 2013 tutorial)Entity Retrieval (SIGIR 2013 tutorial)
Entity Retrieval (SIGIR 2013 tutorial)
 
Multi-step Classification Approaches to Cumulative Citation Recommendation
Multi-step Classification Approaches to Cumulative Citation RecommendationMulti-step Classification Approaches to Cumulative Citation Recommendation
Multi-step Classification Approaches to Cumulative Citation Recommendation
 
Entity Retrieval (WWW 2013 tutorial)
Entity Retrieval (WWW 2013 tutorial)Entity Retrieval (WWW 2013 tutorial)
Entity Retrieval (WWW 2013 tutorial)
 
Semistructured Data Seach
Semistructured Data SeachSemistructured Data Seach
Semistructured Data Seach
 
Collection Ranking and Selection for Federated Entity Search
Collection Ranking and Selection for Federated Entity SearchCollection Ranking and Selection for Federated Entity Search
Collection Ranking and Selection for Federated Entity Search
 

Recently uploaded

bordetella pertussis.................................ppt
bordetella pertussis.................................pptbordetella pertussis.................................ppt
bordetella pertussis.................................ppt
kejapriya1
 
PRESENTATION ABOUT PRINCIPLE OF COSMATIC EVALUATION
PRESENTATION ABOUT PRINCIPLE OF COSMATIC EVALUATIONPRESENTATION ABOUT PRINCIPLE OF COSMATIC EVALUATION
PRESENTATION ABOUT PRINCIPLE OF COSMATIC EVALUATION
ChetanK57
 
GBSN- Microbiology (Lab 3) Gram Staining
GBSN- Microbiology (Lab 3) Gram StainingGBSN- Microbiology (Lab 3) Gram Staining
GBSN- Microbiology (Lab 3) Gram Staining
Areesha Ahmad
 
Richard's aventures in two entangled wonderlands
Richard's aventures in two entangled wonderlandsRichard's aventures in two entangled wonderlands
Richard's aventures in two entangled wonderlands
Richard Gill
 
general properties of oerganologametal.ppt
general properties of oerganologametal.pptgeneral properties of oerganologametal.ppt
general properties of oerganologametal.ppt
IqrimaNabilatulhusni
 
S.1 chemistry scheme term 2 for ordinary level
S.1 chemistry scheme term 2 for ordinary levelS.1 chemistry scheme term 2 for ordinary level
S.1 chemistry scheme term 2 for ordinary level
ronaldlakony0
 
Lateral Ventricles.pdf very easy good diagrams comprehensive
Lateral Ventricles.pdf very easy good diagrams comprehensiveLateral Ventricles.pdf very easy good diagrams comprehensive
Lateral Ventricles.pdf very easy good diagrams comprehensive
silvermistyshot
 
Seminar of U.V. Spectroscopy by SAMIR PANDA
 Seminar of U.V. Spectroscopy by SAMIR PANDA Seminar of U.V. Spectroscopy by SAMIR PANDA
Seminar of U.V. Spectroscopy by SAMIR PANDA
SAMIR PANDA
 
Chapter 12 - climate change and the energy crisis
Chapter 12 - climate change and the energy crisisChapter 12 - climate change and the energy crisis
Chapter 12 - climate change and the energy crisis
tonzsalvador2222
 
Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...
Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...
Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...
Ana Luísa Pinho
 
Introduction to Mean Field Theory(MFT).pptx
Introduction to Mean Field Theory(MFT).pptxIntroduction to Mean Field Theory(MFT).pptx
Introduction to Mean Field Theory(MFT).pptx
zeex60
 
nodule formation by alisha dewangan.pptx
nodule formation by alisha dewangan.pptxnodule formation by alisha dewangan.pptx
nodule formation by alisha dewangan.pptx
alishadewangan1
 
Observation of Io’s Resurfacing via Plume Deposition Using Ground-based Adapt...
Observation of Io’s Resurfacing via Plume Deposition Using Ground-based Adapt...Observation of Io’s Resurfacing via Plume Deposition Using Ground-based Adapt...
Observation of Io’s Resurfacing via Plume Deposition Using Ground-based Adapt...
Sérgio Sacani
 
3D Hybrid PIC simulation of the plasma expansion (ISSS-14)
3D Hybrid PIC simulation of the plasma expansion (ISSS-14)3D Hybrid PIC simulation of the plasma expansion (ISSS-14)
3D Hybrid PIC simulation of the plasma expansion (ISSS-14)
David Osipyan
 
Mudde & Rovira Kaltwasser. - Populism - a very short introduction [2017].pdf
Mudde & Rovira Kaltwasser. - Populism - a very short introduction [2017].pdfMudde & Rovira Kaltwasser. - Populism - a very short introduction [2017].pdf
Mudde & Rovira Kaltwasser. - Populism - a very short introduction [2017].pdf
frank0071
 
platelets_clotting_biogenesis.clot retractionpptx
platelets_clotting_biogenesis.clot retractionpptxplatelets_clotting_biogenesis.clot retractionpptx
platelets_clotting_biogenesis.clot retractionpptx
muralinath2
 
BLOOD AND BLOOD COMPONENT- introduction to blood physiology
BLOOD AND BLOOD COMPONENT- introduction to blood physiologyBLOOD AND BLOOD COMPONENT- introduction to blood physiology
BLOOD AND BLOOD COMPONENT- introduction to blood physiology
NoelManyise1
 
Earliest Galaxies in the JADES Origins Field: Luminosity Function and Cosmic ...
Earliest Galaxies in the JADES Origins Field: Luminosity Function and Cosmic ...Earliest Galaxies in the JADES Origins Field: Luminosity Function and Cosmic ...
Earliest Galaxies in the JADES Origins Field: Luminosity Function and Cosmic ...
Sérgio Sacani
 
Leaf Initiation, Growth and Differentiation.pdf
Leaf Initiation, Growth and Differentiation.pdfLeaf Initiation, Growth and Differentiation.pdf
Leaf Initiation, Growth and Differentiation.pdf
RenuJangid3
 
Salas, V. (2024) "John of St. Thomas (Poinsot) on the Science of Sacred Theol...
Salas, V. (2024) "John of St. Thomas (Poinsot) on the Science of Sacred Theol...Salas, V. (2024) "John of St. Thomas (Poinsot) on the Science of Sacred Theol...
Salas, V. (2024) "John of St. Thomas (Poinsot) on the Science of Sacred Theol...
Studia Poinsotiana
 

Recently uploaded (20)

bordetella pertussis.................................ppt
bordetella pertussis.................................pptbordetella pertussis.................................ppt
bordetella pertussis.................................ppt
 
PRESENTATION ABOUT PRINCIPLE OF COSMATIC EVALUATION
PRESENTATION ABOUT PRINCIPLE OF COSMATIC EVALUATIONPRESENTATION ABOUT PRINCIPLE OF COSMATIC EVALUATION
PRESENTATION ABOUT PRINCIPLE OF COSMATIC EVALUATION
 
GBSN- Microbiology (Lab 3) Gram Staining
GBSN- Microbiology (Lab 3) Gram StainingGBSN- Microbiology (Lab 3) Gram Staining
GBSN- Microbiology (Lab 3) Gram Staining
 
Richard's aventures in two entangled wonderlands
Richard's aventures in two entangled wonderlandsRichard's aventures in two entangled wonderlands
Richard's aventures in two entangled wonderlands
 
general properties of oerganologametal.ppt
general properties of oerganologametal.pptgeneral properties of oerganologametal.ppt
general properties of oerganologametal.ppt
 
S.1 chemistry scheme term 2 for ordinary level
S.1 chemistry scheme term 2 for ordinary levelS.1 chemistry scheme term 2 for ordinary level
S.1 chemistry scheme term 2 for ordinary level
 
Lateral Ventricles.pdf very easy good diagrams comprehensive
Lateral Ventricles.pdf very easy good diagrams comprehensiveLateral Ventricles.pdf very easy good diagrams comprehensive
Lateral Ventricles.pdf very easy good diagrams comprehensive
 
Seminar of U.V. Spectroscopy by SAMIR PANDA
 Seminar of U.V. Spectroscopy by SAMIR PANDA Seminar of U.V. Spectroscopy by SAMIR PANDA
Seminar of U.V. Spectroscopy by SAMIR PANDA
 
Chapter 12 - climate change and the energy crisis
Chapter 12 - climate change and the energy crisisChapter 12 - climate change and the energy crisis
Chapter 12 - climate change and the energy crisis
 
Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...
Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...
Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...
 
Introduction to Mean Field Theory(MFT).pptx
Introduction to Mean Field Theory(MFT).pptxIntroduction to Mean Field Theory(MFT).pptx
Introduction to Mean Field Theory(MFT).pptx
 
nodule formation by alisha dewangan.pptx
nodule formation by alisha dewangan.pptxnodule formation by alisha dewangan.pptx
nodule formation by alisha dewangan.pptx
 
Observation of Io’s Resurfacing via Plume Deposition Using Ground-based Adapt...
Observation of Io’s Resurfacing via Plume Deposition Using Ground-based Adapt...Observation of Io’s Resurfacing via Plume Deposition Using Ground-based Adapt...
Observation of Io’s Resurfacing via Plume Deposition Using Ground-based Adapt...
 
3D Hybrid PIC simulation of the plasma expansion (ISSS-14)
3D Hybrid PIC simulation of the plasma expansion (ISSS-14)3D Hybrid PIC simulation of the plasma expansion (ISSS-14)
3D Hybrid PIC simulation of the plasma expansion (ISSS-14)
 
Mudde & Rovira Kaltwasser. - Populism - a very short introduction [2017].pdf
Mudde & Rovira Kaltwasser. - Populism - a very short introduction [2017].pdfMudde & Rovira Kaltwasser. - Populism - a very short introduction [2017].pdf
Mudde & Rovira Kaltwasser. - Populism - a very short introduction [2017].pdf
 
platelets_clotting_biogenesis.clot retractionpptx
platelets_clotting_biogenesis.clot retractionpptxplatelets_clotting_biogenesis.clot retractionpptx
platelets_clotting_biogenesis.clot retractionpptx
 
BLOOD AND BLOOD COMPONENT- introduction to blood physiology
BLOOD AND BLOOD COMPONENT- introduction to blood physiologyBLOOD AND BLOOD COMPONENT- introduction to blood physiology
BLOOD AND BLOOD COMPONENT- introduction to blood physiology
 
Earliest Galaxies in the JADES Origins Field: Luminosity Function and Cosmic ...
Earliest Galaxies in the JADES Origins Field: Luminosity Function and Cosmic ...Earliest Galaxies in the JADES Origins Field: Luminosity Function and Cosmic ...
Earliest Galaxies in the JADES Origins Field: Luminosity Function and Cosmic ...
 
Leaf Initiation, Growth and Differentiation.pdf
Leaf Initiation, Growth and Differentiation.pdfLeaf Initiation, Growth and Differentiation.pdf
Leaf Initiation, Growth and Differentiation.pdf
 
Salas, V. (2024) "John of St. Thomas (Poinsot) on the Science of Sacred Theol...
Salas, V. (2024) "John of St. Thomas (Poinsot) on the Science of Sacred Theol...Salas, V. (2024) "John of St. Thomas (Poinsot) on the Science of Sacred Theol...
Salas, V. (2024) "John of St. Thomas (Poinsot) on the Science of Sacred Theol...
 

Towards Filling the Gap in Conversational Search: From Passage Retrieval to Conversational Response Generation

  • 1. Weronika Łajewska and Krisztian Balog University of Stavanger, Norway Towards Filling the Gap in Conversational Search: From Passage Retrieval to Conversational Response Generation CIKM’23, Birmingham
  • 2. This study ● Problem setting: Conversational response generation ○ It extends beyond passage retrieval + summarization ● Goal: snippet-level annotations of relevant passages, to enable 1. the training of response generation models that are able to ground answers in actual statements 2. the automatic evaluation of the generated responses in terms of completeness ● Main contributions: 1. Crowdsourcing task design and protocol to collect high-quality annotations 2. A dataset of 1.8k query-passage pairs annotated from the TREC 2020 and 2022 Conversational Assistance track
  • 4. CAsT-snippets sample The seemingly straightforward task of highlighting relevant snippets turns out to be not that simple.
  • 5. Preliminary study A comparison of different task designs, platforms, and worker pools ● Task designs: paragraph-based vs. sentence-based annotation ● Platforms and workers: ○ Amazon MTurk (regular vs. master workers) ○ Prolific ○ Expert annotators (PhD students)
  • 6. Evaluation measures Traditional measures of inter-annotator agreement are insufficient ● Fleiss’ Kappa and Krippendorff’s Alpha are measures for categorical annotations that rely on a binary notion of agreement ● Here: we need to measure the degree to which snippets selected by different workers overlap ○ Inter-annotator agreement: Jaccard similarity (also a less strict variant, k-Jaccard) ○ Similarity against expert annotators: “ROUGE-like” variant of precision and recall
  • 7. Results Inter-annotator agreement Task variant Annotators F1 Paragraph-based MTurk regular 0.36 MTurk master 0.54 Prolific 0.50 Sentence-based MTurk regular 0.31 MTurk master 0.41 Task variant Annotators Jaccard Jaccard_k k = 4 k = 3 k = 2 Paragraph-based MTurk regular (n=5) 0.02 0.08 0.21 0.48 MTurk master (n=5) 0.18 0.35 0.53 0.73 Prolific (n=5) 0.14 0.27 0.44 0.65 Expert (m=3) 0.25 - - 0.54 Sentence-based MTurk regular (n=3) 0.35 - - 0.71 MTurk master (n=3) 0.47 - - 0.76 Comparison to expert annotations Main findings ● Relative ordering: MTurk masters > Prolific > MTurk regular ● Paragraph-level > sentence-level (w.r.t. similarity with expert annotations) ⇒ use MTurk and paragraph-based design for the large-scale data collection
  • 9. Setup Employ a small group of trained crowd workers, selected through a qualification task, and create an extended set of guidelines with help of the annotators Data collection Performed in daily batches (1 topic/batch =~46 HITs) Individual feedback after each submitted batch General comments/suggestions on a common Slack channel $0.3 per HIT +$2 bonus for completing within 24h Qualification task Task consisted of: a detailed description of the problem, examples of correct annotations, a quiz, and 10 query-passage pairs to be annotated 20 workers completed/15 passed Initial guidelines Discussion Feedback on qualification task Extended guidelines
  • 10. Resulting dataset: CAsT-snippets 371 queries, top 5 passages per query ⇒ 1855 query-passage pairs (each annotated by 3 crowd workers) ● Data quality ○ Inter-annotator agreement exceeds even that of expert annotators ○ Similarity with expert annotations is on par with MTurk master workers ● Comparison against other datasets ○ More snippets annotated per input text; also, snippets are longer Dataset Input text Avg. snippets length (tokens) # snippets per annotation CAsT-snippets Paragraph 39.6 2.3 SaaC Top 10 passages 23.8 1.5 QuaC Wikipedia article 14.6 1
  • 11. Challenges identified Challenges pointed out by the crowd workers that need to be addressed in conversational response generation: ● Only a partial answer is present ● Temporal considerations ○ Spans may need to be excluded given the time constraints in the query ○ Assessing temporal validity can be challenging based on the paragraph alone (without larger context) ● Subjectivity of the passages originating from blogs or comments ● Indirect answers that require reasoning and background knowledge ● Determining the appropriate amount of context to include in each span ○ Balancing between being concise and being self-contained ● Determining whether the evidence or additional information is needed or an entity alone is sufficient as an answer
  • 12. Summary ● Snippet-level annotations for conversational response generation (information-seeking queries) ● Several measures to ensure high data quality ○ Preliminary study to compare task variants and crowdsourcing platforms ○ Providing feedback and training to annotators throughout the data collection process ○ Incentive structure to engage crowd workers over a period of time and avoid worker fatigue ● Communication with workers also led to various insights regarding challenges in conversational response generation
  • 13. Questions? Extended version on arXiv: https://arxiv.org/abs/2308.08911 Dataset: https://github.com/iai-group/CAsT-snippets
  • 14.
  • 15. Preliminary study Dataset: TREC CAsT’20 and ‘22 (top 5 passages according to relevance score for each query) Input: query + passage/sentence Output: snippet-level annotations in passage Task Variant Annotator Time # workers Acceptance rate Cost Paragraph MTurk regular 182s 5 50% $0.36 MTurk master 63s 5 90% $0.38 Prolific 154s 5 79% $0.51 Expert 96s 3 - - Sentence MTurk regular 977s 3 72% $0.43 MTurk master 305s 3 87% $0.56
  • 16. Results (large-scale data collection) Inter-annotator agreement Comparison to expert annotations Task variant Annotator Jaccard Jaccard_2 Paragraph -based MTurk regular (n=5) 0.02 0.48 MTurk master (n=5) 0.18 0.73 Prolific (n=5) 0.14 0.65 Expert (m=3) 0.25 0.54 Large-scale (topics 1,2) (m=3) 0.38 0.62 Large-scale (all data) (m=3) 0.33 0.61 Sentence -based MTurk regular (n=3) 0.35 0.71 MTurk master (n=3) 0.47 0.76 Task variant Annotator F1 Paragraph -based MTurk regular 0.36 MTurk master 0.54 Prolific 0.50 Large-scale (topics 1,2) (m=3) 0.54 Sentence -based MTurk regular 0.31 MTurk master 0.41
  • 17. Amazon MTurk - paragraph-based design
  • 18. Amazon MTurk - sentence-based design