SlideShare a Scribd company logo
1 of 31
Studying Topical Relevance
with Evidence-based Crowdsourcing
Oana Inel, Giannis Haralabopoulos, Dan Li, Christophe Van Gysel, Zoltán Szlávik,
Elena Simperl, Evangelos Kanoulas and Lora Aroyo
CIKM 2018
@oana_inel
https://oana-inel.github.io
TOPICAL RELEVANCE
ASSESSMENT PRACTICES Typically one NIST assessor per topic
Use an asymmetrical 3 point-relevance scale
[ highly relevant, relevant, not relevant ]
Follow strict annotation guidelines to
ensure a uniform understanding of the
annotation task
NIST employed assessors
● Topic originators
● Subject experts
2https://oana-inel.github.io/ @oana_inel http://crowdtruth.org
TOPICAL RELEVANCE
ASSESSMENT PRACTICES Typically one NIST assessor per topic
NIST employed assessors
● Topic originators
● Subject experts
Use an asymmetrical 3 point-relevance scale
[ highly relevant, relevant, not relevant ]
What’s wrong with this picture?
Follow strict annotation guidelines to
ensure a uniform understanding of the
annotation task
3https://oana-inel.github.io/ @oana_inel http://crowdtruth.org
TOPICAL RELEVANCE
ASSESSMENT PRACTICES Typically one NIST assessor per topic
Follow strict annotation guidelines to
ensure a uniform understanding of the
annotation task
Use an asymmetrical 3 point-relevance scale
[ highly relevant, relevant, not relevant ]
Is a singular viewpoint enough?
Do experts have bias?
What if the task is subjective or ambiguous?
Are experts always consistent?
NIST employed assessors
● Topic originators
● Subject experts
4https://oana-inel.github.io/ @oana_inel http://crowdtruth.org
What’s wrong with this picture?
Jack Straw, left, Britain's new foreign secretary, said the country's euro policy
remained unchanged, producing a brief recovery in the pound from a 15-year
low set last week.
A common currency, administered by an independent European central bank,
would take many of these decisions out of national hands, the main reason
why so many Conservatives in Britain oppose introducing the euro there.
Greece never had a prayer of joining the first round of countries eligible for
Europe's common currency, so its exclusion from the list of 11 countries ready
to adopt the euro in 1999 was not an issue here.
Highly
relevant
Highly
relevant
Highly
relevant
What do experts say?
TOPIC: Euro Opposition
Identify documents that discuss opposition to the use of the euro, the European currency.
5https://oana-inel.github.io/ @oana_inel http://crowdtruth.org
What does the crowd say?
0.11
relevant
0.46
relevant
0.67
relevant
Jack Straw, left, Britain's new foreign secretary, said the country's euro policy
remained unchanged, producing a brief recovery in the pound from a 15-year
low set last week.
A common currency, administered by an independent European central bank,
would take many of these decisions out of national hands, the main reason
why so many Conservatives in Britain oppose introducing the euro there.
Greece never had a prayer of joining the first round of countries eligible for
Europe's common currency, so its exclusion from the list of 11 countries ready
to adopt the euro in 1999 was not an issue here.
6https://oana-inel.github.io/ @oana_inel http://crowdtruth.org
TOPIC: Euro Opposition
Identify documents that discuss opposition to the use of the euro, the European currency.
Who do you think is more accurate?
Jack Straw, left, Britain's new foreign secretary, said the country's euro
policy remained unchanged, producing a brief recovery in the pound
from a 15-year low set last week.
A common currency, administered by an independent European central
bank, would take many of these decisions out of national hands, the
main reason why so many Conservatives in Britain oppose introducing
the euro there.
Greece never had a prayer of joining the first round of countries eligible
for Europe's common currency, so its exclusion from the list of 11
countries ready to adopt the euro in 1999 was not an issue here.
7https://oana-inel.github.io/ @oana_inel http://crowdtruth.org
TOPIC: Euro Opposition
Identify documents that discuss opposition to the use of the euro, the European currency.
Highly
relevant
Highly
relevant
Highly
relevant
0.11
relevant
0.46
relevant
0.67
relevant
TOPICAL RELEVANCE
ASSESSMENT PRACTICES Typically one NIST assessor per topic
Follow strict annotation guidelines to
ensure a uniform understanding of the
annotation task
Use an asymmetrical 3 point-relevance scale
[ highly relevant, relevant, not relevant ]
All these may contribute to lower accuracy & reliability
Is a singular viewpoint enough?
Do experts have bias? What if the task is subjective or ambiguous?
Are experts always consistent?
NIST employed assessors
● Topic originators
● Subject experts
8https://oana-inel.github.io/ @oana_inel http://crowdtruth.org
Research Questions
RQ1: Can we improve the accuracy of topical relevance
assessment by adapting and ?
RQ2: Can we improve the reliability of topical relevance
assessment by optimizing and ?
Relevance
Scale
Annotation
Guidelines
Document
Granularity
Number of
Annotators
9https://oana-inel.github.io/ @oana_inel http://crowdtruth.org
Hypotheses
text highlight as relevance choice motivation increases the
accuracy of the results
2-point relevance annotation scale is more reliable
topical relevance assessment at the level of document
paragraphs increases the accuracy of the results
large amounts of crowd workers perform as well as or better
than NIST assessors
Annotation
Guidelines
Document
Granularity
Relevance
Scale
Number of
Annotators
10https://oana-inel.github.io/ @oana_inel http://crowdtruth.org
RELEVANCE ASSESSMENT METHODOLOGY
Pilot Experiments
to determine
Optimal
Crowdsourcing
Setting
Evaluate Pilot
Experiments
Main Experiment
with
Optimal
Crowdsourcing
Setting
Relevance Rank
& Labels
Documents
Annotate
Documents
Relevance
Scale
Annotation
Guidelines
Document
Granularity
Paragraph
Order
Number of
Annotators
11https://oana-inel.github.io/ @oana_inel http://crowdtruth.org
DATASET
NYTimes Corpus
○ 250 TREC Robust Track topics
○ 23,554 short & medium documents (less than 1000 words)
NIST annotations
○ 50 topics
○ 5,946 documents
○ ternary scale (929 highly relevant, 1,421 relevant, 3,596 not
relevant)
12https://oana-inel.github.io/ @oana_inel http://crowdtruth.org
8 PILOT EXPERIMENTS
with 120 topic-document pairs / 10 topics
Independent Variables
Relevance
Scale
Binary
Ternary
Annotation
Guidelines
Relevance Value
Relevance Value +
Text Highlight
Document
Paragraphs
Full Document
Document
Granularity
Paragraph
Order
Random Order
Original Order
Controlled Variables
Annotators
15 workers
Figure Eight
English
Speaking
Which is the optimal setting?
13https://oana-inel.github.io/ @oana_inel http://crowdtruth.org
OVERVIEW OF ALL PILOT EXPERIMENTS SETTINGS
N/A
Annotators
Figure Eight
15 workers
English Speaking
14https://oana-inel.github.io/ @oana_inel http://crowdtruth.org
N/A
N/A
N/A
Relevance
Scale
Binary
Ternary
Binary
Binary
Binary
Binary
Binary
Ternary
Annotation
Guidelines
Relevance Value
Relevance Value +
Text Highlight
Relevance Value
Relevance Value +
Text Highlight
Relevance Value
Relevance Value +
Text Highlight
Relevance Value
Relevance Value +
Text Highlight
Document
Paragraphs
Full Document
Document
Granularity
Full Document
Full Document
Full Document
Document
Paragraphs
Document
Paragraphs
Document
Paragraphs
Paragraph
Order
Random Order
Original Order
Original Order
Random Order
METHODOLOGY FOR
RESULTS ANALYSIS
2. Crowd Judgments Aggregation
● CrowdTruth metrics: Model and
aggregate the topical relevance
judgments of the crowd by harnessing
disagreement
1. Ground Truth Reviewing
● Inspection of annotations by NIST
● NIST assessors vs. reviewers
● Reviewers vs. crowd workers
Lora Aroyo, Anca Dumitrache, Oana Inel, Chris Welty: CrowdTruth Tutorial,
http://crowdtruth.org/tutorial/
15https://oana-inel.github.io/ @oana_inel http://crowdtruth.org
Anca Dumitrache, Oana Inel, Lora Aroyo, Benjamin Timmermans, Chris Welty:
CrowdTruth 2.0: Quality Metrics for Crowdsourcing with Disagreement,
https://arxiv.org/pdf/1808.06080.pdf
GROUND TRUTH
REVIEWING
Independent Reviewing
● Three independent reviewers
● Familiar with NIST annotation guidelines
● 120 topic-document pairs used in pilots
Consensus Reaching
● Reviewers inter-rater reliability:
○ Fleiss’ k = 0.67 on ternary relevance
○ Fleiss’ k = 0.79 on binary relevance
● Reviewers and NIST inter-rater reliability:
○ Cohen’ k = 0.44 on ternary relevance
○ Cohen’ k = 0.60 on binary relevance
16https://oana-inel.github.io/ @oana_inel http://crowdtruth.org
CROWDTRUTH METRICS
For paragraph-level assessments:
- we first measure the degree of relevance of each document paragraph with regard to the
topic
- by Paragraph-Topic-Document-Pair Relevance Score (ParTDP-RelScore)
- finally, we calculate the overall topic document relevance score, by taking the MAX of the
ParTDP-RelScore
17https://oana-inel.github.io/ @oana_inel http://crowdtruth.org
Lora Aroyo, Anca Dumitrache, Oana Inel, Chris Welty: CrowdTruth Tutorial, http://crowdtruth.org/tutorial/
Anca Dumitrache, Oana Inel, Lora Aroyo, Benjamin Timmermans, Chris Welty: CrowdTruth 2.0: Quality Metrics for Crowdsourcing with Disagreement,
https://arxiv.org/pdf/1808.06080.pdf
For full-document-level assessments:
- we measure the degree with which a relevance label is expressed by the topic-document pair,
- Relevance-Label-Topic-Document-Pair Score (TDP-RelVal)
In both cases all worker judgements are weighted by their worker quality
Ternary Relevance Value
Relevance Value
+ Text Highlight
Full
Document
Ternary
Full
Document
Do annotation guidelines influence the results?
0.57 0.45 0.79
0.57 0.63 0.86
Highly
RelevantAssessors
NIST
Crowd
Relevant
Not
Relevant
0.57 0.45 0.79
0.69 0.64 0.92
Highly
RelevantAssessors
NIST
Crowd
Relevant
Not
Relevant
accuracy is better when answers are motivated
18https://oana-inel.github.io/ @oana_inel http://crowdtruth.org
Binary
Relevance
Value
Full
Document
Binary
Full Document
Relevance Value
+ Text Highlight
Relevance
Value
Full
Document
Full Document
Relevance Value
+ Text Highlight
Ternary Ternary
Does relevance annotation scale influence the results?
0.57 0.45 0.79
0.57 0.63 0.86
Highly
Relevant
Assessors
NIST
Crowd
Relevant
Not
Relevant
0.57 0.45 0.79
0.69 0.64 0.92
Highly
Relevant
Assessors
NIST
Crowd
Relevant
Not
Relevant
0.80 0.79
0.85 0.81
RelevantAssessors
NIST
Crowd
Not
Relevant
0.80 0.79
0.91 0.90
RelevantAssessors
NIST
Crowd
Not
Relevant
accuracy is better when answers are merged
19https://oana-inel.github.io/ @oana_inel http://crowdtruth.org
Binary
Relevance
Value
Full
Document
Binary
Full Document
Relevance Value
+ Text Highlight
Relevance
Value
Full
Document
Full Document
Relevance Value
+ Text Highlight
Ternary Ternary
Does relevance annotation scale influence the results?
0.57 0.45 0.79
0.57 0.63 0.86
Highly
Relevant
Assessors
NIST
Crowd
Relevant
Not
Relevant
0.57 0.45 0.79
0.69 0.64 0.92
Highly
Relevant
Assessors
NIST
Crowd
Relevant
Not
Relevant
0.80 0.79
0.85 0.81
RelevantAssessors
NIST
Crowd
Not
Relevant
0.80 0.79
0.91 0.90
RelevantAssessors
NIST
Crowd
Not
Relevant
accuracy is better when answers are merged
not relevant docs - high agreement
20https://oana-inel.github.io/ @oana_inel http://crowdtruth.org
Binary
Relevance
Value
Full
Document
Binary
Full Document
Relevance Value
+ Text Highlight
Relevance
Value
Full
Document
Full Document
Relevance Value
+ Text Highlight
Ternary Ternary
Does relevance annotation scale influence the results?
0.57 0.45 0.79
0.57 0.63 0.86
Highly
Relevant
Assessors
NIST
Crowd
Relevant
Not
Relevant
0.57 0.45 0.79
0.69 0.64 0.92
Highly
Relevant
Assessors
NIST
Crowd
Relevant
Not
Relevant
0.80 0.79
0.85 0.81
RelevantAssessors
NIST
Crowd
Not
Relevant
0.80 0.79
0.91 0.90
RelevantAssessors
NIST
Crowd
Not
Relevant
accuracy is better when answers are merged
not relevant docs - high agreement
relevant & highly relevant docs - often confused
21https://oana-inel.github.io/ @oana_inel http://crowdtruth.org
Document
Paragraphs
Binary
Original
Order
Binary
Relevance
Value
Relevance Value
+ Text Highlight
Document
Paragraphs
Original
Order
Does the document granularity influence the results?
accuracy is better with relevance value at paragraph level
22https://oana-inel.github.io/ @oana_inel http://crowdtruth.org
Does the paragraphs order influence the results?
Random
Order
Binary
Relevance
Value
Document
Paragraphs
Binary
Relevance
Value
Document
Paragraphs
Random Order
Binary
Document
Paragraphs
Binary
Document
Paragraphs
Original
Order
Original Order
Relevance Value
+ Text Highlight
Relevance Value
+ Text Highlight
crowd assesses better at paragraph level & in random order
23https://oana-inel.github.io/ @oana_inel http://crowdtruth.org
Random
Order
Binary
Document
Paragraphs
What amount of workers gives you reliable results?
binary relevance: 5 workers perform comparable with NIST
ternary relevance: 7 workers perform as well as the NIST experts
on highly relevant & not relevant docs only
24https://oana-inel.github.io/ @oana_inel http://crowdtruth.org
Relevance Value +
Text Highlight
DERIVED NEW TOPICAL
RELEVANCE
METHODOLOGY
Ask many crowd annotators
Diverse crowd of people Follow light-weighted
annotation guidelines
Apply 2-point relevance scale
25https://oana-inel.github.io/ @oana_inel http://crowdtruth.org
MAIN EXPERIMENT: with the optimal settings
with 23,554 topic-document pairs / 250 topics
Binary
Relevance Value +
Text Highlight
Document
Paragraphs
Random Order
Figure Eight
7 workers
English Speaking
26https://oana-inel.github.io/ @oana_inel http://crowdtruth.org
RESULTS MAIN EXPERIMENT
Kendall's 𝝉 and 𝝉AP rank correlation
coefficient between
● the official TREC systems' ranking
using NIST test collection
● the systems' ranking using the
crowdsourced test collection
0.6317 0.5485 0.5650 0.4879
0.5679 0.4576 0.4637 0.3728
0.5703 0.4849 0.4658 0.3751
𝝉 𝝉AP 𝝉 𝝉AP
Binary Ternary
Binary
nDGC
MAP
R-Prec
27https://oana-inel.github.io/ @oana_inel http://crowdtruth.org
RESULTS MAIN EXPERIMENT
Kendall's 𝝉 and 𝝉AP rank correlation
coefficient between
● the official TREC systems' ranking
using NIST test collection
● the systems' ranking using the
crowdsourced test collection
0.6317 0.5485 0.5650 0.4879
0.5679 0.4576 0.4637 0.3728
0.5703 0.4849 0.4658 0.3751
𝝉 𝝉AP 𝝉 𝝉AP
Binary Ternary
Binary
nDGC
MAP
R-Prec
28https://oana-inel.github.io/ @oana_inel http://crowdtruth.org
Binary Ternary
RESULTS MAIN EXPERIMENT
Kendall's 𝝉 and 𝝉AP rank correlation
coefficient between
● the official TREC systems' ranking
using NIST test collection
● the systems' ranking using the
crowdsourced test collection
0.6317 0.5485 0.5650 0.4879
0.5679 0.4576 0.4637 0.3728
0.5703 0.4849 0.4658 0.3751
𝝉 𝝉AP 𝝉 𝝉AP
Binary Ternary
Binary
nDGC
MAP
R-Prec
29https://oana-inel.github.io/ @oana_inel http://crowdtruth.org
Binary TernaryCrowd & NIST agreement:
● binary scale - 63%
● ternary scale - 54%
crowd workers & NIST assessors do agree
RESULTS MAIN EXPERIMENT
Kendall's 𝝉 and 𝝉AP rank correlation
coefficient between
● the official TREC systems' ranking
using NIST test collection
● the systems' ranking using the
crowdsourced test collection
0.6317 0.5485 0.5650 0.4879
0.5679 0.4576 0.4637 0.3728
0.5703 0.4849 0.4658 0.3751
𝝉 𝝉AP 𝝉 𝝉AP
Binary Ternary
Binary
nDGC
MAP
R-Prec
30https://oana-inel.github.io/ @oana_inel http://crowdtruth.org
Binary Ternary
Fair to moderate correlation between the two rankings
The correlation drops as we focus more on the top-ranked systems
CONCLUSIONS & FUTURE WORK
Contributions
● new methodology for topical relevance assessment
● new relevance metrics (CrowdTruth metrics) harnessing diverse opinions &
disagreement among annotators
● annotated test collection for topical relevance (23,554 English topic-doc pairs)
Future Work
● extend the test collection with the longer documents from the NYT Corpus
● investigate the influence of the topic difficulty & ambiguity
Code & Resources: https://github.com/CrowdTruth/NYT-Crowdsourcing-Topical-Relevance
CrowdTruth metrics: https://github.com/CrowdTruth/CrowdTruth-core
CrowdTruth tutorial: http://crowdtruth.org/tutorial/
31https://oana-inel.github.io/ @oana_inel http://crowdtruth.org

More Related Content

Recently uploaded

Defining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryDefining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryJeremy Anderson
 
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Boston Institute of Analytics
 
办理学位证加利福尼亚大学洛杉矶分校毕业证,UCLA成绩单原版一比一
办理学位证加利福尼亚大学洛杉矶分校毕业证,UCLA成绩单原版一比一办理学位证加利福尼亚大学洛杉矶分校毕业证,UCLA成绩单原版一比一
办理学位证加利福尼亚大学洛杉矶分校毕业证,UCLA成绩单原版一比一F sss
 
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degreeyuu sss
 
Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Seán Kennedy
 
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...Boston Institute of Analytics
 
MK KOMUNIKASI DATA (TI)komdat komdat.docx
MK KOMUNIKASI DATA (TI)komdat komdat.docxMK KOMUNIKASI DATA (TI)komdat komdat.docx
MK KOMUNIKASI DATA (TI)komdat komdat.docxUnduhUnggah1
 
Top 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In QueensTop 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In Queensdataanalyticsqueen03
 
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...Thomas Poetter
 
Thiophen Mechanism khhjjjjjjjhhhhhhhhhhh
Thiophen Mechanism khhjjjjjjjhhhhhhhhhhhThiophen Mechanism khhjjjjjjjhhhhhhhhhhh
Thiophen Mechanism khhjjjjjjjhhhhhhhhhhhYasamin16
 
Semantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxSemantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxMike Bennett
 
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...ssuserf63bd7
 
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfPredicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfBoston Institute of Analytics
 
Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Seán Kennedy
 
Identifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanIdentifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanMYRABACSAFRA2
 
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝DelhiRS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhijennyeacort
 
Business Analytics using Microsoft Excel
Business Analytics using Microsoft ExcelBusiness Analytics using Microsoft Excel
Business Analytics using Microsoft Excelysmaelreyes
 
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024thyngster
 
Heart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectHeart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectBoston Institute of Analytics
 
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一F La
 

Recently uploaded (20)

Defining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryDefining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data Story
 
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
 
办理学位证加利福尼亚大学洛杉矶分校毕业证,UCLA成绩单原版一比一
办理学位证加利福尼亚大学洛杉矶分校毕业证,UCLA成绩单原版一比一办理学位证加利福尼亚大学洛杉矶分校毕业证,UCLA成绩单原版一比一
办理学位证加利福尼亚大学洛杉矶分校毕业证,UCLA成绩单原版一比一
 
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
 
Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...
 
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
 
MK KOMUNIKASI DATA (TI)komdat komdat.docx
MK KOMUNIKASI DATA (TI)komdat komdat.docxMK KOMUNIKASI DATA (TI)komdat komdat.docx
MK KOMUNIKASI DATA (TI)komdat komdat.docx
 
Top 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In QueensTop 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In Queens
 
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
 
Thiophen Mechanism khhjjjjjjjhhhhhhhhhhh
Thiophen Mechanism khhjjjjjjjhhhhhhhhhhhThiophen Mechanism khhjjjjjjjhhhhhhhhhhh
Thiophen Mechanism khhjjjjjjjhhhhhhhhhhh
 
Semantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxSemantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptx
 
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
 
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfPredicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
 
Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...
 
Identifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanIdentifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population Mean
 
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝DelhiRS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
 
Business Analytics using Microsoft Excel
Business Analytics using Microsoft ExcelBusiness Analytics using Microsoft Excel
Business Analytics using Microsoft Excel
 
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
 
Heart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectHeart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis Project
 
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一
 

Featured

2024 State of Marketing Report – by Hubspot
2024 State of Marketing Report – by Hubspot2024 State of Marketing Report – by Hubspot
2024 State of Marketing Report – by HubspotMarius Sescu
 
Everything You Need To Know About ChatGPT
Everything You Need To Know About ChatGPTEverything You Need To Know About ChatGPT
Everything You Need To Know About ChatGPTExpeed Software
 
Product Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage EngineeringsProduct Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage EngineeringsPixeldarts
 
How Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental HealthHow Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental HealthThinkNow
 
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdfAI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdfmarketingartwork
 
PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024Neil Kimberley
 
Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)contently
 
How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024Albert Qian
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsKurio // The Social Media Age(ncy)
 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Search Engine Journal
 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summarySpeakerHub
 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd Clark Boyd
 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next Tessa Mero
 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentLily Ray
 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best PracticesVit Horky
 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project managementMindGenius
 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...RachelPearson36
 

Featured (20)

2024 State of Marketing Report – by Hubspot
2024 State of Marketing Report – by Hubspot2024 State of Marketing Report – by Hubspot
2024 State of Marketing Report – by Hubspot
 
Everything You Need To Know About ChatGPT
Everything You Need To Know About ChatGPTEverything You Need To Know About ChatGPT
Everything You Need To Know About ChatGPT
 
Product Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage EngineeringsProduct Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage Engineerings
 
How Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental HealthHow Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental Health
 
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdfAI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
 
Skeleton Culture Code
Skeleton Culture CodeSkeleton Culture Code
Skeleton Culture Code
 
PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024
 
Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)
 
How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie Insights
 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024
 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary
 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd
 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next
 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search Intent
 
How to have difficult conversations
How to have difficult conversations How to have difficult conversations
How to have difficult conversations
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best Practices
 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project management
 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
 

CIKM 2018: Studying Topical Relevance with Evidence-based Crowdsourcing

  • 1. Studying Topical Relevance with Evidence-based Crowdsourcing Oana Inel, Giannis Haralabopoulos, Dan Li, Christophe Van Gysel, Zoltán Szlávik, Elena Simperl, Evangelos Kanoulas and Lora Aroyo CIKM 2018 @oana_inel https://oana-inel.github.io
  • 2. TOPICAL RELEVANCE ASSESSMENT PRACTICES Typically one NIST assessor per topic Use an asymmetrical 3 point-relevance scale [ highly relevant, relevant, not relevant ] Follow strict annotation guidelines to ensure a uniform understanding of the annotation task NIST employed assessors ● Topic originators ● Subject experts 2https://oana-inel.github.io/ @oana_inel http://crowdtruth.org
  • 3. TOPICAL RELEVANCE ASSESSMENT PRACTICES Typically one NIST assessor per topic NIST employed assessors ● Topic originators ● Subject experts Use an asymmetrical 3 point-relevance scale [ highly relevant, relevant, not relevant ] What’s wrong with this picture? Follow strict annotation guidelines to ensure a uniform understanding of the annotation task 3https://oana-inel.github.io/ @oana_inel http://crowdtruth.org
  • 4. TOPICAL RELEVANCE ASSESSMENT PRACTICES Typically one NIST assessor per topic Follow strict annotation guidelines to ensure a uniform understanding of the annotation task Use an asymmetrical 3 point-relevance scale [ highly relevant, relevant, not relevant ] Is a singular viewpoint enough? Do experts have bias? What if the task is subjective or ambiguous? Are experts always consistent? NIST employed assessors ● Topic originators ● Subject experts 4https://oana-inel.github.io/ @oana_inel http://crowdtruth.org What’s wrong with this picture?
  • 5. Jack Straw, left, Britain's new foreign secretary, said the country's euro policy remained unchanged, producing a brief recovery in the pound from a 15-year low set last week. A common currency, administered by an independent European central bank, would take many of these decisions out of national hands, the main reason why so many Conservatives in Britain oppose introducing the euro there. Greece never had a prayer of joining the first round of countries eligible for Europe's common currency, so its exclusion from the list of 11 countries ready to adopt the euro in 1999 was not an issue here. Highly relevant Highly relevant Highly relevant What do experts say? TOPIC: Euro Opposition Identify documents that discuss opposition to the use of the euro, the European currency. 5https://oana-inel.github.io/ @oana_inel http://crowdtruth.org
  • 6. What does the crowd say? 0.11 relevant 0.46 relevant 0.67 relevant Jack Straw, left, Britain's new foreign secretary, said the country's euro policy remained unchanged, producing a brief recovery in the pound from a 15-year low set last week. A common currency, administered by an independent European central bank, would take many of these decisions out of national hands, the main reason why so many Conservatives in Britain oppose introducing the euro there. Greece never had a prayer of joining the first round of countries eligible for Europe's common currency, so its exclusion from the list of 11 countries ready to adopt the euro in 1999 was not an issue here. 6https://oana-inel.github.io/ @oana_inel http://crowdtruth.org TOPIC: Euro Opposition Identify documents that discuss opposition to the use of the euro, the European currency.
  • 7. Who do you think is more accurate? Jack Straw, left, Britain's new foreign secretary, said the country's euro policy remained unchanged, producing a brief recovery in the pound from a 15-year low set last week. A common currency, administered by an independent European central bank, would take many of these decisions out of national hands, the main reason why so many Conservatives in Britain oppose introducing the euro there. Greece never had a prayer of joining the first round of countries eligible for Europe's common currency, so its exclusion from the list of 11 countries ready to adopt the euro in 1999 was not an issue here. 7https://oana-inel.github.io/ @oana_inel http://crowdtruth.org TOPIC: Euro Opposition Identify documents that discuss opposition to the use of the euro, the European currency. Highly relevant Highly relevant Highly relevant 0.11 relevant 0.46 relevant 0.67 relevant
  • 8. TOPICAL RELEVANCE ASSESSMENT PRACTICES Typically one NIST assessor per topic Follow strict annotation guidelines to ensure a uniform understanding of the annotation task Use an asymmetrical 3 point-relevance scale [ highly relevant, relevant, not relevant ] All these may contribute to lower accuracy & reliability Is a singular viewpoint enough? Do experts have bias? What if the task is subjective or ambiguous? Are experts always consistent? NIST employed assessors ● Topic originators ● Subject experts 8https://oana-inel.github.io/ @oana_inel http://crowdtruth.org
  • 9. Research Questions RQ1: Can we improve the accuracy of topical relevance assessment by adapting and ? RQ2: Can we improve the reliability of topical relevance assessment by optimizing and ? Relevance Scale Annotation Guidelines Document Granularity Number of Annotators 9https://oana-inel.github.io/ @oana_inel http://crowdtruth.org
  • 10. Hypotheses text highlight as relevance choice motivation increases the accuracy of the results 2-point relevance annotation scale is more reliable topical relevance assessment at the level of document paragraphs increases the accuracy of the results large amounts of crowd workers perform as well as or better than NIST assessors Annotation Guidelines Document Granularity Relevance Scale Number of Annotators 10https://oana-inel.github.io/ @oana_inel http://crowdtruth.org
  • 11. RELEVANCE ASSESSMENT METHODOLOGY Pilot Experiments to determine Optimal Crowdsourcing Setting Evaluate Pilot Experiments Main Experiment with Optimal Crowdsourcing Setting Relevance Rank & Labels Documents Annotate Documents Relevance Scale Annotation Guidelines Document Granularity Paragraph Order Number of Annotators 11https://oana-inel.github.io/ @oana_inel http://crowdtruth.org
  • 12. DATASET NYTimes Corpus ○ 250 TREC Robust Track topics ○ 23,554 short & medium documents (less than 1000 words) NIST annotations ○ 50 topics ○ 5,946 documents ○ ternary scale (929 highly relevant, 1,421 relevant, 3,596 not relevant) 12https://oana-inel.github.io/ @oana_inel http://crowdtruth.org
  • 13. 8 PILOT EXPERIMENTS with 120 topic-document pairs / 10 topics Independent Variables Relevance Scale Binary Ternary Annotation Guidelines Relevance Value Relevance Value + Text Highlight Document Paragraphs Full Document Document Granularity Paragraph Order Random Order Original Order Controlled Variables Annotators 15 workers Figure Eight English Speaking Which is the optimal setting? 13https://oana-inel.github.io/ @oana_inel http://crowdtruth.org
  • 14. OVERVIEW OF ALL PILOT EXPERIMENTS SETTINGS N/A Annotators Figure Eight 15 workers English Speaking 14https://oana-inel.github.io/ @oana_inel http://crowdtruth.org N/A N/A N/A Relevance Scale Binary Ternary Binary Binary Binary Binary Binary Ternary Annotation Guidelines Relevance Value Relevance Value + Text Highlight Relevance Value Relevance Value + Text Highlight Relevance Value Relevance Value + Text Highlight Relevance Value Relevance Value + Text Highlight Document Paragraphs Full Document Document Granularity Full Document Full Document Full Document Document Paragraphs Document Paragraphs Document Paragraphs Paragraph Order Random Order Original Order Original Order Random Order
  • 15. METHODOLOGY FOR RESULTS ANALYSIS 2. Crowd Judgments Aggregation ● CrowdTruth metrics: Model and aggregate the topical relevance judgments of the crowd by harnessing disagreement 1. Ground Truth Reviewing ● Inspection of annotations by NIST ● NIST assessors vs. reviewers ● Reviewers vs. crowd workers Lora Aroyo, Anca Dumitrache, Oana Inel, Chris Welty: CrowdTruth Tutorial, http://crowdtruth.org/tutorial/ 15https://oana-inel.github.io/ @oana_inel http://crowdtruth.org Anca Dumitrache, Oana Inel, Lora Aroyo, Benjamin Timmermans, Chris Welty: CrowdTruth 2.0: Quality Metrics for Crowdsourcing with Disagreement, https://arxiv.org/pdf/1808.06080.pdf
  • 16. GROUND TRUTH REVIEWING Independent Reviewing ● Three independent reviewers ● Familiar with NIST annotation guidelines ● 120 topic-document pairs used in pilots Consensus Reaching ● Reviewers inter-rater reliability: ○ Fleiss’ k = 0.67 on ternary relevance ○ Fleiss’ k = 0.79 on binary relevance ● Reviewers and NIST inter-rater reliability: ○ Cohen’ k = 0.44 on ternary relevance ○ Cohen’ k = 0.60 on binary relevance 16https://oana-inel.github.io/ @oana_inel http://crowdtruth.org
  • 17. CROWDTRUTH METRICS For paragraph-level assessments: - we first measure the degree of relevance of each document paragraph with regard to the topic - by Paragraph-Topic-Document-Pair Relevance Score (ParTDP-RelScore) - finally, we calculate the overall topic document relevance score, by taking the MAX of the ParTDP-RelScore 17https://oana-inel.github.io/ @oana_inel http://crowdtruth.org Lora Aroyo, Anca Dumitrache, Oana Inel, Chris Welty: CrowdTruth Tutorial, http://crowdtruth.org/tutorial/ Anca Dumitrache, Oana Inel, Lora Aroyo, Benjamin Timmermans, Chris Welty: CrowdTruth 2.0: Quality Metrics for Crowdsourcing with Disagreement, https://arxiv.org/pdf/1808.06080.pdf For full-document-level assessments: - we measure the degree with which a relevance label is expressed by the topic-document pair, - Relevance-Label-Topic-Document-Pair Score (TDP-RelVal) In both cases all worker judgements are weighted by their worker quality
  • 18. Ternary Relevance Value Relevance Value + Text Highlight Full Document Ternary Full Document Do annotation guidelines influence the results? 0.57 0.45 0.79 0.57 0.63 0.86 Highly RelevantAssessors NIST Crowd Relevant Not Relevant 0.57 0.45 0.79 0.69 0.64 0.92 Highly RelevantAssessors NIST Crowd Relevant Not Relevant accuracy is better when answers are motivated 18https://oana-inel.github.io/ @oana_inel http://crowdtruth.org
  • 19. Binary Relevance Value Full Document Binary Full Document Relevance Value + Text Highlight Relevance Value Full Document Full Document Relevance Value + Text Highlight Ternary Ternary Does relevance annotation scale influence the results? 0.57 0.45 0.79 0.57 0.63 0.86 Highly Relevant Assessors NIST Crowd Relevant Not Relevant 0.57 0.45 0.79 0.69 0.64 0.92 Highly Relevant Assessors NIST Crowd Relevant Not Relevant 0.80 0.79 0.85 0.81 RelevantAssessors NIST Crowd Not Relevant 0.80 0.79 0.91 0.90 RelevantAssessors NIST Crowd Not Relevant accuracy is better when answers are merged 19https://oana-inel.github.io/ @oana_inel http://crowdtruth.org
  • 20. Binary Relevance Value Full Document Binary Full Document Relevance Value + Text Highlight Relevance Value Full Document Full Document Relevance Value + Text Highlight Ternary Ternary Does relevance annotation scale influence the results? 0.57 0.45 0.79 0.57 0.63 0.86 Highly Relevant Assessors NIST Crowd Relevant Not Relevant 0.57 0.45 0.79 0.69 0.64 0.92 Highly Relevant Assessors NIST Crowd Relevant Not Relevant 0.80 0.79 0.85 0.81 RelevantAssessors NIST Crowd Not Relevant 0.80 0.79 0.91 0.90 RelevantAssessors NIST Crowd Not Relevant accuracy is better when answers are merged not relevant docs - high agreement 20https://oana-inel.github.io/ @oana_inel http://crowdtruth.org
  • 21. Binary Relevance Value Full Document Binary Full Document Relevance Value + Text Highlight Relevance Value Full Document Full Document Relevance Value + Text Highlight Ternary Ternary Does relevance annotation scale influence the results? 0.57 0.45 0.79 0.57 0.63 0.86 Highly Relevant Assessors NIST Crowd Relevant Not Relevant 0.57 0.45 0.79 0.69 0.64 0.92 Highly Relevant Assessors NIST Crowd Relevant Not Relevant 0.80 0.79 0.85 0.81 RelevantAssessors NIST Crowd Not Relevant 0.80 0.79 0.91 0.90 RelevantAssessors NIST Crowd Not Relevant accuracy is better when answers are merged not relevant docs - high agreement relevant & highly relevant docs - often confused 21https://oana-inel.github.io/ @oana_inel http://crowdtruth.org
  • 22. Document Paragraphs Binary Original Order Binary Relevance Value Relevance Value + Text Highlight Document Paragraphs Original Order Does the document granularity influence the results? accuracy is better with relevance value at paragraph level 22https://oana-inel.github.io/ @oana_inel http://crowdtruth.org
  • 23. Does the paragraphs order influence the results? Random Order Binary Relevance Value Document Paragraphs Binary Relevance Value Document Paragraphs Random Order Binary Document Paragraphs Binary Document Paragraphs Original Order Original Order Relevance Value + Text Highlight Relevance Value + Text Highlight crowd assesses better at paragraph level & in random order 23https://oana-inel.github.io/ @oana_inel http://crowdtruth.org
  • 24. Random Order Binary Document Paragraphs What amount of workers gives you reliable results? binary relevance: 5 workers perform comparable with NIST ternary relevance: 7 workers perform as well as the NIST experts on highly relevant & not relevant docs only 24https://oana-inel.github.io/ @oana_inel http://crowdtruth.org Relevance Value + Text Highlight
  • 25. DERIVED NEW TOPICAL RELEVANCE METHODOLOGY Ask many crowd annotators Diverse crowd of people Follow light-weighted annotation guidelines Apply 2-point relevance scale 25https://oana-inel.github.io/ @oana_inel http://crowdtruth.org
  • 26. MAIN EXPERIMENT: with the optimal settings with 23,554 topic-document pairs / 250 topics Binary Relevance Value + Text Highlight Document Paragraphs Random Order Figure Eight 7 workers English Speaking 26https://oana-inel.github.io/ @oana_inel http://crowdtruth.org
  • 27. RESULTS MAIN EXPERIMENT Kendall's 𝝉 and 𝝉AP rank correlation coefficient between ● the official TREC systems' ranking using NIST test collection ● the systems' ranking using the crowdsourced test collection 0.6317 0.5485 0.5650 0.4879 0.5679 0.4576 0.4637 0.3728 0.5703 0.4849 0.4658 0.3751 𝝉 𝝉AP 𝝉 𝝉AP Binary Ternary Binary nDGC MAP R-Prec 27https://oana-inel.github.io/ @oana_inel http://crowdtruth.org
  • 28. RESULTS MAIN EXPERIMENT Kendall's 𝝉 and 𝝉AP rank correlation coefficient between ● the official TREC systems' ranking using NIST test collection ● the systems' ranking using the crowdsourced test collection 0.6317 0.5485 0.5650 0.4879 0.5679 0.4576 0.4637 0.3728 0.5703 0.4849 0.4658 0.3751 𝝉 𝝉AP 𝝉 𝝉AP Binary Ternary Binary nDGC MAP R-Prec 28https://oana-inel.github.io/ @oana_inel http://crowdtruth.org Binary Ternary
  • 29. RESULTS MAIN EXPERIMENT Kendall's 𝝉 and 𝝉AP rank correlation coefficient between ● the official TREC systems' ranking using NIST test collection ● the systems' ranking using the crowdsourced test collection 0.6317 0.5485 0.5650 0.4879 0.5679 0.4576 0.4637 0.3728 0.5703 0.4849 0.4658 0.3751 𝝉 𝝉AP 𝝉 𝝉AP Binary Ternary Binary nDGC MAP R-Prec 29https://oana-inel.github.io/ @oana_inel http://crowdtruth.org Binary TernaryCrowd & NIST agreement: ● binary scale - 63% ● ternary scale - 54% crowd workers & NIST assessors do agree
  • 30. RESULTS MAIN EXPERIMENT Kendall's 𝝉 and 𝝉AP rank correlation coefficient between ● the official TREC systems' ranking using NIST test collection ● the systems' ranking using the crowdsourced test collection 0.6317 0.5485 0.5650 0.4879 0.5679 0.4576 0.4637 0.3728 0.5703 0.4849 0.4658 0.3751 𝝉 𝝉AP 𝝉 𝝉AP Binary Ternary Binary nDGC MAP R-Prec 30https://oana-inel.github.io/ @oana_inel http://crowdtruth.org Binary Ternary Fair to moderate correlation between the two rankings The correlation drops as we focus more on the top-ranked systems
  • 31. CONCLUSIONS & FUTURE WORK Contributions ● new methodology for topical relevance assessment ● new relevance metrics (CrowdTruth metrics) harnessing diverse opinions & disagreement among annotators ● annotated test collection for topical relevance (23,554 English topic-doc pairs) Future Work ● extend the test collection with the longer documents from the NYT Corpus ● investigate the influence of the topic difficulty & ambiguity Code & Resources: https://github.com/CrowdTruth/NYT-Crowdsourcing-Topical-Relevance CrowdTruth metrics: https://github.com/CrowdTruth/CrowdTruth-core CrowdTruth tutorial: http://crowdtruth.org/tutorial/ 31https://oana-inel.github.io/ @oana_inel http://crowdtruth.org

Editor's Notes

  1. Hello everyone I am Oana Inel and today I will present a collaborative work between my CrowdTruth team, with myself and Lora Aroyo and researchers from UvA and Southampton, and IBM
  2. Today I’ll talk about Assessment Practices, and specifically about Topical Relevance assessment practices. We have collected a couple of observations from reviewing current NIST practices for topical relevance assessment and some of them stood out. As probably many of you already know, NIST assessors are topic originators and subject experts. One thing that really stands out is that just one NIST assessor is typically assigned per topic and they use an asymmetrical 3-point scale to label the topic-document pairs. Finally, to ensure a uniform understanding of the annotation task, they need to follow strict annotations guidelines.
  3. So, does something seem wrong in this picture? What do you think?
  4. Do you think that a single viewpoint for every topic-document pair and even for every topic is enough to gather a reliable test collection? Do you think that experts can not have bias or they can always be consistent when annotating? And furthermore, what happens when the task is subjective or ambiguous?
  5. Now, let’s look at some NIST annotations from the corpus that we used in our experiments. We have the following topic - euro opposition: identify documents that discuss … Now, let’s see what NIST assessors said about these documents:
  6. Now let’s see what the crowd said, we gave them the same topic and the same documents, and they said that the first document is about 70% relevant, the second one is about 50% percent relevant and the last one is hardly relevant.
  7. Who do you think produced more accurate annotations here? The crowd that gave us a ranking of the documents or the experts that said all three docs are highly relevant?
  8. Now coming back to my introductory slide, do you now see why there is a problem in current annotations practices? Even though they are performed by highly trained experts, one person is sometimes not enough to capture the entire truth, people can have bias, people can make mistakes and it is very difficult to be always consistent. Scales such as highly relevant, relevant and not relevant are usually difficult and confusing and annotation guidelines can not help much if the task is subjective or ambiguous. And all these may actually contribute to lower accuracy and reliability.
  9. We did a set of experiments to see whether we can empirically confirm those observations and we formulated two main RQ to guide our analysis. We wanted to see if we can improve the accuracy of the topical relevance assessment by adapting, more exactly by relaxing the annotation guidelines rather than making them more strict and we also wanted to see if we change the document granularity, this will also positively influence the accuracy And through the second RQ we wanted to see whether we can improve the reliability of topical relevance by choosing an optimal relevance scale that can decrease as much as possible the intrinsic inconsistencies of annotators and by choosing the optimal number of crowd annotators to gather annotations from.
  10. As a result of these RQ, we defined a set of hypotheses related to the annotation guidelines, the document granularity, the relevance scale and the number of annotators
  11. And we defined a methodology in which we performed multiple pilot experiments in which we experimented with various values of these variables in order to identify the most optimal crowdsourcing settings to rank and label documents with topical relevance.
  12. We worked with the short and medium documents in the NYTimes Corpus, covering 250 TREC Robust Track topics.
  13. Let’s have a look how we organized our pilot experiments. We used 120 topic documents pair, covering 10 topics that were previously annotated by NIST. In all the pilot experiments we fixed the crowdsourcing platform, we used the Figure Eight platform, and we asked for 15 annotations from english speaking countries workers. And the goal of these pilot experiments was to find out which combination of these variables produces an optimal setting.
  14. Here are all the variations that we tested. So, we experimented with binary and ternary annotation relevance scale, we experimented with different annotation guidelines in which we asked the workers to provide the relevance value or label or to provide the relevance label and highlight in the document the part that motivates the relevance. Further, we experimented with the document granularity by asking workers to annotate the full document or document paragraphs and with the order in which the paragraphs are shown.
  15. We consider the way we analyzed the results as a contribution of the research. We performed a two-step analysis. First, in order to make sure that we understand the annotations provided by NIST, we reviewed and analyzed them to accurately compare them with the crowd annotations. Second, we didn’t apply the traditional majority vote approaches to evaluate the crowd annotations, but we applied the CrowdTruth quality metrics which harness the disagreement among annotators.
  16. As I mentioned earlier, we first reviewed the ground truth labels. We used three reviewers, authors of this paper that were familiar with the NIST annotation guidelines. The reviewers annotated all the 120 topic-doc pairs used in the pilots. We computed the inter-rater reliability among reviewers and between reviewers and NIST and we observe that on a binary scale the agreement is overall much higher than on a ternary scale.