CIKM 2018: Studying Topical Relevance with Evidence-based Crowdsourcing

Studying Topical Relevance
with Evidence-based Crowdsourcing
Oana Inel, Giannis Haralabopoulos, Dan Li, Christophe Van Gysel, Zoltán Szlávik,
Elena Simperl, Evangelos Kanoulas and Lora Aroyo
CIKM 2018
@oana_inel
https://oana-inel.github.io

TOPICAL RELEVANCE
ASSESSMENT PRACTICES Typically one NIST assessor per topic
Use an asymmetrical 3 point-relevance scale
[ highly relevant, relevant, not relevant ]
Follow strict annotation guidelines to
ensure a uniform understanding of the
annotation task
NIST employed assessors
● Topic originators
● Subject experts
2https://oana-inel.github.io/ @oana_inel http://crowdtruth.org

TOPICAL RELEVANCE
● Subject experts
What’s wrong with this picture?
annotation task

TOPICAL RELEVANCE
annotation task
Is a singular viewpoint enough?
Do experts have bias?
What if the task is subjective or ambiguous?
Are experts always consistent?
● Subject experts
What’s wrong with this picture?

Jack Straw, left, Britain's new foreign secretary, said the country's euro policy
remained unchanged, producing a brief recovery in the pound from a 15-year
low set last week.
A common currency, administered by an independent European central bank,
would take many of these decisions out of national hands, the main reason
why so many Conservatives in Britain oppose introducing the euro there.
Greece never had a prayer of joining the first round of countries eligible for
Europe's common currency, so its exclusion from the list of 11 countries ready
to adopt the euro in 1999 was not an issue here.
Highly
relevant
Highly
relevant
Highly
relevant
What do experts say?
TOPIC: Euro Opposition
Identify documents that discuss opposition to the use of the euro, the European currency.

What does the crowd say?
0.11
relevant
0.46
relevant
0.67
relevant
Jack Straw, left, Britain's new foreign secretary, said the country's euro policy
remained unchanged, producing a brief recovery in the pound from a 15-year
low set last week.
A common currency, administered by an independent European central bank,
would take many of these decisions out of national hands, the main reason
why so many Conservatives in Britain oppose introducing the euro there.
Greece never had a prayer of joining the first round of countries eligible for
Europe's common currency, so its exclusion from the list of 11 countries ready
to adopt the euro in 1999 was not an issue here.

Who do you think is more accurate?
Jack Straw, left, Britain's new foreign secretary, said the country's euro
policy remained unchanged, producing a brief recovery in the pound
from a 15-year low set last week.
A common currency, administered by an independent European central
bank, would take many of these decisions out of national hands, the
main reason why so many Conservatives in Britain oppose introducing
the euro there.
Greece never had a prayer of joining the first round of countries eligible
for Europe's common currency, so its exclusion from the list of 11
countries ready to adopt the euro in 1999 was not an issue here.
Highly
relevant
Highly
relevant
Highly
relevant
0.11
relevant
0.46
relevant
0.67
relevant

TOPICAL RELEVANCE
annotation task
All these may contribute to lower accuracy & reliability
Is a singular viewpoint enough?
Do experts have bias? What if the task is subjective or ambiguous?
Are experts always consistent?
● Subject experts

Research Questions
RQ1: Can we improve the accuracy of topical relevance
assessment by adapting and ?
RQ2: Can we improve the reliability of topical relevance
assessment by optimizing and ?
Relevance
Scale
Annotation
Guidelines
Document
Granularity
Number of
Annotators

Hypotheses
text highlight as relevance choice motivation increases the
accuracy of the results
2-point relevance annotation scale is more reliable
topical relevance assessment at the level of document
paragraphs increases the accuracy of the results
large amounts of crowd workers perform as well as or better
than NIST assessors
Annotation
Guidelines
Document
Granularity
Relevance
Scale
Number of
Annotators

RELEVANCE ASSESSMENT METHODOLOGY
Pilot Experiments
to determine
Optimal
Crowdsourcing
Setting
Evaluate Pilot
Experiments
Main Experiment
with
Optimal
Crowdsourcing
Setting
Relevance Rank
& Labels
Documents
Annotate
Documents
Relevance
Scale
Annotation
Guidelines
Document
Granularity
Paragraph
Order
Number of
Annotators

DATASET
NYTimes Corpus
○ 250 TREC Robust Track topics
○ 23,554 short & medium documents (less than 1000 words)
NIST annotations
○ 50 topics
○ 5,946 documents
○ ternary scale (929 highly relevant, 1,421 relevant, 3,596 not
relevant)

8 PILOT EXPERIMENTS
with 120 topic-document pairs / 10 topics
Independent Variables
Relevance
Scale
Binary
Ternary
Annotation
Guidelines
Relevance Value
Relevance Value +
Text Highlight
Document
Paragraphs
Full Document
Document
Granularity
Paragraph
Order
Random Order
Original Order
Controlled Variables
Annotators
15 workers
Figure Eight
English
Speaking
Which is the optimal setting?

OVERVIEW OF ALL PILOT EXPERIMENTS SETTINGS
N/A
Annotators
Figure Eight
15 workers
English Speaking
N/A
N/A
N/A
Relevance
Scale
Binary
Ternary
Binary
Binary
Binary
Binary
Binary
Ternary
Annotation
Guidelines
Relevance Value
Relevance Value +
Text Highlight
Relevance Value
Relevance Value +
Text Highlight
Relevance Value
Relevance Value +
Text Highlight
Relevance Value
Relevance Value +
Text Highlight
Document
Paragraphs
Full Document
Document
Granularity
Full Document
Full Document
Full Document
Document
Paragraphs
Document
Paragraphs
Document
Paragraphs
Paragraph
Order
Random Order
Original Order
Original Order
Random Order

METHODOLOGY FOR
RESULTS ANALYSIS
2. Crowd Judgments Aggregation
● CrowdTruth metrics: Model and
aggregate the topical relevance
judgments of the crowd by harnessing
disagreement
1. Ground Truth Reviewing
● Inspection of annotations by NIST
● NIST assessors vs. reviewers
● Reviewers vs. crowd workers
Lora Aroyo, Anca Dumitrache, Oana Inel, Chris Welty: CrowdTruth Tutorial,
http://crowdtruth.org/tutorial/
Anca Dumitrache, Oana Inel, Lora Aroyo, Benjamin Timmermans, Chris Welty:
CrowdTruth 2.0: Quality Metrics for Crowdsourcing with Disagreement,
https://arxiv.org/pdf/1808.06080.pdf

GROUND TRUTH
REVIEWING
Independent Reviewing
● Three independent reviewers
● Familiar with NIST annotation guidelines
● 120 topic-document pairs used in pilots
Consensus Reaching
● Reviewers inter-rater reliability:
○ Fleiss’ k = 0.67 on ternary relevance
○ Fleiss’ k = 0.79 on binary relevance
● Reviewers and NIST inter-rater reliability:
○ Cohen’ k = 0.44 on ternary relevance
○ Cohen’ k = 0.60 on binary relevance

CROWDTRUTH METRICS
For paragraph-level assessments:
- we first measure the degree of relevance of each document paragraph with regard to the
topic
- by Paragraph-Topic-Document-Pair Relevance Score (ParTDP-RelScore)
- finally, we calculate the overall topic document relevance score, by taking the MAX of the
ParTDP-RelScore
Lora Aroyo, Anca Dumitrache, Oana Inel, Chris Welty: CrowdTruth Tutorial, http://crowdtruth.org/tutorial/
Anca Dumitrache, Oana Inel, Lora Aroyo, Benjamin Timmermans, Chris Welty: CrowdTruth 2.0: Quality Metrics for Crowdsourcing with Disagreement,
https://arxiv.org/pdf/1808.06080.pdf
For full-document-level assessments:
- we measure the degree with which a relevance label is expressed by the topic-document pair,
- Relevance-Label-Topic-Document-Pair Score (TDP-RelVal)
In both cases all worker judgements are weighted by their worker quality

Ternary Relevance Value
Relevance Value
+ Text Highlight
Full
Document
Ternary
Full
Document
Do annotation guidelines influence the results?
0.57 0.45 0.79
0.57 0.63 0.86
Highly
RelevantAssessors
NIST
Crowd
Relevant
Not
Relevant
0.57 0.45 0.79
0.69 0.64 0.92
Highly
RelevantAssessors
NIST
Crowd
Relevant
Not
Relevant
accuracy is better when answers are motivated

Binary
Relevance
Value
Full
Document
Binary
Full Document
Relevance Value
+ Text Highlight
Relevance
Value
Full
Document
Full Document
Relevance Value
+ Text Highlight
Ternary Ternary
Does relevance annotation scale influence the results?
0.57 0.45 0.79
0.57 0.63 0.86
Highly
Relevant
Assessors
NIST
Crowd
Relevant
Not
Relevant
0.57 0.45 0.79
0.69 0.64 0.92
Highly
Relevant
Assessors
NIST
Crowd
Relevant
Not
Relevant
0.80 0.79
0.85 0.81
RelevantAssessors
NIST
Crowd
Not
Relevant
0.80 0.79
0.91 0.90
RelevantAssessors
NIST
Crowd
Not
Relevant
accuracy is better when answers are merged

Binary
Relevance
Value
Full
Document
Binary
Full Document
Relevance Value
+ Text Highlight
Relevance
Value
Full
Document
Full Document
Relevance Value
+ Text Highlight
Ternary Ternary
0.57 0.45 0.79
0.57 0.63 0.86
Highly
Relevant
Assessors
NIST
Crowd
Relevant
Not
Relevant
0.57 0.45 0.79
0.69 0.64 0.92
Highly
Relevant
Assessors
NIST
Crowd
Relevant
Not
Relevant
0.80 0.79
0.85 0.81
RelevantAssessors
NIST
Crowd
Not
Relevant
0.80 0.79
0.91 0.90
RelevantAssessors
NIST
Crowd
Not
Relevant
not relevant docs - high agreement

Binary
Relevance
Value
Full
Document
Binary
Full Document
Relevance Value
+ Text Highlight
Relevance
Value
Full
Document
Full Document
Relevance Value
+ Text Highlight
Ternary Ternary
0.57 0.45 0.79
0.57 0.63 0.86
Highly
Relevant
Assessors
NIST
Crowd
Relevant
Not
Relevant
0.57 0.45 0.79
0.69 0.64 0.92
Highly
Relevant
Assessors
NIST
Crowd
Relevant
Not
Relevant
0.80 0.79
0.85 0.81
RelevantAssessors
NIST
Crowd
Not
Relevant
0.80 0.79
0.91 0.90
RelevantAssessors
NIST
Crowd
Not
Relevant
not relevant docs - high agreement
relevant & highly relevant docs - often confused

Document
Paragraphs
Binary
Original
Order
Binary
Relevance
Value
Relevance Value
+ Text Highlight
Document
Paragraphs
Original
Order
Does the document granularity influence the results?
accuracy is better with relevance value at paragraph level

Does the paragraphs order influence the results?
Random
Order
Binary
Relevance
Value
Document
Paragraphs
Binary
Relevance
Value
Document
Paragraphs
Random Order
Binary
Document
Paragraphs
Binary
Document
Paragraphs
Original
Order
Original Order
Relevance Value
+ Text Highlight
Relevance Value
+ Text Highlight
crowd assesses better at paragraph level & in random order

Random
Order
Binary
Document
Paragraphs
What amount of workers gives you reliable results?
binary relevance: 5 workers perform comparable with NIST
ternary relevance: 7 workers perform as well as the NIST experts
on highly relevant & not relevant docs only
Relevance Value +
Text Highlight

DERIVED NEW TOPICAL
RELEVANCE
METHODOLOGY
Ask many crowd annotators
Diverse crowd of people Follow light-weighted
annotation guidelines
Apply 2-point relevance scale

MAIN EXPERIMENT: with the optimal settings
with 23,554 topic-document pairs / 250 topics
Binary
Relevance Value +
Text Highlight
Document
Paragraphs
Random Order
Figure Eight
7 workers
English Speaking

RESULTS MAIN EXPERIMENT
Kendall's 𝝉 and 𝝉AP rank correlation
coefficient between
● the official TREC systems' ranking
using NIST test collection
● the systems' ranking using the
crowdsourced test collection
0.6317 0.5485 0.5650 0.4879
0.5679 0.4576 0.4637 0.3728
0.5703 0.4849 0.4658 0.3751
𝝉 𝝉AP 𝝉 𝝉AP
Binary Ternary
Binary
nDGC
MAP
R-Prec

coefficient between
0.6317 0.5485 0.5650 0.4879
0.5679 0.4576 0.4637 0.3728
0.5703 0.4849 0.4658 0.3751
Binary Ternary
Binary
nDGC
MAP
R-Prec
Binary Ternary

coefficient between
0.6317 0.5485 0.5650 0.4879
0.5679 0.4576 0.4637 0.3728
0.5703 0.4849 0.4658 0.3751
Binary Ternary
Binary
nDGC
MAP
R-Prec
Binary TernaryCrowd & NIST agreement:
● binary scale - 63%
● ternary scale - 54%
crowd workers & NIST assessors do agree

coefficient between
0.6317 0.5485 0.5650 0.4879
0.5679 0.4576 0.4637 0.3728
0.5703 0.4849 0.4658 0.3751
Binary Ternary
Binary
nDGC
MAP
R-Prec
Binary Ternary
Fair to moderate correlation between the two rankings
The correlation drops as we focus more on the top-ranked systems

CONCLUSIONS & FUTURE WORK
Contributions
● new methodology for topical relevance assessment
● new relevance metrics (CrowdTruth metrics) harnessing diverse opinions &
disagreement among annotators
● annotated test collection for topical relevance (23,554 English topic-doc pairs)
Future Work
● extend the test collection with the longer documents from the NYT Corpus
● investigate the influence of the topic difficulty & ambiguity
Code & Resources: https://github.com/CrowdTruth/NYT-Crowdsourcing-Topical-Relevance
CrowdTruth metrics: https://github.com/CrowdTruth/CrowdTruth-core
CrowdTruth tutorial: http://crowdtruth.org/tutorial/

CIKM 2018: Studying Topical Relevance with Evidence-based Crowdsourcing

Recommended

Recommended

More Related Content

Recently uploaded

Recently uploaded (20)

Featured

Featured (20)

CIKM 2018: Studying Topical Relevance with Evidence-based Crowdsourcing

Editor's Notes