Information Retrieval systems rely on large test collections to measure their effectiveness in retrieving relevant documents. While the demand is high, the task of creating such test collections is laborious due to the large amounts of data that need to be annotated, and due to the intrinsic subjectivity of the task itself. In this paper we study the topical relevance from a user perspective by addressing the problems of subjectivity and ambiguity. We compare our approach and results with the established TREC annotation guidelines and results. The comparison is based on a series of crowdsourcing pilots experimenting with variables, such as relevance scale, document granularity, annotation template and the number of workers. Our results show correlation between relevance assessment accuracy and smaller document granularity, i.e., aggregation of relevance on paragraph level results in a better relevance accuracy, compared to assessment done at the level of the full document. As expected, our results also show that collecting binary relevance judgments results in a higher accuracy compared to the ternary scale used in the TREC annotation guidelines. Finally, the crowdsourced annotation tasks provided a more accurate document relevance ranking than a single assessor relevance label. This work resulted is a reliable test collection around the TREC Common Core track.
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
CIKM 2018: Studying Topical Relevance with Evidence-based Crowdsourcing
1. Studying Topical Relevance
with Evidence-based Crowdsourcing
Oana Inel, Giannis Haralabopoulos, Dan Li, Christophe Van Gysel, Zoltán Szlávik,
Elena Simperl, Evangelos Kanoulas and Lora Aroyo
CIKM 2018
@oana_inel
https://oana-inel.github.io
2. TOPICAL RELEVANCE
ASSESSMENT PRACTICES Typically one NIST assessor per topic
Use an asymmetrical 3 point-relevance scale
[ highly relevant, relevant, not relevant ]
Follow strict annotation guidelines to
ensure a uniform understanding of the
annotation task
NIST employed assessors
● Topic originators
● Subject experts
2https://oana-inel.github.io/ @oana_inel http://crowdtruth.org
3. TOPICAL RELEVANCE
ASSESSMENT PRACTICES Typically one NIST assessor per topic
NIST employed assessors
● Topic originators
● Subject experts
Use an asymmetrical 3 point-relevance scale
[ highly relevant, relevant, not relevant ]
What’s wrong with this picture?
Follow strict annotation guidelines to
ensure a uniform understanding of the
annotation task
3https://oana-inel.github.io/ @oana_inel http://crowdtruth.org
4. TOPICAL RELEVANCE
ASSESSMENT PRACTICES Typically one NIST assessor per topic
Follow strict annotation guidelines to
ensure a uniform understanding of the
annotation task
Use an asymmetrical 3 point-relevance scale
[ highly relevant, relevant, not relevant ]
Is a singular viewpoint enough?
Do experts have bias?
What if the task is subjective or ambiguous?
Are experts always consistent?
NIST employed assessors
● Topic originators
● Subject experts
4https://oana-inel.github.io/ @oana_inel http://crowdtruth.org
What’s wrong with this picture?
5. Jack Straw, left, Britain's new foreign secretary, said the country's euro policy
remained unchanged, producing a brief recovery in the pound from a 15-year
low set last week.
A common currency, administered by an independent European central bank,
would take many of these decisions out of national hands, the main reason
why so many Conservatives in Britain oppose introducing the euro there.
Greece never had a prayer of joining the first round of countries eligible for
Europe's common currency, so its exclusion from the list of 11 countries ready
to adopt the euro in 1999 was not an issue here.
Highly
relevant
Highly
relevant
Highly
relevant
What do experts say?
TOPIC: Euro Opposition
Identify documents that discuss opposition to the use of the euro, the European currency.
5https://oana-inel.github.io/ @oana_inel http://crowdtruth.org
6. What does the crowd say?
0.11
relevant
0.46
relevant
0.67
relevant
Jack Straw, left, Britain's new foreign secretary, said the country's euro policy
remained unchanged, producing a brief recovery in the pound from a 15-year
low set last week.
A common currency, administered by an independent European central bank,
would take many of these decisions out of national hands, the main reason
why so many Conservatives in Britain oppose introducing the euro there.
Greece never had a prayer of joining the first round of countries eligible for
Europe's common currency, so its exclusion from the list of 11 countries ready
to adopt the euro in 1999 was not an issue here.
6https://oana-inel.github.io/ @oana_inel http://crowdtruth.org
TOPIC: Euro Opposition
Identify documents that discuss opposition to the use of the euro, the European currency.
7. Who do you think is more accurate?
Jack Straw, left, Britain's new foreign secretary, said the country's euro
policy remained unchanged, producing a brief recovery in the pound
from a 15-year low set last week.
A common currency, administered by an independent European central
bank, would take many of these decisions out of national hands, the
main reason why so many Conservatives in Britain oppose introducing
the euro there.
Greece never had a prayer of joining the first round of countries eligible
for Europe's common currency, so its exclusion from the list of 11
countries ready to adopt the euro in 1999 was not an issue here.
7https://oana-inel.github.io/ @oana_inel http://crowdtruth.org
TOPIC: Euro Opposition
Identify documents that discuss opposition to the use of the euro, the European currency.
Highly
relevant
Highly
relevant
Highly
relevant
0.11
relevant
0.46
relevant
0.67
relevant
8. TOPICAL RELEVANCE
ASSESSMENT PRACTICES Typically one NIST assessor per topic
Follow strict annotation guidelines to
ensure a uniform understanding of the
annotation task
Use an asymmetrical 3 point-relevance scale
[ highly relevant, relevant, not relevant ]
All these may contribute to lower accuracy & reliability
Is a singular viewpoint enough?
Do experts have bias? What if the task is subjective or ambiguous?
Are experts always consistent?
NIST employed assessors
● Topic originators
● Subject experts
8https://oana-inel.github.io/ @oana_inel http://crowdtruth.org
9. Research Questions
RQ1: Can we improve the accuracy of topical relevance
assessment by adapting and ?
RQ2: Can we improve the reliability of topical relevance
assessment by optimizing and ?
Relevance
Scale
Annotation
Guidelines
Document
Granularity
Number of
Annotators
9https://oana-inel.github.io/ @oana_inel http://crowdtruth.org
10. Hypotheses
text highlight as relevance choice motivation increases the
accuracy of the results
2-point relevance annotation scale is more reliable
topical relevance assessment at the level of document
paragraphs increases the accuracy of the results
large amounts of crowd workers perform as well as or better
than NIST assessors
Annotation
Guidelines
Document
Granularity
Relevance
Scale
Number of
Annotators
10https://oana-inel.github.io/ @oana_inel http://crowdtruth.org
11. RELEVANCE ASSESSMENT METHODOLOGY
Pilot Experiments
to determine
Optimal
Crowdsourcing
Setting
Evaluate Pilot
Experiments
Main Experiment
with
Optimal
Crowdsourcing
Setting
Relevance Rank
& Labels
Documents
Annotate
Documents
Relevance
Scale
Annotation
Guidelines
Document
Granularity
Paragraph
Order
Number of
Annotators
11https://oana-inel.github.io/ @oana_inel http://crowdtruth.org
12. DATASET
NYTimes Corpus
○ 250 TREC Robust Track topics
○ 23,554 short & medium documents (less than 1000 words)
NIST annotations
○ 50 topics
○ 5,946 documents
○ ternary scale (929 highly relevant, 1,421 relevant, 3,596 not
relevant)
12https://oana-inel.github.io/ @oana_inel http://crowdtruth.org
13. 8 PILOT EXPERIMENTS
with 120 topic-document pairs / 10 topics
Independent Variables
Relevance
Scale
Binary
Ternary
Annotation
Guidelines
Relevance Value
Relevance Value +
Text Highlight
Document
Paragraphs
Full Document
Document
Granularity
Paragraph
Order
Random Order
Original Order
Controlled Variables
Annotators
15 workers
Figure Eight
English
Speaking
Which is the optimal setting?
13https://oana-inel.github.io/ @oana_inel http://crowdtruth.org
14. OVERVIEW OF ALL PILOT EXPERIMENTS SETTINGS
N/A
Annotators
Figure Eight
15 workers
English Speaking
14https://oana-inel.github.io/ @oana_inel http://crowdtruth.org
N/A
N/A
N/A
Relevance
Scale
Binary
Ternary
Binary
Binary
Binary
Binary
Binary
Ternary
Annotation
Guidelines
Relevance Value
Relevance Value +
Text Highlight
Relevance Value
Relevance Value +
Text Highlight
Relevance Value
Relevance Value +
Text Highlight
Relevance Value
Relevance Value +
Text Highlight
Document
Paragraphs
Full Document
Document
Granularity
Full Document
Full Document
Full Document
Document
Paragraphs
Document
Paragraphs
Document
Paragraphs
Paragraph
Order
Random Order
Original Order
Original Order
Random Order
15. METHODOLOGY FOR
RESULTS ANALYSIS
2. Crowd Judgments Aggregation
● CrowdTruth metrics: Model and
aggregate the topical relevance
judgments of the crowd by harnessing
disagreement
1. Ground Truth Reviewing
● Inspection of annotations by NIST
● NIST assessors vs. reviewers
● Reviewers vs. crowd workers
Lora Aroyo, Anca Dumitrache, Oana Inel, Chris Welty: CrowdTruth Tutorial,
http://crowdtruth.org/tutorial/
15https://oana-inel.github.io/ @oana_inel http://crowdtruth.org
Anca Dumitrache, Oana Inel, Lora Aroyo, Benjamin Timmermans, Chris Welty:
CrowdTruth 2.0: Quality Metrics for Crowdsourcing with Disagreement,
https://arxiv.org/pdf/1808.06080.pdf
16. GROUND TRUTH
REVIEWING
Independent Reviewing
● Three independent reviewers
● Familiar with NIST annotation guidelines
● 120 topic-document pairs used in pilots
Consensus Reaching
● Reviewers inter-rater reliability:
○ Fleiss’ k = 0.67 on ternary relevance
○ Fleiss’ k = 0.79 on binary relevance
● Reviewers and NIST inter-rater reliability:
○ Cohen’ k = 0.44 on ternary relevance
○ Cohen’ k = 0.60 on binary relevance
16https://oana-inel.github.io/ @oana_inel http://crowdtruth.org
17. CROWDTRUTH METRICS
For paragraph-level assessments:
- we first measure the degree of relevance of each document paragraph with regard to the
topic
- by Paragraph-Topic-Document-Pair Relevance Score (ParTDP-RelScore)
- finally, we calculate the overall topic document relevance score, by taking the MAX of the
ParTDP-RelScore
17https://oana-inel.github.io/ @oana_inel http://crowdtruth.org
Lora Aroyo, Anca Dumitrache, Oana Inel, Chris Welty: CrowdTruth Tutorial, http://crowdtruth.org/tutorial/
Anca Dumitrache, Oana Inel, Lora Aroyo, Benjamin Timmermans, Chris Welty: CrowdTruth 2.0: Quality Metrics for Crowdsourcing with Disagreement,
https://arxiv.org/pdf/1808.06080.pdf
For full-document-level assessments:
- we measure the degree with which a relevance label is expressed by the topic-document pair,
- Relevance-Label-Topic-Document-Pair Score (TDP-RelVal)
In both cases all worker judgements are weighted by their worker quality
18. Ternary Relevance Value
Relevance Value
+ Text Highlight
Full
Document
Ternary
Full
Document
Do annotation guidelines influence the results?
0.57 0.45 0.79
0.57 0.63 0.86
Highly
RelevantAssessors
NIST
Crowd
Relevant
Not
Relevant
0.57 0.45 0.79
0.69 0.64 0.92
Highly
RelevantAssessors
NIST
Crowd
Relevant
Not
Relevant
accuracy is better when answers are motivated
18https://oana-inel.github.io/ @oana_inel http://crowdtruth.org
19. Binary
Relevance
Value
Full
Document
Binary
Full Document
Relevance Value
+ Text Highlight
Relevance
Value
Full
Document
Full Document
Relevance Value
+ Text Highlight
Ternary Ternary
Does relevance annotation scale influence the results?
0.57 0.45 0.79
0.57 0.63 0.86
Highly
Relevant
Assessors
NIST
Crowd
Relevant
Not
Relevant
0.57 0.45 0.79
0.69 0.64 0.92
Highly
Relevant
Assessors
NIST
Crowd
Relevant
Not
Relevant
0.80 0.79
0.85 0.81
RelevantAssessors
NIST
Crowd
Not
Relevant
0.80 0.79
0.91 0.90
RelevantAssessors
NIST
Crowd
Not
Relevant
accuracy is better when answers are merged
19https://oana-inel.github.io/ @oana_inel http://crowdtruth.org
20. Binary
Relevance
Value
Full
Document
Binary
Full Document
Relevance Value
+ Text Highlight
Relevance
Value
Full
Document
Full Document
Relevance Value
+ Text Highlight
Ternary Ternary
Does relevance annotation scale influence the results?
0.57 0.45 0.79
0.57 0.63 0.86
Highly
Relevant
Assessors
NIST
Crowd
Relevant
Not
Relevant
0.57 0.45 0.79
0.69 0.64 0.92
Highly
Relevant
Assessors
NIST
Crowd
Relevant
Not
Relevant
0.80 0.79
0.85 0.81
RelevantAssessors
NIST
Crowd
Not
Relevant
0.80 0.79
0.91 0.90
RelevantAssessors
NIST
Crowd
Not
Relevant
accuracy is better when answers are merged
not relevant docs - high agreement
20https://oana-inel.github.io/ @oana_inel http://crowdtruth.org
21. Binary
Relevance
Value
Full
Document
Binary
Full Document
Relevance Value
+ Text Highlight
Relevance
Value
Full
Document
Full Document
Relevance Value
+ Text Highlight
Ternary Ternary
Does relevance annotation scale influence the results?
0.57 0.45 0.79
0.57 0.63 0.86
Highly
Relevant
Assessors
NIST
Crowd
Relevant
Not
Relevant
0.57 0.45 0.79
0.69 0.64 0.92
Highly
Relevant
Assessors
NIST
Crowd
Relevant
Not
Relevant
0.80 0.79
0.85 0.81
RelevantAssessors
NIST
Crowd
Not
Relevant
0.80 0.79
0.91 0.90
RelevantAssessors
NIST
Crowd
Not
Relevant
accuracy is better when answers are merged
not relevant docs - high agreement
relevant & highly relevant docs - often confused
21https://oana-inel.github.io/ @oana_inel http://crowdtruth.org
23. Does the paragraphs order influence the results?
Random
Order
Binary
Relevance
Value
Document
Paragraphs
Binary
Relevance
Value
Document
Paragraphs
Random Order
Binary
Document
Paragraphs
Binary
Document
Paragraphs
Original
Order
Original Order
Relevance Value
+ Text Highlight
Relevance Value
+ Text Highlight
crowd assesses better at paragraph level & in random order
23https://oana-inel.github.io/ @oana_inel http://crowdtruth.org
24. Random
Order
Binary
Document
Paragraphs
What amount of workers gives you reliable results?
binary relevance: 5 workers perform comparable with NIST
ternary relevance: 7 workers perform as well as the NIST experts
on highly relevant & not relevant docs only
24https://oana-inel.github.io/ @oana_inel http://crowdtruth.org
Relevance Value +
Text Highlight
25. DERIVED NEW TOPICAL
RELEVANCE
METHODOLOGY
Ask many crowd annotators
Diverse crowd of people Follow light-weighted
annotation guidelines
Apply 2-point relevance scale
25https://oana-inel.github.io/ @oana_inel http://crowdtruth.org
26. MAIN EXPERIMENT: with the optimal settings
with 23,554 topic-document pairs / 250 topics
Binary
Relevance Value +
Text Highlight
Document
Paragraphs
Random Order
Figure Eight
7 workers
English Speaking
26https://oana-inel.github.io/ @oana_inel http://crowdtruth.org
27. RESULTS MAIN EXPERIMENT
Kendall's 𝝉 and 𝝉AP rank correlation
coefficient between
● the official TREC systems' ranking
using NIST test collection
● the systems' ranking using the
crowdsourced test collection
0.6317 0.5485 0.5650 0.4879
0.5679 0.4576 0.4637 0.3728
0.5703 0.4849 0.4658 0.3751
𝝉 𝝉AP 𝝉 𝝉AP
Binary Ternary
Binary
nDGC
MAP
R-Prec
27https://oana-inel.github.io/ @oana_inel http://crowdtruth.org
28. RESULTS MAIN EXPERIMENT
Kendall's 𝝉 and 𝝉AP rank correlation
coefficient between
● the official TREC systems' ranking
using NIST test collection
● the systems' ranking using the
crowdsourced test collection
0.6317 0.5485 0.5650 0.4879
0.5679 0.4576 0.4637 0.3728
0.5703 0.4849 0.4658 0.3751
𝝉 𝝉AP 𝝉 𝝉AP
Binary Ternary
Binary
nDGC
MAP
R-Prec
28https://oana-inel.github.io/ @oana_inel http://crowdtruth.org
Binary Ternary
29. RESULTS MAIN EXPERIMENT
Kendall's 𝝉 and 𝝉AP rank correlation
coefficient between
● the official TREC systems' ranking
using NIST test collection
● the systems' ranking using the
crowdsourced test collection
0.6317 0.5485 0.5650 0.4879
0.5679 0.4576 0.4637 0.3728
0.5703 0.4849 0.4658 0.3751
𝝉 𝝉AP 𝝉 𝝉AP
Binary Ternary
Binary
nDGC
MAP
R-Prec
29https://oana-inel.github.io/ @oana_inel http://crowdtruth.org
Binary TernaryCrowd & NIST agreement:
● binary scale - 63%
● ternary scale - 54%
crowd workers & NIST assessors do agree
30. RESULTS MAIN EXPERIMENT
Kendall's 𝝉 and 𝝉AP rank correlation
coefficient between
● the official TREC systems' ranking
using NIST test collection
● the systems' ranking using the
crowdsourced test collection
0.6317 0.5485 0.5650 0.4879
0.5679 0.4576 0.4637 0.3728
0.5703 0.4849 0.4658 0.3751
𝝉 𝝉AP 𝝉 𝝉AP
Binary Ternary
Binary
nDGC
MAP
R-Prec
30https://oana-inel.github.io/ @oana_inel http://crowdtruth.org
Binary Ternary
Fair to moderate correlation between the two rankings
The correlation drops as we focus more on the top-ranked systems
31. CONCLUSIONS & FUTURE WORK
Contributions
● new methodology for topical relevance assessment
● new relevance metrics (CrowdTruth metrics) harnessing diverse opinions &
disagreement among annotators
● annotated test collection for topical relevance (23,554 English topic-doc pairs)
Future Work
● extend the test collection with the longer documents from the NYT Corpus
● investigate the influence of the topic difficulty & ambiguity
Code & Resources: https://github.com/CrowdTruth/NYT-Crowdsourcing-Topical-Relevance
CrowdTruth metrics: https://github.com/CrowdTruth/CrowdTruth-core
CrowdTruth tutorial: http://crowdtruth.org/tutorial/
31https://oana-inel.github.io/ @oana_inel http://crowdtruth.org
Editor's Notes
Hello everyone I am Oana Inel and today I will present a collaborative work between my CrowdTruth team, with myself and Lora Aroyo and researchers from UvA and Southampton, and IBM
Today I’ll talk about Assessment Practices, and specifically about Topical Relevance assessment practices. We have collected a couple of observations from reviewing current NIST practices for topical relevance assessment and some of them stood out.
As probably many of you already know, NIST assessors are topic originators and subject experts. One thing that really stands out is that just one NIST assessor is typically assigned per topic and they use an asymmetrical 3-point scale to label the topic-document pairs.
Finally, to ensure a uniform understanding of the annotation task, they need to follow strict annotations guidelines.
So, does something seem wrong in this picture? What do you think?
Do you think that a single viewpoint for every topic-document pair and even for every topic is enough to gather a reliable test collection? Do you think that experts can not have bias or they can always be consistent when annotating? And furthermore, what happens when the task is subjective or ambiguous?
Now, let’s look at some NIST annotations from the corpus that we used in our experiments. We have the following topic - euro opposition: identify documents that discuss …
Now, let’s see what NIST assessors said about these documents:
Now let’s see what the crowd said, we gave them the same topic and the same documents, and they said that the first document is about 70% relevant, the second one is about 50% percent relevant and the last one is hardly relevant.
Who do you think produced more accurate annotations here? The crowd that gave us a ranking of the documents or the experts that said all three docs are highly relevant?
Now coming back to my introductory slide, do you now see why there is a problem in current annotations practices? Even though they are performed by highly trained experts, one person is sometimes not enough to capture the entire truth, people can have bias, people can make mistakes and it is very difficult to be always consistent.
Scales such as highly relevant, relevant and not relevant are usually difficult and confusing and annotation guidelines can not help much if the task is subjective or ambiguous.
And all these may actually contribute to lower accuracy and reliability.
We did a set of experiments to see whether we can empirically confirm those observations and we formulated two main RQ to guide our analysis.
We wanted to see if we can improve the accuracy of the topical relevance assessment by adapting, more exactly by relaxing the annotation guidelines rather than making them more strict and we also wanted to see if we change the document granularity, this will also positively influence the accuracy
And through the second RQ we wanted to see whether we can improve the reliability of topical relevance by choosing an optimal relevance scale that can decrease as much as possible the intrinsic inconsistencies of annotators and by choosing the optimal number of crowd annotators to gather annotations from.
As a result of these RQ, we defined a set of hypotheses related to the annotation guidelines, the document granularity, the relevance scale and the number of annotators
And we defined a methodology in which we performed multiple pilot experiments in which we experimented with various values of these variables in order to identify the most optimal crowdsourcing settings to rank and label documents with topical relevance.
We worked with the short and medium documents in the NYTimes Corpus, covering 250 TREC Robust Track topics.
Let’s have a look how we organized our pilot experiments. We used 120 topic documents pair, covering 10 topics that were previously annotated by NIST.
In all the pilot experiments we fixed the crowdsourcing platform, we used the Figure Eight platform, and we asked for 15 annotations from english speaking countries workers.
And the goal of these pilot experiments was to find out which combination of these variables produces an optimal setting.
Here are all the variations that we tested. So, we experimented with binary and ternary annotation relevance scale, we experimented with different annotation guidelines in which we asked the workers to provide the relevance value or label or to provide the relevance label and highlight in the document the part that motivates the relevance. Further, we experimented with the document granularity by asking workers to annotate the full document or document paragraphs and with the order in which the paragraphs are shown.
We consider the way we analyzed the results as a contribution of the research. We performed a two-step analysis. First, in order to make sure that we understand the annotations provided by NIST, we reviewed and analyzed them to accurately compare them with the crowd annotations. Second, we didn’t apply the traditional majority vote approaches to evaluate the crowd annotations, but we applied the CrowdTruth quality metrics which harness the disagreement among annotators.
As I mentioned earlier, we first reviewed the ground truth labels. We used three reviewers, authors of this paper that were familiar with the NIST annotation guidelines. The reviewers annotated all the 120 topic-doc pairs used in the pilots.
We computed the inter-rater reliability among reviewers and between reviewers and NIST and we observe that on a binary scale the agreement is overall much higher than on a ternary scale.