Successfully reported this slideshow.

CIKM 2018: Studying Topical Relevance with Evidence-based Crowdsourcing

1

Share

Loading in …3
×
1 of 31
1 of 31

CIKM 2018: Studying Topical Relevance with Evidence-based Crowdsourcing

1

Share

Download to read offline

Information Retrieval systems rely on large test collections to measure their effectiveness in retrieving relevant documents. While the demand is high, the task of creating such test collections is laborious due to the large amounts of data that need to be annotated, and due to the intrinsic subjectivity of the task itself. In this paper we study the topical relevance from a user perspective by addressing the problems of subjectivity and ambiguity. We compare our approach and results with the established TREC annotation guidelines and results. The comparison is based on a series of crowdsourcing pilots experimenting with variables, such as relevance scale, document granularity, annotation template and the number of workers. Our results show correlation between relevance assessment accuracy and smaller document granularity, i.e., aggregation of relevance on paragraph level results in a better relevance accuracy, compared to assessment done at the level of the full document. As expected, our results also show that collecting binary relevance judgments results in a higher accuracy compared to the ternary scale used in the TREC annotation guidelines. Finally, the crowdsourced annotation tasks provided a more accurate document relevance ranking than a single assessor relevance label. This work resulted is a reliable test collection around the TREC Common Core track.

Information Retrieval systems rely on large test collections to measure their effectiveness in retrieving relevant documents. While the demand is high, the task of creating such test collections is laborious due to the large amounts of data that need to be annotated, and due to the intrinsic subjectivity of the task itself. In this paper we study the topical relevance from a user perspective by addressing the problems of subjectivity and ambiguity. We compare our approach and results with the established TREC annotation guidelines and results. The comparison is based on a series of crowdsourcing pilots experimenting with variables, such as relevance scale, document granularity, annotation template and the number of workers. Our results show correlation between relevance assessment accuracy and smaller document granularity, i.e., aggregation of relevance on paragraph level results in a better relevance accuracy, compared to assessment done at the level of the full document. As expected, our results also show that collecting binary relevance judgments results in a higher accuracy compared to the ternary scale used in the TREC annotation guidelines. Finally, the crowdsourced annotation tasks provided a more accurate document relevance ranking than a single assessor relevance label. This work resulted is a reliable test collection around the TREC Common Core track.

More Related Content

CIKM 2018: Studying Topical Relevance with Evidence-based Crowdsourcing

  1. 1. Studying Topical Relevance with Evidence-based Crowdsourcing Oana Inel, Giannis Haralabopoulos, Dan Li, Christophe Van Gysel, Zoltán Szlávik, Elena Simperl, Evangelos Kanoulas and Lora Aroyo CIKM 2018 @oana_inel https://oana-inel.github.io
  2. 2. TOPICAL RELEVANCE ASSESSMENT PRACTICES Typically one NIST assessor per topic Use an asymmetrical 3 point-relevance scale [ highly relevant, relevant, not relevant ] Follow strict annotation guidelines to ensure a uniform understanding of the annotation task NIST employed assessors ● Topic originators ● Subject experts 2https://oana-inel.github.io/ @oana_inel http://crowdtruth.org
  3. 3. TOPICAL RELEVANCE ASSESSMENT PRACTICES Typically one NIST assessor per topic NIST employed assessors ● Topic originators ● Subject experts Use an asymmetrical 3 point-relevance scale [ highly relevant, relevant, not relevant ] What’s wrong with this picture? Follow strict annotation guidelines to ensure a uniform understanding of the annotation task 3https://oana-inel.github.io/ @oana_inel http://crowdtruth.org
  4. 4. TOPICAL RELEVANCE ASSESSMENT PRACTICES Typically one NIST assessor per topic Follow strict annotation guidelines to ensure a uniform understanding of the annotation task Use an asymmetrical 3 point-relevance scale [ highly relevant, relevant, not relevant ] Is a singular viewpoint enough? Do experts have bias? What if the task is subjective or ambiguous? Are experts always consistent? NIST employed assessors ● Topic originators ● Subject experts 4https://oana-inel.github.io/ @oana_inel http://crowdtruth.org What’s wrong with this picture?
  5. 5. Jack Straw, left, Britain's new foreign secretary, said the country's euro policy remained unchanged, producing a brief recovery in the pound from a 15-year low set last week. A common currency, administered by an independent European central bank, would take many of these decisions out of national hands, the main reason why so many Conservatives in Britain oppose introducing the euro there. Greece never had a prayer of joining the first round of countries eligible for Europe's common currency, so its exclusion from the list of 11 countries ready to adopt the euro in 1999 was not an issue here. Highly relevant Highly relevant Highly relevant What do experts say? TOPIC: Euro Opposition Identify documents that discuss opposition to the use of the euro, the European currency. 5https://oana-inel.github.io/ @oana_inel http://crowdtruth.org
  6. 6. What does the crowd say? 0.11 relevant 0.46 relevant 0.67 relevant Jack Straw, left, Britain's new foreign secretary, said the country's euro policy remained unchanged, producing a brief recovery in the pound from a 15-year low set last week. A common currency, administered by an independent European central bank, would take many of these decisions out of national hands, the main reason why so many Conservatives in Britain oppose introducing the euro there. Greece never had a prayer of joining the first round of countries eligible for Europe's common currency, so its exclusion from the list of 11 countries ready to adopt the euro in 1999 was not an issue here. 6https://oana-inel.github.io/ @oana_inel http://crowdtruth.org TOPIC: Euro Opposition Identify documents that discuss opposition to the use of the euro, the European currency.
  7. 7. Who do you think is more accurate? Jack Straw, left, Britain's new foreign secretary, said the country's euro policy remained unchanged, producing a brief recovery in the pound from a 15-year low set last week. A common currency, administered by an independent European central bank, would take many of these decisions out of national hands, the main reason why so many Conservatives in Britain oppose introducing the euro there. Greece never had a prayer of joining the first round of countries eligible for Europe's common currency, so its exclusion from the list of 11 countries ready to adopt the euro in 1999 was not an issue here. 7https://oana-inel.github.io/ @oana_inel http://crowdtruth.org TOPIC: Euro Opposition Identify documents that discuss opposition to the use of the euro, the European currency. Highly relevant Highly relevant Highly relevant 0.11 relevant 0.46 relevant 0.67 relevant
  8. 8. TOPICAL RELEVANCE ASSESSMENT PRACTICES Typically one NIST assessor per topic Follow strict annotation guidelines to ensure a uniform understanding of the annotation task Use an asymmetrical 3 point-relevance scale [ highly relevant, relevant, not relevant ] All these may contribute to lower accuracy & reliability Is a singular viewpoint enough? Do experts have bias? What if the task is subjective or ambiguous? Are experts always consistent? NIST employed assessors ● Topic originators ● Subject experts 8https://oana-inel.github.io/ @oana_inel http://crowdtruth.org
  9. 9. Research Questions RQ1: Can we improve the accuracy of topical relevance assessment by adapting and ? RQ2: Can we improve the reliability of topical relevance assessment by optimizing and ? Relevance Scale Annotation Guidelines Document Granularity Number of Annotators 9https://oana-inel.github.io/ @oana_inel http://crowdtruth.org
  10. 10. Hypotheses text highlight as relevance choice motivation increases the accuracy of the results 2-point relevance annotation scale is more reliable topical relevance assessment at the level of document paragraphs increases the accuracy of the results large amounts of crowd workers perform as well as or better than NIST assessors Annotation Guidelines Document Granularity Relevance Scale Number of Annotators 10https://oana-inel.github.io/ @oana_inel http://crowdtruth.org
  11. 11. RELEVANCE ASSESSMENT METHODOLOGY Pilot Experiments to determine Optimal Crowdsourcing Setting Evaluate Pilot Experiments Main Experiment with Optimal Crowdsourcing Setting Relevance Rank & Labels Documents Annotate Documents Relevance Scale Annotation Guidelines Document Granularity Paragraph Order Number of Annotators 11https://oana-inel.github.io/ @oana_inel http://crowdtruth.org
  12. 12. DATASET NYTimes Corpus ○ 250 TREC Robust Track topics ○ 23,554 short & medium documents (less than 1000 words) NIST annotations ○ 50 topics ○ 5,946 documents ○ ternary scale (929 highly relevant, 1,421 relevant, 3,596 not relevant) 12https://oana-inel.github.io/ @oana_inel http://crowdtruth.org
  13. 13. 8 PILOT EXPERIMENTS with 120 topic-document pairs / 10 topics Independent Variables Relevance Scale Binary Ternary Annotation Guidelines Relevance Value Relevance Value + Text Highlight Document Paragraphs Full Document Document Granularity Paragraph Order Random Order Original Order Controlled Variables Annotators 15 workers Figure Eight English Speaking Which is the optimal setting? 13https://oana-inel.github.io/ @oana_inel http://crowdtruth.org
  14. 14. OVERVIEW OF ALL PILOT EXPERIMENTS SETTINGS N/A Annotators Figure Eight 15 workers English Speaking 14https://oana-inel.github.io/ @oana_inel http://crowdtruth.org N/A N/A N/A Relevance Scale Binary Ternary Binary Binary Binary Binary Binary Ternary Annotation Guidelines Relevance Value Relevance Value + Text Highlight Relevance Value Relevance Value + Text Highlight Relevance Value Relevance Value + Text Highlight Relevance Value Relevance Value + Text Highlight Document Paragraphs Full Document Document Granularity Full Document Full Document Full Document Document Paragraphs Document Paragraphs Document Paragraphs Paragraph Order Random Order Original Order Original Order Random Order
  15. 15. METHODOLOGY FOR RESULTS ANALYSIS 2. Crowd Judgments Aggregation ● CrowdTruth metrics: Model and aggregate the topical relevance judgments of the crowd by harnessing disagreement 1. Ground Truth Reviewing ● Inspection of annotations by NIST ● NIST assessors vs. reviewers ● Reviewers vs. crowd workers Lora Aroyo, Anca Dumitrache, Oana Inel, Chris Welty: CrowdTruth Tutorial, http://crowdtruth.org/tutorial/ 15https://oana-inel.github.io/ @oana_inel http://crowdtruth.org Anca Dumitrache, Oana Inel, Lora Aroyo, Benjamin Timmermans, Chris Welty: CrowdTruth 2.0: Quality Metrics for Crowdsourcing with Disagreement, https://arxiv.org/pdf/1808.06080.pdf
  16. 16. GROUND TRUTH REVIEWING Independent Reviewing ● Three independent reviewers ● Familiar with NIST annotation guidelines ● 120 topic-document pairs used in pilots Consensus Reaching ● Reviewers inter-rater reliability: ○ Fleiss’ k = 0.67 on ternary relevance ○ Fleiss’ k = 0.79 on binary relevance ● Reviewers and NIST inter-rater reliability: ○ Cohen’ k = 0.44 on ternary relevance ○ Cohen’ k = 0.60 on binary relevance 16https://oana-inel.github.io/ @oana_inel http://crowdtruth.org
  17. 17. CROWDTRUTH METRICS For paragraph-level assessments: - we first measure the degree of relevance of each document paragraph with regard to the topic - by Paragraph-Topic-Document-Pair Relevance Score (ParTDP-RelScore) - finally, we calculate the overall topic document relevance score, by taking the MAX of the ParTDP-RelScore 17https://oana-inel.github.io/ @oana_inel http://crowdtruth.org Lora Aroyo, Anca Dumitrache, Oana Inel, Chris Welty: CrowdTruth Tutorial, http://crowdtruth.org/tutorial/ Anca Dumitrache, Oana Inel, Lora Aroyo, Benjamin Timmermans, Chris Welty: CrowdTruth 2.0: Quality Metrics for Crowdsourcing with Disagreement, https://arxiv.org/pdf/1808.06080.pdf For full-document-level assessments: - we measure the degree with which a relevance label is expressed by the topic-document pair, - Relevance-Label-Topic-Document-Pair Score (TDP-RelVal) In both cases all worker judgements are weighted by their worker quality
  18. 18. Ternary Relevance Value Relevance Value + Text Highlight Full Document Ternary Full Document Do annotation guidelines influence the results? 0.57 0.45 0.79 0.57 0.63 0.86 Highly RelevantAssessors NIST Crowd Relevant Not Relevant 0.57 0.45 0.79 0.69 0.64 0.92 Highly RelevantAssessors NIST Crowd Relevant Not Relevant accuracy is better when answers are motivated 18https://oana-inel.github.io/ @oana_inel http://crowdtruth.org
  19. 19. Binary Relevance Value Full Document Binary Full Document Relevance Value + Text Highlight Relevance Value Full Document Full Document Relevance Value + Text Highlight Ternary Ternary Does relevance annotation scale influence the results? 0.57 0.45 0.79 0.57 0.63 0.86 Highly Relevant Assessors NIST Crowd Relevant Not Relevant 0.57 0.45 0.79 0.69 0.64 0.92 Highly Relevant Assessors NIST Crowd Relevant Not Relevant 0.80 0.79 0.85 0.81 RelevantAssessors NIST Crowd Not Relevant 0.80 0.79 0.91 0.90 RelevantAssessors NIST Crowd Not Relevant accuracy is better when answers are merged 19https://oana-inel.github.io/ @oana_inel http://crowdtruth.org
  20. 20. Binary Relevance Value Full Document Binary Full Document Relevance Value + Text Highlight Relevance Value Full Document Full Document Relevance Value + Text Highlight Ternary Ternary Does relevance annotation scale influence the results? 0.57 0.45 0.79 0.57 0.63 0.86 Highly Relevant Assessors NIST Crowd Relevant Not Relevant 0.57 0.45 0.79 0.69 0.64 0.92 Highly Relevant Assessors NIST Crowd Relevant Not Relevant 0.80 0.79 0.85 0.81 RelevantAssessors NIST Crowd Not Relevant 0.80 0.79 0.91 0.90 RelevantAssessors NIST Crowd Not Relevant accuracy is better when answers are merged not relevant docs - high agreement 20https://oana-inel.github.io/ @oana_inel http://crowdtruth.org
  21. 21. Binary Relevance Value Full Document Binary Full Document Relevance Value + Text Highlight Relevance Value Full Document Full Document Relevance Value + Text Highlight Ternary Ternary Does relevance annotation scale influence the results? 0.57 0.45 0.79 0.57 0.63 0.86 Highly Relevant Assessors NIST Crowd Relevant Not Relevant 0.57 0.45 0.79 0.69 0.64 0.92 Highly Relevant Assessors NIST Crowd Relevant Not Relevant 0.80 0.79 0.85 0.81 RelevantAssessors NIST Crowd Not Relevant 0.80 0.79 0.91 0.90 RelevantAssessors NIST Crowd Not Relevant accuracy is better when answers are merged not relevant docs - high agreement relevant & highly relevant docs - often confused 21https://oana-inel.github.io/ @oana_inel http://crowdtruth.org
  22. 22. Document Paragraphs Binary Original Order Binary Relevance Value Relevance Value + Text Highlight Document Paragraphs Original Order Does the document granularity influence the results? accuracy is better with relevance value at paragraph level 22https://oana-inel.github.io/ @oana_inel http://crowdtruth.org
  23. 23. Does the paragraphs order influence the results? Random Order Binary Relevance Value Document Paragraphs Binary Relevance Value Document Paragraphs Random Order Binary Document Paragraphs Binary Document Paragraphs Original Order Original Order Relevance Value + Text Highlight Relevance Value + Text Highlight crowd assesses better at paragraph level & in random order 23https://oana-inel.github.io/ @oana_inel http://crowdtruth.org
  24. 24. Random Order Binary Document Paragraphs What amount of workers gives you reliable results? binary relevance: 5 workers perform comparable with NIST ternary relevance: 7 workers perform as well as the NIST experts on highly relevant & not relevant docs only 24https://oana-inel.github.io/ @oana_inel http://crowdtruth.org Relevance Value + Text Highlight
  25. 25. DERIVED NEW TOPICAL RELEVANCE METHODOLOGY Ask many crowd annotators Diverse crowd of people Follow light-weighted annotation guidelines Apply 2-point relevance scale 25https://oana-inel.github.io/ @oana_inel http://crowdtruth.org
  26. 26. MAIN EXPERIMENT: with the optimal settings with 23,554 topic-document pairs / 250 topics Binary Relevance Value + Text Highlight Document Paragraphs Random Order Figure Eight 7 workers English Speaking 26https://oana-inel.github.io/ @oana_inel http://crowdtruth.org
  27. 27. RESULTS MAIN EXPERIMENT Kendall's 𝝉 and 𝝉AP rank correlation coefficient between ● the official TREC systems' ranking using NIST test collection ● the systems' ranking using the crowdsourced test collection 0.6317 0.5485 0.5650 0.4879 0.5679 0.4576 0.4637 0.3728 0.5703 0.4849 0.4658 0.3751 𝝉 𝝉AP 𝝉 𝝉AP Binary Ternary Binary nDGC MAP R-Prec 27https://oana-inel.github.io/ @oana_inel http://crowdtruth.org
  28. 28. RESULTS MAIN EXPERIMENT Kendall's 𝝉 and 𝝉AP rank correlation coefficient between ● the official TREC systems' ranking using NIST test collection ● the systems' ranking using the crowdsourced test collection 0.6317 0.5485 0.5650 0.4879 0.5679 0.4576 0.4637 0.3728 0.5703 0.4849 0.4658 0.3751 𝝉 𝝉AP 𝝉 𝝉AP Binary Ternary Binary nDGC MAP R-Prec 28https://oana-inel.github.io/ @oana_inel http://crowdtruth.org Binary Ternary
  29. 29. RESULTS MAIN EXPERIMENT Kendall's 𝝉 and 𝝉AP rank correlation coefficient between ● the official TREC systems' ranking using NIST test collection ● the systems' ranking using the crowdsourced test collection 0.6317 0.5485 0.5650 0.4879 0.5679 0.4576 0.4637 0.3728 0.5703 0.4849 0.4658 0.3751 𝝉 𝝉AP 𝝉 𝝉AP Binary Ternary Binary nDGC MAP R-Prec 29https://oana-inel.github.io/ @oana_inel http://crowdtruth.org Binary TernaryCrowd & NIST agreement: ● binary scale - 63% ● ternary scale - 54% crowd workers & NIST assessors do agree
  30. 30. RESULTS MAIN EXPERIMENT Kendall's 𝝉 and 𝝉AP rank correlation coefficient between ● the official TREC systems' ranking using NIST test collection ● the systems' ranking using the crowdsourced test collection 0.6317 0.5485 0.5650 0.4879 0.5679 0.4576 0.4637 0.3728 0.5703 0.4849 0.4658 0.3751 𝝉 𝝉AP 𝝉 𝝉AP Binary Ternary Binary nDGC MAP R-Prec 30https://oana-inel.github.io/ @oana_inel http://crowdtruth.org Binary Ternary Fair to moderate correlation between the two rankings The correlation drops as we focus more on the top-ranked systems
  31. 31. CONCLUSIONS & FUTURE WORK Contributions ● new methodology for topical relevance assessment ● new relevance metrics (CrowdTruth metrics) harnessing diverse opinions & disagreement among annotators ● annotated test collection for topical relevance (23,554 English topic-doc pairs) Future Work ● extend the test collection with the longer documents from the NYT Corpus ● investigate the influence of the topic difficulty & ambiguity Code & Resources: https://github.com/CrowdTruth/NYT-Crowdsourcing-Topical-Relevance CrowdTruth metrics: https://github.com/CrowdTruth/CrowdTruth-core CrowdTruth tutorial: http://crowdtruth.org/tutorial/ 31https://oana-inel.github.io/ @oana_inel http://crowdtruth.org

Editor's Notes

  • Hello everyone I am Oana Inel and today I will present a collaborative work between my CrowdTruth team, with myself and Lora Aroyo and researchers from UvA and Southampton, and IBM
  • Today I’ll talk about Assessment Practices, and specifically about Topical Relevance assessment practices. We have collected a couple of observations from reviewing current NIST practices for topical relevance assessment and some of them stood out.
    As probably many of you already know, NIST assessors are topic originators and subject experts. One thing that really stands out is that just one NIST assessor is typically assigned per topic and they use an asymmetrical 3-point scale to label the topic-document pairs.
    Finally, to ensure a uniform understanding of the annotation task, they need to follow strict annotations guidelines.
  • So, does something seem wrong in this picture? What do you think?
  • Do you think that a single viewpoint for every topic-document pair and even for every topic is enough to gather a reliable test collection? Do you think that experts can not have bias or they can always be consistent when annotating? And furthermore, what happens when the task is subjective or ambiguous?
  • Now, let’s look at some NIST annotations from the corpus that we used in our experiments. We have the following topic - euro opposition: identify documents that discuss …
    Now, let’s see what NIST assessors said about these documents:
  • Now let’s see what the crowd said, we gave them the same topic and the same documents, and they said that the first document is about 70% relevant, the second one is about 50% percent relevant and the last one is hardly relevant.
  • Who do you think produced more accurate annotations here? The crowd that gave us a ranking of the documents or the experts that said all three docs are highly relevant?
  • Now coming back to my introductory slide, do you now see why there is a problem in current annotations practices? Even though they are performed by highly trained experts, one person is sometimes not enough to capture the entire truth, people can have bias, people can make mistakes and it is very difficult to be always consistent.
    Scales such as highly relevant, relevant and not relevant are usually difficult and confusing and annotation guidelines can not help much if the task is subjective or ambiguous.

    And all these may actually contribute to lower accuracy and reliability.
  • We did a set of experiments to see whether we can empirically confirm those observations and we formulated two main RQ to guide our analysis.

    We wanted to see if we can improve the accuracy of the topical relevance assessment by adapting, more exactly by relaxing the annotation guidelines rather than making them more strict and we also wanted to see if we change the document granularity, this will also positively influence the accuracy

    And through the second RQ we wanted to see whether we can improve the reliability of topical relevance by choosing an optimal relevance scale that can decrease as much as possible the intrinsic inconsistencies of annotators and by choosing the optimal number of crowd annotators to gather annotations from.
  • As a result of these RQ, we defined a set of hypotheses related to the annotation guidelines, the document granularity, the relevance scale and the number of annotators
  • And we defined a methodology in which we performed multiple pilot experiments in which we experimented with various values of these variables in order to identify the most optimal crowdsourcing settings to rank and label documents with topical relevance.
  • We worked with the short and medium documents in the NYTimes Corpus, covering 250 TREC Robust Track topics.
  • Let’s have a look how we organized our pilot experiments. We used 120 topic documents pair, covering 10 topics that were previously annotated by NIST.

    In all the pilot experiments we fixed the crowdsourcing platform, we used the Figure Eight platform, and we asked for 15 annotations from english speaking countries workers.

    And the goal of these pilot experiments was to find out which combination of these variables produces an optimal setting.
  • Here are all the variations that we tested. So, we experimented with binary and ternary annotation relevance scale, we experimented with different annotation guidelines in which we asked the workers to provide the relevance value or label or to provide the relevance label and highlight in the document the part that motivates the relevance. Further, we experimented with the document granularity by asking workers to annotate the full document or document paragraphs and with the order in which the paragraphs are shown.
  • We consider the way we analyzed the results as a contribution of the research. We performed a two-step analysis. First, in order to make sure that we understand the annotations provided by NIST, we reviewed and analyzed them to accurately compare them with the crowd annotations. Second, we didn’t apply the traditional majority vote approaches to evaluate the crowd annotations, but we applied the CrowdTruth quality metrics which harness the disagreement among annotators.
  • As I mentioned earlier, we first reviewed the ground truth labels. We used three reviewers, authors of this paper that were familiar with the NIST annotation guidelines. The reviewers annotated all the 120 topic-doc pairs used in the pilots.
    We computed the inter-rater reliability among reviewers and between reviewers and NIST and we observe that on a binary scale the agreement is overall much higher than on a ternary scale.
  • ×