Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

CIKM 2018: Studying Topical Relevance with Evidence-based Crowdsourcing

49 views

Published on

Information Retrieval systems rely on large test collections to measure their effectiveness in retrieving relevant documents. While the demand is high, the task of creating such test collections is laborious due to the large amounts of data that need to be annotated, and due to the intrinsic subjectivity of the task itself. In this paper we study the topical relevance from a user perspective by addressing the problems of subjectivity and ambiguity. We compare our approach and results with the established TREC annotation guidelines and results. The comparison is based on a series of crowdsourcing pilots experimenting with variables, such as relevance scale, document granularity, annotation template and the number of workers. Our results show correlation between relevance assessment accuracy and smaller document granularity, i.e., aggregation of relevance on paragraph level results in a better relevance accuracy, compared to assessment done at the level of the full document. As expected, our results also show that collecting binary relevance judgments results in a higher accuracy compared to the ternary scale used in the TREC annotation guidelines. Finally, the crowdsourced annotation tasks provided a more accurate document relevance ranking than a single assessor relevance label. This work resulted is a reliable test collection around the TREC Common Core track.

Published in: Data & Analytics
  • Be the first to comment

  • Be the first to like this

CIKM 2018: Studying Topical Relevance with Evidence-based Crowdsourcing

  1. 1. Studying Topical Relevance with Evidence-based Crowdsourcing Oana Inel, Giannis Haralabopoulos, Dan Li, Christophe Van Gysel, Zoltán Szlávik, Elena Simperl, Evangelos Kanoulas and Lora Aroyo CIKM 2018 @oana_inel https://oana-inel.github.io
  2. 2. TOPICAL RELEVANCE ASSESSMENT PRACTICES Typically one NIST assessor per topic Use an asymmetrical 3 point-relevance scale [ highly relevant, relevant, not relevant ] Follow strict annotation guidelines to ensure a uniform understanding of the annotation task NIST employed assessors ● Topic originators ● Subject experts 2https://oana-inel.github.io/ @oana_inel http://crowdtruth.org
  3. 3. TOPICAL RELEVANCE ASSESSMENT PRACTICES Typically one NIST assessor per topic NIST employed assessors ● Topic originators ● Subject experts Use an asymmetrical 3 point-relevance scale [ highly relevant, relevant, not relevant ] What’s wrong with this picture? Follow strict annotation guidelines to ensure a uniform understanding of the annotation task 3https://oana-inel.github.io/ @oana_inel http://crowdtruth.org
  4. 4. TOPICAL RELEVANCE ASSESSMENT PRACTICES Typically one NIST assessor per topic Follow strict annotation guidelines to ensure a uniform understanding of the annotation task Use an asymmetrical 3 point-relevance scale [ highly relevant, relevant, not relevant ] Is a singular viewpoint enough? Do experts have bias? What if the task is subjective or ambiguous? Are experts always consistent? NIST employed assessors ● Topic originators ● Subject experts 4https://oana-inel.github.io/ @oana_inel http://crowdtruth.org What’s wrong with this picture?
  5. 5. Jack Straw, left, Britain's new foreign secretary, said the country's euro policy remained unchanged, producing a brief recovery in the pound from a 15-year low set last week. A common currency, administered by an independent European central bank, would take many of these decisions out of national hands, the main reason why so many Conservatives in Britain oppose introducing the euro there. Greece never had a prayer of joining the first round of countries eligible for Europe's common currency, so its exclusion from the list of 11 countries ready to adopt the euro in 1999 was not an issue here. Highly relevant Highly relevant Highly relevant What do experts say? TOPIC: Euro Opposition Identify documents that discuss opposition to the use of the euro, the European currency. 5https://oana-inel.github.io/ @oana_inel http://crowdtruth.org
  6. 6. What does the crowd say? 0.11 relevant 0.46 relevant 0.67 relevant Jack Straw, left, Britain's new foreign secretary, said the country's euro policy remained unchanged, producing a brief recovery in the pound from a 15-year low set last week. A common currency, administered by an independent European central bank, would take many of these decisions out of national hands, the main reason why so many Conservatives in Britain oppose introducing the euro there. Greece never had a prayer of joining the first round of countries eligible for Europe's common currency, so its exclusion from the list of 11 countries ready to adopt the euro in 1999 was not an issue here. 6https://oana-inel.github.io/ @oana_inel http://crowdtruth.org TOPIC: Euro Opposition Identify documents that discuss opposition to the use of the euro, the European currency.
  7. 7. Who do you think is more accurate? Jack Straw, left, Britain's new foreign secretary, said the country's euro policy remained unchanged, producing a brief recovery in the pound from a 15-year low set last week. A common currency, administered by an independent European central bank, would take many of these decisions out of national hands, the main reason why so many Conservatives in Britain oppose introducing the euro there. Greece never had a prayer of joining the first round of countries eligible for Europe's common currency, so its exclusion from the list of 11 countries ready to adopt the euro in 1999 was not an issue here. 7https://oana-inel.github.io/ @oana_inel http://crowdtruth.org TOPIC: Euro Opposition Identify documents that discuss opposition to the use of the euro, the European currency. Highly relevant Highly relevant Highly relevant 0.11 relevant 0.46 relevant 0.67 relevant
  8. 8. TOPICAL RELEVANCE ASSESSMENT PRACTICES Typically one NIST assessor per topic Follow strict annotation guidelines to ensure a uniform understanding of the annotation task Use an asymmetrical 3 point-relevance scale [ highly relevant, relevant, not relevant ] All these may contribute to lower accuracy & reliability Is a singular viewpoint enough? Do experts have bias? What if the task is subjective or ambiguous? Are experts always consistent? NIST employed assessors ● Topic originators ● Subject experts 8https://oana-inel.github.io/ @oana_inel http://crowdtruth.org
  9. 9. Research Questions RQ1: Can we improve the accuracy of topical relevance assessment by adapting and ? RQ2: Can we improve the reliability of topical relevance assessment by optimizing and ? Relevance Scale Annotation Guidelines Document Granularity Number of Annotators 9https://oana-inel.github.io/ @oana_inel http://crowdtruth.org
  10. 10. Hypotheses text highlight as relevance choice motivation increases the accuracy of the results 2-point relevance annotation scale is more reliable topical relevance assessment at the level of document paragraphs increases the accuracy of the results large amounts of crowd workers perform as well as or better than NIST assessors Annotation Guidelines Document Granularity Relevance Scale Number of Annotators 10https://oana-inel.github.io/ @oana_inel http://crowdtruth.org
  11. 11. RELEVANCE ASSESSMENT METHODOLOGY Pilot Experiments to determine Optimal Crowdsourcing Setting Evaluate Pilot Experiments Main Experiment with Optimal Crowdsourcing Setting Relevance Rank & Labels Documents Annotate Documents Relevance Scale Annotation Guidelines Document Granularity Paragraph Order Number of Annotators 11https://oana-inel.github.io/ @oana_inel http://crowdtruth.org
  12. 12. DATASET NYTimes Corpus ○ 250 TREC Robust Track topics ○ 23,554 short & medium documents (less than 1000 words) NIST annotations ○ 50 topics ○ 5,946 documents ○ ternary scale (929 highly relevant, 1,421 relevant, 3,596 not relevant) 12https://oana-inel.github.io/ @oana_inel http://crowdtruth.org
  13. 13. 8 PILOT EXPERIMENTS with 120 topic-document pairs / 10 topics Independent Variables Relevance Scale Binary Ternary Annotation Guidelines Relevance Value Relevance Value + Text Highlight Document Paragraphs Full Document Document Granularity Paragraph Order Random Order Original Order Controlled Variables Annotators 15 workers Figure Eight English Speaking Which is the optimal setting? 13https://oana-inel.github.io/ @oana_inel http://crowdtruth.org
  14. 14. OVERVIEW OF ALL PILOT EXPERIMENTS SETTINGS N/A Annotators Figure Eight 15 workers English Speaking 14https://oana-inel.github.io/ @oana_inel http://crowdtruth.org N/A N/A N/A Relevance Scale Binary Ternary Binary Binary Binary Binary Binary Ternary Annotation Guidelines Relevance Value Relevance Value + Text Highlight Relevance Value Relevance Value + Text Highlight Relevance Value Relevance Value + Text Highlight Relevance Value Relevance Value + Text Highlight Document Paragraphs Full Document Document Granularity Full Document Full Document Full Document Document Paragraphs Document Paragraphs Document Paragraphs Paragraph Order Random Order Original Order Original Order Random Order
  15. 15. METHODOLOGY FOR RESULTS ANALYSIS 2. Crowd Judgments Aggregation ● CrowdTruth metrics: Model and aggregate the topical relevance judgments of the crowd by harnessing disagreement 1. Ground Truth Reviewing ● Inspection of annotations by NIST ● NIST assessors vs. reviewers ● Reviewers vs. crowd workers Lora Aroyo, Anca Dumitrache, Oana Inel, Chris Welty: CrowdTruth Tutorial, http://crowdtruth.org/tutorial/ 15https://oana-inel.github.io/ @oana_inel http://crowdtruth.org Anca Dumitrache, Oana Inel, Lora Aroyo, Benjamin Timmermans, Chris Welty: CrowdTruth 2.0: Quality Metrics for Crowdsourcing with Disagreement, https://arxiv.org/pdf/1808.06080.pdf
  16. 16. GROUND TRUTH REVIEWING Independent Reviewing ● Three independent reviewers ● Familiar with NIST annotation guidelines ● 120 topic-document pairs used in pilots Consensus Reaching ● Reviewers inter-rater reliability: ○ Fleiss’ k = 0.67 on ternary relevance ○ Fleiss’ k = 0.79 on binary relevance ● Reviewers and NIST inter-rater reliability: ○ Cohen’ k = 0.44 on ternary relevance ○ Cohen’ k = 0.60 on binary relevance 16https://oana-inel.github.io/ @oana_inel http://crowdtruth.org
  17. 17. CROWDTRUTH METRICS For paragraph-level assessments: - we first measure the degree of relevance of each document paragraph with regard to the topic - by Paragraph-Topic-Document-Pair Relevance Score (ParTDP-RelScore) - finally, we calculate the overall topic document relevance score, by taking the MAX of the ParTDP-RelScore 17https://oana-inel.github.io/ @oana_inel http://crowdtruth.org Lora Aroyo, Anca Dumitrache, Oana Inel, Chris Welty: CrowdTruth Tutorial, http://crowdtruth.org/tutorial/ Anca Dumitrache, Oana Inel, Lora Aroyo, Benjamin Timmermans, Chris Welty: CrowdTruth 2.0: Quality Metrics for Crowdsourcing with Disagreement, https://arxiv.org/pdf/1808.06080.pdf For full-document-level assessments: - we measure the degree with which a relevance label is expressed by the topic-document pair, - Relevance-Label-Topic-Document-Pair Score (TDP-RelVal) In both cases all worker judgements are weighted by their worker quality
  18. 18. Ternary Relevance Value Relevance Value + Text Highlight Full Document Ternary Full Document Do annotation guidelines influence the results? 0.57 0.45 0.79 0.57 0.63 0.86 Highly RelevantAssessors NIST Crowd Relevant Not Relevant 0.57 0.45 0.79 0.69 0.64 0.92 Highly RelevantAssessors NIST Crowd Relevant Not Relevant accuracy is better when answers are motivated 18https://oana-inel.github.io/ @oana_inel http://crowdtruth.org
  19. 19. Binary Relevance Value Full Document Binary Full Document Relevance Value + Text Highlight Relevance Value Full Document Full Document Relevance Value + Text Highlight Ternary Ternary Does relevance annotation scale influence the results? 0.57 0.45 0.79 0.57 0.63 0.86 Highly Relevant Assessors NIST Crowd Relevant Not Relevant 0.57 0.45 0.79 0.69 0.64 0.92 Highly Relevant Assessors NIST Crowd Relevant Not Relevant 0.80 0.79 0.85 0.81 RelevantAssessors NIST Crowd Not Relevant 0.80 0.79 0.91 0.90 RelevantAssessors NIST Crowd Not Relevant accuracy is better when answers are merged 19https://oana-inel.github.io/ @oana_inel http://crowdtruth.org
  20. 20. Binary Relevance Value Full Document Binary Full Document Relevance Value + Text Highlight Relevance Value Full Document Full Document Relevance Value + Text Highlight Ternary Ternary Does relevance annotation scale influence the results? 0.57 0.45 0.79 0.57 0.63 0.86 Highly Relevant Assessors NIST Crowd Relevant Not Relevant 0.57 0.45 0.79 0.69 0.64 0.92 Highly Relevant Assessors NIST Crowd Relevant Not Relevant 0.80 0.79 0.85 0.81 RelevantAssessors NIST Crowd Not Relevant 0.80 0.79 0.91 0.90 RelevantAssessors NIST Crowd Not Relevant accuracy is better when answers are merged not relevant docs - high agreement 20https://oana-inel.github.io/ @oana_inel http://crowdtruth.org
  21. 21. Binary Relevance Value Full Document Binary Full Document Relevance Value + Text Highlight Relevance Value Full Document Full Document Relevance Value + Text Highlight Ternary Ternary Does relevance annotation scale influence the results? 0.57 0.45 0.79 0.57 0.63 0.86 Highly Relevant Assessors NIST Crowd Relevant Not Relevant 0.57 0.45 0.79 0.69 0.64 0.92 Highly Relevant Assessors NIST Crowd Relevant Not Relevant 0.80 0.79 0.85 0.81 RelevantAssessors NIST Crowd Not Relevant 0.80 0.79 0.91 0.90 RelevantAssessors NIST Crowd Not Relevant accuracy is better when answers are merged not relevant docs - high agreement relevant & highly relevant docs - often confused 21https://oana-inel.github.io/ @oana_inel http://crowdtruth.org
  22. 22. Document Paragraphs Binary Original Order Binary Relevance Value Relevance Value + Text Highlight Document Paragraphs Original Order Does the document granularity influence the results? accuracy is better with relevance value at paragraph level 22https://oana-inel.github.io/ @oana_inel http://crowdtruth.org
  23. 23. Does the paragraphs order influence the results? Random Order Binary Relevance Value Document Paragraphs Binary Relevance Value Document Paragraphs Random Order Binary Document Paragraphs Binary Document Paragraphs Original Order Original Order Relevance Value + Text Highlight Relevance Value + Text Highlight crowd assesses better at paragraph level & in random order 23https://oana-inel.github.io/ @oana_inel http://crowdtruth.org
  24. 24. Random Order Binary Document Paragraphs What amount of workers gives you reliable results? binary relevance: 5 workers perform comparable with NIST ternary relevance: 7 workers perform as well as the NIST experts on highly relevant & not relevant docs only 24https://oana-inel.github.io/ @oana_inel http://crowdtruth.org Relevance Value + Text Highlight
  25. 25. DERIVED NEW TOPICAL RELEVANCE METHODOLOGY Ask many crowd annotators Diverse crowd of people Follow light-weighted annotation guidelines Apply 2-point relevance scale 25https://oana-inel.github.io/ @oana_inel http://crowdtruth.org
  26. 26. MAIN EXPERIMENT: with the optimal settings with 23,554 topic-document pairs / 250 topics Binary Relevance Value + Text Highlight Document Paragraphs Random Order Figure Eight 7 workers English Speaking 26https://oana-inel.github.io/ @oana_inel http://crowdtruth.org
  27. 27. RESULTS MAIN EXPERIMENT Kendall's 𝝉 and 𝝉AP rank correlation coefficient between ● the official TREC systems' ranking using NIST test collection ● the systems' ranking using the crowdsourced test collection 0.6317 0.5485 0.5650 0.4879 0.5679 0.4576 0.4637 0.3728 0.5703 0.4849 0.4658 0.3751 𝝉 𝝉AP 𝝉 𝝉AP Binary Ternary Binary nDGC MAP R-Prec 27https://oana-inel.github.io/ @oana_inel http://crowdtruth.org
  28. 28. RESULTS MAIN EXPERIMENT Kendall's 𝝉 and 𝝉AP rank correlation coefficient between ● the official TREC systems' ranking using NIST test collection ● the systems' ranking using the crowdsourced test collection 0.6317 0.5485 0.5650 0.4879 0.5679 0.4576 0.4637 0.3728 0.5703 0.4849 0.4658 0.3751 𝝉 𝝉AP 𝝉 𝝉AP Binary Ternary Binary nDGC MAP R-Prec 28https://oana-inel.github.io/ @oana_inel http://crowdtruth.org Binary Ternary
  29. 29. RESULTS MAIN EXPERIMENT Kendall's 𝝉 and 𝝉AP rank correlation coefficient between ● the official TREC systems' ranking using NIST test collection ● the systems' ranking using the crowdsourced test collection 0.6317 0.5485 0.5650 0.4879 0.5679 0.4576 0.4637 0.3728 0.5703 0.4849 0.4658 0.3751 𝝉 𝝉AP 𝝉 𝝉AP Binary Ternary Binary nDGC MAP R-Prec 29https://oana-inel.github.io/ @oana_inel http://crowdtruth.org Binary TernaryCrowd & NIST agreement: ● binary scale - 63% ● ternary scale - 54% crowd workers & NIST assessors do agree
  30. 30. RESULTS MAIN EXPERIMENT Kendall's 𝝉 and 𝝉AP rank correlation coefficient between ● the official TREC systems' ranking using NIST test collection ● the systems' ranking using the crowdsourced test collection 0.6317 0.5485 0.5650 0.4879 0.5679 0.4576 0.4637 0.3728 0.5703 0.4849 0.4658 0.3751 𝝉 𝝉AP 𝝉 𝝉AP Binary Ternary Binary nDGC MAP R-Prec 30https://oana-inel.github.io/ @oana_inel http://crowdtruth.org Binary Ternary Fair to moderate correlation between the two rankings The correlation drops as we focus more on the top-ranked systems
  31. 31. CONCLUSIONS & FUTURE WORK Contributions ● new methodology for topical relevance assessment ● new relevance metrics (CrowdTruth metrics) harnessing diverse opinions & disagreement among annotators ● annotated test collection for topical relevance (23,554 English topic-doc pairs) Future Work ● extend the test collection with the longer documents from the NYT Corpus ● investigate the influence of the topic difficulty & ambiguity Code & Resources: https://github.com/CrowdTruth/NYT-Crowdsourcing-Topical-Relevance CrowdTruth metrics: https://github.com/CrowdTruth/CrowdTruth-core CrowdTruth tutorial: http://crowdtruth.org/tutorial/ 31https://oana-inel.github.io/ @oana_inel http://crowdtruth.org

×