Design and implementation_of_relevance_assessm


Published on

Design and Implementation of relevance assessments using Crowdsourcing by Omar Alonso and Ricardo Baeza Yates

Published in: Technology
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Design and implementation_of_relevance_assessm

  1. 1. Design and Implementation of Relevance Assessments Using Crowdsourcing Omar Alonso 1 and Ricardo Baeza-Yates 2 1 Microsoft Corp., Mountain View, California, US   2 Yahoo! Research, Barcelona, Spain European Conference on Information Retrieval, 2011
  2. 2. Outline of presentation <ul><li>What is Crowdsourcing? </li></ul><ul><ul><li>Amazon Mechanical Turk (AMT) </li></ul></ul><ul><li>Overview of the paper. </li></ul><ul><li>Details of AMT experimental design </li></ul><ul><li>Results </li></ul><ul><li>Recommendations </li></ul>
  3. 3. Crowdsourcing <ul><li>The term &quot;crowdsourcing&quot; is a portmanteau of &quot; crowd &quot; and &quot; outsourcing ,&quot; first coined by Jeff Howe in a June 2006 Wired magazine article &quot;The Rise of Crowdsourcing&quot;. </li></ul>&quot;Crowdsourcing is the act of taking a job traditionally performed by a designated agent (usually an employee) and outsourcing it to an undefined, generally large group of people in the form of an open call.&quot; - Jeff Howe
  4. 4.
  5. 5. Mechanical Turk <ul><li>Was a fake chess playing machine constructed in 1770 </li></ul>
  6. 6.
  7. 7. Design and build HITs Put HITs on Amazon Mechanical Turk Collect evaluation results from MTurk Workers Approve evaluation results according to designed approval rules Pay Workers whose inputs have been approved Collected binary relevance Assessments of TREC 8 Amazon Mechanical Turk Workflow
  8. 8.
  9. 9. Objective of the paper <ul><li>To introduce a methodology for Crowdsourcing binary relevance assessments using Amazon Mechanical Turk </li></ul>
  10. 10. Methodology <ul><ul><li>Data preparation </li></ul></ul><ul><ul><ul><li>Document collection, topics (queries), documents per topic, number of people that will evaluate one HIT </li></ul></ul></ul><ul><ul><li>Interface design </li></ul></ul><ul><ul><ul><li>Most important part of AMT experiment design </li></ul></ul></ul><ul><ul><ul><li>Keep HITs simple </li></ul></ul></ul><ul><ul><ul><li>Instructions should be clear, concise, specific, free from jargon, and easy to read </li></ul></ul></ul><ul><ul><ul><li>Include examples in HITs </li></ul></ul></ul><ul><ul><ul><li>Use UI elements to specify formatting </li></ul></ul></ul><ul><ul><ul><li>Don’t ask for all or every </li></ul></ul></ul><ul><ul><ul><li>Explain what will not be accepted to avoid conflicts later on </li></ul></ul></ul>
  11. 11. Methodology <ul><li>Filtering the workers </li></ul><ul><ul><li>Approval rate: provided by AMT </li></ul></ul><ul><ul><li>Qualification test : better quality filter but involves more development cycles </li></ul></ul><ul><ul><li>Honey pots : Interleave assignments - checks spamming </li></ul></ul><ul><li>Scheduling the tasks </li></ul><ul><ul><li>Split tasks into small chunks: helps avoid Worker fatigue </li></ul></ul><ul><ul><li>Submit shorter tasks first </li></ul></ul><ul><ul><li>Incorporate any implicit or explicit feedback into the experimental design </li></ul></ul>
  12. 12. Experimental Setup <ul><ul><li>TREC 8 - LA Times and FBIS sub-collections </li></ul></ul><ul><ul><li>50 topics (queries) </li></ul></ul><ul><ul><li>10 documents per query </li></ul></ul><ul><ul><li>5 Workers per HIT </li></ul></ul><ul><ul><li>Budget = $100 </li></ul></ul><ul><ul><ul><li>$0.02 for binary assessment + $0.02 for comment/feedback </li></ul></ul></ul><ul><ul><li>Agreement between raters is measured using Cohen’s kappa(k) </li></ul></ul>
  13. 13.
  14. 14.
  15. 15.
  16. 16.
  17. 17.
  18. 18.
  19. 19. Effect of Highlighting <ul><li>Two UIs </li></ul><ul><ul><li>One with query terms highlighted </li></ul></ul><ul><ul><li>Other with no highlighting of query terms </li></ul></ul><ul><li>With a couples of exceptions highlighting contributed to higher relevance compared to plain UI. </li></ul>
  20. 20.
  21. 21. Experiment with comments <ul><li>E1-E3 had optional comments </li></ul><ul><li>Experiment E4 onwards comments were made mandatory </li></ul><ul><li>Re-launched E5 to see the effect of bonus on the length and quality of comments </li></ul>
  22. 22.
  23. 23. Recommendations <ul><li>Iterative approach in designing UI: ability to incorporate feedback </li></ul><ul><li>Split tasks into small chunks and submit smaller tasks first </li></ul><ul><li>Provide detailed feedback for rejected HITs to build Worker trust by word of mouth effect </li></ul><ul><li>Look out for very fast work as it might be that of a Robot </li></ul><ul><li>Bonus payments can help generate better comments </li></ul>
  24. 24. Questions? Thank you!