Crowdsourcing for Book Search Evaluation: Impact of HIT Design on Comparative System Ranking
Upcoming SlideShare
Loading in...5

Crowdsourcing for Book Search Evaluation: Impact of HIT Design on Comparative System Ranking






Total Views
Views on SlideShare
Embed Views



0 Embeds 0

No embeds


Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
Post Comment
Edit your comment

Crowdsourcing for Book Search Evaluation: Impact of HIT Design on Comparative System Ranking Crowdsourcing for Book Search Evaluation: Impact of HIT Design on Comparative System Ranking Presentation Transcript

  • Crowdsourcing for Book Search Evaluation: Impact of HIT Design on Comparative System Ranking Gabriella Kazai , Jaap Kamps, Marjin Koolen, Natasa Milic- Frayling Presented by Kumar Ashish
  • Traditional Method● Require Trained Group of Experts to gather Information● Precise Guidelines ● Problem of Scalability ○ INEX Book Collection ■ 50,239 books ■ 83 Prove-It Topics ■ Assessor require 33 days to judge a single topic if he spends 95 minutes each day
  • Crowdsourcing● It is method of outsourcing work through an open call for contributions from members of crowd, who are invited to carry out Human Intelligence Tasks(HIT) in exchange for micro-payments, social recognition, or entertainment value.● It offers a solution for scalability problem.
  • Problems with Crowdsourcing● Suffers from poor output Quality ○ Workers dishonest and careless behavior ■ Workers motivated by financial gain may aim to complete as many HITs are possible within a given time. ○ Poor task designs by the task requester
  • Solution: ● Include Trap Questions ● Include Qualifying Questions ● Use Gold Standard Data Set for which agreement can be measured ● Timing Controls ● Challenge-response tests(captcha) ● Build redundancy into task design ● Model Annotators Quality
  • Objective● Investigates the Impact of aspects of Human Intelligence Task(HIT) design on the quality of relevance labels provided by the crowd. ○ Investigation is focused upon these three aspects: ■ Quality Control Elements ■ Document pooling and sampling for relevance judgements by the crowd ■ Documents Ordering within a HIT for presentation to the workers
  • Prove It! ● It aims to investigate effective ways to retrieve relevant parts of the books that can aid a user in confirming or refuting a given factual claim
  • How??● Participating Systems are required to retrieve and submit a ranked list of book pages per topic, that can confirm or refute the topic claim or contain information that is related to topic.● Task Performed by Assessors: ○ Assessors are free to choose the topic ○ Assessors are free to choose books. These books are ordered based on their rank. ○ Once Inside book, assessors are required to judge all listed pages. ○ Each pages can take four values: ■ Confirms some aspects of claim ■ Refutes some aspects of claim ■ Contains information that are related to claim ■ Irrelevant
  • Example:Claim: Imperialistic Foreign Policy led to World War 2First Page: ConfirmsSecond Page: Contains Information that relates to claim
  • Approach
  • Experimental DataGold Standard: ● INEX 2010 Prove it topics ○ Author uses a set of 21 topics with an average of 169 judged pages per topic.
  • Experiment Design● Pooling Strategy● Document Ordering● HIT Design and Quality Control● Experimental Grid● Measures
  • Pooling Strategy ● Top-n pool: ○ Top n pages of the official Prove It runs is selected using a round robin strategy. ● Rank-boosted pool: ○ Pages from the Prove It runs are re-ranked based on books highest rank and popularity across all the Best Books Runs and the Prove It runs. ● Answer Boosted Pool: ○ Pages from the Prove It runs are re-ranked based on their content similarity to the topicAuthor selects pages for each HIT by interleaving the threepooling strategies.
  • Document Ordering● Biased Order ○ HITs are constructed by preserving the order of pages produced by a given pooling approach, i.e. based on decreasing expected relevance● Random Order ○ HITs are constructed by first inserting the known relevant pages at any position in the HITs, an then randomly distributing it.
  • ExampleClaim: Imperialistic Foreign Policy led to World War 2Question: What is the relevance label of Document "Fall ofOttoman Empire"?Order1: Order2: 1. Causes of World War 2 1. Indus Valley Civilization 2. World War 2 2. Fall of Ottoman Empire 3. World War 1 3. Causes of World War 2 4. Fall of Ottoman Empire 4. World War 1 5. Indus Valley Civilization 5. World War 2
  • HIT Design and Quality Control● The author devised control mechanisms to verify worker engagement in order to reduce careless behaviour including the extreme case of dishonest workers behavior● In order to check the effect of these control mechanism, the author has devised two types of HITs. ○ Full Design (FullD) ○ Simple Design(SimpleD)
  • Full Design(FullD) ■ Warning: "At least 60% of the labels need to agree with expert provided labels in order to qualify for payments" ■ Trap Question: " I did not pay attention" ■ In Order to reduce the effect of Random clicking, one can use flow control so that answer to next question is dependent upon answer given to previous question. ■ Captcha: To detect Human Input in Online Form ■ Restricts participation to those workers only who completed over 100 HITs at 95+% approval rate
  • Example
  • Simple Design(SimpleD)● No restrictions on the worker who can participate● Includes only one trap question● No qualifying Test● No Warning● No captcha
  • Experimental Grid● FullD-bias ○ Full Design with biased ordering of pages● FullD-rand ○ Full Design with random ordering● SimpleD-Bias ○ Simple Design with biased ordering of pages● SimpleD-Rand ○ Simple Design with random ordering The Interleaved pooling strategy is common across the experiments.
  • Measures:In order to assess the quality of crowdsourcedlabels, CS, the author has introduced two measures: ● Exact Agreement(EA): Agreement on the exact degree of relevance, i.e. CS =GS(Gold Standard) ● Binary Agreement(BA): Either the page is non- relevant(CS and GS are irrelevant) or relevant (CS and GS contains: Confirms, Refutes, Contains Some Information) to the topic of claim
  • Analysis and Discussion
  • Impact on Quality Controls● FullD HITs yield considerably more labels per HIT per worker than SimpleD.● Collected Labels from FullD HITs agree significantly more with the Gold Standards labels than those from SimpleD HITs● FullD HITs attract workers who achieve significantly higher agreement levels with the Gold Standard labels.
  • Impact of Ordering Strategies● When the impact of the biased and random order of pages in the FullD and SimpleD is compared, it is seen that random order of pages produces higher accuracy
  • Refining Relevance Labels● Mean Agreement per HIT (when 3 Workers per HIT for FullD ,and 1 worker per HIT for SimpleD) is 62% EA and 69% BA for FullD and 44% EA and 54% BA for SimpleD.● After applying majority vote FullD achieves 74% EA and 78% BA, while SimpleD achieves 61% EA and 68% BA.When Majority Rule is applied, the accuracy ofSimpleD label improves substantially more than the accuracy of the FullD design.
  • Removing workers with low labelaccuracy● Filtering out workers with low accuracy labels increases the GS agreements for remaining labels● Agreement stays unchanged until the minimum accuracy of workers reaches 40%● Substantially more workers are removed from SimpleD than FullD
  • Impact of Pooling Strategies● Above table shows that there is no substantial difference between label accuracy levels for the three pooling strategies ● Answer Based Pooling leads to highest number of unique and relevant pages.
  • Other Factors Impacting Accuracy● Total Number of HITs completed by worker provides no clue about the level of label accuracy● Average time spent on the HIT is only weakly correlated with accuracy● Correlation between EA and the number of labels produced by workers is strong(dishonest and careless workers tends to skip some part of HITS)● The structure of flow questionaries (Flow) has high correlation with the EA accuracy
  • Impact on System Rankings● MAP and Bpref ○ These system ranking characterise the overall ranking and their comparison provides insights into the impact of un-judged pages.● P@10 and nDCG@10 ○ These system ranking focuses on the search performance in the top 10 retrieved pages.
  • Quality Control System Rank Correlation between different designs● Agreement between FullD Ranking and INEX Ranking is high across all metrics.● SimpleD Ranking and INEX Ranking correlate better than FullD Ranking on MAP and Bref.● P@10 and nDCG@10 metrics strongly differentiate the effect of two HITs design on system ranking.
  • Impact of Ordering Strategy Impact of Biased and random page order on system rank correlation with INEX ranking● Random Ordering of documents in the HITs yields higher level of label accuracy compared to biased ordering.
  • Impact of Pooling Strategies Impact of pooling strategy on system rank correlations with INEX ranking● Rank-boosted pool leads to very high correlations with INEX ranking based on MAP for both the FullD and SimpleD relevance judgements.
  • Evaluation of Prove It Systems● The authors investigate the use of the crowdsourced relevance judgements to evaluate the Prove-It runs of the INEX 2010 Book Track.● Focus is on FullD HIT design.● System is evaluated with ○ With Relevance Judgements from FullD HITs only ○ By merging FullD relevance jusgements with gold standard relevance judgements.
  • System rank correlations with ranking over offcial submissions (top) and extended set(bottom)● FullD relevance judgements lead to slightly different system rankings from the INEX ranking, since ,by design the crowdsourced document pool included pages outside the GS pool.● The correlation between extended system rankings based on P@10 (0.17) and nDCG@10(0.12) using FullD relevance judgements is low.
  • Conclusions● FullD leads to significantly higher label Quality● Random page ordering in HITs leads to significantly higher label accuracy● Consensus over multiple judgements leads to more reliable labels● Completion rate of the questionnaire flow and the fraction of obtained labels provide good indicators of label quality● P@10 and nDCG@10 metrics are more effective in evaluating the effectiveness of crowdsourcing through system rankings.● Filtering out workers with low label accuracy reduces the pooling effect.
  • Amazon Mechanical Turk
  • HIT
  • Discussions: