Crowdsourcing for Book Search Evaluation: Impact of HIT Design on Comparative System Ranking
Crowdsourcing for Book Search Evaluation: Impact of HIT Design on Comparative System Ranking Gabriella Kazai , Jaap Kamps, Marjin Koolen, Natasa Milic- Frayling Presented by Kumar Ashish
Traditional Method● Require Trained Group of Experts to gather Information● Precise Guidelines ● Problem of Scalability ○ INEX Book Collection ■ 50,239 books ■ 83 Prove-It Topics ■ Assessor require 33 days to judge a single topic if he spends 95 minutes each day
Crowdsourcing● It is method of outsourcing work through an open call for contributions from members of crowd, who are invited to carry out Human Intelligence Tasks(HIT) in exchange for micro-payments, social recognition, or entertainment value.● It offers a solution for scalability problem.
Problems with Crowdsourcing● Suffers from poor output Quality ○ Workers dishonest and careless behavior ■ Workers motivated by financial gain may aim to complete as many HITs are possible within a given time. ○ Poor task designs by the task requester
Solution: ● Include Trap Questions ● Include Qualifying Questions ● Use Gold Standard Data Set for which agreement can be measured ● Timing Controls ● Challenge-response tests(captcha) ● Build redundancy into task design ● Model Annotators Quality
Objective● Investigates the Impact of aspects of Human Intelligence Task(HIT) design on the quality of relevance labels provided by the crowd. ○ Investigation is focused upon these three aspects: ■ Quality Control Elements ■ Document pooling and sampling for relevance judgements by the crowd ■ Documents Ordering within a HIT for presentation to the workers
Prove It! ● It aims to investigate effective ways to retrieve relevant parts of the books that can aid a user in confirming or refuting a given factual claim
How??● Participating Systems are required to retrieve and submit a ranked list of book pages per topic, that can confirm or refute the topic claim or contain information that is related to topic.● Task Performed by Assessors: ○ Assessors are free to choose the topic ○ Assessors are free to choose books. These books are ordered based on their rank. ○ Once Inside book, assessors are required to judge all listed pages. ○ Each pages can take four values: ■ Confirms some aspects of claim ■ Refutes some aspects of claim ■ Contains information that are related to claim ■ Irrelevant
Example:Claim: Imperialistic Foreign Policy led to World War 2First Page: ConfirmsSecond Page: Contains Information that relates to claim
Experimental DataGold Standard: ● INEX 2010 Prove it topics ○ Author uses a set of 21 topics with an average of 169 judged pages per topic.
Experiment Design● Pooling Strategy● Document Ordering● HIT Design and Quality Control● Experimental Grid● Measures
Pooling Strategy ● Top-n pool: ○ Top n pages of the official Prove It runs is selected using a round robin strategy. ● Rank-boosted pool: ○ Pages from the Prove It runs are re-ranked based on books highest rank and popularity across all the Best Books Runs and the Prove It runs. ● Answer Boosted Pool: ○ Pages from the Prove It runs are re-ranked based on their content similarity to the topicAuthor selects pages for each HIT by interleaving the threepooling strategies.
Document Ordering● Biased Order ○ HITs are constructed by preserving the order of pages produced by a given pooling approach, i.e. based on decreasing expected relevance● Random Order ○ HITs are constructed by first inserting the known relevant pages at any position in the HITs, an then randomly distributing it.
ExampleClaim: Imperialistic Foreign Policy led to World War 2Question: What is the relevance label of Document "Fall ofOttoman Empire"?Order1: Order2: 1. Causes of World War 2 1. Indus Valley Civilization 2. World War 2 2. Fall of Ottoman Empire 3. World War 1 3. Causes of World War 2 4. Fall of Ottoman Empire 4. World War 1 5. Indus Valley Civilization 5. World War 2
HIT Design and Quality Control● The author devised control mechanisms to verify worker engagement in order to reduce careless behaviour including the extreme case of dishonest workers behavior● In order to check the effect of these control mechanism, the author has devised two types of HITs. ○ Full Design (FullD) ○ Simple Design(SimpleD)
Full Design(FullD) ■ Warning: "At least 60% of the labels need to agree with expert provided labels in order to qualify for payments" ■ Trap Question: " I did not pay attention" ■ In Order to reduce the effect of Random clicking, one can use flow control so that answer to next question is dependent upon answer given to previous question. ■ Captcha: To detect Human Input in Online Form ■ Restricts participation to those workers only who completed over 100 HITs at 95+% approval rate
Simple Design(SimpleD)● No restrictions on the worker who can participate● Includes only one trap question● No qualifying Test● No Warning● No captcha
Experimental Grid● FullD-bias ○ Full Design with biased ordering of pages● FullD-rand ○ Full Design with random ordering● SimpleD-Bias ○ Simple Design with biased ordering of pages● SimpleD-Rand ○ Simple Design with random ordering The Interleaved pooling strategy is common across the experiments.
Measures:In order to assess the quality of crowdsourcedlabels, CS, the author has introduced two measures: ● Exact Agreement(EA): Agreement on the exact degree of relevance, i.e. CS =GS(Gold Standard) ● Binary Agreement(BA): Either the page is non- relevant(CS and GS are irrelevant) or relevant (CS and GS contains: Confirms, Refutes, Contains Some Information) to the topic of claim
Impact on Quality Controls● FullD HITs yield considerably more labels per HIT per worker than SimpleD.● Collected Labels from FullD HITs agree significantly more with the Gold Standards labels than those from SimpleD HITs● FullD HITs attract workers who achieve significantly higher agreement levels with the Gold Standard labels.
Impact of Ordering Strategies● When the impact of the biased and random order of pages in the FullD and SimpleD is compared, it is seen that random order of pages produces higher accuracy
Refining Relevance Labels● Mean Agreement per HIT (when 3 Workers per HIT for FullD ,and 1 worker per HIT for SimpleD) is 62% EA and 69% BA for FullD and 44% EA and 54% BA for SimpleD.● After applying majority vote FullD achieves 74% EA and 78% BA, while SimpleD achieves 61% EA and 68% BA.When Majority Rule is applied, the accuracy ofSimpleD label improves substantially more than the accuracy of the FullD design.
Removing workers with low labelaccuracy● Filtering out workers with low accuracy labels increases the GS agreements for remaining labels● Agreement stays unchanged until the minimum accuracy of workers reaches 40%● Substantially more workers are removed from SimpleD than FullD
Impact of Pooling Strategies● Above table shows that there is no substantial difference between label accuracy levels for the three pooling strategies ● Answer Based Pooling leads to highest number of unique and relevant pages.
Other Factors Impacting Accuracy● Total Number of HITs completed by worker provides no clue about the level of label accuracy● Average time spent on the HIT is only weakly correlated with accuracy● Correlation between EA and the number of labels produced by workers is strong(dishonest and careless workers tends to skip some part of HITS)● The structure of flow questionaries (Flow) has high correlation with the EA accuracy
Impact on System Rankings● MAP and Bpref ○ These system ranking characterise the overall ranking and their comparison provides insights into the impact of un-judged pages.● P@10 and nDCG@10 ○ These system ranking focuses on the search performance in the top 10 retrieved pages.
Quality Control System Rank Correlation between different designs● Agreement between FullD Ranking and INEX Ranking is high across all metrics.● SimpleD Ranking and INEX Ranking correlate better than FullD Ranking on MAP and Bref.● P@10 and nDCG@10 metrics strongly differentiate the effect of two HITs design on system ranking.
Impact of Ordering Strategy Impact of Biased and random page order on system rank correlation with INEX ranking● Random Ordering of documents in the HITs yields higher level of label accuracy compared to biased ordering.
Impact of Pooling Strategies Impact of pooling strategy on system rank correlations with INEX ranking● Rank-boosted pool leads to very high correlations with INEX ranking based on MAP for both the FullD and SimpleD relevance judgements.
Evaluation of Prove It Systems● The authors investigate the use of the crowdsourced relevance judgements to evaluate the Prove-It runs of the INEX 2010 Book Track.● Focus is on FullD HIT design.● System is evaluated with ○ With Relevance Judgements from FullD HITs only ○ By merging FullD relevance jusgements with gold standard relevance judgements.
System rank correlations with ranking over offcial submissions (top) and extended set(bottom)● FullD relevance judgements lead to slightly different system rankings from the INEX ranking, since ,by design the crowdsourced document pool included pages outside the GS pool.● The correlation between extended system rankings based on P@10 (0.17) and nDCG@10(0.12) using FullD relevance judgements is low.
Conclusions● FullD leads to significantly higher label Quality● Random page ordering in HITs leads to significantly higher label accuracy● Consensus over multiple judgements leads to more reliable labels● Completion rate of the questionnaire flow and the fraction of obtained labels provide good indicators of label quality● P@10 and nDCG@10 metrics are more effective in evaluating the effectiveness of crowdsourcing through system rankings.● Filtering out workers with low label accuracy reduces the pooling effect.