- 1. Amazon Mechanical Turk Requester Meetup (Panos Ipeirotis – New York University) © 2009 Amazon.com, Inc. or its Affiliates.
- 2. Panos Ipeirotis - Introduction New York University, Stern School of Business “A Computer Scientist in a Business School” http://behind-the-enemy-lines.blogspot.com/ Email: panos@nyu.edu © 2009 Amazon.com, Inc. or its Affiliates.
- 3. Example: Build an Adult Web Site Classifier Need a large number of hand-labeled sites Get people to look at sites and classify them as: G (general), PG (parental guidance), R (restricted), X (porn) Cost/Speed Statistics Undergrad intern: 200 websites/hr, cost: $15/hr MTurk: 2500 websites/hr, cost: $12/hr © 2009 Amazon.com, Inc. or its Affiliates.
- 4. Bad news: Spammers! Worker ATAMRO447HWJQ labeled X (porn) sites as G (general audience) © 2009 Amazon.com, Inc. or its Affiliates.
- 5. Improve Data Quality through Repeated Labeling Get multiple, redundant labels using multiple workers Pick the correct label based on majority vote 11 workers 93% correct 1 worker 70% correct Probability of correctness increases with number of workers Probability of correctness increases with quality of workers © 2009 Amazon.com, Inc. or its Affiliates.
- 6. But Majority Voting is Expensive Single Vote Statistics MTurk: 2500 websites/hr, cost: $12/hr Undergrad: 200 websites/hr, cost: $15/hr 11-vote Statistics MTurk: 227 websites/hr, cost: $12/hr Undergrad: 200 websites/hr, cost: $15/hr © 2009 Amazon.com, Inc. or its Affiliates.
- 7. Using redundant votes, we can infer worker quality Look at our spammer friend ATAMRO447HWJQ together with other 9 workers We can compute error rates for each worker Error rates for ATAMRO447HWJQ Our “friend” ATAMRO447HWJQ P[X → X]=9.847% P[X → G]=90.153% mainly marked sites as G. P[G → X]=0.053% P[G → G]=99.947% Obviously a spammer… © 2009 Amazon.com, Inc. or its Affiliates.
- 8. Rejecting spammers and Benefits Random answers error rate = 50% Average error rate for ATAMRO447HWJQ: 45.2% P[X → X]=9.847% P[X → G]=90.153% P[G → X]=0.053% P[G → G]=99.947% Action: REJECT and BLOCK Results: Over time you block all spammers Spammers learn to avoid your HITS You can decrease redundancy, as quality of workers is higher © 2009 Amazon.com, Inc. or its Affiliates.
- 9. After rejecting spammers, quality goes up Spam keeps quality down Without spam, workers are of higher quality Without spam Need less redundancy for same quality 5 workers Same quality of results for lower cost 94% correct Without spam 1 worker With spam 80% correct 11 workers 93% correct With spam 1 worker 70% correct © 2009 Amazon.com, Inc. or its Affiliates.
- 10. Correcting biases Classifying sites as G, PG, R, X Sometimes workers are careful but biased Error Rates for Worker: ATLJIK76YH1TF P[G → G]=20.0% P[G → P]=80.0% P[G → R]=0.0% P[G → X]=0.0% P[P → G]=0.0% P[P → P]=0.0% P[P → R]=100.0% P[P → X]=0.0% P[R → G]=0.0% P[R → P]=0.0% P[R → R]=100.0% P[R → X]=0.0% P[X → G]=0.0% P[X → P]=0.0% P[X → R]=0.0% P[X → X]=100.0% Classifies G → P and P → R Average error rate for ATLJIK76YH1TF: 45.0% Is ATLJIK76YH1TF a spammer? © 2009 Amazon.com, Inc. or its Affiliates.
- 11. Correcting biases Error Rates for Worker: ATLJIK76YH1TF P[G → G]=20.0% P[G → P]=80.0% P[G → R]=0.0% P[G → X]=0.0% P[P → G]=0.0% P[P → P]=0.0% P[P → R]=100.0% P[P → X]=0.0% P[R → G]=0.0% P[R → P]=0.0% P[R → R]=100.0% P[R → X]=0.0% P[X → G]=0.0% P[X → P]=0.0% P[X → R]=0.0% P[X → X]=100.0% For ATLJIK76YH1TF, we simply need to compute the “non- recoverable” error-rate (technical details omitted) Non-recoverable error-rate for ATLJIK76YH1TF: 9% © 2009 Amazon.com, Inc. or its Affiliates.
- 12. Too much theory? Open source implementation available at: http://code.google.com/p/get-another-label/ Input: – Labels from Mechanical Turk – Cost of incorrect labelings (e.g., XG costlier than GX) Output: – Corrected labels – Worker error rates – Ranking of workers according to their quality Alpha version, more improvements to come! Suggestions and collaborations welcomed! © 2009 Amazon.com, Inc. or its Affiliates.
- 13. Thank you! Questions? “A Computer Scientist in a Business School” http://behind-the-enemy-lines.blogspot.com/ Email: panos@nyu.edu © 2009 Amazon.com, Inc. or its Affiliates.

