Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Did you mean crowdsourcing for recommender systems?


Published on

Keynote talk at CrowdRec (ACM RecSys).

Published in: Software
  • Be the first to comment

Did you mean crowdsourcing for recommender systems?

  1. 1. Did you mean crowdsourcingfor recommender systems? OMAR ALONSO 6-OCT-2014 CROWDREC 2014
  2. 2. Disclaimer The views and opinions expressed in this talkare mine and do not necessarily reflect the official policy or position of Microsoft. CROWDREC 2014
  3. 3. Outline A bit on human computation Crowdsourcingin informationretrieval Opportunities for recommender systems CROWDREC 2014
  4. 4. Human Computation CROWDREC 2014
  5. 5. Human Computation You are a computer CROWDREC 2014
  6. 6. Human-based computation Use humans as processors in a distributed system Address problems that computers aren’t good Games with a purpose Examples ◦ESP game ◦Captcha ◦ReCaptcha CROWDREC 2014
  7. 7. Some definitions Human computation is a computation that is performed by a human Human computation system is a system that organizes human efforts to carry out computation Crowdsourcing is a tool that a human computation system can use to distribute tasks. CROWDREC 2014 Edith Law and Luis von Ahn. Human Computation.Morgan & Claypool Publishers, 2011.
  8. 8. HC at the core of RecSys “In a typical recommender system people provide recommendations as inputs, which the system then aggregates and directs to appropriate recipients” –Resnick and Varian (CACM 1997) S. Perugini, M. Gonçalves, E. Fox: Recommender Systems Research: A Connection-Centric Survey. J. Intell. Inf. Syst. 23(2): 107-143 (2004) CROWDREC 2014
  9. 9. {where to go on vacation} MTurk: 50 answers, $1.80 Quora: 2 answers Y! Answers: 2 answers FB: 1 answer Tons of results Read title + snippet + URL Explore a few pages in detail CROWDREC 2014
  10. 10. {where to go on vacation} Countries Cities CROWDREC 2014
  11. 11. Information Retrieval and Crowdsourcing CROWDREC 2014
  12. 12. The rise of crowdsourcing in IR Crowdsourcing is hot Lots of interest in the research community ◦Articles showing good results ◦Journals special issues (IR, IEEE Internet Computing, etc.) ◦Workshops and tutorials (SIGIR, NACL, WSDM, WWW, VLDB, RecSys, CHI, etc.) ◦HCOMP ◦CrowdConf Large companies using crowdsourcing Big data Start-ups Venture capital investment CROWDREC 2014
  13. 13. Why is this interesting? Easy to prototype and test new experiments Cheap and fast No need to setup infrastructure Introduce experimentation early in the cycle In the context of IR, implement and experiment as you go For new ideas, this is very helpful CROWDREC 2014
  14. 14. Caveats and clarifications Trust and reliability Wisdom of the crowd re-visit Adjust expectations Crowdsourcing is another data point for your analysis Complementary to other experiments CROWDREC 2014
  15. 15. Why now? The Web Use humans as processors in a distributed system Address problems that computers aren’t good Scale Reach CROWDREC 2014
  16. 16. Motivating example: relevance judging Relevance of search results is difficult to judge ◦Highly subjective ◦Expensive to measure Professional editors commonly used Potential benefits of crowdsourcing ◦Scalability (time and cost) ◦Diversity of judgments CROWDREC 2014 Matt Lease and Omar Alonso. “Crowdsourcing for search evaluation and social-algorithmic search”, ACM SIGIR 2012 Tutorial.
  17. 17. Crowdsourcing and relevance evaluation For relevance, it combines two main approaches ◦Explicit judgments ◦Automated metrics Other features ◦Large scale ◦Inexpensive ◦Diversity CROWDREC 2014
  18. 18. Development framework Incremental approach Measure, evaluate, and adjust as you go Suitable for repeatable tasks CROWDREC 2014 O. Alonso. “Implementing crowdsourcing-based relevance experimentation: an industrial perspective". Information Retrieval, (16)2, 2013
  19. 19. Asking questions Ask the right questions Part art, part science Instructions are key Workers may not be IR experts so don’t assume the same understanding in terms of terminology Show examples Hire a technical writer ◦Engineer writes the specification ◦Writer communicates CROWDREC 2014 N. Bradburn, S. Sudman, and B. Wansink. Asking Questions: The Definitive Guide to Questionnaire Design, Jossey-Bass, 2004.
  20. 20. UX design Time to apply all those usability concepts Experiment should be self-contained. Keep it short and simple. Brief and concise. Be very clear with the relevance task. Engage with the worker. Avoid boring stuff. Document presentation & design Need to grab attention Always ask for feedback (open-ended question) in an input box. Localization CROWDREC 2014
  21. 21. Other design principles Text alignment Legibility Reading level: complexity of words and sentences Attractiveness (worker’s attention & enjoyment) Multi-cultural / multi-lingual Who is the audience (e.g. target worker community) Special needs communities (e.g. simple color blindness) Cognitive load: mental rigor needed to perform task Exposure effect CROWDREC 2014
  22. 22. When to assess work quality? Beforehand (prior to main task activity) ◦How: “qualification tests” or similar mechanism ◦Purpose: screening, selection, recruiting, training During ◦How: assess labels as worker produces them ◦Like random checks on a manufacturing line ◦Purpose: calibrate, reward/penalize, weight After ◦How: compute accuracy metrics post-hoc ◦Purpose: filter, calibrate, weight, retain (HR) CROWDREC 2014
  23. 23. How do we measure work quality? Compare worker’s label vs. ◦Known (correct, trusted) label ◦Other workers’ labels ◦Model predictions of workers and labels Verify worker’s label ◦Yourself ◦Tiered approach (e.g. Find-Fix-Verify) CROWDREC 2014
  24. 24. Comparing to known answers AKA: gold, honey pot, verifiable answer, trap Assumes you have known answers Cost vs. Benefit ◦Producing known answers (experts?) ◦% of work spent re-producing them Finer points ◦What if workers recognize the honey pots? CROWDREC 2014
  25. 25. Comparing to other workers AKA: consensus, plurality, redundant labeling Well-known metrics for measuring agreement Cost vs. Benefit: % of work that is redundant Finer points ◦Is consensus “truth” or systematic bias of group? ◦What if no one really knows what they’re doing? ◦Low-agreement across workers indicates problem is with the task (or a specific example), not the workers CROWDREC 2014
  26. 26. Methods for measuring agreement What to look for ◦Agreement, reliability, validity Inter-agreement level ◦Agreement between judges ◦Agreement between judges and the gold set Some statistics ◦Percentage agreement ◦Cohen’s kappa (2 raters) ◦Fleiss’ kappa (any number of raters) ◦Krippendorff’s alpha With majority vote, what if 2 say relevant, 3 say not? ◦Use expert to break ties ◦Collect more judgments as needed to reduce uncertainty CROWDREC 2014
  27. 27. Pause Crowdsourcing works ◦Fast turnaround, easy to experiment, few dollarsto test ◦But: you have to design experiments carefully, quality, platform limitations Crowdsourcing in production ◦Large scale data sets (millions of labels) ◦Continuous execution ◦Difficult to debug Multiple contingent factors How do you knowthe experiment is working Goal: framework for ensuring reliability on crowdsourcing tasks CROWDREC 2014 O. Alonso, C. Marshall and M. Najork. “Crowdsourcing a subjective labeling task: A human centered framework to ensure reliable results”
  28. 28. Labeling tweets –an example of a task Is this tweet interesting? Subjective activity Not focused on specific events Findings ◦Difficult problem, low inter-rater agreement (Fleiss’ k, Krippendorff’salpha) ◦Tested many designs, number of workers, platforms (MTurkand others) Multiple contingent factors ◦Worker performance ◦Work ◦Task design CROWDREC 2014 O. Alonso, C. Marshall & M. Najork. “Are some tweets more interesting than others? #hardquestion. HCIR 2013.
  29. 29. Designs that include in-task CAPTCHA Borrowed idea from reCAPTCHA -> use of control term Adapt your labeling task 2 more questions as control ◦1 algorithmic ◦1 semantic CROWDREC 2014
  30. 30. Production example #1 CROWDREC 2014 Q1 (k = 0.91, alpha = 0.91) Q2 (k = 0.771, alpha = 0.771) Q3 (k = 0.033, alpha = 0.035) In-task captcha Tweet de-branded The main question
  31. 31. Production example #2 CROWDREC 2014 •Q3 Worthless (alpha = 0.033) •Q3 Trivial (alpha = 0.043) •Q3 Funny (alpha = -0.016) •Q3 Makes me curious (alpha = 0.026) •Q3 Contains useful info (alpha = 0.048) •Q3 Important news (alpha = 0.207) Q2 (k = 0.728, alpha = 0.728) Q1 (k = 0.907, alpha = 0.907) In-task captcha Breakdown by categories to get better signal Tweet de-branded
  32. 32. Findings from designs No quality controlissues Eliminating workers who did a poor job on question #1 didn’t affect inter-rateragreement for question #2 and #3. Interestingness is a fully subjective notion Wecan still build a classifier that identifies tweets that are interesting to a majority of users CROWDREC 2014
  33. 33. Careful with That AxeData, Eugene In the area of big data and machine learning: ◦labels -> features -> predictive model -> optimization Labeling/experimentation perceived as boring Don’t rush labeling ◦Human and machine Label quality is veryimportant ◦Don’t outsource it ◦Own it end to end ◦Large scale CROWDREC 2014
  34. 34. More on label quality Data gathering is not a free lunch You can’t outsource label acquisition and quality Labels for the machine != labels for humans Emphasis on algorithms, models/optimizations and mining from labels Not so much on algorithms for ensuring high quality labels Training sets CROWDREC 2014
  35. 35. People are more than HPUs Why is Facebook popular? People are social. Information needs are contextually grounded in our social experiences and social networks Our social networks also embody additional knowledge about us, our needs, and the world We relate to recommendations The social dimension complements computation CROWDREC 2014
  36. 36. Opportunities in RecSys CROWDREC 2014
  37. 37. Humans in the loop Computation loopsthat mix humans and machines Kind of active learning Double goal: ◦Humanchecking on the machine ◦Machinechecking on humans Example:classifiers for social data CROWDREC 2014
  38. 38. Collaborative Filtering v2 Collaboration with recipients Interactive Learning new data CROWDREC 2014
  39. 39. What’s in a label? Clicks, reviews, ratings, etc. Better or novel systems if we focus more on label quality? New ways of collecting data Training sets Evaluation & measurement CROWDREC 2014
  40. 40. Routing Expertise detection and routing Social load balancing When to switchbetween machines and humans CROWDREC 2014
  41. 41. Conclusions Crowdsourcing at scale works but requires a solid framework Three aspects that need attention: workers, work and task design Labeling social data is hard Traditional IR approaches don’t seem to work for Twitter data Label quality Outlined areas where RecSyscan benefit from crowdsourcing CROWDREC 2014
  42. 42. Thank you CROWDREC 2014 @elunca