Comparison GWAP Mechanical Turk


Published on

An experiment to replicate GWAP-centered human computation using microtasks

Published in: Education

Comparison GWAP Mechanical Turk

  1. 1. Comparing human computation services Elena Simperl (University of Southampton)
  2. 2. Human computation• Outsourcing tasks that machines find difficult to solve to humans (accuracy, efficiency, costs)
  3. 3. Dimensions of human computation See also [Quinn & Bederson, 2012]• What is outsourced – Tasks that require human skills that cannot be easily replicated by machines (visual recognition, language understanding, knowledge acquisition, basic human communication etc) – Sometimes only certain steps of a task are outsourced to humans, the rest is executed automatically• How is the task being outsourced – Tasks broken down into smaller units undertaken in parallel by different people – Coordination required to handle cases with more complex workflows – Partial or independent answers consolidated and aggregated into complete solution
  4. 4. Dimensions of human computation (2) See also [Quinn & Bederson, 2012]• How are the results validated – Solutions space closed (choice of correct answer) vs open (collection of potential solutions) – Performance objectively measured or through ratings/votes – Statistical techniques employed to predict accurate solutions • May take into account confidence values of algorithmically generated solutions• How can the overall process be optimized – Incentives and motivators (altruism, entertainment, intellectual challenge, social status, competition, financial compensation) – Assigning tasks to people based on their skills and performance (as opposed to random assignments) – Symbiotic combinations of human- and machine-driven computation, including combinations of different forms of crowdsourcing
  5. 5. Games with a purpose (GWAP) See also [van Ahn & Dabbish, 2008]• Human computation disguised as casual games• Tasks are divided into parallelizable atomic units (challenges) solved (consensually) by players• Game models – Single vs multi-player – Selection agreement vs input agreement vs inversion- problem games
  6. 6. Dimensions of GWAP design• What tasks are amenable to ‚GWAP-ification‘ – Work is decomposable into simpler (nested) tasks – Performance is measurable according to an obvious rewarding scheme – Skills can be arranged in a smooth learning curve – Player’s retention vs repetitive tasks• Note: Not all domains are equally appealing – Application domain needs to attract a large user base – Knowledge corpus has to be large-enough to avoid repetitions – Quality of automatically computed input may hamper game experience• Attracting and retaining players – You need a critical mass of players to validate the results – Advertisement, building upon an existing user base – Continuous development
  7. 7. Microtask crowdsourcing• Similar types of tasks, but different incentives model (monetary reward)• Successfully applied to transcription, classification, and content generation, data collection, image tagging, website feedback, usability tests…
  8. 8. Our experiment• Goals – Compare the two approaches for a given task (ontology engineering) – More general: description framework to compare different human computation models and use them in combination• Set-up – Re-build OntoPronto within Amazon’s Mechanical Turk, based on existing OntoPronto data
  9. 9. OntoPronto• Goal: extend Proton upper- level ontology• Multi-player (single player using pre-recorded rounds) – Step 1: topic of Wikipedia article classified as class or instance – Step 2: browsing the Proton hierarchy from the root to identify most specific class which matches the topic of the article• Consensual answers, additional points for more specific classes
  10. 10. Validation of players‘ inputs• A topic is played at least six times• Number of consensual answers to each question at least four• The number of consensual answers modulo reliability more than half of the number of total answers received – Reliability measures relation consensual and correct answers given by a player
  11. 11. Evaluation and collected data• 270 distinct players, 365 Wikipedia articles, 2905 game rounds• Approach is effective – 77% of challenges solved consensually – If agreement, most answers correct (97%)• …and efficient – 122 classes and entities extending Proton (after validation)
  12. 12. Implementation through MTurk• Server-side component – Generates new HITs – Evaluate assignments of existing HITs• Two types of HITs – Class or instance (1 cent) – Proton class (5 cent)• HITs generated using title, first paragraph and first image (if available)• Qualification test with five questions, turkers with at least 90% accepted tasks
  13. 13. Implementation through MTurk (2)• Multiple assignments per HIT, four consensual answers needed – (number of answers needed for consensus - 1) x (number of available answer options) + 1• HITs with (four) consensual answers are considered completed• Assignments matching consensus accepted• HIT costs maximally (number of answers needed for consensus) x (reward per correct assignment)
  14. 14. Evaluation and collected data
  15. 15. Development time and costs per contribution• OntoPronto: five development months• MTurk: one month – Additional effort required because of the setting of the experiment – Less effort as HIT design and validation mechanisms adopted from OntoPronto• Average cost for a correct answer on MTurk 0.74 $
  16. 16. Quality of contributions• Both approaches resulted in high-quality data• Diversity and biases (270 players vs 16 turkers) – Additional functionality of MTurk• Game-based approach economic in the long run if player retention strategy available• Microtask-based approach uses ‚predictable‘ motivation framework• MTurk less diverse (270 players vs 16 turkers)
  17. 17. Challenges and open questions• Synchronous vs asynchronous modes of interaction – Consensual answers, ratings by other turkers?• Executing inter-dependent tasks in MTurk – Mapping game steps into HITs – Grouping HITs• Using game-like interfaces within microtask crowdsourcing platforms – Impact on incentives and turkers‘ behavior?• Using MTurk to test GWAP design decisions
  18. 18. Challenges and open questions (2)• Descriptive framework for classification of human computation systems – Types of tasks and their mode of execution – Participants and their roles – Interaction with system and among participants – Validation of results – Consolidation and aggregation of inputs into complete solution• Reusable collection of algorithms for quality assurance, task assignment, workflow management, results consolidation etc• Schemas recording provenance of crowdsourced data
  19. 19. S. Thaler, E. Simperl, S. Wölger. An experiment incomparing human computation techniques. IEEE Internet Computing, 16(5): 52-58, 2012 For more information email: twitter: @esimperl
  20. 20. Theory and practice of social machines Deadline: 25.02.2013