Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Crowdsourcing Research Opportunities: Lessons from Natural Language Processing

How is crowdsourcing used in science? How did it impact the field of NLP?

A presentation of the key points described in:
Marta Sabou, Kalina Bontcheva, Arno Scharl (2012) Crowdsourcing Research Opportunities: Lessons from Natural Language Processing. In 12th International Conference on Knowledge Management and Knowledge Technologies (i-KNOW), Special Track on Research 2.0.

  • Login to see the comments

Crowdsourcing Research Opportunities: Lessons from Natural Language Processing

  1. 1. Crowdsourcing Research Opportunities:Lessons from Natural Language Processing Marta Sabou, Kalina Bontcheva, Arno Scharl
  2. 2. Crowdsourcing
  3. 3. CrowdsourcingUndefined and generally large group
  4. 4. Crowdsourcing in ScienceCrowdsourcing for NLPChallenges
  5. 5. Crowdsourcing in science – is not newSir Francis Galton, “VOX POPULI”Citizen science, from early 19th century, 60,000 – 80,000 yearly volunteers
  6. 6. Genre 1: Mechanised Labour Participants (workers) paid a small amount of money to complete easy tasks (HIT = Human Intelligence Task)
  7. 7. Genre 2: Games with a purpose From 2008 240k players
  8. 8. Crowdsourcing via Facebook
  9. 9. Genre 3: Altruistic Crowdsourcing >250K players >670K players
  10. 10. Crowdsourcing in Science - Typical Use •Harness human intuition to prune solution space Process/ Evaluation Input Algorithm Output•Form based data collection•Labeling, Classification•Surveys
  11. 11. Crowdsourcing in ScienceCrowdsourcing for NLPChallenges
  12. 12. Crowdsourcing in NLPPapers relying on crowdsourcing in major NLP venues
  13. 13. Crowdsourcing Genres in NLP
  14. 14. Benefit 1: Affordable, Large-Scale Resources A variety of small-medium sized resources can be obtained with as little as 100$ using AMT Crowdsourcing is also cost effective for large resources (Poesio, 2012) $/label 1 M labels ($)Traditional High Q. 1 1,000,000Mechanical Turk .38 380,000 (<40%)Game .19 217,000 (20%)
  15. 15. Benefit 2: Diversification of research
  16. 16. Challenge 1: Contributor Selection and Training From: prior to resource creation To: during the resource creation
  17. 17. Challenge 2: Aggregation and Quality Control From: a few experts‘ annotations To: multiple, noisy annotations from non-experts Approach 1: Statistical techniques  Simplest (and most popular): majority voting  More complex: Machine learning model trained on various features Approach 2: Crowdsourcing the QC process itself HIT1 (Create): HIT2 (Verify): Which of these 5 sentences is the Translate the following sentence: best translation?
  18. 18. Conclusions (What have we learned from NLP?) Crowdsourcing is revolutionalising NLP research  Cheaper resource acquisition  Diversification of research agenda But requires more complex methodologies  For contributor management  For quality control and data aggregation Other findings: most popular  Genre: mechanised labour  Task: acquiring input data  Problem: solving subjective tasks
  19. 19. Crowdsourcing in ScienceCrowdsourcing for NLPChallenges
  20. 20. User Motivation Motivating users  Motivations for scientific projects might differ  Task-granularity might impact motivation Promoting learning and science  Advertise STEM research to young people  Support learning and self-improvement through participation in crowdsourcing
  21. 21. Legal and Ethical Issues Acknowledging the Crowd‘s contribution  S. Cooper, [other auhors], and Foldit players: Predicting protein structures with a multiplayer online game. Nature, 466(7307):756-760, 2010. Ensuring privacy and wellbeing  Mechnised labour criticesed for low wages (,$2/hour), lack of worker rights  Prevent addition, prolonged-use & user exploitation Licensing and consent  Some clearly state the use of Creative Common licenses  General failure to provide informed consent information
  22. 22. Technical Issues Scaling up to large resources Preventing bias Increasing repeatability  Through reuse of crowdsourcing elements (e.g., HIT templates) uComp - Embedded Human Computation for Knowledge Extraction and Evaluation  3 year project, starting November 2012  Develops a scalable and generic HC framework for knowledge creation  Provides reusable HC elements
  23. 23. Thank you!