SlideShare a Scribd company logo
Crowdsourcing Research Opportunities:
Lessons from Natural Language Processing
  Marta Sabou, Kalina Bontcheva, Arno Scharl
Crowdsourcing
Crowdsourcing




Undefined and generally large group
Crowdsourcing in Science
Crowdsourcing for NLP
Challenges
Crowdsourcing in science – is not new




Sir Francis Galton, “VOX POPULI”



Citizen science, from early 19th century, 60,000 – 80,000 yearly volunteers
Genre 1: Mechanised Labour
 Participants (workers) paid a small amount
  of money to complete easy tasks (HIT =
  Human Intelligence Task)
Genre 2: Games with a purpose
                                From 2008
                                240k players
Crowdsourcing via Facebook
Genre 3: Altruistic Crowdsourcing

                                    >250K players




          >670K players
Crowdsourcing in Science - Typical Use
                       •Harness human
                       intuition to prune
                       solution space




                              Process/               Evaluation
     Input                    Algorithm
                                            Output

•Form based data collection
•Labeling, Classification
•Surveys
Crowdsourcing in Science
Crowdsourcing for NLP
Challenges
Crowdsourcing in NLP
Papers relying on crowdsourcing in major NLP venues
Crowdsourcing Genres in NLP
Benefit 1: Affordable, Large-Scale Resources
 A variety of small-medium sized resources can be
  obtained with as little as 100$ using AMT
 Crowdsourcing is also cost effective for large
  resources (Poesio, 2012)


                             $/label 1 M labels ($)
Traditional High Q.             1       1,000,000
Mechanical Turk                .38   380,000 (<40%)
Game                           .19    217,000 (20%)
Benefit 2: Diversification of research
Challenge 1: Contributor Selection and Training
 From: prior to resource creation
 To: during the resource creation
Challenge 2: Aggregation and Quality Control

 From: a few experts‘ annotations
 To: multiple, noisy annotations from non-experts
 Approach 1: Statistical techniques
   Simplest (and most popular): majority voting
   More complex: Machine learning model trained on
    various features
 Approach 2: Crowdsourcing the QC process itself
            HIT1 (Create):                       HIT2 (Verify):
                                      Which of these 5 sentences is the
  Translate the following sentence:           best translation?
Conclusions (What have we learned from NLP?)

 Crowdsourcing is revolutionalising NLP
  research
   Cheaper resource acquisition
   Diversification of research agenda
 But requires more complex methodologies
   For contributor management
   For quality control and data aggregation
 Other findings: most popular
   Genre: mechanised labour
   Task: acquiring input data
   Problem: solving subjective tasks
Crowdsourcing in Science
Crowdsourcing for NLP
Challenges
User Motivation

 Motivating users
   Motivations for scientific projects might differ

   Task-granularity might impact motivation
 Promoting learning and science
   Advertise STEM research to young people
   Support learning and self-improvement through
    participation in crowdsourcing
Legal and Ethical Issues
 Acknowledging the Crowd‘s contribution
    S. Cooper, [other auhors], and Foldit players: Predicting
     protein structures with a multiplayer online game.
     Nature, 466(7307):756-760, 2010.
 Ensuring privacy and wellbeing
    Mechnised labour criticesed for low wages (,$2/hour),
     lack of worker rights
    Prevent addition, prolonged-use & user exploitation
 Licensing and consent
    Some clearly state the use of Creative Common licenses
    General failure to provide informed consent information
Technical Issues
 Scaling up to large resources
 Preventing bias
 Increasing repeatability
   Through reuse of crowdsourcing elements (e.g., HIT
    templates)
 uComp - Embedded Human Computation for
  Knowledge Extraction and Evaluation
   3 year project, starting November 2012
   Develops a scalable and generic HC framework for
    knowledge creation
   Provides reusable HC elements
Thank you!

More Related Content

Similar to Crowdsourcing Research Opportunities: Lessons from Natural Language Processing

How to facilitate crowd participation - presentation in ISPIM 2013
How to facilitate crowd participation - presentation in ISPIM 2013How to facilitate crowd participation - presentation in ISPIM 2013
How to facilitate crowd participation - presentation in ISPIM 2013
Miia Kosonen
 
Investigating Crowdsourcing as an Evaluation Method for (TEL) Recommender Sy...
Investigating Crowdsourcing as an Evaluation Method for (TEL) Recommender Sy...Investigating Crowdsourcing as an Evaluation Method for (TEL) Recommender Sy...
Investigating Crowdsourcing as an Evaluation Method for (TEL) Recommender Sy...
Christoph Rensing
 

Similar to Crowdsourcing Research Opportunities: Lessons from Natural Language Processing (20)

Leaning Lab il Living Lab di Pisa
Leaning Lab il Living Lab di PisaLeaning Lab il Living Lab di Pisa
Leaning Lab il Living Lab di Pisa
 
Establishing an Online Access Panel for Interactive Information Retrieval Res...
Establishing an Online Access Panel for Interactive Information Retrieval Res...Establishing an Online Access Panel for Interactive Information Retrieval Res...
Establishing an Online Access Panel for Interactive Information Retrieval Res...
 
How to facilitate crowd participation - presentation in ISPIM 2013
How to facilitate crowd participation - presentation in ISPIM 2013How to facilitate crowd participation - presentation in ISPIM 2013
How to facilitate crowd participation - presentation in ISPIM 2013
 
Computational Social Science:The Collaborative Futures of Big Data, Computer ...
Computational Social Science:The Collaborative Futures of Big Data, Computer ...Computational Social Science:The Collaborative Futures of Big Data, Computer ...
Computational Social Science:The Collaborative Futures of Big Data, Computer ...
 
Crowdsourcing & Human Computation Labeling Data & Building Hybrid Systems
Crowdsourcing & Human Computation Labeling Data & Building Hybrid SystemsCrowdsourcing & Human Computation Labeling Data & Building Hybrid Systems
Crowdsourcing & Human Computation Labeling Data & Building Hybrid Systems
 
Investigating Crowdsourcing as an Evaluation Method for (TEL) Recommender Sy...
Investigating Crowdsourcing as an Evaluation Method for (TEL) Recommender Sy...Investigating Crowdsourcing as an Evaluation Method for (TEL) Recommender Sy...
Investigating Crowdsourcing as an Evaluation Method for (TEL) Recommender Sy...
 
Rise of Crowd Computing (December 2012)
Rise of Crowd Computing (December 2012)Rise of Crowd Computing (December 2012)
Rise of Crowd Computing (December 2012)
 
Social machines: theory design and incentives
Social machines: theory design and incentivesSocial machines: theory design and incentives
Social machines: theory design and incentives
 
Research to Innovation
Research to InnovationResearch to Innovation
Research to Innovation
 
TRANSFORMING BIG DATA INTO SMART DATA: Deriving Value via Harnessing Volume, ...
TRANSFORMING BIG DATA INTO SMART DATA: Deriving Value via Harnessing Volume, ...TRANSFORMING BIG DATA INTO SMART DATA: Deriving Value via Harnessing Volume, ...
TRANSFORMING BIG DATA INTO SMART DATA: Deriving Value via Harnessing Volume, ...
 
David Rejeski: The Synthetic Biology Startup Ecosystem in the US
David Rejeski: The Synthetic Biology Startup Ecosystem in the USDavid Rejeski: The Synthetic Biology Startup Ecosystem in the US
David Rejeski: The Synthetic Biology Startup Ecosystem in the US
 
Crowdsourcing - an overview
Crowdsourcing - an overviewCrowdsourcing - an overview
Crowdsourcing - an overview
 
Crowd Computing: Opportunities & Challenges (IJCNLP 2011 Keynote)
Crowd Computing: Opportunities & Challenges (IJCNLP 2011 Keynote)Crowd Computing: Opportunities & Challenges (IJCNLP 2011 Keynote)
Crowd Computing: Opportunities & Challenges (IJCNLP 2011 Keynote)
 
Information entanglement
Information entanglementInformation entanglement
Information entanglement
 
Technology in the Wild: Dynamics and Uncertainty in Field Experiments, Vietnam
Technology in the Wild: Dynamics and Uncertainty in Field Experiments, VietnamTechnology in the Wild: Dynamics and Uncertainty in Field Experiments, Vietnam
Technology in the Wild: Dynamics and Uncertainty in Field Experiments, Vietnam
 
SSSW 2016 Cognition Tutorial
SSSW 2016 Cognition TutorialSSSW 2016 Cognition Tutorial
SSSW 2016 Cognition Tutorial
 
Crowdsourcing: A Survey
Crowdsourcing: A SurveyCrowdsourcing: A Survey
Crowdsourcing: A Survey
 
Overview of Data Science and AI
Overview of Data Science and AIOverview of Data Science and AI
Overview of Data Science and AI
 
The culture of researchData
The culture of researchData The culture of researchData
The culture of researchData
 
The Culture of Research Data, by Peter Murray-Rust
The Culture of Research Data, by Peter Murray-RustThe Culture of Research Data, by Peter Murray-Rust
The Culture of Research Data, by Peter Murray-Rust
 

Recently uploaded

Recently uploaded (20)

Mats Zuccarello Biography & Stats-icebrek.pdf
Mats Zuccarello Biography & Stats-icebrek.pdfMats Zuccarello Biography & Stats-icebrek.pdf
Mats Zuccarello Biography & Stats-icebrek.pdf
 
Croatia and Italy Set for Challenging UEFA Euro 2024 Campaigns.docx
Croatia and Italy Set for Challenging UEFA Euro 2024 Campaigns.docxCroatia and Italy Set for Challenging UEFA Euro 2024 Campaigns.docx
Croatia and Italy Set for Challenging UEFA Euro 2024 Campaigns.docx
 
Poland Vs Austria Austria announced a provisional squad for Euro 2024 David A...
Poland Vs Austria Austria announced a provisional squad for Euro 2024 David A...Poland Vs Austria Austria announced a provisional squad for Euro 2024 David A...
Poland Vs Austria Austria announced a provisional squad for Euro 2024 David A...
 
Belgium Vs Slovakia Belgium star facing uncertain future after Euro Cup 2024....
Belgium Vs Slovakia Belgium star facing uncertain future after Euro Cup 2024....Belgium Vs Slovakia Belgium star facing uncertain future after Euro Cup 2024....
Belgium Vs Slovakia Belgium star facing uncertain future after Euro Cup 2024....
 
Italy vs Albania Jorginho and Vicario in Italy's Provisional Euro Cup 2024 Sq...
Italy vs Albania Jorginho and Vicario in Italy's Provisional Euro Cup 2024 Sq...Italy vs Albania Jorginho and Vicario in Italy's Provisional Euro Cup 2024 Sq...
Italy vs Albania Jorginho and Vicario in Italy's Provisional Euro Cup 2024 Sq...
 
Akshay Ram on Adobe's Creative Strategy and Execution, the Present and Future...
Akshay Ram on Adobe's Creative Strategy and Execution, the Present and Future...Akshay Ram on Adobe's Creative Strategy and Execution, the Present and Future...
Akshay Ram on Adobe's Creative Strategy and Execution, the Present and Future...
 
TAM Sports-IPL 17 Advertising Report- M01 - M71.xlsx - IPL 17 FCT (Commercial...
TAM Sports-IPL 17 Advertising Report- M01 - M71.xlsx - IPL 17 FCT (Commercial...TAM Sports-IPL 17 Advertising Report- M01 - M71.xlsx - IPL 17 FCT (Commercial...
TAM Sports-IPL 17 Advertising Report- M01 - M71.xlsx - IPL 17 FCT (Commercial...
 
Belgium vs Romania Belgium's main defender Arthur Theate will not play at the...
Belgium vs Romania Belgium's main defender Arthur Theate will not play at the...Belgium vs Romania Belgium's main defender Arthur Theate will not play at the...
Belgium vs Romania Belgium's main defender Arthur Theate will not play at the...
 
Albania vs Spain Euro Cup 2024 Very Close Armando Broja Optimistic Albania Wi...
Albania vs Spain Euro Cup 2024 Very Close Armando Broja Optimistic Albania Wi...Albania vs Spain Euro Cup 2024 Very Close Armando Broja Optimistic Albania Wi...
Albania vs Spain Euro Cup 2024 Very Close Armando Broja Optimistic Albania Wi...
 
Real Bedford FC - Strategic Plan v3 (24/25 Season)
Real Bedford FC - Strategic Plan v3 (24/25  Season)Real Bedford FC - Strategic Plan v3 (24/25  Season)
Real Bedford FC - Strategic Plan v3 (24/25 Season)
 
Turkey vs Georgia Tickets: Turkey and Georgia Prepare for a Promising UEFA Eu...
Turkey vs Georgia Tickets: Turkey and Georgia Prepare for a Promising UEFA Eu...Turkey vs Georgia Tickets: Turkey and Georgia Prepare for a Promising UEFA Eu...
Turkey vs Georgia Tickets: Turkey and Georgia Prepare for a Promising UEFA Eu...
 
Poland Vs Netherlands Poland Euro 2024 squad Who is Michal Probierz bringing ...
Poland Vs Netherlands Poland Euro 2024 squad Who is Michal Probierz bringing ...Poland Vs Netherlands Poland Euro 2024 squad Who is Michal Probierz bringing ...
Poland Vs Netherlands Poland Euro 2024 squad Who is Michal Probierz bringing ...
 
How does IPL franchises makes money 1.pdf
How does IPL franchises makes money 1.pdfHow does IPL franchises makes money 1.pdf
How does IPL franchises makes money 1.pdf
 
Spain Vs Croatia Euro Cup 2024 Spain announces provisional squad, Morata, Yam...
Spain Vs Croatia Euro Cup 2024 Spain announces provisional squad, Morata, Yam...Spain Vs Croatia Euro Cup 2024 Spain announces provisional squad, Morata, Yam...
Spain Vs Croatia Euro Cup 2024 Spain announces provisional squad, Morata, Yam...
 
Denmark vs Serbia Tickets: Denmark's Inspirational Journey to the Euro Cup 2024
Denmark vs Serbia Tickets: Denmark's Inspirational Journey to the Euro Cup 2024Denmark vs Serbia Tickets: Denmark's Inspirational Journey to the Euro Cup 2024
Denmark vs Serbia Tickets: Denmark's Inspirational Journey to the Euro Cup 2024
 
Unveiling the Transformative Legacy of Cricplus
Unveiling the Transformative Legacy of CricplusUnveiling the Transformative Legacy of Cricplus
Unveiling the Transformative Legacy of Cricplus
 
Online Sports Betting In India online betting
Online Sports Betting In India  online bettingOnline Sports Betting In India  online betting
Online Sports Betting In India online betting
 
Denmark Vs England Cole Palmer thrilled to be selected in England’s Euro Cup ...
Denmark Vs England Cole Palmer thrilled to be selected in England’s Euro Cup ...Denmark Vs England Cole Palmer thrilled to be selected in England’s Euro Cup ...
Denmark Vs England Cole Palmer thrilled to be selected in England’s Euro Cup ...
 
Denmark vs England England Euro Cup squad guide Fixtures, predictions and bes...
Denmark vs England England Euro Cup squad guide Fixtures, predictions and bes...Denmark vs England England Euro Cup squad guide Fixtures, predictions and bes...
Denmark vs England England Euro Cup squad guide Fixtures, predictions and bes...
 
Ukraine Vs Belgium What are the odds for Ukraine to make the Euro Cup 2024 qu...
Ukraine Vs Belgium What are the odds for Ukraine to make the Euro Cup 2024 qu...Ukraine Vs Belgium What are the odds for Ukraine to make the Euro Cup 2024 qu...
Ukraine Vs Belgium What are the odds for Ukraine to make the Euro Cup 2024 qu...
 

Crowdsourcing Research Opportunities: Lessons from Natural Language Processing

  • 1. Crowdsourcing Research Opportunities: Lessons from Natural Language Processing Marta Sabou, Kalina Bontcheva, Arno Scharl
  • 5. Crowdsourcing in science – is not new Sir Francis Galton, “VOX POPULI” Citizen science, from early 19th century, 60,000 – 80,000 yearly volunteers
  • 6. Genre 1: Mechanised Labour  Participants (workers) paid a small amount of money to complete easy tasks (HIT = Human Intelligence Task)
  • 7. Genre 2: Games with a purpose From 2008 240k players
  • 9. Genre 3: Altruistic Crowdsourcing >250K players >670K players
  • 10. Crowdsourcing in Science - Typical Use •Harness human intuition to prune solution space Process/ Evaluation Input Algorithm Output •Form based data collection •Labeling, Classification •Surveys
  • 12. Crowdsourcing in NLP Papers relying on crowdsourcing in major NLP venues
  • 14. Benefit 1: Affordable, Large-Scale Resources  A variety of small-medium sized resources can be obtained with as little as 100$ using AMT  Crowdsourcing is also cost effective for large resources (Poesio, 2012) $/label 1 M labels ($) Traditional High Q. 1 1,000,000 Mechanical Turk .38 380,000 (<40%) Game .19 217,000 (20%)
  • 16. Challenge 1: Contributor Selection and Training  From: prior to resource creation  To: during the resource creation
  • 17. Challenge 2: Aggregation and Quality Control  From: a few experts‘ annotations  To: multiple, noisy annotations from non-experts  Approach 1: Statistical techniques  Simplest (and most popular): majority voting  More complex: Machine learning model trained on various features  Approach 2: Crowdsourcing the QC process itself HIT1 (Create): HIT2 (Verify): Which of these 5 sentences is the Translate the following sentence: best translation?
  • 18. Conclusions (What have we learned from NLP?)  Crowdsourcing is revolutionalising NLP research  Cheaper resource acquisition  Diversification of research agenda  But requires more complex methodologies  For contributor management  For quality control and data aggregation  Other findings: most popular  Genre: mechanised labour  Task: acquiring input data  Problem: solving subjective tasks
  • 20. User Motivation  Motivating users  Motivations for scientific projects might differ  Task-granularity might impact motivation  Promoting learning and science  Advertise STEM research to young people  Support learning and self-improvement through participation in crowdsourcing
  • 21. Legal and Ethical Issues  Acknowledging the Crowd‘s contribution  S. Cooper, [other auhors], and Foldit players: Predicting protein structures with a multiplayer online game. Nature, 466(7307):756-760, 2010.  Ensuring privacy and wellbeing  Mechnised labour criticesed for low wages (,$2/hour), lack of worker rights  Prevent addition, prolonged-use & user exploitation  Licensing and consent  Some clearly state the use of Creative Common licenses  General failure to provide informed consent information
  • 22. Technical Issues  Scaling up to large resources  Preventing bias  Increasing repeatability  Through reuse of crowdsourcing elements (e.g., HIT templates)  uComp - Embedded Human Computation for Knowledge Extraction and Evaluation  3 year project, starting November 2012  Develops a scalable and generic HC framework for knowledge creation  Provides reusable HC elements

Editor's Notes

  1. How does crowdsourcing relate to Research 2.0.? My talk will illustrate how certain web technologies can reduce the gap between scientists on one hand, and ordinary citizens on the other – thus enabling a certain form of research 2.0. If Web2.0 is often associate to “user generated content”, research 2.0, at least the one enabled by crowdsourcing, is “user generated/supported science”. Taking the field of NLP as an example, I will discuss how crowdsourcing is changing research practices and its effect on this scientific discipline. Research 2.0 deals with the involvement of the web in science. It spans from the utilization of Web 2.0 tools and technologies in research to a more open and sharing approach to science. Some definitions of Research 2.0 even include notions of a methodological change due to the abundance of data, and the nature of the socio-technical systems on the web. The change in scientific practices due to the involvement of Research 2.0 tools and technologies in the research process and the effects this has on science itself.
  2. But not projects that: Do not have the creation of scientific data as their main goal (e.g., Wikipedia) Use crowds to support auxiliary scientific processes (e.g., Mendeley) Recruit online but experiment in lab Recruit processing power and NOT human effort (SETI@home) Have as contributors scientific stuff alone, e.g., collaboratories
  3. But not projects that: Do not have the creation of scientific data as their main goal (e.g., Wikipedia) Use crowds to support auxiliary scientific processes (e.g., Mendeley) Recruit online but experiment in lab Recruit processing power and NOT human effort (SETI@home) Have as contributors scientific stuff alone, e.g., collaboratories
  4. In fact, already in 1907, Sir Francis Galton, (Darwin‘s cousin, A brilliant Victorian scientist,) has published a Nature article entitled „VOX Populi“ (or the voice of the people, the voice of the crowd), where he discribes his experiment at a lifestock fair: 787 persons were asked to estimate the weight of the ox, and, while none came close to the real value, the mean of the guesses was almost spot-on. Meanwhile, some other societies were using the crowd differently, namely, to support them in gathering scintific data. From the early 19th century, the Aubodon society has been relying on volunteers to count species of local birds. Their campaings continue to this date, and in 2012, volunteers submitted over 100, 000 ch ecklists leading to observations about 623 specied and over 17 million individual birds. These activities are often termed as citizen science. This is not a novel phenomenon Citizen science projects around since the beginning of last century (at least) There is a vast landscape and variety of citizen science projects where scientists call on the public for help - some examples, including from Lora‘s paper (her talk might have some mentions as well) IT enables virtual citizen science projects and this upsurge is a direct consequence of new and improved ways to involve the public into scientifc procecess
  5. Participants contribute while having fun 13 Apr 2012 | 16:35 EDT | Posted by Rebecca Hersher: Two years ago, FoldIt made headlines, lots of them, when players of the online protein-folding video game took three weeks to solve the three dimensional structure of a simian retroviral protein that is used in animal models of HIV, but whose structure had eluded biochemists for more than a decade. “: http://blogs.nature.com/spoonful/2012/04/foldit-games-next-play-crowdsourcing-better-drug-design.html Phylo is an experimental video game about multiple sequence alignment optimisation. “Since the launch in November 2010, we received more than 350,000 solutions submitted from more than 12,000 registered users. Our results show that solutions submitted contributed to improving the accuracy of up to 70% of the alignment blocks considered.” It is about showing that humans can aid algorithms rather than comparing human and machine performance.
  6. In 2008, the group built a FB game that required players to rate the sentiment associated to a sentence on a 5-values scale, then used this as atraining corpus for the sentiment detection module. Over 800 player played the game. In 2009 the game has been released in a slightly different form and with the aim to gather sentiment lexicons, i.e., associations between words and their sentiment polarity (ratings from as many as 12 players were averaged to get the final value). The game ran in 7 different languages and attracted over 4000 players. Let this be an introductory example of a crowdsourcing project, however, crowdsourcing is a not a new phenomenon.
  7. Volunteer contributes because he is interested in a domain, supports a cause
  8. More languages E.g., Urdu, Arabic, Hitian Creole Irvine and Klementiev create lexicons between English and 37 low resourced languages Diverse types of text (besides news-wire) Emails, twitter feeds, augmented and alternative communication texts Speech: transcription, accent rating, assessment of dialog systems Subjective tasks Sentiment detection, translation, word sense disambiguation, anaphora resolution, question answering, textual entailment, text summarization …. Niche language phenomena Lab experiments reproduced at a fraction of their cost E.g., contextual predictivity (Cloze task), corpus trends
  9. Completely new wrt traditional approaches Uses „create-verify“ workflows Widespred technique for translation tasks, less for labeling
  10. STEM (Science, Technology, Engineering, Mathematics) Harness increased visability and ease of engagement in social networks to make STEM research more attractive and understandable =&gt; more young people to study STEM
  11. STEM (Science, Technology, Engineering, Mathematics) Harness increased visability and ease of engagement in social networks to make STEM research more attractive and understandable =&gt; more young people to study STEM
  12. STEM (Science, Technology, Engineering, Mathematics) Harness increased visability and ease of engagement in social networks to make STEM research more attractive and understandable =&gt; more young people to study STEM