Crowdsourcing Research Opportunities: Lessons from Natural Language Processing
Upcoming SlideShare
Loading in...5
×
 

Crowdsourcing Research Opportunities: Lessons from Natural Language Processing

on

  • 736 views

How is crowdsourcing used in science? How did it impact the field of NLP? ...

How is crowdsourcing used in science? How did it impact the field of NLP?

A presentation of the key points described in:
Marta Sabou, Kalina Bontcheva, Arno Scharl (2012) Crowdsourcing Research Opportunities: Lessons from Natural Language Processing. In 12th International Conference on Knowledge Management and Knowledge Technologies (i-KNOW), Special Track on Research 2.0.

Statistics

Views

Total Views
736
Views on SlideShare
736
Embed Views
0

Actions

Likes
1
Downloads
13
Comments
0

0 Embeds 0

No embeds

Accessibility

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment
  • How does crowdsourcing relate to Research 2.0.? My talk will illustrate how certain web technologies can reduce the gap between scientists on one hand, and ordinary citizens on the other – thus enabling a certain form of research 2.0. If Web2.0 is often associate to “user generated content”, research 2.0, at least the one enabled by crowdsourcing, is “user generated/supported science”. Taking the field of NLP as an example, I will discuss how crowdsourcing is changing research practices and its effect on this scientific discipline. Research 2.0 deals with the involvement of the web in science. It spans from the utilization of Web 2.0 tools and technologies in research to a more open and sharing approach to science. Some definitions of Research 2.0 even include notions of a methodological change due to the abundance of data, and the nature of the socio-technical systems on the web. The change in scientific practices due to the involvement of Research 2.0 tools and technologies in the research process and the effects this has on science itself.
  • But not projects that: Do not have the creation of scientific data as their main goal (e.g., Wikipedia) Use crowds to support auxiliary scientific processes (e.g., Mendeley) Recruit online but experiment in lab Recruit processing power and NOT human effort (SETI@home) Have as contributors scientific stuff alone, e.g., collaboratories
  • But not projects that: Do not have the creation of scientific data as their main goal (e.g., Wikipedia) Use crowds to support auxiliary scientific processes (e.g., Mendeley) Recruit online but experiment in lab Recruit processing power and NOT human effort (SETI@home) Have as contributors scientific stuff alone, e.g., collaboratories
  • In fact, already in 1907, Sir Francis Galton, (Darwin‘s cousin, A brilliant Victorian scientist,) has published a Nature article entitled „VOX Populi“ (or the voice of the people, the voice of the crowd), where he discribes his experiment at a lifestock fair: 787 persons were asked to estimate the weight of the ox, and, while none came close to the real value, the mean of the guesses was almost spot-on. Meanwhile, some other societies were using the crowd differently, namely, to support them in gathering scintific data. From the early 19th century, the Aubodon society has been relying on volunteers to count species of local birds. Their campaings continue to this date, and in 2012, volunteers submitted over 100, 000 ch ecklists leading to observations about 623 specied and over 17 million individual birds. These activities are often termed as citizen science. This is not a novel phenomenon Citizen science projects around since the beginning of last century (at least) There is a vast landscape and variety of citizen science projects where scientists call on the public for help - some examples, including from Lora‘s paper (her talk might have some mentions as well) IT enables virtual citizen science projects and this upsurge is a direct consequence of new and improved ways to involve the public into scientifc procecess
  • Participants contribute while having fun 13 Apr 2012 | 16:35 EDT | Posted by Rebecca Hersher: Two years ago, FoldIt made headlines, lots of them, when players of the online protein-folding video game took three weeks to solve the three dimensional structure of a simian retroviral protein that is used in animal models of HIV, but whose structure had eluded biochemists for more than a decade. “: http://blogs.nature.com/spoonful/2012/04/foldit-games-next-play-crowdsourcing-better-drug-design.html Phylo is an experimental video game about multiple sequence alignment optimisation. “Since the launch in November 2010, we received more than 350,000 solutions submitted from more than 12,000 registered users. Our results show that solutions submitted contributed to improving the accuracy of up to 70% of the alignment blocks considered.” It is about showing that humans can aid algorithms rather than comparing human and machine performance.
  • In 2008, the group built a FB game that required players to rate the sentiment associated to a sentence on a 5-values scale, then used this as atraining corpus for the sentiment detection module. Over 800 player played the game. In 2009 the game has been released in a slightly different form and with the aim to gather sentiment lexicons, i.e., associations between words and their sentiment polarity (ratings from as many as 12 players were averaged to get the final value). The game ran in 7 different languages and attracted over 4000 players. Let this be an introductory example of a crowdsourcing project, however, crowdsourcing is a not a new phenomenon.
  • Volunteer contributes because he is interested in a domain, supports a cause
  • More languages E.g., Urdu, Arabic, Hitian Creole Irvine and Klementiev create lexicons between English and 37 low resourced languages Diverse types of text (besides news-wire) Emails, twitter feeds, augmented and alternative communication texts Speech: transcription, accent rating, assessment of dialog systems Subjective tasks Sentiment detection, translation, word sense disambiguation, anaphora resolution, question answering, textual entailment, text summarization …. Niche language phenomena Lab experiments reproduced at a fraction of their cost E.g., contextual predictivity (Cloze task), corpus trends
  • Completely new wrt traditional approaches Uses „create-verify“ workflows Widespred technique for translation tasks, less for labeling
  • STEM (Science, Technology, Engineering, Mathematics) Harness increased visability and ease of engagement in social networks to make STEM research more attractive and understandable => more young people to study STEM
  • STEM (Science, Technology, Engineering, Mathematics) Harness increased visability and ease of engagement in social networks to make STEM research more attractive and understandable => more young people to study STEM
  • STEM (Science, Technology, Engineering, Mathematics) Harness increased visability and ease of engagement in social networks to make STEM research more attractive and understandable => more young people to study STEM

Crowdsourcing Research Opportunities: Lessons from Natural Language Processing Crowdsourcing Research Opportunities: Lessons from Natural Language Processing Presentation Transcript

  • Crowdsourcing Research Opportunities:Lessons from Natural Language Processing Marta Sabou, Kalina Bontcheva, Arno Scharl
  • Crowdsourcing
  • CrowdsourcingUndefined and generally large group View slide
  • Crowdsourcing in ScienceCrowdsourcing for NLPChallenges View slide
  • Crowdsourcing in science – is not newSir Francis Galton, “VOX POPULI”Citizen science, from early 19th century, 60,000 – 80,000 yearly volunteers
  • Genre 1: Mechanised Labour Participants (workers) paid a small amount of money to complete easy tasks (HIT = Human Intelligence Task)
  • Genre 2: Games with a purpose From 2008 240k players
  • Crowdsourcing via Facebook
  • Genre 3: Altruistic Crowdsourcing >250K players >670K players
  • Crowdsourcing in Science - Typical Use •Harness human intuition to prune solution space Process/ Evaluation Input Algorithm Output•Form based data collection•Labeling, Classification•Surveys
  • Crowdsourcing in ScienceCrowdsourcing for NLPChallenges
  • Crowdsourcing in NLPPapers relying on crowdsourcing in major NLP venues
  • Crowdsourcing Genres in NLP
  • Benefit 1: Affordable, Large-Scale Resources A variety of small-medium sized resources can be obtained with as little as 100$ using AMT Crowdsourcing is also cost effective for large resources (Poesio, 2012) $/label 1 M labels ($)Traditional High Q. 1 1,000,000Mechanical Turk .38 380,000 (<40%)Game .19 217,000 (20%)
  • Benefit 2: Diversification of research
  • Challenge 1: Contributor Selection and Training From: prior to resource creation To: during the resource creation
  • Challenge 2: Aggregation and Quality Control From: a few experts‘ annotations To: multiple, noisy annotations from non-experts Approach 1: Statistical techniques  Simplest (and most popular): majority voting  More complex: Machine learning model trained on various features Approach 2: Crowdsourcing the QC process itself HIT1 (Create): HIT2 (Verify): Which of these 5 sentences is the Translate the following sentence: best translation?
  • Conclusions (What have we learned from NLP?) Crowdsourcing is revolutionalising NLP research  Cheaper resource acquisition  Diversification of research agenda But requires more complex methodologies  For contributor management  For quality control and data aggregation Other findings: most popular  Genre: mechanised labour  Task: acquiring input data  Problem: solving subjective tasks
  • Crowdsourcing in ScienceCrowdsourcing for NLPChallenges
  • User Motivation Motivating users  Motivations for scientific projects might differ  Task-granularity might impact motivation Promoting learning and science  Advertise STEM research to young people  Support learning and self-improvement through participation in crowdsourcing
  • Legal and Ethical Issues Acknowledging the Crowd‘s contribution  S. Cooper, [other auhors], and Foldit players: Predicting protein structures with a multiplayer online game. Nature, 466(7307):756-760, 2010. Ensuring privacy and wellbeing  Mechnised labour criticesed for low wages (,$2/hour), lack of worker rights  Prevent addition, prolonged-use & user exploitation Licensing and consent  Some clearly state the use of Creative Common licenses  General failure to provide informed consent information
  • Technical Issues Scaling up to large resources Preventing bias Increasing repeatability  Through reuse of crowdsourcing elements (e.g., HIT templates) uComp - Embedded Human Computation for Knowledge Extraction and Evaluation  3 year project, starting November 2012  Develops a scalable and generic HC framework for knowledge creation  Provides reusable HC elements
  • Thank you!