Your SlideShare is downloading. ×
  • Like
Corpus Annotation through Crowdsourcing: Towards Best Practice Guidelines
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×

Now you can save presentations on your phone or tablet

Available for both IPhone and Android

Text the download link to your phone

Standard text messaging rates apply

Corpus Annotation through Crowdsourcing: Towards Best Practice Guidelines

  • 617 views
Published


Annotating data is expensive and often fraught. Crowdsourcing promises a quick, cheap and high-quality solution, but it is critical to understand the process and plan work appropriately in order to get results. This presentation and paper discuss the challenges involves and explain simple ways to getting reliable, quality results when crowdsourcing corpora.

Full paper: https://gate.ac.uk/sale/lrec2014/crowdsourcing/crowdsourcing-NLP-corpora.pdf

Published in Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
617
On SlideShare
0
From Embeds
0
Number of Embeds
2

Actions

Shares
Downloads
4
Comments
0
Likes
1

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. University of Sheffield, NLP Corpus Annotation through Crowdsourcing: Towards Best Practice Guidelines Marta Sabou Kalina Bontcheva Leon Derczynski Arno Scharl
  • 2. University of Sheffield, NLP Crowdsourcing in science – is not new Citizen science, from early 19th century, 60,000 – 80,000 yearly volunteers Sir Francis Galton, “VOX POPULI”Francis Galton
  • 3. University of Sheffield, NLP What is Crowdsourcing in our domain? • NLP researchers are increasingly using crowdsourcing as a novel, collaborative approach for obtaining linguistically annotated corpora • Crowdsourcing is an emerging collaborative approach for acquiring annotated corpora and a wide range of other linguistic resources • Three main kinds of crowdsourcing platforms • paid-for marketplaces • games with a purpose • volunteer-based platforms
  • 4. University of Sheffield, NLP Genre 1: Mechanised Labour  Participants (workers) paid a small amount of money to complete easy tasks (HIT = Human Intelligence Task)
  • 5. University of Sheffield, NLP Genre 2: Games with a purpose
  • 6. University of Sheffield, NLP Genre 3: Altruistic Crowdsourcing
  • 7. University of Sheffield, NLP Workflow for Crowdsourcing Corpora 1. Project Definition 2. Data and UI Preparation 3. Running the Project 4. Corpus Delivery
  • 8. University of Sheffield, NLP Step 1. Project Definition • Data distribution: how “micro” is each microtask? • Long paragraphs hard to digest, worker fatigue • Single sentences not always appropriate: e.g. for co-ref • Reward scheme • Granularity – per task? Per set of tasks? High scores? • What to do with “bad” work • How much to reward • No clear, repeatable results for quality:reward relation • High rewards get it done faster, but not better • Pilot task gives timings, so pay at least minimum wage
  • 9. University of Sheffield, NLP Step 1. Project Definition • Choose the most appropriate genre or mixture of crowdsourcing genres • Trade-offs: Cost; Timescale; Worker skills • Pilot the the design, measure performance, try again • Simple, clear design important • Binary decision tasks get good results
  • 10. University of Sheffield, NLP Step 2: Data and UI Preparation • Build and test the user interfaces • Easy to medium difficulty in AMT/CF • Medium to hard (and expensive) for GWAPs • Simple, usable interface more important than getting it in worker's own language! (Khanna 2010)
  • 11. University of Sheffield, NLP Step 2: Data and UI Preparation
  • 12. University of Sheffield, NLP Step 3: Running the Crowdsourcing Project • Can run for hours, days or years, depending on genre and size ● Quality control • Use gold units to control quality •Contributor management • Begin with “training” questions – only gold; qualification period • Have sufficient number / diversity of contributors, to reduce annotator bias
  • 13. University of Sheffield, NLP • Evaluate and aggregate contributor inputs to produce final decision • Majority vote • Discard inputs from low-trusted contributors (e.g. Hsueh et al. (2009)) • MACE (Hovy 2013) • Merge units into complete documents • Tune the expert-created “gold” standard based on annotator feedback • Crowd has a broader knowledge-base than a few experts • Deliver corpus in widely used format e.g. TEI, GATE, CONLL, NIF Step 4: Evaluation and Corpus Delivery
  • 14. University of Sheffield, NLP Legal and Ethical Issues ● Acknowledging the Crowd‘s contribution ● S. Cooper, [other authors], and Foldit players: Predicting protein structures with a multiplayer online game. Nature, 466(7307):756-760, 2010. ● Ensuring privacy and wellbeing ● Mechnised labour criticised for low wages, lack of worker rights ● Majority of workers rely on microtasks as main income source ● Prevent prolonged use & user exploitation (e.g. daily caps) ● Licensing and consent ● Some clearly state the use of Creative Common licenses ● General failure to provide informed consent information
  • 15. University of Sheffield, NLP Example: CF Marking Locations in tweets
  • 16. University of Sheffield, NLP Example: CF Locations selected
  • 17. University of Sheffield, NLP How to do it: The Easy Way • Download and use the GATE Crowdsourcing plugin • https://gate.ac.uk/wiki/crowdsourcing.html • Transforms automatically texts with GATE annotations into CF jobs • Generates the CF User Interface (based on templates) • Researcher then checks and runs the project in CF • On completion, the plugin imports automatically the results back into GATE, aligning to sentences and representing the multiple annotators
  • 18. University of Sheffield, NLP GATE Crowdsourcing Overview (1) • Choose a job builder – Classification – Sequence Selection • Configure the corresponding user interface and provide the task instructions
  • 19. University of Sheffield, NLP Configure and execute the job in CF Gold data units can also be uploaded from GATE, so CF controls quality
  • 20. University of Sheffield, NLP Automatic CF Import into GATE • Each CF judgement is imported back as a separate annotation with some metadata • Adjudication can happen automatically (e.g. majority vote and/or trust- based) or manually (Annotation Stack editor) • The resulting corpus is ready to use for experiments or can be exported out of GATE as XML/XCES
  • 21. University of Sheffield, NLP Summary 1. Project Definition 2. Data and UI Preparation 3. Running the Project 4. Corpus Delivery
  • 22. University of Sheffield, NLP Thank you for your time! Marta Sabou Kalina Bontcheva Leon Derczynski Arno Scharl This was part of the uComp project (www.ucomp.eu). uComp receives the funding support of EPSRC EP/K017896/1, FWF 1097-N23, and ANR-12-CHRI-0003-03, in the framework of the CHIST-ERA ERA-NET.