GIW/ISCB-Asia2014 Poster#19
Performance evaluation between Crowdworkers and Biocurators towards constructing a CrowdR&D platform
Eli Kaminuma1, Yukino Baba2,3, Takatomo Fujisawa1, Asao Fujiyama1, Hisashi Kashima4, Yasukazu Nakamura1
1. National Institute of Genetics, SOKENDAI; 1111 Yata, Mishima, Shizuoka, 411-8540 Japan. 2. National Institute of Informatics, SOKENDAI, Tokyo 101-8430, Japan 3. JST, ERATO, Kawarabayashi Large Graph Project, Tokyo 101-8430, Japan. 4. Kyoto University, Grad School of Informatics, Kyoto 606-8501, Japan.
[2014-12-15] GIW/ISCB-Asia2014 Performance evaluation between Crowdworkers and Biocurators towards constructing a CrowdR&D platform
1. → Spearman correlation coefficients between
correspondence and object distance
→ Spearman correlation coefficients between
correspondence and object width
Type Data collection Data annotation Data analysis Data analysis Data analysis Data collection
Name DNA barcoding(1) GeneWiki (2) Foldit (3) Eterna (4) Fraxinus(5) 23andMe (6)
Task DNA barcode search
and data collection
Gene annotation Protein folding RNA secondary structure
design
DNA sequence alignment
for plant pathogen
Phenotype +
genotype data
collection
Research
member
・Sara McMullin(SAP
Canada)
・Paul Hebert(Biodiversity
Institute of Ontario )
・Luca de alfaro(UCSC)
・Andrew Su(Scripps Res
Institute)
・Zoran Popovic(UW) ・Rhiju Das(Stanford)
・Adrien Treuille(CMU)
・Dan MacLean(The Sainsbury
Laboratory)
23andMe Inc.
Crowd# ? 6,830 /2 years 733 37,000 51,057 (2013/8-12) 10,000 for 22 traits
Feature ・DNA barcodes of
500K species in
2015
・SAP: bigdata analysis
・iBOL: data provider
. LifeScanner app
・Human gene
10,369 (2011)
・34, 069 edits by
6,830 distinct editors
・Crowd performance is
better than computer.
・AutoBot
・Crowd performance is
better than AutoBot.
・BBSRC fund: The British
Ash Tree Genome Project
・NGS sequencing
(GitHub)
・P-value trends for
participation depend
on locus(6)
○ Eli Kaminuma1, Yukino Baba2,3, Takatomo Fujisawa1, Asao Fujiyama1, Hisashi Kashima4, Yasukazu Nakamura1
1. National Institute of Genetics, SOKENDAI; 1111 Yata, Mishima, Shizuoka, 411-8540 Japan. 2. National Institute of Informatics, SOKENDAI, Tokyo 101-8430, Japan
3. JST, ERATO, Kawarabayashi Large Graph Project, Tokyo 101-8430, Japan. 4. Kyoto University, Grad School of Informatics, Kyoto 606-8501, Japan.
.
Performance evaluation between Crowdworkers and Biocurators towards
constructing a CrowdR&D platform
ABSTRACT
High-performance next-generation sequencing (NGS) technologies are advancing genomics and molecular biological research. At 2010, we released an automatic high-
throughput annotation pipeline “DDBJ Read Annotation Pipeline” for NGS sequencing data, which analyzes by using computer facilities of Japan’s National Institute of Genetics
supercomputer. After automatic annotation analysis, human curation tasks are performed to modify errors. However, massive amounts of NGS sequence data have created a
bottleneck at human curation with manual tasks. To resolve the problem, we investigate crowdsourcing approach to accomplish curation tasks. First, we evaluated performances
between non-professional crowdworkers of a commercial crowdsourcing platform and our expert biocurators. Two tasks of image-based gene structural annotation and text
annotation of gene names with technical knowledge were attempted. In the image annotation task, we found all incorrect cases by crowd with all correct by experts. This indicates
that tasks should be clarified with informative sentences reflecting professional knowledge. However it may be high cost. As for the text annotation task, the comparison of
performances between three biocurators and 17 crowdworkers confirmed that several crowds exhibited high performance levels equivalent to the curators. Next, we propose a
crowdsourcing research platform under development, named by CrowdR&D. Researchers can use the CrowdR&D site as a portal to generate crowdsourcing tasks. It provides
quantitative evaluation of individual tasks and manages separated tasks as a workflow. Moreover it includes user authentication function and data sharing function. Finally, we
provide the information of ethical review for protecting crowdworkers required at paper submission to life science research journals.
Acknowledgments: ・ Yoshiki Mochizuki ・ Yuichi Kodama ・Takako Mochizuki ・Yasuhiro Tanizawa ・Hideki Nagasaki ・Naoko Sakamoto ・Nori Kurata
Funding: 1) JST CREST ‘Advanced Core Technologies for Big Data Integration H26 pilot study
2) Transdisciplinary Research Integration Center Project of Research Organization of Information and Systems.
Publication: Kaminuma E, Fujisawa T , Tanizawa Y, Sakamoto N, Kurata N, Shimizu T, Nakamura Y, H2DB: Nucleic acids research, 41: D880, 2013
Crowd performance② A text annotation task exhibited
the existance of high performance crowdworkers
Reference : Examples of conventional crowdsourcing studies
#19
A crowdsourcing research platform (ongoing)
[data collection → curation → modeling ]
※DDBJ Read Annotation Pipeline http://p.ddbj.nig.ac.jp/(Kaminuma et al., NAR 2010; Nagasaki et al., 2013)
→Implemented on the NIG supercomputer inc. 10TB & 2TB mem nodes / 350 thin nodes and 100TB storage.
Wiki-based
open curation
(Salzberg, 2013)
■An analytic process of automatic sequence annotations using DDBJ pipeline
High-
throughput
NGS
sequencing
Jamboree / Online community
curation
Database opened
Paper submission
Automatic annotations
Paper
submission
Database
opened
Present path
■A proposal of ‘BigData curation’
DDBJ Pipeline(※)
User#= 666, Around 8000
jobs/year(Dec 2014)
TogoAnnotation
(Fujisawa, Nakamura
et al., 2014)
Large-scale DNA
sequences
Automatic annotations
Expert
Non-expert(crowdworker)
Task assignment
(Precision, Cost)
Enhancing annotation models
via training data
Crowdsouring-based manual curation
The problem for data curation for‘BigData’
after automatic annotations of DDBJ Pipeline
Performance difference between experts and non-experts?
→①Image annotation
→②Text annotation
Collaboration:
Data science
1.UofBigData
(Kyoto univ.
Kashima lab.)
2.Deep Analytics.
(Opt Inc., Dr.Saito)
Microtask
3. Crowd4U
(Tsukuba univ.
Morishima lab.)
Manual curation
tasks for database
①Confirming the existance of high performance crowdworkers
②Recall of crowdworker tends to be higher than precision of crowd.
→a strategy suggestion : first step of crowdworker collection and second
step of curator screening
Red dot square:
Expert level zone
■Data
・BioNLP/NLPBA 2004 shared task,curated data
・5 tasks/crowd based on random sampling
・Average word#:187
・Detect specific terms for gene annotations
・Evaluation measure:recall, precision, fscore
■Subjects
・Crowdworkers (non-expert):17
・Biocurators (expert):3
■Crowdsourcing condition
・Platform:Lancers (http://lancers.jp/)
・Cost:5 tasks 1,000yen/crowd*17=17,000 yen
(Biocurator: free)
■Educational and professional background for crowd
・Lifescience in master degree:3
・Lifescience jobs:0
■English skill for crowd
・TOEIC avg. score(self reported): 675
Experimental conditions ResultExperimental condition
■Results
・Crowdworker 20: 19 task performance: 0.85±0.12
・Significant error: expert=true positive, non-expert = true negative
→ Stating clear task sentences with expert knowledge is high cost.
■ No correlation between object parameters and crowd performance
■ Data
Arabidopsis thaliana Col-0のCold stress treatment RNA-seq study
(http://bioviz.org/quickload/A_thaliana_Jun_2009/cold_stress/,
Loraine Lab. site in UNC Charlotte)
Crowdsourcing task : Detection of graph overlapped parts.
■ Task
①Sequence information was removed.
②Easy sentence for non-expert
・No zoom-in/out
・Detect accumulated regions of RNA-seq alignment
IGV viewer(Robinson et al, 2011)
Keyword: citizen science, crowd science, commercial crowdsourcing, mobile crowd sensing, participatory sensing
Japanese crowdsourcing research
(7) Dr.Matsuda(Kyushu univ.): 「KOKOPIN!」 ecological plant image collection
(8) Dr.Senou(Kanagawa pref. museum of natural history) : 「Fish image database」 Mar 2014, 89,195entries
(9) Prof.Washitani(Tokyo univ.), Dr.Kitamoto (NII) : 「Seiyou Status」 Monitoring of invasive alien bumblebee B. terrestris.
(1)http://www.insightaas.com/populating-the-barcode-of-life/
(2)http://en.wikipedia.org/wiki/Portal:Gene_Wiki
(3)http://fold.it/
(4)http://eterna.cmu.edu/
(5)https://apps.facebook.com/fraxinusgame/
(6) Eriksson et al., PLOS Genetics, e1000993, 2010.
(7)http://www.kokopin.com/
(8)http://nh.kanagawa-museum.jp/staff/data/st3.html
(9)http://www.seiyoubusters.com/seiyou/
■Subjects
・Crowd worker(Non-expert): 20
■Crowdsourcing condition
・Platform:Lancers(http://lancers.jp/)
・Reward:95yen/worker * 20 workers=1,900 yen
Correspondence of crowdworkers
Object width
r=0.09(P=0.71)
r=-0.13(P=0.60)
Distance between objects
[2]Curation task request
[3]An evaluation of task professionality[1] A systematic flow from data collection to data modeling
Modeling
Microtask
(image annotation、
tagging)
Biocuration
(Term recognition)
Microtask
(tagging etc.)
Microtask
(data collection etc)
𝐿𝐿𝑅 𝑄𝑢𝑒𝑟𝑦 = 𝑙𝑜𝑔
𝑃 𝑄𝑢𝑒𝑟𝑦 𝑃𝑟𝑜𝑓𝑒𝑠𝑠𝑖𝑜𝑛𝑎𝑙
𝑃(𝑄𝑢𝑒𝑟𝑦|𝑁𝑜𝑛 𝑝𝑟𝑜𝑓𝑒𝑠𝑠𝑖𝑜𝑛𝑎𝑙)
NLP N-gram based professional models and proposed evaluation measure
(Non professional data)http://www.gutenberg.org/cache/epub/6434/pg6434.txt
A Brief History of the United States by Joel Dorman Steele
(Professional data) NCBI PUBMED (keyword: gene 2014)
Mining DNA barcode from NGS sequence read archive :
Crowdsourced data collection
92% finished
Cost
・25 tasks=525 yen
・1 task=21 yen
Working time
・Total : 83 min.
・92% finished : 21min.
Estimated time
・1 task time~ 3 min
GBIF
overlap
Species# Genus# Auto
/ Manual
operation
SRA entries(ALL※) x 8,914 275 Auto
SRA entries(WGS) x 2,354 100 Auto
SRA entries
(WGS,uniq species)
x 236 100 Auto
SRA entries(WGS,
GBIF removed,
uniq species)
100 25 Auto
SRA entries(WGS,
GBIF removed,
uniq species,.
DBCLS SRA search)
100+9 25 Manual
(Crowdsourc
ed)
SRA original species (without GBIF) = 109 entries
for new barcodes
※NCBI SRA search : keyword: plant
The 42 percent of SRA plant species was not included in GBIF db.
The 8 percent of GBIF-free SRA species was saved by crowdsourcing.
Unsolved
Research
Data
Automatic
Annotation
Models
Curated
Structure Data
Crowdworkers
&
Expert curators
③
Model training /
Data modeling
②
Data curation
①Data collection
④Performance
evaluation
1. SRA search(plant) 2. GBIF species
removed(matK,rbcL)
■A mining protocol for new DNA barcodes
■Prompt actions of crowdworkers (http://lancers.jp/)
■Result: species# for new DNA barcode
3. DBCLS SRA
taxid search
http://www.gbif.jp/bol/
http://www.ncbi.nlm.nih.gov/
http://sra.dbcls.jp/
(Nakazato et al., 2013)
Crowd performance① An image annotation task clarified
the high costs of task specification
Red lines : Automatic annotation by cufflinks(Trapnell et al., 2010)
Correct data: NCBI RefSeq annotations
Task professionality
CrowdR&D