SlideShare a Scribd company logo
1 of 1
→ Spearman correlation coefficients between
correspondence and object distance
→ Spearman correlation coefficients between
correspondence and object width
Type Data collection Data annotation Data analysis Data analysis Data analysis Data collection
Name DNA barcoding(1) GeneWiki (2) Foldit (3) Eterna (4) Fraxinus(5) 23andMe (6)
Task DNA barcode search
and data collection
Gene annotation Protein folding RNA secondary structure
design
DNA sequence alignment
for plant pathogen
Phenotype +
genotype data
collection
Research
member
・Sara McMullin(SAP
Canada)
・Paul Hebert(Biodiversity
Institute of Ontario )
・Luca de alfaro(UCSC)
・Andrew Su(Scripps Res
Institute)
・Zoran Popovic(UW) ・Rhiju Das(Stanford)
・Adrien Treuille(CMU)
・Dan MacLean(The Sainsbury
Laboratory)
23andMe Inc.
Crowd# ? 6,830 /2 years 733 37,000 51,057 (2013/8-12) 10,000 for 22 traits
Feature ・DNA barcodes of
500K species in
2015
・SAP: bigdata analysis
・iBOL: data provider
. LifeScanner app
・Human gene
10,369 (2011)
・34, 069 edits by
6,830 distinct editors
・Crowd performance is
better than computer.
・AutoBot
・Crowd performance is
better than AutoBot.
・BBSRC fund: The British
Ash Tree Genome Project
・NGS sequencing
(GitHub)
・P-value trends for
participation depend
on locus(6)
○ Eli Kaminuma1, Yukino Baba2,3, Takatomo Fujisawa1, Asao Fujiyama1, Hisashi Kashima4, Yasukazu Nakamura1
1. National Institute of Genetics, SOKENDAI; 1111 Yata, Mishima, Shizuoka, 411-8540 Japan. 2. National Institute of Informatics, SOKENDAI, Tokyo 101-8430, Japan
3. JST, ERATO, Kawarabayashi Large Graph Project, Tokyo 101-8430, Japan. 4. Kyoto University, Grad School of Informatics, Kyoto 606-8501, Japan.
.
Performance evaluation between Crowdworkers and Biocurators towards
constructing a CrowdR&D platform
ABSTRACT
High-performance next-generation sequencing (NGS) technologies are advancing genomics and molecular biological research. At 2010, we released an automatic high-
throughput annotation pipeline “DDBJ Read Annotation Pipeline” for NGS sequencing data, which analyzes by using computer facilities of Japan’s National Institute of Genetics
supercomputer. After automatic annotation analysis, human curation tasks are performed to modify errors. However, massive amounts of NGS sequence data have created a
bottleneck at human curation with manual tasks. To resolve the problem, we investigate crowdsourcing approach to accomplish curation tasks. First, we evaluated performances
between non-professional crowdworkers of a commercial crowdsourcing platform and our expert biocurators. Two tasks of image-based gene structural annotation and text
annotation of gene names with technical knowledge were attempted. In the image annotation task, we found all incorrect cases by crowd with all correct by experts. This indicates
that tasks should be clarified with informative sentences reflecting professional knowledge. However it may be high cost. As for the text annotation task, the comparison of
performances between three biocurators and 17 crowdworkers confirmed that several crowds exhibited high performance levels equivalent to the curators. Next, we propose a
crowdsourcing research platform under development, named by CrowdR&D. Researchers can use the CrowdR&D site as a portal to generate crowdsourcing tasks. It provides
quantitative evaluation of individual tasks and manages separated tasks as a workflow. Moreover it includes user authentication function and data sharing function. Finally, we
provide the information of ethical review for protecting crowdworkers required at paper submission to life science research journals.
Acknowledgments: ・ Yoshiki Mochizuki ・ Yuichi Kodama ・Takako Mochizuki ・Yasuhiro Tanizawa ・Hideki Nagasaki ・Naoko Sakamoto ・Nori Kurata
Funding: 1) JST CREST ‘Advanced Core Technologies for Big Data Integration H26 pilot study
2) Transdisciplinary Research Integration Center Project of Research Organization of Information and Systems.
Publication: Kaminuma E, Fujisawa T , Tanizawa Y, Sakamoto N, Kurata N, Shimizu T, Nakamura Y, H2DB: Nucleic acids research, 41: D880, 2013
Crowd performance② A text annotation task exhibited
the existance of high performance crowdworkers
Reference : Examples of conventional crowdsourcing studies
#19
A crowdsourcing research platform (ongoing)
[data collection → curation → modeling ]
※DDBJ Read Annotation Pipeline http://p.ddbj.nig.ac.jp/(Kaminuma et al., NAR 2010; Nagasaki et al., 2013)
→Implemented on the NIG supercomputer inc. 10TB & 2TB mem nodes / 350 thin nodes and 100TB storage.
Wiki-based
open curation
(Salzberg, 2013)
■An analytic process of automatic sequence annotations using DDBJ pipeline
High-
throughput
NGS
sequencing
Jamboree / Online community
curation
Database opened
Paper submission
Automatic annotations
Paper
submission
Database
opened
Present path
■A proposal of ‘BigData curation’
DDBJ Pipeline(※)
User#= 666, Around 8000
jobs/year(Dec 2014)
TogoAnnotation
(Fujisawa, Nakamura
et al., 2014)
Large-scale DNA
sequences
Automatic annotations
Expert
Non-expert(crowdworker)
Task assignment
(Precision, Cost)
Enhancing annotation models
via training data
Crowdsouring-based manual curation
The problem for data curation for‘BigData’
after automatic annotations of DDBJ Pipeline
Performance difference between experts and non-experts?
→①Image annotation
→②Text annotation
Collaboration:
Data science
1.UofBigData
(Kyoto univ.
Kashima lab.)
2.Deep Analytics.
(Opt Inc., Dr.Saito)
Microtask
3. Crowd4U
(Tsukuba univ.
Morishima lab.)
Manual curation
tasks for database
①Confirming the existance of high performance crowdworkers
②Recall of crowdworker tends to be higher than precision of crowd.
→a strategy suggestion : first step of crowdworker collection and second
step of curator screening
Red dot square:
Expert level zone
■Data
・BioNLP/NLPBA 2004 shared task,curated data
・5 tasks/crowd based on random sampling
・Average word#:187
・Detect specific terms for gene annotations
・Evaluation measure:recall, precision, fscore
■Subjects
・Crowdworkers (non-expert):17
・Biocurators (expert):3
■Crowdsourcing condition
・Platform:Lancers (http://lancers.jp/)
・Cost:5 tasks 1,000yen/crowd*17=17,000 yen
(Biocurator: free)
■Educational and professional background for crowd
・Lifescience in master degree:3
・Lifescience jobs:0
■English skill for crowd
・TOEIC avg. score(self reported): 675
Experimental conditions ResultExperimental condition
■Results
・Crowdworker 20: 19 task performance: 0.85±0.12
・Significant error: expert=true positive, non-expert = true negative
→ Stating clear task sentences with expert knowledge is high cost.
■ No correlation between object parameters and crowd performance
■ Data
Arabidopsis thaliana Col-0のCold stress treatment RNA-seq study
(http://bioviz.org/quickload/A_thaliana_Jun_2009/cold_stress/,
Loraine Lab. site in UNC Charlotte)
Crowdsourcing task : Detection of graph overlapped parts.
■ Task
①Sequence information was removed.
②Easy sentence for non-expert
・No zoom-in/out
・Detect accumulated regions of RNA-seq alignment
IGV viewer(Robinson et al, 2011)
Keyword: citizen science, crowd science, commercial crowdsourcing, mobile crowd sensing, participatory sensing
Japanese crowdsourcing research
(7) Dr.Matsuda(Kyushu univ.): 「KOKOPIN!」 ecological plant image collection
(8) Dr.Senou(Kanagawa pref. museum of natural history) : 「Fish image database」 Mar 2014, 89,195entries
(9) Prof.Washitani(Tokyo univ.), Dr.Kitamoto (NII) : 「Seiyou Status」 Monitoring of invasive alien bumblebee B. terrestris.
(1)http://www.insightaas.com/populating-the-barcode-of-life/
(2)http://en.wikipedia.org/wiki/Portal:Gene_Wiki
(3)http://fold.it/
(4)http://eterna.cmu.edu/
(5)https://apps.facebook.com/fraxinusgame/
(6) Eriksson et al., PLOS Genetics, e1000993, 2010.
(7)http://www.kokopin.com/
(8)http://nh.kanagawa-museum.jp/staff/data/st3.html
(9)http://www.seiyoubusters.com/seiyou/
■Subjects
・Crowd worker(Non-expert): 20
■Crowdsourcing condition
・Platform:Lancers(http://lancers.jp/)
・Reward:95yen/worker * 20 workers=1,900 yen
Correspondence of crowdworkers
Object width
r=0.09(P=0.71)
r=-0.13(P=0.60)
Distance between objects
[2]Curation task request
[3]An evaluation of task professionality[1] A systematic flow from data collection to data modeling
Modeling
Microtask
(image annotation、
tagging)
Biocuration
(Term recognition)
Microtask
(tagging etc.)
Microtask
(data collection etc)
𝐿𝐿𝑅 𝑄𝑢𝑒𝑟𝑦 = 𝑙𝑜𝑔
𝑃 𝑄𝑢𝑒𝑟𝑦 𝑃𝑟𝑜𝑓𝑒𝑠𝑠𝑖𝑜𝑛𝑎𝑙
𝑃(𝑄𝑢𝑒𝑟𝑦|𝑁𝑜𝑛 𝑝𝑟𝑜𝑓𝑒𝑠𝑠𝑖𝑜𝑛𝑎𝑙)
NLP N-gram based professional models and proposed evaluation measure
(Non professional data)http://www.gutenberg.org/cache/epub/6434/pg6434.txt
A Brief History of the United States by Joel Dorman Steele
(Professional data) NCBI PUBMED (keyword: gene 2014)
Mining DNA barcode from NGS sequence read archive :
Crowdsourced data collection
92% finished
Cost
・25 tasks=525 yen
・1 task=21 yen
Working time
・Total : 83 min.
・92% finished : 21min.
Estimated time
・1 task time~ 3 min
GBIF
overlap
Species# Genus# Auto
/ Manual
operation
SRA entries(ALL※) x 8,914 275 Auto
SRA entries(WGS) x 2,354 100 Auto
SRA entries
(WGS,uniq species)
x 236 100 Auto
SRA entries(WGS,
GBIF removed,
uniq species)
100 25 Auto
SRA entries(WGS,
GBIF removed,
uniq species,.
DBCLS SRA search)
100+9 25 Manual
(Crowdsourc
ed)
SRA original species (without GBIF) = 109 entries
for new barcodes
※NCBI SRA search : keyword: plant
The 42 percent of SRA plant species was not included in GBIF db.
The 8 percent of GBIF-free SRA species was saved by crowdsourcing.
Unsolved
Research
Data
Automatic
Annotation
Models
Curated
Structure Data
Crowdworkers
&
Expert curators
③
Model training /
Data modeling
②
Data curation
①Data collection
④Performance
evaluation
1. SRA search(plant) 2. GBIF species
removed(matK,rbcL)
■A mining protocol for new DNA barcodes
■Prompt actions of crowdworkers (http://lancers.jp/)
■Result: species# for new DNA barcode
3. DBCLS SRA
taxid search
http://www.gbif.jp/bol/
http://www.ncbi.nlm.nih.gov/
http://sra.dbcls.jp/
(Nakazato et al., 2013)
Crowd performance① An image annotation task clarified
the high costs of task specification
Red lines : Automatic annotation by cufflinks(Trapnell et al., 2010)
Correct data: NCBI RefSeq annotations
Task professionality
CrowdR&D

More Related Content

Viewers also liked

Viewers also liked (6)

Duke University
Duke UniversityDuke University
Duke University
 
Currículo de vida andres llanos actual
Currículo de vida andres llanos actualCurrículo de vida andres llanos actual
Currículo de vida andres llanos actual
 
Cells part 2 cell theory
Cells part 2  cell theoryCells part 2  cell theory
Cells part 2 cell theory
 
Pokemon
PokemonPokemon
Pokemon
 
Miprimerapagina
MiprimerapaginaMiprimerapagina
Miprimerapagina
 
Chi cuadrada
Chi cuadradaChi cuadrada
Chi cuadrada
 

More from Eli Kaminuma

[2021-03-14] 植物表現型画像解析のための手作業注釈加速化手法とActive Learning
[2021-03-14] 植物表現型画像解析のための手作業注釈加速化手法とActive Learning[2021-03-14] 植物表現型画像解析のための手作業注釈加速化手法とActive Learning
[2021-03-14] 植物表現型画像解析のための手作業注釈加速化手法とActive LearningEli Kaminuma
 
[2020-12-15] 実験研究者のための深層学習入門 [第2回] Google Colab 環境で自動機械学習と深層画像生成(AutoML, GAN編) 
[2020-12-15] 実験研究者のための深層学習入門 [第2回] Google Colab 環境で自動機械学習と深層画像生成(AutoML, GAN編) [2020-12-15] 実験研究者のための深層学習入門 [第2回] Google Colab 環境で自動機械学習と深層画像生成(AutoML, GAN編) 
[2020-12-15] 実験研究者のための深層学習入門 [第2回] Google Colab 環境で自動機械学習と深層画像生成(AutoML, GAN編) Eli Kaminuma
 
[2020-09-01] IIBMP2020 Generating annotation texts of HLA sequences with anti...
[2020-09-01] IIBMP2020 Generating annotation texts of HLA sequences with anti...[2020-09-01] IIBMP2020 Generating annotation texts of HLA sequences with anti...
[2020-09-01] IIBMP2020 Generating annotation texts of HLA sequences with anti...Eli Kaminuma
 
[2019-11-22] JSAI合同研究会 糖尿病電子カルテを事例としたMeSH Term注釈に基づくアクセス制限研究のオープンデータ類似検索
[2019-11-22] JSAI合同研究会 糖尿病電子カルテを事例としたMeSH Term注釈に基づくアクセス制限研究のオープンデータ類似検索[2019-11-22] JSAI合同研究会 糖尿病電子カルテを事例としたMeSH Term注釈に基づくアクセス制限研究のオープンデータ類似検索
[2019-11-22] JSAI合同研究会 糖尿病電子カルテを事例としたMeSH Term注釈に基づくアクセス制限研究のオープンデータ類似検索Eli Kaminuma
 
[2019-09-02] AI・IoT活用情報とGoogle Colab植物画像注釈
[2019-09-02] AI・IoT活用情報とGoogle Colab植物画像注釈[2019-09-02] AI・IoT活用情報とGoogle Colab植物画像注釈
[2019-09-02] AI・IoT活用情報とGoogle Colab植物画像注釈Eli Kaminuma
 
[2019-03-14] JSPP19 深層学習による植物注釈タスクとPublic Cloud活用法
[2019-03-14] JSPP19 深層学習による植物注釈タスクとPublic Cloud活用法[2019-03-14] JSPP19 深層学習による植物注釈タスクとPublic Cloud活用法
[2019-03-14] JSPP19 深層学習による植物注釈タスクとPublic Cloud活用法Eli Kaminuma
 
[2018-03-29]JSPP18 Oxford Flower Image Datasetを用いた深層学習ハンズオン
[2018-03-29]JSPP18 Oxford Flower Image Datasetを用いた深層学習ハンズオン[2018-03-29]JSPP18 Oxford Flower Image Datasetを用いた深層学習ハンズオン
[2018-03-29]JSPP18 Oxford Flower Image Datasetを用いた深層学習ハンズオンEli Kaminuma
 
[18-01-26]DSTEP ディープラーニングによる出芽酵母蛍光画像の細胞内タンパク質局在の分類
[18-01-26]DSTEP  ディープラーニングによる出芽酵母蛍光画像の細胞内タンパク質局在の分類 [18-01-26]DSTEP  ディープラーニングによる出芽酵母蛍光画像の細胞内タンパク質局在の分類
[18-01-26]DSTEP ディープラーニングによる出芽酵母蛍光画像の細胞内タンパク質局在の分類 Eli Kaminuma
 
[2017-05-29] DNASmartTagger
[2017-05-29] DNASmartTagger [2017-05-29] DNASmartTagger
[2017-05-29] DNASmartTagger Eli Kaminuma
 
[2016-12-01] DDBJデータ解析チャレンジ報告:機械学習コンペティションのタスク設計とルール設定
[2016-12-01] DDBJデータ解析チャレンジ報告:機械学習コンペティションのタスク設計とルール設定[2016-12-01] DDBJデータ解析チャレンジ報告:機械学習コンペティションのタスク設計とルール設定
[2016-12-01] DDBJデータ解析チャレンジ報告:機械学習コンペティションのタスク設計とルール設定Eli Kaminuma
 
[2016-07-06] DDBJデータ解析チャレンジ概要
[2016-07-06] DDBJデータ解析チャレンジ概要[2016-07-06] DDBJデータ解析チャレンジ概要
[2016-07-06] DDBJデータ解析チャレンジ概要Eli Kaminuma
 
[2016-06-06] CrowdR&D:クラウド協働評価のための参加型R&Dプロジェクト情報統合基盤
[2016-06-06] CrowdR&D:クラウド協働評価のための参加型R&Dプロジェクト情報統合基盤[2016-06-06] CrowdR&D:クラウド協働評価のための参加型R&Dプロジェクト情報統合基盤
[2016-06-06] CrowdR&D:クラウド協働評価のための参加型R&Dプロジェクト情報統合基盤Eli Kaminuma
 
[2013-12-05] NGS由来ゲノムワイド多型マーカ構築とそのRDF注釈情報統合化
[2013-12-05] NGS由来ゲノムワイド多型マーカ構築とそのRDF注釈情報統合化[2013-12-05] NGS由来ゲノムワイド多型マーカ構築とそのRDF注釈情報統合化
[2013-12-05] NGS由来ゲノムワイド多型マーカ構築とそのRDF注釈情報統合化Eli Kaminuma
 
[2015-06-10] オンライン・クラウドサイエンス(市民科学)の潮流
[2015-06-10] オンライン・クラウドサイエンス(市民科学)の潮流[2015-06-10] オンライン・クラウドサイエンス(市民科学)の潮流
[2015-06-10] オンライン・クラウドサイエンス(市民科学)の潮流Eli Kaminuma
 
[2015-11-11][DDBJing33] DDBJとNIG Supercomputerの紹介、大量配列情報解析 (第33回 DDBJing 講習会 ...
[2015-11-11][DDBJing33] DDBJとNIG Supercomputerの紹介、大量配列情報解析 (第33回 DDBJing 講習会 ...[2015-11-11][DDBJing33] DDBJとNIG Supercomputerの紹介、大量配列情報解析 (第33回 DDBJing 講習会 ...
[2015-11-11][DDBJing33] DDBJとNIG Supercomputerの紹介、大量配列情報解析 (第33回 DDBJing 講習会 ...Eli Kaminuma
 

More from Eli Kaminuma (15)

[2021-03-14] 植物表現型画像解析のための手作業注釈加速化手法とActive Learning
[2021-03-14] 植物表現型画像解析のための手作業注釈加速化手法とActive Learning[2021-03-14] 植物表現型画像解析のための手作業注釈加速化手法とActive Learning
[2021-03-14] 植物表現型画像解析のための手作業注釈加速化手法とActive Learning
 
[2020-12-15] 実験研究者のための深層学習入門 [第2回] Google Colab 環境で自動機械学習と深層画像生成(AutoML, GAN編) 
[2020-12-15] 実験研究者のための深層学習入門 [第2回] Google Colab 環境で自動機械学習と深層画像生成(AutoML, GAN編) [2020-12-15] 実験研究者のための深層学習入門 [第2回] Google Colab 環境で自動機械学習と深層画像生成(AutoML, GAN編) 
[2020-12-15] 実験研究者のための深層学習入門 [第2回] Google Colab 環境で自動機械学習と深層画像生成(AutoML, GAN編) 
 
[2020-09-01] IIBMP2020 Generating annotation texts of HLA sequences with anti...
[2020-09-01] IIBMP2020 Generating annotation texts of HLA sequences with anti...[2020-09-01] IIBMP2020 Generating annotation texts of HLA sequences with anti...
[2020-09-01] IIBMP2020 Generating annotation texts of HLA sequences with anti...
 
[2019-11-22] JSAI合同研究会 糖尿病電子カルテを事例としたMeSH Term注釈に基づくアクセス制限研究のオープンデータ類似検索
[2019-11-22] JSAI合同研究会 糖尿病電子カルテを事例としたMeSH Term注釈に基づくアクセス制限研究のオープンデータ類似検索[2019-11-22] JSAI合同研究会 糖尿病電子カルテを事例としたMeSH Term注釈に基づくアクセス制限研究のオープンデータ類似検索
[2019-11-22] JSAI合同研究会 糖尿病電子カルテを事例としたMeSH Term注釈に基づくアクセス制限研究のオープンデータ類似検索
 
[2019-09-02] AI・IoT活用情報とGoogle Colab植物画像注釈
[2019-09-02] AI・IoT活用情報とGoogle Colab植物画像注釈[2019-09-02] AI・IoT活用情報とGoogle Colab植物画像注釈
[2019-09-02] AI・IoT活用情報とGoogle Colab植物画像注釈
 
[2019-03-14] JSPP19 深層学習による植物注釈タスクとPublic Cloud活用法
[2019-03-14] JSPP19 深層学習による植物注釈タスクとPublic Cloud活用法[2019-03-14] JSPP19 深層学習による植物注釈タスクとPublic Cloud活用法
[2019-03-14] JSPP19 深層学習による植物注釈タスクとPublic Cloud活用法
 
[2018-03-29]JSPP18 Oxford Flower Image Datasetを用いた深層学習ハンズオン
[2018-03-29]JSPP18 Oxford Flower Image Datasetを用いた深層学習ハンズオン[2018-03-29]JSPP18 Oxford Flower Image Datasetを用いた深層学習ハンズオン
[2018-03-29]JSPP18 Oxford Flower Image Datasetを用いた深層学習ハンズオン
 
[18-01-26]DSTEP ディープラーニングによる出芽酵母蛍光画像の細胞内タンパク質局在の分類
[18-01-26]DSTEP  ディープラーニングによる出芽酵母蛍光画像の細胞内タンパク質局在の分類 [18-01-26]DSTEP  ディープラーニングによる出芽酵母蛍光画像の細胞内タンパク質局在の分類
[18-01-26]DSTEP ディープラーニングによる出芽酵母蛍光画像の細胞内タンパク質局在の分類
 
[2017-05-29] DNASmartTagger
[2017-05-29] DNASmartTagger [2017-05-29] DNASmartTagger
[2017-05-29] DNASmartTagger
 
[2016-12-01] DDBJデータ解析チャレンジ報告:機械学習コンペティションのタスク設計とルール設定
[2016-12-01] DDBJデータ解析チャレンジ報告:機械学習コンペティションのタスク設計とルール設定[2016-12-01] DDBJデータ解析チャレンジ報告:機械学習コンペティションのタスク設計とルール設定
[2016-12-01] DDBJデータ解析チャレンジ報告:機械学習コンペティションのタスク設計とルール設定
 
[2016-07-06] DDBJデータ解析チャレンジ概要
[2016-07-06] DDBJデータ解析チャレンジ概要[2016-07-06] DDBJデータ解析チャレンジ概要
[2016-07-06] DDBJデータ解析チャレンジ概要
 
[2016-06-06] CrowdR&D:クラウド協働評価のための参加型R&Dプロジェクト情報統合基盤
[2016-06-06] CrowdR&D:クラウド協働評価のための参加型R&Dプロジェクト情報統合基盤[2016-06-06] CrowdR&D:クラウド協働評価のための参加型R&Dプロジェクト情報統合基盤
[2016-06-06] CrowdR&D:クラウド協働評価のための参加型R&Dプロジェクト情報統合基盤
 
[2013-12-05] NGS由来ゲノムワイド多型マーカ構築とそのRDF注釈情報統合化
[2013-12-05] NGS由来ゲノムワイド多型マーカ構築とそのRDF注釈情報統合化[2013-12-05] NGS由来ゲノムワイド多型マーカ構築とそのRDF注釈情報統合化
[2013-12-05] NGS由来ゲノムワイド多型マーカ構築とそのRDF注釈情報統合化
 
[2015-06-10] オンライン・クラウドサイエンス(市民科学)の潮流
[2015-06-10] オンライン・クラウドサイエンス(市民科学)の潮流[2015-06-10] オンライン・クラウドサイエンス(市民科学)の潮流
[2015-06-10] オンライン・クラウドサイエンス(市民科学)の潮流
 
[2015-11-11][DDBJing33] DDBJとNIG Supercomputerの紹介、大量配列情報解析 (第33回 DDBJing 講習会 ...
[2015-11-11][DDBJing33] DDBJとNIG Supercomputerの紹介、大量配列情報解析 (第33回 DDBJing 講習会 ...[2015-11-11][DDBJing33] DDBJとNIG Supercomputerの紹介、大量配列情報解析 (第33回 DDBJing 講習会 ...
[2015-11-11][DDBJing33] DDBJとNIG Supercomputerの紹介、大量配列情報解析 (第33回 DDBJing 講習会 ...
 

Recently uploaded

Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesBoston Institute of Analytics
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodJuan lago vázquez
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...apidays
 
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...Principled Technologies
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobeapidays
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MIND CTI
 

Recently uploaded (20)

Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 

[2014-12-15] GIW/ISCB-Asia2014 Performance evaluation between Crowdworkers and Biocurators towards constructing a CrowdR&D platform

  • 1. → Spearman correlation coefficients between correspondence and object distance → Spearman correlation coefficients between correspondence and object width Type Data collection Data annotation Data analysis Data analysis Data analysis Data collection Name DNA barcoding(1) GeneWiki (2) Foldit (3) Eterna (4) Fraxinus(5) 23andMe (6) Task DNA barcode search and data collection Gene annotation Protein folding RNA secondary structure design DNA sequence alignment for plant pathogen Phenotype + genotype data collection Research member ・Sara McMullin(SAP Canada) ・Paul Hebert(Biodiversity Institute of Ontario ) ・Luca de alfaro(UCSC) ・Andrew Su(Scripps Res Institute) ・Zoran Popovic(UW) ・Rhiju Das(Stanford) ・Adrien Treuille(CMU) ・Dan MacLean(The Sainsbury Laboratory) 23andMe Inc. Crowd# ? 6,830 /2 years 733 37,000 51,057 (2013/8-12) 10,000 for 22 traits Feature ・DNA barcodes of 500K species in 2015 ・SAP: bigdata analysis ・iBOL: data provider . LifeScanner app ・Human gene 10,369 (2011) ・34, 069 edits by 6,830 distinct editors ・Crowd performance is better than computer. ・AutoBot ・Crowd performance is better than AutoBot. ・BBSRC fund: The British Ash Tree Genome Project ・NGS sequencing (GitHub) ・P-value trends for participation depend on locus(6) ○ Eli Kaminuma1, Yukino Baba2,3, Takatomo Fujisawa1, Asao Fujiyama1, Hisashi Kashima4, Yasukazu Nakamura1 1. National Institute of Genetics, SOKENDAI; 1111 Yata, Mishima, Shizuoka, 411-8540 Japan. 2. National Institute of Informatics, SOKENDAI, Tokyo 101-8430, Japan 3. JST, ERATO, Kawarabayashi Large Graph Project, Tokyo 101-8430, Japan. 4. Kyoto University, Grad School of Informatics, Kyoto 606-8501, Japan. . Performance evaluation between Crowdworkers and Biocurators towards constructing a CrowdR&D platform ABSTRACT High-performance next-generation sequencing (NGS) technologies are advancing genomics and molecular biological research. At 2010, we released an automatic high- throughput annotation pipeline “DDBJ Read Annotation Pipeline” for NGS sequencing data, which analyzes by using computer facilities of Japan’s National Institute of Genetics supercomputer. After automatic annotation analysis, human curation tasks are performed to modify errors. However, massive amounts of NGS sequence data have created a bottleneck at human curation with manual tasks. To resolve the problem, we investigate crowdsourcing approach to accomplish curation tasks. First, we evaluated performances between non-professional crowdworkers of a commercial crowdsourcing platform and our expert biocurators. Two tasks of image-based gene structural annotation and text annotation of gene names with technical knowledge were attempted. In the image annotation task, we found all incorrect cases by crowd with all correct by experts. This indicates that tasks should be clarified with informative sentences reflecting professional knowledge. However it may be high cost. As for the text annotation task, the comparison of performances between three biocurators and 17 crowdworkers confirmed that several crowds exhibited high performance levels equivalent to the curators. Next, we propose a crowdsourcing research platform under development, named by CrowdR&D. Researchers can use the CrowdR&D site as a portal to generate crowdsourcing tasks. It provides quantitative evaluation of individual tasks and manages separated tasks as a workflow. Moreover it includes user authentication function and data sharing function. Finally, we provide the information of ethical review for protecting crowdworkers required at paper submission to life science research journals. Acknowledgments: ・ Yoshiki Mochizuki ・ Yuichi Kodama ・Takako Mochizuki ・Yasuhiro Tanizawa ・Hideki Nagasaki ・Naoko Sakamoto ・Nori Kurata Funding: 1) JST CREST ‘Advanced Core Technologies for Big Data Integration H26 pilot study 2) Transdisciplinary Research Integration Center Project of Research Organization of Information and Systems. Publication: Kaminuma E, Fujisawa T , Tanizawa Y, Sakamoto N, Kurata N, Shimizu T, Nakamura Y, H2DB: Nucleic acids research, 41: D880, 2013 Crowd performance② A text annotation task exhibited the existance of high performance crowdworkers Reference : Examples of conventional crowdsourcing studies #19 A crowdsourcing research platform (ongoing) [data collection → curation → modeling ] ※DDBJ Read Annotation Pipeline http://p.ddbj.nig.ac.jp/(Kaminuma et al., NAR 2010; Nagasaki et al., 2013) →Implemented on the NIG supercomputer inc. 10TB & 2TB mem nodes / 350 thin nodes and 100TB storage. Wiki-based open curation (Salzberg, 2013) ■An analytic process of automatic sequence annotations using DDBJ pipeline High- throughput NGS sequencing Jamboree / Online community curation Database opened Paper submission Automatic annotations Paper submission Database opened Present path ■A proposal of ‘BigData curation’ DDBJ Pipeline(※) User#= 666, Around 8000 jobs/year(Dec 2014) TogoAnnotation (Fujisawa, Nakamura et al., 2014) Large-scale DNA sequences Automatic annotations Expert Non-expert(crowdworker) Task assignment (Precision, Cost) Enhancing annotation models via training data Crowdsouring-based manual curation The problem for data curation for‘BigData’ after automatic annotations of DDBJ Pipeline Performance difference between experts and non-experts? →①Image annotation →②Text annotation Collaboration: Data science 1.UofBigData (Kyoto univ. Kashima lab.) 2.Deep Analytics. (Opt Inc., Dr.Saito) Microtask 3. Crowd4U (Tsukuba univ. Morishima lab.) Manual curation tasks for database ①Confirming the existance of high performance crowdworkers ②Recall of crowdworker tends to be higher than precision of crowd. →a strategy suggestion : first step of crowdworker collection and second step of curator screening Red dot square: Expert level zone ■Data ・BioNLP/NLPBA 2004 shared task,curated data ・5 tasks/crowd based on random sampling ・Average word#:187 ・Detect specific terms for gene annotations ・Evaluation measure:recall, precision, fscore ■Subjects ・Crowdworkers (non-expert):17 ・Biocurators (expert):3 ■Crowdsourcing condition ・Platform:Lancers (http://lancers.jp/) ・Cost:5 tasks 1,000yen/crowd*17=17,000 yen (Biocurator: free) ■Educational and professional background for crowd ・Lifescience in master degree:3 ・Lifescience jobs:0 ■English skill for crowd ・TOEIC avg. score(self reported): 675 Experimental conditions ResultExperimental condition ■Results ・Crowdworker 20: 19 task performance: 0.85±0.12 ・Significant error: expert=true positive, non-expert = true negative → Stating clear task sentences with expert knowledge is high cost. ■ No correlation between object parameters and crowd performance ■ Data Arabidopsis thaliana Col-0のCold stress treatment RNA-seq study (http://bioviz.org/quickload/A_thaliana_Jun_2009/cold_stress/, Loraine Lab. site in UNC Charlotte) Crowdsourcing task : Detection of graph overlapped parts. ■ Task ①Sequence information was removed. ②Easy sentence for non-expert ・No zoom-in/out ・Detect accumulated regions of RNA-seq alignment IGV viewer(Robinson et al, 2011) Keyword: citizen science, crowd science, commercial crowdsourcing, mobile crowd sensing, participatory sensing Japanese crowdsourcing research (7) Dr.Matsuda(Kyushu univ.): 「KOKOPIN!」 ecological plant image collection (8) Dr.Senou(Kanagawa pref. museum of natural history) : 「Fish image database」 Mar 2014, 89,195entries (9) Prof.Washitani(Tokyo univ.), Dr.Kitamoto (NII) : 「Seiyou Status」 Monitoring of invasive alien bumblebee B. terrestris. (1)http://www.insightaas.com/populating-the-barcode-of-life/ (2)http://en.wikipedia.org/wiki/Portal:Gene_Wiki (3)http://fold.it/ (4)http://eterna.cmu.edu/ (5)https://apps.facebook.com/fraxinusgame/ (6) Eriksson et al., PLOS Genetics, e1000993, 2010. (7)http://www.kokopin.com/ (8)http://nh.kanagawa-museum.jp/staff/data/st3.html (9)http://www.seiyoubusters.com/seiyou/ ■Subjects ・Crowd worker(Non-expert): 20 ■Crowdsourcing condition ・Platform:Lancers(http://lancers.jp/) ・Reward:95yen/worker * 20 workers=1,900 yen Correspondence of crowdworkers Object width r=0.09(P=0.71) r=-0.13(P=0.60) Distance between objects [2]Curation task request [3]An evaluation of task professionality[1] A systematic flow from data collection to data modeling Modeling Microtask (image annotation、 tagging) Biocuration (Term recognition) Microtask (tagging etc.) Microtask (data collection etc) 𝐿𝐿𝑅 𝑄𝑢𝑒𝑟𝑦 = 𝑙𝑜𝑔 𝑃 𝑄𝑢𝑒𝑟𝑦 𝑃𝑟𝑜𝑓𝑒𝑠𝑠𝑖𝑜𝑛𝑎𝑙 𝑃(𝑄𝑢𝑒𝑟𝑦|𝑁𝑜𝑛 𝑝𝑟𝑜𝑓𝑒𝑠𝑠𝑖𝑜𝑛𝑎𝑙) NLP N-gram based professional models and proposed evaluation measure (Non professional data)http://www.gutenberg.org/cache/epub/6434/pg6434.txt A Brief History of the United States by Joel Dorman Steele (Professional data) NCBI PUBMED (keyword: gene 2014) Mining DNA barcode from NGS sequence read archive : Crowdsourced data collection 92% finished Cost ・25 tasks=525 yen ・1 task=21 yen Working time ・Total : 83 min. ・92% finished : 21min. Estimated time ・1 task time~ 3 min GBIF overlap Species# Genus# Auto / Manual operation SRA entries(ALL※) x 8,914 275 Auto SRA entries(WGS) x 2,354 100 Auto SRA entries (WGS,uniq species) x 236 100 Auto SRA entries(WGS, GBIF removed, uniq species) 100 25 Auto SRA entries(WGS, GBIF removed, uniq species,. DBCLS SRA search) 100+9 25 Manual (Crowdsourc ed) SRA original species (without GBIF) = 109 entries for new barcodes ※NCBI SRA search : keyword: plant The 42 percent of SRA plant species was not included in GBIF db. The 8 percent of GBIF-free SRA species was saved by crowdsourcing. Unsolved Research Data Automatic Annotation Models Curated Structure Data Crowdworkers & Expert curators ③ Model training / Data modeling ② Data curation ①Data collection ④Performance evaluation 1. SRA search(plant) 2. GBIF species removed(matK,rbcL) ■A mining protocol for new DNA barcodes ■Prompt actions of crowdworkers (http://lancers.jp/) ■Result: species# for new DNA barcode 3. DBCLS SRA taxid search http://www.gbif.jp/bol/ http://www.ncbi.nlm.nih.gov/ http://sra.dbcls.jp/ (Nakazato et al., 2013) Crowd performance① An image annotation task clarified the high costs of task specification Red lines : Automatic annotation by cufflinks(Trapnell et al., 2010) Correct data: NCBI RefSeq annotations Task professionality CrowdR&D